Xtables2 User Documentation Jan Engelhardt Xtables2, Architecture Draft 6 (“A6”) November 2010 Copyright © 2010 Jan Engelhardt . This work is made available under the Creative Commons Attribution-Noncommercial-Sharealike 3.0 (CC-BY-NC-SA) license. See http://creativecommons.org/licenses/by-nc-sa/3.0/ for details. (Alternate arrangements can be made with the copyright holder(s).) Additionally, modifications to this work must clearly be indicated as such, and the title page needs to carry the words “ Modified Version” if any such modifications have been made, unless the release was done by the designated maintainer(s). The Maintainers are members of the Netfilter Core Team, and any person(s) appointed as maintainer(s) by the coreteam. Table of Contents 1 Architectural Differences Similarities -- Retained features Protocol-independent table Xtables1 support Singular table Initial absence of base chains Arbitrary new base chains Absent chain policies Absent default counters Multiple targets per rule Zero targets per rule (Internal detail) Packing of rules Higher degree of freedom in modifications (granularity) Re-entrancy Netlink protocol Userspace library for rule manipulation Other minor issues fixed 2 Installation 3 Usage 1 Architectural Differences Similarities -- Retained features The important features that Xtables1 has, and that I wanted to have in Xtables2 too: • Network namespace support • Backwards compatibility: xt1 userspace tools should continue to work (see below) • Bulk addition/replacement/etc. of rules, especially atomically switching rulesets and parts thereof. • All current xt extensions continue to be usable Other noteworthy retained constraints: • Tables must be loop-free TODO 1. Sysfs file to control counter transplantation between xt1-xt2 Protocol-independent table Xtables2 rules have, in empty form, no mandatory protocol-specific parts attached anymore, like, for example, struct ip6t_ip6 used to be in ip6tables rules. This allows for a single given table to be called from various packet type handlers (such as IPv6, BRIDGE, etc.). User rulesets often do not care about the network protocol being used, and with the protocol-independent table, -p tcp -m tcp --dport 22 will match TCP/22, independent of the used network protocol, which means you saved replicating the rule between iptables and ip6tables. Checking explicitly for network protocols is of course still possible by means of -m ipv4, -m ipv6, and -m arp for the network protocols of the same name. The respective replacement for ebtables rules is a mixture of -m physdev to check for an Ethernet bridge device, and -m eth to test for Ethernet packets. The side effect of protocol-agnostic tables is that the ip_tables, ip6_tables, arp_tables and ebtables kernel modules all get obsoleted in one swoop, ending a painful legacy of copy-and-paste. Xtables1 support To continue being able to use the Xtables1 interfaces that older userspace tools, such as iptables(8), ip6tables(8) and arptables(8) employ, Xtables2 provides a kernel-level rule translator that will convert from and to Xtables2 rulesets on the fly as these programs talk to the kernel. Of course, the backtranslation from xt2 to xt1 (e. g. on `iptables -S`) only works if the rules in use can be completely mapped into xt1-style rules --- which will be the case for all rules initially converted from xt1. (The translator will signal -ERANGE otherwise.) Singular table In Xtables2, each network namespace only has a single table (internally referred to as “master”). What previously have been 40 base chains in 13 tables per network namespace over four protocol/subsystem domains (IPv4, IPv6, ARP, bridge) is now a comfortable 40 base chains in a single table per network namespace. This has the benefits that user-defined chains need not be reproduced across different tables as it was previously necessary. Whereas xt1 only allowed replacing one table at a time atomically, it is thus possible to replace the entire ruleset --- which may spanned multiple tables in xt1 --- at once in Xtables2. The name of base chains is now freely selectable; the recommended standard naming is “table/chainname” to avoid overlap. The compatibility translation layer for Xtables1 will use “ table/hook/nfproto”. • brouting/BROUTING/bridge • raw/{PREROUTING,OUTPUT}/{ipv4,ipv6} • nat/{PREROUTING,OUTPUT,POSTROUTING}/bridge • nat/{PREROUTING,INPUT,OUTPUT,POSTROUTING}/ipv4 • mangle/{PREROUTING,INPUT,FORWARD,OUTPUT,POSTROUTING}/{ipv4,ipv6} • filter/{INPUT,FORWARD,OUTPUT}/{ipv4,ipv6,arp,bridge} • security/{INPUT,FORWARD,OUTPUT}/{ipv4,ipv6} Initial absence of base chains Initially, base chains will not exist and, from a kernel point of view, need to be created first. A userspace component can transparently take care of that for the user, just as iptables(8) does autoload table modules and thus makes base chains available. A respective Netfilter hook will be installed on base chain creation and removed again on deletion, so that non-existing base chains do not delay packet processing. This is similar to not having loaded a table in xt1, however, if you only used, for example, a single INPUT rule in the mangle table, you would have still added four hooks for the other base chains from the mangle table. Arbitrary new base chains The administrator is free to create base chains using arbitrary Netfilter hook priorities (corresponds to “raw”, “filter”, etc.) and hook numbers (corresponds to prerouting, input, etc.). This alleviates the need to have kernel modules for each table. Absent chain policies Chain policies used to be a hidden rule at the end of a base chain that was jumped to if the user issued a RETURN from the base chain (“underflow”). Running off the end of a chain also causes the hidden rule to be executed. Since the Internet has become a much more hostile place since its inception, Xtables2 uses a strict drop policy for underflows and run-offs. You can deal with this by using appropriate accepting rules. Absent default counters Rules no longer have byte and packet counters attached to them by default, to increase processing speed for the users who in fact do not need the counters. Counters, if desired, need to be explicitly specified when creating a rule. For them to have their original behavior, that is, only rise when all match conditions of the rule have been met, -m counter is to be used as the last action of a rule. Counters may also be added anywhere else, as in: • -m tcp -m time -m counter: counts after both xt_tcp and xt_time matched successfully. • -m tcp -m counter -m time: counts after successful xt_tcp match already. • -m counter -m tcp -m time: always counts • -m tcp -m counter -m time -m counter: also possible: two counter objects The Xtables1 translator will of course generate xt2 rules that always have counters. This behavior can be changed by tuning the sysfs variable /sys/module/xt1_support/parameters/xlat_counters. Multiple targets per rule This feature that is highly desired in the user community has been implemented. Targets are executed one after another, provided the previous target was not terminating. This makes commands such as `-j LOG -j DROP` possible. `-j DROP -j LOG` will of course not log it. Zero targets per rule (Internal detail) Classic iptables encoded rules without a target as having an implicit XT_CONTINUE, i. e. a jump to the next rule. This is no longer needed in Xtables2, where rules default to XT_CONTINUE, so the 40 bytes that it took to encode CONTINUE are saved. (Not that rules without a target are in a majority...) Packing of rules Tests have shown that rulesets stored in “free-hanging” data structures, such as linked lists, suffer from increased memory usage and severe performance degradation when executing the ruleset. Processing time and memory expansion of up to 2.8x has been observed. The time expansion is believed to be a result of increased D-cache or TLB misses due to the “fragmentation” of the ruleset's objects. The memory usage increase is due to natural housekeeping cost of the allocator. Xtables1 packs all rules in a table together, which is good for locality, but has implications on the difficulty, specifically time cost, of ruleset manipulation. As a result of the findings about time and memory, Xtables2 (starting with snapshot A4) packs rules in a chain together, for the following reasons. 1. Jumps can lead to anywhere (byte-wise) in the table blob. The first rule of the first encoded chain could cause a jump to the last encoded chain. Especially on larger rulesets, it is assumed that such a “far jump” is equally costly whether the entire table is packed, or only single chains. 2. Packing only chains rather than the entire table has the benefit that rule insertion/deletion time is only \mathcal{O}\left(c\right) rather than \mathcal{O}\left(n^{2}\right). Higher degree of freedom in modifications (granularity) The capabilities of an implementation can, among other things, be characterized by one or more of the following “granularities”. From coarse to fine-grained: • Ruleset level: Exchange of a ruleset (allows manipulation of multiple tables). • Table level: Exchange of a table (multi-chain manipulation -- within a single table). • Chain level: Exchange of a chain (multi-rule manipulation -- within a single chain). • Rule level: Exchange of a rule. • Subrule level: Exchange of extensions, rule parameters. Additionally, the atomic guarantees of a level apply to lower ones, so that table atomicity implies chain and rule atomicity. Subrule level control seems to have not much practical value, too; it is only listed here for completeness. The ip_tables kernel interface only offers table-level granularity. The iptables(8) userspace program retrieves an entire table at a time from the kernel, unpacks it, applies the desired user modification, repacks it, and then submits it back to the kernel. Even for small changes, this means that more data than really needed will be transferred forth and back. Also, since a new table is constructed, the kernel has to redo all checks. This problem is amplified by users wrongly calling iptables repeatedly instead of doing a highly-recommended bulk replace. The Xtables2 kernel interface offers modification at four levels: ruleset, table, chain and rule. Since Xtables2 conveniently has just a single table, ruleset granularity is the same as table granularity in xt2's implementation[footnote: But it would not be so in Xtables1, of course. ]. Xtables2 always provides at least chain atomicity even for single rule updates, as a result of the implementation of rule splicing. Cost: per splice operation. Re-entrancy (xt_TEE has been included in Linux 2.6.35, and re-entrancy support has been backported to Xtables1. This is thus the original text I had written up before.) Re-entrancy is a concern with extensions that cause generation of additional network packets that will flow through the tables while the original packet is being processed. • NF_HOOK \rightarrow ip6t_do_table \rightarrow dst_output \rightarrow NF_HOOK \rightarrow ip6t_do_table (oops!) Actually, re-entrancy has not been that much of a problem historically. Within the mainline kernel, only ipt_REJECT and ip6t_REJECT are affected, and in fact, only half so, because the original packet will be discarded as part of REJECT's operation. Outside the kernel, in the realm of 3rd-party code, I only know of the modules I have written or co-authored myself: xt_TEE for duplicating and rerouting packets, or the esoteric xt_ECHO (for the RFC 862 echo/udp protocol). Due to iptables storing its jumpstack in the ruleset, reentering e. g. ip6t_do_table from already-running instance of the function causes the jumpstack to be overwritten and targets must return an absolute verdict so that the trashed jumpstack is not used in the parent instance anymore. nftables uses a fixed array on the C stack to store the iptables jumpstack, which has limitations of its own --- it is always too big or too small for someone. Xtables2 solves this by allocating a fixed amount of jumpstack space when a table is reloaded. It analyzes the ruleset, and allocates space for m jumps and n do_table calls. m is the maximum number of jumps possible, i. e. the number of user-defined chains (see above). n is a sysfs tunable [currently: hardcoded tunable], defaulting to 2. Netlink protocol Xtables2 uses the highly-regarded Netlink base protocol and transport for communication between user and kernel components. Userspace library for rule manipulation The iptables package did offer a userspace library, libiptc, that would allow to modify the in-kernel rulesets. However, not a lot of love has gone into the library or its API, so users generally went around it. The aforementioned code duplication between did of course not stop at the kernel border; there is, in fact, libip4tc, libip6tc, libarptc and libebtc. With Xtables2, a new package, libnetfilter_xtables, is introduced. It provides means to modify the in-kernel table directly, but also to create staging tables in userland first (like libiptc), which are then used for a bulk replace operation. (libnetfilter_xtables makes use of Pablo Neira Ayuso's “ Minimalist Netlink” library, libmnl.) Other minor issues fixed As a result of making the table replace operation a single atomic operation --- previously, it was split into to two, whereby first the table was exchanged, and then the counters --- “Resource temporarily unavailable” is not possible in xt2. 2 Installation 3 Usage [LaTeX Command: printindex]