Next Previous Contents

6. Information for Programmers

I'll let you in on a secret: my pet hamster did all the coding. I was just a channel, a `front' if you will, in my pet's grand plan. So, don't blame me if there are bugs. Blame the cute, furry one.

6.1 Understanding iptables

iptables hooks in on the NF_IP_LOCAL_IN, NF_IP_FORWARD and NF_IP_LOCAL_OUT hooks. It keeps an array of rules in memory (hence the name `iptables', although in fact there is only one table). The only difference between the three hooks is where they being traversing in the table.

Inside the kernel, each rule (`struct ipt_kern_entry') consists of the following parts:

  1. A `struct ipt_ip' part, containing the specifications for the IP header which it is to match.
  2. A `struct ipt_match' pointer, indicating the (optional) extra match function for that element, eg. tcp packet matching.
  3. A `union ipt_matchinfo' part, which contains the data for the extra match function.
  4. A `struct ipt_target' pointer, indicating the action to perform if the packet matches.
  5. A `union ipt_targinfo' part, which contains the data for the target function.
  6. A `struct ipt_counters' part, which contains number of matches. This is actually held in a separate array, with two for each CPU to avoid locking.

Userspace has four operations: it can read the current table, read the info (hook positions and size of table), replace the table, and add in new counters.

The kernel starts traversing at the location indicated by the particular hook. That rule is examined, if it's a match, the match function associated with that rule is called. If that function returns a negatice number, that number is taken as a final result: the negated verdict minus one (unless it's IPT_RETURN, in which case the stack is popped). If the function returns a positive number, and the number is not equal to the last position plus one, that location is pushed on the stack. Traversal continues at the returned location.

6.2 Extending iptables

Because I'm lazy, iptables is fairly extensible. This is basically a scam to palm off work onto other people, which is what Open Source is all about (cf. Free Software, which as RMS would say, is about freedom, and I'm sitting in one of his talks at the moment).

Extending iptables potentially involves to parts: extending the kernel, by writing a new module, and possibly extending the userspace program iptables, by writing a new shared library.

The Kernel

Writing a kernel module itself is fairly simple, as you can see from the examples. One thing to be aware of is that your code must be re-entrant: there can be one packet coming in from userspace, while another arrives on an interrupt. In fact in SMP there can be one packet on an interrupt per CPU in 2.3.4 and above.

The functions you need to know about are:

init_module()

This is the entry-point of the module. It returns an error number, or 0 if it successfully registers itself with netfilter.

cleanup_module()

This is the exit point of the module; it should unregister itself with netfilter.

ipt_register_match()

This is used to register a new match type. You hand it a `struct ipt_match', which is usually declared as a static (file-scope) variable.

ipt_register_target()

This is used to register a new type. You hand it a `struct ipt_target', which is usually declared as a static (file-scope) variable.

ipt_unregister_target()

Used to unregister your target.

ipt_unregister_match()

Used to unregister your match.

New Match Functions

New match functions are usually written as a standalone module. It's possible to have these modules extensible in turn, although it's usually not necessary. One way would be to use the netfilter framework's `nf_register_sockopt' function to allows users to talk to your module directly. Another way would be to export symbols for other modules to register themselves, the same way netfilter and iptables do.

The core of your new match function is the struct ipt_match which it passes to ipt_register_match(). This structure has the following fields:

next

This field is set to NULL.

name

This field is the name of the match function, as referred to by userspace. The name should match the name of the module (ie. if the name is "mac", the module must be "ipt_mac.o") for auto-loading to work.

match

This field is a pointer to a match function, which takes the skb, the in and out device names (one of which may be ""), the ipt_matchinfo union from the rule that was matched, the IP offset (non-zero means a non-head fragment), a pointer to the protocol header (ie. just past the IP header) and the length of the data (ie. the packet length minus the IP header length). It should return non-zero if the packet matches.

checkentry

This field is a pointer to a function which checks the specifications for a rule; if this returns 0, then the rule will not be accepted from the user. For example, the "tcp" match type will only accept tcp packets, and so if the `struct ipt_ip' part of the rule does not specify that the protocol must be tcp, a zero is returned.

me

This field is set to `&__this_module', which gives a pointer to your module. It causes the usage-count to go up and down as rules of that type are created and destroyed. This prevents a user removing the module (and hence cleanup_module() being called) if a rule refers to it.

New Targets

New targets are also usually written as a standalone module. The discussions under the above section on `New Match Functions' apply equally here.

The core of your new target is the struct ipt_target which it passes to ipt_register_target(). This structure has the following fields:

next

This field is set to NULL.

name

This field is the name of the target function, as referred to by userspace. The name should match the name of the module (ie. if the name is "REJECT", the module must be "ipt_REJECT.o") for auto-loading to work.

target

This is a pointer to the target function, which takes the skbuff, the input and output device names (either of which may be ""), the ipt_targinfo union of the matching rule, and the position of the rule in the table. The target function returns a non-negative absolute position to jump to, or a negative verdict (which is the negated verdict minus one).

checkentry

This field is a pointer to a function which checks the specifications for a rule; if this returns 0, then the rule will not be accepted from the user.

me

This field is set to `&__this_module', which gives a pointer to your module. It causes the usage-count to go up and down as rules with this as a target are created and destroyed. This prevents a user removing the module (and hence cleanup_module() being called) if a rule refers to it.

Userspace Tool

Now you've written your nice shiny kernel module, you may want to control the options on it from userspace. Rather than have a branched version of iptables for each extention, I use the very latest 80's technology: laserdisc. Sorry, I mean shared libraries.

The shared library should have an `_init()' function, which will automatically be called upon loading: the moral equivalent of the kernel module's `init_module()' function. This should call `register_match()' or `register_target()', depending on whether your shared library provides a new match or a new target.

You only need to provide a shared library if you want to initialize part of the structure, or provide additional options. For example, the `REJECT' target doesn't require either of these, so there's no shared library.

There are useful functions described in the `iptables.h' header, especially:

check_inverse()

checks if an argument is actually a `!', and if so, sets the `invert' flag if not already set. If it returns true, you should increment optind, as done in the examples.

string_to_number()

converts a string into a number in the given range, returning -1 if it is malformed or out of range.

exit_error()

should be called if an error is found. Usually the first argument is `PARAMETER_PROBLEM', meaning the user didn't use the command line correctly.

New Match Functions

Your shared library's _init() function hands `register_match()' a pointer to a static `struct iptables_match', which has the following fields:

next

This pointer is used to make a linked list of matches (such as used for listing rules). It should be set to NULL initially.

name

The name of the match function. This should match the library name (eg "tcp" for `libipt_tcp.so').

help

A function which prints out the option synopsis.

init

This can be used to initialize the ipt_matchinfo union. It will be called before `parse()'.

parse

This is called when an unrecognized option is seen on the command line. `invert' is true if a `!' has already been seen. The `flags' pointer is for the exclusive use of your match library, and is usually used to store a bitmask of options which have been specified. It should return non-zero if the option was indeed for your library. Make sure you adjust the nfcache field; if you are examining something not expressible using the contents of `linux/include/netfilter_ipv4.h', then simply OR in the NFC_UNKNOWN bit.

final_check

This is called after the command line has been parsed, and is handed the `flags' integer reserved for your library. This gives you a chance to check that any compulsory options have been specified, for example: call `exit_error()' if this is the case.

print

This is used by the chain listing code to print (to standard output) the ipt_matchinfo union for a rule. The numeric flag is set if the user specified the `-n' flag.

extra_opts

This is a NULL-terminated list of extra options which your library offers. This is merged with the current options and handed to getopt_long; see the man page for details. The return code for getopt_long becomes the first argument (`c') to your `parse()' function.

New Targets

Your shared library's _init() function hand `register_target()' it a pointer to a static `struct iptables_target', which has similar fields to the iptables_match structure detailed above.

Sometimes a target (like `REJECT') doesn't need a userspace library; the iptables program won't object if it can't load the library (but the kernel will object if it can't load the module).

Using `libiptc'

libiptc is the iptables control library, designed for listing and manipulating rules in the iptables kernel module. While its current use is for the iptables program, it makes writing other tools fairly easy. You need to be root to use these functions.

The kernel module itself simply contains a table of rules, and a set of three numbers representing entry points. Chain names ("INPUT", etc) are provided as an abstraction by the library. It does this by inserting error nodes before the head of each user-defined chain, which contain the chain name in the ipt_targinfo union (the builtin chain positions are defined by the three table entry points).

When `iptc_init()' is called, the table, including the counters, is read. This table is manipulated by the `iptc_insert_entry()', `iptc_replace_entry()', `iptc_append_entry()', `iptc_delete_entry()', `iptc_delete_num_entry()', `iptc_flush_entries()', `iptc_zero_entries()', `iptc_create_chain()' `iptc_delete_chain()', and `iptc_set_policy()' functions.

The table changes are not written back until the `iptc_commit()' function is called. This means it is possible for two library users operating on the same chain to race each other; locking would be required to prevent this, and it is not currently done.

There is no race with counters, however; counters are added back in to the kernel in such a way that counter increments between the reading and writing of the table still show up in the new table.

There are various helper functions:

iptc_next_chain()

This function returns one chain name at a time, and eventually NULL. To start iteration, you hand it NULL for the first argument: after that you hand it the chain name it gave you previously.

iptc_num_rules()

This function returns the number of rules in the chain given by chainname.

iptc_get_rule()

This returns a pointer to the n'th ipt_entry in the given chain. Do not manipulate this entry manually.

iptc_get_target()

This gets the target of the given rule. If it's an extended target, the name of that target is returned. If it's a jump to another chain, the name of that chain is returned. If it's a verdict (eg. DROP), that name is returned. If it has no target (an accounting-style rule), then the empty string is returned.

Note that this function should be used instead of using the value of the `verdict' field of the ipt_entry structure directly, as it offers the above furthur interpretations of the standard verdict.

iptc_get_policy()

This gets the policy of a builtin chain, and fills in the `counters' argument with the hit statistics on that policy.

iptc_strerror()

This function returns a more meaningful explanation of a failure code in the iptc library. If a function fails, it will always set errno: this value can be passed to iptc_strerror() to yield an error message.

6.3 Understanding NAT

Welcome to Network Address Translation in the kernel. Note that the infrastructure offered is designed more for completeness than raw efficiency, and that future tweaks may increase the efficiency markedly. For the moment I'm happy that it works at all.

NAT is separated into connection tracking (which doesn't manipulate packets at all), and the NAT code itself. Connection tracking is also designed to be used by an iptables modules, so it make subtle distinctions in states which NAT doesn't care about.

Connection Tracking

Connection tracking (in the conntrack/ subdirectory) hooks into high-priority NF_IP_LOCAL_OUT and NF_IP_PRE_ROUTING hooks, in order to see packets before they enter the system.

The nfreason field in the skb is used to indicate the state of a packet, and the nfmark field is a pointer to the conntrack structure (unless the nfreason field is IPS_INVALID).

Currently one BFL (Big Fucking Lock) protects connection tracking and NAT; it's called ip_conntrack_lock.

6.4 Extending Connection Tracking/NAT

These frameworks are designed to accomodate any number of protocols and different mapping types. Some of these mapping types might be quite specific, such as a load-balancing/failover mapping type.

Internally, connection tracking converts a packet to a "tuple", representing the interesting parts of the packet, before searching for bindings or rules which match it. This tuple has a manipulatable part, and a non-manipulatable part; called "src" and "dst", as this is the view for the first packet in the NAT world (it'd be a reply packet in the RNAT world). The tuple for every packet in the same packet stream in that direction is the same.

For example, a TCP packet's tuple contains the manipulatable part: source IP and source port, the non-manipulatable part: destination IP and the destination port. The manipulatable and non-manipulatable parts do not need to be the same type though; for example, an ICMP packet's tuple contains the manipulatable part: source IP and the ICMP id, and the non-manipulatable part: the destination IP and the ICMP type and code.

Every tuple has an inverse, which is the tuple of the reply packets in the stream. For example, the inverse of an ICMP ping packet, icmp id 7, from 192.168.1.1 to 1.2.3.4, is a ping-reply packet, icmp id 7, from 1.2.3.4 to 192.168.1.1.

These tuples, represented by the `struct ip_conntrack_tuple', are used widely. In fact, together with the hook the packet came in on (which has an effect on the type of manipulation expected), and the device involved, this is the complete information on the packet.

A connection is represented by the `struct ip_conntrack': it has two parts, a part referring to the direction of the original packet (dirinfo[0]), and a part referring to packets in the reply direction (dirinfo[1]).

Anyway, the first thing the NAT code does to see if the connection tracking code managed to extract a tuple and find an existing connection, by looking at the skbuff's nfreason field; this tells us if it's an attempt on a new connection, or if not, which direction it is in; in the latter case, then the manipulations listed are done.

If was the start of a new connection, we look for a rule for that tuple. If a rule matches, it is used to initialize the manipulations for both that direction and the reply; the connection-tracking code is told that the reply it should expect has changed. Then, it's manipulated as above.

If there is no rule, a `null' binding is created: this usually does not map the packet, but exists to ensure we don't map another stream over an existing one. Sometimes, the null binding cannot be created, because we have already mapped an existing stream over it, in which case the per-protocol manipulation may try to remap it, even though it's nominally a `null' binding.

Another important concept is a "range"; this is used to specify the range of addresses a mapping is allowed to bind into. A range element consists of an inclusive minimum and maximum IP address, and an inclusive maximum and minimum protocol-specific value (eg. TCP ports). There is also room for flags, which say whether the IP address can be mapped (sometimes we only want to map the protocol-specific part of a tuple, not the IP), and another to say that the protocol-specific part of the range is valid.

A range is a linked-list of these range elements; this means that a range could be "1.1.1.1-1.1.1.2 ports 50-55 AND 1.1.1.3 port 80". Each range element in the chain adds to the range (a union, for those who like set theory).

Currently we don't actually use the linked list, due to a lack of user interface, but that will be fixed RSN.

New Protocols

Inside The Kernel

Implementing a new protocol first means deciding what the manipulatable and non-manipulatable parts of the tuple should be. Everything in the tuple has the property that it identifies the stream uniquely. The manipulatable part of the tuple is the part you can do NAT with: for TCP this is the source port, for ICMP it's the icmp ID; something to use as a "stream identifier". The non-manipulatable part is the rest of the packet that uniquely identifies the stream, but we can't play with (eg. TCP destination port, ICMP type).

Once you've decided this, you can write an extention to the connection-tracking code in the conntrack/ directory, and go about populating the `ip_conntrack_protocol' structure which you need to pass to `ip_conntrack_register_protocol()'.

You can use the "struct_reserve" macros in "struct_reserve.h" in the conntrack/ directory to reserve private room in the conntrack structure. This will always be initialized to zero for you, and you can use it as you wish.

The fields of `struct ip_conntrack_protocol' are:

next

Set it to NULL; used to sew you into the list.

proto

Your protocol number; see `/etc/protocols'.

name

The name of your protocol. This is the name the user will see; it's usually best if it's the canonical name in `/etc/protocols'.

min_length

The minimum size of the header for this protocol. Packets not containing this many bytes after the IP header won't even get to your functions.

pkt_to_tuple

The function which fills out the protocol specific parts of the tuple, given the packet. The `datah' pointer points to the start of your header (just past the IP header), and the datalen is the length of the packet. If the packet isn't long enough to contain the header information, return 0; datalen will always be larger than the min_length you specified though.

invert_tuple

This function is simply used to change the protocol-specific part of the tuple into the way a reply to that packet would look.

print_tuple

This function is used to print out the protocol-specific part of a tuple; usually it's sprintf()'d into the buffer provided. The number of buffer characters used is returned. This is used to print the states for the /proc entry.

print_conntrack

This function is used to print the private part of the conntrack structure, if any, also used for printing the states in /proc.

destroy

This is called when a conntrack structure of this protocol is about to be destroyed; it gives you a chance to cleanup.

established

This function is called when a packet is seen which is part of an established connection. You get a pointer to the conntrack structure, the IP header, the length, the direction of the packet and the statetype. You can modify the statetype: for example you will want to set it to IPS_INVALID if you delete the conntrack.

new

This function is similar to the above, but is called when a packet creates a connection for the first time; there is no direction arg, since the first packet is direction 0 by definition.

expiry

This is the initial time (in jiffies) until the connection should expire. If zero, the connection won't expire.

Once you've written and tested that you can track your new protocol, it's time to teach NAT how to translate it. This means writing a new module; an extention to the NAT code in the NAT/ directory, and go about populating the `ip_nat_protocol' structure which you need to pass to `ip_nat_protocol_register()'.

next

Set it to NULL; used to sew you into the list.

name

The name of your protocol. This is the name the user will see; it's best if it's the canonical name in `/etc/protocols' for userspace auto-loading, as we'll see later.

protonum

Your protocol number; see `/etc/protocols'.

manip_pkt

This is the other half of connection tracking's pkt_to_tuple function: you can think of it as "tuple_to_pkt". There are some differences though: you get a pointer to the start of the IP header, and the total packet length. This is because some protocols (UDP, TCP) need to know the IP header. You're given the ip_nat_tuple_manip field from the tuple (ie. the "src" field), rather than the entire tuple, and the type of manipulation you are to perform.

in_range

This function is used to tell if manipulatable part of the given tuple is in the given range. This function is a bit tricky: we're given the manipulation type which has been applied to the tuple, which tells us how to interpret the range (is it a source range or a destination range we're aiming for?).

This function is used to check if an existing mapping puts us in the right range, and also to check if no manipulation is neccessary at all.

unique_tuple

This function is the core of NAT: given a tuple and a range, we're to alter the per-protocol part of the tuple to place it within the range, and make it unique. If we can't find an unused tuple in the range, return 0. We also get a pointer to the conntrack structure, which is required for ip_nat_used_tuple().

The usual approach is to simply iterate the per-protocol part of the tuple through the range, checking `ip_nat_used_tuple()' on it, until one returns false.

Note that the null-mapping case has already been checked: it's either outside the range given, or already taken.

If IP_NAT_RANGE_PROTO_SPECIFIED isn't set, it means that the user is doing NAT, not NAPT: do something sensible with the range. If no mapping is desirable (for example, within TCP, a destination mapping should not change the TCP port unless ordered to), return 0.

print

Given a character buffer, a match tuple and a mask, write out the per-protocol parts and return the length of the buffer used.

print_range

Given a character buffer and a range, write out the per-protocol part of the range, and return the length of the buffer used. This won't be called if the IP_NAT_RANGE_PROTO_SPECIFIED flag wasn't set for the range.

Extending `ipnatctl'

To have protocol-specific rule match criteria, and protocol-specific range criteria, you need to extend ipnatctl. This is done similarly to extending `iptables', only the requirements are currently simpler.

Simply write a shared library which exports the following symbols (ie. don't declare them `static'):

struct option options[]

A NULL-terminated array of `struct option', indicating the extra options you want to recognize. After the user has specified the protocol, the library will be loaded and these options will be available.

void help(void)

A function which prints (to standard error) a usage message.

void parse(int c, char **argv, struct ip_nat_rule_user *rule)

The function which parses the options described in the options array. If there is a parsing problem, call `print_usage()', a printf-style function which doesn't return. Remember to set the mask bits on if you add a requirement to the rule's protocol-specific tuple, and set the IP_NAT_RANGE_PROTO_SPECIFIED flag if you set up the protocol-specific part of the range.

New Mapping Types

This is the really interesting part. You can override the default mapping type by providing a new mapping type, and getting a user to instruct ipnatctl to use that new type.

Two extra mapping types are provided in the default package: masquerade and redirect. These are fairly simple mappings design to illustrate the potential and power of writing a mapping type.

Note that you can reserve space inside the conntrack structure (see under `New Protocols' above) for your private use.

Inside the Kernel

This is the core of the new mapping type. A `struct ip_nat_mapping_type' is filled out, with the following fields:

next

Used for the internal linked list; set to NULL.

name

The name of the mapping type. This should match the module name for auto-loading to work, eg "ipnat_mapping_masquerade.o" has name "masquerade".

me

If this field is set, your module's usage count goes up and down as rules refer to you.

check

Your mapping type gets a chance to check a user-specified rule which is about to be inserted; returning 0 will cause the rule to fail. For example, your mapping type may only handle TCP packets; check here that the rule specifies TCP only.

setup

This is the meat of your new mapping type. When a rule of your type matches, this function is called. You get the conntrack, IP packet, the tuple, the rule which matched, and the range the rule specified to be mapped onto.

Inside this function you should call ip_nat_setup_info(); it takes the same arguments your specific setup function does. However, you will frequently want to hand it a different (narrower?) range, rather than the one you were given. For example, the masquerading mapping uses a range with a single IP address: the address of the interface the packet is about to head out.

Protocol Helpers For UDP and TCP

This is still under development.

6.5 Understanding Netfilter

Netfilter is pretty simple, and is described fairly thoroughly in the previous sections. However, sometimes it's neccessary to go beyond what the NAT or iptables infrastructure offers, or you may want to replace them entirely.

One important issue for netfilter (well, in the future) is caching. Each skb has an `nfcache' field: a bitmask of what fields in the header were examined, and whether the packet was altered or not. The idea is that each hook off netfilter OR's in the bits relevent to it, so that we can later write a cache system which will be clever enough to realize when packets do not need to be passed through netfilter at all.

The most important bits are NFC_ALTERED, meaning the packet was altered (this is already used for IPv4's NF_IP_LOCAL_OUT hook, to reroute altered packets), and NFC_UNKNOWN, which means caching should not be done because some property which cannot be expressed was examined. If in doubt, simply set the NFC_UNKNOWN flag on the skb's nfcache field inside your hook.

6.6 Writing New Netfilter Modules

Plugging Into Netfilter Hooks

To receive/mangle packets inside the kernel, you can simply write a module which registers a "netfilter hook". This is basically an expression of interest at some given point; the actual points are protocol-specific, and defined in protocol-specific netfilter headers, such as "netfilter_ipv4.h".

To register and unregister netfilter hooks, you use the functions `nf_register_hook' and `nf_unregister_hook'. These each take a pointer to a `struct nf_hook_ops', which you populate as follows:

next

Used to sew you into the linked list: set to NULL.

hook

The function which is called when a packet hits this hook point. Your function must return NF_ACCEPT, NF_DROP or NF_QUEUE. If NF_ACCEPT, the next hook attached to that point will be called. If NF_DROP, the packet is dropped. If NF_QUEUE, it's queued. You receive a pointer to an skb pointer, so you can entirely replace the skb if you wish.

flush

Currently unused: designed to pass on packet hits when the cache is flushed. May never be implemented: set it to NULL.

pf

The protocol family, eg, `PF_INET' for IPv4.

hooknum

The number of the hook you are interested in, eg `NF_IP_LOCAL_OUT'.

Processing Queued Packets

This interface is currently only used by the netfilter_dev; you can register to handle queued packets. This has similar semantics to registering for a hook, except you can block processing the packet.

The two functions used to register interest in queued packets are nf_register_interest() and nf_unregister_interest(); both of these take a pointer to a `struct nf_interest', which has fields as follows:

next

Used to sew it into the linked list: set to NULL.

pf

The protocol family of packets you are interested in, eg. PF_INET.

hookmask

A bitmask representing the hooks you are interested in. Thus, 0 means no hooks, 0xFFFFFFFF means all hooks. To only receive packets queued by hook functions on hook NF_IP_LOCAL_OUT, for example, set this to `1 << NF_IP_LOCAL_OUT'.

mark

If this is non-zero, only skbuffs queued with this nfmark value will be passed to you.

reason

Similar to the mark field, but matches the skbuff's nfreason field.

wake

This is a pointer to a nf_wakeme structure, which wait queue and an skbuff queue. The idea is that your driver sleeps on the wait queue in this structure, and gets woken when a packet is added to the queue. See linux/drivers/char/netfilter_dev.c for a way to do this without a race condition.

The reason this is a pointer, is that the same nf_wakeme structure can be used for multiple nf_interest structures; useful since a process can only sleep on one semaphore.

If no-one is registered to handle a packet, it is dropped. If multiple handlers are registered to handle a packet, the first one in gets it.

Once you have registered interest in queued packets, they begin queueing. You can do whatever you want with them, but you must call `nf_reinject()' when you are finished with them (don't simply kfree_skb() them). When you reinject an skb, you hand it the skb and a result: NF_DROP causes them to be dropped, NF_ACCEPT causes them to continue to iterate through the hooks, and NF_QUEUE causes them to be queued again.

You can also call `nf_getinfo()', with the skbuff which was queued on your queue, and two pointers to IFNAMSIZ-size character arrays, and the information on the in and out devices will be filled into those arrays. Note that one or the other may be the empty string, depending on the hook the packets was queued from.

Note that even though you may specify NF_ACCEPT for a packet, there are cases where it will be dropped anyway, usually because an interface has vanished since it was queued.

Receiving Commands From Userspace

It is common for netfilter components to want to interact with userspace. The method for doing this is by using the setsockopt mechanism. Note that each protocol must be modified to call nf_setsockopt() for setsockopt numbers it doesn't understand, and so far only IPv4 has been modified (in linux/net/ipv4/ip_sockglue.c).

Using a now-familiar technique, we register a `struct nf_setsockopt_ops' using the nf_register_sockopt() call. The fields of this structure are as follows:

next

Used to sew it into the linked list: set to NULL.

pf

The protocol family you handle, eg. PF_INET.

optmin

and

optmax

These specify the (inclusive) range of sockopts numbers handled.

fn

This is the function called when the user calls one of your sockopts. They must have NET_ADMIN capability to call this.

This technique is used, for example, by the iptables code to register the user sockops.

6.7 Packet Handling in Userspace

Using the netfilter_dev device, almost anything which can be done inside the kernel can now be done in userspace. This means that, with some speed penalty, you can develop your code entirely in userspace. Unless you are trying to filter large bandwidths, you should find this approach superior to in-kernel packet mangling.

In the very early days of netfilter, I proved this by porting an embryonic version of iptables to userspace. Netfilter opens the doors for more people to write their own, fairly effecient netmangling modules, in whatever language they want.

Using /dev/netfilter_ipv4

Creating /dev/netfilter_ipv4

You need to create a character device called `/dev/netfilter_ipv4'. I'm currently using the temporary major number 120 (reserved for LOCAL/EXPERIMENTAL USE); the minor number represents the protocol (see linux/include/linux/socket.h), which for ipv4 (PF_INET) is 2. Hence the command to create /dev/netfilter_ipv4 is:

# mknod /dev/netfilter_ipv4 c 120 2
#

Note that this number will change later, to an official one. See linux/include/linux/netfilter_dev.h.

Grabbing Packets from /dev/netfilter_ipv4

You can simply read from the device, and you will see an information header followed by the packet. You can seek and read as normal, but writing only works in the packet part of the device.

The header is a `struct nfdev_head', which contains:

iniface

The in interface name. May be the empty string.

outiface

The out interface name. May be the empty string.

pktlen

The length of the packet; this, plus the size of the header, gives the larges offset you can seek to or read from.

mark

The skbuff's nfmark field.

reason

The skbuff's nfreason field.

hook

The hook the packet was queued from.

You can read as much or as little as you like. When you are finished with the packet, you do an NFDIOSVERDICT ioctl(), handing it a pointer to a `struct nfdev_verdict'. This structure specifies the verdict (NF_ACCEPT, NF_DROP or NF_QUEUE), the new packet length (currently this cannot be used to grow the packet), the new nfmark value, and the new nfreason value.

The `passer' program in the test/ subdirectory gives a simple example of a program which simply returns NF_ACCEPT for each packet which is queued. In conjunction with the iptables `-j QUEUE' target, this can be developed into a simple userspace examination program.

Specifying What Packets You Want from /dev/netfilter_ipv4

When a program first opens `/dev/netfilter_ipv4', it gets all IPv4 packets on any hook, because the netfilter device registers an initial condition of `any hook, any mark, any reason'. There are two ioctls used to delete and add other specifications; usually this catch-all specification is deleted first, any packets drained from the queue, and then the specifications you want added.

Both the NFDIOSADDCOND and NFDIOSDELCOND ioctl take a pointer to a `struct nfdev_condition', which contains a hook mask, a mark value, and a reason value: see the section on Processing Queued Packets above for the meanings of these values.

6.8 Translating 2.0 and 2.2 Packet Filter Modules

In the compat/ directory is a simple layer which should make porting quite simple.

6.9 Motivation

As I was developing ipchains, I realized (in one of those blinding-flash-while-waiting-for-entree moments in a Chinese restaurant in Sydney) that packet filtering was being done in the wrong place. I can't find it now, but I remember sending mail to Alan Cox, who kind of said `why don't you finish what you're doing, first, even though you're probably right'. In the short term, pragmatism was to win over The Right Thing.

After I finished ipchains, which was initially going to be a minor modification of the kernel part of ipfwadm, and turned into a larger rewrite, and wrote the HOWTO, I became aware of just how much confusion there is in the wider Linux community about issues like packet filtering, masquerading, port forwarding and the like.

This is the joy of doing your own support: you get a closer feel for what the users are trying to do, and what they are struggling with. Free software is most rewarding when it's in the hands of the most users (that's the point, right?), and that means making it easy. The architecture, not the documentation, was the key flaw.

So I had the experience, with the ipchains code, and a good idea of what people out there were doing. There were only two problems.

Firstly, I didn't want to get back into security. Being a security consultant is a constant moral tug-of-war between your conscience and your wallet. At a fundamental level, you are selling the feeling of security, which is at odds with actual security. Maybe working in a military setting, where they understand security, it'd be different.

The second problem is that newbie users aren't the only concern; an increasing number of large companies and ISPs are using this stuff. I needed reliable input from that class of users if it was to scale to tomorrow's home users.

These problems were resolved, when I ran into David Bonn, of WatchGuard fame, at Usenix in July 1998. They were looking for a Linux kernel coder; in the end we agreed that I'd head across to their Seattle offices for a month and we'd see if we could hammer out an agreement whereby they'd sponsor my new code, and my current support efforts. The rate we agreed on was more than I asked, so I didn't take a pay cut. This means I don't have to even think about external conslutting for a while.

Exposure to WatchGuard gives me exposure to the large clients I need, and being independent from them allows me to support all users (eg. WatchGuard competitors) equally (my contract says I can't work full-time for a competitor, but I can consult).

So I could have simply written netfilter, ported ipchains over the top, and been done with it. Unfortunately, that would leave all the masquerading code in the kernel: the independence of masquerading and filtering is the one of the major wins point of moving the packet filtering points, but masquerading also needs to be moved over to the netfilter framework.

Also, my experience with ipfwadm's `interface-address' feature (the one I removed in ipchains) had taught me that there was no hope of simply ripping out the masquerading code and expecting someone who needed it to do the work of porting it onto netfilter for me.

So I needed to have at least as many features as the current code; preferably a few more, to encourage niche users to become early adopters. This means replacing transparent proxying (gladly!), masquerading and port forwarding. In other words, a complete NAT layer.

Even if I had decided to port the existing masquerading layer, instead of writing a generic NAT system, the masquerading code was showing its age, and lack of maintenance. See, there's no masquerading maintainer, and it shows. It seems that serious users generally don't use masquerading, and there aren't many home users up to the task of doing maintenance. People like Juan Ciarlante do fixes, but it's gotten to the stage (being extended over and over) that it needs a rewrite.

Please note that I wasn't the person to do a NAT rewrite: I don't use masquerading any more, and I'd not studied the existing code. That's probably why it took me longer than it should have. But the result is fairly good, in my opinion, and I sure as hell learned alot. No doubt the second version will be even better, once we see how people use it.


Next Previous Contents