Same host detection through IP-ID

This page explains one of the privacy risks related to the identification field in the IPv4 header. These issues are not new information. Shortcomings of the field were known in the '90s and as a result a few improvements were included in the IPv6 design.

The field is mandatory in IPv4 and thus poses a risk regardless of whether you have any need for the field. The field is only 16 bits long in IPv4, which has proven to be insufficient.

In IPv6 the field is optional and on most IPv6 packets it isn't used. Moreover the size of the field has been increased to 32 bits, which avoids many of the problems.

How IP-ID can detect IP addresses assigned to the same host

Some networking stacks use a counter to generate IP-ID values. If the host has multiple IP addresses and use the same counter for them it is possible to inspect traffic from these IP addresses and see that they were generated by a common counter.

Linux uses an array of counters. When a IP-ID value is needed a hash value is computed to choose an index in this array. The hash input includes the source and destination IP address. This reduces the risk of using the same counter for unrelated flows, but it doesn't eliminate the possibility. Older Linux versions used 1024 counters. Newer Linux versions use up to 262144 counters depending on available memory.

The proof-of-concept provided here can ping a pair of IP addresses using different source addresses looking for a pair of source addresses which gets to use the same IP-ID counter. Due to the birthday paradox the number of IP addresses needed is only the square root of the number of counters. Thus you would need around 32 IP addresses when targeting an older Linux version and around 512 when targeting a newer Linux version. For more reliable detection I opted for twice that number.

To use the tool you need a /22 IPv4 address range. The code in the repository has hardcoded the client IP range as 100.100.0.0/22. But you only need to edit a few places in the code to use it with a public IPv4 range.

Example usage

The source can be downloaded through this link or by running the command:

hg clone https://v6tools.kasperd.dk/same-host/

Once you have the source you can run it with two IP addresses as argument:

# ./same-host.py 172.19.0.2 172.19.0.3
2034
2034 2032
2032 2032
89
2032 2032
2032 2032
67
Evaluating
100.100.1.182 > 172.19.0.3
100.100.0.147 > 172.19.0.2
Evaluating
100.100.2.66 > 172.19.0.3
100.100.1.143 > 172.19.0.2
Evaluating
100.100.3.174 > 172.19.0.2
100.100.3.63 > 172.19.0.3
Evaluating
100.100.2.128 > 172.19.0.2
100.100.2.128 > 172.19.0.3
!!!!!! Found shared IPID counter between 172.19.0.2 and 172.19.0.3 !!!!!!!
Evaluating
100.100.1.203 > 172.19.0.3
100.100.1.245 > 172.19.0.2
Evaluating
100.100.2.11 > 172.19.0.2
100.100.2.248 > 172.19.0.3
Evaluating
100.100.3.241 > 172.19.0.3
100.100.0.49 > 172.19.0.2
Evaluating
100.100.0.97 > 172.19.0.3
100.100.0.230 > 172.19.0.2
Evaluating
100.100.1.250 > 172.19.0.3
100.100.1.50 > 172.19.0.2
Evaluating
100.100.0.224 > 172.19.0.2
100.100.1.186 > 172.19.0.3
!!!!!! Found shared IPID counter between 172.19.0.2 and 172.19.0.3 !!!!!!!
Evaluating
100.100.0.66 > 172.19.0.2
100.100.2.207 > 172.19.0.3
Evaluating
100.100.2.87 > 172.19.0.3
100.100.3.238 > 172.19.0.2
Evaluating
100.100.0.246 > 172.19.0.2
100.100.2.126 > 172.19.0.3
Evaluating
100.100.2.194 > 172.19.0.2
100.100.3.242 > 172.19.0.3

The proof-of-concept code produces somewhat verbose output with information about the steps it is taking. The relevant output is the line printed when IP addresses have been found to be using the same counter:

Evaluating
100.100.0.224 > 172.19.0.2
100.100.1.186 > 172.19.0.3
!!!!!! Found shared IPID counter between 172.19.0.2 and 172.19.0.3 !!!!!!!

This tells us that 172.19.0.2 and 172.19.0.3 are sharing an IP-ID counter, so they must be pointing to the same host. The output also shows which two client IP addresses were being used to achieve the same counter.

Several other sets of IP addresses were evaluated along the way when some of the counters happened to have nearby values. But the verification concluded that they were not the same counter after all.

How does this impact privacy

You might be running two applications on one machine which communicate with servers with an expectation that the servers cannot tell that both applications are running on the same host. This could for example be two separate web browsers.

If you open a web page in each browser, those pages could collude to load resources from several different IP addresses in order to detect your host using the same IP-ID counter for some of those resources.

Another scenario would be if you are running one website using your regular IP address and a separate website using an IP address you got through a tunnel or VPN. You might not want clients to know that both IP addresses are hosted on the same machine, but by inspecting IP-ID counters they could find out.

What does Linux do about this?

Linux has taken two different approaches to address this issue on IPv4 and IPv6.

Linux has stopped using counters to assign IP-ID values on IPv6. Instead IP-ID values are generated using a random number generator. This increases the risk of colliding IP-ID values, however due to the field being 32 bits on IPv6 such collisions will still happen less frequently than on IPv4.

On IPv4 the number of counters has been increased from 1024 to as much as 262144 on hosts with lots of memory. However as demonstrated by this proof-of-concept, that is insufficient to prevent a targeted attacker from arranging two flows using the same counter.

Solutions

Use IPv6
Keep in mind that you also need to ensure that you are using different IP addresses. If you switch to IPv6 and use the same IP address for both applications, then peers you are communicating with will know i'ts the same host from the IP address.

Workarounds

More sophisticated IP-ID algorithms
There are likely many different algorithms which would work. The first which comes to my mind is to not use the counter directly as IP-ID value and instead pass it through a tweakable block cipher. This block cipher would have to use a 16 bit block size, which rules out most of the commonly used block ciphers. The tweak value needs to include the same information used as input for the hash function, so the tweak would need to be larger than the block size.
Use a proxy
An application layer proxy won't pass through IP headers. As such they can be used to mask your IP-ID values. There are many variations of this such as HTTP proxies, SOCK proxies, tor, and others. But they will result in a more complicated setup and don't apply to all use cases.

Non-solutions

Block ICMP
The proof of concept uses ICMP echo request packets. But the IP-ID issue lies in the IPv4 layer, so it could be exploited through other transport protocols and not just ICMP.
Use NAT
Using NAT makes everything harder. For that reason I didn't bother to design this proof-of-concept with NAT in mind. But that doesn't stop a determined attacker from implementing a version that works with NAT. Moreover keep in mind that there are other problems related to passing IP-ID values through NAT.
Switch to another OS
The underlying problem is in the design of IPv4. Linux has taken some steps to work around the issue, but they don't cover everything. I have not investigated other OS, but I doubt they will do signficantly better.
Use more IP-ID counters
Using the same algorithm with a larger number of counters isn't viable. The number of counters you need grows quadratic with the number of IP addresses used by the attacker. You would need to allocate 32 exabytes of memory just for the counters to achieve good protection.
Use containers or network namespaces
The counters are shared between all containers and network namespaces. The hash function mapping IP addresses to array indices is separate per network namespace, but they all index into the same array, so collisions can still happen.
Extend the IP-ID field in IPv4
Deploying such a change is not feasible. Deploying to all endpoints and ensuring they use compatible versions is in itself a challenge. But even that isn't sufficient. Packets can be fragmented in flight on intermediate routers, when that happens they would need to understand the extended IP-ID field. You might think an IPv4 option which gets copied to all fragments would solve that, but then you'd first have to fix the vast number of networks where IPv4 options don't work.
Exclude remote IP from hash input
The remote IP is included to protect against other IP-ID related risks. Removing it would solve this problem, but reintroduce other worse problems.

How severe is this problem?

The issue is not very severe. It is however more severe than some of the excuses sometimes made up for not deploying IPv6. So this page is primarily intended to be used to counter flawed arguments in favor of IPv4. What's demonstrated by this proof-of-concept is one inherent problem in IPv4 that's mostly fixed by IPv6.