
Storing IP addresses as integers - nreece
http://norbauer.com/notebooks/code/notes/storing-ip-addresses-as-integers
======
yagibear
Compatibility with IPv6 is much more important than saving a couple of bytes,
IMHO. Better to keep as a string.

~~~
pmorici
Storing the address as a string doesn't implicitly give you any better ipv6
compatibility than storing it as an integer.

I would think that integers would allow for better compatibility seeing as a
numbers a number no matter what just expand that data type size of the column.
The fact that ipv6 allows you to write the same address in multiple ways using
the :: notion to denote a single span of zeros means there are a multitude of
string representations of the same address.

~~~
tptacek
Huh? Littering your code with u_int32_t declarations and comparisons against
magic numbers like "3232235520" doesn't hurt IPv6 compatibility? Pants are
shirts!

~~~
pmorici
Maybe that's the way you'd do it. I was thinking more along the lines of...

inet_addr_t some_var = inet_aton("123.456.789.123");

The way you represent constants in code doesn't have to be a direct
correlation with the way you store their representation.

~~~
tptacek
I don't know what language this is in, but if it's C, you meant to say
inet_aton(string, &addrstruct), and that string is probably more valuable than
struct in_addr. Since C doesn't offer a 128 bit scalar type, I'm not sure how
inet_addr_t helps you any more than u_int32_t.

~~~
pmorici
pseudo code, the language isn't important. Point is, because there are a
number of ways to write the same ipv6 address (assuming it contains runs of
zeros longer than 4 or multiple runs of zeros) you are going to have to
normalize your string input no matter what so why not normalize to an
integer/byte representation which is more compact.

~~~
tptacek
The conversation is about portability, and the observation is that the code
you're talking about still assumes an address is a scalar variable, which
effectively means it assumes IPv4 addressing. That's all.

I'm not super thrilled to debate the performance merits of optimizing 16-byte
strings vs. 4-byte integers. I use whatever is most convenient. It's slightly
easier to convert a charstar to a u_int32_t, but it's much easier to index a
u_int32_t.

~~~
pmorici
The conversation is about _compatibility_ , not portability. ie: what is the
best way to store ipv4 addresses so that they are compatible with ipv6
addresses when you need to start supporting those.

At any rate we are talking about two different things.

if you have the ipv6 address 1234:0000:5678:0000:0000:9212... you can write it
a number of different ways, using the :: abbreviation for the various runs of
zeros. ipv4 really doesn't have that pit fall in the string representation of
its addresses.

So, if you store it as a string one way and then latter when you get input
that has the same address written another way you are going to have problems
if you don't normalize. Since we are talking about storing in a database, as
the parent to this chain was you would likely want to normalize to the most
compact representation to save space in your database. The precise type
semantics of your language of choice are irrelevant.

------
pmorici
MySQL, also has built-in functions to do this,
[http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-
functio...](http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-
functions.html#function_inet-aton)

~~~
neilc
PostgreSQL has IMHO a nicer solution: abstract data types to represent IPv4,
IPv6, and MAC addresses, along with the functions over those types that you'd
expect.

[http://www.postgresql.org/docs/8.4/static/datatype-net-
types...](http://www.postgresql.org/docs/8.4/static/datatype-net-types.html)

<http://www.postgresql.org/docs/8.4/static/functions-net.html>

------
audidude
why would you _ever_ store a 32-bit ip address on disk or memory in anything
_other_ than a 32-bit word

~~~
ars
Because of IPv6.

You'll need a 128bit word for that. Plus some bignum math libraries for 32
machines that don't do 128 bits natively. (Do 64 bit machines handle 128 bits
natively?)

Also watch out that you use an unsigned int. In PHP for example (and probably
most other dynamic languages that don't do bignum as well) all ints are
signed. So you'll have to work with the number as a string.

~~~
dpifke
I'd be curious if matching 128 bit words is slower or more painful than
variable-length string comparisons in any RDBMS. I would bet not.

Plus, you can't do CIDR operations on strings, making it a pain to i.e. match
all addresses within a given /27.

~~~
audidude
You will often have to do a page-fault for both, so loading wont be too much
an issue. However, you have to get it into an integer to compare it to begin
with. So the string method is _purely_ overhead.

~~~
tptacek
A page fault, because you touched a 4 byte word, a 16 byte string, or a 16
byte binary address? We're talking about data types that fit in a single L1
cache line.

------
brianobush
I have done this in code that I have written (yes, IPV6 - haven't seen it yet
and it has been talked about for years) that handles IP address.

One thing that is really easy once the ip address (x) is in the integer space:
private address determination becomes a simple integer comparison. e.g.,
0.0.0.0 is simply x > 0, in the range 192.168.0.0 to 192.168.255.255 is
written as: x >= 3232235520 && x <= 3232301055, etc.

~~~
tghw
Or, much more easier for anyone else reading your code:

    
    
      ip.startswith('192.168.') or ip.startswith('10.')
    

No, it's not as fast, but are you really doing that many is_ip_private()
calls?

~~~
tptacek
That works for /16's and /8's. Now do it for /27.

~~~
ars
Grab the last octet, and see if it's <= 30

Personally I store a string until 128bit math becomes easier (so I can handle
IPv6). But usually I just want to log it, not check netmasks.

~~~
tptacek
There are, what, 8 /27s in a /24? How does seeing if the last octet is less
than 30 help you?

------
Confusion
This is explanation is clear to everyone that already understands it and will
be inexplicable to anyone that doesn't. Foremost, he should explain that ipv4
addresses simply _are_ 4 byte numbers and that www.xxx.yyy.zzz is just a human
readable presentation. Then it is immedialy clear that the latter isn't
necessarily the most common way to store the datum.

------
walesmd
In a project here at work we are storing IP Addresses in both string format
and in integer format (primarily, so we can sort the addresses intelligently).
By sorting on the integer column, yet displaying the string column, you get
the result set in the order that makes the most sense.

------
albertsun
How often would this actually be worth it? My hunch is that the computational
time involved in packing and unpacking IP addresses into integers is more
valuable than the space saved by storing them as integers.

~~~
jbyers
The computational time to pack and unpack an IP to an integer is vanishingly
small. My old MacBook Pro does 500,000 per second of the corresponding PHP
function.

The difference between an integer and a fifteen byte string is eleven bytes.
Our database has a few hundred-million row tables (barely considered big by
today's standards) that store IPs. Storing IPs as integers saves us a GB per
hundred million rows in addition to a substantial index size reduction.

Your application may not need to store that much data, but it's my experience
that tables with IPs are the ones that tend to get big. :)

~~~
potatolicious
Storage is cheap - the primary win here is computation time.

if(ip1 == ip2) is a _lot_ faster as ints than strings.

~~~
tptacek
Seeing as how the largest dotted-quad IP address fits inside rax:rdx on a
modern CPU, and that two of them fit in a single cache line, I'm guessing x ==
y, while faster, is not "much" faster with strings than integers.

I wouldn't populate an address trie with strings, but I also wouldn't give a
second thought to passing them around a random C program as charstars either.

~~~
potatolicious
Most string libs aren't that smart though - string comparisons are still _byte
by byte_. You're now comparing something 15 times instead of 1. You can do an
int32/int64 (depending on architecture) compare in a single op.

The point I guess is, you can keep 'em around as charstars, but eventually
you'll have to do this cast to compare them...

~~~
tptacek
All memcmp's are this smart. But that's kind of besides the point, right? 1
time, 14 times, if we're talking about L1 cache, we're really epsilon from
pure reg/reg ALU operations, implementing effectively constant-time
algorithms.

I agree, int32 is faster. Like I said, it's just not "much" faster.

------
tptacek
Last time I looked, Ruby's IPAddr code had a horrible DNS dependence.

