
Potholes to avoid when migrating to IPv6 - deafcalculus
https://rachelbythebay.com/w/2018/12/30/v6/
======
GlitchMr
Don't manually parse those things. If your database provides a way to parse
those (like PostgreSQL does), make use of it. Use what your programming
language provides, and if it really doesn't provide the functionality to parse
IPv6 addresses, use a library to do so.

For example, in Rust you can write this:

    
    
        use std::error::Error;
        use std::net::{AddrParseError, SocketAddr};
        
        fn parse_address(addr: &str) -> Result<SocketAddr, AddrParseError> {
            addr.parse()
                .or_else(|_| Ok(SocketAddr::new(addr.parse()?, 10443)))
        }
        
        fn main() -> Result<(), Box<dyn Error>> {
            let addresses = [
                "[2001:0db8:f00f::0553:1211:0088]:10444",
                "2001:0db8:f00f::0553:1211:0088",
            ];
            for address in &addresses {
                println!("{:?}", parse_address(address)?);
            }
            Ok(())
        }
    

This isn't specific to IPv6 by the way. It also applies to other standards
like CSV (although I suppose with CSV it does vary, I saw so many broken CSV
files that sometimes a custom implementation is the best way to go to parse
those).

~~~
mavhc
The most stupid thing about CSV is it's entirely pointless, ASCII has
codepoints for file, group, record and unit separators.

~~~
twic
Which you can't type on a keyboard, sensibly print on a terminal, or express
with a symbolic escape (like '\n') in any language i know.

If history had gone differently, the separators would be useful for separating
things in files, but it didn't.

~~~
JdeBP
I am sure I am not alone at chuckling at the idea that one cannot type FS, GS,
RS, and US on a keyboard, clearly borne of GUI conventions and thinking. Some
of us have been typing control characters on a keyboard for quite a number of
years. This is what the [Control] key is for. (-:

FS enjoys quite significant use nowadays on terminal keyboards in the worlds
of Unix and Linux, it being the usual special character mapped to SIGQUIT. GS
and RS are similarly fairly mainstream commands in VIM/NeoVIM.

~~~
adwn
> _I am sure I am not alone at chuckling at the idea [...]_

You might want to hold back on your condescending laughter for a bit. Neither
vim nor nano allow me to insert a control character by pressing Ctrl+ _key_ ,
so how am I supposed enter the FS character when editing a text file? Besides,
there are other operating systems besides Unix and Linux ( __cough __Windows
__cough __).

~~~
jodoherty
Actually, VIM does support this. Type Ctrl+V in insert mode followed by
whatever you want to insert (e.g. Ctrl+\ for the ASCII FS character). VIM will
insert the literal value of the control code into the buffer.

~~~
adwn
Thanks! I had tried Ctrl+V followed by the character from the caret notation
(e.g., " _Ctrl+V \_ ", instead of " _Ctrl+V Ctrl+\_ "), and that hadn't
worked.

------
zAy0LfpBZLC8mAC
Well, I mean, yes, those are all mistakes people are going to make, no doubt
about that. But somehow the solution is missing?

The core mistake here is using string manipulation at all. The only correct
way to handle string data is to parse it, and then either operate on the
parsed data structure, or re-serialize into a canonical format and use that
for comparisons and stuff.

And in the best case, you don't try to build your own parser, but use a well-
tested one that's already there. For the particular use case of parsing IP
addresses, it's probably best to use getaddrinfo() with AI_NUMERICHOST, and
for re-serialization getnameinfo() with the same flag. If those don't
understand the address, you most likely won't be able to connect anyway. And
they will handle stuff like link-local addresses correctly, at least as long
as you are on the host that actually has the respective interface.

For databases, you use column types intended for storing IP addresses, so the
database will do the parsing and canonicalization for you.

And when you actually have to build a parser yourself, read the damn spec for
the format instead of going by what you think the format is, because most
likely it's not that.

And mind you that most of those problems are not really IPv6-specific. There
are also many ways to write an IPv4 address that your average parser will
understand. Most of those are generally frowned upon, so they don't occur
often, but if you want to reliably compare IPv4 addresses, you actually need
to do the same as for IPv6.

~~~
TorKlingberg
One reason people keep using string manipulation is that in C getaddrinfo
gives a addrinfo, which is quite a complex struct. It contains a struct
sockaddr, which is actually a sockaddr_in, which contains both an address and
a port. So now you have to carefully keep track of which of your addrinfos and
sockaddrs have the port in them and which are just the address.

------
nieve
It won't solve most of the problems, but if you're using PostgreSQL its native
inet datatype (which supports 4 or 6) instead of a text or binary string can
save a world of pain and there are alternatives in some other RDBMSs:

[https://www.postgresql.org/docs/current/datatype-net-
types.h...](https://www.postgresql.org/docs/current/datatype-net-
types.html#DATATYPE-INET) [https://www.postgresql.org/docs/current/functions-
net.html](https://www.postgresql.org/docs/current/functions-net.html)

First class network and ip types that properly support contains and exclusion
operators make a lot of things less error-prone and potentially much faster.

~~~
jschwartzi
Yeah, most of these problems can be solved by declaring a new type of data,
defining how a string address is parsed into that data type, defining rules
for comparison, and using that type whenever a host is input into the system.
These are problems that happen when you don't use the type system of a
language or the language doesn't allow you to enforce specific types as input.

~~~
nieve
I doubt we're ever going to get standardization on a given ip data type even
within most languages, much less across them, but there's a tiny bit of hope
that the various databases might converge on something vaguely portable. It's
taken twenty years, but I'm actually seeing it with some other types. We just
need to make sure MySQL doesn't release something where internally all ip
addresses (4 & 6) are stored as 47 EBCDIC characters and silently corrupts
your dumps if you're using a really-no-I-mean-it UTF-8 locale. (Not entirely
fair to the MySQL devs, but a lot of database people have been burned by their
approach and Oracle is not making this better.)

~~~
nly
Any language with networking in the standard library should have a data type
for IP addresses. There's no good reason for API bindings for database libs to
not use them, and no good reason for you to not use them even if you're not
using a database.

~~~
nieve
Agreed in theory, but I've seen plenty of hand-rolled solutions even so and I
think the dbs are at least slightly more likely to get standardization via ORM
devs. One of the most painful I've run across was a system that was allocating
ips for short, but highly variable lifespan VMs and trying to store them in
MySQL database manipulated with both Ruby and Python. Even with language
support there was no real defense against coworkers trying to brute-force
complicated network checks on assignments when sub-second responses were
needed. Unless your work is only ever in a small corner of a large system a
single language and standard library having a native type won't always protect
you from your code getting handed ip addresses from a foreign source in
various text formats. The original post explains why this is a potential
nightmare in detail.

------
peterwwillis
Don't try to parse things without a standard. Even CSV and e-mail addresses
are more complicated than they seem.

Also, it's pretty silly that we still use these unintuitive conventions from
40 years ago for modern systems. Is 192.168.2.8:10443 an address? A phone
number and extension? Is it TCP, UDP? IPv4, IPv6? An HTTPS service, or just
something resembling its decimal notation assigned service number? Are there
multiple services proxied behind this one address? Can I route between them?
When I request a URI, does the application know what I really want/expect?
What about a timeout for my request? What about authentication/authorization?
Consistency requirements? Idempotence? Security guarantees?

Operating systems don't even take <host>:<port> arguments for network
syscalls, that's just a convention we sort of came up with and later stuck to.
But as a URL it's pretty crap. I suggest we replace them with modern URLs that
can embed tiered information such as session IDs, service types, routes,
security requirements, operational parameters, etc. Most people may only need
[https://google.com/](https://google.com/), but sometimes we may also want to
request webv2+uquic+v6:/SC,TLSv1.3/userid[s:84742049]@google.com(r:NA)/ . I
know that's ugly as sin, but hopefully people wouldn't need to specify all of
that all of the time (service name/version, transport, address, strong
consistency, TLS 1.3, userid, session id, host/namespace, North American
region).

~~~
tedunangst
Why bother with text when ASN.1 solves all these problems too?

~~~
mcguire
But then you have to solve ASN.1's problems.

Some of such, circa 2000: [https://tools.ietf.org/html/draft-yu-
asn1-pitfalls-00](https://tools.ietf.org/html/draft-yu-asn1-pitfalls-00)

Holy mother of cows! The specs are available!
[https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx](https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx)

------
donatj
It’s certainly been said but IPv6 to my eyes is awash in second system
syndrome, which has largely slowed its adoption.

The complexity of handling addresses is plainly a failure of the design. Using
colons for separators when they’re already being used for ports served no
purpose but to confuse. Having more than a single valid form of an address
again only serves to confuse.

If there’s anything to learn from the UNIX principals it’s there is great
power in making things easily manipulated as strings. The design of IPv6 makes
this impossible.

~~~
stephen_g
None of the standard calls use colons for separating ports - I can’t recall
exactly whether it’s even a standard URI thing or just something web browsers
did.

Further, even UNIX paths containing ‘..’ and other things need normalisation,
and with symlinks you can’t necessarily just compare two paths to determine
whether they point to the same file.

The problem, as plenty of other people have said in these comments, is that
you just need to parse the address and compare the structure. There are
standard library calls for this, and most databases have types that can do
this.

~~~
isbjorn16
> None of the standard calls use colons for separating ports - I can’t recall
> exactly whether it’s even a standard URI thing or just something web
> browsers did.

[https://tools.ietf.org/html/rfc3986#section-3.2.3](https://tools.ietf.org/html/rfc3986#section-3.2.3)

Specifically part of the URI RFC.

Probably should have picked something wildly unrelated as a separator instead
of colons - but the horse has already left the barn, so now we just get to
play cleanup.

You're right that there's already a decent amount of support for this in
various languages and persistence mechanisms, but I've spent enough time in my
career playing with truly bleeding edge barely-an-alpha platforms to know that
sometimes you do just get stuck writing your own parser and comparators. It's
good to know in advance what sort of potholes you will be trying to navigate
:)

Edit: I just discovered the URI RFC is dated 2005, with IPV6 being dated 1998.
( [https://www.ietf.org/rfc/rfc2460.txt](https://www.ietf.org/rfc/rfc2460.txt)
)

Still, it struck me as odd that we had been using host:port for so long. The
URL conceptual predecessor to URI RFC also mentions using host:port in section
3.1.1 - which came out in 1994.

Regardless of which RFC made a hash of things though, all 3 are widely used
specifications, so we're stuck in this boat now!

------
mgerdts
Not just ipv6.

    
    
      $ ping 127.1
      PING 127.1 (127.0.0.1): 56 data bytes
      ...
      $ ping $(((127 << 24) + 1))
      PING 2130706433 (127.0.0.1): 56 data bytes
      ...
    

I've primarily seen the second form in the early days of the web when spammers
were presumably trying to bypass mail or web filters that were scanning for
blacklisted IPs.

Edit: fix formatting

~~~
mschuster91
Hex works too (tested on OS X):

    
    
        ping 0x7f000001
        PING 0x7f000001 (127.0.0.1): 56 data bytes
    

or, as a URL (tested in OS X Chrome, and the URL formatter of HN also shows
it, with Chrome hover showing 127.0.0.1):
[https://0x7f000001/](https://0x7f000001/)

------
swinglock
The problem was (and remained) programmers thinking addresses are strings.

~~~
ztorkelson
Right. I was waiting for the author to get to this point, and much to my
surprise they never did.

The takeaway shouldn’t be “strings are hard”. Strings _are_ hard, but that’s
not the problem here. The problem is using inappropriate data types for the
task at hand, and the takeaway should be that representation matters.

Representing an IP endpoint as a string only really makes sense at the human-
computer interface. In general, the first thing a program should do with such
a string is convert it into a more suitable data type. And as noted elsewhere,
more often than not there will already be a library function to do so.

Strings are especially pernicious because they are ubiquitous, in that just
about anything can be represented as a string, but the operations one can do
on a string representation of an object do not generally correspond with the
operations one would want to do on the object itself. This disparity is the
source of many bugs, which the article exemplifies (though falls short of
directly addressing).

Similarly problematic are assumptions that converting an object to a string
and a string to an object (i.e. `parse` and `format` functions) are bijective.
This is not generally the case. (Wise practitioners might choose such a
mapping, though, when the opportunity presents.)

------
magicalhippo
I'm really stumped as to why they didn't eliminate ports when designing IPv6.
Having the last bits in the address take the role of ports makes sense given
the hierarchical nature of IPv6 addresses. The first bits specify the network
(my home), next bits the device, and last bits the service/endpoint on that
device.

It would also make parsing much easier given that they chose : as separator in
the addresses.

~~~
cnorthwood
Ports operate a different layer in the OSI model - the transport layer, not
the network layer (which IP operates on). Ports are TCP and UDP concepts
(e.g., a different application can bind to TCP port 443 than UDP port 443, as
they're essentially different ports). Some protocols don't have ports at all,
e.g., ICMP, which instead has a "type".

~~~
magicalhippo
Sure, but so what?

Imagine you could only bind services to their standard ports. What's the
difference between a service running on my computer having a unique ip
address, or my machine having a unique ip address and only running that
service?

~~~
cryptonector
Compatibility. Code written for layer 4 should be compatible even if layer 3
is changed. In practice much code had to be changed anyways, but the amount of
violence applied to existing code was limited by not getting rid of ports.
It's not just that either, but much else, including cognitive load.

------
z3t4
Another gotcha is that ipv4 and ipv6 has different firewall tables. So if you
for example have blocked _all_ traffic besides port 80, you _also_ need to do
it on ipv6 !

~~~
yrro
pfft, who is still using iptables? :)

FYI, nftables has the 'inet' table which processes both IPv4 and IPv6 packets,
and 'ip' and 'ip6' which processes each respectively.

~~~
z3t4
iptables' are easy to configure and available almost everywhere.

~~~
floatboth
easy to configure?? iptables is the most infamously difficult firewall to
configure.

Ubuntu created the ufw "uncomplicated firewall" wrapper for a reason :D

~~~
z3t4
I think it's similar to CSS... I however have a sense that IPv6 is much
different so we might get better and easier tools for future IP versions once
IPv4 is gone, I however don't expect that to happen during my lifetime :/

------
fanf2
All of this flailing around would have been avoided by either thinking
carefully about canonical representations, or reading
[https://tools.ietf.org/html/rfc5952](https://tools.ietf.org/html/rfc5952)

------
signa11
fwiw, rfc-5952, specifically, sec:6 outlines recommended string
representations. also, it is generally much nicer to use 'sock_storage' as an
underlying representation of AF_{UNIX/INET,INET6}, endpoints rather than
building abstractions to tide over differences of 'socaddr_*'

------
contravariant
Maybe this is a stupid question at this point but if using colons introduces
this degree of ambiguity why not stick with simple periods? Or any other
separator for that matter.

------
nly
Even putting aside the ambiguity of using ':' as a address:port delimiter,
there are differences between platforms in parsing _just IPv6 addresses_

e.g. on Linux/glibc 2.28

    
    
        [cling]$ #include <arpa/inet.h>
        [cling]$ ::in6_addr dst;
        [cling]$ dst
        (::in6_addr &) @0x7ff318aff010
        [cling]$ inet_pton (AF_INET6, "1234:1234:1234:1234:1234::1234:8.8.8.8", &dst)
        (int) 0
        [cling]$ inet_pton (AF_INET6, "1234:1234:1234:1234:1234:1234:8.8.8.8", &dst)
        (int) 1
    

Here '::' in this 'full' address is (correctly) rejected.

However, although I haven't verified this on FreeBSD (perhaps someone can?),
there's a comment in the libc source suggesting that this will be accepted
there

[https://github.com/freebsd/freebsd/blob/1d6e4247415d264485ee...](https://github.com/freebsd/freebsd/blob/1d6e4247415d264485ee94b59fdbc12e0c566fd0/lib/libc/inet/inet_pton.c#L127)

Parsing sucks.

------
throwaway12iii
Obviously the point of the article is to not manually parse ip addresses.

Well actually, python can do that just as well as all the other languages.

    
    
        >>> import ipaddress
        >>> ipaddress.ip_address('2001:db8::')
        IPv6Address('2001:db8::')

~~~
throwaway12iii
Actually, in Java you can just use InetAddress.getByName.

------
pdkl95
The fundamental problem in the example:

    
    
        leader_host = bigdata.example.org:10443
    

is ":10443" is not part of the _host_ name. The field is called "leader_host";
if a port is needed, it should use it's own field instead of trying to
overload the host field.

    
    
        leader_host = bigdata.example.org
        leader_port = 10443
    

(and as as others have already mentioned, don't write your own parser when
they already exist in your stdlib/etc)

------
ta57746uhhj
IPv6 was over-engineered. We need IPv5 which just fixes the address space,
then have a rethink about v7 ... without all the nonsense.

------
mrunkel
Why not split the parameters?

Address and port as separate values makes more sense to me.

~~~
JdeBP
Indeed, several mechanisms for passing this stuff to programs do exactly that.
Notice that they are two distinct command-line arguments to these programs,
for examples:

* [http://jdebp.eu./Softwares/nosh/guide/tcp-socket-listen.html](http://jdebp.eu./Softwares/nosh/guide/tcp-socket-listen.html)

* [https://cr.yp.to/ucspi-tcp/tcpserver.html](https://cr.yp.to/ucspi-tcp/tcpserver.html)

One of the principles underlying this design was, famously, "Don't parse".
Don't have one string with a separator character. Have two strings that are
already separate.

* [http://cr.yp.to/qmail/guarantee.html](http://cr.yp.to/qmail/guarantee.html)

------
johnklos
Canonicalize everything, then compare the canonicalized results. We've got
this ;)

------
peanut-walrus
':' is also a perfectly valid character in a domain name, so the naive
splitting already does not work in the IPv4 ecosystem.

~~~
tatersolid
No, a colon is _not_ legal in a domain name or host name. See RFC 1123. Only
alphanumerics and the hyphen character are allowed in DNS labels, which are
separated by full-stop characters. Non-hostnames in DNS may also allow the
underscore character as part of a DNS label.

~~~
peanut-walrus
RFC 1123 specifies the preferred format, not what is mandatory. This is
clarified in RFC 2181:

"The DNS itself places only one restriction on the particular labels that can
be used to identify resource records. That one restriction relates to the
length of the label and the full name. The length of any one label is limited
to between 1 and 63 octets. A full domain name is limited to 255 octets
(including the separators). The zero length full name is defined as
representing the root of the DNS tree, and is typically written and displayed
as ".". Those restrictions aside, any binary string whatever can be used as
the label of any resource record. Similarly, any binary string can serve as
the value of any record that includes a domain name as some or all of its
value (SOA, NS, MX, PTR, CNAME, and any others that may be added).
Implementations of the DNS protocols must not place any restrictions on the
labels that can be used."

You are correct that it can not be used in a hostname and I am not sure how it
should be encoded when used in a URI, but for a domain name it is perfectly
valid.

Indeed, I tested this on my local DNS server running Knot and I can create a
domain name with colons in it and it also resolves using dig.

~~~
tatersolid
I can’t register “foo:bar.com” on Hover, GoDaddy, or anywhere else.

There’s a difference between how DNS software is supposed to treat label data
as opaque octet strings for forward-compatibility reasons, versus the
standards for valid host and domain names enforced by the rest of the Internet
standards.

------
ggm
I've taken to using python split(':')[:-1] and like constructs

~~~
anderskaseorg
See also rsplit(':', 1). But instead of either of those, you should really be
using library functions that already understand the format in question instead
of making up your own rules.

    
    
      >>> u = urllib.parse.ParseResult('', '[2001:0db8:f00f::0553:1211:0088]:10443', '', '', '', '')
      >>> u.hostname
      '2001:0db8:f00f::0553:1211:0088'
      >>> ipaddress.ip_address(u.hostname)
      IPv6Address('2001:db8:f00f::553:1211:88')
      >>> u.port
      10443

~~~
ggm
Urllib? There has to be a better choice than that. I've been using inet_ntop
and pton to canonically convert bare IPs in to sane forms.

~~~
detaro
What's the problem with urllib for this purpose?

------
emilfihlman
IPv6 is a clusterfuck of "hey lest add x!" and "lets forget reasonable
previous experience!".

------
thisacctforreal
Call me a luddite but I'm not looking forward to IPv6.

How about IPv5; add another byte.

We can keep the colons 192:168:1:1:1.

