Not enough detail here, but I'm surprised when I see the terms "voice" and "TCP" together. Do they use TCP to set up the call, then UDP to handle the audio?
With the traditional VoIP stacks, yes. There are two components to making phone calls: messaging and media. The messaging part sets up and tears down calls. The media part passes the audio.
but typically you use something that can have at least some intelegent behaviour like RTP not UDP. Why did they choose that when literally every other pure VoIP service uses RTP ?
The next problem will be when they get to having large numbers of servers in a region.
Then two problems will manifest themselves.
1) The client will take long enough to work through the list of servers that the wrong server is chosen simply because the connection is initiated first.
2) This design has all servers seeing all calls. The load represented by all the TCP connections hitting all of the servers will consume more and more of a server.
Neither is a problem you can solve by adding more servers. In fact, they are made worse by adding servers!
However, if you're charging per call, that's what they call a "good problem to have".
Agreed. We haven't run into this yet (and aren't charging at all, this is an OSS project), but I think that would be the point where you have to do one of two things:
1) Architect your DNS response to include a small sample of the total result set for the region, where the sample includes at least one switch from each micro-region.
2) Break down and introduce load balancers, so that there's on load balancer per micro-region, which fronts all of the switches within that micro-region.
Fortunately an individual switch doesn't really do much (just shovel packets around), so the number of simultaneous calls a switch can handle is high enough that redundancy is more about availability than load.
The server architecture is nice but my favorite part is the simple signaling protocol, as opposed to SIP or any other overly complex (for this use case) telephony signaling protocol. Nice work.
i'm confused (which is not too surprising as i am no expert on this). why can't they hole-punch through nats? then they would avoid the extra bounce-through-server latency and would hugely reduce the load on servers. i thought that was how skype worked, for example.
(i realise this doesn't stop their "fastest first" idea being useful - i got kind of sidetracked by the explanation of how servers are used near the start of the post).
Your NAT traversal strategy is limited by the type of NAT being employed. Traditionally, the worst case scenario for NAT traversal is "symmetric NAT."
Symmetric NAT means that each tuple of (source_ip, source_port, destination_ip, _destination_port) gets its own unique (external_ip, external_port) tuple. This is bad because it means that STUN is ineffective: your STUN server will see a different external port than what your actual destination would see.
Mobile data networks are actually worse than symmetric NAT. Not only will your external port change based on your target, but your external IP likely will as well.
This makes NAT traversal, AFAIK, basically impossible.
And there's no real reason to do it either given the availability of global IPv6 addresses and any of the free tunnel brokers/teredo/etc, other than the simple lack of adoption/momentum. Most ISPs will need a push, but from an infrastructure perspective it just boils down to a new gateway router or a firmware upgrade. It doesn't even need replacing the end-user equipment if ISPs simply set up a tunnel server for their own customers as a transitional measure.
Edit: Yep, they use UDP for the audio: https://github.com/WhisperSystems/RedPhone/wiki/Signaling-Pr...