It's not about size, it's about rate and introducing latency. Just the hijack itself is going to add DNS latency, which is monitored by any competent operations team. Expert operations teams, and I know of one, also monitor the BGP path to their public addresses (including nameservers) to detect things like the Youtube kerfluffle.
Adding a conditional ("do I answer or do I proxy?") on every DNS query -- and there are many -- is going to introduce enough latency to be noticed unless you throw a lot of gear at it. And you're still going to introduce latency by inserting another hop. That's my point, though I do agree with you.
>Adding a conditional ("do I answer or do I proxy?") on every DNS query -- and there are many -- is going to introduce enough latency to be noticed unless you throw a lot of gear at it. And you're still going to introduce latency by inserting another hop. That's my point, though I do agree with you.
Welcome to the world of recursive name servers, there is a lot of software out there that does exactly what you just mentioned, I fail to see what would be hard about making this change.
Adding a conditional ("do I answer or do I proxy?") on every DNS query -- and there are many -- is going to introduce enough latency to be noticed unless you throw a lot of gear at it.
Hm, why? Any modern CPU is blazingly fast. Writing it in Ruby probably wouldn't be smart, but Python + PyPI or Lua + LuaJIT would easily get within a factor of 10x of C.
I didn't say it would be technically impossible, I said it would be noticed. If you make it a theoretical problem, and it most certainly isn't (there are a lot more practicalities involved), you're adding at least another string compare to every query. That's enough of a latency shift for me to notice in my graphs -- I notice when the Internet reroutes itself and my DNS latency goes up by 5 milliseconds.
This isn't a "could it be done?" exercise, it's more of a "could it be done without detection?" exercise. For this specific case, it's a pretty big risk.
I wasn't speaking theoretically. I don't understand how a pipe read + string compare + pipe write would add 5ms per query.
As for detection, that was the reason I brought up CPU power. Modern CPUs are so fast that that it seems like this redirector would hardly generate a blip in any chart (such as top).
I don't care about proving anybody wrong. I care about filling my knowledge gaps. I.e. it's interesting to try to figure out how something like this would be detected in practice.
You'd be using network sockets, not pipes (pipes are slow as fuck btw). And it would add the latency of the network transmission in both directions, plus the processing time, which would add up to much more than 5ms unless you're on the same network segment as your target. And higher CPU load increases latency.
Who is going to notice increased latency in DNS queries? Most likely web developers. Nobody else I can think of would do (non-cached) bulk DNS queries to random domains and actually be looking for millisecond changes in lookup time. And those developers would have no insight to the DNS infrastructure serving requests, so they'd have no idea to contact the DNS admins to investigate. Even the DNS admins could be fooled before they contact network admins to do further research.
The bottom line is not "has DNS latency changed?", it's "has DNS latency become unacceptably high enough to force me investigate?" Unless it's becoming a problem, I think anyone would ignore increased latency because they have ten other work tasks to deal with.
> Unless it's becoming a problem, I think anyone would ignore increased latency because they have ten other work tasks to deal with.
You'd be surprised once you start working with larger, higher-traffic infrastructures. If our average external DNS query rises 200ms, my phone goes off. There's more slack on p99, but it's also monitored.
All of the timings for the various parts of a request to the system that I administer are instrumented from a small libcurl app running in multiple ASNs remotely, because Pingdom and other services do not provide the resolution that we need. They are then rendered on a stacked graph that always lives on my third monitor, and any significant deviation averaged out over five minutes catches my eye.
I know it sounds like overkill, but it's crucial at scale.
Okay, so, when you notice your DNS latency going up by 5ms... how much investigation do you then do to confirm exactly what caused this, and have a very high confidence (how high?) of ruling out it being caused by a MitM on the DNS? Really?
Without getting too far into specific operational security -- the same reason that I hate there's an entire branch off my thread discussing this specific attack, which I think is detrimental to the discussion -- we have monitoring in place to tell me if this exact attack happens. Within seconds. The latency would just be a clue.
Think about the dumbest way you would do that. Then implement it. That's how simple our system is.