A good takeaway from this outage for the average user would be to make sure that your fallback DNS resolvers are operated by totally separate providers. (eg, configure 22.214.171.124 with 126.96.36.199 as a fallback, rather than 188.8.131.52 and 184.108.40.206) (Edit: fixed cloudflare's secondary address)
DNS fallback is often also misunderstood to mean "fall back if the domain is not found", but it really means "fall back if the name server fails to respond". If the domain is not found (i.e. servers returns a valid NXDOMAIN response), most resolvers do not consult any other name servers.
Round-robin, cascading name lookups, and other frequently desired functionality can be obtained through dnsmasq or similar caching/forwarding name servers.
options timeout:2 retries:2 max-inflight:768 max-timeouts:100 single-request-reopen rotate
PS: someone here mentioned that this behavior is OS-dependent - nope, this happens on the router level, and all devices in my apartment suffer.
My router uses a DNS resolver internally, and it will spread-cast to multiple DNS servers and use the quickest response it can get. It also caches using the TTL in the DNS response, and so it will serve up cached records transparently.
Secure IP: 220.127.116.11 Provides: Security blocklist, DNSSEC, No EDNS Client-Subnet sent. If your DNS software requires a Secondary IP address, please use the secure secondary address of 18.104.22.168
Unsecured IP: 22.214.171.124 Provides: No security blocklist, DNSSEC, sends EDNS Client-Subnet. If your DNS software requires a Secondary IP address, please use the unsecured secondary address of 126.96.36.199
Client-Subnet lets providers with widely-distributed servers pick one that's near you to serve your content.
Seems like a fair deal to me. I get free DNS service, and companies sponsoring this program get metrics on threats and general Internet usage. I'm a little skeptical of their claims that individuals can't be identified from their anonymized data. E.g. I probably only get one or two hits on my personal website every week, so it might not be hard for a malicious employee to deanonymize visitors to my site.
Some highlights from the policy:
Many nations classify IP addresses as Personally-Identifiable Information (PII), and we take a conservative approach in treating IP addresses as PII in all jurisdictions in which our systems reside. Our normal course of data management does not have any IP address information or other PII logged to disk or transmitted out of the location in which the query was received. We may aggregate certain counters to larger network block levels for statistical collection purposes, but those counters do not maintain specific IP address data nor is the format or model of data stored capable of being reverse-engineered to ascertain what specific IP addresses made what queries.
There are exceptions to this storage model: In the event of events or observed behaviors which we deem malicious or anomalous, we may utilize more detailed logging to collect more specific IP address data in the process of normal network defense and mitigation. This collection and transmission off-site will be limited to IP addresses that we determine are involved in the event.
We do not correlate or combine information from our logs with any personal information that you have provided Quad9 for other services, or with your specific IP address.
Quad9 DNS Services generate and share high level anonymized aggregate statistics including threat metrics on threat type, geolocation, and if available, sector, as well as other vertical metrics including performance metrics on the Quad9 DNS Services (i.e. number of threats blocked, infrastructure uptime) when available with the Quad9 threat intelligence (TI) partners, academic researchers, or the public.
Quad9 DNS Services share anonymized data on specific domains queried (records such as domain, timestamp, geolocation, number of hits, first seen, last seen) with its threat intelligence partners. Quad9 DNS Services also builds, stores, and may share certain DNS data streams which store high level information about domain resolved, query types, result codes, and timestamp. These streams do not contain IP address information of requestor and cannot be correlated to IP address or other PII.
Quad9 does not track visitors over time and across third-party websites, and therefore does not respond to Do Not Track signaling.
With https, whoever you end up contacting needs to cough up a valid certificate for the domain in the url. I run https everywhere to try to get that protection as often as possible. In practice there are still ways that dns tricks can cause trouble but they are not as bad as you might think and browsers are slowly pushing an https only web (I hear Chrome will soon start marking all http sites as "insecure" rather than https sites as secure). ssh has its own authentication method and I do try to verify new hosts via another secure chanel.
Speaking of not trusting companies, I am reminded that at one point I noticed that CentryLink seems to be intercepting all dns traffic no matter the intended destination, so without either a secure connection past the ISP or maybe a nonstandard port it may not matter what dns server you try to use. Hopefully all ISPs that do this do the horrible redirect of invalid domains thing so attempting an http connection to an invalid domain might show if this is the case (I found it trying some of the nonstandard domains that OpenNIC resolves).
I can’t even use that address with the ISP Alestra in Mexico
What does this code sample have to do with FRP? This code seems extremely trivial and doesn't give any real indication to me why you'd need a framework of any sort. It seems like they really want to emphasize that they use FRP, but this code just seems completely unrelated.
One thing I’d be interested to know more about is why it took 17 minutes to fix. While you can and always should strive to make them less likely, outages are inevitable, so how you respond is crucial. Here the outage was very obviously caused by a deployment that I’d assume was supervised by humans – why did it take 17 minutes to roll back?
Especially when you consider that they are getting DoS attacks every 2-3 minutes - so all deploys are going out into a hectic world and the dots maybe aren't that easy to connect under those circumstances.
- shit is not working
- is this an attack?
- no it's us
- that's how
- let's go back
- have to get supervisor
- roll back huge thing
really that long?
- monitoring system picks up irregularity (smoothed over some window of time, which delays alerting)
- alert propagates to humans
- humans may take time to notice alert (even a page takes a few seconds to read)
- humans make decisions, may need to talk to other humans (all of what you said above)
- humans evaluate correct procedure, double-check it (you don't want them making the wrong "fix" and making something else worse, do you?)
- humans execute commands
- commands take time to run on large collections of computers (running them completely in parallel can cause thundering herd issues, in some cases)
Things change in a time when you would freak getting up in the morning and say google.com did not work.
Automated processes can only mitigate so many edge cases. Even then, humans need to be involved, and that slows things down.
Another example - I can record a video and broadcast internationally, translated on the fly into dozens of languages, effectively communicating with significantly more people than if I could not harness that technical capability. In 2000 that communication process would have been orders of magnitude longer. (Did they have on-the-wire translation then? Idk, just making an assumption to illustrate my point.)
> Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.
> What we did not account for, and what Provision API didn’t know about, was that 188.8.131.52/24 and 184.108.40.206/24 are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.
> As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!
When porting legacy code it is not only important to understand the edge cases and technical debt built up over time, but to test more heavily in production because you never know if you got them all because some smart guy built them long ago and/or there are unknown hacks that were cornerstones of the system for better or worse.
Phased and alpha/beta rollouts in an almost A/B testing way is good for replacement systems. Version 2 systems can also add new attack vectors or other single points of failure that aren't as know from legacy problems, the Provision API seems like it is a candidate for that.
Over time the Version 2 system will be hardened just before it is EOL and replaced again to fix all the new problems that arise over time. Version 2's do innovate but they also shroud fixing old issues and pain points for new unknown problems.
Each of these bugs took weeks of real-world usage before they were found.”
— Things You Should Never Do, Part I, (https://www.joelonsoftware.com/2000/04/06/things-you-should-...)
That win 95 fix? Doesn't matter because win 95 was auto upgraded to be unrecognizable (SaaS is a wonderful thing).
Low memory? Not our problem - go buy another computer. Remember programmer time is more valuable.
You can try and buck the trend but your dependencies wont and the customer doesn't care whose code the bug is in. You won't get any browny points so you might as well just save the effort.
It's a brave new world.
I wonder if it would be possible to express the idea that if a block being applied drops traffic well below expected levels, it must be a mistake?
It has got nothing to do with packet filtering per se.
I fucking love these guys
But works with 220.127.116.11
Quad9's goal seems to be about threat-detection and prevention.
It would be nice if you could have both simultaneously (and maybe you can), but at the moment both services are actually quite different.
The only way you get that in a more generic sense is if a specialist DNS CDN provider started up that provided DNS services for all the existing CDNs (or they all agreed to some type of federated standard that let them share the same recursive multicast IP addresses for DNS resolution).
It's amazing how fast it is.
PS: sorry for hijacking the thread.
In the meantime, have you tried adding other addresses such as 18.104.22.168, 22.214.171.124, or 126.96.36.199 to see whether the fallback works?