
Show HN: A DNS server that removes the top million domains - taxonomyman
http://millionshort.com/dns.html
======
samwillis
This is by the same guys as the million short search engine (Google minus the
top million). Probably good to use this in combination with the dns to find
things that are not just broken links.

<http://www.millionshort.com/>

The search engine was discussed on HN before:
<http://news.ycombinator.com/item?id=3910304>

~~~
znowi
Yes, it's not very useful without the search engine. Which I have tried just
now and the experience was... frustrating.

Why? Cause it did not return any results for any of my queries (e.g. "hello").
I thought, no, it can't be broken - must be something on my end. I opened it
in Chrome incognito window and it worked! Aha, "location based", I thought.
And I was right - they use IP location by default to localize results.

I know it is a common practice now, sadly, popularized by Google, et al - but
it sucks! I deal with this each time I travel. Can you, please, prioritize
this and first look into my actual request header, which explicitly says that
I prefer response in English? Thank you very much.

~~~
furyofantares
Why would an incognito window prevent them from customizing results based on
your IP?

------
dumbfounder
If millionshort.com made it into the top 1 million domains would it cease to
exist?

~~~
robotmlg
Does the set of all sets that don't contain themselves contain itself?

------
bmmayer1
Why is this useful in any way, shape or form?

~~~
ronnier
To get on HackerNews.

~~~
bmmayer1
Which is a top 10k site...

------
gojomo
A great refinement would be: on the error page, suggest alternate sites with
similar content that are still reachable.

Or even: for the exact URL visited, suggest the one page in the remaining long
tail that's most like (by some text/semantic measure) the originally-requested
page. (Or even: redirect automatically to that page.)

------
s353
If you use these servers you may see a lot less advertising. That's because
more than a few of the top 100/1000/10000/1000000 sites are actually just ad
servers, assuming Million Short is using Alexa as the source. And because they
appear in the top Alexa list one might guess those particular ad servers serve
a significant share of the internet's advertising.

Another thought is you could potentially use these as general purpose DNS
servers; e.g. they are all Amazon EC2 I believe so with respect to the DNS-
based geolocation efforts of many websites, you'd be treated as if coming from
the location of whatever region the datacenter is in. Just add the top
100/1000/10000/1000000 sites to your HOSTS file.

~~~
andrewcooke
_so with respect to the DNS-based geolocation efforts of many websites, you'd
be treated as if coming from the location of whatever region the datacenter is
in_

wut? how does dns based geolocation work? you seem to be saying that sites
assume you share the physical location of your dns server?

~~~
s353
No. What sites assume is that you are located (at least in a regional sense)
near the (recursive) DNS servers you use. For example, that's how many CDN's
work.

Note: It's certainly possible to share the exact same location (or interface,
to be more precise) as your DNS server. I run my own personal DNS cache on
localhost. It's not unheard of. I'd guess there would be a few other readers
of HN who do this as well.

~~~
gojomo
I think that's wrong: there's no way for a site to know what DNS servers I
use. Instead, they use a reverse lookup from the apparent IP I'm connecting
from... that is available to them, and is unrelated to my DNS servers.

Or can you supply a reference/explanation for how'd they'd know my DNS
servers?

~~~
dsl
Run 'dig +short whoami.ultradns.net' in your terminal. You'll get back the IP
of the DNS server you are using.

Your ISPs recursive DNS servers send off a query to the sites authoritative
servers, which in turn look at the source IP address. That's how they know.
(Source: I've built a few CDNs)

~~~
gojomo
Sure, but that only applies to the CDNs who have been careful to send diferent
answers to different places, for sites relying heavily on such CDNs.

A standalone (single-IP) site not using a CDN, or even a site that uses a CDN
solely for bulky static assets, has no direct way to query what DNS servers a
client used, other that the fact that those servers resolved the request Host
to the listening IP. (Perhaps it could probe by attempting a number of
resource loads from hostnames that resolve differently based on different
major DNS sources, but that's be obtrusive and require constant maintenance.)

Especially in the 'long tail' (of not-top-1-million-sites), I'd expect the
non-CDN or CDN-only-for-big-assets setup to predominate, and so any geographic
adaptation would be more likely based on IP lookups (via a database like from
MaxMind), rather than CDN inference.

Or is there some other way even static-asset CDNs somehow communicate back
their geography-sensing back to primary sites?

~~~
dsl
I'm not sure I fully understand your question.

A "standalone" site can get the IP address of the users DNS server by doing an
AJAX request to <http://[random].ip.yourdomain.com/>. Your DNS server responds
to requests for *.ip.yourdomain.com with the IP of your webserver and stores
the requesting IP address in a database using [random] as its key. Finally a
script on your website fetches the IP from the database when it gets the
request and prints it out wrapped in a cute little JSON wrapper. You can see
an example of this at <http://entropy.dns-oarc.net/test/>

~~~
gojomo
Clever, but it seems to me that might only coarsely reveal some global service
my DNS server falls back to, NOT the server my local machine consults first.

Is this technique, including running your own authoritative DNS server and
remembering every unique lookup, commonly used to geolocalize individual web
visitors? Or do servers more often just look up the originating IP? My
conjecture is that the latter dominates.

~~~
dsl
You said originally "there's no way for a site to know what DNS servers I
use." I proved that is false.

Is it used to geolocate users? No. Is it used to route traffic in most major
CDNs? Yes. The two are completely different use cases.

I think this is getting way out of scope for HN. If you are still curious how
this stuff works I can email you directly if you'd like.

~~~
gojomo
OK, a typical website acting alone can't know what DNS servers my local
machine is configured to contact, and furthermore doesn't use such DNS sensing
to geo-localize its content (the claim I was responding to).

But, with the technique you've described, a website coordinating with a DNS
server can probe to learn one of the DNS servers that gets consulted (directly
or indirectly) by my machine. Got it. Neat and useful trick.

------
petercooper
You can get into the top million on Alexa with a minuscule amount of traffic
so you'd be extremely limited. Losing the top 1000 would probably be a more
interesting experiment for mid/long term purposes.

~~~
cfn
They also have that (and 100k, 10k and 100).

------
ChikkaChiChi
What I think would be more interesting is a proxy that only uses the first 1k,
100k, 1m sites.

I might be wrong, but it might be an easy way to keep users on the "bright
streets" of the Internet instead of wandering down malware-ridden alleys.

------
measure2xcut1x
So what's the criteria for removal? I.e. how does a domain get in the top 1m?

~~~
rb2k_
they probably just grab the alexa top 1 million csv file that they provide.

------
garretruh
I can only imagine malicious uses for this. "Sorry, you're no longer allowed
to access Google, Facebook, Twitter, or Wikipedia." Not that that is entirely
a bad thing.

~~~
stephengillie
Someone will use it for one of their anti-distraction productivity tools.

Is changing DNS easier or more difficult than editing a HOSTS file?

~~~
ikawe
edit your /etc/resolve.conf

nameserver 1.2.3.4

------
dumbfounder
How did they do their ranking? Is it based on a web crawl, dns stats, other?
Is their list of the top million domains public? I would love to see the data.

~~~
dumbfounder
I haven't seen confirmation, but they probably use Alexa given they make it
easy to download their top 1 million list:

<http://s3.amazonaws.com/alexa-static/top-1m.csv.zip>

------
smudgymcscmudge
I don't get it. What's the point of this?

~~~
pixie_
You get 'popularly obscure' results. Which I think is a better name than
million short.

~~~
pixie_
'IndieSearch' is even better lol.

~~~
sukuriant
It comes already built with a "once it's popular, I don't like it anymore"
feature!

------
makmanalp
This could work as a pretty neat anti-procrastination tool. HN is ranked 2.9k
and reddit is 100-something.

