Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I Scraped Hacker News for TLD Popularity (v01.io)
68 points by klausbreyer on March 13, 2023 | hide | past | favorite | 41 comments



This will save your time:

https://play.clickhouse.com/play?user=play#U0VMRUNUIHRvcExld...

Takeaways:

- .org is 50% better than .com; - .edu and .gov are really nice; - .io is cool, and much better than .uk;

PS. I don't remember any rate limits of the API. Here is how I downloaded the data: https://github.com/ClickHouse/ClickHouse/issues/29693


Thanks zX41zdbW - I find the query you shared easier to reason about than:

https://github.com/klausbreyer/hackernewsstats/blob/main/que...

Also:

"Elapsed: 0.140 sec, read 33.95 million rows, 1.26 GB."

140 ms is quite impressive.


to be fair, that is large part because CH seems to have a topLevelDomain function


you don't need to scrape HN there's the public dataset in google BigQuery, I don't know if it's still updated regularly.

edit:

here's the link https://console.cloud.google.com/marketplace/details/y-combi...

they seems to have stopped updating around the later part of 2022, don't know why.


To anyone interested, I noticed this stoppage as well. IIRC, the data in BigQuery stopped being updated Nov 16, 2022. I’ve got a complete dataset that merges in new content from the firebase dataset, then rectifies story points at a later time (since stories can amass points indefinitely). Comment here if you’re interested. I’m thinking of publishing it as a torrent, but don’t know if there’s enough interest.


> Comment here if you’re interested. I’m thinking of publishing it as a torrent, but don’t know if there’s enough interest.

Please do :)


Would be interested as well!


I would love that! I find HN useful to have backed up on my laptop, for freetext searching offline mostly.


+1


I also learned this today. Thanks for sharing! :)

Edit: But sadly it is not up to date. :)


Dumb question: but I thought the ".io" TLD was owned by a really sketchy group / organization or ownership was being contested (or similar?) and having a domain there was semi-risky, is that still the case? I've avoided .io TLD because of this vague notion, but never really knew the specifics.


The biggest concern I've seen over the .io TLD is that it's the ccTLD for a territory ("British Indian Ocean Territory") whose existence / sovereignty is fragile. The people who originally lived in this territory were forcibly expelled in the 1970s – https://en.wikipedia.org/wiki/Expulsion_of_the_Chagossians – so there are problems from two points of view: ethically this doesn't sit right with a lot of folks, but even from a pragmatic point of view, the UNGA adopted a resolution in 2019 specifying that the UK should withdraw its administration from the territory – this has not yet occurred, and it is uncertain what could/would happen with the ccTLD if or when it does.


"Island of Shame" by David Vine really gets into the details of this if you really want to know the history. Spoiler: The United States fully controls this territory because of the (very active and strategically important) military base. The UK ownership is just a cover... (an entire book could be written just about this alone)

Also, the people brought there were slaves. No one lived on the islands originally. This is why the various governments felt they could move these people where ever they wanted. The people became self-sufficient, of course. Then were forceably moved to Mauritius and other places, given a stipend, but no land or access to any resources when the US military expanded their base. Basically dumped the people into ghettos.

This is still a very open wound in 2023.


I have such a domain myself. i have been thinking about switching to .dev for a long time (i have already reserved it).

However, my dilemma:

I would then keep .io and redirected so that no one can steal it. By doing so, I'm not helping anyone who lives there. :(

I don't use it for new projects for a long time.



.io has some baggage. People working in FLOSS have been suggesting something else instead:

> In short, this TLD belongs to the Chagos Archipelago, which suffered a mass deportation by the UK government 50 years ago. These people still fight for the right to return to their homeland.

https://github.com/elementary/website/discussions/3108


If you're using .io because you want to use a "techy" TLD (that isn't .net that isn't held in high esteem for some reason), .dev is a great alternative.


Isn't that Google's TLD? Unfortunately I trust them less than the sketchy administrative situation of .io. Wake up one day and their algo's decided you're a spam account and lock you out and take your domain.


You don't need to get it directly via Google. Mine are registered through a normal registrar, and not linked to a Google Account.


.tech .technology .software .it .digital

are options. I agree about avoiding Google.


There is also .rest (for APIs ), .run and .codes ;)


This is great! It's something I've wondered about for a while. I was surprised to see the decline of .com is fairly linear. Before I looked I was expecting the use of alternate TLDs to be accelerating a bit.

Are the absolute values the running totals? If so, why do they decline from 2021 to 2022?

I think a graph for unique counts would be cool to see too. For example, the ClickHouse query posted earlier in this thread shows:

    domain    count    unique
    -------------------------
    .org     349414     58226
    .net     114499     31129
So the submissions using .org are 3x .net, but the unique domains seen using .org are less than 2x .net. I'm not sure if there's any significance there, but it would be interesting to see the difference.

In the same context, I think it would be interesting to see the top 50 domains on each TLD.

Anyway, it's very cool info to see. Thanks for sharing it!


Good feedback, thank you very much! I will see if I adress this in the future. :)


The interpretation you never asked for: The data exhibit a strong preferential attachment [1] behaviour, i.e. you can draw a line in a log-log plot (despite only a semilog plot is shown). This is typical for real world data.

[1] https://en.wikipedia.org/wiki/Preferential_attachment


Thanks! A good thing to know!


Speaking of .io tld: Looks like the territory is going thru something (last paragraphs of https://en.wikipedia.org/wiki/.io#History mentions ".io domain could also be extinguished")

Should domain owners be worried?


Is there any information within .it, for the Italian provincial second level domains? I find the idea of these domains fascinating, although implementation of them could be problematic, given that countries can lose or gain territory over time (suppose the Kingdom of Naples secedes from Italy?)


> Wrong links: http://blog.plover.com./prog/lib.html

That’s technically not incorrect. Host names, as far as DNS is concerned, always have a trailing “.” And my browser resolves the URL just fine


One remark: .io looks like it's #3 in the time series, but it seems to be missing from bar plot.


I see it in both graphs? Using Firefox here, in case it matters.


here's what i've been using to automate domain name discovery:

https://github.com/esteininger/domain-name-checker/blob/main...


> I had a database full of HN Stories since the very beginning, which accumulated to ~1GB.

Just curious, is there any way to download that?


Sure, I can make a dump available. Can you tell a little bit, what you planning to do with it?


Well, I need titles.

I had this idea for a while. Imagine bitcoin. At some point of time this word had emerged on hacker news, and we started seeing this more and more. I'm wondering what other words are there, that weren't in titles 2-3 years back?

These words/topics are probably the future. If you ever know that, let me know by email (in my profile), or just post a comment here.

There shouldn't be too much of them, probably hundreds.


You've got mail :)


This is interesting, thanks.

If someone wanted to dig deeper, does anyone know if Google makes the .dev zone file public?


I can't figure out what they did? Web scrapped hacker news for domain ideas?


Why does the stack chart randomly change colors by just filtering/ordering?


I think plotly is giving the colors just based on the order of data inserted.


domain.com./whatever is valid tho?


In theory, technically it is correct. But it was harder for me to parse and it really was not a significant amount.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: