
Show HN: Download the first 10,002,378 HN comments/stories as one archive - cdman
Magnet link: magnet:?xt=urn:btih:44c65b5779d9d8021e002584fa73740f36d052a6&amp;dn=10m_hn_comments_sorted<p>Go to https:&#x2F;&#x2F;hn-archive.appspot.com&#x2F; for the torrent file &#x2F; source code.<p>I&#x27;ll be semi-frequently checking the story and answering any questions which may come up.
======
duggan
Somehow I can never turn down a data dump, despite never having done much with
one.

Some day!

------
binarymax
Thank you for this! I'm training word2vec on it right now - will take several
hours.

If anyone else is interested here is the (terrible) code to get it into a
prototype format.
[https://gist.github.com/binarymax/d3691180e65ff7f0dec5](https://gist.github.com/binarymax/d3691180e65ff7f0dec5)

~~~
philth
Keep us posted on your discoveries. It would be interesting to see how
different the embedding is to word2vec trained on a different corpus. I
imagine borrowed words like "python" are clustered with programming languages
rather than snakes in this case.

As a side note, not really having looked too deeply into word2vec, does
word2vec capture multiple meanings? If so, how?

~~~
binarymax
All done, results are very promising! Examples are too long so here is one
below, and this gist has more:
[https://gist.github.com/binarymax/6befa448df3f5fd6dba9](https://gist.github.com/binarymax/6befa448df3f5fd6dba9)

    
    
            Starting training using file 10m.txt
            Vocab size: 305432
            Words in train file: 565170189
            Alpha: 0.000045  Progress: 99.91%  Words/thread/sec: 107.57k  
            real   174m19.955s
            user   1315m35.661s
            sys    3m27.011s
    
    

Enter word or sentence (EXIT to break): startup

    
    
            Word: startup  Position in vocabulary: 390
    
                                                          Word       Cosine distance
            ------------------------------------------------------------------------
                                                      startups      0.808231
                                                  bootstrapped      0.719379
                                                  entrepreneur      0.707722
                                                        starup      0.698379
                                                 bootstrapping      0.698216
                                                     incubator      0.683647
                                                      founders      0.664983
                                                       scrappy      0.660502
                                                 entrepreneurs      0.660176
                                               entrepreneurial      0.656120
                                                            yc      0.652160
                                                     cofounder      0.651848
                                                            vc      0.650642
                                                     fledgling      0.636813
                                                    cofounders      0.632761
                                                       venture      0.622636
                                                       company      0.617562
                                                    incubators      0.612947
                                                        statup      0.608451
                                                       founder      0.608080
                                              entrepreneurship      0.604812
                                                            sv      0.603689
                                                         bigco      0.602171
                                                   startuppers      0.592669
                                                     cofounded      0.588964
                                                  entrepeneurs      0.585747
                                                          solo      0.582533
                                                 entreprenuers      0.564045
                                                   boostrapped      0.562884
                                                  solopreneurs      0.559994
                                                    cofounding      0.559840
                                                       statups      0.558347
                                                      business      0.552922
                                                  bootstrapper      0.551885
                                                     techstars      0.545766
                                                 bootstrappers      0.545263
                                                       fintech      0.545090
                                                      fundable      0.542542
                                                       shotput      0.541257
                                                   accelerator      0.540787

------
tilt
[https://hn-archive.appspot.com/](https://hn-archive.appspot.com/)

Clickable

~~~
cdman
Thank you.

------
theklub
Someone should map the use of tech buzzwords over the years. Would be pretty
funny to look at.

~~~
ecesena
This could be a good starting point: [http://dclure.org/essays/visualizing-
the-humanist/](http://dclure.org/essays/visualizing-the-humanist/)

------
paulsutter
I wish it included upvotes/downvotes. Why are those secret? It would be fun to
work on ranking algorithms, and any inc effective requires knowing who is
doing the up/down voting.

------
ivan_ah
> 10,002,378

what date range does this correspond to? How big is the archive?

~~~
cdman
It is from story 1 to comment/story 10,002,378 :-)

The archive is 1.12GB big and contains 1 JSON document / per line. The JSON
document is approximately the format returned by the official HN API (although
there are some exceptions since some of the comments are not available through
the official API and those had to be retrieved through the Algolia API and/or
scraped from the site).

~~~
ivan_ah
Thx. Great job. Now I just have to dust off some LDA code and see some
topics...

------
orf
Does this include [dead] comments?

~~~
cdman
Yes, dead comments were fetched / scraped from the website (so it might not be
perfect since it uses regex to parse HTML :p).

------
callum85
Which pieces of data are included with each comment/link?

~~~
frou_dh
[https://github.com/HackerNews/API#items](https://github.com/HackerNews/API#items)

Also the file is ~5.3GB when decompressed, if anyone is wondering.

------
toomuchtodo
What license applies to the archive? Creative commons?

------
sitkack
meta data request: can someone scrape the tracker and provide a log of the all
the IPs that participated in the swarm?

