
Newsgroups and the Internet Archive: I Made a Difference - mailxplorer
http://screengod.blogspot.com/2018/03/newsgroups-i-made-difference.html
======
wmf
It's pretty sad that a bunch of people sent historical Usenet archives to
Google, they imported them into Google Groups, and then... basically hid/lost
access to it over time. I assume the data is in GFS somewhere but it may never
see the light of day. But maybe we just need to shame them every decade:
[https://www.wired.com/2009/10/usenet/](https://www.wired.com/2009/10/usenet/)

~~~
Steuard
Google Groups was great when they first acquired all those archives! It just
got progressively worse and worse over time, and I've never understood why.
(Stagnation I could understand, but how did the same searches get _less_
effective?)

~~~
ghaff
I get that Google essentially walked back from an at least implied role as an
information archivist. But I still don't really get why they so completely
abandoned things like Google Groups given the truly minuscule resources to
maintain them at some low level.

~~~
cultureleak_ta0
Nobody can get promoted for improvements like that. It would only garner
goodwill with a tiny customer base and it’s not clear how that would translate
to revenue or user growth on other products

~~~
petercooper
Totally agree, but I kinda feel (and this is solely my opinion) that much as
it's our collective responsibility to donate to worthwhile causes, gigantic
tech companies should spend rounding-error money on good digital causes like
this.

------
NelsonMinar
Google's Usenet archive is too important to be owned by one company. My
understanding is some Google engineers worked a few years ago to make sure
Archive had a copy, but I can't verify that and I don't think it's online.
Some of the earliest archives come from Eugene Spafford's collection which is
readily online elsewhere, but Google did a lot of work cleaning it and of
course has the DejaNews archives which are invaluable.

I'm amazed Google Groups still exists as a product, it seems abandoned
internally and I expect to hear it's shut down any month now.

------
luckydude
I miss dejanews - anyone remember that before Google bought them? They had
full search, it's a shame Google let that go. Their search went back to pretty
much the beginning of net news. Does that exist anywhere anymore?

~~~
NelsonMinar
At least some of the Usenet archives are still online in Google Groups and
searchable. Can't vouch for completeness of the search index though, it looks
pretty wonky for stuff from the early 1990s. Anyway, example working archive
link I found via search:
[https://groups.google.com/forum/#!searchin/comp.os.minix/tor...](https://groups.google.com/forum/#!searchin/comp.os.minix/torvalds$20comp.os.minix%7Csort:date/comp.os.minix/yDS0CebiYdE/yjQw2YBmy2IJ)

Don't be too hard on Google's acquisition of Deja. I wasn't working at Google
at the time, but heard from many colleagues when I joined that the Deja
acquisition was quite chaotic because Deja was just about to shut down
entirely when Google picked it up rather than let it disappear. It's a shame
they don't do better with the Usenet archive now, but it's clearly not
Google's business anymore.

------
sverige
For all of you who are misty-eyed pondering those wonderful, probably-lost-
forever Usenet posts from the 90s, I dare you to read Kibo's .signature (last
updated 5/5/94 4:52AM <\-- CINCO DE MAY-O !!!!) at
[http://archive.birdhouse.org/etc/kibosig.txt](http://archive.birdhouse.org/etc/kibosig.txt)
to cure yourselves of misplaced misty-eyed nostalgia for those long-ago times
when the Internet was something else.

Kibo for President!

~~~
edraferi
What did I just read. It’s like trying to understand satire from a hundred
years ago.

I was convinced it was hopelessly corrupted by the archiving process, but some
of the ASCII art still works (mostly).

~~~
aperrien
It's concentrated 90's memes, straight from the Gen X tap. Can be hazardous in
large doses.

------
llao
Thank you!

If I read it right, humanity has lost 1991-2003 to Google though, correct?

~~~
linguaz
Take a look here:

[https://archive.org/details/usenethistorical](https://archive.org/details/usenethistorical)

"This historical collection of Usenet spans more than 30 years and was given
to us by a generous donor"

This group for example, from that collection:

    
    
        https://archive.org/download/usenet-comp/comp.emacs.mbox.zip
        [69.8M]
    

includes posts spanning from December 1988 to June 2013.

For some reason the mbox files have an odd format, with From lines that look
like:

    
    
         From -8118066241627336028
    

I wrote some scripts to fix that so I could open the mbox files in Mutt.

BTW, found the above links on this page:

[http://ryanfb.github.io/etc/2015/02/23/early_usenet_history_...](http://ryanfb.github.io/etc/2015/02/23/early_usenet_history_and_archiving.html)

which has more info & links about historical usenet archives.

~~~
llao
Oh wow, does this mean that we have all of usenet texts? What stops people
from providing a better interface than Google's then?

------
zokier
> After a month I had most newsgroups - excepting binaries - and it came to
> 800GB.

> Trying to index THAT lot was impossible

Stupid question, but why would indexing 800GB of newsgroup postings be
impossible?

~~~
mailxplorer
Clucene was way too slow for body text, more than 1GB. I had my own header
parser in C++ (though you can do that in Python easily).

I'm trying again on that 800GB with KISS DB (append-only hashtable), and
Elasticsearch. Doesn't matter if GPL because it's a website.

~~~
GuacheSuedeHN
Do you mind sharing the code ? I think that is an interesting thing to see

------
jl6
Reddit is the current era’s equivalent of Usenet, and we don’t have a robust
archive of that either.

~~~
brokensegue
wayback machine's archive of reddit isn't perfect but it works. just give the
IA more money

~~~
Asparagirl
Yes, but much of the Wayback Machine’s reddit content was specifically
targeted and scraped by ArchiveTeam, who are volunteers that seek out at-risk
content from the web and make sure that it gets into the Wayback. In the past
few years we’ve specifically tried to go after sub-reddits that we thought
were newsworthy and/or at high risk for deletion. But there’s no way we can
get all of it.

But you can help! If you have extra server space/bandwidth or you can spare
$40/month, we can add more pipelines:
[https://www.archiveteam.org/index.php/ArchiveBot](https://www.archiveteam.org/index.php/ArchiveBot)

Source: am ArchiveTeam member, run various pipelines, have scraped sub-reddits
ranging from The_Donald to the cryptocurrency worlds to darknet markets.

------
Steuard
It's great to see some Usenet archives out there to partly make up for the
disappointment of Google Groups. But I'm sad that this archive seems to be
incomplete, even within its stated date range. Back in the day, I was active
on rec.arts.books.tolkien and alt.fan.tolkien: in this archive, I can't find
any trace of the massive "alt." hierarchy at all, and the list of files for
the "rec." hierarchy doesn't include the Tolkien group. For that matter, the
list includes rec.humor.funny and rec.humor.d and others, but apparently not
rec.humor itself. (It really does make you appreciate just how substantial the
effort of collecting a comprehensive Usenet archive would be.)

On another note, not that anyone here would be able to fix it, but this list
would be a lot easier to search through if the item names didn't all begin
with "Usenet newsgroups within", so you could jump to first letters in a
meaningful way.

~~~
mailxplorer
It's my fault (not the IA), I thought I'd got all newsgroups but must have
missed some. I just checked the main newsgroup list and it's incomplete, for
some reason.

My plan though, is to dust off the old code, get a complete list of groups,
get them, and then make it searchable.

Sorry about that, I didn't check. This was all done in 2013. Basically, I
wanted to build a search engine but indexing the newsgroup posts (for header
and text body search) would take too long. I abandoned it, then in 2016 I sent
it to the IA.

So, I only just found out it's lacking a bunch of groups, 5 years on...

~~~
Steuard
No apologies necessary: creating the archive in the first place was awesome!
Like I said, this just shows how massive and complex Usenet is (or maybe,
was), and how it's not easy at all to create a comprehensive archive. It's far
better to have _some_ of it than none of it!

~~~
mailxplorer
From what I can gather, I downloaded one tenth of Usenet - 11,000 groups. This
means if I'd done all 110,000 groups it would have taken me about a year to
download them, and an 8TB drive to store them (in 2013!). That wasn't really
feasible...

------
fencepost
I'm impressed that Giganews maintains a 10-year archive of Usenet, I suspect
that would break most newsreaders.

~~~
u801e
A lot of the big usenet providers have at least a decade's worth of article
retention at this point (even for binary newsgroups).

------
xor1
Usenet was my very first exposure to internet communities. I read and posted
on alt.games.nintendo.pokemon between ages 9 and 13, then moved to Something
Awful after that.

------
alxlaz
I'm gonna put my dinosaur hat on and remind everyone that you can still use
USENET in 2018. It's alive, well, and if you don't need access to binary
groups, it's also free and very straightforward.

There are still some intresting discussions going on, mostly in the technical
groups.

I still open it up maybe once a month or so for nostalgia, though I haven't
posted in a while.

------
l1n
Search query to the void, but if anyone has archives of the umbc.* hierarchy,
I'd be eternally grateful to see them.

------
forapurpose
Usenet article periodically get promoted to the front page. I wonder what that
says about the age demographics of HN, given that Usenet hasn't been
significant in maybe 20 years (and even then it was a niche). Is the younger
generation here? And if not here, where?

~~~
unimpressive
I'm kind of offended that you seem to think young people can't be aware of
things that happened before their time.

EDIT: To elaborate a bit, my expectation would be that the sort of young
person using HN leans more historically interested than normal. Is more likely
to appreciate the value of things like Internet archive, etc. Usenet is a huge
part of digital history, and the fact that it's not available even though
archives were kept is something of a tragedy.

