Hacker News new | past | comments | ask | show | jobs | submit login
Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 (modsandmembersblog.wordpress.com)
1393 points by Diagon 40 days ago | hide | past | web | favorite | 405 comments



Extensive history is about to be lost. Despite being broken, many organizations still use it. Examples from that post:

A police cooperative in Washington DC that was using them as a network to communicate with their respective neighborhoods with over 17,000 members.

A phone company in the UK that assigns phone numbers using the groups and now will lose all those phone designations when it’s deleted.

A Birding group in new Delhi with 2,000 members that has collected data and research on birds for TWO DECADES.

An Adoption group in France, that has been using it for years and years to communicate and share history and photos and more.

They also would have found: Numerous support groups for people who are suicidal or depressed.

Numerous medical groups for people to communicate more effectively with their doctors.

Numerous Vet groups with 24 hr care advice for sick pets.

Numerous support and help groups for the Elderly.

Numerous Historical groups for WW2 Veterans, Vietnam Veterans, and etc.

Numerous science groups that have used them for years and have all their research there.

Numerous fan fiction groups or arts groups that have shared their work for years.


> A phone company in the UK that assigns phone numbers using the groups and now will lose all those phone designations when it’s deleted.

Wow, somebody invented a database that's even worse than an Excel file on a network share.

(Also, how are they going to assign new numbers when archive.org takes over? Is archive.org going to give them write access?)


“My understanding is that [the group] will still function as a mailing list, which is for all practical purposes, what people use this as,” https://www.theverge.com/2019/10/17/20919630/yahoo-groups-uk...


That's right, but (our main concern) is that the archives are being deleted. With no further history being recorded, it's utility for some purposes is limited. I have also come across some complaints that even as a list-serve it can be problematic. Posts, for example, are no longer coming in order.


But as a mailing list, each subscriber has the entire archive, at least from the date they joined. And any one of them can make it publicly accessible at any point in the future. In practice it will undoubtedly result in the destruction of enormous amounts of human knowledge, but at least in theory not much is getting immediately lost.


The difficulty in a lot of cases is finding someone who has a complete copy of the group. Yahoo Groups also had file, photo and database features, and archives of those are likely to be incomplete. You'd have to go through the member list (primarily early members) and find someone who still had a copy of all the messages.

The other problem is making it available - I ran a Yahoo group for many years, and have Mbox and Maildir format archives. I'm still looking for a decent web-based browser for these. HyperKitty (Mailman's archive browser) came close, but seems to require most of Mailman to be installed in order to work.

In my case, I managed to archive a bunch of groups related to amateur radio -- and I will be placing these on archive.org as soon as I have a spare moment to zip them up. A difficult-to-access archive is better than no archive at all, the important part is getting the data into a safe place.


I'm actually applying for an SBIR grant right now to work on the NLP algorithms that power fwdeveryone.com, if you have any interest in writing a letter of support. Basically it would eventually enable someone to mass export something like an MBOX archive onto the web in a cleaned up format with accessible typography. You can play around with prettyfwd.com to get an idea of the current state of the tech, it works well for 95% of (non-commercial) email threads but still needs some more work to support the rest.


I have the same problem. I used some old script back in 2006 to download a couple of groups in ... I think it's Mbox format. It's just not clear what to do with it.


Hmm... a standalone viewer for these formats (that exposed a webserver that could be accessed in a browser) sounds like it would pretty trivial given a parser for the email format itself. Especially maildir!

How big are these archives? Do you have any samples? Does the viewer need any special features? (threading?)


> pretty trivial given a parser for the email format itself

The problem is that there isn't any standard that defines what can and can't go inside the body of an email message. So if you want to post each email message exactly in a thread exactly as is, i.e. each with completely different typography and with all the replies attached and not sanitized in any way, then that's relatively easy. But it's also completely unreadable for more than about 30 seconds, and doesn't allow for good search functionality. These problems aren't a deal breaker if you're only trying to make sense of your own inbox, but when you're looking for specific information across millions of people's inboxes then they're a complete nonstarter.


I remember how great Gmane used to be with several incredible web-based views of mailing list archives. Too bad it sounds like the source code was lost and never open sourced. Another on the list of services that died without passing on enough of the torch.


Oh, neat, that's a pretty interesting read. Also good to know people will be able to keep their phone number after this.


Not to be flippant, but wouldn't one of the members of these groups have a copy of the group in their email? Given gmail and whatnot store things virtually indefinitely, couldn't the contents be recovered that way?

-EJ


Some of these groups are decades old. For them, you'd be hard pressed to find someone who was there for the whole history of the group and kept them all. Also, yahoo was often a headache, dropping emails to individuals - you'd have to go to the website to read them. And furthermore, there is a lot more than emails stored on the platform: files/images/links/calendars/databases ...

To add to all this, it's not an individual project. Most people done' have technical competence. They need someone to help. That's what the Archive Team has been trying to offer (if not for Verizon).


disclaimer: I'm a Member of Archive Team who's helping coordinate the joining of Yahoo Groups in preparation for archival.

Yahoo's banning of a large amount of the accounts we were using is a huge setback for us. In total we lost over access to over 55,000 Yahoo Groups, many of these will now not be archived and will be lost when Yahoo deletes everything on December 14.

Particularly disastrous was the loss of access to all of the 30,000 Fandom (fanfic / fanart / etc..) groups that were requested to be archived by members of the fandom community. We're back to square one now, and it is looking increasingly likely that we're only going to be able to re-join (and therefore archive) a small percentage of these groups before December 14.

(And now for the inevitable, shameless plug...) We could really use some help! If you've got an hour or so, we could really use people to come and complete CAPTCHAs for us. (A CAPTCHA is needed to join every group). Instructions at: https://github.com/davidferguson/yahoogroups-joiner


I tried to do this but upon clicking the purple "Join Group" button Yahoo is giving me an error saying my email address is not linked to a Yahoo account:

> Your email address is not linked to a Yahoo ID. To join this group, you need to link your email address to a Yahoo account.

When I click "link your email address", it just takes me to a page called "Personal info" which doesn't have any obvious way to link my email address.

So I'm not sure how to proceed.

EDIT: Solved it. I had initially only "verified" the account with a phone number, but you have to add an email address as well. It's now working.

For anyone who, like me, signed up for this and filled in the Google form, but then couldn't find the leaderboard URL after closing the tab, it is https://df58.host.cs.st-andrews.ac.uk/yahoogroups/leaderboar...

It seems to be working through a list in reverse alphabetical order. Watching the progress being made is quite satisfying. When I started it was on groups like "sciencefiction" and now it's moved on to "petzluverz".


How long did it take you between adding the email address and being able to join the group?

Seeing the same thing now, I added an email address and verified it, but I'm still not allowed to join the group.


It didn't take long at all for me after verification. Although I have sometimes randomly gotten that error message. Interestingly, sometimes it actually had joined the group anyway. The site has been a little glitchy off and on, but it's working for me right now.


While the above post is concerned with Fandom groups, my concern is with groups that started doing early community driven biohacking type research. There are medical tests results and discussions of medical interventions. While that's my focus, I'm sure there's additiona important material. We really need to save this data.


Thanks for fighting the good fight!

I assumed I could help by going to a web page and solving a bunch of captchas for you, but when I read those instructions I found there's more involved (forging a Yahoo account, installing an extension) and it turned me off.

If captcha's are the bottleneck, maybe some generous soul here could figure out a way to automate the rest and just give me a page I can go solve captchas? Further reducing the friction might help get you some more uptick from the community - more monkeys like me banging at typewriters.

Sorry I wasn't more help, and best of luck with your efforts.


I imagine you guys already know this but considering we’re up against the timeline, I’d use the captcha solving service (easy to google yourself) and Luminati to distribute the IP addresses while swallowing my ethical qualms.


I would donate my IP/bandwidth to archive.org if I could run a scraper easily.



Thanks! I never heard of that before; just like project SETI though for archival purposes.

What are the hardware requirements of that VM? I'm attempting to import it on my NAS4Free home NAS Virtualbox service which is the only machine I keep up 24/7 atm, but it takes forever to import. The hardware is very limited however (Atom D410 + a bit over 1GB RAM available), so I'm not sure it would succeed, but so far it loads forever, no errors given. I'd like to run it for this project to start contributing quickly albeit with limited hw before the deadline, then find better iron in the future.


I’m running the Docker image on the smallest Hetzner VMs, with 5 concurrent groups and 40 shared rsync threads per container, and 12 containers per server. Start one container, do docker top on it to make sure it’s pulling, then start the others one by one, taking a few seconds between each to avoid overwhelming the CPU. I’ve got 6 of those little VMs going, and have rolled up 4GB and 2800 groups worth in 6 hours.

After they settle down, they’re more memory than processor intensive. I’ve considered playing with the settings a bit, but thought it was more important to get a bunch of them running on a couple different VMs at different sites.

If I were really feeling fancy, I’d write a nice deployment definition for orchestrating this with microk8s...


I'm running it on a Synology NAS (Celeron J3455), and the docker manager UI claims it's using 180 MB RAM and less than 1% CPU (and I just confirmed it's currently working on archiving Yahoo! Groups)


I don't find it processor or memory heavy, it's mostly doing a lot of IO (network and disk).


Unfortunately it doesn't offer a qemu-compatible image or an image that would work when converted, it's a shame and shooting itself in the foot.


You should be able to trivially run the Dockerfile[0] on a standard Ubuntu image for qemu, should that be your only reason for desisting.

0: https://hub.docker.com/r/archiveteam/warrior-dockerfile/


An ova file is just a tarball containing an ovf file and a vmdk file. The ovf file is a text-based configuration format, so you can get a basic idea of the config you'd need for qemu. Then the vmdk can be converted with qemu-img.

I used the following qemu-img command:

    qemu-img convert -O qcow2 archiveteam-warrior-v3-20171013-disk001.vmdk archiveteam-warrior-v3-20171013-disk001.qcow2
I use the following to run the VM (I gave it some more memory because I have plenty to space):

    qemu-system-x86_64 -m 1024 archiveteam-warrior-v3-20171013-disk001.qcow2
I think they were doing some kind of port forwarding, but I didn't bother, and I just access the web interface using the VM's IP (you can hit alt-right arrow to go to a login prompt and log in as root then run "ip a" to see the IP).


I know, I did that and it didn't boot. Couldn't be bothered further and I ain't installing docker on my system, it's incompatible with my setup.


It went pretty good for the first 10-20 or so groups but now I get the multiples of the really annoying captchas (click until none remain) per group... Damnit yahoo...


update: just enabling the vpn was enough to 'reset' captcha to the simple level, seems like yahoo does not take into account whether your IP is 'residential'.


I also noted that for yahoo changing IP, even changing continents, allowed me to use the same cookies as long as I kept my original browser window open.


Shoutout to https://github.com/dessant/buster by the way!

`Buster is a browser extension which helps you to solve difficult captchas by completing reCAPTCHA audio challenges using speech recognition. Challenges are solved by clicking on the extension button at the bottom of the reCAPTCHA widget.`


That's nice, but it doesn't scale. Google only let you solve a few (5 or so) audio captchas in quick succession before you're banned for a while, so it's no good for us.


It's been working for me instead of clicking on all the little busses or crosswalks, even if it doesn't work at scale. Thought it might help some other users of the extension.


FYI: The extension offers many private groups that I can't join without approval and that seems to disrupt the flow of the extensions.


Yeah, sorry about that. The current (as of 2100 UTC) set of groups being sent out to be joined were ones submitted through our nomination form: https://tinyurl.com/savegroups

I did specify that groups requiring approval to join shouldn't be submitted, but not everyone took notice. (And then there was the several dozen Google Groups URLs that were submitted!)


It seems a weird set of groups. Like, lots of three-to-five person groups roleplaying doctor who, spiderman and things like that. Is this the long tail of what hasn't been archived or is there not even a good way to tell post/member count without loading up through the extensions?


From IRC (betamaxthetape):

It's a set of groups that have been specifically requested by the fandom community. Of course, the groups handed out depend on what's been joined, so if / once all the fandom groups are joined, we'll move onto something else.

I appreciate this isn't made clear in the instructions, but if you have a desired set of groups in mind, you don't need to use the chrome extension. Just join the groups you want saved and (provided you've sent the account details through the form) they'll be added to the queue to be archived. I did a lot of Amateur Radio (Ham Radio in US) groups that way.


Ah, that's good to know that I can browse and find things that I'm more interested in. The instructions weren't clear about the difference between extension/group access and archiving.


It's a set of groups that have been specifically requested by the fandom community. Of course, the groups handed out depend on what's been joined, so if / once all the fandom groups are joined, we'll move onto something else.

I appreciate this isn't made clear in the instructions, but if you have a desired set of groups in mind, you don't need to use the chrome extension. Just join the groups you want saved and (provided you've sent the account details through the form) they'll be added to the queue to be archived. I did a lot of Amateur Radio (Ham Radio in US) groups that way.


Yeah not volunteering for that mate.


See immediately above in the thread. Instructions were perhaps not clear.


Hah, this is fun! I've so far stumbled on a fantastic group with Sims 1 houses (pictures, and the actual lots), and a Dream Street fan-club, which of course prompted me to see who the hell they were.

I confess I'm doing this mostly to see what people posted on the internet at some point in time :)

Edit: All groups have around 1600 members... what causes this...


> Edit: All groups have around 1600 members... what causes this...

That's possibly the maximum cap?


Is there any cited reason for the groups they're blocking?


Verizon's response, and the response to the response, are in the article of the OP. They claim they offer a Group Downloads Manager, but it's very broken.


btw, maybe Mechanical Turk could help with the captcha part?


A couple of years ago I saw somebody giving a talk, where they demonstrated a CAPCHA-Solving API, with people from India solving the CAPCHAs for a few cents.


That's basically what the DeathByCaptcha server is.


Thanks - I just wanted to say such services exist or used to exist, didn't remember the name.


I feel like there must be some protection in place against using mTurk with captcha, or it would have already been abused.


Mturk's turnaround for this stuff can't be fast enough to work would be my guess. I know jobs I put up there for transcription, despite a generous bonus, were always delayed for at the very least hours.


You misunderstand. You keep a live page open and point jobs to the live page. No need to put a captcha image in the mturk job.

You can absolutely purchase captcha answers.


Just solved a bunch of captchas, but Chrome crashed a few times during. Due to the addon?


I've been using Edge (Chromium) for past few hours, no issues yet. Plugin could be unrelated to your crashing. May help to use a standalone Chromium build for this https://chromium.woolyss.com/


I checked on IRC. One person says they've been using it for hours on chromium without a problem. "I've been using Edge (Chromium) for past few hours, no issues. Could be unrelated, could be related. May help to use a standalone chromium build for this."


As an aside, is there anyway to recover emails if I didn't sign into Yahoo for a year? I and a lot of others had up to 15 years of sentimental mail exchanged during that period :(


I don't see why not. Point Thunderbird at it or something and then just transfer the mails over to somewhere else if you want that - but this is not about mail. Rather it's about Yahoo Groups, whose archives are about to go away.


Forgive my naivety, but why would blocking of your accounts delete the data you have already backed up? This sounds like you are doing it the wrong WAY, IMO.


Two reasons: (a) If we hit Yahoo with everything we've got, groups would have almost certainly crashed, or at least become unbearably slow. That's not a reasonable thing to do, and would be (IMHO) grounds for Verison banning us.

(b) We were still testing / writing the scripts to do the actual archiving. Most of the groups we did save before the banning were from test runs of the archiving script.

And sure, given hindsight, I'd do things differently. We've learned, now, and are archiving a groups soon after it is joined.


OK, thanks for explaining this. Just my 2 cents then: big companies make decisions like this based on the potential PR win/loss. If ignoring you keeps the PR delta at 0, while allowing to export the data exposes them to even a minimal risk (I dunno, someone's private details buried in), they will ignore, or even actively resist you.

Politically, you need to arrange it so that cooperating with you will give Verizon a small PR boost, while ignoring you will be seen negatively by the public. This thread had a good example of interesting data that is worth preserving, so I would try reaching out to news companies (NY Times and whatnot) to see if anyone wants to publish a piece. Phrasing this positively and ensuring enough people see it, would greatly increase the chances of cooperation from Verizon.


They hadn't backed up yet. They had set up accounts with yahoo that they were then planning to use to back up those groups. Backups themselves were starting, but they had to go slowly enough not to bog down yahoo's servers.


Have you posted this on Reddit anywhere? Possibly /technology?

You might even get the admins to make an announcement.


It’s been all over r/datahoarder lately, also saw a post on r/YouShouldKnow


Have you considered using NordVPN for CAPTCHA bypass? They are a shady company, but their network of residential VPNs is impressive.


There have to be some Verizon or Yahoo employees on HN who are reading this.

Can any of you shed some light on why Verizon and Yahoo aren't cooperating with the Archive Team to archive this valuable historical content?

(If you don't feel comfortable commenting with your regular HN account, maybe you could do so with a throwaway account?)

Also, is it possible for any of you to bring this issue to the attention of upper management and help them understand how important it is to archive this?

You Verizon/Yahoo employees have much more power to make a difference here than anyone of us from the outside can.


Probably not very helpful/informational but:

I work for VzM, but not historically directly on Yahoo products (product teams have been merged/consolidated etc. over the past few years, but there's still strong tendencies toward products people came from).

So I wouldn't be very clued into what's happening with Yahoo Groups internally. And I've heard nothing about this internally. At all.

As it stands, it's 2:30pm in SV, VzM is top of the HN frontpage, and not a single soul has mentioned it yet on internal Slack.

Will see if I can find out more.


Maybe you could be the one to raise it on the Slack channel and (even better) get some eyeballs with authority clued in on the matter.


It was someone quite high up in the company who was the first to raise in Slack actually; though it's clear were similarly not highly clued in to this before yesterday, and no substantive replies or info yet (just other colleagues with similar concerns).

I'm guessing this will blow up later this morning when people start waking for the work week.


Thanks. We also noticed that BoingBoing picked it up. Their graphics are a bit crude - but that's BoingBoing. We're trying to be more polite here: https://boingboing.net/2019/12/08/oath-makes-you-swear-2.htm...


Crude but quite applicable.


If VzM wants to contact someone at the archive team securely they can DM any of the @s on irc.efnet.org/#archiveteam or twitter DM myself (@JRWR) or Jason Scott (@textfiles)


Seems risky at this point, since they’ve already posted here. Probably better to wait quietly and report.


Why would it be risky? Surfacing an issue that is important to the public, where future/planned actions by the company could become a PR debacle sounds important.


We're commenting on an article that describes Verizon giving 0 fucks about what is "important to the public".


I hope this doesn't sound naive, but what does the M in VzM stand for?


Media. Verizon Media is the specific division of Verizon that contains Yahoo, AOL, and VDMS (formerly edgecast)


Not OP but I assume it means verizon media

https://en.wikipedia.org/wiki/Verizon_Media


Don't suppose you can find a way to unban our accounts...? ;)


That could really help. Thank you!


Pure speculation, but if you publish something created by another person without an explicit permission by them, it may open you up for a lawsuit. If some groups required explicit approval by a moderator in order to read the posts, I would take it as they didn't want the content to go public.

So technically, some legal troll could post some copyrighted information, wait for it to be published on Archive, and then sue Archive for copyright infringement and Verizon for assisting it. As a non-profit, Archive will likely get away with just taking it down, but a for-profit Verizon is a wholly different story.


Groups can be private or not. Require approval, or not. The archiving team isn't attempting to break into private groups and archive them. Only public groups are going to be collected.


Also, from one of the mails:

> The 128 people you banned were REQUESTED by the group owners to get their stuff.


Here I wrote a blog post to explain why products are shutdown, illustrating Yahoo and Verizon. https://thehftguy.com/2019/12/10/why-products-are-shutdown-t...


I think everyone understands that corporations don't want to spend money and effort maintaining servers that don't generate revenue. No one is really surprised that they won't help with archive efforts.

The question is why they're spending real effort on blocking archivists. All they had to do was keep doing nothing for a few days. The cost to them might have been a couple hundred dollars' worth of bandwidth, at most, which I think archivists would have been happy to pay--they've done more before. (That's estimating based on small-scale commercial hosting prices; it might not even register on whatever enterprise uplink Yahoo/Verizon uses.)

Instead they've got at least one professional taking time away from productive work to fuck with archivists at no benefit to anyone. It's possible that the wage-hours spent on this actually exceed what the bandwidth costs would have been. It's astonishingly petty.


Two reasons I can think of right away. There can have a policy (and people) to detect abuse and shut off bot accounts, this can even be a separate entity from Yahoo Groups. Second, there can be internal metrics tracking active users and viewed pages, to get down as low as possible before deletion. In both cases archive.org is ruining it for them.


<conspiracy tinfoil hat>

Is it possible that there may be some kind of political angle to all of this; that archiving this information for the future might allow someone to find out something that someone else doesn't want to come to light?

</conspiracy tinfoil hat>


how much storage do you think in total all of the Yahoo Groups content takes?


Over 4 petabytes 8 years ago


I'd love to have this as a torrent


you have 4,000 terabytes of storage space and bandwidth to torrent? (I'll be honest, I had to look up how many terabytes a petabyte is.)


It's actually not totally insane anymore. If you could afford a Tesla Roadster you can build yourself a 4PB storage solution. With some high density top loading storage servers (4HE for 90 HDDs), 6TB HDDs and some SSDs thrown in for caching you can build that in 36HE for less than 300k$ (not counting time needed to assemble and configure). So if that's your hobby, go ahead :D If one takes more than 5 minutes to research this I'm pretty sure that it's possible to push that number below 250k$.


Yes, university should be able to make some room so researchers can work with it. It's a lot of data, but not impossible to do with a small investment.


Oh.


I'm genuinely curious from an ideological perspective, why archivists think all this material is worth saving?

People often compare the shutting down of sites or the banning of content (e.g. When Tumblr banned porn, or now yahoo shutting down groups) to the burning of the Library of Alexandria. But there is a huge difference. The LoA held knowledge collated and collected by the best thinkers of the time. The Internet is not that. The Internet is an open platform where anybody can say anything like that. Most comment sections are filled with all sorts of material ranging from factual to entirely fictional.

I realise it is hard to decide what is worth keeping (and therefore erring on the side of saving it all), but I'd wager that the vast majority of archived content is not useful at all. The Wayback machine is a perfect example. Lots of great stuff, but that's a drop in the bucket compared to the vast amounts of useless, or even redundant information stored.

It is a lot of resources thrown at saving, not the equivalent of the Library of Alexandria, but the public toilet block graffiti wall.

Anybody want to share what drives them to do this?


Even if we still had the Library of Alexandria, it may have shed zero light on the actual lives of citizens. Archiving content on the internet means capturing thousands of individual level perspectives and experiences. We don't know what will end up being important to historians 50 or 100 years from now. I would bet there are dozens if not hundreds of historians that would give anything for a record of their favorite time period that contains even a fraction of the amount of content today's archive efforts are storing.

It's also not horrendously expensive - we are getting better and better at storage as well data analysis techniques, so stuff that seems useless today may be useful 50 years from now and cost less to store than it does now. The key thing again being that we can't benefit from hindsight.

Even graffiti can give insight into a time period, even if that insight is that that time period had an unusually high number of graffiti artists.


Not to mention that historians of the future will be able to sort and characterize massive amounts of data and draw conclusions that couldn't be made without that data.

For a time period where data is more valuable that oil, that the wealthiest companies are trying to grab every piece of data they can, and on a site where this is frequently discussed and many work for said companies, I find the question "why do archivists want to archive data?" a little silly. Date might not be useful to us now, but might be to future historians (though this is a similar argument made by that companies that do mass surveillance).


What about people who don't want stupid comments they made online when they were 14 permanently indexed and searchable for all of time by the Archive Team? Yes, they may have posted to Yahoo! Groups back in 1999 when they didn't know better, but now it's 2019 and you have people digging up decades-old dirt on people to try and destroy their reputations and careers.

Given that search engines have zero ethics when it comes to removing embarrassing (but not illegal) content, sometimes the loss of information is a small blessing for some.

Yes, it's their fault, but I also don't think it's fair that something a child said at 14 should haunt them their entire professional careers, either.


The stuff stored in the Yahoo groups is material from the beginning of the internet. When people explored what could be possible and how easy is was to connect globally. You have a valid point, but it's also one of these things in our generation that we have to live with. We explored and tried things. Only now we look back and see what those explorations of our younger selfes really are; sometimes funny, sometimes embarrassing. However, if you are cautious, you may be able to delete your stuff or at least make it anonymous by deleting that said account. If not, you have live with it. Those of all these people can now learn from it and can educate their kids in being careful with the internet. (Or at least this is what it should be)

The dogma, that "everything posted to the internet will stay on the internet" , may not be entirely true for this first generation, because now large parts are already gone. But I am certain that this will be very true for the current generation, because I really doubt that Facebook and others will ever freely delete large datasets of user content.


Given that search engines have zero ethics when it comes to removing embarrassing (but not illegal) content,

Ethics are about codified sets of rules. Perhaps they're just following a set of rules that doesn't promote hiding things to make people feel better?


The archives are not easily indexable by search engines, they're posted as multi-GB gzip-compressed WARC files.


But someone could hypothetically convert the WARC files back to static HTML and host them on the clear web.


Hypothetically, yes; but right now all this stuff is available on the clearnet and searchable. So obviously any potential harm of the present situation, is decreased. And, unless your argument is that we should delete all fora on the web because someone may have said something embarrassing on them, then I think you'd probably want to come down on the side of preservation.


I'm pretty sure Yahoo isn't doing this to protect people from their old posts.


IA are extremely responsive in delisting content on request.

Email info@archive.org


Withhold wide-scale, anonymous access for a few decades maybe? (Though presumably there is a middle ground that doesn't involving leaving _everything_ inaccessible for a few decades.)


For example: World War two groups where many of the the members have passed away by now. There could be first hand accounts of history that has already been lost to time.


Could?

More like definitely.


YES! It's like preserving ecological diversity. It's a store for later learning. Verizon is working in cold hard capitalism, and you can bet your lunch that they did NOT use Google Groups to hold their shared wisdom/history, and they would never let it be lost.

But many don't have the pockets for better systems, and so their earned knowledge lived on Google Groups. And when you think of all the people and groups that might have had needs to store their history, and what tools they might have used, what do you expect the skew of Yahoo Groups was. Certainly no Fortune 500 companies, but rather nonprofit and grassroots and all sorts of domains that are already getting the short end of the stick in our world :)


Heh *Yahoo Groups, that is


Step 1: We only need to archive the genuinely good content.

Step 2: It will take a long time to look through all this content and determine which parts deserve keeping.

Step 3: We will inevitably leave out something that someone else thinks is worth keeping anyway.

Step 4: Let's just archive everything.


It's basically that. Yes, when saving everything we'll save a lot of trash and utterly garbage, spam and all that shit... But the things we would be risking to lose if we didn't save everything, they are and will be so much important. To save what is really important, you have to save everything.


And actually, spam is quite interesting to some people. It certainly gives a flavour of what early-2000s internet was like, and what happens when spam filters aren't good.


One man’s public toilet block graffiti wall is another’s Library of Alexandria. Let the historians and journalists decide what’s important and the archivists take their best crack at saving it.

I write a lot of historical content and often the most useful stuff I find—for example, old flyers or ads from the 1950s or 1960s—would have been considered trash by someone at the time.

So an archivist’s job isn’t to make a judgment. It’s to protect the data as they see fit.


Toilet wall graffiti and such, preserved in Pompeii, is an important archaeological resource for the understanding actual daily life of Romans.

So yes, there are real, hardcore scientific papers about ancients "shitposting" each other down to "your mom" jokes. Because it shows us how people really lived.


> I'm genuinely curious from an ideological perspective, why archivists think all this material is worth saving?

It's easier to just save it all and let gawd sort it out.

You never know what some future person might find interesting. For example, my father took lots and lots of pictures, but they're all set in the living room and kitchen. No pictures of the rest of the house. I'm sure the thought of photographing other rooms simply never occurred to him as being interesting.

For another example, many people are interested in where/when/why certain words first appeared, like the origin of "OK". Massive archives of text that are searchable would help with this.


It is a lot of resources thrown at saving, not the equivalent of the Library of Alexandria, but the public toilet block graffiti wall.

Ask an antiquarian about the value of graffiti in the ruins of Pompeii and other archaeological sites sometime. The great historians of the day wrote about their contemporary culture, while the vandals and miscreants and lowlifes and commoners contributed to that culture. Having access to both sources gives us a much more complete picture.

You don't know what's worth saving at the time you save it.


Ha, ha! Well, there's some high quality material there too, but I take your point. In the right context, like "history from below," all kinds of material can be high quality!


To be clear, I wasn't comparing the fanfic authors and other Yahoo Groups contributors to vandals scribbling dicks all over Pompeii. Just saying that all other things being equal, future historians will prefer to have too much data to work with than too little.

By definition, we don't have the benefit of hindsight until it's too late.


See below. My main concern is early medical/biohacking groups that shared data, like medical tests, and engaged in extensive discussion/community driven research. Such groups go back to at least the late 1990's.

A main concern of the Archive Group (again, below) is art that was uploaded there.

I'm sure those are not the only two classes of examples. See for example the bird watching group in Delhi that has been collecting data for decades. (In the link of the OP.)


Great question! I'll take an amateur swing at a decent answer:

People doing important work (esp important work that is underfunded) don't have time to write/record their own histories. But that history can be instructive, to learn what worked and what didn't, and help future travellers do it better :)

And perhaps especially important: ppl engaging in these under-resourced efforts are often working in domains that capitalism is... less curious about, we'll just say. Otherwise, it would likely be able to be more highly documented, as incentive is there to preserve it.

Our ability to improve our present from better understanding our past is a supposed benefit of a digital world that accrues data -- we have records of things that in prior ages just flew by in conversation (for better or for worse). But efforts like this rob us all of that wisdom <3

And again, there is an asymmetry in who gets robbed. It is often the folks working in the commons, those doing invisible maintenance labour (nonprofits, grassroots, community), and generally just people doing work within the cracks of capitalism.


> The LoA held knowledge collated and collected by the best thinkers of the time.

... that had access to writing services and were wealthy enough to have their thoughts stored.

There could have been many odd voices out there that would've told us an entire different story. But these are unknown because they didn't have access.

Now we are in the era of (almost) universal access to storing our thoughts and we still don't listen to the everyone or mark them as uninteresting and not worthy.


One never knows what may have value.

The graffiti on the toilet wall may well speak to the start of a trend, term, movement, or other event, for example.

Think longer timelines, broader scope than you personally may feel is relevant.

En mass, those questions have answers we individually are unlikely to fathom.


> It is a lot of resources thrown at saving, not the equivalent of the Library of Alexandria, but the public toilet block graffiti wall.

We have that kind of graffiti from Pompeji. It's enormously more fascinating and insightful into regular people's lives than all the stuff about kings and battles people wrote about in the more official works.

When looking through all newspapers and magazines, the advertisements are often the most interesting bit. Especially since you can probably already read about the big events they wrote about on Wikipedia or history books.


Certain group contents are actually unique and valuable (see threads below), in which there could be a certain similarities to the LoA.

But most importantly, Groups is a corpus representing many segments of society during a period (starting 2001, with a peak of over 100 million users in 2008). It's a snapshot that embodies concerns, beliefs, morals, language... at several realms. This is more than LoA even. It can be used profusely by researchers and historians to study society for years to come. Or by AI to learn how and who we are/were...


> The LoA held knowledge collated and collected by the best thinkers of the time. The Internet is not that.

I am in awe of your flair for understatement.


We don't think it's necessary to preserve everything that's ever spoken verbally. We don't lament that everyday conversation is ephemeral.

People are conflating internet discussion content with written content because it's stored as text. Whereas the more legitimate comparison is to verbal communication.


> We don't lament that everyday conversation is ephemeral.

I imagine you're not a historian. Neither am I, but I cannot imagine that there is a historian out there who hasn't lamented the ephemerality of everyday conversation (and even of apparently more durable forms of communication).


Historians would love to preserve spoken communication and there are many projects recording everyday conversations. There are even projects of recording the typical sounds of the environment in certain areas at a certain time. However, many forms of spoken everyday conversation fall under restrictive privacy laws, which poses strict limits to such preservation efforts.

The texts on the internet at a given time, on the other hand, are public and reflect the opinions and ways of living of a large number of people at that time. There is no doubt that these could be analysed in the future to give us historical insights in ways we cannot even conceive yet. (Think e.g. about getting them data mined and analysed by advanced A.I. to give new insights into the time period.)

The worth of the data is so obvious that it's really hard for me to understand why you and some other people don't think these are interesting data points for research on how we lived in, say, 200, 500, or even 10000 years from now. The data is not only interesting to historians, but also to economists, political scientists, and linguistics, btw.


>We don't lament that everyday conversation is ephemeral.

Linguists definitely do.


When I was at aol I tried to get them to open source the q link server code from the 1980s. Someone actually got it on DVD for me and everything but after the Verizon merger they fired the entire legal team that was responsible for authorizing open source release and it just stalled.


Open sourcing code can be tricky—there's quite a bit of review that needs to go into doing it right, as well as more work if you want the release to actually be reasonably useful. Blocking this archiving effort is on a whole other level. We're talking about saving information that was already public. All they have to do to allow this to happen is... nothing. I can't comprehend why Verizon/Yahoo would go out of their way to block these efforts.


It depends on the size of the codebase and how shitty your programmers are, but if you aren't greedy or scared of over-litigation, it isn't hard at all.

I have written great contributions to a python API library that could be of benefit to the community around it. The code has nothing to do with my company's core competency, and the code is used for internal orchestration, so "exposing insecure code" is an unlikely concern.

It is easier for a lawyer, especially a luddite, to say "no" than to help their employees give back to the world.


For new code it is indeed "simple". Old code however likely contains third party provided code, be it from libraries or code provided by contractors, where no (clear) license permitting relicensing of the source is available. This can be quite complex historic work as version history might not exist (which code come from where?) and documentation is limited (paper contracts lost in archives) and so on.


First, to Hell with whoever downvoted me, probably lawyers (not you, johannes). Second, I get there are occasionally complicating factors, BUT - licensing can't be difficult at most times, since the company often owns what the worker produces - for better or worse, it's simple that way. As for third party work, are you talking about library imports, or copy and paste? The logistics of solving those problems are either simple or really complex.


Yes, they own what employed workers produce. But especially before there was such a number of freely available open source licenses software vendors licensed tons of stuff, often in source, often without permission to relicense the source and over time developers refsctored the licensed code, which makes it hard to trace code back. Especially since version control often was done by having different sets of floppies, which are all gone.


>open source the q link server code

what a lovely thought. Thanks for the effort, even tho it didnt pan out. if you've got the dvd torrent it out :)

now im wondering if there's a stratus emulator anywhere and/or the os code. Them things were nasty... individually battery backed hard drives was just the beginning. The slot cards looked like someone had dumped yellow patchwire spaghetti all over them.


Nah I don’t have the dvd and gave up trying to get it released because it wasn’t my job.


If you ever bump into this person again please consider suggesting this. If they don't feel comfortable releasing it to the public directly, there should be contacts at archive.org that would help releasing it anonymously.


It's like the burning of the Library of Alexandria all over again.

We don't know exactly what was in the library when it burned. We assume it was all great works of intellectualism, but it could very well have been the fanfics of their time.


Except that the Library of Alexandria never actuelly burnt ! That is a very good ol' myth ;)

- https://www.firstthings.com/web-exclusives/2010/06/the-perni...

- https://www.ancientworldmagazine.com/articles/making-myth-li...

- https://history.stackexchange.com/questions/677/what-knowled...

But anyway, no one should delete human littérature, be it inadvertently or by lack of effort.


From Wikipedia: "Scholars have interpreted Cassius Dio's wording to indicate that the fire did not actually destroy the entire Library itself, but rather only a warehouse located near the docks being used by the Library to house scrolls"

If anything this would make the analogy even more apt, since only part of Yahoo is being destroyed. :)

Regardless, it's mostly used as a metaphor for the destruction of knowledge at this point.


Too often historical events turn out to be perfectly true, but claimed to be myths due to dizzying semantic distinctions.

Just looking at the third link, the most upvoted answer agrees that humanity suffered a significant loss of important information. And the 'myth' is just an asinine distinction regarding whether loss was due literally due to fire, or whether the information was lost due to some other cause. I think declaring it a myth in a conversation like this misses the point (it certainly isn't a distinction relevant to the original comparison made here to Yahoo Groups) and just serves to confuse people.


It's quite clear the library is no longer here. How exactly it was lost does matter as its destruction has been used to paint various groups as anti intellectual barbarians since ancient times. Eliminating the story as a weapon to attack others would do humanity some good.


It has been used that way, but not here. Here, it's a disorienting non-sequitur that makes it sound like the information was never really lost.


These articles seem more concerned with detailing how important it is that it wasn't Christians. Makes sense for a organization centered around "religion and public life", I guess. Quite the angle.


It's quite important that it wasn't Christians. A large part of the public understanding of history is based on a belief that progress through the early Middle Ages was held back primarily by Christian repression of free thought. There are people who very seriously believe that we'd be flying between stars by now if Christianity had never become predominant.

You don't have to be a Christian apologist to think that it's important for people understand history correctly.


Do people generally think it was Christians? Without looking it up, I would have said "barbarians", which may not rule out Christians but doesn't specify them either.


I think the majority of people have never thought about it one way or the other (and would probably think similarly to you), but there is a substantial group of people who do. While it's by no means predominant, you come across the idea with fair regularity on atheist discussion boards.


Whoa. I guess what they say is true - say a lie often enough, and it becomes the truth.


Wait the library wasn't lost due to that fire, but the contents were slowly lost due to the passage of time and people not caring or having access to copy it's contents? That makes the analogy way better, but the "burning" part is sadly wrong.


Yes, that is exactly what I wanted to convey by "lack of efforts".

2000 years ago, as a civilization, even if we failed to care enough for the Works stored in the Library, their loss would not have happened if access was not limited, which would have helped in their dissemination and issuing of copies.

Today, as a civilization, if we fail to implement to right process to backup on time what matters to us, we will repeat the same errors as our ancestors.

I guess many historians today would prefer to see those non-existent backups of the Alexandria Library rather than those of Yahoo Groups, but who knows what is more important after all ;)


The main difference is, that then, "backup" ment copying everything by hand, and now, it means one simple copy-paste. Considering the size and price of modern hard drives, and relatively small size of old archives, any one individual can backup a huge amount of data (and even offer/share it as a download link/torrent seed/etc).

Their whole Library would probably fit even on a smallest now-available sd card.


Yahoo Answers is an invaluable trove of insight into an intellectual class of people that I think a lot of us regularly forget exist.



https://i.redd.it/yv9k5nes87rz.jpg

Not sure why this one kills me so much...


It is an absolute trove of insight:

https://www.youtube.com/watch?v=EShUeudtaFg


Before clicking that link I guessed at what it would be, and I was wrong but not far (I was expecting https://www.youtube.com/watch?v=Ll-lia-FEIY )


I think one of the unintended consequences of privacy legislation is it will support the burning the library of Alexandria over and over again.

The default corporate posture will be : Delete all the data! It's a liability and figuring out what we can keep is an enormous headache.


Well, to some degree it is a liability. It just took this long and some accidents for them to finally figure it out.

That attitude will create a problem - a.k.a. opportunity - for others to come in and solve. Google got rich by scraping the internet and solving the headache of how to find decent content. If there's value in some of this data headed to the dump, it gives a chance for someone to do the same. Who knows, they might even find a way to do in a privacy-respecting manner.


This particularly consequence was so clearly predictable that it's hard to call it unintentional. It's a hard-to-avoid trade-off.

Jury's out on whether it was the right one.


Note that "deleting" is problematic in itself, especially for "cloud" data, considering how storage works, especially transistor one like SSD's !


Mark it inaccessible and write the new data over it in the next write cycle.


That doesn't work on SSDs, and the data might be even theoretically recoverable on HDDs : https://security.stackexchange.com/questions/12503/can-wiped... > Therefore, you should assume there is no reliable way to securely erase individual files on a SSD; you need to sanitize the whole drive, as an entire unit.

There's a reason why when security is deemed important, the storage is physically destroyed instead.


I meant for the sake of data protection, not for forensics. You start with all ones and gradually deplete your ability to write ones over time in electron charge memories such as SSDs.


This is not about companies following best practices but about what is going to happen when some of the supposedly deleted data pops up again, as it eventually will.

Will a judge that is clueless about how computers really work consider that as a GDPR violation or not ? As deliberate or not ?


If you disable wear levelling, you could force it to end in a final all-zeros state.


Other collective projects to try to archive Yahoo groups:

Queer Digital History Project: https://queerdigital.com/ygpresproject

Project to Archive Trans Yahoo Groups: https://archivetransyahoo.noblogs.org/list-of-known-trans-gr...

Project to Archive South Asian American yahoo groups: https://yahoogroups.southasianamerican.org/

I've got to guess that there are more.


there are a few groups i was a member of like lifters https://groups.yahoo.com/neo/groups/Lifters/info which was an intensive technical development group in the field on propellerless, rocketless, jetless flight using only electronic high voltage.

also some of the politics groups were a great time capuslue for around the clinton/bush election era

a lo to f eartthquake researchers gathered on several earthquake groups as well including caltech seismologistics and advanced amatuers many of whom arent around anymore.

also some of the info in these groups can be used to defeat patent applications as they show evidence of public prior concepts and art.

yahoogroups consisted of somewhat more technically advanced users than modern website users like reddit etc because they were earlier and somewhat harder to use.

its a lot of good quality content.

also in the early days on these groups spam and massive controlled astroturfing account groups was pretty rare.

this is like losing 15 years of ancient Sumerian writings in a very interesting early time for the Internet.


This is a wake-up call to the entire world: we cannot take internet history for granted. We need affordable, decentralized means with long-term economic incentives to archive the digital world.

In a way, the digital world is far more fragile than the physical world. And the time to solve this is now.


Tragedy of The Cloud.

IIRC, Archive.org is still running its fundraiser today.

We need LOTS of publicly-sponsored and paid-for digital archival centers that, like libraries, are maintained for the common welfare. Or we could, you know, add that duty (and funding) to existing libraries! With -paid- archivists!


Yeah, aren't there archiving obligations like that for at least books and movies?


What prevents Verizon from donating the Yahoo Groups database to the Internet Archive? What does Verizon have to gain from preventing the archival of Yahoo Groups?


Companies don't typically operate that way. All else being equal (especially when there's no $$$ in it for them) when given the choice between doing something and doing nothing, they usually choose to do nothing. It's often not malicious, but an overabundance of caution. (i.e. lawyers raising red flags about liability, 'our IP' etc... it's a real pain even from the inside getting large companies to do anything different from the status quo)

My bet would be that Verizon's network monitoring system/team sees the archive team's attempts as some sort of anomaly to be stopped. It's possible, though I wouldn't bet on it given Verizon's history re: public relations, that making noise might alter the equation and get them to allow the archive team to continue.


It is kind of incredible that they are expecting to be protected by IP laws, and yet aren't willing to put the slightest effort to archive the content that they are taking down...


Maybe those who care (we?) could organize a campaign to get customers to commit to leaving Verizon if they let the messages be deleted without archive? That would convert it into the language they understand.

To raise the perceived threat level, many folks could support in building tooling or docs to help ppl migrate as easily and streamlined as possible, to minimize the tax on consumer time that they rely on. (E.g., help on comparable plans, cheat sheet for call centre keywords, etc.)

Maybe something team "Do Not Pay" could help run with...! [1]

[1]: https://boingboing.net/2019/10/28/parking-tickets-plus-plus....


There is a campaign already. https://modsandmembersblog.wordpress.com/


Oh God, I'm that guy. I'd been following this elsewhere, so didn't actually expect I'd get new info from the link itself :/ [opens mouth, inserts foot]


It's simply way too much work. Dying projects generating no revenues don't get the luxury of having tens of people assigned to work on them.


How is it too much work to tar up that shit, put it on a big ass drive or two, and ship it to them? I can't imagine it's that hard.


`rm -rf /` is objectively free from Verizon's perspective.

Paying lawyers to examine the fine details and determine what liability may arise from publishing a database dump or the software that can view the dump's contents is not free.


Probably a few minutes, and then 4 million lawyer hours for review.


You mean, "tar up" multiple databases across possibly multiple data centers + all related files uploaded to those groups (also possibly spread across multiple datacenters) while preserving full integrity and making sure that there's accompanying documentation on how to set all this up and run?

You tell me how much work it would be.

Compared that too pulling the plug and getting servers over to a landfill.


I can imagine it's easier and safer (from a legal perspective) to just delete the data and therefore no longer be responsible for the content. Twitter wants to delete older Twitter accounts because they're required to by law under the GDPR.

I mean, the GDPR makes things kind of difficult in this regard, and I suspect even archives are liable if somebody takes an issue with content they are hosting.


This seems relatively cheap to fix. Spin off Yahoo Groups as a new corporation, and have that corporation subsequently donate all its assets. If the corporation somehow manages to get sued, it doesn't really matter, since it has no assets.

Or spin it off and sell it.


No non-privately owned company would ever willingly put itself through the legal and tax requirements for spinning off a new company with part of its assets just to do the right, non-profitable thing, with those assets.

Also, in my opinion, no privately owned company either, unless the owner was soon dying of something and wanted to get in good with their creator.


I’d assume the law is smarter than this, because companies would otherwise continually spin of new corporations to get rid of their liabilities with no assets as a sort of lightning rod for lawsuits.


This is, iiuc, how the movie and construction industries work. Spin up a minicorp for every big risky project to shield the mother ship.


When you create the SPV in advance, it's very clear what part of the work done by the organization attaches to it (because the organization ensures that all its processes explicitly specify the legal compartment they're running under.)

When you create an SPV after-the-fact, you have to go back and reverse-engineer a separation of liabilities from documents that don't specify whether they're work done for the organization or the SPV (because the SPV didn't exist.)

It's like a divorce. (Or, for an even more on-the-nose analogy, it's like trying to use a condom after-the-fact by extracting any bodily contamination and putting it in the condom.)


They do it regularly. Lead in gasoline (ethyl corp), asbestos.


If Yahoo Groups has a GDPR obligation now (and it's not clear that they do) they don't erase obligation that by spinning up a different company and dumping all this personal data into that new company - that would be its own GDPR breach.


That doesn’t sound correct, given they GDPR doesn’t generally apply to archival products.


Why not? According to GDPR someone can show up and request (1) fixing personal data (PII) like nickname - this is data accuracy requirement, in fact, according to GDPR Yahoo should do the data accuracy check (for instance send a reminder to the user to check data). (2) Someone can file data portability request, Yahoo needs to provide this. (3) Some can request data removal. (4) Yahoo has to managed user consents for anything they do with those data.

For a product that does not bring any revenue or significant revenue, it is better to dump everything and simply don't be associated with data any longer.

That's the side effect of GDPR, it is hard from the technical and financial perspective to maintain anything free on the Internet that keeps user's data.


GDPR has an actual archive exception to the "right to be forgotten", art. 17, §3d [0]. IANAL, so I don't want to say if it covers this archival, but I would hope so.

0: https://gdpr-info.eu/art-17-gdpr/


Anything being archived by archive.org is pretty clearly being done in the public interest. If it was something like Equifax archiving the data to use as a factor in people's credit scores then it would be much more ambiguous.


> Twitter wants to delete older Twitter accounts because they're required to by law under the GDPR.

So, by analogy, if Twitter did allow people to download an archive of any public Twitter account's history... what would the GDPR require them to do? Wrap those archives in some sort of auto-expiring DRM?


One of Verizon's spokespeople was literally Darth Vader. "Ma Bell has you by the calls".

Large corporations are not anthropomorphic entities, regardless of their disarming branding. Rather they are amoral bureaucracies, likely administered by people who have learned to ignore their empathy to get there. Verizon won't change course to accommodate the Internet Archive or general Internet community any more than a combine would pause for a field mouse.


We have examples of content that was destroyed because it was deemed trivial at the time, one example being the BBC's policy of erasing its television shows so the tape could be used for new shows. The policy began with the idea that a television broadcast was a temporary communication like radio, and really, what possible reason could there be for people in the future to want to watch things like comedy shows. Dr Who, or news programs from the 60s, or the BBC's coverage of the Apollo moon landing? Surely the value of these cultural artifacts was not as great as the cost of video tape? https://en.wikipedia.org/wiki/Wiping#BBC


The "dark side" of web scrapers has always been one step ahead with things like IP bans and CAPTCHA solvers, maybe it's time to get their assistance... as the old saying goes, "an enemy of an enemy is a friend".


Who are the dark side of web scrapers?


People who personally have 100,000 Yahoo accounts because they made them back when you could just pretend to be blind and request the captcha in spoken form, and then fed it into Google's speech to text engine, fed it back in to Yahoo, made the accounts, and who also have a botnet of a million residential IPs and can spin up a bunch of servers to run some scrapers.


This feels like an alt-take on That Scene in The Dark Knight. In a good way? :)


So spammers who had yahoo account mailers


The shady SEO people (including the social media account farmers) and the spammers, who seem to always find a way around everything that's put in place against them.


Call For Action

https://modsandmembersblog.wordpress.com/taking-action/

Don't miss the sidebar with these links:

https://modsandmembersblog.wordpress.com/media-contacts/

https://modsandmembersblog.wordpress.com/contacting-verizon-...

https://modsandmembersblog.wordpress.com/contacting-verizon-...

Also, you can add these emails to the media contacts:

  "Reporter Katyanna Quach" <kquach@theregister.co.uk>,
   "Managing editor Gavin Clarke" <gavin.clarke@theregister.co.uk>,
   "Corey Wilson & Rachel Janc; Senior Director, Communications" <press@Wired.Com>,
   "Pitches" <submit@wired.com>,
   "Rich Woods" <rich.woods@neowin.net>,
   "Paul Thurrott" <paul@thurrott.com>,
   "Brad Sams" <brad@petri.com>,
    "Kate Rayford, Media Inquiries" <katie.rayford@slate.com>,
    "Bryan Lowder (LGBTQ issues/culture)" < bryan.lowder@slate.com>,
    "Torie Bosch (emerging technology effects on public policy and society)" <torie.bosch@slate.com>,
    "Jonathan Fischer (big tech, cities, media/internet culture)" <jonathan.fischer@slate.com>,
    "Susan Matthews, Health & Science" <susan.matthews@slate.com>,
    "Erika Allen, Executive Managing Editor" <erika.allen@vice.com>,
    "Katie Drummond, SVP, Global Content" <katie.drummond@vice.com>,
    "Press, US" <press@vice.com>,
    "Press, Canada" <presscanada@vice.com>,
    "Press, UK" <ukpressoffice@vice.com>,
    "Pitches, Culture" <culture.pitches@vice.com>,
    "Pitches, Tech" <tech.pitches@vice.com>,
    "Issues" <issues.pitches@vice.com>


Please be aware that historically spamming media representatives has the opposite of the intended effect. A few emails to make it clear that it's coming from a group instead of just one individual can help, but at the point where it becomes saturating inbox noise it tends to get ignored.

It's not like interacting with political representatives or corporate PR/executive types where you're conveying the size of the interested party, in this case newsworthiness doesn't necessarily depend on how many people are sending the email.


Point taken.

There's also stuff there about contacting Verizon and contacting the shareholders of Verizon. For them, I think we need volume.


In the early 2000’s there existed two main ecosystems in mobile software J2ME and BREW (not counting Symbian) the latter BREW, operated by Verizon. I had cofounded a QA consulting company that heavily based itself off BREW’s highly extensive developer portal. Then one day without warning, the developer portal disappeared. Luckily I had the foresight to download all the documentation a week before. My cofounder, a Microsoft developer was dumbfounded.


Yes, this was incredibly sudden, and with not support for getting out. They gave 13 days notice of intention to shut down new additions to message archives (extended to 20 days after some commotion). That was October 21, I believe. They have offered a broken group downloader that produces incomplete results. Desperate group owners have been using a Windows piece of software called PGDownload, but Verizon has blocked that. Now the only organized effort is being actively interfered with. Dumbfounding is indeed the word.


There must be something I am missing somewhere.

1) I have been a member of a group for many years (Gann study group) . Last Friday I received a notification from the owner who was explaining the group was closing so he set up a new one somewhere else. I thought it would be nice if I made a backup. So I found a python script on github (there are dozen of scripts in various languages which can be used to backup a yahoo group there). It took me a couple of minute to get it working and then a while later. Voila ! I had it nicely packed on my hard drive. So why is it so hard to back up a group? I don't understand the problem.

2) "A phone company in the UK that assigns phone numbers using the groups and now will lose all those phone designations when it’s deleted."

What? Well OK why not.. But? They are a phone company. There must be someone able to scrape all this data? I don't get it? There are so many ways to extract data from yahoo group.


Most people running these groups are not technical. Even if they got the word in time, the only option many of them could find was PGOnline, a Win pay software, which by this point Yahoo has blocked. Furthermore, even if they got it, what do they do with it? For many groups, the archives are a resource to be referred to. They need to be hosted somewhere, preferably with some kind of front-end search engine. Even better if the search engine integrates with any new posts on the forum they move to.

The Archive Team has been taking requests for backups of groups for people who don't have the technical facility to run the python scripts. They then intend to make them available on the internet archive. The next project is making some kind of front end, in case group owners want to host that somewhere. Some of us, for example, will be doing that behind some kind of a forum login, so it won't be search engine indexed.

As for your point 2, that was cut/pasted from the link in the OP, where it's describing that many groups are still using the platform. More relevant to this project, is that many groups are losing their archives, and those archives contain anything from scientific data, to hobbyist & howto information, to art and literature, etc.


The current administration put Verizon’s chief counsel into the position of FCC Chairman. I would not expect Verizon to answer to anyone.

Also, it is shame that the person in direct contact with Yahoo over this is sending angry emails in all caps. The Internet Archive deserves better.


I agree on the first point. The second is perhaps understandable if you read the whole exchange. You know they initially gave us 13 days before they cut off storing any more of the group emails (that is, new emails)? With an outcry, they increased that to 20. Many thousands of people were scrambling to find a new home. We are now reaching the end of the line (the last week) before the archives themselves are gone, and they have blocked the main concerted attempt to save some of that history. So, some level of frustration is in order.


A lack of emotional control is usually understandable. But it suggests a lack of care and focus that does not befit an important effort. I learned years ago to never send and email or text or to make a call when angry. I always thank myself the next day when I am able to choose my words more tactfully. That email makes them look like a group of angry trolls.


I'm wholeheartedly supporting the archival effort but was wondering exactly the same thing about the person communicating with Verizone. Her argumentation comes off as quite immature, and she's not making much sense with all that rambling.

Saying stuff such as this sounds pretentious and will unfortunately only get laughed at by anyone in the corporate world: "So the best thing Verizon could do, since they are just going to throw us all into the trash anyway, as we aren’t important to them, is let us get our archives any way we can.

The terms of service really should not apply to people who have been told, we’re gonna delete you from existence. If it’s lawful for us to get them from you, in broken buggy and virus ridden state, it’s just as lawful for us to get them ourselves."

As it is right now, she's just not doing any favors to the archivist community out there. Perhaps someone with proper communication skills and better nerves should take up that role? This is not a time to play a martyr and throw a fit while expecting Verizon to meet you half-way.


Don't use free corporate services for shit you care about. Or think you may care about later.

Don't use any service that suffers from a single point of control.

How much anguish when Facebook inevitably either goes away or pivots entirely?

Or HN, for that matter?


I don't believe that point is necessarily up for debate. At this point we are just trying to save the data that we know will be lost.


It might not be up for debate, but it's a good reminder.

Maybe Hacker News should be mirrored on Usenet...


Things like this are a good answer to when people question why internet centralization and walled gardens matter. If these things were hosted across thousands of servers, federated, or under a license that made them able to be copied, there would be no issue. This is only an issue in the first place because people posted content in a place and manner that made them give up ownership to it. One day, perhaps decades from now, Facebook is going to face the same problem. Twitter would, too, if it wasn't being archived by the Library of Congress.


Verizon claimed that the archivists violated the "terms of service" [1], but I couldn't find any reference to automation, downloading, crawling, or denial of service attacks that might apply.

Does anyone have an idea of exactly what term or terms were violated by the archivists?

[1] https://www.verizonmedia.com/policies/us/en/verizonmedia/ter...


Just playing a devil's advocate here. The way archivists are downloading the data can be said to disrupt the services, which is mentioned in the terms of service:

2. d. viii: "interfere with or disrupt the Services or servers, systems or networks connected to the Services in any way."

I'd also like to point out that the apparent spokesperson Brenda Fowler said in her open letter to Verizon, that "If the problem is that all our attempts to rescue our archives in the time we have left is causing an overload or strain on your servers, then stop making us HAVE to work around the clock, and GIVE US MORE TIME. ..." Probably not the wisest thing to say right now.

Also, archiving the groups with automated tools is against the Use of Services rule, that states the following:

2. e: "Use of Services. You must follow any guidelines or policies associated with the Services. You must not misuse or interfere with the Services or try to access them using a method other than the interface and the instructions that we provide. ..."

As I mentioned in another comment, I really support the cause and am a big fan of archiving myself but it's unfortunately quite clear that Verizon is right at calling out the violations of "terms of service".


Using the interface wouldn't block scrapers, yes? They do use the interface. But, this is academic I think. They offer a broken way to get our stuff, and say that we can't do anything else. Should we acquiesce to this?

As for bogging down the servers, my understanding was different from what the author said. They hadn't started to archive, but were in script testing mode and were accumulating yahoo accounts. What I saw of their activities, they were very careful about not overloading the servers. (I know that because I was backing up my own groups independently at the time, and I was able to do it. Luckily.)


AFAIK they hadn't started doing mass-archiving either. They were still setting up.


Correct. They had done some testing, but that's all. They were just getting yahoo id's, while iterating on software improvements, so they could then download the groups.


I had just recently been reading about Arweave [0], a sort of distributed file storage that claims to permanently store files/webpages using various incentives.

Seems like something like this would be a good way to archive this sort of information or build sites like Yahoo groups on top of this file storage in the first place.

[0] https://www.arweave.org/


Arweave is doing great stuff but I think it'd still run into a similar situation as archive.org -- Check out the arweave discord dev community if you haven't though!


Storage isn't really the problem in this case, collecting the data is the problem because yahoo/verizon are actively hostile.


Such a pity we lost gmane.org.

Lots of knowledge gets lost these days.


Just thinking out loud: This makes me wonder if we can learn from this and prepare to backup other (similar) platforms that hold such an amount of data and might go away some day. Building the backup tools today and ideally starting to backup now, making the process incremental so you can run it every now and then and only scrape the new stuff.


Modern Web3.0 portals that built on async JS will be impossible to archive without hitting API limits or resource quotas.

New Reddit(without the old.reddit.com interface) for example. Many niche subreddits contain lots of information that would be lost if reddit dies(or just deletes these subreddits).

Youtube is unarchivable in principle due high amount of storage required(even thinking of 640x480) and yet it still contains tons of unique content found nowhere else from rare AMVs(that survived prior deletions) to instructions to repair telescopes - or basically anything in video form that doesn't have backups(i.e.not uploaded to other videos sites).

4chan and similar sites are archived by several sites in haphazard manner(only boards they like) and yet it a huge chunk of internet culture that is going to be lost if these sites die(and its more probable than Reddit due less funding). Usenet is slowly fading into obscurity and dependence on Google Groups. Many forums that today exist, will not exist forever: yet very few are archived anywhere else. Other forum-like sites like Stackoverflow and Quora might disappear in the future with nothing replacing them. Github is subject to Microsoft whims and positions on open-source. Wikipedia and various wiki farm sites don't have much revenue streams. Practically every major website we take for granted is vulnerable - people thought Yahoo Groups was going to last forever.


And this is the dangers of relying on a private, corporate, for-profit law-bound organization. They're susceptible to abiding by the laws and of course, there is a cost attached to all of this.

Exploiting a free resource, as we all do these days (reddit, youtube, facebook, hackernews itself etc) is all well and good but maintaining history is expensive (content needs moderating, you are required to abide by the GDPR and DMCA, there may be disputes about content on the platform).

I mean, Google+, MySpace, Bebo, IMDB comments is now dead and gone, how useful was the data really? I'm sure some people might go to archives but I would imagine 95% of the data is just "rot" that has no value or substance.

History is lost all the time, we barely know what we've been up to the last few thousand years only now can we so extensively document our world with the precision and quality afforded to us.

But in the end, time moves on and some of that history is lost, it hurts, but whose to say any archived history will be preserved anyhow? We're still relying on our storage technology being readable years/decades/centuries from now, which is not a given.


Maintaining a static archive is remarkably inexpensive. The total amount of textual data included in even Google+ was likely only a few hundred GB. Images and multimedia, of course, would have been far more, though sampling-based estimates suggest that these were a few hundred KB each, on average, on about 30% of all posts.

The mean post size on G+ was rougly the same as on Twitter: about 120 characters. (Quite possibly because most G+ posts were themselves repurposed Twitter content.)

Static content does not require ongoing moderation, though it's possible that problematic content will be periodically identified.

The bigger challenge is actually in the publishing engines. Even where these are static, it's possible that vulnerabilities will be identified. That was Google's (not especially convincing) excuse.

A challenge of the Internet Archive / Archive Team method of archival and access is that in preserving the original formatting and packaging of content, the bandwidth and storage requirements are increased tremendously. By about two orders of magnitude in the case of G+.

Were the Archive to focus on the actual originally-authored content rather than all the associated chrome, both factors would be tremendously reduced.


While I agree with your first point, and tried to get groups I was associated with to move for years, nevertheless there are groups there that engaged in community driven research and have important data uploaded there. (This is my main concern, though other groups were focused on different issues - uploaded art, for example.) So I think while we need to educate people about not using centralized providers like Yahoo and Google, right now we need to focus on getting someone at Verizon/Yahoo to respond to this urgent situation.



I totally agree. Google? FB? Twitter? How about the Friendiverse? :)


We cannot excpect a private company to continue paying for resources they don't want to.

But giving a "export all the data in xml/json/whatever" button, and maybe even opensourcing the now-abandoned component serving this data, would be nice move. The first part could even become a regulative requirement some day.


> maintaining history is expensive (content needs moderating, you are required to abide by the GDPR and DMCA, there may be disputes about content on the platform).

Things shouldn't be like this. The price per unit of storage and bandwidth falls fast (and, except for the sites dealing with user-generated videos, faster than the amount and size of content grows). Laws shouldn't apply retroactively.

The problem really is that our means of accessing information are services. When you have a physical letter, or an e-mail saved locally, or a text message from 15 years ago, you can just read them. Nobody will know or care. Nobody will come after you trying to apply GDPR or DMCA retroactively. And since storage is near-free, you won't ever lose it until you forget about it (or at least about doing regular backups). Whereas with modern webmail, forums, link aggregators, IMs - you don't have even your own messages, and viewing a conversation that happened 15 years ago is really being provided a service today. Services are ephemeral, they're also subject to ever-changing regulations and whims of the service providers.

Bottom line, while services are necessary for transferring conversations, we really shouldn't be relying on them for access to conversations that already happened.


If you are a company, GDPR does apply to data on physical letters and local emails. A large part of the preparation for the introduction of GDPR enforcement was companies getting a handle on what they had stored in various media.


actually email and letters are something which the gdpr falls short in some countries. especially germany. since basically the constitution is above the gdpr and depending on the letter/email the content of the letter does not need to be acknowledged or showed (gdpr also means you can access your data) to the person who want his data deleted/showed/whatever.


All true, but costs of hosting and serving aside, there is a non-zero legal cost with hosting and serving the content. Blame bureaucrats, parasite lawyers, and our litigious society.


Those costs reflect the actual social costs of that hosting. Prior to GDPR and similar legislation, those risks were externalised onto users and society at large. They're now being shifted, properly, to where they should have been borne in the first place, on the service providers themselves.

Blame risk-externalising business practices and willful ignorance.


What social coast is there to distributing content contributed by people who agreed to terms according to those terms? Users transmitted data about themselves to a party after reading that party's terms of service and agreeing to the things it promised to do with the data. To paraphrase a popular talking point, two consenting IP addresses should be able to send whatever data they want between each other.


1. Terms of use can change at any time.

2. Technical capabilities have expanded massively. When Yahoo Groups launched, enterprise storage of more than a few hundred GB was highly unusual. I worked for a Very Impressive Service Agency which was lucky to claim two Sun Starfire servers, only one of which was Large File (> 2 GB) at about the time, for analytic use.

By the late 2000s, AOL were deploying massive-RAM based systems to be able to perform whole-dataset operations in memory.

For the past ~5-8 years, large-scale SSD drives have been A Thing, now available in the terabyte range, for a price. Again, the level of analysis and expolration possible have made tremendous leaps.

3. There is the concept of manifest vs. latent functions, and awareness. The full realm of possibilities of technical systems are rarely apparent to their creators, let alone nontechnical users. See (very generally): https://en.wikipedia.org/wiki/Manifest_and_latent_functions_...

The marketing and disclosures of such services rarely include such disclaimers as "use of this system may subject you to a lifetime of personal and social profiling, grammar-based context analysis, GD ML AI based image content analysis, and imperil the global liberal social democratic experiment."

Hiding behind the figleaf of "you should have considered all possible future implications of your present actions and will have no future recourse" is grossly flawed, and quite frankly, professional malfeasance and malice aforethought given current understanding.

The awareness of risks has changed, and is unambiguous. Providers should foot the costs, or mitigate them accordingly.

(I suspect that at least in part, the actions of Yahoo, Google, and others, reflects this changed awareness, though I'm not aware any providers have explicitly stated this.)

Again: the risks always existed. The previous state was made possible only by pretending they did not. They do. Practices must change.


Social cost would be at best very difficult to quantify, though, making it quite hard to handle. "Increased partisan tensions" due to social media, for instance, is not the sort of thing the cost of which one can quantify and mitigate.

Your point that the things which can be done with information collected are constantly in flux, and I agree the ability to retroactively change terms of service to cover previously-collected data is ridiculous and implies an illusory contract which is not legally valid. No one should be able to run through a neural net data collected in the nineties. However, it's also not reasonable to demand that old data be removed, as it's produced at least as much by the server as by the client (e.g. access logs are typically produced by server-side monitoring of server-side software). The most sensible option is for companies to require explicit agreement to TOS changes to continue using the service, and use new data only under that policy while using the old data under the old policy. It's additional compliance overhead, certainly, but it's no different from how a client contract would be treated.

> professional malfeasance and malice aforethought

You are not the arbiter of such things, but thank you for your opinion. There's also a site guideline about assuming good faith, so you're in violation of that.


My own thinking on this has evolved very considerably over the past five years or so. That's included a comprehensive and ongoing exploration of the fields of media, communications, epistemology, and several others, related to this. I'd long seen computers as technology, largely independent of social implications. I now see these as utterly inextricably linked, and with implications that are anything but predictably benign.

Costs being difficult to assess does not mean impossible, and the notions of probability and risk are central to all finance, investment, and insurance. Uncertainty is NOT an absolute lack of knowledge.

Among the principles that becomes apparent is that changes in informational regimes have profound impacts upon societies, and that this is a pattern which can be traced back through history to the invention of writing itself, and via indirect anthropological evidence likely to the emergence of speech.

The principle transcends humans themselves -- a leading theory for the Cambrian Explosion is that it was a consequence, effecively, of structuring and communications mechanisms within organisms developing, and allowing the creation of complex body plans, and not merely single-celled organisms or masses or colonies of cells.

For media, see especially Elizabeth Eisenstein's The Printing Press as an Agent of Change and Marshall McLuhan's The Gutenberg Galaxy. The link between mass media and totalitarian, fascist, authoritarion, and nationalist sentiments has long been observed (Hannah Arendt, Dwight MacDonald, the Frankfurt School, Edward Herman & Noam Chomsky, Adam Curtis).

I've been impressed by the insight, or occasionally, lack, of awareness of the potential perils of comprehensive data archives by pioneers within the data field.

Paul Baran, co-inventer of packet-based networking, wrote "On the Engineer's Responsibility in Protecting Privacy" (https://www.rand.org/pubs/papers/P3829.html) in 1968, some 51 years ago. In it he remarked on both the risks, and industry attitudes:

There are many amongst us who would not hesitate to build equipment to compromise the privacy of any given individual provided the price is right. These are the whores of industry. They would not hesitate building systems and devices contrary to the public interest; their only concern is the buck.

The full paper, and in fact, all of Baran's RAND publications, are online in full-text, following my request to RAND. I remain grateful to them for this.

Baran was also interviewed for a 1966 BBC documentary:

"Well, he who has access to information controls the game. This is very dangerous. I think both your country and mine have never trusted the government completely. We do so for good reason. Here we have a mechanism that could be abused. Here we have a mechanism that would allow the creation of a dictator. . .

I've yet to see an expression by anyone in Congress about this new type of danger. In fact, we see proposals for centralizing information, we see proposals for rushing ahead into new, more efficient computer information systems, and very little thought is being given to the dangers of the misuse of these systems. . . I ask a lot of people about privacy, why they valued it, and I was surprised by the number of people who said "Well, I don't do anything wrong. Why should I worry about privacy?" And then, on the other hand, I think there's a more wise group that says, 'Privacy is really the right to be wrong, then go on and live the rest of your life, without having it mark you forever.' I tend to think this latter view is the view we should hold.

https://invidio.us/watch?v=FwaDvJYZTVk&t=29m31s

Another view was expressed by AI pioneer and Nobel Laureate (economics) Herbert Simon:

"The privacy issue has been raised most insistently with respect to the creation and maintenance of longitudinal data files that assemble information about persons from a multitude of sources. Files of this kind would be highly valueable for many kinds of economic and social research, but they are bought at too high a price if they endanger human freedom or seriously enhance the opportunities of blackmailers. While such dangers should not be ignored, it should be noted that the lack of comprehensive data files has never been the limiting barrier to the suppression of human freedom. The Watergate criminals made extensive, if unskillful, use of electronics, but no computer played a role in their conspiracy. The Nazis operated with horrifying effectiveness and thoroughness without the benefits of any kind of mechanized data processing."

https://pdfs.semanticscholar.org/a9e7/33e25ee8f67d5e670b3b7d....

There is, of course, one slight problem with Simon's argument: The Nazis did make heavy use of mechanised data processing, provided and supported by IBM. Edwin Black documents this meticulously in his book IBM and the Holocaust:

https://ibmandtheholocaust.com


We can still read Fidonet messages decades after BBSs died. The power of decentralized networks.


I'm curious where you'd find these -- do you have any links to Fidonet archives?

It's been a while since I looked, but I didn't find anything significant last time I did.



Some were gated over to Usenet, and you can see traces in Google Groups, but I'm not aware of any mass archive of them.


A very substantial portion (~98% of all public posts) of Google+ was successfully archived, at the Internet Archive, thanks to the Archive Team. As a longtime G+ user, and one of the organisers behind the G+ "Plexodus", the existence, assistance, and capabilities of the Archive Team were hugely appreciated.

AT and the Internet Archive have succeeded in preserving other content, though not all projects are successful. You can see a partial listing at https://www.archiveteam.org/

Even as notorious a "wasteland" as Google+ (a naming I've had some role in establishing: https://ello.co/dredmorbius/post/naya9wqdemiovuvwvoyquq) had many millions of actual active users, and tens of thousands of active communities (https://social.antefriguserat.de/index.php/Migrating_Google%...).

Unlike numerous other shutdowns, Google announced the G+ shutdown well in advance, though they "accelerated" the schedule twice, from "sometime in August 2019" to April 1, 2019, the eventual shutdown date. The tools Google offered for archiving and migrating content, whilst among the best in the industry (an exceptionally low bar), were incredibly insufficient: buggy, incomplete, duplicative, and not readily portable). It was largely third-party tools and assistance -- the Friends+Me Google+ archiver and ArchiveTeam most especially -- that meaningful preservation was possible.

The conceit of large-scale, free-to-use services has been convenience, capability, and trust, the last a point Google explicitly made in its original G+ announcement:

You and over a billion others trust Google, and we don’t take this lightly. In fact we’ve focused on the user for over a decade: liberating data, working for an open Internet, and respecting people’s freedom to be who they want to be. We realize, however, that Google+ is a different kind of project, requiring a different kind of focus—on you. That’s why we’re giving you more ways to stay private or go public; more meaningful choices around your friends and your data....

https://googleblog.blogspot.com/2011/06/introducing-google-p...

That trust has been repeatedly violated.

And in actively opposing archival efforts, Google, Yahoo, Flikr, and others, are violating that trust only so much the more.

In the G+ shutdown, it was the active dismissal, obstruction, and interference of Google and its user-based support team (the so-called "Top Contributors") which were most disappointing. Long-time Google supporter Loren Weinstein made this point specifically and repeatedly:

https://lauren.vortex.com/2019/01/29/googles-g-user-trust-be...

I'll note that this tends to strongly reduce the value proposition of all Web 2.0 / SaaS offerings, given that even the very largest and wealthiest companies are willing to act in this manner.

The consistency of this behaviour and attitude across multiple service providers makes me think that the behaviour and practices are not coincidental or unintentional.


Thanks for that history. I wasn't aware of it.

I got into this tangentially because of a community and ecosystem of Y-groups that I've been involved in. When I found the Archive Team's efforts, I hitched my wagon - though I'm not at all central to that group.


That's pretty much my status.

Feel free to drop me a line -- dredmorbius <at> protonmail <dot> com

I suspect you've also been active on Reddit lately (ow my inbox!).


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: