A police cooperative in Washington DC that was using them as a network to communicate with their respective neighborhoods with over 17,000 members.
A phone company in the UK that assigns phone numbers using the groups and now will lose all those phone designations when it’s deleted.
A Birding group in new Delhi with 2,000 members that has collected data and research on birds for TWO DECADES.
An Adoption group in France, that has been using it for years and years to communicate and share history and photos and more.
They also would have found:
Numerous support groups for people who are suicidal or depressed.
Numerous medical groups for people to communicate more effectively with their doctors.
Numerous Vet groups with 24 hr care advice for sick pets.
Numerous support and help groups for the Elderly.
Numerous Historical groups for WW2 Veterans, Vietnam Veterans, and etc.
Numerous science groups that have used them for years and have all their research there.
Numerous fan fiction groups or arts groups that have shared their work for years.
Wow, somebody invented a database that's even worse than an Excel file on a network share.
(Also, how are they going to assign new numbers when archive.org takes over? Is archive.org going to give them write access?)
The other problem is making it available - I ran a Yahoo group for many years, and have Mbox and Maildir format archives. I'm still looking for a decent web-based browser for these. HyperKitty (Mailman's archive browser) came close, but seems to require most of Mailman to be installed in order to work.
In my case, I managed to archive a bunch of groups related to amateur radio -- and I will be placing these on archive.org as soon as I have a spare moment to zip them up. A difficult-to-access archive is better than no archive at all, the important part is getting the data into a safe place.
How big are these archives? Do you have any samples? Does the viewer need any special features? (threading?)
The problem is that there isn't any standard that defines what can and can't go inside the body of an email message. So if you want to post each email message exactly in a thread exactly as is, i.e. each with completely different typography and with all the replies attached and not sanitized in any way, then that's relatively easy. But it's also completely unreadable for more than about 30 seconds, and doesn't allow for good search functionality. These problems aren't a deal breaker if you're only trying to make sense of your own inbox, but when you're looking for specific information across millions of people's inboxes then they're a complete nonstarter.
To add to all this, it's not an individual project. Most people done' have technical competence. They need someone to help. That's what the Archive Team has been trying to offer (if not for Verizon).
Yahoo's banning of a large amount of the accounts we were using is a huge setback for us. In total we lost over access to over 55,000 Yahoo Groups, many of these will now not be archived and will be lost when Yahoo deletes everything on December 14.
Particularly disastrous was the loss of access to all of the 30,000 Fandom (fanfic / fanart / etc..) groups that were requested to be archived by members of the fandom community. We're back to square one now, and it is looking increasingly likely that we're only going to be able to re-join (and therefore archive) a small percentage of these groups before December 14.
(And now for the inevitable, shameless plug...) We could really use some help! If you've got an hour or so, we could really use people to come and complete CAPTCHAs for us. (A CAPTCHA is needed to join every group). Instructions at: https://github.com/davidferguson/yahoogroups-joiner
> Your email address is not linked to a Yahoo ID. To join this group, you need to link your email address to a Yahoo account.
When I click "link your email address", it just takes me to a page called "Personal info" which doesn't have any obvious way to link my email address.
So I'm not sure how to proceed.
EDIT: Solved it. I had initially only "verified" the account with a phone number, but you have to add an email address as well. It's now working.
For anyone who, like me, signed up for this and filled in the Google form, but then couldn't find the leaderboard URL after closing the tab, it is https://df58.host.cs.st-andrews.ac.uk/yahoogroups/leaderboar...
It seems to be working through a list in reverse alphabetical order. Watching the progress being made is quite satisfying. When I started it was on groups like "sciencefiction" and now it's moved on to "petzluverz".
Seeing the same thing now, I added an email address and verified it, but I'm still not allowed to join the group.
I assumed I could help by going to a web page and solving a bunch of captchas for you, but when I read those instructions I found there's more involved (forging a Yahoo account, installing an extension) and it turned me off.
If captcha's are the bottleneck, maybe some generous soul here could figure out a way to automate the rest and just give me a page I can go solve captchas? Further reducing the friction might help get you some more uptick from the community - more monkeys like me banging at typewriters.
Sorry I wasn't more help, and best of luck with your efforts.
What are the hardware requirements of that VM?
I'm attempting to import it on my NAS4Free home NAS Virtualbox service which is the only machine I keep up 24/7 atm, but it takes forever to import. The hardware is very limited however (Atom D410 + a bit over 1GB RAM available), so I'm not sure it would succeed, but so far it loads forever, no errors given. I'd like to run it for this project to start contributing quickly albeit with limited hw before the deadline, then find better iron in the future.
After they settle down, they’re more memory than processor intensive. I’ve considered playing with the settings a bit, but thought it was more important to get a bunch of them running on a couple different VMs at different sites.
If I were really feeling fancy, I’d write a nice deployment definition for orchestrating this with microk8s...
I used the following qemu-img command:
qemu-img convert -O qcow2 archiveteam-warrior-v3-20171013-disk001.vmdk archiveteam-warrior-v3-20171013-disk001.qcow2
qemu-system-x86_64 -m 1024 archiveteam-warrior-v3-20171013-disk001.qcow2
`Buster is a browser extension which helps you to solve difficult captchas by completing reCAPTCHA audio challenges using speech recognition. Challenges are solved by clicking on the extension button at the bottom of the reCAPTCHA widget.`
I did specify that groups requiring approval to join shouldn't be submitted, but not everyone took notice. (And then there was the several dozen Google Groups URLs that were submitted!)
It's a set of groups that have been specifically requested by the fandom community. Of course, the groups handed out depend on what's been joined, so if / once all the fandom groups are joined, we'll move onto something else.
I appreciate this isn't made clear in the instructions, but if you have a desired set of groups in mind, you don't need to use the chrome extension. Just join the groups you want saved and (provided you've sent the account details through the form) they'll be added to the queue to be archived. I did a lot of Amateur Radio (Ham Radio in US) groups that way.
I confess I'm doing this mostly to see what people posted on the internet at some point in time :)
Edit: All groups have around 1600 members... what causes this...
That's possibly the maximum cap?
You can absolutely purchase captcha answers.
(b) We were still testing / writing the scripts to do the actual archiving. Most of the groups we did save before the banning were from test runs of the archiving script.
And sure, given hindsight, I'd do things differently. We've learned, now, and are archiving a groups soon after it is joined.
Politically, you need to arrange it so that cooperating with you will give Verizon a small PR boost, while ignoring you will be seen negatively by the public. This thread had a good example of interesting data that is worth preserving, so I would try reaching out to news companies (NY Times and whatnot) to see if anyone wants to publish a piece. Phrasing this positively and ensuring enough people see it, would greatly increase the chances of cooperation from Verizon.
You might even get the admins to make an announcement.
Can any of you shed some light on why Verizon and Yahoo aren't cooperating with the Archive Team to archive this valuable historical content?
(If you don't feel comfortable commenting with your regular HN account, maybe you could do so with a throwaway account?)
Also, is it possible for any of you to bring this issue to the attention of upper management and help them understand how important it is to archive this?
You Verizon/Yahoo employees have much more power to make a difference here than anyone of us from the outside can.
I work for VzM, but not historically directly on Yahoo products (product teams have been merged/consolidated etc. over the past few years, but there's still strong tendencies toward products people came from).
So I wouldn't be very clued into what's happening with Yahoo Groups internally. And I've heard nothing about this internally. At all.
As it stands, it's 2:30pm in SV, VzM is top of the HN frontpage, and not a single soul has mentioned it yet on internal Slack.
Will see if I can find out more.
I'm guessing this will blow up later this morning when people start waking for the work week.
So technically, some legal troll could post some copyrighted information, wait for it to be published on Archive, and then sue Archive for copyright infringement and Verizon for assisting it. As a non-profit, Archive will likely get away with just taking it down, but a for-profit Verizon is a wholly different story.
> The 128 people you banned were REQUESTED by the group owners to get their stuff.
The question is why they're spending real effort on blocking archivists. All they had to do was keep doing nothing for a few days. The cost to them might have been a couple hundred dollars' worth of bandwidth, at most, which I think archivists would have been happy to pay--they've done more before. (That's estimating based on small-scale commercial hosting prices; it might not even register on whatever enterprise uplink Yahoo/Verizon uses.)
Instead they've got at least one professional taking time away from productive work to fuck with archivists at no benefit to anyone. It's possible that the wage-hours spent on this actually exceed what the bandwidth costs would have been. It's astonishingly petty.
Is it possible that there may be some kind of political angle to all of this; that archiving this information for the future might allow someone to find out something that someone else doesn't want to come to light?
</conspiracy tinfoil hat>
People often compare the shutting down of sites or the banning of content (e.g. When Tumblr banned porn, or now yahoo shutting down groups) to the burning of the Library of Alexandria. But there is a huge difference. The LoA held knowledge collated and collected by the best thinkers of the time. The Internet is not that. The Internet is an open platform where anybody can say anything like that. Most comment sections are filled with all sorts of material ranging from factual to entirely fictional.
I realise it is hard to decide what is worth keeping (and therefore erring on the side of saving it all), but I'd wager that the vast majority of archived content is not useful at all. The Wayback machine is a perfect example. Lots of great stuff, but that's a drop in the bucket compared to the vast amounts of useless, or even redundant information stored.
It is a lot of resources thrown at saving, not the equivalent of the Library of Alexandria, but the public toilet block graffiti wall.
Anybody want to share what drives them to do this?
It's also not horrendously expensive - we are getting better and better at storage as well data analysis techniques, so stuff that seems useless today may be useful 50 years from now and cost less to store than it does now. The key thing again being that we can't benefit from hindsight.
Even graffiti can give insight into a time period, even if that insight is that that time period had an unusually high number of graffiti artists.
For a time period where data is more valuable that oil, that the wealthiest companies are trying to grab every piece of data they can, and on a site where this is frequently discussed and many work for said companies, I find the question "why do archivists want to archive data?" a little silly. Date might not be useful to us now, but might be to future historians (though this is a similar argument made by that companies that do mass surveillance).
Given that search engines have zero ethics when it comes to removing embarrassing (but not illegal) content, sometimes the loss of information is a small blessing for some.
Yes, it's their fault, but I also don't think it's fair that something a child said at 14 should haunt them their entire professional careers, either.
The dogma, that "everything posted to the internet will stay on the internet" , may not be entirely true for this first generation, because now large parts are already gone. But I am certain that this will be very true for the current generation, because I really doubt that Facebook and others will ever freely delete large datasets of user content.
Ethics are about codified sets of rules. Perhaps they're just following a set of rules that doesn't promote hiding things to make people feel better?
More like definitely.
But many don't have the pockets for better systems, and so their earned knowledge lived on Google Groups. And when you think of all the people and groups that might have had needs to store their history, and what tools they might have used, what do you expect the skew of Yahoo Groups was. Certainly no Fortune 500 companies, but rather nonprofit and grassroots and all sorts of domains that are already getting the short end of the stick in our world :)
Step 2: It will take a long time to look through all this content and determine which parts deserve keeping.
Step 3: We will inevitably leave out something that someone else thinks is worth keeping anyway.
Step 4: Let's just archive everything.
I write a lot of historical content and often the most useful stuff I find—for example, old flyers or ads from the 1950s or 1960s—would have been considered trash by someone at the time.
So an archivist’s job isn’t to make a judgment. It’s to protect the data as they see fit.
So yes, there are real, hardcore scientific papers about ancients "shitposting" each other down to "your mom" jokes. Because it shows us how people really lived.
It's easier to just save it all and let gawd sort it out.
You never know what some future person might find interesting. For example, my father took lots and lots of pictures, but they're all set in the living room and kitchen. No pictures of the rest of the house. I'm sure the thought of photographing other rooms simply never occurred to him as being interesting.
For another example, many people are interested in where/when/why certain words first appeared, like the origin of "OK". Massive archives of text that are searchable would help with this.
Ask an antiquarian about the value of graffiti in the ruins of Pompeii and other archaeological sites sometime. The great historians of the day wrote about their contemporary culture, while the vandals and miscreants and lowlifes and commoners contributed to that culture. Having access to both sources gives us a much more complete picture.
You don't know what's worth saving at the time you save it.
By definition, we don't have the benefit of hindsight until it's too late.
A main concern of the Archive Group (again, below) is art that was uploaded there.
I'm sure those are not the only two classes of examples. See for example the bird watching group in Delhi that has been collecting data for decades. (In the link of the OP.)
People doing important work (esp important work that is underfunded) don't have time to write/record their own histories. But that history can be instructive, to learn what worked and what didn't, and help future travellers do it better :)
And perhaps especially important: ppl engaging in these under-resourced efforts are often working in domains that capitalism is... less curious about, we'll just say. Otherwise, it would likely be able to be more highly documented, as incentive is there to preserve it.
Our ability to improve our present from better understanding our past is a supposed benefit of a digital world that accrues data -- we have records of things that in prior ages just flew by in conversation (for better or for worse). But efforts like this rob us all of that wisdom <3
And again, there is an asymmetry in who gets robbed. It is often the folks working in the commons, those doing invisible maintenance labour (nonprofits, grassroots, community), and generally just people doing work within the cracks of capitalism.
... that had access to writing services and were wealthy enough to have their thoughts stored.
There could have been many odd voices out there that would've told us an entire different story. But these are unknown because they didn't have access.
Now we are in the era of (almost) universal access to storing our thoughts and we still don't listen to the everyone or mark them as uninteresting and not worthy.
The graffiti on the toilet wall may well speak to the start of a trend, term, movement, or other event, for example.
Think longer timelines, broader scope than you personally may feel is relevant.
En mass, those questions have answers we individually are unlikely to fathom.
We have that kind of graffiti from Pompeji. It's enormously more fascinating and insightful into regular people's lives than all the stuff about kings and battles people wrote about in the more official works.
When looking through all newspapers and magazines, the advertisements are often the most interesting bit. Especially since you can probably already read about the big events they wrote about on Wikipedia or history books.
But most importantly, Groups is a corpus representing many segments of society during a period (starting 2001, with a peak of over 100 million users in 2008). It's a snapshot that embodies concerns, beliefs, morals, language... at several realms. This is more than LoA even. It can be used profusely by researchers and historians to study society for years to come. Or by AI to learn how and who we are/were...
I am in awe of your flair for understatement.
People are conflating internet discussion content with written content because it's stored as text. Whereas the more legitimate comparison is to verbal communication.
I imagine you're not a historian. Neither am I, but I cannot imagine that there is a historian out there who hasn't lamented the ephemerality of everyday conversation (and even of apparently more durable forms of communication).
The texts on the internet at a given time, on the other hand, are public and reflect the opinions and ways of living of a large number of people at that time. There is no doubt that these could be analysed in the future to give us historical insights in ways we cannot even conceive yet. (Think e.g. about getting them data mined and analysed by advanced A.I. to give new insights into the time period.)
The worth of the data is so obvious that it's really hard for me to understand why you and some other people don't think these are interesting data points for research on how we lived in, say, 200, 500, or even 10000 years from now. The data is not only interesting to historians, but also to economists, political scientists, and linguistics, btw.
Linguists definitely do.
I have written great contributions to a python API library that could be of benefit to the community around it. The code has nothing to do with my company's core competency, and the code is used for internal orchestration, so "exposing insecure code" is an unlikely concern.
It is easier for a lawyer, especially a luddite, to say "no" than to help their employees give back to the world.
what a lovely thought. Thanks for the effort, even tho it didnt pan out. if you've got the dvd torrent it out :)
now im wondering if there's a stratus emulator anywhere and/or the os code. Them things were nasty... individually battery backed hard drives was just the beginning. The slot cards looked like someone had dumped yellow patchwire spaghetti all over them.
We don't know exactly what was in the library when it burned. We assume it was all great works of intellectualism, but it could very well have been the fanfics of their time.
But anyway, no one should delete human littérature, be it inadvertently or by lack of effort.
If anything this would make the analogy even more apt, since only part of Yahoo is being destroyed. :)
Regardless, it's mostly used as a metaphor for the destruction of knowledge at this point.
Just looking at the third link, the most upvoted answer agrees that humanity suffered a significant loss of important information. And the 'myth' is just an asinine distinction regarding whether loss was due literally due to fire, or whether the information was lost due to some other cause. I think declaring it a myth in a conversation like this misses the point (it certainly isn't a distinction relevant to the original comparison made here to Yahoo Groups) and just serves to confuse people.
You don't have to be a Christian apologist to think that it's important for people understand history correctly.
2000 years ago, as a civilization, even if we failed to care enough for the Works stored in the Library, their loss would not have happened if access was not limited, which would have helped in their dissemination and issuing of copies.
Today, as a civilization, if we fail to implement to right process to backup on time what matters to us, we will repeat the same errors as our ancestors.
I guess many historians today would prefer to see those non-existent backups of the Alexandria Library rather than those of Yahoo Groups, but who knows what is more important after all ;)
Their whole Library would probably fit even on a smallest now-available sd card.
Not sure why this one kills me so much...
The default corporate posture will be : Delete all the data! It's a liability and figuring out what we can keep is an enormous headache.
That attitude will create a problem - a.k.a. opportunity - for others to come in and solve. Google got rich by scraping the internet and solving the headache of how to find decent content. If there's value in some of this data headed to the dump, it gives a chance for someone to do the same. Who knows, they might even find a way to do in a privacy-respecting manner.
Jury's out on whether it was the right one.
There's a reason why when security is deemed important, the storage is physically destroyed instead.
Will a judge that is clueless about how computers really work consider that as a GDPR violation or not ? As deliberate or not ?
Queer Digital History Project: https://queerdigital.com/ygpresproject
Project to Archive Trans Yahoo Groups: https://archivetransyahoo.noblogs.org/list-of-known-trans-gr...
Project to Archive South Asian American yahoo groups: https://yahoogroups.southasianamerican.org/
I've got to guess that there are more.
also some of the politics groups were a great time capuslue for around the clinton/bush election era
a lo to f eartthquake researchers gathered on several earthquake groups as well including caltech seismologistics and advanced amatuers many of whom arent around anymore.
also some of the info in these groups can be used to defeat patent applications as they show evidence of public prior concepts and art.
yahoogroups consisted of somewhat more technically advanced users than modern website users like reddit etc because they were earlier and somewhat harder to use.
its a lot of good quality content.
also in the early days on these groups spam and massive controlled astroturfing account groups was pretty rare.
this is like losing 15 years of ancient Sumerian writings in a very interesting early time for the Internet.
In a way, the digital world is far more fragile than the physical world. And the time to solve this is now.
IIRC, Archive.org is still running its fundraiser today.
We need LOTS of publicly-sponsored and paid-for digital archival centers that, like libraries, are maintained for the common welfare. Or we could, you know, add that duty (and funding) to existing libraries! With -paid- archivists!
My bet would be that Verizon's network monitoring system/team sees the archive team's attempts as some sort of anomaly to be stopped. It's possible, though I wouldn't bet on it given Verizon's history re: public relations, that making noise might alter the equation and get them to allow the archive team to continue.
To raise the perceived threat level, many folks could support in building tooling or docs to help ppl migrate as easily and streamlined as possible, to minimize the tax on consumer time that they rely on. (E.g., help on comparable plans, cheat sheet for call centre keywords, etc.)
Maybe something team "Do Not Pay" could help run with...! 
Paying lawyers to examine the fine details and determine what liability may arise from publishing a database dump or the software that can view the dump's contents is not free.
You tell me how much work it would be.
Compared that too pulling the plug and getting servers over to a landfill.
I mean, the GDPR makes things kind of difficult in this regard, and I suspect even archives are liable if somebody takes an issue with content they are hosting.
Or spin it off and sell it.
Also, in my opinion, no privately owned company either, unless the owner was soon dying of something and wanted to get in good with their creator.
When you create an SPV after-the-fact, you have to go back and reverse-engineer a separation of liabilities from documents that don't specify whether they're work done for the organization or the SPV (because the SPV didn't exist.)
It's like a divorce. (Or, for an even more on-the-nose analogy, it's like trying to use a condom after-the-fact by extracting any bodily contamination and putting it in the condom.)
For a product that does not bring any revenue or significant revenue, it is better to dump everything and simply don't be associated with data any longer.
That's the side effect of GDPR, it is hard from the technical and financial perspective to maintain anything free on the Internet that keeps user's data.
So, by analogy, if Twitter did allow people to download an archive of any public Twitter account's history... what would the GDPR require them to do? Wrap those archives in some sort of auto-expiring DRM?
Large corporations are not anthropomorphic entities, regardless of their disarming branding. Rather they are amoral bureaucracies, likely administered by people who have learned to ignore their empathy to get there. Verizon won't change course to accommodate the Internet Archive or general Internet community any more than a combine would pause for a field mouse.
Don't miss the sidebar with these links:
Also, you can add these emails to the media contacts:
"Reporter Katyanna Quach" <email@example.com>,
"Managing editor Gavin Clarke" <firstname.lastname@example.org>,
"Corey Wilson & Rachel Janc; Senior Director, Communications" <press@Wired.Com>,
"Rich Woods" <email@example.com>,
"Paul Thurrott" <firstname.lastname@example.org>,
"Brad Sams" <email@example.com>,
"Kate Rayford, Media Inquiries" <firstname.lastname@example.org>,
"Bryan Lowder (LGBTQ issues/culture)" < email@example.com>,
"Torie Bosch (emerging technology effects on public policy and society)" <firstname.lastname@example.org>,
"Jonathan Fischer (big tech, cities, media/internet culture)" <email@example.com>,
"Susan Matthews, Health & Science" <firstname.lastname@example.org>,
"Erika Allen, Executive Managing Editor" <email@example.com>,
"Katie Drummond, SVP, Global Content" <firstname.lastname@example.org>,
"Press, US" <email@example.com>,
"Press, Canada" <firstname.lastname@example.org>,
"Press, UK" <email@example.com>,
"Pitches, Culture" <firstname.lastname@example.org>,
"Pitches, Tech" <email@example.com>,
It's not like interacting with political representatives or corporate PR/executive types where you're conveying the size of the interested party, in this case newsworthiness doesn't necessarily depend on how many people are sending the email.
There's also stuff there about contacting Verizon and contacting the shareholders of Verizon. For them, I think we need volume.
1) I have been a member of a group for many years (Gann study group) . Last Friday I received a notification from the owner who was explaining the group was closing so he set up a new one somewhere else.
I thought it would be nice if I made a backup. So I found a python script on github (there are dozen of scripts in various languages which can be used to backup a yahoo group there).
It took me a couple of minute to get it working and then a while later. Voila ! I had it nicely packed on my hard drive.
So why is it so hard to back up a group? I don't understand the problem.
2) "A phone company in the UK that assigns phone numbers using the groups and now will lose all those phone designations when it’s deleted."
What? Well OK why not.. But? They are a phone company. There must be someone able to scrape all this data? I don't get it? There are so many ways to extract data from yahoo group.
The Archive Team has been taking requests for backups of groups for people who don't have the technical facility to run the python scripts. They then intend to make them available on the internet archive. The next project is making some kind of front end, in case group owners want to host that somewhere. Some of us, for example, will be doing that behind some kind of a forum login, so it won't be search engine indexed.
As for your point 2, that was cut/pasted from the link in the OP, where it's describing that many groups are still using the platform. More relevant to this project, is that many groups are losing their archives, and those archives contain anything from scientific data, to hobbyist & howto information, to art and literature, etc.
Also, it is shame that the person in direct contact with Yahoo over this is sending angry emails in all caps. The Internet Archive deserves better.
Saying stuff such as this sounds pretentious and will unfortunately only get laughed at by anyone in the corporate world: "So the best thing Verizon could do, since they are just going to throw us all into the trash anyway, as we aren’t important to them, is let us get our archives any way we can.
The terms of service really should not apply to people who have been told, we’re gonna delete you from existence. If it’s lawful for us to get them from you, in broken buggy and virus ridden state, it’s just as lawful for us to get them ourselves."
As it is right now, she's just not doing any favors to the archivist community out there. Perhaps someone with proper communication skills and better nerves should take up that role? This is not a time to play a martyr and throw a fit while expecting Verizon to meet you half-way.
Don't use any service that suffers from a single point of control.
How much anguish when Facebook inevitably either goes away or pivots entirely?
Or HN, for that matter?
Maybe Hacker News should be mirrored on Usenet...
Does anyone have an idea of exactly what term or terms were violated by the archivists?
2. d. viii: "interfere with or disrupt the Services or servers, systems or networks connected to the Services in any way."
I'd also like to point out that the apparent spokesperson Brenda Fowler said in her open letter to Verizon, that "If the problem is that all our attempts to rescue our archives in the time we have left is causing an overload or strain on your servers, then stop making us HAVE to work around the clock, and GIVE US MORE TIME. ..." Probably not the wisest thing to say right now.
Also, archiving the groups with automated tools is against the Use of Services rule, that states the following:
2. e: "Use of Services. You must follow any guidelines or policies associated with the Services. You must not misuse or interfere with the Services or try to access them using a method other than the interface and the instructions that we provide. ..."
As I mentioned in another comment, I really support the cause and am a big fan of archiving myself but it's unfortunately quite clear that Verizon is right at calling out the violations of "terms of service".
As for bogging down the servers, my understanding was different from what the author said. They hadn't started to archive, but were in script testing mode and were accumulating yahoo accounts. What I saw of their activities, they were very careful about not overloading the servers. (I know that because I was backing up my own groups independently at the time, and I was able to do it. Luckily.)
Seems like something like this would be a good way to archive this sort of information or build sites like Yahoo groups on top of this file storage in the first place.
Lots of knowledge gets lost these days.
New Reddit(without the old.reddit.com interface) for example.
Many niche subreddits contain lots of information that would be lost if reddit dies(or just deletes these subreddits).
Youtube is unarchivable in principle due high amount of storage required(even thinking of 640x480) and yet it still contains tons of unique content found nowhere else from rare AMVs(that survived prior deletions) to instructions to repair telescopes - or basically anything in video form that doesn't have backups(i.e.not uploaded to other videos sites).
4chan and similar sites are archived by several sites in haphazard manner(only boards they like) and yet it a huge chunk of internet culture that is going to be lost if these sites die(and its more probable than Reddit due less funding).
Usenet is slowly fading into obscurity and dependence on Google Groups.
Many forums that today exist, will not exist forever: yet very few are archived anywhere else.
Other forum-like sites like Stackoverflow and Quora might disappear in the future with nothing replacing them. Github is subject to Microsoft whims and positions on open-source. Wikipedia and various wiki farm sites don't have much revenue streams.
Practically every major website we take for granted is vulnerable - people thought Yahoo Groups was going to last forever.
Exploiting a free resource, as we all do these days (reddit, youtube, facebook, hackernews itself etc) is all well and good but maintaining history is expensive (content needs moderating, you are required to abide by the GDPR and DMCA, there may be disputes about content on the platform).
I mean, Google+, MySpace, Bebo, IMDB comments is now dead and gone, how useful was the data really? I'm sure some people might go to archives but I would imagine 95% of the data is just "rot" that has no value or substance.
History is lost all the time, we barely know what we've been up to the last few thousand years only now can we so extensively document our world with the precision and quality afforded to us.
But in the end, time moves on and some of that history is lost, it hurts, but whose to say any archived history will be preserved anyhow? We're still relying on our storage technology being readable years/decades/centuries from now, which is not a given.
The mean post size on G+ was rougly the same as on Twitter: about 120 characters. (Quite possibly because most G+ posts were themselves repurposed Twitter content.)
Static content does not require ongoing moderation, though it's possible that problematic content will be periodically identified.
The bigger challenge is actually in the publishing engines. Even where these are static, it's possible that vulnerabilities will be identified. That was Google's (not especially convincing) excuse.
A challenge of the Internet Archive / Archive Team method of archival and access is that in preserving the original formatting and packaging of content, the bandwidth and storage requirements are increased tremendously. By about two orders of magnitude in the case of G+.
Were the Archive to focus on the actual originally-authored content rather than all the associated chrome, both factors would be tremendously reduced.
But giving a "export all the data in xml/json/whatever" button, and maybe even opensourcing the now-abandoned component serving this data, would be nice move. The first part could even become a regulative requirement some day.
Things shouldn't be like this. The price per unit of storage and bandwidth falls fast (and, except for the sites dealing with user-generated videos, faster than the amount and size of content grows). Laws shouldn't apply retroactively.
The problem really is that our means of accessing information are services. When you have a physical letter, or an e-mail saved locally, or a text message from 15 years ago, you can just read them. Nobody will know or care. Nobody will come after you trying to apply GDPR or DMCA retroactively. And since storage is near-free, you won't ever lose it until you forget about it (or at least about doing regular backups). Whereas with modern webmail, forums, link aggregators, IMs - you don't have even your own messages, and viewing a conversation that happened 15 years ago is really being provided a service today. Services are ephemeral, they're also subject to ever-changing regulations and whims of the service providers.
Bottom line, while services are necessary for transferring conversations, we really shouldn't be relying on them for access to conversations that already happened.
Blame risk-externalising business practices and willful ignorance.
2. Technical capabilities have expanded massively. When Yahoo Groups launched, enterprise storage of more than a few hundred GB was highly unusual. I worked for a Very Impressive Service Agency which was lucky to claim two Sun Starfire servers, only one of which was Large File (> 2 GB) at about the time, for analytic use.
By the late 2000s, AOL were deploying massive-RAM based systems to be able to perform whole-dataset operations in memory.
For the past ~5-8 years, large-scale SSD drives have been A Thing, now available in the terabyte range, for a price. Again, the level of analysis and expolration possible have made tremendous leaps.
3. There is the concept of manifest vs. latent functions, and awareness. The full realm of possibilities of technical systems are rarely apparent to their creators, let alone nontechnical users. See (very generally): https://en.wikipedia.org/wiki/Manifest_and_latent_functions_...
The marketing and disclosures of such services rarely include such disclaimers as "use of this system may subject you to a lifetime of personal and social profiling, grammar-based context analysis, GD ML AI based image content analysis, and imperil the global liberal social democratic experiment."
Hiding behind the figleaf of "you should have considered all possible future implications of your present actions and will have no future recourse" is grossly flawed, and quite frankly, professional malfeasance and malice aforethought given current understanding.
The awareness of risks has changed, and is unambiguous. Providers should foot the costs, or mitigate them accordingly.
(I suspect that at least in part, the actions of Yahoo, Google, and others, reflects this changed awareness, though I'm not aware any providers have explicitly stated this.)
Again: the risks always existed. The previous state was made possible only by pretending they did not. They do. Practices must change.
Your point that the things which can be done with information collected are constantly in flux, and I agree the ability to retroactively change terms of service to cover previously-collected data is ridiculous and implies an illusory contract which is not legally valid. No one should be able to run through a neural net data collected in the nineties. However, it's also not reasonable to demand that old data be removed, as it's produced at least as much by the server as by the client (e.g. access logs are typically produced by server-side monitoring of server-side software). The most sensible option is for companies to require explicit agreement to TOS changes to continue using the service, and use new data only under that policy while using the old data under the old policy. It's additional compliance overhead, certainly, but it's no different from how a client contract would be treated.
> professional malfeasance and malice aforethought
You are not the arbiter of such things, but thank you for your opinion. There's also a site guideline about assuming good faith, so you're in violation of that.
Costs being difficult to assess does not mean impossible, and the notions of probability and risk are central to all finance, investment, and insurance. Uncertainty is NOT an absolute lack of knowledge.
Among the principles that becomes apparent is that changes in informational regimes have profound impacts upon societies, and that this is a pattern which can be traced back through history to the invention of writing itself, and via indirect anthropological evidence likely to the emergence of speech.
The principle transcends humans themselves -- a leading theory for the Cambrian Explosion is that it was a consequence, effecively, of structuring and communications mechanisms within organisms developing, and allowing the creation of complex body plans, and not merely single-celled organisms or masses or colonies of cells.
For media, see especially Elizabeth Eisenstein's The Printing Press as an Agent of Change and Marshall McLuhan's The Gutenberg Galaxy. The link between mass media and totalitarian, fascist, authoritarion, and nationalist sentiments has long been observed (Hannah Arendt, Dwight MacDonald, the Frankfurt School, Edward Herman & Noam Chomsky, Adam Curtis).
I've been impressed by the insight, or occasionally, lack, of awareness of the potential perils of comprehensive data archives by pioneers within the data field.
Paul Baran, co-inventer of packet-based networking, wrote "On the Engineer's Responsibility in Protecting Privacy" (https://www.rand.org/pubs/papers/P3829.html) in 1968, some 51 years ago. In it he remarked on both the risks, and industry attitudes:
There are many amongst us who would not hesitate to build equipment to compromise the privacy of any given individual provided the price is right. These are the whores of industry. They would not hesitate building systems and devices contrary to the public interest; their only concern is the buck.
The full paper, and in fact, all of Baran's RAND publications, are online in full-text, following my request to RAND. I remain grateful to them for this.
Baran was also interviewed for a 1966 BBC documentary:
"Well, he who has access to information controls the game. This is very dangerous. I think both your country and mine have never trusted the government completely. We do so for good reason. Here we have a mechanism that could be abused. Here we have a mechanism that would allow the creation of a dictator. . .
I've yet to see an expression by anyone in Congress about this new type of danger. In fact, we see proposals for centralizing information, we see proposals for rushing ahead into new, more efficient computer information systems, and very little thought is being given to the dangers of the misuse of these systems. . . I ask a lot of people about privacy, why they valued it, and I was surprised by the number of people who said "Well, I don't do anything wrong. Why should I worry about privacy?" And then, on the other hand, I think there's a more wise group that says, 'Privacy is really the right to be wrong, then go on and live the rest of your life, without having it mark you forever.' I tend to think this latter view is the view we should hold.
Another view was expressed by AI pioneer and Nobel Laureate (economics) Herbert Simon:
"The privacy issue has been raised most insistently with respect to the creation and maintenance of longitudinal data files that assemble information about persons from a multitude of sources. Files of this kind would be highly valueable for many kinds of economic and social research, but they are bought at too high a price if they endanger human freedom or seriously enhance the opportunities of blackmailers. While such dangers should not be ignored, it should be noted that the lack of comprehensive data files has never been the limiting barrier to the suppression of human freedom. The Watergate criminals made extensive, if unskillful, use of electronics, but no computer played a role in their conspiracy. The Nazis operated with horrifying effectiveness and thoroughness without the benefits of any kind of mechanized data processing."
There is, of course, one slight problem with Simon's argument: The Nazis did make heavy use of mechanised data processing, provided and supported by IBM. Edwin Black documents this meticulously in his book IBM and the Holocaust:
It's been a while since I looked, but I didn't find anything significant last time I did.
AT and the Internet Archive have succeeded in preserving other content, though not all projects are successful. You can see a partial listing at https://www.archiveteam.org/
Even as notorious a "wasteland" as Google+ (a naming I've had some role in establishing: https://ello.co/dredmorbius/post/naya9wqdemiovuvwvoyquq) had many millions of actual active users, and tens of thousands of active communities (https://social.antefriguserat.de/index.php/Migrating_Google%...).
Unlike numerous other shutdowns, Google announced the G+ shutdown well in advance, though they "accelerated" the schedule twice, from "sometime in August 2019" to April 1, 2019, the eventual shutdown date. The tools Google offered for archiving and migrating content, whilst among the best in the industry (an exceptionally low bar), were incredibly insufficient: buggy, incomplete, duplicative, and not readily portable). It was largely third-party tools and assistance -- the Friends+Me Google+ archiver and ArchiveTeam most especially -- that meaningful preservation was possible.
The conceit of large-scale, free-to-use services has been convenience, capability, and trust, the last a point Google explicitly made in its original G+ announcement:
You and over a billion others trust Google, and we don’t take this lightly. In fact we’ve focused on the user for over a decade: liberating data, working for an open Internet, and respecting people’s freedom to be who they want to be. We realize, however, that Google+ is a different kind of project, requiring a different kind of focus—on you. That’s why we’re giving you more ways to stay private or go public; more meaningful choices around your friends and your data....
That trust has been repeatedly violated.
And in actively opposing archival efforts, Google, Yahoo, Flikr, and others, are violating that trust only so much the more.
In the G+ shutdown, it was the active dismissal, obstruction, and interference of Google and its user-based support team (the so-called "Top Contributors") which were most disappointing. Long-time Google supporter Loren Weinstein made this point specifically and repeatedly:
I'll note that this tends to strongly reduce the value proposition of all Web 2.0 / SaaS offerings, given that even the very largest and wealthiest companies are willing to act in this manner.
The consistency of this behaviour and attitude across multiple service providers makes me think that the behaviour and practices are not coincidental or unintentional.
I got into this tangentially because of a community and ecosystem of Y-groups that I've been involved in. When I found the Archive Team's efforts, I hitched my wagon - though I'm not at all central to that group.
Feel free to drop me a line -- dredmorbius <at> protonmail <dot> com
I suspect you've also been active on Reddit lately (ow my inbox!).