That's right, but (our main concern) is that the archives are being deleted. With no further history being recorded, it's utility for some purposes is limited. I have also come across some complaints that even as a list-serve it can be problematic. Posts, for example, are no longer coming in order.
But as a mailing list, each subscriber has the entire archive, at least from the date they joined. And any one of them can make it publicly accessible at any point in the future. In practice it will undoubtedly result in the destruction of enormous amounts of human knowledge, but at least in theory not much is getting immediately lost.
The difficulty in a lot of cases is finding someone who has a complete copy of the group. Yahoo Groups also had file, photo and database features, and archives of those are likely to be incomplete. You'd have to go through the member list (primarily early members) and find someone who still had a copy of all the messages.
The other problem is making it available - I ran a Yahoo group for many years, and have Mbox and Maildir format archives. I'm still looking for a decent web-based browser for these. HyperKitty (Mailman's archive browser) came close, but seems to require most of Mailman to be installed in order to work.
In my case, I managed to archive a bunch of groups related to amateur radio -- and I will be placing these on archive.org as soon as I have a spare moment to zip them up. A difficult-to-access archive is better than no archive at all, the important part is getting the data into a safe place.
I'm actually applying for an SBIR grant right now to work on the NLP algorithms that power fwdeveryone.com, if you have any interest in writing a letter of support. Basically it would eventually enable someone to mass export something like an MBOX archive onto the web in a cleaned up format with accessible typography. You can play around with prettyfwd.com to get an idea of the current state of the tech, it works well for 95% of (non-commercial) email threads but still needs some more work to support the rest.
I have the same problem. I used some old script back in 2006 to download a couple of groups in ... I think it's Mbox format. It's just not clear what to do with it.
Hmm... a standalone viewer for these formats (that exposed a webserver that could be accessed in a browser) sounds like it would pretty trivial given a parser for the email format itself. Especially maildir!
How big are these archives? Do you have any samples? Does the viewer need any special features? (threading?)
> pretty trivial given a parser for the email format itself
The problem is that there isn't any standard that defines what can and can't go inside the body of an email message. So if you want to post each email message exactly in a thread exactly as is, i.e. each with completely different typography and with all the replies attached and not sanitized in any way, then that's relatively easy. But it's also completely unreadable for more than about 30 seconds, and doesn't allow for good search functionality. These problems aren't a deal breaker if you're only trying to make sense of your own inbox, but when you're looking for specific information across millions of people's inboxes then they're a complete nonstarter.
I remember how great Gmane used to be with several incredible web-based views of mailing list archives. Too bad it sounds like the source code was lost and never open sourced. Another on the list of services that died without passing on enough of the torch.
Not to be flippant, but wouldn't one of the members of these groups have a copy of the group in their email? Given gmail and whatnot store things virtually indefinitely, couldn't the contents be recovered that way?
Some of these groups are decades old. For them, you'd be hard pressed to find someone who was there for the whole history of the group and kept them all. Also, yahoo was often a headache, dropping emails to individuals - you'd have to go to the website to read them. And furthermore, there is a lot more than emails stored on the platform: files/images/links/calendars/databases ...
To add to all this, it's not an individual project. Most people done' have technical competence. They need someone to help. That's what the Archive Team has been trying to offer (if not for Verizon).
disclaimer: I'm a Member of Archive Team who's helping coordinate the joining of Yahoo Groups in preparation for archival.
Yahoo's banning of a large amount of the accounts we were using is a huge setback for us. In total we lost over access to over 55,000 Yahoo Groups, many of these will now not be archived and will be lost when Yahoo deletes everything on December 14.
Particularly disastrous was the loss of access to all of the 30,000 Fandom (fanfic / fanart / etc..) groups that were requested to be archived by members of the fandom community. We're back to square one now, and it is looking increasingly likely that we're only going to be able to re-join (and therefore archive) a small percentage of these groups before December 14.
(And now for the inevitable, shameless plug...) We could really use some help! If you've got an hour or so, we could really use people to come and complete CAPTCHAs for us. (A CAPTCHA is needed to join every group). Instructions at: https://github.com/davidferguson/yahoogroups-joiner
I tried to do this but upon clicking the purple "Join Group" button Yahoo is giving me an error saying my email address is not linked to a Yahoo account:
> Your email address is not linked to a Yahoo ID. To join this group, you need to link your email address to a Yahoo account.
When I click "link your email address", it just takes me to a page called "Personal info" which doesn't have any obvious way to link my email address.
So I'm not sure how to proceed.
EDIT: Solved it. I had initially only "verified" the account with a phone number, but you have to add an email address as well. It's now working.
It seems to be working through a list in reverse alphabetical order. Watching the progress being made is quite satisfying. When I started it was on groups like "sciencefiction" and now it's moved on to "petzluverz".
It didn't take long at all for me after verification. Although I have sometimes randomly gotten that error message. Interestingly, sometimes it actually had joined the group anyway. The site has been a little glitchy off and on, but it's working for me right now.
While the above post is concerned with Fandom groups, my concern is with groups that started doing early community driven biohacking type research. There are medical tests results and discussions of medical interventions. While that's my focus, I'm sure there's additiona important material. We really need to save this data.
I assumed I could help by going to a web page and solving a bunch of captchas for you, but when I read those instructions I found there's more involved (forging a Yahoo account, installing an extension) and it turned me off.
If captcha's are the bottleneck, maybe some generous soul here could figure out a way to automate the rest and just give me a page I can go solve captchas? Further reducing the friction might help get you some more uptick from the community - more monkeys like me banging at typewriters.
Sorry I wasn't more help, and best of luck with your efforts.
I imagine you guys already know this but considering we’re up against the timeline, I’d use the captcha solving service (easy to google yourself) and Luminati to distribute the IP addresses while swallowing my ethical qualms.
Thanks! I never heard of that before; just like project SETI though for archival purposes.
What are the hardware requirements of that VM?
I'm attempting to import it on my NAS4Free home NAS Virtualbox service which is the only machine I keep up 24/7 atm, but it takes forever to import. The hardware is very limited however (Atom D410 + a bit over 1GB RAM available), so I'm not sure it would succeed, but so far it loads forever, no errors given. I'd like to run it for this project to start contributing quickly albeit with limited hw before the deadline, then find better iron in the future.
I’m running the Docker image on the smallest Hetzner VMs, with 5 concurrent groups and 40 shared rsync threads per container, and 12 containers per server. Start one container, do docker top on it to make sure it’s pulling, then start the others one by one, taking a few seconds between each to avoid overwhelming the CPU. I’ve got 6 of those little VMs going, and have rolled up 4GB and 2800 groups worth in 6 hours.
After they settle down, they’re more memory than processor intensive. I’ve considered playing with the settings a bit, but thought it was more important to get a bunch of them running on a couple different VMs at different sites.
If I were really feeling fancy, I’d write a nice deployment definition for orchestrating this with microk8s...
I'm running it on a Synology NAS (Celeron J3455), and the docker manager UI claims it's using 180 MB RAM and less than 1% CPU (and I just confirmed it's currently working on archiving Yahoo! Groups)
An ova file is just a tarball containing an ovf file and a vmdk file. The ovf file is a text-based configuration format, so you can get a basic idea of the config you'd need for qemu. Then the vmdk can be converted with qemu-img.
I think they were doing some kind of port forwarding, but I didn't bother, and I just access the web interface using the VM's IP (you can hit alt-right arrow to go to a login prompt and log in as root then run "ip a" to see the IP).
It went pretty good for the first 10-20 or so groups but now I get the multiples of the really annoying captchas (click until none remain) per group...
Damnit yahoo...
update: just enabling the vpn was enough to 'reset' captcha to the simple level, seems like yahoo does not take into account whether your IP is 'residential'.
I also noted that for yahoo changing IP, even changing continents, allowed me to use the same cookies as long as I kept my original browser window open.
`Buster is a browser extension which helps you to solve difficult captchas by completing reCAPTCHA audio challenges using speech recognition. Challenges are solved by clicking on the extension button at the bottom of the reCAPTCHA widget.`
That's nice, but it doesn't scale. Google only let you solve a few (5 or so) audio captchas in quick succession before you're banned for a while, so it's no good for us.
It's been working for me instead of clicking on all the little busses or crosswalks, even if it doesn't work at scale. Thought it might help some other users of the extension.
Yeah, sorry about that. The current (as of 2100 UTC) set of groups being sent out to be joined were ones submitted through our nomination form: https://tinyurl.com/savegroups
I did specify that groups requiring approval to join shouldn't be submitted, but not everyone took notice. (And then there was the several dozen Google Groups URLs that were submitted!)
It seems a weird set of groups. Like, lots of three-to-five person groups roleplaying doctor who, spiderman and things like that. Is this the long tail of what hasn't been archived or is there not even a good way to tell post/member count without loading up through the extensions?
It's a set of groups that have been specifically requested by the fandom community. Of course, the groups handed out depend on what's been joined, so if / once all the fandom groups are joined, we'll move onto something else.
I appreciate this isn't made clear in the instructions, but if you have a desired set of groups in mind, you don't need to use the chrome extension. Just join the groups you want saved and (provided you've sent the account details through the form) they'll be added to the queue to be archived. I did a lot of Amateur Radio (Ham Radio in US) groups that way.
Ah, that's good to know that I can browse and find things that I'm more interested in. The instructions weren't clear about the difference between extension/group access and archiving.
It's a set of groups that have been specifically requested by the fandom community. Of course, the groups handed out depend on what's been joined, so if / once all the fandom groups are joined, we'll move onto something else.
I appreciate this isn't made clear in the instructions, but if you have a desired set of groups in mind, you don't need to use the chrome extension. Just join the groups you want saved and (provided you've sent the account details through the form) they'll be added to the queue to be archived. I did a lot of Amateur Radio (Ham Radio in US) groups that way.
Hah, this is fun! I've so far stumbled on a fantastic group with Sims 1 houses (pictures, and the actual lots), and a Dream Street fan-club, which of course prompted me to see who the hell they were.
I confess I'm doing this mostly to see what people posted on the internet at some point in time :)
Edit: All groups have around 1600 members... what causes this...
Verizon's response, and the response to the response, are in the article of the OP. They claim they offer a Group Downloads Manager, but it's very broken.
A couple of years ago I saw somebody giving a talk, where they demonstrated a CAPCHA-Solving API, with people from India solving the CAPCHAs for a few cents.
Mturk's turnaround for this stuff can't be fast enough to work would be my guess. I know jobs I put up there for transcription, despite a generous bonus, were always delayed for at the very least hours.
I've been using Edge (Chromium) for past few hours, no issues yet. Plugin could be unrelated to your crashing. May help to use a standalone Chromium build for this https://chromium.woolyss.com/
I checked on IRC. One person says they've been using it for hours on chromium without a problem. "I've been using Edge (Chromium) for past few hours, no issues. Could be unrelated, could be related. May help to use a standalone chromium build for this."
As an aside, is there anyway to recover emails if I didn't sign into Yahoo for a year? I and a lot of others had up to 15 years of sentimental mail exchanged during that period :(
I don't see why not. Point Thunderbird at it or something and then just transfer the mails over to somewhere else if you want that - but this is not about mail. Rather it's about Yahoo Groups, whose archives are about to go away.
Forgive my naivety, but why would blocking of your accounts delete the data you have already backed up? This sounds like you are doing it the wrong WAY, IMO.
Two reasons:
(a) If we hit Yahoo with everything we've got, groups would have almost certainly crashed, or at least become unbearably slow. That's not a reasonable thing to do, and would be (IMHO) grounds for Verison banning us.
(b) We were still testing / writing the scripts to do the actual archiving. Most of the groups we did save before the banning were from test runs of the archiving script.
And sure, given hindsight, I'd do things differently. We've learned, now, and are archiving a groups soon after it is joined.
OK, thanks for explaining this. Just my 2 cents then: big companies make decisions like this based on the potential PR win/loss. If ignoring you keeps the PR delta at 0, while allowing to export the data exposes them to even a minimal risk (I dunno, someone's private details buried in), they will ignore, or even actively resist you.
Politically, you need to arrange it so that cooperating with you will give Verizon a small PR boost, while ignoring you will be seen negatively by the public. This thread had a good example of interesting data that is worth preserving, so I would try reaching out to news companies (NY Times and whatnot) to see if anyone wants to publish a piece. Phrasing this positively and ensuring enough people see it, would greatly increase the chances of cooperation from Verizon.
They hadn't backed up yet. They had set up accounts with yahoo that they were then planning to use to back up those groups. Backups themselves were starting, but they had to go slowly enough not to bog down yahoo's servers.
There have to be some Verizon or Yahoo employees on HN who are reading this.
Can any of you shed some light on why Verizon and Yahoo aren't cooperating with the Archive Team to archive this valuable historical content?
(If you don't feel comfortable commenting with your regular HN account, maybe you could do so with a throwaway account?)
Also, is it possible for any of you to bring this issue to the attention of upper management and help them understand how important it is to archive this?
You Verizon/Yahoo employees have much more power to make a difference here than anyone of us from the outside can.
I work for VzM, but not historically directly on Yahoo products (product teams have been merged/consolidated etc. over the past few years, but there's still strong tendencies toward products people came from).
So I wouldn't be very clued into what's happening with Yahoo Groups internally. And I've heard nothing about this internally. At all.
As it stands, it's 2:30pm in SV, VzM is top of the HN frontpage, and not a single soul has mentioned it yet on internal Slack.
It was someone quite high up in the company who was the first to raise in Slack actually; though it's clear were similarly not highly clued in to this before yesterday, and no substantive replies or info yet (just other colleagues with similar concerns).
I'm guessing this will blow up later this morning when people start waking for the work week.
If VzM wants to contact someone at the archive team securely they can DM any of the @s on irc.efnet.org/#archiveteam or twitter DM myself (@JRWR) or Jason Scott (@textfiles)
Why would it be risky? Surfacing an issue that is important to the public, where future/planned actions by the company could become a PR debacle sounds important.
Pure speculation, but if you publish something created by another person without an explicit permission by them, it may open you up for a lawsuit. If some groups required explicit approval by a moderator in order to read the posts, I would take it as they didn't want the content to go public.
So technically, some legal troll could post some copyrighted information, wait for it to be published on Archive, and then sue Archive for copyright infringement and Verizon for assisting it. As a non-profit, Archive will likely get away with just taking it down, but a for-profit Verizon is a wholly different story.
Groups can be private or not. Require approval, or not. The archiving team isn't attempting to break into private groups and archive them. Only public groups are going to be collected.
I think everyone understands that corporations don't want to spend money and effort maintaining servers that don't generate revenue. No one is really surprised that they won't help with archive efforts.
The question is why they're spending real effort on blocking archivists. All they had to do was keep doing nothing for a few days. The cost to them might have been a couple hundred dollars' worth of bandwidth, at most, which I think archivists would have been happy to pay--they've done more before. (That's estimating based on small-scale commercial hosting prices; it might not even register on whatever enterprise uplink Yahoo/Verizon uses.)
Instead they've got at least one professional taking time away from productive work to fuck with archivists at no benefit to anyone. It's possible that the wage-hours spent on this actually exceed what the bandwidth costs would have been. It's astonishingly petty.
Two reasons I can think of right away. There can have a policy (and people) to detect abuse and shut off bot accounts, this can even be a separate entity from Yahoo Groups. Second, there can be internal metrics tracking active users and viewed pages, to get down as low as possible before deletion. In both cases archive.org is ruining it for them.
Is it possible that there may be some kind of political angle to all of this; that archiving this information for the future might allow someone to find out something that someone else doesn't want to come to light?
It's actually not totally insane anymore. If you could afford a Tesla Roadster you can build yourself a 4PB storage solution. With some high density top loading storage servers (4HE for 90 HDDs), 6TB HDDs and some SSDs thrown in for caching you can build that in 36HE for less than 300k$ (not counting time needed to assemble and configure). So if that's your hobby, go ahead :D If one takes more than 5 minutes to research this I'm pretty sure that it's possible to push that number below 250k$.
Yes, university should be able to make some room so researchers can work with it. It's a lot of data, but not impossible to do with a small investment.
I'm genuinely curious from an ideological perspective, why archivists think all this material is worth saving?
People often compare the shutting down of sites or the banning of content (e.g. When Tumblr banned porn, or now yahoo shutting down groups) to the burning of the Library of Alexandria. But there is a huge difference. The LoA held knowledge collated and collected by the best thinkers of the time. The Internet is not that. The Internet is an open platform where anybody can say anything like that. Most comment sections are filled with all sorts of material ranging from factual to entirely fictional.
I realise it is hard to decide what is worth keeping (and therefore erring on the side of saving it all), but I'd wager that the vast majority of archived content is not useful at all. The Wayback machine is a perfect example. Lots of great stuff, but that's a drop in the bucket compared to the vast amounts of useless, or even redundant information stored.
It is a lot of resources thrown at saving, not the equivalent of the Library of Alexandria, but the public toilet block graffiti wall.
Anybody want to share what drives them to do this?
Even if we still had the Library of Alexandria, it may have shed zero light on the actual lives of citizens. Archiving content on the internet means capturing thousands of individual level perspectives and experiences. We don't know what will end up being important to historians 50 or 100 years from now. I would bet there are dozens if not hundreds of historians that would give anything for a record of their favorite time period that contains even a fraction of the amount of content today's archive efforts are storing.
It's also not horrendously expensive - we are getting better and better at storage as well data analysis techniques, so stuff that seems useless today may be useful 50 years from now and cost less to store than it does now. The key thing again being that we can't benefit from hindsight.
Even graffiti can give insight into a time period, even if that insight is that that time period had an unusually high number of graffiti artists.
Not to mention that historians of the future will be able to sort and characterize massive amounts of data and draw conclusions that couldn't be made without that data.
For a time period where data is more valuable that oil, that the wealthiest companies are trying to grab every piece of data they can, and on a site where this is frequently discussed and many work for said companies, I find the question "why do archivists want to archive data?" a little silly. Date might not be useful to us now, but might be to future historians (though this is a similar argument made by that companies that do mass surveillance).
What about people who don't want stupid comments they made online when they were 14 permanently indexed and searchable for all of time by the Archive Team? Yes, they may have posted to Yahoo! Groups back in 1999 when they didn't know better, but now it's 2019 and you have people digging up decades-old dirt on people to try and destroy their reputations and careers.
Given that search engines have zero ethics when it comes to removing embarrassing (but not illegal) content, sometimes the loss of information is a small blessing for some.
Yes, it's their fault, but I also don't think it's fair that something a child said at 14 should haunt them their entire professional careers, either.
The stuff stored in the Yahoo groups is material from the beginning of the internet. When people explored what could be possible and how easy is was to connect globally.
You have a valid point, but it's also one of these things in our generation that we have to live with. We explored and tried things. Only now we look back and see what those explorations of our younger selfes really are; sometimes funny, sometimes embarrassing. However, if you are cautious, you may be able to delete your stuff or at least make it anonymous by deleting that said account. If not, you have live with it.
Those of all these people can now learn from it and can educate their kids in being careful with the internet. (Or at least this is what it should be)
The dogma, that "everything posted to the internet will stay on the internet" , may not be entirely true for this first generation, because now large parts are already gone. But I am certain that this will be very true for the current generation, because I really doubt that Facebook and others will ever freely delete large datasets of user content.
Hypothetically, yes; but right now all this stuff is available on the clearnet and searchable. So obviously any potential harm of the present situation, is decreased. And, unless your argument is that we should delete all fora on the web because someone may have said something embarrassing on them, then I think you'd probably want to come down on the side of preservation.
Withhold wide-scale, anonymous access for a few decades maybe? (Though presumably there is a middle ground that doesn't involving leaving _everything_ inaccessible for a few decades.)
For example: World War two groups where many of the the members have passed away by now.
There could be first hand accounts of history that has already been lost to time.
YES! It's like preserving ecological diversity. It's a store for later learning. Verizon is working in cold hard capitalism, and you can bet your lunch that they did NOT use Google Groups to hold their shared wisdom/history, and they would never let it be lost.
But many don't have the pockets for better systems, and so their earned knowledge lived on Google Groups. And when you think of all the people and groups that might have had needs to store their history, and what tools they might have used, what do you expect the skew of Yahoo Groups was. Certainly no Fortune 500 companies, but rather nonprofit and grassroots and all sorts of domains that are already getting the short end of the stick in our world :)
It's basically that. Yes, when saving everything we'll save a lot of trash and utterly garbage, spam and all that shit... But the things we would be risking to lose if we didn't save everything, they are and will be so much important. To save what is really important, you have to save everything.
And actually, spam is quite interesting to some people. It certainly gives a flavour of what early-2000s internet was like, and what happens when spam filters aren't good.
One man’s public toilet block graffiti wall is another’s Library of Alexandria. Let the historians and journalists decide what’s important and the archivists take their best crack at saving it.
I write a lot of historical content and often the most useful stuff I find—for example, old flyers or ads from the 1950s or 1960s—would have been considered trash by someone at the time.
So an archivist’s job isn’t to make a judgment. It’s to protect the data as they see fit.
Toilet wall graffiti and such, preserved in Pompeii, is an important archaeological resource for the understanding actual daily life of Romans.
So yes, there are real, hardcore scientific papers about ancients "shitposting" each other down to "your mom" jokes. Because it shows us how people really lived.
> I'm genuinely curious from an ideological perspective, why archivists think all this material is worth saving?
It's easier to just save it all and let gawd sort it out.
You never know what some future person might find interesting. For example, my father took lots and lots of pictures, but they're all set in the living room and kitchen. No pictures of the rest of the house. I'm sure the thought of photographing other rooms simply never occurred to him as being interesting.
For another example, many people are interested in where/when/why certain words first appeared, like the origin of "OK". Massive archives of text that are searchable would help with this.
It is a lot of resources thrown at saving, not the equivalent of the Library of Alexandria, but the public toilet block graffiti wall.
Ask an antiquarian about the value of graffiti in the ruins of Pompeii and other archaeological sites sometime. The great historians of the day wrote about their contemporary culture, while the vandals and miscreants and lowlifes and commoners contributed to that culture. Having access to both sources gives us a much more complete picture.
You don't know what's worth saving at the time you save it.
Ha, ha! Well, there's some high quality material there too, but I take your point. In the right context, like "history from below," all kinds of material can be high quality!
To be clear, I wasn't comparing the fanfic authors and other Yahoo Groups contributors to vandals scribbling dicks all over Pompeii. Just saying that all other things being equal, future historians will prefer to have too much data to work with than too little.
By definition, we don't have the benefit of hindsight until it's too late.
See below. My main concern is early medical/biohacking groups that shared data, like medical tests, and engaged in extensive discussion/community driven research. Such groups go back to at least the late 1990's.
A main concern of the Archive Group (again, below) is art that was uploaded there.
I'm sure those are not the only two classes of examples. See for example the bird watching group in Delhi that has been collecting data for decades. (In the link of the OP.)
Great question! I'll take an amateur swing at a decent answer:
People doing important work (esp important work that is underfunded) don't have time to write/record their own histories. But that history can be instructive, to learn what worked and what didn't, and help future travellers do it better :)
And perhaps especially important: ppl engaging in these under-resourced efforts are often working in domains that capitalism is... less curious about, we'll just say. Otherwise, it would likely be able to be more highly documented, as incentive is there to preserve it.
Our ability to improve our present from better understanding our past is a supposed benefit of a digital world that accrues data -- we have records of things that in prior ages just flew by in conversation (for better or for worse). But efforts like this rob us all of that wisdom <3
And again, there is an asymmetry in who gets robbed. It is often the folks working in the commons, those doing invisible maintenance labour (nonprofits, grassroots, community), and generally just people doing work within the cracks of capitalism.
> The LoA held knowledge collated and collected by the best thinkers of the time.
... that had access to writing services and were wealthy enough to have their thoughts stored.
There could have been many odd voices out there that would've told us an entire different story. But these are unknown because they didn't have access.
Now we are in the era of (almost) universal access to storing our thoughts and we still don't listen to the everyone or mark them as uninteresting and not worthy.
> It is a lot of resources thrown at saving, not the equivalent of the Library of Alexandria, but the public toilet block graffiti wall.
We have that kind of graffiti from Pompeji. It's enormously more fascinating and insightful into regular people's lives than all the stuff about kings and battles people wrote about in the more official works.
When looking through all newspapers and magazines, the advertisements are often the most interesting bit. Especially since you can probably already read about the big events they wrote about on Wikipedia or history books.
Certain group contents are actually unique and valuable (see threads below), in which there could be a certain similarities to the LoA.
But most importantly, Groups is a corpus representing many segments of society during a period (starting 2001, with a peak of over 100 million users in 2008). It's a snapshot that embodies concerns, beliefs, morals, language... at several realms. This is more than LoA even. It can be used profusely by researchers and historians to study society for years to come. Or by AI to learn how and who we are/were...
We don't think it's necessary to preserve everything that's ever spoken verbally. We don't lament that everyday conversation is ephemeral.
People are conflating internet discussion content with written content because it's stored as text. Whereas the more legitimate comparison is to verbal communication.
> We don't lament that everyday conversation is ephemeral.
I imagine you're not a historian. Neither am I, but I cannot imagine that there is a historian out there who hasn't lamented the ephemerality of everyday conversation (and even of apparently more durable forms of communication).
Historians would love to preserve spoken communication and there are many projects recording everyday conversations. There are even projects of recording the typical sounds of the environment in certain areas at a certain time. However, many forms of spoken everyday conversation fall under restrictive privacy laws, which poses strict limits to such preservation efforts.
The texts on the internet at a given time, on the other hand, are public and reflect the opinions and ways of living of a large number of people at that time. There is no doubt that these could be analysed in the future to give us historical insights in ways we cannot even conceive yet. (Think e.g. about getting them data mined and analysed by advanced A.I. to give new insights into the time period.)
The worth of the data is so obvious that it's really hard for me to understand why you and some other people don't think these are interesting data points for research on how we lived in, say, 200, 500, or even 10000 years from now. The data is not only interesting to historians, but also to economists, political scientists, and linguistics, btw.
When I was at aol I tried to get them to open source the q link server code from the 1980s. Someone actually got it on DVD for me and everything but after the Verizon merger they fired the entire legal team that was responsible for authorizing open source release and it just stalled.
Open sourcing code can be tricky—there's quite a bit of review that needs to go into doing it right, as well as more work if you want the release to actually be reasonably useful. Blocking this archiving effort is on a whole other level. We're talking about saving information that was already public. All they have to do to allow this to happen is... nothing. I can't comprehend why Verizon/Yahoo would go out of their way to block these efforts.
It depends on the size of the codebase and how shitty your programmers are, but if you aren't greedy or scared of over-litigation, it isn't hard at all.
I have written great contributions to a python API library that could be of benefit to the community around it. The code has nothing to do with my company's core competency, and the code is used for internal orchestration, so "exposing insecure code" is an unlikely concern.
It is easier for a lawyer, especially a luddite, to say "no" than to help their employees give back to the world.
For new code it is indeed "simple". Old code however likely contains third party provided code, be it from libraries or code provided by contractors, where no (clear) license permitting relicensing of the source is available. This can be quite complex historic work as version history might not exist (which code come from where?) and documentation is limited (paper contracts lost in archives) and so on.
First, to Hell with whoever downvoted me, probably lawyers (not you, johannes). Second, I get there are occasionally complicating factors, BUT - licensing can't be difficult at most times, since the company often owns what the worker produces - for better or worse, it's simple that way. As for third party work, are you talking about library imports, or copy and paste? The logistics of solving those problems are either simple or really complex.
Yes, they own what employed workers produce. But especially before there was such a number of freely available open source licenses software vendors licensed tons of stuff, often in source, often without permission to relicense the source and over time developers refsctored the licensed code, which makes it hard to trace code back. Especially since version control often was done by having different sets of floppies, which are all gone.
what a lovely thought. Thanks for the effort, even tho it didnt pan out. if you've got the dvd torrent it out :)
now im wondering if there's a stratus emulator anywhere and/or the os code. Them things were nasty... individually battery backed hard drives was just the beginning. The slot cards looked like someone had dumped yellow patchwire spaghetti all over them.
If you ever bump into this person again please consider suggesting this. If they don't feel comfortable releasing it to the public directly, there should be contacts at archive.org that would help releasing it anonymously.
It's like the burning of the Library of Alexandria all over again.
We don't know exactly what was in the library when it burned. We assume it was all great works of intellectualism, but it could very well have been the fanfics of their time.
From Wikipedia: "Scholars have interpreted Cassius Dio's wording to indicate that the fire did not actually destroy the entire Library itself, but rather only a warehouse located near the docks being used by the Library to house scrolls"
If anything this would make the analogy even more apt, since only part of Yahoo is being destroyed. :)
Regardless, it's mostly used as a metaphor for the destruction of knowledge at this point.
Too often historical events turn out to be perfectly true, but claimed to be myths due to dizzying semantic distinctions.
Just looking at the third link, the most upvoted answer agrees that humanity suffered a significant loss of important information. And the 'myth' is just an asinine distinction regarding whether loss was due literally due to fire, or whether the information was lost due to some other cause. I think declaring it a myth in a conversation like this misses the point (it certainly isn't a distinction relevant to the original comparison made here to Yahoo Groups) and just serves to confuse people.
It's quite clear the library is no longer here. How exactly it was lost does matter as its destruction has been used to paint various groups as anti intellectual barbarians since ancient times. Eliminating the story as a weapon to attack others would do humanity some good.
These articles seem more concerned with detailing how important it is that it wasn't Christians. Makes sense for a organization centered around "religion and public life", I guess. Quite the angle.
It's quite important that it wasn't Christians. A large part of the public understanding of history is based on a belief that progress through the early Middle Ages was held back primarily by Christian repression of free thought. There are people who very seriously believe that we'd be flying between stars by now if Christianity had never become predominant.
You don't have to be a Christian apologist to think that it's important for people understand history correctly.
Do people generally think it was Christians? Without looking it up, I would have said "barbarians", which may not rule out Christians but doesn't specify them either.
I think the majority of people have never thought about it one way or the other (and would probably think similarly to you), but there is a substantial group of people who do. While it's by no means predominant, you come across the idea with fair regularity on atheist discussion boards.
Wait the library wasn't lost due to that fire, but the contents were slowly lost due to the passage of time and people not caring or having access to copy it's contents? That makes the analogy way better, but the "burning" part is sadly wrong.
Yes, that is exactly what I wanted to convey by "lack of efforts".
2000 years ago, as a civilization, even if we failed to care enough for the Works stored in the Library, their loss would not have happened if access was not limited, which would have helped in their dissemination and issuing of copies.
Today, as a civilization, if we fail to implement to right process to backup on time what matters to us, we will repeat the same errors as our ancestors.
I guess many historians today would prefer to see those non-existent backups of the Alexandria Library rather than those of Yahoo Groups, but who knows what is more important after all ;)
The main difference is, that then, "backup" ment copying everything by hand, and now, it means one simple copy-paste. Considering the size and price of modern hard drives, and relatively small size of old archives, any one individual can backup a huge amount of data (and even offer/share it as a download link/torrent seed/etc).
Their whole Library would probably fit even on a smallest now-available sd card.
Well, to some degree it is a liability. It just took this long and some accidents for them to finally figure it out.
That attitude will create a problem - a.k.a. opportunity - for others to come in and solve. Google got rich by scraping the internet and solving the headache of how to find decent content. If there's value in some of this data headed to the dump, it gives a chance for someone to do the same. Who knows, they might even find a way to do in a privacy-respecting manner.
That doesn't work on SSDs, and the data might be even theoretically recoverable on HDDs :
https://security.stackexchange.com/questions/12503/can-wiped...
> Therefore, you should assume there is no reliable way to securely erase individual files on a SSD; you need to sanitize the whole drive, as an entire unit.
There's a reason why when security is deemed important, the storage is physically destroyed instead.
I meant for the sake of data protection, not for forensics. You start with all ones and gradually deplete your ability to write ones over time in electron charge memories such as SSDs.
This is not about companies following best practices but about what is going to happen when some of the supposedly deleted data pops up again, as it eventually will.
Will a judge that is clueless about how computers really work consider that as a GDPR violation or not ? As deliberate or not ?
there are a few groups i was a member of like lifters https://groups.yahoo.com/neo/groups/Lifters/info which was an intensive technical development group in the field on propellerless, rocketless, jetless flight using only electronic high voltage.
also some of the politics groups were a great time capuslue for around the clinton/bush election era
a lo to f eartthquake researchers gathered on several earthquake groups as well including caltech seismologistics and advanced amatuers many of whom arent around anymore.
also some of the info in these groups can be used to defeat patent applications as they show evidence of public prior concepts and art.
yahoogroups consisted of somewhat more technically advanced users than modern website users like reddit etc because they were earlier and somewhat harder to use.
its a lot of good quality content.
also in the early days on these groups spam and massive controlled astroturfing account groups was pretty rare.
this is like losing 15 years of ancient Sumerian writings in a very interesting early time for the Internet.
This is a wake-up call to the entire world: we cannot take internet history for granted. We need affordable, decentralized means with long-term economic incentives to archive the digital world.
In a way, the digital world is far more fragile than the physical world. And the time to solve this is now.
IIRC, Archive.org is still running its fundraiser today.
We need LOTS of publicly-sponsored and paid-for digital archival centers that, like libraries, are maintained for the common welfare. Or we could, you know, add that duty (and funding) to existing libraries! With -paid- archivists!
What prevents Verizon from donating the Yahoo Groups database to the Internet Archive? What does Verizon have to gain from preventing the archival of Yahoo Groups?
Companies don't typically operate that way. All else being equal (especially when there's no $$$ in it for them) when given the choice between doing something and doing nothing, they usually choose to do nothing. It's often not malicious, but an overabundance of caution. (i.e. lawyers raising red flags about liability, 'our IP' etc... it's a real pain even from the inside getting large companies to do anything different from the status quo)
My bet would be that Verizon's network monitoring system/team sees the archive team's attempts as some sort of anomaly to be stopped. It's possible, though I wouldn't bet on it given Verizon's history re: public relations, that making noise might alter the equation and get them to allow the archive team to continue.
It is kind of incredible that they are expecting to be protected by IP laws, and yet aren't willing to put the slightest effort to archive the content that they are taking down...
Maybe those who care (we?) could organize a campaign to get customers to commit to leaving Verizon if they let the messages be deleted without archive? That would convert it into the language they understand.
To raise the perceived threat level, many folks could support in building tooling or docs to help ppl migrate as easily and streamlined as possible, to minimize the tax on consumer time that they rely on. (E.g., help on comparable plans, cheat sheet for call centre keywords, etc.)
Maybe something team "Do Not Pay" could help run with...! [1]
Oh God, I'm that guy. I'd been following this elsewhere, so didn't actually expect I'd get new info from the link itself :/ [opens mouth, inserts foot]
`rm -rf /` is objectively free from Verizon's perspective.
Paying lawyers to examine the fine details and determine what liability may arise from publishing a database dump or the software that can view the dump's contents is not free.
You mean, "tar up" multiple databases across possibly multiple data centers + all related files uploaded to those groups (also possibly spread across multiple datacenters) while preserving full integrity and making sure that there's accompanying documentation on how to set all this up and run?
You tell me how much work it would be.
Compared that too pulling the plug and getting servers over to a landfill.
I can imagine it's easier and safer (from a legal perspective) to just delete the data and therefore no longer be responsible for the content. Twitter wants to delete older Twitter accounts because they're required to by law under the GDPR.
I mean, the GDPR makes things kind of difficult in this regard, and I suspect even archives are liable if somebody takes an issue with content they are hosting.
This seems relatively cheap to fix. Spin off Yahoo Groups as a new corporation, and have that corporation subsequently donate all its assets. If the corporation somehow manages to get sued, it doesn't really matter, since it has no assets.
No non-privately owned company would ever willingly put itself through the legal and tax requirements for spinning off a new company with part of its assets just to do the right, non-profitable thing, with those assets.
Also, in my opinion, no privately owned company either, unless the owner was soon dying of something and wanted to get in good with their creator.
I’d assume the law is smarter than this, because companies would otherwise continually spin of new corporations to get rid of their liabilities with no assets as a sort of lightning rod for lawsuits.
When you create the SPV in advance, it's very clear what part of the work done by the organization attaches to it (because the organization ensures that all its processes explicitly specify the legal compartment they're running under.)
When you create an SPV after-the-fact, you have to go back and reverse-engineer a separation of liabilities from documents that don't specify whether they're work done for the organization or the SPV (because the SPV didn't exist.)
It's like a divorce. (Or, for an even more on-the-nose analogy, it's like trying to use a condom after-the-fact by extracting any bodily contamination and putting it in the condom.)
If Yahoo Groups has a GDPR obligation now (and it's not clear that they do) they don't erase obligation that by spinning up a different company and dumping all this personal data into that new company - that would be its own GDPR breach.
Why not? According to GDPR someone can show up and request (1) fixing personal data (PII) like nickname - this is data accuracy requirement, in fact, according to GDPR Yahoo should do the data accuracy check (for instance send a reminder to the user to check data). (2) Someone can file data portability request, Yahoo needs to provide this. (3) Some can request data removal. (4) Yahoo has to managed user consents for anything they do with those data.
For a product that does not bring any revenue or significant revenue, it is better to dump everything and simply don't be associated with data any longer.
That's the side effect of GDPR, it is hard from the technical and financial perspective to maintain anything free on the Internet that keeps user's data.
GDPR has an actual archive exception to the "right to be forgotten", art. 17, §3d [0]. IANAL, so I don't want to say if it covers this archival, but I would hope so.
Anything being archived by archive.org is pretty clearly being done in the public interest. If it was something like Equifax archiving the data to use as a factor in people's credit scores then it would be much more ambiguous.
> Twitter wants to delete older Twitter accounts because they're required to by law under the GDPR.
So, by analogy, if Twitter did allow people to download an archive of any public Twitter account's history... what would the GDPR require them to do? Wrap those archives in some sort of auto-expiring DRM?
One of Verizon's spokespeople was literally Darth Vader. "Ma Bell has you by the calls".
Large corporations are not anthropomorphic entities, regardless of their disarming branding. Rather they are amoral bureaucracies, likely administered by people who have learned to ignore their empathy to get there. Verizon won't change course to accommodate the Internet Archive or general Internet community any more than a combine would pause for a field mouse.
We have examples of content that was destroyed because it was deemed trivial at the time, one example being the BBC's policy of erasing its television shows so the tape could be used for new shows. The policy began with the idea that a television broadcast was a temporary communication like radio, and really, what possible reason could there be for people in the future to want to watch things like comedy shows. Dr Who, or news programs from the 60s, or the BBC's coverage of the Apollo moon landing? Surely the value of these cultural artifacts was not as great as the cost of video tape?
https://en.wikipedia.org/wiki/Wiping#BBC
The "dark side" of web scrapers has always been one step ahead with things like IP bans and CAPTCHA solvers, maybe it's time to get their assistance... as the old saying goes, "an enemy of an enemy is a friend".
People who personally have 100,000 Yahoo accounts because they made them back when you could just pretend to be blind and request the captcha in spoken form, and then fed it into Google's speech to text engine, fed it back in to Yahoo, made the accounts, and who also have a botnet of a million residential IPs and can spin up a bunch of servers to run some scrapers.
The shady SEO people (including the social media account farmers) and the spammers, who seem to always find a way around everything that's put in place against them.
In the early 2000’s there existed two main ecosystems in mobile software J2ME and BREW (not counting Symbian) the latter BREW, operated by Verizon. I had cofounded a QA consulting company that heavily based itself off BREW’s highly extensive developer portal. Then one day without warning, the developer portal disappeared. Luckily I had the foresight to download all the documentation a week before. My cofounder, a Microsoft developer was dumbfounded.
Yes, this was incredibly sudden, and with not support for getting out. They gave 13 days notice of intention to shut down new additions to message archives (extended to 20 days after some commotion). That was October 21, I believe. They have offered a broken group downloader that produces incomplete results. Desperate group owners have been using a Windows piece of software called PGDownload, but Verizon has blocked that. Now the only organized effort is being actively interfered with. Dumbfounding is indeed the word.
Please be aware that historically spamming media representatives has the opposite of the intended effect. A few emails to make it clear that it's coming from a group instead of just one individual can help, but at the point where it becomes saturating inbox noise it tends to get ignored.
It's not like interacting with political representatives or corporate PR/executive types where you're conveying the size of the interested party, in this case newsworthiness doesn't necessarily depend on how many people are sending the email.
1) I have been a member of a group for many years (Gann study group) . Last Friday I received a notification from the owner who was explaining the group was closing so he set up a new one somewhere else.
I thought it would be nice if I made a backup. So I found a python script on github (there are dozen of scripts in various languages which can be used to backup a yahoo group there).
It took me a couple of minute to get it working and then a while later. Voila ! I had it nicely packed on my hard drive.
So why is it so hard to back up a group? I don't understand the problem.
2) "A phone company in the UK that assigns phone numbers using the groups and now will lose all those phone designations when it’s deleted."
What? Well OK why not.. But? They are a phone company. There must be someone able to scrape all this data? I don't get it? There are so many ways to extract data from yahoo group.
Most people running these groups are not technical. Even if they got the word in time, the only option many of them could find was PGOnline, a Win pay software, which by this point Yahoo has blocked. Furthermore, even if they got it, what do they do with it? For many groups, the archives are a resource to be referred to. They need to be hosted somewhere, preferably with some kind of front-end search engine. Even better if the search engine integrates with any new posts on the forum they move to.
The Archive Team has been taking requests for backups of groups for people who don't have the technical facility to run the python scripts. They then intend to make them available on the internet archive. The next project is making some kind of front end, in case group owners want to host that somewhere. Some of us, for example, will be doing that behind some kind of a forum login, so it won't be search engine indexed.
As for your point 2, that was cut/pasted from the link in the OP, where it's describing that many groups are still using the platform. More relevant to this project, is that many groups are losing their archives, and those archives contain anything from scientific data, to hobbyist & howto information, to art and literature, etc.
I agree on the first point. The second is perhaps understandable if you read the whole exchange. You know they initially gave us 13 days before they cut off storing any more of the group emails (that is, new emails)? With an outcry, they increased that to 20. Many thousands of people were scrambling to find a new home. We are now reaching the end of the line (the last week) before the archives themselves are gone, and they have blocked the main concerted attempt to save some of that history. So, some level of frustration is in order.
A lack of emotional control is usually understandable. But it suggests a lack of care and focus that does not befit an important effort. I learned years ago to never send and email or text or to make a call when angry. I always thank myself the next day when I am able to choose my words more tactfully. That email makes them look like a group of angry trolls.
I'm wholeheartedly supporting the archival effort but was wondering exactly the same thing about the person communicating with Verizone. Her argumentation comes off as quite immature, and she's not making much sense with all that rambling.
Saying stuff such as this sounds pretentious and will unfortunately only get laughed at by anyone in the corporate world: "So the best thing Verizon could do, since they are just going to throw us all into the trash anyway, as we aren’t important to them, is let us get our archives any way we can.
The terms of service really should not apply to people who have been told, we’re gonna delete you from existence. If it’s lawful for us to get them from you, in broken buggy and virus ridden state, it’s just as lawful for us to get them ourselves."
As it is right now, she's just not doing any favors to the archivist community out there. Perhaps someone with proper communication skills and better nerves should take up that role? This is not a time to play a martyr and throw a fit while expecting Verizon to meet you half-way.
Things like this are a good answer to when people question why internet centralization and walled gardens matter. If these things were hosted across thousands of servers, federated, or under a license that made them able to be copied, there would be no issue. This is only an issue in the first place because people posted content in a place and manner that made them give up ownership to it. One day, perhaps decades from now, Facebook is going to face the same problem. Twitter would, too, if it wasn't being archived by the Library of Congress.
Verizon claimed that the archivists violated the "terms of service" [1], but I couldn't find any reference to automation, downloading, crawling, or denial of service attacks that might apply.
Does anyone have an idea of exactly what term or terms were violated by the archivists?
Just playing a devil's advocate here. The way archivists are downloading the data can be said to disrupt the services, which is mentioned in the terms of service:
2. d. viii: "interfere with or disrupt the Services or servers, systems or networks connected to the Services in any way."
I'd also like to point out that the apparent spokesperson Brenda Fowler said in her open letter to Verizon, that "If the problem is that all our attempts to rescue our archives in the time we have left is causing an overload or strain on your servers, then stop making us HAVE to work around the clock, and GIVE US MORE TIME. ..." Probably not the wisest thing to say right now.
Also, archiving the groups with automated tools is against the Use of Services rule, that states the following:
2. e: "Use of Services. You must follow any guidelines or policies associated with the Services. You must not misuse or interfere with the Services or try to access them using a method other than the interface and the instructions that we provide. ..."
As I mentioned in another comment, I really support the cause and am a big fan of archiving myself but it's unfortunately quite clear that Verizon is right at calling out the violations of "terms of service".
Using the interface wouldn't block scrapers, yes? They do use the interface. But, this is academic I think. They offer a broken way to get our stuff, and say that we can't do anything else. Should we acquiesce to this?
As for bogging down the servers, my understanding was different from what the author said. They hadn't started to archive, but were in script testing mode and were accumulating yahoo accounts. What I saw of their activities, they were very careful about not overloading the servers. (I know that because I was backing up my own groups independently at the time, and I was able to do it. Luckily.)
Correct. They had done some testing, but that's all. They were just getting yahoo id's, while iterating on software improvements, so they could then download the groups.
I had just recently been reading about Arweave [0], a sort of distributed file storage that claims to permanently store files/webpages using various incentives.
Seems like something like this would be a good way to archive this sort of information or build sites like Yahoo groups on top of this file storage in the first place.
Arweave is doing great stuff but I think it'd still run into a similar situation as archive.org -- Check out the arweave discord dev community if you haven't though!
Just thinking out loud: This makes me wonder if we can learn from this and prepare to backup other (similar) platforms that hold such an amount of data and might go away some day. Building the backup tools today and ideally starting to backup now, making the process incremental so you can run it every now and then and only scrape the new stuff.
Modern Web3.0 portals that built on async JS will be impossible to archive without hitting API limits or resource quotas.
New Reddit(without the old.reddit.com interface) for example.
Many niche subreddits contain lots of information that would be lost if reddit dies(or just deletes these subreddits).
Youtube is unarchivable in principle due high amount of storage required(even thinking of 640x480) and yet it still contains tons of unique content found nowhere else from rare AMVs(that survived prior deletions) to instructions to repair telescopes - or basically anything in video form that doesn't have backups(i.e.not uploaded to other videos sites).
4chan and similar sites are archived by several sites in haphazard manner(only boards they like) and yet it a huge chunk of internet culture that is going to be lost if these sites die(and its more probable than Reddit due less funding).
Usenet is slowly fading into obscurity and dependence on Google Groups.
Many forums that today exist, will not exist forever: yet very few are archived anywhere else.
Other forum-like sites like Stackoverflow and Quora might disappear in the future with nothing replacing them. Github is subject to Microsoft whims and positions on open-source. Wikipedia and various wiki farm sites don't have much revenue streams.
Practically every major website we take for granted is vulnerable - people thought Yahoo Groups was going to last forever.
And this is the dangers of relying on a private, corporate, for-profit law-bound organization. They're susceptible to abiding by the laws and of course, there is a cost attached to all of this.
Exploiting a free resource, as we all do these days (reddit, youtube, facebook, hackernews itself etc) is all well and good but maintaining history is expensive (content needs moderating, you are required to abide by the GDPR and DMCA, there may be disputes about content on the platform).
I mean, Google+, MySpace, Bebo, IMDB comments is now dead and gone, how useful was the data really? I'm sure some people might go to archives but I would imagine 95% of the data is just "rot" that has no value or substance.
History is lost all the time, we barely know what we've been up to the last few thousand years only now can we so extensively document our world with the precision and quality afforded to us.
But in the end, time moves on and some of that history is lost, it hurts, but whose to say any archived history will be preserved anyhow? We're still relying on our storage technology being readable years/decades/centuries from now, which is not a given.
Maintaining a static archive is remarkably inexpensive. The total amount of textual data included in even Google+ was likely only a few hundred GB. Images and multimedia, of course, would have been far more, though sampling-based estimates suggest that these were a few hundred KB each, on average, on about 30% of all posts.
The mean post size on G+ was rougly the same as on Twitter: about 120 characters. (Quite possibly because most G+ posts were themselves repurposed Twitter content.)
Static content does not require ongoing moderation, though it's possible that problematic content will be periodically identified.
The bigger challenge is actually in the publishing engines. Even where these are static, it's possible that vulnerabilities will be identified. That was Google's (not especially convincing) excuse.
A challenge of the Internet Archive / Archive Team method of archival and access is that in preserving the original formatting and packaging of content, the bandwidth and storage requirements are increased tremendously. By about two orders of magnitude in the case of G+.
Were the Archive to focus on the actual originally-authored content rather than all the associated chrome, both factors would be tremendously reduced.
While I agree with your first point, and tried to get groups I was associated with to move for years, nevertheless there are groups there that engaged in community driven research and have important data uploaded there. (This is my main concern, though other groups were focused on different issues - uploaded art, for example.) So I think while we need to educate people about not using centralized providers like Yahoo and Google, right now we need to focus on getting someone at Verizon/Yahoo to respond to this urgent situation.
We cannot excpect a private company to continue paying for resources they don't want to.
But giving a "export all the data in xml/json/whatever" button, and maybe even opensourcing the now-abandoned component serving this data, would be nice move. The first part could even become a regulative requirement some day.
> maintaining history is expensive (content needs moderating, you are required to abide by the GDPR and DMCA, there may be disputes about content on the platform).
Things shouldn't be like this. The price per unit of storage and bandwidth falls fast (and, except for the sites dealing with user-generated videos, faster than the amount and size of content grows). Laws shouldn't apply retroactively.
The problem really is that our means of accessing information are services. When you have a physical letter, or an e-mail saved locally, or a text message from 15 years ago, you can just read them. Nobody will know or care. Nobody will come after you trying to apply GDPR or DMCA retroactively. And since storage is near-free, you won't ever lose it until you forget about it (or at least about doing regular backups). Whereas with modern webmail, forums, link aggregators, IMs - you don't have even your own messages, and viewing a conversation that happened 15 years ago is really being provided a service today. Services are ephemeral, they're also subject to ever-changing regulations and whims of the service providers.
Bottom line, while services are necessary for transferring conversations, we really shouldn't be relying on them for access to conversations that already happened.
If you are a company, GDPR does apply to data on physical letters and local emails. A large part of the preparation for the introduction of GDPR enforcement was companies getting a handle on what they had stored in various media.
actually email and letters are something which the gdpr falls short in some countries. especially germany. since basically the constitution is above the gdpr and depending on the letter/email the content of the letter does not need to be acknowledged or showed (gdpr also means you can access your data) to the person who want his data deleted/showed/whatever.
All true, but costs of hosting and serving aside, there is a non-zero legal cost with hosting and serving the content. Blame bureaucrats, parasite lawyers, and our litigious society.
Those costs reflect the actual social costs of that hosting. Prior to GDPR and similar legislation, those risks were externalised onto users and society at large. They're now being shifted, properly, to where they should have been borne in the first place, on the service providers themselves.
Blame risk-externalising business practices and willful ignorance.
What social coast is there to distributing content contributed by people who agreed to terms according to those terms? Users transmitted data about themselves to a party after reading that party's terms of service and agreeing to the things it promised to do with the data. To paraphrase a popular talking point, two consenting IP addresses should be able to send whatever data they want between each other.
2. Technical capabilities have expanded massively. When Yahoo Groups launched, enterprise storage of more than a few hundred GB was highly unusual. I worked for a Very Impressive Service Agency which was lucky to claim two Sun Starfire servers, only one of which was Large File (> 2 GB) at about the time, for analytic use.
By the late 2000s, AOL were deploying massive-RAM based systems to be able to perform whole-dataset operations in memory.
For the past ~5-8 years, large-scale SSD drives have been A Thing, now available in the terabyte range, for a price. Again, the level of analysis and expolration possible have made tremendous leaps.
3. There is the concept of manifest vs. latent functions, and awareness. The full realm of possibilities of technical systems are rarely apparent to their creators, let alone nontechnical users. See (very generally): https://en.wikipedia.org/wiki/Manifest_and_latent_functions_...
The marketing and disclosures of such services rarely include such disclaimers as "use of this system may subject you to a lifetime of personal and social profiling, grammar-based context analysis, GD ML AI based image content analysis, and imperil the global liberal social democratic experiment."
Hiding behind the figleaf of "you should have considered all possible future implications of your present actions and will have no future recourse" is grossly flawed, and quite frankly, professional malfeasance and malice aforethought given current understanding.
The awareness of risks has changed, and is unambiguous. Providers should foot the costs, or mitigate them accordingly.
(I suspect that at least in part, the actions of Yahoo, Google, and others, reflects this changed awareness, though I'm not aware any providers have explicitly stated this.)
Again: the risks always existed. The previous state was made possible only by pretending they did not. They do. Practices must change.
Social cost would be at best very difficult to quantify, though, making it quite hard to handle. "Increased partisan tensions" due to social media, for instance, is not the sort of thing the cost of which one can quantify and mitigate.
Your point that the things which can be done with information collected are constantly in flux, and I agree the ability to retroactively change terms of service to cover previously-collected data is ridiculous and implies an illusory contract which is not legally valid. No one should be able to run through a neural net data collected in the nineties. However, it's also not reasonable to demand that old data be removed, as it's produced at least as much by the server as by the client (e.g. access logs are typically produced by server-side monitoring of server-side software). The most sensible option is for companies to require explicit agreement to TOS changes to continue using the service, and use new data only under that policy while using the old data under the old policy. It's additional compliance overhead, certainly, but it's no different from how a client contract would be treated.
> professional malfeasance and malice aforethought
You are not the arbiter of such things, but thank you for your opinion. There's also a site guideline about assuming good faith, so you're in violation of that.
My own thinking on this has evolved very considerably over the past five years or so. That's included a comprehensive and ongoing exploration of the fields of media, communications, epistemology, and several others, related to this. I'd long seen computers as technology, largely independent of social implications. I now see these as utterly inextricably linked, and with implications that are anything but predictably benign.
Costs being difficult to assess does not mean impossible, and the notions of probability and risk are central to all finance, investment, and insurance. Uncertainty is NOT an absolute lack of knowledge.
Among the principles that becomes apparent is that changes in informational regimes have profound impacts upon societies, and that this is a pattern which can be traced back through history to the invention of writing itself, and via indirect anthropological evidence likely to the emergence of speech.
The principle transcends humans themselves -- a leading theory for the Cambrian Explosion is that it was a consequence, effecively, of structuring and communications mechanisms within organisms developing, and allowing the creation of complex body plans, and not merely single-celled organisms or masses or colonies of cells.
For media, see especially Elizabeth Eisenstein's The Printing Press as an Agent of Change and Marshall McLuhan's The Gutenberg Galaxy. The link between mass media and totalitarian, fascist, authoritarion, and nationalist sentiments has long been observed (Hannah Arendt, Dwight MacDonald, the Frankfurt School, Edward Herman & Noam Chomsky, Adam Curtis).
I've been impressed by the insight, or occasionally, lack, of awareness of the potential perils of comprehensive data archives by pioneers within the data field.
Paul Baran, co-inventer of packet-based networking, wrote "On the Engineer's Responsibility in Protecting Privacy" (https://www.rand.org/pubs/papers/P3829.html) in 1968, some 51 years ago. In it he remarked on both the risks, and industry attitudes:
There are many amongst us who would not hesitate to build equipment to compromise the privacy of any given individual provided the price is right. These are the whores of industry. They would not hesitate building systems and devices contrary to the public interest; their only concern is the buck.
The full paper, and in fact, all of Baran's RAND publications, are online in full-text, following my request to RAND. I remain grateful to them for this.
Baran was also interviewed for a 1966 BBC documentary:
"Well, he who has access to information controls the game. This is very dangerous. I think both your country and mine have never trusted the government completely. We do so for good reason. Here we have a mechanism that could be abused. Here we have a mechanism that would allow the creation of a dictator. . .
I've yet to see an expression by anyone in Congress about this new type of danger. In fact, we see proposals for centralizing information, we see proposals for rushing ahead into new, more efficient computer information systems, and very little thought is being given to the dangers of the misuse of these systems. . . I ask a lot of people about privacy, why they valued it, and I was surprised by the number of people who said "Well, I don't do anything wrong. Why should I worry about privacy?" And then, on the other hand, I think there's a more wise group that says, 'Privacy is really the right to be wrong, then go on and live the rest of your life, without having it mark you forever.' I tend to think this latter view is the view we should hold.
Another view was expressed by AI pioneer and Nobel Laureate (economics) Herbert Simon:
"The privacy issue has been raised most insistently with respect to the creation and maintenance of longitudinal data files that assemble information about persons from a multitude of sources. Files of this kind would be highly valueable for many kinds of economic and social research, but they are bought at too high a price if they endanger human freedom or seriously enhance the opportunities of blackmailers. While such dangers should not be ignored, it should be noted that the lack of comprehensive data files has never been the limiting barrier to the suppression of human freedom. The Watergate criminals made extensive, if unskillful, use of electronics, but no computer played a role in their conspiracy. The Nazis operated with horrifying effectiveness and thoroughness without the benefits of any kind of mechanized data processing."
There is, of course, one slight problem with Simon's argument: The Nazis did make heavy use of mechanised data processing, provided and supported by IBM. Edwin Black documents this meticulously in his book IBM and the Holocaust:
A very substantial portion (~98% of all public posts) of Google+ was successfully archived, at the Internet Archive, thanks to the Archive Team. As a longtime G+ user, and one of the organisers behind the G+ "Plexodus", the existence, assistance, and capabilities of the Archive Team were hugely appreciated.
AT and the Internet Archive have succeeded in preserving other content, though not all projects are successful. You can see a partial listing at https://www.archiveteam.org/
Unlike numerous other shutdowns, Google announced the G+ shutdown well in advance, though they "accelerated" the schedule twice, from "sometime in August 2019" to April 1, 2019, the eventual shutdown date. The tools Google offered for archiving and migrating content, whilst among the best in the industry (an exceptionally low bar), were incredibly insufficient: buggy, incomplete, duplicative, and not readily portable). It was largely third-party tools and assistance -- the Friends+Me Google+ archiver and ArchiveTeam most especially -- that meaningful preservation was possible.
The conceit of large-scale, free-to-use services has been convenience, capability, and trust, the last a point Google explicitly made in its original G+ announcement:
You and over a billion others trust Google, and we don’t take this lightly. In fact we’ve focused on the user for over a decade: liberating data, working for an open Internet, and respecting people’s freedom to be who they want to be. We realize, however, that Google+ is a different kind of project, requiring a different kind of focus—on you. That’s why we’re giving you more ways to stay private or go public; more meaningful choices around your friends and your data....
And in actively opposing archival efforts, Google, Yahoo, Flikr, and others, are violating that trust only so much the more.
In the G+ shutdown, it was the active dismissal, obstruction, and interference of Google and its user-based support team (the so-called "Top Contributors") which were most disappointing. Long-time Google supporter Loren Weinstein made this point specifically and repeatedly:
I'll note that this tends to strongly reduce the value proposition of all Web 2.0 / SaaS offerings, given that even the very largest and wealthiest companies are willing to act in this manner.
The consistency of this behaviour and attitude across multiple service providers makes me think that the behaviour and practices are not coincidental or unintentional.
I got into this tangentially because of a community and ecosystem of Y-groups that I've been involved in. When I found the Archive Team's efforts, I hitched my wagon - though I'm not at all central to that group.
This is really unfortunate. There's an amateur microscopy community on Yahoo groups with tons of information about old research microscopes. Corporations can really suck.
That only explains a decision to take them down ITFP. What can be gained by shredding all that information? Maybe they aren't shredding it. Maybe they just want no publically available copies to exist.
It explains both. Blocking archiving will save a bunch of bandwidth as well as not having to scale up the servers for the load of dealing with the archiving.
I'll give you that. I wasn't thinking of a load spike, just that of the perpetually continued service.
But it doesn't excuse much. Capt. Obvious says, "it's temporary, it could have been anticipated, 'protecting' their datacenters from the load in exactly this way will almost certainly be interpreted as ill will, there are options not being taken such as throttling or even voluntarily sending it all to archive.org"
Create a service for long-term storage with an easy integration API; the idea would be that if you integrate your data with our service, and you eventually (maybe you're going out of business, or something) make a call to delete data, that data is first transferred to our service before deleting it on your end.
Integrating with us is basically like making a reservation in advance, so when you do perform the big delete like what's happening to these groups, it's offloaded to this service first.
I have no good idea about how to store/structure the data, or how it would make money. But I also have no idea if there's an easier solution to problems like this, where you force users to scramble to save all their stuff somewhere. People would also begin judging services by whether their data will be saved once it's terminated (ie whether you integrate with us or not), so I feel like that would ultimately bring in a lot of customers.
With modifications, maybe. Like, you don't have to store everything on the service's end until you're ready to delete it. Also we don't seem to have a P2P service with the kind of contract I mentioned.
Redundancy is built-in P2P, so deletion is not an issue.
(Popularity is - but systems like Flixxo have tried to tackle the issue...)
Also, what about IPFS ?
I'm confused why Archive.org is attempting to archive and expose to the public what is essentially private communications?
My usage of Yahoo groups in the early 2000s was mostly to communicate with my high school / college / dorm groups and the last thing I want is for embarrassing messages from 20 years ago sent to a private group to be archived.
Clarification - we're not archive.org. Archive Team and Internet Archive are completely separate.
And we're only archiving things that "any guy on the internet" can see. If someone can access the messages simply by joining a group (with no moderator approval), I'd argue it's fair game.
We're not going to be unreasonable, though. If something private slips through and we receive a takedown request from the author, we typically remove it.
Reports that there is an extension of the deadline to download groups until Jan. 31 are a smokescreen put up by Verizon. The Dec. 14 deadline still holds, but they will take requests of users for their data until Jan 31. This is a fail because:
From Yahoo's reply to the modsandmembers group: "The Groups Download Manager will download any content an individual posted to Yahoo Groups. However, it will not download attachments and photos uploaded to the Group by other members. For those that are having difficulty with the files delivered, this help article explains the types of files within the .zip file sent ... "
The point made on IRC:
The important point is they won't give you the vast majority of photos and attachments. They also don't export databases at all, which are important for certain groups. It's completely arbitrary too, because they give you files other people posted...
Recently Verizon have blocked all of my yahoo accounts. I've spent some time trying to find any kind of support form to get them restored with no luck. To get support you need pay money now. Perhaps, Archive.org accounts fell under the same ban.
Verizon has stated in support emails that they were aware of Archive Team's efforts and specifically will not be un-banning our accounts.[0] I therefore think it likely that the banning was targeted.
We have groups that the owners can't even access any more. They demand a yahoo email even when there's a non-Yahoo email associated with it. Y-Groups has been badly broken for some time.
I don't see this as a big loss, and if I were someone who'd posted on Yahoo Groups, I'd be happy if it all disappeared. I don't consider this kind of content something that should be durable and everlasting. It's ephemeral conversation. If anything worth saving comes of conversation, it should be converted to another form and saved.
I'm glad I'm old enough that most of what I wrote on message boards as a teenager and young adult disappeared long before archive.org and similar sites existed. Conversations shouldn't last forever.
I'm part of an opensource project that has been around for a long time, the mailing lists (for good or bad) lived there for quite some time. There is a lot of design and troubleshooting history I would rather not be lost forever.
This is pretty standard. Google Groups bans me whenever I open too many tabs at once. It's annoying, but these big corps don't have any inclination to support user behavior other than standard, especially when closing down a product. I think the blocking bit is sensationalizing it a bit more than is the reality. The site policies and design intentionally prevent joining a group just to download everything, it's not an intentional blocking of a group or action beyond the default behavior of enforcing the TOS.
Well, it was actually intentional. While a lot of us were backing up our groups as individuals (whether as owners or just members), and having no trouble there, the Archive Team was taking requests. So there was volume involved, and they have been in touch with Yahoo who refused to be supportive of the archive effort. So their target really was these guys specifically. Not just the ToS, but these guys in the context of an archive effort at a point of "ToS" (Termination of Service).
This is why the change in ownership to private equity of .ORG TLD is problematic.
What if the owners, or owners after it is sold further down the line, of the .ORG TLD prevent archiving? Or charge more for that? That would greatly affect web archive.org and Wikipedia.org.
This move on preventing archival actions is probably setup to allow them to block it later after the sale goes through and say that it was a policy before they started it. A bit like how they lifted .ORG pricing limits just before .ORG TLD was sold to Ethos Capital.
Wait. You're saying that the owners of the .org TLD could put restrictions on the use of those domain names? And something like "can't be used for archiving" is a legal possibility?
I admit to legal ignorance, but this does seem over the top.
I find it curious that at the same time we discuss the 'right to be forgotten' laws there's also the opposite problem of preventing the internet from forgetting something.
'right to be forgotten' laws are the result of the whole "numbers have owners" insanity, combined with the fact that the average person will mindlessly use random services to store private data.
In the US, dial 611, say "representative" at the prompt, then dial 3 for "something else."
I have Verizon prepaid, so switching is easy. I called and politely explained the situation to the customer service representative. I informed them that I will definitely switch away from Verizon if they delete the Yahoo Groups data without allowing archival. They promised to inform their manager and email me back.
Anyone know why the service is being shut down? Is it perhaps because of the California Consumer Privacy Act[0] that comes into effect in Jan 2020 and they know they can't possibly comply with these new regulations? That's my personal guess...
I'm an occasional yahoo groups user. I'm pleased to find that the owners of the groups I'm interested in have moved their content to groups.io.
These include mostly support groups for obsolete hardware and software. I would say that something like github would be a better place, except for copyright problems. Yahoo groups were better for quasi-legal archiving.
I'm wishing that Groups.io was hosted outside of the US for this reason..
i can't possibly be the only person here who has been expecting this type of fuckery for 10+ years, right?
it's taken me that long to get to a place where i feel confident that my own personal archive of data isn't going to disappear from control.
side note: it's too bad that ssbc isn't quite ready to deploy at scale for the types of folks who rely on Yahoo Groups for this service. sure seems like a great resource.
I have a disability group that actively discourages email messages and recommends - or, that is, recommended - that members use the Files section for answers to frequent questions. Email is for questions not answered in Files.
This group would be intolerable without a website, with the same questions asked over and over and over again. Yahoo is totally wrong that email is all groups need.
Yes, exactly. Other groups have medical tests uploaded to the files section, to be shared with other members and discussed on-list. And yet others had artistic works or literature.
This is very sad to hear. One particular piece of Yahoo groups that I particularly care about is the LTSpice group, which contained a lot of simulation models I've not been able to find anywhere else. Luckily, this particular one seems to have been migrated to https://groups.io/g/LTspice
Someone else mentioned that group on this thread. A group member. Scroll to the second page. He mentioned it along with a number of other groups worth archiving.
The lesson here isn't to run a hail Mary effort at the last day to save Yahoo Groups. The lesson here is to backup things you care about before they start to disappear.
Is this stuff so important if not one of millions of users thought it was worth putting effort into saving?
Yahoo's takeover/shutdown was announced years ago.
Inertia. I'm connected with a series of groups and I tried to get people to move for years. It just doesn't seem anything happens except under pressure.
Also, the scripts really have been developing during this last 2 months, which is all the time they gave us when they said they'd be deleting all content. Until then, I didn't see anything that would get me all content including files/links/calendars/databases.
>We are receiving comments and messages from the frustrated and angry groups archivists, and some of those are posted below. You can send in your own if you e-mail it to owlsy@yahoo.com.
You're still using a yahoo.com email address to organize?
I have trouble understanding how Yahoo's moves are GDPR-compliant : if just one European citizen was part of the group to be archived, at least some if not all data in the group can be considered part of their personal data (as they were either sender or recipient of messages) and therefore there should be a mechanism allowing them to download that data.
Am I getting it wrong ?
[Will be sent out to media. Thanks to those who are commenting here for useful bits of information that has been included.]
Verizon Media/Yahoo closing down Yahoo Groups, with imminent loss of important history.
Yahoo, now owned by Verizon, is in the process of partially closing down a two-decades long service, Yahoo Groups. This email list-serve/bulletin board is both a discussion platform and a repository of history for a large array of communities. Verizon intends to make all such history unavailable after December 14th of this year, having provided less than 2 and a half months of warning and, due to a broken system, having failed to alert numerous group owners and members.
Some of the histories on this platform are important, not just to the early Internet, but even to academic research and scientific investigations. Examples are discussion groups that included WW-II veterans, many of whom are now deceased; queer and trans groups, some of whom have organized to try to preserve this history [1,2]; and minority/immigrant groups [3]. Those with scientific data include earthquake researchers, medical/biohacking groups, bird watchers with decades of data, and even material relevant to patent claims. Popular culture groups devoted to art have uploaded their works there, and fan-fiction groups exchanged stories and feedback [4]. Sadly, Verizon has not made arrangements to help people to preserve this material. This has resulted in individual Yahoo Group owners/moderators, most of whom do not have technical backgrounds, scrambling to find ways to download their archives. Yahoo claims they have provided a data exporter, but it is very incomplete. Groups include not only messages, but files, databases, calendars and other materials. Some more technical users have organized to take requests, crowd-sourcing the work to download these Groups. One such [4] has been in contact with the Internet Archive, with the objective of uploading whatever they can get to Archive.org. Unfortunately, for reasons we can't guess, Verizon has been actively obstructing these efforts [5]. We urgently request that if Verizon will not support efforts to preserve this material, then they at least allow some means that Group owners can download complete archives of their groups, and independent groups can download public Groups. For this, we also need more time.
Support can be offered by spreading the word on Twitter, Facebook, Reddit, Hacker News, and other social media.
This reminds me of my MSN Groups groups I had as a kid. I wish I could go back and see them, but I wasn't "big" enough, even though I had like 80 members.
Is this the future of Facebook when personal data use become heavily regulated, the data will become harder to monetize and the next big thing rize on the horizon?
Probably. Now that the groups I am concerned with are migrating to our own hosted BB, we are also planning to migrate the associated FB groups away. For that, we expect to lose data in the process.
What was the original plan exactly ? Subscribe to as many groups as possible and then wait until the last moment to grab the data ? That would almost certainly have resulted in massive bandwidth problems and massive bans by Verizon in response at this point, failing the archival effort anyway.
The archiving scripts have been under development and testing since Yahoo first made their announcement. The plan was to get as much manual labor (volunteers solving CAPTCHAs to join groups) done while waiting for the scripts to be stable, reliable, and automatable.
Plan was to organize it, then carefully and thoughtfully balance out the load so that it didn't put undue burden on their servers. The entire plan was orchestrated in this way so that it wouldn't cause problems.
yahoo is going to keep the messages but just delete the art and other uploads or attachments to the messages correct? although apparently they will make some groups private as well essentially closing access.
A lot of people are going to groups.io, but I certainly wouldn't want to suggest anyone move to another centralized system. The best thing to do is probably just to get a VPS and install phpBB or something. For group owners, importing our Yahoo archives there is going to be the next task.
The point is they then own their data. If they want to move it to archive.org or anyplace else, they can. Unlike the situation we find ourselves in now.
If you read here you will see that there are many active groups. Even of those that aren't active, some have important history, even scientific value.
As for old posts disappearing, individuals have always been able to go to Yahoo Groups and delete posts that they later thought the better of. Group owners can also delete posts.
IDK if this is any help, Verizon is holding their annual conference On Dec 10th (less 2 days away as of this writing), with C.E.O. Hans Vestberg presenting at 12:15 EST.
Anybody who has FU money: start filing papers for a lawsuit against Yahoo and notify them of your intention to do so. Pay for expedited filing. This is the worst kind of PR for a company and they'll do almost anything to avoid being sued in the first place. A settlement offer is Not An Option.
Anybody living in the SF Bay Area, Especially if you live in the Silicon Valley: Pay a visit to Yahoo Global HQ. Find out the borders of their property, park and picket right outside them.
Anybody with the resources: Make / Print a banner that says something like:
yahoo! #WENEEDMORETIME!
#YahooGroups
To drape over this pedestrian bridge located at [37.397547, -122.022673] https://goo.gl/maps/XknYg9H3BYWZPESF9
South Entrance Address: 101 E Ahwanee Ave, Sunnyvale, CA 94089 (Intersection of W. Ahwanee Ave. & Borregas Ave.)
P.S.: A good alternate location is the US-101 Mathilda Avenue Overpass, on the sidewalk southbound / northbound (37.398832, -122.027722)
Yahoo has sent emails to everyone allowing them to click a button and get a zip file prepared with everything they have posted or stored on every Yahoo Group they have ever participated in.
Blog author and people here believe that outsiders with no connection to these groups have an inherent right to download and republish all this information despite having no license to do so. Yahoo/Verizon seems to think differently.
There have been lots of reports that this 'Get My Data' function doesn't work. One of the demands made by the blog author is for Yahoo to fix this so it actually does work.
Additionally, the 'Get My Data' only gives you access to all the files / photos that you uploaded to the group. These archives should not be considered complete archives of the group - is it really reasonable to suggest that in order to completely back up a group, every member must complete their own data request?
> 'Get My Data' only gives you access to all the files / photos that you uploaded to the group
Your comment is quite interesting! I did not receive only my own contributions, but got fairly large files that seem to me to be exhaustive archives of all posts and other materials uploaded by any and all group members during the entire time I was a member of each group.
It's pretty interesting if different people are being sent totally different things though as you indicate. Maybe I am just lucky.
I'm pretty happy myself that they have this method to get everything from the time I was a member of any group. I can't think of any other service that provides for that. Yahoo seems to be the most open and data accessible service I can recall. Other boards facing imminent shutdown I had to go to a lot of trouble to scrape old posts using bot scripts, which was inconvenient.
Yes, when a group owner wants to migrate to another platform - a bulletin board or the like - they need to be able to get all the group associated data so they can reconstitute their community.
I am a self interested party, but I’m personally glad since there’s a post in a Yahoo Group that’s findable through Google, that would absolutely ruin my reputation and life if discovered.
Well, I will say that if these are backed up, they will be less accessible than they would with Yahoo. Someone will have to download a large archive (gigabytes, depending on teh group) and then search through it for themselves.
Yeah it’s not that the post is immediately bad. It’s that I had an alt account for a completely different side of myself and made one errant post with that alt account to a forum linkable to my real identity. So it’s not a serious concern but I’ll be glad when it’s gone.
Well, all I can say is, don't add that to the list of groups that this team is trying to archive (https://docs.google.com/forms/d/1Z-lODnyXsE2kiu8uL01L--10nDq...) and if it does end up on archive.org, you will find another post on this thread saying that they are very responsive to takedown requests. (info@archive.org)
This comment breaks the site guidelines. Would you please review them and stick to the rules when posting here? Note this one: "Have curious conversation; don't cross-examine." People need to be able to discuss their work and their interests here without being harangued, and any substantive point you make can be made respectfully. Pummelling people helps no one.
We've had to ask you about these things before. Continuing like this will get you banned. I don't want to ban you, so would you please re-read the rules and fix this?
Yes, it wasn't me who started the groups in mind, and these were quite early, starting in the late 90's. While many members were researchers, on the whole they were not in computer fields- chemistry, biology, medicine... On top of that, people of that vintage knew very little about the internet.
There have also been many other comments here about other types of groups, from bird watchers to people building rockets to histories of WWII, that would certainly not have had this kind of technical background.
Likely having something more easily accessible to less technical people, plus being able to view the history of posts in the browser, are some reasons.
There's way too much data. Many group owners did not know of the shutdown (Yahoo was negligent regarding informing owners), and even if they did many group owners have little or no technical capability. That's why so many requested of the Archive Team that their groups be archived.
Is that enough? What about groups with only a few members? Or members that have passed away in the last 20 years? Will those WWII vet groups have 1000 responsive members with the technical skill to run an archive script in the next week?
It's not stupid. There were serious groups using that platform. While I never thought it was a good idea, they nevertheless did. My personal concern is community driven medical/biohacking research groups that go back to at least the late 1990's.
The one group I ever joined held plenty of useful/unique SysEx dumps containing custom patches for a popular 80's music synthesizer, among related things. I wonder who has already backed it up, and if I should.
edit: Oops, I'm also a member of LTspice. D'oh!
The best way to stop being ashamed of stupid things that you said forever ago is not to cast those things into the Memory Hole, but to stop saying those things, and most importantly, stop being the person who would. Then you know it's in the past, and it doesn't matter who else remembers.
Why wouldn't these groups for which Yahoo seems to be a critical service have done this work weeks ago? Either it's important and you make an effort yourself, or you do nothing, which clearly indicates that it is not important. I'm having a hard time getting worked into a lather over this one like it seems everyone else is - it's been announced for 2 months - ample time to save what you needed.
Actually, it's not. We were given 13 days for the shutoff of archives, with another 7 when there was an outcry. Most owners of groups are non-technical and have no idea how to get their data; this is why the archive team put up a request page for owners who wanted their groups archived (https://docs.google.com/forms/d/1Z-lODnyXsE2kiu8uL01L--10nDq...). Their more crowd sourced approach, which has been hard at work for these months, has been actively interfered with by Verizon (see the link of the OP). Verizon should be supporting this attempt to preserve this material.
‘There’s no point acting all surprised about it. All the planning charts and demolition orders have been on display in your local planning department in Alpha Centauri for fifty of your Earth years, so you’ve had plenty of time to lodge any formal complaint and it’s far too late to start making a fuss about it now.’
Yahoo sent out email notifications to every group owner and member. I got mine on November 1st.
Yahoo is definitely handling this badly: they should have offered an export tool, given a longer window, and been clearer in their messaging. But the claim that they didn't notify people (and just put the information up in a place no one would see) is wrong.
A lot of people apparently didn't get notified. The platform regularly lost emails. (Fifteen years of being on it, such problems were not irregular.) Also, they did offer an export tool, but the results were/are incomplete.
I'll add that some group owners/moderators have not even been able to log into their groups. Yahoo was doing something with needing a yahoo email to get in as admin, even when a non-yahoo email was associated with the account. Those people that lost access to their yahoo email account have been totally SOL.
I was a member of a single Yahoo group, a local recycling cooperative. I got a notification about the shutdown in mid-November and promptly sent a download request (not a full export). It took literally weeks before I was notified that the download was ready.
For groups, especially, this simply isn't realistic to expect.
When the G+ shutdown was announced (see earlier comment, I was one of the "Plexodus" organisers), the whole question of what to do about Communities migration hadn't even occurred to me. It wasn't until a community moderator happened to mention that they had a group with 400,000 subscribers and no idea at all to even begin that I started looking into this.
It's a really basic thing. I've found no evidence that any such guide has ever existed online. I make no claims that what I came up with was the best, or even effective. It does seem to have been the first. Win, me.
We created a "short FAQ", formatted for posting to G+, which we posted again and again and again and again and again, to as many communities and spaces as we could. Because when you're trying to reach a mass audience (10ks to millions of users, 10ks to 8.1 million communities), mass repetition is necessary.
It's hard enough to motivate people to turn up at one place once every couple of years to mark a few spots on paper, when they all know that this happens regularly -- election turnouts tend around 30-40%. Getting mass compliance on a complicated, unexpected, and technical process (with social and legal entanglements) is far harder.
Many groups glide by on minimum maintenance for years. Whatever initial technical competence was required to set up the group be long-lost. At best, moderation skills are known (and even that is a long stretch). Migration is an entirely different skillset, one rarely exercised, and, as noted above, extraordinarily poorly supported.
TL;DR: your expectations are entirely unreasonable.
If Google ever wonders why their newer ideas failed to gain traction they should read your comment.
For every early adopter pissed off there will be a hundred or more people that never make it to your new and shiny project. So piss off enough early adopters and whatever you launch will be a dud.
They're also not monetizing the content or doing anything with it. They're just going to throw it away. Why would they go out of their way to block archival attempts?
This is the corporate equivalent of throwing your old computer into an empty ditch on the side of the road, and getting mad when someone responsible comes by to recycle it for you.
Corporate DNA in the US seems to be build around two basic principles:
1) Use whatever means to get yourself forward
2) Do whatever you can to hold everyone else back
The Groups thing is a wonderful example of #2 in action. It’s not about preventing access to Groups in particular, just the general principle of preventing and hindering all and always
That is one hell of an analogy. I'm going to try to use it.
(There is one guy I came across trying trying to monetize what he could download of some archives. He used the Windows software PGDownload to do it. It's a lot of alt-med stuff: https://groups.rifeforum.com)
There are probably really costs here. From their initial message in the post, it sounded like it would cost them “something” to allow all of the data to be archived. I could imagine that the data for these groups was expected to fall into a 80:20 distribution between cold data and hot data (Or 90:9:1 between archived data, read only, and read/write). If you start pulling data from the archive or cold tier, then you could disrupt that storage pattern. Moving the data around could end up causing some kind of monetary cost.
But it seems strange to me that they would actively try to thwart these efforts and remove the possibility for a great deal of good will.
> …equivalent of throwing your old computer into an empty ditch on the side of the road, and getting mad when someone responsible comes by to recycle it for you.
a better equivalent is realizing that that computer might have important stuff and taking steps to wipe out hard drives. or retrieving those bags full of documents and letters so you can secure trashing them.
i mean, most of us do it. not saying it is right but it is the sensible thing to do. for the legal aspect and also for security.
curious, isn't archive.org affected by GDPR at all?
I guess it's more like they don't want to have their brand associated with "inappropriate" content. From my (European) perspective the bar American corporations use is weird, but seems plausible to me.
Also I believe they come to it as a media corporation - access to content has to be restricted, so they can charge. If archive.org has too many texts/videos/... people don't watch their movie (not that it works that way ...)
Speculating wildly is OK. Speculating wildly using definitive-sounding statements, particularly in a thread directed specifically towards employees of a company, is extremely misleading.
If you had started with “Not related to Verizon in any way, but IMO...” that would have been perfectly fine, but given where and how you made the statements it looks remarkably like you were claiming inside knowledge of the situation, and then feigning innocence when called out.
A police cooperative in Washington DC that was using them as a network to communicate with their respective neighborhoods with over 17,000 members.
A phone company in the UK that assigns phone numbers using the groups and now will lose all those phone designations when it’s deleted.
A Birding group in new Delhi with 2,000 members that has collected data and research on birds for TWO DECADES.
An Adoption group in France, that has been using it for years and years to communicate and share history and photos and more.
They also would have found: Numerous support groups for people who are suicidal or depressed.
Numerous medical groups for people to communicate more effectively with their doctors.
Numerous Vet groups with 24 hr care advice for sick pets.
Numerous support and help groups for the Elderly.
Numerous Historical groups for WW2 Veterans, Vietnam Veterans, and etc.
Numerous science groups that have used them for years and have all their research there.
Numerous fan fiction groups or arts groups that have shared their work for years.