Hacker News new | past | comments | ask | show | jobs | submit login

disclaimer: I'm a Member of Archive Team who's helping coordinate the joining of Yahoo Groups in preparation for archival.

Yahoo's banning of a large amount of the accounts we were using is a huge setback for us. In total we lost over access to over 55,000 Yahoo Groups, many of these will now not be archived and will be lost when Yahoo deletes everything on December 14.

Particularly disastrous was the loss of access to all of the 30,000 Fandom (fanfic / fanart / etc..) groups that were requested to be archived by members of the fandom community. We're back to square one now, and it is looking increasingly likely that we're only going to be able to re-join (and therefore archive) a small percentage of these groups before December 14.

(And now for the inevitable, shameless plug...) We could really use some help! If you've got an hour or so, we could really use people to come and complete CAPTCHAs for us. (A CAPTCHA is needed to join every group). Instructions at: https://github.com/davidferguson/yahoogroups-joiner

I tried to do this but upon clicking the purple "Join Group" button Yahoo is giving me an error saying my email address is not linked to a Yahoo account:

> Your email address is not linked to a Yahoo ID. To join this group, you need to link your email address to a Yahoo account.

When I click "link your email address", it just takes me to a page called "Personal info" which doesn't have any obvious way to link my email address.

So I'm not sure how to proceed.

EDIT: Solved it. I had initially only "verified" the account with a phone number, but you have to add an email address as well. It's now working.

For anyone who, like me, signed up for this and filled in the Google form, but then couldn't find the leaderboard URL after closing the tab, it is https://df58.host.cs.st-andrews.ac.uk/yahoogroups/leaderboar...

It seems to be working through a list in reverse alphabetical order. Watching the progress being made is quite satisfying. When I started it was on groups like "sciencefiction" and now it's moved on to "petzluverz".

How long did it take you between adding the email address and being able to join the group?

Seeing the same thing now, I added an email address and verified it, but I'm still not allowed to join the group.

It didn't take long at all for me after verification. Although I have sometimes randomly gotten that error message. Interestingly, sometimes it actually had joined the group anyway. The site has been a little glitchy off and on, but it's working for me right now.

While the above post is concerned with Fandom groups, my concern is with groups that started doing early community driven biohacking type research. There are medical tests results and discussions of medical interventions. While that's my focus, I'm sure there's additiona important material. We really need to save this data.

Thanks for fighting the good fight!

I assumed I could help by going to a web page and solving a bunch of captchas for you, but when I read those instructions I found there's more involved (forging a Yahoo account, installing an extension) and it turned me off.

If captcha's are the bottleneck, maybe some generous soul here could figure out a way to automate the rest and just give me a page I can go solve captchas? Further reducing the friction might help get you some more uptick from the community - more monkeys like me banging at typewriters.

Sorry I wasn't more help, and best of luck with your efforts.

I imagine you guys already know this but considering we’re up against the timeline, I’d use the captcha solving service (easy to google yourself) and Luminati to distribute the IP addresses while swallowing my ethical qualms.

I would donate my IP/bandwidth to archive.org if I could run a scraper easily.

Thanks! I never heard of that before; just like project SETI though for archival purposes.

What are the hardware requirements of that VM? I'm attempting to import it on my NAS4Free home NAS Virtualbox service which is the only machine I keep up 24/7 atm, but it takes forever to import. The hardware is very limited however (Atom D410 + a bit over 1GB RAM available), so I'm not sure it would succeed, but so far it loads forever, no errors given. I'd like to run it for this project to start contributing quickly albeit with limited hw before the deadline, then find better iron in the future.

I’m running the Docker image on the smallest Hetzner VMs, with 5 concurrent groups and 40 shared rsync threads per container, and 12 containers per server. Start one container, do docker top on it to make sure it’s pulling, then start the others one by one, taking a few seconds between each to avoid overwhelming the CPU. I’ve got 6 of those little VMs going, and have rolled up 4GB and 2800 groups worth in 6 hours.

After they settle down, they’re more memory than processor intensive. I’ve considered playing with the settings a bit, but thought it was more important to get a bunch of them running on a couple different VMs at different sites.

If I were really feeling fancy, I’d write a nice deployment definition for orchestrating this with microk8s...

I'm running it on a Synology NAS (Celeron J3455), and the docker manager UI claims it's using 180 MB RAM and less than 1% CPU (and I just confirmed it's currently working on archiving Yahoo! Groups)

I don't find it processor or memory heavy, it's mostly doing a lot of IO (network and disk).

Unfortunately it doesn't offer a qemu-compatible image or an image that would work when converted, it's a shame and shooting itself in the foot.

You should be able to trivially run the Dockerfile[0] on a standard Ubuntu image for qemu, should that be your only reason for desisting.

0: https://hub.docker.com/r/archiveteam/warrior-dockerfile/

An ova file is just a tarball containing an ovf file and a vmdk file. The ovf file is a text-based configuration format, so you can get a basic idea of the config you'd need for qemu. Then the vmdk can be converted with qemu-img.

I used the following qemu-img command:

    qemu-img convert -O qcow2 archiveteam-warrior-v3-20171013-disk001.vmdk archiveteam-warrior-v3-20171013-disk001.qcow2
I use the following to run the VM (I gave it some more memory because I have plenty to space):

    qemu-system-x86_64 -m 1024 archiveteam-warrior-v3-20171013-disk001.qcow2
I think they were doing some kind of port forwarding, but I didn't bother, and I just access the web interface using the VM's IP (you can hit alt-right arrow to go to a login prompt and log in as root then run "ip a" to see the IP).

I know, I did that and it didn't boot. Couldn't be bothered further and I ain't installing docker on my system, it's incompatible with my setup.

It went pretty good for the first 10-20 or so groups but now I get the multiples of the really annoying captchas (click until none remain) per group... Damnit yahoo...

update: just enabling the vpn was enough to 'reset' captcha to the simple level, seems like yahoo does not take into account whether your IP is 'residential'.

I also noted that for yahoo changing IP, even changing continents, allowed me to use the same cookies as long as I kept my original browser window open.

Shoutout to https://github.com/dessant/buster by the way!

`Buster is a browser extension which helps you to solve difficult captchas by completing reCAPTCHA audio challenges using speech recognition. Challenges are solved by clicking on the extension button at the bottom of the reCAPTCHA widget.`

That's nice, but it doesn't scale. Google only let you solve a few (5 or so) audio captchas in quick succession before you're banned for a while, so it's no good for us.

It's been working for me instead of clicking on all the little busses or crosswalks, even if it doesn't work at scale. Thought it might help some other users of the extension.

FYI: The extension offers many private groups that I can't join without approval and that seems to disrupt the flow of the extensions.

Yeah, sorry about that. The current (as of 2100 UTC) set of groups being sent out to be joined were ones submitted through our nomination form: https://tinyurl.com/savegroups

I did specify that groups requiring approval to join shouldn't be submitted, but not everyone took notice. (And then there was the several dozen Google Groups URLs that were submitted!)

It seems a weird set of groups. Like, lots of three-to-five person groups roleplaying doctor who, spiderman and things like that. Is this the long tail of what hasn't been archived or is there not even a good way to tell post/member count without loading up through the extensions?

From IRC (betamaxthetape):

It's a set of groups that have been specifically requested by the fandom community. Of course, the groups handed out depend on what's been joined, so if / once all the fandom groups are joined, we'll move onto something else.

I appreciate this isn't made clear in the instructions, but if you have a desired set of groups in mind, you don't need to use the chrome extension. Just join the groups you want saved and (provided you've sent the account details through the form) they'll be added to the queue to be archived. I did a lot of Amateur Radio (Ham Radio in US) groups that way.

Ah, that's good to know that I can browse and find things that I'm more interested in. The instructions weren't clear about the difference between extension/group access and archiving.

It's a set of groups that have been specifically requested by the fandom community. Of course, the groups handed out depend on what's been joined, so if / once all the fandom groups are joined, we'll move onto something else.

I appreciate this isn't made clear in the instructions, but if you have a desired set of groups in mind, you don't need to use the chrome extension. Just join the groups you want saved and (provided you've sent the account details through the form) they'll be added to the queue to be archived. I did a lot of Amateur Radio (Ham Radio in US) groups that way.

Yeah not volunteering for that mate.

See immediately above in the thread. Instructions were perhaps not clear.

Hah, this is fun! I've so far stumbled on a fantastic group with Sims 1 houses (pictures, and the actual lots), and a Dream Street fan-club, which of course prompted me to see who the hell they were.

I confess I'm doing this mostly to see what people posted on the internet at some point in time :)

Edit: All groups have around 1600 members... what causes this...

> Edit: All groups have around 1600 members... what causes this...

That's possibly the maximum cap?

Is there any cited reason for the groups they're blocking?

Verizon's response, and the response to the response, are in the article of the OP. They claim they offer a Group Downloads Manager, but it's very broken.

btw, maybe Mechanical Turk could help with the captcha part?

A couple of years ago I saw somebody giving a talk, where they demonstrated a CAPCHA-Solving API, with people from India solving the CAPCHAs for a few cents.

That's basically what the DeathByCaptcha server is.

Thanks - I just wanted to say such services exist or used to exist, didn't remember the name.

I feel like there must be some protection in place against using mTurk with captcha, or it would have already been abused.

Mturk's turnaround for this stuff can't be fast enough to work would be my guess. I know jobs I put up there for transcription, despite a generous bonus, were always delayed for at the very least hours.

You misunderstand. You keep a live page open and point jobs to the live page. No need to put a captcha image in the mturk job.

You can absolutely purchase captcha answers.

Just solved a bunch of captchas, but Chrome crashed a few times during. Due to the addon?

I've been using Edge (Chromium) for past few hours, no issues yet. Plugin could be unrelated to your crashing. May help to use a standalone Chromium build for this https://chromium.woolyss.com/

I checked on IRC. One person says they've been using it for hours on chromium without a problem. "I've been using Edge (Chromium) for past few hours, no issues. Could be unrelated, could be related. May help to use a standalone chromium build for this."

As an aside, is there anyway to recover emails if I didn't sign into Yahoo for a year? I and a lot of others had up to 15 years of sentimental mail exchanged during that period :(

I don't see why not. Point Thunderbird at it or something and then just transfer the mails over to somewhere else if you want that - but this is not about mail. Rather it's about Yahoo Groups, whose archives are about to go away.

Forgive my naivety, but why would blocking of your accounts delete the data you have already backed up? This sounds like you are doing it the wrong WAY, IMO.

Two reasons: (a) If we hit Yahoo with everything we've got, groups would have almost certainly crashed, or at least become unbearably slow. That's not a reasonable thing to do, and would be (IMHO) grounds for Verison banning us.

(b) We were still testing / writing the scripts to do the actual archiving. Most of the groups we did save before the banning were from test runs of the archiving script.

And sure, given hindsight, I'd do things differently. We've learned, now, and are archiving a groups soon after it is joined.

OK, thanks for explaining this. Just my 2 cents then: big companies make decisions like this based on the potential PR win/loss. If ignoring you keeps the PR delta at 0, while allowing to export the data exposes them to even a minimal risk (I dunno, someone's private details buried in), they will ignore, or even actively resist you.

Politically, you need to arrange it so that cooperating with you will give Verizon a small PR boost, while ignoring you will be seen negatively by the public. This thread had a good example of interesting data that is worth preserving, so I would try reaching out to news companies (NY Times and whatnot) to see if anyone wants to publish a piece. Phrasing this positively and ensuring enough people see it, would greatly increase the chances of cooperation from Verizon.

They hadn't backed up yet. They had set up accounts with yahoo that they were then planning to use to back up those groups. Backups themselves were starting, but they had to go slowly enough not to bog down yahoo's servers.

Have you posted this on Reddit anywhere? Possibly /technology?

You might even get the admins to make an announcement.

It’s been all over r/datahoarder lately, also saw a post on r/YouShouldKnow

Have you considered using NordVPN for CAPTCHA bypass? They are a shady company, but their network of residential VPNs is impressive.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact