Hacker News new | comments | show | ask | jobs | submit login

> Archive Team promises to back up SoundCloud amid worries of a shutdown


Consider donating to the Internet Archive.

> Archive Team plans large scale backing up of Soundcloud soon, but seriously, please donate money to the Archive. http://Archive.org/donate


Edit: my info is dated. Archive Team backed off. Smoke if you've got 'em.

That article is out of date, unfortunately. The Internet Archive and Archive Team have abandoned their backup efforts by SoundCloud's request.


That is so infuriating. If the site goes down permanently Soundcloud should swallow whatever costs this incur to pay for their failure of stewardship.

By honoring SC's request Archive Team has allowed the indie Library of Alexandria to burn...

That's what pisses me off about VC. Everything's a fucking product...

They should ransom the whole database of they're going away. It worked for certain software (Blender), it might for the music too.

Whoa, I've never heard of this, can you expand on what happened with blender?

Blender was developed by Neo Geo for internal use. When they were acquired, the developer tried to sell Blender as a commercial product. During bankruptcy proceedings, the Blender Foundation was founded and bought the source code to release as open source.


That's like $80k worth of data transfer. Also, how can you afford to store all of that unless you have gsuite?

Edit: I was mixed up. I was thinking of what it would cost SoundCloud to transfer all that music. I guess storing it would be much cheaper...

You can buy a dedicated server ($50-200/mo) hooked up to a 1Gbps line ($1-2k/mo). Assuming you fully saturate the link, 900TB will take 83 days to download. Of course you can also rent a bigger line, or more of them. And this doesn't consider storage costs. But it's certainly less than $80k.

Edit: I think you could even rent at a datacenter that peers with amazon, hook your server with the files up to a VPC via ipsec, then move files to s3 or glacier via a server in the VPC. Not sure if you could avoid the PUT costs, but you would avoid amazon bandwidth charges with the peering, I think (not 100% sure).

And how much do you need to spend on HDDs and how many do you need for 900TB of data?

WD 10TB Gold drives on newegg are $414 * 90 = 37260, now you want at least a 25% raid redundancy... $46575... now, you need how many U of space to store all those drives? It's not just something you can throw in a 1U host. Even if you can get 125 drives in servers 12 at a time, you still need at least a 3U * 10, 30U, so that's a pretty expensive hosting proposition, beyond the $50K for drives alone

A chassis design that holds the HD's vertically[1] would half the space; down to 12U instead of 30U, which is slightly easier to stomach. However, that's less than half of Soundcloud's reported 2.5 PB[2], though I have no idea how much of that 2.5 PB consists of downsampled copies which wouldn't need to be stored.

[1] https://www.supermicro.com/products/system/4U/6048/SSG-6048R... [2] https://aws.amazon.com/solutions/case-studies/soundcloud/

This might be an application where the cost of a tape drive + many tape cartridges will be far less than the cost of HDDs. LTO-6 tapes are ~$4/TB and the drive is a few $K.

Tapes are also more reliable than HDDs for long-term archival storage.

S3 transfer inbound is free, from anywhere. You'd use several dedicated servers consuming a queue of urls to warc package and push into S3, no vpc or IPSec tunnel required (random S3 key prefix ftw to prevent creating bucket hotspots).

Then you'd hope someone came along with a big chunk of AWS credits to use snowball to migrate your bucket(s) of data to the Internet Archive when they could safely accept said data.

AWS Direct Connect out of AWS into the connected DC costs per GB according to https://aws.amazon.com/directconnect/pricing/.. and it's quite the pretty penny!

Yes, but looks like transfer in is free (other than paying for the pipe). So in the scenario we are discussing, you could get 900tb into AWS fairly cheaply.

He's been active on Reddit and IRC. He claims to be storing it in GSuite after ripping it with Google Compute Engine.

Since incoming traffic is free and outgoing traffic to other Google services is free, bandwidth costs would be minimal (just the size of the requests, not the responses).

Still has to pay for storage? GCS nearline?

GSuite has unlimited storage (until enough people do things like that I guess).

$80k? Even at Hetzner, which is one of the more expensive hosters w.r.t. transfer, this costs less than $1100.

If you directly run traffic agreements with ISPs, and peer directly, you can get below half of that.

Where are you getting those moon prices? AWS? GCP? Never use their prices to get a fair estimate, AWS and GCP's prices are orders of magnitudes off from a fair price. Unless you're in Silicon Valley, running systems yourself at a traditional datacenter will always end up cheaper. (The issue behind that being that Amazon and Google have to pay far higher wages than you, and you can cut out a middleman)

I estimated using GCPs egress pricing since I was thinking if how much it would cost SoundCloud.

Still, it's quite impressive to be able to manage that much data.

I'd be very surprised if SoundCloud paid anything near the standard GCP pricing. It's much more likely that at their scale they pay the rates I mentioned.

That said, if they pay the rates you mentioned, then it's no wonder the company fell apart.

It's all in gsuite, I believe. For how long, I don't know. I can't imagine Google not taking notice.

...which means the next step is to get it out of gsuite and into more permanent archival storage.

Are we calling Google cloud gsuite now?

G Suite is the new name for Google Apps. I believe the person they are referring to is storing everything in Google Drive.

I estimated the cost really badly based on the go egress pricing.

Also, I was thinking that gsuite would be the cheapest option since it has "unlimited" storage.


That's a bad request they made. IA should ignore it now.

Unfortunately, I think the time to back up soundcloud was before anyone could have known we needed to backup soundcloud. A lot of my favorite artists that I discovered there have a third of their original catalog at best left up because of copyright claims, etc. Who decided free songs by hobbyist artists should be banned even from audibility simply because they contain a fraction of a sample of copyrighted materials?

The most egregious example to me was when Madlib released a remix of a Kanye West song that he had produced and it was taken down for copyright by UMG (a shareholder in Soundcloud). That's when I knew the sun had set on Soundcloud as the open creative community it had started as.

In a similar vein, soundcloud was a hotspot for DJ mixes (not remixes) for many online hobby DJ communities, but copyright takedowns eventually stopped that as well.

This has been mixcloud since day one when soundcloud said they didnt want mixes.

Seems like his void has been filled by Mixcloud now.

What's the alternative? Seems like it would take considerable expenses to challenge the copyright claim if Soundcloud refused to take down the song and UMG sued.

*had to back off

There are still people interested in archiving in #soundbutt on EFNet but no plan as there is no place to store all the data. :(

We (rsync.net) contributed disk space to the original Archive Team effort - backing up geocities.

I continue to be interested in helping with these efforts but multiple petabytes of online storage is not trivial ...

Appreciate the contribution.

Any info on how to backup your liked songs and subscribed artists in a nice way? I've just saved the HTML of the page, so I can always retrieve them...

Youtube-dl it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact