Hacker News new | comments | show | ask | jobs | submit login
Archiveteam are backing up SoundCloud (archiveteam.org)
338 points by hunglee2 4 days ago | hide | past | web | 215 comments | favorite

This seems a bit ridiculous. 900TB costs $22000 in hard drives (assuming $100/4TB HDD), without any redundancy. I wonder what their storage solution is like.

Almost certainly Google Drive unlimited.

And this is how Google Drive Unlimited becomes Google Drive Limited...

Like every other Unlimited service, because people like this abuse the shit out of it then when it inevitably dies they run around exclaiming "dont blame me brah they said it was unlimited...., shouldn't have called it unlimited"

> when it inevitably dies they run around exclaiming "dont blame me brah they said it was unlimited...., shouldn't have called it unlimited"

Well... yeah? If you let people advertise 'unlimited' but they can't actually deliver, you end up with a kind of market for lemons situation.


Reposting my comment from https://news.ycombinator.com/item?id=14716912


There is a strongly utilitarian argument to not allowing such false statements.

It devalues the products of people that aren't bullshitting you. Say with fake-unlimited the "real limit" is 4TB before they start terminating you, but a different provider provides 5TB of capacity.

Because the former is allowed to outright lie, there is no way for the latter to effectively communicate that they are in fact offering a better product, instead they too have to make a bullshit "fake unlimited" claim to compete. Now because nobody has to actually back their claims with anything, they are infact massively incentivised to cut the "real storage" limits, because it will cut their costs, and they can still keep making the same claims.

Its a market for lemons[1] race to the bottom, and everyone loses, producer and consumer because scamming liars cannot be reliably assessed beforehand. So consumers lose faith in the entire market segment, and providers offering actual legitimate services become unsustainable.

[1] https://en.wikipedia.org/wiki/The_Market_for_Lemons


If you allow sellers to lie about information-opaque things like this, you drive the entire market to a shittier equilibrium, it should absolutely not be allowed.

It's like you go to a restaurant that offers unlimited refills and they don't allow you to keep refilling your cup while you're throwing the drinks onto the floor. False advertising!

I think you quite clearly know the difference between the two situations. Don't debase yourself.

Except for the fact that when Google, Amazon and the rest talk about "unlimited" they are referring to unlimited PERSONAL storage of data you create as a person, this would include backups of your personal computer, photos, important documents,etc

Not backing up SoundCloud or the entire Internet for $5 a month

I would love to have my own personal backup of SoundCloud. It's like having a huge library at home, only not limited by physical space.

>... only not limited by physical space.

Well, it still kinda is - the physical drives have to live somewhere... ;-)

I didn't create my tax documents, I didn't create the professionally shot photos of my kid, and it's arguable whether it was truly I that created KSP game save files. I definitely didn't create a lot of the work documents that end up living there, either.

> they are referring to unlimited PERSONAL

Actually Google only offers unlimited to business and education accounts, so they're absolutely not referring to personal data.

Jesus H Christ..

People are too f'in litteral, first

Unlimited === Store everything every created

PERSONAL in the context of the discussion would include documents created by a person in general course of business, my point is very clearly and only pedantic trolls do not understand the meaning of point I was attempting to address.

Business use would be for the business that signed up for the service, not for storing an entire copy of SoundCloud

> Business use would be for the business that signed up for the service, not for storing an entire copy of SoundCloud

What if that business is Soundcloud? As far as I can tell, Google places no limits on business users, numerical or otherwise.

This is such a cool idea. Going to backup Soundcloud myself tomorrow !

So I can backup my computer every day, encrypt the blob and upload it?

Well, companies should think twice before selling unlimited plans. Maybe set a really high cap that's enough to cover 90% of the users but unlimited means unlimited.

90% would likely be covered by a TB or 2. That would not work for marketing.

What would work for marketing if you can't compete on the raw number game is framing the amount - like Apple did with "1000 songs in your pocket".

If you want to design a cap - Goog probably has an idea of the kinds of file people store (whether it's pics or docs or videos or sheets etc...), pick a number of those that's impressive and is just lower than what your 10% power users has, and then use that to influence the package size.

2TB of files can easily be turned into "half a million photos" or like their previous ad campaigns, show how it can store every photo from birth to university for your kid or something. Or a love letter every day from first date to goodbye.

If you can't compete on the number, don't compete on the number.

I prefer "libraries of Congress" as my storage unit of choice.

I suppose a new storage unit is in order: "SoundClouds".

It is amusing to trace the arguments in this thread. We went from "abusing unlimited storage is why we can't have nice things" to "well, don't call it unlimited if it isn't" to "but marketing doesn't like that".

This is why I so strongly prefer services that don't bullshit me. Tell me what you're exactly selling, for how much[1]. If promise to livestream setting your marketing department on fire, I'll pay double.

[1] If it is free, I already know what it costs and am not interested, thanks.

Over a million megabytes of storage!

Why should the companies think twice? It's the users who end up suffering, not the companies. The companies will just put an end to abusive users or punish everyone uniformly by increasing prices.

The companies suffer too. American Airlines has lost a lot of money with the their unlimited travel pass and I'm not sure they could revoke all the subscriptions after they figure this out.

While this is clearly abusing intent behind the "unlimited" sales pitch, I still do not think it's in violation of it. There's also most likely more than just this guy owning copy of the files from SoundCloud, so once Google has copy of them all they can just cache it for the rest of the people. I.e. you don't need unique files for every user who has some file, instead you can just give out the one to anyone who requests it.

Yeah, except people on DataHoarder encrypt their files so that the providers can't store one copy per unique file but have to store one copy per one file in any user's account.

I always encrypt because I worry an automated copyright scan will be the end of my account. If there was a policy that assured my non-shared files wouldn't be subject to these scans I would happily store content unencrypted.

For reference, I don't store pirated content, rather content that I have a license for, but cloud providers have no way of knowing that. Unfortunately the dispute processes are unreliable, so when it's time to backup a media project, I play it safe and encrypt.

> For reference, I don't store pirated content, rather content that I have a license for, but cloud providers have no way of knowing that.

I'm pretty sure that they just assert that the backup is illegal even if it came from a licensed copy. One argument (used by Nintendo IIRC) is that the official media/servers are too reliable to require backups, and thus anything purporting to be a backup is really for another purpose and thereby not exempt under the statutes authorizing backups.

That's pretty uncaring on Nintendo's part. I wonder how many original Nintendo Entertainment System cartridges are destroyed in natural disasters without backup whose owners can only now play ROM dumps of them from the Internet.

My entire Nintendo DS and 3DS cart collection was stolen in a break-in at my place. You can bet that instead of repurchasing I simply bought a flash cart.

You often see that cloud services or streaming services are creating policies and DRM around copyright laws that actually remove freedoms given to you under copyright (fair use for example, or even instances where you have licensed copies).

I wouldn't trust such services to respect your rights in any capacity. In fact, I would argue that the "our incredible journey" trend is a form of property damage (a storage rental place can't just burn their store to the ground with customer's posessions still inside).

IANAL but Google's ToS for copyright seem to be pretty safe so long as you don't share links to copyrighted content https://support.google.com/docs/answer/148505

> Respect copyright laws. Do not share copyrighted content without authorization or provide links to sites where your readers can obtain unauthorized downloads of copyrighted content. It is our policy to respond to clear notices of alleged copyright infringement. Repeated infringement of intellectual property rights, including copyright, will result in account termination. If you see a violation of Google's copyright policies, report copyright infringement.

Given that they cater to businesses, I don't think they could do automated scans.

Looks like they do hash matching to automatically scan for copyrighted content: https://torrentfreak.com/google-drive-uses-hash-matching-det...

But they also only did anything with it when the user tried to share it.

Definitely not all. For example the top post of all time (where the guy has 1 Petabyte of videos in Amazon drive) he talks about how only his personal files are encrypted. Then again his videos are recordings of camsites so they might be quite unique

Is there a valid reason for this aside from making sure if they delete it once, they don't delete it everywhere? Now I see the true issue against datahoarders... Wasting more space than needed to account for a situation that their own actions kinda put them under...

I'll go ahead and state the obvious – it's because the vast majority of the files /r/Datahoarders is hoarding is illegal material that they don't want to get ToS'ed for.


Oooooooh... reminds me of something https://news.ycombinator.com/item?id=2438181 [How Dropbox sacrifices user privacy for cost savings (dubfire.net)]

> abuse the shit out of it

It wouldn't be abuse if it was actually unlimited. They uploaded a finite (if large) amount of data to a service advertising infinite capacity.

You know the saying: "you are not wrong, you are just an asshole".

I don't think OP is arguing that it's against ToS or anything like that. He is arguing that if people just upload stuff in these "unlimited" services all willy nilly soon they won't be unlimited anymore. The price will be the same, but they will drop the capacity to something reasonable. Which in turn might hurt some legitimate users.

I'm all for people backing up anything they feel like is necessary, but they just need to know that if they are going to be uploading Tera or even Peta bytes of data to "unlimited" service without paying much for it they are living on borrowed time.

Except they're not an asshole.

Companies should be responsible for what they put in their marketing. If someone calling the bluff on the "unlimited" plan makes them "downsize" the plan to actual ~10TB + $x per Y additional TB, then so be it. It probably doesn't change the reality in any way, and at least the company is no longer lying about its service.

Expect it still might be way lower than 10TB. The unlimited thing, while it might technically be a lie, work on the premise that most users won't have even 1TB of data on the service which lets some outliers have 10s of Terabytes with no problem.

Obviously you are entitled to your opinion and again you are not wrong, false advertising is bad, but again if you are intentionally uploading stuff just to upload stuff be prepared to lose most of it when the limits come crashing down. Like the guy who had/has over Petabyte of video on Amazon if Amazon decides "OK unlimited was an bad idea, let's give everyone 10TB" where is he going to put the rest of this 1014TB of stuff? If the answer is "just let it get erased" then congratulations you are the reason why we can't have nice things.

The problem is they dont downsize to a "reasonable" 10TB, they over correct in the opposite directly like Amazon did, and people that used the service as intended get fucked.

People were using amazon cloud drive to host their entire plex library and were hammering the service every time plex did a library update.

I think there is real damage though, if you let someone advertise 'unlimited' but not deliver.

If my alternative service had a much more reasonable offering with a high limit, that 95% of both companies users data would fit in I'm going to get screwed because I refuse to lie and call it unlimited.

> You know the saying: "you are not wrong, you are just an asshole".

That's more applicable when the other party is not being wrong by calling a service "unlimited", when it is not.

(and, arguably, assholes when they inevitably take it down with because "oops we didn't mean, like, unlimited unlimited")

I think there's a hole in this reasoning. Because, it seems to me that "legitimate users" is being used to mean "people who are not uploading huge amounts of data" (usually phrased as 'abusing the service'). But then, by definition, imposing a reasonably large cap hurts no legitimate users.

In the terabytes for personal backup these days is pretty reasonable. I think my backup to Backblaze is about 3TB. A lot of that is photos; I have relatively little video.

I agree with your general point though. One thing that helps is that there's a certain throttle because of network bandwidth even if that isn't capped or deliberately throttled.

am i wrong?

no youre not wrong.

am i wrong?

youre not wrong walter. youre just an asshole!

all right then.

It's okay, for everyone one like this, there's 1000's like me that pay for the service and use a fraction of it.

Think of it as we are all contributing to keeping this data safe and available.

That is what they said about Amazon Cloud Drive and every other Unlimited service that came before it right up to the day the Unlimited Service died

Heres an idea then, don't make bullshit claims you cant back up.

There never was unlimited service so the only thing that happened is that services became accurately marketed.

And since the unlimited google drive is 10 a month per user and is targeting large companies, they definitely offset the costs. Now with team drives that helps a lot as not every employee will have a unique copy of a file.

A car company says: "You can have any colour car you want as long as it is black..."

I am sure there will be people defending the car company as really offering unlimited colours, but obviously they have to restrict it to black, because some clients were unreasonable to expect the company to paint their car Neon Vermilion.

Or to put it another way, "You can have unlimited storage, except that it is limited..."

Nothing is unlimited, it does not exist.

Unlimited is a marketing term used to express simply to the consumer there are not overall limits placed on your storage provided you adhere to the rest of the terms of service.

In the context of data cloud data stroage when Google, Amazon and the rest talk about "unlimited" they are referring to unlimited PERSONAL storage of data you create as a person, this would include backups of your personal computer, photos, important documents,etc

Not backing up SoundCloud or the entire Internet for $5 a month

So if you are a amateur photographer then storing 10TB of photos you took on the service is acceptable, downloading 900TB of music files you do not own, you did no create and have no permission to "as a backup" because the service is going to go under is abuse.

Unlimited is a marketing term used to express simply to the consumer there are not overall limits placed on your storage provided you adhere to the rest of the terms of service.

Which specific clause of the TOS is this violating?

In the context of data cloud data stroage when Google, Amazon and the rest talk about "unlimited" they are referring to unlimited PERSONAL storage of data you create as a person, this would include backups of your personal computer, photos, important documents,etc

That's your interpretation. Nowhere do they actually claim or imply that you're only supposed to use it for data you create as a person. In fact, it'd be absurd, considering that sharing files is built into the system.


That companies lie to us repeatedly under the guise of "common sense", as if they followed the same standard when applying their unreadable TOSs against us, is bad enough.

Corporations are not your friends, and they won't hesitate to block you if you start becoming a liability. Assuming good faith is absurd, defending it publicly is grotesque.

Any TOS I've seen from storage companies has clauses against violating local copyright law.

In the United States for instance, making a copy of any digital media is generally illegal, period. The two major exceptions here are if you can prove "fair use", or if you are an archive or library.

"Fair use" is, of course, a very fuzzy term. Fuzzy enough to give companies enough wiggle room to terminate if, as probably most large archives would be, a person uploaded terabytes and terabytes of copyrighted media to their drive. (If said person shares copyrighted links in particular, that usually is explicitly called out in storage TOS clauses... but even if not, I think it would be difficult to claim "fair use" for a personal upload of Soundcloud to your Google drive.)

In the Archive Team's case, it looks like the Archive Team is using the Wayback Machine from archive.org. (http://archiveteam.org/index.php?title=Dev/Infrastructure) Libraries and archives have their own set of rules allowing limited copying (https://www.law.cornell.edu/uscode/text/17/108), in addition to the general "fair use case". My guess is due to questions of Soundcloud's longevity, archiving Soundcloud would qualify.

> Unlimited is a marketing term used to express simply to the consumer there are not overall limits placed on your storage provided you adhere to the rest of the terms of service.

Well, that's doublespeak then, and if someone calls them out on it by actually testing the claim they make, so be it.

I don't understand why people seem not to mind being lied to their faces, as long as it's "just marketing".

>>I don't understand why people seem not to mind being lied to their faces,

Because most rational people use common sense and logic to come to the understanding that when a company is offering you "unlimited" storage for your PERSONAL FILES, they do not intend for you to go out and download SoundCloud as backup in case the SoundClould Service goes under

Just like when a "All you can Eat" buffet does not intend this to mean "All you can eat in your entire life" where by you fill grocery bags full of food to take home with you

All you can eat is actually quite specific, but for knuckleheads like yourself it is usually clarified in writing in other places. It's all you can eat while at the establishment.. not all you can take. They don't advertise it as "unlimited" either as many places in busy markets have a posted time limit. This is clearly part of their terms of service as most buffets I know both clearly state that you can't use takeout containers and many have a posted price for taking out food by the pound or piece. You are not digging yourself further dude.

>>All you can eat is actually quite specific, but for knuckleheads like yourself it is usually clarified in writing in other places. It's all you can eat while at the establishment..

How am I a "knuckle Head" in this situation, when I go to a buffet I eat a normal human portation of food inline with price I am charged for the meal

I do not eat 25 plates full of prime rib for $5.

I do not abuse business simply because "I technically can because it is in the rules"

I fucking hate people that look for these types of technicalities to exploit in society. These types of people are exactly why there are pages of Terms of service, and why we can not have nice things, because people can not be trusted to not abuse shit.

Let's all calm down please.

Come to think of it, I'm mostly in agreement with you. I do feel there's a difference between "all you can eat buffet" and "unlimited storage" (or "lifetime warranties"). The former is more of a reasonable and well-explained offer; the latter is more of a bogus marketing claim. I detest bogus marketing claim.

detest them all you want, advocate for companies not to use them but do not justify bad abusive behavior as a means to "call out" the companies.

2 wrongs do not make a right and all that.

>I do not eat 25 plates full of prime rib for $5.

I'm not going to eat 25 plates of anything but, at a buffet, I have no issue with mostly going light on the cheaper fillers. And I'm not going to worry about it if I end up getting a "good deal" on the meal as a result.

> I do not eat 25 plates full of prime rib for $5.

Wait what? I mean the cost of admission is usually more like $50 but that's practically the SOP for Brazilian steakhouses. I have fasted for days just so that I could eat more.

Well then they shouldn't have called it unlimited.

I'm more interested in how they did this:

> Not a home connection, just google abuse. GCE has ~40Gbit to each server and >400Gbit peering with amazon

It sounds like they paid GCE for 900TB of data transfer?

There's no charge for ingress traffic. They spin up an instance in GCE and use youtube-dl or something else to scrape the site directly into an "unlimited" drive account (which is also free to transfer into).

I guess soundcloud is technically using Google Cloud now...

Consider what Soundcloud, which ist hosted on AWS, payed for this. The cheapest public offer from AWS is 0.05$/GB for OUTBOUND traffic (for at least 350TB/month). If you consume more, you negotiate directly with AWS.

Lets assume they pay 0.005$/GB, which results in a 4500$ bill(@ 0.005/GB) or 45000$ bill(@ 0.05$) for Soundcloud.

> The cheapest public offer from AWS is 0.05$/GB for OUTBOUND traffic (for at least 350TB/month)

Actually it's $0.02/GB if your traffic is in the US and you do 5PB/month: https://aws.amazon.com/cloudfront/pricing/

But yes, they likely have a non-public deal.

they're using google drive

When your log file is 100GB..

Their website seems to be down: I'm just wondering, are they downloading everything accessible through the player, or just songs marked "Download"?

Even given that they could restrict themselves to songs marked okay to download, how much of that will be DJ mixes containing copyrighted songs?

I'm just wondering because Soundcloud actually has support to specify your copyright terms, which does not default to "everyone can download this", so it's an interesting case..

The website works now, it just says "selective content"

I'm sure SC serves up a lot of content per day, but how do you think they will react by suddenly having someone download all of their 900 TB or whatever it is in one day? How much will Archiveteam be contributing to SC's downfall by suddenly causing them a huge unexpected bill?

As someone who really wants the SC content backed up properly, I nonetheless see how this raises some interesting legal issues.

I think in most cases ArchiveTeam's actions have been copyright-infringement on some level. They just don't care and find that keeping user content safe from unrepentant deletion to be more important.

If you've ever seen a Jason Scott talk, he isn't the sort of guy who gives a shit if you DMCA him while he's sucking up all of your bandwidth archiving your content two days before your servers shut down.

Also, DMCA takedown notices properly order certain kinds of intermediaries (not people who have consciously chosen to publish something) to stop facilitating access to allegedly infringing material. They're not properly directed to an alleged copyright infringer himself or herself, and they don't order people to stop downloading things, even if their reason for downloading is apparently to publish the information.


This is not to say that people don't send "DMCA notices" for anything and everything to anyone and everyone, but those notices are not following the law. (Also, a lawyer can always send a demand letter demanding that someone stop any behavior, but a properly-constructed DMCA notice gives the recipient an extra reason to follow it compared to a run-of-the-mill legal demand—"[a]n OSP who complies with the requirements for a given safe harbor is not liable for money damages", as Wikipedia puts it.)

Yeah, more likely you'd get a Cease-and-Desist. But once the ArchiveTeam publishes it, they'll likely get DMCA'd (though they've survived things like that before).

Not to mention that dying websites or recent acquires are likely to be able to have the legal muster to start threatening archivists.

> I'm sure SC serves up a lot of content per day, but how do you think they will react by suddenly having someone download all of their 900 TB or whatever it is in one day? How much will Archiveteam be contributing to SC's downfall by suddenly causing them a huge unexpected bill?

Obviously, there's more than just the bandwidth cost, but assuming they pay $0.02/GB for CDN traffic, we're talking about $18k. It's not nothing, but I doubt it'd change their outlook in any meaningful way.

I should add that the ArchiveTeam doesn't download everything in a day, but rather uses a distributed crawler (ArchiveTeam Warrior) run by volunteers. They rate-limit the crawling rate as needed in order not to overwhelm the site being archived.

That's making the incorrect assumption that all 900TB is hosted by the CDN... you typically only have a fraction of your content, known as your working set, in the CDN's working cache.

So you should factor in the bandwidth cost for CDN traffic AND origin pulls, which if you're serving from AWS (and not using CloudFront), is $0.08/GB.

The size of your working set also influences your overall CDN bill. Storage isn't free.

I'd expect their S3 data transfer pricing to be closer to the assumed $0.02 CDN price. The price available to the public goes down to $0.05 once you hit 200 TB, and AWS is known to give significant discounts to large clients.

The storage cost is a good point, and I don't know which CDN they use, but I expect the songs that weren't "hot" would be evicted from edge caches pretty fast, so that should be no significant increase.

To put things into perspective, an article from 2009 estimates that Spotify used 84TB of traffic per day[1], back when they had 5 million users. Soundcloud claims to have had 40 million registered users in 2013 and 175 million unique monthly listeners in 2014. These numbers aren't easy to compare, but either way I think it's safe to say that 900TB should be a small fraction of their monthly traffic, and shouldn't make a huge dent in their financials.

[1]: https://www.theguardian.com/technology/blog/2009/oct/08/spot...

Is it just me, or is charging pennies per gigabyte usurious? If I paid even a penny per gig on my home internet, I'd have nearly a $150 bill just in gig charges, and I know I'm one of the lower bandwidth customers on my block.

Why stay with a host that bends you over a barrel for transit? Even Netflix doesn't use Amazon to deliver their data heavy content, as their pricing is pie in the sky high.

What block do you live on that 15TB is on the low end?

They replaced all lighting with streaming 1080p video.

I can only access it intermittently. Here is the google cache:


In response to your concerns; I think ArchiveTeam simply doesn't care. They are very firm in their convictions, and they don't exactly listen to requests to not archive things.[0]

If you're curious, here[1] is the initial discussion that AT had. People bring up copyright concerns.



I'm having weird issues myself downloading my own tracks that are hosted on soundcloud. When I login as myself even though I have set my songs to be able to be downloaded I can't. For one song I was able to log off and log back in and download that one but I was unable to do it for another.

Hmm, I've been playing with IPFS lately, and just had an idea: Since IPFS is perfect for archival, Archiveteam could put their files on IPFS, and users could help out by pinning stuff on their local nodes. For example, I could ask their website to give me a 10 GB list of files to pin (if I wanted to "donate" 10 GB to them), and I'd keep them available.

The only problem is that I don't know whether IPFS has any way to gauge availability, so I'm not sure if the team could tell which files were only hosted by a few people.

Archiveteam has discussed about backing up the Internet Archive a long time ago in their project called INTERNETARCHIVE.BAK (http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK). They decided to go with git-annex because its creator was a cofounder of the ArchiveTeam and was willing to work with that project, who is still waiting for IPFS's proposal on how to do this (http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...).

You can participate in the effort of course. Have a few hundreds of GB and a good connection ? Head over to http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/g... and follow the steps !

ArchiveTeam is putting the files in the Internet Archive archives. While IPFS is great, I'm not sure I agree it's good for archival because it depends on the availability of the IPFS network. The Internet Archive does work to make sure that there are sufficent backups to mean they can recreate their archive.

As long as there is a single party interested in content hosted on IPFS, the "IPFS Network" will persist.

It would absolutely be the best move for IPFS to be used in this case - maybe something like the AkashaApp guys, albeit for audio-media.

Edit: The Akasha App for those who aren't yet familiar with it - https://akasha.world - brings together IPFS and Ethereum to make a truly distributed peer network for persistent content.

> As long as there is a single party interested in content hosted on IPFS, the "IPFS Network" will persist.

I imagine that's one of the reasons why it's not ideal for archival content.

I don't get why ? It's exactly the same today : you need at least one party to host the data (in this case the archive team). With IPFS however, if more parties wish to host it, it will lighten the load.

There is no incentive to keep data in the network.

You can pay pinning services, but what's the point if you're just going to pay someone hosting it?

I'd rather have this archived on Siacoin, Storj, Swarm or any other distributed network with actual incentives to keep things around

IPFS+Ethereum = incentive. Please do not be so flippant to reject something until you've grok'ed it sufficiently well enough to argue against it. If you have looked at IPFS+Ethereum and found it wanting, I'd love to know what exactly - because from my perspective this is precisely the kind of technology that delivers your stated requirements.

I've definitely "grok'ed" it.

Which is why I find Swarm a better solution. It is literally IPFS+Ethereum with additional support for ENS lookups, deniable storage, redundant storage, etc. This allows for far better privacy and being able to compensate the loss of parts of the file, both features lacking in IPFS itself.

The current swarm testnet performs, as per my experience, better than IPFS in terms of bandwidth and latency.


I think ipfs is very interesting, and will hopefully evolve to become a viable archive solution - but that'll only happen once you have a few serious organizations running their own redundant ipfs "intranet" storage networks, with geographical redundancy (or at least good local availability). Then we might see "archive peering" between such organizations.

The time horizon of "archival storage" probably starts at a hundred years - that will need some structure to have a likelihood of persisting for so long.

> The time horizon of "archival storage" probably starts at a hundred years - that will need some structure to have a likelihood of persisting for so long.

This reminds me of one of the principles of camlistore's data schema, which explicitly says that they made their schemas overly-explicit so that future digital archeologists can re-create the schema purely from examples[1].

It's a shame that camlistore feels more like a very long experiment over a polished and usable backup/archive system.

[1]: https://camlistore.org/doc/principles ; There used to be a more explicit explanation, but I couldn't find it.

> Since IPFS is perfect for archival

I don't believe this is true. IPFS doesn't have any built in way of easily distributing parts of an archive, doesn't support (as far as I know) any form of erasure coding, making overhead quite high and requires that you use its own weird block + hash scheme for integrity.

It's also very immature, we don't know if IPFS will be around in 10 years and we don't know what kinds of bugs it will have.

IPFS is a great tool and it has its uses but I don't think archiving is one of them yet.

IPFS would not be considered suitable for digital preservation. Have a look at LOCKSS[1].

[1] https://www.lockss.org/

Interested in understanding why you think it's not suitable for digital preservation? Feels like something that uses content-addressing and a P2P network is perfect for digital preservation.

In a P2P network the availability/integrity of the archived material depends on storage nodes not under the direct control of the archivist. I might trust a P2P network for content re-distribution, but I wouldn't trust it for long term storage.

You don't have to; If you would be hosting the archived material yourself anyway, there is no negative I can see in doing it through a P2P network. Long-term storage could still depend on you and any other users helping with any of the content for any amount of time would just be a bonus.

I've got a tool to grab all of the songs from your feed. I use this to offline sync mixes (not individual songs).


It would be an absolute shame if Soundcloud disappears. There has been so much music I have discovered on this service.

Shameless plug (has collected a bit of dust): https://github.com/krmannix/downcloud

It's a node tool built a few years ago to download the playlists of users through your command line. Might be helpful for a situation where you'd like to back up your own playlists.

You'll need to get an API key - no sure how feasible that is at this moment.

No offense, but https://github.com/rg3/youtube-dl is without question the best downloading experience I have ever encountered, and for damn sure doesn't require an API key

Yes, mine is built for SoundCloud, not YouTube or video sites, given the context of the article

That leads me to believe you have not used youtube-dl, as it has an unfortunate name but supports an incredible amount of sources: https://github.com/rg3/youtube-dl/tree/master/youtube_dl/ext... including, of course, soundcloud https://github.com/rg3/youtube-dl/blob/master/youtube_dl/ext...

Why doesn't SoundCloud want my money ? There are no ads, and no paid plans for listeners. A lot of songs in my library disappear once the artist gets big and wants some cash from iTunes. I would have no problem paying for access to these songs but it's just not possible. I would also like to buy some band posters / t-shirts, vinyls, cds, show tickets etc, not possible either. It's like they are actively avoiding revenue streams. I don't get it.

There is a paid plan: SoundCloud Go.


Unavailable in my country apparently. It must also be the reason why I don't get ads.

Huh? There are both ads and paid plans for listeners, and I'm a subscriber.

Wondering this too. Why don't they sell tracks directly? I'm always wondering what various tracks are in the house mixes I listen to and would love to add them to my cart right there.

I imagine the problem with a feature like that could be that if they know what is in the mixes and they'd try to negotiate a deal for selling the tracks, they'd be asked to please pay or enforce payment of royalties for the tracks used in the mixes? (But I don't know to what extend they already do things like that/have deals in place)

> There are no ads

There are ads. Audio ads between songs.

There are ads — at least on mobile.

Archiveteam seems like a really cool project, what I was wondering (and couldn't find in the FAQ) is who is paying for all the storage? Is it donated by big tech companies?

Volunteers, no corps as far as I know.


rsync.net donated online storage for the original geocities backup efforts in 2009.

A decent amount eventually ends up at archive.org

[not all, because archive.ort needs to be a bit more careful, but they have a decent symbiotic relationship]

I love archive.org but they have to start becoming a) very careful indeed and b.) better at cleaning up the content that has been uploaded. By this time there is an insane amount of illegal content on it like ripped CDs and copies of books which still can be bought. I don't want them to go down because of this kind of stuff.

It was founded by Jason Scott (of textfiles.com) who works for the Internet Archive (which is nonprofit). So while I think the Archive Team is fully community driven project what they backup ends up on Internet Archive.

> Resource Limit Is Reached

> The website is temporarily unable to service your request as it exceeded resource limit. Please try again later.

I suppose I prefer an archive over the blog being unavailable

>I suppose I prefer an archive over the blog being unavailable


it is interesting that their blog is not good enough for archive. link above and below are different (first one is updated but second one is not, because link contains some extra info). they will be able to save a lot of space by finding duplicate.


The more I think about this, the more convinced I am that Archiveteam are actually detrimental (in the long run) to the well-being of the Internet.

Don't get me wrong, I appreciate the work they do, and without them lots of content would simply disappear. But solving this problem should be at the core of the protocol itself (Xanadu, anyone?[1]), not depending on the resources and goodwill of a single team.

Just like IPv6, I don't think the problem will be solved as long as there's a patch that somehow works.

[1] https://en.wikipedia.org/wiki/Project_Xanadu

Apt username. Yes, you're wrong. See, we don't have Xanadu and we don't have an in-protocol solution so somebody needs to do the dirty work here and AT stepped up and does an absolutely incredible job.

So as long as there's a patch that somehow works we have at least a solution. If that patch wouldn't be there we'd have nothing.

The perfect solution is the enemy of the good.

There is no perfect solution in sight and this a good one until a better one comes along.

>But solving this problem should be at the core of the protocol itself (Xanadu, anyone?[1]), not depending on the resources and goodwill of a single team.

I disagree. Expecting networks and software to act as an immutable medium is a fool's errand. The internet was never meant to provide a permanent cultural archive, and it's not actually a "problem" that it doesn't, because that's not what it's for.

Backups should be a service, not a feature of the network or the protocol itself. I think that what Archive Team does represents the correct way to approach the issue.

I downvoted this. Maybe we will get to this point one day, but in the meantime I think we should all appreciate the amount of conservation work they do for the future generations. Exactly how NGOs help feed people hoping that one day the system will be fixed.

I upvoted this. Regardless of whether I agree, I think it is a reasonable question to ask / discussion to have, and the downvote button should be used for comments that don't help the conversation in any way.

If/until the foundation is better, how does archiving hurt anything today?

I'm interested in people's opinions on the legality of this. They mention "Archive Team considers the SoundCloud service in danger and, as it hosts a lot of original content, finds it important to prepare to save it selectively (a full grab would be too big and would raise concerns of mass copyright infringement).", but how is downloading any portion of artist's music not copyright infringement?

I've written my own Soundcloud offline audio player, but didn't distribute it because it was against their TOS.

> but how is downloading any portion of artist's music not copyright infringement?

I had the same issue with backing up Geocities when it went down. I figured better safe than sorry, established a very easy deletion procedure for the copyright holders and have received only a very small number of nastygrams compared to an absolutely enormous number of messages from people that were happy their content got saved.

So at a guess, yes it is copyright infringement, no, it will not lead to trouble because most people are able to recognize a good faith effort when they see it.

> So at a guess, yes it is copyright infringement, no, it will not lead to trouble because most people are able to recognize a good faith effort when they see it.

A takedown notice from a few large commercial soundcloud users would probably be enough, no?

No, it wouldn't. You just take down their content, as is their right. If they sue they'll likely lose if they don't first send you a takedown notice but you are going to have to take that risk.

Each copyright owner can only request the takedown of their own content. They can't demand that you take the whole archive offline.

IIRC they said "selective backup"

To me that means "stuff that's most likely fine to preserve and most likely isn't found on other places".

Also according to Sound Cloud's ToS by using them you are granting all users rights to "to use, copy, listen to offline, repost, transmit or otherwise distribute" your content. So if Archive Team downloads everything they can (that does not in itself violate copyright (i.e. they are not Metallica songs)) there should be no copyright issues.

Not quite. The Terms of Use state that uploaders grant users the ability to use, copy, repost, distribute, etc. content specifically "utilizing the features of the Platform from time to time, and within the parameters set by you using the Services".

Which means you're only allowed to redistribute content through the facilities provided by Soundcloud. You're not allowed to simply download music and share it outside of the website.

That said, simply downloading music from Soundcloud isn't copyright infringement. You have to do that anyway to listen to the music. But redistributing it (outside of Soundcloud) is illegal unless the copyright holder has granted permission to do so, or the work is under Public Domain.

>By uploading Your Content to the Platform, you also grant a limited, worldwide, non-exclusive, royalty-free, fully paid up, license to other users of the Platform (...) to use, copy, listen to offline, repost, transmit or otherwise distribute, publicly display, publicly perform, adapt, prepare derivative works of, compile, make available and otherwise communicate to the public

At least I read this only as "by default you are granting all users permission to do whatever you like", then you can restrict the access rights according to next part.

>You can limit and restrict the availability of certain of Your Content to other users of the Platform, and to users of Linked Services

I'd guess there are still plenty of stuff to backup even if some artists/performers/bands use more restrictive licensing.

One of the reasons I prefer Soundcloud is that it's trivially easy for me to enable listeners to download tracks for free.

I am sure whatever Archive Team downloads that is determined safe (enough) will end up on Internet Archive for everyone to enjoy. They have all kind of neat stuff there that you might not even think to preserve, like old super market announcements. You can easily spend an evening just rummaging through it if you are into older stuff.

It's really not much different from the Internet Archive as a whole. The vast majority of the content that the Internet Archive stores is copyrighted and not under CC, etc. The Archive mostly gets away with it because it is/was all public facing material and they bend over backwards (with retroactive robots.txt, etc.) to remove anything that the owner objects to.

Copyright only covers distribution. Merely having a copy of something is fine. When you download something from Soundcloud, that's them distributing a copy to you; presumably they have permission to do that. If you then distribute copies of it to other people, then it might be a copyright violation (there are exceptions, and varying laws, and, etc). Just holding on to something that was distributed to you isn't a copyright violation.

It probably is and archive team just don't care...

Maybe they're only pulling either indie stuff with no copyright, or the rarer stuff that they're less likely to get in trouble over.

indie stuff with no copyright

There's no such thing, copyright is automatic, it applies as soon as the author puts the work in some tangible medium (like an hard drive).

I've never thought about the medium thing. If you write a song (and perhaps perform it in public and everything), but never commit it into writing or recording, does copyright not apply?

If you wrote it on paper, whether musical notation or just the lyrics, that would give you songwriter copyright.

If you then gave a performance and someone recorded it, the copyright in the recording would lie with them - but they would not be able to distribute it without also having your permission.

(People forget that these are different and the songwriter royalties are quite lucrative - famously e.g. the Beatles almost all the songs are joint copyright Lennon/McCartney, not the band as a whole)

What would the copyrights apply to if there's no recording?

Live concerts played from memory? E.g. Eubie Blake supposedly only wrote Charleston Rag down 16 years after composing it.

Someone else could write it down and publish it without the author's permission.

Huh, didn't know that!

So if I record and master a random riff, I technically have a copyright on it?

To be fair, I was mostly talking about ones with a record label, where you might be hounded by a label with a lot of legal representation.

Yes, sure, if you record a riff or write a few sentences on paper, then you own the copyright to that - there is no "magic importance limit" that needs to be surpassed for works to become copyrightable.

Shitty drawings made by five year olds get the same legal treatment as Picasso, etc, etc.

If you publish it. If you just sit on it, you don't. (At least not in the sense that you can get away with suing someone that publishes the same rift later)

What would publish imply?

Would putting it on a private Google Drive work? What about a Drive that's searchable and anyone can have a listen, if they so desired?

IANAL, but unlike patents, copyright violation only occurs if you actually copy the work - independent creations are not infringing, even if similar. While the standard for showing that the work was actually copied is usually not high, especially since these are usually civil suits, some link in a random Google Drive might not be enough.

I'd guess that standard would be set by a court case, unless there already has been one like this (in which case please enlighten me). In general, with vagueness like this, a court usually has the final say.

I don't know whether SoundCloud does not have a huge, Wikipediaesque donation banner on every page illustrating the severity of their financial situation; it's embarassing for them, but think of it like this: the artists have a right to know that the platform that manages their life's work needs their support.

"I don't know whether SoundCloud does not have a huge, Wikipediaesque donation banner on every page illustrating the severity of their financial situation"

Maybe because they are profit-oriented company that raised hundreds of million USD in venture capital. Asking for donations would be kinda unethical, and the founders would know that.

However, if SoundCloud somehow would be transformed into a non-profit organization...

So you're saying Richard Hendricks' New Internet should be used to keep Soundcloud afloat?

Did anyone backup GrooveShark? I had some unique pieces stored there which seem to be lost forever.

If these kind of entities actually want to preserve resources, they shouldn't be generating a petabyte of bandwidth charges. Contact soundcloud and come to an agreement that will be responsible.

I imagine a company with its funds running out will not be particularly interested in dedicating resources to the archive team. Not to mention that doing that would be kind of signing their own death certificate. It's hard to argue that you are "on path to profitability" when you're busy handing off your assets to a museum.

How much does a petabyte of bandwidth go for these days anyway? It might as well be cheaper than paying engineering, legal and management to arrange for some kind of off-line data transfer.

That could cost them more than $50K in bandwidth.

Or if you worked with them, and both of you happened to be in Amazon, that could be a no-cost transfer.

Archiveteam exists mostly because your suggested course of action rarely works.

If they can't convince soundcloud it's a positive, they should not be doing it.

I don't think SoundCloud is an entity with some optimistic face you could approach and convince through reason or merit.

I expect they, like any business, would take a stance primarily based on liability, cost, benefit, etc... overriding how anyone working at the company actually feels about it.

It costs them nothing to ignore/outright reject without consideration a crazy proposal like "lets give our entire database of content to a third party". Assessing the technical and legal ramifications of that proposal costs time/effort/money. Why bother?

If SoundCloud gives their content to Archive Team, they're shouldering the possibility of some kind of liability, surely. If they say nothing, let Archive Team take it themselves, from their website, they let Archive Team (who understand and are willing to) take responsibility for that.

That would be legally iffy, since they don't own the copyright, and so it's unlikely that they're allowed to bless an wholesale copy of their users' content to some third-party.

If it's legally iffy, archiveteam should not be archiving it.

Probably legal issues with Soundcloud offloading the data.

This way they can skate the rules and save the data.

Plus startups on the way out usually want to use their data as a bargaining chip to sell whatever remains of the service off, if they give it away they lose that chip.

What's the current context here? Is SoundCloud going away anytime soon?

That's at least the fear of many musicians, DJs and of the existing (but reportedly shrinking) user base after a recent round of layoffs and reports that SoundCloud's money will run out in Q4.

Soundcloud CEO says no its not going away anytime soon. https://www.digitalmusicnews.com/2017/07/14/soundcloud-ceo-r...

Would they say it if they knew it, though?

Anyone interested in backing up their own personal (public) SoundCloud files will find this tool useful: https://github.com/mafintosh/soundcloud-to-dat

Cool!! Based on the Dat project (http://datproject.org). These guys might become a good alternative to IPFS, especially if they reposition themselves as a message-based system, instead of just file sharing.

How can tracks be downloaded when many of them are pay-to-access? or stream-only?

Generally, if you "stream" (send) me bits, they're mine to do what I want with (practically, maybe not legally).

Although, if anyone could tell me how to do that with AES-SAMPLE HLS video, I'd be very happy.

Can you describe the ways you have attempted, and the bad outcome, to help focus the answer/search a little more?

The 10 minutes I spent poking around to see what "AES HLS" was made it appear that a MITM proxy would straighten that problem right out, unless you are in an iTunes FairPlay-esque encrypted to your iPhone type deal, in which case I believe the math is against you

One of the issues is that the stream I have as an example is geo-restricted to Australia.


youtube-dl seems to find the stream URL okay, but then prints ffmpeg errors:


https://github.com/selsta/hlsdl should be able to do it in theory, but it looks like WideVine DRM is involved.

Someone has a Kodi plugin to make it work but Kodi doesn't support recording of any kind (?)

Aren't most of the "bits" these days protected by DRM encryption, even in the hardware? I know video is, but audio?

I'm always confused when people say they've backed up some website when there are these practical and legal barriers.

If you can play the audio, it isn't encrypted

> If you can play the audio

This forcibly imposes a fixed recording rate, limited to the length of the track itself.

Especially if you’re playing over the rear audio jack instead of over the HDMI port.

How would you find all files to download without scraping all Soundcloud pages?

Just from a legal standpoint: isn't „...considers the SoundCloud service in danger...“ slander? Especially since the Internet Archive isn't a nobody (with maybe some inside information?).

Just remembered cases like Deutsche Bank vs Leo Kirch which are legal nightmares.

Good luck suing not for profit volunteer group when you yourself are in financial trouble to begin with. I'm not saying it's impossible, but there's not much to gain.

Archive Team is not the Internet Archive, and the above quote is clearly opinion.

> Just from a legal standpoint: isn't „...considers the SoundCloud service in danger...“ slander?

I think you mean libel, and I think it's a reasonable conclusion that everyone is thinking given that SoundCloud just laid off so many people.

Maybe in Germany it could be, but not in the US.

I hadn't heard of Archiveteam before, but the fact that HN brings their site down doesn't make me too confident in their backing up skills :)

Backing up data then storing it to be infrequently read and handing a major spike of traffic are two very different things.

HN has been bringing down a few sites that were unprepared. True, it's possible to harden your page for unexpected traffic spikes, but that's always work and maybe you just decide it's not worth the effort and invest the time elsewhere. (whether that's a good decision is up for debate)

You're thinking scalability. What they do is reliability in archives, and "team work" (it's in the name) for retrieval. Cheers

Mostly, I saw the humourous in it, which I tried to emphasize with the smilie. Didn't land.

Ah, yes, another good example of Poe's law! :)


HN can direct more traffic than you think, and it can do so very quickly. AT's backing up skills are fine, and if you've managed to be on HN for nearly 5 years without hearing about them then that does not make me too confident about your reading skills :)

fair point :) for me it's more this "first impression thing".

That's probably not going to help with their huge bandwidth costs

Ironically their web server is already over limit.

>508 Resource Limit Is Reached

Aww I like soundcloud

Does anyone feel that this is all a bit pointless. Like is there a greater social need to preserve SC?

I found a lot of the content to ephemeral, things like podcasts or DJ mixes. I dunno, it just seems a bit silly to put resources on it.

I have my own personal recordings up there (although they're also on bandcamp and I still have them backed up). I know of a lot of people who use it to host their music. It's not just podcasts and remixes.

Someone might have poured all of their creativity to content on that platform and to loosely quote Jason Scott "this might be the largest audience this specific line of genes have/will ever reach". Is that not worth preserving? Even if it just takes a thumb drive worth of space? Even if the content sucks, it still part of our history as a species. Think of some of the old works/diaries/letters that are now thought to hold value. Like I've been lately reading "letter from Seneca". Why are they any more worthy of preservation than something on SoundCloud?

i guess it's subjective, some people would think a bank of Vapourwave music on SC is important. I'm pretty happy to have a 128kbp version of blink 182 albums on my backup, because they mean something to me. I'm just really questioning the need to archive for the sake of archiving.

I'd recommend transcoding to 96kb/s VBR Ogg Opus during crawling.

Why? You increase the CPU requirements of the crawl, lose quality from an already low bitrate file and then have to QA the new encode. All for a very small decrease in filesize.

The goal of archiving is making a perfect copy. Transcoding ruins that.

Not relevant in this situation, but to be fair, archiving analog media isn’t about making a perfect copy, just a good one. Also, when old DRM encumbered games (StarForce, etc.) are archived, the DRM is removed first. So not a perfect copy there either, but just an as good as you can get copy.

as perfect as possible then, no need to argue about the 'perfect' phrasing. It's clear what I meant in the context.

Both your example require transformation, while archiving unprotected digital media is as easy as making a copy.

That's not the goal of archiving.

Probably not an amazing and erudite source as others have posted but there's this tweet: https://twitter.com/chancetherapper/status/88592122075935539...

Soundcloud is here to stay

I dont understand. Is this supposed to imply this rapper will somehow invest and personally try to ensure soundcloud will survive?

Or did he call the CEO, the CEO told him 'Yes we will stay'? I mean what is he supposed to say?

'No, don't bother uploading new songs, we will run out of cash soon, thanks for the call'?

Hm that's a good point. I didn't consider the cynical approach to a CEO's reassurance, as obvious as it seems.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact