Hacker News new | comments | show | ask | jobs | submit login
Magnet-hashes for all torrents on The Pirate Bay: 164 MB (thepiratebay.se)
176 points by Zirro 1720 days ago | hide | past | web | 93 comments | favorite

This gets very meta very quickly.

If linking to copyrighted data 'should' be illegal (SOPA), then what about descriptions of that data that are sufficient to identify the original, but not reconstruct it (magnet links)?

And if those were made illegal, then what about descriptions of those descriptions? You can recurse infinitely on this.

Beyond mere amusement, after just one or two recursions, you get to the point where it would be difficult to write a law that would criminalize magnet links without also criminalizing people who link to a Sparknotes-like summary or commentary for a piece of media.

You're thinking about it the wrong way. From a programmer's POV, the link to the torrent can be abstracted endlessly into new and distinct forms, each of which you believe needs to be legislated away in turn. From a lawyer's point of view, the specifics are really not important, but rather the end result: is the user illegally procuring copyrighted material, or is the distributor providing them with a readily accessible means of doing so?

Law has certain resemblances to regular code, but folks here seem to think that if something isn't properly specified that the law will break in the same way that a program will fail to compile or run properly. But that's not how it works. Poorly drafted laws can fail, certainly, but it's not that hard to draft something that focuses on the end result.

Consider ordinary offences, such as robbery. You wouldn't get anywhere by arguing that you're alleged to have put your right hand in your pocket and pulled out a small hatchet, and that since there's no law specifically forbidding right-hand wielding of hatches, you should go free. The technicalities of how you committed the robbery are irrelevant as long as it can be established that you took someone's property in a violent fashion. I'm a little perplexed as to why folks think torrenting/piracy/filesharing etc. is so different that it can't be addressed legally. Sure, the law needs to be clear and logical, but only up to a point. It doesn't need to be absolutely exhaustive, and 'beyond a reasonable doubt' has never meant 'beyond any imaginable possibility'. People do make arguments like that in criminal defense cases from time to time, but they typically fail because the doubts they attempt to raise are absurdly far-fetched.

Totally true. As someone commented on my own story on this, you can kill someone in war and be a hero, or kill someone at home and get life in prison. Context and the definition provided by the law are very important. And as you say, the law doesn't have to be needlessly ignorant of the reality. It can easily encompass intent, result, context, etc.

But! There are still interesting problems to be acknowledged as far as how you define the transgression. A torrent like this from one perspective is worth billions, from another is worth very little. And practically speaking, if someone were to really use it the way you might expect a pirate to use it, it is practically worth only a tiny fraction of its potential worth. These are things that are difficult to define or restrict, but are immediately obvious to anyone who would use it. That's why it's interesting to me.

Sounds like we're on the same page. Legislative drafting is often overlooked in these things, I agree. A legal friend in the UK friend told me that way back when, the very best lawyers would get recruited to work at parliament writing and 'debugging' legislation, in order to maximize its legal reliability. Around the 1980s this came to be regarded as a waste of money and the budgets were slashed, and intead of the best lawyers legislative drafting fell to those who were unable to get decent jobs in the private sector...and the quality of legislation fell accordingly, while the cost of litigation soared. We're seeing a similar problem in American lawmaking nowadays, as best I can tell.

That's what this is already, eh?

A description of copyrighted data would be the torrent file, which content producers would probably like to argue are infringing.

A magnet link is a hash of the torrent file, so it's already two steps removed.

Of course, the Pirate Bay magnet dump is itself a torrent, so it's a hash of a hash of a hash of copyrighted data.

And that torrent itself has a magnet link: 938802790a385c49307f34cca4c30f80b03df59c is a hash of a hash of a hash of a hash of copyrighted data. (In the MP/RIAA's ideal world, I've just committed criminal copyright infringement with damages reaching into the $billions.)

Theoretically, the Pirate Bay dump could include the torrent for the Pirate Bay dump, and be an infinitely recursive description of itself... but that's probably an intractable cryptographic process.

And, continuing back from hexadecimal, we have the splendid:

in base 10 for the magnet link of the torrent containing the list of magnet links for the pirate bay's torrents. Shall we add to the list of illegal numbers?

Perhaps its prime factorization also belongs on that list.

Best to include the Roman numeral version of the number, as it's equally illegal.


It would also be wise to outlaw the URL I've just linked, and perhaps also the combination of letters "JROu8" as they also contain the information in question given proper context.


So, going back to hex and turning it into a flag just like the Free Speech flag in the wiki link, would that make this image:


the most copyright-infringing flag of all time?

What will they do if we find a crazy enough way to make, say, the number 5 infringing? And who owns 5, anyhow?

Even worse, they will make pointers illegal.

I can beat that. With a bit of indirection, the URL http://thepiratebay.se references all of the above.

> ... but that's probably an intractable cryptographic process.

Not always.

Algorithms that append a hash to a file, preserving the same hash, i.e.

  hash(s) == hash(s ++ hash(s))

A program that prints (though does not contain) a hash of itself:


(edited for formatting)

Ha. I bet one of those hashes is a hex representation of a copyrighted song lyrics

The more one studies computer science, the less one believes that information can be owned in any meaningful way. I remember reading a paper on a "lightnet" that basically XORed arbitrary blocks of data together, then produced a recipe on how to recover some original data by XORing the appropriate blocks. With just this recipe you could ask other nodes for the random blocks, then recover the original data locally. The paper made a good argument that however you try to define ownership of a block, it will lead to some contradiction. The system was implemented but I can't recall its name. I think it was hosted on SourceForge.

I'm sure most everyone on HN has read it already, but "What Colour are your bits?" http://ansuz.sooke.bc.ca/entry/23 is always a good reference in this sort of discussion. The system you mention sounds like what the Colour essay is railing against; trying to create a mathematical solution to a problem which is rooted in the law, and cannot be resolved by mathematical manipulations, but only by legal manipulations.

Do you recall the title or author of that paper? I'd be interested in reading about the contradiction argument you mention.

I think parent is talking about Monolith[1], which in fact that article specifically references:

    I think Colour is what the designers of Monolith are trying to challenge, 
    although I'm afraid I think their understanding of the issues is superficial
    on both the legal and computer-science sides. (...)
[1]: http://monolith.sourceforge.net/

I like that article as well, insofar as it helps both sides understand each other better, rather than continuously talking past each other. In order to do so effectively, that article pointedly avoids taking a position on the issue. However, the article also doesn't suggest that people shouldn't hold a position on that issue.

I personally feel comfortable taking the position that anyone claiming that bits have color has an objectively wrong view of reality. I don't think that position needs much advocacy, though; reality always tends to win in the long term. It could certainly use a little help sometimes, though.

The followup "Colour, social beings, and undecidability" http://ansuz.sooke.bc.ca/entry/24 is also pretty good.

I feel this is dealt with by the fact that law has a concept intention. Handing out the recipe for reconstructing the pirated material counts as aiding piracy, I would say.

There's definitely the undecidable concept of intention, but there's also an undecidable concept of identity: for example, when is an MP3 file "the same" as a copyrighted song? You can encode a copyrighted song into different bitrates, swap left/right channels, slightly pitch shift, apply equalization, etc. Most likely, given some algorithm that tests audio identity, you could always come up with a way to create a copy that sounds the same to a human but is missed by the computer.

Granted, the fingerprinting algorithm used by YouTube is pretty good: http://www.csh.rit.edu/~parallax/

>when is an MP3 file "the same" as a copyrighted song?


I have to say, I find the concept of intent a little difficult to grasp when it comes to this stuff.

As an, albeit somewhat contrived, example:

Let's say I have a music player on my computer that, when fed the works of Shakespeare, it plays the Gaga's latest hit and when fed with the works of Sir Arthur Conan Doyle, it plays Jingle Bells.

My actual intent is to read Shakespeare and to listen to Jingle Bells, but that is going to be a pretty difficult case to prove in a real court of law. The assumption will be made that I had the player and copy of Shakespeare so that I could listen to Gaga.

Theoretically, unless Shakespeare has been interpreted by said program, I have not committed any crime. But precedence disagrees. Just having the right sequence of bits on your computer is enough to prove intent, whether your actual intention was to use it to violate copyright or not. That is where I get lost.

Yes, in much the same way that any given number could be random, pseudorandom or nonrandom given its context, it can also be in violation or accordance with copyright law depending on its legal context.

Don't confuse a random number with the value returned by a function that returns nondeterministic values with a somewhat predictable pattern.

In case anyone is interested, the program to which I referred is "OFF System". Its home page is supposed to be http://offsystem.sourceforge.net/, but unfortunately it's down now and the project seems dead. The paper was on the site, but I can't find any link to it now.

EDIT: I think I found it at http://www.findthatfile.com/search-31270472-hPDF/download-do... . The name of the file is "CopyNumbCJ.pdf".

> This gets very meta very quickly.

You're thinking legality, but I'm thinking efficiency. So we can now distribute 1.5 million torrents (of a total of several thousand TBs, no doubt) in a file 90MB big, in a torrent which itself has a magnet address that takes up...20 bytes?

The size savings as you go up the tree are incredible. I see no reason why you couldn't create an almost-entirely distributed torrent site in this way.

Think about: Torrent discovery could be done by regular distribution of index torrents, and the clients use that to find out what can be downloaded and where.

In fact, in the world of magnet addresses, "uploading" a torrent would be as simple as requesting that its URI be put in the day's index. So running a torrent site would be as simple as curating a list of magnet URIs each day into an index, then publishing that torrent's URI somewhere. Like Twitter. You could run a torrent site entirely from Twitter.

edonkey has had distributed search for a long time. It's possible to maintain an keyword index of magnet links in a dht, and then you remove the need for the torrent site completely.

I think that requiring torrent files and trackers was a policy decision to deflect liability away from the client implementer to multiple third parties. That's why bit torrent is still around and Grokster isn't. There's no technical need for them.

BitTorrent has torrent files and trackers because it was designed for non-infringing use-cases (e.g. distributing Knoppix images). The centralized elements substantially improve reliability (and, at the time, performance) in those use-cases.

Java should be illegal. You can write code that downloads torrents with it.

Your comment should be illegal; it gives advice on how to go about this process.

Your comment should be illegal; it clarifies that the previous comment was advice and therefore assists in illegal action.

Someone could suggest a system, whereby spelling mistakes are used to encode partial information about a magnet link.

One switched pair could give you the position in the magnet-link key, the other switched pair could give you the value. That way, you could never pin down exactly who gave you what information.

Or maybe I shouldn't suggest it?

That could have interesting unintended consequences - ISPs being pressured to ban users who post blog comments with spelling errors.

I approve of this

Lawmakers can easily avoid this meta situation by writing simpler more encompassing laws.So rather than being specific about how the pirated material is accessed they can write a more open law along the lines of "a site that's main use is assisting the distribution of pirated material".

SOPA/PIPA tried to do that.

> then what about descriptions of that data that are sufficient to identify the original, but not reconstruct it

Such as the name of the copyrighted work?

Hello, I am an author of the scrape. I did it more to try it, but who knows, maybe it will be useful to someone.

I went trough the description pages like http://thepiratebay.se/torrent/$i by increasing the $i and saving the magnet if pirate bay didn't return 404 error. I went trough the pages as unlogged user, though, so that might be the reason why I got only 1.5m torrents.

I didn't know pirate bay has hidden porn torrents; there is TONS of porn in the scrape already.

The script is in perl, I will post it to pastebin in a moment.

edit: allright, the script itself is here http://pastebin.com/8RXXthXB

as you can see, it's not very complicated.

I think it's a great idea, and a nifty hack.

It might be an to release a diff against this once a week, and write a quick script to grab it, keeping the list up-to-date.

I am thinking of releasing new versions once a week and putting the hash of the torrent of the newest version on some public site. (Say, some twitter account.)

But it would still be more proof of concept than really anything useful - the comments and descriptions ARE important.

edit: More I am thinking about it, the less useful it sounds.

First, the information about seeders vary constantly, especially with the new torrents.

Also, it STILL depends on single point of failure - the Pirate Bay itself. If TPB will be down for any reason, I will have no place to scrape this from and it will all fall apart anyway.

Plus, I think Pirate Bay itself should make dumps like this. It would probably be much better for their database anyway :)

I like the idea of a weekly twitter update with the master magnet hash. I feel like the purpose would not be the usefulness of the string of chars, but more to prove a point.

The porn torrents are only hidden from naive searchers; all the pages for them are still accessible if you've got a direct link to them, so your scraper should've picked all of them up.

i tried to run the script, however, i get an error (added diagnostics for more info, so line 13 refers to line 11 of your script, line 27 to line 25):

Can't use an undefined value as an ARRAY reference at piratebay_magnet_scrape.pl line 13 (#1) (F) A value used as either a hard reference or a symbolic reference must be a defined value. This helps to delurk some insidious errors.

Uncaught exception from user code: Can't use an undefined value as an ARRAY reference at piratebay_magnet_scrape.pl line 13. at piratebay_magnet_scrape.pl line 13 main::__ANON__(20697, 0, undef, 0, 0) called at /usr/share/perl5/Parallel/ForkManager.pm line 354 Parallel::ForkManager::on_finish('Parallel::ForkManager=HASH(0x9cd7ac8)', 20697, 0, undef, 0, 0) called at /usr/share/perl5/Parallel/ForkManager.pm line 333 Parallel::ForkManager::wait_one_child('Parallel::ForkManager=HASH(0x9cd7ac8)', undef) called at /usr/share/perl5/Parallel/ForkManager.pm line 285 Parallel::ForkManager::start('Parallel::ForkManager=HASH(0x9cd7ac8)') called at piratebay_magnet_scrape.pl line 27

The Pirate Bay front page claims 4.187.907 torrents. But, this 164MB is only ~1.5 million torrents. Is the discrepancy from exclusion of the porn torrents? I'm guessing this guys scrape missed them; you have to be logged in to TPB to see them.

It does contain adult content.

The IDs are sequential, but there are substantial gaps. Removed spam torrents, most likely.

When TPB had to be blocked in the Netherlands and they switched to recommending magnet links instead of torrents (pretty close after each other), I thought someone would have done this sooner. But it's here now, and proxies do their job just fine ^^. (I couldn't load the page directly as it's blocked here.)

magnet of the magnets: magnet:?xt=urn:btih:938802790a385c49307f34cca4c30f80b03df59c&dn=The+whole+Pirate+Bay+magnet+archive&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80

You need to put your link example with new lines above and below, indented by two spaces. This will put it in a scrollable code box instead of stretching the page.



Should be titled "The whole Pirate Bay magnet archive except this torrent" :-)

I think a magnet link is based on the hash of the contents too, so it might be an interesting problem to include the torrent's own magnet link in itself.

Son, in this house we respect Russel's paradox!

The BT Infohash is mostly a hash of the hashes of the blocks + file names + a couple of other things. Notably the list of trackers is not part of the infohash, so adding trackers to a torrent file does not affect the hash.

If you're really interested, install the Python module bencode, and use it to de-serialize a .torrent file.

The thing is, new torrents are uploaded to Pirate Bay all the time, so one can only archive TPB in any given moment - which has to be, of course, before the creation of this torrent.

The archive is static, the TPB database is dynamic :)

Fun discussion here, guys. I've posted my thoughts on it at TechCrunch here: http://techcrunch.com/2012/02/08/is-a-hash-of-hash-of-a-torr...

Hope the community doesn't think I've hijacked the thread for my own purposes. I just thought it was an interesting little discussion and wanted to point it out.

I wonder if some self-updating mechanism could be implemented in magnet links. Something like additional signature part in the magnet url so the owner could inform other peers that content is changed and need to be updated.

how can this happen i wonder

Got it down to 70 MB with lrzip. Any better?

58858381 bytes (57 MB) with nanozip

  nz -cc -m1.8g

I got 68.2MB with xz. (Whoa, 1.8MB difference...)

68MB with "xz -9 --extreme" :P

70841044 bytes (67.56 MB) by running it through "sort" first.

Bonus: it's in chronological order.

it's in sort-of chronological order originally (partially ordered is the correct term, I guess?)

but not 100%

I don't know how it was ordered originally. I imagine just whatever order the scraper returned data.

putting it throughs sort ordered by tpb's id, which I imagine are assigned in chronological order. The low-numbered torrents seem to be from 2004.

67.8 MB with 7-zip LZMA2 64MB dict, 273 word size, 8 threads

0MB with rm

I wrote https://github.com/Bogdanp/Pirate as a fun little exercise and thought some of you might find it useful.


Just one thing: I see you are splitting by |, and some torrents (very few, but some) have | in their name (I didn't bother with escaping that).

I've accounted for that by grabbing all the other fields before grabbing the title. `python pirate.py -l | grep '|'` seems to yield correct results :).


I was inspired to steal and pirate the above magnet link into this quick vigilantist internet liberation site: http://yason.kapsi.fi/piratebay.html. I would be positively surprised to catch the interest of even a single MAFIAA party, though.

i'm in holland from Ziggo, so cant see it.. :( but good news!

Does www.br3in.nl work to access it?

I didn't know about that one yet, but that's hilarious!

Awesome idea. Magnet links will be around as the standard for some time it seems :)

sorry if i'm being daft, but can't you get this down to a magnet link itself? THe page linked does just that: magnet:?xt=urn:btih:938802790a385c49307f34cca4c30f80b03df59c&dn=The+whole+Pirate+Bay+magnet+archive&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80

but wikipedia has an example of a magnet link like this:


so could we get the magnet link to ALL of the magnet-hashes for ALL torrents on the Pirate Bay tdown to, what is that, 35 characters oplus the magnet cruft "magnet:?xt=urn:sha1:"?

Hi, would you mind editing your post to remove the really really long string of text? It's breaking the page.

Which browser are you using? I am not seeing this no matter how much I resize the window. (I use Nightly, Firefox alpha)

Edit: To clarify, the line/link he posted splits across three rows automatically.

Chrome on windows. This happens in every comment thread with very long strings of text.

Works fine for me on Chrome Stable x64.

It breaks the page on Chrome 16.0.912.77m on Windows 7 x64.

The stable build is on 17.0.963.46 m. Been a while since you've restart your browser?

Windows really must be getting better.

I used to comment on how I had to reboot my computer the other month. Windows users would be like "yeah, that sucks, I hope you had everything backed up ... wait, did you say reload or reboot?"

Chrome is only on 17.x.x.x since yesterday. Not really 'a while' :)

The page is still "broken" on Chrome 17.0.963.46 m Win7 x64.

Breaks with Chrome 18.0.1017.2.dev on Lucid as well.

Still broken on Chrome 28.

I wish this would be actioned on. The comments are pretty unreadable in this thread as-is.

Alternatively, HN needs to implement a single-comment-only break somehow.

sorry, I saw your post past the edit window. Any mod can feel free to do as the parent suggests, it's ok by me.

if you know who to mail who can do that go ahead

You can leave out the "dn" (just a name), but if you leave out the tr links to trackers, the peers using that magnet link will have a harder time finding each other, and will often end up partitioned.

The magnets in the archive are already just the hashes to make it smaller.

right but if that's "all of the pirate bay" why does it stop being so after one additional level of indirection, being likewise "just the hash to make it smaller"?

If we accept that the linked file is "all of the pirate bay" than isn't my comment just as equally "all of the pirate bay"? Haven't I just included "all of the pirate bay" in my comment?

Maybe I'm being a bit black-and-white on this, but while the meta and the philosophy is interesting to talk about, no one is mentioning the morality. Stealing is wrong. You're taking someone else's work and not compensating them for it. I think it's sad that we're all so worried about the law when, in reality, you shouldn't pirate music for the same reason you don't steal a candy bar from the grocery store or snag 5 dollars out of your coworker's wallet or hack into Dropbox to get extra storage for free. It's wrong.

I'm getting a bit sick of the 'piracy is theft' nonsense. It isn't. Nobody is deprived of their possession. Me copying your song doesn't result in you no longer having a song.

Piracy is far closer to plagiarism, but even then only to a point. In plagiarism, one attempts to pass off the work of another as one's own. In piracy, one simply copies another's work for one's own use. They are fundamentally different.

This is why piracy is as prevalent as it is: it simply is not as bad as plagiarism, let alone theft. Most people have an intuitive understanding of this, and those who pirate do so without the cognitive dissonance that comes with acting against their moral code. It might be "wrong" in an abstract sense, sort of like lying on your resume is "wrong", but it's not wrong in the absolute sense of harming another person's body or property.

The real problem is that piracy is an attack on our capitalist society's framework for handling a new type of good--a non-scarce good. Copies of data are not scarce, yet society has decided to declare them to be scarce, by law, on pain of jail time.

Declaring copies of data to be scarce is a convenient convention (for capitalists), because it allows data itself to be sold exactly like the scarce vessels (records, cassettes, books) which used to be sold as a stand-in for the data itself, and like other scarce things (bread, oil).

Without scarcity, capitalism cannot function. You cannot sell air (yet, but see "Spaceballs") because there is air all around so you can't run out of it. You can't sell "fours" because you can create as many fours as you need with a pen and paper or keyboard. However, the capitalists have successfully legislated that certain sequences of data are scarce. If this fictional scarcity of data collapses many people would be unable to make money selling non-scarce "virtual" things like ideas and data, and would have to fall back to selling real things. And they'd be unable to "allocate" these goods to people because everybody could just have whatever they wanted. Very good for people, very bad for capitalists.

Much like the attempt to legislate Pi (a half-true half-myth btw), this 'fiat scarcity' (just coined that) is doomed to ultimate failure. If not in the law, in practice. Like the War on Drugs.

Piracy is sharing and most kids (I was anyway) are taught at a young age that sharing is caring. Maybe we need to now teach kids that sharing is piracy and will land you with huge fines. Next time a kid gives her friend some chocolate, we better lock her up before she starts infringing our rights!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact