1.5 TB of Dark Net Market scrapes (gwern.net)
364 points by gwern on July 15, 2015 | hide | past | web | favorite | 73 comments

I've had a very slow-moving hobby project to parse and analyze a subset of this data: https://github.com/rcompton/black-market-recommender-systems

So far I've had some ok results along the lines of "91.7% of vendors who sold speed and MDMA also sold ecstasy" http://ryancompton.net/2015/03/24/darknet-market-basket-anal... I am working on extending this to markets besides evolution now.

Just FYI, MDMA is ecstasy.

I chose that statistic to be a bit tongue-in-cheek, cf

> While 'ecstasy' is the popular name for MDMA, the functional definition of ecstasy is any pill represented as MDMA on the street. Ecstasy pills are notoriously unreliable in content, more so than most other street drugs, and commonly contain either caffeine, ephedrine, amphetamines, MDA, MDE, DXM, or--in rare cases--DOB, and don't necessarily contain MDMA or any psychoactive.


To follow up on your response and clarify some: in your dataset, not only do 91.7% of the vendors of 'speed and MDMA' also sell 'Ecstasy'; 76.8% of the vendors of 'Stimulants and Ecstasy' also sell 'MDMA'.

If I understand correctly, this means that vendors were significantly more likely to mention 'Ecstasy' in a product name (or description) if they claimed to be selling 'MDMA' than they were to mention 'MDMA' in a product name/description if they claimed to be selling 'Ecstasy', reinforcing the point that you have just made.

What people think of as ecstasy (the pill that makes you dance all night) is usually mdma mixed with some kind of speed. That's why this statistic was amusing.

"Usually" is a real stretch. ecstacydata.org's lab testing report analysis indicates amphetamines were present in under 5% of tests for the last 15 years. http://www.ecstasydata.org/stats_substance_by_year.php?style...

If you also include the methamphetamine row, that number is a lot higher, in 2007 in particular it was present in 38.6% of the pills tested. For contrast, last year only 33.1% percent of the pills tested actually contained MDMA.

I don't do MDMA personally but I have some friends who like going to raves with some chemical supplies. On the streets, some make it a point of honor to sell/buy only "MDMA" because "ecstasy" is "crap".

The only precise name is MDMA. In my opinion the name users give to what they're ingesting is inconsequential. Most users have no idea what MDMA means anyway, nor would they know how to chemically identify if what they bought might contain MDMA.

    The only precise name is MDMA.
There's also "Molly," which is the street name for pure MDMA. (Though: most terms are tied to a specific geography.)

    ...nor would they know how to chemically identify if what they
    bought might contain MDMA.
True, but that's not unique to MDMA users. cf. the heroin in the Netherlands that was being sold as cocaine, and which lead to several deaths.

Organizations such as Dance Safe have existed for decades (in the U.S. at least) to let people test their drugs to make sure they're ingesting what they intend to ingest.

"Molly" is the US term for what should be (but usually isn't - very often some mixture of random RCs methylone, mdvp and other alphabet drugs) MDMA.

The UK term is "Mandy" "mud" or MD.

Ecstacy (which I have never ever heard used) refers to what should be MDMA in pill form along with some binders and is usually referred to in the UK as "pills" or "Es". I believe the US prefers the terms "rolls"

>the heroin in the Netherlands that was being sold as cocaine, and which lead to several deaths. How on earth can you buy coke and get heroin? No one buys coke and then rakes up a monster line, everyone does a dab test (lick finger, stick it in the powder and taste it)

Heroin tastes nothing like coke. Heroin sells for more so why mix it in?

Ah yes, Molly. Heard that one too. The heroin is apparently still being sold as cocaine in Amsterdam. I was posters about this on stores a couple of months ago :/

I wish researching the drugs you do and testing them as thougraly as possible was more widispread. I have managed to drill it into most people I meet in that context. Most are actually receptive and dispite public assumptions, do care about their long term health and what they put in their body. A small amount are not. In one case I tested somebody's 'MDMA' for them and told them that I was certian that there was no MDMA in it at all and they took it anyway.

    Most are actually receptive and dispite public assumptions,
    do care about their long term health and what they put in their body.
That's where I don't agree. If they really cared, they wouldn't ingest unknown chemicals. Actions speak louder than words. The only reason they keep on doing it is because their experiences so far have been good on average, without too much drama. It's sad but sometimes even the death of a close one is not enough to stop people from blindly trusting street pushers.

I don't personally have a problem with MDMA and I would try it if I could be assured it was pure, with regulated dosage. Purifying street bought drugs and carefully measuring dosages is surprisingly easy and doesn't require a lot of money (FWIW, I'm a chemist).

The absurdity of the situation, in my opinion, is that people are willing to blindingly trust some random (or even a well-known) pusher because the consequences of making due diligence are harsher, in their minds. What I mean is that it's much easier and on average much less dangerous to just ingest a pill than to make due diligence and possibly get caught, sent to prison, etc.

The system works!

Gwern has a Patreon page now for anyone interested in supporting his research: https://www.patreon.com/gwern?ty=h

Gwern does more interesting things with lower monetary burn rate than most anyone I've met.

gwern, you are an absolute force of nature when it comes to generating and collecting and presenting information in a useful way. Thank you.

I hope a he gets some good donations from HN, Ill send him a few dollars in BTC.

What amazing work! I am very interested in doing research with Tor and a dataset like this could make my job a heck of a lot easier. I have a legal question though: Are your scrapes text only? Before I work with this dataset, I want to make sure that there's no possibility it contains illegal images (child porn).

They are generally not text only. I feel that images are useful to allow browsing the markets as they were and may be highly valuable in their own right as research material, so I tried to collects images where applicable. (The forums usually did not support any kind of image upload other than avatars, so this is more relevant to the markets than forum scrapes.)

As far as CP goes, there should be essentially zero CP anywhere in the archive. DNM users almost universally loathe CP, and no market has ever dared to permit sales. (You may find this funny: CP is so taboo, on the DNMs like elsewhere, that it's been used in at least one attack - SR2's DoctorClu/Brian Farrell infamously attacked a rival market's forum by posting CP to it.)

DNM users almost universally loathe CP, and no market has ever dared to permit sales.

These users are willing to do so many other illegal things, but the thought of being known as a pedophile or supporting pedophilia in any way, is completely abhorrent to them? Interesting datapoint.

> These users are willing to do so many other illegal things, but the thought of being known as a pedophile or supporting pedophilia in any way, is completely abhorrent to them?

I'm pretty sure the distinction is that voluntary transactions have no victims, and DNM folks care more about morality and ethics than legalities.

It is interesting, isn't it? You might expect there to be a 'general factor of criminality/antisociality/violence' akin to how we find a general factor of intelligence in psychological things, but as far as I can tell, drug use or sales seems to be largely orthogonal to other kinds of crimes - most of the DNM users will never buy credit card dumps and rip off retailers (carding is disliked by a lot of DNM users, although not enough to totally ostracize it like CP), most DNM users will never download CP, most DNM users will never beat or rape someone, etc. I have no issue buying and using illegal drugs, so I'm a criminal, but I'm not the same kind of criminal as, say, the Vallejo kidnappers (which BTW if you haven't read the complaint, it's an amazing read if you're into true-crime stories: http://www1.icsi.berkeley.edu/~nweaver/vallejo.pdf ).

This probably has a lot to do with why the War on Drugs has been such a failure and why legalizing does not seem to unleash crime waves.

just cause someone wants to get fucked up doesn't mean they want to fuck other people up too

Morality is not binary. That someone is fine with some things that are illegal does not mean that they are fine with everything that is illegal.

It's quite usual for career criminals to have a detailed and complicated worked-out ethical system, where that is a victimless crime but that is reprehensible.

The largest use of DNMs is for the sale of illegal recreational drugs. Regardless of your feelings on the matter, I hope that it's clear that crimes like that fall into a different ethical category than child pornography.

>It's quite usual for career criminals to have a detailed and complicated worked-out ethical system

Not even that complicated, pretty much summed up by "no women no children"

I think you will find, that apart from the paedophiles themselves, they are universally loathed. Probably more so by criminals.

The last place a paedophile wants to be is locked up in prison.

How about ascii art?

Actually, this is an interesting topic. Poisoning a dataset. CP would work for private security investigators, and to poison against government investigators you could use leaked classified secrets.

Could you work around this by operating on the files on VPS you don't own, streaming a very low-res ('Basilisk'-proof - https://en.wikipedia.org/wiki/BLIT_(short_story) ) remote desktop image.

Possession laws are pretty strict and hard to decode. I wouldn't want to be the test case in court. The idea of "poisoning" a dataset is an interesting theoretical. But in practice, I just want to judge the likelihood that the dataset is poisoned by the presence of images. If it is then there's not much I can do with it.

Yes, this absolutely needs to be clarified by Gwern. This is a very dangerous thing to link researchers to if it contains any illegal content.

Nonsense. Gwern doesn't need to do anything for anyone.

It's an interesting issue, and a way investigators may be attacked, but it's their responsibility alone. There exists data. This is that data. The data may bite. Touch the data at your own risk.

Guess what, laws aren't universal! Unless gwern has a complete understanding of your jurisdiction and can somehow guess how you plan to use the data, he cannot know what is legal and wasn't isn't. The burden lies on you.

Um.. It says Black Market right on the tin.

Indeed, a warning that it may contain illegal content would be about as sensible as the standard "Warning: may contain nuts" label on a tin of nuts.

What is illegal in a download of an online drug marketplace? Are pictures of drugs banned where you live?

It's a general black market, not just drugs. For example, one of the sites described on that page is PEDOFUNDING, "A crowdfunding site for child pornography." Now the dump isn't supposed to contain any images, but it's hard to be 100% sure. In any case, whatever risk there might be seems to be clearly implied in the name and description there.

FreeeOW, that's what I get for skimming the list I guess.

Um... Dark market

Does ASCII art have victims? Other than its audience, I mean.

Technically it could be translated from more traditional pictures. (Instead of being an original creation.)

Lower down on the page, he says he did scrape at least one site with such images, although he specifically only took text. Can't verify that this was the case for all scraped sites.

This is why I love the internet. This article has given me a fascinating glimpse into a world I have no idea about.

Author: thank you so much for taking the time to document this.

Unrelated to the darknet, but this Twitter account evoked a very similar feeling in me (random snippets from userboards in the 80's and early 90's):


"I heard about the Apple Watch recently and was going to check it out—but not now. It can't even transmit or input data. ☯93JAN"


Some of them a bit less clairvoyant about the future:

"The actual date for the end of the world is July 5, 1998. ☯92NOV"

It is interesting how many were accurate or onto something. I just didn't expect it.

edit: Found this gem after wasting more time that I cared to.

"I think the future of personal communications holds great things in store for us, but privacy won't be one those things. ☯94AUG"

July 5, 1998 is "X-Day", from the Church of Subgenius. It actually happened.


Some of the quotes are from before I was even born so that is just one of the references I don't know and just assumed it was a date thrown out for no reason.

This is true.

Love the "See also:" Heaven's Gate, a real-life cult with a similar concept of "X-Day"

what a great twitter account. thanks for bringing this to my attention.

You should take the time to read his other posts. Gwern is one of the most fascinating people/websites I know of.

Thanks, I will read at least another by him.

This is so cool. Thanks gwern.

If someone's feeling bored, you're welcome to put the entire archive on a web server for us to look at......

Or maybe I'll just do it.

I don't think that works. It's not remotely browsable or searchable. It would be quite challenging to put these scrapes up, anyway. They're regular wget crawls with a regular directory/file structure, the problem is that there's so much material and so many files that it can be almost impossible to find what you are looking for... (Plus you need to rewrite links into relative links to make everything render properly.)

Hmm. Now I'm thinking that I might end up using your idea (scraping the dark web) and using something like httrack[0] to do exactly that: structure.

[0] https://en.wikipedia.org/wiki/HTTrack

I once tried using HTTrack, but I found it was doing too much magic under the hood and was hard to work with. As dumb as wget is (that blacklist bug is over 12 years old now!), it at least is understandable.

Thanks for saving me the headache :)

What is the legality with respect to downloading this file? Could it contain material that would put us at legal risk?

I really don't know the answer. But I just want to remark that it seems to be a terrible situation that more than one of us have to wonder about the legality of downloading the dataset (there was a comment thread below on the same topic)

Since when would it be illegal to posses statistics. Are we afraid that dark markets might claiming proprietary ownership of that data?

I wouldn't download it unless you want a visit from the feds.

>collating and creating these scrapes has absorbed an enormous amount of my time & energy due to the need to solve CAPTCHAs,...

Have you considered automating this?

I did, but the problem is that I never expected to be scraping for so long and it was always easier to just solve by hand than do a complete rewrite and allow for using CAPTCHA libraries. If I had known I would be scraping for ~3 years, I would have done many things differently: http://www.gwern.net/Black-market%20archives#how-to-crawl-ma...

The whole point of CAPTCHAs is to be difficult to automate. Or are you suggesting automating farming out the CAPTCHA solutions to cheap workers?

There's a plenty of service providers that sell APIs to captcha solving services at reasonable prices.

Or if you don't want to spend money, you can always re-host things on your own site. Let your visitors do the work for you.

I assumed it was a joke. But maybe not?

There are plenty of services that do exactly that. They're pretty cheap.

The main problem is that most CAPTCHAs are terrible, and do a better job of keeping humans out than robots.

"HOW TO CRAWL MARKETS" section has good tips for general crawling as well.

you could also use a library to handle captchas. have a look at 'tesseract'

