
1.5 TB of Dark Net Market scrapes - gwern
http://www.gwern.net/Black-market%20archives
======
rcpt
I've had a very slow-moving hobby project to parse and analyze a subset of
this data: [https://github.com/rcompton/black-market-recommender-
systems](https://github.com/rcompton/black-market-recommender-systems)

So far I've had some ok results along the lines of "91.7% of vendors who sold
speed and MDMA also sold ecstasy" [http://ryancompton.net/2015/03/24/darknet-
market-basket-anal...](http://ryancompton.net/2015/03/24/darknet-market-
basket-analysis/) I am working on extending this to markets besides evolution
now.

~~~
ez123
Just FYI, MDMA is ecstasy.

~~~
chm
I don't do MDMA personally but I have some friends who like going to raves
with some chemical supplies. On the streets, some make it a point of honor to
sell/buy only "MDMA" because "ecstasy" is "crap".

The only precise name is MDMA. In my opinion the name users give to what
they're ingesting is inconsequential. Most users have no idea what MDMA means
anyway, nor would they know how to chemically identify if what they bought
_might contain_ MDMA.

~~~
WalterGR

        The only precise name is MDMA.
    

There's also "Molly," which is the street name for pure MDMA. (Though: most
terms are tied to a specific geography.)

    
    
        ...nor would they know how to chemically identify if what they
        bought might contain MDMA.
    

True, but that's not unique to MDMA users. cf. the heroin in the Netherlands
that was being sold as cocaine, and which lead to several deaths.

Organizations such as Dance Safe have existed for decades (in the U.S. at
least) to let people test their drugs to make sure they're ingesting what they
intend to ingest.

~~~
user_0001
"Molly" is the US term for what should be (but usually isn't - very often some
mixture of random RCs methylone, mdvp and other alphabet drugs) MDMA.

The UK term is "Mandy" "mud" or MD.

Ecstacy (which I have never ever heard used) refers to what should be MDMA in
pill form along with some binders and is usually referred to in the UK as
"pills" or "Es". I believe the US prefers the terms "rolls"

>the heroin in the Netherlands that was being sold as cocaine, and which lead
to several deaths. How on earth can you buy coke and get heroin? No one buys
coke and then rakes up a monster line, everyone does a dab test (lick finger,
stick it in the powder and taste it)

Heroin tastes nothing like coke. Heroin sells for more so why mix it in?

------
Moshe_Silnorin
Gwern has a Patreon page now for anyone interested in supporting his research:
[https://www.patreon.com/gwern?ty=h](https://www.patreon.com/gwern?ty=h)

~~~
mapt
Gwern does more interesting things with lower monetary burn rate than most
anyone I've met.

------
b6
gwern, you are an absolute force of nature when it comes to generating and
collecting and presenting information in a useful way. Thank you.

~~~
icpmacdo
I hope a he gets some good donations from HN, Ill send him a few dollars in
BTC.

------
curiousg
What amazing work! I am very interested in doing research with Tor and a
dataset like this could make my job a heck of a lot easier. I have a legal
question though: Are your scrapes text only? Before I work with this dataset,
I want to make sure that there's no possibility it contains illegal images
(child porn).

~~~
fineman
How about ascii art?

Actually, this is an interesting topic. Poisoning a dataset. CP would work for
private security investigators, and to poison against government investigators
you could use leaked classified secrets.

Could you work around this by operating on the files on VPS you don't own,
streaming a very low-res ('Basilisk'-proof -
[https://en.wikipedia.org/wiki/BLIT_(short_story)](https://en.wikipedia.org/wiki/BLIT_\(short_story\))
) remote desktop image.

~~~
curiousg
Possession laws are pretty strict and hard to decode. I wouldn't want to be
the test case in court. The idea of "poisoning" a dataset is an interesting
theoretical. But in practice, I just want to judge the likelihood that the
dataset is poisoned by the presence of images. If it is then there's not much
I can do with it.

~~~
setuptools
Yes, this absolutely needs to be clarified by Gwern. This is a very dangerous
thing to link researchers to if it contains any illegal content.

~~~
tomcam
Um.. It says Black Market right on the tin.

~~~
mikeash
Indeed, a warning that it may contain illegal content would be about as
sensible as the standard "Warning: may contain nuts" label on a tin of nuts.

~~~
tedks
What is illegal in a download of an online drug marketplace? Are pictures of
drugs banned where you live?

~~~
mikeash
It's a general black market, not just drugs. For example, one of the sites
described on that page is PEDOFUNDING, "A crowdfunding site for child
pornography." Now the dump isn't supposed to contain any images, but it's hard
to be 100% sure. In any case, whatever risk there might be seems to be clearly
implied in the name and description there.

~~~
tedks
FreeeOW, that's what I get for skimming the list I guess.

------
branchless
This is why I love the internet. This article has given me a fascinating
glimpse into a world I have no idea about.

Author: thank you so much for taking the time to document this.

~~~
jmduke
Unrelated to the darknet, but this Twitter account evoked a very similar
feeling in me (random snippets from userboards in the 80's and early 90's):

[https://twitter.com/wwwtxt](https://twitter.com/wwwtxt)

~~~
WalterGR
"I heard about the Apple Watch recently and was going to check it out—but not
now. It can't even transmit or input data. ☯93JAN"

Prescient.

~~~
hfsktr
Some of them a bit less clairvoyant about the future:

"The actual date for the end of the world is July 5, 1998. ☯92NOV"

It is interesting how many were accurate or onto something. I just didn't
expect it.

edit: Found this gem after wasting more time that I cared to.

"I think the future of personal communications holds great things in store for
us, but privacy won't be one those things. ☯94AUG"

~~~
tripzilch
July 5, 1998 is "X-Day", from the Church of Subgenius. It actually happened.

[https://en.wikipedia.org/wiki/X-Day_%28Church_of_the_SubGeni...](https://en.wikipedia.org/wiki/X-Day_%28Church_of_the_SubGenius%29)

~~~
hfsktr
Some of the quotes are from before I was even born so that is just one of the
references I don't know and just assumed it was a date thrown out for no
reason.

------
joshmn
This is so cool. Thanks gwern.

If someone's feeling bored, you're welcome to put the entire archive on a web
server for us to look at......

Or maybe I'll just do it.

~~~
fractalcat
You mean like this?
[https://archive.org/download/dnmarchives](https://archive.org/download/dnmarchives)

~~~
gwern
I don't think that works. It's not remotely browsable or searchable. It would
be quite challenging to put these scrapes up, anyway. They're regular wget
crawls with a regular directory/file structure, the problem is that there's so
much material and so many files that it can be almost impossible to find what
you are looking for... (Plus you need to rewrite links into relative links to
make everything render properly.)

~~~
joshmn
Hmm. Now I'm thinking that I might end up using your idea (scraping the dark
web) and using something like httrack[0] to do exactly that: structure.

[0]
[https://en.wikipedia.org/wiki/HTTrack](https://en.wikipedia.org/wiki/HTTrack)

~~~
gwern
I once tried using HTTrack, but I found it was doing too much magic under the
hood and was hard to work with. As dumb as wget is (that blacklist bug is over
12 years old now!), it at least is understandable.

~~~
joshmn
Thanks for saving me the headache :)

------
stevewepay
What is the legality with respect to downloading this file? Could it contain
material that would put us at legal risk?

~~~
NhanH
I really don't know the answer. But I just want to remark that it seems to be
a terrible situation that more than one of us have to wonder about the
legality of downloading the dataset (there was a comment thread below on the
same topic)

------
ryanlol
>collating and creating these scrapes has absorbed an enormous amount of my
time & energy due to the need to solve CAPTCHAs,...

Have you considered automating this?

~~~
smeyer
The whole point of CAPTCHAs is to be difficult to automate. Or are you
suggesting automating farming out the CAPTCHA solutions to cheap workers?

~~~
ryanlol
There's a plenty of service providers that sell APIs to captcha solving
services at reasonable prices.

~~~
amenghra
Or if you don't want to spend money, you can always re-host things on your own
site. Let your visitors do the work for you.

------
SergeyHack
"HOW TO CRAWL MARKETS" section has good tips for general crawling as well.

------
acosmism
you could also use a library to handle captchas. have a look at 'tesseract'

