Hacker News new | past | comments | ask | show | jobs | submit login
Dataset of classified screenshots from Tor hidden services (circl.lu)
117 points by adulau 6 days ago | hide | past | web | favorite | 48 comments

I was reviewing the graph and was surprised at the rankings of things. Frankly, there's a lot more "not illegal" things going on in Tor than I would have expected. I expected there was a reasonable amount, but not representing so much of the total[0].

It's a double-edged sword in a way -- more people using the project for doing non-illegal things means perhaps the moniker "Dark" could be replaced with a less hostile word (Private? ...no...)[1].

Bad, in that, it speaks a bit to users making inconvenient choices because they perceive their privacy is threatened enough to warrant it.

I was glad to see that they scrubbed the content -- would have been one scary tarball to download, otherwise, and very illegal for them to post considering some of those categories. It makes me wonder how they avoided running afoul of the law just collecting/viewing it for the purpose of scrubbing it, but IANAL. It'd keep me from being involved on any project like that, though!

[0] I'm eye-balling the counts for a few categories in addition to the obvious "other-not-illegal", so it's probably not 100%, and I'm assuming US law (specifically WRT speech)

[1] I get "Dark" is meant as "invisible" and it's perhaps even a more accurate description since it's not completely "invisible" unless you take additional steps (and the software contains no known flaws), but sufficiently invisible for most adversaries. However, it's spoken in the same breath as "drug/theft markets" and is used more like "dark alley". Not sure there's a better term and I'm probably bike-shedding, anyway.

The graph displaying repartition was done only on a subset of all pictures. Not all of them are classified yet, and so, these stats were extracted from only a sample of this (quite big) dataset. Second thing is that these stats are computed BEFORE removal of all "bad pictures". Which means that what you have in the folder has not exactly the same ratio/statistics.

Concerning the "other-not-illegal", if the website was none of the other labels AND not illegal, it was usually tagged as "other-not-illegal". For instance, personal websites are tagged as "legitimate" and a Tor Wiki is tagged as "other-not-illegal" and "wiki". General information about Tor, websites allowing to do some calculation online (tools to hash things, calculator, ..) or online games (without money involved) are labelled as "other-not-illegal".

Anything related to finance - even if it can be legal - is not labelled as "other-not-illegal", for example.

> It makes me wonder how they avoided running afoul of the law just collecting/viewing it for the purpose of scrubbing it, but IANAL. It'd keep me from being involved on any project like that, though!

CIRCL is the Computer Emergency Response Team of Luxembourg. They ̶a̶r̶e̶ work very closely with the law.

Disclaimer: I'm very much a layperson here.

It wasn't obvious to me from the article how or if this dataset is statistically significant in some way in regards to total tor traffic?

I believe the tor project publishes anonymized data sets of tor usage, and if I recall correctly the web site handling the most traffic over tor is Facebook.

If you look at the metrics for onion services traffic [0] and total network traffic [1] you'll see that this report is looking at the roughly 1% of tor traffic that is going to onion services. The overwhelming majority of traffic is to the regular, "clear" internet.

[0] https://metrics.torproject.org/hidserv-rend-relayed-cells.ht... [1] https://metrics.torproject.org/bandwidth-flags.html

I couldn't agree more. I run a number of hidden services on Tor. That's because for every website I create it's a simple thing to make it a tor hidden service too. My ham radio and science hobby websites are in no way, "dark". I link to them (and vice versa) from my clearweb sites and host from my home connection/IP.

Plus on Tor you actually own your domain rather than lease it on the whim of some company easily pressured by political and social winds.

Interesting. I've never thought to host my web sites on Tor. I don't have very many that are strictly "mine" but that'd be something interesting to do if only to understand a little more about Tor.

How do you own it ?

Your domain is a predictable derivative of your hidden node's key, which is randomly generated. It's cryptographically impossible for the same domain to be created more than once.

Yup. Further, you can generate billions of keys until you find one where the public key matches some sequence you define beforehand. I brute forced one that starts with "superkuh". It only takes some tens of minutes to an hour or two on a GPU from 2010.

As long as you keep the private keys private you get traffic to that public key and "own" the domain.

Not impossible, just unlikely. Collisions can exist in theory in any hashing algorithm (though it may be a 2^2048 large address space or something)

Which is what "cryptographically impossible" means. Even with all the computation on the planet for a million years you can't get a glimmer of a chance.

You generate a public/private keypair for your service. Your domain is a short hash of your public key. Whomever owns the private key, owns the domain name.

Selling stuff on darknet markets must be lucrative just because 1 or 2 researchers has to always buy it.

Feel free to check which amount of money the bitcoins addresses you can find online have received .. And you'll figure out :)

I'm surprised to see finance pop-up so high in the list... What's going on?

Crypto currency pump and dump schemes for the most part.

Without forgetting Mixers, Credit-Card sellers, Paypal-related schemes, CryptoWallets, Escrows ...

I would strongly advice anyone interested by the labels frequencies of dataset to get a look at the json file provided. Interesting thing is not that much the frequency of X or Y labels, but the frequency of one set of labels. (this would be a great addition on the webpage actually).

Pictures can have multiple labels. And so, having the ratio of "Forum + Drugs + Finance" vs "Market-place + Weapons" dispense more information than just the global frequency of "Finance"-related pages :)

> We also manually removed pictures which were identified as containing harmful content, such as violent, offensive, obscene or equivalent undesirable pictures which may shock anyone.


Very true. Most of the finance or market-place websites are scam. Not all of them are labeled with it (because there is a SCAM label ! Can't truly test all websites by ordering the stuff) but clearly some of them were scam. In particular techno-stuff (IPhone etc.) marketplaces are usually scams.

> In particular techno-stuff (IPhone etc.) marketplaces are usually scams.

Usually? Always. There are a plenty of legitimate services that'll fence such items for you, but they'll just sell them very near retail on amazon/ebay/whatever.

> 43 dark-web:motivation="religious"

That's interesting.

Unless it's some cult requiring the sacrafice of humans, I guess maybe it's for a religous group in an oppressive country.

Religion seems like a topic where there are a lot of sincere beliefs and questions, but also a lot of social signalling and pressure within a community to adhere to specific beliefs.

Suppose you are a member of a congregation who questions a specific doctrine but you don't want to commit to opposing it or signalling that you are unreliable? Suppose you are a member of the clergy questioning your faith, and not satisfied with the discussions you have had with people of higher rank within your religious order.

I think there is significant value in anonymous forums. For example the arguments during the drafting of the US Constitution probably would have been far less productive if they hadn't been preceded by anonymous and pseudo-anonymous discussions in the form of the Federalist Papers where proposals didn't carry the benefit of signalling allegiance to interest groups, or the disadvantage of signalling the opposite.

A great example today is the difference between the content on Quora compared to the content found on Hacker News. Just like a cover letter to a resume, a post on Quora may be truthful or interesting, but it is also inextricably linked to the poster's name and always suspect of being primarily interested in the effect it has on the poster's standing in the real world. Here, there are varying degrees of anonymity, and posts are more likely to be motivated by a sincere interest in exploring a topic or advocating an opinion, rather than what making such a statement says about the individual saying it.

> Suppose you are a member of a congregation who questions a specific doctrine but you don't want to commit to opposing it or signalling that you are unreliable?

For a real and current example, see the recent EFF vs. Watchtower case where a Reddit user /u/darkspilver posted to an ex-Jehovah's Witness subreddit. Watchtower subpoenaed Reddit for the user's IP address so they could excommunicate the user. If they'd communicated over Tor, this would be less of a problem.

There's a lot of just... stuff hosted on .onion.

Several of the images tagged with only 'religious' are literally just hosting mirrors of the King James Bible from htmlbible.com.

Very true. Most of them are just biblical extracts. I haven't seen any "sect-related" content.

Be very careful about such projects on Tor. There are plenty of images within Tor, the possession of which, can land you in prison and destroy your life. I would hesitate from clicking any link to such a project without first some examination of how they dealt with that issue.

True. But this dataset is "safe". I could ensure you to download it and show it in front of any public, without that much concern about shocking anyone (maybe one or two pictures may not be friendly for everyone, but nothing as bad as you could encounter on the "true Tor"). That's precisely one use of this dataset : showing what is on Tor without have a 200 heart beat because you don't know on which page you'll land next.

This dataset does not include any such images.

It does say they filtered the dataset first “picture by picture” to remove violent or offensive material.

Come on, this CERT from Luxemburg project, if it would be some random Anon blogger I would not even open main page for such project without going through Tor.

It took me a minute to realize that "classified" in the headline means labeled, not secret.

Sorry, I wanted to change it after I push the submit button... it was too late.

Maybe @dang can help, here? :)

I disagree that there is something that needs to be changed here. "Dataset of classified X" is commonly used to refer to this exact thing. (i.e. dataset of data points with human-made labels)

It's ambiguous, that's the problem. My first reading was it was a dataset of government information that had leaked on Tor (despite having a computer science background - that's not necessarily the context on HN). Even if it's accurate, changing it to something that is less ambiguous is useful. "Labelled dataset of screenshots" would remove the ambiguity, and it uses fairly standard terminology.

But the word has multiple meanings and both interpretations are equally plausible. Personally I also thought the title was about secrets. Changing from classified to labeled will remove ambiguity completely.

It's the ordering here:

"... classified screenshots ..." would be usually be understood in en-gb to be "screenshots given a security rating requiring restricted access".

"... screenshots classified ..." would be "screenshots given a classification of some sort".

I disagree - the former would be used in both cases in my opinion. I think it just depends which usage you use more.

It is used in both situations when context makes it clear, there's no [or very little, at least] context in a title. Ordering disambiguates the intended meaning when context is insufficient.

Disagree; "classified" is normally used before the noun, not after, when referring to government classification. E.g. it's "classified material" not "material that's classified".

Erm, that's what I did. You said disagree and then repeated my first point.

I think something like "categorization" would be less ambiguous though

But "classification" is the terminology. "Categorization" would be ok if we go back X many years in CS/Stats literature and rename it as "categorization". I think "categorized" would be more confusing here. In this case, I would compromise to something like "Dataset of labeled screenshots from [...]".

HN is not an audience of CS and Stats academics, or those familiar with the literature in those fields.

You don't need to be an academic to be familiar with a discipline's terminology. If you keep reading articles like OP, at some point you'll learn that by convention people use the word "classification" for things like this.

But in HN, the more common usage is government classification, so headlines should be based on that precedent.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact