Second, to anyone crawling hidden services or crawling over tor, please run a relay or decrease your hop. Don’t sacrifice other’s desperate need for anonymity for your $whatever_purpose_thats_probably_not_important. It could be some fun thing to do for you, but some people are relying on tor to use the free, secure and anonymous Internet.
People who actually need anonymity need to hide among traffic that is boring. If you reduce the number of hops your crawler is using, you're reducing the amount of boring traffic and making it easier to find the interesting people.
Running a relay in addition to using Tor in the normal way is a good idea, however, as it increases the bandwidth of the network.
But if the received wisdom becomes "if you're not rebelling against an oppressive regime, you should only be using 1 hop" then the advice has real harmful effects.
onion like "layered" services
The Tor Project recently added a consensus flag which can globally disable single hop client connections as a DDoS mitigation approach. It is currently enabled. (DoSRefuseSingleHopClientRendezvous)
For the uninitiated, can you please explain the differences in what they are and how they're accessed?
The differences are explained in the post. The dark web is a vast groups of services that cannot be accessed without using special software / proxy.
The hidden services are service running on the TOR network and accessed using a browser that use the TOR proxy.
They are a type of dark web services, but not the entirety
This takes me to the 1990's.... "groups of services that cannot be accessed without using special software" definitely matches NNTP, FTP, SMTP, SSH, HTTP, etc.
> The dark web is a vast groups of services that cannot be accessed without using special software / proxy.
Can you name a few for examples?
From tor blog:
A new relay, assuming it is reliable and has plenty of bandwidth, goes through four phases: the unmeasured phase (days 0-3) where it gets roughly no use, the remote-measurement phase (days 3-8) where load starts to increase, the ramp-up guard phase (days 8-68) where load counterintuitively drops and then rises higher, and the steady-state guard phase (days 68+).
That's true, you never want to rebuild the circuit. But it strikes me that the idea that this is avoidable falls into at least two of the Eight Fallacies of Distributed Computing, namely "The Network Is Reliable" and "Topology Doesn't Change".
If we instead assume that the network isn't reliable, and topology does change, then instead of eliminating unreliable nodes and being conservative with changes to the topology, we would focus on reducing the costs of rebuilding a circuit so that network unreliability and topology changes aren't disastrous.
But it sounds like the Tor team has instead decided to bolster these assumptions, to make them less of assumptions; trying to make the network as reliable as possible and trying to make the topology change as little as possible.
I don't mean this to be a harsh criticism of the Tor team. I'm an outsider, and beyond an uncompromising privacy constraint, I don't know all the constraints Tor was built under. I'm sure the tradeoffs made by the Tor team make sense within the context of their constraints. Obviously, the Tor network works well enough to have a large user base, so they have provided a good-enough solution.
But I wonder if changes could be made to Tor's design in the future which would allow quicker adding and removing nodes, and handle network reliability issues better, so that Tor would be faster.
One possibility which stands out to me is to pool circuits and load-balance between them, so that if a circuit begins to have issues, you still are connected along other circuits while you build a new circuit to replace the unreliable one. This possibly would run into issues where correlate could correlate traffic from different circuits to unmask clients, so you'd have to be careful, but I'm not sure these problems would be insurmountable.
But remember: What tor is doing is hard. They are doing complex crypto, networking, security.... The hard stuff. The real stuff. Torproject is a nonprofit organization with limited capabilities. They are doing their best. It took 3 years to design and implement dos mitigation techniques, for example.
Your proposed plan could take over 10 years, even for a well funded corporation. It might take time and fail. It might create huge vulnerability due to code complexity. Afaik, tor can’t risk that.
I mean, it's a dumb idea, but until you force people to contribute, most won't. I've looked at doing it, but didn't want to deal with legal headaches b/c creepy pedos use tor, too.
The thought is that other people are hiding in your noise. Make less noise, other people stand out more.
Can somebody list some positive, legitimate, not illegal uses to desperately be anonymous?
Most people, when they hear of things like "the dark web" and cryptocurrency think about the massively publicized instances of drug trafficking and ordering a hit on someone.
It's going to take a lot of work to reframe the utility and purpose of them to a more universal, humanitarian angle.
People in this world live in oppressive circumstances. This should be viewed as a step toward helping them not be systemically silenced.
If one is willing to argue that the US government throwing someone in a cage because the grew or bought the wrong plant is legitimate, then I don't see how they have any standing to complain about China doing something for someone who held up the wrong sign at a protest.
Reminds me a bit of what you see with how some societies approach drug addiction. Providing a safe space with clean needles vs throwing in a prison. There's a lot to think about.
And I think we've seen some of that with the marijuana legalization across the US. The state adoption had strong initial resistance, but public opinion began to shift once it got out of the shroud of stigma and moral enforcement.
Give me a solid reason for why you want corporations and governments to have access to detailed records of everything you do online.
There's value in that data to certain groups of people and we may not like what the future looks like once that value is tapped to its potential.
In a good, free society, maybe anonymity isn’t important.
But in a bad society, one in which collaboration on a cause is punished, but each individual desperately wants to collaborate and change something fundamental...
Anonymity allows the planning of synchronized action.
Mass or targeted misinformation also threatens the planning of synchronized action.
There is something of a Catch 22 here I think. A society in which it's difficult or costly to be do something anonymously is essentially a society with total surveillance.
And it seems intuitive that a surveillance state is not good or free.
So there is an argument you can make that good and free societies should allow anonymity even if they are the sort of society where it is least needed.
You are being tracked at the very least as an abstract person. If any of the above fingerprints are linked to a real identity (logging in just once even, or posting your email on a forum) then you are now being tracked even logged out.
If you use Tor and log into services it has no benefits. Tor, the browser and other distributions, will still leave fingerprints but they will no longer be unique and match only you but everyone using tor.
Tor, the protocol, will hide that you are the one receiving or sending packets.
"why Tor would be any better than simply browsing in incognito mode"
Incognito mode does nothing to hide packets or source/destination you are communicating to. Your ISP could literally pull up all non-https sites you visited along with their content, assignable to you, airstrike, as a person. Tor would block this.
From what I could see, author made effort to make the crawler distributed with k8s (which I don't is needed considering there are only approximately 75 000 onion addresses) using modern buzzword technology, but from what I could see the crawler itself is rather simplistic. It doesn't even seem to index/crawl relative urls, just absolute ones.
These things can happen in parallel but let’s also assume no more than 32 simultaneous TCP connections per host through a Tor proxy.
So we’re looking at ~75k1005/32 seconds = 14 days to run through all of them. You may not need to distribute this but there are situations (e.g. I want a fresh index daily) where it is warranted.
Regarding the number of onion addresses available you are wrong. Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.
Not taken but available.
I agree with the fact that the crawler is really simplistic. But the project is new (2 months I think) and has to evolve. You can make a PR If you want to help me to improve it!
As a defense against the parent comment, though, this proves way too much. It doesn't matter how much k8s you throw at that, you're never going to so much as find your first site, if you're looking at the problem that way.
That's not really a relevant number here.
>Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.
I sorta understand what you mean, technically it's 32 characters per position (5 bits), and 16 positions.
In v2 .onion addresses, that is.
v3 ones  are 56 positions, but not all the bits are used for addressing, so the same formula wouldn't quite work to calculate real theoretical capacity.
IIRC someone already made site which generates unlimited links to v3 addresses (without having them lead to anywhere, of course).
V3 addresses are just ed25519 pub keys and a couple byte changes. You can use Go libraries like Bine  to generate as many V3 (or V2) addresses as you want from keys.
0 - https://godoc.org/github.com/cretz/bine/torutil#OnionService...
edit: onion services, not addresses
I would tread lightly crawling the dark web. There are cases where the FBI has admitted to running services on TOR, to collect IP addresses:
What about when the FBI/CIA does it? Genuine question.
Why should I face legal problems?
However, if you have used the the system which creates a database of questionable dark web links on your machine, that could be tricky to explain... and easy to implicate
I am guessing you have some personal instance you use at least for testing/"education", right?
> On June 20th, board members of the „Zwiebelfreunde“ association in multiple German cities had their homes searched under the dubious pretence that they were „witnesses“ while their computers and storage media were confiscated.
Content that's illegal to possess is a different issue, though I'm sure they'd make for an interesting case because a crawler downloading, saving and parsing an HTML page isn't as clear cut as a human evaluating and deciding what to download and store. "The suspect has the hard- and software necessary to download this content" shouldn't be enough to convince a judge to issue a search warrant, but then again, judges probably have very little technical knowledge.
If the police can convince a judge to raid the board members and their families of registered club just because they, among many other things, collected some donations for an US org, then some overzealous police detective or DA going after some dev who made a webcrawler for the "dark web" and is probably in possession (knowingly or not) or illegal content isn't much of a stretch either.
Hell, they used to raid people accused by third parties of copyright infringement (for private personal use), about a decade back or one and a half, but thankfully that stopped now. They would come early in the morning, present you with a warrant that said "based on evidence provided by <third party>..." (i.e. somebody somehow collected an IP address you might have used off of some file sharing swarm), take all your shit and scare your neighbors and quite often your parents because they raided a lot of minors too.
I know two people who this happened to personally.
One guy wanted his stuff back, to which the DA replied that if they got to keep his stuff they would drop the case (I kid you not), and the other guy had his stuff returned about 2 years later, except for his HDDs. And his stuff not only included computers, CDs, DVDs, a printer, but they had actually seized books... paper ones... wat. Neither was convicted of any crimes in the end (IIRC they both had the thing dropped because "minor offense" not worth pursuing).
Turns out that a little googling of the German internet around that time turned up a lot of similar cases and some people claiming the police and DAs did that to get new computers "cheap"...
The OP wrote a crawler and used it to crawl Tor. Depending on where they live, accessing the content might be illegal, and storing some of the content in your computer might be illegal as well.
Law enforcement might be monitoring some domains, or have set up some honeypots that the OP might crawl automatically.
You don't want to end up in court having to argue about why your computed accessed some child pornography and downloaded it, and trying to explain to a jury that you did not did those things, but that the crawler that you programmed to do those things did.
Sure, nobody might end up raiding the OPs home, and even if they do, the OP might be able to successfully survive a jury. But just having to go through that might suck.
If the OP only wrote the software and never used it, then they are fine. But from the article, they did use it, so who knows where the crawler landed. Chances are nowhere good.
And that some eager police people like to "inconvenience" people connected to TOR somehow isn't exactly new, either.
E.g. there have been multiple raids against TOR exit node operators in different countries around the world in the past, even when the police was fully aware it was a TOR exit node that did not store information.
Maybe I'm just too paranoid - then again I used to run a TOR exit node myself and had a bunch of less than pleasant run-ins with the police, tho thankfully no raids.
Dread, the dark net reddit, is surprisingly vibrant
I think its weird that people almost don't want to hear positive stories about dark net.
It’ll be funny when news articles and romcoms just start “forgetting” to qualify their plot piece with the “its scary” trope
If you're new to the field and want something that's easy to set up & polite, I strongly recommend Apache Storm Crawler (https://github.com/DigitalPebble/storm-crawler).
However, I'm wondering what would be a good practical purpose of crawling dark web.
There's no practical purpose for the crawler. It's more an educational project than anything.