
Building a Dark Web Crawler in Go - aadlani
https://creekorful.me/building-fast-modern-web-crawler/
======
bureaucrat
First of all, it’s hidden sevices, not dark web.

Second, to anyone crawling hidden services or crawling over tor, please run a
relay or decrease your hop. Don’t sacrifice other’s desperate need for
anonymity for your $whatever_purpose_thats_probably_not_important. It could be
some fun thing to do for you, but some people are relying on tor to use the
free, secure and anonymous Internet.

~~~
MuffinFlavored
> other’s desperate need for anonymity

Can somebody list some positive, legitimate, not illegal uses to desperately
be anonymous?

~~~
goda90
The "not illegal" part is the catch here. Something can be illegal but still
legitimate if the laws are illegitimate. Someone trying to exercise freedom of
speech or the press under an oppressive regime would need anonymity to avoid
being jailed or killed.

~~~
metamet
I think this is an important cultural reframing that needs to occur, sooner
rather than later.

Most people, when they hear of things like "the dark web" and cryptocurrency
think about the massively publicized instances of drug trafficking and
ordering a hit on someone.

It's going to take a lot of work to reframe the utility and purpose of them to
a more universal, humanitarian angle.

People in this world live in oppressive circumstances. This should be viewed
as a step toward helping them not be systemically silenced.

~~~
SkyBelow
Under the idea of legitimate and illegitimate use, why wouldn't drug
trafficking be legitimate? It gets drugs off the streets, decreases violence
compare to street level drug dealing, increases safety (while the reputation
of online sellers isn't a great metric, we are talking relative to the person
on the street corner), and generally involves only adults.

If one is willing to argue that the US government throwing someone in a cage
because the grew or bought the wrong plant is legitimate, then I don't see how
they have any standing to complain about China doing something for someone who
held up the wrong sign at a protest.

~~~
metamet
I suspect that since illicit drug trafficking has a strong social stigma, it
may not be the best thing to lead with. It can, however, be discussed with
nuance in a way that could change minds. I definitely think there's a lot of
legitimacy to what you're saying.

Reminds me a bit of what you see with how some societies approach drug
addiction. Providing a safe space with clean needles vs throwing in a prison.
There's a lot to think about.

And I think we've seen some of that with the marijuana legalization across the
US. The state adoption had strong initial resistance, but public opinion began
to shift once it got out of the shroud of stigma and moral enforcement.

------
Hitton
Disclaimer: I have rather small experience with Golang and just skimmed the
crawler code.

From what I could see, author made effort to make the crawler distributed with
k8s (which I don't is needed considering there are only approximately 75 000
onion addresses) using modern buzzword technology, but from what I could see
the crawler itself is rather simplistic. It doesn't even seem to index/crawl
relative urls, just absolute ones.

~~~
creekorful
Author here. I'm fairly new to Golang too and it's my first project.

Regarding the number of onion addresses available you are wrong. Addresses are
encoded in Base32 which means there are 32 characters available. So there are
32^16=1.208925819614629174706176×10^24 addresses available.

Not taken but available.

I agree with the fact that the crawler is really simplistic. But the project
is new (2 months I think) and has to evolve. You can make a PR If you want to
help me to improve it!

~~~
akklesed
Offtopic nitpick:

>Addresses are encoded in Base32 which means there are 32 characters
available. So there are 32^16=1.208925819614629174706176×10^24 addresses
available.

I sorta understand what you mean, technically it's 32 characters per position
(5 bits), and 16 positions. In v2 .onion addresses, that is.

v3 ones [1] are 56 positions, but not all the bits are used for addressing, so
the same formula wouldn't quite work to calculate real theoretical capacity.
IIRC someone already made site which generates unlimited links to v3 addresses
(without having them lead to anywhere, of course).

[1]
[https://trac.torproject.org/projects/tor/wiki/doc/NextGenOni...](https://trac.torproject.org/projects/tor/wiki/doc/NextGenOnions)

~~~
kodablah
> IIRC someone already made site which generates unlimited links to v3
> addresses (without having them lead to anywhere, of course)

V3 addresses are just ed25519 pub keys and a couple byte changes. You can use
Go libraries like Bine [0] to generate as many V3 (or V2) addresses as you
want from keys.

0 -
[https://godoc.org/github.com/cretz/bine/torutil#OnionService...](https://godoc.org/github.com/cretz/bine/torutil#OnionServiceIDFromV3PublicKey)

------
jmnicolas
I'd be concerned that the DB is going to contain some pretty nasty stuff that
might be hard to explain in front of a judge.

~~~
creekorful
You are right. That's why it's an educational project and not a public search
engine

~~~
mellosouls
IANAL but "educational project" won't fly in court, and nor should it.

~~~
andrewjrhill
Programs like [https://www.hacksplaining.com/](https://www.hacksplaining.com/)
exist purely as educational programs that teach you to exploit known flaws in
web security and have no issue with the law.

~~~
F147H34D
Right, but possession of those items do not constitute a violation of law.
Whereas, the possession of child exploitation material does. No matter the
reasoning.

I would tread lightly crawling the dark web. There are cases where the FBI has
admitted to running services on TOR, to collect IP addresses:

[https://www.wired.com/2013/09/freedom-hosting-
fbi/](https://www.wired.com/2013/09/freedom-hosting-fbi/)

~~~
apta
> Right, but possession of those items do not constitute a violation of law.
> Whereas, the possession of child exploitation material does. No matter the
> reasoning.

What about when the FBI/CIA does it? Genuine question.

~~~
monoshift
No one watches the watchmen.

------
mschuster91
To anyone experimenting with such stuff, _take care_ and don't make your
services publically available. Especially the dark web is full with highly
illegal content such as child pornography and in some jurisdictions even
"involuntary possession" such as in browser caches may be enough to convict
you.

~~~
creekorful
Do you think I should add a license in Github to mention that? To protect me
and the users who will use the crawler?

~~~
weatherlight
yes.

------
rolltiide
I’ve been pretty surprised at how big hidden services have become

Dread, the dark net reddit, is surprisingly vibrant

I think its weird that people almost don't _want_ to hear positive stories
about dark net.

It’ll be funny when news articles and romcoms just start “forgetting” to
qualify their plot piece with the “its scary” trope

~~~
Phenomenit
I thought dread was dead?

~~~
rolltiide
Its not, hit up dark fail for the onion link to dark fail and browse the
latest onion links

------
zhdc1
Crawlers are fun!

If you're new to the field and want something that's easy to set up & polite,
I strongly recommend Apache Storm Crawler
([https://github.com/DigitalPebble/storm-
crawler](https://github.com/DigitalPebble/storm-crawler)).

------
sbmthakur
A well written article with lot of technical details. Well done.

However, I'm wondering what would be a good practical purpose of crawling dark
web.

~~~
creekorful
Thank you!

There's no practical purpose for the crawler. It's more an educational project
than anything.

~~~
warent
Weird, for some reason your comments are being instantly marked as "dead." I
think there's some kind of filter that's tripping out for your account since
it's new. I vouched for your two comments so hopefully everyone can see them
now, but an admin (i.e. dang?) will need to look into this for a longer term
solution.

~~~
creekorful
Thank you sir. Actually my other comments are invisible too. That's weird.

------
seisvelas
I did the same in Racket when I made a Tor search engine. Here's the source
code of the crawler!

[https://github.com/torgle/torgle/blob/master/backend/torgle....](https://github.com/torgle/torgle/blob/master/backend/torgle.rkt)

------
fs111
Any http-aware software that supports socks proxies can access information on
hidden services, so any crawler can do it. I fail to see what is novel about
that, except that it uses k8s and mongo and a catchy blog title.

------
woodandsteel
So how well would this thing work? What I am asking is what percentage of all
the tor hidden service sites out there would get detected by it?

------
goatsi
How well does it handle a gzip bomb?
[https://www.hackerfactor.com/blog/index.php?/archives/762-At...](https://www.hackerfactor.com/blog/index.php?/archives/762-Attacked-
Over-Tor.html)

------
Havoc
Sounds like a recipe to score yourself a free FBI visit

~~~
penagwin
Generally the FBI doesn't give a hoot until you start distributing illegal
stuff....

~~~
fishtacos
What does suck is being put on IP blacklists by various providers for merely
running a Tor relay, not an exit node. There are several websites I can only
access through VPN because of my IP is associated with running a relay.

------
getpolarized
Go is a horrible language in which to write a crawler. The main problem is
that NLP and machine learning code simply isn't as prevalent and robust as it
is in Java and Python.

~~~
marcrosoft
Go is great for a crawler. What does NLP and ML have to do with crawling?

