Hacker News new | past | comments | ask | show | jobs | submit login
Shodan Search Engine: Search Engine for the Internet of Everything (shodan.io)
166 points by ohjeez 65 days ago | hide | past | favorite | 47 comments



Excerpt worth pulling out for HN readers:

Crawling incidentally I think is the biggest issue with making a new search engine these days. Websites flat out refuse to support any crawler than Google, and cloudflare and other protection services and CDN’s flat out deny access to incumbents. It is not a level playing field. I would actually like to see some sort of communal web crawl supported by all web crawlers that allows open access to everyone. The benefits to websites would be immense as well, as they could be hit by a single crawler, rather than multiple, and bugs could be ironed out.


While it's an obstacle, I don't think this is as big of an issue as they make it out to be, and I say this as someone who runs a DIY crawler and a janky homebrew search engine. You can get listed as a good bot with cloudflare &c if you ask nicely enough. Even if you aren't, their IP ranges are public so you can crawl them really slowly if you absolutely must have their stuff.

In the end, you are never going to get completeness, that's a bad mark to aim for, and not even Google does that. That goal requires you have a model of the Internet that's as old as Vannevar Bush. The Internet isn't a bunch of static files on a souped up FTP server anymore, most pages are generated request-time, increasingly client-side. There are websites that don't just have infinite documents, they have infinite wildcard subdomains as well. There are websites that give you different results depending on what time of day it is, or where you are from.

The hard problem of search engines is to find a good subset of the internet for some definition of good; and the first step toward doing that (in a timely fashion) is to rein in the crawling. Search engines are always always going to present a selection of web sites, and there are always going to be pages you can't find because they didn't make the cut; either because they were excluded, or because they excluded themselves.


Hey, your project is awesome, but you seem to be the only non-big3 crawler who doesn't encounter this problem and call attention to it [1]. Maybe because you (awesomely) focus on low/no-javascript, text-heavy sites? Much of the most obnoxious antibot stuff (reCAPTCHA, CDN "AI" throttling, a lot of what cloudflare does, etc) is javascript-based.

Maybe I'm totally wrong here. If so, you really should make your crawl corpus publicly available. Trickle it into one of the public clouds and charge people a few microcents per query. You'll get to retire early, knowing you've done a good deed for human technological progress.

> There are websites that give you different results depending on what time of day it is, or where you are from.

That has been true since at least 1993 [2], before there were search engines!

> The hard problem of search engines is to find a good subset of the internet for some definition of good

This really is a case of the tech industry trying to redefine "search" in a way that gives them editorial power. Librarians don't do that sort of thing. We could learn a lot from their integrity. Or we could just stop calling it "search", because that's not what it is anymore. For example, that box at the top of amazon.com is most definitely not a search box anymore; it's a sales-lead-generator box. The button should be marked "try to sell me shit which is not necessarily related to these words" instead of "search".

[1] off the top of my head: archive.is/archive.today, gigablast. I'm actually having trouble remembering any still-operating whole-web crawlers besides Google, Bing, and Yandex.

[2] https://en.wikipedia.org/wiki/Common_Gateway_Interface


Some folks think it's a very big issue and have released data on their findings about robots.txt files: https://knuckleheads.club/the-evidence-we-found-so-far/

We see some problems but as an established, and respectful crawler, we are usually able to overcome most blocks at Mojeek; a international web search engine with no tracking (currently 4.3 billion pages)


I learned of Mojeek via searX use, it has found corners of the Internet which the big 3 have ignored with the same search queries; I find it very useful to balance out all the popular results. Kudos to your team.


I'm aware that CGI has been around for a long time, but it was bog slow since it forked new processes to deal with every call, it wasn't really feasible to build the sort of dynamic websites we do today until stuff like PHP entered the market.

The relative proportion of static files versus dynamic web content have probably reversed.

> This really is a case of the tech industry trying to redefine "search" in a way that gives them editorial power.

This isn't a new development at all. The signal to noise ratio problem goes back to the late '90s, and ruthlessly removing noise was arguably one of the moves that allowed Google to break through.

Unfortunately they seem to have forgotten why they became a success.

From http://infolab.stanford.edu/~backrub/google.html

> [...] In 1994, some people believed that a complete search index would make it possible to find anything easily. [...] However, the Web of 1997 is quite different. Anyone who has used a search engine recently, can readily testify that the completeness of the index is not the only factor in the quality of search results.

--

It's difficult to conceptualize these numbers, but we can use English Wikipedia as a measuring stick. It has of the order ten million pages, and covers most topics you would ever want to read about. Maybe there are some niche things here and there that are missing, but overall, it's a relatively complete coverage of human ideas and interests. How many English Wikipedia Units do you need, for your search engine to feel complete, assuming 100% signal and 0% noise? Do you need one? Ten? A hundred?

A "big search engine" crawl is over a hundred thousand English Wikipedia Units. How much of that is interesting? How much has been written by humans, has ever been looked at by humans?


https://commoncrawl.org/ is a big crawler with all data made public. Probably a good resource for new search engines.


Yes and no. It works for the deep web, but for anything that needs more frequent crawling no. They don't do it often enough.

I have mentioned this elsewhere, but id love there to be a single crawl that everyone was able use and build indexes off. It would be especially good for webmasters too as you would have one bot to worry about and optimise for, reducing the load on systems, and allowing for innovation in the search space that cloud flare and other cdn's are stifling because they understandably are blocking crawlers that aren't Google or Bing.

For example id love to see something added to robots.txt that allows you to say "crawl at 2am, hit has hard as you want, don't come back for a week" or some such. Having a single crawl could allow for this.


Your comment has been misplaced, it should be under https://news.ycombinator.com/item?id=28665395 (I was pretty confused until that realization).


Well, then change your user agent to 'googlebot' when crawling the web, since it now became the synonym for search crawlers?


That trick doesn't work because people also filter out Google's IP addresses to verify that the "googlebot" crawler is actually a real Google bot.


Can you run a crawler from GCP or do they explicitly filter for other Google AS?


Since it is pretty well hidden and I only found out about it in a previous HN thread: Besides a free plan and contrary to the mainstream, shodan.io has a pretty attractive one-time payment plan. It is probably not enough if you need it 9 to 5 but perfect for the occasional research.

That being said, there is also censys.io. They also don't advertise it prominently, but with their free Community Plan you can search their database for free and you'll get similar results to the free shodan search.


The cost of the one-time payment is behind a registration wall. It’s $49 for those curious.


I'm not sure if they still offer this, but I got my lifetime sub for $5 on black friday a few years back.


This is a great tool, but there's a caveat: Last time I checked Shodan didn't scan all ports, and from all locations. If you're using this to check the surface area of your infrastructure (including the notify-me-of-new-open-ports feature), always test with a full nmap scan yourself, preferably from within the same geographic region.


When I use Shodan as a data source for any info sec consulting, I try to get the point across that Shodan’s map of your network is not accurate (nmap is much better for that), but anything that an adversary sees on Shodan is likely to be an initial target (even if it is down at the time of the attacker looking).

Shodan is a fantastic tool, but I never trust it as a single source of truth for a network perimeter.


there are actually multiple alternatives available that are like Shodan but concentrate more on detailed port scanning for example https://www.idyllum.com or https://pentest-tools.com/network-vulnerability-scanning/tcp...

There still really aren't much all encompassing tool sites out there and you need to mix and match unfortunately


Slow news day?

Next up: https://search.censys.io


I wonder how many leaks we've all heard about started with a simple Shodan search.


If you are interested in shodan.io, every year the hold a one-day offer for a lifetime subscription for cheap.


I got one of those once. Had it for about 2 years and then they banned me without any explanation. How do I know they banned me? Because all profile related actions (login, logout etc) suddenly became a 404 for me and no results would be returned. The 404 issue was resolved as soon as I deleted my cookies, but of course at that point I could no longer login either.

I emailed them a few times and was ignored.

It was still a pretty good deal though.


We don't do bans like that so it sounds like there might have been a configuration/ network issue.


In that case, can you tell me why my account no longer exists? my username was 'mkoryak'. My account was created on

Mon, Jul 18, 2016, 9:38 PM


Your account got banned due to scraping of the website.


That's weird.

Edit: now I remember - I wrote a gm script to load multiple pages of results into a single page because I wanted to see 5 pages at a time. I can see why that looked like scraping.


? Not a new product. Just a link to shodan?


Somebody just realised the website exists I guess. Straight to the front page...


Maybe, it was discussed here on HN eight years ago!

https://news.ycombinator.com/item?id=5512477



"The Lucky 10000 has been referenced over 10000 times on reddit"

https://www.reddit.com/r/xkcd/comments/64yacz/the_lucky_1000...


I always wondered... how is shodan actually legal?

In which jurisdiction is it?

edit: apparently it's from Texas.

https://www.linkedin.com/company/shodan/


To know is something is legal or not we have to know why it would be _illegal_. What laws are they breaking by port scanning? (I guess the default would be the CFAA but I doubt even that hammer could be used for simple portscanning?).


If you are talking HTTP to an endpoint that you have no idea whether you are authorized to, and it happens that they have some ToS document stating "do not use unless", then it could be construed as 'unauthorized access'. This unauthorized access could be argued as being illegal under the CFAA.

The argument is iffy, hinging on the difference between just talking a little bit to a random IP on the one hand; and talking HTTP to a specific site whose ToS you could have been aware off. That second case was recently relevant when some company (I think linkedIn?) managed to stop people crawling them because, even though the IP endpoint is public, the ToS of the site did not authorize crawlers.

All this said, I could see the argument that it is illegal. As a bad analogy, suppose you just checked every door in the city for being locked, and if it wasn't locked, peeked in to record the door shape. That feels like trespassing.


I remember, years ago, sending abuse reports to Shodan's ISPs asking them to enforce their ToS against Shodan, and receiving several emails from Shodan telling me what their mission is and how it's not abuse. I continued reporting them and eventually their unsolicited scans ceased.

Seemed like such a bizarre response to a perfectly cromulent request.


This seems like a perfectly Karen request.

Information you expose to the public internet, behind an isp or not, is not private or secure.

This is like complaining that people keep taking walks by your house whenever your wife sunbathes nude on your front porch. The people doing the looking aren't the problem. Where your wife is exposing anything sensitive is the problem.

The solution is to move your wife to the privacy of the back yard, buy her a sunbooth,or accept that she's on display to the public.

Much the same, make sure your firewall facing your public network connection only exposes what you want it to. Part of the implicit structure of the internet is that by connecting, you're agreeing to be connected to, to be visible to everyone else that's connected.

Leave Shodan be, their presence is important and relevant to the ongoing discussions of educating the public and maintaining accountability for security decisions by giant corporations.


Example you picked has a flaw that nude wife will be fined even if she exposes herself on your property - but in a way that others can see her. Police officer will be quite quick to explain her why she should not do this :)

Other point on "Karen request" is spot on because Shodan is making requests for research purposes and selling the results. They don't access anything that is behind password and they don't scan with intention to find something that they can break through.

While I mostly don't agree with people "scanning is not a crime", because I don't believe randoms on the internet that they are "scanning just for fun" and I expect they scan to find something they can get into which is first step of crime "preparation".

Then people don't understand intent of scanning is what makes it illegal/legal not the scan on itself. Just like knife is not good/bad but if you hold it to harm someone it makes it bad and illegal.


> example you picked has a flaw that nude wife will be fined even if she exposes herself on your property

That's not even true for every jurisdiction in America, let alone the world.

> Police officer will be quite quick to explain her why she should not do this

That's not a flaw in the analogy. The point of the GPs comparison is to pick a similar real life scenario in which the fault is very obvious. The fact that you're able to identify the obviousness of the fault (and suggest the police would too) demonstrates the success of the analogy.


I'll skip dissecting your painful metaphor and go straight to your claim. I haven't said anything about my opinion of Shodan or their work. I also know that the Internet is full of malicious actors and we must expect and protect against abuse.

I didn't give Shodan permission to access my systems. The services and methods they used were not publicly listed services that one would expect a member of the public to use. They don't have implicit permission.

Before reporting them to their ISPs, I checked the terms of the ISPs. These prohibited Shodan's behaviour.

I didn't exaggerate or lie to the ISPs, just gave them the facts. The ISPs chose to bring this up with Shodan, proving that my reports were meritorious.

Shodan contacted me more than once and I politely told them not to access my systems again. Each time they ignored my explicit withdrawal of consent.

Shodan are a commercial organisation that profits from the information they gather.

So they accessed my systems without permission, broke the terms of service of multiple ISPs, which were then enforced by their ISPs, and then were specifically asked by me to stop on more than one occasion, which they did not. They profit from all of this. And you think I'm a Karen?

It used to be that people who used the Internet did so under certain social norms. One of these was not to touch another's network without permission and, if caught, accept you got caught and gracefully stop it. I guess because you like Shodan, you assert they can do all of this and anyone who reasonably and politely objects to it is somehow unreasonable.

It bemuses me what the Internet has become.


>>It used to be that people who used the Internet did so under certain social norms. One of these was not to touch another's network without permission and, if caught, accept you got caught and gracefully stop it.

This is a silly argument. You're connecting to the internet, bud. It's a dangerous place, and there is zero expectation that nobody look at or scan your network. PUBLIC internet means what you connect to it is PUBLIC, at the global scale.

Shodan isn't hacking into secured devices. It isn't getting past your ISP's firewall and looking into restricted networks. It's scanning publicly addressed, explicitly broadcast ports, services, devices, and data. It's not on them to cater to your weird sense of internet manners, it's on you to wise up to what your public internet connection means.


To push the analogy a little, what is she is nude in the living room, and people are stopping on the sidewalk to look in through your windows. I think that would be a rude invasion of privacy. (Alternatively, replace the living room window with an open front-door, or to push it even more with a closed but not locked front-door)

The question of this analogy hinges, in my mind, on whether "this site talks https on port 443" is the equivalent of looking in someones front yard, or the equivalent of looking through the window into someones living room.

I think one weird difference is that we can passively observe things in the real world. But you can't passively observe an IP-address. You need to send data at it to see how it responds.

On that note, perhaps the analogy should be that people are shining a flashlight into your front-garden at night. Because they need to illuminate your private property in order to see anything.


> I think one weird difference is that we can passively observe things in the real world. But you can't passively observe an IP-address. You need to send data at it to see how it responds.

The other part of that is that you also can't passively show anything. The other side needs the actively respond with something that you can then observe.


I feel like the incentive here is to make sure your devices are secured rather than asking them to stop scanning :-)

Better someone who is openly providing you this information such that you can use it than a malicious attacker!


Perhaps, but I knew how my systems were set-up, and their connections were odd enough that I noticed them. I would suggest this shows that my security precautions were working well.


So Shodan doesn't look at your network now but every crook out there can scan it, find vulnerabilities and possibly exploit them. How are you protecting your network against that?


I didn't ask them to look at my network. I then explicitly withdrew consent when they contacted me. They profit from the data they collected from me. This pushes them into crook territory in my opinion.

I'm well aware that crooks can connect to my systems: that's why Shodan came to my attention.

I find vulnerabilities by scanning my own network, I don't need someone else to profit by doing that against my permission.


Do you also report Chinese botnets? Because the most traffic I get is from those.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: