
Ever wondered how many open FTP servers there are? - Lolapo
https://github.com/massivedynamic/openftp4
======
DonHopkins
I used to leave a file called README on my public ftp directory, which
contained only:

cat: README: No such file or directory

I'd occasionally get email from frustrated people who had trouble trying to
read the README file, so I'd tell them to simply run "emacs README", and emacs
would solve all of their problems. I don't know if my passive aggressive emacs
evangelism ever worked, because I never heard back from them.

~~~
anchpop
Reminds me of the Abbott and Costello skit "Who's on first" [1]

[1] [https://youtu.be/kTcRRaXV-fg?t=63](https://youtu.be/kTcRRaXV-fg?t=63)

------
kragen
I got my first internet account on my high school's VAX in 1992 in order to
download Dr. Dobb's code from ftp.mv.com. The VAX had a help file installed
which was a list of thousands of anonymous FTP servers, most of which
requested that you please not use anonymous FTP during business hours to avoid
overloading them; shortly I was downloading all kinds of things from
wuarchive.wustl.edu (which had a LOT of stuff) and WSMR-SIMTEL20.army.mil,
which worried my dad. I bound some function keys on my terminal to different
internet commands (FTP, TELNET) and internet sites (like those I mentioned) so
I could FTP to wuarchive or TELNET to HPCwire with two keystrokes.

At some point I got hold of Scott Yanoff's list of interesting Internet
services (capitalizing Internet was still justifiable at the time) and learned
about the Weather Underground, Archie, HPCWire, and this new thing called the
"World Wide Web" — I started telnetting to a server at the University of
Kansas where I could use Lynx, and it seemed pretty clear that this was going
to be a big deal, because of how enormously much easier it was than
downloading text files over FTP.

So not only have I wondered how many open FTP servers there are, my
exploration of the internet pretty much started with a list of them.

Nowadays I occasionally look for FTP servers because they tend to be less of a
pain in the ass for downloading stuff than HTTP servers — you can usually get
a full list of what they have, and they never interrupt you with CAPTCHAs.
It's kind of like a real-world "shibboleet" — I guess sometimes assholes push
mandates for CAPTCHAs and whatnot on a company's technical folk, but they
leave FTP open because the assholes don't know about it.

If you're wondering how many open HTTP servers there are, Netcraft does a
pretty good monthly survey.

~~~
Esau
"capitalizing Internet was still justifiable at the time"

Given the Internet is a particular thing, shouldn't it still be used as a
proper noun?

~~~
akshatpradhan
AP Style Alert. Internet and web are no longer to be capitalized. Check the
googles.

~~~
Esau
I don't work for the AP so I will continue to capitalize it.

------
p4bl0
I have a fun story about open FTP servers.

When we were undergrad students, a friend and I wondered exactly that same
question about FTP servers. So we wrote a script that tested random IPv4 for
FTP servers using nmap and then attempted to connect.

To our surprise we found quite a few. Most of them were small or not writable,
but at least one of them was writable, and it had hundreds of GB of free
space. So we connected to it, and then saw that it had huge data files
containing what seemed like random junk.

Then, we tried to ssh to the box, but no ssh server was listening. So we tried
to telnet in. That worked but we were prompted:

    
    
        Enter password: 
    

Ah, too bad, we will never guess what the password is… Mh, let's try "123456"
just in case… But then, instead of logging us in or telling us that we entered
the wrong password, it simply said:

    
    
        Re-enter password:
    

Mh? So we re-entered "123456".

    
    
        The password has been set.
    

Haha! The owner must never have set this up. And then we were in. It was a
very strange system with a few commands with standard names but non-standard
behaviors, and a very minimalistic shell. By poking around and searching the
web, we understood that we were actually connected to a surveillance camera
and that the big files were probably parts of video.

Poking around more gave us access to other such cameras in the same network
and more importantly to a web interface from which we could see the videos
streaming live. We saw offices and people doing stuff like taking the garbage
out (I don't know why but that is one of the more precise image I can recall
^^). The only distincive thing I remember was a big sign saying "Miami Fitness
Club".

After that we never did anything about it as we had other things to do, but I
kind of cherish this story as a nice souvenir of my first year at the ENS.

~~~
saganus
I remember doing something similar except it was with my cable provider, when
they first came into town.

Since all clients would basically be connected to a LAN, as soon as I found I
could port scan random users I started doing it.

Of course there were a lot of businesses on the network that apparently used
FTP to move files around but were unsecured.

I spent the better part of a couple of weeks just going through the data
(never actually downloaded anything since I knew it could land me in trouble
with my parents).

Like you said, I also cherish this story since it was a rare peak at other
peoples lives... without even knowing a 12 year old had access to all their
business files.

~~~
tthayer
@Home was wide open around 1999ish. My friend would send random (if you can
call goatse.cx random) print jobs to various open SMB printer shares. Ah, to
be young again.

------
sillysaurus3

      $ lz5
      -bash: lz5: command not found
      $ brew install lz5
      ==> Auto-updated Homebrew!
      Updated Homebrew from b5a6b4e to 7926114.
      Error: No available formula with the name "lz5" 
      ==> Searching for similarly named formulae...
      Error: No similarly named formulae found.
      ==> Searching taps...
      Error: No formulae found in taps.
    

So, rather than complain about this, how do I do something to fix it? Can I
add lz5 as a tap to Homebrew?

EDIT: Let me try this again, in a more productive way.

lz5 does not currently exist in Homebrew. I'd like to fix this. I've never
done this before. Does anyone have advice? Is it as simple as forking lz5,
then adding a tap to Homebrew?

Thanks. And as a note, this edit occurred after specialp's replies. They were
right: this originally wasn't a productive comment.

~~~
sepbot
[https://github.com/Homebrew/brew/blob/master/Library/Homebre...](https://github.com/Homebrew/brew/blob/master/Library/Homebrew/cask/CONTRIBUTING.md#adding-
a-cask)

~~~
sillysaurus3
Ah, it'd be a cask? Cool, thanks.

~~~
reaperhulk
[https://github.com/Homebrew/homebrew-
core/blob/master/.githu...](https://github.com/Homebrew/homebrew-
core/blob/master/.github/CONTRIBUTING.md) Is a better link. It would just be a
new formula in homebrew-core

------
userbinator
I wonder how many of those are just mirrors of Linux distros and other open-
source software, how many have more interesting things (including software),
and how many of _those_ were deliberately configured to be open for sharing.
There is the somewhat-well-known filesearch.ru if you want to look for things
on this non-HTTP part of the Internet. (If I remember correctly, Google used
to index FTPs too and you'd get plenty of results with the right queries, but
that seems to have mostly and silently disappeared...)

~~~
bane
There's still a uprising number of niche ftp site around: MSX, Demoscene, etc.
Mostly older scenes that pre-date the WWW and are still around. I think it
might be useful to test every port on every ip to see what happens protocol-
wise. Limiting to just common ports is probably missing lots of cool things.

~~~
andai
How many ports are there?

~~~
aeinstein1
According to Reserved IP addresses there are 588,514,304 reserved addresses
and since there are 4,294,967,296 (2^32) IPv4 addressess in total, there are
3,706,452,992 public addresses. There are 65536 ports, so one would have to
scan all those ports on all those addresses.

------
lucb1e
S/he talks about it as if it's something bad, something unfixed. The whole
thing sounds like it but in particular "[to be excluded] go fix your shit".

I agree that if you don't want people to access it, you should secure it. Yet
not all these servers are accidentally open: my ftp on 80.100.131.150 (I
assume it's in there) hosts a copy of Damn Small Linux because all downloads
were extremely slow or broken at the time.

~~~
a3_nm
Agreed. Having an open FTP server with read-only access is no more problematic
than having an open HTTP server.

------
djsumdog
I remember back in the 90s, people would hunt for FTP servers that allowed
anonymous writes. Companies that didn't know how to secure servers would
suddenly be hosting a ton-o-warez.

~~~
leeoniya
> would suddenly be hosting a ton-o-warez.

which could then be found by searching for "index of"...

~~~
jlgaddis
... if only we had had search engines then. :)

~~~
dalke
I used to telnet archie.mcgill.ca to search the collected index of archive
sites, back in 1991 or so.

[https://en.wikipedia.org/wiki/Archie_search_engine](https://en.wikipedia.org/wiki/Archie_search_engine)

------
jftuga
Side note: "xz -9 -e" compresses the file to 3,296,864 where as "lz5 -15" only
compresses the original file to 4,643,261 bytes. The xz compressed file is 29%
smaller.

Even "gzip -9" compresses the file to 4,035,858.

So I wonder why lz5 was chosen for compression.

~~~
jlgaddis
How long did that "xz -9 -e" take compared to "lz5 -15" or "gzip -9"?

~~~
mappu
Numbers on my machine, ordered by compression duration:

    
    
        Compressor    Duration    Size
        cat           0.0s        11385584 [*]
        gzip -1       0.2s        4918778 [*]
        bzip2 -1      0.9s        3334653
        bzip2 -9      0.9s        3122347 [*]
        xz -1         1.1s        4222016
        gzip -9       2.4s        4085818
        lz5 -15       6.9s        4643261
        xz -9 -e      10.7s       4033589
        zpaq -m5      34.2s       2655834 [*]
    

A [*] indicates that no better compressor was faster. Test methodology was
`cat | $compressor > output` run 3-4 times to get an average.

I'm surprised by how bzip2 turned out.

------
danielrm26
I recommend using Shodan for this kind of stuff.

[https://shodan.io](https://shodan.io)

------
ryanmccullagh
Scanning the IPv4 space. I know there are many different projects that do it.
I was thinking about how I would do this today. I believe the first step would
be to arrive at all the IPv4 blocks (/22, etc), then do a calculation to
arrive at the address of each based on the prefix. Then in an array of threads
or so, try to connect(2) to the address on some type of service in a timeout
handler. If it succeeds, then consider that address as up. I would consider
doing this in an async loop with epoll(7), so that many connections could be
attempted at once and improve through put.

Anyway. nmap can probably do this and is a great tool

~~~
toast0
If you want to have your scan done in a reasonable amount of time, you
probably want to use raw packets, and not connect.

If you don't mind all the flak you'll get, just send a SYN on the port you
care about to each IP (maybe skip rfc1918, multicast and reserved addresses;
or only send to addresses included in bgp announcements). If you send one
packet to each addresses, including the addresses you should probably skip
that's 4 Billion packets; if you do it at 1M pps (should fit on a 1Gbps
ethernet connection), that's less than two hours.

~~~
minxomat
That's what masscan automates. To fully utilize ms, one requires a friendly
ISP. And even then, they can only be ms-friendly as long as their peers are.
If you annoy the peers enough, they'll just drop you
([http://www.sudosecure.com/ecatels-harboring-of-spambots-
and-...](http://www.sudosecure.com/ecatels-harboring-of-spambots-and-malware-
causes-bgp-peers-to-stop-peering-with-them/)). Many datacenters classify port-
scanning as an offensive action, even with low package thoughput.

Pulling data from research servers (such as Censys), reducing and _then_
scanning is always a good idea.

------
sigill
The submission links to a blog post on how the data was retrieved:
[http://255.wf/2016-09-18-mass-analyzing-a-chunk-of-the-
inter...](http://255.wf/2016-09-18-mass-analyzing-a-chunk-of-the-internet/)

> For this little experiment, I’ve setup a single KVM instance, running a
> single 2GHz vCore with 2GIB of RAM and 10GiB of HDD space. This is
> sufficient. Probing for ftp access is an extremely CPU-intensive task. You
> are going to hit bottlenecks in this order: > > CPU > Memory > a whole lot
> of nothing > network > > While the rescan was running, only about 1 to 2kpps
> were exchanged, while the CPU was pinned at 100%.

So this means his setup spent about 1-2 million clock cycles _per probe_.
That's a lot!

I suppose this is because he runs the probe script once per IP address? I
suspect that an implementation which would stay in-process would be at least
an order of magnitued faster.

~~~
minxomat
Sure. Faster even with a better scheduler. I just wanted to show how the
simplest and most redneck way still finishes in a reasonable amount of time.
:-)

~~~
bemmu
I was amazed how fast that went. Was fully expecting the story to unfold with
how you rented out 100 AWS servers to complete the task, instead it was just
one computer and only took hours.

~~~
minxomat
It's all about reducing data offline before throwing the kitchen sink at the
internet.

------
barefootcoder
Good old archie... [http://archie.icm.edu.pl/archie-
adv_eng.html](http://archie.icm.edu.pl/archie-adv_eng.html)

~~~
userbinator
> Search Type: Sub String Exact Regular Expression

Google has an advantage in terms of index size, but definitely lose in terms
of precision. The way it munges queries is a bit unsettling at times, and even
the "exact" option doesn't seem to always work the way it should.

~~~
daurnimator
Have you tried verbatim mode? (Thanks
[https://news.ycombinator.com/item?id=12046056](https://news.ycombinator.com/item?id=12046056))

~~~
mikeash
That is a huge improvement over the normal search, but unfortunately it's not
truly verbatim. For example, searching for `self` with the backticks turns up
results for just self with no backticks. It works with some symbols but not
others, not sure why.

------
heydonovan
Are there any legal implications for doing this? Was going to do something
similar with Redis and MongoDB.

~~~
achillean
This might be a useful starting point:

[https://blog.shodan.io/its-still-the-data-
stupid/](https://blog.shodan.io/its-still-the-data-stupid/)

Shodan crawls for most nosql/ queueing software including mongodb and
redistribution.

And related to OP: we also crawl for FTP and attempt anonymous as well as a
few other things to better understand FTP deployments on the Internet.

~~~
minxomat
Related to your related: Nice. Is this research public or commercial?

~~~
achillean
All the data is searchable for free on
[https://www.shodan.io](https://www.shodan.io)

------
awqrre
[http://ftpsearch.ntnu.no](http://ftpsearch.ntnu.no) used to index many of
them

------
albertzeyer
That would be fun to use together with fun side project
[RandomFtpGrabber]([https://github.com/albertz/RandomFtpGrabber](https://github.com/albertz/RandomFtpGrabber))
which will download random stuff from a list of FTPs.

------
andrewfromx
"All in all, there were exactly 18,454,087 things that responded to a banner
fetch... The JSON file is about 4GiB." where can I download this JSON file? I
only see the list of ips available.

~~~
minxomat
I specifically left the mass _scanning_ part out. If you want to experiment
with scan data without scanning yourself, a good place to start is
[https://censys.io](https://censys.io).

------
repler
> consider piping the list through shuf each time you try something new. You
> know why.

No, I don't know why. Can someone explain? (without being condescending,
preferably)

~~~
sprocket
It's not entirely clear to me either, though I would assume it's so that the
first server in the list doesn't get constantly hammered by a hundred people
going through the list sequentially.

~~~
repler
hmm. Yeah that makes sense.

Would be better if they simply stated that instead of playing insinuation
based guessing games.

------
pmontra
How many of them are honeypots, intentionally left open?

------
amelius
How many are there with write access?

And how do they protect against spam?

------
andrewfromx
any stats on how many servers are running Pure-FTPd vs. Paradise FTP vs. some
other software?

~~~
achillean
There are ~300,000 Pure-FTPd instances:
[https://www.shodan.io/search?query=product%3Apure-
ftpd+port%...](https://www.shodan.io/search?query=product%3Apure-
ftpd+port%3A21)

Not sure how to fingerprint Paradise FTP (searching for "Welcome to Paradise"
also returned some non-relevant results) but there aren't many that contain
"paradise" in their welcome banner.

Shodan also fingerprints lots of other FTP software (check out the "Top
Products" section):

[https://www.shodan.io/report/WHJBsZqV](https://www.shodan.io/report/WHJBsZqV)

ProFTPD is by far the most popular choice at the moment.

Note: Doing a search that uses a filter (ex: "product") requires a free Shodan
account. None of the above require paid access, you just need a free account.

------
GirlsCanCode
I can see having an open anonymous FTP that was read-only as a way of serving
some files that needed to be downloaded and didn't contain any sensitive
information. It's no different from providing a URL to them.

Or even a write-only one so people could deposit data.

The only real problem is a read-write one where people can use it to exchange
information.

