
Ask HN: Is Hackernews still using the file system as database? - tosh
I&#x27;d love to learn a bit about what the current hosting&#x2F;tech stack setup looks like. IIRC the posts were stored in files on the file system and scaled quite well vertically. Is that still the case?
======
dchuk
While on the "how HN works" topic, I have a question:

I'm working on an HN app reader idea that I think is unique (more to come)
that requires regularly getting the front page posts and comments. I wrote a
script to do that through the Firebase API but holy shit it ended up needing
thousands of requests to get all the data for the current state of the front
page.

So instead, I wrote a scraper script to produce the same thing, just with 30
requests (1 per front page post). (Trying to turn these damn <tr>'s into a
recursive comment tree was a quite the mindfuck btw)

...is it ok to scrape HN? I see nothing in the robots.txt to say it shouldn't
be allowed, and I actually feel better making 30 scraping requests rather than
XXXX API requests.

EDIT: Also happy to receive suggestions. I'm coding in Ruby and don't see an
obvious way to access the Firebase DB directly (quite frankly I don't even
know if what I just said makes sense) so any help is appreciated.

~~~
tbirdz
The algolia api has more bulk operations, so you can get more than one
response per api request. Also if you look around, there are some datasets
where people have downloaded all the HN API json responses, and have made them
available, so you could use those for local development and testing.

As far as ethics go, it's probably better to use the API than scraping. Sure
you're making more requests, but I think the firebase servers and cdn have far
more ability to handle a lot of requests than the HN server does. If having
that many requests would put firebase out in any significant way, then I'm
sure they would put a query load limit on the HN api.

~~~
Artemis2
HN is behind Cloudflare; if you're not logged in the pages can be cached so I
wouldn't expect small-scale scrapping to have any effect. Cloudflare might
stop you though.

------
Kaizyn
The source code used to be open, and it isn't clear what happened to it. Maybe
ask over on the arclanguage.org forum as HN is written in arc.

The closest thing to a source repo is this from many years ago:
[https://github.com/wting/hackernews](https://github.com/wting/hackernews)

~~~
akkartik
The official sources are at
[http://arclanguage.org/install](http://arclanguage.org/install)

Community-supported version:
[http://arclanguage.github.io](http://arclanguage.github.io)

~~~
j_s
[https://news.ycombinator.com/item?id=11176894](https://news.ycombinator.com/item?id=11176894)

> _A version of HN 's source code is included with the public release of Arc,
> but HN's algorithm has many extensions that aren't public._

[https://news.ycombinator.com/item?id=13456306](https://news.ycombinator.com/item?id=13456306)

> _We 're unlikely to publish all that because doing so would increase two bad
> things: attempts to game the site, and meta nitpicking._

Including at least at one time a Chrome extension for moderation:

[https://news.ycombinator.com/item?id=11670071#11670562](https://news.ycombinator.com/item?id=11670071#11670562)

~~~
newsat13
> We're unlikely to publish all that because doing so would increase two bad
> things: attempts to game the site, and meta nitpicking.

So security by obscurity helps at times.

~~~
CydeWeys
Passwords are security through obscurity too. Things would be a lot less
secure if all passwords were publicly available.

~~~
pdpi
Security by obscurity is precisely defined as security that relies on the
algorithm/implementation itself being private to be able to function. Key
material being private does not qualify for this. The alternative is that
security through obscurity becomes such an all-encompassing term as to become
meaningless

------
markkat
We at [https://hubski.com](https://hubski.com) use a very hacked version of
news.arc. Going on seven years. We moved data to postgres though.

The app is top notch, but it does not scale very nicely. Too much to hold in
memory.

------
ers35
I'm working on a tool that maintains an up to date archive of Hacker News and
a companion tool to browse and search it. I'm not ready to release yet, but
here is a dump of the stories, comments, and users from the Firebase API as a
SQLite database with a full text search index:
[https://archive.org/details/hackernews-2017-05-18.db](https://archive.org/details/hackernews-2017-05-18.db)

------
sandGorgon
The source of lobste.rs is actually a good second alternative to HN
[https://github.com/jcs/lobsters](https://github.com/jcs/lobsters)

It has been in active development for several years, and is actually quite
popular in its own site.

It also incorporates several interesting concepts like an invite-tree,etc.

I don't mind HN close-sourcing its code...But I'm definitely curious keeping
Arc alive. IMHO, the rest of YC stack is pretty vanilla.

I wonder if Arc is the "burning candle that must never be allowed to die out"

------
cyberferret
The database is now on Google Firebase, is it not?

At least their API is sourcing all data from Firebase last time I looked at
it...

~~~
pvg
That's just for the API.

~~~
cyberferret
Thanks for clarifying... I'd be interested to hear about how they sweep the
data from the existing database into Firebase. Must be in close to real time,
because the Firebase API update feeds trigger every 10 to 15 seconds with
updated data.

~~~
kogir
I wrote the original integration, which required updating some Racket
libraries to support HTTP pipelining, chunked transfer encoding, and streaming
JSON serialization :)

My version was near real time, with a 30-60 second batching delay IIRC, and
could still be what powers it for all I know.

------
zitterbewegung
I would also like to know about how they have optimized arc and it's
underlying runtime from racket . They seem to have proven that arc can scale .

~~~
jacquesm
It mostly proves that cloudflare can scale. Not that long ago Dan asked us to
log out due to server load so that pages would be served from the cloudflare
cache rather than that they had to be generated on the fly.

~~~
j_s
Thanks to DanG and others for their work keeping HN running! (for the past 3+
years:
[https://news.ycombinator.com/item?id=7493856](https://news.ycombinator.com/item?id=7493856)
)

\--

HN Scalability

[https://news.ycombinator.com/item?id=13755673#13756819](https://news.ycombinator.com/item?id=13755673#13756819)
(3 months ago, on _Ask HN: Is S3 down?_ )

> _If you don 't plan to post anything, would you mind logging out? Then we
> can serve you from cache._

[https://news.ycombinator.com/item?id=12909752#12911870](https://news.ycombinator.com/item?id=12909752#12911870)
(6 months ago, on _Donald Trump Is Elected President_ )

> _please log out unless you intend to comment, so we can serve you from
> cache_

> _We buried the previous thread on this because, at almost 2000 comments, it
> was pegging the server._ |
> [https://news.ycombinator.com/item?id=12911042](https://news.ycombinator.com/item?id=12911042)

[https://news.ycombinator.com/item?id=9174869#9175457](https://news.ycombinator.com/item?id=9174869#9175457)
(1 year ago, on _Ask HN: Why the HN Nginx error for articles below 9M?_ )

> _We turned old articles off a few hours ago as an emergency measure because
> the site was being crawled aggressively and our poor single-core Racket
> process couldn 't keep up. Previous solutions that have served us well until
> now, such as IP-address rate limiting, appear no longer to suffice._

\--

Source:
[https://hn.algolia.com/?query=author:dang%20cache&type=comme...](https://hn.algolia.com/?query=author:dang%20cache&type=comment)

------
peterwwillis
Don't use a filesystem as a database. We did that back in 1998 mostly just
because we were using 1998-era computing power, and the more static content
you could serve the less likely your box would get hosed by Slashdot. Now my
watch has more computing power than those old servers. Use a real database,
even a file-backed one like BerkeleyDB (or god forbid, SQLite)

~~~
STRML
"God forbid, SQLite"? It's honestly a great product and a wonderful stepping-
stone. I wish more new projects used it rather than half-measures like Mongo.

~~~
nerdwaller
100% agreed, it can hand a few hundred thousand hits a day. Obviously not okay
for HN scale - but it's what my personal site and a bunch of side projects
use.

~~~
hdhzy
Do you have problems with locking if two requests are made in the same time?

~~~
akx
AFAIU it's only locked for concurrent writes, so a largely read-only workload
should be okay.

~~~
hdhzy
Reading notes at [0] it seems it takes a little bit of configuration to get
good setup.

[0]:
[https://secure.php.net/manual/en/sqlite3.exec.php#usernotes](https://secure.php.net/manual/en/sqlite3.exec.php#usernotes)

