Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Is Hackernews still using the file system as database?
264 points by tosh on May 18, 2017 | hide | past | web | favorite | 78 comments
I'd love to learn a bit about what the current hosting/tech stack setup looks like. IIRC the posts were stored in files on the file system and scaled quite well vertically. Is that still the case?

While on the "how HN works" topic, I have a question:

I'm working on an HN app reader idea that I think is unique (more to come) that requires regularly getting the front page posts and comments. I wrote a script to do that through the Firebase API but holy shit it ended up needing thousands of requests to get all the data for the current state of the front page.

So instead, I wrote a scraper script to produce the same thing, just with 30 requests (1 per front page post). (Trying to turn these damn <tr>'s into a recursive comment tree was a quite the mindfuck btw)

...is it ok to scrape HN? I see nothing in the robots.txt to say it shouldn't be allowed, and I actually feel better making 30 scraping requests rather than XXXX API requests.

EDIT: Also happy to receive suggestions. I'm coding in Ruby and don't see an obvious way to access the Firebase DB directly (quite frankly I don't even know if what I just said makes sense) so any help is appreciated.

The algolia api has more bulk operations, so you can get more than one response per api request. Also if you look around, there are some datasets where people have downloaded all the HN API json responses, and have made them available, so you could use those for local development and testing.

As far as ethics go, it's probably better to use the API than scraping. Sure you're making more requests, but I think the firebase servers and cdn have far more ability to handle a lot of requests than the HN server does. If having that many requests would put firebase out in any significant way, then I'm sure they would put a query load limit on the HN api.

HN is behind Cloudflare; if you're not logged in the pages can be cached so I wouldn't expect small-scale scrapping to have any effect. Cloudflare might stop you though.

It sounds like your struggles are with Firebase more than the actual scraping part, so I'm not sure how useful this will be. But there's a marvelous demonstration of Clojure's ability to render and serve a scraped digest of the HackerNews homepage.


The final product is about 17 lines of code. Which is quite compact, sure—but if it looks like Perl, who cares?

Thankfully, it looks nothing like Perl. The code is deeply intuitive and evangelizes the idea that sending an object over a network can often be defined as a composition of functions resulting in the creation of the very object that we're sending. Whether that object is an HTML document, JSON, or whatever, the alluring premise is that these can all be forged from a single bulk-concatenation of strings returned by functions being unrolled in an inner-loop. It looks practically effortless. Might give you some ideas for batching those 30 GET requests. (Or maybe not—I've never used firebase before!)

Very cool! Always wanted to get into Clojure, I just don't code for a living anymore so only do it in my free time so I tend to turn to tools I know. Bookmarked though!

The Firebase API is the way to go. I built a small app to do exactly as you described [0] and wrote a blog post about it [1]. Drop me a line if you need any more info.

UPDATE: Using the Firebase feed means you don't have to do periodic scraping. You can simply set a listener for the FB change() event and the API will basically tell you when there is fresh information from the front page etc.

[0] - https://tophn.info

[1] - https://hackernoon.com/tophn-a-fun-side-project-built-with-v...

I can't seem to find anything that would let me do what you did, but with Ruby. I'm not opposed to using Node, just prefer Ruby more.

For Hacker News Daily I make requests to https://hacker-news.firebaseio.com/v0/

IIRC I made this change at the suggestion of someone at YC (maybe dang, but it might have been before he took over HN) so I think it's very likely that this is the route they would prefer you to use. But if in doubt, send them an email; dang is very responsive.


What would you recommend for the instance where someone's emailed hn@ycombinator.com and hasn't gotten a response?

I suspect it's something I said, I'm just unsure exactly what. I asked about the posting and API ratelimits, the reviewer idea mentioned at https://news.ycombinator.com/item?id=11662380, source release (notwithstanding the voting system, which is fine), and other curios. I don't think I said anything offensive.

I have successfully gotten in touch in the past, so I'm just not sure what to do. The email was sent on the 27th of April; I just checked my spam folder (I must admit I didn't until now...), but I don't think gmail would have deleted any incorrectly-filed replies, since they'd be less than a month old.

I'm not complaining or trying to create drama, just curious what to do next. My fear is that it's not possible to answer the "what did I do" question due to social context and so forth (a hole I've fallen into many, many times due to communication perception issues).

Update for the record: I was, very happily, completely wrong.

dang got in touch to let me know the delay was due to backlog and that he'll properly reply when he gets the chance. Absolutely fine by me; I'm super glad to know my email was fine :D

Check out my implementation of comments parsing from the html, which is used in HNES: https://github.com/ibejoeb/HNES/commit/054fc4e626137aadef6f0...

Nice! I basically did it the same way. I'll push my code up to a public repo soon and let you know to compare.

Happy to help - email is in my profile.

Am I crazy? I don't see an email in your profile.

It is in the 'about' field.

It wasn't at the time your parent posted. It is there now. That, or we're experiencing folie à deux.

hckrnews.com ?

The source code used to be open, and it isn't clear what happened to it. Maybe ask over on the arclanguage.org forum as HN is written in arc.

The closest thing to a source repo is this from many years ago: https://github.com/wting/hackernews

The official sources are at http://arclanguage.org/install

Community-supported version: http://arclanguage.github.io


> A version of HN's source code is included with the public release of Arc, but HN's algorithm has many extensions that aren't public.


> We're unlikely to publish all that because doing so would increase two bad things: attempts to game the site, and meta nitpicking.

Including at least at one time a Chrome extension for moderation:


It's a shame that in closing the door to the pernicious "meta nitpicking", whatever that is, the possibility for constructive analysis of how HN's algorithms shape and mould the behaviour of its dedicated community has also been removed.

Sufficient amounts of "constructive" analysis by uninvolved and under-informed third parties can be a net negative.

So how about informing said parties?

Everything is tradeoffs. Time spent educating outsiders who may or may not be useful can be spent doing other things, like work that moves the ball forward.

That would work for good willing people but not for people that is in a political war state of mind

> We're unlikely to publish all that because doing so would increase two bad things: attempts to game the site, and meta nitpicking.

So security by obscurity helps at times.

Someone once did an analysis on HN posts and their ranking relative to the time they were posted and votes. And they found posts with certain keywords were heavily penalized and sort of soft banned from the site. IIRC it included stuff like "NSA" and "HN" and posts from certain sites like reddit and youtube (but I could be remembering.)

Having the full list of banned keywords, or even acknowledging there is a list, could cause drama. And it's easily evaded if people know about it, like when reddit banned "Tesla" posts got through by misspelling it "Telsa".

There's also other stuff like a controversy filter, that detects articles with more comments than votes and penalizes them. I try to avoid commenting in articles that are getting close to the limit to avoid triggering it.

of course it helps, i don't think anybody's ever denied that. It just shouldn't be relied on.

It's not about security, just content curation.

Reddit has a similar approach if I'm not mistaken.

Passwords are security through obscurity too. Things would be a lot less secure if all passwords were publicly available.

Security by obscurity is precisely defined as security that relies on the algorithm/implementation itself being private to be able to function. Key material being private does not qualify for this. The alternative is that security through obscurity becomes such an all-encompassing term as to become meaningless

In that case, 256 bit encryption keys are security-through-obscurity too, they're just realllllly obscure.

Indeed. The difference between "Security by obscurity" versus login/passwords is really scale.

Usually, some numbnut "programmer" sets a no-login and a simple password as a secret service account. It invariably is found, and badness ensues.

Whereas login/password is a 1/password_space chance of getting it. It's the combination of a default hidden account and no way to know/change it.

It would be easier if they had a injectable function to handle moderation / anti gaming etc. Release the code publicly with a stub function, then run the real one in prod.

Hacker News deviates from the original Arc code, though. Unfortunately, because YCombinator wants to protect their secret sauce, they're not likely to release anything interesting.

One can hope, though.

It's not like the tech behind Hacker News is what gives it its value. It's because it's associated with Y Combinator. Not sure why they aren't willing to release it.

Because, for some stupid reason, having your startup or blog post appear on the Hacker News frontpage is a Really Big Deal, and the staff don't want anyone peeking at their algorithms to find ways to game the system. Which means the really interesting problems they have to solve like voting ring detection, spam detection, etc. are the ones they don't want anyone to see their solutions to.

Also (personal theory of mine) maybe the code doesn't have the degree of abstraction or separation of concerns necessary to allow open sourcing without also exposing YC business concerns or parts they want to keep secret.

ycombinator is 0% of the value IMO. Interesting news and comments for me

> and comments

Which is to say, the community of users! (You all are awesome... usually)

The current code base would be interesting.

The HN frontpage algorithms used to be more predictable, now it's a black box.

We at https://hubski.com use a very hacked version of news.arc. Going on seven years. We moved data to postgres though.

The app is top notch, but it does not scale very nicely. Too much to hold in memory.

I'm working on a tool that maintains an up to date archive of Hacker News and a companion tool to browse and search it. I'm not ready to release yet, but here is a dump of the stories, comments, and users from the Firebase API as a SQLite database with a full text search index: https://archive.org/details/hackernews-2017-05-18.db

The source of lobste.rs is actually a good second alternative to HN https://github.com/jcs/lobsters

It has been in active development for several years, and is actually quite popular in its own site.

It also incorporates several interesting concepts like an invite-tree,etc.

I don't mind HN close-sourcing its code...But I'm definitely curious keeping Arc alive. IMHO, the rest of YC stack is pretty vanilla.

I wonder if Arc is the "burning candle that must never be allowed to die out"

The database is now on Google Firebase, is it not?

At least their API is sourcing all data from Firebase last time I looked at it...

That's just for the API.

Thanks for clarifying... I'd be interested to hear about how they sweep the data from the existing database into Firebase. Must be in close to real time, because the Firebase API update feeds trigger every 10 to 15 seconds with updated data.

I wrote the original integration, which required updating some Racket libraries to support HTTP pipelining, chunked transfer encoding, and streaming JSON serialization :)

My version was near real time, with a 30-60 second batching delay IIRC, and could still be what powers it for all I know.

I would also like to know about how they have optimized arc and it's underlying runtime from racket . They seem to have proven that arc can scale .

It mostly proves that cloudflare can scale. Not that long ago Dan asked us to log out due to server load so that pages would be served from the cloudflare cache rather than that they had to be generated on the fly.

Thanks to DanG and others for their work keeping HN running! (for the past 3+ years: https://news.ycombinator.com/item?id=7493856 )


HN Scalability

https://news.ycombinator.com/item?id=13755673#13756819 (3 months ago, on Ask HN: Is S3 down?)

> If you don't plan to post anything, would you mind logging out? Then we can serve you from cache.

https://news.ycombinator.com/item?id=12909752#12911870 (6 months ago, on Donald Trump Is Elected President)

> please log out unless you intend to comment, so we can serve you from cache

> We buried the previous thread on this because, at almost 2000 comments, it was pegging the server. | https://news.ycombinator.com/item?id=12911042

https://news.ycombinator.com/item?id=9174869#9175457 (1 year ago, on Ask HN: Why the HN Nginx error for articles below 9M?)

> We turned old articles off a few hours ago as an emergency measure because the site was being crawled aggressively and our poor single-core Racket process couldn't keep up. Previous solutions that have served us well until now, such as IP-address rate limiting, appear no longer to suffice.


Source: https://hn.algolia.com/?query=author:dang%20cache&type=comme...

As of a few years ago, HN could run fine without CloudFlare and just nginx. CF was mostly to try and obscure the server's IP, and to handle (rare) DDoS attacks.

As opposed to just, you know, using ESI so your page loads are always from the cache.

They still ask us to do that when the thread has lots of comments.

Scale in what sense? I'm genuinely curious. Isn't this mostly a network/storage/memory bound problem?

Back in the day, HN was a single-core process on a single machine.

Obviously, that’s an entirely different story than services distributed over clusters of hundreds of different machines.

190 days ago still was:


Plus caching for logged-out visitors.

Don't use a filesystem as a database. We did that back in 1998 mostly just because we were using 1998-era computing power, and the more static content you could serve the less likely your box would get hosed by Slashdot. Now my watch has more computing power than those old servers. Use a real database, even a file-backed one like BerkeleyDB (or god forbid, SQLite)

"God forbid, SQLite"? It's honestly a great product and a wonderful stepping-stone. I wish more new projects used it rather than half-measures like Mongo.

100% agreed, it can hand a few hundred thousand hits a day. Obviously not okay for HN scale - but it's what my personal site and a bunch of side projects use.

Do you have problems with locking if two requests are made in the same time?

"Few hundred thousand" ~= 300,000 /day

= 3.47 hits / s

"Insert time / record" = ~50 ms (according to the internet)

= 50 ms / s locked time.

So I'm guessing not. Even if at peak he was seeing 10x traffic, you'd only expect short delays from the database.

AFAIU it's only locked for concurrent writes, so a largely read-only workload should be okay.

Reading notes at [0] it seems it takes a little bit of configuration to get good setup.

[0]: https://secure.php.net/manual/en/sqlite3.exec.php#usernotes

I used SQLite in my side project few years ago. It was very simple (like 2 tables/collections), but a lot of updates every day. I eventually switched to MongoDB because SQLite was too slow (and also I wanted to play with NoSQL stuff).

> wonderful stepping-stone

It's honestly a fantastic final solution in certain circumstances.

A curious example from my memory. Subversion VCS circa 2005 initially provided two backends: Berkeley and file system. The everyday Subversion usage proved that file system backend was more reliable, more scalable, easier to manage and less prone to occasional data inconsistencies. So, file system has its practical use as a DB.

Another prominent example is Mercurial SCM. It uses a fine breed of file system and custom index files and shows miracles in terms of reliability and performance across variety of platforms.

And with SSD's and HDD memory plus OS disk caching, HDD storage can be very fast. It works great as a simple key/value storage. Duplication (read slaves) and sharding also works great. Modern file systems such as ZFS also helps speed things up and provides nice features like "snapshots" that isn't even possible in modern databases.

To be fair though, BDB is hardly a poster child for databases. It's very fragile and extremely prone to corruption.

I agree filesystems are underappreciated for database duties, but it depends a lot on usage patterns.

You are probably downvoted because you are essentially saying "Use a database for the sake of using a database!" which is not very good advice. Use the best, simplest tech your problem needs. If a filesystem is sufficcient, that's really good because it's stupid simple and cheap to operate.

A file system is essentially a single node document store indexed with a key that is the file path. It's a database!

A filesystem is a filesystem. A database is a database. We have different words for things for a reason.

If you wanted to compare a filesystem to a database, it's more a hierarchial collection of tables whose columns are the file stat() with a binary record at the end of each row. But it's somewhat of a useless database because its features are all oriented at its storage backend, not accessing the data. A real database allows you to optimize queries and search, index, join and select data at once and efficiently handles the more complex aspects of accessing and writing so you don't have to implement it yourself.

If you need a database, _use a database_. Tech hipsters need to stop badly reinventing the wheel.

This is a bit true Scotsman. By viewing a filesystem as a type of db you can think more laterally about your tech choices.

No, I was saying to use a database because that's what the OP was asking about. And I got downvoted because people think using the simplest tech possible is a good idea. If that were true, we wouldn't be using a containerized virtualized app written in a high level language. Stupid simple now means it's a pain in the ass later.

Using simple things as much as possible is always a good idea until you need a more complex thing.

Why did we need the databases in the first place?

So you have a post on a forum. You can just save that to disk by itself and access it later. And here come the problems.

How many posts are there? Filesystems commonly have problems dealing with too many files, like inode limits, slowness reading large directories or in deep paths, slowness traversing large trees, path name max size, etc. Now you have to engineer a bunch of hacks around the filesystem, or try to find a perfect filesystem and be stuck with it.

What if you want to search posts? Well, you have to open and read all the posts. But what if you want to speed that up, like with a word index? Congrats, you're about to write a database/search engine.

What if dynamic processing is too resource-intense and you want to serve static content? Once you apply transforms to the user content and create your static resource, you get to decide if you keep the original content, because it's not useful to you just sitting on disk.

What if you want to merge, upgrade, convert, export, etc parts of the data? You have a big job ahead of you if you didn't keep the original content, and if you did, I hope you kept metadata to make this task easier.

What if you want to spread the load, store records efficiently, control I/O methods and cache, support strong locking, platform independence, and storage backend independence?

Databases are a powerful and simple tool to add features to an application and make it easier and more efficient to process data. If you don't want app features, data processing features, platform independence, etc then don't use a database.

What about the converse question? Why does the filesystem not provide indexing, tabulation, relationships among document, archival tools, tools for merging two diverged versions of a folder tree, etc.? (Is there one that does it? I don't know!). E.g. Why can't I store a slidedeck by slides and show a "document" that is composed of transcluded bits and pieces? I think pondering alternate systems from time to time is valuable.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact