I'm working on an HN app reader idea that I think is unique (more to come) that requires regularly getting the front page posts and comments. I wrote a script to do that through the Firebase API but holy shit it ended up needing thousands of requests to get all the data for the current state of the front page.
So instead, I wrote a scraper script to produce the same thing, just with 30 requests (1 per front page post). (Trying to turn these damn <tr>'s into a recursive comment tree was a quite the mindfuck btw)
...is it ok to scrape HN? I see nothing in the robots.txt to say it shouldn't be allowed, and I actually feel better making 30 scraping requests rather than XXXX API requests.
EDIT: Also happy to receive suggestions. I'm coding in Ruby and don't see an obvious way to access the Firebase DB directly (quite frankly I don't even know if what I just said makes sense) so any help is appreciated.
As far as ethics go, it's probably better to use the API than scraping. Sure you're making more requests, but I think the firebase servers and cdn have far more ability to handle a lot of requests than the HN server does. If having that many requests would put firebase out in any significant way, then I'm sure they would put a query load limit on the HN api.
The final product is about 17 lines of code. Which is quite compact, sure—but if it looks like Perl, who cares?
Thankfully, it looks nothing like Perl. The code is deeply intuitive and evangelizes the idea that sending an object over a network can often be defined as a composition of functions resulting in the creation of the very object that we're sending. Whether that object is an HTML document, JSON, or whatever, the alluring premise is that these can all be forged from a single bulk-concatenation of strings returned by functions being unrolled in an inner-loop. It looks practically effortless. Might give you some ideas for batching those 30 GET requests. (Or maybe not—I've never used firebase before!)
UPDATE: Using the Firebase feed means you don't have to do periodic scraping. You can simply set a listener for the FB change() event and the API will basically tell you when there is fresh information from the front page etc.
 - https://tophn.info
 - https://hackernoon.com/tophn-a-fun-side-project-built-with-v...
IIRC I made this change at the suggestion of someone at YC (maybe dang, but it might have been before he took over HN) so I think it's very likely that this is the route they would prefer you to use. But if in doubt, send them an email; dang is very responsive.
What would you recommend for the instance where someone's emailed email@example.com and hasn't gotten a response?
I suspect it's something I said, I'm just unsure exactly what. I asked about the posting and API ratelimits, the reviewer idea mentioned at https://news.ycombinator.com/item?id=11662380, source release (notwithstanding the voting system, which is fine), and other curios. I don't think I said anything offensive.
I have successfully gotten in touch in the past, so I'm just not sure what to do. The email was sent on the 27th of April; I just checked my spam folder (I must admit I didn't until now...), but I don't think gmail would have deleted any incorrectly-filed replies, since they'd be less than a month old.
I'm not complaining or trying to create drama, just curious what to do next. My fear is that it's not possible to answer the "what did I do" question due to social context and so forth (a hole I've fallen into many, many times due to communication perception issues).
dang got in touch to let me know the delay was due to backlog and that he'll properly reply when he gets the chance. Absolutely fine by me; I'm super glad to know my email was fine :D
The closest thing to a source repo is this from many years ago:
Community-supported version: http://arclanguage.github.io
> A version of HN's source code is included with the public release of Arc, but HN's algorithm has many extensions that aren't public.
> We're unlikely to publish all that because doing so would increase two bad things: attempts to game the site, and meta nitpicking.
Including at least at one time a Chrome extension for moderation:
So security by obscurity helps at times.
Having the full list of banned keywords, or even acknowledging there is a list, could cause drama. And it's easily evaded if people know about it, like when reddit banned "Tesla" posts got through by misspelling it "Telsa".
There's also other stuff like a controversy filter, that detects articles with more comments than votes and penalizes them. I try to avoid commenting in articles that are getting close to the limit to avoid triggering it.
Usually, some numbnut "programmer" sets a no-login and a simple password as a secret service account. It invariably is found, and badness ensues.
Whereas login/password is a 1/password_space chance of getting it. It's the combination of a default hidden account and no way to know/change it.
One can hope, though.
Also (personal theory of mine) maybe the code doesn't have the degree of abstraction or separation of concerns necessary to allow open sourcing without also exposing YC business concerns or parts they want to keep secret.
Which is to say, the community of users! (You all are awesome... usually)
The HN frontpage algorithms used to be more predictable, now it's a black box.
The app is top notch, but it does not scale very nicely. Too much to hold in memory.
It has been in active development for several years, and is actually quite popular in its own site.
It also incorporates several interesting concepts like an invite-tree,etc.
I don't mind HN close-sourcing its code...But I'm definitely curious keeping Arc alive. IMHO, the rest of YC stack is pretty vanilla.
I wonder if Arc is the "burning candle that must never be allowed to die out"
At least their API is sourcing all data from Firebase last time I looked at it...
My version was near real time, with a 30-60 second batching delay IIRC, and could still be what powers it for all I know.
https://news.ycombinator.com/item?id=13755673#13756819 (3 months ago, on Ask HN: Is S3 down?)
> If you don't plan to post anything, would you mind logging out? Then we can serve you from cache.
https://news.ycombinator.com/item?id=12909752#12911870 (6 months ago, on Donald Trump Is Elected President)
> please log out unless you intend to comment, so we can serve you from cache
> We buried the previous thread on this because, at almost 2000 comments, it was pegging the server. | https://news.ycombinator.com/item?id=12911042
https://news.ycombinator.com/item?id=9174869#9175457 (1 year ago, on Ask HN: Why the HN Nginx error for articles below 9M?)
> We turned old articles off a few hours ago as an emergency measure because the site was being crawled aggressively and our poor single-core Racket process couldn't keep up. Previous solutions that have served us well until now, such as IP-address rate limiting, appear no longer to suffice.
Obviously, that’s an entirely different story than services distributed over clusters of hundreds of different machines.
Plus caching for logged-out visitors.
= 3.47 hits / s
"Insert time / record" = ~50 ms (according to the internet)
= 50 ms / s locked time.
So I'm guessing not. Even if at peak he was seeing 10x traffic, you'd only expect short delays from the database.
It's honestly a fantastic final solution in certain circumstances.
Another prominent example is Mercurial SCM. It uses a fine breed of file system and custom index files and shows miracles in terms of reliability and performance across variety of platforms.
I agree filesystems are underappreciated for database duties, but it depends a lot on usage patterns.
If you wanted to compare a filesystem to a database, it's more a hierarchial collection of tables whose columns are the file stat() with a binary record at the end of each row. But it's somewhat of a useless database because its features are all oriented at its storage backend, not accessing the data. A real database allows you to optimize queries and search, index, join and select data at once and efficiently handles the more complex aspects of accessing and writing so you don't have to implement it yourself.
If you need a database, _use a database_. Tech hipsters need to stop badly reinventing the wheel.
How many posts are there? Filesystems commonly have problems dealing with too many files, like inode limits, slowness reading large directories or in deep paths, slowness traversing large trees, path name max size, etc. Now you have to engineer a bunch of hacks around the filesystem, or try to find a perfect filesystem and be stuck with it.
What if you want to search posts? Well, you have to open and read all the posts. But what if you want to speed that up, like with a word index? Congrats, you're about to write a database/search engine.
What if dynamic processing is too resource-intense and you want to serve static content? Once you apply transforms to the user content and create your static resource, you get to decide if you keep the original content, because it's not useful to you just sitting on disk.
What if you want to merge, upgrade, convert, export, etc parts of the data? You have a big job ahead of you if you didn't keep the original content, and if you did, I hope you kept metadata to make this task easier.
What if you want to spread the load, store records efficiently, control I/O methods and cache, support strong locking, platform independence, and storage backend independence?
Databases are a powerful and simple tool to add features to an application and make it easier and more efficient to process data. If you don't want app features, data processing features, platform independence, etc then don't use a database.