>What database did you use? We didn't use one. We just stored everything in files. The Unix file system is pretty good at not losing your data, [...] It is a common mistake to think of Web-based apps as interfaces to databases ...
This. I mean URLs were designed to encode Unix file names. Today you'd probably say 'it's a common mistake to think of Web-based apps as interfaces to "REST APIs"'.
Now the question of all questions is why viaweb was recreated in a non-LISP language (though probably discussed to death here) ...
The unix file system is a database, just not a particularly capable one. It is a hierarchical key-value database, with the file/directory name being the key. If your data fits that schema you're golden, otherwise not so much.
True, but a bit reductionist. A Unix file system also gives certain guarantees. For example, you can atomically rename a complete directory tree while client program keep accessing the "old" ones, you have a large choice of excellent SCMs for tracking files, can have networked and/or distributed file systems, uniform permissions or ACLs ... all of which come in handy if you're running a web server.
>> We didn't use one. We just stored everything in files
Meaning what? What kind of file formats? What did you query the data with, etc?
>> It turned out the Yahoo accounting department used Oracle.
I remember reading some Oracle advertisement (PC Magazine circa 2001) where a company like Amazon switched to them 'in one day'. It was probably the marketing department or something.
If it's anything like Hacker News, the way it works is as follows:
1. on startup, load all items into memory from the files.
2. whenever an item is changed, save it to disk.
In modern times, you can store each item as a separate .json file, for example.
With this technique, there is no risk of data corruption. There is a risk of inconsistency; e.g. if I remember correctly, when you upvote an item, the vote is saved to disk, then the author's karma is incremented, and finally the author's profile is saved to disk. If the webserver dies between saving the vote and saving the karma count, the karma will no longer be the proper value. Stuff like that. Such things tend not to matter if you design it carefully, though.
EDIT: I was curious what the order of operations was, so I pulled up HN's old source code:
(unless (or (author user i)
(and (is ip i!ip) (~editor user))
(is i!type 'pollopt))
(++ (karma i!by) (case dir up 1 down -1))
(save-prof i!by))
(wipe (comment-cache* i!id)))
(push vote i!votes)
(save-item i)
(push (list (seconds) i!id i!by (sitename i!url) dir)
(uvar user votes))
(= ((votes* user) i!id) vote)
(save-votes user)
(zap [firstn votewindow* _] (uvar user votes))
(save-prof user)
(push (cons i!id vote) recent-votes*))))
The user's karma is incremented / decremented, then the user's profile is saved; the vote is added to an item's votes, then the item is saved; the vote is stored in the global votes table, then the votes table is saved; the vote is added to the user's "votes" list, then the user's profile is saved.
Each file held an item (a product or section page), and it was stored as an s-expression.
Storing Lisp data as s-expressions is even more natural than storing Javascript data as JSON, which takes some effort if you want it to handle objects correctly.
> It is a common mistake to think of Web-based apps as interfaces to databases. Desktop apps aren't just interfaces to databases; why should Web-based apps be any different? The hard part is not where you store the data, but what the software does.
There's an alternate take on this point. Maybe the mistake is to not think of desktop apps as interfaces to databases (regardless of what the software actually does).
For Viaweb, users (ie, e-commerce sellers) paid $100+ per month. So a process per user isn't expensive. Shoppers didn't require so much state (just a shopping cart) so that part of the system kept all the state in a sharded database.
There is some routing infrastructure required by the front end to route requests, time out after inactivity, and reload the process if they come back. But it can give very snappy response to have all the data needed to serve a request all loaded in memory.
Not viable for handling many users at once on a single unix box, no. Context switching between the processes starts to dominate your time. It’s an old-school approach that is really tidy at relatively low concurrency.
In the early 2000s, this area was sometimes called the “C10k problem” - can you handle 10,000 concurrent connections on one machine? See https://en.m.wikipedia.org/wiki/C10k_problem
These days, most servers can blow way past that, even into millions, but none - that I am aware - do that with a process-per-connection model.
It depends. If you're caching ("weakly" eg with revalidating If-Modified-Since against mtime, or even "strongly" eg aggressively), which you should, you're creating processes only for a small fraction of requests. Remaining major overhead is reparsing dynamic language backend code, which you can further reduce by using native code.
The cache is served by an HTTP server still, though. What you are describing is a sort of hybrid, where most requests are handled in threads (either OS or green threads) but a small fraction get their own process. I think I agree that that could work, but it sounds a bit different from creating a process per connection.
I don’t think parsing code is the major overhead. Its not really about starting the processes so much as switching between them when concurrently handling a bunch of requests.
This. I mean URLs were designed to encode Unix file names. Today you'd probably say 'it's a common mistake to think of Web-based apps as interfaces to "REST APIs"'.
Now the question of all questions is why viaweb was recreated in a non-LISP language (though probably discussed to death here) ...