Viaweb FAQ (2004)

tannhaeuser · on July 15, 2020

>What database did you use? We didn't use one. We just stored everything in files. The Unix file system is pretty good at not losing your data, [...] It is a common mistake to think of Web-based apps as interfaces to databases ...

This. I mean URLs were designed to encode Unix file names. Today you'd probably say 'it's a common mistake to think of Web-based apps as interfaces to "REST APIs"'.

Now the question of all questions is why viaweb was recreated in a non-LISP language (though probably discussed to death here) ...

lisper · on July 15, 2020

The unix file system is a database, just not a particularly capable one. It is a hierarchical key-value database, with the file/directory name being the key. If your data fits that schema you're golden, otherwise not so much.

tannhaeuser · on July 15, 2020

True, but a bit reductionist. A Unix file system also gives certain guarantees. For example, you can atomically rename a complete directory tree while client program keep accessing the "old" ones, you have a large choice of excellent SCMs for tracking files, can have networked and/or distributed file systems, uniform permissions or ACLs ... all of which come in handy if you're running a web server.

bluedino · on July 15, 2020

>> We didn't use one. We just stored everything in files

Meaning what? What kind of file formats? What did you query the data with, etc?

>> It turned out the Yahoo accounting department used Oracle.

I remember reading some Oracle advertisement (PC Magazine circa 2001) where a company like Amazon switched to them 'in one day'. It was probably the marketing department or something.

sillysaurusx · on July 15, 2020

If it's anything like Hacker News, the way it works is as follows:

1. on startup, load all items into memory from the files.

2. whenever an item is changed, save it to disk.

In modern times, you can store each item as a separate .json file, for example.

With this technique, there is no risk of data corruption. There is a risk of inconsistency; e.g. if I remember correctly, when you upvote an item, the vote is saved to disk, then the author's karma is incremented, and finally the author's profile is saved to disk. If the webserver dies between saving the vote and saving the karma count, the karma will no longer be the proper value. Stuff like that. Such things tend not to matter if you design it carefully, though.

EDIT: I was curious what the order of operations was, so I pulled up HN's old source code:

        (unless (or (author user i)
                    (and (is ip i!ip) (~editor user))
                    (is i!type 'pollopt))
          (++ (karma i!by) (case dir up 1 down -1))
          (save-prof i!by))
        (wipe (comment-cache* i!id)))
      (push vote i!votes)
      (save-item i)
      (push (list (seconds) i!id i!by (sitename i!url) dir)
            (uvar user votes))
      (= ((votes* user) i!id) vote)
      (save-votes user)
      (zap [firstn votewindow* _] (uvar user votes))
      (save-prof user)
      (push (cons i!id vote) recent-votes*))))

The user's karma is incremented / decremented, then the user's profile is saved; the vote is added to an item's votes, then the item is saved; the vote is stored in the global votes table, then the votes table is saved; the vote is added to the user's "votes" list, then the user's profile is saved.

spenczar5 · on July 15, 2020

Do all servers share a file system, or is there some sort of copying done in the background?

ubercow13 · on July 15, 2020

As far as I remember HN runs on a single core of a single server. I wonder whether this has changed more recently, though.

tinus_hn · on July 16, 2020

So you’re using the filesystem as a poor database. Is this even faster than using a real database?

tlb · on July 15, 2020

Each file held an item (a product or section page), and it was stored as an s-expression.

Storing Lisp data as s-expressions is even more natural than storing Javascript data as JSON, which takes some effort if you want it to handle objects correctly.

PaulDavisThe1st · on July 15, 2020

> It is a common mistake to think of Web-based apps as interfaces to databases. Desktop apps aren't just interfaces to databases; why should Web-based apps be any different? The hard part is not where you store the data, but what the software does.

There's an alternate take on this point. Maybe the mistake is to not think of desktop apps as interfaces to databases (regardless of what the software actually does).

wozer · on July 15, 2020

Is one process per user still viable? It sounds like it would not scale / not work well with cloud deployments.

tlb · on July 15, 2020

For Viaweb, users (ie, e-commerce sellers) paid $100+ per month. So a process per user isn't expensive. Shoppers didn't require so much state (just a shopping cart) so that part of the system kept all the state in a sharded database.

There is some routing infrastructure required by the front end to route requests, time out after inactivity, and reload the process if they come back. But it can give very snappy response to have all the data needed to serve a request all loaded in memory.

spenczar5 · on July 15, 2020

Not viable for handling many users at once on a single unix box, no. Context switching between the processes starts to dominate your time. It’s an old-school approach that is really tidy at relatively low concurrency.

In the early 2000s, this area was sometimes called the “C10k problem” - can you handle 10,000 concurrent connections on one machine? See https://en.m.wikipedia.org/wiki/C10k_problem

These days, most servers can blow way past that, even into millions, but none - that I am aware - do that with a process-per-connection model.

tannhaeuser · on July 15, 2020

It depends. If you're caching ("weakly" eg with revalidating If-Modified-Since against mtime, or even "strongly" eg aggressively), which you should, you're creating processes only for a small fraction of requests. Remaining major overhead is reparsing dynamic language backend code, which you can further reduce by using native code.

spenczar5 · on July 15, 2020

The cache is served by an HTTP server still, though. What you are describing is a sort of hybrid, where most requests are handled in threads (either OS or green threads) but a small fraction get their own process. I think I agree that that could work, but it sounds a bit different from creating a process per connection.

I don’t think parsing code is the major overhead. Its not really about starting the processes so much as switching between them when concurrently handling a bunch of requests.

janvdberg · on July 15, 2020

Are there any old screenshots around; what the editor or backend interface for a Viaweb user looked like?

tosh · on July 15, 2020

the faq reads like an essay on minimalism and first-principles thinking