Hacker News new | past | comments | ask | show | jobs | submit login
Althttpd: Simple webserver in a single C file (sqlite.org)
715 points by miles on June 8, 2021 | hide | past | favorite | 339 comments



Here is the actual single C-code file: https://sqlite.org/althttpd/file?name=althttpd.c

Something I absolutely love about text based protocols such as HTTP/1 is how easy you can implement it in any virtually programming language. Sure, the implementation is not top-of-the-notch, but it just damned works, it is portable, it is understandable by humans. That's something what's got lost with HTTP/2 and HTTP/3, respectively.


Binary protocols aren't (or at least don't have to be) any more difficult and frequently are even easier to implement.

Text protocols have difficult problems like escaping or detecting the end of particular field that are frequent source of mistakes.

The issue is that many (especially scripting) languages treat binary data as second class.

The only real issue is that inspecting the binary message visually is little bit more difficult. You can usually easily tell if your data structure is broken if it is in text form. What I do is I usually have some simple converter that converts the binary message to text form and this helps me inspect the message easily.


And capitalisation issues, and encodings… give me a binary protocol to parse any time. They’re normally comparatively well-defined, whereas text protocols are seldom properly defined and so have undefined behaviour left, right and centre, which inevitably leads to security bugs—or even if they are well-defined, they’re probably done in a way that makes your language’s string type unsuitable, and makes parsing more complicated.


People say this, but then there's ASN.1 which has had critical security bugs in: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=asn.1

I will agree that exhaustively defining text protocols is extremely hard, starting from character set / encoding and getting worse from there.


ASN.1 is not a binary protocol. It is a language to describe messages.

Typically you create message description which is then compiled to code that can serialize/deserialize messages in BER-TLV or PER-TLV.

I know because I wrote a complete parser/serializer for BER-TLV. It is simple protocol and any security issue is in the parser/serializer and not the protocol itself. That for simple reason that the protocol is nothing more than a format to serialize/deserialize the data.


True, but they certainly don't make it too easy either with the indefinite length constructs.


I guess if you're going to pick the absolute worst example of a binary protocol.


That is absolutely not fair.

BER-TLV is really nice protocol. I have worked with it for couple of years when I worked on an EMV application. It uses BER-TLV to communicate with the credit card but it is also very convenient format for all sorts of other uses and I would use it wherever I could. Think of it as Json but in binary form. It is not complicated and I would not even bother parsing the messages -- I could interpret hex dumps of them by sight very easily.


>>* The author disclaims copyright to this source code. In place of

>>* a legal notice, here is a blessing:

>>*

>>* May you do good and not evil.

>>* May you find forgiveness for yourself and forgive others.

>>* May you share freely, never taking more than you give.

Above all, this note at the beginning of the source code impressed me.


> give me a binary protocol to parse any time

I don't disagree in principle but I've come across a handful of very poorly documentated binary protocols in my years and that is an extremely painful thing to deal with compared to text-based protocols.


I think it's pretty easy to make poor use of HTTP to the same end. Imagine you're dumping traffic between a client and a server and the exchange is:

   GET /asdfasdfasdfdsaf
   X-Asdf-Asdf: 83e7234
   
   HTTP/1.0 202
   X-83e7233: 1
   X-83f730b: 4
   
It's text, but you still have no idea what's going on.

Overall, I think it's kind of a wash. It's basically equally easy to take a documented text or binary protocol and write a working parser. Neither format solves any intrinsic problems -- malicious input has to be accounted for, you have to write fuzz tests, you have to deal with broken producers. It's a wash.

People like text because they can type it into telnet and see something work, which is kind of cool, but probably not a productive use of time. I can type HTTP messages, but use curl anyway. (SMTP was always my favorite. "HELO there / MAIL FROM foo / RCPT TO bar / email body / ." Felt like a real conversation going on. Still not sure how to send an email that consists of a single dot on its own line though.)


>Still not sure how to send an email that consists of a single dot on its own line though.

It seems you prefix any line of data starting with a period with an additional period, to therefore distinguish it from the end of mail period.

https://datatracker.ietf.org/doc/html/rfc5321#section-4.5.2


> Still not sure how to send an email that consists of a single dot on its own line though.

A line starting with a period is escaped by adding an extra period; the receiving side removes the first character of a line if it is a period.


It's the dirties hack I've ever met.

Except I did something like this once in the past (but with zero byte instead of dot).


Ugh, flashbacks. Now that you’ve mentioned that I feel the burning need to change my wording to add the condition “fairly described”. I have also come across very poorly documented binary protocols!


Just use recordflux for binary parser proved absent of runtime errors... ;-)


Can you expand on that a little?


RecordFlux[0] is a DSL written in Ada for specifying messages in a binary protocol. Code for parsing these messages is then generated automatically with a number of useful properties automatically proven including that no runtime errors will occur.

[0] https://github.com/Componolit/RecordFlux


They definitely don't have to be. I've echoed these sentiments before.

However, HTTP/2 and HTTP/3 certainly are, though the reasons why they are complicated have nothing to do with choosing to use a binary based format. (They are complicated for good reason, though, and I hope that browsers and servers can continue to support HTTP/1 as the baseline till the sun burns out, just to make life easier.)


I did implement a webserver (that supports websocket-a binary protocol) and also implemented a `make_binary_string` function along the wah so as to not lose my bearings. Printf is 100% amazing and I generally leave my printf statements inside all my code because I can switch them on and off very easily for any individual function by using macros. Also clutch :)


Yes and no. Debugging a binary protocol is a bit more difficult then a text one; which is the reason HTTP is ascii based.


The reason HTTP is ASCII based isn't because it's easier to debug. It's because back then the other end was as likely to be a human as a piece of software. Because HTTP in it's early days barely had headers or even formatting, so people typed "GET /" at the server directly or used the same method to send mail directly.

Nobody does that anymore and debugging is easily solved by converting your binary protocol to a textual form.


I question the ascertain that humans were doing "GET /" or "HELO my.fq.dn" on any regular basis. There were mail clients from the very beginning for example:

* https://en.wikipedia.org/wiki/History_of_email

Perusing document-based information was also done via clients, first with Gopher and then with the WWW: either GUIs like Mosaic, or on the CLI via (e.g.) Lynx.


I definitely did back in the day :) But I'm only one single human.


I did both send mails over SMTP as well as downloading files from FTP using Telnet.


same


I would question if this weren't the case often enough that people would demand it be compatible with their teleprompter.


I don't think this is about human readability (albeit it might be to some extent during the design phase). Instead, I think this was the sensible thing to do when byte order was way more variable across a network. If you only send single bytes to begin with, you can as well use a textual format, IMO.


It depends from the nature of the payload. If I have to send text, I'll use a text based protocol because I can debug it more easily, but if I have to send binary information, I'd rather send it using well defined structures after taking into account word sizes, alignment, endianess etc. on the involved hardware. Back in the day I had to do that with small agents on different achitectures (IBM Power and x86) with different languages (C and Object Pascal), and the correct textbook way to do that was to use xml, but that way everything had to be converted to text, interpreted and then translated back for replies. No way, hardware was slow and we aimed at speed, so we used plain C/Pascal structures and no other manipulation except for adjusting type sizes and endianess if the hardware or language required so to make all modules compatible no matter what they were written in and where they would run. I also built my own tools for testing and debugging which were nothing more than agents that sent predefined packets and echoed at screen the return values, so that any error in the data would be quickly spotted, while a xml translation of a corrupt buffer could have missed any error not comprised in the marked text, hence hiding there was a problem somewhere. A solution like that is highly debatable for sure, but I'm among the ones who want their software to crash loudly at the first problem, not try hiding it until it creates a bigger damage.


> which is the reason HTTP is ascii based.

well, other side of HTTP was telnet in the terminal - this is why it is ASCII to begin with


This is not true. HTTP was developed for the first (graphical) web browser, WorldWideWeb. It was specifically intended to serve HTML and images, neither of which is particularly well suited for browsing with telnet.


Not sure that is strictly correct. I vaguely remember a command line browser early in the piece. The inline image ability and <img> tag came a bit later in the story when Andreessen put it into the NCSA's Mosaic browser. I've always wondered if it was that single thing that really made the web take off, more so than hyperlinking which had been around for quite a while. People like pretty pictures, especially advertisers.


You're describing the line mode browser:

https://en.wikipedia.org/wiki/Line_Mode_Browser

But that didn't just spit out the HTTP response - it parsed and rendered the HTML.


Yeah, humans can roughly speaking infer the schema of a text-based protocol but require one for a binary protocol.


I love C, but it's pretty scary sometime. 5 minutes ago, "I wonder if I can find a potential memory overwrite in 5 minutes?"

Sure enough, the function StrAppend potentially overflows a size_t size (without checking), and then writes into memory could be past the end of the allocated buffer. Given 5 minutes, I didn't look thoroughly if this is actually exploitable, but it's definitely a red-flag for the code. Be careful out there! Hopefully I am missing something, or this is just a simple oversight, but I would carefully audit this code before using it.

Submitted a ticket through the Althttpd website.

static char StrAppend(char zPrior, const char zSep, const char zSrc){ char zDest; size_t size; size_t n0, n1, n2;

  if( zSrc==0 ) return 0;
  if( zPrior==0 ) return StrDup(zSrc);
  n0 = strlen(zPrior);
  n1 = strlen(zSep);
  n2 = strlen(zSrc);
  size = n0+n1+n2+1;
  zDest = (char*)SafeMalloc( size );
  memcpy(zDest, zPrior, n0);
  free(zPrior);
  memcpy(&zDest[n0],zSep,n1);
  memcpy(&zDest[n0+n1],zSrc,n2+1);
  return zDest;
}


> Sure enough, the function StrAppend potentially overflows a size_t size

How should this happen in practice? The three strings would have to be larger than the available address space...


Yeah. The function in question is called in only one place. It would seem you’d need to send the web server more than a size_t of data for this to be an issue.


Yes, absolutely. If the webserver is compiled 32-bit, that is only 4GB of data, which might be feasible? I don't know enough to say. Assuming a hacker kindly won't overflow your buffer is never a good idea.

However, the presence of one piece of code that is not integer-overflow safe definitely makes me nervous. This is just the one I found in 5 minutes, what else is in there?


It's not an integer overflow that would be needed but an unsigned overflow. The way I see it, on 32-bits, that means that the size HTTP request would have to be bigger than what's available to both user application and the OS together. In short, one just can't get the input request that big. Of course, if you manage that, you'll disprove this claim.


None that stand out to me, including what you posted. Do you have a real example?


MAX_CONTENT_LENGTH is 250MB. You won’t be able to send 4GB of data.


In most places it uses int for string and buffer sizes lengths. It wouldn't surprize me if 2GiB of data could trigger several overflows.


Exactly. In a single file C nobody can expect to get universal library functions that work in any possible imaginable context. The only relevant context is the code the function is in. And in that context, the function is doing enough.


And there's only one call to StrAppend() which is easily verified as safe.


HTTP/1 simplicity comes not from text nature, but from it's clear concept and limited functionality.

At core, it's just request-reply + key-value metadata. Whenever it's text, or binary, it does not matter much. But writing HTTP/2 frame types in letters would not make them any easier to understand.


There are also keep-alive, caching (a big topic), chunked transfer encoding, header parsing peculiarities, and authentication in HTTP. The combination of these creates some nice opportunities for implementation bugs. Source: have worked on a client-side implementation.

Now, HTTP/2 isn't even conceptually simple, I agree about that... it seems ugly.


> implementation bugs. Source: have worked on a client-side implementation

well because the hard part is the client-side. caching is a client-side only thing with http, keep-alive is a thing that a server pushes to a client, the same with chunked transfer, which is not as easy to implement for a client like it was with content length.

basically a server does only need to implement certain headers, but a client needs to know all. also most clients even accept bad servers, like content for head requests, etc.. most stateless protocols put a lot of burden into clients.

h2 on the other hand is stateful and keeps the same hard semantics onto the client side and also makes servers more complex, because it's a state machine.


One good thing is that you don’t have to support everything if you’re writing a server. It just depends on what kind of use cases you want to support.


That came with http/1.1 ;-)


HTTP allows a TimedOut response to be sent by a server to request you haven't yet sent. So it's not strictly request/reply.


'100 Continue' would probably be better example of breaking request/reply flow, as it provides useful functionality and requires non-trivial implementation and compatibility measures.

While '408 Request Timeout' is somewhat dubious fig leaf over TCP RST


Really simple, I implemented a couple. Remember that the letters after the status code are aesthetic, so make sure to put your style in there:

200 FINE 200 KTHX 404 MISS 403 NO 500 FUCK

Another similar server in one file is busybox httpd command if you are interested https://git.busybox.net/busybox/tree/networking/httpd.c


I always wanted an HTTP response code for when the server detects a malicious request. Like, 430 DONT BE AN ASSHOLE


Relatedly, in every C project I've ever worked on, the custom (>255) error code for failures due to malicious looking input that could not possibly have been supplied by mistake is in the const EGETFUCT.


Once you have to support HTTP/1.1 it's not really a text-based protocol, because support for chunked encoding is mandatory for clients. (Yes the chunk headers are technically text, but it's interspersed with arbitrary binary data. It's not something you can easily read in a text editor.)


I make most HTTP requests using netcat or similar tcp clients so I write filters that read from stdin. Reading text files with the chunk sizes in hex interspersed is generally easy. Sometimes I do not even bother to remove the chunk sizes. Where it becomes an issue is when it breaks URLs. Here is a simple chunked transfer decoder that reads from stdin and removes the chunk sizes.

   flex -8iCrfa <<eof
    int fileno (FILE *);
   xa "\15"|"\12"
   xb "\15\12" 
   %option noyywrap nounput noinput 
   %%
   ^[A-Fa-f0-9]+{xa}
   {xa}+[A-Fa-f0-9]+{xa}
   {xb}[A-Fa-f0-9]+{xb} 
   %%
   int main(){ yylex();exit(0);}
   eof

   cc -std=c89 -Wall -pipe lex.yy.c -static -o yy045
Example

Yahoo! serves chunked pages

   printf 'GET / HTTP/1.1\r\nHost: us.yahoo.com\r\nConnection: close\r\n\r\n'|openssl s_client -connect us.yahoo.com:443 -ign_eof|./yy045


I tried this but ended up with gibberish in my terminal. Also couldn't find an explanation for -a on flex's man page. I've never used the thing before.


The extra "a" is a typo but would have no effect. The "i" is also superfluous but harmless. Without more details on the "gibberish" it is difficult to guess what happened. The space before "int fileno (FILE *);" is required. All the other lines must be left-justified, no leading spaces, except the line with "int main()" which can be indented if desired.


This is the script I am running - https://pastebin.com/65GxJ9i9. The - after << ignores the tabbed indents in heredoc.

This is what it produces for me when I run `lexit.sh us.yahoo.com` - https://stuff-storage.sfo3.digitaloceanspaces.com/ee.txt


https://news.ycombinator.com/item?id=27490265 <-- yy054

The "gibberish" is GZIP compressed data. "yy054" is a simple filter I wrote to extract a GZIP file from stdin, i.e., discard leading and trailing garabage. As far as I can tell, the compressed file "ee.txt" is not chunked transfer encoded. If it was chunked we would first extract the GZIP, then decompress and finally process the chunks (e.g., filter out the chunk sizes with the filter submitted in the OP).

In this case all we need to do is extract the GZIP file "ee.txt" from stdin, then decompress it:

    printf "GET /ee.txt\r\nHost: stuff-storage.sfo3.digitaloceanspaces.com\r\nConnection: close\r\n\r\n"|openssl s_client -connect 138.68.34.161:443 -quiet|yy054|gzip -dc > 1.htm
    firefox ./1.htm
   
Hope this helps. Apologies I initially guessed wrong on here doc. I was not sure what was meant by "gibberish". Looks like the here doc is working fine.


new pastebin. Had a typo in the old one- https://pastebin.com/4j9Z3eCc


Need to get rid of the leading spaces on all lines except the "int fileno" line. Can also forgo the "here doc" and just save the lines between "flex" and "eof" to a file. Run flex on that file. This will create lex.yy.c. Then compile lex.yy.c.

The compiled program is only useful for filtering chunked transfer encoding on stdin. Most "HTTP clients" like wget or curl already take care of processing chunked transfer encoding. It is when working with something like netcat that chunked tranfser encoding becomes "DIY". This is a simple program that attempts to solve that problem. It could be written by hand without using flex.


Okay I'll give up for now. There are really no spaces in front of the lines. In pastebin if you check the raw version you'll see they are tabs. Which get stripped out because I added a `-` before eof. Providing the file manually to flex also produces the same gibberish for me.


HTTP/1 requests for binary files also aren’t text based under that criterion. The protocol itself is text based, but the payload(s) can be binary. That’s doesn’t change between HTTP/1 and /1.1.


HTTP is serious about backward compatibility, so a client that speaks HTTP/1.1 (or HTTP/1.0 + Host header) can still talk to most servers out there.

Similarly, if you write a server that only speaks HTTP/1.1 (or HTTP/1.0 + Host header), you can put it behind a reverse proxy or load balancer that handles higher versions, does connection management, and terminates TLS. It will work perfectly fine, only without some of the latest performance optimizations that you might or might not even need.


> you can put it behind a reverse proxy or load balancer that handles higher versions, does connection management, and terminates TLS

This is even standard practice in many production deployments. Typically you want the proxy or load balancer anyway and there's often little benefit (if any) to using HTTP/2 or HTTP/3 over a very low-latency, high reliability local network.


That seems like a high risk of http desync attacks, if you're only implementing a subset of http/1.1.


How would one attack a static page?


Who said anything about static? This webserver supports cgi, which means it supports php, perl, etc.

However even if it didn't, js based client-side apps are probably still attackable in the right set of circumstances.


Why would you even need a reverse proxy for a static page? This isn't about that.


I see. Thanks!


For people who like text-based protocols (and dislike surveillance and web bloat), I suggest taking a look at Gemini, which was designed so you could write a client for it over the weekend.

https://gemini.circumlunar.space/


Gemini is quite interesting in that brings nostalgic feeling from when the web was a new thing, but is also modern.

The Lagrange browser seems quite polished.


I prefer binary protocols but I think you make a good point about HTTP/1 especially from a learning perspective. I remember how enlightening it was when I followed a tutorial to create a http server and found that all I needed to do (ignoring TCP/IP) was write 'HTTP/1.1 200 OK' plus a few headers and some raw html in a string and it would actually show up in my browser.


It's not a property of text-based protocols, rather it's a property of simple protocols. HTTP/1 is not merely text based, it's ASCII based (technically ISO-8859-1, which includes ASCII). One char, one byte, one encoding. HTTP itself is mostly very simple, text "name: value" pairs separated by newlines, followed by arbitrary content as the body.

I think the solution is to start with a simple protocol and upgrade to more complex protocols after that. While technically you don't need to support HTTP/1 to support 2 and 3, the upgrade over TCP happens mostly in that way.


> While technically you don't need to support HTTP/1 to support 2 and 3, the upgrade over TCP happens mostly in that way.

This is not actually true of HTTP/2 (the upgrade doesn’t use HTTP/1), and slightly misleading of HTTP/3 (there’s nothing to upgrade because it sits beside TCP HTTP; but advertising HTTP/3 support is currently done over TCP HTTP).

Now for the details:

HTTP/2 upgrade is done at the TLS level, via ALPN. Essentially the client TLS handshake says “hello there! BTW do you do HTTP/2?” and the server either responds “hi!” and starts talking HTTP/1, or “hi! Let’s talk h2!” and starts talking HTTP/2.

So it’s perfectly possible (though not a good idea) to have a fully-functioning HTTP/2 server that doesn’t speak a lick of HTTP/1.

(HTTP/2 over cleartext, h2c, does use the old HTTP/1 Upgrade header mechanism, but h2c is more or less just not used by anyone.)

HTTP/3 upgrade, well, “upgrade” is the wrong word. It’s operating over UDP rather than TCP, so you’re not upgrading an existing HTTP thing to HTTP/3, you’re starting a new connection because you’ve learned that the server supports HTTP/3. This bootstraping is currently done by advertising h3 support via the Alt-Svc header (HTTP/1+) or by an ALTSVC frame (HTTP/2+), which the client can then remember so it uses the best protocol next time.

For best results, they’re working on allowing you to advertise HTTP/3 support on DNS, so that after a while it should be genuinely possible (though a very bad idea) to have a fully-functioning HTTP/3 server that doesn’t speak any TCP at all, yet works with sufficiently recent browsers in network environments that don’t break HTTP/3. https://blog.cloudflare.com/speeding-up-https-and-http-3-neg... is good info on this part.


> it's ASCII based (technically ISO-8859-1, which includes ASCII)

No, it is (some) ASCII plus "opaque octets". ("A recipient SHOULD treat other octets in field content (obs-text) as opaque data.") If you want to say that historically it was such, that's not correct either; it was that plus "supporting other charsets only through use of [RFC2047] encoding" which is a nightmare of an encoding scheme.

> HTTP itself is mostly very simple, text "name: value" pairs separated by newlines

Except when it isn't: headers can be folded across multiple lines. (Which is also obsolete, and discouraged.)

Add to that 1xx responses, transfer encodings, chunks, chunk extensions, trailers. HTTP is far from "simple"…


Gopher is easier than that.


Recording state info in static variables makes it only suitable for embedded.

I really hate when people do that. Static should only ever be const, init once, or something intrinsically singleton. There are very few exceptions.


Are you thinking of C++ issues with constructors? I don’t see anything wrong with the use of static variables in this single file C program. This isn’t a library; it’s a standalone web server.


I am thinking of multithreading. Mutable static variables pretty much destroy any possibility of multithreading without a major refactor. Test harnessing is an issue too.

But if this only ever wants to be an app binary, I guess it's sort of okay.


That class of errors is not an issue here.

This program isn't meant to be multithreaded, it's meant to be multi-/process/. The difference is that each instance of the application has its own independent address space. So that's one address space per instance. And because of that there are no threading issues. Also as a result any communication between different processes has to be explicitly defined.

Making this program multi-threaded would be a big mistake because then you couldn't use any of the HTTP proxies out there to monitor for connections and hand them off to this program.

This is arguably a much better architecture from a safety/security standpoint then trying to spin multiple threads each sharing an address space. It forces you to use OS-provided mechanisms for shared state rather than simply manipulating the memory to share information.


Being text based is a huge flaw with HTTP in my opinion (and also elsewhere, like Redis). It leads to parsing bugs and overly verbose communications. Humans are good at reading text, CPUs prefer binary.


The use of text for popular protocols is for a reason — computers don’t write programs, people do. And while CPUs prefer binary, it’s easier for programmers to read/write/reason about text. This makes it easier to work with a new protocol.

From a practical perspective, with a binary protocol, it can be difficult to use across different languages or add support for a new language. If you use the simplest possible encoding, you’d send raw struct data. But this doesn’t always work across different OS/arch/versions/etc. if the server is in C, but the client is in Python, reading the binary protocol would require a far more complicated parser.

Obviously a more formal encoding (protobuf, etc) would be preferred, but if you already need to use an encoding mechanism, why not wrap it in a text format? It’s easier to write clients that can read/write text protocols in any language. The reason why text protocols are so popular aren’t because they are necessarily “better” but easier to adopt. This is why the most popular protocols are text based.


Except it quickly gets messy when you start dealing with real data and making sure encoding and escaping is done correctly.

> with a binary protocol, it can be difficult to use across different languages

This is also true of text protocols that aren't well-designed. I don't think it's necessarily the case that binary protocols are more difficult to deal with. You just have a different set of concerns to address.

> If you use the simplest possible encoding, you’d send raw struct data.

This is the "simplest" in the sense that it's definitely easy to just copy this data on the wire, but I think this is a straw man. I don't think it's really any more difficult to write a simple protocol that uses binary data compared to text.


> I don't think it's really any more difficult to write a simple protocol that uses binary data compared to text.

I don’t really think so either… I mean, I’ve done both and it’s really not terrible to use binary. I think text is marginally easier to parse, but once to have the routines to read the right endian-ness, the advantage is minor. As you said, the biggest concern (as always) should be the design. A good design can be implemented easily with either mode.

However, it is significantly easier to debug a text protocol. Attaching a monitor or capturing packets is easier with text as the parsers are much easier and more generic.


That's fair. Although tools like Wireshark have made this much better.


Being binary based leads to errors and loss of momentum when you are debugging something that's deployed to prod. It's all about tradeoffs.


Performance is really bad. This is good for running a small HTTP server on an embedded device but if plan is to use it for HTTP server to serve production web traffic performance is really bad. Below is report of running the server and hitting a minimal index.html page and hitting it with artillery.

All virtual users finished Summary report @ 09:39:57(-0400) 2021-06-08 Scenarios launched: 33645 Scenarios completed: 2573 Requests completed: 2573 Mean response/sec: 42.57 Response time (msec): min: 0 max: 9029 median: 2 p95: 6027.7 p99: 8778.8 Scenario counts: Get index.html: 33645 (100%) Codes: 200: 2573 Errors: ETIMEDOUT: 31008 EPIPE: 48 ECONNRESET: 16


Indeed. And a Citation biz-jet is way faster, flies higher, goes further, and carries more passengers than a Carbon Cub. On the other hand, the Citation costs more, burns more gas, takes more maintenance, and is more complex to fly, and you should not try to land a Citation on a sandbar in a remote Alaskan river.

Choose the right tool for the job.

Changing the https://sqlite.org/ website to run off of Nginx or Apache instead of alhttpd would just increase the time I spend on administration and configuration auditing.


It is not clear whether you would spend more time on administration with another webserver. I don't have experience with your webserver, but mine are 'set it' and 'forget it' affairs.


Love the Carbon Cub reference. STOL!!


And yet in years of using sqlite I have never once had a problem loading their website.


>but if plan is to use it for HTTP server to serve production web traffic performance is really bad.

But it seems to be "good enough", no? As stated on the page, it serves 500k requests a day.

Were you running your tests using xinetd or stunnel?


It serves sqlite.org just fine.

Most people don’t need FANG tools.


I'm not sure I would call Apache or Nginx "FANG tools" (or FAANG tools)


The design goal is not top performance here. It is simplicity, observability of the source, and security.

It absolutely will fail under a DDoS-like punishing load which, say, nginx would have a chance to fend off.

It's still plenty adequate for many real-world configurations and load patterns, much like Apache 1.x has been. Only this is like 2% the size of the Apache 1.x.


Fair enough, but when considering the reasons and decisions behind using this server from the developers, isn’t your point kind of moot?

It’s not optimized for high ‘performance’. It’s optimized for low resource usage, and the ability to reliably serve large amounts of requests on a small budget, right?

They state that the website is currently serving 500K requests & 50GB of bandwidth per day. Respectfully, this is quite the opposite of your ‘only good for small embedded devices’ claim.

I think this is very interesting, and I’m glad I know this exists now! Worth considering if you have the right type of use case.


That's not a lot of requests.

My hobby website serves more traffic for a 1/4 of the cost and is easy to configure.


Care to mention what you are using for your hobby project?


It's a cool demo, but obviously would not recommend running in any production environment. It's not battle tested (performance and security) and not constantly peer-reviewed like Apache or NGINX. Even further down the totem pole Caddy (which I really like) is better than Althttpd for lots of very good reasons. Ok, with what I thought would be obvious said; Althttpd it's still way cool and impressive.


> hitting it with artillery

This?

https://github.com/artilleryio/artillery


Yes


    /*
    ** Test procedure for ParseRfc822Date
    */
    void TestParseRfc822Date(void){
      time_t t1, t2;
      for(t1=0; t1<0x7fffffff; t1 += 127){
        t2 = ParseRfc822Date(Rfc822Date(t1));
        assert( t1==t2 );
      }
    }
There's only two billion integers, guess we can test them all. Well, substantially fewer than two billion with that skip. I wonder if that completes in a few seconds.


That was quite easy to cut and paste and compile:

    $ time ./althttpd-time-parse 
    Test completed in 10518961 us

    real    0m10,521s
    user    0m10,486s
    sys     0m0,004s
The "Test completed" line is from my main() "driver", I wanted to measure time inline too and the measurements seem to agree.

This is on a Dell Latitude featuring a Core i5-7300U at 2.6 GHz, running Ubuntu 20.10.


This loop doesn’t test them all - it tests every 1/127th integer, so it only runs about 16 million times.


And a modern CPU runs billions of instructions per second, so it shouldn't take too long at all. Depends on the implementation of the function as well though, and how the compiler can optimize, unroll or optimize both the function under test and the test itself.



How would you write a better test?

IMO most tests are fundamentally flawed. The way most testing is done it would be easier and better to just write everything twice, have a method to compare results, and hope you got it right at least once.


I do think of tests as a kind of double accounting.


Computers are insanely fast


Another comment shows 10 seconds on a relatively decent CPU from 2017. So it is a fairly heavyweight task, though I suppose could be rewritten to use more than one core.


yeah, that's a lot of iterations :-)

I was curious to see how my M1 compares to my intel 2019 macbook pro:

M1:

/tmp/tt 8.42s user 0.01s system 99% cpu 8.435 total

2,6 GHz 6-Core Intel Core i7

/tmp/tt 15.69s user 0.03s system 98% cpu 15.888 total

/tmp/tt 15.69s user 0.03s system 98% cpu 15.888 total


Looks like test code, rather than something that runs when serving in production.

It fails an assert if the parse doesn't work I guess?


I have yet to try running this for anything, but I do appreciate how it really sticks to the "do one thing well" ethos. Modern web servers can be extremely complicated with a lot of moving parts. This boils it down to just one thing and lets a person focus on the project instead of the infrastructure. Granted, it's very simplistic, but that's its strength.


I do respect the technical chops around sqlite. However, I think a "fork for every single http request" server isn't really useful in many situations.

That the sqlite website is able to run this way is more a testament to Linux's work on a lightweight/fast fork() than anything else. This would perform terribly on a more traditional Unix.


I ran a webmail service with 2m users that forked and exec'd a CGI for every request 20 years ago. 20 year old hardware was already fast enough that we were usually IO bound on the storage backend rather than constrained by the (much cheaper) frontends.

Forking for every request is slow, sure.

But if your code is written with it in mind it's faster than most people might expect, and most people never get to a scale where it matters.

It's not the right choice for everything, but people have ironically gotten obsessed with things we introduced a long time ago as workarounds for slow hardware (and fork used to be slow on Linux too) decades after the original problems were largely solved.

I do agree there are times this won't be useful, though.


"I ran a webmail service with 2m users that forked and exec'd a CGI for every request"

Yes, but that met expectations of that time period, and expectations for a webmail service. I'm curious if you also forked for every static asset...that's what this setup appears to do.

I just don't see the benefit of mysql choosing to use this today. It works, but there are other minimal http servers that would be just as simple, but would be faster and use fewer resources. I suppose they don't need to change it, but it's not really a great example of anything other than "fork is cheap on linux" to me.


Performance expectations were if anything for the most part tighter than what people tend to get away with today. People hadn't gotten used to slow dynamic sites yet.

We didn't fork for for every static asset, but the vast majority of overall requests were dynamic past the initial pageload, so the vast majority of requests resulted in fork.

In terms of benefits, the simplicity is attractive. It's an approach that is in general quite resilient to errors.


> Yes, but that met expectations of that time period,

As peer poster said, expectations years ago were higher than today. It is frustrating how common it is for sites to think that downloading multiple MB of code just to show a simple page and having it take seconds to render is somehow ok.

The expectations used to be ~100ms for a page load and render back when I was working on high performance web servers (~15 years ago).


What are your thoughts on darkhttpd?

https://unix4lyfe.org/darkhttpd/


Never used it, but it says:

- Uses sendfile() on FreeBSD, Solaris and Linux

- Event loop, single threaded - no fork() or pthreads

- Supports If-Modified-Since, Keep-Alive, IPV4, 301 redirects

And appears to be just a little larger than Althttpd. Sounds good to me.


But pretty much nobody is is running a more traditional Unix nowadays. Almost everyone uses Linux for web servers. So let's judge the tool based on its actual context, not on an unrealistic one.


I'm saying it's not terribly interesting or broadly useful, unlike the rest of sqlite, which is. There are other minimal http servers that are vastly more efficient without being much more complicated.


Blocks some referers by default:

    static const char *azDisallow[] = {
      "skidrowcrack.com",
      "hoshiyuugi.tistory.com",
      "skidrowgames.net",
    };
Anyone know why?


It is located in a #ifdef 0 block. So my guess is it might be for testing purpose.


There's also this:

    }else if( strcasecmp(zFieldName,"Referer:")==0 ){
      zReferer = StrDup(zVal);
      if( strstr(zVal, "devids.net/")!=0 ){ zReferer = "devids.net.smut";
        Forbidden(230); /* LOG: Referrer is devids.net */
      }
Which I can appreciate why it's there, it's still odd.


I remember seeing these crackers around back in the days. Skidrow at least cracted some grand theft auto games.

Not sure why they're blocked


There's a lot of cool technology in the Tcl and SQLite ecosystem. Wapp [0] is a tiny web application framework (a single file) also written by D. Richard Hipp

[0] https://wapp.tcl.tk/home/doc/trunk/README.md


I'm all for SQLite and I am a fan of the author of the project but for a webserver I have turned my back away from Nginx for https://caddyserver.com/ because of the simplicity.

Caddy is just really awesome as a reverse proxy (2 line config!!) and I am in the processes of moving all my projects to it. It is fast enough as well since other things will be the bottle neck way before that.

I am not affiliated with Caddy in any way, just blown away by the quality of it.


The thing I love most about caddy is it automatically does all the ssl certificate garbage which is so painful in every other web server ever. Yes certbot makes it less painful but it’s still a big PITA, unlike caddy where SSL is just like magic.


And zero dependencies!

This might solve my problem with older servers that no longer support the latest SSL.

I really need to upgrade those rickety old machines.


I suppose you mean zero runtime dependencies? It seems to have few dozen build dependencies.

Runtime dependencies create a nuisance as you have to update several things together. On the other hand, they can allow components with separate update cycles and responsibilities to be update separately.

Build dependencies create maintainability and security problems. They can also solve maintainability and security problems. It depends on what your consideration is. But as a matter of practice, many developers seem too concerned with possible behavioral/API breakage, that they like to pin to specific versions of their dependencies, which now means that you aren't getting any security fixes.

(Technically, Althttpd doesn't achieve zero runtime dependencies in comparison to a modern http server that does HTTPS, because it requires a separate program to terminate TLS. But these connect through general mechanisms that are much easier to combine and update separately.)

Everyone has to make a judgement about how they maintain their own systems, but being excited about "zero (runtime) dependencies!" isn't the way the judgement concludes.


You mean zero runtime deps because it pulls a lot of stuff when it does get built. Still great but I'd use traefik for more than 10 sites.


Caddy can serve thousands of sites without a sweat. What are your concerns exactly?


The webui based config helps for lots of sites and my clients can do it themselves without bothering me.


The only thing I don't really know how to do with it is round robin DNS for many servers with LetsEncrypt HTTPS.

It feels like then I'd probably need either shared storage for the certificate files (which goes against the idea of decentralization somewhat) or to use a DNS challenge type.

Anyone have experience with something like that?


Shared storage is the solution. Caddy supports multiple different storage backends (filesystem by default, and Redis, Consul, DynamoDB via plugins) and uses the storage to write locks so that one instance of Caddy can initiate the ACME order, and another can solve the challenge. See the docs: https://caddyserver.com/docs/automatic-https#storage

I'm doing this exact thing, with the Redis plugin behind DNSRR and it works seamlessly.


I fiddled around for many hours with traefik, and could not get it to do what I wanted -- something I'd done before and had a known example working config of.

10 minutes of caddy, I had everything running exactly as I wanted and the job was done.


I tried Traefik for the first time around 6 months ago (version 2) - man, coming from nginx (which I wouldn't call simple), I found Traefik config to be really confusing. It felt like I had to specify the same stuff 2 or 3 times, and in general it was just so unintuitive. And the docs (at the time at least) only showed snippets of trivial examples.

I don't think I'd choose to use it again. Instead, I'll try Caddy, or HAProxy if I need massive performance.


A big missing feature in Caddy for me is an embedded language like Lua for nginx so you can write tiny hooks. The Caddy authors have indicated on HN a while ago that Caddy 2 may have an embedded scripting language but I can't find anything about it in their docs.


Seems like it was postponed[0].

For very tiny hooks, you might be able to get away with using request matchers[1] and respond[2].

[0]: https://caddy.community/t/missing-starlark-documentation/958...

[1]: https://caddyserver.com/docs/caddyfile/matchers

[2]: https://caddyserver.com/docs/caddyfile/directives/respond


Writing plugins for Caddy is so easy that it's generally not necessary to have scripting built-in. You can just build yourself a new binary with a Go plugin just by using the "xcaddy" build tool. https://caddyserver.com/docs/extending-caddy

But yeah, it's still something at the back of our minds, and we were considering Starlark for this, but that hasn't really materialized because it's usually easier to just go with the plugin route.


> A separate process is started for each incoming connection, and that process is wholly focused on serving that one connection.

It makes you wonder just how "heavy" operating system processes actually are. We may not need to worry about the complexity of trying to multiple run async requests in a single process/thread in all cases.


Linux processes are actually surprisingly lightweight, thanks to using copy-on-write memoryu. I.e. fork is cheap, exec is expensive and changing things in already allocated memory is expensive. But this execution model does not do exec and mostly allocates new buffers for anything it needs to do. As an extra benefit, you get garbage collection as a gift from the kernel ;)


Apache's thread-per-connection model used to run basically the entire internet until nginx came along and demonstrated 10k simultaneous connections on a single server.

If you only have around 100 concurrent confections, a separate thread per connection is entirely feasible. A whole new process is probably fine on Linux, but e.g. Windows takes pretty long to spawn a process


the default mode in apache/apache2 used to be _process_ per connection (pre forked workers) not thread per connection. threads came much later.


If you have 100 concurrent confections you're probably baking cookies.


But you can probably get away with using one regular commercial oven instead of getting a baking tunnel oven :)


To be fair, I think it would be ideal if every service would run their own 100 concurrent connections, instead of everyone using a service that handles 1 trillion.


You might want to review that knowledge. You can spawn a few thousand threads on a modern machine without much contention.


well the thing is he confuses processes with threads. apache2-prefork used/uses processes and not threads.


I like to remember that threading APIs were a later add-on after Unix had already existed for years. They're not fundamental like the Process is.

Isn't it an appealing model to not even have to talk about threads because every process is 1 thread by definition?


Multiple processes are a bit easier to deal with than threads per servers - mostly because POSIX signals interacts poorly with threads.

It's also more secure because you should not mix different users' requests in the same process if you can avoid it.

nginx runs on a one process per core model and more or less does everything correctly.


So what was the point of the async/callback web programming revolution if processes were good enough?


It's because they weren't good enough for the thousands of concurrent requests of C10k or the millions that came after it with C10m.

Granted, 10k concurrent requests is a problem for the 1% of websites, so processes were (and still are) good enough for the long tail of personal or small-scale websites.


Memory use. Even though threads / processes are “cheap” right now, it wasn’t the case in the past and they are still quite far from the couple-of-kbs per connection needed in async servers. You’re not getting a million paralel processes processing requests any time soon.


To add on to every other sibling comment, switching threads or processes requires a trip back up to the kernel space (a context switch), instead of just remaining in user space. in this switch, all of your caches get busted.

Not a problem for most folks, but when you want the greatest possible performance, you want to avoid these kinds of transitions. Basically, the same reason some folks use user-space networking stacks.


The point was that you didn't have the ability to spawn new threads at all. async lets you pretend you have threads, at the cost of everything being run through a hidden event loop.


I thought it was to get the best of both worlds in that you could max out a core with async to avoid context switching or waiting for blocked I/O but you still open up additional threads on more cores if you were becoming CPU bound.


yeah, basically. a lot of people do not know, but async/await does have NOTHING do to with threads or processes. you can use threads with async/await, but you do not. async/await basically means you are running your "green threads"/tasks/promises on a event loop, the event loop can be either single threaded or it can run the stuff on multiple threads.

a lot of people just did not get the difference between concurrency vs. parallelism. threads and processes are basically parallelism, while concurrency is async programming. good talk about that stuff from rob pike (go): https://www.youtube.com/watch?v=oV9rvDllKEg


There wasn't a lot of point. Almost nobody needs this; but since everybody wants to do what the hyper-successful mega-scalers are doing...


Millions of concurrent connections. On 20 year old hardware with a small fraction of the power of today's.


Fork() on modern Linux is very fast/lightweight. This isn't true for all POSIXy operating systems though. This would perform really terribly, for example, on any Unix implementation from the early 2000s or before, and maybe on some current ones.


fork() is evil[0].

  [0] https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234


The downside is excessive context switching, and sharing data between processes becomes difficult (counters, etc).


Althttpd is less complex than other HTTPDs because it doesn't support encryption by itself and instead recommends to use stunnel4.


Isn't that the same philosophy behind varnish?



Similar is thttpd.

I was not familiar with darkhttpd. Both of these are similar to the sqlite server in security design (chroot capability), but unlike in that a single process serves all requests and does not fork.

I have used stunnel in front of thttpd, and chrome has no complaints.

https://acme.com/software/thttpd/


althttpd has CGI, sometimes interesting.


The thttpd server also has CGI. I wrote about combining it with stunnel. The rfc-1867 tool is overly dated, and I've replaced it for my internal use:

https://www.linuxjournal.com/content/secure-file-transfer


It takes years of complexity to achieve this level of simplicity


There is also redbean (https://justine.lol/redbean/) - a single-file web server with embedded Lua interpreter as an Actually Portable Executable by Justine Tunney, the creator of Cosmopolitan.


The idea you can have an executable that is incredibly small, and runs on macOS/Linux/Windows, and is very fast, and has features, is mind-blowing.

Justine Tunney is a treasure!


Everytime I come across Justine's work, I'm always amazed.


That is amazing. And it looks like she may embed SQLite in as well.


Already happened; https://github.com/jart/cosmopolitan/pull/185 (Add sqlite3 support to Lua scripts in Redbean) has been merged.


I really wish more developers took the time to comment code as well as the sqlite.org devs.


I misread the title as "in a single line of C-code". I thought "must be a pretty long line"


91,933 characters.

I was curious about the number of lines and calculating characters was a simple select all from there. There are 2,592 lines.


Does z in zTmpNam or zProtocol signify global variable? https://sqlite.org/althttpd/file/althttpd.c

Also, I'd like to complain about the hn hug of death, because it isn't happening.


The "z" prefix is intended to denote a "zero-terminated string", or more specifically a pointer to a zero-terminated string.


I use SQLite in applications via their C/C++ interface and it’s spectacular. I’d love to see a version of this that I could embed too.


I have a great appreciation for D. Richard Hipp's work.

    **    May you do good and not evil.
    **    May you find forgiveness for yourself and forgive others.
    **    May you share freely, never taking more than you give.


> separate process is started for each incoming connection

wow. Thanks also for elaborating the xinetd & stunnel4 configs.


Sure it lives a single file, but considering the length, wouldn't it actually be better to split it into multiple files? I'm not that familiar with C, it seems to be a "thing" in C to just have giant files.


I've been thinking about this a lot recently. For most of my career I've been a Java programmer and for the majority of projects, each class is put in its own file. The amount of jumping (between files / following method calls) can get really tedious when you're trying to grok a code base. I've been working on Typescript projects recently where the standard has been to have slightly larger files -- possibly containing a class definition, but more often it's an entire module -- and it's actually been kind of nice to just read the entire thing top to bottom in one go. I've looked for studies on "locality" of source code, but haven't really found anything.


Yes, it can be really jarring to have to constantly move between several small files.

I mostly use C#, and a while back I settled on a middle ground, where closely-related classes and interfaces are grouped together in a single file.

When I'm working on web apps/APIs, I usually follow the "feature folder" concept too, where all the most central parts are together in the same file.


SQLite is actually maintained in many files, but they are concatenated into one file for distribution. Here is the reasoning: https://sqlite.org/amalgamation.html


See Jef’s original thttpd

https://acme.com/software/thttpd/


I would like to see something like:

althttpd -exec some_executable {}.method {}.body

So you could quickly call executable from a browser and redirect the output in the response.



Isn't this just CGI again?


Always has been.


That's a good idea. Don't know why you are being modded down.


Interesting but why? There’s a bazillion web servers out there, surely one of them can do the job?

What have I missed?


> [Althttpd ...] has run the https://sqlite.org/ website since 2004

I don't know what the landscape was like in 2004 really, but probably at least an order of magnitude less than today's bazillion (whatever that would be!).


I don't think Nginx was out then, so I was using Apache HTTPD. Maybe Dwayne considered that too heavy for what he needed to serve up.


Nginx was first released to the public in 2004 [1]. Apache was released in 1995 [2].

On a more personal note, wow! I had no idea I started using the Internet for realz before the release of Apache, in 1994. This young made me feel, not.

[1]: https://en.wikipedia.org/wiki/Nginx

[2]: https://en.wikipedia.org/wiki/Apache_HTTP_Server


Nginx’s first public release was Oct 2004, so fits with your theory.


I don't know of any other well-known web server with the same featureset. For instance, it has no configuration file, it's run from xinetd statelessly/single-threaded, it runs itself in a chroot and it's short enough to be readable without specific effort.

It also isn't brand new: it's been around since 2004. So that probably narrows the range of possible competitors even more.

If you can find a webserver that meets all of those constraints, please let us know.


filed [0] is written to be readable, and stateless, and runs from a chroot, and has no configuration file. It doesn't run from xinetd and it's multi-threaded, though.

I wrote it because no other web server could serve files fast enough on my system (not lighttpd, not nginx, not Apache httpd, not thttpd) to keep movies from buffering.

[0] https://filed.rkeene.org/


> I wrote it because no other web server could serve files fast enough on my system (not lighttpd, not nginx, not Apache httpd, not thttpd) to keep movies from buffering.

Could you expand on that? What type of files, how many clients? I seem to recall plain apache2 from spinning rust streaming fine to vlc over lan - but last time I did that was before HD was much of a thing... Now I seem to stream 4k hdr over ZeroTierOne over the Internet to my Nvidia Shield via just DLNA/UpNP (still to vlc) just fine. But I'm considering moving to caddy and/or http/webdav - as a reasonable web server with support for range request seem to handle skipping in the stream much better.


You might want to try filed !

This was for serving MPEG4-TS files with, IIRC, H.264 video and MPEG-III audio streams -- nothing fancy -- from a server running a container living on a disk attached via USB/1.1.

While USB/1.1 has enough bandwidth to stream the video, the other HTTP servers were too slow with Range requests, because they would do things like wait for logs to complete and open the file to serve (which is synchronous and requires walking the slow disk tree).


> a server running a container living on a disk attached via USB/1.1.

Ah, ok. That makes sense. USB 1.1 can certainly challenge cache layers and software assumptions.

I do wonder how far apache2 might have been pushed, dropping logs and adjusting proxy/cache settings.


I do not know, though I do know that just disabling logging wasn't sufficient.


You missed the date in the comment of the C file: 2001-09-15

Back then, there weren't bazillion web servers out there. A patchy server was still.. patchy. And engine that solves problem X (c10k) was not created yet :)

(for whoever reads my comment, I am referring to Apache and nginx)


dependency awareness. External stuff bears surprises. Not everybody feels comfortable with that.


There’s plenty of lightweight minimal web servers with minimal dependencies.

But hang on is dependency anxiety really the reason or did you just make that up?


no, not made up – selfmade is even less dependencies than a 3rd party without other deps. And minimalis more than zero, still.


>There’s plenty of lightweight minimal web servers with minimal dependencies.

In 2001?


Simplicity


This pretty much sounds like my networks class final project.


One file of 2600 lines, guess if you give up common sense and follow this approach you can even create a kernel in one file


I am not sure exactly where you were trying to go with this, but 2600 lines is actually pretty short for a C program.


Whose common sense though? You seem to have one particular opinion on how to organize a project, but that's not the only one. Sometimes, having just one file is easier.


Depends on your development style and aims.

One file is easy to add into a project, and the compiler optimizes translation units better, so you get a bit of a performance increase in some cases.

Having "Find symbol in file" is nice too if you know you are looking for it just in this one file related to the code. Most editors aren't as ergonomic for finding "symbol in current directory" as they are for "symbol in current file".


For parsing/protocol handling the implementations where this was distributed amongst several files and/or classes have usually been the worst, in my experience.

But now I want to see how badly you could Uncle Bob this thing. My screen should be wide enough for the resulting function names.


have you ever looked at the source of sqlite?


AFAIK FreeRTOS is a kernel and distributed as a single file.


Maybe that was true at some point, but it’s no longer true. FreeRTOS consists of a core of 6-7 .c files (some of which may be optional) plus 2-3 board support source files. See their GitHub mirror: https://github.com/FreeRTOS/FreeRTOS-Kernel


the beauty and power of simplicity


I thought forking webservers was too slow?


It depends how complex a process you are forking and what OS you are running on. It has been some time since I wrote any real code for Linux or Windows that would be significantly affected by such things, but it used to be that forking could be almost as efficient under Linux as starting a new thread in an existing process. Under Windows this was very much not the case. Threads make communication between parts easier (no IPC needed) but that isn't usually an issue for the work a web server does.

Comparing to purely event based web servers forking can still be better as no request should fully block another, or usually crash another (which is more likely with threads), and thread or fork based servers can make better use of concurrency which is significant for CPU heavy jobs.

So swings & roundabouts. Each type (event, thread, process, some hybrid of the above) has strengths, and of course weaknesses.


> but it used to be that forking could be almost as efficient under Linux as starting a new thread in an existing process.

That's because internally it's nearly the same thing. Both forking and starting a new thread on Linux is a variant of the clone() system call, the only difference being which things are shared between parent and child.


Too slow for what?


Take a look at Fossil too: https://www.fossil-scm.org/

It's the distributed version control (and more) used by SQLite. Most people have no idea about how cool the SQLite ecosystem is, and how it's used even on avionics!


The lack of ability to squash PRs into single commits is a deal breaker for many, myself included.

No, having a commit "fix typo" in the main branch's history is not at all useful and won't ever be. It's noise.

In a work setting it's much better to reduce noise.


Fossil does has the ability to squash commits.

The difference is that Fossil does not promote the use of commit-squashing. While it can be done, it takes a little work and knowledge of the system. Consider the premature-merge problem in which a feature branch is merged into trunk before it is ready, and subsequent typo fixes need to be added. To do this in Fossil you first move the errant merge onto a new branch (accomplished by adding a tag to the merge check-in) then fix the typo on the original feature branch, then merge again. So in Fossil it is a multi-step process. Fossil does not have a "rebase" command to do all that in one convenient step. Also, Fossil preserves the original errant check-in on the error branch, rather than just "disappearing" the check-in as Git tends to do.

The difference here is a question of priorities. What is more important to you, an accurate history or a clean history that tells a story? Fossil prioritizes truth over beauty. If you prefer a retouched or "photoshopped" history over an auditable record of what really happened, Fossil might not be the right choice for you.

To put it another way, Fossil can squash commits, but another system might work better for you if commit-squashing is your go-to method of dealing with configuration management problems.


> Consider the premature-merge problem in which a feature branch is merged into trunk before it is ready

That isn't the normal use case for commit squashing though. Generally, trunk/master/main isn't ever rewritten. Squashing is usually done on feature branches _before_ merging. What does that look like in fossil?

It seems like part of the problem is that fossil is designed for a very different workflow. See https://www.fossil-scm.org/home/doc/trunk/www/fossil-v-git.w.... The autosync, don't commit until it is ready to be merged workflow might work well for a small flat organization, but I'm not sure how that scales to large organizations that have different privilege levels, formal review requirements, and hundreds or thousands of contributors.


I appreciate the different approaches and I'm grateful for the info (and the opinionated take) right from the source. Thank you.

In terms of Git and the usual commercial SCM practices, I'm speaking empirically. Everywhere I worked in a team, leaders and managers wanted main branch's history to be a bird's-eye view, to have every commit fully build in CI/CD, and be able to find who introduced a problem (admittedly this requires a little more digging compared to Fossil, though). Squashed commits help when auditing for production breakages, and apparently also helps managers do release management (as well as billing customers sometimes).

Do I have all sorts of minor commits in my own projects? Sure! I actually pondered using fossil for them but alas, learning new tools just never gets enough priority due to busy life. I'd love to learn and use it one day. I'm sick of Git.

But I don't think your analogy with a photoshopped / retouched picture is fair. Squashing PRs into a single commit is not done for aesthetic reasons or for deliberately disappearing information -- a link to the original PR with its branch and all commits in it remain and can be fully audited after all. No information actually disappeared.

I believe a better analogy would be with someone who prefers to have one big photo album that contains smaller albums which in turn contain actual photos of separate life events that are mostly (but not exactly) in chronological order -- as opposed to Fossil's approach which can be likened to a classic big photo album with all semantically unrelated photos put in strict chronological order.

I'll reiterate that my observations and opinions are mostly empirical. And let me say that I don't like Git at all. But the practice I described does help in a classic team of programmers and managers.

I concede that Git and its quirks represent a local maxima that absolutely can be improved upon, but at least to me the jury is still out on what's the better approach -- and I'm not sure a flat history is it.


We probably use the tools differently. I squash minor commits in my local branch, so I never want the review system squashing branches. If I put a branch for review, it is a series of patches which should be applied as-is. See the kernel mailing list and associated patch sets for the sort of thing I mean.

I frequently use git rebase in interactive mode to rearrange and curate my commits to form whatever narrative I'm aiming for. Commits are semi-independent stories which can be merged, in order, at any rate and still make sense. Each commit makes sense with respect to history, but doesn't care about the future.

I squash and rearrange and fixup commits until they look they way I want, and would want to see if I was looking at a history, and then send them for review.

Whether you merge my branches, or fast-forward and rebase the individual patches, makes little difference to me. But please don't squash my hard work.


We definitely do use tools differently.

Not looking to pick a fight here, mind you, but the Linux kernel is hardly a representative demonstration oh how to consume Git out there in the wild.

The way you describe your usage it already seems you kinda sorta do your own squashed commits, only you want several of them get merged into the main branch, not just one. So you're still rewriting history a bit, no?


Yes! Absolutely I am rewriting history. I don't care about my messy history, and nor should anyone else. I routinely force-push to my branches because, again, it's my mess. Tools like mailing lists obviously handle this fine because every revision of my patches is published there, so who cares. Tools like Gerrit handle this well because it stores patch sets regardless of branch state -- as they are unrelated -- which feels like the right model. Tools like Github just suck at this in general and fall apart when you start force-pushing PR branches, but whatever.

The problem with squashing at review time is that it is incompatible with my model. I'd rather teach engineers to not push shitty "fix the thing" or "oops" commits that are ultimately useless for me, the reviewer or reader. Just use git commit --fixup and git rebase --autosquash as $DEITY intended and force push away. It's your world, you'll be deleting the branch after you land it anyways.

The tool works so well when used as intended: a distributed version control system. The centralized model adds so much pain and suffering.


I never understood the point of squashing if you just want the short version, only read the merge commits; if you want the full details to figure out some bug or whatever, then yes, I want those fix typo commits most definitely, because as often as not, those are at fault.

Squashing buys you next to nothing, and costs you the ability to dive into the history in greater detail.

I suppose if your project is truly huge, it becomes worth it to reduce load on your VCS, but beyond that...


Squashing for example allows me to have a history where each commit builds. This has been very useful for bisecting for me. I wouldn't call it "next to nothing".

The "greater detail" part can cost me a lot of time.


Having commits that do not build represents the history more accurately. It could very well be a “fix” to some build error that silently introduces an issue, that context is lost when you squash.


> Having commits that do not build represents the history more accurately.

Sure it does, but sometimes that level of detail in history is not helpful. Individual keystrokes are an even finer/"more accurate" representation of history; but who wants that? At some point, having more granular detail becomes noise - the root of the disconnect is that people have a difference in opinion on which level that is: for some (like you), it's at individual commit-level. For others (like me), it's at merge-level: inspecting individual commits is like trying to parse someone's stream-of-consciousness garbage from 2 years ago. I really don't care to know you were "fixing a typo" in a0d353 on 2019-07-15 17:43:32, but your commit is just tripping-up my git-bisect for no good reason.


Can you not just skip all non-merge commits?


Sure, I can - but should I? That's the fundamental difference in opinion (which I don't think can be reconciled). I don't need to know what the developer was thinking or follow the individual steps when they developing a feature or fixing a bug, for me, the merge is the fundamental unit of work, and not individual commits. Caveat: I'm the commit-as-you-go type of developer, as most developers are (branching really is cheap in Git). If everyone was disciplined enough not to make commits out of WIP code, and every commit was self-contained and complete, I'd be all for taking commits as the fundamental unit of code change

If the author did something edgy or hard-to-understand with the change-set, I expect to see an explanation why it was done that way as a comment near the code in question, rather than as a sequence of commit-messages, that is the last place I will look - but that's just me


But the problem with this approach is that you're making it impossible to extract the actual changes when you do want them, whereas simply skipping non-merge commits is a minor inconvenience (`--first-parent` tends to cover it).

I mean, granted, it's not ideal. I think this is a bit of a problem with the low-level nature of git - Ideally it'd be easier to semantically bundle such sequences of commits such that it's be more reliably dealt with in the broader ecosystem (not every tool supports --first-parent), and in any case, there's nothing forcing you to maintain the first-parent-is-linear-mainline-history; that's just a tradition which, again, many common tools follow. Then of course there's the poor integration with git hosting (such as github) and git - I can blame a file, but I can't easily correlate that with the discussions in the PRs, and whatever correlation there is is purely online, with all the limitations of a single-vendor non-distributed system like that entails.

Ideally this wouldn't even be a tradeoff at all; it would be obvious how to track history both at the small scale and the larger scale (and perhaps even more?), but alas, it's what we have.

Out of curiosity - when you merge via squash, what kind of commit messages do you retain? Do you mostly concatenate the commit messages, or rewrite the whole thing?


> when you merge via squash, what kind of commit messages do you retain? Do you mostly concatenate the commit messages, or rewrite the whole thing?

Context-specific. For a bigger PR that deals with an extensive refactor I'll prefer to have a descriptive title and hand-curated task list below (so definitely not 1:1 to commit messages). For smaller PRs -- or more focused ones, like those dealing with a single feature or bug -- I'll only leave a descriptive title.

But I usually never leave a list of commit messages. Not because I have no discipline; sometimes some refactoring requires 4-5 steps and all commits have 99% identical messages which is not useful when you aggregate those in a single list of bullet points in the end.

---

> But the problem with this approach is that you're making it impossible to extract the actual changes when you do want them, whereas simply skipping non-merge commits is a minor inconvenience (`--first-parent` tends to cover it).

Again, that's not the issue here. The issue is that when you work on a big project (like many of us do) you get something like 4-7 merged PRs a day; don't pull/fetch for 3 days and you'll get 60+ lines in your terminal when you get to it.

There are people who manage releases and people who chase subtle regressions. Having git bisect narrow it down to a big PR squashed commit is actually a win; it gives them a localized area inside which they can work with other tools (not bisect).

In the end I suppose we can say it's a subjective taste. But I always appreciated the main branch's history to only consist of squashed commits. Again, it gives you a good bird-eye's view.


Storing every single version of the file which ever hit disk locally on my machine in the history would be the most accurate, yet no one seems to advocate for that. Even with immutable history, which versions go into the history is a choice the developer makes.


>Storing every single version of the file which ever hit disk locally on my machine in the history would be the most accurate, yet no one seems to advocate for that.

You'd be surprised. It's only because we understand (and are used to) tool limitations (regarding storage, load, etc) that we don't advocate for that, not because some other way is philosophically better.

I'd absolutely like to have "every single version of the file which ever hit disk locally on my machine in the history".


I'd be very surprised if you used such a feature on a daily basis indeed.

I understand the rationale but the balance tilts too far into the "too much details" territory for me and that can slow me down while digging.

What I found most productive for myself is that searching for a problematic piece should happen on a two-tiered tree, not a flat list. What I mean is: first find the big squashed PR commit that introduces the problem, then dig in more details inside of it.

Not claiming my way is better but for almost 20 years of career I observed it was good for many others as well, so I am not exactly an aberration either.

To me a very detailed history is mostly a distraction. Sure `git-bisect` works best on such a detailed micro-history but that's a sacrifice I am willing to make. I first use bisect to find the problematic squashed commit and then work on its details until I narrow down the issue.


I mean, this isn't even really all that far-fetched, other systems do work like that, such as e.g. word's track changes or gdoc's history - or even a database's transaction log.

And while those histories are typically unreadable, it is possible to label (even retroactively) relevant moments in "history"; and in any case just because a consumer-level wordprocessor doesn't export the history in a practical way doesn't mean a technical VCS couldn't do better - it just means we can't take git-as-is or google-docs-as-is and add millions of tiny meaningless "commits" and hope it does anything useful.


All of that is completely valid and I appreciate it. But you are addressing a group of people most of which are already overbooked and have too much on their plate every day. How viable is it to preach this approach to them?


Is that with or without auto-save? Though, that might actually be interesting for some academic research if enough data was collected.


Representing the history more accurately is not a useful design goal


Why not? I do not see any reason to have a history at all for anything except to be able to go back to a specific version to track down a problem. Inaccurate history makes that less useful.


Typo commits and the usual iteration during development isn't "accurate history". Noise in commit logs provides negative value.

Ideally, each commit should be something that you could submit as a stand-alone patch to a mailing list; whether it's a single commit that was perfect from the get-go or fifty that you had to re-order and rewrite twenty times does not matter at all; the final commit message should contain any necessary background information.

It would be needlessly restrictive to prevent users from making intermediate commits if that helps their workflow: I want to be able to use my source-code management tool locally in whichever way I please and what you see publicly does not need to have to have anything to do with my local workflow. Thus, being able to "rewrite history" is a necessary feature.


Patch-perfect commits are an idealistic goal. The truth is that, as already mentioned, many of those typos and “dirty” commits can be the source of bugs that you’re looking for. Hiding them hides the history.


> Patch-perfect commits are an idealistic goal.

Indeed they are, hence the need to rewrite the history. Managers, tech leads, or users of your OSS project don't care about the "fix typo" comments. They are interested in a meaningful history that tells a bigger story.

And to be frank, I am interested in the same, mid-term and long-term. While I am grappling with a very interesting problem for a month then yes, I'd love my messy history! But after I nail the problem and introduce the feature I'll absolutely rewrite history so the squashed PR commit simply says "add feature X" or "fix bug Y".


> Managers, tech leads, or users of your OSS project don't care about the "fix typo" comments. They are interested in a meaningful history that tells a bigger story.

Why would they be looking at the version control system for this? That is not what it's there for.


GitHub in particular is widely used by managers -- not the higher-level managers of course, but a lot of engineering managers have mastered the usage of GitHub issues, Markdown task lists inside PR descriptions, and reviewing results of CI/CD pipelines.

And many tech leads simply don't have the time to review every single WIP commit. They want meaningful message/description of the big squashed PR commit. If you just post a merged list of all commit messages with 10x "fix stuff" inside you'll be in big trouble the next time around and your work will be inspected very closely.

The practices I am describing to you are reality in many tech companies. Writing code there is not about you at all. And almost nobody will read your code and PR descriptions unless they really have to. Hence it's a professional courtesy to make those as small and meaningful as possible.


> And many tech leads simply don't have the time to review every single WIP commit

Do you usually review every commit? I usually just review the diff between the PR'd branch and master, as does everyone I work with.


Exactly, and that's why we use squashed commits.


Why? You don't need to. GitHub will show you that diff, git itself will show you that diff. There is no need to permanently rewrite history to do this. That makes no sense.


For the last time: squashed commits help teams who need the main branch be a high-level history of delivered features and fixed bugs. Where one commit is one delivered feature or a bug fixed. That's the idea.

Whether you think that "makes no sense" is inconsequential. Many people find it very meaningful.


That is absolutely what it's there for! git blame and git log are extremely commonly used, git bisect is only convenient in a commit history where breakage is uncommon, &c


None of those require a "meaningful history that tells a bigger story".


Problem is, when you're debugging later on you need to understand what a breaking commit does (or is intended to do) before knowing what it did wrong.

Let alone the code review issue. In any multi-person project, it is just as important that your commit history be readable by a third party for information as that it be useful for your own personal debugging.


That’s solved by writing good commit messages from the start. Commiting junk or WIP is an easily fixable culture problem.


That's not realistic at all. Programmers are often times "in the zone" and having to craft perfect messages while you're rushing on to the next problem is killing productivity.

Compromises with human nature must be made. Hence -- we need to be able to rewrite history.

Your academic purism is out of place, dude. Real humans don't work like you say they do. Some do -- most don't.


Exactly. You should not commit WIP.

By definition the actual history of your work includes WIP, so this means your commit history should not reflect the actual history of your work.


I'm of the opinion that you should commit WIP stuff. Use the SCM for managing your source code, damn it!

Just don't publish WIP crap; fortunately, you can have your cake and eat it too, with git.

The biggest reason git (and any similarly advanced SCM) is superior to non-distributed alternatives like Subversion is that I can use it to manage my own workflow, instead of just as the final off-site backup of whatever I decide to publish. I get to actually use everything git offers for shuffling commits and code around while coding.

Want to switch contexts quickly? git commit the whole thing and just switch a branch.

How about untangling a hairy merge? Do it piecemeal and commit when you're done with each bit; it's trivial to then undo mistakes, redo, combine or reorder stuff and you cannot lose any work by accident because git commits are immutable.

All of these features essentially require history rewriting; sure, you're free to rebrand and not call "store WIP state in repository" a commit even though it is one, but I would consider any SCM without these features nigh useless for most work.


That was indeed my point; this started out with someone saying git is bad because rewriting history is bad, and me pushing back on that.


> Ideally, each commit should be something that you could submit as a stand-alone patch to a mailing list

Why?


Because when debugging with said history a week or a month down the line, or when someone else is debugging with said history, their human comprehension is essential for understanding why a certain event in the history is causing problems and what that event was intended to do.


That seems connected to the initial claim in only the most tenuous way.


In the Linux kernel mailing lists, which are the submit-by-email culture I'm most familiar with and with the best documentation of norms, the main criterion for individual patches is that they be comprehensible to a reviewer. Reviewers and bugfixers face similar reading comprehension constraints.


But why should any of that apply to my version control system? I'm not developing the Linux kernel on a mailing list.


It shouldn't for your personal stuff.

But it absolutely should when you work in a team. I'll personally scold you if you waste my time with merging a 20 commit PR of which 10 commmits are "fix stuff" or "fix typo" or "oops forgot variable" etc. It won't pass code review.

You are free to disagree. I am only telling you how it is in many companies.


But why? What problem are you actually solving with this?


1. Waste of code reviewers' time

2. Waste of later coders' time when running git blame on a line when trying to figure out the purpose of code

3. Waste of later debuggers' time when they need to decide whether this error is the thing they're looking for, an unrelated error to bisect skip, or an unrelated error in possibly-related code that they need to manually fix and then re-test.


As the other sibling poster says, it's about not wasting other people's time. Make the messages/descriptions ruthlessly short and to the point and your colleagues will like you.


Leaving aside the readability and code-review concerns, bisection as a process is supremely painful when you have to separate out your targeted bug from the usual intermittent compilation and runtime issues that show up during local development.


You're not saying anything new as far as I can tell. Your grandparent already said what you said. I only disputed the "this costs next to nothing" part, which you don't seem to comment on.


I was commenting exactly on that. You think having every commit build, at the cost of destroying history, is gaining something.

I think representing history correctly is best, and agree that “squashing buys you next to nothing” other than visually pleasant output. Clearer?

My favorite approach is rebase + non-ff merge. Best of both worlds.


No. They think having every commit build is gaining something.

You think the disadvantage (destroying history) is more important, but you can't say that it "buys you nothing".


ideally your branch commits would usually build too; but admittedly, I tend to bisect by hand, following the --first-parent. I suppose if the set of commits were truly huge that might be more of an issue. And of course, there are people that have hacked their way to success here: https://stackoverflow.com/a/5652323, but that sounds a little fragile.


Who squashes like this? I rebase all of my PR's not because I want to trash history, but because I want history to be meaningful. If I include all of my "whoops typo fix" and "pay respect to the linter gods" commits, I have made my default branch history much less readable.

I would say what you're describing is a break down in CI/CD and code review. How is code that is that broken getting into your default branch in the first place?


I certainly don't, but it's a standard option in git merge (and IIRC github supports it), so I'm pretty sure some teams do.

As to rebases to clean up history (and not just the PR itself)... personally, I don't think that's worth it. My experience with history like this is that it's relevant during review, and then around 95% of it is irrelevant - you may not know which 5 % are relevant beforehand, but it's always some small minority. It's worth cleaning up a PR for review, but not for posterity. And when it comes to review, I like commits like "pay respect to the linter gods" and the like, because they're easy to ignore, whereas if you touch code and reformat even slightly in one commit, it's often harder to skip the boring bits; to the point that I'll even intentionally commit poorly formatted code such that the diff is easy to read and then do the reformat later. Removing clear noise (as in code that changes back and forth and for no good reason) is of course nice, but it's easy to overdo; a few typo commits barely impact review-ability (imho), and rebases can and do introduce bugs - you must have encountered semantic merge conflicts before, and those are 10 times as bad with rebases, because they're generally silent (assuming you don't test each commit post-rebase), but leave the code in a really confusing situation, especially when people fix the final commit in the PR, but not the one were the semantic merge conflict was introduced, and laziness certainly encourages that.

It also depends on how proficient you are with merge conflicts and git chicanery. If you are; then the history is yours to reshape; but not everybody is, and then I'd rather review an honest history with some cruft, rather than a frankenstein history with odd seams and mismatched stuff in a commit that basically exists because "I kept on prodding git till it worked".


Having full history in the feature branch is mandatory -- nobody is disputing that, myself included.

All due diligence is done there, not in the main branch.

The main branch only needs to have one big commit saying "merging PR #2169". If you need more details you'll go that PR/branch and get your info.

The "fix typo" commit being in the main branch buys you nothing. It's only useful in its separate branch.


But that's what a merge commit is - the diff along the merge commit is what the squashed commit would be; the only thing squashing does is "forget" that the second parent exists (terminology for non-git VCS's may be slightly different).

Why not merge?


Because in big commercial projects several people or teams merge inside the main branch on a regular basis. It's easier to look at a history only including squashed commits each encompassing an entire PR (feature or a bug).

It gives you better high-level observability and a good bird's-eye view. And again -- if you need more details you can go and check all separate commits in the PR/branch anyway.

And finally, squashed commits are kind of atomic commits. Imagine history of three separate PRs happening at roughly the same time. And now all those commits are interspersed in the history of the main branch.

How is that useful or informative? It's chaos.

EDIT: my bad, I conflated merging with rebasing. Still, I prefer a single squashed commit for most of the reasons above, plus those of the other two commenters (useful git blame output and buildable history).


Oh yeah, rebasing branches as a "merge" policy is definitely tricky like that. (I mean, I'm sure some people do that with and perhaps with good reason, but it makes this kind of stuff clearly worse).


This is fine if its how your team develops but not for everyone. We don't care about full history in branches; maybe it has more detail than main/master but it should still be contextually meaningful. I'd never approve a commit into main with the message "merging PR #xxx" either; it's redundant (merging), has no summary about what it actually does and relies on an external system (your PR/MR process) for details. I do agree that keeping noise out of your main is key, but would go even further than you to keep it clean AND self-contained.


Well sure, the title was just an example. It usually is more like this:

"Encrypt project's sensitive fields (#1234)"

With the number being a PR or an issue # (which does contain a link to the PR).

I do care about history in branches though. And many others do. I agree that it varies from team to team.


greater detail can ultimately lead to less information if noisy commits crowd out the important ones. IME you want a commit to mean something and that usually leads to tweaking the work/change/commit relationship, which is where squashing helps.


I want to be able to tell why a given line of code was introduced. Seeing "Fix indentation" in the output of `git blame` won't help me with that.


I also want to be able to tell _why_ which is why I dislike working on codebases that squash commmits. Too many times I've done a blame to see why a change was made, and it's a giant (> 10) list of commit messages. Oftentimes, the macro description of what was going on does not help me with the line-level detail.

Also, in case it helps you in the future, `blame -wC` is what I use when doing blame; it ignores whitespace changes and tracks changes across files (changes happened before a rename, for example.)


Neither does squashing though: you still can't tell if that line was introduced or modified.

I've come across "fix indentation" or "fix typo" commits where a bug was introduced, like someone accidentally comitted a change (maybe they were debugging something, or just accidentally modified it).

For example: I'm tracing a bug where a value isn't staying cached. I find a line of code DefaultCacheAge=10 (which looks way too short) and git blame shows the last change was modifying that value from 86400. What I do next will be very different if the commit message says "fix indentation" vs "added new foobar feature" or "reduced default cache time for (reason)".


git blame --first-parent


I agree with you that such a commit has nothing to do in the main branch. But it has nothing it do in any branch that is shared with anyone either. Git has enough ways to keep your own history clean at all times to not require a hack like squash PRs to compensate the lack of discipline of a team. With squash PRs you lose so much valuable information that it gets impossible to use commands like bisect or to have a proper context on a blame.


I agree that you lose the benefit of a direct bisect but this is usually shrugged off with "you can go to the PR and inspect the commits one by one" which, while not ideal, is deemed a good tradeoff if you want your main branch to only contain big commits each laser-focused on one feature or bug.

As I replied to @SQLite above, I am not saying this is the optimal state of affairs -- not at all. But it's what is required everywhere I ever worked for 19.5 years (and similar output was desired when we worked with CVS and Subversion and it was harder to achieve there).

But I'll disagree this is lack of discipline. It's not that at all. It's a compromise between programmers, managers/supervisors, CTOs / directors of engineering, and release engineers. They want a bird's-eye view of the project in the main branch.


Consider editing. Yes, a paper goes through revisions, which are holistic, and represent one iteration to the next. But in between those revisions, you can see markups, and editor notes as to why a change was performed.

Sometimes those insights are just as useful as the packaged whole.


Everybody who ever wrote their own diffing program has considered that and that's not the problem. The problem is how does this approach scale in a repo with 50+ contributors? It doesn't, sadly.


> how does this approach scale in a repo with 50+ contributors?

You surely aren't going to have 50 contributors all simultaneously working on the same mainline or feature (or have 50 working together at the same time if ever, if you do, I'd say that's poor project management). The reality is a portion of the developers work on this feature in this part of the code base, a few over here, they'll be on their own branches or lines, and everything will be fine.

This scenario where we have 50+ devs all crashing and bumping into each other rarely happens, if ever. I have personally only seen one instance of it happen in kernel development. And even then it was relatively straightforward to sort out.

To go further, in a hypothetical scenario where there is one feature and 50+ open source developers are all vying to push their patches in, there is still going to be one reference point to work off of, and reviewers are going to base everything off that. It's a sequential process, not concurrent.


long long long time ago Fossil destroyed code for Zed Shaw

And since that day, not one looked at Fossil the same way, at least not the same way they looked at git

https://www.mail-archive.com/fossil-users@lists.fossil-scm.o...

I can't tell for sure how much of impact this had on fossil's adoption, its hard to beat git no matter how good you are, but I think it was a bit hit


> long long long time ago Fossil destroyed code for Zed Shaw

1) No, it didn't. Please read the part of the thread after Zed's initial panic attack.

2) To the best of our[1] knowledge, fossil itself has never caused a single byte of data loss. When fossil checks in new data, it reads that data back (in the same SQL transaction the data was written in) to ensure than it can read what it wrote with 100% fidelity, so it's nearly impossible to get corrupted data into fossil without going into the db and massaging it by hand. People have lost data by storing their only copy of a repository on failing/failed storage or on a network drive, but no software can protect against hardware failure and nobody in their right mind tries to maintain an active sqlite db over a network drive (plenty of people do it, despite the repeated warnings of anyone who knows anything about sqlite, and they have only themselves to blame when it goes pear shaped). Fossil makes syncing to/from a remote copy absolutely trivial, so any failure to regularly sync copies to a backup is end-user error.

[1] = the fossil developers.


Heh, this reminded me of _why's comments from the Zed drama waaaaay back: https://gist.github.com/brianjlandau/186701

> Let me put it this way. Suppose you’ve got Zed Shaw. No, wait, say you’ve got “a person.” (We’ll call this person “Hannah Montana” for the sake of this exercise.) And you look outside and this young teen sensation is yelling, throwing darts at your house and peeing in your mailbox. For reals. You can see it all. Your mailbox is soaked. Defiled. The flag is up.

> Now, stop and think about this. This is a very tough situation. This young lady has written one of THE premiere web servers in the whole wide world. Totally, insanely RFC complaint. They give it away on the street, but everyone knows its secretly worth like a thousand dollars. And there was nothing in that web server that hinted to these postal urinations.


Which is a shame because from the followup emails it becomes clear that the loss was actually Zed Shaw's fault (he explicitly typed "fossil revert" which is what deleted his code) and not the bug (there was a bug he encountered but it didn't end up in data loss).


Technically, Fossil didn't destroy the code; Zed destroyed it by running `fossil revert`. All Fossil did was run off into la la land where nothing makes sense and everything is invisible and the working directory is empty.

Still an impressive bug – but the first rule of “my CVS has broken” is “stop running commands, and copy the CVS directory somewhere else”. (I've needed to do this to a .git twice.) Had he done this, he wouldn't've lost work. While he shouldn't've had to, “stop doing everything and take a read-only copy” is the first step when any database containing important data has corrupted.


I'm pretty familiar with a lot of Zed's doings but I had missed this one. I doubt it impacted the adoption much, losing one advocate like that isn't going to doom your project.

I finally gave Fossil a serious try last year for a few months, just on my own but I don't think my opinions would change if I tried in a team setting. I still love the idea, but the execution is... ghetto. Serviceable, certainly, but ghetto is the best overarching description I have for it. You have to be willing to look past a lot of things, sure many of them are petty, in order to take it over dedicated polished services for the things it combines. And git+[choice of issue tracker]+[choice of forum]+[choice of wiki]+[choice of project website]+etc. is the real competition against Fossil, not git+nothing, so even if it was better on the pure version control bits it would still be a tough battle. (Not to mention the elephant git+github is a pretty good kitchen sink on its own if you don't want to think about choices and have something serviceable that's also a lot less ghetto.)

I also realized how much I love git's staging area concept once it was gone -- even when I had to use Perforce a lot, at least Perforce has the concept of pending changelists so you have something similar. I've never been a big fan of git's rebase, but it's also brought up a lot as a feature people are unwilling to give up, and I see the appeal. In summary, I think the adoption issue is just that people who do eventually give it a shot find usability issues/missing functionality they aren't willing to put up with.


And git+[choice of issue tracker]+[choice of forum]+[choice of wiki]+[choice of project website]+etc. is the real competition against Fossil

I think that this is only true for some projects. For some people, the ease of self hosted setup (one executable) and the fact that you can change the documentation, edit code and close bugs offline is a big win that no centralized service can compete with.


Sure, one executable is nice, that might be enough of a payoff for some people to overlook the rest. I think you may be underestimating the amount of choice that's out there though. For documentation alone there's countless options that don't require a centralized online thing. Three I'd use over Fossil again are 1) simple project doc/ folder with .md files inside (which you can render locally to look just like github with https://github.com/joeyespo/grip) 2) Doxygen or a language-specific equivalent if it's better 3) libreoffice docs stored either in the same repo or another one (perhaps a submodule).

For project management I'm less familiar with the options out there but I'd be surprised if there was nothing that gives a really stellar offline experience. I'd give a preemptive win to Fossil on the narrow aspect that your issue tracking changes can be synced and merged automatically with a collaborative server when you come back online, whereas if you stood up your own instance of Trac for instance I'm not sure if they have any support for syncing. If you're working by yourself, though, then there's no problem, Trac and many others work just like Fossil and stand up a local server (or are dedicated standalone programs) and work the same whether you're offline or online. But when I'm working solo I prefer low-tech over anything that resembles Jira (and I don't even really dislike Jira) -- I've played with https://github.com/dspinellis/git-issue as another offline/off-platform option but in my most recent ongoing solo project I'm quite happy with a super low-tech issues text file that has entries like (easy to make with https://github.com/dhruvasagar/vim-table-mode)

    +--------------+
    | Add thing    |
    +==============+
    | Done whens / |
    | other info   |
    +--------------+
and when I'm closing one I just move it to the issues-closed file as part of the closing commit. I might give it an identifier if I need to reference it in the code/over multiple commits.


from the linked thread:

> I just cloned your repo. Everything is working fine. Breath, Zed.

seems like he panicked and made it worse, then rage-quit


Right. Reading the entire conversation shows the parent messaging is FUD.


I've just read it myself and was very impressed by Richard Hipp's calm, courteous and honest demeanour. I would recommend this to anyone as a paragon of how to respond to a difficult situation.


It's been a few years but my one and only interaction with Zed via an email exchange showed me he was quite unstable and prone to outbursts. Never meet your heroes, they say.


I had exactly one interaction with him, too, whereby he shared with me a very long list of companies that he knew of which were hiring at the time, this would've been around 2012 or so. I got quite a few interviews as a result, and quite a few offers, FWIW, though I ended up taking a job with a company that wasn't on this list.

I understand both that he is and why he is such a polarizing figure. Just wanted to put my positive anecdote on the pile, since they seem to be a less common when he comes up in online comments.


I believe it. Some people exhibit pretty extreme behavior. It sounds like that was extremely nice of him :)


Fossil has enough wiki and theming features to power a customized website and to let you edit its contents in the browser. I've joked that Fossil SCM is secretly "Fossil CMS".

My personal website, https://dbohdan.com/, is powered by Fossil. A year ago I was shopping for a wiki engine, didn't love any I looked at, and realized I could try something I was already familiar with: Fossil. It did take a few hacks to make it work how I wanted. The wiki lacks category and transclusion features and, at least for now [1], can't generate tables of contents. I've invented a simple notation for tags and generate a "tag page" [2] using a Tcl script [3]. The script runs every time I synchronize my local repository with dbohdan.com. The TOC is generated in IE11-compatible JavaScript in the reader's browser [4]. The redirects are in the Caddyfile (not in the repo). Maybe I'll migrate to a more full-featured wiki later [5], but I am enjoying this setup right now. I am happy I gave Fossil a try.

Fossil also has a built-in forum engine [6]. I am thinking of migrating a forum running on deprecated software to it.

Edit: My favorite music page and sitemap are generated on sync, too. [7] The sitemap uses Fossil's "unversioned content" feature to avoid polluting the timeline (commit history). [8]

-----

[1] In the forum thread https://fossil-scm.org/forum/forumpost/b635dc56cb?t=h DRH talks about implementing a server-side TOC.

[2] The page lists the tags and what pages are tagged with each. Tags on other pages link to their section of the tag page. https://dbohdan.com/wiki/special:tags.

[3] https://dbohdan.com/artifact/8297b54f5d

[4] https://dbohdan.com/artifact/d81bb60a0e

[5] PmWiki seems like a nice lightweight option—an order of magnitude less code than its closest competitor DokuWiki, very stable, and has a better page history view. Caveat: it is written in old school PHP. https://pmwiki.org/.

[6] https://fossil-scm.org/home/doc/trunk/www/forum.wiki

[7] https://dbohdan.com/wiki/music-links with https://dbohdan.com/artifact/053d0ff993, https://dbohdan.com/uv/sitemap.xml with https://dbohdan.com/artifact/c21444f7c9.

[8] https://fossil-scm.org/home/doc/trunk/www/unvers.wiki


> The wiki lacks category ...

Just FYI: we recently improved the internals to be able to add propagating tags to wiki pages[1], so it will eventually be possible to use those to categorize/group your wiki pages. What's missing now is UIs which can make use of that feature. The CLI tag command can make use of them, but that doesn't help your UI much.

> ... and transclusion features

For the wiki it seems unlikely to me that transclusion will ever be a thing. It can hypothetically be done with the embedded docs feature if the fossil binary is built with "th1-docs" support, but, alas, we can't currently support propagating tags on file-level content. (i have an idea how it might be integrated, but figuring out whether or not it internally makes sense requires trying it out (and that doesn't have a high priority).)

[1] https://fossil-scm.org/forum/forumpost/3d4b79a3f9?t=h


This is excellent news! I hope Fossil can eventually obsolete both my tag script and the JavaScript TOC.

As for transclusions, I don't expect Fossil to implement them. While something like https://www.pmwiki.org/wiki/PmWiki/IncludeOtherPages would be cool, it seems probably out of scope for Fossil.


I'm glad others are using Fossil this way as well! I use Fossil for hosting my small projects, but have also built a forum/wiki site for my family using it, and I'm working on porting my blog to Fossil as well. I'm very impressed with its design and flexibility while still remaining small, fast, and understandable.

Side note: I've been exploring the ecosystem around Fossil/SQlite as well. I've been working with Pikchr a lot recently as a way to create diagrams that can be version controlled. Because Pikchr is implemented as a single C file, I was able to compile using Emscripten as a WASM file, and embed that file in a single HTML page that gives me a "live editing" experience (basically I call a render method when the text area gets updated). The way these pieces of software are written minimizes dependencies and allows for them to be used in a huge variety of environments. I've really enjoyed working with them.

https://pikchr.org/home/doc/trunk/homepage.md


I don't want forum/web software built into my dvcs. I don't see how this improves over git.


> As of 2018, the althttpd instance for sqlite.org answers about 500,000 HTTP requests per day (about 5 or 6 per second) delivering about 50GB of content per day (about 4.6 megabits/second) on a $40/month Linode. The load average on this machine normally stays around 0.1 or 0.2

Interesting. If the load avg is consistently low, it could mean they're over-paying for CPU. If this was a non-dedicated AWS instance you might want low load so you don't chew up CPU credits, but you'd also want to use an instance type that creates some load so you're utilizing what you're paying for. Linode VPCs don't use cpu credits so the calculation is a bit simpler. I'm also curious how much of that bandwidth couldn't be offset by a CDN or mirrors.

If you were using a serverless platform, you'd ideally want to use something like static site hosting feature where you're mostly just paying for storage and egress. Or a serverless application platform to auto-scale traffic as needed. The main problem with doing this, of course, is the cost of egress: cloud providers with fancy serverless platforms often have redonkulous egress costs, so using a plain old VM on a VPC provider can be cheaper if you have more bandwidth demands than compute.

(I am aware none of this is a concern if you'd rather just spend $40 and forget about it. I am a nerd.)


Linode doesn't follow the cafeteria pricing style. You buy a package. $40/month is the minimum for us to get the disk space and I/O bandwidth we need. We could get by with less CPU, perhaps, but the extra memory and extra cores do reduce latency and they are nice to have on days when SQLite is a top story at HN. (Load avg has been running at about 0.95% all day today.)


> I'm also curious how much of that bandwidth couldn't be offset by a CDN or mirrors

As you say, at $40/m it's academic for a lot of people, but AFAIK, the whole site is static, so presumably if you put it behind Cloudflare’s free tier it would serve all but the file downloads from the edge. A pure guess, but I'd imagine that would mean serving 75% of requests from the edge.


Look again. The entire Althttpd website is 100% dynamic. Notice that the hyperlink at the very top of this HN article is to a Markdown file (althttpd.md). A CGI runs to convert this into HTML for your web-browser.

The core SQLite website has a lot of static content, but there are dynamic elements, such as Search (https://www.sqlite.org/search?s=d&q=sqlite) and the source code repository (https://www.sqlite.org/src/timeline?n=100&y=ci).

So far today, 23.48% of HTTP requests to the sqlite.org domain are for dynamic content, according to server logs.


I'd be putting this in 1000 layers of sandboxing since its a C program with network access.


Your point being? It's been serving sqlite.org just fine for all this time. You seem to be making some pretty big assumptions about security here without actually explaining what your specific concerns are.


I think the original comment is partially a joke. Maybe there isn't specifically anything wrong with this, but there is the fact that it is written in C. Historically there is a good precedent of this being an issue. C does not guarantee correctness to the same level as more modern languages.

If it was written in Haskell or Rust for example you could be more sure about correctness. I believe for something like this, correctness is fairly important. Not to mention you probably wont even lose speed. As for if the code is understandable, it is a 2600 lines of terse-ish C [0]. Do you really think about the entire blob at the same time?

[0] https://sqlite.org/althttpd/file?name=althttpd.c


People need to stop assuming that memory safety == functional safety. They are two very different things. Rust ensures memory safety, but it won't stop you from making logical mistakes. You can't be "sure about correctness" unless proven mathematically.


I agree, but it is an extra level above c. Same with using a more advanced type system, more easily encoding problem constraints in that type system.


if you run gnu/linux and are worried by C code running... I have bad news for you.


His operating system is probably GNU/Docker.


Not far off it. I run fedora silverblue with is flatpak and podman


The google security team is working on fixing this with rust.


Indirectly, they're also pushing for "fixing" of Firefox with Rust (only 84%[?] to go!)


I hate to tell you but half the internet runs on C. You can't type ".com" without passing through C code


Does it being written by the Sqlite devs not improve the outlook for you?


how many bugs may the 1000 layers bring. If each layer has 3 LOC it's more than the webserver in the first place.


Maybe I'm getting old but these days I'm having a hard time telling if comments like these are serious or sarcastic.


It's definitely not you. There is a general term for this called "Poe's Law" that says something like: "a sufficiently thorough parody is indistinguishable from the original". That GP might be an example of this.


It's both. Of course we have to rely on C now because most stuff is C but this is not an ideal situation and C shouldn't be used for new software.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: