Hacker News new | comments | show | ask | jobs | submit login
SQLite with a Fine-Toothed Comb (2016) (regehr.org)
109 points by jxub 33 days ago | hide | past | web | favorite | 31 comments

Richard Hipp (creator of SQLite) had this to say about Rust and SQLite in the comments:

> Rewriting SQLite in Rust, or some other trendy “safe” language, would not help. In fact it might hurt.

Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way. We know this, because we have tested the SQLite machine code – every single instruction – using many different compilers, on many different CPU architectures and operating systems and with many different compile-time options. So there is nothing wrong with the sqlite3.so or sqlite3.dylib or winsqlite3.dll library that is happily running on your computer. Those files contain no source code, and hence no UB.

The point of Prof. Regehr’s post (as I understand it) is the the C programming language as evolved to contain such byzantine rules that even experts find it difficult to write complex programs that do not contain UB.

The rules of rust are less byzantine (so far – give it time :-)) and so in theory it should be easier to write programs in rust that do not contain UB. That’s all well and good. But it does not relieve the programmer of the responsibility of testing the machine code to make sure it really does work as intended. The rust compiler contains bugs. (I don’t know what they are but I feel sure there must be some.) Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected. In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”. But whatever you call it, the outcome is the same: the program does not work. And the only way to find these problems is to thoroughly test the actual machine code.

And that is where rust falls down. Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing. Nor are there multiple independently-developed rust compilers for diversity testing. Perhaps that situation will change as rust becomes more popular, but that is the situation for now.

One problem with this argument is that SQLite is primarily used as a source-level embeddable library. That is, most users of SQLite don't use the official binaries and instead build the source code themselves. So, in practice, the source code, not the official blessed binary, is what matters. When upgrading compilers, developers of apps that embed SQLite don't typically check to ensure that upstream SQLite has tested the new version of their compiler. They just upgrade their compiler and assume that the new version will continue to compile their SQLite source properly. If the new version of the compiler happens to compile programs with undefined behavior differently, then problems can arise.

Right, and the much vaunted test suite is proprietary, so the typical end user can't reproduce this test of "every single instruction."

Not that I'm taking sides here. I'm really interested in both the extensive testing that the SQLite team does, and the analysis that John blogs about.

He's not wrong, SQLite is a very well tested piece of software and probably the best that can be done safety-wise in C. Still, as it was pointed out by pcwalton below, and also last time this topic was discussed, there are some use cases where tests done on the entirety machine code do not guarantee that UB will not occur.

For one, it's quite likely that an embedded platform's toolchain will not be part of the SQLite test configurations. Secondly, SQLite can be and is compiled into a binary, and this means that all bets are off, especially if LTO is enabled. Thirdly there are products that build on SQLite, such as its own commercial encryption extension and other extensions from third parties. The former probably enjoy the same level of testing, but it's not clear how the latter are tested.

The conclusion is that it's humanly impossible to write memory-safe C, even with 100% test coverage, static and dynamic analysis. Something like Frama-C is required, which is virtually unheard of for the majority of open source and commercial software.

> In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”.

C has both compiler bugs and undefined behaviour. Undefined behaviour is an inherent property of the C standard, while a compiler bug is a property of the implementation (a place where it doesn't match the standard).

A valid argument along the same lines might be that the Rust compiler has existed for less time and is used less than C compilers, and therefore is more likely to contain bugs.

> Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing.

Coverage tools work on Rust, such as kcov. I'm not sure of the state of gcov itself though.

> Nor are there multiple independently-developed rust compilers for diversity testing.

Isn't diversity testing only necessary/good because there are many C compilers? Using your phrasing, if the code compiles and runs correctly (i.e. every single machine instruction is checked) with the one Rust compiler that exists, then it works.

There's definitely many reasons why a language having multiple compilers is good, but I think "diversity testing" is circular logic.

Undefined behaviour is not a compiler bug - it is deliberate.

And having undefined behaviour in your C code is definitely not a good thing, even if it is basically unavoidable.

The real problem is that the C and C++ standards cop out to UB in too many places, e.g. with things like type aliasing and people reasonably think "weeeell, it may be UB but it works now and I need it so screw it" and then you have a mess of programs relying on de facto non-standard behaviour which is shit.

The C people just need to officially define some of the de facto behaviours.

Rust doesn't have this problem because it doesn't leave so many basic things undefined.

> And having undefined behaviour in your C code is definitely not a good thing, even if it is basically unavoidable.

If it's literally unavoidable, then the language specification is BROKEN.

Now, most C UB is avoidable, but it's very difficult to notice some UB, and most compilers aren't that good at telling you about the UB they exploit. In this sense UB is unavoidable in that human programmers may often write code with UB without noticing.

If it's only "practically unavoidable", not literally, then the language specification and/or the compilers (by failing to warn about it) are BROKEN.

You cannot blame C programmers, not anymore. The committee has been much too aggressive in its zeal to speed up C by adding more UB cases. We've reached the point where compiler outputs run very fast because all the important bits have been elided by the optimizer, breaking the program in the process. We, the users of the language, have been pushed to the breaking point by the committee and the compiler groups. Please stop. And don't just stop, revisit some of the worst UB decisions.

Yes, even C89 had lots of footguns, but UB was much more manageable.

The only reasons I myself have not yet abandoned C are: a) I haven't learned Rust yet, b) many codebases I work with are C codebases and won't get rewritten in Rust anytime soon, c) it takes time to get enough critical mass. (c) is happening though, and (a) is, for me, just a matter of time; (b) I can solve by moving on to new things, but the world is full of legacy code that we can't just abandon/rewrite, so moving on isn't exactly likely.

> The C people just need to officially define some of the de facto behaviours

Sure, as soon as the all the different ISA people officially define some of the de facto behaviors. UB isn't in the standard "just because" it's in the standard because there is no apparent underlying standard.

Aliasing rules, for example, have nothing to do with ISAs. Neither do pointer comparison rules, and many others besides.

The rule about memcmp() with invalid pointers by length zero does have to do with actual systems, but it can still be standardized and the vendors with now-non-compliant memcmp() implementations just have to fix it. This has happened before (e.g., snprintf()), so the ISA thing is a total cop-out.

Weird ISAs are exactly why you cannot compare pointers. Segmented memory for one. Or imagine an OS and compiler that implemented automatic overlay switching. With that and PAE on 32-bit x86 systems you could have special "far overlay" pointers returned from malloc calls which would map in different 1 GB overlay sections when accessed.

Aliasing rules are important in some ISAs too. Like weird DSPs. Imagine a system where 32-bit objects can't even share the same memory space as 8-bit objects. Casting a pointer to a different sized type is completely meaningless there. Of course programming such a weird thing is usually done in assembly, but there are C compilers.

I'm not familiar enough with C on segmented architectures, so I can't quite speak to that, but I was referring to [0], which clearly has nothing to do with segmented architectures.

As to aliasing, ISAs too had nothing to do with the reason for aliasing rules, but rather optimizations for functions like memcpy() (as opposed to memmove()).

[0] https://news.ycombinator.com/item?id=17439467

There are other aliasing rules that have big performance impacts on certain architectures. The Xbox 360 and PS3 Power cores for example had a severe load-hit-store performance penalty that tended to be triggered by code that moved data between floating point and integer registers via memory. Strict aliasing rules that allow the compiler to assume float and int pointers don't alias could make a huge performance difference but those rules are also the source of much troublesome undefined behavior for code that does intentional type punning.

The ISA in this case requires going via memory to move data between fp and integer registers and certain implementations of that ISA had major performance impacts associated with that. In this case UB rules really did allow for valuable optimizations but really did cause trouble elsewhere.

The ISA you describe doesn't require aliasing rules. It merely gives you an incentive to have them.

C and other languages need much better control over aliasing than the 'restrict' keyword and compiler command-line switches.

I don't find that comment very impressive, but further down he has commented some more, and I'm fully in agreement there:

"The disagreement is not over whether or not UB is a problem, but rather how serious of a problem. Is it like “Emergency patch – update immediately!” or more like “We fixed a compiler warning” or is it something in between."

Compiler bugs are supposed to get fixed.

In C regarding UB, everything goes every single time the compiler gets upgraded.

In many cases, what is considered UB in the C standard is a widely accepted and documented extension in the vast majority of compilers. For example, nonstrict aliases are UB in ISO C, but are defined language extensions in MSVC, GCC, clang, and many other compiler vendors.

Now try to maintain a large C codebase stable and safe across such variety of compilers.

Completely unrelated, but maybe I can use the deep knowledge of HN.

Browsing from the article I ended up in the page of tis-interpreter that says "You can also use TrustInSoft to maintain compliance or reach certification according to the norms EN-50128/IEEE 1558." [1]

How does this compliance process work? Some of you have experience?

[1]: https://trust-in-soft.com/industries/rail/

You should contact them[1]. TrustInSoft has a document that matches the classes of defects that its analyzer can guarantee the absence of to the vocabulary and recommendations of EN 50128. If you are preoccupied by this standard in particular, I'm sure that this document would make things clearer.

N.B.: You had drifted away from the description of tis-interpreter at some point before you arrived to https://trust-in-soft.com/industries/rail/ . While nothing prevents you from finding tis-interpreter useful in application of EN 50128, the page was written with TrustInSoft Analyzer in mind. TrustInSoft Analyzer is a static analyzer that propagates through the program sets of values (“abstract values” is the technical term[2]) instead of the concrete values corresponding to specific inputs that tis-interpreter propagates. As a result, TIS Analyzer can guarantee the absence of unwanted behaviors for all possible inputs. This is what makes it valuable from the point of view of software safety.

[1] disclosure: by “them”, I mean “us”, since I'm a co-founder and work there

[2] see https://en.wikipedia.org/wiki/Abstract_interpretation

I wonder if there are enough statically compiled languages out there.

Not to be arrogant, but it doesn't seem new and/or recent languages are picking up fast enough, because their syntax is just not simple enough. I see many languages, and I never really find one that is interesting to me.

I recently found volt, and I thought this language was really awesome. Then I realized that it is garbage collected.

D is fine, but it has too many high level constructs (OOP and al) which I don't find useful. Rust is okay, but to me syntax matters most, and to me rust is too far from C.

All of this shows that you cannot beat C.

I just want a language that picks up from C, has the ease of use and readability of python, doesn't include high-level constructs as first class citizen, and compiles to LLVM or machine code. I wish I had the skills to build such language. C++ is close, but its slow compilation and its backward compatibility with C are not good.

Nim? nim-lang.org

It's GC, but that can be turned off and you can use manual memory management or you can simply use one of a handful of GC methods that work for you (in my experience, the GC is very performant). It has easy syntax like Python. It can be compiled to C, C++, ObjC, or JS (currently). It has tons of meta-programming features like templates, macros, etc. It can be cross-compiled fairly easily to almost any platform. It has whatever high level stuff you want to use, or just don't use them. It's not 1.0 yet, but it's pretty stable overall and the current release is feature-full enough that you could do just about anything with it that you want, right now. It's also got a really easy to use FFI. It's also about as fast as C, in practice.

> All of this shows that you cannot beat C.

No, all of this shows that you have very peculiar preferences and demands, some of which are more emotional than technical.

> I just want a language that picks up from C, has the ease of use and readability of python

These two requirements are mutually exclusive.

Nim does that. But I myself don't want Python-style syntax. I like braces and dislike semantic indentation -- part of it is that I like showmatch, and part of my aesthetic sense.

Before UNIX got widespread adoption in the industry, C was just yet another systems programming language.

We had quite a few alternatives, with saner defaults, but some did not come with an OS and others died alongside with their OS.

So it is not that C won somehow magically, rather just like the browser comes with JavaScript, UNIX came with C.

And getting rid of UNIX and POSIX will be a very hard thing to ever achieve, other than being a Google, Microsoft, Apple and pushing something else forward no matter to whom it hurts and how much it costs.

> D is fine, but it has too many high level constructs (OOP and al) which I don't find useful.

D has betterC mode. It's hard to get closer to C than that.


I can understand if you make the argument that D isn't as portable because it doesn't support as many platforms, but you certainly don't need to use classes, which are not even supported in betterC.

You can write Object Pascal code like that - you just ignore the "object" part and write it like traditional imperative code with functions/procedures, records (structs), manual memory management, etc.


You do want C++.

No other language is going to be more backwards compatible with C, because no other language has that goal. If slow compilation is what is bothering you, you might want to look at your compiler options for your target architecture and of course, update your hardware. What type of software are you writing? What is your development platform, and what are you targeting? Are you using a good IDE, like Clion?

I have an i5 which is 2 years old, 16GB of ram and a SSD.

I want to write games, and usually the libraries involved can be a little large, like bullet or Ogre3D. I remember MSVC 2012 stopping compilation because of a lack of memory.

Accelerating the step of rebuilding a single file requires precompiled headers, which is atrocious to manage, or specific code decomposition in several files, which is quite demanding.

I'm aware modules might be accelerating compilation (I hope I'm right), but I'm not even sure they will be part of C++20.

I like C++ too, but the slow compilation times are really getting on my nerves, and I wish there was a good solution for that, even if it breaks backward compatibility.

Sounds like you want Rust with Python's syntax

Another call for boringcc[0]. [0] https://news.ycombinator.com/item?id=10772841

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact