Hacker News new | comments | show | ask | jobs | submit login
Elevate your C programming skills by exploring the SQLite codebase (cppdepend.com)
105 points by cpprepository on July 9, 2017 | hide | past | web | favorite | 73 comments

It is a nice/ok code base. I find Redis' a bit better.

One thing that stood out in the article was calling out the percentage of functions with fewer than X lines. I don't think that is a good goal to work toward. Having functions that do one thing and do it well is, and function length can be a very rough proxy measure for that. However "doing one thing" almost always means "perform more than one step in an algorithm." I've seen too many code bases where "keep functions short" is used as an excuse to break locality by piecing functions out into smaller functions that are only ever called in one place (the function they split up). It can make code harder to read and maintain and be less efficient besides.

Eh, IMO it depends on how many locals there are more than the number of lines. Reducing the state space definitely improves readability.

If you have a function with 15 locals, it could be hard to keep straight in your head, even if that function were 50 lines or so. If you can split it up into 3 functions with 5 locals each (and a small number of arguments), that is usually more readable.

As long as the functions are named well and communicate (only) with params and return values, it doesn't hurt readability to split it up, even if it's only used in one place.

On the other hand, I suppose you can have a 500 line function and if has few locals, then you probably don't need to split it up. The canonical example would be a function that only prints stuff to the screen, maybe based on a couple parameters. But I don't really run into those situations very much -- more often I run into the big functions that can be split up.

I don't know. My absolutely favorite codebase ever was dpkg. It isn't as good now as it was before, but essentially they arranged so that anything "unusual" (error handling and the unwind process) happened in a side channel via setjmp and then the core of the program was in a file called processarc.c which contained one giant function that you could use to see, in order, all of the middly steps that dpkg performed. Doing it like that meant that the code read like a specification.

I am highly skeptical of codebases with lots of functions that are called from only one place; I would even put that forward, along with "code duplication (not just in the small but also structurally: where you should be building higher-order primitives, including your own control structures or scaffold) as its foil (as it usually isn't OK to not have a function if it would be called twice), as core metrics for "bad code". If you have too many locals... use a block! That's what they are for :/.

I'm definitely a fan of global coherence, but it only goes so far. I briefly worked on Rob Landley's toybox project, and he takes it to an extreme. There are good things about the project, but it will eventually fall down because adding code is an an O(n^2) algorithm.

That is, every new piece of code has to be compared with all the existing code, in order to maintain global coherence. So there is a hard limit on the size of a program you can write this way.

Also, new contributors are forced to understand the whole thing.

This is why "local reasoning" is a desirable property in software architecture -- so the program can grow smoothly. Most programs grow badly, but you want the cost to be closer to O(n) rather than O(n^2).

So these types of program eventually either rot or die -- that's why dpkg "isn't as good as it was before".

As always the answer is somewhere in the middle. Neither extreme is good. You want some modularity but not too much. I guess people are disagreeing on how much there should be.

I concede that modularity has a much greater cost in C than in other languages, e.g. because char* in a function parameter list tells me nothing about ownership. See the sibling comment where I was puzzling through Python interpreter source code like Py_SetPath(char *).

So in C it might make a lot more sense to have huge functions because there is a cost of tracking ownership across function calls. In languages with garbage collection that's much less of a problem.

> As long as the functions are named well and communicate (only) with params and return values, it doesn't hurt readability to split it up, even if it's only used in one place.

It's interesting how even in the context of C the virtues that functional programming dials up to eleves come up.

(Git is similar: it's a dirty C program with lots of mutation under the hood, but when you use it you can sort-of pretend it purely functional, since the main things you care about are append-only, and even rewriting history just creates a new slightly altered copy of history, and the old one is available until the GC kicks in. Similar for the recent craze of copy-on-write in filesystems.)

Yes it's it's a common pattern, especially in distributed systems. What concern does functional programming share with distributed systems? Being disciplined about state. A distributed system is no longer a Von Neumann machine, because you don't have globally consistent state (without paying for it). Procedural languages assume you're using the Von Neumann model.

Also, what does OOP share with functional programming? Being disciplined about state. When OOP is used well, you're protecting invariants on state inside classes.

Basically the anti-pattern is global mutable state, which most (old) codebases still have. It works poorly for concurrency too (e.g. Python and the GIL).

Here are a couple related comments which touch on OOP and FP, which a lot of people seemed to like:

https://news.ycombinator.com/item?id=14523728 -- Go without globals; pure dependency injection is really functional programming; state as an explicit parameter

https://news.ycombinator.com/item?id=11841893 -- there's no real advantage to expressing something like "split a string by a delimiter" in a purely functional style. Either way, you have a trivial referentially transparent function you can reuse without causing complexity in your program. You might as well do the obvious imperative thing.

Also, I neglected to mention that this style actually doesn't work all that well in C. It works better in C++.

Not only because C++ has classes and C doesn't, but because of char. I was recently hacking on the Python interpreter, and wondering about functions like Py_SetPath(char ) and setenv(char, char) (from libc). The problem is that I have no idea whether those functions copy their arguments or they store pointers to them. I had to go read three levels deep.

So in that sense, functions in C fail miserably... they are not functions in the mathematical sense, so you can't reason about them the same way without strong conventions.

> So in that sense, functions in C fail miserably... they are not functions in the mathematical sense, so you can't reason about them the same way without strong conventions.

Yes. Hence my astonishment and use of "even in C".

Indeed, the most important bit for functional programming in the large are well-defined interfaces and a very disciplined use of state. In the languages I use I see no need to resort to imperative features for "split a string by a delimiter"---but if your language is a bit clunky that might be the path of least resistance.

(To give a counterexample to my point, even in Haskell you might want to use eg a mutable hashtable in an imperative fashion. Even if that means going out of your way to do so.)

Avoiding global (or singleton) state is only the start of what's necessary. Even then OO programs tends to have lots of little pockets of state whose network of interactions goes to a large scale again and doesn't compose well.

One point to complement your argument:

Breaking out a function that's only ever called once can still be a good idea: functions---especially pure ones---communicate with their environment via narrow interfaces, their arguments and return values.

Compare that to inlining the same block of code: the reader will have to carefully check that the inlined block is self-contained, instead of having the syntax provide the hint.

Granted, this argument mostly makes sense for pure functions, and doesn't apply nearly as often in C as in eg OCaml. Haskell's usual style is perhaps the most extreme in that regard, any individual definition only very seldom goes on for longer than one screen. (Forth people strive for a similar style of ravioli code, but I don't have the experience with that language to say much.)

The first example made my skin crawl. No spacing between related blocks of code, no indentation of conditional statements to help pick them out against the background, and since when is a lack of comments in the body of a method a good thing?

I fully agree with you. Here is how I'd write the first example. Note that I don't know anything about SQLite and that I may have introduced typos. The comments 'resize ...' are just place holders.


I like your formatting of the code, but...

You introduce comments that are basically just noise. The name of the method already states what it does. The rest is just very simple details, actually memcpy's. Most of the comments are totally unnecessary.

Code comments are not pure good, they are not Schindler's List. They are necessary evil. Every time you need to add a code comment you have failed to write clean code that it easy to understand.

Every comment that you add has a cost. Someone needs to read it _in_addition_ to the code that it tries to describe. The code itself either is correct or isn't. The comment cannot fix that. But it can slow down the reader to understand what the code does. Whether there is a comment or not, the reader still needs to read the code.

Of course you can omit the comments in the body of the function. Yet I do not think they are "just noise". In my point of view they help with two things when going over the code: First they reduce the cognitive load, second they help maintaining a red thread throughout the code.

Let me explain that in more detail: while `malloc((sizeof(char) + sizeof(i16) + 1) N)` tells me how big the buffer is, it does not tell how it is organized / how it will be used. Of course you could introduce a struct for this but that would get out of hand quickly. Having the comment one line prior illustrating the layout helps one to understand what comes next. Hence offsetting `zExtra` after each `memcpy` part becomes obvious.

`resize azColl` (and similar) serves as a heading for the following block, allowing one to easily find a specific step in a list of multiple steps. Scaning the whole function with your eyes for such headings is much faster than parsing for `pIdx->azColl` inside a call to `memcpy`.

In the end, whether to put these kinds of comments into the code-base or not should be part of the coding convention chosen for the project. People have different opinions and after all it is a trade-off. The way I did it in this example is the way I'd do it in a big (community driven) code base like SQLite. If the convention would state otherwise, I wouldn't add them.

Have you considered that you can structure your code with something else than "header comments"? For example, if you find that you need to split a function into multiple steps and you end up adding "header comments", you could also split that code into subroutines and name those functions accordingly?

Also, I usually find that just adding one or two line feeds between code blocks helps to add enough structure.

I often consider that in languages like Python or Haskell. There you can easily do this in very few lines of code and with limited scope (inner function, Haskell's where). In C this is often too much overhead in my opinion.

Comments are salt, just enough enhances, too much spoils.

I'll take a codebase with no comments over pages of

    // assign 10 to a
    a = 10;
particularly when why you are assigning 10 would actually be a (more) useful comment.

I support the statement that comments should explain why something is done and not how something is done, in general. See my other reply why I chose to do it like this.

> Every time you need to add a code comment you have failed to write clean code that it easy to understand.

As far as possible the code should tell you 'what' it is doing. But pure code alone has a hard time telling you 'why' it does what it does, and an even harder time telling 'why not'?

(Ie 'Why are we using this complicated approach, and not a simpler one?', 'Why can we get away with this simple approach, why don't we have to worry about corner case x?')

What we used to do when I worked at a consulting firm that took contracts whose deliverables included source code is when writing code just put in the comments that we needed for ourselves during development.

For the final delivery of source code we'd make a binder that had a code listing on the pages that were on the right side, with a detailed running commentary on the pages that were on the left. On the left hand pages we could go into great detail (we'd add vertical space on the right hand pages as needed to keep things lined up).

The left hand pages in essence were almost a book on how to do whatever it was the code was doing, making it easy for the client to hand over the project to their own programmers after we were gone. We did not rush writing the left hand pages. Sometimes I spent more time writing the left hand pages than I had spend writing and debugging the code. (Our contracts were all fixed bid contracts, so spending all that time didn't raise the price for the client).

> Every time you need to add a code comment you have failed to write clean code that it easy to understand.

I find this to be a rather curious statement. I don't believe "clean code" and "easy to understand" are necessarily related as strictly as you seem to.

Sometimes solutions to problems are messy and difficult to reason through with just the text of the solution ("code"). Code comments that elucidate meaning when the solution is thorny make code clean. The absence of comments might make code pure (in some sense), but not clean.

>> Every time you need to add a code comment you have failed to write clean code that it easy to understand. > I find this to be a rather curious statement. I don't believe "clean code" and "easy to understand" are necessarily related as strictly as you seem to.

Well, if we select a definition for clean code to be what Robert C. Martin defines in his book "Clean Code", it very much is related to code that is maintainable, that is readable (i.e. easy to understand). I'm not saying they are "strictly" related as you seem to suggest I think.

The absence of comments is not a necessary property of clean code. But the absence of unnecessary comments is. Most comments are unnecessary.

Also, most importantly, the presence of comments is not a sufficient criteria for clean code. Code can be of very high quality without any comments.

You seem to say that _good_ comments may improve the quality of code and I agree.

This is a nice example.

It's not that we are not capable of understanding the original code block, but the 'breathing space' and just some minor sparse comments now allow much of that understanding in less than a few seconds -- a real practical side when working on a large body of code.

And it does that without interfering with the 'good practices' highlighted in the article.

One of the most influential few paragraphs I've ever read in my life was this FAQ entry:


It must not be left out of any discussions of style in the SQlite codebase.

Without this approach, many stylistic choices would not be nearly as acceptable.

(But one thing I never looked into: do the developers use a more readable, better-commented and -spaced style, fewer tricks that produce fewer warnings, to write their test suites?

Or are they written in the same style?)

Testing in sqlite is great! That still isn't a reason to write unreadable code though. It may just be a "Linux Torvald" kind of approach: if you can't read the code as is we don't want you in the community.

Do you know the answer to my question? Is the testing code (which I can only assume is not itself tested) is written in the same style and way as the code it is testing? Or is it written in a different style?

> since when is a lack of comments in the body of a method a good thing?

I think that's the rationale: you are supposed to explain what you do, not how you do it. The function is short enough that a global comment is enough.

I agree with you, that a couple of comments in the function body wouldn't have hurt.

To be fair to the article, it 'remarks' on the lack of comments there but doesn't specifically call it out as a best practice.

I actually found it a welcome and refreshing change from the excessively verbose and spread-out style that seems to have become popular these days (of which Java and C# are the main examples.)

I'm not sure all the casts are necessary, however, and I'd probably reduce usage of some sizeofs just using the constant (sizeof(i16) should obviously be 2, if the name is any indication.)

> I'd probably reduce usage of some sizeofs just using the constant

Why? The sizeof is resolved into a constant at compile time, ie, no runtime cost. I see the 'sizeof(i16)' as conveying more information than a '2'.

i16 seems to be a typedef for short int, which may be actually be larger than 2 bytes depending on the compiler.

Even more so:

-- the fact that the code didn't use the int16_t from stddef.h but rather defined their own, could hint in the sense that a signed-int-over-exactly-16-bits might be unavailable in the ABI. People defining their own exact-width integral types (instead of pulling in stddef.h), rarely think about how that makes the code harder to interpret for a reader.

Now, if the type i16 might indeed be wider than 16 bits, sizeof becomes perfectly justified but '16' is not.

If it's specified as i16, it would be exactly two bytes in all situations - there would be no logic in including a size parameter that didn't actually mean anything, and sized ints are specifically meant to be platform-independent.

But then why not use stdint ?

Here, have a window manager I wrote, maybe it'll make your skin crawl right off: https://github.com/serprex/nobox/blob/master/nobox.c

That is COMPACT. I always thought that a window manager was supposed to be big. This fits in one file!

(A note on code style though: some of your lines are a bit too long. Have pity on those with small screens. Seriously, it's not that hard to break up line 94.)

Where can I learn enough about X to read this code?

I'd suggest reading the dwm source: http://git.suckless.org/dwm/tree/dwm.c

It's still one file, but it has more features & is less obfuscated

Otherwise I mostly learnt by reading XCB header files

That said things are going the way of Wayland, so I've since switched: https://github.com/serprex/noway

Thanks! Though I'd rather read yours. A couple of hundred lines seems less intimidating than a couple of thousand. Though still I doubt I'll get around to it.

Have pity on those with small screens.

How small are your screen(s)? What resolution do you run them at? What size font? Do you have eyesight issues?

I have decent vision at monitor distance. On a 13-inch Macbook Pro, using a font of reasonable size, I can easily diff two files with a line length of 108 - side-by-side.

I'm curious, then, why others can't do that. (Eyesight issues aside, of course.)

(Whether code lines should be 108 characters long is a separate issue.)

At home, my desktop has a resolution of 1366x768. (I usually use my laptop, which is higher than that, but whatever.) At work, I have monitors in the "portrait" position, so I only have 1080 horizontal pixels unless I'm willing to split something across two screens. (Never!)

But in this case, all that is not relevant. The link was to Github, and Github enforces a width of less than a thousand pixels, regardless of screen size.

I have monitors in the "portrait" position

Excellent, thanks. That's something I hadn't considered.

But in this case, all that is not relevant. The link was to Github, and Github enforces...

Ah. I assumed you were speaking generally.

Well, I mostly was. Long lines are _generally_ inconvenient. Everything from portrait screens to websites that won't use the full screen width to someone who wants to a printout of the code for some reason (even printing in landscape is not enough).

You should read 'Clean Code', comments mean a failure of clean understanding code

And you should avoid this kind of strong assertion: there are many cases where comment are useful.. To fix a bug, I replaced a binary serialiser by a text serialiser, how could the code explain this? Also with threading, comments are often useful.

In c# , an obsolete attribute on the binaryserialiser, in the same namespace as the text serializer. With the reason why you switched and why it's obsolete now

Perhaps should have added a 'mostly' in my original reply though. But I want to keep comments always to a minimum

That code doesn't seem to require any comments. Code comments are usually a smell. The author couldn't write code that's easy to understand so he had to write a comment to describe it.

I second this. For more info about why usually code comments are a code smell, I recommend reading the Clean Code book (Robert Martin), the chapter on Comments.

I've worked with people who believe this, including at least one devout follower of Clean Code and I totally disagree. One developer's obvious self-documenting code is another developer's WTF. Yeah, some projects have a lot of code that doesn't need to be commented, especially web and GUI projects that have a lot of similar code for performing CRUD operations on various database objects. That shouldn't make you get out of the habit of commenting things that actually need to be commented.

One major WTF in this code that stood out for me is the place where char zExtra is cast to const char * * . (Pointer-to-pointer casts should be alarming for any C programmer.) Only later, I realised that zExtra was a block of memory containing three arrays of different type. There should at least be a brief comment explaining the memory format of zExtra, and maybe one explaining why the code makes a single allocation instead of three separate allocations (is it performance critical?)

I don't believe you _totally_ disagree. You must understand that littering every single like of code with comments describing minute details is unnecessary. There must be a balance.

I'm not a purist by any means. I don't think comments are pure evil either. They have their uses but to say that code without comments is

Clean Code principles only state that code comments are not good _per se_. Most comments are just noise. Yet they are sometimes needed and bring value. It's not black and white.

But reading that amalgamation sqlite3.c file wasn't the best C reading-for-learning experience I had.

I found Redis and OpenBSD much more organized and documented, it just made more sense since the structure of those projects is really nicely organized. Of course there are reason why sqlite3 is one C file, I am well aware of that, but I just wanted to point out 2 more awesome sources for reading and learning C.

One of my favorite C reading experience is maru http://piumarta.com/software/maru/

(self bootstrapping lisp)

IIRC, the original source is split into many files and you can build a one file version, and the reasoning behind doing so has to do with compile and link performance.

When the 'return' statement is put on the same line as the parent 'if' statement, setting a breakpoint on returning requires dealing with the assembly code and hex values (identifying the instruction that implements the return (could be a ret or a jump), then using its address to set the breakpoint.

When the return statement is kept on its own line, this is much easier.

Or using advanced GUI debuggers that allow middle statement breakpoints.

Which debuggers for C or C++ do this, and which compilers do they work with?

Visual C++, if I am not mistaken it was introduced on VS 2015.

C/C++ breakpoints still appear to be line-based in VS2017. (C# breakpoints have been statement-based for a long time...)

I see, my failure then.

When on VS, I spend most of my time on the .NET side, so I thought it was on the C++ side as well.

I feel like TFA makes a very bad case for it (they use structs? only 8 functions have many parameters? your commercial tool can query this kind of thing with SQL?), but regardless, SQLite is a very good codebase to delve into as a seasoned programmer wanting to improve even more.

What worked for me was diving in with a specific question (I think it was "I wonder how indexes are represented on disk?") until I found the answer, picking up a lot on the way.

Not an easy codebase to read, but full of good architecture, good ideas and best practices.

> The function body is well indented

To be fair, there is only one level of indentation. Kind of difficult to screw that up.

Is it me or is the scrolling behavior on this site awful?

Yeap. I wanted to swipe back on the trackpad of my backbook and it just scrolled down. Not ok.

Same here (on a very high-DPI display). Apparently smoothscroll.js is responsible.

This strikes me as programming like it's 1999 (1998 actually). It's not the worst by far. But then I wouldn't call it an elevation either.

1. Elegant use of white space makes for an easier read.

2. Intermingled declarations and code showed up in C99.

3. C++ style comments rather than block comments. There should be a commentary (and not a recital).

4. camelCase for local variables? Really?

5. Single line if statements make for a hard read.

For Java I would recommend to study H2 SQL database. It is small self-contained project, with well written and consistent code.

I think the Linux kernel source code is also very clean and it's worth reading. I prefer its coding style.

While we are talking about C skills, recently I got hit by this gotcha. I never knew a simple comparision could be this complex.

Question: Whats ru, rs ?

    unsigned int u0 = 0;
    unsigned int u1 = 1;
    int s0 = 0;
    int s1 = 1;
    int ru = 0 > u0 - u1;
    int rs = 0 > s0 - s1;
Answer: ru is 0. rs is 1.

For this specific case, warnings can save you:


  prog.cc: In function 'int main()':
  prog.cc:8:16: error: comparison of unsigned expression < 0 is always false [-Werror=type-limits]
       int ru = 0 > u0 - u1;

I had -Wall set but I forgot that its misleading. -Wextra is needed for -Wtype-limits and -Wsign-compare.

I wouldn't say it is complex, that we have unsigned comparison for unsigned numbers, and signed comparison for signed numbers.

Const-correctness is really difficult to get right. One person's const is another person's mutable.

An example from the top of my head, a function which finds an integer in an array "should" get that array as a "const int *". But if the caller wants to modify the found element, it needs to do ugly casting (really bad practice).

Similar problems exist in most other programming languages. Most have only "shallow" const. You can mark an Java array member as final, but you can't disallow mutations to that array.

Const-correctness is similar to types - it breaks very quickly, and when it breaks it gets in the way.

That means I resort to indices when possible. When pointers are needed I typically avoid const.

> a function which finds an integer in an array "should" get that array as a "const int ". But if the caller wants to modify the found element, it needs to do ugly casting (really bad practice).

My friend, your example is wrong! The caller can happily work with a mutable array; the function accepting a "const" pointer is just a guarantee that the function itself* won't modify the content.

  size_t look4 (const int *ptr, int val);
  void caller (int *ptr, int old_val, int new_val)
      size_t index = look4(ptr, old_val);
      ptr[index] = new_val;

Sure, if you return an index. I bet the parent poster was discussing returning a pointer, though.

Imagine look4 returned a pointer to the found element. Making look4's signature const-correct in C is not possible.

(Think of strchr).

As my sibling poster explained, that's not why meant.

Try to reduce some of the problems these people have: https://stackoverflow.com/questions/856542/elegant-solution-...

I know you are just giving an example, but caller() should check that what came from look4() is a valid index into ptr.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact