Hacker News new | past | comments | ask | show | jobs | submit login
SQLite 3.8.7 is 50% faster than 3.7.17 (gmane.org)
579 points by anon1385 on Oct 7, 2014 | hide | past | web | favorite | 164 comments

I'm glad to see something like this. People often take the typical optimization mantra too far. The minute you mention optimization on most internet forums (especially StackOverflow) you get slammed for promoting "premature optimization."

That's not to say premature optimization doesn't exist. Sometimes writing faster code is the difference between using a hash instead of a list. Sometimes it's spending an entire day optimizing a function that's called once. The former is performance aware, the later is premature. It seems as though most would have you believe that they both fall into the "premature" bucket.

It's as though people are using it as an excuse for their personal ignorance and would rather continue to revel in this ignorance than expand what they know. As far as I am concerned these "unimportant tidbits" about performance are gifts that keep giving.

/somewhat offtopic rant

Yes, but it's unlikely that someone asking "is it faster to concatenate strings in C# using the + operator or StringBuilder" on StackOverflow is someone that is writing a low-level database engine like SQLite. In my (totally anecdotal) experience the people who are more concerned about performance are beginners, because that is something that they can grasp, and this is where the "don't prematurely optimize" people are coming from. Obviously someone who is writing a low-level database engine knows that performance matters, and likewise, is likely to be running benchmarks rather than asking vague questions on SO.

Indeed. However, keep in mind that as this beginner "ramps up" to the knowledge level of someone who is working on a SQL database he will need a source to find out where costs lie. He's done some profiling, finds an expensive method with massive amounts of string concats, heads to Google and all he can find is "don't prematurely optimize."

You can supplement an answer with "don't prematurely optimize," but the presumption that premature optimization is occurring is a premature assumption.

He who asks a question is a fool for five minutes; he who does not ask a question remains a fool forever. ~ Chinese Proverb

I would work with beginners that are not fools, than experts who are.

> He's done some profiling

In my experience on Stack Overflow, the people getting hit with accusations of "premature optimization" have not gone this far.

Nope, if he got to that level, he'll know that the only way to really know it is testing.

He may still ask why one is slower than the other (if it is, I don't know that), but that's a completely different question that in most contexts will get a nice answer.

Suppose, for a moment, that our hero is seeing this situation for the first time and doesn't know what the alternatives are.

Tests that don't cover the best option can't possibly identify the best option. So if you want to do a really good test, going somewhere to ask, "What else should I try?" is hardly an asinine course of action. On the contrary, if our hero is new to this stuff, or is new to the platform in question, or anything like that, then asking others for advice should probably qualify as due diligence.

And that's decent advice in response to the "is it faster to concatenate strings in C# using the + operator or StringBuilder" questions. But I've also asked Stack Overflow questions that drowned under a tsunami of premature optimization tut-tutting from people who I suspect didn't even fully understand the question, and then sighed and proceeded to take a couple days to figure it out myself and improve application performance in a critical section by 50%. Which is fine, but once upon a time I thought Stack Overflow might be a resource where people might occasionally save me a bunch of time by saying, "Hey, don't waste your time, I've been there and it's a blind alley."

I think part of what's going on there is that SO's gamification can create some kind of perverse incentives for answerers. If you want to get the upvotes, you have two good courses of action: You can either already have 5 digits' worth of karma, or you can answer first. The first option isn't available to the vast majority of SO users, so most go with the latter one. This drives down the quality of answers. It also creates perverse incentives for behaviors such as users getting in a quick low-quality response that's just good enough to get a few upvotes from people who already understand what you're talking about, and then (maybe) following it up with an edit that provides a high-quality response that would be useful to the asker.

> But I've also asked Stack Overflow questions that drowned under a tsunami of premature optimization tut-tutting from people who I suspect didn't even fully understand the question, and then sighed and proceeded to take a couple days to figure it out myself and improve application performance in a critical section by 50%.

Me too. It's infuriating how presumptuous people can be.

Here is someone insisting to me over and over that virtual function call overhead is not measurable: http://stackoverflow.com/a/16603295/77070

Here is someone telling me that my attempts to avoid a virtual function call "definitely fall under the category of premature optimization": http://stackoverflow.com/a/5100072/77070

It's interesting to me that people will shout "premature optimization" before they have any idea what you are doing or how long you have worked on this problem before trying to optimize.

I'm also deeply skeptical of the pre-canned response that people should always profile - in the spirit that it's hopeless to optimize without a profiler.

In my experience, profilers are really low-level tools that often highlight the wrong problems, aren't very good at telling you where the optimization opportunities are, and tend to encourage microoptimization.

It's not that profilers are bad, it's that I get the feeling the people parroting this advice have no idea what they're talking about, and somehow believe that a profiler is "the solution", when profilers are neither necessary nor (most problematically) sufficient to fix most performance problems.

For me, profilers are most useful for

1. quickly seeing which high level functions or methods take the longest time, 2. which methods are called a lot.

Even 2. can be of marginal use. A lot of times, it's not surprising a method is called a lot, and not clear whether those calls are indicative of a performance problem. There are times, though, where a method is called a lot more than you expected, in which case it might be the clue to solving the performance problem.

For 1., it's usually click down through the call stack for the most expensive top level method, while there is still a single method taking up a good fraction of the overall time.

Hopefully the stopping point will still be a method you wrote, and not a library call. If it's your code, you likely have the ability to fix the performance problem by changing your code. If it's a library call, you might need to be a little more creative, in finding an alternative approach, or better understanding what that library call is doing under the hood.

So for me, the profiler just tells me I'm looking at the right general part of the code when trying to improve performance, and that's about it.

FWIW, both 1 and 2 can still lead you in the wrong direction sometimes. For example, if you're working in a garbage-collected language most profilers (that I've used, anyway) won't give you good insight into how much time you're wasting on excessive GC load because you're creating too $#@$% many unnecessary objects (see: boxing), or creating objects with just about exactly the wrong life span. If you're working on parallel code, many profilers are darn near useless for figuring out how much time you're losing to blocking on synchronization.

I just replaced a SQL view that was performing 24,000 reads with a procedure that performs 45 reads. Yes, it's a little different to use, but overall don't listen to the premature optimization people.

I could wait until the data in those tables grows to a crazy size and the reads are out of control (and spilling to disk) or I could just fix it now. Hmmm...

It helps to preempt the typical response. If I post a question that's "premature optimization bait" I end it with:

> I'm optimizing this due to profiler results.

You'd be surprised how many people still question whether I'm prematurely optimizing in the comments, but the actual answers tend to be more helpful.

Have you tried something more along the lines of:

I'm trying to do X.

Currently, I'm using Y mechanism, and it produces correct results.

However, the performance sucks -- profiling shows me this is a big time sink in my program. So... (actual question.)

It may help with kneejerk "premature optimization" if the profiler results are mentioned before the request about how to optimize (at least, it should to the extent that the problem is people stopping reading because they see the optimization request so that they never read the comment about profiler results.)

And then you post your own solution for others who might stumble upon it and 15 minutes later the entire thread is deleted by a mod for "this is a discussion thread and doesn't fit the Stack Overflow Modus Operandi"

> If you want to get the upvotes, you have two good courses of action: You can either already have 5 digits' worth of karma, or you can answer first.

How does having karma help you get more upvotes on a brand new answer?

Answers that already have upvotes get more views, and are therefore more likely to get even more upvotes. But that's distinct from a user's reputation score.

To be fair, I once fixed a java application that completely crashed due to using string concatenation for data export, rather than a StringBuffer. It does matter sometimes even in workaday applications.

It's not a completely useless question. I wrote an API data validation gem for Ruby that needs to be as unobtrusive as possible, so it needs to be fast. I found that building strings by interpolation (e.g. "#{str1}#{str2}") is quite a bit faster than addition (e.g. str1 + str2).

The other response you'll see all the time? "Measure it!"

Low-level optimization is not a task to be undertaken casually. If you don't know where to look, you can end up just wasting a lot of programmer time for zero real-world benefit. If you don't measure before and after, you may even make things worse...

The SQLite folks didn't make these changes with a hope and a prayer; they carefully profiled, changed, and profiled again... And they invested years in building a test suite to ensure that such micro-optimizations don't inadvertently break the logic. The results are impressive, but so was the time and effort invested in achieving them. If you're not willing to be so diligent, "premature optimization" is exactly what you're doing.

> "Measure it!"

And for the love of all that you care about, measure it with realistic data and use cases.

I've seen people optimise code while working with small test databases and making changes that are more efficient over the smaller data-sets but are hideously bad on larger ones (usually due to memory use growth, for instance hitting the limits where SQL Server starts spooling temporary data to disk instead of just keeping it in RAM for the duration of its existence).

Or just as bad: working with data that has the wrong balance and (to give another SQL example) giving index use "hints" that shave a little off the time to run over their test data but hamper the query planner on real data by making it ignore what might be better options.

Can you expand on what you mean by index use "hints"? Or is this a feature I just haven't stumbled on yet?

As lmz mentioned, they inform the query planner that you'd prefer it to take a particular route to address your instructions where it might have a number of choices given your current table+index structure and data patterns (and has chosen a bad one of the options in the past).

I use the quotes around the word hint because it implies that it is a fairly informal pointer that the planner might choose to ignore where in reality they usually take it as an order and go that way without question (assuming that you know better than it might otherwise chose to do, or you would not have given the hint).

Telling / forcing the query planner to use a particular index in a query e.g.:

SQL Server: http://msdn.microsoft.com/en-us/library/ms187373.aspx MySQL: http://dev.mysql.com/doc/refman/5.6/en/index-hints.html

Oh yeah, definitely. In some situations performance implications are obvious. E.g. Adding to a list of unique values in a loop? Use a hash, not a list. That's the kind of thing you do up-front, it sounds obvious but you would be amazed at the bad responses you get on the internet for pointing this kind of stuff out.

What I guess I'm trying to say is that you can go in a write obviously slow code "which is fine because you shouldn't prematurely optimize," or you can be informed and, with sometimes no extra development time, write something that's at least competent from the get-go.

For everything else? Profile, profile, profile, profile! And make sure you are using a competent profiler. I've wasted days because of inferior tools, switched to better ones (cough VTune cough) and had major improvements in mere hours.

This is where knowing the full Knuth quote helps...

> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Another common mistake relates to the phrase "penny wise and pound foolish", which applied to programming helps us identify those scenarios where we might waste time micro-optimizing array searches rather than choosing a hashtable (or even pre-sorting the array). Time spent optimizing your architecture will nearly always offer a bigger payoff than micro-optimizations - but once that's done, Knuth's critical 3% is worth remembering.

It goes beyond the full quote --- the whole paper is great, showing Knuth's stature as both a thinker and a writer. If you've ever quoted the "premature optimization" line to someone, or nodded when hearing it, you'd do yourself justice by understanding the context of the quote.


Here's some more from it to entice you:

  My own programming style has of course changed during the 
  last decade, according to the trends of the times (e.g., 
  I'm not quite so tricky anymore, and I use fewer go to's), 
  but the major change in my style has been due to this inner 
  loop phenomenon. I now look with an extremely jaundiced eye 
  at every operation in a critical inner loop, seeking to 
  modify my program and data structure (as in the change from 
  Example 1 to Example 2) so that some of the operations can 
  be eliminated. The reasons for this approach are that: a) 
  it doesn't take long, since the inner loop is short; b) the 
  payoff is real; and c) I can then afford to be less 
  efficient in the other parts of my programs, which 
  therefore are more readable and more easily written and 

For real, the SQLite codebase is some of the most well-tested open-source software out there.

The argument behind premature optimization only really stands when first developing a product. Once you have an established product, it's no longer premature, it's just optimization.

Premature optimization is that which is done before the need for the optimization is clear. Thus debugged, feature complete (for now) ... 'Established' really has nothing to do with it, if that established product has bugs then those should be addressed before attempting to optimize. Unless of course, the lack of optimization is itself the bug.

> Once you have an established product, it's no longer premature, it's just optimization.

Its premature even if the product is completely feature complete and bug-free if the particular optimization being done is being done before more significant optimizations, which, if you aren't using some measurement more sophisticated than "this looks like it could be optimized" to choose what to optimize, it probably is.

Well, only if you never add any new features to it.

As long as you are not optimizing those new features as you are adding them, it's not premature optimization.

I'm late to the discussion, but I will say that when you have an established product you have a much better perspective on how the code is organized than when you're still cobbling together V1.0. A firm grasp of the code's organization makes it much easier to contain the optimized part and slap on a /* HERE BE DRAGONS */ header.

With a full suite of automated regression tests you'll have a great deal of confidence that any changes you've made didn't break anything. Go nuts and experiment!

> The minute you mention optimization on most internet forums (especially StackOverflow) you get slammed for promoting "premature optimization."

Typically, that's because the people asking are asking questions like "which is faster in PHP, print or echo?" to which the answer is generally "there's about a half a second difference if you run it a million times, but you shouldn't be doing that anyways".

That's because so many people asking are focusing on micro-optimizations without having profiled and while ignoring bigger factors, or else don't have working code yet.

On a mature project like SQLite, nothing is premature anymore - including optimisations.

The internet is full of dumbass developers unfortunately. Both on HN, StackOverflow, and well everywhere.

There are few absolutes in coding. Some times optimizing a function that's called once is worth it. People often perceive software that takes a long time to load is slow even if everything else is fast. Though even just changing the loading screen can do a lot even if the actual time something takes to load does not change.

PS: If all someone wants to do is rotate an image loading photoshop really is slow.

A common advice is to focus optimization in the few places where it matters, such as in the middle of tight loops. Sqlite is in an interesting situation in than almost all their code could be called in the middle of somebody's tight loop. The same applies to many libraries, operating systems, VMs, etc, but not to most applications.

Well, I think the imporatant part to note is that SQLite has been through basic performance review, it's been heavily tested and used at this point. It is a mature piece of software.

So optimizations like this can certainly no longer be considered "premature".

Premature optimization is also not a true waste of time (IMO) if it's part of the learning process.

Man, did I ever 'waste' a lot of time on optimization when I was first learning to code (in C), but:

a) I learned what the low-hanging fruit was, so that I was eventually able to make all my code fairly fast without explicit optimization.

b) Man, where some of those early programs I wrote blindingly fast... especially compared to the higher-level python/numpy/scipy stuff I write a lot today. It's also nice to repurpose some of that old code (ie as a library) for new applications since every potential bottleneck has already been ironed out.

> Sometimes writing faster code is the difference between using a hash instead of a list. Sometimes it's spending an entire day optimizing a function that's called once. The former is performance aware, the later is premature.

I disagree, it depends on the context. If using your original list is so fast that nobody human can notice the difference, it's a waste of time to turn into a hash, not "performance aware".

Actually, whether it's a "waste of time" also depends on the context. It could be worse. It could make your code harder to understand and maintain.

> "premature optimization."

Once a feature works as intended, optimizations aren't premature anymore. You're then in a position where you can profile, measure, and compare.

SQLite has an advantage because it is a small, well defined program (for a DB) that as is much defined by what it doesn't do as what it does. The typical problem with premature optimisation is that you optimize something that is either subject to having functionality requirements altered in a significant way and the optimisations will make it brittle when you need to do this.

Grace Hopper on inefficiency, etc


They already optimised everything else, so this isn't premature.

I think people under-estimate incremental improvements a lot. My tai-chi teacher told me that eastern culture is more about "putting one sheet of paper on another", there is slow and steady improvement. Western is much more about "big bang approach", make it or break it.

SQLite has been around for 15 years, they never shifted their purpose and were constantly improving. You get almost perfect product at this point.

I work on in-memory db which was evolving similar way. Now after more than decade of small improvements, it is faster than java collections, despite doing serialization and its own memory management.

Not only eastern cultures have the concept. In German there is the phrase 'a steady water drop hollows a stone' and it actually traces back to the Roman author Ovid 'gutta cavat lapidem'. Perseverance is just so much harder than instant gratification.

Just to show Yanks have it too, "slow and steady wins the race".

Which in my mind has always been linked with The Tortoise and the Hare -- which is then from Aesop / Greece.

Well you can't really reject all elements of American culture that have their roots in Europe as not American or else you end up without a whole lot to work with.

My pappy always said, if you shoot it and it's still standing, keep shooting it! Small, incremental changes!

One rain does not make a crop. — Creole

I always thought story-myths like Star Wars and The Karate Kid fulfilled the role of fairytales and parables for the comparatively young culture of United States. I'm only half-kidding. I bet Yoda has a few choice quotes related to exactly this subject?

Well that wouldn't be very good! Let's embrace them then.

It seems to me that that's lost the idea of incremental improvement. It's about taking the time to do things right, but there's no sense of making it better.

Hm, that's a good point. A favorite saying around here is, "If you don't have time to do it right, you don't have time to do it over!" There's a big emphasis on doing something right, and if you have to go back and improve it later, that's a failure.

Interesting. The same phrase 'a steady water drop hollows a stone' also exists in Chinese: 水滴石穿, and it also traces back to 2000 years old classics.

I don't want to disappoint you, but the same phrase exists in so many languages that I'm afraid it isn't something culture-specific.

Little known fact, water had lots of sediment and grit in it back then. Wore lots of things down.

In india the phrase is: "Continuous movement of rope can make mark on stone, like it makes on the wall of a well".

HINDI: "Rasari aavat javat re, sil par parat nishaan"

In (Brazilian) Portuguese: "Água mole em pedra dura, tanto bate até que fura".

There's nothing specifically Brazilian about that sentence. In fact, it's a quite common saying in Portugal.

Thanks! I did not know that.

I usually prefer to err on the side of being too specific. :)

Gutta cavat lapidem (Ovidium)

I'm not sure I understand the tradeoff. Given two tasks requiring an equal amount of work, are you saying that it's better to spend it on a 5% improvement than a 150% improvement? Sure, if all you have available are lots of 5% improvements, they'll also eventually add up. But won't you prefer to work on the biggest improvement/cost tasks first?

Perhaps it should be seen versus the "throw it all away and start over" tendency? Like in steady refactors vs. big rewrites.

Here's one way to look at it.

Say you can either spend X amount of effort on a 50% improvement, or you can spend X/10 on each of 10 5% improvements. Sounds the same, right?

It's not. If the smaller increases compound, you get a (1.05^10=) 63% increase out of the latter situation. Plus you don't have to wait for everything to be done to get visible improvements.

>Say you can either spend X amount of effort on a 50% improvement, or you can spend X/10 on each of 10 5% improvements. Sounds the same, right?

This is the same thing. The point is that you say "10 5% improvements of state S0" in that sentence and then later on you use 10 5% improvements of state S0, S1... S9. This doesn't make any sense, could you explain to me how this would work?

> you say "10 5% improvements of state S0" in that sentence and then later on you use 10 5% improvements of state S0, S1... S9.

No, I don't.

Here's a very simplistic example. Say I've got a complicated computation with 10 nested loops, and there's a potential for a 5% speedup tweak at every nesting level, each of which takes about the same amount of effort to find; or, for the same overall effort I can speed up the whole thing by 50% by replacing the whole algorithm with a faster one.

The former case gives you a better outcome than the latter, and it doesn't matter what order you do the tweaks in, or whether you do them all at once. The point is that they can compound without caring about the previous or latter state of anything other than their own level.

Yes but the point is that the (at least I assumed so) 5% increase in speed was based on the initial performance of the system and that it didn't scale. Isn't that basically what you're talking about?

If you take two speed-ups which individually would be 5% over the initial performance and apply them both, you get 10.25%, not 10%. The point is that it can scale.

A number of discrete improvements is also more resilient. If a single perf improvement needs to be removed, only a 5% gain is removed. If the big 50% gain is monolithic it is much more susceptible to disruption.

you're statement needs a lot of qualifications. It is in only very specialized cases where this is true.

Let's say my program takes time 1.0 and is made up of part A taking 0.5 and part B taking 0.5

Defining 5% speedup is weird in the first place, but let's say 5% speedup is taking 0.95 for the program and 10% speedup is taking 0.9 for the program.

5% speedup by improving part A => 0.45 + 0.5 => 0.95 runtime 5% speedup by improving part B => 0.5 + 0.45 => 0.95 runtime

making both improvements: 0.45 + 0.45 => 0.9

So, two speed-ups which individually would be 5% over the initial performance gives us a 10% speed up.

percentage speed ups do not necessarily compound. And in practice, usually they do not.

Imagine you like something and have 40 hours implement it. Now spread those hours over one week, or over several months. Second cas will be much better because people will contribute ideas, you spend some time thinking about this task outside your job and so on. There is some overhead with constantly restarting your work, but that is negleable with good methodology.

The business jargon word for this is "kaizen" - process of improvement through gradual change.


I think you misspelled buzzword.

What you say is called Kaizen. I practice software development, computer gaming, and several other things on these principles, and it really works. Always focus on something small, don't worry about how long it takes or how well you are doing, and when you feel it's internalized go to the next step.

We underestimate the long-term improvements and overestimate the short-term improvements, not sure who said it first.

And yet, a big bang approach with a better data representation and code generation for queries can yield a 10x improvement http://infoscience.epfl.ch/record/198693/files/801-klonatos....

On the other hand, this one looks quite big bang for me. From a major version to the next 50%.

Also, natural selection tends to have periods of fast adaptation and periods of plateau quasi-equilibrium.

I like Tai-chi and all, but I'm still a skeptic :)

What does putting one sheet of paper on another give you? Death by a thousand paper cuts.*

* Yeah, this is a pretty bad post, I'm sorry.

Ah, the good old 100/100 rule: 100% of the execution time is somewhere in 100% of the code.

Sometimes when you profile code you're dismayed to find that there are no clumps where the time is spent. It's everywhere. This function here takes 3%, this one 5%, that one another 3%, ... you either have to make some radical change, or else make all of these a little faster.

And if you make the radical change, you'll see that it will also have these 3% there and 5% somewhere else.

And it will probably break something along the way :-)

Deep within another thread, I had a conversation with 'nnethercote' regarding Richard's use of Cachegrind to optimize performance. My position is that it is no longer a useful approach, as the actual behaviour of modern processors has diverged so much from Cachegrind's simple simulation. Instead, I think it is much more productive to use tools like 'perf' that use the CPU's built in performance counters to measure the actual performance.

My conviction inspired me to repeat Richard's test to see how the numbers compared when run on a current Haswell i7-4770 processor. Here's what I got for the versions he compared when compiled with GCC 4.8.1 and using 'perf stat speedtest1 --size 5'[1] (sizes are for the sqlite.o lib only):

  Cachegrind per Richard:
  3.8.7a -Os: 953,861,485 cycles
  3.7.17 -Os: 1,432,835,574 cycles

  3.8.7 -O3: 469,956,519 cycles (870,767,673 instructions, 1,069,032 bytes)
  3.8.7 -Os: 542,228,740 cycles (884,541,860 instructions, 648,360 bytes)

  3.17.7 -O3: 651,680,545 cycles (1,320,867,830 instructions, 1,002,192 bytes)
  3.17.7 -Os: 771,304,471 cycles (1,406,527,795 instructions, 605,976 bytes)
One can make of the numbers what you will, but here are some conclusions I drew:

While the improvement is about the same magnitude that Richard sees, the actual number of cycles is off by about a factor of 2.

Despite having a larger binary, and thus theoretically worse instruction cache behavior, -O3 beats -Os by about 20% in speed in both cases, at a cost of 40% (400K/1MB) in size.

'perf record -F100000 speedtest1 --size 5; perf report' is a really slick and easy way to figure out where the program is spending it's time. If I were optimizing this, I remain fairly certain it would be more a more effective approach than using Cachegrind.

[1] I had to comment out the "sqlite3_rtree_geometry_callback" line in speedtest1.c to get it to compile, which might affect my numbers slightly, but from the source comments I don't think it was actually being used.

The cycle-counts returned by cachegrind are repeatable, to 7 or 8 significant figures. That means that I can make a small change and rerun the test and know whether or not the change helped or hurt even if the difference is only 0.01%. I don't think perf is quite so repeatable, is it?

Also, the cg_annotate utility gives me a complete program listing showing me the cycle counts spent on each line of code, which is invaluable in tracking down hotspots in need of work. If perf provides such a tool, I am unaware of it.

Remember that I'm not trying to optimize for a specific CPU. SQLite is cross-platform. I want to do optimizations that help on all CPUs using all compilers. I'm measuring the performance on the "cachegrind virtual CPU" of a binary prepared using GCC and -Os because that combination gives repeatable measurements that are easy to map into specific lines of source code. But the optimizations themselves should usually apply across all CPUs and all compilers and all compiler optimization settings.

Nkruz is, of course, welcomed to use any tool he likes to optimize his projects. But, at least for the moment, I'm finding cachegrind to be a better tool to help with implementing micro-optimizations.

> The cycle-counts returned by cachegrind are repeatable, to 7 or 8 significant figures. ... I don't think perf is quite so repeatable

It can come close. I just tried, and for 'speedtest1' seemed to be getting 3 significant digits for the cycle counts, and 4 for the instruction count. You'd probably gain another one or two if you were to measure computation only and remove the printf() and other I/O statements. The underlying performance counters are pretty much cycle accurate.

> cg_annotate utility gives me a complete program listing showing me the cycle counts spent on each line of code... If perf provides such a tool, I am unaware of it.

Yes, that record/report combination I quoted above does just this, with insignificant runtime overhead. Unlike the total counts, this one is sampled, so you might get a little more variation. It's definitely good enough for quickly finding hotspots, and there are other (harder to use) tools that can use the precise "PEBS" events you need complete counts. These even allow you to do nifty things like track the number of times each branch statement is mispredicted.

> I'm measuring the performance on the "cachegrind virtual CPU" of a binary prepared using GCC and -Os because that combination gives repeatable measurements that are easy to map into specific lines of source code.

Absolutely, this is the right way to view cachegrind. My question would be whether it is the right generic CPU to be using, and whether optimizations made on it translate well to other modern CPUs. Many of them will, but I think you'd have faster turnaround time even better success with an approach that uses a real CPU and its performance counters.

> I'm finding cachegrind to be a better tool to help with implementing micro-optimizations.

Please realize I have the utmost respect for your work on SQLite. It's my most frequent answer when asked for an example of C code to study, learn from, and pattern after. I'm certain you will manage to optimize it with any tool you choose, but having spent many hours with GProf and cachegrind myself, I (will the zeal of a recent convert) think you'll be amazed with some of the things that are now possible with performance counters.

I find perf a very useful tool, and use it frequently.

At the same time, I don't see how your data supports your conclusion that perf is more effective than cachegrind here. The absolute cycle count is of limited interest. The most important thing these tools do is tell programmers where to look, and cachegrind seems to have done a good job of that here.

In my experience, perf and cachegrind are two tools in the toolbox. perf is stronger at optimizing for the particular CPU I'm on, and it runs faster. Cachegrind collects more detailed information, and while the model it uses isn't perfect for the CPU I'm on today, it's usually good enough to be useful, and it's good for optimizing things likely to matter on other CPUs too.

One problem with perf counters, in my experience, is that the results vary a lot in a quasi-random way. Trying to get repeatable results is a painful experience.

How is this work funded? Are these developers working full time on SQLite, or is this work done in developers' spare time?

The main SQLite developers have a company that has a few revenue streams. Mainly they offers support and maintenance contracts. There are also a couple of closed source extensions for SQLite for handling compressed and encrypted databases that they sell. Finally the complete test suite is closed source, so if you want to make closed source changes to SQLite and still make sure you don't break compatibility then you probably want to license that.

not going to say it is 100% of their time, but they are paid to work on it.

You can see five consortium members on the front page of sqlite, the cost of a consortium membership appears to $75,000/year[1]. That's enough to pay developers.

[1] http://sqlite.org/consortium_agreement-20071201.html

And that doesn't even include Apple, who I have to believe has a big investment in sqlite, and probably has some form of support arrangement with them.

I wonder what most of those optimisations were? In particular, are there any that are general enough to be worth knowing about, or even worth including at the compiler level?

> We have achieved this by incorporating hundreds of micro-optimizations. Each micro-optimization might improve the performance by as little as 0.05%. If we get one that improves performance by 0.25%, that is considered a huge win. Each of these optimizations is unmeasurable on a real-world system (we have to use cachegrind to get repeatable run-times) but if you do enough of them, they add up.

Interesting! That reminds me of my own experience in a different context (from https://blog.mozilla.org/nnethercote/2011/07/01/faster-javas...):

> Cachegrind [does] does event-based profiling, i.e. it counts instructions, memory accesses, etc, rather than time. When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful.

Of course since processors are out of order and superscalar reducing instructions isn't always a worthy goal either. You might end up with less instructions but less things happening in parallel. And then of course you have to balance this with the cache efficiency of having to look at less instructions to complete a task.

Basically, optimization at this level is really hard.

And yet, in practice, I have found that optimizing for instruction counts works really well. I've gotten way more mileage out of Cachegrind's instruction counts than I ever have out of its cache or branch prediction simulations.

I was tempted to respond after your first comment, but this followup provoked me to action. When you wrote your post several years ago, modelling with Cachegrind may still have been a defensible approach. But the divergence between its generic processor simulation and actual real world performance continues to diverge. As you note, this divergence first became very apparent with the cache and branch predictions. Currently, I'd argue that your time will almost always be better spent checking the CPU's built-in performance monitors rather than using Cachegrind. For the cases where Cachegrind used to be useful, 'perf' is a joy!

I don't have any experience with Cachegrind, so I don't have any reason to doubt what you're saying.

However, if it's true, what do you say about the real-world performance gains the SQLite folks have achieved by using Cachegrind?

Well, the SQLite folks are clearly still getting good usage from Cachegrind.

I'm not sure that's a strong argument for your case. Similarly to the way that Usain Bolt could probably beat me in the 100m even if he was in a wheelchair with flat tires, I'm sure Richard Hipp could probably do a decent job of optimizing SQLite with just a stub of pencil and a scrap of napkin. I just think he'd be more effective with better tools.

But you inspired me to test out some real numbers, which I posted in a new thread: https://news.ycombinator.com/item?id=8426302

How long will we have to wait to get this version usable for our CoreData apps in ios ? Is it possible to include it manually into our app or are we dependent on ios releases ?

If you want to use it with CoreData, you might have to wait. But you can always use sqlite directly and link with the latest library. It might well be possible to specify a different library version you can use with CoreData, but I've never tried this.

By "50% faster" what they actually mean is this new version requires 50% less CPU cycles than the former for doing the same task. In real world I think you will not get your program to run "50% faster": disk io is often the bottleneck in the database area.

disk io is often the bottleneck in the database area.

SQLite usually gets used in situations where the data easily fits into memory (multiple times). It isn't really postgres ;)

Right. But by no fault of Postgres either because disk-based io offers durability and in-memory classically does not. Apples and oranges.

In fact all the recent discussions of "performance" for RDBMSs has to be understood in terms of "what's being sacrificed for one or more aspects of ACID?". RDBMS technology is incredibly mature and optimisations that keep the ACID aspects intact are also incredibly difficult now. Marginal returns.

You're of course right, what I meant to say: In the typical use cases of sqlite the whole db content usually is cached in RAM anyway, so 50% faster really is 50% faster. I found that out quite recently when researching whether caching pythons sqlite in a web app prototype would make it faster: It won't, It's usually completely cached anyway.

This is no where near true, might be true for spinning disks, but we haven't even really started to optimize RDBMSs for SSDs.

Also sqlite is IO and ACID compliant, it's just that the data generally fits in RAM where as many Postgres installs have data that doesn't fit in RAM.

There is still plenty of room for improvement. See eg http://arxiv.org/abs/1210.0481

True, but as SQLite is heavily used in mobile applications it still has the advantage of consumes less CPU and thus less battery.

Percentages are multiplicative:

50% fewer cycles for the same work would be 100% faster.

50% more work for the same cycles would be 33% fewer CPU cycles for the same work.

50% fewer cycles might also mean, not faster at all, but your CPU stalls/idles more.

Which can lead to better battery usage or the ability to use slower CPUs.

Which I believe is in fact the case for many of these improvements as per discussion on the thread.

Strictly speaking they only claim 10% improvement from the previous release, the 50% improvement is from a release 16 months ago.

SQLIte mostly runs on slow CPU devices with relative fast SSDs. Reads are parallel, but writes are single threaded, so there is no advantage from multi core ARM CPUs.

Sqlite mostly runs everywhere, especially on normal desktop PCs, including all those old ones with slow HDDs and relatively powerful CPUs.

But I don't get the ARM CPU remark, did parent change his comment?

The in-memory mode is also the default for many APIs. You can basically just use it like a collection for your language of choice but use SQL for putting and fetching things from the collection.

Not for the same database obviously, but SQLite databases are cheap. If your data can be split into different databases, you can pull off parallel writes.

I really like SQLite and posts like this always makes me wonder: Could I run a webshop on SQLite, if I did it right and how many users could I potentially support?

Of cause it's not that hard to deploy a MySQL or Postgresql server, but it's still pretty hard to beat SQLite in ease of deployment and management. Running a simply webshop could be a easy as having a database for inventory and one for orders. Even very large stores aren't going to have thousands or even hundreds of orders a minute, so I don't think that writes would be an issue and reads certainly isn't an issue.

I ran a forum on SQLite for a few years before switching to Postgres fairly recently. This is not due to any shortcoming of SQLite, which I believe could have handled far more traffic than what we were already getting at the time (6K - 10K unique hits per day), but it allowed us to have nicer search features.

I honestly don't know the upper limit of users you can support concurrently, but with modern hard drives, I imagine it's an order of magnitude beyond this. It has more to do with your application code than the database in my experience.

I had an informal rule: Read early, write late. When the visitor first hits the page, I'll read for the first queries and only write to the database after I've sent the response back to the user. Writes are always the slowest, but no where near as slow as the network so the visitor never had to sit and wait for writes.

The thing is, a lot of the criticisms of SQLite didn't actually apply if you limit the number of writes to < 10 and queries < 50 per hit and keep everything on one server. It's only when you try to write across the network a lot of the locking issues and such come up. Also, by 3.7, there was Write Ahead Logging (WAL) which was perfect timing as that allowed us to skip the DIY write queue entirely.

Depends. For read-heavy sites, it's possible. If you have a relatively small working dataset, you can load everything into RAM and periodically sync it to disk.

Dealing with multiple writing threads with SQLite is really annoying.

If you need to scale beyond a single host to serve the workload, something else will probably be easier to manage.

You certainly could, but it's not really built with concurrent reading and writing. http://www.sqlite.org/faq.html#q5

That FAQ entry is out-of-date as it doesn't mention the write-ahead logging support.


Neat, I didn't realize that had landed. Thanks for the link.

Is it possible to compile this version of sqlite for ARM and replace it on an Android device to improve performance? Android mainly uses SQLite for internal stuff and allmost all Apps use it...

As stated here: http://forum.xda-developers.com/android/software-hacking/sql...

Most definitely. Download AOSP, overwrite ./external/sqlite/dist with the new version, and build libsqlite.so and deploy on a rooted device (note that Chromium, and the Chrome browser, use their own copy of sqlite).

Having said that, while Android uses sqlite copiously, I doubt this would make a perceptible difference in usage, performance or battery life. It is a 50% improvement by itself, but I doubt sqlite3 usage is even measurable compared to everything else going on in the system.

Yeah, thanks. That was the thing I was looking for.

About your doubts, I'd like to quote the sqlite improvement comment: "Each of these optimizations is unmeasurable on a real-world system (we have to use cachegrind to get repeatable run-times) but if you do enough of them, they add up."

Therefore, I will build this library for my ROM, even though it maybe only gives a 0.02% performance boost, but constantly improving my ROM with those micro-optimizations will probably lead to a 50% boost one day :P

Anyone using newer versions of SQLite out of homebrew on OSX? If so are you having any trouble with conflicts of any sort with the system version? OSX heavily relies on SQLite in many places and I'm curious if it causes trouble to try and use both (and have it in your path).

I'm on 3.8.6. So will have most of the benefits. The temptation with sqlite is that it is so well maintained to only do cursory testing with it, so might drop this in.


> A full 10% of the performance gain has come since the previous release. >There have been a lot of changes.

This is lovely. I believe the most important thing that we learn about this is how to measure your effort correctly. If you do micro optimization, but you don't track it, people will say that " DON'T PREMATURE OPTIMIZATION "

can we expect this version to be included in android L ??

Just ship it in your own binary.

very good news for android developer

> Small. Fast. Reliable. Choose any three.

Nice tag line.

Cute, but inaccurate. Small? Not quite.


Fast? No again.


Note that LevelDB itself isn't exactly the speed king here. It's just the clearest example. To its credit, SQLite does fare well when it comes to reliability.


So ... choose one?

(Downvotes from those who despise facts in ... oops, already happened.)

Within the space of sql databases they get all three.

Writing to /dev/null beats leveldb anyday, only problem is I can't get the data back out. A leveldb to sqlite comparison doesn't is unfair to both leveldb and sqlite.

Ah, "no true Scotsman" rears its head again. Why limit the space to SQL databases? Many applications that need an embeddable database don't specifically need SQL. You're right that comparing LevelDB to SQLite isn't quite fair to either, but they invited the comparison with a slogan that makes comparative claims. It's not unfair to provide citations relevant to those claims, which you don't seem to have done BTW.

> Ah, "no true Scotsman" rears its head again. Why limit the space to SQL databases?

"SQL" is probably generally less important than relational, but either might be an important category.

> Many applications that need an embeddable database don't specifically need SQL.

And yet, many do, or at least need a relational store. To make comparisons, you have to define the category you are making them within; of course, any such category may not be the appropriate one for some comparisons -- either too narrow, too broad, or the right size but divided on the wrong axis -- but that's a matter of where the comparison is relevant. OTOH, its simply wrong to say that a claim is wrong simply because it isn't based on your preferred categorization.

It's equally wrong to say the claim is correct because it's based on somebody else's preferred categorization. Do you see "SQL" anywhere in "small, fast, reliable"? They made a broad claim. They should be the ones to provide evidence or clarification. Why are those standards only applied to those who point out that the claim is unproven? That seems a lot like changing the terms of the argument (asymmetrically!) to fit a preordained conclusion.

>Do you see "SQL" anywhere in "small, fast, reliable"?

You're right! It hasn't baked me a cake, what the hell?

Context is key. It's a database, it does database things. A key-value store with no relations is a different beast.

> Do you see "SQL" anywhere in "small, fast, reliable"?

I see it all over the place -- the front page of SQLite.org -- where the motto "Small. Fast. Reliable. Choose any three." is presented. Here's the first paragraph of body text from the page:

(emphasis added) "SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain."

The place they are telling you that the context is "SQL database engines" rather than just "database engines" is the same place they tell you that the context is something related to database engines as opposed to, say, automobiles.

They make the claim within the broader context of being a sql database. It's perfectly reasonable to consider the claim within that context.

Just taking the single phrase without considering the broader context the phrase was made in is not fair to sqlite.

> Do you see "SQL" anywhere in "small, fast, reliable"?

No. But I do see it in "SQLite".

FWIW: I had never heard of MDB, and saw it performed really well on the benchmark, so I decided to look it up. I found this page:


They don't even mention sqlite in the comparison section on that page because they compare themselves to other embeddable "key value stores" which sqlite isn't.

If you look closely you'll even see that there is even an MDB backend for sqlite.

It's not a "no true Scotsman" when the products are so clearly different.

Of course they do compare it with SQLite 3: http://symas.com/mdb/microbench/

Yes, if you compare it only to things that are exactly like it then it will be in the top percentile. Also the bottom. That doesn't seem very useful. What other embeddable SQL databases should it be compared to? If you don't think this comparison is the correct one, tell me which comparison is. I'll go get the numbers, and I bet I'll still find that SQLite is still neither small nor fast by comparison. Otherwise all this goalpost-moving makes it impossible for anyone to prove anything (not that the pro-SQLite contingent is even trying).

Balancing full SQL semantics with being small enough, fast enough, and commendably reliable is an impressive achievement. I respect the SQLite authors for that. If they had picked any three of "small, fast, reliable, SQL" I would never have objected. BUT. THEY. DIDN'T. They left "SQL" out, opening the comparison to others. For many people who need an embeddable database, a key/value or generically table-oriented (but not fully relational/SQL) database is quite sufficient. For them, a claim based on an unstated and irrelevant fourth criterion, against equally unnamed alternatives, is either useless or misleading. But that's apparently the only kind of comparison the herd will accept.

I was comparing it to other relational stores, not other embeddable stores.

It works both ways too though, it's very small fast and reliable compared to Chrome or Outlook, which both can store data...

I disagree. You can be not the smallest and still be small, etc. Still I think people should not downvote because they disagree. Especially the links in that post are quite valuable!

Yes, it can be small without being the smallest, but such a claim should at least require being smaller than average/median for the category. In that example, it had a much larger footprint than the majority. If not for the inclusion of Brobdingnagian BDB, it would have been the largest. If I want to claim I'm strong, it's not sufficient for there to be only one who is weaker. Among embeddable databases, SQLite has no real claim to be small or fast. Its strengths lie elsewhere.

Seems reasonable. What is its strength if not being small?

And this is one of the problems with "optimize later".

If you have a large project in Python, for example, far too often there isn't any one spot that has massive gains if you translate it to C. It's the death of a thousand cuts.

Optimizing later, after three major releases and a ton of minor ones, and after getting deployed basically everywhere, seems to be working out pretty well for them though?

Only because it is designed in such a way to allow for it.

The problem is architectural issues, which are far too often glossed over.

You don't know which optimization require which architecture. And your "extensible, optimizable architecture" may be the cause of the performance problem.

What are you talking about?

(At a guess...)

It's related, I think, to the standard piece of received wisdom, that you shouldn't optimize until some unspecified "later". This isn't a bad rule of thumb, but efficiency and convenience aren't always happy bedfellows, and, when coding, people tend to choose convenience. The result can be that when this later date arrives, and you go to make the code run more quickly, you're a bit stuck, because it would require changes more sweeping than can be accommodated with the resources available. But had performance been taken into account from the start, the changes would have been easy to make, or the whole thing might not even have been an issue.

An example might be writing a compute-heavy program in python rather than C.

Another standard piece of received wisdom is that once you get to optimize later, you'll just run the profiler and there'll be some hot spot due to poor choice of algorithm. So you optimize that bit and then you're done. But in practice this happens quite rarely; it's more common for the code to just be generally slow. A common reason it's generally slow is because performance wasn't considered during development, so the end result is the same.

This is what the reference to translating into C relates to, I suspect - the idea that in your slow python program there's this one bit that can be turned into C and suddenly your program is super fast.

(Of course, for many programs, python's speed is irrelevant. And any compute-heavy python program is likely only to be improved by translation into C - though your average python program would probably be much harder to make any quicker. Maybe that just goes to show that you actually have to think about this stuff on a case-by-case basis rather than relying on received wisdom. Anyway, I pick python only because it was mentioned.)

Belief in the improve-algorithm-later theory usually causes people to dismiss micro-optimizations as pointless, because rather than get 0.1% here and 0.1% there you can just put it off until later and then get 25% all in one go. But this attitude actually makes it more likely that the code will paint itself into a corner, leaving micro-optimizations the only option.

There isn't really any good quick easy solution to any of this, which is why I always liked it when computers got faster every year :(

You will still typically achieve a better result (take less time building the software, produce software that works better and runs faster) if you "optimize later" than if you try to make everything fast from day one.

If the application is properly decomposed into distinct subsystems one could take an entire subsystem and move it into C/Rust or Julia. Esp so if those subsystems are functionally pure.

Python might not be the fastest player on the block... but the increase in coding productivity will more than pay for extra equiptment to handle and speed-shortcomings.

Except when it doesn't. I'm facing this problem with an embedded-ish project I'm working on where Python initially seemed like a good choice, but now we're regretting it for both performance and power usage reasons. If we'd gone with C or C++ initially we'd be having a much easier time optimizing.

Not that I'm implying Python is bad, but if your target use case doesn't allow for a desktop or server, think carefully about what tech you use.

Also achievable in JVM, .NET, ML family, Lisp family, Julia, D, Rust, Dylan with the added benefit of performance.

I can't really envision using it for anything other than scripting.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact