
SQLite 3.8.7 is 50% faster than 3.7.17 - anon1385
http://permalink.gmane.org/gmane.comp.db.sqlite.general/90549
======
zamalek
I'm glad to see something like this. People often take the typical
optimization mantra too far. The minute you mention optimization on most
internet forums (especially StackOverflow) you get slammed for promoting
"premature optimization."

That's not to say premature optimization doesn't exist. Sometimes writing
faster code is the difference between using a hash instead of a list.
Sometimes it's spending an entire day optimizing a function that's called
once. The former is performance aware, the later is premature. It seems as
though most would have you believe that they both fall into the "premature"
bucket.

It's as though people are using it as an excuse for their personal ignorance
and would rather continue to revel in this ignorance than expand what they
know. As far as I am concerned these "unimportant tidbits" about performance
are gifts that keep giving.

/somewhat offtopic rant

~~~
ownagefool
The argument behind premature optimization only really stands when first
developing a product. Once you have an established product, it's no longer
premature, it's just optimization.

~~~
emodendroket
Well, only if you never add any new features to it.

~~~
wvenable
As long as you are not optimizing those new features as you are adding them,
it's not premature optimization.

------
qwerta
I think people under-estimate incremental improvements a lot. My tai-chi
teacher told me that eastern culture is more about "putting one sheet of paper
on another", there is slow and steady improvement. Western is much more about
"big bang approach", make it or break it.

SQLite has been around for 15 years, they never shifted their purpose and were
constantly improving. You get almost perfect product at this point.

I work on in-memory db which was evolving similar way. Now after more than
decade of small improvements, it is faster than java collections, despite
doing serialization and its own memory management.

~~~
italophil
Not only eastern cultures have the concept. In German there is the phrase 'a
steady water drop hollows a stone' and it actually traces back to the Roman
author Ovid 'gutta cavat lapidem'. Perseverance is just so much harder than
instant gratification.

~~~
hyh1048576
Interesting. The same phrase 'a steady water drop hollows a stone' also exists
in Chinese: 水滴石穿, and it also traces back to 2000 years old classics.

~~~
krick
I don't want to disappoint you, but the same phrase exists in so many
languages that I'm afraid it isn't something culture-specific.

------
kazinator
Ah, the good old 100/100 rule: 100% of the execution time is somewhere in 100%
of the code.

Sometimes when you profile code you're dismayed to find that there are no
clumps where the time is spent. It's everywhere. This function here takes 3%,
this one 5%, that one another 3%, ... you either have to make some radical
change, or else make all of these a little faster.

~~~
erikb
And if you make the radical change, you'll see that it will also have these 3%
there and 5% somewhere else.

~~~
scuba7183
And it will probably break something along the way :-)

------
nkurz
Deep within another thread, I had a conversation with 'nnethercote' regarding
Richard's use of Cachegrind to optimize performance. My position is that it is
no longer a useful approach, as the actual behaviour of modern processors has
diverged so much from Cachegrind's simple simulation. Instead, I think it is
much more productive to use tools like 'perf' that use the CPU's built in
performance counters to measure the actual performance.

My conviction inspired me to repeat Richard's test to see how the numbers
compared when run on a current Haswell i7-4770 processor. Here's what I got
for the versions he compared when compiled with GCC 4.8.1 and using 'perf stat
speedtest1 --size 5'[1] (sizes are for the sqlite.o lib only):

    
    
      Cachegrind per Richard:
      3.8.7a -Os: 953,861,485 cycles
      3.7.17 -Os: 1,432,835,574 cycles
    
      Haswell:
      3.8.7 -O3: 469,956,519 cycles (870,767,673 instructions, 1,069,032 bytes)
      3.8.7 -Os: 542,228,740 cycles (884,541,860 instructions, 648,360 bytes)
    
      3.17.7 -O3: 651,680,545 cycles (1,320,867,830 instructions, 1,002,192 bytes)
      3.17.7 -Os: 771,304,471 cycles (1,406,527,795 instructions, 605,976 bytes)
    

One can make of the numbers what you will, but here are some conclusions I
drew:

While the improvement is about the same magnitude that Richard sees, the
actual number of cycles is off by about a factor of 2.

Despite having a larger binary, and thus theoretically worse instruction cache
behavior, -O3 beats -Os by about 20% in speed in both cases, at a cost of 40%
(400K/1MB) in size.

'perf record -F100000 speedtest1 --size 5; perf report' is a really slick and
easy way to figure out where the program is spending it's time. If I were
optimizing this, I remain fairly certain it would be more a more effective
approach than using Cachegrind.

[1] I had to comment out the "sqlite3_rtree_geometry_callback" line in
speedtest1.c to get it to compile, which might affect my numbers slightly, but
from the source comments I don't think it was actually being used.

~~~
SQLite
The cycle-counts returned by cachegrind are repeatable, to 7 or 8 significant
figures. That means that I can make a small change and rerun the test and know
whether or not the change helped or hurt even if the difference is only 0.01%.
I don't think perf is quite so repeatable, is it?

Also, the cg_annotate utility gives me a complete program listing showing me
the cycle counts spent on each line of code, which is invaluable in tracking
down hotspots in need of work. If perf provides such a tool, I am unaware of
it.

Remember that I'm not trying to optimize for a specific CPU. SQLite is cross-
platform. I want to do optimizations that help on all CPUs using all
compilers. I'm _measuring_ the performance on the "cachegrind virtual CPU" of
a binary prepared using GCC and -Os because that combination gives repeatable
measurements that are easy to map into specific lines of source code. But the
optimizations themselves should usually apply across all CPUs and all
compilers and all compiler optimization settings.

Nkruz is, of course, welcomed to use any tool he likes to optimize his
projects. But, at least for the moment, I'm finding cachegrind to be a better
tool to help with implementing micro-optimizations.

~~~
nkurz
> The cycle-counts returned by cachegrind are repeatable, to 7 or 8
> significant figures. ... I don't think perf is quite so repeatable

It can come close. I just tried, and for 'speedtest1' seemed to be getting 3
significant digits for the cycle counts, and 4 for the instruction count.
You'd probably gain another one or two if you were to measure computation only
and remove the printf() and other I/O statements. The underlying performance
counters are pretty much cycle accurate.

> cg_annotate utility gives me a complete program listing showing me the cycle
> counts spent on each line of code... If perf provides such a tool, I am
> unaware of it.

Yes, that record/report combination I quoted above does just this, with
insignificant runtime overhead. Unlike the total counts, this one is sampled,
so you might get a little more variation. It's definitely good enough for
quickly finding hotspots, and there are other (harder to use) tools that can
use the precise "PEBS" events you need complete counts. These even allow you
to do nifty things like track the number of times each branch statement is
mispredicted.

> I'm measuring the performance on the "cachegrind virtual CPU" of a binary
> prepared using GCC and -Os because that combination gives repeatable
> measurements that are easy to map into specific lines of source code.

Absolutely, this is the right way to view cachegrind. My question would be
whether it is the right generic CPU to be using, and whether optimizations
made on it translate well to other modern CPUs. Many of them will, but I think
you'd have faster turnaround time even better success with an approach that
uses a real CPU and its performance counters.

> I'm finding cachegrind to be a better tool to help with implementing micro-
> optimizations.

Please realize I have the utmost respect for your work on SQLite. It's my most
frequent answer when asked for an example of C code to study, learn from, and
pattern after. I'm certain you will manage to optimize it with any tool you
choose, but having spent many hours with GProf and cachegrind myself, I (will
the zeal of a recent convert) think you'll be amazed with some of the things
that are now possible with performance counters.

------
rlpb
How is this work funded? Are these developers working full time on SQLite, or
is this work done in developers' spare time?

~~~
cowsandmilk
not going to say it is 100% of their time, but they are paid to work on it.

You can see five consortium members on the front page of sqlite, the cost of a
consortium membership appears to $75,000/year[1]. That's enough to pay
developers.

[1]
[http://sqlite.org/consortium_agreement-20071201.html](http://sqlite.org/consortium_agreement-20071201.html)

~~~
ghshephard
And that doesn't even include Apple, who I have to believe has a big
investment in sqlite, and probably has some form of support arrangement with
them.

------
Shish2k
I wonder what most of those optimisations _were_? In particular, are there any
that are general enough to be worth knowing about, or even worth including at
the compiler level?

~~~
easytiger
Everthing detailed:

[http://www.sqlite.org/changes.html](http://www.sqlite.org/changes.html)

------
nnethercote
> We have achieved this by incorporating hundreds of micro-optimizations. Each
> micro-optimization might improve the performance by as little as 0.05%. If
> we get one that improves performance by 0.25%, that is considered a huge
> win. Each of these optimizations is unmeasurable on a real-world system (we
> have to use cachegrind to get repeatable run-times) but if you do enough of
> them, they add up.

Interesting! That reminds me of my own experience in a different context (from
[https://blog.mozilla.org/nnethercote/2011/07/01/faster-
javas...](https://blog.mozilla.org/nnethercote/2011/07/01/faster-javascript-
parsing/)):

> Cachegrind [does] does event-based profiling, i.e. it counts instructions,
> memory accesses, etc, rather than time. When making a lot of very small
> improvements, noise variations often swamp the effects of the improvements,
> so being able to see that instruction counts are going down by 0.2% here,
> 0.3% there, is very helpful.

~~~
amaranth
Of course since processors are out of order and superscalar reducing
instructions isn't always a worthy goal either. You might end up with less
instructions but less things happening in parallel. And then of course you
have to balance this with the cache efficiency of having to look at less
instructions to complete a task.

Basically, optimization at this level is really hard.

~~~
nnethercote
And yet, in practice, I have found that optimizing for instruction counts
works really well. I've gotten _way_ more mileage out of Cachegrind's
instruction counts than I ever have out of its cache or branch prediction
simulations.

~~~
nkurz
I was tempted to respond after your first comment, but this followup provoked
me to action. When you wrote your post several years ago, modelling with
Cachegrind may still have been a defensible approach. But the divergence
between its generic processor simulation and actual real world performance
continues to diverge. As you note, this divergence first became very apparent
with the cache and branch predictions. Currently, I'd argue that your time
will almost always be better spent checking the CPU's built-in performance
monitors rather than using Cachegrind. For the cases where Cachegrind used to
be useful, 'perf' is a joy!

~~~
nnethercote
Well, the SQLite folks are clearly still getting good usage from Cachegrind.

~~~
nkurz
I'm not sure that's a strong argument for your case. Similarly to the way that
Usain Bolt could probably beat me in the 100m even if he was in a wheelchair
with flat tires, I'm sure Richard Hipp could probably do a decent job of
optimizing SQLite with just a stub of pencil and a scrap of napkin. I just
think he'd be more effective with better tools.

But you inspired me to test out some real numbers, which I posted in a new
thread:
[https://news.ycombinator.com/item?id=8426302](https://news.ycombinator.com/item?id=8426302)

------
bsaul
How long will we have to wait to get this version usable for our CoreData apps
in ios ? Is it possible to include it manually into our app or are we
dependent on ios releases ?

~~~
michaelmior
If you want to use it with CoreData, you might have to wait. But you can
always use sqlite directly and link with the latest library. It might well be
possible to specify a different library version you can use with CoreData, but
I've never tried this.

------
ponytech
By "50% faster" what they actually mean is this new version requires 50% less
CPU cycles than the former for doing the same task. In real world I think you
will not get your program to run "50% faster": disk io is often the bottleneck
in the database area.

~~~
qwerta
SQLIte mostly runs on slow CPU devices with relative fast SSDs. Reads are
parallel, but writes are single threaded, so there is no advantage from multi
core ARM CPUs.

~~~
onli
Sqlite mostly runs everywhere, especially on normal desktop PCs, including all
those old ones with slow HDDs and relatively powerful CPUs.

But I don't get the ARM CPU remark, did parent change his comment?

------
mrweasel
I really like SQLite and posts like this always makes me wonder: Could I run a
webshop on SQLite, if I did it right and how many users could I potentially
support?

Of cause it's not that hard to deploy a MySQL or Postgresql server, but it's
still pretty hard to beat SQLite in ease of deployment and management. Running
a simply webshop could be a easy as having a database for inventory and one
for orders. Even very large stores aren't going to have thousands or even
hundreds of orders a minute, so I don't think that writes would be an issue
and reads certainly isn't an issue.

~~~
eli
You certainly could, but it's not really built with concurrent reading and
writing.
[http://www.sqlite.org/faq.html#q5](http://www.sqlite.org/faq.html#q5)

~~~
strcat
That FAQ entry is out-of-date as it doesn't mention the write-ahead logging
support.

[https://www.sqlite.org/wal.html](https://www.sqlite.org/wal.html)

~~~
eli
Neat, I didn't realize that had landed. Thanks for the link.

------
underlines
Is it possible to compile this version of sqlite for ARM and replace it on an
Android device to improve performance? Android mainly uses SQLite for internal
stuff and allmost all Apps use it...

As stated here: [http://forum.xda-developers.com/android/software-
hacking/sql...](http://forum.xda-developers.com/android/software-
hacking/sqlite-50-faster-performance-boost-t2898180)

~~~
personZ
Most definitely. Download AOSP, overwrite ./external/sqlite/dist with the new
version, and build libsqlite.so and deploy on a rooted device (note that
Chromium, and the Chrome browser, use their own copy of sqlite).

Having said that, while Android uses sqlite copiously, I doubt this would make
a perceptible difference in usage, performance or battery life. It is a 50%
improvement by itself, but I doubt sqlite3 usage is even measurable compared
to everything else going on in the system.

~~~
underlines
Yeah, thanks. That was the thing I was looking for.

About your doubts, I'd like to quote the sqlite improvement comment: "Each of
these optimizations is unmeasurable on a real-world system (we have to use
cachegrind to get repeatable run-times) but if you do enough of them, they add
up."

Therefore, I will build this library for my ROM, even though it maybe only
gives a 0.02% performance boost, but constantly improving my ROM with those
micro-optimizations will probably lead to a 50% boost one day :P

------
tbrock
Anyone using newer versions of SQLite out of homebrew on OSX? If so are you
having any trouble with conflicts of any sort with the system version? OSX
heavily relies on SQLite in many places and I'm curious if it causes trouble
to try and use both (and have it in your path).

------
easytiger
I'm on 3.8.6. So will have most of the benefits. The temptation with sqlite is
that it is so well maintained to only do cursory testing with it, so might
drop this in.

Interestingly:

> A full 10% of the performance gain has come since the previous release.
> >There have been a lot of changes.

------
nXqd
This is lovely. I believe the most important thing that we learn about this is
how to measure your effort correctly. If you do micro optimization, but you
don't track it, people will say that " DON'T PREMATURE OPTIMIZATION "

------
pranayairan
can we expect this version to be included in android L ??

~~~
sitkack
Just ship it in your own binary.

------
c4pt0r
very good news for android developer

------
rbobby
> Small. Fast. Reliable. Choose any three.

Nice tag line.

~~~
notacoward
Cute, but inaccurate. Small? Not quite.

[http://symas.com/mdb/microbench/](http://symas.com/mdb/microbench/)

Fast? No again.

[http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html](http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html)

Note that LevelDB itself isn't exactly the speed king here. It's just the
clearest example. To its credit, SQLite _does_ fare well when it comes to
reliability.

[https://www.usenix.org/system/files/conference/osdi14/osdi14...](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-
pillai.pdf)

So ... choose one?

(Downvotes from those who despise facts in ... oops, already happened.)

~~~
zaphar
Within the space of sql databases they get all three.

Writing to /dev/null beats leveldb anyday, only problem is I can't get the
data back out. A leveldb to sqlite comparison doesn't is unfair to both
leveldb and sqlite.

~~~
notacoward
Ah, "no true Scotsman" rears its head again. Why limit the space to SQL
databases? Many applications that need an embeddable database don't
specifically need SQL. You're right that comparing LevelDB to SQLite isn't
quite fair to either, but _they invited the comparison_ with a slogan that
makes comparative claims. It's not unfair to provide citations relevant to
those claims, which you don't seem to have done BTW.

~~~
aidenn0
FWIW: I had never heard of MDB, and saw it performed really well on the
benchmark, so I decided to look it up. I found this page:

[http://symas.com/mdb/](http://symas.com/mdb/)

They don't even mention sqlite in the comparison section on that page because
they compare themselves to other embeddable "key value stores" which sqlite
isn't.

If you look closely you'll even see that there is even an MDB backend for
sqlite.

It's not a "no true Scotsman" when the products are so clearly different.

~~~
notacoward
Yes, if you compare it only to things that are _exactly like it_ then it will
be in the top percentile. Also the bottom. That doesn't seem very useful. What
other _embeddable SQL_ databases should it be compared to? If you don't think
this comparison is the correct one, tell me which comparison is. I'll go get
the numbers, and I bet I'll still find that SQLite is still neither small nor
fast by comparison. Otherwise all this goalpost-moving makes it impossible for
anyone to prove anything (not that the pro-SQLite contingent is even trying).

Balancing full SQL semantics with being small _enough_ , fast _enough_ , and
commendably reliable is an impressive achievement. I respect the SQLite
authors for that. If they had picked any three of "small, fast, reliable, SQL"
I would never have objected. BUT. THEY. DIDN'T. They left "SQL" out, opening
the comparison to others. For many people who need an embeddable database, a
key/value or generically table-oriented (but not fully relational/SQL)
database is quite sufficient. For them, a claim based on an unstated and
irrelevant fourth criterion, against equally unnamed alternatives, is either
useless or misleading. But that's apparently the only kind of comparison the
herd will accept.

~~~
aidenn0
I was comparing it to other relational stores, not other embeddable stores.

It works both ways too though, it's very small fast and reliable compared to
Chrome or Outlook, which both can store data...

------
TheLoneWolfling
And this is one of the problems with "optimize later".

If you have a large project in Python, for example, far too often there isn't
any one spot that has massive gains if you translate it to C. It's the death
of a thousand cuts.

~~~
sauere
Python might not be the fastest player on the block... but the increase in
coding productivity will more than pay for extra equiptment to handle and
speed-shortcomings.

~~~
ramidarigaz
Except when it doesn't. I'm facing this problem with an embedded-ish project
I'm working on where Python initially seemed like a good choice, but now we're
regretting it for both performance and power usage reasons. If we'd gone with
C or C++ initially we'd be having a much easier time optimizing.

Not that I'm implying Python is bad, but if your target use case doesn't allow
for a desktop or server, think carefully about what tech you use.

