That's not to say premature optimization doesn't exist. Sometimes writing faster code is the difference between using a hash instead of a list. Sometimes it's spending an entire day optimizing a function that's called once. The former is performance aware, the later is premature. It seems as though most would have you believe that they both fall into the "premature" bucket.
It's as though people are using it as an excuse for their personal ignorance and would rather continue to revel in this ignorance than expand what they know. As far as I am concerned these "unimportant tidbits" about performance are gifts that keep giving.
/somewhat offtopic rant
You can supplement an answer with "don't prematurely optimize," but the presumption that premature optimization is occurring is a premature assumption.
He who asks a question is a fool for five minutes; he who does not ask a question remains a fool forever. ~ Chinese Proverb
I would work with beginners that are not fools, than experts who are.
In my experience on Stack Overflow, the people getting hit with accusations of "premature optimization" have not gone this far.
He may still ask why one is slower than the other (if it is, I don't know that), but that's a completely different question that in most contexts will get a nice answer.
Tests that don't cover the best option can't possibly identify the best option. So if you want to do a really good test, going somewhere to ask, "What else should I try?" is hardly an asinine course of action. On the contrary, if our hero is new to this stuff, or is new to the platform in question, or anything like that, then asking others for advice should probably qualify as due diligence.
I think part of what's going on there is that SO's gamification can create some kind of perverse incentives for answerers. If you want to get the upvotes, you have two good courses of action: You can either already have 5 digits' worth of karma, or you can answer first. The first option isn't available to the vast majority of SO users, so most go with the latter one. This drives down the quality of answers. It also creates perverse incentives for behaviors such as users getting in a quick low-quality response that's just good enough to get a few upvotes from people who already understand what you're talking about, and then (maybe) following it up with an edit that provides a high-quality response that would be useful to the asker.
Me too. It's infuriating how presumptuous people can be.
Here is someone insisting to me over and over that virtual function call overhead is not measurable: http://stackoverflow.com/a/16603295/77070
Here is someone telling me that my attempts to avoid a virtual function call "definitely fall under the category of premature optimization": http://stackoverflow.com/a/5100072/77070
It's interesting to me that people will shout "premature optimization" before they have any idea what you are doing or how long you have worked on this problem before trying to optimize.
In my experience, profilers are really low-level tools that often highlight the wrong problems, aren't very good at telling you where the optimization opportunities are, and tend to encourage microoptimization.
It's not that profilers are bad, it's that I get the feeling the people parroting this advice have no idea what they're talking about, and somehow believe that a profiler is "the solution", when profilers are neither necessary nor (most problematically) sufficient to fix most performance problems.
1. quickly seeing which high level functions or methods take the longest time,
2. which methods are called a lot.
Even 2. can be of marginal use. A lot of times, it's not surprising a method is called a lot, and not clear whether those calls are indicative of a performance problem. There are times, though, where a method is called a lot more than you expected, in which case it might be the clue to solving the performance problem.
For 1., it's usually click down through the call stack for the most expensive top level method, while there is still a single method taking up a good fraction of the overall time.
Hopefully the stopping point will still be a method you wrote, and not a library call. If it's your code, you likely have the ability to fix the performance problem by changing your code. If it's a library call, you might need to be a little more creative, in finding an alternative approach, or better understanding what that library call is doing under the hood.
So for me, the profiler just tells me I'm looking at the right general part of the code when trying to improve performance, and that's about it.
I could wait until the data in those tables grows to a crazy size and the reads are out of control (and spilling to disk) or I could just fix it now. Hmmm...
> I'm optimizing this due to profiler results.
You'd be surprised how many people still question whether I'm prematurely optimizing in the comments, but the actual answers tend to be more helpful.
I'm trying to do X.
Currently, I'm using Y mechanism, and it produces correct results.
However, the performance sucks -- profiling shows me this is a big time sink in my program. So... (actual question.)
It may help with kneejerk "premature optimization" if the profiler results are mentioned before the request about how to optimize (at least, it should to the extent that the problem is people stopping reading because they see the optimization request so that they never read the comment about profiler results.)
How does having karma help you get more upvotes on a brand new answer?
Answers that already have upvotes get more views, and are therefore more likely to get even more upvotes. But that's distinct from a user's reputation score.
Low-level optimization is not a task to be undertaken casually. If you don't know where to look, you can end up just wasting a lot of programmer time for zero real-world benefit. If you don't measure before and after, you may even make things worse...
The SQLite folks didn't make these changes with a hope and a prayer; they carefully profiled, changed, and profiled again... And they invested years in building a test suite to ensure that such micro-optimizations don't inadvertently break the logic. The results are impressive, but so was the time and effort invested in achieving them. If you're not willing to be so diligent, "premature optimization" is exactly what you're doing.
And for the love of all that you care about, measure it with realistic data and use cases.
I've seen people optimise code while working with small test databases and making changes that are more efficient over the smaller data-sets but are hideously bad on larger ones (usually due to memory use growth, for instance hitting the limits where SQL Server starts spooling temporary data to disk instead of just keeping it in RAM for the duration of its existence).
Or just as bad: working with data that has the wrong balance and (to give another SQL example) giving index use "hints" that shave a little off the time to run over their test data but hamper the query planner on real data by making it ignore what might be better options.
I use the quotes around the word hint because it implies that it is a fairly informal pointer that the planner might choose to ignore where in reality they usually take it as an order and go that way without question (assuming that you know better than it might otherwise chose to do, or you would not have given the hint).
SQL Server: http://msdn.microsoft.com/en-us/library/ms187373.aspx
What I guess I'm trying to say is that you can go in a write obviously slow code "which is fine because you shouldn't prematurely optimize," or you can be informed and, with sometimes no extra development time, write something that's at least competent from the get-go.
For everything else? Profile, profile, profile, profile! And make sure you are using a competent profiler. I've wasted days because of inferior tools, switched to better ones (cough VTune cough) and had major improvements in mere hours.
> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
Another common mistake relates to the phrase "penny wise and pound foolish", which applied to programming helps us identify those scenarios where we might waste time micro-optimizing array searches rather than choosing a hashtable (or even pre-sorting the array). Time spent optimizing your architecture will nearly always offer a bigger payoff than micro-optimizations - but once that's done, Knuth's critical 3% is worth remembering.
Here's some more from it to entice you:
My own programming style has of course changed during the
last decade, according to the trends of the times (e.g.,
I'm not quite so tricky anymore, and I use fewer go to's),
but the major change in my style has been due to this inner
loop phenomenon. I now look with an extremely jaundiced eye
at every operation in a critical inner loop, seeking to
modify my program and data structure (as in the change from
Example 1 to Example 2) so that some of the operations can
be eliminated. The reasons for this approach are that: a)
it doesn't take long, since the inner loop is short; b) the
payoff is real; and c) I can then afford to be less
efficient in the other parts of my programs, which
therefore are more readable and more easily written and
Its premature even if the product is completely feature complete and bug-free if the particular optimization being done is being done before more significant optimizations, which, if you aren't using some measurement more sophisticated than "this looks like it could be optimized" to choose what to optimize, it probably is.
Typically, that's because the people asking are asking questions like "which is faster in PHP, print or echo?" to which the answer is generally "there's about a half a second difference if you run it a million times, but you shouldn't be doing that anyways".
The internet is full of dumbass developers unfortunately. Both on HN, StackOverflow, and well everywhere.
PS: If all someone wants to do is rotate an image loading photoshop really is slow.
So optimizations like this can certainly no longer be considered "premature".
Man, did I ever 'waste' a lot of time on optimization when I was first learning to code (in C), but:
a) I learned what the low-hanging fruit was, so that I was eventually able to make all my code fairly fast without explicit optimization.
b) Man, where some of those early programs I wrote blindingly fast... especially compared to the higher-level python/numpy/scipy stuff I write a lot today. It's also nice to repurpose some of that old code (ie as a library) for new applications since every potential bottleneck has already been ironed out.
I disagree, it depends on the context. If using your original list is so fast that nobody human can notice the difference, it's a waste of time to turn into a hash, not "performance aware".
Actually, whether it's a "waste of time" also depends on the context. It could be worse. It could make your code harder to understand and maintain.
Once a feature works as intended, optimizations aren't premature anymore. You're then in a position where you can profile, measure, and compare.
SQLite has been around for 15 years, they never shifted their purpose and were constantly improving. You get almost perfect product at this point.
I work on in-memory db which was evolving similar way. Now after more than decade of small improvements, it is faster than java collections, despite doing serialization and its own memory management.
HINDI: "Rasari aavat javat re, sil par parat nishaan"
I usually prefer to err on the side of being too specific. :)
Perhaps it should be seen versus the "throw it all away and start over" tendency? Like in steady refactors vs. big rewrites.
Say you can either spend X amount of effort on a 50% improvement, or you can spend X/10 on each of 10 5% improvements. Sounds the same, right?
It's not. If the smaller increases compound, you get a (1.05^10=) 63% increase out of the latter situation. Plus you don't have to wait for everything to be done to get visible improvements.
This is the same thing. The point is that you say "10 5% improvements of state S0" in that sentence and then later on you use 10 5% improvements of state S0, S1... S9. This doesn't make any sense, could you explain to me how this would work?
No, I don't.
Here's a very simplistic example. Say I've got a complicated computation with 10 nested loops, and there's a potential for a 5% speedup tweak at every nesting level, each of which takes about the same amount of effort to find; or, for the same overall effort I can speed up the whole thing by 50% by replacing the whole algorithm with a faster one.
The former case gives you a better outcome than the latter, and it doesn't matter what order you do the tweaks in, or whether you do them all at once. The point is that they can compound without caring about the previous or latter state of anything other than their own level.
Let's say my program takes time 1.0 and is made up of part A taking 0.5 and part B taking 0.5
Defining 5% speedup is weird in the first place, but let's say 5% speedup is taking 0.95 for the program and 10% speedup is taking 0.9 for the program.
5% speedup by improving part A => 0.45 + 0.5 => 0.95 runtime
5% speedup by improving part B => 0.5 + 0.45 => 0.95 runtime
making both improvements:
0.45 + 0.45 => 0.9
So, two speed-ups which individually would be 5% over the initial performance gives us a 10% speed up.
percentage speed ups do not necessarily compound. And in practice, usually they do not.
Also, natural selection tends to have periods of fast adaptation and periods of plateau quasi-equilibrium.
I like Tai-chi and all, but I'm still a skeptic :)
* Yeah, this is a pretty bad post, I'm sorry.
Sometimes when you profile code you're dismayed to find that there are no clumps where the time is spent. It's everywhere. This function here takes 3%, this one 5%, that one another 3%, ... you either have to make some radical change, or else make all of these a little faster.
My conviction inspired me to repeat Richard's test to see how the numbers compared when run on a current Haswell i7-4770 processor. Here's what I got for the versions he compared when compiled with GCC 4.8.1 and using 'perf stat speedtest1 --size 5' (sizes are for the sqlite.o lib only):
Cachegrind per Richard:
3.8.7a -Os: 953,861,485 cycles
3.7.17 -Os: 1,432,835,574 cycles
3.8.7 -O3: 469,956,519 cycles (870,767,673 instructions, 1,069,032 bytes)
3.8.7 -Os: 542,228,740 cycles (884,541,860 instructions, 648,360 bytes)
3.17.7 -O3: 651,680,545 cycles (1,320,867,830 instructions, 1,002,192 bytes)
3.17.7 -Os: 771,304,471 cycles (1,406,527,795 instructions, 605,976 bytes)
While the improvement is about the same magnitude that Richard sees, the actual number of cycles is off by about a factor of 2.
Despite having a larger binary, and thus theoretically worse instruction cache behavior, -O3 beats -Os by about 20% in speed in both cases, at a cost of 40% (400K/1MB) in size.
'perf record -F100000 speedtest1 --size 5; perf report' is a really slick and easy way to figure out where the program is spending it's time. If I were optimizing this, I remain fairly certain it would be more a more effective approach than using Cachegrind.
 I had to comment out the "sqlite3_rtree_geometry_callback" line in speedtest1.c to get it to compile, which might affect my numbers slightly, but from the source comments I don't think it was actually being used.
Also, the cg_annotate utility gives me a complete program listing showing me the cycle counts spent on each line of code, which is invaluable in tracking down hotspots in need of work. If perf provides such a tool, I am unaware of it.
Remember that I'm not trying to optimize for a specific CPU. SQLite is cross-platform. I want to do optimizations that help on all CPUs using all compilers. I'm measuring the performance on the "cachegrind virtual CPU" of a binary prepared using GCC and -Os because that combination gives repeatable measurements that are easy to map into specific lines of source code. But the optimizations themselves should usually apply across all CPUs and all compilers and all compiler optimization settings.
Nkruz is, of course, welcomed to use any tool he likes to optimize his projects. But, at least for the moment, I'm finding cachegrind to be a better tool to help with implementing micro-optimizations.
It can come close. I just tried, and for 'speedtest1' seemed to be getting 3 significant digits for the cycle counts, and 4 for the instruction count. You'd probably gain another one or two if you were to measure computation only and remove the printf() and other I/O statements. The underlying performance counters are pretty much cycle accurate.
> cg_annotate utility gives me a complete program listing showing me the cycle counts spent on each line of code... If perf provides such a tool, I am unaware of it.
Yes, that record/report combination I quoted above does just this, with insignificant runtime overhead. Unlike the total counts, this one is sampled, so you might get a little more variation. It's definitely good enough for quickly finding hotspots, and there are other (harder to use) tools that can use the precise "PEBS" events you need complete counts. These even allow you to do nifty things like track the number of times each branch statement is mispredicted.
> I'm measuring the performance on the "cachegrind virtual CPU" of a binary prepared using GCC and -Os because that combination gives repeatable measurements that are easy to map into specific lines of source code.
Absolutely, this is the right way to view cachegrind. My question would be whether it is the right generic CPU to be using, and whether optimizations made on it translate well to other modern CPUs. Many of them will, but I think you'd have faster turnaround time even better success with an approach that uses a real CPU and its performance counters.
> I'm finding cachegrind to be a better tool to help with implementing micro-optimizations.
Please realize I have the utmost respect for your work on SQLite. It's my most frequent answer when asked for an example of C code to study, learn from, and pattern after. I'm certain you will manage to optimize it with any tool you choose, but having spent many hours with GProf and cachegrind myself, I (will the zeal of a recent convert) think you'll be amazed with some of the things that are now possible with performance counters.
At the same time, I don't see how your data supports your conclusion that perf is more effective than cachegrind here. The absolute cycle count is of limited interest. The most important thing these tools do is tell programmers where to look, and cachegrind seems to have done a good job of that here.
In my experience, perf and cachegrind are two tools in the toolbox. perf is stronger at optimizing for the particular CPU I'm on, and it runs faster. Cachegrind collects more detailed information, and while the model it uses isn't perfect for the CPU I'm on today, it's usually good enough to be useful, and it's good for optimizing things likely to matter on other CPUs too.
You can see five consortium members on the front page of sqlite, the cost of a consortium membership appears to $75,000/year. That's enough to pay developers.
Interesting! That reminds me of my own experience in a different context (from https://blog.mozilla.org/nnethercote/2011/07/01/faster-javas...):
> Cachegrind [does] does event-based profiling, i.e. it counts instructions, memory accesses, etc, rather than time. When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful.
Basically, optimization at this level is really hard.
However, if it's true, what do you say about the real-world performance gains the SQLite folks have achieved by using Cachegrind?
But you inspired me to test out some real numbers, which I posted in a new thread: https://news.ycombinator.com/item?id=8426302
SQLite usually gets used in situations where the data easily fits into memory (multiple times). It isn't really postgres ;)
In fact all the recent discussions of "performance" for RDBMSs has to be understood in terms of "what's being sacrificed for one or more aspects of ACID?". RDBMS technology is incredibly mature and optimisations that keep the ACID aspects intact are also incredibly difficult now. Marginal returns.
Also sqlite is IO and ACID compliant, it's just that the data generally fits in RAM where as many Postgres installs have data that doesn't fit in RAM.
50% fewer cycles for the same work would be 100% faster.
50% more work for the same cycles would be 33% fewer CPU cycles for the same work.
But I don't get the ARM CPU remark, did parent change his comment?
Of cause it's not that hard to deploy a MySQL or Postgresql server, but it's still pretty hard to beat SQLite in ease of deployment and management. Running a simply webshop could be a easy as having a database for inventory and one for orders. Even very large stores aren't going to have thousands or even hundreds of orders a minute, so I don't think that writes would be an issue and reads certainly isn't an issue.
I honestly don't know the upper limit of users you can support concurrently, but with modern hard drives, I imagine it's an order of magnitude beyond this. It has more to do with your application code than the database in my experience.
I had an informal rule: Read early, write late. When the visitor first hits the page, I'll read for the first queries and only write to the database after I've sent the response back to the user. Writes are always the slowest, but no where near as slow as the network so the visitor never had to sit and wait for writes.
The thing is, a lot of the criticisms of SQLite didn't actually apply if you limit the number of writes to < 10 and queries < 50 per hit and keep everything on one server. It's only when you try to write across the network a lot of the locking issues and such come up. Also, by 3.7, there was Write Ahead Logging (WAL) which was perfect timing as that allowed us to skip the DIY write queue entirely.
Dealing with multiple writing threads with SQLite is really annoying.
If you need to scale beyond a single host to serve the workload, something else will probably be easier to manage.
As stated here:
Having said that, while Android uses sqlite copiously, I doubt this would make a perceptible difference in usage, performance or battery life. It is a 50% improvement by itself, but I doubt sqlite3 usage is even measurable compared to everything else going on in the system.
About your doubts, I'd like to quote the sqlite improvement comment:
"Each of these optimizations is unmeasurable
on a real-world system (we have to use cachegrind to get repeatable
run-times) but if you do enough of them, they add up."
Therefore, I will build this library for my ROM, even though it maybe only gives a 0.02% performance boost, but constantly improving my ROM with those micro-optimizations will probably lead to a 50% boost one day :P
> A full 10% of the performance gain has come since the previous release.
>There have been a lot of changes.
Nice tag line.
Fast? No again.
Note that LevelDB itself isn't exactly the speed king here. It's just the clearest example. To its credit, SQLite does fare well when it comes to reliability.
So ... choose one?
(Downvotes from those who despise facts in ... oops, already happened.)
Writing to /dev/null beats leveldb anyday, only problem is I can't get the data back out. A leveldb to sqlite comparison doesn't is unfair to both leveldb and sqlite.
"SQL" is probably generally less important than relational, but either might be an important category.
> Many applications that need an embeddable database don't specifically need SQL.
And yet, many do, or at least need a relational store. To make comparisons, you have to define the category you are making them within; of course, any such category may not be the appropriate one for some comparisons -- either too narrow, too broad, or the right size but divided on the wrong axis -- but that's a matter of where the comparison is relevant. OTOH, its simply wrong to say that a claim is wrong simply because it isn't based on your preferred categorization.
You're right! It hasn't baked me a cake, what the hell?
Context is key. It's a database, it does database things. A key-value store with no relations is a different beast.
I see it all over the place -- the front page of SQLite.org -- where the motto "Small. Fast. Reliable. Choose any three." is presented. Here's the first paragraph of body text from the page:
(emphasis added) "SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain."
The place they are telling you that the context is "SQL database engines" rather than just "database engines" is the same place they tell you that the context is something related to database engines as opposed to, say, automobiles.
Just taking the single phrase without considering the broader context the phrase was made in is not fair to sqlite.
No. But I do see it in "SQLite".
They don't even mention sqlite in the comparison section on that page because they compare themselves to other embeddable "key value stores" which sqlite isn't.
If you look closely you'll even see that there is even an MDB backend for sqlite.
It's not a "no true Scotsman" when the products are so clearly different.
Balancing full SQL semantics with being small enough, fast enough, and commendably reliable is an impressive achievement. I respect the SQLite authors for that. If they had picked any three of "small, fast, reliable, SQL" I would never have objected. BUT. THEY. DIDN'T. They left "SQL" out, opening the comparison to others. For many people who need an embeddable database, a key/value or generically table-oriented (but not fully relational/SQL) database is quite sufficient. For them, a claim based on an unstated and irrelevant fourth criterion, against equally unnamed alternatives, is either useless or misleading. But that's apparently the only kind of comparison the herd will accept.
It works both ways too though, it's very small fast and reliable compared to Chrome or Outlook, which both can store data...
If you have a large project in Python, for example, far too often there isn't any one spot that has massive gains if you translate it to C. It's the death of a thousand cuts.
The problem is architectural issues, which are far too often glossed over.
It's related, I think, to the standard piece of received wisdom, that you shouldn't optimize until some unspecified "later". This isn't a bad rule of thumb, but efficiency and convenience aren't always happy bedfellows, and, when coding, people tend to choose convenience. The result can be that when this later date arrives, and you go to make the code run more quickly, you're a bit stuck, because it would require changes more sweeping than can be accommodated with the resources available. But had performance been taken into account from the start, the changes would have been easy to make, or the whole thing might not even have been an issue.
An example might be writing a compute-heavy program in python rather than C.
Another standard piece of received wisdom is that once you get to optimize later, you'll just run the profiler and there'll be some hot spot due to poor choice of algorithm. So you optimize that bit and then you're done. But in practice this happens quite rarely; it's more common for the code to just be generally slow. A common reason it's generally slow is because performance wasn't considered during development, so the end result is the same.
This is what the reference to translating into C relates to, I suspect - the idea that in your slow python program there's this one bit that can be turned into C and suddenly your program is super fast.
(Of course, for many programs, python's speed is irrelevant. And any compute-heavy python program is likely only to be improved by translation into C - though your average python program would probably be much harder to make any quicker. Maybe that just goes to show that you actually have to think about this stuff on a case-by-case basis rather than relying on received wisdom. Anyway, I pick python only because it was mentioned.)
Belief in the improve-algorithm-later theory usually causes people to dismiss micro-optimizations as pointless, because rather than get 0.1% here and 0.1% there you can just put it off until later and then get 25% all in one go. But this attitude actually makes it more likely that the code will paint itself into a corner, leaving micro-optimizations the only option.
There isn't really any good quick easy solution to any of this, which is why I always liked it when computers got faster every year :(
Not that I'm implying Python is bad, but if your target use case doesn't allow for a desktop or server, think carefully about what tech you use.
I can't really envision using it for anything other than scripting.