Unfortunately for real memory (caches, DIMMs, swap etc) that's not true. It's O(log n) for any non-trivial size of memory, and can have bad (constant?) factors for mutating memory that is shared between processors.
Of course functional languages have hidden and not-so-hidden costs too, starting with the garbage collector, but including the time taken to get the larger working set into cache.
I'm interested in any studies that look at these total costs in real systems.
This once again raises the more subdued or neglected aspect of "premature optimisation": blindly picking an algorithm that looks better on paper may in reality be much slower than a naive algorithm that better matches the characteristics of a computer.
Some algorithms are O(1) and you could easily be fooled into believing that they are, indeed, always better than their counterparts that're asymptotically worse than a constant-time; that may not be so, though, if the "constant-time" computations required for non-trivial inputs take longer than say a linear one.
It may not make it into the basics, but locality of data is absolutely considered in a lot of real work on algorithms and data structures. For example, the theoretical advantages of hopscotch hashing rest on data locality concerns. On a hypothetical machine for which memory access always takes the same amount of time, I believe - but have not bothered to prove - that hopscotch hashing and other open addressing techniques should actually be slower than other forms of collision resolution.
As for how it affects the question of imperative vs. functional data structure performance, it should tend to favor the imperative ones. The issue is that purely functional data structures aren't allowed to just grab a block of memory all at once and insert data into it at leisure. (That would be a form of mutation.) Instead space has to be allocated as it is needed, in a piecemeal fashion. As a result, longer-lived and more active purely functional data structures will tend to scatter across the heap over time.
Claims about data structures in the context of how they're allocated in a modern machine architecture / memory model are not the same thing as claims regarding the semantics that are exposed to the user of a programming language.
I think you do raise an interesting question, namely what's the layout in memory of a tree like data structure in eg scheme or Haskell, when the tree is built from scratch, but also when updated and a garbage collection run has compacted the data. Oooo, I totally want to investigate this now! (I don't have the time to, but I totally want to.)
Point being, depending on the memory management semantics for heap data in a programming language, any claims about locslity happening for tree data may or may not happen.
This actually raises a really beautiful question: what is the complextity of any garbage collection algorithm that exactly or approximately maximizes the locality of tree or directed graph like data structures in memory? I want to speculate that perhaps the data structures which people find fast in practice are actually in some sense enabling the gc to preserve locality. I wonder how optimally the layout can be when only linear or nearly linear time algorithms are considered!
But I'm sure you'll also agree that a freshly allocated tree probably hasn't been alive for long and hasn't seen much activity. :)
It clearly has some relation to optimal caching optimization problems, but with more explicit interdependencies. It's also seemingly both an online and offline problem in terms of how to analyze it, but I need to think on it more
It's a bit more complicated than that. To talk about runtime (and memory) of an algorithm or a data structure, you need to have a machine model. If you look at what happens in, say, Haskell: We write our algorithms in a purely functional way, but we are interested in their runtime after compilation into native code. The compiler is free to do any transformation that preserves correctness, so if it's smart enough, it might grab a block of memory all at once. Or if it can prove that data won't be used again, it can safely change variables in place. (The language Clean even has type system support for that kind of mutation. (See https://en.wikipedia.org/wiki/Linear_type_system))
You might be able to analyse functional algorithms with a machine model that doesn't involve compilation to some form of imperative code. Perhaps by counting reductions in lambda calculus. But then it's hard to compare that to asymptotic performance of imperative algorithms.
That, and in a sense it's side-stepping the theoretical issue more than anything else. A compiler that's smart enough to find spots where a performance improvement can safely be gained by replacing a pure structure with a mutable one in the background doesn't really counter the idea that impure structures can be faster so much as acknowledge it.
(For an example of the distinction--straight out of Okasaki's book: the standard purely-functional queue implementation as two linked lists has O(1) amortized runtime in adding and removing only if used `single-threaded' / ephemeral. As a persistent data structure you get a worst case of O(n).)
In a way that kind of dependence isn't new. If you look at old versions of Knuth's TAOCP, for many algorithms he has separate analyses for the in-RAM case and the on-tape case, using two different memory models, and in some cases which algorithm is better differs a lot between the two cases. But it may be that algorithms reference books are doing a worse job keeping up now than they used to.
I think aside from things like cache coherency, memory models and things like SIMD, computer scientists will have to think of their algorithms (or at least cover it when they write books for undergrads or professionals) in terms of commutativity and associativity so people can see how parallelisable an algorithm is or can be.
which do fully analyse the hardware.
The important takeaway I think, is that O(log n) approaches O(1) for all practical values of n.
To the downvoters: gee golly, would it hurt you to reply to my comment? If you think I'm wrong then say why!
Also if memory is O(log n) then arrays are O(log n) but BSTs are O((log n)^2).
(Edit II: Let's be clear that BSTs are already (log n)^2 if you're counting CPU cycles with O(1) memory access. They're O(log n) if you count cache misses or comparisons, or you might say they're O((log n)^2) but the first log is base <2 and the second is the logarithm base 2 to the 256th power, so only one really counts. (The second manifests as a cache line miss if your key comparison walks past the first 32 bytes of your key.))
(Actually memory access is at best O(n^(1/3)) thanks to space being three dimensional, but really worse thanks to cooling concerns, and uh, for sufficiently large n, worse because of gravity.)
Why yes, it is worse in some theoretical manner. But if you think it is necessarily worse as a practical manner, you have failed to understand the math fully.
Suppose that n is 1000 in some toy problem. Then you scale it up to a huge data set, a billion items! The log(n) factor only got worse by a factor of 3. Maybe we didn't go high enough. Perhaps we can take a large data set. Let's see, how about all of the data that is produced at CERN's Large Hadron Collider in a year? That's 15 petabytes. Now the log(n) factor is a whopping 5.39.
In other words while log(n) represents unbounded possible growth, it gets worse in practice by at most a fairly small constant factor.
How about O(1)? O(1) does not mean fast. It just means constant relative to the data size. It can be a bad constant, but as long as it is a constant it qualifies as O(1).
Let's take a practical example. Wavelets are cool in lots of ways, but one of the properties people noticed quickly is that doing a basic wavelet transform on a data set with n elements takes time O(n). Doing a FFT (Fast Fourier Transform) on the same data set takes time O(n log(n)). Yay, we're faster!
But hold on a minute. The interesting wavelets that we like to use are O(n) with a worse constant than the FFT. So the FFT tends to be faster on practical data sets. (There are lots of reasons why one would prefer wavelets over the FFT, but speed of calculation is not generally one of them.)
Congratulations, I don't think one algorithm is necessarily worse, your entire post was a waste of effort.
I mean seriously you just made a post teaching about how logarithmic constants can be low enough that they don't matter in reply to my post which provided an example of that very thing.
If you understood this, then why did you start off your post arguing that O(log(n)) was worse than O(1)?
Incidentally comparing a logarithmic lookup to a hash lookup, a lot of nosql solutions use hash tables because O(1) is good, right? For instance look at Apache's Cassandra. By contrast Google's BigTable uses an ordered data structure which is logarithmic.
Guess what, if you've used both, BigTable is better. Because the extra log cost is irrelevant for the normal use case, and being ordered reduces the operational cost of things like range searches.
Yes, even the pure Haskell has unboxed, mutable O(1) arrays. They're just
a bit hidden for the beginner.
I sort of view lazily-evaluated data structures (like generator expressions) as not data structures at all, but rather control flow constructs. They control when, how, and if expressions are evaluated and results are returned. As a result, I feel that using lazy equivalents for your data structures qualifies as an algorithmic change. In fact, one could manually code lazy data access in an eager language, and the resulting algorithm would look very different from the original.
On the other hand, the laziness is enabled by the purity of the algorithm, so... well, it depends on your perspective. You could say that laziness makes the functional equivalent just as fast, or you could say that the functional equivalent is slower, but there's a different functional algorithm that is just as fast.
 Except they're still data structures. They're like, a quantum superposition of data structure and control-flow operator. Which one it is depends on how you look at it.
I don't think anyone is claiming that it is the same algorithm. Rather, they're claiming that using lazy evaluation gives you more choice of algorithms; and one of these algorithms is faster than the best algorithm under strict evaluation. I haven't been able to download Bird et. al's paper, but I wouldn't assume it's simply "implementing the same algorithm in Haskell instead of strict ML".
Nobody advocates purely functional programming, at least not in the implementation of a language itself, but it can be an incredibly valuable model for a language interface. In the same way, a stack (or better yet, semi-infinite tape) can be an incredibly valuable model for a language interface at well. Functional programming is, by definition, just another way of saying that your model of computation is based on the lambda calculus instead of a Turing machine. You can just as easily ask what the asymptotic 'cost' of using a non-random access Turing machine is.
If people really want a functional language with a purely functional implementation, then you have to say 'farewell' to even things like tail-recursion, since you can make the argument that converting recursion to a loop is anti-lambda calculus.
EDIT: It seems there's some confusion over the distinction between implementing a purely functional algorithm in a language and implementing an language in a purely functional manner. My point is that, even if you implement your algorithm in a purely functional manner, if your compiler is simply converting it to a (well-optimized) imperative, destructive code behind-the-scenes and making it appear to be immutable, then this entire question is rather beside the point.
 (With apologies to Simon Peyton Jones).
If we want to get really pedantic, I guess we could say that we're using one Turing machine (our computer) to emulate another equivalent model of computation (the lambda calculus), so the computer warms up because the Turing Machine is an inefficient emulator of the instantaneously-evaluating lambda calculus evaluator. But the lambda calculus evaluator doesn't change state.
...so, what we really need is to wait for someone to invent a lambda calculus evaluator. Then we wouldn't need to emulate them with these silly Turing machines, and we could get instant evaluations of our programs based solely on ϐ-reductions, with no side-effects (thermal side-effects included). That would put an end to this debate!
(Furthermore, who cares if P=NP if ϐ-reductions can be evaluated 'directly' instead of being emulated by these obsolete Turing machines? Even the hardest decidable problems would be solved instantaneously!)
Here's hardware that does the first http://www.cs.york.ac.uk/fp/reduceron/
Pedantry is infinitely recursive, with no base case in sight and no tail optimization - my tolerance/stack for these things overflows rather quickly. :-)
But thanks for the link! I'll be sure to check it out later; it looks interesting.
Precisely. For example, you can ask whether the best algorithm for some problem on a 1-tape Turing machine is asymptotically slower than the best algorithm on a 2-tape Turing machine.
And at the end of the day, my original point stands: we're talking about writing purely functional code, but if the code is being compiled by a compiler that takes advantage of non-functional code optimization (and as far as I know, nearly all general-purpose compilers do to some degree), then it doesn't make sense to compare the functional code vs. non-functional code. Even if it appears that data structures are immutable, if the compiler is mutating them behind-the-scenes, it all depends on how well-optimized the compiler is, and that's a very different discussion than the one implied by the title.
You can describe the performance of a functional program in something like asymptotic number of reductions. You can write compilers that will run those programs on real hardware in time related to the performance bound you get by working in the source model. Also, perfect optimization is undecidable. So, the question is whether there are problems where you necessarily lost asymptotic performance by working with a purely functional programming language.
In the general case, not necessarily for any subset, and this would apply to to non-functional algorithms as well. Just because I write a 'simple' for/while loop, that doesn't necessarily imply anything about the actual
At the end of the day, an algorithm is simply an unambiguous, executable, and terminating specification of a way to solve a problem. We may implement an algorithm by creating a physical or virtual machine that conforms to that specification, but even abstractions like Turing machines and lambda calculi are just attempts at creating a deterministic model that we can manipulate in order to understand the algorithm, which is an abstract specification.
The problem is that in general, algorithms are not specified in perfect detail, or they are implemented in ways that are logicially equivalent, but not exactly equivalent, with the specification. (Or oftentimes both). When dealing with any specific algorithm, we can usually infer that analyses of one will translate to the other, but in the general case, this assumption is murky.
The difference between functional and non-functional for an algorithm happens at the level of the specification, which is different from the level of the implementation. The translation from one to the other involves optimization. And if perfect optimization is undecidable, as you point out, so is the question of whether a perfectly optimized implementation of one class of algorithms is 'better' than a perfectly optimized implementation of another class, since we're no longer dealing with individual instances.
Lazy evaluation is great in that the actual computation might not needed at the end but there are non-trivial cost to track and evaluate all the thunks.
It's an extra dimension to consider when comparing the cost of doing simple computation with strict evaluation vs utilizing lazy evaluation with the added thunking cost.
Something may look like it is O(1) but in reality it may be much slower. The only way to find out is to put the program through its paces with various values for 'n', and to measure the time it took to compute the answer.
Prepare to be surprised.
I think we can agree that:
1) you can write most software in a functional programming language without any (noticeable) loss of performance
2) there will always be software (video encoders, HPC stuff) that will always be done in Fortran/C :D
However, I think the question is very interesting from a theoretical point of view: if you model imperative algorithm as programs for a RAM machine (good theoretical approximation of real single core computers) and purely functional algorithms as programs for a TBI machine (good theoretical approximation of a hypothetical machine executing pure algorithms), is there a problem for which a faster solution exists for a RAM than for a TBI?
Note. RAM machines exist and are standard in literature; to my knowledge TBI (to-be-invented) machines do not (yet) exist.
> 2) there will always be software (video encoders, HPC stuff) that will always be done in Fortran/C :D
Alas, yes. But there are efforts underway to challenge that. Theoretically, your compiler should be able to do a better job, if it knows more about your intent. I'm learning about some techniques for really high performance Haskell. E.g.:
* Iteratee/Enumeratee: http://okmij.org/ftp/Streams.html
* Stream fusion: https://donsbot.wordpress.com/2008/06/04/haskell-as-fast-as-...
The C equivalent of stream fusion would be loop fusion: if you have to for loops one after another iterating over the same array, you could fuse them into one loop. That same trick works in Haskell, even more often and with more benefits.
Lots of algorithms maintain mutable states. When these algorithms are translated into immutable state style, their complexity (BigO) shoots up. The SO question is asking what's the increased complexity when moving to PURE functional programming style.
Equivalently, I can write an x86 emulator in a functional language whose time per instruction is independent of the amount of memory used. Then I can just compile an imperative program and run it on my emulator without any asymptotic slowdown.
How, you ask? Either (a) my language provides me O(1) functional arrays (e.g. Mercury does this), or (b) I use an O(log n) structure (such as a binary tree): since n is bounded by the size of physically addressable memory, this reduces to O(1). (See also rwmj's comment about memory access times. )
The real question to ask is, what is the asymptotic cost of programming in a language without an O(1) array structure? (A property which unfortunately has falsely come to be associated with purely functional programming, due to early functional languages being merely naïve interpreters of the lambda calculus.) The answer to this question of course is likely O(log n).
Uh, no. By definition, a O(log n) structure takes, on average, log n operations to get to the node you want to get to. That never reduces down to constant time, because any given access will take log n reads to get to the correct place.
Also, keep in mind that O(1) array lookup isn't really constant time. It's still variable, depending on where the information actually is stored(i.e. in the L2 cache, RAM, on HDD, etc.), the practical time can vary from double digit nanoseconds to several milliseconds. Cache misses can be hugely expensive.
Big-O notation is for asymptotic upper bounds, not the average case. That is, it's a guarantee that using the algorithm will never take longer than the function given. It is not a statement of the average case. In order to do that, you must make some analysis on the distribution of the inputs.
>By definition, a O(log n) structure takes, for most inputs, no more than log n operations to get to the node you want to get to.
However, from a CS perspective, I still think that it's important to stress that the inputs do matter.
Big-O is not intended to be a worst-case analysis on running time. There are a lot of assumptions that go into the analysis on the inputs.
For example, it's generally accepted that binary search tree lookup is O(log n). However, if you have a very badly unbalanced tree, the lookup time could be O(n).
The only binary search trees that people use in practice are balanced. That is, the algorithms that control insertion and deletion in the tree guarantee that the tree is always balanced, and they themselves run in O(log n) time. In my experience, most trees used in practice are in fact red-black trees, but a much easier to understand balanced binary tree is an AVL tree. So when we say that a red-black tree has O(log n) lookup, that is a guarantee across all inputs. There is no input to a red-black tree that will yield a linear lookup time.
O(f(n)) means that an operation takes less than k f(n) time to complete (where k is some constant) for all n.
I predicated my assertion on bounded memory, hence n is constant. Therefore O(log n) means that the operation is bounded in time by k log n. But since n is also a constant, this is equal to some other constant k'. This is equivalent to saying that the operation is bounded by O(1).
I can see why it's especially tempting to make that jump in the case of O(log n), but that's just not the way this type of analysis is performed - a more conventional way of stating your point would be to say "yes, it's theoretically worse than O(1) but for practical purposes the constant factors matter far more".
You're fundamentally misunderstanding what big-O notation means. You cannot say that since n has an upper limit, everything is constant. By your argument, the actual speed of an algorithm doesn't matter, period, since the sample size is bounded, which means that everything evaluates to a constant.
Big-O notation is used to give a ballpark estimate of an algorithm's efficiency. In this case, the metric being used is the number of data fetches the CPU needs to get to the specific location that it's looking for.
O(1) means that the number of possible CPU instructions is constant, no matter how large n is. So a static array lookup is O(1) because, no matter how large N is, the CPU only has to do a fetch. That's 1 instruction, and while the actual time is variable due to cache misses, the total number of instructions is constant.
O(log n) means that it will take log n operations to get the result. Binary trees are a great example. Assuming a perfectly balanced BST, the worst possible number of lookups(i.e. to the farthest leaf) will be log n different fetches. If you have a system that creates 127 buckets, the worst possible number of lookup operations is 7, which is log base 2 of 128.
No it's not. Big-O notation is used to denote asymptotic running times. The OP's question is invalid because his assumption of O(1) array accesses is only valid on hardware with finite memory size, which makes what is actually a O(log n) or O(n^(1/3)) memory read/write appear to be O(1).
I made the same invalid assumption and came to the same invalid conclusion as the OP (i.e. that language X has O(1) array accesses). You can't deny me this assumption (and therefore conclusion, which I reached logically) without denying it to the OP as well.
O(1) means that the number of possible CPU instructions is constant, no matter how large n is.
This is simply false. If n is larger than 2^32, the number of instructions (and more importantly, time) grows immensely.
Now if on the other hand you want to posit an infinite O(1) memory; OK, I posit a lambda calculus reducer which can treat infinitely sized terms as a single redux. Bam, I've got O(1) array accesses in lambda calculus (and thus functional programming) now.