

Arrays considered somewhat harmful - jonpaul
http://blogs.msdn.com/b/ericlippert/archive/2008/09/22/arrays-considered-somewhat-harmful.aspx

======
jasonwatkinspdx
I strongly disagree with many of the claims in this post:

 _"An array’s whole purpose is to be a mass of mutable state. Mutable state is
hard for both humans and compilers to reason about. It will be hard for us to
write compilers in the future that generate performant multi-core programs if
developers use a lot of arrays."_

How does this conclusion follow from the premise? It also completely ignores
that arrays are the natural expression of data in data parallel programs.
Nearly all high performance parallel numerical code leverages arrays heavily
(OpenMP, MATLAB, various GPU languages).

 _"If you are writing such an API, wrap the array in a ReadOnlyCollection <T>
and return an IEnumerable<T> or an IList<T> or something, but not an array.
(And of course, do not simply cast the array to IEnumerable<T> and think
you’re done! That is still passing out variables; the caller can simply cast
back to array! Only pass out an array if it is wrapped up by a read-only
object.)

That’s the situation at present. What are the implications of array
characteristics for the future of programming and programming languages?"_

This appears to be a correct criticism of the mechanics of arrays in C#, but I
don't see how that generalizes to the topic of programming languages in
general. For example, we might use some sort of copy on write approach to
preserve single copy performance when not changing the array is the common
case.

 _"Clock speeds have stopped increasing, transistor density has stopped
increasing"_

AFAIK transistor density is still tracking Moore's law and is expected to do
so for quite some time (see SIA roadmaps).

 _"Parallelizing in a world with observable side effects means locks"_

This ignores a whole class of wait free data structures. Some of the most
interesting parallel langauge tools are built around such structures. For
example, cilk is some extremely smart code centered around wait free dequeues.

 _"The “C” programming language is all about mechanisms. It lays bare almost
exactly what the processor is actually doing, providing only the thinnest
abstraction over the memory model."_

This is a common misconception that leads to many bugs and performance
problems. Modern compilers make dramatic transformations of programs. The
connection between what we assume based on the c abstract machine model and
what the processor actually does is rather complex. My favorite example of
this is how consuming function arguments that are pointers into temporary
variables at the top of a function body actually generates shorter faster
code. This is because it tells the compiler aliasing between the pointers is
impossible. However, if you're using the misguided "my c is close to what the
processor does" perspective you'd incorrectly assume these assignments result
in a lot of wasted work.

I'm not even a c programmer and I know this. I'm quite tired of seeing this
conceptual mistake from otherwise extremely knowledgeable people.

 _"Arrays demand imperative code, arrays are all about side effects, arrays
make you write code which emphasizes how the code works, not what the code is
doing or why it is doing it."_

Entirely incorrect. For example, one can elegantly express many parallel
algorithms as binary recursion with an associative combining operator over
trees of arrays. This is a very declarative way of programming that is both
elegant and maximally expresses parallelism. See the various presentations by
Guy Steele for a very clear presentation of this perspective.

As a summary, I'm impressed by the posters knowledge of C# and how deeply he's
considered this problem, but I think at the root of his thinking is an
assumption that nearly all languages will be similar to C#. I disagree, and I
think languages that center on nested data parallelism over arrays will play a
major role in the future of programming.

~~~
timrobinson
The languages that use arrays efficiently for parallel programming (i.e. GPU
languages) aren't general-purpose programming languages; programs in those
languages are written with data parallelism in mind. C# is a general-purpose
imperative language; it's very hard to take a program in a language such as
C#, where mutation is interspersed throughout the code, and parallelise it
automatically.

 _"I'm impressed by the posters knowledge of C#"_ \- Eric Lippert designed and
implemented large parts of the last couple of versions of the C# language and
compiler.

~~~
jasonwatkinspdx
_"The languages that use arrays efficiently for parallel programming (i.e. GPU
languages) aren't general-purpose programming languages"_

I disagree. Typically these languages are supersets of general purpose
languages like c. That is, they are c code with additional annotations
(OpenMP) or use annotations with some initialization API (GPU languages). On
top of that, there are purely array based general purpose languages (APL, J,
K, NESL, etc).

 _"it's very hard to take a program in a language such as C#, where mutation
is interspersed throughout the code, and parallelise it automatically."_

I agree with qualifications. It's not so much the mutation that is fault, but
the over specification of dependence. For example, map() makes clear that the
iteration is data parallel. It is trivial to convert map() to pmap(). for()
does not say anything about the iteration being data parallel, so to convert
this to par() the compiler must do an analysis. Depending on the program this
may require an alias analysis. While this may be possible, and perhaps even
easy for common cases, to do so in general requires quite complex analysis.

In any case, I think it's perfectly possible to create a parallel mutable
language. In fact, the agent model readily encapsulates mutability in a
parallel and distributed way. So while I agree with the critique in the
context of C#, I don't agree with how the author generalizes it to the topic
of languages as a whole.

------
Aaronontheweb
For .NET devs: Always use something that implements IList instead of a
primitive array. As Eric points out, you can make it read-only and avoid the
malloc frenzy that occurs under the hood when you work with methods that
return primitive arrays.

You can also inherit from any of the IList / IEnumerable / ICollection types
and make customizations to better support parallelism, memory usage, etc...
I've done implementations on top of IQueue (a cousin of IList) before to
support custom Reader/Writer scenarios for a parallelized polling service (in-
memory job queue) and it worked well.

------
Chris_Newton
Arrays are a perfectly good tool when you need a fast, random access, mutable
data structure with dense indices.

When any of those qualifiers is not mandatory, some form of indirection-based
data structure will probably be advantageous in some way, such as offering
persistence, more flexible index types, more efficient access patterns, or
lower risk of incorrect usage.

In practice, most situations do _not_ require all of those qualifiers.
However, the culture around many programming languages still promotes arrays
as the default choice, rather than encouraging developers to think about
alternatives and choose the most appropriate one in context. It's not arrays
that are harmful, it's using them when they're the wrong tool for the job.

~~~
Retric
What data structure has a more efficient access pattern?

People default to arrays because 95% of the time they are either the best
option or good enough that it's not worth worrying about.

~~~
Chris_Newton
> What data structure has a more efficient access pattern?

The depends on what kind of access you will need.

For example, if you are going to be doing a lot of searching to determine
whether a value is an element of a set, an array will take O( _n_ ) in the
number of elements to do a linear search, while various self-balancing tree
structures will be O(log _n_ ), and a hash table with a good choice of hash
function will be even better. Unless you are working with _very_ small data
sets, arrays will rapidly lose out here.

If we're talking about a data structure that gets modified over time, then
suppose you need to store a sequence of values in order, and you will often
insert or remove elements in the middle of the sequence. A linked list or
similar structure can perform these operations for O(1) cost, but an array is
O( _n_ ) in the number of elements.

Of course, all data structures are ultimately kept in one big array of
memory/drive, so you could argue that the array can be just as good as the
other data structures in the examples above. However, I think that is a
circular argument: you're just reinventing the alternative data structures.

> People default to arrays because 95% of the time they are either the best
> option or good enough that it's not worth worrying about.

Perhaps. Then again, perhaps this attitude of declaring the default, easy tool
"good enough" instead of learning to use the right tool for the job is why in
2011 it still takes nearly a minute for my PC to boot in the morning and
sometimes I still type faster than my state-of-the-art DTP software can manage
to draw the letters on the screen.

~~~
Retric
Your making some major assumptions. Big O is great when dealing with large
data sets but beware the K.

Deleting random elements from a linked list is an O(n) operation because you
need to find the element(s). They also waste a fair amount of memory.

Determining whether a value is an element of a set is an O(Log(N)) operation
on an ordered array. And faster than a self-balancing tree because of cash
locality issues.

Small Hash tables are slower than arrays because of that hash function. (I
love Hash Tables but they make a poor default choice.) Edit: Set operations
where not what I would call an "access pattern," but it's a valid point.

PS: Buffers and Bitmaps are both well suited to an array. You can't use an
self-balancing tree in a hard real time system. And so it goes, when arrays
don't suit your needs there are great options, but they are less common than
you might think.

~~~
Chris_Newton
> Big O is great when dealing with large data sets but beware the K.

Sure, you always have to be careful of the constant factor in reality. That's
why I noted that arrays will typically lose out to alternatives like trees and
hash tables _unless_ you are working with very small numbers of elements.

Still, logarithmic beats linear very quickly as the amount of data increases.
Consider that 2^10=1024, so by the time you have 1,000 or so elements to
consider, you need a _k_ that is about two orders of magnitude better for a
linear scan to beat a binary tree.

For hash tables, it mostly comes down to how long it takes to compute the hash
function, which in general should be a constant factor itself.

> Deleting random elements from a linked list is an O(n) operation because you
> need to find the element(s).

Why would you want to delete a _random_ element? You're moving the goalposts.

If you really did need that functionality, then a simple linked list wouldn't
be the data structure you wanted. Use the right tool for the job instead.

> They also waste a fair amount of memory.

They "waste" a small number of pointer-sized chunks for each element. If you
have so much data that this might matter, you're probably running on hardware
with silly specifications so it doesn't and/or using a different approach to
scaling anyway.

> Determining whether a value is an element of a set is an O(Log(N)) operation
> on an ordered array.

Sure it is, after you've done your O( _n_ log _n_ ) sort. You're moving the
goalposts again.

> And faster than a self-balancing tree because of cash locality issues.

Cache locality is certainly a valid concern in some applications, particularly
with recent generations of PC processors/memory architectures. Once again, you
should pick the right tool for the job.

In this case, the basic problem is that fetch time dominates comparison time.
This is something database engineers have been learning to deal with since
forever, and block-aware data structures like B-trees have been known for 40
years.

> Small Hash tables are slower than arrays because of that hash function.

Are you sure? Consider this: if you wanted to sort an array of 1,000,000
numbers between 1 and 100 as fast as possible, how would you do it?

But yes, if you really are within the constant factor required for the hash
function, then a hash table is probably not a good choice.

> And so it goes, when arrays don't suit your needs there are great options,
> but they are less common than you might think.

Don't tell that to anyone who works with list-based programming languages!

~~~
Retric
As I said there are plenty of cases where Arrays suck.

However, 1,000,000 element arrays while not small are generally sorted so fast
in memory that N LOG N is fine. I could write a custom hash function that
would store the counts in an Array ;0, but it's not something I want to
maintain so the fallback option on numbers outside the range would be sorted.

"Sure it is, after you've done your O(nlogn) sort." Creating a binary search
tree is also an O(n log n) operation...

Binary search trees wins when you want to constantly updated list that you do
searches on, but that's a smaller case than just looking for elements in sets.
So for example: if you start with an unordered list and want to look up 50
elements to see which one is in it, you are better off with a linear search
though an array, unless you can sort the list while waiting for those 50
elements.

PS: I am bending over backward to support that 95%, but when given all the
constants is surprising how often an Array works just as well or better than
the "optimal" data structure. Generally speaking you want large data sets in a
database somewhere, locally you rarely deal with large data-sets and when you
do it's often a temporary thing or a lookup table.

------
cosgroveb
I like how "Arrays Considered Somewhat Harmful" becomes "Arrays are Harmful."
Do away with those weasel words to make your title more catchy! (not link-
baity, no no).

~~~
jonpaul
As the submitter and not the author, I stand nothing to gain by sharing the
article. But, I see your point. I was thinking that when he mentioned not
having a model where a mutable collection with fixed size made sense; thus, it
translated to me to "arrays are harmful." I then considered "somewhat" to be
pointless.

~~~
jbri
Except that by removing the "considered" word, you also removed the shoutout
back to the goto article.

------
warmfuzzykitten
> Arrays simply do not model any problem that I have at all well

Then don't use arrays. However, arrays (vectors and matrices) do model
problems that scientists and engineers have, and nothing else models them as
well or efficiently. So to call arrays harmful is to exclude a large and
important class of applications simply because you don't happen to work on
them. It would be more reasonable to note there is a need for immutable arrays
not met by our current languages.

~~~
jbri
Is it an array that models your problem efficiently, or is it "an
n-dimensional data structure with O(1) lookups by index", of which an array is
merely one implementation?

~~~
chancho
Are there any other such data structures besides arrays and hash tables?
(That's only half-rhetorical. I'd love to hear about one.) If so, are you
seriously suggesting that when an array fits the bill, we should stop and
think "hmm... maybe another data structure would be a better container for
this dense, contiguous set of integer keys"?

------
ot
About the first point: why doesn't c# have const arrays of const values, like
C++?

~~~
munificent
Const != immutable. Const in C++ gives you a read-only view into an object.
You can accomplish the same goal with more fine-grained control by using
interfaces in C# (or Java, which also lacks const). Instead of returning a
const reference to your object, you return your object cast to an interface
that doesn't have any mutating methods.

~~~
chancho
What's an example of "more fine grained control"? I've used C++ more than Java
or C#, but it's not apparent to me how removing a language feature gives you
more fine grained control.

A const reference in C++ is like an interface that doesn't have any mutating
methods, except that all the methods of the interface have the same names and
signatures as the original, and I don't have to actually type out a separate
interface definition, I just choose which methods I want to appear in both the
immutable and mutable interfaces by appending a "const" to their signatures
(bonus: if the implementation for the const version is different, I can
overload it.)

Sure, C++'s const has tons of problems, but it seems like the Java and C#
designers looked at it and said "const is probably more trouble than it's
worth so let's not bother." Which is a fair assessment, I guess.

~~~
Chris_Newton
> What's an example of "more fine grained control"? I've used C++ more than
> Java or C#, but it's not apparent to me how removing a language feature
> gives you more fine grained control.

The point isn't the loss of "const", it's that the only access you do have to
a class-type object in C++/Java/C# is via its interface.

If you hold all of an object's current state in private data members, then the
only way you can change that state -- even if you can see the object and it
isn't const -- is via the interface functions provided. If as class designer
you just happen not to provide any interface functions that mutate the object
(no set_whatever() calls, no assignment operators, etc.) then your object
becomes immutable by implication.

The finer control the GP mentioned probably refers to the way you can also
choose to permit certain specific types of mutation, by providing a limited
set of functionality in the interface rather than direct access to the
underlying values.

------
yason
As soon as you start protecting yourself from the dangers of the unidentified
dumbfool programmer, you get exactly this. In that case you can't do pretty
much anything without wrapping it around read only collections or iterators or
accessors first, whereas what you really want is just _to return a list of
damn things_.

Java seems to be the breeding pit for such practices: it's a supposedly secure
virtual machine that allows you to hide stuff so deep in your rear end that
you can barely access it yourself any longer without a delicate assessment of
numerous design patterns. That sort of capability allows you to get on that
encapsulation-road far enough that it might not be feasible to ever back down
and rewrite things in a sensible way.

But the truth is, all you need is the array holding the important values and
all you need to do is document the function appropriately. "Don't modify the
returned list" would be a good start.

Then you want your developers have the skills to understand each part of that
sentence. On the other hand, if someone won't it's not your problem since you
documented the behaviour.

Some languages, such as C, allow you to tell that more explicitly with the
'const' keyword and even have the compiler act as a gatekeeper. The developer
can still write the returned value into a non-const variable and mess around
with that but he's already broken the contract by that time. In Java, it
should be more than sufficient to cast it into a type denoting a read-only
array or collection but I don't see a common, real reason to not eventually
return that single array.

~~~
Chris_Newton
> But the truth is, all you need is the array holding the important values and
> all you need to do is document the function appropriately. "Don't modify the
> returned list" would be a good start.

I agree that over-engineering is both common and unhelpful in software
development. On the other hand, I think you are too quick to dismiss
encapsulation techniques.

The sort of defensive programming tools you mentioned aren't there because
dumb programmers don't know any better. They are there because smart
programmers know that sometimes they will still make mistakes anyway.

See also: failing to free dynamically allocated memory, failing to check
return codes or handle exceptions, failing to exit a loop under the correct
conditions, creating deadlocks in multithreaded code, etc.

------
btipling
Why would sending an array back for something like the Twitter News Feed not
be appropriate? Am I missing something? It's a list of tweets, so wouldn't you
send back a list? That is an array?

~~~
Aaronontheweb
It's not the concept of arrays that are the problem - it's the implementation
of them. As the author points out, it's easy to use arrays to accidentally
create a memory hog application that doesn't perform well in multi-core
environments.

It's better to abstract arrays through the use of more sophisticated classes
like IList in .NET, which can mitigate the issues with traditional primitive
arrays and provide developers with something that works better in modern
multi-core scenarios.

~~~
flatline
Technically System.Array also implements IList, which is an interface behind
which sit several implementations. The generic List type is probably worse
than a plain old array, as it is really a mutable, resizable array
implementation. Even the author states "one might reasonably point out that
List<T> is a mass of mutable state too". Future-proofing yourself through the
IList abstraction is a good practice, but I'm not sure off-hand what you would
use to replace it that would be better, for now. I got the impression that the
author was arguing for strictly immutable data structures, which you can
achieve through diligent use of the readonly attribute and private setters,
but which is not otherwise a native concept in C# or VB. I hope more of the F#
immutable structures make it into the core framework in the future.

~~~
Aaronontheweb
As a general practice it's bad to return List<T> for that reason - it's
bloated with additional members and state to match to support just-about-
anything out of the box.

Return IList instead of List on any public interface you define is the rule I
try to obey.

Basically it's all about how you implement arrays in your called classes - do
they expose some internal state by returning a deep clone of the array each
time it's called (which is what the examples Eric used were doing) or does it
expose its contents via a read-only List?

~~~
euroclydon
Yeah, but which hard class do you return as an IList? Does the bloat go away
if you return a List from an IList typed method?

------
swannodette
Precisely. Persistent Data Structures are the future.

------
listrophy
In ruby:

    
    
      class Type
        def self.get_constructors
          @ctorInfos ||= go_get_constructor_infos_from_metadata.freeze # go_get_WTF?
        end
      end
    

There you go. Language-level object freezing. Apply, as needed, to language of
your choice.

------
dspeyer
This is _very_ C# specific.

In C++, you can return a const reference. In javascript, Array is a class (and
you can add functions to the prototype, but probably shouldn't).

