Hacker News new | past | comments | ask | show | jobs | submit login
Should array length be stored into a local variable in C#? (habr.com)
94 points by atomlib on June 7, 2019 | hide | past | favorite | 68 comments



This is good to know, but fairly unsurprising. The interesting case is List<T>. The .Count property is actually a function call, and the value could change during the loop. If you don’t mutate the list, is it smart enough to both inline the function call and hoist the value out as an invariant?


Ist doesn‘t even need to be a function call: Accessing a class field repeatedly in a tight loop is slower than a local variable.


There's no mathematical way for it to know that you won't modify (especially remove). The developer would have to express that by, say, iterating over the internal array directly.


A good compiler can in a large number of cases infer that it won't modify.

The compiler can explore the call graph of the loop body and find out which methods might be called on the list, and then check that none of those methods write to the list's internal length variable.

How deeply and precisely the compiler (or JIT) analyzes the code in an attept to prove that the length won't change involves a tradeoff in compilation speed, but I'd expect .NET to have found a pretty good sweet spot by now.


> A good compiler can in a large number of cases infer that it won't modify.

For a single thread that should be doable. But that’s not mathematically covering all bases is it?

And how can the compiler know if the relevant list is not accessed from multiple threads? I’m pretty sure that’s next to impossible and definitely not worth the complexity for such a small optimisation.


Compilers have been assuming single-threaded memory access by default since, forever? If you want your code to work with multiple threads, then you have to explicitly write code to do so. Otherwise, the compiler is liable to store things in registers, shuffle instructions around, and do whatever it feels like to optimize performance for the single-threaded case. If the compiler didn't do this, then every program would run many many times slower than it does.

Doing some quick googling, this article seems to explain things well from the perspective of a C# developer: https://afana.me/archive/2015/07/10/memory-barriers-in-dot-n...


It often is easy to know it isn’t accessed elsewhere (escape analysis can prove it for local variables), and compilers do check the simple cases (search term: “lock elison”).

For example, see https://shipilev.net/jvm/anatomy-quarks/19-lock-elision/.

This also can help when using java’s StringBuffer, which is thread-safe, but can often be ‘converted’ into a non-thread-safe variant by the compiler (the existence of StringBuilder shows that ‘often’ at least in some version of Java wasn’t often enough, though)


> There's no mathematical way for it to know that you won't modify (especially remove).

This is not true. Here's three examples:

A positive result from escape analysis and then scalar replacement of aggregates.

A positive result from partial escape analysis until after this point in the program.

Speculation that list is not modified, and deoptimisation if it is.


>There's no mathematical way for it to know that you won't modify

That's an overstatement. That's currently impossible in C#, but Microsoft is working on it in F*: https://www.fstar-lang.org I hope some of that will trickle down to F# and C# some day.


This is veering into C and undefined behavior territory, but with List<T> there's a guarantee that you cannot modify the list while enumerating over it, so could the compiler theoretically assume that when using foreach?


I assume it does, if you try to do so it throws an exception. It's a rather common footgun with multi-threaded code until you've lost a few toes to it. People are always writing naive producer-consumer queues with an unlocked List<T> backing it until they learn better.


> with List<T> there's a guarantee that you cannot modify the list while enumerating over it, so could the compiler theoretically assume that when using foreach?

No such guarantee exists.

The only guarantee there is is that enumeration will throw on next iteration after a modification.

The compiler can assume nothing.


Nice benchmarking on Microsoft .Net CLR. Looks like the JIT compiler is smart enough to recognize that array.Length is an invariant and hoist it out of the loop, which is awesome for the common use cases!

One nitpick about the title: C# runs on more runtimes than Microsoft .Net CLR, and those may behave very differently. For example: Mono CLR, or Unity's IL2CPP which is an ahead-of-time compiler.

Specifically, I'd expect IL2CPP would not hoist length out, because it would not recognize it as an invariant. (Some great examples of IL2CPP cross compilation are here: https://jacksondunstan.com/articles/4749 )

TLDR: the Microsoft JIT compiler makes the local variable unnecessary, but this is a property of the JIT, not of C#. Developers on non-MS platforms shouldn't assume this.


> those may behave very differently

They shouldn’t behave any differently should they? There’s a single language spec.


"Behave very differently" is meant with regards to performance, here; or in other words, things that are not observable in the language semantics.


Java's language spec also doesn't dictate how escape analysis or remapping from classes into structs gets done, for example.


Always accessing array.Length is defensively coding. In the event your array is mutated, always accessing array.Length ensures you won't run into an Index Out of Bounds exception.

Even better is to just avoid accessing the array's length. I almost always use foreach or Linq.


That is orders of magnitude slower and allocates. Of course you can use: https://github.com/jackmott/LinqFaster


I write first for readability and performance second. Not to say that performance isn't important (and of course I pay attention to O notation), but throwing a little more memory and CPU power at a web page is far cheaper and easier than spending developer time reading code.


Wrong

From the article:

> It also turned out that Foreach often walks through the array faster than For


1. `ForEach` is slower. 2. `foreach` is faster. 3. Linq is slower.


I'm not a C# developer, but this kind of thing seems to permeate almost all languages in one form or another.

Maybe it is just a style and readability thing; or maybe (as suggested elsewhere as well) it is meant to be reused elsewhere in the system, so it is cached in a variable for later use.

Or, it's possible that at one time - maybe early in the early days of .NET - doing it this way was more optimized, and the habit stuck with developers (perhaps they all read the same article in the knowledge base about it?). If that's the case, it's a bit of "premature optimization", but one that doesn't apparently harm anything.

What I do wonder is if certain other changes could change the speed?

At least it might be interesting to see in these trivial cases; I admit that in more complex loops it might not be advisable.

But - for instance, what if rather than iterating thru the array from the 0th element to the length of the array, you instead started from the last element and iterated backwards, until you hit zero? That way, you wouldn't be checking the length of the array, but rather for zero?

The code for such a test might look like:

    public int WithoutVariable() {
        int sum = 0;
        for (int i = array.Length - 1; i > -1; i--) {
            sum += array[i];
        }
        return sum;
    }
I'm not sure that a "with variable" version would make much difference (or sense), but here it is for completeness sake:

    public int WithVariable() {
        int sum = 0;
        int length = array.Length - 1;
        for (int i = length; i > -1; i--) {
            sum += array[i];
        }
        return sum;
    }
Again - I'm not a C# developer - maybe my code is wrong above, but hopefully it gets the idea across.

Would this work better? Would it be faster? What would the JIT compiler create? Maybe it wouldn't be any faster or better than the ForEach examples?

I honestly don't know - but if anybody wants to give it a shot, I'd be curious as to the results...

EDIT: I noticed that I said "checking for zero" - but I modified my code to check for -1 as the boundary; I suppose the check in the loops could be modified to be "i == 0;" instead. I'm not sure if whether doing an "i >= 0;" vs "i == 0;" vs "i > -1;" which is faster - another thing to check, I suppose...


”what if rather than iterating thru the array from the 0th element to the length of the array, you instead started from the last element and iterated backwards, until you hit zero?”

That used to be common in assembly, as it leads to smaller and faster code on many systems.

See https://stackoverflow.com/questions/2823043/is-it-faster-to-..., which also shows how times have changed, with many answers calling this premature optimization.


Was also common in javascript a few years ago when the runtimes were more simplistic than they are now.

Iterating from the tail also made the terminal condition a comparison to 0, which IIRC was faster in at least one browser.

The gains were mostly masturbatory of course, the overhead of the iteration loop is not going to matter much when you have 5 method calls in the body.


Iterating from the tail has other advantages, such as if you want to remove items from the list. With negative iteration you don't need to do any special handling. But that is an optimisation for development time rather than run-time, I'd imagine multiple in-place removals is not so efficient.


I wouldn't naturally think about summing lists in reverse so it becomes more cognitive effort to understand the code. I've seen plenty of termination condition bugs on reverse iterations that might back that up.

A related thought is how modern code clean-up tools are doing things like reducing if-nesting e.g. turning this:

    if (open) 
    { 
        stuff();
        close();
    }
into this, with early returns:

    if (!open) return;

    stuff();

    close();
My feeling is that the first more naturally represents the idea I have of the behaviour and the second, like your reverse iteration, is an encoded version of that idea. I feel I have to make an extra cognitive step decode and reassemble it to create an idea of the behaviour.

I suspect the further you stray from the natural idea, the harder the code is to read, validate by eye and the more likely errors are to crop in. I'm don't know how subjective this is. I generally don't have a strong feeling about early returns it's just I have been noticing the slightly greater cognitive effort they are causing me compared to the logical chunking that nested-ifs provide.


In the above example I tend to agree with you, but I'd definitely prefer early return over multiple levels of nested if statements unless all of the nested statements are quite small.


I personally find early returns to be easier to understand (I think of them as exit conditions) than nested if statements but to each their own.


There are definitely cases like pre-conditions and validation that are natural early exists. There are other situations where it would sound really off and confusing to phrase instructions for a human that way. Not an easy task for an automatic code-clean-up tool to discern!


I feel like I saw something about how the JIT only checks the bounds once when going in reverse, but I can't find anything concrete to back that up, so we'll have to go with the 10 year old write-up for now saying: no, it is not optimized compared to the Length optimization. https://blogs.msdn.microsoft.com/clrcodegeneration/2009/08/1...


`i > -1` isn't quite right. I'm not sure about how C# does implicit casting, but arrays should be indexed by `usize` and comparing signed w/ unsigned is a bug waiting to happen or UB in other languages.


Whilst I would personally use i >= 0 and find the use of -1 here slightly jarring, it is correct. .NET libraries usually use int even in places where uint would be best (like Array.Length) and indexers like this[int i] for Array. So, this will work fine. Well, it'd work fine anyway, because the negative value is never used, but yeah...


Yeah I meant it more as a bad habit, because when you move to other languages you'll have problems. In rust I don't think the style would even compile, and in C you're just playing with fire.


The difference is that the article is investigating using a property in the boundary check. The boundary check is evaluated with every iteration. Your example is only using it in the initialization expression which, by definition, is evaluated only once. This holds true for virtually any language with a for construct.


Saving the array length to a variable is one of those things that inexperienced programmers love to do, thinking it will optimise something.


Avoiding doing unnecessary work is something all developers should strive to, experienced or not. Maybe in that case that won't make any difference. But, say, a method call in a loop often sould be done outside the loop. Doing it here but not there leads to inconsistent style. Also depending on the language and compiler that might actually make a difference sometimes. I'm not going to write code that assumes a certain interpreter/compiler behavior.


It honestly depends on the compiler.


It's a great default assumption when you don't know all the quirks of the specific language or compiler, because either it will help you or at least won't hurt you.


It might hurt if you add this extra variable to every loop and at some point the array length changes within the loop. It also makes the code more verbose. This optimisation should be done like all optimisations: first you benchmark and then see if it makes sense to make this change. Most of the time there's no point doing so.


Use Linq and forget about array length.

It's an interesting analysis and all, but why bother when the language has such an elegant collections API.


Linq is orders of magnitude slower and allocates. It is not always an appropriate choice.


It's the appropriate first choice.

If things are slow, instrument and investigate. Only bother fiddling with Array.length if profiling with production workloads exposes that as the bottleneck. More likely the problem will be elsewhere.


Do you have data for that?

Whenever I tried to find actual data for this Linq seemed to be slightly slower, but not too much.

Examples: https://codereview.stackexchange.com/questions/14197/is-the-... https://wheresmykeyboard.com/2015/06/linq-lambda-loop-perfor...


Yes. The total slowdown you get will depend on how much of the work being done is the actual iteration vs the work inside the loop. If you are doing a long running thing each time you go through the loop, then the overhead of the linq iteration is not so bad

But something like:

    var sum = values.Sum(x => x * x);
will take ~10x longer than an for or foreach loop equivalent.

(260ms vs 36ms on 32 million float32s on my machine, measured by benchmarkdotnet)

plus a small allocation


Thanks for your reply. I'm fairly new to C#, and was hoping that linq would make my life easier. Oh well.

I made some benchmarks myself: https://pastebin.com/5vQNpbPC

And esp. the Sum is a lot slower in linq. Not quite an order of magnitude, but pretty bad.

Even worse:

        float sum = 0;
        arr.Select(x => (sum += x * x)).ToList();
        return sum;
This is somehow still a lot faster than the normal linq Sum. What does .Sum() do to be this slow?

Edit: I just noticed you also wrote this blog post on the topic: https://jackmott.github.io/programming/2016/07/22/making-obv... I should have read that earlier!


You're doing more work though. You're converting to a List<T> (in order to caox the lazy enumerable to enumerate). You should use Aggregate for a more (generalised) way to reduce/fold collections into a value:

    var sum = arr.Aggregate(0, (t, x) => t + (x * x));
On the whole though it's better to use Linq until it's not. It's more declarative which will lead to more reliable code. Optimise when you find performance issues, don't write bad code just because you may gain a few nanoseconds here and there.


Bear in mind that the operation is trivial and so the overhead cost of invoking the lambda per iteration becomes more significant. I'm not trying to dismiss the performance differences, but summing 32 million floats is not a common use-case and I think the performance differences are mostly irrelevant for those using it to (for example) sum a list of line-items for an invoice or whatever.

For anyone that needs maximum performance then they should clearly be wary of anything that requires invocation of lambdas, but the usefulness of Linq shouldn't be written off because of that - the declarative style leads to fewer bugs, more stable, maintainable and easier to read code - that in itself is a performance boost.


You're right, but for most people writing C#, the developer time saved by Linq (being easier to read and understand quickly) is much more valuable than the tiny amount of CPU time that's lost.


If your workload is summing 32 million float32s use foreach. Otherwise pay for readability with CPU.


At the university, every time i put the array length into a local variable for a loop, i got 2 points deduction on my Homework.


Why?


Probably because his teachers arrogance to life experience was too high.


Ahh, nice to see some c# without a new line before the braces.


That is probably my least favorite thing about C# coding. The preeminent style wastes so much vertical space. K&R braces for me.


Now if they would only drop Pascal case so that I can distinguish a method from a class I might give it a shot.


archive.org has a mirror in case the site is still hugged to death: https://web.archive.org/web/20190606120130/https://habr.com/...


If I remember correctly, the runtime keeps size of the array in the header of the object along with sync block, etc. If you have VS, you can view the object in memory to see the sync block value, array size value, etc.


It is very possible the reason is not speed, but readability. If you simply named it 'length', then sure, there is no point. If it is given a better, more descriptive name, and then gets used in an equation elsewhere in the code, then it may be very useful because it is easier to read.


something like:

    var widgetLength = widget.Length;

?


I think that's confusing - it could be referencing a different object names "widgetLength" how about widgetDotLength to be more clear?


You're right. Let's go with yours. When can you push that?


Oh that'll be at least next week, I'm currently busy changing all the "int" declarations to "wholeNumberDataType" for clarity.


TL;DR: Yes & use foreach rather than for to skip array bound checks.


it will skip the array bounds check as long as myArray.Length is the terminal condition in the for loop, rather than a local variable with the length stored.


It is odd, that C# needs foreach loop for bound check elimination.

Java supports this optimization for a wide range of loop types... since Java 7, I think. Normally I would argue, that Java is just ahead of curve, but Android has also gained supports for bounds check elimination in 2014-2015.

Either article does not tell us whole truth or Microsoft JIT is subpar by modern standards.


c# does not need a foreach loop for bound check elimination.


Yeah. A slight oops from CLR optimizer, but nothing serious. It seemed to miss a hoisting optimization opportunity in the foreach case.


Long time ago I decided that you shouldn't second guess the compiler without a reason. So just avoid worrying about local optimizations and focus on your data structures and algorithms.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: