Hacker News new | past | comments | ask | show | jobs | submit login

That indicates a numerically unstable code on his/her part though. Scientific results shouldn't depend on unspecified epsilon values that fall within the spec.

If they're getting different results on different CPUs which both implement the spec correctly, they need to fix their code anyways.




Besides using a deterministic floating-point implementation, how can you avoid that in sims like this? A slightly difference force on the particle at current time step will cause it to end up in a slightly different place.

edit: They do run sims with slightly different initial conditions to get the physics rather than simulation artifacts. The issue here is that for a given set of initial conditions, results computed on different machines would not agree.


All floating point implementations are deterministic. You do get the same results on each run on a given CPU. There is no stochastic floating point implementation in AMD or Intel.

I understand that they don't agree when run on different CPUs. But they both work correctly, because IEEE floats and operations are defined up to a certain level of precision by specification.

What I'm saying is, if they obtain different results across CPUs within their accuracy target for results, their code is simply broken and they need to change their code to either use a numerically more stable algorithm or higher-precision floats. No scientific software should rely on unspecified numerical "bugs" that fall outside of specs and can change in the future even with the same vendor.

You also have to remember that there are all sorts of people in CERN from undergrads learning to code on the go to software engineers with no physics background. Just because a piece of code made into a CERN repository at some point doesn't mean it's a gold standard, word of gods (which appears to be your premise) and CPU vendors are to blame for any problems.


> All floating point implementations are deterministic.

The entire point here though is that implementations are allowed to produce different results, and that code that needs to produce the exact same result using different implementations need to take this into account.

The application (SixTrack) had, AFAIK, only been used in compute clusters before. When they started using it in LHC@Home, running on a large variety of user hardware, this issue was exposed.

> What I'm saying is, if they obtain different results across CPUs within their accuracy target for results, their code is simply broken and they need to change their code to either use a numerically stable algorithm or higher-precision floats.

Right, that's what they discovered and that's what their solution was as I mentioned in my original post.


> and that code that needs to produce the exact same result

... is broken. "The same result" is a complicated question when floating point computations are involved. Even in an IEEE 754 environment, the compiler can easily cause non-bit-exact differences over time on the same chip.


I agree that code that depends on exact results from things that are known to depend on implementation defined behavior is broken code.

But the reality is that a ton of scientific and high performance code does it, and simply declaring it broken and not dealing with it will not do you any favors if you are trying to work with such systems.

It's similar to how microsoft has to maintain decades of bugs because a ton of software depends on them and would break if they fixed them. Yeah, microsoft would be "correct" to fix them and declare the code buggy. But that would piss off a lot of people that don't care about this argument and probably lose microsoft money. It's the same situation here.


I deal with it by using the numerical skills I used in grad school. There are no surprises here for anyone who's studied scientific computing. You appear to not understand the issues involved. Robust code doesn't have a problem with different compilers or implementation-defined behavior that's within bounds.


Good for you? I didn't originally write most of the code I have to work with. Just like intel doesn't write the code that depends on MKL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: