Hacker News new | past | comments | ask | show | jobs | submit login
ROOT – Data Analysis Framework (cern.ch)
76 points by michaelsbradley on Apr 20, 2016 | hide | past | web | favorite | 32 comments

An HN post on software I actually use ;) Those of us in the particle physics world use ROOT quite often. ROOT is pretty old (for me at least, started in the mid 90's) The most recent version (ROOT 6) is a great step forward for modern C++ use. It's very far down the line, but the experimental ROOT 7 code I've seen is even better.

Glad to hear it's getting better. I've heard not-so-nice things in the past

ROOT can't be said to be OO because it breaks the encapsulation in the guts. There is a massive usage of "g" global pointers : gROOT, gDirectory, gTree, gEnv, gSystem, gPad, etc... Around one hundred in v5-18-00, a disaster. And this definitely breaks the fundamental OO principle of encapsulation.

Then since ROOT violates three basic principles of OO (encapsulation, inheritance, virtuality) we are compelled to conclude that ROOT can't be considered as an OO software. ROOT is a bright example of people having jump to C++ but missed totally the point of OO. At least it will probably stay in the history of software because of that.

What could be the improvements in a ROOT major revision ?

o at least fix the name ! Is it ROOT, Root, root ? (Hell, we are pretty sure that any Bazaar model software have at least converged on that !)

o have then a correct namespacing of classes and libs.

o restore encapsulation (then get rid of the g pointers).

o revisit the inheritances. At least have a good histogram class. And arrange the storage area to be stable (then "fix" the TTree). And please, have an introspection class that looks like an introspection class.

o use pure abstract interfaces to separate domains. And stick strongly to the idea to have them pure.

o etc, etc, etc, etc, etc, etc, etc, etc, etc, etc,...


Before beginning, I should point out that these are simply my own views and that I hold no animosity against the developers — their design simply doesn't work for me. Presumably there are many people "out there" who think ROOT an excellent piece of software. In complete honesty, though, I have yet to meet any of them. In fact, I've never had any complaints that this article mis-represents ROOT, and I've had a fair bit of "fan mail", not mention discussions with well-respected developers and physicists who hold precisely the same views :-)




ROOT was the product of Fons and Rene porting PAW from Fortran, learning C++, OO, flirting with Taligent coding styles, and a bunch of other things all at the same time.

It was okay for a time, but that's time has long passed.

Next, someone will be posting SuperMongo http://www.astro.princeton.edu/~rhl/sm/

I work with RHL.

I do data analysis of ATLAS data, and it's everywhere. Everyone knows that root kind of sucks, and some people have moved to matplotlib do to at least the plotting for them, however that brings a slide of other problems for you, for example the plot guidelines for ATLAS publications is formulated in root terms, so other kinds of plots sometimes not get approved. On the other hand there is literally millions of lines of code in the analysis framework that heavily based on root, so there is no real way to switch it out.

There is quite a lot I could say about ROOT, ROOT files, CINT, but I won't.

There's better options. Don't use it unless you are in HEP.

Even then, it tends to be used for everything whether it really needs to or not. I switched to matplotlib for plotting and was much happier/more productive.

What options are better?

Almost anything, to be honest. Matplotlib, R, Matlab, Mathematica etc. are all much nicer. Those will do most things ROOT does and be much less delicate. In a lot of places (especially outside CERN) Matplotlib is taking over where ROOT might have been used, but it's a slow process.

The problem is that ROOT still has a few very specialized features that its users still need and you can't get elsewhere. And there are a ton of legacy analysis tools built on top of it that are difficult to port because of how ROOT is. And a lot of its more extensive users are comfortable with it and have no motive to change (they're busy with being scientists).

I don't know anybody who actually likes ROOT, but it also won't be going away any time soon.

The one thing I am missing in the non-ROOT universe is a powerful fitting framework that can do multidimensional and simultaneous fits in disjoint function domains.

I am a particle physicist, and used to use ROOT every working day. It is still used daily by thousands of other particle physicists, though, and is a core part of many high-energy physics experiments.

I think there are a few of objectively neat features of ROOT:

* Versioned persistency of C++ objects deriving from the TObject base class [1];

* Script-like execution of C++ and a C++ REPL based on clang [2]; and

* Dynamic bindings of the C++ classes to Python [3].

There's an accompanying, but independently developed, file access protocol for reading and writing ROOT files over a network, too [4].

On the other (subjective) hand, ROOT is regarded a pain to use by ‘analysts’, the people who use ROOT to make the results that go in to physics papers. There are already some good, old-but-still-valid critiques [5, 6], so I won't say too much, but I think a large part of the problem comes from two things:

1. ROOT tries its best to do everything that a particle physicist might want to do. This encompasses a very wide range of things, and this has lead to ROOT having a very large, often intractable codebase that cannot be modularised.

2. It has failed to keep up with contemporary coding techniques and analysis methods. Most of the PhD students I know use the Python interface to ROOT, and yet the ROOT developers are planning to drop Python support for the next major version (ROOT 7, which is expected in 2018). Those that do use C++ aren't able to use even C++11 effectively with ROOT, as its interfaces aren't compatible.

Luckily, I'm confident that analysts will move to a better way. I've been very encouraged by the astrophysics and machine learning communities in particular, who are using Python to do low- and high-level analysis on large datasets, as we do in particle physics, and are producing fantastic results. Tools like pandas, matplotlib, and scikit-learn are an absolute joy to use in comparison with ROOT, and the communities within the Python ecosystem are wonderful: they foster very open code development, and value readable, well-documented, fast code.

I don't need ROOT to get any better, because I think the future is already here.

[1]: https://root.cern.ch/root/html534/guides/users-guide/InputOu...

[2]: https://root.cern.ch/cint-prompt

[3]: https://root.cern.ch/pyroot

[4]: http://xrootd.org

[5]: http://www.insectnation.org/articles/problems-with-root.html

[6]: http://www.insectnation.org/articles/root-wishlist.html

Background upfront: I'm the guy behind the C++ interpreter and ROOT's new interfaces. I'm the co-author of the only surviving C++ reflection proposal and the author of the std::variant proposal. I have contributed to the C++ Core Guidelines (http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines https://youtu.be/1OEu9C51K2A).

* HEP stores about 0.5 exabytes of data in ROOT format, that's almost exclusively serialized objects that do not know anything about TObject.

* XRootD is not really specific for ROOT files. A better example would maybe be our JavaScript de-serialization library, https://root.cern.ch/js/

* No way will the python binding be dropped. I wonder where you got that rumor from. About one third of our users is using it.

* HEP is limited by CPU resources, which is part of the reason why HEP decided to use a close-to-bare-metal language for the number crunching part.

* We just made the use of python and R multivariate analysis tools with ROOT data more straightforward.

* We have people from genomics etc coming to ask for help, because they cannot find a system that scales as well as ROOT does.

And then we have a different perception of the direction out there. I see that Hadoop was nice but slow, Spark is nice but slow, so now things are moving to C++, see e.g. ScyllaDB. There is no reason for us to move away from it, but every reason to make it more usable.

And yes, I agree that this is an issue. But many physicists do not.

* ROOT files still have terrible documentation. Rene throws up his arms in protest anytime people say this (I've personally witnessed this)

* Physicists still don't like pyroot interfaces, otherwise rootpy wouldn't exist.

* astropy is proof that you can be performant and user friendly. Julia is proof that you don't even need a C++ library underneath.

* Saying ROOT scales well is weird; It is true that ROOT and the ROOT IO/ROOT files are efficient, but it needs but additional services have helped it scale (dCache, XRootD, batch farm/grid/DIRAC, etc...)

* Not sure what the ScyllaDB tangent has to do with anything. There are scalable open source RDBMS options out there too like CitusDB, Greenplum which support UDFs. Hadoop and Spark with HDFS are still great for certain applications, and as general data analysis tools are great, but it's tricky to really get them to perform well without HDFS and the grid model of computing doesn't lend itself well to that paradigm.

* I've heard the C++ interpreter is much better with Cling (if that's you, I applaud your effort!) CINT was a gun that fired in both directions for every grad student I ever had to help.

* XRootD has little to do with ROOT anymore other than it also implements the original root protocol.

* ROOT is not modular. It is both an application and a collection of libraries and somewhat of a VM. That does make some things convenient, but it also makes some things extremely hard.

There are many reasons to move away from ROOT, and the astrophysics community is a prime example of that!

Thanks for clarifying. You're right that I was too broad, and it's certainly true that many physicists don't share my opinion (I'm working on that).

Speed is always a concern, but I don't think it dictates that C++ should be the primary ‘user-facing’ interface. Numpy is fast, but it doesn't sacrifice a nice API to achieve it.

Personally, a big difference is that a lot of the Python packages feel fast to use and, most importantly, to write. ROOT can be fast to execute, no question, but I feel like I'm fighting against it (and I'm sorry that's very vague and qualitative).

It would be very interesting to hear more about the genomics use-case, and how they evaluated the other options.

I'm using Python for analysis, and I'm running into performance issues constantly.

If you want easy scale-out and scale-up with Python, check out the (relatively) new library Dask: http://dask.readthedocs.org

The thing that bothers me most about root is that some parts of it are basically not maintained at all.

There are serious bugs in RooFit which haven't been fixed in years. Wouter Verkerke has abandoned it (from what I can tell). Lorenzo Moneta is fixing the worst potholes, but it seems is has no authority or no time to tackle the misguiding interface and the broken scaffolding of RooFit.

Maybe ROOT7 will be a chance to take ownership of RooFit again.

Have there been any success stories in regard to genomics and ROOT? About 10-15 years ago the group I was with then explored ROOT as the alternatives (Perl, early versions of R, etc.) weren't very attractive. We didn't end up going with ROOT ourselves for a variety of reasons, but did anyone else in the field do so?

Although I never used ROOT while at CERN, it surely was part of many of our discussion subjects at ATLAS-DAQ.

Nice to see it on HN.

I was in ATLAS DAQ, too, and I'm happy I never had to use too much of ROOT, too. PAW, on the other hand, it was... charming!

My area was L2PU and Dataflow related, a decade ago.

So also not that much use of PAW as well.

I was in the online monitoring group, ever heard about GNAM? And it was around a decade ago, too.

But PAW was earlier, in the KLOE experiment for my graduation thesis.

My biggest advice about ROOT is: Don't use it really.

Look, ROOT is a very complex framework for data gathering and analysis build by physics and it shows every step of the way. The bugs are everywhere and it does really weird things like setting global variables when you analyze some piece of data for instance, changing your results for all subsequent analysis (this particular bug cost me about 2 weeks).

And in the end, there isn't really any point in using ROOT.

- Data gathering can be done with a simple CSV (binary if you wish), a more advanced SQL database, or in the realm of research with the venerable HDF5 format.

- Data analysis in C++ or any compiled language, just doesn't make much sense. You can use Python or R. The libraries to read and treat data are optimized and will make the process much less error prone and probably faster in the end.

Seriously, don't make the same mistakes as I did just because some older people in your lab use ROOT and you feel compelled to do it as well. There are much better tools for the job and I regret not searching for them before wasting about 6 months of my PhD thesis trying to integrate ROOT in my research workflow.

Agreed. ROOT is an idiosyncratic mess, that hasn't really benefitted from the developments in data processing from other fields. Much better off with Python,numpy, pandas and friends. CSV for simple tables, SQL for complex ones, and HDF5 for n-dimensional arrays. Cython or numba to speed up the slow bits.

> My biggest advice about ROOT is: Don't use it really.

I partially agree: don't use it as a framework, but do use its libraries, they are good!

A good chunk of it's libraries are re-exported open source libraries exporting alternate/C++ interfaces though! For example, GSL, FFTW3, and more than a few others.

I will say that it is nice that it has most any math function you will need. I know people who get super frustrated when they can't find a landau distribution in whatever language/library they are using and then just go back to ROOT at the end of the day.

> Data analysis in C++ or any compiled language, just doesn't make much sense.

I rather use a programming language with REPL that gives me the option to compile to native code, instead of being forced to write extensions in another language.

Plenty to choose from, doesn't need to be C++.

And I am still reaching for CERNLIB (PAW) any time I need to plot something. Could never understand why Rene Brun, Perevozchikov, et al. got so attracted to the OOP back then.

Back then OOP was everywhere in C++ world.

I think that HNers that bash J2EE and JEE designs never had the "pleasure" to enjoy mid-90's C++ OO frameworks.

Yet, it was really upsetting for me to see the smart and very experienced guys who wrote the beautiful CERNLIB to fall into this.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact