Hacker News new | past | comments | ask | show | jobs | submit login
CERN Open Data Portal: Explore more than 2 PB of open data from particle physics (cern.ch)
125 points by lelf 63 days ago | hide | past | web | favorite | 12 comments

What software is being used to return the results? I tried it and feel the response is very quick. 12 pb is such a huge amount of data.

First of all it’s 2 and not 12 pb. But more importantly the search doesn‘t go through those 2 pb. The search goes through the different ‚experiments‘ (or whatever you call that) and the dataset for one of those experiments may easily be hundreds of gb. Your hits are ‚experiments‘ not the content of those large datasets obtained during the experiment... this reduces the size of the search index by several orders of magnitude compared to the datasets itself. So if for instance each dataset was 10gb in avg, you‘d ‚only‘ be going through roughly 20000 entries. So letˋs make a very conservative estimate of the lower and upper bounds, say something like 2000-2000000 entries (although 2 mio. datasets/experiments would be A LOT - like 500 experiments each day since 2008. Anyway, that being the search feels snappy indeed, which is nice. I agree with the other comment that ES is a good guess.

Judging by the shape of the API responses, looks like ElasticSearch is handling the metadata querying.

CERN open data is build using Invenio (https://inveniosoftware.org/). Invenio has a search module (https://invenio-search.readthedocs.io/en/latest/) that uses Elasticsearch.

ELI5: what should anyone be looking for, how, and with what tools?

Also explain why they dont already have an Api/too/integration with *.edu. Looker tableau wolfram?????

Looks like the site has some resources for helping to get started.

I've also got a separate resource from a Meetup talk I went to a while back. The speaker is an ML engineer who looked into some LHC datasets and posted a writeup of her talk here: https://lavanya.ai/2019/05/31/searching-for-dark-matter/

Tools are provided with some starting points of benchmark analyses. I know for sure that is the case with Atlas open data.

Geant4 should have been open source a decade ago.

What do you mean? Geant4 was always open source, I did my bachelor thesis 15 years ago using it.

Now, FLUKA on the other hand... no idea what their deal is.

It's very useful for studying ultra small highly metallic supernovas is probably part of the issue.

I just downloaded and installed Geant4 from their website yesterday, it's also on GitHub

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact