2. In data warehouse terms, would you fit it as a MOLAP, ROLAP or HOLAP kind of engine? I'm not sure whether all data is held in memory and whether aggregates are cached. Can you preprocess the dataset in batch mode (let's say overnight), repopulate aggregate caches for faster retrieval later on? (I mean that as something similar to SQL Server Analysis Services in MOLAP mode).
3. Can you compare it to Apache Drill ?
We have a pull request for partitioned hash aggregations, but currently all the groups must fit in memory limit specified by the "task.max-memory" configuration parameter (see http://prestodb.io/docs/current/installation/deployment.html for details).
Regarding approximate queries, we are working with the author of BlinkDB (http://blinkdb.org/) to add it to Presto. BlinkDB allows very fast approximate queries with bounded errors (an important requirement for statisticians / data scientists).
2) Presto is a traditional SQL engine, so it would be ROLAP. We don't yet have any support for building cubes or other specialized structures (though full materialized view support with rewrite is on the roadmap).
The Presto query engine is actually agnostic to the data source. Data is queried via pluggable connectors. We don't currently have any caching ready for production.
There is a very alpha quality native store that we plan to use soon for query acceleration. The idea is you create a materialized view against a Hive table which loads the data into the native store. The view then gets used transparently when you query the Hive table, so the user doesn't have to rewrite their queries or even know about it. All they see is that their query is 100x faster. (All this code is there today in the open source release but needs work to productionize it.)
We have a dashboard system at Facebook that uses Presto. For large tables, users typically setup pipelines in Hive that run daily to compute summary tables, then write the dashboard queries against these summary tables. In the future, we would like to be able to handle all of this within Presto as materialized views.
3) We're excited about the Drill project. They have some interesting ideas about integrating unstructured data processing (like arbitrary JSON documents) with standard SQL. However, last I looked they were still in early development, whereas Presto is in production at Facebook and is usable today. Please also see this comment: https://news.ycombinator.com/item?id=6684785
Additionally, in the long term, we want to enable Presto to be a completely standalone system that is not dependent on HDFS or the Hive metastore, while enabling next-generation features such as full transaction support, writable snapshots, tiered storage, etc.
One big advantage we have that speeds up development is the Facebook culture of moving fast and shipping often. Facebook employees are used to working with software as it's being built and refined. This kept us focused on the key subset of features that matter to users and getting close to realtime feedback from them. Our development cycle is typically one release and push to production per week.
Also, from the beginning Presto had to work with current Facebook infrastructure (100s of machines, 100s of petabytes), so we faced and solved all the associated scaling challenges up-front.
I'm missing something very basic here. The core idea of Presto seems to be to scan data from dumber systems -- Hive, non-relational stores, whatever -- and do SQL processing on it. So isn't its speed bounded by the speed of the underlying systems? I know you're working on various solutions to that, but what parts are actually in production today?
Unfortunately, we don't have any docs, so you'll just have to peruse the code. There's minimal server in the codebase to demonstrate usage of some of its features: https://github.com/airlift/airlift/tree/master/sample-server
We're also working on a example connector that can read from files/urls. We should have that code up soon.
Do you think this would be a decent candidate for click stream analysis?
We have over a thousand employees in Facebook using it daily for anything you can imagine, so I recommend trying it and letting us know what you find useful and how it can be improved.
We currently have a pull request open right now that will allow range predicates to be be pushed into the connectors. This will allow connectors the ability to implement range/skip scans.
The core presto engine does not take advantage of indexes right now.
For freshmen we recommend FBU: https://www.facebook.com/careers/university/fbu
Everyone else should check out regular internships or new grad positions: https://www.facebook.com/careers/university
In the case that you're younger, at Google, we offer something called Engineering Practicum internships. In this program, you get paired up with another freshman or sophomore, and work on an intern project together.
Feel free to email me if you're interested in this. I'm doing some heavy data analysis and would be more than happy to host some interns next summer.
If you're a freshman or sophomore, apply!
As I understand it you just need to be going back to school at the end of your internship.
⒉⬠ In which DB do you store the text and media?
MySQL is used for storing textual user content like comments, etc. See https://www.facebook.com/MySQLatFacebook
Photos are stored using specialized systems: https://www.facebook.com/note.php?note_id=76191543919 http://www.stanford.edu/class/cs240/readings/haystack.pdf
And you can see some pictures of our new data center used for cold storage of older photos: http://readwrite.com/2013/10/16/facebook-prineville-cold-sto...