Hacker News new | past | comments | ask | show | jobs | submit login

Hi, I work on Presto at Facebook and would be happy to answer any questions about it.

1. Can you force it to do exact (not approximate counts) and aggregations? If so, what are the preconditions (size of tables?)

2. In data warehouse terms, would you fit it as a MOLAP, ROLAP or HOLAP kind of engine? I'm not sure whether all data is held in memory and whether aggregates are cached. Can you preprocess the dataset in batch mode (let's say overnight), repopulate aggregate caches for faster retrieval later on? (I mean that as something similar to SQL Server Analysis Services in MOLAP mode).

3. Can you compare it to Apache Drill ?

1) Presto always returns exact answers, with the exception of the approximate aggregate functions approx_distinct() and approx_percentile() (see http://prestodb.io/docs/current/functions/aggregate.html). We do not currently support COUNT(DISTINCT), though if you really need that you can use a subquery with SELECT DISTINCT.

We have a pull request for partitioned hash aggregations, but currently all the groups must fit in memory limit specified by the "task.max-memory" configuration parameter (see http://prestodb.io/docs/current/installation/deployment.html for details).

Regarding approximate queries, we are working with the author of BlinkDB (http://blinkdb.org/) to add it to Presto. BlinkDB allows very fast approximate queries with bounded errors (an important requirement for statisticians / data scientists).

2) Presto is a traditional SQL engine, so it would be ROLAP. We don't yet have any support for building cubes or other specialized structures (though full materialized view support with rewrite is on the roadmap).

The Presto query engine is actually agnostic to the data source. Data is queried via pluggable connectors. We don't currently have any caching ready for production.

There is a very alpha quality native store that we plan to use soon for query acceleration. The idea is you create a materialized view against a Hive table which loads the data into the native store. The view then gets used transparently when you query the Hive table, so the user doesn't have to rewrite their queries or even know about it. All they see is that their query is 100x faster. (All this code is there today in the open source release but needs work to productionize it.)

We have a dashboard system at Facebook that uses Presto. For large tables, users typically setup pipelines in Hive that run daily to compute summary tables, then write the dashboard queries against these summary tables. In the future, we would like to be able to handle all of this within Presto as materialized views.

3) We're excited about the Drill project. They have some interesting ideas about integrating unstructured data processing (like arbitrary JSON documents) with standard SQL. However, last I looked they were still in early development, whereas Presto is in production at Facebook and is usable today. Please also see this comment: https://news.ycombinator.com/item?id=6684785

Thanks a lot for the answers, good luck on your further development. I'm a regular SQL Server user and MOLAP goodies can be very useful. From what I understand, it should be very easy to join a new data source to existing Presto warehouse. So I suppose you could "upload" additional mappings from Excel/CSV into the warehouse and do new queries live (instead of waiting for batch reprocess). This can be very beneficial to the user/analyst who often likes to define new data categorizations on-the-fly and use them to dig into data further.Great job! I'm putting this project on my must-try-shortlist right now!

If this is a hot button item, seriously look into microstrategy >9.2.1. You build the project against some warehouse, and can import other data sources like excel, access, or even other odbc sources. The imported data gets converted to an in memory cube which you can then combine analysis with the larger warehouse. No reprocessing on the warehouse needed.

How complete is the SQL dialect? Are nested queries supported? I was surprised to see that the ANTLR grammar is only ~900 lines; a quick perusal didn't suggest to me that nested queries are supported, but I'm not a guru...

> It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, subqueries,and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest). The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).

2) Can't you directly read the RCFiles/etc. bypassing Hive? Why implement another storage format?

Presto does read Hive data directly out of HDFS, bypassing Hive/MapReduce. However, neither HDFS nor the file formats like RCFile are efficient. The entire design of HDFS makes it difficult to build a real column store. Having a native store for Presto allows us to have tight control over the exact data format and placement, enabling 10-100x performance increases over the 10x that Presto already has over Hive.

Additionally, in the long term, we want to enable Presto to be a completely standalone system that is not dependent on HDFS or the Hive metastore, while enabling next-generation features such as full transaction support, writable snapshots, tiered storage, etc.

HDFS now supports placement groups with one use case being columnar storage.

Any metrics you can share about the team size, any esoteric skill sets found to be crucial, or overall cost? Some sense for those who wonder "what would it take to create or recreate something like this?"

We started the project with 4 full-time engineers and added a fifth person recently. We began working on Presto in August last year and deployed our first working version in January, and we've been iterating on it ever since.

One big advantage we have that speeds up development is the Facebook culture of moving fast and shipping often. Facebook employees are used to working with software as it's being built and refined. This kept us focused on the key subset of features that matter to users and getting close to realtime feedback from them. Our development cycle is typically one release and push to production per week.

Also, from the beginning Presto had to work with current Facebook infrastructure (100s of machines, 100s of petabytes), so we faced and solved all the associated scaling challenges up-front.

So, the obvious question is does it connect easily with Scuba? From the paper, Scuba doesn't support joins, so would that logic be handled in Presto?

We are working on a Scuba connector.

Sorry, didn't answer the core question. We are working on a Scuba connector specifically to bring joins and other powerful SQL features to Scuba. These features are handled by the Presto query engine, so data in existing non-relation systems can be easily analyzed. Even better, Presto can join data across connectors to formerly isolated systems.


I'm missing something very basic here. The core idea of Presto seems to be to scan data from dumber systems -- Hive, non-relational stores, whatever -- and do SQL processing on it. So isn't its speed bounded by the speed of the underlying systems? I know you're working on various solutions to that, but what parts are actually in production today?

Can you point me to any documentation on the airlift framework that Presto uses? It seems to be a REST server and service discovery framework. I can see the framework on github but there is very little documentation.

Airlift is a toolkit for building distributed services and it is the foundation of Presto. For the most part, it's a convenience layer on top of some well-known libraries (Jackson, Jersey, Jetty, Guice, etc.), but it also has support for various things you'd typically need to build a deployable server, such as packaging, configuration, logging, instrumentation (stats), etc.

Unfortunately, we don't have any docs, so you'll just have to peruse the code. There's minimal server in the codebase to demonstrate usage of some of its features: https://github.com/airlift/airlift/tree/master/sample-server

Thanks! It reminds me a bit of Dropwizard at first glance. Doing REST services in Java has been really easy and enjoyable lately. The Jersey, jackson, jetty, combo is great.

But why did you you use this particular one? And how is it related to https://github.com/proofpoint/platform ? Thanks in advance.

Can you give a pointer to some sample code implementing a custom connector to a file containing flat binary records? If no, sample code for a connector to HDFS...

We don't have a connector for flat binary records or plain HDFS, but you can take a look at the Hive connector here: https://github.com/facebook/presto/tree/master/presto-hive/s...

We're also working on a example connector that can read from files/urls. We should have that code up soon.

Congrats on open sourcing this. It sounds really great and I'm excited to get a chance to play with it.

Do you think this would be a decent candidate for click stream analysis?

Thanks! Presto supports arbitrary SQL including some basic window functions (full support is on our roadmap). I've used window functions at previous companies with commercial data warehouses to do some basic click stream analysis: http://prestodb.io/docs/current/functions/window.html

We have over a thousand employees in Facebook using it daily for anything you can imagine, so I recommend trying it and letting us know what you find useful and how it can be improved.

Is there any benefit to use Presto on top of Riak or ElasticSearch (assuming someone writes the required connectors)? What are the main differences there?

The benefits would be that you could query them using full ANSI SQL and could join them with other data sources such as Hive. One of the benefits of Presto's connector architecture is that it allows queries across multiple disparate data sources (after the someone writes the connectors).

What are the access paths that presto uses (full scan, range scan, skip scan, ...)? Does it understand indexes?

Presto supports full scan right now, but pushes down key value equality pairs to the connectors for additional connector side pruning (e.g. partitions in Hive). The engine itself will also do some additional partition pruning at planning time with the partitions returned by the connector.

We currently have a pull request open right now that will allow range predicates to be be pushed into the connectors. This will allow connectors the ability to implement range/skip scans.

The core presto engine does not take advantage of indexes right now.

Is Presto meant mainly for ad hoc internal analysis or is it fast enough for user-facing features as well?

Presto is designed for OLAP type workloads and is thus primarily used heavily internally at Facebook for data analysis and reporting. It won't be replacing traditional databases (like mysql) that are typically used for user-facing features, but may be suitable for serving user-facing traffic doing OLAP type queries such as generating dashboards. We are actually looking into doing something like that right now for Facebook.

Can it be installed and played around with on the virtual machine cloudera provides for hadoop?

Wow, I didn't realize they had a VM image available. We haven't tried it, but there's no reason why it wouldn't work. Try it and let us know!

Are you looking for interns?

Facebook is always looking for interns and has a great intern program. See https://www.facebook.com/careers/department?dept=interns&req...

Thanks for the link, I'll try that. At the career fairs, the recruiters have indicated they filter out people like me who aren't juniors in college, so I'm so looking for someone on the inside to help me get past this barrier.

I do a bunch of on-campus recruiting at Facebook.

For freshmen we recommend FBU: https://www.facebook.com/careers/university/fbu

Everyone else should check out regular internships or new grad positions: https://www.facebook.com/careers/university

Is the problem that you're younger than that or older than that?

In the case that you're younger, at Google, we offer something called Engineering Practicum internships. In this program, you get paired up with another freshman or sophomore, and work on an intern project together.


Feel free to email me if you're interested in this. I'm doing some heavy data analysis and would be more than happy to host some interns next summer.

As an Eng Practicum alum, I can't recommend this internship enough. You'll write code that matters, learn a ton about 'real-world' engineering at scale, and meet some amazing people.

If you're a freshman or sophomore, apply!

We (I'm another eng at FB) definitely hire interns who are not juniors in college. You can be younger or older (or you could be a junior - I did my internship here after junior year).

As I understand it you just need to be going back to school at the end of your internship.

⒈⭔ Which GraphDB do you use at Facebook?

⒉⬠ In which DB do you store the text and media?

Facebook uses TAO for graphs: https://www.facebook.com/notes/facebook-engineering/tao-the-...

MySQL is used for storing textual user content like comments, etc. See https://www.facebook.com/MySQLatFacebook

Photos are stored using specialized systems: https://www.facebook.com/note.php?note_id=76191543919 http://www.stanford.edu/class/cs240/readings/haystack.pdf

And you can see some pictures of our new data center used for cold storage of older photos: http://readwrite.com/2013/10/16/facebook-prineville-cold-sto...

wow thank you! Is there an opensource alternative comparable to TAO for Graphs? Neo4j is dead slow..

How does Presto compare to Impala?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact