Hacker News new | comments | ask | show | jobs | submit login

It will be extremely kind if someone can explain the major differences or features that Presto offers that are different then Impala or other similar products.



I can't speak to what other products like Impala, Apache Drill, etc. offer, but Presto supports the following:

- Standard ANSI SQL syntax, including all the basic features you'd expect from a SQL engine (aggregations, joins, etc) and other more advanced features like analytic window functions, common table expressions (WITH), approximate distinct counts and percentiles.

- It's extensible. The open source code base includes a connector for Hive, but we also have some custom connectors for internal data stores at Facebook. We're working on a connector for HBase, too.

- In comparison to Hive, it's very fast and efficient. For our workloads it's at least 10x more CPU-efficient. Airbnb is using it and has had a similar experience.

- Most importantly, Presto has been battle-tested. It's been in production at Facebook since January and it's used by 1,000 employees every day running 30,000 queries daily. We've hit every edge case you can imagine.


Any best practices or gotchas about query authoring in comparison to a typical relational database? What part of the traditional relational db mentality needs to be changed or thrown out as I write queries. For example, should I avoid using this aggregate function or that join in a way I'm familiar with in the context of something like Postgres.


When running against Hive data, Presto is similar to many analytic databases in that queries typically perform a full table or partition scan, so "point lookups" that search for one or a few records will be much less efficient than they would be in an OLTP system like PostgreSQL that has a precomputed index for that query. (This is actually a property of the data source and not the Presto query engine. For example, we are writing an HBase connector that can take advantage of HBase's indexes for very efficient queries.)

In general, you should be able to write your query in the simplest and most readable way, and Presto should execute it efficiently. We already have the start of an advanced optimizer that supports equality inference, full predicate move-around, etc. This means that you don't need to write redundant predicates everywhere as is required with some query engines.

Also, if you are familiar with PostgreSQL, you should feel right at home using Presto. When making decisions for things not covered by ANSI SQL, the first thing we look at is "what does PostgreSQL do".


What about connectors to other relational databases: PostgreSQL, MySQL, etc?


Connecting Presto to a relational database is a tricky question. If you just want to have Presto scan the table and perform all computation in Presto, it is pretty easy, but for this to perform well, you would want to push down some of the computation to the database. The other problem is if you only have one database, you would have hundreds of cores hammering that single database for data.

That said, earlier this year, during a hackathon, we build a prototype connector that could split a query and push down the relevant parts to a distributed database that supports simple aggregations. It would be more work to clean this up and integrate, so if a lot of people are interested in this we can prioritize that.


Makes sense. I was thinking for the case where you have data in multiple databases/servers and want to do aggregation or joins without first doing some ETL step to bring the data into another format. Unless there is something Presto can do that a relational database can't, I would assume you just use normal SQL if you have a single database.


If you did that, wouldn't you just be re-inventing DATAllegro or early versions of Greenplum or early versions of Aster? I.e., better than nothing, but still far short of a modern analytic relational DBMS?


Are those things open-source?


Actually, there was an early version of Greenplum that was open source. Nobody seemed to care much.


The data still lives in HDFS, so Sqoop is the preferred way to export to RDBMS as far as I know.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: