Hacker News new | past | comments | ask | show | jobs | submit login
PartiQL: One query language for all your data (amazon.com)
241 points by portmanteaufu on Aug 1, 2019 | hide | past | favorite | 84 comments



    PartiQL> SELECT * FROM [1,2,3]
       | 
    ===' 
    <<
      {
        '_1': 1
      },
      {
        '_1': 2
      },
      {
        '_1': 3
      }
    >>
    --- 
    OK! (86 ms)
Jeez. 86ms for this query on this data set? Hope that's not representative of the general performance!


Other queries involving joins, aggregates, unrolls, and pivots on schemaless, nested, multi-path documents performed way better than the example cherry-picked from the main blog post-- Queries completing between 5ms to 25ms, albeit on toy dataset.

https://partiql.org/tutorial.html


You missed the point. The modern infrastructure's primary value is scalability. This number is of course bad, for the data. But this number will be more impressive when the data is million time bigger.


You missed the point. Modern infrastructure's scalability is irrelevant if even one user's experience is poor.

In the era of 64-core processors, scaling horizontally is meaningless for 99.9% of architecture designs. Latency matters to everyone, always.

Trivial queries taking nearly 1/10th of a second on modern kit is absolutely atrocious, and shows a total lack of awareness of performance as a feature.


It looks like the first query run in the REPL takes more time (startup cost). Subsequent runs of that query return around 5ms. Even a query like `SELECT * FROM 1` when executed immediately after starting the REPL will take longer than usual. Even running non-query expression, e.g., `1 + 1` immediately after starting the REPL will take more time to complete than usual.

The time reported by the REPL can be misleading.

There is work that we definitely need to do on performance as we develop PartiQL. Performance is something we have been considering since inception and we will keep considering as we go forward.


I mean modern infrastructure's user experience is very poor compared to decades ago's PC based software...


> Latency matters to everyone, always.

I used to do a lot of BigQuery for analytics. Latency in BigQuery is crap and clearly not it's selling point, we're not talking ms here, we're talking seconds at a minimum. Yet it's a really nice database for it's use cases.


We've improved our floor latency by a factor of 5 since two years ago, we've introduced clustering, and newly introduced BI engine gets you into dozen digit millisecond range, so give it a try again :)

(Product manager on BigQuery)


This is on the JVM so the JIT's optimizations probably haven't kicked in yet.


Who are we kidding? Even without JIT optimizations that seems absurdly slow.


Also, the PartiQL compiler makes heavy use of closures, each of which becomes a class, so the first time a query executes the JVM has to load a few dozen classes--this probably explains the 86ms more than a lack of JIT optimizations alone.


each of which becomes a class

Wasn't part of the whole thing with lambdas that they don't all become individual classes?


No, the point of lambdas is you didn't have to write the full class syntax.

Like in other OO languages, every value in Java is an object [1], and every object has a class.

[1] Except for the 8 primitive types.


I didn't say that was 'the point of lambdas'. But the lambdas implementation, as far as I know, doesn't make new classes. It's not syntactic sugar around anon/inner classes.


You're right that it's not represented as a class in byte code.

At runtime, LambdaMetafactory creates function objects that implement the require interfaces.

The objects are instances of java.lang.Object, including returning a value for getClass() that is the same as for any other other lambda objects created from same expression.

Not compile time classes, but runtime instances of classes.

Uses different machinery with keeping compatibility.


They're not classes in source code. As an implementation detail, certain compilation strategies do generate them anyway.


Which ones?


I don't know much about compilers, but basically the ones don't have anywhere else to put the context of a closure.


I don't think that's what happens in the JVM, you can google up the invokevitual + other supporting gunk/mechanism for this.


https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.ht... makes plenty of references to dispatching based on class.


D'oh, sorry I sent you on a wild goose chase - you want invokedynamic. Invokedynamic and lambdas.


they use invokedynamic instruction so a little more efficient than a class once jit'ed (for hotspot, not sure about the other jvms). there might be some cases where class generation is needed, but i don't think so


Did you just copy/paste from the tutorial?

It's an early reference implementation and demo program to show off the language syntax, it doesn't have much to do with whatever engine actually executes the query and that will be the majority of any real query's timing.

Regardless, even the query parsing and compilation should be much faster if it moves into lower-level language like C++ or Rust.


A common query language, while appealing, is unlikely to fully abstract over different types of databases with different features and performance trade offs. It will be a leaky abstraction.

Now, in practice, perhaps with sufficient adoption and integration, PartiQL might be good enough for 80% of use cases.


(I'm part of the PartiQL effort.) You are right about the challenge you point out and we are realistic about it. Thus this line in the charter: {{{ While the adopting query engines generally may not support all features of PartiQL, a database engine that “supports PartiQL” is expected to be consistent with the PartiQL specification in the syntax subset it supports. }}}


'SQL’s ORDER BY orders the output data. Similarly, the PartiQL ORDER BY is responsible for turning its input bag into an array.'

That is the most important thing for my uses. I deal mostly in time series data, SQL windowing queries are too slow. Turning the set into an array to allow indexing and support easy time series queries is enough for me the use it.



What does this offer over Hive SQL and Spark also supports it?

Below are the reasons given in the blog post and I am trying to compare them with Hive SQL + Spark

SQL compatibility - I need to check this as I am not a SQL expert, but Hive SQL seems compatible

First-class nested data - supported

Optional schema and query stability - supported

Minimal extensions - feels same goals in Hive SQL

Format independence - yes

Data store independence - yes.


There is one word that every vendor hates: "vendor agnostic". Minor differences in SQL dialects are not a bug, they are features for most vendors.

Most customers running on Amazon (or any cloud) want to move from having to maintain their own databases (which takes a lot of effort) to paying someone else do it. Amazon knows this.

This move looks like Amazon has everything to win and every other vendor has everything to lose. Even if they say the opposite (you can switch from Amazon to your own) - they know that extremely few customers have the will to operationalize their own databases. So they know that only the opposite will happen - customers will switch from self hosted to Amazon services. They have also been openly predatorial towards other open source databases (e.g. aws elasticsearch and mongo). No wonder all Amazon services already support this.

In that context, who is the target audience and what is the deployment model here? Are vendors going to integrate this directly into their databases? Or users have to run their own proxy instances? Or is it compiled into the application as a library?


Is there a specification in anything besides PDF easily available to link to?


(I'm a member of the PartiQL team.) The language spec source will be open-sourced, as well, early next week (week of Aug 5). As I said above, to @ahl: Overall, we look forward to a community effort and participants that are interested in making significant investments to achieve the project's goals. We invite diverse opinions and viewpoints. As PartiQL grows towards a diverse community, we expect to add maintainers (for code and spec) that have non-Amazon affiliations and explore more formalized methods of governance.



OP asked for anything besides PDF.


AWS is all-in on data lock-in.

This may be powerful and useful, but it is proprietary, nontransparent, unstandardized, and nonportable.

I get that every database has some platform lock-in, but its getting ridiculous. At least amazon's relational offerings need to adhere to binary driver protocols.


Anyone know how this compares to Presto and zetasql?


Presto is a distributed query engine that can run queries across different datasources. It accepts a basic ANSI SQL syntax.

ZetaSQL is a custom SQL dialect, along with parser and analyzer, that Google uses for products like BigQuery and Spanner.

PartiQL is a new query language extended from SQL to work with various non-relational data sources and schemaless data formats in a more natural and idiomatic way.


One big difference is native support for nested data that's built right into the syntax of the language. Most other SQL implementations allow support for nested data through functions which have non-intuitive syntax.


We generally build views to unnest the arrays, maps, and structs and query from them (or build other tables from the views in hive) but something like this is certainly a bit easier


I also am trying to figure out how this stacks up against Apache Drill.


It's just a language, not a query engine. You can add PartiQL to Drill and Presto so that they can support a richer querying syntax over the unstructured/schemaless data sources they handle.


@dlurton since you seem to be speaking for the PartiQL team on this (congrats on the launch!): The reference implementation is open source; what's the plan for the language spec? Is that something that AWS is going to own and control? The website references the PartiQL Steering Committee -- is that just AWS folks or is the intention to make it more broadly composed of members of the community you build?

I'm interested in adopting PartiQL for our product, but would we get to participate in the evolution of the language or would we purely be downstream of the decisions made to benefit AWS products and services?


hi @ahl, I'm a member of PartiQL's steering committee and glad to see your interest to participate in PartiQL's evolution. The language spec source will be open-sourced, as well, early next week (week of Aug 5). Overall, we look forward to a community effort and participants that are interested in making significant investments to achieve the project's goals. Diverse opinions and viewpoints, both on the language and on the process, are very welcome.

At this point, the maintainers/committee is only Amazon members. As PartiQL grows towards a diverse community, we expect to add maintainers/committee (for code and spec) that have non-Amazon affiliations and explore more formalized methods of governance,as they will emerge from our community discussions.

Please email us at partiql-committee@amazon.com to further coordinate.


Disclosure: I work for AWS and provided some opinions and non-prescriptive advice to the team behind PartiQL about Open Source.

The same question was raised on Twitter, and I put my thoughts there: https://twitter.com/_msw_/status/1157405984823758848

TL;DR, my advice is that successful open source projects and open specifications usually have diverse communities. You will have a hard time attracting people to your community of they do not share goals with the rest of the community. We should have some bounding boxes around how the spec evolves through clear tenets. Otherwise welcome diverse opinions, experience, and problems to solve collaboratively.


Love the codebase, I never wrote any Kotlin (and very little Java) and was able to (hopefully) complete a good first issue very quickly.


This is pretty nice. If only because using a SQL dotted syntax seamlessly with JSON data.


Postres has had this for years. It's arrows instead of dots, but that's the only visual difference.


It's not the same at all, and it gets much more verbose with minor complexity and lacks functionality.

PG is working on adding SQL/JSON support for JSON Path queries for the next version. It'll be a major improvement but still not as nice as what PartiQL has here.


The SQL standard includes JSON support: https://modern-sql.com/blog/2017-06/whats-new-in-sql-2016


The SQL standard is different from what databases actually support. No RDBMS supports this yet. Only Elasticsearch, Couchbase, and managed services like Rockset and CosmosDB.


Oracle 12.2+ reportedly supports the ISO SQL/JSON though I haven't tried it. Which btw looks like mostly just renames of Postgres functions.

https://docs.oracle.com/en/database/oracle/oracle-database/1...


Will this be integrated into AWS Athena? The blog post doesn't mention it.


AWS Athena is basically managed Presto so AWS will have to modify Presto to support it. They might, and hopefully upstream the changes.


Awesome! I’d be very interested to see when DynamoDB support this language and a MongoDB like query builder. Then I might sell all my MDB shares...


I wonder how this deals with nested parquet data, and whether it's able to optimise on the things parquet provides.


It would be possible to integrate parquet data with PartiQL.

Here is an example of integrating PartiQL with CSV files. https://github.com/partiql/partiql-lang-kotlin/blob/master/e.... Integrating with Parquet would of course be more complex then that.


(PartiQL team member) AWS Redshift Spectrum supports PartiQL on parquet since last year. Except that the language had not had a name yet and was referred to as "SQL extensions for nested data.


Is this a GraphQL alternative or more for accessing DBs in the backend?


It's a querying language like SQL, but designed to handle more unstructured and complex data models natively. You wouldn't want to expose this publically for the same reasons you wouldn't expose a SQL interface directly to your database.

GraphQL has some similarities in handling complex queries across multiple data sources, but also has lots of functionality and large ecosystem around offering it as a public API to clients.


This is not a GraphQL alternative. APIs that expose GraphQL can potentially use this in the backend to fetch the data, however.


Correct me if I'm wrong, but I believe the answer is "both", at least as far as I've read.


GraphQL is intended for public APIs which are interacted with by arbitrary, possibly malicious, queries.

This appears to be for known queries. Unless it is designed for arbitrary queries, DoS is a likely problem.


Interesting that they opted for a relational rather than a categorical one; the latter is proving to be more flexible [0].

[0] https://www.categoricaldata.net/


Interesting that they opted for the single-most popular query language on the planet, versus somebody's hobby project? Why is that interesting?


Looks like this package is maintained by a single person: https://github.com/CategoricalData/CQL

Not sure it's comparable to something like Amazon, that has probably dedicated funding.


How is it proving to be more flexible?


I'd be happy to showcase our recent progress. But let's connect offline, so as not to hijack the conversation. Feel free to drop me a line at ryan@conexus.ai.


Interesting project. Thanks for sharing.


Quote: "PartiQL requires the Java Runtime (JVM) to be installed on your machine."

And that right there is where they lost me. Nooo thank you.



So I assume this is a rebranding of some other open source project with the amazon brand stuck on it, or is it actually something distinct?


The posted article says it was designed and built in house, where it is currently dogfooded, and the specification doc is dated today (2019-08-01):

https://partiql.org/assets/PartiQL-Specification.pdf


This is neat. Anyone want to add support for TreeBase/Tree Notation? http://treenotation.org/treeBase/. It's currently on the backburner to query TreeBases in SQL without having first to convert the TreeBase to sql. Seems like it would be relatively straightforward to use this to do that.


Most of your recent comments link to your project's website. Please stop with the abusive promotion.


I have real problems with TreeNotation advertising itself as "a software-less database system". It's basically abusing your filesystem to store a tree data structure and using git to handle concurrent updates. That's great, except for the part where they ask "does it scale"; and the answer is yes, but should be no.

A database is so much more than just a schema and validator, but this is being advertised as a database replacement.

And, I want to stress, that this doesn't mean I don't think it isn't useful. I bet there are lots of times where you want to enforce some sort of structure on a bunch of folders with files in them. That's not a database though.


Thanks for the feedback! I did not expect that confusion. I just made an update (and will push shortly) to be more explicit that it scales for collaborative knowledge bases. But I'm not talking about something like real time transactional DBs, etc.

I have not used TreeBase for anything other than collaborative knowledge bases. Haven't even thought much beyond that. Thanks for letting me know that wasn't clear.


There's nothing special about collaborative knowledge bases that make them immune from scaling problems. You can't use TreeNotation to run Wikipedia. Fundamentally it isn't a database, period. It's a schema for files in a file system.


> Fundamentally it isn't a database, period. It's a schema for files in a file system.

This is false. It is both.

A database is merely an application that provides an interface to structured data on disk.

I know a thing or two about databases, having contributed to a few of the larger open source ones.

There's a lot more to TreeBase than is on the website right now. As the website says "We have been using TreeBase for over 2 years in systems with millions of rows and dozens of collaborators." For all you know, you may have actually used a website that is powered by TreeBase (well, a TreeBase application written in a different host language, but the file system semantics are the same).


Not sure why you're getting negativity. Though I haven't gone back and reviewed your past comments, I don't see any harm in mentioning it here... It seems somewhat on topic and looks interesting for certain use cases.


I don't find it abusive, the comments seem relevant and it's a open source library.


FWIW this guy does actually spam his project a lot. I recognize him from reddit where people have also complained about his constant promotion of this project.


Sure some people complain that I mention TreeBase and Tree Notation, but others have started using it, are contributing to the project, and some have even written thanks. You can't please all the people all the time.

Can you point to a single place where I posted about TreeBase or TreeNotation that wasn't relevant to the context? I don't think you'll find a single example.


It's extremely relevant to the OP.


Still, it is always appropriate to add a disclosure that you're affiliated with the project you're linking to.


That's a good point. I usually make it clearer but I guess the original comment in this thread was a little ambiguous. Thanks for the feedback.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: