Other queries involving joins, aggregates, unrolls, and pivots on schemaless, nested, multi-path documents performed way better than the example cherry-picked from the main blog post-- Queries completing between 5ms to 25ms, albeit on toy dataset.
You missed the point. The modern infrastructure's primary value is scalability. This number is of course bad, for the data. But this number will be more impressive when the data is million time bigger.
You missed the point. Modern infrastructure's scalability is irrelevant if even one user's experience is poor.
In the era of 64-core processors, scaling horizontally is meaningless for 99.9% of architecture designs. Latency matters to everyone, always.
Trivial queries taking nearly 1/10th of a second on modern kit is absolutely atrocious, and shows a total lack of awareness of performance as a feature.
It looks like the first query run in the REPL takes more time (startup cost). Subsequent runs of that query return around 5ms. Even a query like `SELECT * FROM 1` when executed immediately after starting the REPL will take longer than usual. Even running non-query expression, e.g., `1 + 1` immediately after starting the REPL will take more time to complete than usual.
The time reported by the REPL can be misleading.
There is work that we definitely need to do on performance as we develop PartiQL. Performance is something we have been considering since inception and we will keep considering as we go forward.
I used to do a lot of BigQuery for analytics. Latency in BigQuery is crap and clearly not it's selling point, we're not talking ms here, we're talking seconds at a minimum. Yet it's a really nice database for it's use cases.
We've improved our floor latency by a factor of 5 since two years ago, we've introduced clustering, and newly introduced BI engine gets you into dozen digit millisecond range, so give it a try again :)
Also, the PartiQL compiler makes heavy use of closures, each of which becomes a class, so the first time a query executes the JVM has to load a few dozen classes--this probably explains the 86ms more than a lack of JIT optimizations alone.
I didn't say that was 'the point of lambdas'. But the lambdas implementation, as far as I know, doesn't make new classes. It's not syntactic sugar around anon/inner classes.
You're right that it's not represented as a class in byte code.
At runtime, LambdaMetafactory creates function objects that implement the require interfaces.
The objects are instances of java.lang.Object, including returning a value for getClass() that is the same as for any other other lambda objects created from same expression.
Not compile time classes, but runtime instances of classes.
Uses different machinery with keeping compatibility.
they use invokedynamic instruction so a little more efficient than a class once jit'ed (for hotspot, not sure about the other jvms). there might be some cases where class generation is needed, but i don't think so
It's an early reference implementation and demo program to show off the language syntax, it doesn't have much to do with whatever engine actually executes the query and that will be the majority of any real query's timing.
Regardless, even the query parsing and compilation should be much faster if it moves into lower-level language like C++ or Rust.
A common query language, while appealing, is unlikely to fully abstract over different types of databases with different features and performance trade offs. It will be a leaky abstraction.
Now, in practice, perhaps with sufficient adoption and integration, PartiQL might be good enough for 80% of use cases.
(I'm part of the PartiQL effort.)
You are right about the challenge you point out and we are realistic about it. Thus this line in the charter:
{{{
While the adopting query engines generally may not support all features of PartiQL, a database engine that “supports PartiQL” is expected to be consistent with the PartiQL specification in the syntax subset it supports.
}}}
'SQL’s ORDER BY orders the output data. Similarly, the PartiQL ORDER BY is responsible for turning its input bag into an array.'
That is the most important thing for my uses. I deal mostly in time series data, SQL windowing queries are too slow. Turning the set into an array to allow indexing and support easy time series queries is enough for me the use it.
There is one word that every vendor hates: "vendor agnostic". Minor differences in SQL dialects are not a bug, they are features for most vendors.
Most customers running on Amazon (or any cloud) want to move from having to maintain their own databases (which takes a lot of effort) to paying someone else do it. Amazon knows this.
This move looks like Amazon has everything to win and every other vendor has everything to lose. Even if they say the opposite (you can switch from Amazon to your own) - they know that extremely few customers have the will to operationalize their own databases. So they know that only the opposite will happen - customers will switch from self hosted to Amazon services. They have also been openly predatorial towards other open source databases (e.g. aws elasticsearch and mongo). No wonder all Amazon services already support this.
In that context, who is the target audience and what is the deployment model here? Are vendors going to integrate this directly into their databases? Or users have to run their own proxy instances? Or is it compiled into the application as a library?
(I'm a member of the PartiQL team.)
The language spec source will be open-sourced, as well, early next week (week of Aug 5).
As I said above, to @ahl:
Overall, we look forward to a community effort and participants that are interested in making significant investments to achieve the project's goals. We invite diverse opinions and viewpoints.
As PartiQL grows towards a diverse community, we expect to add maintainers (for code and spec) that have non-Amazon affiliations and explore more formalized methods of governance.
This may be powerful and useful, but it is proprietary, nontransparent, unstandardized, and nonportable.
I get that every database has some platform lock-in, but its getting ridiculous. At least amazon's relational offerings need to adhere to binary driver protocols.
Presto is a distributed query engine that can run queries across different datasources. It accepts a basic ANSI SQL syntax.
ZetaSQL is a custom SQL dialect, along with parser and analyzer, that Google uses for products like BigQuery and Spanner.
PartiQL is a new query language extended from SQL to work with various non-relational data sources and schemaless data formats in a more natural and idiomatic way.
One big difference is native support for nested data that's built right into the syntax of the language. Most other SQL implementations allow support for nested data through functions which have non-intuitive syntax.
We generally build views to unnest the arrays, maps, and structs and query from them (or build other tables from the views in hive) but something like this is certainly a bit easier
It's just a language, not a query engine. You can add PartiQL to Drill and Presto so that they can support a richer querying syntax over the unstructured/schemaless data sources they handle.
@dlurton since you seem to be speaking for the PartiQL team on this (congrats on the launch!): The reference implementation is open source; what's the plan for the language spec? Is that something that AWS is going to own and control? The website references the PartiQL Steering Committee -- is that just AWS folks or is the intention to make it more broadly composed of members of the community you build?
I'm interested in adopting PartiQL for our product, but would we get to participate in the evolution of the language or would we purely be downstream of the decisions made to benefit AWS products and services?
hi @ahl,
I'm a member of PartiQL's steering committee and glad to see your interest to participate in PartiQL's evolution.
The language spec source will be open-sourced, as well, early next week (week of Aug 5).
Overall, we look forward to a community effort and participants that are interested in making significant investments to achieve the project's goals. Diverse opinions and viewpoints, both on the language and on the process, are very welcome.
At this point, the maintainers/committee is only Amazon members. As PartiQL grows towards a diverse community, we expect to add maintainers/committee (for code and spec) that have non-Amazon affiliations and explore more formalized methods of governance,as they will emerge from our community discussions.
Please email us at partiql-committee@amazon.com to further coordinate.
TL;DR, my advice is that successful open source projects and open specifications usually have diverse communities. You will have a hard time attracting people to your community of they do not share goals with the rest of the community. We should have some bounding boxes around how the spec evolves through clear tenets. Otherwise welcome diverse opinions, experience, and problems to solve collaboratively.
It's not the same at all, and it gets much more verbose with minor complexity and lacks functionality.
PG is working on adding SQL/JSON support for JSON Path queries for the next version. It'll be a major improvement but still not as nice as what PartiQL has here.
The SQL standard is different from what databases actually support. No RDBMS supports this yet. Only Elasticsearch, Couchbase, and managed services like Rockset and CosmosDB.
(PartiQL team member)
AWS Redshift Spectrum supports PartiQL on parquet since last year. Except that the language had not had a name yet and was referred to as "SQL extensions for nested data.
It's a querying language like SQL, but designed to handle more unstructured and complex data models natively. You wouldn't want to expose this publically for the same reasons you wouldn't expose a SQL interface directly to your database.
GraphQL has some similarities in handling complex queries across multiple data sources, but also has lots of functionality and large ecosystem around offering it as a public API to clients.
I'd be happy to showcase our recent progress. But let's connect offline, so as not to hijack the conversation. Feel free to drop me a line at ryan@conexus.ai.
This is neat. Anyone want to add support for TreeBase/Tree Notation? http://treenotation.org/treeBase/. It's currently on the backburner to query TreeBases in SQL without having first to convert the TreeBase to sql. Seems like it would be relatively straightforward to use this to do that.
I have real problems with TreeNotation advertising itself as "a software-less database system". It's basically abusing your filesystem to store a tree data structure and using git to handle concurrent updates. That's great, except for the part where they ask "does it scale"; and the answer is yes, but should be no.
A database is so much more than just a schema and validator, but this is being advertised as a database replacement.
And, I want to stress, that this doesn't mean I don't think it isn't useful. I bet there are lots of times where you want to enforce some sort of structure on a bunch of folders with files in them. That's not a database though.
Thanks for the feedback! I did not expect that confusion. I just made an update (and will push shortly) to be more explicit that it scales for collaborative knowledge bases. But I'm not talking about something like real time transactional DBs, etc.
I have not used TreeBase for anything other than collaborative knowledge bases. Haven't even thought much beyond that. Thanks for letting me know that wasn't clear.
There's nothing special about collaborative knowledge bases that make them immune from scaling problems. You can't use TreeNotation to run Wikipedia. Fundamentally it isn't a database, period. It's a schema for files in a file system.
> Fundamentally it isn't a database, period. It's a schema for files in a file system.
This is false. It is both.
A database is merely an application that provides an interface to structured data on disk.
I know a thing or two about databases, having contributed to a few of the larger open source ones.
There's a lot more to TreeBase than is on the website right now. As the website says "We have been using TreeBase for over 2 years in systems with millions of rows and dozens of collaborators." For all you know, you may have actually used a website that is powered by TreeBase (well, a TreeBase application written in a different host language, but the file system semantics are the same).
Not sure why you're getting negativity. Though I haven't gone back and reviewed your past comments, I don't see any harm in mentioning it here... It seems somewhat on topic and looks interesting for certain use cases.
FWIW this guy does actually spam his project a lot. I recognize him from reddit where people have also complained about his constant promotion of this project.
Sure some people complain that I mention TreeBase and Tree Notation, but others have started using it, are contributing to the project, and some have even written thanks. You can't please all the people all the time.
Can you point to a single place where I posted about TreeBase or TreeNotation that wasn't relevant to the context? I don't think you'll find a single example.