Sometimes the result of function call in Datomic is a copy of the database with the updates applied. It allows you to preview how your transaction will affect the data, were that choice to be made. Datalog allows for this kind of return.
Were Postgres to add something like this (and if I'm not inferring too much from the title and commentary here) it wouldn't the first time that the industry took a lesson from Rich Hickey's resurrection of older tech (Datalog).
> No. It does not return a copy of the database with the updates applied.
At first I thought you were talking about Datomic (which does, indeed, "return a copy of the database with updates applied"). But you are talking about the paper, which is about something else apparently.
Ordinary "day to day" operations should have the highest possible isolation level, and a few specific operations should run with loose isolation levels after they have been proved to be both necessary for good performance and harmless. Do you prefer to put your data at risk?
If I understand it correctly, for a query with joins, then instead of returning a single combined set of results with the information you need, it returns multiple sets of results based on each separate table.
But I don't understand why that would ever be desirable.
I always write my queries to return the information I need, per whatever "key" I need, in each row. Splitting that up in multiple sets of results is not helpful.
Or if I really need multiple sets of data, I write multiple queries.
So I really don't get what this is for. It's like it takes one of the biggest features of databases -- joins -- and removes it? Why?
First the authors show that for certain use cases, a single query is not ideal. Either the result set will be larger than it needs to be, or it will have to use something like ARRAY_AGG which "discard[s] all schema and relational information on the way." So let's assume that we are in a situation where a single query is not desirable.
Then the authors show ways to use multiple queries (in section 3, "SQL-BASED REWRITE METHODS"). Each method has its drawbacks. If you've done enough SQL, these patterns will look familiar to you. Compare the syntax needed here vs the proposed "SELECT RESULTDB" syntax in Listing 3 and you should see that their proposal is easier to write. Then in section 4, the authors "present an algorithm that can be integrated into a DBMS to efficiently compute the result set of our SELECT RESULTDB queries." Presumably this would be more efficient than any of the alternatives from section 3.
> First the authors show that for certain use cases, a single query is not ideal.
Reading it closely, this is what I already disagree with. You do a good job of summarizing their two main arguments, so allow me to rebut:
> Either the result set will be larger than it needs to be
Yes, they're using the example of repeating information (denormalization), such as multiple courses taught by the same professor. But it's genuinely hard to see this as a drawback -- that's a feature. Data should be stored as normalized as possible, but queries are supposed to denormalize to present information in the desired format.
(And if you're dealing with what would be an overly-large amount of repeated values, you just run multiple queries yourself instead of one. And if round-trip latency is some kind of issue with running queries sequentially, you can always issue queries in parallel instead.)
> which "discard[s] all schema and relational information on the way."
Again, this is a feature. You're not supposed to retrieve all possible relational information in query results. You write your query to retrieve and differentiate precisely what you need and no more. Discarding irrelevant information is a feature, not a bug. More than that -- you want your query to define and adhere to its output format regardless of the underlying database structure, precisely so you can refactor things in the database and rewrite the query but not need to rewrite the code that uses the query results.
I guess my overall bafflement is that the things they describe as "not ideal" seem to me like features rather than problems, and these features have been highly beneficial in my practical experience of writing a lot of database-driven apps.
denormalize to present information in the desired format
What if my "desired format" is a structure that maintains the relationships specified in the database?
This is an incredibly common pattern, particularly with ORMs.
# psuedocode
for each recipe in Recipe.joins(Ingredient).select_all() do
for each ingredient in recipe.ingredients do
print(f"{recipe.name} requires {ingredient.name}")
Depending on the ORM and the query options specified, this results in one of the following:
1. Everything getting denormalized into a single, wasteful result table and then teased back out into Recipes and Ingredients by the ORM
2. Two SQL queries. One to fetch the recipes, and then one to fetch the ingredients once we know the id's of the recipies.
3. A big ol' N+1 situation with 1 query to fetch recipes and then N additional queries to fetch the ingredients.
They all have various inefficiencies. What if we just returned the relational structures directly?
1. You have given up on any sort of consistency.
2. The database implementation will never be able to optimize a lookup workload which is pretty close to pessimal. Not caring about 99% is only viable if can actually avoid the other 1%.
Don’t use ORM for performance critical DB operations and write SQL by hand might be another answer. Case in point: I recently improved performance of my email app by about 250x by replacing SwiftData with native SQLite database (and manual sql queries). The message data is under 1M rows range plus a lot of blobs/attachments (under 10MB each).
> Data should be stored as normalized as possible, but queries are supposed to denormalize to present information in the desired format.
But the desired format usually isn't a square, flat table. Usually the data has structure and you want to preserve that structure when you're displaying, not just having it hidden in the database. E.g. you might want to show a heading for each professor and then a list of classes taught by that professor. Getting back a flat table with n duplicate copies of the professor's information is a step backwards.
> Data should be stored as normalized as possible, but queries are supposed to denormalize to present information in the desired format.
The desired format depends on the application.
At this point a substantial portion of all SQL queries are generated by (and the results consumed by) ORMs.
For that use case having results that include Products and Categories separately (so you can instantiate Product and Category objects) is more useful than a single table.
If the transfer time was really slow, I could see wanting to return a subset of data without duplication, and then allow the application to query locally among that data for all the various things it needs.
That seems so unlikely that I'd think a distributed sqlite database would be a better idea instead, and just have it sync up when it can, and always query local.
I think the point is if you move that denormalisation to the application (the ORM say) then it's less data on the wire.
Similarly keeping schema information gives the ORM more to work with I suppose? Helpful for typing, foreign key links where it's not necessarily the same original table structure, but in the query result professor_id is still ultimately a foreign key to professor table, etc.
Less data on the wire and a consistent snapshot at a given point in time, without the possible inconsistencies allowed by performing successive queries with intervening writes.
> Data should be stored as normalized as possible, but queries are supposed to denormalize to present information in the desired format.
Nit: data should be stored as normalized as possible, if your use-case prioritizes fast writes over fast reads. There are other benefits to normalization, but in practice they pale in comparison to the performance shift.
I'm guessing they're referring to the fact that typically when you use `GROUP BY` in SQL (aside from aggregate functions such as `ARRAY_AGG`), you're typically not returning a "group" of results, but a summary of the results in the group.
I think the authors' work is interesting but they shouldn't have said 'relational database' in the paper, just 'sql database' instead ("We keep an SQL database storing information about professors ..." and so on).
In "industry", I don't argue with people saying Oracle/MySQL/PostgreSQL are relational databases because that would make me an insufferable colleague and would hardly add any value to the discussion, but for an academic paper on databases, I would prefer more accurate language.
I'm in industry too, though I did go through formal studies at university and databases was one of the topics that interested me the most back then (late 90s). I dropped out without graduating so I'm definitely not in academia, but I do keep up with some of what's going on though my ACM membership, and it is within that context that I made my comment about the authors' choice of words in their paper.
With that said, some examples of differences between an SQL database and a relational one (which, to be honest, AFAIK, is something that doesn't exist in a production-ready available implementation):
- In a relational database everything is a relation (including the result of applying any operator). Within the Relational Model, this is called the Information Principle, and among other characteristics, the header and tuples in a relation are a set and therefore have no order. SQL is not set oriented, evidenced by the fact that columns and rows do have an order (to the point that most, if not all SQL products let you specify the place in a table's definition in which you want to insert a new column, something that makes no sense whatsoever for a relation).
- In line with the previous item, SQL allows duplicates, which relations don't have (because sets don't have duplicates).
- According to some (notably, Codd disagreed with this), NULLs and three-valued logic are not part of the relational model, while SQL obviously supports NULLs.
- In a relational database, values are stored in relation variables, and these change by being replaced completely with the relational assignment operator. Say you have a relation users, and you want to remove fipar from it, the relational way to do that would be to say users := users minus (users where username = fipar) (pardon the crude pseudocode, hopefully the intent is clear). This means there's no way for the update to be done partially. SQL databases used transactionally comply with this, but some let you relax ACID properties for performance, and when that is done, the universe of possible results for operations includes outcomes that would not happen with relational assignment.
The list is much bigger, and there are lots of edge cases. For a proper treatment from someone who knows what they're talking about I'd recommend this book for a thorough answer to your question: https://www.oreilly.com/library/view/sql-and-relational/9781...
Not a difference, but since this is a common misconception, the "relational" in a relational database is not about relationships between tables. If we simplify by saying that a relation in a relational database is analogue to a table in an SQL database, you can have a relational database with a single relation (i.e., a single table). The relation is between the header and the tuples (rows in SQL). The idea is that the headers form a statement about the world modeled by the database, and every tuple is a combination of values for which the statement is true. In light of this, it should be obvious why duplicates make no sense in a relation. Saying something twice doesn't make it more true!
Another misconception is that relational databases don't scale. That makes no sense, because the relational model has nothing to say about the implementation. Saying the model doesn't scale because you can't use Foreign Keys after certain data size and throughput threshold are crossed in MySQL (say) makes as much sense as saying that arithmetic doesn't scale because you hit an overflow while using a specific model of calculator.
> In light of this, it should be obvious why duplicates make no sense in a relation. Saying something twice doesn't make it more true!
Saying something twice doesn't make it more true, but that could also entail that duplicates should be benign. After all, "true AND true" is still true. So I don't think the conclusion that duplicates make no sense follows from this fact, it rather entails that duplicates should have no effect (idempotency).
In fact, a higher performance implementation is possible if we permit duplicates at some levels of the system because we don't need to check for duplicates. If duplicates have to be removed, that can arguably be done at the final materialization stage, or some other stage where it makes the most sense and overall work is minimized.
It becomes pretty clear that `order` is a significant property to make useful (and performant!) programs. "Duplicates" is also required to make usefull programs.
One nonobvious reason for this: You wanna report that a `customer` has a duplicated key `1`. If you CAN'T model `[(customer.id = 1), (customer.id = 1)]` then you can't report errors! And `erroneous` data is VITAL to make useful programs because then the only possibility is "perfect" data, and that is not possible!
Another reason is that we want to `count` duplicates, to see `duplicates`, and other NON-obvious at first: "What is a duplicate?". Get fun with floats, Unicode, combining case and non-case sensitive input... and is obvious that for useful programs IS REQUIRED to support bags in an extended version of the relational model.
And yet...
IS very important to remember about `set semantics` and try to adhere to it when makes sense. Your query planner will like it. You "valid" constraints like it. And `unique index` like it. And so on...
It could be useful with ORM. For example, in Rails ActiveRecord instantiates the class for each row. But when you need to write complex query with many joins and instantiate AR models tree from it becomes either cumbersome or leads to n+1 queries. Still pretty niche usecase, but looks interesting
Right, but fetching rows along with a bunch of related rows from other tables isn't niche, it's literally the _raison d'etre_ of GraphQL (simplifying a bit because GraphQL types don't necessarily map onto db tables.)
Static-language clients need to create a class (or collection of classes) for each GraphQL query, or forgo type safety (typically just null-safety, but I suppose you could treat each result as a dictionary type). Creating just the schema-level types, with relationships modeled separately, does throw out some of the benefits ("just the data you need") but simplifies the client, moreso if the client may perform multiple complex queries on the same data.
so, if you want to join two things, a parent and children, you get duplication of parent details
eg, select P.name, C.* from Parent P inner join Child C on C.Parent_id = P.Id where P.Id = 1234
so from what I can tell is this idea would have a result set of a Parent table and child tables based on the join and projection with no duplication? At least that's what I understand it to be. That seems useful.
Let's say you're querying orders, and each order has a customer name in another table that you want to pull through. If you do a join and include the customer name in the result set it's going to amount to a lot of repeated data that has to be sent across the network. Instead, you could receive two result sets, one with only the relevant customers and the other with the orders containing only the customer ID. The amount of data getting sent across the network will be massively reduced and you can still reference the customer names at the other end.
I guess it's kind of like doing all the things a join query would do, like mapping rows together, filtering, sorting, and grouping, but the final output is denormalized into several "tables" without duplication. From what I gather, the main "win" is the size reduction in the result set, which can impact network and memory usage. Yes you can do this yourself by running multiple queries, one on each table. Presumably by paying the latency of multiple round trips.
Sometimes you just want the data to do with as you please and don't want to bother dealing with relational database overhead (and instead prefer to deal with that overhead yourself).
Not really a new idea, just new to SQL databases. Usually if you find yourself needing to dump a whole ton of data, you don't use SQL, you use any number of other database solutions that are better suited to that model. So I suppose this is really meant to give you a little bit of that for when you need it.
I can imagine a use case. For complicated queries, you end up performing a Cronenberg on your data in order to smash it all together with large join chains, or even creating new json columns that effectively return other "tables". If its performance critical, this is often the best option.
For most queries, a single table makes perfect sense. But in the real world things aren't usually so neat.
I may be misremembering this but I think ODBC does allow for more than one table to be returned at a time? But it has been nearly 30 years since I messed with that API. I think it has to do with stored procedures and that they sometimes can return more than one table. Maybe someone with more recent exp in it can chime in.
If I wanted to get some patients who had either condition x y or z and likely they had more one of those conditions, this would be a useful way to return the data.
The list of patients remains distinct but which of the conditions they had would still be referencable.
(humor) if system not set up appropriately, expanded long form version of shell fork bomb[0] that drains bank account per pay per amount of resources used.
Reading the abstract... it feels to me like the author doesn't know what a relational database is.
Chris Date does discuss the concept of nested tuples / relation-valued tuples of some kind in his books as an extension to the relational model. But that's something else entirely.
My interpretation is to enable a query to return many relations in order to overcome the high latency associated with most RDBMBes that often prevent using relational databases in a "theoretically pure" way. It's a hack, but possibly a better hack than the hacks we currently resort to.
This is to have multiple relations returned. Traditional unions and joins only see one relation returned.
You can pack multiple relations into one relation using unions/joins and then unpack it again in the client to ultimately achieve the same effect, which is a workaround that some utilize, but that is really hacky (to be clear, unions and joins are not hacks when used for their intended purpose, but the data packing is).
The proposed would be much less hacky - although still far from an ideal database that has no latency constraints that force deviation from what is "pure". But that’s engineering for you.
IIRC this (returning a full database with a schema as a result of a query) has been proposed before in categorical databases, see https://www.categoricaldata.net/
The purpose of this is to support the annoying limitation that results must be a rectangular matrix. I tweeted this Oct 28th so clearly it solves a problem practitioners already agree with https://x.com/tomlarkworthy/status/1718329733274824783?s=20
Many useful result sets are not rectangular, trees, graphs etc. There is no way to efficiently transfer these results sets in a single query. You either do multiple round trips or pad the leaves with repeated information which eats network bandwidth.
When I think of SQL, I think of relational databases where they were created and most commonly used. What queries return are not tables, but a relation that can be composed of data from one or more tables.
What this paper is then saying is "What if SQL returned something other than a relation?" That question is being answered by all non-relational DBs being made. An example of one might be a graphdb that could return a graph with nodes, edges and properties thereof.
If we want to stay within relational dbs, then we're saying that we want a single statement to return multiple relations. That sounds something like GraphQL query to me which is stitching of the relational parts.
SQL queries return data from tables.
The abstract does mention that one downside of this is that the result of the query has to be constructed in memory, resulting in a lot of copying.
This is particularly apparent for joins, where some join types can produce much more data than all their input tables combined.
I don't see how this is "return something other than a relation" they are pretty specific about what they mean and what benefits it can bring
> Our approach has clear semantics, i.e. our extended SQL returns subsets of all tables with only those tuples that would be part of the traditional (single-table) query result set, however without performing any denormalization through joins.
Yes, that sounds just like a GraphQL connection result with associated nested connections. I believe that there's link deduplication for GraphQL which can also be enabled to avoid the copies.
The main difference is avoiding multiple round-trips--one for each table--which within a datacenter network doesn't add a whole lot of latency compared to the time serving those queries and some can also be done in parallel.
I think this solves the problem of Cartesian explosions in ORMs. The requirement of related data needing to be shoved into a single table means data shared by many entities has to be duplicated. The result set being split into distinct tables removes this constraint.
The only way I know how to deal with this in Postgres while maintaining a single query is to lateral join into a json aggregate. However, I don’t think the tooling around that approach is mature enough yet.
I love postgres json aggregates, but since postgres cant do nested aggregates it is kinda limiting when I would like to build my whole query response in a single postgres query
I'm new to web-dev (and databases), so my apologies for the potentially silly question, but what's an example situation where you would use nested aggregates and what kind of effort would it save, if it were possible?
Was curious myself so asked chatgpt and this what I got:
Nested aggregates in databases are useful when you want to perform calculations or summaries on groups of data within larger groups. For example, imagine you have a database of sales transactions with products and categories. You could use nested aggregates to find the total sales for each category and then calculate the average sales per category. It saves effort by allowing you to perform calculations at different levels of granularity in a single query.
Personally, I prefer raw sql over something like an ORM. So many times I want to get all users, with all posts, with all post comments (example).
You can do an array aggregate to get all users and their posts in the array, with build_jsonb_object you can make the posts an object of say id and content. But then you can't really make another array aggregate per post.
shell out for nested aggregates[0]. one 'sql querry' to run a batch shell containing all multiple aggregate levels in a file, import back in to return table reference.
I feel like we could build something like this with SQLite, application defined functions and in-memory DB pretty quickly.
But then I ask myself... Why would I want to do any of this? If I do Database=>sql command=>Database... Don't I have to write yet-more-SQL (or something approximating SQL) to get at my final items? I can't write a foreach loop over pile-o-tables. Virtually all programming abstractions expect some serial flow and enumerations of types.
At some point, you are going to have to map A to B, and I'd argue adding another database in the middle probably isn't strictly headed in the right direction.
If you want to select 2 different result shapes, just write 2 different queries. Use some lateral concepts in your programming environment to make the SQL part not suck. If you stand on some weird "one query only" principle every time you have to populate a practical view, you are going to have a super rough time with life.
Maybe if the client did the JOINs and other logic that were in the query, it would shed load from the upstream DB? That would be helpful for RDBMS which tend to be harder to scale when they are CPU bound.
I have seen a lot of services move to microservices and do client-side joins to help distribute load. Maybe there is a way for a DB built around this approach to run each table on a dedicated instance and have each stream the data back async? That would reduce cross-talk between the DBs serving the query and effectively create a client-side JOIN again, but be transparent to the client (if it were handled by the DB driver instead).
I imagine that this could be useful when you want to perform multiple operations on a subset of the data from the database at the client-side. So you load the RESULTDB to client and then perform subsequent queries client-side and those actually do the JOINs
This could dramatically reduce the load on the database for some applications as well as simplifying some queries that can be awkward to do fully in SQL (some GROUP BYs can get quite complicated to understand)
Another useful thing I can see is using this to simplify the creation of edge databases backed by a central database, like imagine creating a RESULTDB "view" at the edge using this syntax.
Essentially it allows for querying over all kinds (aka collection/table).
One usecase where these can be useful to implement account wipeout. If all entities (aka document/row) have a commonly named field like “owner” which is an account ID you can delete all their records in a single query. I mean for deletes you can do a similar use case with cascading deletes and foreign keys, but those don’t always scale well because they have too all happen in one transaction (but then it’s not transparent to the application)
I'm surprised they didn't mention SQL Server's "FOR JSON", which doesn't solve the problem they're solving entirely (n:n cannot be deduplicated AFAIK) but it can get you 90% of the way
This seems similar to how Eloquent (Laravel's ORM) will, by default, make separate IN queries when you ask for a collection plus its relationships (instead of using a JOIN).
I have thought about how you might return normalised data to your front-end and, with support from a front-end library, rebuild the object relationships so you can access them naturally.
One place I got stuck with JSON:API is that support for updating a row and its relationships in a single API request seems to have been kicked around for a while and is unresolved:
https://github.com/json-api/json-api/issues/795
I did some contract work for a company called Akiban, which was bought by FoundationDB, which was bought by Apple that took the whole product off the market at the time. What Akiban offered was SQL with an extra datatype called a "nested result set", the idea here seems similar if not more generalized; you didn't need a traditional join anymore and your ORM-like library could build up hierarchical models from a single SELECT statement that included nested queries. there is still logic in SQLAlchemy to accommodate this concept.
Duplicating complete database not so helpful if need to grant access to specific subset of data in database (note: database incorporates multiple tables).
There's no duplication involved. It's an immutable database with an eavt (entity, attribute, value, time) index model. Pick a point in time to filter entities existing only at that point. It's effectively a read-only pointer to the full db created for free after every transaction.
> Our approach has clear semantics, i.e. our extended SQL returns subsets of all tables with only those tuples that would be part of the traditional (single-table) query result set, however without performing any denormalization through joins.
Hmm, so if I understand this right, the query
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID
would return two tables: (1) an Orders table with only the OrderID and OrderDate columns present; and (2) a Customers table with only the CustomerName column present — without any connection between CustomerName and either Orders.OrderID or Orders.OrderDate?
EDIT: Or, perhaps both Orders.CustomerID and Customers.CustomerID would be present in the respective tables since they're referenced by the JOIN?
The two parts of this I find really attractive are:
> What is also irritating about the result table is the fact that the notion of a ‘key’ is not defined anymore, that notion is simply dropped when processing data in SQL or relational algebra (→Problem 5: key information loss).
> Back to the result table, SQL discards even more information from the underlying relations: where did a specific attribute in the result table come from in the first place? Does that attribute value correspond to an attribute in one of the base relations? Or was it computed? (→Problem 6: schema information loss)
I've definitely had cases in ETL where being able to track where data came from (e.g `columnA` is derived from a window function over `a.columnB` and `b.columnC). This doesn't quite get to that (what if `a.columnB` is derived!) but its a neat idea
Rant: we should avoid the term "database" in contexts like this. The standard terms are "schema" (for the definition of "database" that sqlite and mysql use) and "catalog" (for the definition of "database" that postgresql uses). It's further overloaded to sometimes mean DBMS (database management system, i.e. an implementation), and sometimes also "cluster" or "daemon instance" (which sometimes overlap with one of the previous senses, but in different ways per implementation).
The word schema is the format that the data must adhere to, not the data itself, at least in the rest of computing, and the literal meaning is very close to that. So that is a bad word to use for a bunch of data, whereas database is clearly means a bunch of data.
I think “multiple tables” would be clearest for the submission. Or if it is the case “all tables in the database schema” or “all tables in all schemas of the database the user can access”
The overloading horse has bolted so might as well aim for clarity for the intended audience.
When I see just the word “Database” my brain treats it as a wildcard could be many things (like State could be a country or a state of a country)
Schema covers the definition of a bunch of related tables.
A database can run as many schemas at it wants, for as many applications as it wants. I might not be able to see them all.
"My database" doesn't mean anything in an enterprise environment. It might not even be 'mine' for any of several competing definitions.
Also there are several ways I could break "my database" which would result in every single employee at the company being pissed off, at the same time. Even if they've never heard of my project.
> Schema covers the definition of a bunch of related tables. A database can run as many schemas at it wants, ...
You're really getting hung up on the way that many DBMSs conflate the traditional meaning of schema (the names and types of the columns) with something slightly different (the database schema along with the actual data in it and some other metadata).
In fact your own comment conflates those meanings too! You say "the definition of a bunch of tables" but you seem to be talking about the data in them as well as their definition. This seems like extra evidence that it is better to avoid this term.
At least when you're talking about the multiple different schemas (in the DBMS sense) in a database, they very often (not always) have different sets of tables with different column names and types, so they really do have different schemas (in the traditional sense). But here, the focus is really on the data itself being returned (albeit the benefit of it is that it has a more usefully rich schema than a traditional SQL result), so I think calling that data a schema is a bad word choice.
> Schema covers the definition of a bunch of related tables.
> A database can run as many schemas at it wants, for as many applications as it wants.
Even in the scope of a single application, the word “schema” can be ambiguous. For example, I have a Spring Boot application, which stores its data in Postgres. Inside the JAR there is a collection of SQL files which are executed by Flyway at startup, and we call those files “the schema”. At the same time, we have a configuration option to tell the application which Postgres schema to use within its database, and we actually use that so multiple QA environments can all share the same database. [0] So even in this single app the word “schema” is being used in two different (albeit related) ways. Maybe we just need to accept that language is like that, many words have multiple meanings (sometimes even related meanings) and the context determines which of those meanings applies in any particular case.
[0] maybe should have put them in separate catalogs, but ended up going with separate schemas in the same catalog because that was the simplest configuration that met the need
Long time ago I learned that a database is any kind of structured, machine readable data. A database management system is the software that manages these databases and databases + database management system = database system. Dunno if recalled correctly.
Directly afterwards I got told that no one ever uses these words correctly, but the professor felt obliged to at least tell us the academic definition once.
Since this was a German lecture, some time was spent on the correct plural of schema (german: Schema). The options presented were Schema, Schemas, Schemata and everything was allowed as long as it is not Schemen (multiple shadowy figures).
Dunno why I remember this. It's the only part of the lecture I really remember.
you must have visited the same school as i did.
maybe the prof also said: there is no science in informatics, only in RDMS you can prove something. pulls out ms access
so, other than having sql abstraction,
don't see how's that's any different from dbms system call to create ram disk, copy current database over to virtual disk, delete irrelevant data, and fork process to use the database in ram disk.
I suspect the article is about returning the contributing tables in a join as multiple relations... while not possible in SQL proper this is possible with stored procedures (at least in T-SQL)
Yeah. Sorry, I was being sarcastic. Was thinking of the time a junior dev json serialized an NHibernate entity and it was essentially serializing the entire database.
It is like a CTE but you can take a subset of a database that satisfies the join condition. I could see this being used in a select for update scenario.
I'd wager that any performance gains achieved by returning a database file instead of a (well-known, long-established, well-understood) resultset is going to be negated by literally the rest of the owl.
By that I mean, you then need to instrument loading the database file that is returned, applying specific local security and performance considerations that come with that choice, and then querying the normalized data within that returned file.
Could that be fast? Maybe. The best bet would be scenarios where the wire speed / bandwidth / system memory of the consuming device are very very limited. In those situations there's already an established way of handling it though: cursors.
A resultset is, for the most part, highly compressible already (barring scenarios where you're just returning blobs, which incidentally this is doing as well), so it's not clear to me this is even saving you that much in terms of the "data duplication" they're discussing.
I don't think it's talking about returning a database per se. Their contribution seems to be a new type of query that, instead of using joins to join together results from many tables into a single denormalized output "table," it actually returns multiple output "tables" whose rows you can associate together in your code. It's kind of a "normalized join," if you will.
That is a valid use case, but IMHO the proper mental model for it would be not "SQL statement that returns a database" but rather "a synced partial mirror database instance", i.e. you have a local database server (likely in-memory SQLite or something) that gets synced with the master DB (perhaps even both ways) but only for a specific subset of rows.
One thing to note is that with a row-level security implementation like Postgres you can pretty much do this already. For example, if you have a table `users` with appropriate RLS policies, `SELECT * FROM users` will only return the rows that whoever's doing the selecting is allowed to see.
AFAICT that’s limited to single table calls though.
I guess you could use to recreate the DB at the ORM level since you know the schema as well. But I still like this idea living at the database layer and not an extra layer of abstraction on top of it.
I really don't see the need for it. Perhaps export tools could use it. For the rest of you have correct tables and joins you can get all the data you need in one query.
You could already use multiple subqueries if you wanted.
I want to read the paper further, but I have done similar things in the past using JSONB. Multiple CTEs to pull the individual parts, convert to JSON and then combine into objects keys of arrays.
> Also be aware that if one of the inputs to ARRAY_AGG is NULL, the entire output will be NULL and thus none of the non-NULL attribute values will be shown in the output! So extra logic is re-quired to avoid these cases. Good luck with that one!
that's how null works in 3VL, use isnull or coalesce accordingly.
> We present a backward compatible SQL extension (SELECT RESULTDB) to allow SQL SELECT statements to return a clearly defined subset of a database rather than just a single table.
just use sql/json it's standardized and supported. you guys are way out of touch from modern sql. please go and read modern-sql.com
This is going to age me but a client CEO drank the CORBA kool-aid aid and insisted on a CORBA interface with a vendor famous for lock-in and their underlying database not being open. The implementation essentially returned the entire DB in its byzantine format without a decoder ring.
I don't get why it is a paper? Just make multiple requests SELECT ... FROM TABLE. You can batch them if you like, or just send out multiple requests in parallel.
Infact this is what some shitty ORMs do when you don't want them to do this and you actually wanted a join. (Grrrr!). Sometimes you want the opposite. The complaint is more that ORMs do silly things and there is less control.
I am guessing the paper is really about some nice sugar for doing this?
The main application for this is where you have detail data for parent records in a snowflake pattern. In that case SQL tends to require a ridiculous number of queries, where common formats like JSON and XML are capable of transferring hierarchical data like that in a single response. That is a major weakness of SQL and the common inability to return a hierarchical set of relations in one response in particular.
Also, running a ridiculous number of queries in parallel is not practical on many databases due to per connection overhead, a problem that is so severe that many databases have internal many-to-one connection demultiplexers already, i.e. they have N execution engines for M connections. That should sound familiar to those who are familiar with threading models.
it's either 'select groups of tables in database and make a new database' or api for behind the scenes 'duplicate the database and remove data not interested in'
That would be convenient, and some programming environments have support for that kind of thing already. A hierarchical object valued expression would also be convenient in a different way.
let’s just call it drm, have an ai figure out the interface and do the plumbing, and throw some decoupled compute and storage at the problem, behind a serverless veneer.
think it's an api wrapper for old school NP complete/unbounded delimited query aka (system("dump all database tables in delimited format, grep for results, create new database with grep'd results while keeping original datatypes/functions/etc)). Hopefully with full 8/16/32/64 utf suport.
Perhaps sql take on 'view to a kill' aka create view / kill -9 database view
It's currying, for SQL. Cute. The functional people have discovered SQL.
This is presumably lazy evaluation - the "returned database" is just some object that implies a selection from the database. You don't want to copy all the records.
Yes, but a working concept for sql duplicate database & populate was to automate the 'populate with just relevant data needed.
Potential for github pulls/pushes of databases with 'select' relevant data to push/provide via sql.
Bit easier to think of it as method(s) needed to make dbms a replacement for OS where duplicate dbms is equivalent to exec a shell with relevant copy of files.
Sometimes the result of function call in Datomic is a copy of the database with the updates applied. It allows you to preview how your transaction will affect the data, were that choice to be made. Datalog allows for this kind of return.
Were Postgres to add something like this (and if I'm not inferring too much from the title and commentary here) it wouldn't the first time that the industry took a lesson from Rich Hickey's resurrection of older tech (Datalog).