Wasn't that the promise of SQL in the first place? I don't believe it, why would people (only) want to access structured data with something as ambiguous as natural language?
Whenever precision is required, we end up with methodologies like technical writing, or specialized language like legalese, to provide the necessary precision.
If we didn't have "programming languages", we'd end up with some sort of "computerese" analogous to legalese, and it would still take years to become proficient in using it, and there would still be either people or AI who would translate natural language into "computerese" (but we'd likely prefer human oversight over the AI anyway, given that a human that needs an AI to convert their language into something more precise is likely not capable of judging whether the output is precisely what they meant).
I agree with this, I don't think we'll be programming in natural languages until the computer is intelligent enough to pick out unintended ambiguity and ask clarifying questions. What I'm unsure about is whether or not the computer will be that intelligent in 50 years. Whenever it is that intelligent, I expect us to transition quickly.
There are going to be many tasks for which we don't want any independent judgement to be exercised by the machine. And for those tasks, we're going to still want a highly structured language.
So: yes, most companies already don't require people designing things to be able to fully spec every edge case in advance, without any dialogue.
Moreover, human communication is continuous and conversational. Programming is not. Most of existing code editors are not designed to have a conversation. Jupyter notebook is close but not there yet. I bet with a conversational agent style code editor plus a good auto suggestion and auto completion feature, we don't need to invent technology to use natural language to communicate with a machine. We just need a well designed formal language with precise and concise syntax.
Some people write news conforming to Google translation's performance just to make sure the story can be auto translated to many other languages. Most of these stories have pretty normalized vocabulary. This is how much human can adapt to the new world.
One day when the machine intelligence surpasses human intelligence, we will all speak in Python or whatever the most popular among machines.
In a sense, that's one definition of language power - to what extent can you specify that what without having to specify the how, assuming that the machine can execute the how much more optimally than you can most or all of the time.
I'm not sure what you mean by the outcome, but you can absolutely over-specify the output you want, and it's a relatively common problem. For example, you could say 'I want a system that prints lines with a saffron-based pigment', when all you wanted was an intense yellow-ish color, and don't care about the exact pigment being used. Or, you could say 'I want a program that let's a user choose red or blue or yellow and then paints a line in that color', when you actually mean the user can choose any color.
I've seen many problems of over-specification when talking high-level details with less technical people. They tend to give details to help explain what they mean, without caring about the specific details; or they use metonimy expecting everyone understands what they mean.
For example, in the case of SQL, the GROUP BY clause is nearly always noise.
So much so that SQL database will tell you exactly what things you need to add to the GROUP BY to make it valid SQL. Those things are extraneous information that the database could (and already did) know.
(IMO, in an aggregated query, SQL ought to implicitly add all non-aggregated fields to the GROUP BY, the user may explicitly add fields to the GROUP BY if desired.)
Maybe a similar syntax, "GROUP BY AUTO", could be useful, but I suspect it would quickly become a code smell - some people would slap "AUTO" in when they get an error in much the same way they slap "DISTINCT" in to fix an accidental cartesian product without actually thinking about the problem.
As irritating as having to list many columns in the GROUP cause of a statement with a wide output, I like that it has to be explicit: if I have the column lists wrong I've made an error and I don't want SQL Server to guess how to fix that error. Perhaps instead of missing a column in the grouping list I've instead messed up in the projection clause. And if the statement is getting inconvenient enough that the grouping clause is a significant irritation, perhaps it needs to be refactored more generally.
One further thing to note is that some DBs do automagically decide what to do when columns are neither in an aggregate or the grouping list: mySQL for instance will just return an arbitrary value from those that exist in the current group for such columns (IIRC the first value hit as the data is read). This leads to situations where the query seems to work fine until something changes in either the data balance or the available indexes (or some change is made to the query itself) that makes the processing happen a different way around so a different arbitrary value starts to get selected for the same group in the same query.
That's a great point
And as the name suggests its somewhat magic/implicit in its behavior.
No ANSI SQL result is altered by changing the order of GROUP BY.
I have literally never, ever wanted this. Why in the world would you group by any field not appearing in an aggregate?
Also, the group by can specify cubes and rollups, more over you may not want to group by any field, but rather an expression.
SELECT count(*), favorite_color, favorite_song
GROUP BY ...
Whether you want this or not, ANSI SQL requires that you group by at least these.
You could group by more more fields, such as favorite_pet, though that would be somewhat odd since you are did not include it in the result set.
I still prefer DRY, but I can appreciate the WET perspective.
* Except for the special case when it's non-aggregated and I want duplicate rows in the result set. SQL allowing duplicate rows is a departure from the underlying relational paradigm.
Lots of companies and organizations have made that promise. Even legendary game company Infocom way back in 1985. So far, none have delivered.
Explaining exactly what we want and covering all of the edge cases is just as hard as using a regular programming language. Until we can have full back and conversation with computers that have a deep understanding of what people want from a program and can ask the user questions like "But what should happen when x happens?".
Before we use these 5th generation languages to write natural language queries we need an API that is sufficiently high level and possible to logically optimize without resorting to hacks.
The SQL server presents itself as a black box, saying "just say what you want and I'll figure out how to get it" and then completely fails to deliver on this and you have to open up the black box and find that its insides are dark and scary and full of jagged sharp ends.
and other wondrous promises of overselling dreams and delivering something that many have now forgotten ever existed. Let's be fair, C is still alive and kicking despite the promises of that 4GL era.
I guess my point is, there's nothing "natural" about language.
I have this article printed out in a drawer for these cases:
With DAX, Microsoft has extended the PivotTable model (instead of the relational model), and they deserve full respect for trying something new in this space.
Is it a simpler model? I kinda like it, but don’t think it is simpler.
My initial thought is highly specific human language is actually harder to write than most code, and always more verbose. Well written philosophy comes to mind.
I'd argue that this is part of the reason ORMs gained so much prominence. They solve the problem that SQL introduces by trying to emulate natural language. Most developers will gladly risk sacrificing some specificity of the query in exchange for reducing the verbosity of a SQL query. Obviously if the dev knows their ORM/SQL well, there is no lost precision.
I'm pretty sure controlled attempto English or lojban (zero ambiguity natural languages) have no computable implementations either.
"the product of pi and the square of r" is no less clear than the mathematical transcription of it.
3245 is perhaps English, but there's nothing wrong with "three thousand, two hundred and forty-five".
Some more examples taken from Wikipedia:
The sum of zero and zero. Five less than the product of eight and some value. Seven times the square of some value plus four times the aforesaid value minus ten.
I mean I could go on; these are either not ambiguous, or they're ambiguous only because of a lack of convention.
We don't do it because our attention span doesn't last long enough. We need to keep looking back at what we've already read to really understand what we're reading. Mathematical formulae give us the ability to do that quickly, since the graphical, multidimensional concise layout quickly allows us to find the place we're looking for.
We are not here concerned with so-called computer 'languages', which resemble human languages (English, Sanskrit, Malayalam, Mandarin Chinese, Twi or Shoshone etc.) in some ways but are forever totally unlike human languages in that they do not grow out of the unconscious but directly out of consciousness. Computer language rules ('grammar') are stated first and thereafter used. The 'rules' of grammar in natural human languages are used first and can be abstracted from usage and stated explicitly in words only with difficulty and never completely.
--Walter Ong [Orality And Literacy]
Disagree, it's almost certainly more or less than half; it's incredibly unlikely that these statements would account for exactly half of disagreements. Furthermore the statements in question are rather specific and very unambiguous, not the other way around.
Or rather, the ambiguity is around how specific the statements are.
(Big time commitment though, FYI.)
I've seen two disasters where someone mistakenly thought they could use an ORM to make a database look like a giant, in-memory data structure.
To make a long story short, in both cases, working with a trivial amount of data in an embedded database was painfully slow.
In general, I think we'll see application programming languages expose more semantics that streamline data access. The problem is we keep thinking the way to solve the impedance mismatch is to make the database look like traditional data structures; when the real solution requires languages that have the high degree of flexibility that a SQL database normally has.
For example: A table T might have columns Id, A, B, and C. We can "select Id, B from T". But, if I make a class to map to table T, I probably have fields Id, A, B, and C. Then, if I want to make a little utility function that just looks at Id and B, it's probably taking a full T object, with all of the fields loaded.
In such a case, the programming language needs to evolve. My function doesn't need a full T object, just an object of a class that we, semantically, know is a "T", but also has the fields that I care about. Then, we need enough automation within the compiler to know that, if I add a call to my new utility method, it has to change some code to "select Id, A from T" to "select Id, A, B from T", and leave other code unchanged.
For example I recently discovered that I can configure PyCharm to a database. Interestingly, after I did just that, the IDE downloaded the database schema and suddenly I got highlighting, autocomplete and refactoring available (well as migrations).
Suddenly I felt like I didn't actually need an ORM.
C# seems to also have thing called LINQ I'm not as familiar with it but my understanding is that it is a language within language that can represent SQL statements. I'm guessing this might be what you're talking. Ultimately I think the solution is that:
- there's a way to efficiently express what data we need (ORM has the issue you mentioned)
- an IDE can understand this syntax so it can help with it the same as with rest of the code.
The first time it had the "update" problem, meaning, it appeared that I had to always load an object into memory before updating it. (2008)
The second time, it wasn't available in Mono. (2011)
Granted, a lot changed since
Dunno, seems like as long as there is crappy data that humans need to clean, enterprise and financial firms will continue to use XL as critical part of their data infrastructure.
And as long as XL reigns supreme in finance and consulting, seems a bit far fetched to talk about infinitely scalable, sentient and 'self-healing' DBMSs...
Let's focus on getting the data out of XL, then work on the genie in the bottle.
Someone trialed an idea with directly interpreting user text queries at my previous job. They gave up after seeing what people typed in.
People have no concept of what the computer needs to do it's job, so you get terse gibberish. At a minimum you'd need the computer to talk back and get them to clarify, but I rather suspect people would hate that. It'd feel like you spent 20 minutes arguing with a computer to run a report.
We spend so much time and energy copying data around into specific data stores to solve specific problems / answer specific questions / enable specific features... It's messy, complex, and adds a ton of overhead.
It could simplify a lot of technology if an operational database could also handle non-trivial analytical workloads.
This quote happens to be in the context of video frames but I think temporal indexing in general is widely under-utilised, both for providing consistent queries (i.e. the database-as-a-value) and for integration of late-arriving data (i.e. bitemporality). It seems particularly relevant when considering how best to synchronise information across the "omnipresent Internet of Things" (not to mention inter-planetary devices, which also get a mention in the post!).
First-off, there’s no history of schema changes. While you can easily go from a NOT NULL to a NULL column, you can’t go from NULL to NOT NULL. This is a deal-breaker when using TT for LoB data when schema changes happen somewhat regularly. TT should have been designed with schema versioning from the start.
The second main issue is the still lack of tooling and ORM support for TT. While EF6 is compatible with TT (it will safely ignore the SYSTEM TIME columns) it doesn’t let you directly query the history table.
Third - any UPDATE statement, even when it doesn’t actually change any data, causes a new history row to be added, including a full copy of any mvarchar(max) values. Copy-on-write does not seem to be used. That’s a huge waste of space.
Finally, you cannot exclude columns from temporal tracking - so if you have frequently-updated columns containing inconsequential data (e.g. display sort-order columns) then you’ll end up with history table spam.
I don’t know why the SQL Server team invests massive amount of engineering effort into features like these when other massive priorities exist - like modernising T-SQL’s syntax to be less verbose (parameterised object identifiers and optional predicate clauses, please!) or adding support for column array-types, which would greatly improve the platform.
Other data models with schema after write semantics (for instance trees or graphs) don't have this problem of having to store schema changes.
As I'm working myself on a temporal storage system in my spare time (as well as during my student times at the University of Konstanz), I'd say that you should index each revision instead, storing the timestamp of the transaction commit in a RevisionRootPage for instance, which is the root and main entry to deserialize a revision. Unchanged pages should be shared amongst the different revisions.
Furthermore due to fast random access of flash drives and even parallel I/O with PCIe SSDs you can store page-fragments instead of whole pages. The SSD might fetch too much information due to misalignments, but I'd say that access to persistent storage also gets more fine granular in the future (byte level granularity instead of block-level or the blocks are smaller-sized -- I think even 512 bytes is not that unusual nowadays).
Not only storing the incremental or differential changes from a database page, but changes which fall out of a sliding window doesn't make intermediate full dumps of pages necessary, such that write and read peaks don't occur and a predictable number of page fragments can be read to reconstruct a page in-memory, preferably fetching the fragments if possible in parallel.
It makes zero sense.
Meanwhile, the Windows version, which is used for 99.999% of installs still does not support Vista-era APIs like elliptic curve certificates because of legacy code nobody has touched in 10+ years.
There's a crazy amount of low-hanging fruit in that product that just isn't being addressed.
There's one strategic reason: using it by developers in containers.
I like to think a lot about Kdb, partly because I've used it extensively. Kdb, aside from it's database core functionality, can also be an app server. There's a natural concept of being able to open a listen socket or connect to other listenrs. Very quickly you can build distributed database systems with just a few lines of code. It's very powerful... I think the "way of thinking" in Kdb has yet to permeate into other database technologies.
And I'll be writing a number 8 to fill the Age box :(
Hahahahahahahaha - such optimism.
Autonomous, self-healing, horizontally-scalable stateless systems are hard-but-doable.
Autonomous, self-healing, horizontally-scalable stateful systems are the stuff nightmares are made of, if your application layer doesn't relax its expectations re: ACID properties of the system.
Alternatively, perhaps you have a business model which is amenable to document-based storage. If you can reduce your business modeling to:
Serializable GUID-keyed objects - Any arbitrary business type instance that can be turned into JSON bytes.
GUID-keyed transactions - Document the specific object GUID(s) to be locked for a transaction.
You can very easily entertain a resilient, clustered solution. Going wide on GUID keys is trivial compared to alternative approaches. If you need to support any degree of SQL grammar or row-based storage, you are in for a completely different dimension of pain. I would never subject myself to this. Instead, I would reach for SQLite or Postgres.
Essentially, there is a long history of developing tools to make databases more "autonomous". Pavlo's research centers around accurately capturing the state of the database, and training a system to learn the impact of tuning one of the many available knobs.
Tuning == getting the most out of your hardware
Scaling == my perfectly tuned node, with the lowest-latency/mostest IOPS IO subsystem, and highest number of CPUs, and the mostest amount of RAM can't handle my workloads. What now?
Re-architect your monolith and re-consider your transactional boundaries - that's what.
I also see the scenario of radical data ownership not addressed, which may be a black swan event. Open-source competitors / legislation may enforce the use of e.g. data pods, as a digital identity storage with managed access right distribution. It's worth a though, what Tim Berner Lee's solid / inrupt would mean for the future of data storage systems. In this scenario, the transmission of personal data from pods needs to be optimised in a secure way.
I also believe this is the first kind of ACID NoSql database, which goes along Andy Pavlo's "All programs will execute using strongly consistent ACID transactions".
Definitely going to switch over from PostGreSQL at some point.
> The relational model will still dominant for most applications...Likewise, SQL (or some dialect of it) will remain the de facto language for interacting with a DBMS
The author does believe RDBMS systems will continue to dominate for the next 50 years. I have no particular reason to cast serious doubt about that, but it will certainly be interesting to see what the role and prominence of NoSQL databases is in that future.
And so not everything is relational which is what SQL is for.
(This is also the first time I've heard that term, but it seems to be several years old and that's roughly what I got from a few quick searches)
Even if the data problem is graph, document, time series, wide table, EAVT etc.
But this is no panacea. The complexity will still be there, and ops folks will have to deal with it through extra layers of indirection and obscurity (which are other ways to spell "abstraction").
That might be the best compromise, but it is still a compromise.
Nope they won't. Natural languages are not precise enough for this.
Sorry, but this feels hilariously wrong.
Databases that work performantly are all about making key architectural decisions about which information is indexed and/or denormalized and/or preprocessed, how, and ensuring that queries are written only in forms that take advantage of that design.
It's the difference between queries taking milliseconds or hours on reasonably-sized datasets of, say, 10GB+.
Because it's shockingly easy to write a query that, otherwise, will not only read every row of your database, but be required to read every row once for every row, because that's literally the only way to calculate the result if the right data isn't suitably indexed or preprocessed. (E.g. "find the number of users who have ever ordered the same item (any item) more than once" on a table indexed only by order ID.)
I don't see how that can ever ever be made "transparent", and it certainly doesn't have anything to do with tighter coupling with a programming framework.
I also liked "They will instead ask questions about data in a natural language". Humans can't answer questions about data in natural language - when data questions come up they have to be couched in formal terms and statistics. I look forward to the day an AI can answer "what is the best decision we can make with this data?" but it runs into a problem we already have in spades. If the asker of the question isn't thinking in formal terms I question whether they are going to be able to consistently pose meaningful questions and interpret the answers. Applied Math is not easy to implement in industry because a company's ability to perform is limited by its management's understanding.
It is easy to nitpick. I'm also glad we have people thinking about the big picture of how data gets handled.
If you prepackage your queries, or do some sort of profiling guided optimization, the database could learn what it needs to index to answer your queries efficiently. There's already some precedence for this, with how Facebook for instance packages their graphql queries ahead of time to avoid parsing them at runtime.
Alternatively, you could maybe imagine models where storage is sharded along with the code for running it, so you say declaratively specify the data you need (count for this grouped by that, etc) and then at runtime the data is pre-aggregated per server hosting a certain shard.
However, I doubt we will ever derive mappings from application code or (retroactive) profile guided optimization. The biggest problem will become deployments.
We would need to couple application code deployments to deployments of databases even more than we do today (which is a giant problem).
When a new index is needed or new data types we would need to “prime” the datastore in order to get optimal queries from the start.
Even if this application-database deployment issue is fixed through automation it’s likely we wouldn’t want it to be coupled. Database index creation or schema change can take hours or longer. Application code needs to be very nimble. In the days of CI we don’t want Feature A’s database migration to block the deployment of Feature B or BugFix C.
- The Next 50 Years of Databases (2015)
- The Next 46 Years of Databases
I'm not wrong. It's a thing.