Hacker News new | past | comments | ask | show | jobs | submit login

I worked for Cleargraph, a startup which built a natural language query layer on top of RDBMSes and which we sold to Tableau. We reached the same conclusions as you: that such a system must properly handle ambiguous queries, and users need to explicitly understand the results they're viewing.

The way our system worked, broadly, was with a computational linguistics approach rather than an NLP one. We used an augmented bottom-up parsing algorithm to assemble a user's query into an AST, then translated that AST into both SQL (though that changed post-acquisition) and back into a canonical English-like query.

For ambiguous inputs, the parser would produce multiple candidate ASTs which we could offer to the user in the form of search results, for them to select the expression they preferred (solving your point 1).

The parser was constructed to use semantic information about the database schema to infer fields in the case of underspecification (handling your point 2). For example, we might understand about the `Series` table that `name` was the "display column", and should be the default field by which record-level results were displayed. Semantic information could, to various degrees, be inferred - always with the option to be manually overridden.

It seemed clear to us that a schema-agnostic approach to SQL generation would be untenable for a real product. Annotation of the schema over time (starting with basic information like field names, synonyms, and types, then adding semantic information like currency or unit labels) is key to letting a query generator produce sensible results to underspecified queries. For this reason alone, I'm skeptical of the quality of results that something like Photon can produce. Every schema differs in syntax and semantics. Perhaps a big enough corpus of schemas can allow for a generalized model. The more you can constrain the domain in building such a system, the better the results. (look at Statmuse [0]: pretty impressive query capability over sports facts).

[0]: https://www.statmuse.com/

Statmuse works a lot better than I expected. It needs to be able to answer which player is the hottest/cutest/most handsome though. Huge gap that needs to be corrected, tout suite.

Just threw some 'curveballs' at statsmuse and it held up pretty well. Good argument for restricting the domain as you say.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact