Hacker News new | past | comments | ask | show | jobs | submit login
TaBERT: A new model for understanding queries over tabular data (facebook.com)
153 points by speculator on July 3, 2020 | hide | past | favorite | 25 comments



I'd like to see something that could do this, handling the awfulness of real world tabular data. "What country has the highest GDP? Okay, which table has GDP? Is it the country_gdp table? No, that's an old one that hasn't been written to in 3 years. Ah here it is, but you need to join against `geopolitics`, but first dedup the crimea data, since it's showing up in two places, we're can't remember why it got written to twice there. Also, you need to exclude June 21 because we had an outage on the brazil data that day. What do you mean some of the country_id rows are NULL?" And so on. I dream that someday there's a solution for that. That's a looooong ways away, I'd bet.


The tough thing is that a common failure mode of many of the modern AI solutions is some output that looks superficially correct, but doesn't actually map correctly to the real world. When you want a table of data, it seems like the danger will be high that the table looks correct but isn't actually accurate. The problem here is about keeping sloppy data out of your table, which is tough for a statistical AI.

So yeah, I would expect this to be a long ways away.


For more interesting problems than "which country has the highest GDP?", it's about more than just sloppy data. If you want to include any covariates, how do you know which ones to include? You could try to include everything predictive, but then you'll use the client margin column to predict client revenue or something. Or you'll control for a column causally downstream, biasing your estimates, like estimating revenue differences and controlling for page views in an experiment that affects page views. There's so much that we just don't include in our databases that's crucial to using them, and it's not just about sloppiness.


tbf that problem is also quite tough for data scientists, the model doesnt need to be flawless just better


This was my experience trying to work with the Johns Hopkins COVID data.

I don't know if Johns Hopkins became the canonical data source because they were amongst the first to have public data and charts, but honestly I was kinda surprised at the low quality, coming from a group called "Center for Systems Science and Engineering". Their data was far harder to use than it needed to be, even months into the pandemic.

Fortunately there were a handful of other projects dedicated to making it sane and resolving the inconsistencies, unreconciled changes in format, etc... that was really helpful.


Which projects do you think have particularly high quality and easy to work with data? I was using that but it’s so messy...


Metabase[0] lets you ask the question step by step. You pick the table, pick the filter, pick the aggregation and so on. I have been working in BI long enough to know that even this isn't going to answer all questions. It is cleaning data, filtering out stuff that shouldn't be included in the data set that somehow is included and other issues that makes it difficult to automate. The is usually a bunch of things not documented. I will be watching this with interest.

[0]https://www.metabase.com/


The problem which TaBERT supposedly can solve is speed of this process, you can have hundreds of tables, with dozens of columns with many unobvious relationships, and constructing query manually can take lot's of effort even using UI automation. And the idea is that you have some system, which creates this query instantly following short description.


This could be seriously addressed with a configurable rule system similar to email filters + search. You have to store any of the metadata factors you want to consider in a companion index that can allow complex filtering or decision tree splits, then for introspection of SQL-like data sources, you can follow key and type relationships to determine what’s joinable.

Perhaps outputting several potential answers at the end, each explaining the “pathway” it chose to use (filters / decision tree splits + graphical path through keys / joinable types in the underlying data), and allow the user to select one or more results that they believe are valid pathways of criteria, or perhaps tweak individual filters and joins in the listed pathway for a given result.

I think this would offer a lot more value than trying to get a full natural language interface that “just works” on complex filtering conditions, where getting just one answer back (instead of seeing the variety of pathways the system could choose and what influence each step has on the end result) entails too many cases the ML system fails with unrealistic results.


Out of curiosity have you tried any of the data catalogs? Alation, Informatica, Collibra? Not that they solve this problem for you but that make it solvable.


What do these do? I am constantly told by IT we need data catalogues, they talk about it, they invest money, but I have never seen the results. Me and my team keep most metadata info of tables in json files, not pretty but at least I can see what it does


Does anyone know how it relates/compares to Google's TaPaS? [1] I notice this paper doesn't refer to it.

[1] https://ai.googleblog.com/2020/04/using-neural-networks-to-f...


They both report performance on WikiTableQuestions (called WikiTQ in the TAPAS paper).

Wang, et.al. (previous SOTA): 44.5

TAPAS: 48.8

TaBERT: 52.3


Git repo or it doesn't exist ;-)

Seriously, if this is not available, what are the alternatives?

I've seen in the past some NLP + Storage project but I don't recall them. (even remotely connected, there was something to convert PDFs into machine readable data).

Is this AwesomeNLP https://github.com/keon/awesome-nlp a good starting point there?



No pretrained models, though, unfortunately.


They do provide it. The link is in github repo. Google Drive shared folder : https://drive.google.com/drive/folders/1fDW9rLssgDAv19OMcFGg...


Seems similar to this work out of Salesforce a few years ago: https://www.salesforce.com/blog/2017/08/salesforce-research-...


TABERT no longer on the Spider leaderboard? - https://yale-lily.github.io/spider . The top is "RATSQL v2 + BERT" testing at 65.6 for exact matches.


NLP has come pretty far: "Released by Symantec in 1985 for MS-DOS computers, Q&A's flat-file database and integrated word processing application is cited as a significant step towards making computers less intimidating and more user friendly. Among its features was a natural language search function based on a 600 word internal vocabulary." https://en.wikipedia.org/wiki/Q%26A_(Symantec)


Does the following mean that one can map/train to runtimes that give proper results based on the underlying data _results_?

"A representative example is semantic parsing over databases, where a natural language question (e.g., “Which country has the highest GDP?”) is mapped to a program executable over database (DB) tables."

Could it be thought of in the same fashion as Resolvers in GraphQL integrated into BERT?


Does anything like this exist for XML documents? Wonder if it could be used for identifying interesting information in web pages.


I thought Google already did something similar?

Are we entering deep copycat culture?


Honest version:

> Why it matters:

> Improving NLP allows us to create better, more seamless human-to-machine interactions for tasks ranging from identifying dissidents to querying for desperate laid-off software engineers. TaBERT enables business development executives to improve their accuracy in answering questions like “Which hot app should we buy next?” and “Which politicians will take our bribes?” where the answer can be found in different databases or tables.

> Someday, TaBERT could also be applied toward identifying illegal immigrants and automated fact checking. Third parties often check claims by relying on statistical data from existing knowledge bases. In the future, TaBERT could be used to map Facebook posts to relevant databases, thus not only verifying whether a claim is true, but also rejecting false, divisive and defamatory information before it's shared.


Why are you making fake quotes here?

Most of the data in the world is in tables, and most people don't speak SQL. A very large purpose of computers has been for managing and querying this data and there is nothing nefarious at all about that.

Translating between the two is something that there have been many attempts at.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: