
TaBERT: A new model for understanding queries over tabular data - speculator
https://ai.facebook.com/blog/tabert-a-new-model-for-understanding-queries-over-tabular-data/
======
ianhorn
I'd like to see something that could do this, handling the awfulness of real
world tabular data. "What country has the highest GDP? Okay, which table has
GDP? Is it the country_gdp table? No, that's an old one that hasn't been
written to in 3 years. Ah here it is, but you need to join against
`geopolitics`, but first dedup the crimea data, since it's showing up in two
places, we're can't remember why it got written to twice there. Also, you need
to exclude June 21 because we had an outage on the brazil data that day. What
do you mean some of the country_id rows are NULL?" And so on. I dream that
someday there's a solution for that. That's a looooong ways away, I'd bet.

~~~
lacker
The tough thing is that a common failure mode of many of the modern AI
solutions is some output that looks superficially correct, but doesn't
actually map correctly to the real world. When you want a table of data, it
seems like the danger will be high that the table looks correct but isn't
actually accurate. The problem here is about keeping sloppy data out of your
table, which is tough for a statistical AI.

So yeah, I would expect this to be a long ways away.

~~~
ianhorn
For more interesting problems than "which country has the highest GDP?", it's
about more than just sloppy data. If you want to include any covariates, how
do you know which ones to include? You could try to include everything
predictive, but then you'll use the client margin column to predict client
revenue or something. Or you'll control for a column causally downstream,
biasing your estimates, like estimating revenue differences and controlling
for page views in an experiment that affects page views. There's so much that
we just don't include in our databases that's crucial to using them, and it's
not just about sloppiness.

------
abhgh
Does anyone know how it relates/compares to Google's TaPaS? [1] I notice this
paper doesn't refer to it.

[1] [https://ai.googleblog.com/2020/04/using-neural-networks-
to-f...](https://ai.googleblog.com/2020/04/using-neural-networks-to-find-
answers.html)

~~~
nl
They both report performance on WikiTableQuestions (called WikiTQ in the TAPAS
paper).

Wang, et.al. (previous SOTA): 44.5

TAPAS: 48.8

TaBERT: 52.3

------
philprx
Git repo or it doesn't exist ;-)

Seriously, if this is not available, what are the alternatives?

I've seen in the past some NLP + Storage project but I don't recall them.
(even remotely connected, there was something to convert PDFs into machine
readable data).

Is this AwesomeNLP [https://github.com/keon/awesome-
nlp](https://github.com/keon/awesome-nlp) a good starting point there?

~~~
jhj
it's in the paper

[https://github.com/facebookresearch/tabert](https://github.com/facebookresearch/tabert)

~~~
IfOnlyYouKnew
No pretrained models, though, unfortunately.

~~~
myth_drannon
They do provide it. The link is in github repo. Google Drive shared folder :
[https://drive.google.com/drive/folders/1fDW9rLssgDAv19OMcFGg...](https://drive.google.com/drive/folders/1fDW9rLssgDAv19OMcFGgFJ5iyd9p7flg)

------
neeeeees
Seems similar to this work out of Salesforce a few years ago:
[https://www.salesforce.com/blog/2017/08/salesforce-
research-...](https://www.salesforce.com/blog/2017/08/salesforce-research-ai-
talk-to-data.html)

------
sriku
TABERT no longer on the Spider leaderboard? - [https://yale-
lily.github.io/spider](https://yale-lily.github.io/spider) . The top is
"RATSQL v2 + BERT" testing at 65.6 for exact matches.

------
KasianFranks
NLP has come pretty far: "Released by Symantec in 1985 for MS-DOS computers,
Q&A's flat-file database and integrated word processing application is cited
as a significant step towards making computers less intimidating and more user
friendly. Among its features was a natural language search function based on a
600 word internal vocabulary."
[https://en.wikipedia.org/wiki/Q%26A_(Symantec)](https://en.wikipedia.org/wiki/Q%26A_\(Symantec\))

------
j4ah4n
Does the following mean that one can map/train to runtimes that give proper
results based on the underlying data _results_?

"A representative example is semantic parsing over databases, where a natural
language question (e.g., “Which country has the highest GDP?”) is mapped to a
program executable over database (DB) tables."

Could it be thought of in the same fashion as Resolvers in GraphQL integrated
into BERT?

------
louisstow
Does anything like this exist for XML documents? Wonder if it could be used
for identifying interesting information in web pages.

------
runawaybottle
I thought Google already did something similar?

Are we entering deep copycat culture?

------
SheinhardtWigCo
Honest version:

> Why it matters:

> Improving NLP allows us to create better, more seamless human-to-machine
> interactions for tasks ranging from identifying dissidents to querying for
> desperate laid-off software engineers. TaBERT enables business development
> executives to improve their accuracy in answering questions like “Which hot
> app should we buy next?” and “Which politicians will take our bribes?” where
> the answer can be found in different databases or tables.

> Someday, TaBERT could also be applied toward identifying illegal immigrants
> and automated fact checking. Third parties often check claims by relying on
> statistical data from existing knowledge bases. In the future, TaBERT could
> be used to map Facebook posts to relevant databases, thus not only verifying
> whether a claim is true, but also rejecting false, divisive and defamatory
> information before it's shared.

~~~
nl
Why are you making fake quotes here?

Most of the data in the world is in tables, and most people don't speak SQL. A
very large purpose of computers has been for managing and querying this data
and there is nothing nefarious at all about that.

Translating between the two is something that there have been many attempts
at.

