
NoDB: Efficient Query Execution on Raw Data Files - pdq
http://stratos.seas.harvard.edu/publications/nodb-efficient-query-execution-raw-data-files-0
======
hnruss
Searched around for the PostgresRaw repo, think this is it?
[https://github.com/HBPSP8Repo/NoDB](https://github.com/HBPSP8Repo/NoDB)

------
victorNicollet
My company is working [#] on a domain-specific language with similar premises.
When running a script, we load data from CSV, and use a clever caching
structure to avoid any work done on a previous run.

[https://www.lokad.com/envision-more-technical-
overview](https://www.lokad.com/envision-more-technical-overview)

We've actually encountered two kinds of work with such data sets: exploratory
work (when a data scientist is running brand new queries on a fixed data set)
and production work (when a pre-written script is run on new data a daily or
weekly basis, maybe with some user-provided parameters). These are separate
optimization problems, the first is usually solved by caching the results of
intermediate processing while the second is mostly about optimizing the script
execution.

In the latter case, I think one of the main difficulties with SQL is that
queries are independent and a typical data analysis session requires many
queries. With typical database pipelines, there's no obvious spot to perform
global static analysis and optimize the query sequence as a whole (for
instance, finding shared common expressions or indices). Stored procedures are
a good start, but typical query optimizers just aren't designed to achieve any
kind of procedure-level analysis.

[#] and we're looking to hire another compiler expert, if anyone's interested.

------
sktrdie
I wonder how this compares to HDT [1], a compressed format for RDF which can
be queried directly from the HDT files without having to parse/index it. It's
sad that research coming out of Harvard doesn't mention other quite related-
work such as HDT.

1\. [http://www.rdfhdt.org/what-is-hdt/](http://www.rdfhdt.org/what-is-hdt/)

~~~
mkaufmann
It does not really compare, I guess thats also why its not mentioned.

The whole premise of the paper is that data can be analyzed in situ. That
means loading from its original place in the original format without any
previous transformations. This is in contrast to the traditional approach of
database systems that the data has to be loaded first into a database.

This paper describes a way how unprocessed unindexed data can be efficiently
used to answer queries using a database system. The novelty of this approach
is mostly that they build a index on the fly that can be reused later and
examining the idea of directly using the raw files. For efficient loading of
CSV files which is also mentioned in the NoDB paper, I think those details are
better described in a later paper from TU München[1] that examines this aspect
in more detail.

HDT for RDF or Parquet[2] and ORCFiles[3] in the Big Data space, are (binary)
formats where the data is already processed and stored in a more efficient
format than plain old text CSV files. Creating these files can already be
compared to just loading the data into a database. The only difference that
format used for data storage is open and can be used by many systems. So its a
completely different setting.

Still its an interesting thought to make databases aware of the indexed
information in those file formats besides CSV so that these can also be
directly used without loading.

[1]
[http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf](http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf)

[2] [https://github.com/Parquet/parquet-
format](https://github.com/Parquet/parquet-format)

[3]
[https://cwiki.apache.org/confluence/display/Hive/LanguageMan...](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC)

------
morebetterer
Sounds a bit like a cross between the (now defunct) Google Sawzall project and
the sqlite csvtable module:

[https://github.com/softace/sqliteodbc/blob/master/csvtable.c](https://github.com/softace/sqliteodbc/blob/master/csvtable.c)

~~~
danso
Never heard of SQlite csvtable but there is the excellent csvsql, part of the
csvkit

[http://csvkit.readthedocs.org/en/0.9.1/scripts/csvsql.html](http://csvkit.readthedocs.org/en/0.9.1/scripts/csvsql.html)

Here's an example of how to download the SF health inspection data, unzip it,
build the SQLite schema (the dataset contains multiple tables), and then
insert the CSVs into their respective tables, and query them, all from the
command line.

csvsql handles the inference of schema from the CSV data and the import work;
the SQL commands can all obviously be done from just echoing into SQLite
directly. But csv2sql will also let you create a CSV in memory and run a SQL
query for those times when you just need to quickly filter the data. Here's an
example:

[http://2015.padjo.org/tutorials/mapping/077-ok-schools-
quake...](http://2015.padjo.org/tutorials/mapping/077-ok-schools-
quakes/#trimming-the-schools-data)

Tremendously useful tool. I can say without hyperbole that it has changed the
scope of how I do data science, even though it literally does nothing that I
could not do via SQlite and Python alone (csvkit is basically a command line
wrapper written in Python around csv, SQLAlchemy, and so forth)

------
rxin
You can already use Spark SQL to query files (e.g. CSV, JSON, Parquet)
directly on various storage systems without ETL. It doesn't have as much
research novelty as NoDB, but it is used in a lot of production data pipelines
and ad-hoc analysis on "raw data files".

------
jslakro
It is not what has been doing COBOL During Decades ?

------
monstrado
Great paper.

This has a lot of similar goals to Apache Impala, mainly around improving raw
data access to be as efficient as possible.

[http://pandis.net/resources/cidr15impala.pdf](http://pandis.net/resources/cidr15impala.pdf)

