Hacker News new | past | comments | ask | show | jobs | submit login
NoDB: Efficient Query Execution on Raw Data Files (seas.harvard.edu)
76 points by pdq on Dec 18, 2015 | hide | past | favorite | 9 comments



Searched around for the PostgresRaw repo, think this is it? https://github.com/HBPSP8Repo/NoDB


My company is working [#] on a domain-specific language with similar premises. When running a script, we load data from CSV, and use a clever caching structure to avoid any work done on a previous run.

https://www.lokad.com/envision-more-technical-overview

We've actually encountered two kinds of work with such data sets: exploratory work (when a data scientist is running brand new queries on a fixed data set) and production work (when a pre-written script is run on new data a daily or weekly basis, maybe with some user-provided parameters). These are separate optimization problems, the first is usually solved by caching the results of intermediate processing while the second is mostly about optimizing the script execution.

In the latter case, I think one of the main difficulties with SQL is that queries are independent and a typical data analysis session requires many queries. With typical database pipelines, there's no obvious spot to perform global static analysis and optimize the query sequence as a whole (for instance, finding shared common expressions or indices). Stored procedures are a good start, but typical query optimizers just aren't designed to achieve any kind of procedure-level analysis.

[#] and we're looking to hire another compiler expert, if anyone's interested.


I wonder how this compares to HDT [1], a compressed format for RDF which can be queried directly from the HDT files without having to parse/index it. It's sad that research coming out of Harvard doesn't mention other quite related-work such as HDT.

1. http://www.rdfhdt.org/what-is-hdt/


It does not really compare, I guess thats also why its not mentioned.

The whole premise of the paper is that data can be analyzed in situ. That means loading from its original place in the original format without any previous transformations. This is in contrast to the traditional approach of database systems that the data has to be loaded first into a database.

This paper describes a way how unprocessed unindexed data can be efficiently used to answer queries using a database system. The novelty of this approach is mostly that they build a index on the fly that can be reused later and examining the idea of directly using the raw files. For efficient loading of CSV files which is also mentioned in the NoDB paper, I think those details are better described in a later paper from TU München[1] that examines this aspect in more detail.

HDT for RDF or Parquet[2] and ORCFiles[3] in the Big Data space, are (binary) formats where the data is already processed and stored in a more efficient format than plain old text CSV files. Creating these files can already be compared to just loading the data into a database. The only difference that format used for data storage is open and can be used by many systems. So its a completely different setting.

Still its an interesting thought to make databases aware of the indexed information in those file formats besides CSV so that these can also be directly used without loading.

[1] http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf

[2] https://github.com/Parquet/parquet-format

[3] https://cwiki.apache.org/confluence/display/Hive/LanguageMan...


Sounds a bit like a cross between the (now defunct) Google Sawzall project and the sqlite csvtable module:

https://github.com/softace/sqliteodbc/blob/master/csvtable.c


Never heard of SQlite csvtable but there is the excellent csvsql, part of the csvkit

http://csvkit.readthedocs.org/en/0.9.1/scripts/csvsql.html

Here's an example of how to download the SF health inspection data, unzip it, build the SQLite schema (the dataset contains multiple tables), and then insert the CSVs into their respective tables, and query them, all from the command line.

csvsql handles the inference of schema from the CSV data and the import work; the SQL commands can all obviously be done from just echoing into SQLite directly. But csv2sql will also let you create a CSV in memory and run a SQL query for those times when you just need to quickly filter the data. Here's an example:

http://2015.padjo.org/tutorials/mapping/077-ok-schools-quake...

Tremendously useful tool. I can say without hyperbole that it has changed the scope of how I do data science, even though it literally does nothing that I could not do via SQlite and Python alone (csvkit is basically a command line wrapper written in Python around csv, SQLAlchemy, and so forth)


You can already use Spark SQL to query files (e.g. CSV, JSON, Parquet) directly on various storage systems without ETL. It doesn't have as much research novelty as NoDB, but it is used in a lot of production data pipelines and ad-hoc analysis on "raw data files".


It is not what has been doing COBOL During Decades ?


Great paper.

This has a lot of similar goals to Apache Impala, mainly around improving raw data access to be as efficient as possible.

http://pandis.net/resources/cidr15impala.pdf




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: