Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This point is interesting:

So indeed, it does not matter whether the data is stored on disk or in memory --- column-stores are a win for these types of workloads. However, the reason is totally different.

He says that for tables on disk, column layout is a win due to less data transferred from disk. But for tables in memory, the win is due to the fact that you can vectorize operations on adjacent vaules.

I also think the convergence of these two approaches is interesting: one is from a database POV; the other is from a programming language POV.

Column Databases came out of RDBMS research. They want to speed up SQL-like queries.

Apache Arrow / Pandas came at it from the point of view of R, which descended from S [1]. It's basically an algebra for dealing with scientific data in an interactive programming language. In this view, rows are observations and columns are variables.

It does seem to make sense for there to be two separate projects now, but I wonder if eventually they will converge. Probably one big difference is what write operations are supported, if any. R and I think Pandas do let you mutate a value in the middle of a column.

-----

On another note, I have been looking at how to implement data frames. I believe the idea came from S, which dates back to 1976, but I only know of a few open source implementations:

- R itself -- the code is known to be pretty old and slow, although rich with functionality.

- Hadley Wickham's dplyr (the tbl type is an enhanced than data.frame). C++.

- The data.table package in R. In C.

- Pandas. In a mix of Python, Cython, and C.

- Apache Arrow. In C++.

- Julia DataFrames.jl.

If anyone knows of others, I'm interested.

[1] https://en.wikipedia.org/wiki/S_(programming_language)



My objective, in fact, is to see the data science world unify data frame representations around Apache Arrow, so code written for Python, Julia, R, C++, etc. will all be portable across programming languages. See https://www.youtube.com/watch?v=wdmf1msbtVs


For data frame implementations, you can also look at Spark's Dataset/DataFrame and Graphlab Create's SFrame.

The company/team behind Graphlab Create was bought by Apple and the open sourced components haven't been updated since then. Because of that, I wouldn't use it in production, but if you are just looking for functioning implementations to compare, that gives you one more.


Thanks, I never heard of Graphlab Create. It is a substantial piece of code! It says it's "out of core", which means it's probably more similar to Parquet/ORC than Arrow. But still interesting.

https://github.com/turi-code/SFrame/tree/master/oss_src/sfra...

For comparison, dplyr and arrow:

https://github.com/tidyverse/dplyr/tree/master/src

https://github.com/apache/arrow/tree/master/cpp/src/arrow

C++ does seem to be useful for stuff like this.


I thought GraphLab Create had some nice functionality, it is too bad that the Apple acquisition appears to have effectively killed the publicly available product. It doesn't look like you can even pay for the commercial part anymore.


Bcolz is interesting for its use of compression in both ram and disk. The ram compression is great when your data is just a little too big.


Its not publicly available, but I'm also working on an implementation of data-frame-esque structures.

You've got most of the ones I'm familiar with there, barring arguably the SAS data set. Albeit that one is "row based", in terms of storage, like the database tables of yore, albeit the application of formats/types is done intellectually on a columnar level.


Oh cool, what resources are you using to design the data structures? Is it just what you know from using data frames, existing code, etc.? Or any books or journal articles?

The main thing I know of is that R's data.table package has the concept of indices on columns like SQL. R doesn't have that.

I don't know if any of the others like Pandas, Arrow, or dplyr have it.

I just Googled and found this article and a very old gist:

http://appsilondatascience.com/blog/rstats/2017/03/02/r-fast...

https://gist.github.com/arunsrinivasan/7997521

I would guess that most implementations don't have indices, and that might be an indication that they're not essential. But for someone with a computer science background it seems annoying not to use hash tables when they are so obviously applicable.


The short answer is I'm using my own experience in the field and the limitations/frustrations I've had working with the existing ones :)

As you've said, I think the inclusion of indices comes from people used to working with more of a database background speeding up frequent common queries. SAS data sets can be indexed, but I don't think there's anything intellectually to be gained there that isn't gotten just from general database indexes, since SAS data sets are designed to be on-disk data structures and operate very similarly to database tables.

Talking completely off the cuff here, I imagine it has to do with the fact that these are mostly designed to be in-memory/columnar based structures, by people doing particular kinds of analysis on those structures. I'm not saying indices can't play a part, but that given the context, they're mainly in the realms of micro and query optimisations.

I think one has to ask what kind of use case would an index support, and what does it gain in efficiency vs take away in complexity?

For in memory columnar analysis on highly mutable data structures, its not quite so clear that it gains that much.

First of all, if you're scanning an entire column, you're not gaining anything from indexes. Indeed, you're losing out because of the cost/space to store one.

Second of all, if you're updating or mutating the data structure frequently in memory, then the cost of re-indexing the data structure needs to be paid each time. A cost even more questionable when in-memory access is already pretty fast. And in a lot of this field, that's a common use case.

Thirdly, if you're just randomly accessing into points within the column, well usually columns are set out as collections of continuous arrays of memory. To offset into the array at a particular index i is likely faster than the double indirection of looking up an index (also in memory) and then offsetting into another point in the original array.

R and the like already have the concept of indexing into a data frame using an array of integers, which you could argue not only covers the concept of an index, but possibly might even be more efficient than storing them in a hash.

Lets say you want to index on a string or a complex object/combination of objects/values? Well there's a potential use case (since a string doesn't usually index straight into a low-level continuous array), but the likes of R does have the concept of row names (albeit, i can't remember how efficiently its implemented), which already lets you access a dataframe notionally based on the name of a row.

Indeed, that's one of the places hashes have in my own implementation: figuring out how to index into the objects via strings labels. But I deliberately left the inclusion of row/column names optional (though column names are included by default since that seems to be the defacto norm).

In practice, many programmers/programs in this space either legitimately don't need or don't use the concept.

Indeed, I struggled quite intently whether to include rowname indexing as a concept at all, since I personally have barely used them in my day to day life for the last 10+ years. But I can't say there isn't a use for them. I suppose transition matrices on real world entities come to mind...


Yeah row names are another good example. I've never used them, so I would be inclined to leave them out. But I'm not sure what other people use them for.

Maybe they are used for fast lookup? In other words, its like an index? If you want to retrieve a particular row, it's usually an O(n) scan. But maybe row names are like a limited kind of index and you can retrieve by name in O(1) time.

I tend to follow Hadley Wickham's model -- column labels are essential, but I'm pretty sure he would say row names are not in the model? I haven't seen it addressed actually.

If I'm understanding correctly the transition matrix example, I think data frames and matrices are vastly different. And it makes sense to have labels for every dimension on matrices, but not data frames.

What language are you implementing your data frame library in? I'm curious about both the API and the algorithms. I looked at Hadley's newer dplyr code, and it's mostly in C++., as opposed to plyr which was mostly in R.

Are you a R user? If I were doing this (which I may but not for like a year), I would probably use Hadley's API and model the code after his. I'm curious if you have any criticisms of the way he does things.

FWIW the summary of using indices is that it changes the computational complexity of an algorithm -- and O(n + m) algorithm can be turned into O(m log n), and this makes an orders of magnitude difference in running time. It's not small difference. I've used this to good effect in real programs.

It's more like data preparation than analysis, but that is a big part of what people use data frames for.


Do you have a working definition for what constitutes a 'data frame'? API/operations you can do to it, or maybe performance requirements of some sort? It seems like, for instance, a 2d array might qualify as a 'data frame' under some definitions.


Data frame is fairly well defined, but it's a common source of confusion. In a data frame, the columns are of heterogeneous types (string, integer, bool, double, factor or "enum", etc.). In a vector/matrix, every cell has the same type.

A short way to understand it is that R and Matlab aren't the same thing. They are not substitutes for one another.

In R, the core type is the data frame. In Matlab, it is the matrix, of which a 2D array is a special case. (Both Matlab and R have grown some of the others' functionality, but it's not good at all. You can tell which one is primary.)

Another way to say it: Julia at its core isn't an R replacement; it's a Matlab replacement. Apparently, the representation of missing values was a problem for DataFrames.jl. Whether or not individual entries in a vector are "nullable" has a large impact on how you write vectorized mathematical operations and how well it performs.

And if you notice that data frames sound like SQL tables, well that's exactly why these Apache projects are converging/overlapping :) They are coming from different points of view, but they have the same underlying data model.

One of my slogans is that "R is the only language without the ORM problem", because its native data model matches the data model of a relational database. I'm designing a shell which will feature data frames (eventually, probably not for a year or more):

https://github.com/oilshell/oil/wiki/Oil-and-the-R-Language

SQL is a very limited language, with essentially no capabilities for abstraction. Writing programs using the relational model in a Turing-complete language like R or Python is very powerful.


A 2D array has a different algebra from a Dataframe. (though some operators are analogous, e.g. a transpose = a pivot)

A dataframe is basically behaves like a database table, with all the attendant operations (aggregation, pivots, filters, etc.) The basic algebra for dataframes is relational algebra plus a few other useful database operations.

The algebra for a 2D array (matrix) is linear algebra.


Okay, but on the surface at least these seem computationally equivalent; so is one of these just a special case of the other?


I think there are similarities, but I don't think one is a special case of the other. At a very high level, they are both collection data structures that share certain properties but that are optimized for very different use cases.

There are instances where the difference can be palpable. I work with legacy MATLAB code where database type data are stored in cell arrays. They are tremendously difficult to reason with for data science work because the primitive operations are all based on positional cell coordinates. (on the other hand, arrays are much easier to work with for numerical type algorithms)

If you understand the theory of coordinatized data, choosing the right coordinate system can greatly simplify math/data manipulation. A data frame is a type of dataset where the coordinates are simply dimensions in other columns, which makes it amenable to aggregations, pivots, slicing/dicing and such on mixed categorical and numerical data.


Apache Spark. Scala


Do you have any experience with data frames in Spark? Do they work well?

My impression that it's more of an API to bolt the data frame abstraction onto whatever Apache Spark already did, rather than a native data frame representation. But I could be wrong -- I haven't used it.

Is the query language Scala or something else? As far as I remember, native Spark is like MapReduce where you write the distributed functions in the same language that the framework is written in -- Scala.

Also, I'm pretty sure Apache Spark is more immutable/functional, whereas in R and Python, data frames are mutable. But that model also has benefits, for redundancy and resilience to nodes being down, which I believe is the whole point of Spark.

Although, I have a pet-theory that you don't really need distributed data frames. For 95% or 99% of data science, use stratified sampling to cut your data down to the size of a single machine. Then load it up in R or Python. You can make almost any decision with say 4 GiB of data. Often you can make it with 4 MB or 4 KB of data. It's not the amount of data that is the problem, but the quality.


> Do you have any experience with data frames in Spark? Do they work well?

Yes, for the most part.

> Is the query language Scala or something else?

You can use Spark SQL, which is essentially SQL; Spark parses it and pumps through a query planner and optimizer. You can also call methods in the Spark API (from Python, Scala, etc.)

> I'm pretty sure Apache Spark is more immutable/functional.

Yes, but that can be a good thing. It makes data transformations easier to reason about.

> Although, I have a pet-theory that you don't really need distributed data frames.

Yes, for most data science work on small datasets, you really don't. That said, there is some work being done in the Spark community to enable Spark to work as a computation engine for local datasets. That way one can standardize on one interface for both small data and big data. As of right now though, there is a fair bit of overhead if you want to use Spark on a smaller datasets -- it's way slower than R/Python dataframes. Hopefully some of those overhead issues will be addressed.

This is why I think Arrow is such an exciting development. The promise is that you can use any compute engine on the same dataframe representation with no serialization overheads. So you can use nimbler tools like R and Python when the data is small, and Spark (or PySpark) on the same data when it needs to scale up. It abstracts your data away from your tooling while maintaining high performance.


OK thanks for the information. What I'm suggesting is that a two-phase workflow for analytics may actually make more sense than trying to unify things. (Here, I'm considering analytics as distinct from big data in products, where you really need to compute on every user's data or every web page).

Basically you have a disk-backed distributed column store with petabytes of data. At Google this is Dremel:

"Dremel: interactive analysis of web-scale datasets"

https://scholar.google.com/scholar?cluster=15217893017571539...

And then you write SQL queries against it to load a sample of gigabytes onto a single machine. Stratified sampling is useful here, so you can representative samples of both big groups and small groups, which tends to happen a lot. And then you use something like R, Python, or Julia to analyze that subset.

Despite the title of the Dremel Paper, it's not really good for interactive analysis. It's good for a quick "top K" query, which is sometimes all you need. But for anything beyond that, it's too slow.

And SQL is a horrible language for writing analytics in. The data frame model is much better. SQL is awesome for loading data though.

So I'm basically suggesting that the SQL/disk/distributed vs. the Programming language/memory/local distinction actually works well for most problems. You can cleanly divide your workflow into "what data do I need to answer this question?" vs. all the detailed and iterative analysis on a smaller data set.

We do need a much better interchange format than CSV though.


I'm not sure I understand exactly what you are proposing that is new. Could you help me see what I'm missing? (not being snarky, genuinely curious)

1) 2-phase workflow for analytics - Yes, this is being done today with different distributed column stores (Vertica, SQL Server columnstores etc.), queried into local dataframes. Spark can be used for that.

2) SQL good for data retrieval, dataframes better for analytics - OK. Ok, Spark can retrieve content via SQL into Spark Dataframes, and then apply RDD operations which are similar to dplyr type operations.

3) Better interchange format than CSV? OK, you can always write your SQL resultset as a Parquet file which can be natively accessed by Spark or Pandas (via PyArrow - there's Arrow again)


It's not so much proposing something new, but calling into question the assumption that this is desirable:

That way one can standardize on one interface for both small data and big data.

For big data, I believe you want a set of operations that are efficient for data extraction:

- Selecting and filtering.

- Explicit and efficient joins. Disallow pathological joins not related to data extraction!

- Sampling. Especially an efficient and accurate stratified sampling over groups. I believe this is missing from most databases, although I could be wrong since there are many of them I'm not familiar with. They typically have a fast inaccurate algorithm one, or a slow accurate one. Or they just allow you to sample an entire set of records from a table, not do "group by"-like semantics. I did what you would now call "data engineering" for many years and this was probably the #1 thing that came up over and over.

That's pretty much it as far as I can tell. The goal is just to cut your data down to manageable size, not do anything overly smart.

On the other hand, for small data, you want an expressive interface, not necessarily an efficient one. You want all the bells and whistles of R and dplyr (and I suppose Pandas, although I'm less familiar with it).

A lot of those things are hard to implement efficiently, but you need them for scripting. Moreover sometimes it's just a lot easier to use an inefficient algorithm on small data, since most of the code is throwaway.

I spent awhile looking at Apache Arrow a few weeks ago. I like the idea a lot. I feel like it may be trying to serve too many masters (but that's just a feeling). It definitely seems possible to underestimate the difference in requirements between small data and big data.

I'm saying those differences are fundamental and not accidental, and you want different APIs so that programs are aware of the differences.

Also, in most other domains there is a distinction between "interchange format" and "application format" (e.g. png/tiff files vs. photoshop files). It would be nice if that didn't exist, but it's there for a reason. This is a long argument, but some problems are more social than technical. Dumb, inefficient interchange formats solve a social problem.


Ah I wasn't sure which part of my reply was being referred to. I think there are two different phases of the data science process: (1) EDA and exploratory modeling. (2) productionizing models on the large.

I believe Arrow can be part of (1) and (2). It is an in-memory format that aims to avoid serialization to/from different use points, so it is tooling and data-size agnostic. You can still sample from it and perform your exploration and analysis.

Spark on the other hand and is mostly (2) -- it can be used for (1) but it is fairly weak at it due to overheads. It also doesn't support the full range of capabilities that dplyr (and other R packages) bring to the table. I would not use Spark for EDA or throwaway modeling.

I think we can agree to disagree but I see a lot of advantages to having a common in-memory abstraction (such as Arrow) to small/big data whose difference is only that of cardinality. We don't have to mandate the use of such a format of course, but I think having such a format solves a large class of problems for a lot of people.

At the same time, no one will stop using CSV (tabular)/JSON (hierarchical) even though these are inefficient, "dumb" formats. It is good enough for most applications and are what I consider local optimum solutions. Even Excel XLSX is a pretty good format for interchange.

I am open to changing my mind but my experiences so far have not convinced me otherwise yet.


Right, EDA is all about making quick iterations, and my point is that I can't think of ANY use cases for EDA on more than a machine's worth of data. (Other than "top" queries, which are useful, but SQL will give you that. You don't need data frames for simple analysis.)

In my experience, when people think they want to "explore" hundreds of terabytes of data (if that's even possible), they either don't have a good method of sampling (due to database limitations), or they haven't considered what kind of sampling they need to answer the question. I'm open to counterexamples where you need terabytes or petabytes of data and you can't sample, but I can't think of any myself.

And yes R is too inefficient to handle some problems on a single machine (or multiple machines), but data.table and dplyr are more efficient. I guess one anti-pattern I would see is -- R is too inefficient, let's run R on multiple machines! I think the better solution is to make use of your machine better, with packages like data.table and dplyr (or a future platform based on Arrow). I think it's easy to underestimate the practical problems of distributed computing.

You should really avoid distributed computing when you can!

And I agree that once you reach production, you do need different tools. In my experience, when you are dealing with hundreds of terabytes or petabytes of data, you're writing a fairly optimized program with a big data framework. NOT using data frames. Big data frameworks are already pretty inefficient -- as you mention, if you run them on a local machine, you see evidence of that. This is true of all the frameworks at Google written in C++ too. When you add the data frame abstraction, it's only going to get worse. Every layer adds some overhead.

So I am very big on data frames for data analysis obviously, but I question the value of distributed data frames.

Bringing it back to the article, having 2 abstractions definitely makes sense. You want a distributed disk-backed column store for storing huge amounts of data permanently. And then you want an in-memory format like Arrow for iteratively analyzing data using the data frame abstraction.

What's not clear to me is that you want a distributed in-memory abstraction as well. It's possible but I think there are so many other problems in the tools for data science -- I can easily name 10 problems with both R and Python ecosystems that are more important than distributed data frames.

To play devil's advocate, I do see the value in not rewriting your analysis when you go from prototype to production. I'm pretty big on that -- I like to ship the prototypes. That's pretty much the philosophy of Julia -- the prototype should be efficient.

But I see multiple challenges there for data science. Even if you had distributed data frames, it wouldn't solve that problem. For one thing, most analysts don't write their code with things like indices in mind, see my other comments here:

https://news.ycombinator.com/item?id=15597376

If the code could benefit from indices, and is not using them (like most programs that use data frames), then IMO they need major surgery anyway. It's not just a matter of flipping a switch and magically going from small to big. But yes I think the general idea of shipping your prototypes is appealing.


> Although, I have a pet-theory that you don't really need distributed data frames. For 95% or 99% of data science, use stratified sampling to cut your data down to the size of a single machine. Then load it up in R or Python. You can make almost any decision with say 4 GiB of data. Often you can make it with 4 MB or 4 KB of data. It's not the amount of data that is the problem, but the quality.

To add fuel to that fire, you can get AWS EC2 instances with 4TB of main system RAM now. At least where inference is the goal, the times that you'd need exotic systems are truly rare indeed.


Data frames do work well. You are correct that it is mainly an abstraction on top of the existing spark RDD representation. The language is Scala, though a slightly strange string/macro-based sql-ish API. Perhaps you can tell from that description I'm a bit luke-warm on it.

There are a few big benefits to immutability. Redundancy and resilience is a big one, but you also get some improved performance because the compiler and planner can more aggressively optimize, and do things like predicate push-down since it doesn't have to materialize everything. It also gets you a lot of the more general programming benefits of immutability/FP in preventing bugs and such.

I somewhat agree with your point about data volume. The situations that require Spark are relatively infrequent. But they are becoming more common. A lot of data is required when you start getting into building more complex machine learning models. But for a lot of your run-of-the-mill regression or basic understanding analysis, Spark is way overkill.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: