
Dato open sources SFrame – a disk-backed, compressed columnnar data frame - infinite8s
http://blog.dato.com/sframe-open-source-release
======
sandGorgon
this comment is very interesting:

 _I really appreciate you guys doing this. There 's now serious consideration
of building the Julia (julialang.org) data-analysis ecosystem on top of
SFrames. It will hopefully allow our community to focus on the unique
strengths of Julia, such as the ability to write arbitrary JITed user-defined
aggregation and transformation functions, while not reinventing the wheel of
low-level scalable data carpentry primitives._

Anybody know how accurate this is ? Would love to compare R data frames or
Pandas to SFrame in terms of feature and performance.

~~~
shele
He actually wrote a Julia wrapper exploring this a bit more
[https://github.com/malmaud/SFrames.jl](https://github.com/malmaud/SFrames.jl)

------
haijieg
For those who are interested in the technical details, this blog post by
Yucheng Low explains the architecture of SFrame very well:
[http://blog.dato.com/data-processing-architecture-of-
graphla...](http://blog.dato.com/data-processing-architecture-of-graphlab-
create)

------
shoyer
The title here should have a (2015) in it -- this blog post is from September
25, 2015.

------
elyase
What I don't get is: can you use SFrame without installing the graphlab
framework at all? What are the limitations? Is there an example notebook using
only SFrame and not graphlab?

~~~
haijieg
Yes, SFrame is a subset of GraphLab Create. SFrame provides the core data
structures to work with tabular and graph data at scale on a single machine.
In addition, using GraphLab Create allows you to create/use/evaluate machine
learning models.

Here is some user guide to get you started:
[https://dato.com/learn/userguide/sframe/introduction.html](https://dato.com/learn/userguide/sframe/introduction.html)
Just replace "import graphlab" with "import sframe".

------
rixed
Can someone post a link to the comparison with other columnar database storage
layer?

~~~
RyanHamilton
Most other stores are databases [http://www.timestored.com/time-series-
data/column-oriented-d...](http://www.timestored.com/time-series-data/column-
oriented-databases)

------
prodigal_erik
Is a "data frame" just a table with (edit: a lot of) storage optimization?

~~~
chubot
As far a I know, the term "data frame" comes from R and its predecessors like
S, where a data frame is the core data structure. Logically, it is indeed like
an SQL table -- a column has a single type whereas a row has heterogeneous
types.

AFAIK, all implementations are column-oriented, which admits certain kind of
implementation and optimization. SQL databases are mostly row-oriented,
probably since updating a row at a time is a common operation.

I would think of it as a table, but embedded in a programming language rather
than a database (so you don't use SQL), with more operations, and which is
very often used in a read-only fashion.

The syntax in R is nicer than SQL in my opinion. It's more algebraic and
composable. Instead of "SELECT name, address FROM foo WHERE age > 30", you can
write foo[foo$age > 30, c('name', 'address')].

Some links:

[http://www.r-bloggers.com/select-operations-on-r-data-
frames...](http://www.r-bloggers.com/select-operations-on-r-data-frames/)

Pandas is a data frame library for Python, based on R:

[http://pandas.pydata.org/pandas-
docs/stable/basics.html](http://pandas.pydata.org/pandas-
docs/stable/basics.html)

This article explains the relevance of the relational model to data analysis /
statistics (rows are observations, columns are variables):

[https://scholar.google.com/scholar?cluster=77966238326629329...](https://scholar.google.com/scholar?cluster=7796623832662932979&hl=en&as_sdt=0,5&sciodt=0,5)

~~~
stewbrew
"data frame is the core data structure"

Actually, AFAIK a data.frame in R is actually a list of vectors (i.e. columns)
with some constraints.

~~~
chubot
That's not true:

    
    
        > d=data.frame(a=c(1,2,3),b=c(4,5,6))
        > e=list(a=c(1,2,3),b=c(4,5,6))
    
        > class(d)
        [1] "data.frame"
        > class(e)
        [1] "list"
    
        > d[c(TRUE,FALSE),]
          a b
        1 1 4
        3 3 6
    
        > e[c(TRUE,FALSE),]
        Error in e[c(TRUE, FALSE), ] : incorrect number of dimensions
    

They are represented similarly in R, but they are distinct data types. The
data frame is the core data structure in the sense that many functions in R
operate on data frames (but not lists of vectors).

~~~
stewbrew
1\. You shouldn't use `=` for assignments in R but `<-`. `=` does late
binding.

2\. You shouldn't use `class()` here but `mode()` to check the actual
underlying data structure.

    
    
        > mode(d)
        [1] "list"
        > mode(e)
        [1] "list"
    

3\. The reason `[` works differently is because it a S3 method which invokes
different functions for lists and data.frames -- that's why class(d) doesn't
return "list". See `methods("[")`.

See
[https://cran.r-project.org/doc/manuals/r-release/R-lang.html...](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Data-
frame-objects) for details.

------
asdfologist
It doesn't support Python 3 :(

~~~
bsg75
Hopefully soon: [http://blog.dato.com/state-of-the-
sframe-2016](http://blog.dato.com/state-of-the-sframe-2016)

