
Show HN: Bistro – A light-weight column-oriented data processing engine - asavinov
https://github.com/asavinov/bistro
======
agibsonccc
The core looks close enough to dataframes that I'd be curious to know how you
compare to tablesaw:
[https://github.com/jtablesaw/tablesaw](https://github.com/jtablesaw/tablesaw)

This looks neat but I'm not sure why I would care about this. There's a ton of
solutions out there in the ecosystem out there already with a columnar like
interface.

Granted, we wrote our own as well[1] that uses the builder pattern that you
then toss to an executor (our main backend is spark for this). One reason we
wrote this is for persistence purposes. Being able to encode and persist a
series of transforms that you can then load remotely has been very helpful for
us in machine learning.

We've since migrated this project to the eclipse foundation and intend on
doing a rewrite of the interface as well as integrate our baked in tensor
library[2] in to certain parts of the pipeline for speed purposes and handling
things like computer vision workloads.

In general, I always like seeing new takes on the columnar format processing
approach but I'm just not seeing anything novel here. Clarification of intent
would be great!

[1]:
[https://github.com/deeplearning4j/DataVec](https://github.com/deeplearning4j/DataVec)
[2]:
[https://github.com/deeplearning4j/nd4j](https://github.com/deeplearning4j/nd4j)

~~~
asavinov
Bistro is not about a physical model and columnar (physical) representation
although it relies on it. It is about logical level of representation and
processing.

There are at least three major questions when we want to introduce a new
logical data model:

* How we define columns within one table. Conceptually, it is easy, e.g., SELECT x, y, c = a+b FROM T. Yet, even in this simple case we see a controversy: this statement will create a table but our goal is not to create a table - we want to create a column (function). Bistro uses _calc_ operation for that purpose. But it is of course not new. In pandas, for example, one can use df.apply.

* How to connect several tables. RM, map-reduce and other set-oriented approaches use join which produces a new table. Here we have a similar controversy [1]: I do not want to produce a new table, it is not my goal. My goal is to link these two tables and this means creating a new column. Bistro introduces _link_ operation for such columns.

* How to aggregate data. Here a typical operation is group-by. It is an eclectic operation which combines several other operations and such an approach also has some problems [2]. Bistro changes the way data is aggregated by introducing accumulate functions which get one input (not a subset) and return one output value. An accumulate function is called for each element of the group by _updating_ the current aggregate instead of computing the aggregate by getting the whole group.

So linking and aggregation are what distinguishing Bistro from other
approaches and frameworks to data processing including pandas, SQL and map-
reduce.

[1]
[https://www.researchgate.net/publication/301764816_Joins_vs_...](https://www.researchgate.net/publication/301764816_Joins_vs_Links_or_Relational_Join_Considered_Harmful)

[2]
[https://www.researchgate.net/publication/316551218_From_Grou...](https://www.researchgate.net/publication/316551218_From_Group-
by_to_Accumulation_Data_Aggregation_Revisited)

~~~
agibsonccc
Ok so you're defining new operations on top of existing primitives. Makes
sense! The concepts look more interesting than the library right now (it
doesn't move the needle for me production wise yet) but it has potential!
Every project starts somewhere. I'm glad you wrote this in java at least.

There's a ton of things I'd be missing to start looking at this.

1\. Backend agnostic: let me run on different backends like flink/spark

2\. Give me off heap memory please. Let me play dangerous and use pointers
directly to optimize interactions with transforms. We wrote our own GC among
other things for our tensor lib due to the GC bottlenecks and copying and the
like. That's personally what I kind of like about tablesaw.

A library for 1 off adhoc analysis in memory isn't a bad start though,
especially since most folks don't actually have that large of problems.

~~~
asavinov
You are right - it is an MVP, and the goal is to choose a direction. In fact,
I am still not sure in which direction to go:

* JavaScript data processing framework for in-browser data processing

* Python framework like pandas

* Big data processing framework like Spark

* Database management system

* Data integration system like typical ETL and BI

* Stream analytics like Kafka Streams

* IoT (light weigh) stream processing engine

* Something else?

I would be very thankful for any suggestion from people who know the market
and (acute) needs of the customers. What is the best niche for this kind of
technology?

~~~
munro
I'm working on a columnar Python DSL right now, I think of it like SQLAlchemy
for Pandas/Spark/Flink. My goal is to create a language that makes the
cumbersome parts of the PySpark API much easier to express. I started off with
the intent of replicating the R's dataframe API because it feels more fluid--
but what you're doing feels eerie, because I came to the same conclusions as
you around focusing on a language for columnar manipulations, and letting the
"linking" become implicit. Then I want to transition to rethinking the ML
pipeline, I like Spark's more than sklearn's, but there are still cumbersome
parts that my intuition says can be solved by a columnar API.

So if I were you, I would do Python DSL, because it addresses my immediate
needs of increasing productivity. :)

------
buremba
Is it in-memory? Does it support replication or sharding? What's the main use-
case? How does it differ from ORC, Parquet or Arrow? The repository doesn't
have any information.

~~~
asavinov
At the physical (storage) level it is in-memory and organized as a column
store, that is, internally it stores a list of tables (with no data) and a
list of columns (each being a Java array).

It is definitely interesting and important to implement persistence (for
example using Parquet or Arrow) as well as other mechanisms like sharding or
replication (for big data processing, fault tolerance etc.) Yet, currently
this direction has lower priority because the next task I want to focus on is
in-stream analytics (an alternative to Kafka Streams).

In general, the whole approach is focused on the _logical level_ of data
modeling and processing, that is, the goal is to increase performance and
simplicity of development. The general idea (and hypothesis) is that defining
how data is being processed using _column operations_ is easier, more
intuitive, less error-prone and easier to maintain than using purely _set-
operations_.

In other words, at logical level, it is an alternative to map-reduce, SQL,
pandas and other models and frameworks where set-operations are used to
process data.

~~~
jnordwick
> defining how data is being processed using column operations is easier, more
> intuitive, less error-prone and easier to maintain than using purely set-
> operations.

And every APL programmer just nodded in agreement.

~~~
PeCaN
All 6 of us. :(

Really though, I spend the time reading the readme thinking “This looks very
cool, shame it doesn't have a nice language to go with it…”

------
dgudkov
Interesting idea. Columnar ETL can be quite efficient in some scenarios
because frequently an ETL transformation (e.g. calculating a new column)
effectively modifies an existing table, rather than creates a new one. This
allows calculating only the delta, instead of re-building a new table from.
This helps optimize performance and do calculations in-memory without slow
disk I/O.

Another advantage is that it allows performing many transformations (e.g.
filtering) directly on dictionary compressed data, without decompressing it.
This works well in Vertica [1] (based on C-Store DB [2]) which was our
inspiration for building a light-weight ETL for business users that also uses
a columnar in-memory data transformation engine [3].

[1] [https://www.vertica.com/](https://www.vertica.com/)

[2]
[http://db.csail.mit.edu/projects/cstore/](http://db.csail.mit.edu/projects/cstore/)

[3] [http://easymorph.com/in-memory-engine.html](http://easymorph.com/in-
memory-engine.html)

------
krat0sprakhar
Sorry for being that guy, but I just clicked into a random file in src to read
the code, and found the code style (indentation etc.) to be quite weird
[https://github.com/asavinov/bistro/blob/master/core/src/main...](https://github.com/asavinov/bistro/blob/master/core/src/main/java/org/conceptoriented/bistro/core/ColumnData.java).

Might I suggest using [https://github.com/google/google-java-
format](https://github.com/google/google-java-format) for formatting?

~~~
jahewson
This is what happens when tabs and spaces get mixed.

~~~
tomcam
So you’re a spacist?

~~~
jasonkostempski
If not, I run a space supremacist group out of my basement. Weekly meetings
devoted entirely to tab-shaming.

------
jitl
An example would be great. Can you show how to do a given task with SQL,
map/reduce, and your framework?

Because right now I have no idea why I’d choose to learn this new stuff over
using google-able tools I already know.

Make your value proposition _really_ clear.

~~~
hoprocker
Agreed -- a motivating example in the Readme would really help people totally
new to this concept grasp the how/why of the project.

~~~
sgolestane
On [http://conceptoriented.org](http://conceptoriented.org) there is a link to
[http://conceptoriented.com](http://conceptoriented.com) which seem to be a
demo.

~~~
asavinov
Source code for this web-app and previous projects:

* UI (Angular 2): [https://github.com/asavinov/sc-web](https://github.com/asavinov/sc-web)

* REST server: [https://github.com/asavinov/sc-rest](https://github.com/asavinov/sc-rest)

* Core engine: [https://github.com/asavinov/sc-core](https://github.com/asavinov/sc-core) \- Bistro is a complete rewrite of this project

There is an (old) implementation of this idea in C#:

* UI (WPF): [https://bitbucket.org/asavinov/dc-wpf](https://bitbucket.org/asavinov/dc-wpf)

* Core engine (C#): [https://bitbucket.org/asavinov/dce-csharp](https://bitbucket.org/asavinov/dce-csharp)

And also some old Java implementation the engine in Java:

* [https://bitbucket.org/asavinov/dc-core](https://bitbucket.org/asavinov/dc-core)

------
jnordwick
Might new a cool idea, but not nearly fleshed out enough. I think a larger
example instead of just individuals lines of code would be useful. Show a toy
widget sales spreadsheet.

What is the use case? Does it support time series? How works you do a moving
average or pivot table?

~~~
asavinov
> What is the use case?

This is what I am trying to understand myself :) Previously I have implemented
this approach as a web-app (self-service tool for working with tables where
users can define columns as formulas - similar to spreadsheets). Discussed
here:
[https://news.ycombinator.com/item?id=14351461](https://news.ycombinator.com/item?id=14351461)
But it is too difficult to implement (much more resources are needed) so I
switched to developing a library.

Now I want to implement a server for in-stream analytics (an alternative to
Kafka Streams). Hopefully, in the next version of Bistro. It will have quite
significant portions of data processing logic in UDFs (retention policy, when
to evaluate, how to add etc. when to do what).

> Does it support time series? How works you do a moving average or pivot
> table?

I am designing it now as a tool for stream processing (Bistro Streams) and
hence it will support time series (each new row will get a time-stamp and the
system will "know" how to deal with time).

Current approach to user-defined functions (Evaluator interface) does not
support moving average or other rolling aggregations. New API will be defined.
This task has high priority since it is very important for stream analytics.

Pivoting is conceptually more difficult because it is not an operation with
data - it is an operation with schema (using data). Maybe some kind of ad-hoc
solution will work.

~~~
jnordwick
Honestly, there is probably solid use cases for simplifying data exploration
especially when dealing with large amounts of data. I have myself (as probably
every developer in finance and many others) have tried to think of ways to
build upon the intuitiveness and usefulness of the spreadsheet paradigm while
making it more powerful.

And if you can find a good way to do this along with a solid interface that
non-programmers can use (Excel is the master at getting non-programmers to
program), there is probably a lot of money to be made.

Honestly, this looks like a query and analysis system that could be built on
top of a column store so you don't need to deal with the storage part of the
system yet. (I'm a big KBD+ fan).

Good luck. I starred and watched the repo just to see how it progresses.

~~~
asavinov
_Honestly, this looks like a query and analysis system that could be built on
top of a column store so you don 't need to deal with the storage part of the
system yet. (I'm a big KBD+ fan)._

I like this idea because it allows for reusing an existing engine and, what is
important, integrate it with many different engines (which might have already
quite sophisticated mechanisms of data managements). Yet, such an engine has
to expose quite low-level operations with data. Also, there has to be support
for user-defined functions (lambdas).

------
nickpeterson
I skimmed the readme but didn't see the answer to what I regard as a basic
question. How is this different from a view? I can easily make derived columns
based on functions and reference those in other views (performance issues
aside).

------
julienfr112
How do that compare to SAS software
([https://en.wikipedia.org/wiki/SAS_(software)](https://en.wikipedia.org/wiki/SAS_\(software\)))
? Particularly the "DATA" steps.

~~~
philkrylov
SAS DATA steps create new datasets (table) from one or more input datasets. As
far as I understand the author, Bistro creates just a new column in an
existing table.

~~~
asavinov
> Bistro creates just a new column in an existing table.

What is new is that this column can use data from other tables - not
necessarily this one. And eventually Bistro gets rid of joins and group-bys
which are not difficult to understand but frequently their use is
inappropriate, for example, in most cases, joins are used where we actually
want to link two tables (new relation is not needed).

Note also that we cannot avoid set operations and creating tables. Therefore,
Bistro also supports table creation. Yet, it does it in a functional way, that
is, by defining a new columns.

------
KasianFranks
This is neat. Vectorspace based AI calculations will benefit from this
approach. Great work!

