This looks neat but I'm not sure why I would care about this. There's a ton of solutions out there in the ecosystem out there already with a columnar like interface.
Granted, we wrote our own as well that uses the builder pattern that you then toss to an executor (our main backend is spark for this). One reason we wrote this is for persistence purposes. Being able to encode and persist a series of transforms that you can then load remotely has been very helpful for us in machine learning.
We've since migrated this project to the eclipse foundation and intend on doing a rewrite of the interface as well as integrate our baked in tensor library in to certain parts of the pipeline for speed purposes and handling things like computer vision workloads.
In general, I always like seeing new takes on the columnar format processing approach but I'm just not seeing anything novel here. Clarification of intent would be great!
There are at least three major questions when we want to introduce a new logical data model:
* How we define columns within one table. Conceptually, it is easy, e.g., SELECT x, y, c = a+b FROM T. Yet, even in this simple case we see a controversy: this statement will create a table but our goal is not to create a table - we want to create a column (function). Bistro uses calc operation for that purpose. But it is of course not new. In pandas, for example, one can use df.apply.
* How to connect several tables. RM, map-reduce and other set-oriented approaches use join which produces a new table. Here we have a similar controversy : I do not want to produce a new table, it is not my goal. My goal is to link these two tables and this means creating a new column. Bistro introduces link operation for such columns.
* How to aggregate data. Here a typical operation is group-by. It is an eclectic operation which combines several other operations and such an approach also has some problems . Bistro changes the way data is aggregated by introducing accumulate functions which get one input (not a subset) and return one output value. An accumulate function is called for each element of the group by updating the current aggregate instead of computing the aggregate by getting the whole group.
So linking and aggregation are what distinguishing Bistro from other approaches and frameworks to data processing including pandas, SQL and map-reduce.
There's a ton of things I'd be missing to start looking at this.
1. Backend agnostic: let me run on different backends like flink/spark
2. Give me off heap memory please. Let me play dangerous and use pointers directly to optimize interactions with transforms.
We wrote our own GC among other things for our tensor lib due to the GC bottlenecks and copying and the like.
That's personally what I kind of like about tablesaw.
A library for 1 off adhoc analysis in memory isn't a bad start though, especially since most folks don't actually have that large of problems.
* Python framework like pandas
* Big data processing framework like Spark
* Database management system
* Data integration system like typical ETL and BI
* Stream analytics like Kafka Streams
* IoT (light weigh) stream processing engine
* Something else?
I would be very thankful for any suggestion from people who know the market and (acute) needs of the customers. What is the best niche for this kind of technology?
So if I were you, I would do Python DSL, because it addresses my immediate needs of increasing productivity. :)
Maybe you could use calcite as an engine or as a base kind of like arrow.
If you do python, maybe look in pyjnius.
There's a lot of things that already have connectors. I would continue along your MVP route allowing folks to do basic things with your framework first then you can improve it as you go. SQL databases aren't a bad initial target. Most folks can do SQL.
It is definitely interesting and important to implement persistence (for example using Parquet or Arrow) as well as other mechanisms like sharding or replication (for big data processing, fault tolerance etc.) Yet, currently this direction has lower priority because the next task I want to focus on is in-stream analytics (an alternative to Kafka Streams).
In general, the whole approach is focused on the logical level of data modeling and processing, that is, the goal is to increase performance and simplicity of development. The general idea (and hypothesis) is that defining how data is being processed using column operations is easier, more intuitive, less error-prone and easier to maintain than using purely set-operations.
In other words, at logical level, it is an alternative to map-reduce, SQL, pandas and other models and frameworks where set-operations are used to process data.
And every APL programmer just nodded in agreement.
Really though, I spend the time reading the readme thinking “This looks very cool, shame it doesn't have a nice language to go with it…”
Another advantage is that it allows performing many transformations (e.g. filtering) directly on dictionary compressed data, without decompressing it. This works well in Vertica  (based on C-Store DB ) which was our inspiration for building a light-weight ETL for business users that also uses a columnar in-memory data transformation engine .
Might I suggest using https://github.com/google/google-java-format for formatting?
Because right now I have no idea why I’d choose to learn this new stuff over using google-able tools I already know.
Make your value proposition really clear.
* UI (Angular 2): https://github.com/asavinov/sc-web
* REST server: https://github.com/asavinov/sc-rest
* Core engine: https://github.com/asavinov/sc-core - Bistro is a complete rewrite of this project
There is an (old) implementation of this idea in C#:
* UI (WPF): https://bitbucket.org/asavinov/dc-wpf
* Core engine (C#): https://bitbucket.org/asavinov/dce-csharp
And also some old Java implementation the engine in Java:
What is the use case? Does it support time series? How works you do a moving average or pivot table?
This is what I am trying to understand myself :) Previously I have implemented this approach as a web-app (self-service tool for working with tables where users can define columns as formulas - similar to spreadsheets). Discussed here: https://news.ycombinator.com/item?id=14351461 But it is too difficult to implement (much more resources are needed) so I switched to developing a library.
Now I want to implement a server for in-stream analytics (an alternative to Kafka Streams). Hopefully, in the next version of Bistro. It will have quite significant portions of data processing logic in UDFs (retention policy, when to evaluate, how to add etc. when to do what).
> Does it support time series? How works you do a moving average or pivot table?
I am designing it now as a tool for stream processing (Bistro Streams) and hence it will support time series (each new row will get a time-stamp and the system will "know" how to deal with time).
Current approach to user-defined functions (Evaluator interface) does not support moving average or other rolling aggregations. New API will be defined. This task has high priority since it is very important for stream analytics.
Pivoting is conceptually more difficult because it is not an operation with data - it is an operation with schema (using data). Maybe some kind of ad-hoc solution will work.
And if you can find a good way to do this along with a solid interface that non-programmers can use (Excel is the master at getting non-programmers to program), there is probably a lot of money to be made.
Honestly, this looks like a query and analysis system that could be built on top of a column store so you don't need to deal with the storage part of the system yet. (I'm a big KBD+ fan).
Good luck. I starred and watched the repo just to see how it progresses.
I like this idea because it allows for reusing an existing engine and, what is important, integrate it with many different engines (which might have already quite sophisticated mechanisms of data managements). Yet, such an engine has to expose quite low-level operations with data. Also, there has to be support for user-defined functions (lambdas).
What is new is that this column can use data from other tables - not necessarily this one. And eventually Bistro gets rid of joins and group-bys which are not difficult to understand but frequently their use is inappropriate, for example, in most cases, joins are used where we actually want to link two tables (new relation is not needed).
Note also that we cannot avoid set operations and creating tables. Therefore, Bistro also supports table creation. Yet, it does it in a functional way, that is, by defining a new columns.