
Show HN: Empirical – a language for time-series analysis - chrisaycock
https://www.empirical-soft.com/
======
chrisaycock
I wrote Empirical to address issues I had routinely faced in my career.

I spent ten years in quantitative finance, primarily statistical arbitrage and
high-frequency trading. I always ran into problems working with time-series
data, from fetching the data to expressing the algorithms to watching my
backtests fail from a type error after four hours.

As a result, Empirical has statically typed Dataframes and builtin timestamp
types. It can infer the types from a file as long as the input source is known
at compile-time, such as in a REPL.

Today's release is the very first public beta. There is a ton of work to do
still; see the roadmap for further details:

[https://github.com/empirical-soft/empirical-
lang/issues/1](https://github.com/empirical-soft/empirical-lang/issues/1)

I have released the source code under the AGPL with the Commons Clause. A
proprietary license is available for users who need more commercial-friendly
terms.

It was a long journey to get here. I would like to thank everyone who
participated in the private beta. And lastly, I would like to thank Y
Combinator; I won a Startup School grant last year.

~~~
hx2a
I like how SQL syntax is integrated into the language. That is very
convenient.

Can you compare/contrast this language with q? Can I write complex queries as
I can do in q? How about performance?

~~~
chrisaycock
q was definitely an inspiration.

The biggest difference is that Empirical is statically typed. Also, arbitrary
expressions are allowed anywhere when dealing with Dataframes.

For example, I can sort by anything:

    
    
        sort my_table by col1 - col2, foo(col3)
    

And I can aggregate by an external array as long as the lengths are the same:

    
    
        from my_table select foo(col1) by some_array_with_repetition
    

As for performance, I've tried to make Empirical "reasonable" at this stage,
though I haven't put too much effort into it beyond that. I'm more worried
about the ergonomics of the language right now, so I don't have SIMD,
dependency analysis, etc.

One of the biggest things Empirical lacks compared to q is nested arrays.
That's a major issue I have to tackle in the virtual machine.

------
dmix
This is a great programming language homepage content wise. The copy and
examples are good, gets right to the point.

But the navigation needs some work. The problem is the top logo isn't
clickable, so I can't go back to the homepage from any of the subpages without
clicking the back button, and the subpages don't have the primary navigation
at the top either.

Edit: also the navigation should be repeated in the footer.

~~~
chrisaycock
Thanks for the tips. Web design is a complete mystery to me, so I'll gladly
take all the feedback I can get.

------
jaupe
I really like that it feels like a dynamically typed language but with the
security of type inference. That's really cool

~~~
chrisaycock
That's exactly the feel I've been going for. Instead of "gradual typing", I
wondered if there was a way to make everything statically typed but still read
from a file. I settled on a combination of _type providers_ (F#) and _compile-
time function evaluation_ (D).

------
mamcx
Pretty cool. And is similar to my idea for a relational language:

[https://bitbucket.org/tablam/tablam/wiki/Syntax](https://bitbucket.org/tablam/tablam/wiki/Syntax)

Only, this has shipped!

------
cedricd
This looks super interesting. We're doing a startup right now that transforms
data into time series tables. Building datasets from those in sql has been
challenging enough that we built out a UI to do it. This could be another
elegant approach.

~~~
chrisaycock
Narrator looks interesting. I'm only reading CSV files for now, but I do want
to handle SQL pushdown at some point in the future. Feel free to ping me if
you want to swap war stories.

christopher.aycock (AT) empirical-soft.com

------
e12e
Looks very interesting. Is there (currently) any facility for saving work?
Like writing dataframes and/or functions to disk? I had a quick look at the
tutorial and source - but only found the stuff handling csv input.

~~~
chrisaycock
It's pretty rudimentary, but you can save CSV files from a Dataframe with:

    
    
        store(df, "some_file.csv")
    

I don't have modules yet, but you can load an Empirical code file from the
REPL with a "magic command":

    
    
        >>> \l my_functions.emp
    

The full list of magic commands is available with:

    
    
        >>> \help

------
victorNicollet
I really like the `asof` keyword, so I'll be stealing it for my own language
:-)

I suppose that `from ..` does not print the result, but rather returns a new
dataframe that just happens to be printed by the REPL ?

~~~
chrisaycock
You are correct about "from". Empirical is a normal programming language; when
an expression is evaluated in the REPL and the result isn't stored, then the
result is printed to the screen.

------
floki999
Very nice. I realize that it currently runs in a shell, but one ingredient I
would absolutely want is built-in charting capabilities - especially when
running back-tests etc.

~~~
chrisaycock
Visualization is definitely important and I would love to hear anybody's
thoughts on it. I think I should make a wrapper to an existing library, like
matplotlib or ggplot2.

~~~
X6S1x6Okd1st
I've really been enjoying vegalite, but that does require a browser.

------
pnichols
Any comments on performance today or in the near future? Any features which
should provide a big speedup in the future as compared to competitors (kdb,
pandas)?

~~~
chrisaycock
I've primarily been focused on the ergonomics of the language, so I've only
tried to make performance "reasonable" for now.

Longer-term performance objectives are:

1\. JIT - I designed the VM's byte code to be both interpretably and a mid-
level IR to LLVM. Currently I just interpret everything since there is almost
no runtime overhead for vector operations. However, compiled code will greatly
speed-up any scalars in a loop.

2\. SIMD - Since the VM's opcodes are already statically typed and vector-
aware, integrating OpenBLAS and SLEEF (or Intel's MKL and VML) should be
straightforward.

3\. MIMD - Ideally I can just lean on existing libraries, though I'm not above
embedding OpenMP if that gets the job done.

4\. Distributed - Now comes the hard part. If we want MPI-level performance, I
need to have more sophisticated scheduling. Which leads us to...

5\. Streaming - This is the real holy grail. There has been a ton of research
in the database community to get away from the "Volcano model" (iterators). I
want to have the compiler generate streaming-aware opcodes for the VM based on
the nature of how the data is to be consumed. I believe this will require a
type system that can track the "context" of the computation, similar to how
Koka and F* track side effects. I'm not aware of any general-purpose language
that has compiled streaming.

~~~
corysama
Looking at interpret.cpp for SIMD potential: I bet you could add an allocator
for std::vector that aligns and pads everything to 32 bytes then just replace
all of the scalar op loops with loops over AVX intrinsics. No need for an
external library.

~~~
chrisaycock
That's a possibility to get something running near term. I'm trying to avoid
CPU-specific intrinsics since I have a fantasy that this might be run on ARM
in the future, though that may be getting really ahead of myself.

~~~
corysama
NEON intrinsics are pretty easy as well ;) As long as you are doing simple
+-*&| ops they work the same as SSE.

------
atemerev
Thank you! Aside from non-free kdb, there are not much products available in
the field. I will test it and see if it works for my tasks.

~~~
chrisaycock
Thanks for trying it. As I mention elsewhere, Empirical is pretty limited
right now because this is the first beta release. If you run into something
specific that's missing, please let me know about it, ideally on the issue
tracker:

[https://github.com/empirical-soft/empirical-
lang/issues](https://github.com/empirical-soft/empirical-lang/issues)

That way I'll know what targets to hit.

------
mvcalder
My corporate overlords are blocking access to your site. They say your cert is
invalid.

~~~
chrisaycock
Are you able to post any error messages? I haven't gotten notice from anyone
else, and my own browser doesn't have any complaints. But if there's a
problem, I want to get it fixed.

~~~
mvcalder
I sent them your reply and magically all is good. Thank for the "ammo" and the
work.

------
chubot
This looks cool! A few comments:

\- I peeked at the VVM implementation, since I've been looking for an
implementation of data frames for my shell Oil [1]. I've looked at Hadley
Wickham's dplyr code, R's data.table library, Pandas (which is somewhat
awkwardly based on NumPy), and a little bit at Apache Arrow. I also remember R
has a "zoo" library though I haven't used it much.

Was your designed influenced by any system in particular? I've had a hard time
finding any descriptions of data frames other than the code. I have less
experience with time series, but I believe the main issue on top of data
frames is having joins by time columns (e.g. your "asof" operator).

But otherwise, could VVM could be used for dplyr-style analysis? dplyr has a
very rich set of operators.

[https://www.rstudio.com/wp-content/uploads/2015/02/data-
wran...](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-
cheatsheet.pdf)

Hadley does a good job of describing the high level philosophy, but I've been
looking for low level advice, like how to do vectorized math quickly with
overflow checks and so forth. (Do you have ints or is everything a float?)
Maybe it's not a big deal, but it's not something I have experience with. I'd
like to read about someone's implementation, especially in portable C / C++. I
think a lot of earlier systems were in Fortran/assembly.

I guess your implementation is fairly different because the language is
statically typed. I peeked and it looks like DataFrame is std::vector<void
star>, which makes sense for static typing.

Have you looked at how Julia does things? It has macros and fast code
generation. I imagine you started this project before Julia 1.0, where they
added NA for data frame support. I'm more of an R and Python user, but I find
Julia pretty interesting, e.g. something like this approach is impossible in R
or Python AFAIK:

[http://scattered-thoughts.net/blog/2016/10/11/a-practical-
re...](http://scattered-thoughts.net/blog/2016/10/11/a-practical-relational-
query-compiler-in-500-lines/)

[http://scattered-thoughts.net/blog/2018/08/16/julia-as-a-
pla...](http://scattered-thoughts.net/blog/2018/08/16/julia-as-a-platform-for-
language-development/)

\- I watched your video, which is a nice demo. My feedback: if you want to
maximize the number of people that get through it, I would make the font
bigger and also raise the bottom of the window so the typed code is more
readable. The code is nearly clipped off which causes some friction for
viewers. Hope that's helpful.

\- Nice to see someone else using Zephyr ASDL! I have linked these blog posts
a few times here:
[http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL](http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL)

Anyway I hope to have time to play with this a bit more. I don't have that
many time series use cases but I'm definitely interested in data frames!

[1] The slogan for why a shell could use data frames is: "the output of ls and
ps is a table". For those unfamiliar with data frames, here's my intro: _What
Is a Data Frame? (In Python, R, and SQL)_
[http://www.oilshell.org/blog/2018/11/30.html](http://www.oilshell.org/blog/2018/11/30.html)

~~~
chrisaycock
VVM is column-oriented, which is how pretty much every Dataframe
implementation works. Each column is a vector of whatever the user's type
represents; Int64 in Empirical is i64 in VVM and int64_t in C++.

VVM has its own statically typed assembly language. You can see examples of it
in the regression tests; here's one that sorts a table:

[https://github.com/empirical-soft/empirical-
lang/blob/master...](https://github.com/empirical-soft/empirical-
lang/blob/master/tests/VVM/sort.vvm)

Since it's a virtual machine, VVM is pretty low level and really only meant as
a compilation target. While it does some of the heavy lifting to match keys or
determine the order of indices in a vector, Empirical is needed to coordinate
the moving pieces.

Empirical takes a very different approach from Julia. Empirical is statically
typed (not "gradually" typed), is focused around Dataframes, and compiles to a
VM that is then interpreted.

As I mention elsewhere, I haven't done much on the performance side of things.
I eventually want SIMD and JIT, but my priority for now is getting the
Empirical language right.

