How does it handle dirty data? Does it assign an "any" type? Also, why do you th...

chrisaycock · on Dec 30, 2021

Missing and poorly formatted input is given a type-specific value. Eg., Float64 is nan and Int64 is nil.

  >>> Int64("5")
  5

  >>> Int64("5b")
  nil

If inferencing cannot determine a consistent type from a CSV file, then the column will just be a String.

I don't know what you mean by "embedding" a Dataframe.

rscho · on Dec 30, 2021

On the website you linked:

"Embedding Dataframes into an existing language would not be possible."

I don't think it would be an issue for languages with good metaprogramming facilities.

chrisaycock · on Dec 30, 2021

Ah, I see what you're referring to.

The hardest thing is the load() function, particularly in the REPL. It looks dynamic, but is actually static. Pulling off this slight-of-hand requires both type providers and automatic compile-time function evaluation on arbitrary expressions.

F# is the only other language I know of that has type providers. They invented it.

As for CTFE, languages like Zig and D require the user to indicate when to evaluate something ahead of time. I wanted this to happen automatically and still be available for compound expressions, user-defined functions, user-defined types, etc. Doing that requires tracking purity (no state or IO) in an expression, plus a mechanism to actually do the evaluation. I've never seen a language take it to the extreme that Empirical does.

So an existing statically typed language would need (1) a REPL interface, (2) purity tracking, (3) compile-time function evaluation, (4) some kind of types-as-parameters setup, and (5) array notation. Most existing statically typed languages don't have a REPL; the ones that do generally lack array notation. I couldn't find a language that did all of that plus type providers and automated CTFE on arbitrary expressions.

Hence, I had to create my own language.

mdcfrancis · on Dec 30, 2021

I've written similar in Julia, you can see the record type used in https://www.juliapackages.com/p/namedtuples. The full library, not in the open source, uses this type for time series analysis. It's all type safe and allowed expressions such as x = vwap( ts, 5) - l1( vwap( ts, 5)) through to a time moving PCA. Julia makes writing this sort of thing short and quick. The total impl was only a thousand lines or so of code.

chrisaycock · on Dec 30, 2021

I checked your website; do you have an example of how to load data from a file into NamedTuples? Specifically, can NamedTuples infer type from an external source?

Also, do you have an example of what a displayed table looks like? Julia has a DataFrames package that can display a table. I am curious to know how your time-series library displays a table.

mdcfrancis · on Dec 30, 2021

unfortunately, I don't have access to that code anymore, I wrote a number of loaders for different data set types including CSV. The time series were all modeled as forward iterating stream of tuples, so there is no specific table abstraction. There is an implicit assumption that the stream is ordered by the join key, in a time series this being the timestamp, though nothing in the implementation enforced that.

Joins are always n-way merge joins, so you can write something like y = 2x^2 - 3z + c and fold that into a single streaming operation y = f( x, z, c ) where y, x, z and c are time streams.

When rendered to screen they looked very similar to your examples. With plugins in the IDE you could directly plot and array of time series as a chart.

Since the time I wrote NamedTuples the Julia core team folded the functionality into the core of Julia https://docs.julialang.org/en/v1/manual/types/#Named-Tuple-T.... This is the core of https://juliadb.org/ all credit to the Julia core team

xwolfi · on Dec 30, 2021

I don't think I get it. I do a lot of pandas in a bank so I recognize your dataframes for what they are, but what advantage do you have over python+pandas ?

I hate Python (I'm a Java dev helping Quants), but it's that or KDB, and I think I could murder the creator of KDB :D And I have to admit Pandas is instinctive, Python is easy enough to extend, what are you doing that's so important you made a language for it ?

chrisaycock · on Dec 30, 2021

Empirical is statically typed. Python and q/kdb+ are dynamically typed.

I spent years using those products in finance. I would set-up a simulation that would crash after four hours because of a misspelled column name. Empirical prevents that by refusing to run a script that has a type error or unresolved identifier. No more crashed overnight sims!

iamwil · on Dec 30, 2021

You should say this under the question of how it's different than Julia.

It's not enough to say it's statically typed, since not everyone is convinced of the benefits based on the context they're coming from.

I just saw a talk by Rich Hickey about Clojure, and he eschews static typing, since he thinks of it as a coupling in a language. And based on the types of programs he writes and runs, he hasn't seen a benefit.

So I think when you're specific about what statically typed buys you in the context of the job Empirical does for you, I think it's more convincing.

mdcfrancis · on Dec 30, 2021

I can answer for the type stable julia case, if you have a struct in julia that is composed only of primitive types this is stored as a C struct with zero overhead and fixed byte length. An array of these is then crazy efficient when it comes to streaming into the CPU etc. If you dig around the GPU support in Julia you can see this used to good effect.

maest · on Dec 30, 2021

> python+pandas

Another advantage is supporting sql-like syntax natively (and not having to use pandas' awkward, bolted-on API)