
JuliaDB - dunefox
https://juliadata.github.io/JuliaDB.jl/latest/
======
KenoFischer
Just to set expectations while this package is decently stable and used by a
number of people, both commercially and in the open source, it is basically in
maintenance mode at this point in time. The intention a few years back was to
build a complete analytical database on top of Julia, but have it be a mostly
independent project. That vision didn't quite materialize, partly because not
enough people wanted to do distributed analytical work (often a single big box
is enough, particularly with a language as fast as Julia) and partly because
the main developer left to pursue a PhD. I wouldn't be surprised if it got
revived in the future, but for the moment, it's basically just the distributed
tabular computing part (which is useful if you have that problem, but more
limited than the name might imply).

~~~
mumblemumble
> not enough people wanted to do distributed analytical work (often a single
> big box is enough, particularly with a language as fast as Julia)

As someone who's in the midst of a plan to migrate off of Apache Spark,
because we've discovered that, for our particular purposes, a single-machine
implementation in a language with better crunching and multithreading chops
than Java, can easily achieve comparable performance to a Spark cluster, while
being good deal easier to develop and maintain, this comment strikes me as
being very poignant.

I have come to suspect that the big data ecosystem is a castle that was
largely built on a nice, thick technical foundation that is largely composed
of pointer chasing.

~~~
smabie
Yeah, just look at kdb+/q performance compared to Spark. The former is orders
of magnitude more efficient and you realize that a cluster is only necessary
because the performance and memory usage of the JVM is very poor on numerical
workloads. Same goes for Hadoop.

The amount of money and engineering resources thrown at a problem that
actually has a pretty simple solution (integrate your db and language, make
your program and language runtime fit in L1 i-cache, optimize the hell out of
your language) is a little disheartening. But hey, it keeps people employed
managing and debugging an unholy mess of DBs, K8 clusters, load balancers, KV
stores, and message brokers..

~~~
AzzieElbab
Sure, but there are 12 people in the world who know how to use k and q
efficiently, and there is no fail recovery or any way to deal with data that
actually does not fit in ram

~~~
snicker7
That is not true. RAM is cheap (terabytes per node is not uncommon). Moreover,
a common pattern for kdb+ databases is to have two parts: a real-time database
(rdb) that is typically memory backed and a historical database (hdb) that is
serialized to disk (mmap'd for near-RAM speed). In practice, most data
fetching is sequential (cache friendly), so you get good performance
regardless. kdb+ is commonly used in the financial industry, where data
volumes are large.

The language itself is extremely minimal, so the learning curve is actually
much lower than most languages (it forces you to write in a "vectorized"
style, which is not too different than how numpy/R/matlab works). I was able
to get up to speed in a couple of days.

Source: Used to work with kdb+ (still do sometimes). Not a shill (I don't even
like k/j/q/APL).

~~~
AzzieElbab
I am kdb user, I am not arguing against it. More power to you if you can
afford to endlessly scale kdb vertically. Rest of the world have to deal with
distributed systems, because people do not want to pay for kdb or the machines
to run it on

------
smabie
How does JuliaDB compare to DataFrame.jl? I know JuliaDB has row indices,
which is a very nice feature that DataFrame lacks. Is there any reason I
shouldn't use JuliaDB even if I don't need persistence.

I wish the Julia ecosystem was a little more integrated: there are a lot of
different competing libraries that ostensibly do the same thing. Python has
the advantage that it's obvious what you should use: numpy, Pandas, scipy,
statsmodels, matplotlib, etc.

With Julia, it's less clear. Though I think part of the reason is that
actually releasing a new scientific computing Python library is incredibly
difficult and requires a lot of expertise.

Julia makes it pretty trivial for anyone to contribute a model that has
excellent performance. This fragmentation is a common problem among expressive
languages.

~~~
xiaodai
Partly, it is because Julia is newish and partly because of the Lisp curse
that Julia suffers from.

But nonsustsined efforts will fall by the way side and true gems will emerge
as the clear front runner like DataFrames.jl

~~~
porker
> and partly because of the Lisp curse that Julia suffers from.

The Lisp curse?

~~~
oxinabox
[http://winestockwebdesign.com/Essays/Lisp_Curse.html](http://winestockwebdesign.com/Essays/Lisp_Curse.html)

------
eigenspace
I always found it rather ironic that this package, which was developed by
JuliaComputing, manages to violate two common conventions for package names in
the julia ecosystem

1) don't put "julia" in the package name

2) don't use abbreviations in the package name

Of course, JuliaDB is probably older than these conventions, but it's amusing
nonetheless.

~~~
elcomet
It's not only a Julia package though, is it?

If it's a full database engine, it might be usable from other languages. So
the name makes sense.

~~~
eigenspace
It's 'just' a julia package.

------
nine_k
So, it's a Julia implementation of something similar to Pandas / numpy. It's
not a "DB" in the sense Postgres, SQLite, or dbm are "DBs", it's not
persistent and data must fit in RAM.

It's cool that it's pure Julia, so it's instantly portable everywhere Julia
runs, and the code is safer than it could be were portions of it written in C.

~~~
electriccello
Difference from DataFrames: it's possible to use on larger-than memory data,
as long as you have it in separate csv files which can be super useful at
times!

------
j88439h84
It looks like barely faster than Pandas?

~~~
nine_k
Hey, Pandas wrap heavily optimized C and Fortran code, with a thin layer of
Python interface on top.

This thing is pure Julia.

This being _any_ faster than Pandas is a huge compliment to Julia the language
in general and its JIT compiler in particular.

~~~
srean
> This being any faster than Pandas is a huge compliment to Julia the language
> in general and its JIT compiler in particular.

Indeed. This 'whole stack under the same language' is an important feature to
have. You get to reap the advantages of an improved JIT. End to end autodiff
is easier. Some of these things are a problem, for example, in PyPy because of
the Python <-> C bridge.

------
lma21
why is it being compared to pandas ? is it a data analysis library?

~~~
unixhero
Apparently yes

But it's called a db

~~~
smabie
It does support persistence, but it's essentially a DataFrame library.

------
Tarrosion
I wish this website spent its banner space on "what it is" rather than "star
us on GitHub!". I doubly wish this because it's pretty confusing what JuliaDB
is! From the name, I expect it to be a database, but the website immediately
compares it to Pandas. I think of Pandas as a library for in-memory
manipulation and analysis of tabular data, not a database / persistence
engine. So is JuliaDB not a database at all, but in fact an alternative to
Pandas / DataFrames.jl?

In fact, I've been using Julia for work and following the ecosystem since
version 0.4 (we're at 1.5 now), and I'm _still_ not sure what JuliaDB is. No
doubt this is mostly due to me not having reason to look very deeply (and/or
not being very perceptive), but certainly doesn't feel like the marketing copy
is giving me any help...

~~~
vosper
I think it is actually a database, so it seems they haven't committed the
cardinal sin of putting "DB" in the name of something that's not a database.

> JuliaDB is a pure Julia analytical database. It makes loading large datasets
> and playing with them easy and fast. JuliaDB needs to support a number of
> features: releational database operations, quickly parsing text files,
> parallel computing, data storage and compression.

Got this from
[https://juliadb.org/talk/juliacon2018shashi/](https://juliadb.org/talk/juliacon2018shashi/)
which appears to go into more details

> This talk is a bottom-up look at the construction of JuliaDB. We will talk
> about the scope and implementation of underlying building block packages,
> namely IndexedTables, TextParse, Dagger, OnlineStats and PooledArrays.

~~~
gopalv
> they haven't committed the cardinal sin of putting "DB" in the name of
> something that's not a database

It doesn't seem to store data in any way - so it is definitely a data
processing engine, but a database without an "INSERT" command feels a little
off.

From that point of view, this looks a lot like what original MapReduce did -
the data lives outside it & is referenced as urls, but the engine itself does
processing out-of-core and in-memory for very large datasets.

~~~
wtallis
"It doesn't seem to store data" is a complaint that doesn't really make sense
to lodge against a _library_ rather than a standalone program. If your program
is using a database library, it's your job to write the line of code that
tells the library to load data from a particular file into memory. The library
cannot persist as a running process of its own across multiple executions of
your program that is using the library. This is as true of SQLite as it is of
JuliaDB.

I agree that it's a bit odd to not have a direct analog of SQL's INSERT, but
you can definitely add rows to an existing table by making a new one and doing
a merge operation.

