
Dataset: Databases for lazy people - ikuyamada
https://dataset.readthedocs.org/
======
akrymski
This is excellent, I've always wanted a simple db library that makes defining
a schema upfront _optional_. After all a schema is a constraint that can be
set in stone later on, just like adding indices for optimisation purposes.

Few questions:

\- what's the performance like? is there any overhead to alchemy, eg comparing
schema every time you do an insert?

\- no way to specify primary keys as an alternative to the auto-generated id
column?

\- no table.remove()?

\- what about inserting more complicated data structures, eg a dict with a
nested dict or a list? would be great if those were serialized auto-magically
into a blob type (or used to create another table with a foreign key?)

\- would be nice to be able to freeze a schema with table.freeze() for
example: from then on new columns don't get created automatically, or get
stored in an extra blob column (this is a very common scenario for python devs
where non-indexable columns just get stuck in a blob)

\- let me optionally define a schema and specify defaults with
table.schema(name='', price=0.0')

\- would love to see table.ensureIndex('column', 'unique') similar to mongo
for quickly creating indices

\- db = Dataset() should do dataset.create('sqlite:///:memory:') for me -
would be nice to have that as the default connector, so that Dataset() acts as
LINQ for Python by default

\- dataset.freeze is nice but I'd rather have dataset.export() &
dataset.import() letting me easily copy rows from one db to another (after
inspection for example)

Thanks for creating this!

------
adelevie
This looks like a really useful library. A few questions:

1\. What size limits/practical constraints are there on freezefiles (and
accompanying JSON files)?

2\. Are there any code samples for consuming freezefiles, or should I just
assume it's simple JSON/YML parsing?

3\. Has there been any thought in using this to expose database contents via
static REST API?

Final thought: this seems like a great step towards solving the age-old
version controlling data problem.

~~~
aet
I'm curious of the relative advantages/disadvantages over something like
sqlalchemy..

~~~
piqufoh
From an end-user point of view, SQLAlchemy relies on you first defining your
models in the ORM (object relational mapping) and then SQLAlchemy will take
control of issuing the SQL to create, update and drop tables depending on your
interactions with your Python ORM models.

From what I can read, it seems that this cool looking tool allows you to use
SQL as a kind of object free data store, maybe not unlike a NoSQL DB python
wrapper (freeing you from first defining your models, and then ensuring that
the SQLAlchemy functions have updated your DB).

------
alokv28
I'm trying to figure out where something like this fits into the python data
ecosystem.

For datasets that fit in memory, Pandas seems like the best bet. Good I/O
functions (JSON, CSV), easy slicing (numpy array-like syntax), and some sql-
like operations (groupby, join).

For large datasets, you'd need a proper db.

So is Dataset then useful for datasets that cannot fit in memory but aren't
_too_ large?

~~~
joshguthrie
Simple.

Recently, I've been tasked with mapping all of our clients addresses to
lat/long. I could've read the CSV and appended the results to each line. Or
used a JSON file. That I would have to read/write every time.

Instead, I wrote some pseudo-helper to dump all the CSV data into a SQLite DB.
Then I ran my script. Every time I found a lat/long, I could mark the client
as "done" and add the lat/long for that client and every client that shared
this address. When I had to cut my script because I saw one result from Google
Maps was wrong, I could just edit it straight in SQL, mark it as "invalid" and
relaunch my script: it started right back at the first undone row. Then I just
had to select all the "invalid" results and search them manually or refine
them so Google Maps would give me a proper result.

Dataset is useful for small data that is constantly being worked on.

(This answer is from a Ruby POV and the dataset I was working on had about 4K
rows, which explains why a) some Python magic wasn't available to me, maybe it
would have been perfect in Python world and b) I didn't want to play with
streams on my files)

Of course I still need some automation to correctly use my "DataMiner" (as I
called it) to the fullest. I'll use Dataset's API as a basis to rewite it
correctly.

~~~
alokv28
I know very little about what's available in Ruby, but I would have used the
Pandas library to accomplish this task in python. Their in-memory data
structure, a DataFrame, is more than capable of handling those operations.

------
dykesa
I don't know that the solution to "programmers are lazy" and databases are
hard is for them to learn a niche concept instead of taking the time to
actually learn about data storage or, find someone who knows it well and work
with them.

~~~
dragonwriter
Where I see this being useful isn't as a solution to programmers being "lazy"
about learning DBs, but to programmers being "lazy" about dealing with RDBMS
schema while rapidly iterating on a new project. Something like this lets you
_discover_ the schema as you iterate on the program, which would seem to be a
win for agility.

~~~
virmundi
What I don't get is why do I need a "stable" data store to iterator over? On a
project I once didn't get a database until 2 weeks before the actual project
was due. Fortunately I was using Spring. So I just mocked the DAL until such
time as I actually got a real one. This whole time I changed the contract on
the DAL, changed how I used it, etc. Then when the DBA finally got around to
having time to make my db I presented him with a decent, thought-out ERD.

Sure the mocks had to do some work, but a simple cache allowed me to perform
all of the CRUD operations in memory. I can see doing something similar with
Mongo/Couch, but having done the DAL with a pure mock set injected via Spring,
I don't really see the point. The same goes for HQL or another lightweight in
memory DB + Hibernate/JPA. I assume the model of interaction would work with
Python or similar languages too.

~~~
dragonwriter
Because sometimes you want a product to be available to real users while it is
still rapidly developing, especially in an environment using Lean principles.

It's kinda of the opposite of the delivery-date-and-it's-done style of
project.

------
JunkDNA
I have been working on an ETL domain specific language using Scala for a while
now (DataExpress for the curious:
[http://dataexpress.research.chop.edu](http://dataexpress.research.chop.edu))
trying to address similar problems to this.

Namely, when doing ETL you don't _want_ to have to map all your tables and
relationships into models that an ORM likes to have. IMO, there is such a
dearth of tools in this space of "quick and dirty" database work. People are
either using highly custom scripts on one end or things like Kettle or
commercial analogs for "big serious work" on the other. There's almost no in-
between.

Having something at a slightly higher level of abstraction than the database
driver itself is really, really nice and makes for cleaner, more readable
code. Makes me wonder about my continued work on DataExpress!

------
VLM
"for lazy people" should probably be "for lazy individuals not groups" as I've
often seen the schema and its DB become a natural demarc point between groups,
so changing the demarc on the fly amounts to forcing everyone else's API to
change.

Typical example, "Say what, who decided the name column is now two columns
first name and last name ?"

And sometimes there's absolutely nothing wrong with that, if the natural
demarc point in a project isn't the database and its schema.

------
huskyr
Nice. For PHP there is Idiorm, which is a really lightweight ORM wrapper that
makes dealing with SQL databases a breeze:

[https://idiorm.readthedocs.org/en/latest/](https://idiorm.readthedocs.org/en/latest/)

~~~
borplk
what is the most robust PHP ORM out there for large projects?

~~~
tariqr
Doctrine / Eloquent.

------
masklinn
That could be nice, unless `find` is limited to equality relations with a
constant. And all examples are equality relations with constants...

~~~
thruflo
Looking at the source[1] `find` supports `==` and `in_()`. Beyond that, it
supports custom sql queries[2].

For more power, drop down to SQLAlchemy.

[1]:
[https://github.com/pudo/dataset/blob/dc144a27b01ff404a528275...](https://github.com/pudo/dataset/blob/dc144a27b01ff404a5282758f010f00bad87cbe9/dataset/persistence/table.py#L200)
[2]:
[https://dataset.readthedocs.org/en/latest/quickstart.html#ru...](https://dataset.readthedocs.org/en/latest/quickstart.html#running-
custom-sql-queries)

~~~
masklinn
> Beyond that, it supports custom sql queries[2].

That's more than a bit unsatisfactory if I'm using a query builder or ORM to
avoid writing custom SQL queries.

> For more power, drop down to SQLAlchemy.

It's closer to stepping sideways, even the expression language is at a similar
level of abstraction.

~~~
lucian1900
It pretty much _is_ SQLAlchemy. Look at the code, there's very little there.

------
st0ne_c0ld
Looks like it doesn't work on python 3.3 for some reason ;( It would be better
to put some information to the website about required software\modules
versions etc...

in <module> import dataset File "C:\Python33\lib\site-
packages\dataset\\__init__.py", line 7, in <module> from
dataset.persistence.database import Database File "C:\Python33\lib\site-
packages\dataset\persistence\database.py", line 3, in <module> from urlparse
import parse_qs ImportError: No module named 'urlparse'

------
chrismmccomas
This looks wonderful. I have a side project that uses BeautifulSoup to get
sports scores and then computes standings based on those results and prints
the results out to a text file. Honestly, for me, being a simple and personal
program SQL Alchemy was just overkill I felt like, but this project looks like
it'll do exactly what I need.

I'll have to look into it more in-depth later, but I love the idea behind it.

------
nubela
interesting. am i right to say that it turns nosql-like queries/inserts into a
relational structure? what about joins and the complicated stuff that bogs
down writing queries as the complexity grows, how does this library support
the __advanced__ stuff?

~~~
pudo
well, the idea is to make the simple stuff really simple and keep the
complicated stuff around. so if you want to write a JOIN, use SQL or
SQLAlchemy's core constructs - both are accessible, neither have been
reinvented :)

------
bravura
I actually think it's time for object databases (e.g. ZopeDB) to make a
comeback.

Sometimes it's useful to persist a mass of crap, without thinking through the
format _at all_. Webscraping is a good example offered by the project author.

------
po84
This is great. I was just looking for a humane way to play with databases in
IPython Notebook the other day. I was able to pull a sample of data from a DB
into a Pandas dataframe with just a couple lines of code. Perfect.

------
egonschiele
This looks cool. My main question: why not just use Redis? Its a no-sql
solution that's proven to be fast and reliable. I assume dataset isn't meant
for larger datasets than Redis can already handle. The biggest advantage is
the `datafreeze` command which could've been written for Redis instead.

------
lcasela
Anything like this for ruby?

~~~
kcen
This looks pretty close to a clone of the Sequel gem.

[http://sequel.jeremyevans.net/](http://sequel.jeremyevans.net/)
[https://github.com/jeremyevans/sequel](https://github.com/jeremyevans/sequel)

~~~
kennymeyer
Sequel has much more advanced features than this Python library.

------
plumeria
A nice idea, but please change the logo.

~~~
hmottestad
Naked Mole Rats are the coolest creatures in the entire world. They hardly
seem to age and they very seldomly get cancer. Also there is a really funny
video about them:
[http://www.youtube.com/watch?v=eHi9FvUPSdQ](http://www.youtube.com/watch?v=eHi9FvUPSdQ)

~~~
dgesang
thanks ymmd

