
When should you store serialized objects in the database? (2010) - harshasrinivas
https://www.percona.com/blog/2010/01/21/when-should-you-store-serialized-objects-in-the-database/
======
exabrial
Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I know the HN police will cite me for no citation, so I'd say it comes with
experience. The law changed at one point, and we were legally bound to be able
to locate a customers record by a piece of data in a blob. The only way to fix
the problem was to dump the massive (1tb+) table and reinsert them into a real
schema. The engineering effort to do this took 9 months to get it right,
because other people changed the way blobs were written out over the course of
years.

Being clever doesnt pay, again.

~~~
LyndsySimon
> Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I agree 100%.

I read that title and expected the post to begin with "Never. You should never
serialize objects into a single field." I was disappointed.

If you need schemaless storage, use a schemaless DB. I don't understand what's
so difficult about that. I wouldn't try to shove unstructured data into
PostgreSQL any more than I'd try to shove relational data into MongoDB.

~~~
spacemanmatt
HAHA, I tend to agree that 'never' is the only good answer.

------
orf
I think it's rarely a good idea to store blobs of any kind in the database.
I've seen systems that store pretty large files as blobs (even base64 encoded
ones once), then do 'select *' on the table and wonder why their query
performance is so terrible. Use a filesystem, that's what it's for.

For stuff like this then I would say it's always preferable to store a json
encoded representation rather than a format like pickle (python's object
serialization format). If you don't and some clever chap works out a way to
write input to that field then you've got an easy RCE. Plus it's easier to
debug JSON, and databases like PG have a native data type for it.

~~~
ams6110
_it 's always preferable to store a json encoded representation_

That's just the format du jour. Ten years ago it would have been "store an XML
encoded representation" and ten years before that it would have been some
delimited representation. Tomorrow it may be yaml or something even more
hideous.

Blobs in the DB can make sense in some situations but they should really be
blobs: images, or other binary/raw data. But beyond a certain size it's almost
always easier to manage things like these in a filesystem and just store a
pointer (filename) in the DB.

~~~
mattnewton
But there are database tools emerging to deal with JSON, like Postgres (and
Mongodb.) I haven't used Mongo in production, but postgres's json support is
fantastic and materially better than anything I have seen with XML. Sometimes
you really do need to store nearly schemaless data and operate on it for some
time.

~~~
cakes
As reference there are/were tools in something like MSSQL for dealing with
XML.

~~~
mgkimsal
mysql has had xml/xpath query functions for a while too, iirc.

------
rwultsch
I was on the DBA team at FB and I spent the better part of a year working on
the deployment system for online schema change. It was a pain. Other companies
have done quite a bit of work on this as well (Shift from Square, etc...).

Later on I joined Pinterest as their first MySQL DBA. They had copied the
sharding system from FB, but instead of having a bunch of columns, they just
stored a JSON blob. This saved them from learning how to perform schema change
until I joined the company. This is a pretty incredible feature.

We have a new feature under development (which will be open sourced as part of
Percona MySQL) which will allow column level compression with an optional
predefined dictionary. During testing, this resulted in a 30% additional
reduction in spaced consumed versus InnoDB page compression AND doubles our
peak QPS at lower latency. This would not work well with many individuals
columns, but kicks ass for JSON blobs.

[http://www.slideshare.net/denshikarasu/less-is-more-novel-
ap...](http://www.slideshare.net/denshikarasu/less-is-more-novel-approaches-
to-mysql-compression-for-modern-data-sets) (slides 37, 40, 41, 42)

------
TheSoftwareGuy
For anybody using sqlite, they have good documentation about this very
question: [https://www.sqlite.org/intern-v-extern-
blob.html](https://www.sqlite.org/intern-v-extern-blob.html)

~~~
bvinc
Is this really the same thing? This page compares storing blobs in sqlite vs
in a separate file.

I think a better sqlite page about the concept of serializing things in your
database is the fact that sqlite has json support.

[https://www.sqlite.org/json1.html](https://www.sqlite.org/json1.html)

~~~
distances
That's with a loadable JSON1 extension, which one won't find e.g. in Android.

Though checking now for this, someone has packaged a later version of SQLite
with this extension [1]. I wonder if there is any possible performance
advantage when using a system provided SQLite vs. one installed with the
application?

[1] [https://github.com/requery/sqlite-
android](https://github.com/requery/sqlite-android)

~~~
distances
Note to self: should have read the whole project description, there's a
performance chapter answering this exact question.

------
garethrees
It makes sense to store serialized data structures in the database when these
conditions apply:

1\. There are no use cases that would require you to SELECT on the fields in
the serialized data structures.

2\. You anticipate that the data structures are going to change frequently
during development, so that turning them into relations is going to involve a
lot of schema migrations.

Basically you give up the possibility of being able to SELECT on some of the
data in return for being able to change its format rapidly and cheaply.

I worked on a project recently where this was helpful -- when I designed the
database schema I didn't know the details of many of the data structures that
were going to have to be stored there. From the use cases I could deduce the
set of fields that would need to be SELECTed on, but the other fields were
ill-defined. By storing them as blobs (actually as JSONB fields, since this
was PostgreSQL) I could safely defer the decision about how to design these
parts of the database, without incurring lots of schema migrations along the
way.

~~~
vec
Be very careful with this logic. I'm in the middle of cleaning up a project
where the original developer did this calculus and stored a lot of metadata as
JSON strings in text fields. Now, three years out, we have new reporting
requirements that need to be able to filter on some of that metadata.

Experience has taught me that it's next to impossible to know what types of
queries will be needed for the entire lifetime of a project. If you absolutely
must store data like this, please leave a clear migration path for moving some
or all of it into well structured tables when (not if) it becomes necessary.

~~~
garethrees
I understand your concern. Perhaps if I give an example, it will be clear why
I followed the approach I did.

The project is an industrial control system. Different kinds of item are
processed by the system. Each kind of item needs image acquisition parameters,
for example field of view. There's no use case that needs to select kinds of
item whose image capture parameters specify a particular field of view, and
it's clear that such a use case is very unlikely to arise — the image
acquisition parameters are part of the internal specification of the system,
not something that any end user is interested in.

At database design time it wasn't settled which model of camera was going to
be used, and so it was far from settled which parameters were going to be
needed. Using a JSONB field for the image acquisition parameters meant that I
could leave the specification of these parameters to the camera programmer,
without having to incur a series of database migrations as the correct set of
parameters were worked out.

Even in your project, where new reporting requirements eventually arose, it
doesn't mean the original decision was necessarily a bad one. Sometimes it
makes sense to incur technical debt in order to bring a product to market in a
timely fashion.

------
rhinoceraptor
You can get away with it in Postgres. The app I work on stores phone numbers
in a JSONB array in the following format:

    
    
        [{
          "tags": ["cell"],
          "number": "1231231234"
        }]
    

Here's a snippet demonstrating how you can do a lateral left join on the
column to find the number tagged 'cell' in tags array:

    
    
        select * from mytable t
        left join lateral (
          select phone->'number' as cell_phone from
          jsonb_array_elements(t.phone_numbers) phone
          where phone->'tags' @> '["cell"]'
        ) p on true;

------
collyw
I am working on a system where the data is serialized using Python's pickle
and stored in the database. Absolute nightmare for debugging as its basically
unreadable.

~~~
unlinker
Can't you open a Python shell and unpickle it?

~~~
collyw
Gave me an error the last time I tried to that.

Plus even if it did work, it involves logging into the server, activating my
python virtualenv, pulling the data out via the python / Django shell an
unpickling and printing it. As opposed to running a query on my local machine
connecting to the database. When you are debugging a problem and just want to
get an overview of what is happening, that is a hell of a lot of hassle.

------
Zikes
PostgreSQL and jsonb:
[https://www.postgresql.org/docs/current/static/datatype-
json...](https://www.postgresql.org/docs/current/static/datatype-json.html)

~~~
andy_ppp
Yes, maybe use a database that allows you the best of both worlds, a
serialised blob that happens to be queryable and generally really high
performance.

~~~
Zikes
Queryable and even indexable! Hardly a trade-off at all, really.

------
blowski
I was working on a massive CRUD project - hundreds of end-user customisable
textarea fields. Despite this sounding like a project perfect for a NoSQL
database, the in-house team that would be maintaining it were MySQL experts,
and didn't want to support MongoDB or anything like that.

So, yep, we stored everything as serialized objects in the database. We had a
separate table for 'change events', and whenever someone changed the contents
of one of the textareas we stored it in that table. A worker would eventually
update the the serialized object, but in the meantime, we would load the
serialized object and apply all of the changes that had happened since it was
last updated. Basically, an Event Sourcing pattern.

So it was either that, or the EAV route, or the 'end user altering the
database' route. The latter two options sounded even worse. Our solution
worked out pretty well. Admittedly, it only had a few hundred concurrent
users, and even the biggest document was never more than 100K.

So, it can work, but YMMV.

------
Friedduck
We serialized some XML as a backup to the data we were extracting (and
properly modeling), given that the vendor was prone to changing the schema
without proper notification, and that there were some data elements we weren't
using at that time.

We also built the necessary tools to extract/re-process records easily, and
the architecture worked well for us. As our needs or the schema changed we
could easily accommodate those changes without undue effort.

It doesn't directly address the question, and I'm not sure that I'd use the
same solution if the volume were predicted to be significantly higher, but in
our case it worked beautifully. (Happily our volume was predictably within a
known range, for reasons I won't go into.)

------
jbyers
(2010)

Uber moved to a similar architecture in 2014-2015, ~5 years after article and
the original Friendfeed post. Being able to operate MySQL predictably at scale
is extremely valuable to high-growth companies, enough to tilt in the favor of
unconventional schema choices versus less proven NoSQL alternatives.

[https://eng.uber.com/schemaless-part-one/](https://eng.uber.com/schemaless-
part-one/)

[https://eng.uber.com/schemaless-part-two/](https://eng.uber.com/schemaless-
part-two/)

[https://eng.uber.com/schemaless-part-three/](https://eng.uber.com/schemaless-
part-three/)

------
forinti
In my experience, databases outlast the applications built on top of them, so
it makes no sense to cut corners on the data modelling.

Except, of course, if the data only exists to support the application (some
sort of buffer, cache, or session storage).

~~~
jb613
> databases outlast the applications built on top of them

sure but this needs to be balanced with performance UNTIL then

------
Arwill
This is the consequence of the chosen programming language not being adapted
to work with a relational database. The language is made to work with objects,
and the access to DB is clumsy, trough library functions, which are deeply
encapsulated, and the language and the DB are two different worlds. In SAP's
ABAP language relational database access is integrated into the language. In
SAP, when you create a database table, the structure of that table will be
automatically available to any program as a structure datatype. So if you
change a table definition, you will also change the data type used by the
programs. Doing table changes is supported by a database tool that will
automatically copy records from the old table to the new one if necessary. Its
easy to find all references to a DB table, and recompile the sources. Its
actually done automatically when both program and DB structure changes are
deployed. Whatever change the developer does in the development system, that
change will be automatically adjusted in the productive system on deployment.
This makes any table structure change pretty easy, the development environment
takes care of that. Wherever SAP applications use blobs to store data (for
example HR payroll), those are the worst to develop with. Doing a non-simple
change on a TB big table would surely cause disruption in a SAP system too,
but other techniques are available for those cases.

------
fauria
_" If the application really is schema-less and has a lot of optional
parameters that do not appear in every record, serializing the data in one
column can be a better idea than having many extra columns that are NULL."_

Why not just use a document oriented database instead? Seems like a good use
case for MongoDB for example: [https://www.mongodb.com/compare/mongodb-
mysql](https://www.mongodb.com/compare/mongodb-mysql)

~~~
wesd
Also the assumption is that you don't need to report on the data. If you need
to report on the data then you might need to create index on those columns for
performance which you can't do on a blob.

------
autogn0me
Agree, storing binaries in database is generally a bad idea. It would be a
really miserable idea if it were being done without a sane persistence API. In
the Python world, ZODB -
[http://www.zodb.org/en/latest/](http://www.zodb.org/en/latest/) is tightly
coupled with the language but works reasonably well in practice. The storage
layer is pluggable and
[https://pypi.python.org/pypi/RelStorage](https://pypi.python.org/pypi/RelStorage)
provides storing pickles in RDBMS.

ZODB is arguable a novel approach to persistence using Python. And certainly
worth taking some time just to play with it -- the barrier of entry low, e.g.
`pip install`. But for each positive there are negatives..

"You got it buddy: the large print giveth, and the small print taketh away"

~~~
prashnts
From my personal experience, I completely agree.

For an academic project, I was calculating ~2mln RNA-RNA interactions from
their sequences. Since this calculation stays a requirement for all further
calculations, being the naïve kid I was, I started pickle'ing the results.

To feel like the cool kid, I wanted to involve a database somehow -- so after
trying out a bunch of options, I finally settled for ZODB. As the project
scaled up, soon the ZODB started being a big pain, because as I recall, it
only allows a limited number of connections even in the read operations.

Lesson learned, though, it now resides as a lookup table in a PgSQL instance.

------
harshasrinivas
Link about FriendFeed (mentioned in the blog):
[http://web.archive.org/web/20100314211658/http://bret.appspo...](http://web.archive.org/web/20100314211658/http://bret.appspot.com/entry/how-
friendfeed-uses-mysql)

------
bvinc
What about this reason? What if your program is pretty much entirely used from
a JSON REST service? What if these JSON objects need to also be sent between
machines? What if they need to be exportable to files sometimes?

Now imagine the same program also has an internal database where these JSON
objects can be imported and used. Does it make sense that, when actually in
use, these objects are relational and split between 10 complicated tables? Why
should someone bother writing complex import/export conversion functions,
maintaining them in the future, and having worse performance. Wouldn't it be
much simpler, maintainable, and faster to just plop the JSON in the database?

~~~
PretzelFisch
So, my question after storing xml in a database and using their xml features
to build indexes is this. Are you looking for a database or a search engine?
what will you gain from a database if you are not using it's features, over
saving to disk and building a query index?

~~~
bvinc
What you would gain is the ease, simplicity, and stability of using your
favorite sql engine. You get ACID transactions and the ability to add and
remove from large lists using low memory, for free.

------
joesmo
Proper databases have JSON or other serialized field types. Even mysql 5.7 has
some support for this. There is no reason you should hit a limit on one of
your tables at a few hundred thousand records because you're an idiot and
stored a ton of serialized data in mysql. It happens all the time though.

------
jcoffland
The answer is, of course, almost never.

