The user data is most likely in rows instead of columns. Instead of having id, n...

thorin · on April 11, 2018

Entity attribute value anti pattern - this has been well known for at least 20 years. It can be tempting when you want to design a "flexible" system but really needs to be used sparingly. I was BI team lead on a product where the architect insisted that it be used on every entity (>300) as you never knew when you might want to add some bit of data. It led to some interesting (multi-page) sqls and the project ultimately failed. This was one of the reasons. Slow performance and often runtime errors when expected data wasn't present and the application layer couldn't cope. It was a good learning experience. https://mikesmithers.wordpress.com/2013/12/22/the-anti-patte...

zimpenfish · on April 11, 2018

We have this as a "meta" field (because MySQL is balls at adding new columns to big tables without multi-hour downtime) with some arcane nonsense format. Totally unqueryable with any efficiency.

cleong · on April 11, 2018

EAV pattern has trade-offs you need to compensate for (performance). Production systems that use EAV have flat tables, and heavy caching to have be flexible with acceptable performance.

bpicolo · on April 11, 2018

You could argue that there are cases for it. Datomic[0] is basically EAV on steroids.

https://www.datomic.com/

nstart · on April 11, 2018

Oh gosh this pattern. The first time I encountered it was in my first job where we used Magento. Super flexible. Also super slow. Does anyone have any advice how to make a db design like this work faster? Generally I thought when data is arranged like this it might be a prime candidate for document based storage. But I'm no dba so I have no idea if that would be correct.

matte_black · on April 11, 2018

If you are using Postgres, the JSONB datatype will let you do exactly this while still using the full power of SQL. Simply create a column where you keep a JSON object full of random user properties, if flexibility is what you want. You can even index properties.

mhd · on April 11, 2018

Or just create ad hoc tables with user fields. Quite often it's not that a customer has n different fields for n entities, but a few that apply to the majority (like internal ERP ids, classifcation etc.). Put them in a few tables, index them, join them. If you don't want to parse internal DB descriptors, create a set of "schema" tables to build queries from.

icebraining · on April 11, 2018

The question is whether it can be stored like this while allowing for fast queries. For example, unless it changed recently, Postgres doesn't calculate statistics to help the query planner on jsonb fields.

zaarn · on April 11, 2018

JSONB columns should behave just like any datatype with a GIN index in the recent releases, to my knowledge.

Still, a JSON column will arguably be faster than a pure KV table since you can more efficiently query it, especially any non-JSON columns.

ptman · on April 11, 2018

IIRC JSONB still has problem with index statistics

So values in JSONB columns can be indexed nicely, but the statistics can be much worse than for non-JSONB columns, which can lead the query planner astray.

NickNameNick · on April 11, 2018

Postgres will let you create an index on an expression into the JSON column, so querying should still be very quick.

matte_black · on April 11, 2018

The answer would make a great blog post.

bitexploder · on April 11, 2018

This just isn't that hard. They don't have that much data. It is really late for me, but, put it all in memory and figure it out. These just aren't hard problems. DBAs have been solving performance issues for decades with a clever index on the right column for 30+ years. Sorry if this is get off my lawn-ish, but I have been on too many projects where I made a DB index and solved a major bottleneck. Too many new developers are ignorant to the nuance of RDBMS tuning. I am not even a DBA.

collyw · on April 11, 2018

I am not a DBA, but I could see this crap when the NoSQL hype took off.

"Relational databases don't scale"

Well they worked fine for decades before Mongo's marketing claimed so.

thecatspaw · on April 11, 2018

> Does anyone have any advice how to make a db design like this work faster?

Normalize it properly. If this is not possible, ensure that both the userid and the property are indexed

bpicolo · on April 11, 2018

If what you actually want is data arranged like such, Datomic is probably a prime candidate

https://www.datomic.com/

siculars · on April 11, 2018

If they are using some sort of middleware orm, which they may well be because of their model, they are most likely using an EAV[0] schema which, although flexible for writes, is horrendous for reads. The join plus pivot is a disaster on virtually any relational system.

[0]https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

orf · on April 11, 2018

Hmm, that does seem probable. In fact that could make the SQL even more efficient as you'd only need a combined index on the 'prop' and 'value' columns, rather than N arbitrary combinations of indexes that may or may not be used.

Edit: Had some bad attempt at writing this query but it's rather late and it made no sense.

any626 · on April 11, 2018

You would need to have a new join for each new property

    SELECT DISTINCT loyaltyMemberID
    from members as m
    INNER JOIN properties as p1 on m.id = p1.user_id
    INNER JOIN properties as p2 on m.id = p2.user_id
    INNER JOIN properties as p3 on m.id = p3.user_id
    AND (p1.prop = 'gender' AND p1.value = x)
    AND ((p2.prop = 'age' AND p2.value = y) OR (p3.prop = 'censor' AND p3.value = z))

TimJYoung · on April 11, 2018

There's no need for the extra joins, you can just do the one join and then filter everything in the WHERE clause:

  SELECT DISTINCT loyaltyMemberID
  from members as m
  INNER JOIN properties as p on m.id=p.user_id
  WHERE (p.prop='name' AND p.value = value) AND

  ...etc.

83457 · on April 11, 2018

But how would you do exclusions with your approach?

TimJYoung · on April 12, 2018

I'm not sure what you're asking - could you give me an example of what you're envisioning that couldn't be satisfied with a combination of Boolean expressions in the WHERE clause ?

83457 · on April 19, 2018

With this query approach how do you find people that have a prop1 but do not a prop2?

If you get records back with prop1 then you have to remove those records from results based on another record.

There are multiple ways to accomplish this but it can't be done with a single join and simple where clause.

TimJYoung · on May 2, 2018

Sorry, I missed your reply. Yes, you are correct, in that case you would need to use a except, sub-query, derived table, etc.

pbnjay · on April 11, 2018

Especially with partial indexes, I still feel like this structure will be significantly faster than the original UNION ALL ... GROUP BY on calculated fields.

And they mention in the post that most queries don't use that many fields.

roel_v · on April 11, 2018

Confession time: in my first job, I build something like this (and it worked pretty well in the sense that it was very flexible), but then I also had to do a 'select' based on iirc 14 of such properties. I don't really recall the exact problem I had at first, but my solution was to create two separate (temporary) tables, select 7 of the properties into one and 7 into the other, run a select on both of those tables, then join the results in code. This ran at and acceptable speed (I must have done something so that adding criteria made the run time increase non-linearly - doing it on 14 was orders of magnitude slower than on 7).

Then years later I ran into the guy who had to do some work on it after I left that company. I must have scarred him pretty badly, because he remembered it enough to bring it up as pretty much the first topic after the obligatory 'hey so what are you up to nowadays'. When I think back about it now, it was a cringey solution - then again, this was at a company where nobody had ever heard of a 'database index' (or if they did, never mentioned or implemented them).

tengbretson · on April 11, 2018

Christ do people do this??

DCoder · on April 11, 2018

This is a pretty popular pattern known as Entity-Attribute-Value [0]. It's used by many products where a) data model needs to be very flexible and allow new attributes without schema changes, or b) a typical entity has a large number of possible attributes that may or may not be set for all entities ("sparse" attributes). WordPress uses this to store post metadata, Magento uses this to store product attributes and most of other data, Drupal uses a variation of this to store all the posts and other content you create… I have too much experience with this model to be surprised.

[0]: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

dbrgn · on April 11, 2018

Sounds like a great case for Postgres hstore (like OpenStreetMap does it)?

stu432 · on April 11, 2018

I mean, not really, theres no magic in Postgres' implementation of the pattern. They even spell it out for you:

"can be useful in various scenarios, such as rows with many attributes that are rarely examined"

They are querying these quite heavily, they aren't just random attributes they need to retrieve.

dbrgn · on April 11, 2018

You're right, I think most OSM databases extract the most commonly used keys into dedicated, potentially indexed columns.

bpicolo · on April 11, 2018

hstore querying is quite slow (and GIN indexes on hstores are pretty massive). I'd always go jsonb over hstores these days, but jsonb has the same indexing problem. JSON has a well-optimized/spec compliant serializer/deserializer in every language you can imagine as a baseline, whereas hstore does not.

rickycook · on April 11, 2018

i think that’s basically what it was designed for :p

friedman23 · on April 11, 2018

Is `value` a string here?

roel_v · on April 11, 2018

I once implemented a variation of this where there was a column called 'data_type', the valid values were the various SQL data types, and in code I would do a switch() on the (string) value of that column and then cast the contents of the 'value' column based on that... Ah the folly of youth...

any626 · on April 11, 2018

Depends

You could have

    property_id, user_id, value(string)
    1 (assume age), 1213,    60
    2 (gender),     1213,    female

or

    property_id, user_id, value_str, value_int
    1 (assume age), 1213,    null, 60
    2 (gender),     1213,    female, null

or have a mapping in the application to get the type of the property. Plenty of ways to handle it.

candiodari · on April 11, 2018

> This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.

I think you're right. Oh ... my ... god ...

I wish I could say this is the worst example of a database schema I've ever seen, but it isn't.

Technology cycle:

X gets invented -> idiots abuse it -> X "is bad" -> Y (strictly worse than X) is "so much better" -> idiots abuse it -> Y "is bad" -> ...

icebraining · on April 11, 2018

EVA is a valid pattern if the keys are dynamic. For example, in a CRM, the user might want to store properties of their clients that you haven't thought of. In our platform, we use different schemas for each company, so we can actually do a ADD COLUMN ..., but you don't want to do that if you have a multi-tenant DB :)

ajcodez · on April 11, 2018

It makes sense if each client gets to add their own fields.

candiodari · on April 11, 2018

Using the builtin type for that purpose is going to work way better. This depends on the DB you're using but is generally referred to as a "JSON" field (why ? Because they're a response to MongoDB, which calls is that). Oracle and SQL server have very similar things.

In Mysql, it is JSON data type [1], in Postgres JSON/JSONB [2].

Creating indexes across them is doable, through a workaround (involving what is generally referred to as "VIEWS", but can be called calculated columns or something like that).

And, frankly, in the worst case for indexing, these databases still perform comparable to key-value stores in speed (especially SQLite).

[1] https://dev.mysql.com/doc/refman/5.7/en/json.html#json-paths

[2] https://www.postgresql.org/docs/9.4/static/datatype-json.htm...

icebraining · on April 11, 2018

They may be generally a better option, but they have their own disavantages. For example, JSONB fields in Postgres won't deduplicate keys, so if you have large keys, your table size will increase quite a bit (which also makes it harder to keep it memory).

Using an EVA, you can have a "keys (id, key_name)" table, and then only use the IDs in the values table, reducing that waste.

By the way, you don't need views for indexing on JSONB fields, it's supported out of the box in PG.

jjeaff · on April 11, 2018

At least in MySQL, json field types are rather new. MySQL 5.7 is not yet an option with AWS Aurora or Google cloud SQL even.

And I don't think you will necessarily get better performance with json fields vs an EAV model. Yes, you can index json fields by creating virtual views, but that requires that you know the field ahead of time. With an EAV model, you can have your values table indexed and then join.

But I am excited to start using the json field types. In many cases, it will really simplify things over the traditional EAV stuff.