Hacker News new | past | comments | ask | show | jobs | submit login

The user data is most likely in rows instead of columns. Instead of having

    id, name, age, gender
    1213, fake, 60, female
they would have

    property_id, user_id, value
    1 (assume age), 1213,    60
    2 (gender),     1213,    female
This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.



Entity attribute value anti pattern - this has been well known for at least 20 years. It can be tempting when you want to design a "flexible" system but really needs to be used sparingly. I was BI team lead on a product where the architect insisted that it be used on every entity (>300) as you never knew when you might want to add some bit of data. It led to some interesting (multi-page) sqls and the project ultimately failed. This was one of the reasons. Slow performance and often runtime errors when expected data wasn't present and the application layer couldn't cope. It was a good learning experience. https://mikesmithers.wordpress.com/2013/12/22/the-anti-patte...


We have this as a "meta" field (because MySQL is balls at adding new columns to big tables without multi-hour downtime) with some arcane nonsense format. Totally unqueryable with any efficiency.


EAV pattern has trade-offs you need to compensate for (performance). Production systems that use EAV have flat tables, and heavy caching to have be flexible with acceptable performance.


You could argue that there are cases for it. Datomic[0] is basically EAV on steroids.

https://www.datomic.com/


Oh gosh this pattern. The first time I encountered it was in my first job where we used Magento. Super flexible. Also super slow. Does anyone have any advice how to make a db design like this work faster? Generally I thought when data is arranged like this it might be a prime candidate for document based storage. But I'm no dba so I have no idea if that would be correct.


If you are using Postgres, the JSONB datatype will let you do exactly this while still using the full power of SQL. Simply create a column where you keep a JSON object full of random user properties, if flexibility is what you want. You can even index properties.


Or just create ad hoc tables with user fields. Quite often it's not that a customer has n different fields for n entities, but a few that apply to the majority (like internal ERP ids, classifcation etc.). Put them in a few tables, index them, join them. If you don't want to parse internal DB descriptors, create a set of "schema" tables to build queries from.


The question is whether it can be stored like this while allowing for fast queries. For example, unless it changed recently, Postgres doesn't calculate statistics to help the query planner on jsonb fields.


JSONB columns should behave just like any datatype with a GIN index in the recent releases, to my knowledge.

Still, a JSON column will arguably be faster than a pure KV table since you can more efficiently query it, especially any non-JSON columns.


IIRC JSONB still has problem with index statistics

So values in JSONB columns can be indexed nicely, but the statistics can be much worse than for non-JSONB columns, which can lead the query planner astray.


Postgres will let you create an index on an expression into the JSON column, so querying should still be very quick.


The answer would make a great blog post.


This just isn't that hard. They don't have that much data. It is really late for me, but, put it all in memory and figure it out. These just aren't hard problems. DBAs have been solving performance issues for decades with a clever index on the right column for 30+ years. Sorry if this is get off my lawn-ish, but I have been on too many projects where I made a DB index and solved a major bottleneck. Too many new developers are ignorant to the nuance of RDBMS tuning. I am not even a DBA.


I am not a DBA, but I could see this crap when the NoSQL hype took off.

"Relational databases don't scale"

Well they worked fine for decades before Mongo's marketing claimed so.


> Does anyone have any advice how to make a db design like this work faster?

Normalize it properly. If this is not possible, ensure that both the userid and the property are indexed


If what you actually want is data arranged like such, Datomic is probably a prime candidate

https://www.datomic.com/


If they are using some sort of middleware orm, which they may well be because of their model, they are most likely using an EAV[0] schema which, although flexible for writes, is horrendous for reads. The join plus pivot is a disaster on virtually any relational system.

[0]https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...


Hmm, that does seem probable. In fact that could make the SQL even more efficient as you'd only need a combined index on the 'prop' and 'value' columns, rather than N arbitrary combinations of indexes that may or may not be used.

Edit: Had some bad attempt at writing this query but it's rather late and it made no sense.


You would need to have a new join for each new property

    SELECT DISTINCT loyaltyMemberID
    from members as m
    INNER JOIN properties as p1 on m.id = p1.user_id
    INNER JOIN properties as p2 on m.id = p2.user_id
    INNER JOIN properties as p3 on m.id = p3.user_id
    AND (p1.prop = 'gender' AND p1.value = x)
    AND ((p2.prop = 'age' AND p2.value = y) OR (p3.prop = 'censor' AND p3.value = z))


There's no need for the extra joins, you can just do the one join and then filter everything in the WHERE clause:

  SELECT DISTINCT loyaltyMemberID
  from members as m
  INNER JOIN properties as p on m.id=p.user_id
  WHERE (p.prop='name' AND p.value = value) AND

  ...etc.


But how would you do exclusions with your approach?


I'm not sure what you're asking - could you give me an example of what you're envisioning that couldn't be satisfied with a combination of Boolean expressions in the WHERE clause ?


With this query approach how do you find people that have a prop1 but do not a prop2?

If you get records back with prop1 then you have to remove those records from results based on another record.

There are multiple ways to accomplish this but it can't be done with a single join and simple where clause.


Sorry, I missed your reply. Yes, you are correct, in that case you would need to use a except, sub-query, derived table, etc.


Especially with partial indexes, I still feel like this structure will be significantly faster than the original UNION ALL ... GROUP BY on calculated fields.

And they mention in the post that most queries don't use that many fields.


Confession time: in my first job, I build something like this (and it worked pretty well in the sense that it was very flexible), but then I also had to do a 'select' based on iirc 14 of such properties. I don't really recall the exact problem I had at first, but my solution was to create two separate (temporary) tables, select 7 of the properties into one and 7 into the other, run a select on both of those tables, then join the results in code. This ran at and acceptable speed (I must have done something so that adding criteria made the run time increase non-linearly - doing it on 14 was orders of magnitude slower than on 7).

Then years later I ran into the guy who had to do some work on it after I left that company. I must have scarred him pretty badly, because he remembered it enough to bring it up as pretty much the first topic after the obligatory 'hey so what are you up to nowadays'. When I think back about it now, it was a cringey solution - then again, this was at a company where nobody had ever heard of a 'database index' (or if they did, never mentioned or implemented them).


Christ do people do this??


This is a pretty popular pattern known as Entity-Attribute-Value [0]. It's used by many products where a) data model needs to be very flexible and allow new attributes without schema changes, or b) a typical entity has a large number of possible attributes that may or may not be set for all entities ("sparse" attributes). WordPress uses this to store post metadata, Magento uses this to store product attributes and most of other data, Drupal uses a variation of this to store all the posts and other content you create… I have too much experience with this model to be surprised.

[0]: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...


Sounds like a great case for Postgres hstore (like OpenStreetMap does it)?


I mean, not really, theres no magic in Postgres' implementation of the pattern. They even spell it out for you:

"can be useful in various scenarios, such as rows with many attributes that are rarely examined"

They are querying these quite heavily, they aren't just random attributes they need to retrieve.


You're right, I think most OSM databases extract the most commonly used keys into dedicated, potentially indexed columns.


hstore querying is quite slow (and GIN indexes on hstores are pretty massive). I'd always go jsonb over hstores these days, but jsonb has the same indexing problem. JSON has a well-optimized/spec compliant serializer/deserializer in every language you can imagine as a baseline, whereas hstore does not.


i think that’s basically what it was designed for :p


Is `value` a string here?


I once implemented a variation of this where there was a column called 'data_type', the valid values were the various SQL data types, and in code I would do a switch() on the (string) value of that column and then cast the contents of the 'value' column based on that... Ah the folly of youth...


Depends

You could have

    property_id, user_id, value(string)
    1 (assume age), 1213,    60
    2 (gender),     1213,    female
or

    property_id, user_id, value_str, value_int
    1 (assume age), 1213,    null, 60
    2 (gender),     1213,    female, null
or have a mapping in the application to get the type of the property. Plenty of ways to handle it.


> This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.

I think you're right. Oh ... my ... god ...

I wish I could say this is the worst example of a database schema I've ever seen, but it isn't.

Technology cycle:

X gets invented -> idiots abuse it -> X "is bad" -> Y (strictly worse than X) is "so much better" -> idiots abuse it -> Y "is bad" -> ...


EVA is a valid pattern if the keys are dynamic. For example, in a CRM, the user might want to store properties of their clients that you haven't thought of. In our platform, we use different schemas for each company, so we can actually do a ADD COLUMN ..., but you don't want to do that if you have a multi-tenant DB :)


It makes sense if each client gets to add their own fields.


Using the builtin type for that purpose is going to work way better. This depends on the DB you're using but is generally referred to as a "JSON" field (why ? Because they're a response to MongoDB, which calls is that). Oracle and SQL server have very similar things.

In Mysql, it is JSON data type [1], in Postgres JSON/JSONB [2].

Creating indexes across them is doable, through a workaround (involving what is generally referred to as "VIEWS", but can be called calculated columns or something like that).

And, frankly, in the worst case for indexing, these databases still perform comparable to key-value stores in speed (especially SQLite).

[1] https://dev.mysql.com/doc/refman/5.7/en/json.html#json-paths

[2] https://www.postgresql.org/docs/9.4/static/datatype-json.htm...


They may be generally a better option, but they have their own disavantages. For example, JSONB fields in Postgres won't deduplicate keys, so if you have large keys, your table size will increase quite a bit (which also makes it harder to keep it memory).

Using an EVA, you can have a "keys (id, key_name)" table, and then only use the IDs in the values table, reducing that waste.

By the way, you don't need views for indexing on JSONB fields, it's supported out of the box in PG.


At least in MySQL, json field types are rather new. MySQL 5.7 is not yet an option with AWS Aurora or Google cloud SQL even.

And I don't think you will necessarily get better performance with json fields vs an EAV model. Yes, you can index json fields by creating virtual views, but that requires that you know the field ahead of time. With an EAV model, you can have your values table indexed and then join.

But I am excited to start using the json field types. In many cases, it will really simplify things over the traditional EAV stuff.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: