Corelan Cybersecurity Research[1] is a good starting point. There haven't been any new articles in ~5 years, but some of the techniques in the linked article are covered. The /r/netsec subreddit is also a suprisingly good source of security info. A big part as well is just experimenting on your own. Vulnerability research is a very "creative" field, so being able to think outside the box is just as important as having technical knowledge.
This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.
Entity attribute value anti pattern - this has been well known for at least 20 years. It can be tempting when you want to design a "flexible" system but really needs to be used sparingly. I was BI team lead on a product where the architect insisted that it be used on every entity (>300) as you never knew when you might want to add some bit of data. It led to some interesting (multi-page) sqls and the project ultimately failed. This was one of the reasons. Slow performance and often runtime errors when expected data wasn't present and the application layer couldn't cope. It was a good learning experience.
https://mikesmithers.wordpress.com/2013/12/22/the-anti-patte...
We have this as a "meta" field (because MySQL is balls at adding new columns to big tables without multi-hour downtime) with some arcane nonsense format. Totally unqueryable with any efficiency.
EAV pattern has trade-offs you need to compensate for (performance). Production systems that use EAV have flat tables, and heavy caching to have be flexible with acceptable performance.
Oh gosh this pattern. The first time I encountered it was in my first job where we used Magento. Super flexible. Also super slow. Does anyone have any advice how to make a db design like this work faster? Generally I thought when data is arranged like this it might be a prime candidate for document based storage. But I'm no dba so I have no idea if that would be correct.
If you are using Postgres, the JSONB datatype will let you do exactly this while still using the full power of SQL. Simply create a column where you keep a JSON object full of random user properties, if flexibility is what you want. You can even index properties.
Or just create ad hoc tables with user fields. Quite often it's not that a customer has n different fields for n entities, but a few that apply to the majority (like internal ERP ids, classifcation etc.). Put them in a few tables, index them, join them. If you don't want to parse internal DB descriptors, create a set of "schema" tables to build queries from.
The question is whether it can be stored like this while allowing for fast queries. For example, unless it changed recently, Postgres doesn't calculate statistics to help the query planner on jsonb fields.
IIRC JSONB still has problem with index statistics
So values in JSONB columns can be indexed nicely, but the statistics can be much worse than for non-JSONB columns, which can lead the query planner astray.
This just isn't that hard. They don't have that much data. It is really late for me, but, put it all in memory and figure it out. These just aren't hard problems. DBAs have been solving performance issues for decades with a clever index on the right column for 30+ years. Sorry if this is get off my lawn-ish, but I have been on too many projects where I made a DB index and solved a major bottleneck. Too many new developers are ignorant to the nuance of RDBMS tuning. I am not even a DBA.
If they are using some sort of middleware orm, which they may well be because of their model, they are most likely using an EAV[0] schema which, although flexible for writes, is horrendous for reads. The join plus pivot is a disaster on virtually any relational system.
Hmm, that does seem probable. In fact that could make the SQL even more efficient as you'd only need a combined index on the 'prop' and 'value' columns, rather than N arbitrary combinations of indexes that may or may not be used.
Edit: Had some bad attempt at writing this query but it's rather late and it made no sense.
You would need to have a new join for each new property
SELECT DISTINCT loyaltyMemberID
from members as m
INNER JOIN properties as p1 on m.id = p1.user_id
INNER JOIN properties as p2 on m.id = p2.user_id
INNER JOIN properties as p3 on m.id = p3.user_id
AND (p1.prop = 'gender' AND p1.value = x)
AND ((p2.prop = 'age' AND p2.value = y) OR (p3.prop = 'censor' AND p3.value = z))
I'm not sure what you're asking - could you give me an example of what you're envisioning that couldn't be satisfied with a combination of Boolean expressions in the WHERE clause ?
Especially with partial indexes, I still feel like this structure will be significantly faster than the original UNION ALL ... GROUP BY on calculated fields.
And they mention in the post that most queries don't use that many fields.
Confession time: in my first job, I build something like this (and it worked pretty well in the sense that it was very flexible), but then I also had to do a 'select' based on iirc 14 of such properties. I don't really recall the exact problem I had at first, but my solution was to create two separate (temporary) tables, select 7 of the properties into one and 7 into the other, run a select on both of those tables, then join the results in code. This ran at and acceptable speed (I must have done something so that adding criteria made the run time increase non-linearly - doing it on 14 was orders of magnitude slower than on 7).
Then years later I ran into the guy who had to do some work on it after I left that company. I must have scarred him pretty badly, because he remembered it enough to bring it up as pretty much the first topic after the obligatory 'hey so what are you up to nowadays'. When I think back about it now, it was a cringey solution - then again, this was at a company where nobody had ever heard of a 'database index' (or if they did, never mentioned or implemented them).
This is a pretty popular pattern known as Entity-Attribute-Value [0]. It's used by many products where a) data model needs to be very flexible and allow new attributes without schema changes, or b) a typical entity has a large number of possible attributes that may or may not be set for all entities ("sparse" attributes). WordPress uses this to store post metadata, Magento uses this to store product attributes and most of other data, Drupal uses a variation of this to store all the posts and other content you create… I have too much experience with this model to be surprised.
hstore querying is quite slow (and GIN indexes on hstores are pretty massive). I'd always go jsonb over hstores these days, but jsonb has the same indexing problem. JSON has a well-optimized/spec compliant serializer/deserializer in every language you can imagine as a baseline, whereas hstore does not.
I once implemented a variation of this where there was a column called 'data_type', the valid values were the various SQL data types, and in code I would do a switch() on the (string) value of that column and then cast the contents of the 'value' column based on that... Ah the folly of youth...
> This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.
I think you're right. Oh ... my ... god ...
I wish I could say this is the worst example of a database schema I've ever seen, but it isn't.
Technology cycle:
X gets invented -> idiots abuse it -> X "is bad" -> Y (strictly worse than X) is "so much better" -> idiots abuse it -> Y "is bad" -> ...
EVA is a valid pattern if the keys are dynamic. For example, in a CRM, the user might want to store properties of their clients that you haven't thought of. In our platform, we use different schemas for each company, so we can actually do a ADD COLUMN ..., but you don't want to do that if you have a multi-tenant DB :)
Using the builtin type for that purpose is going to work way better. This depends on the DB you're using but is generally referred to as a "JSON" field (why ? Because they're a response to MongoDB, which calls is that). Oracle and SQL server have very similar things.
In Mysql, it is JSON data type [1], in Postgres JSON/JSONB [2].
Creating indexes across them is doable, through a workaround (involving what is generally referred to as "VIEWS", but can be called calculated columns or something like that).
And, frankly, in the worst case for indexing, these databases still perform comparable to key-value stores in speed (especially SQLite).
They may be generally a better option, but they have their own disavantages. For example, JSONB fields in Postgres won't deduplicate keys, so if you have large keys, your table size will increase quite a bit (which also makes it harder to keep it memory).
Using an EVA, you can have a "keys (id, key_name)" table, and then only use the IDs in the values table, reducing that waste.
By the way, you don't need views for indexing on JSONB fields, it's supported out of the box in PG.
At least in MySQL, json field types are rather new. MySQL 5.7 is not yet an option with AWS Aurora or Google cloud SQL even.
And I don't think you will necessarily get better performance with json fields vs an EAV model. Yes, you can index json fields by creating virtual views, but that requires that you know the field ahead of time. With an EAV model, you can have your values table indexed and then join.
But I am excited to start using the json field types. In many cases, it will really simplify things over the traditional EAV stuff.
Remote: Yes and preferred
Willing to relocate: No
Technologies: Backend Engineer with experience in Golang, Java, and PHP.
Résumé/CV: https://www.linkedin.com/in/alejandrolanderos/
Email: aleee626@gmail.com