
Ask HN: How do I represent a hierarchy in RDBMS? - m33k44
Say, I have following hierarchy:<p><pre><code>   Animal ==+==&gt; Dog
            |
            +==&gt; Cat
</code></pre>
and another &quot;class&quot;, Pet, that references Animal.<p><pre><code>   Pet --1-----+--&gt; Animal
</code></pre>
The way I can represent this in RDBMS is by creating tables for each class, Pet, Animal, Cat and Dog. And then additional tables for the relationships, i.e.<p><pre><code>   PetAnimal(pet_id, animal_id) [one-to-many]
   AnimalDog(animal_id, dog_id) [one-to-one]
   AnimalCat(animal_id, cat_id) [one-to-one]
</code></pre>
This way I am able to represent the hierarchy of Animal, Dog and Cat. So, tomorrow if I want to add Rabbit, then I will just add Rabbit and AnimalRabbit tables.<p>My only concern is that the Animal table will grow rapidly as more pets and animal types are added and will be a performance issue. What is a better way to represent hierarchical structures in RDBMS to avoid both storage space explosion and performance issues?
======
gdulli
You've asked about storage size and performance, which are very unlikely to be
an issue for storing rows of two integers.

What you really want to be thinking about is the data access patterns.

Are you going to be querying the entire data set in a way that can only be
practically expressed or executed as a graph algorithm? My guess is no,
because that's specialized and uncommon. What you want to avoid is giving up
the reliability, maturity, predictability of an RDBMS unless you really have
to, because you really need the secret sauce of a graph (or other specialized)
database.

Even if you do have graph algorithms to run, I would still recommend an RDBMS
for primary storage and day to day single-animal lookups if that's a frequent
use case. Load data into a secondary graph database to handle offline or one-
off analysis. Unless your entire application is truly centered around graph
algorithms.

Switching off of an RDBMS very quickly becomes a tail-wagging-the-dog
situation if you only look at the data and not the access use cases.

The important thing to note is that the entity relationships alone aren't the
right information to allow anyone to make a meaningful recommendation.

------
personpersonper
Your choices are typically one of
[https://martinfowler.com/eaaCatalog/singleTableInheritance.h...](https://martinfowler.com/eaaCatalog/singleTableInheritance.html),
[https://www.martinfowler.com/eaaCatalog/classTableInheritanc...](https://www.martinfowler.com/eaaCatalog/classTableInheritance.html)
or
[https://martinfowler.com/eaaCatalog/concreteTableInheritance...](https://martinfowler.com/eaaCatalog/concreteTableInheritance.html)

------
dillonmckay
Your storage space concern is unfounded, particularly without denormalizing
your relationships (to gain performance).

It seems you have ‘animals’, some of which can be ‘pets’, but not all.

You also have ‘dogs’, ‘cats’, ‘rabbits’, etc., which may or may not be pets,
but are all animals.

So, my questions to you, what attributes do all animals share?

What attributes do all pets share?

What attributes are unique to dogs?

What attributes are unique to cats?

There are two main ways I could model this in an RDBMS, but I need to know
more specifics.

~~~
m33k44

      > What attributes do all pets share?
    

Pets have owner, feeding time, outdoor time etc.

    
    
      > What attributes are unique to dogs?
    

Type, Height, Medical record, Sex, Barking pitch etc.

    
    
      > What attributes are unique to cats?
    

Type, Colour, Weight, Medical record, Sex etc.

~~~
dillonmckay
So, since you listed ‘sex’ as a unique attribute for both cats and dogs, I
think you need to alter your perspective.

We want to model the relationships between entities, not necessarily represent
OOP ‘objects’ in a DB.

So, you should have an ‘animal’ table that would have fields for any attribute
that is shared by all animals, such as ‘sex’, ‘medical record’, and ‘weight’
(I am assuming all animals will have a gender).

Medical record would be a lookup key, probably not an integer, so you cannot
enumerate or get an off by one error, use a UUID or hashid.

I am assuming non-pets can have medical records.

Your dog and cat tables would reference the animal table via a foreign key for
those details.

For cat type and dog type, I would make those foreign key references to lookup
tables for each of the respective types.

Animal table would also have a foreign key reference to the pet table, but
this can be NULL, if the animal is not a pet.

Thoughts?

~~~
m33k44
So something similar to what @rodeoclown described below?

~~~
dillonmckay
Exactly.

------
stonecolddevin
If you're using Postgres, you can use recursive queries using common table
expressions. Disqus implemented this sort of thing 10 years ago:
[https://pastebin.com/Fe2twMRr](https://pastebin.com/Fe2twMRr) (I can't find
the original talk for some reason)

Here's the meat of the solution:

    
    
      CREATE TABLE comments (
            id SERIAL PRIMARY KEY,
            message VARCHAR,
            author VARCHAR,
            parent_id INTEGER REFERENCES comments(id)
      );
      INSERT INTO comments (message, author, parent_id)
      VALUES ('This thread is really cool!', 'David', NULL), 
      ('Ya David, we love it!', 'Jason', 1), ('I agree David!', 'Daniel', 1), 
      ('gift Jason', 'Anton', 2),
      ('Very interesting post!', 'thedz', NULL), 
      ('You sir, are wrong', 'Chris', 5), 
      ('Agreed', 'G', 5), ('Fo sho, Yall', 'Mac', 5);
    

What we’ve done now, is setup a basic comment model. We’ve included the
message, the author, and the parent comment (which is optional). Now let’s
learn how to use a recursive query to easily re-order this date, showing us a
fully threaded view, sorted in ascending order by id.

    
    
      WITH RECURSIVE cte (id, message, author, path, parent_id, depth)  AS (
      SELECT  id,
              message,
              author,
              array[id] AS path,
              parent_id,
              1 AS depth
      FROM    comments
      WHERE   parent_id IS NULL
    
      UNION ALL
     
      SELECT  comments.id,
              comments.message,
              comments.author,
              cte.path || comments.id,
              comments.parent_id,
              cte.depth + 1 AS depth
      FROM    comments
              JOIN cte ON comments.parent_id = cte.id
      )
      SELECT id, message, author, path, depth FROM cte
      ORDER BY path;

------
bpyne
As other people are saying, you haven't really given enough information for
someone to suggest a definitive data model.

By "Pet" class, do you mean a table for pet records?

Pets are animals that humans decide to take care of. Should a "Pet" be a state
of an Animal? So, should it be a simple boolean column on the Animal table?

Having an AnimalRabbit table implies a many-to-many relationship between
Animal and Rabbit. Are you storing different rabbit types as parent records?
If so, why not avoid the join table, store "rabbit" as a record in the Animal
table, and have the different rabbit types in the Rabbit table.

Without more detail on what you're trying to do from an application
standpoint, it's tough to give recommendations. Entity-Relationship Modeling
purists believe in defining entities and their relationships without
implementation detail. I've always found that to be a crock. You need a
certain amount of detail to model better.

~~~
m33k44

      > By "Pet" class, do you mean a table for pet records?
    

Yes.

    
    
      > Having an AnimalRabbit table implies a many-to-many relationship between Animal and Rabbit. Are you storing different rabbit types as parent records?
    

The whole point on making an Animal table was so that it could be referenced
by the Pet table, because Pets can be more than one type. If I include Dog and
Cat directly in Pet table then there will be lots of fields in the Pet table
empty; also tomorrow if I want to add Rabbit as a Pet then I need to modify
the Pet table to include Rabbit fields with null values in the previous rows.
Having a separate Animal table allows me to add new types of Pets without
modifying the Pet table and also saves table space.

~~~
bpyne
To reiterate, without knowing about your application, it isn't easy to make
recommendations.

It sounds like the Animal table is going to be the join table between Pet and
the different types of animal tables. Correct?

You haven't mentioned a need to store differentiating information - e.g.
species, genus, family - in the different animal tables (Dog, Cat, etc.). So
it sounds like you could skip the one table per animal notion and just have
the Animal table contain animal types like rabbit, dog, etc. Then it becomes
the "one" in the one-to-many relationship with Pet.

~~~
m33k44

      > It sounds like the Animal table is going to be the join table between Pet and the different types of animal tables. Correct?
    

Yes, that is correct.

------
maxk42
A simple solution is to have a single table with the fields: id, parent, name

id is your primary key. parent is the id of the parent (if one exists) or a
sentinel value (NULL or 0) if no parent exists, and name is "Dog", "Cat",
"Pet", etc.

If you structure your queries well this will be very performant up to 1 - 2
million records on commodity hardware (e.g. an EC2 t2-large instance) and may
be suitable for your needs for much larger data-sets, depending on your
application.

For improved lookup performance, you might split the fixed-width and variable-
width (varchar) columns into separate tables. Then you'd end up with an
"animal_names" table consisting of (id, name) and an "animal_relationships"
table consisting of (id, parent_id, name_id) -- but again, that depends on
usage patterns and which specific queries you need to be performant.

~~~
tabtab
The question is not really about hierarchies, but more about representing
diverse-but-similar domain entities.

It's a problem found in GUI modeling also, where various GUI widgets will or
can have different properties. Some properties will be shared, such as
"parent_ID" (AKA "container"), but many are specific to a particular widget or
subset of widgets.

Further, new widgets may be added down the road, and we don't want to have to
invent a new table for each one. Current RDBMS modelling techniques handle
these situations poorly.

I've been a fan of "dynamic relational" to handle these kinds of problems, but
so far it hasn't been implemented:
[https://news.ycombinator.com/item?id=15352786](https://news.ycombinator.com/item?id=15352786)

------
rhizome
There is no RDBMS that will choke on a table with every animal in existence in
it.

To my knowledge, the two most common ways of handling this: self-referential
nested set[1] and with a join table[2]

1\.
[https://en.wikipedia.org/wiki/Nested_set_model](https://en.wikipedia.org/wiki/Nested_set_model)

2\. [https://coderwall.com/p/lixing/closure-tables-for-
browsing-t...](https://coderwall.com/p/lixing/closure-tables-for-browsing-
trees-in-sql)

------
paktek123
Are there any significant differences between AnimalCat and AnimalDog? Feels
like there can just be an "animals" table with an enum "type" field or a one
to many animaltype(id, type).

Normally with such hierarchies it doesn't neccesary need to represent the same
structure unless there is a significant difference as we can filter by
animaltype with a where clause. The same approach cannot be applied to rmdbs
design as to oo because the paradigms are different.

~~~
m33k44

      > Are there any significant differences between AnimalCat and AnimalDog? Feels like there can just be an "animals" table with an enum "type" field
    

The AnimalCat and AnimalDog are just relationship tables. The alternative to
not using separate relationship tables is, just as you said, to use an enum
type field in the Animal table. I was actually trying to avoid using an extra
field in the table to indicate type, hence my question on what are better
approaches to avoid both table size increase and performance issues.

~~~
paktek123
What is the reason for avoiding an extra field considering the alternatives
mentioned throughout this post?

~~~
m33k44
Just one 64 bit size field in a 1 billion row table will amount to about 8 GB
data for that table.

Edit: Basically what I am looking for is a trade-off between space and
performance.

~~~
paktek123
8gb seems quite small in the grand scheme of things. Storage is fairly cheap
nowadays. I'm sure there will be other concerns with scaling when you do hit
the 1 billion mark.

------
rodeoclown
Start with the animal table:

Animal(animal_id PK, ... animal attributes)

No need for join tables or surrogate keys on these tables. The animal_id can
serve as both the primary key and foreign key.

Dog(animal_id PK FK, ... dog only attributes) Cat(animal_id PK FK, ... cat
only attributes)

No need for surrogate key on pet, and don't forget the owner key:

Pet(owner_id PK FK, animal_id PK FK)

This allows multiple owners of one animal. If you don't want that you need a
unique constraint on animal_id.

~~~
m33k44
What are the pros and cons of this approach?

~~~
dragonwriter
The pros are you aren't creating unnecessary sequences and tables, which has
both storage space and join complexity savings.

The cons are...none, really; the extra surrogate keys aren't buying you
anything.

------
vinceguidry
This is really a domain modeling question masquerading as a data modeling one.
If you really were modeling animals, then you could put the whole thing in one
animals table, with self-referencing columns. This is because you only have
one 'type' of domain concept, with details that not all instances of the class
will share. So you leave those columns NULL.

So you do one animals table with an animal_id and species_type columns, the
species_type column could be a enumerated string field or foreign key to a
species table, and pet as a boolean field. Now soon you might want to know
'whose' pet it is, so you'll need a human table to store the humans, and then
you'd need a human_id field on the animals table to store that information.
You 'could' use a pets join table if you want pets to be a first class domain
concept.

All these decisions revolve around what your app will be doing. Ideally you'd
have a REST web architecture, this makes domain modeling a bit easier.
Anything that has it's own web page, give a database table.

------
pier25
In Postgres you can use tree data with LTREE.

[http://patshaughnessy.net/2017/12/11/trying-to-represent-
a-t...](http://patshaughnessy.net/2017/12/11/trying-to-represent-a-tree-
structure-using-postgres)

~~~
metapsj
great for data that is mostly read intensive. not so great if you want lots of
writes.

@maxk42 approach with the relationship table works well for both read and
write intensive applications.

------
NicoJuicy
Depending the DB, some databases have build-in support for HierarchyId or
ParentId ( eg. SQL server : [https://docs.microsoft.com/en-us/sql/relational-
databases/ta...](https://docs.microsoft.com/en-us/sql/relational-
databases/tables/tutorial-using-the-hierarchyid-data-type?view=sql-
server-2017) )

Otherwise nested sets of closure tables are popular db patterns.

I love the trick of nested sets though. Calculating a number for each side of
a type do your RightBower and LeftBower can be used (
[https://images.app.goo.gl/Bv2wiC3gG6DW869r5](https://images.app.goo.gl/Bv2wiC3gG6DW869r5)
)

------
dragonwriter
You can just use Animal_id as the PK for all the tables (in the other tables
it will also be an FK to Animals) and dump the relationship tables.

> My only concern is that the Animal table will grow rapidly as more pets and
> animal types are added and will be a performance issue.

Why? You aren't duplicating data anywhere (except perhaps the unnecessary
IDs), so why do you think there is a more compact representation? It doesn't
matter much what table the data is in as long as it appears only once.

------
Pamar
Sorry, I am embarking on a flight so maybe I will have to come back to you
later.

For the time being: [https://stackoverflow.com/questions/3413459/designing-
sql-da...](https://stackoverflow.com/questions/3413459/designing-sql-database-
to-represent-oo-class-hierarchy) \- mine is the "chosen" answer and I hope
will give you a pointer in the right direction.

------
mcv
Does this need to be tabular data? If you want to model relationships, graph
databases are great.

Despite their name, relational databases are not great at relationships.
They're great at tables, but can also handle relationships if you have to.
Many problems can be solved by an RDBMS, but as this question and many others
like it show, it's not always a natural fit.

~~~
m33k44
> Does this need to be tabular data?

No, not a requirement. But what will I lose if I choose graph DB over RDBMS?

~~~
henryfjordan
Graphdbs are excellent at joins when a relationship exists (they are O(1), not
O(logn) like SQL). The query language many of them use (Cypher) is very
expressive for these types of queries. Take a look, the basics are very easy
to pick up.

The biggest draw-back is that you likely already know SQL and would need to
learn a lot up front. The resources out there are good, but there are so many
more good SQL references. If you want to do aggregations and the like, SQL
Databases might be a little more suited for those tasks. Popular SQL DBs are
also a lot more stable and optimized, just given their age.

To make use of these features though, all your data must fit on one box.

~~~
mcv
The difference in experience is really the only major downside to graph DBs.
They're an excellent tool for many of the situations where we currently use
RDBMSs, mostly because that's what we're familiar with while graph DBs are new
and unfamiliar.

I've now been using Neo4J for a couple of months, and I don't want to go back.
If relations between different items are what matters to you, then graph DBs
make that so painless compared to the joins from relational DBs.

The main thing relational DBs still have over graph DBs is tables. Give me a
table with all items from this table that meet this condition. Graph DBs can
do that too, but it's less natural, because they don't store their data in
tables, and nothing guarantees that all items of a certain type actually have
the same properties. RDBMSs are statically typed, Graph DBs dynamically typed.

------
dm03514
[https://stackoverflow.com/questions/4048151/what-are-the-
opt...](https://stackoverflow.com/questions/4048151/what-are-the-options-for-
storing-hierarchical-data-in-a-relational-database)

------
mindcrash
Use the right data model and - for example - plug Agensgraph into Postgres.
Plug your data in Agensgraph. Problem solved.

[https://github.com/bitnine-oss/agensgraph](https://github.com/bitnine-
oss/agensgraph)

~~~
superdimwit
He's not talking about an object hierarachy, but more of a type hierarchy

------
ahsd1
[http://patshaughnessy.net/2017/12/13/saving-a-tree-in-
postgr...](http://patshaughnessy.net/2017/12/13/saving-a-tree-in-postgres-
using-ltree)

Look into ltree in postgres or similar

~~~
metapsj
great for data that is mostly read intensive. not so great if you have lots of
writes.

@maxk42 approach with the relationship table works well for both read and
write intensive applications.

------
eb0la
I agree if you can use a recursive table, do it. If you're using an ORM do it
as the ORM allows you and think about performance later as soon as it grows
big.

If tables are small (or you have good indexes), a non-optimal structure will
still work for you.

~~~
m33k44

      > I agree if you can use a recursive table
    

Could you please elaborate on that? Which table will have recursive relation
in this case?

~~~
sokoloff

      Category ID  Category Name Category Parent
      -----------  ------------- ---------------
      1            Animal        NULL
      2            Dog           1
      3            Cat           1
      4            Rabbit        1
      5            Beagle        2
      6            Lion          3
      7            Tiger         3
      8            Bengal Tiger  7
    

A Bengal Tiger (8) is a Tiger (7) is a Cat (3) is an Animal (1).

~~~
protomyth
just for your sanity in a relational database alway put a root id for
hierarchies. Also, you might want to skip the null and use a self reference
depending on how you need to build the hierarchy. Root only matters if you
have multiple hierarchies in a table.

~~~
tracker1
+1 for root-id, this is a huge failing in a lot of threaded discussion table
designs.

~~~
protomyth
Yeah, it really helps if you have to do some processing in the stored
procedures and it helps with the query speed if properly indexed. I have seen
folks put a level number in the table also, but that really depends on what
the hierarchy represents.

------
iblaine
One possible solution is a junction table:
[https://en.wikipedia.org/wiki/Associative_entity](https://en.wikipedia.org/wiki/Associative_entity)

An RDBMs within indexes is probably fine.

------
khurram75
Recursive relation within the same table. A single table can solve this
problem. Some animal is pet and some are not this is not their class but these
are their property.

~~~
m33k44

      > Recursive relation within the same table. A single table can solve this problem.
    

How? Which table will recurse?

~~~
pier25
Not parent commenter but a single table is enough. Each row pointing to its
parent row.

------
sokoloff
There will never be so many animals that they won't fit into RAM on any
reasonable DB server (even including a smartphone).

------
maxdemarzi
If you used Neo4j you would get an easy hierarchy. If you need multiple
inheritance then no problem. If all of your classes have completely different
properties then no problem. If some discovery reclassified the tree somehow
then no problem. But it all depends on how important this use case is to the
overall application.

------
thunderbong
Have a table for Types

The types table have references among them

Types Table

id name parent_id

1 Animal null

2 Pet 1

3 Dog 2

4 Cat 2

Have the actual animals reference the Types

id name type_id

1 Rover 3

2 Oscar 4

Now, you can get the relationships out of these.

------
jenkstom
A self referencing foreign key (to the same table).

------
gwbas1c
I agree with the general sentiment that this is a better stack overflow
question. But, it's made it to the HN front page, so it deserves some
discussion.

In general, there is no one-size-fits-all answer, because the answer is really
driven by your data. Which brings up the most important point:

Your database is your application's source of truth. It must have the best
data model, and will most likely outlast all source code that you write.

So don't try to model your data as if you're writing an object oriented
application. Model your data at the database first, and then write your
application to fit the model of your data. (Don't put the cart before the
horse.)

Anyway, the general patterns in this case are:

\- One table with lots of null columns

\- Lots of tables that join: (Animal joins pet joins Dog/Cat/Fish)

\- Separate tables (Dog/Cat/Fish, no joins)

All approaches have tradeoffs. The one big table approach will have "waste"
with the null columns, but remember that joining pet to animal to dog has a
penalty on read, and that separate tables has implementation complexity.

IMO: If your entities are known, (IE, you won't add a new type of animal every
month,) and your database is read-heavy, go with a single table with null
columns. Otherwise, you need to justify the complexity of multiple tables.

~~~
cjblomqvist
One "trick" applicable to the mega table solution (one table with all data) to
preserve performance (kind of) is to do indicies that spans each type, or do
views (which technically will be almost the same as indicies to the DB
internally). I'm by no means a DB pro, but I've looked up the same answer and
been working with these cases (with these twists).

------
AstralStorm
This is a question better posted to Stack Overflow. With more technical
implementation details.

~~~
dang
There's no problem with posting questions like this to HN! If it gratifies
intellectual curiosity, it's just fine.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

Edit: I should add something important, though. This submission works because
there haven't been similar questions recently. If it started a pattern of
people posting questions of this sort, that would quickly be less interesting,
and therefore off topic by the curiosity principle.

~~~
tabtab
If it triggers heated debate among experienced people, it probably qualifies
as an "intellectual curiosity". If most come to a consensus about the
solution(s), then it's more of a technical how-to question. The problem is
that it can be hard to know ahead of time; newbies often stumble into age-old
conundrums. So far, I find this question in the first camp: it's not something
the current RDBMS are well-equipped to handle for reasons described in a
nearby reply.

