Or, you could use a graph database and stop having frustrating relational impeda...

danbruc · on July 21, 2023

That will depend on which graph database you use as a graph database might just store the graph in an underlying relational database. And it will also depend on what kind of data you have and what kind of queries you want to perform. For a graph database it might be faster to navigate along links in the graph but I would guess you will have to pay a big performance penalty if you have to operate orthogonally to your links, like aggregate across all instances of some entity.

victor106 · on July 21, 2023

That sounds too good to be true. Is that really true of all grapdb’s?

Also, if that’s really true why can’t everyone just use graphdb’s?

EGreg · on July 21, 2023

Because of vendor lock-in with the LAMP stack over the years. Every host used MySQL, how many had Neo4j as available?

coding123 · on July 21, 2023

Graph databases don't solve that. All databases, document, graph, rel ALL implement indexes to find specific things in the exactly the same way. Very well known tree, hash and other techniques.

The representation (outside of indexing) has properties that make your USE CASE better or worse. EGreg would not be someone to hire to architect a solution. He'll just put your 1Trillion row per month use-case in a graph DB like Neo4J and you'll just watch it fall over when you run billing.

e12e · on July 21, 2023

How big is the 1 though?

hinkley · on July 21, 2023

When you’re talking about data sets so large they dictate what hardware you use, and introduce terms like “cluster”, then 1 = √n

Which is why we need a version 2 of complexity theory, that doesn’t treat memory access or arithmetic on arbitrary precision numbers (aka as n actually goes to infinity) as O(1) operations. They aren’t. Which every large system engineer knows but few will talk about.

hansvm · on July 21, 2023

Why sqrt(n) and not log(n)?

And that complexity theory already exists. Typical whiteboard engineering uses transdichotomous models to gloss over some polylogarithmic factors (as do much of the literature), but more accurate models exist.

The difference isn't usually super relevant when comparing multiple solutions all using the same model of computation though since the extra terms don't tend to bump one complexity class above another when switching models, and if you cared about actual runtime characteristics you wouldn't be relying on log factors in such a crude tool anyway.

hinkley · on July 21, 2023

Speed of light.

Imagine a data center containing exabytes of data. How long does it take to access an arbitrary bit of that data?

We use clusters because computers cannot contain an infinite amount of memory, storage, or CPUs, because of physics. You see this same thing play out at smaller scales but it's more obvious at the macro scale. More addresses take logn time to sort out, but time to access is measured in radii, not gate depth.

In a world where clusters are rare, Knuth made decent approximations. In a world where clusters are not only de rigeur but hosted on multitenant hardware spread out over upwards of 100 miles, those approximations are bullshit and need to change.

hansvm · on July 21, 2023

Oooh interesting, thank you. I totally misunderstood this as the "integers never need more than 64 bits, so hash tables are constant" argument.

hinkley · on July 21, 2023

Integer arithmetic is really quantized logarithmic complexity. If your hardware has a bucket your number fits into, you're calculating n+1 or nxn in constant time. But if your data set doubles in size (especially for multiplication) you may find yourself in a bucket that doesn't exist or a more expensive one. Contemporary code is more likely to reach for bignum which is logn, but again stairstepped to each number of integers it takes to represent the entire number. A bigger problem when your data set is sparse, so that values grow faster than population (eg, UUID).

But on the topic of hash tables, you can only reach 'O(1)' if you can avoid collisions. And to avoid collisions you need a key of length m, which grows as n grows. You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time. Actual access time for a hash table on non-imaginary hardware is √nlogn, not O(1).

If you consider that we have applications that are just a single hash table occupying the entire RAM of a single machine, then this is not picking nits. It's capacity planning.

danbruc · on July 22, 2023

You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time.

I am not sure I am following this argument. You are not going to have more than 2^64 pigeons and pigeonholes on any system soon and I almost dare to say you will never ever get to 2^128. And for 64 or 128 bit keys comparisons and many other operations are for all practical purposes constant time. I guess you could argue that this is sweeping a factor of log(n) under the rug because of things like carry chains which could be faster for smaller bit sizes but I am not sure that this is really useful on common hardware, an addition will take one clock cycle independent of the operand values.

hinkley · on July 22, 2023

Are your hash keys 8 bytes long? Mine aren't. Mine aren't even close.

danbruc · on July 23, 2023

Sure, they can be and are often longer, but not because you were forced to use long keys, it just happened that the thing you want to store in a hash table is a long string. The way you worded it made it sound to me like you were saying that one has to use long keys, not that in practice one often has to deal with long keys. But even then I am not convinced that this should give an additional factor in the complexity analyses, I think I would argue, at least in some situations, that calculating hashes of long keys should still be considered constant time, that for the longest keys. But I can also imagine to take this into account if the key length is not only big but also highly variable.

EGreg · on July 21, 2023

Look, if you have N items related to X, at insert time, you store them in an array and have X point to that array, instead of foreign keys.

For example, when a user has 7 articles. Do you want to just point to where the articles are stored? Or do you want to do O(log n) lookup for each article?

And if you have many-to-many, do you want to join an Intermediate Table for even more processing, or just follow a pointer to a range of an intermediate node directly and traverse?

ndriscoll · on July 21, 2023

How is that different from a clustered index?

EGreg · on July 21, 2023

A clustered index requires O(log N) lookups, since it's still an index.

I'm talking about pointing directly to the location of an array where things are stored. That array isn't a clustered index. Each array is different.

ndriscoll · on July 21, 2023

What about when you delete rows? Do you just leave the space unused now? Or if you update a row to be larger? Rewrite the whole array (so possibly O(n) updates)?

How do you deal with data that gets accessed in different orders based on relationships from multiple other entities? Duplicate the data? If so, updates now get amplified and you can fit less data in RAM so you're more likely to require disk IO. If not, you need a layer of indirection (so you have an array of pointers instead of an array of data).

Even with a layer of indirection, updates that grow a row and require a reallocation will cause you to have to go update all pointers (also, those pointers need to be indexed to find who you need to update). To avoid write amplification, you can use an array of ids instead of an array of pointers. Now you want an id <-> pointer lookup, which could be done with a hash map in O(1), or with a tree in O(logN) if you want to also allow efficient range queries.

I'm not exactly sure on the implementation details, but for ballpark numbers, for an 8 kB page size with 80% fill factor and 16 B entries (8 B key + 8 B pointer), with 10E9 rows, log(N) ought to be ~4 (page IOs). Not ideal, but the point is for a lot of real-world engineering, O(logN) is effectively O(1) (with a small enough constant factor that it's fine for general use).

So if your data is not write-once, trees are a well-rounded default data structure.