I am not sure of the exact statistic, but something like 95% of all production databases are less than 10GB. There seems to be a 'FAANG hacker' fascination with 'extreme-scale' which probably comes from seeing the challenges faced by the handful of organizations working at that level. Much of the time most graph database users want (as in why are they there) a DB that allows you to flexibly model your data and run complex queries. They probably also want some sort of interoperability. If you can do that well for 10GB, that is holy grail enough. We certainly found that developing graph database TerminusDB [1] - most users have smaller production DBs, more lightly use bells and whistles features, and really want things like easy schema evolution.
This research paper is talking about performance whilst you're talking about scalability.
Those are related but are distinct from each other.
And sure about 95% of companies would have their needs met with a simpler system but that does leave a lot of companies who will not. And for those of us in say finance doing customer/fraud analytics I would welcome all the performance I can get.
> This research paper is talking about performance whilst you're talking about scalability. Those are related but are distinct from each other.
The paper has "Scale to Hundreds of Thousands of Cores" in the title. I have not yet read the paper but it seems unlikely it doesn't talk about scalability.
If your data is small enough to easily fit in ram, you kind of can't have that slow a query on it (or at least you no longer are talking about a database problem).
If you end up having to scan the 10 GB graph many times per query without acceleration structures helping you (like indices), it will be slow. I'd say it's still a DB problem.
I'm guessing that, when the paper's author mentioned "hundreds of thousands of cores", they didn't have 10GB of data in mind. That works out less than a typical L1 cache's worth of data per core.
This is really common across article-comment platforms; is anyone interested in discussing how to incentivise comment sections that have read the paper?
This isn't a graph database like neo4j. This is a graph database like I hoped neo4j would be. It's not about having an easier time working with schemas. It's about analyzing graphs that are too big to fit in RAM. Transaction analysis for banks, trafic analysis of roads, failure resilience of utility networks, etc.
In these kinds of workloads you quickly run into performance bottlenecks. Even in-memory analyses need care to avoid conplete pointer chasing slowdowns.
I do still hope this is fast in like a single CPU 32 core 64GB system with an SSD. But if this takes a cluster to be useful, then I will still love it.
But the 5% of places where that kind of scale is needed are the ones paying the top 1% salary band, so this is the content distributed systems engineers like to read about and work on.
Yeah, but the hacker fascination is what drives progress. You could have made the same type of argument about ML, and we would have been content with MNIST.
One of the simpler supported backends for our Modality product (https://auxon.io/products/modality), which results in a data model that’s a special case of a DAG for modeling big piles of casually correlated events from piles and piles of distributed components for “system of systems” use cases, is built using SQLite, and the scaling limiter is almost always how efficiently the traces & telemetry can be exfiltrated from the systems under test/observation before how fast the ingest path can actually record things becomes a problem.
That said, I do love me some RDMA action. 10 years ago I was fiddling with getting Erlang clustering working via RDMA on a little 5 node Infiniband cluster. To mixed results.
I agree with your sentiment but I suppose you're considering the wrong statistics. Instead you should consider:
- how many jobs have interviews that necessitate knowing how to handle extreme scale
- proportion of jobs (not companies) requiring extreme scale - the fact that non extreme scales are the long tail doesn't mean it's a fat tail
- proportion of buyers/potential users that walk away from the inability to handle extreme scale
Only one anecdote, but I found out a while after starting at my current job that directly questioning the extent to which scale-out was actually needed to solve a problem during a technical interview question is the thing that made me stand out from the rest of the crowd, and landed me the job. Being able to constructively challenge assumptions is an incredibly valuable job skill, and good managers know that.
Counter-anecdote: directly questioning "scale-out fantasies" has contributed to my early departure from a handful of jobs and contracts. One place was obsessed with getting everything into AWS auto-scaling groups when the problem was actually that they were running on MySQL with a godawful schema, dumbass session management, and horrific queries that we weren't allowed to fix because they were "migrating to node microservices anyway" (pretty sure that still hasn't happened years later.)
> Being able to constructively challenge assumptions is an incredibly valuable job skill
The best people challenge bad assumption and worst bosses get mad.
Had one boss get mad that I reduced the database footprint by 94% - why?
Because he wrote the initial implementation and refused to believe that his baby, which cost so much space because of how awesome it was, could fit into 5GB.
But challenging the status quo has gotten me to where I am, so I wont stop it anytime soon :)
I get that angle but I also see orgs capturing too much data. What's the use case for it? Not sure but if we ever do need it we'll have it is the typical answer.
How about some succinct data structures and delta encoding for modern databases [1]. Succinct data structures are a family of data structures which are close in size to the information theoretic minimum representation (while still being queryable).
There is a prolog based database that will take away many foundational problems in implementing a rules system or other logic programming artifact. It is called TerminusDB [1]
'LoRA is an incredibly powerful technique we should probably be paying more attention to
LoRA works by representing model updates as low-rank factorizations, which reduces the size of the update matrices by a factor of up to several thousand. This allows model fine-tuning at a fraction of the cost and time. Being able to personalize a language model in a few hours on consumer hardware is a big deal, particularly for aspirations that involve incorporating new and diverse knowledge in near real-time. The fact that this technology exists is underexploited inside Google, even though it directly impacts some of our most ambitious projects.' [1]
There are already some open source alternatives to datomic. TerminusDB (https://github.com/terminusdb/terminusdb) for example is implemented in prolog (and Rust) so has the datalog variant query power that makes datomic so powerful. If you want free as in speech (thou I love free beer).
XTDB is also worth mentioning, especially since they’re on the HN front page with a v2 early access announcement. There are differences in how they do things. I can’t meaningfully comment on business usage of either or what the trade-offs between them are.
There are already a few open-source alternatives that run datalog variant query languages. I'd point the curious towards TerminusDB [1] and TypeDB [2]. TerminusDB is implemented in prolog (and rust) so an alternative with datalog in the heart.
so the only way to not have your code as a learning source for co-pilot is to move all the code off GitHub? And even in that case your past/existing code would be source material?