More

LukeEF · on July 11, 2023

'People struggling should just shut up 'cause everything is great - look at these charts'

LukeEF · on May 21, 2023

I am not sure of the exact statistic, but something like 95% of all production databases are less than 10GB. There seems to be a 'FAANG hacker' fascination with 'extreme-scale' which probably comes from seeing the challenges faced by the handful of organizations working at that level. Much of the time most graph database users want (as in why are they there) a DB that allows you to flexibly model your data and run complex queries. They probably also want some sort of interoperability. If you can do that well for 10GB, that is holy grail enough. We certainly found that developing graph database TerminusDB [1] - most users have smaller production DBs, more lightly use bells and whistles features, and really want things like easy schema evolution.

[1] https://github.com/terminusdb/terminusdb

threeseed · on May 21, 2023

This research paper is talking about performance whilst you're talking about scalability.

Those are related but are distinct from each other.

And sure about 95% of companies would have their needs met with a simpler system but that does leave a lot of companies who will not. And for those of us in say finance doing customer/fraud analytics I would welcome all the performance I can get.

loeg · on May 21, 2023

> This research paper is talking about performance whilst you're talking about scalability. Those are related but are distinct from each other.

The paper has "Scale to Hundreds of Thousands of Cores" in the title. I have not yet read the paper but it seems unlikely it doesn't talk about scalability.

threeseed · on May 21, 2023

I was referring to scalability in the sense of the size of the data being stored.

You can have slow queries with 10GB of data just like you can have fast queries with 10PB of data.

adgjlsfhk1 · on May 21, 2023

If your data is small enough to easily fit in ram, you kind of can't have that slow a query on it (or at least you no longer are talking about a database problem).

fulafel · on May 22, 2023

If you end up having to scan the 10 GB graph many times per query without acceleration structures helping you (like indices), it will be slow. I'd say it's still a DB problem.

mumblemumble · on May 21, 2023

I'm guessing that, when the paper's author mentioned "hundreds of thousands of cores", they didn't have 10GB of data in mind. That works out less than a typical L1 cache's worth of data per core.

forgetbook · on June 2, 2023

> I have not yet read the paper

This is really common across article-comment platforms; is anyone interested in discussing how to incentivise comment sections that have read the paper?

rocqua · on May 21, 2023

This isn't a graph database like neo4j. This is a graph database like I hoped neo4j would be. It's not about having an easier time working with schemas. It's about analyzing graphs that are too big to fit in RAM. Transaction analysis for banks, trafic analysis of roads, failure resilience of utility networks, etc.

In these kinds of workloads you quickly run into performance bottlenecks. Even in-memory analyses need care to avoid conplete pointer chasing slowdowns.

I do still hope this is fast in like a single CPU 32 core 64GB system with an SSD. But if this takes a cluster to be useful, then I will still love it.

RhodesianHunter · on May 21, 2023

But the 5% of places where that kind of scale is needed are the ones paying the top 1% salary band, so this is the content distributed systems engineers like to read about and work on.

bubblethink · on May 21, 2023

>There seems to be a 'FAANG hacker' fascination

Yeah, but the hacker fascination is what drives progress. You could have made the same type of argument about ML, and we would have been content with MNIST.

im_down_w_otp · on May 21, 2023

I think I kind of agree with this.

One of the simpler supported backends for our Modality product (https://auxon.io/products/modality), which results in a data model that’s a special case of a DAG for modeling big piles of casually correlated events from piles and piles of distributed components for “system of systems” use cases, is built using SQLite, and the scaling limiter is almost always how efficiently the traces & telemetry can be exfiltrated from the systems under test/observation before how fast the ingest path can actually record things becomes a problem.

That said, I do love me some RDMA action. 10 years ago I was fiddling with getting Erlang clustering working via RDMA on a little 5 node Infiniband cluster. To mixed results.

belter · on May 21, 2023

Interesting that you mention the value 10 GB, as it is the size of a DynamoDB partition or an AWS Aurora cell...

parentheses · on May 21, 2023

I agree with your sentiment but I suppose you're considering the wrong statistics. Instead you should consider: - how many jobs have interviews that necessitate knowing how to handle extreme scale

- proportion of jobs (not companies) requiring extreme scale - the fact that non extreme scales are the long tail doesn't mean it's a fat tail

- proportion of buyers/potential users that walk away from the inability to handle extreme scale

... and more sarcastically

- proportion of articles about extreme scale

- proportion of repos about extreme scale

mumblemumble · on May 21, 2023

Only one anecdote, but I found out a while after starting at my current job that directly questioning the extent to which scale-out was actually needed to solve a problem during a technical interview question is the thing that made me stand out from the rest of the crowd, and landed me the job. Being able to constructively challenge assumptions is an incredibly valuable job skill, and good managers know that.

zimpenfish · on May 21, 2023

Counter-anecdote: directly questioning "scale-out fantasies" has contributed to my early departure from a handful of jobs and contracts. One place was obsessed with getting everything into AWS auto-scaling groups when the problem was actually that they were running on MySQL with a godawful schema, dumbass session management, and horrific queries that we weren't allowed to fix because they were "migrating to node microservices anyway" (pretty sure that still hasn't happened years later.)

> Being able to constructively challenge assumptions is an incredibly valuable job skill

I would agree but ...

> good managers

... are few and far between.

hobs · on May 21, 2023

The best people challenge bad assumption and worst bosses get mad.

Had one boss get mad that I reduced the database footprint by 94% - why? Because he wrote the initial implementation and refused to believe that his baby, which cost so much space because of how awesome it was, could fit into 5GB.

But challenging the status quo has gotten me to where I am, so I wont stop it anytime soon :)

swader999 · on May 21, 2023

I get that angle but I also see orgs capturing too much data. What's the use case for it? Not sure but if we ever do need it we'll have it is the typical answer.

fnord77 · on May 21, 2023

really? I don't quite believe that. We're a tiny company with maybe 70 customers and db is roughly 11Tb.

paulddraper · on May 21, 2023

Assuming 1kb per "record" that's 150 million records per customer.

Definitely a data heavy product, wherever it is that you're offering.

(Unless you keep large blobs in the DB. But database scale has more to do with records than raw storage.)

cubefox · on May 21, 2023

That seems a lot. What type of data?

crabmusket · on May 22, 2023

Congratulations, you're in the 5%, along with us :)

ElFitz · on May 22, 2023

Just have a look at the size of all of English Wikipedia. Or all of StackOverflow.

And these are seemingly huge services.

And yet…

deegles · on May 21, 2023

What are the databases with easy schema evolution?

LukeEF · on May 10, 2023

Succinct data structures ftw.

LukeEF · on May 10, 2023

How about some succinct data structures and delta encoding for modern databases [1]. Succinct data structures are a family of data structures which are close in size to the information theoretic minimum representation (while still being queryable).

[1] https://github.com/terminusdb/terminusdb/blob/dev/docs/white...

LukeEF · on May 8, 2023

There is a prolog based database that will take away many foundational problems in implementing a rules system or other logic programming artifact. It is called TerminusDB [1]

[1] https://github.com/terminusdb/terminusdb

LukeEF · on May 5, 2023

From the google leaks paper:

'LoRA is an incredibly powerful technique we should probably be paying more attention to

LoRA works by representing model updates as low-rank factorizations, which reduces the size of the update matrices by a factor of up to several thousand. This allows model fine-tuning at a fraction of the cost and time. Being able to personalize a language model in a few hours on consumer hardware is a big deal, particularly for aspirations that involve incorporating new and diverse knowledge in near real-time. The fact that this technology exists is underexploited inside Google, even though it directly impacts some of our most ambitious projects.' [1]

[1] https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

LukeEF · on April 27, 2023

There are already some open source alternatives to datomic. TerminusDB (https://github.com/terminusdb/terminusdb) for example is implemented in prolog (and Rust) so has the datalog variant query power that makes datomic so powerful. If you want free as in speech (thou I love free beer).

filoeleven · on April 27, 2023

XTDB is also worth mentioning, especially since they’re on the HN front page with a v2 early access announcement. There are differences in how they do things. I can’t meaningfully comment on business usage of either or what the trade-offs between them are.

https://news.ycombinator.com/item?id=35733515

LukeEF · on April 27, 2023

There are already a few open-source alternatives that run datalog variant query languages. I'd point the curious towards TerminusDB [1] and TypeDB [2]. TerminusDB is implemented in prolog (and rust) so an alternative with datalog in the heart.

[1] https://github.com/terminusdb/terminusdb [2] https://github.com/vaticle/typedb

LukeEF · on June 23, 2022

This has to be imagined in the context of a post-SQL future. Unless of course you are a 'SQL is the end of history' person!

LukeEF · on June 22, 2022

so the only way to not have your code as a learning source for co-pilot is to move all the code off GitHub? And even in that case your past/existing code would be source material?

ggeorgovassilis · on June 22, 2022

That summarises my understanding.