Am I able to upload a book and have it respond truthfully to the book in a way that's superior to NotebookLM or similar? Generally most long context solutions are very poor. Or does the data have to be in a specific format?
To get the outcome you want, RAG (retrieval augmented generation) would be the way to go, not fine-tuning. Fine-tuning doesn't make the model memorize specific content like a book. It teaches new behaviors or styles. RAG allows the model to access and reference the book during inference. Our platform focuses on fine-tuning with structured datasets, so data needs to be in a specific format.
The magic behind NotebookLM can't be replicated only with fine-tuning. It's all about the workflow, from the chunking strategy, to retrieval etc.
For a defined specific use-case it's certainly possible to beat their performance, but things get harder when you try to create a general solution.
To answer your question, the format of the data depends entirely on the use-case and how many examples you have. The more examples you have, the more flexible you can be.
Nobody is taking Sam Altman at his word lol, these ideas about intelligence have been believed for a long time in the tech world and the guy is just the best at monetizing them. People are pursuing this path because of a general conviction in these ideas of themselves, I guess to people like Atlantic writers Sam Altman is the first time they've encountered them but it really has nothing to do with Sam Altman.
100% this. He isn't even saying anything novel (I mean that in a good way).
On top of that, the advance in models for language and physical simulation based models (for protein prediction and weather forecasting as examples) has been so rapid and unexpected that even folks who were previously very skeptical of "AI" are believers - it ain't because Sam Altman is up there talking a lot. I went from AI skeptic to zealot in about 18 months, and I'm in good company.
He was literally invited to congress to speak about AI safety. Sure, perhaps people that have a longer memory of the tech world don't trust him. That's actually not a lot of people. A lot of people just aren't following tech (like my in-laws).
Did they go with ScyllaDB just because it was compatible with Cassandra? Would it make sense to use a totally different solution altogether if they didn't start with that.
Yes, we wanted to migrate all our data stores away from Cassandra due to stability and performance issues. Moving to something that didn't have those issues (or at least had a different set of less severe issues) while also not having to rewrite a bunch of code was a positive.
Did you guys end up redesigning the partitioning scheme to fit within Scylla's recommended partition sizes? I assume the tombstone issue didn't disappear with a move to Scylla but incremental compaction and/or SCTS might have helped a bunch?
Nope. Didn't change the schema, mainly added read coalescing and used ICS. I think the big thing is when Scylla is processing a bunch of tombstones it's able to do so in a way that doesn't choke the whole server. Latest Scylla version also can send back partial/empty pages to the client to limit the amount of work per query that is run.
Oh that's pretty neat. Did you just end up being okay with large partitions? I've been really afraid to let partition sizes grow beyond 100k rows even if the rows themselves are tiny but I'm not really sure how much of a real-world performance impact it has. It definitely complicates the data model to break the partitions up though.
It's on the leaderboard, it's tied with qwen 2.5 72b and far below SOTA of o1, claude sonnet, and deepseek. (also below very old models like gpt-4-0314 lol)
There's still basic filters even if you take all the ones that you can turn off from the UI all off. It's still not capable of summarizing some YA novels I tried to feed it because of those filters.
It's also comparing prices on google cloud, which has its own markup, a lot more expensive than say runpod. Runpod is $1.64/hr for the A100 on secure cloud while the A100 on Google is $4.44/hr. A lot more expensive... yeah. So in that context a 30% price beat is actually a huge loss overall.
reply