What is your exact question? To me it makes sense that you’d not want to use NoSQL if you’re dealing with data that’s already relational, and heavily leveraging features common in relational DBs that may not come out of the box with NoSQL DBs.
They’re saying basically that NoSQL DBs solve a lot of horizontal scaling problems but aren’t a good fit for their highly relational data, is my understanding. Not that they can’t get NoSQL functionality at eg the query level in relational DBs.
My perspective from working both inside and outside of Google:
The external spanner documentation doesn’t seem as good as the internal documentation, in my opinion. Because it’s not generally well known outside of G, they ought to do a better job explaining it and its benefits. It truly is magical technology but you have to be a database nerd to see why.
It’s also pretty expensive and because you generally need to rewrite your applications to work with it, there is a degree of lockin. So taking on Spanner is a risky proposition - if your prices get hiked or it starts costing more than you want, you’ll have to spend even more time and money migrating off it. Spanner’s advantages over other DBs (trying to “solve” the CAP theorem) then become a curse, because it’s hard to find any other DB that gives you horizontal scaling, ACID, and high availability out of the box, and you might have to solve those problems yourself/redesign the rest of your system.
Personally I would consider using Cloud Spanner, but I wouldn’t bet my business on it.
If you really have that much data and traffic, the $ costs start to add up to multiple engineer comp costs. At that point it’s cheaper to move to something you have good control over.
I.e sharding at application layer and connecting to the DB instance replica where the customer data is hosted.
Depends. The cost may pay for itself but the engineers you have already may have higher ROI things to do. It's also nice to have operational stuff managed for you. Personally I'd be happy to pay extra for the kinds of problems Spanner solves to free myself up to do other things (to a point, ofc).
> sharding at application layer and connecting to the DB instance replica where the customer data is hosted.
Spanner does global consistency/replication. If having good performance per-tenant globally is a concern, this helps a lot, and is hard to implement on your own. It can also ultimately save you money by limiting cross-region traffic.
This is the lump of labor fallacy in reverse isn't it? Those people would eventually be employed by other industries - both the labor and $ formerly going to yachts would be used somewhere else.
I think this is a fair approach when things work well enough that a typical user doesn’t need to worry about whether they’ll trigger some kind of special content/moderation logic. If you shadowban spammers and real users almost never get flagged as spammers, the benefits of being tight-lipped outweigh those of the very few users who get improperly flagged or are just curious.
With some of these models the guardrails are so clumsy and forced that I think almost any typical user will notice them. Because they include outright work-refusal it’s a very frustrating UX to have to “discover” the policy for yourself through trial and error.
And because they’re more about brand management than preventing fraud/bad UX for other users, the failure modes are “someone deliberately engineered a way to get objectionable content generated in spite of our policies.” Obviously some kinds of content are objectionable enough for this to be worth it still, but those are mostly in the porn area - if somebody figures out a way to generate an image that’s just not PC, despite all the safety features, shouldn’t that be on them rather than the provider?
Even tuning the model for political correctness is not the end of the world in my opinion, a lot of LLMs do a perfectly reasonable job for my regular use cases. With image generators they are going so far as to obviously (there’s no other way that makes sense) insert diversity sub prompts for some fraction of images which is simply confusing and amateur. Everybody who uses these products just a little bit will notice it. It’s also so cautious that even mild stuff (I tried to do the “now make it even more X” with “American” and it stopped at one iteration) gets caught in the filters. You’re going to find out the policies anyway because they’re so broad an likely to be encountered while using the product innocently - anything a real non-malicious user is likely to get blocked by should be documented.
I think you are using energy drinks in your example for effect, only because people overrate their caffeine content relative to coffee. Coffee drinkers who have a full tumbler/togo cup in the morning and a sbux or continual refills from the office pot at work can far exceed the caffeine intake of 3 energy drinks. Meanwhile, Red Bulls actually have deceptively little caffeine and are barely worse than just a regular Coca Cola.
It’s very easy to get into the habit of extremely high caffeine intake, speaking from experience. Not only is caffeine in many drinks and extremely well distributed, its psychological effects are pretty mild. But I think the worst part is that people don’t really think about how much caffeine is in the coffee they’re drinking because it’s hard to measure.
For people with high daily intake it’s not so much that they want to get “high” on coffee to enjoy the conversation, as it is that their brain is in acute caffeine withdrawal and they will feel so shitty with coffee that they won’t be able to participate in conversation very well.
The internet is already mostly filled with low quality bullshit though, and GPT-4/Gemini are much better writers than whoever is churning out SEO as we know it now.
It's a lazy argument to imply this 1. invalidates the technological achievement 2. prevents iterative improvement a la the singularity. For one, the Internet itself is not bullshit just because a lot of spammers/hustlers put bad content on it to try to make money. And secondly, you can curate datasets... nothings stopping researchers from training LLMs on shitty SEO now, and if they wanted to they could curate datasets going forward to try to prevent LLM-spam from entering the training sets of future models.
And finally, people already use reputation/identity/branding and proxies for it as quality filters on the internet. For example, this is an unfamiliar blogger to me and so I entered it with skepticism I wouldn't have with people like Gwern or Lynn Alden. Good writing from people like Gwern and Lynn Alden won't disappear just because LLM content exists on the internet - it just makes reputations and identity (eg to a real human) more important.
You are spot-on (I think) with the point that identify and reputation will become much more important. I hope this will end up with system that help cultivate and verify reputation. I fear however that the identity leg is much easier to tackle.
What does me telling LLMs to write articles for "best vacuum cleaners 2024" and putting it on the Internet have to do with the ability of LLMs to improve themselves? Humans write those kinds of articles for the Internet as it is, and yet humans are the ones designing and improving LLMs now.
The Internet is a pull model. I can go to Gwern's website directly and not care that most other websites have crap on them.
People choose to use push models for content through meta properties, tiktok, and aggregators like reddit and HN, but nothing is forcing them to. If they push enough bad content, people won't keep using them. Already happened with Facebook and Reddit predecessors, probably happening to Reddit now.
It doesn't matter how big the haystack is when you have the ability to go directly to the needle.
I've seen this exact same fallacy happen several times throughout my career, which isn't even very long.
I think in many cases it boils down to some subtype not being identified and evaluated on its own. As in your case it's especially impactful, and yet IME also usually where these kinds of things get improperly prioritized, when it's a user's first impression or when it occurs in a way that causes a user to have to just sit and wait on the other end as these are often "special" cases with different logic in your application code.
OTOH sometimes users try to weird/wrong/adversarial shit and so their high failure rate is working as intended. But it pollutes your stats such that it can hide real issues with similar symptoms and skew distributions.
Not everybody wants to sell their startup though. Accessing that liquidity without selling can be very very risky. Also, small companies like this can’t command big premiums because too much operational/institutional knowledge is held by the owners, so a lot of that “profit” is really more like a wage for the founder.
I’m in the infancy of a bootstrapped company while witnessing the general enshittification of all these formerly treated VC backed companies as they prepare for /adjust to being public companies. I also recall many decent products like Quora (yes there was a period where it was actually good) go to shit and fail chasing $$. I feel like the culture is shifting for new founders and a lot us want to aim for sustainable businesses that deliver real value rather than VC moonshots that say yes to everything that makes them more money.
I think “lifestyle” business carries negative connotations and bootstrapping/roof shots don’t capture this mindset yet. For me personally I guess it’s like, I’d rather have a $100m business that does cool shit and is focused on doing one thing, than be forced into chasing growth at all costs in all directions to get a potentially much bigger exit. If I had such an exit I’d draw it down at a sustainable rate anyway that ends up not being different from the earnings of a private company of similar value.
I guess with similar luck and results maybe I could be a billionaire instead of a hundred millionaire, but idk if that really matters, and I’d then get the typical SV meaningless ennui while being glad to leave the shitshow I created before it got even shitter. Running something I’m proud of for the long term seems like a more meaningful and rewarding way to spend my time and it can actually make the world better even if it doesn’t capture as much of that value.
Like, what if the dominant social media company had refused to serve ads and optimize for watch time? What if major cloud providers had a cohesive vision instead of cobbling together every stupid thing a $100mm spend customer wanted? What if YouTube could serve recommendations based on similarity/enjoyment? They could still be major successful companies. But selling your company, taking on investors, and maybe even lending against your equity threatens to destroy that.
News media is in direct competition with big tech companies for advertising. The more eyeballs go towards big tech and not news companies, the less market share and relevancy they have.
Even worse for them, when people want news now they generally go to aggregators, search for it on Google, or get it served up by a Meta property. It used to be that instead people would read the newspaper or go to a news channel on TV. So news media is furious that big tech controls their top of funnel and distribution channels, as consumers typically prefer it that way vs directly seeking out news by going to cnn.com. In some places they’ve pushed link taxes which tech companies strongly criticized for entirely legit reasons/threatened to pull services, which upped the animosity.
Also, because news is monetized through advertising they need stories and narratives that capture people’s interests and attention. Nobody would care about a story like “Google Scholar revolutionized research discovery and accessibility and improved geographical collaboration a billion %” or “Waymo actually works pretty well no complaints” or “most SF residents actually like Waymo”. But controversy like “Waymo ran into something” is more attention grabbing the more they spin it as evil. Additionally, “good thing continues to be good” is not news but “good thing is actually bad” and “recognizable company X did a bad thing” are news. Similarly “fall from grace” “David vs Goliath” and “these people made a lot of money so you should blame them for not having money” are consistently popular narratives people like.
So news media have literally every reason to drag big tech through the mud and pretty much no reason to ever say anything good about them. For sure these companies have problems but you don’t hear about the good things (IMO Meta has made huge improvements in organizational/data security, and their products drive a lot of commerce in developing countries; Amazon warehouses are usually in places where $15-20/hr is actually a huge step up for local inhabitants; big tech is much better than Microsoft and other old school players at fighting unreasonable law enforcement requests) and the bad things are often overplayed/slanted.
They’re saying basically that NoSQL DBs solve a lot of horizontal scaling problems but aren’t a good fit for their highly relational data, is my understanding. Not that they can’t get NoSQL functionality at eg the query level in relational DBs.
reply