Ratio/quantity is important, but quality is even more so.
In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.
The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.
IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.
It requires tax increases, and the average earner's UBI will typically balance out the tax increase, meaning they don't directly profit.
UBI isn't about giving everyone free money. It's about giving everyone a safety net, so that they can take bigger economic risks and aren't pushed into crime or bullshit work.
The upper half of society will only see the indirect benefits, like having greater employment/investment choices due to more entrepreneurialism.
(1) Retirees with skills don't suddenly decide to become entrepreneurs when they reach 65.
(2) people on the dole don't suddenly become entrepreneurs. We even used to have a specific programme in New Zealand for the unemployed to start their own business . . . I'm fairly sure it didn't work.
(3) mothers on the DPB get a good whack of money even with kids that don't need Hyde time investment. It is rare to see them do anything more entrepreneurial than an under-the-table job.
> It requires tax increases, and the average earner's UBI will typically balance out the tax increase, meaning they don't directly profit.
A good portion of my salary is already taken by Tax and the government wastes it. I've seen the waste first hand when contracting for both Local, Nation Government. I was so disgusted by this, I have made every effort to avoid working with them.
I've also seen this waste happen in large charities and ossified corporations. The former also disgusting me as I know they would simply piss away a few thousand on complete BS, that took a whole village to collect and for it not to go towards the stated purpose of the charity. As a result I don't donate to any charities that aren't local.
Every-time someone suggests a tax increase, I know for a fact they haven't seen the waste happen first hand.
> UBI isn't about giving everyone free money. It's about giving everyone a safety net, so that they can take bigger economic risks and aren't pushed into crime or bullshit work.
Giving everyone a safety net will require giving people money that is taken from others. To the people that benefit it is seen as "free" and will become "expected" and won't be treated as a safety net.
Being a responsible adult is about reducing the amount of risk you are taking, not increasing it.
So what you will be doing it teaching people to essentially gamble and people did similar during COVID. Some people took their cheques and put it into crypto, meme stocks or whatever. Some won big, most didn't.
I've met people in my local area that have lost huge amounts of money on risky investments, everything from property developments, to bitcoin. Creating an incentive for risk taking without the consequences is actually reckless, a massive moral hazard and will simply create perverse incentives.
> The upper half of society will only see the indirect benefits, like having greater employment/investment choices due to more entrepreneurialism.
You will be taxing those people more and they will have less to invest. The reason why many people invest is because they have disposable income that they can afford to risk.
By taxing people more (which you admit would have to happen), they will have less disposable income and will be inclined to invest less as a result.
That discussion also makes me worry that they may try to use LLMs or LLM-based metrics to measure the size of the gap as a proxy for value of the content.
The landlord of the marketplace should probably not dabble in the appraisal of products, whether for factuality or value.
As a content consumer, I'm also hoping to be part of the ecosystem. I already use Patreon a lot as "AdBlock absolution", but it doesn't fix the market dynamics. Major content platforms tend to stagnate or worsen over time, because they prefer to sell impressions to advertisers than a good product to consumers.
What makes you think the secrets are small enough to fit inside people's heads, and aren't like a huge codebase of data scraping and filtering pipelines, or a DB of manual labels?
Please consider also describing the business model on the website, even if hidden away on a FAQ. I've so much subscription fatigue now, I just don't try things out if needing a subscription is an inevitability. I'm happy to pay for good products, just not happy to be forced to pay a fixed rate for continued access even if my usage dwindles.
If you are thinking of adding a one-off-donation-style purchase method, consider giving annual reminders to renew it. At least in my case, I'm not unwilling to pay repeatedly if development continues, just unwilling to make an upfront ongoing commitment.
I don't think retrofitting existing languages/ecosystems is necessarily a lost cause. Static enforcement requires rewrites, but runtime enforcement gets you most of the benefit at a much lower cost.
As long as all library code is compiled/run from source, a compiler/runtime can replace system calls with wrappers that check caller-specific permissions, and it can refuse to compile or insert runtime panics if the language's escape hatches would be used. It can be as safe as the language is safe, so long as you're ok with panics when the rules are broken.
It'd take some work to document and distribute capability profiles for libraries that don't care to support it, but a similar effort was proven possible with TypeScript.
I actually started working on a tool like that for fun, at each syscall it would walk back up the stack and check which shared object a function was from and compare that to a policy until it found something explicitly allowed or denied. I don't think it would necessarily be bulletproof enough to trust fully but it was fun to write.
The last major innovation as a product was PWA support starting in 2016.
Browsers used to try new ideas like RSS, widgets, shared and social browser sessions. Interfaces to facilitate low-friction integration with the rest of your life, and to multiplex data sources so that it's not a hassle to have many providers for [news, entertainment, social] experiences.
Likely no coincidence that this innovation languished once monopolies started pumping money into the ecosystem.
Wholeheartedly agree. Opera. Before it pivoted to Chrome and sold to Chinese investors I think was the apex example of this. I will never stop singing the praises of Opera Unite, which was a brilliant and potentially revolutionary way of leveraging the browser for something that could have been the basis of peer-to-peer web and social connection.
> It's interesting that there are no reasoning models yet
This may be merely a naming distinction, leaving the name open for a future release based on their recent research such as coconut[1]. They did RL post-training, and when fed logic problems it appears to do significant amounts of step-by-step thinking[2]. It seems it just doesn't wrap it in <thinking> tags.
> Or is Behemoth just going through post-training that takes longer than post-training the distilled versions?
This is the likely main explanation. RL fine-tuning repeatedly switches between inference to generate and score responses, and training on those responses. In inference mode they can parallelize across responses, but each response is still generated one token at a time. Likely 5+ minutes per iteration if they're aiming for 10k+ CoTs like other reasoning models.
There's also likely an element of strategy involved. We've already seen OpenAI hold back releases to time them to undermine competitors' releases (see o3-mini's release date & pricing vs R1's). Meta probably wants to keep that option open.
In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.
The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.
IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.