Specifically, I've seen that a common failure mode of the distilled Deepseek models is that they don't know when they're going in circles. Deepseek incentivizes the distilled LLM to interrupt itself with "Wait." which incentivizes a certain degree of reasoning, but it's far less powerful than the reasoning of the full model, and can get into cycles of saying "Wait." ad infinitum, effectively second-guessing itself on conclusions it's already made rather than finding new nuance.
The full model also gets into these infinite cycles. I just tried asking the old river crossing boat problem but with two goats and a cabbage and it goes on and on forever.
I'd also add that AI code generation in non-formally-verifiable languages, at places with concurrency requirements like Nvidia, might end up in the short-term creating more hard-to-spot concurrency bugs than before, as developers become incentivized to tab-complete code without fully thinking through the implications as they type.
But AI code generation for formally verifiable programs? And to assist in writing custom domain-specific verifiers? Now that's the sweet spot. The programming languages of the future, and the metaprogramming libraries on top of them, will look really, really cool.
Have you tried it for Rust, arguably a much lower bar than actually fully formally verifiable? Sorry to say but AI can't figure out the borrow checker. I'd speculate it does poor on a lot of things that are sort of implicit in the code / properties of the grand structure rather than incremental text things.
There is a podcast and blackhat? video about Nvidia choosing SPARK over Rust. Not because of formal verification at all but because it is a more developed prospect and offers better security even without any formal verification. This isn't mentioned but Ada is also far better at modelling registers or even network packets.
I might consider AI if it utilised SPARKs gnat prove but I wouldn't usw AI otherwise.
As an example of modelling packets, here's an example of modelling a complex packet which is a bit-packed tagged union. I don't think many other languages make such packets so easy to declare: https://gist.github.com/liampwll/abaaa1f84827a1d81bcdb2f5f17...
Maybe better performance can be achieved with specialized models. There are some that were able to solve mathematical olympiad problems, e.g. AlphaProof.
> Sorry to say but AI can't figure out the borrow checker. I'd speculate it does poor on a lot of things that are sort of implicit in the code / properties of the grand structure rather than incremental text things.
I don’t really use AI tools, but in the past few weeks I’ve tried it with Rust and while it had problems, borrow checking errors were never part of them.
I fully agree that LLMs don’t “understand,” but people also really oversell the amount of thinking needed to satisfy borrowing in a lot of cases.
I had a production bug (caught in canary) introduced into Google Play in just this manner. The AI code review assist suggested switching a method to a more modern alternative, both the author and code reviewer thought the AI suggestion was a good one, the new alternative was documented as not being safe to use in the onCreate() method of an Android activity, the code in question was not in onCreate but was in a method transitively called by onCreate in a different Activity, boom goes an unrelated Activity. Would've been caught trivially by an expressive type system, but the LLM doesn't know that.
I tested o3-mini yesterday, having it verify a bit-hack for "vectorizing" 8x8bit integer addition using a single 64 bit int value[0]. Suffice to say, I am not impressed. I asked it to give me a counter example that would make the function fail, or to tell me that it works if it does. It mentioned problems regarding endianness which weren't present, it mentioned carries that would spill over which couldn't happen. I had given it chances to give counter examples, but the counterexamples he gave didn't fail.
Only after telling it that I tested the code and that it works did it somewhat accept that the solution worked.
I think a deterministic, unambiguous process is a lot more valuable for formal verification.
A limitation I see with AI for coding is that your problem must be mainstream to get decent results.
In my experience, if I ask it to do web things in PHP or Java, data things in Python, ... it gives a good enough result. If I ask it a postgis question, I get an answer with hallucinated APIs, bugs, and something that doesn't even do what I want if it works.
I suspect the ADA/Spark world and the formal verification world are to small to decently train the AI.
> A limitation I see with AI for coding is that your problem must be mainstream to get decent results.
Or to phrase it another way, there must examples of the technique that you want to "generate" in the training set otherwise the horrifically overfitted model cannot work.
Probably a little bit too cynical but this is my experience asking LLMs "unusual" questions.
No, that's about where we are. Today's LLM coding is basically automated Stack Overflow copy/paste - not too bad for anything simple and mainstream, but unable to reason about code from first principles, and fond of making up random shit that looks good but doesn't work.
I tried asking a LLM for a full code snippet for the first time yesterday (i've been using them as search engines better than google, which isn't saying much, before).
It produced code that compiled but failed to do anything because the 3rd api call it generated returned an error instead of doing what the LLM said it would do.
Didn't spend time on seeing what was wrong because we already had a library that did the same thing; I was more testing what the LLM can do.
I thought OP was saying AI can generate the code that can then be formally verified based on your non-verifiable code, not that the AI itself is verifying it? I could be wrong, of course.
Perhaps I misread it then. I think that if you're at the point of using formal verification you're likely to be very meticulous in whatever you're writing, so an AI would be used quite carefully
> as developers become incentivized to tab-complete code without fully thinking through the implications as they type.
Is this true? There should always be a QA process and you should always carefully review when committing. The writing process doesn't influence that unless you're automating whole swaths you don't understand during reviews.
https://techstrongitsm.com/features/turn-rivers-4-4-billion-... provides more context here: "Turn/River, which specializes in optimizing software companies and cloud-based software-as-service companies, already has investments in cloud-based IT company Paessler and cybersecurity company Tufin."
More generally, this is a leveraged buyout: the acquiring firm uses debt from banks and other sources to pay for most of the acquisition. That debt then becomes the responsibility of the acquired company to service - often pushing the company towards cost-cutting measures and an emphasis on harvesting short-term value from accounts. This is sometimes done in an effective and customer-centric way... but that's not always the case. It's hard to know what the outcome will be here.
Paessler were purchased in May 2024, and by July 2024 they were moving to a subscription, rather than perpetual licensing model, which came with a 100%-ish price increase.
Start the process for a SolarWinds alternative now.
> More concretely, the current average value life-time of a cookie is €2.52 or $2.7 [58]. Given that there have been at least 329 billion reCAPTCHAv2 sessions, which created tracking cookies, that would put the estimated value of those cookies at $888 billion dollars.
In practice, many of these cookies will have already been placed by other Google services on the site in question, with how ubiquitous Google's ad and analytics products are. And it's unclear whether Google uses the _GRECAPTCHA cookies for purposes other than the CAPTCHA itself (in the places where this isn't regulated).
But reCAPTCHA does gives Google an ability to have scripts running that fundamentally can't be ad-blocked without breaking site functionality, and it's an effective foot in the door if Google ever wanted to use it more broadly. It's absolutely something to be aware of.
There is a valid criticism that when you rely heavily on synthetic outputs, you bring along the precursor model's biases and assumptions without fully knowing the limitations of the data set the precursor model was trained on, as well as intentional adjustments made by the designers of the precursor model to favor certain geopolitical goals.
But that's not the criticism that I'm often seeing; it's more that there's an "unfair" amount of press coverage towards new models that rely, in the critics' views, more on distillation than on "true" innovation.
It's worth noting that there are many parties with significant motivation to build public sympathy that only "true" innovation should be valued, and it is only their highly-valued investments that can uniquely execute in that space. Cutting-edge models built in caves with a box of their scraps are counter to that narrative. It's worth considering https://paulgraham.com/submarine.html in this context, and understanding whether it is truly "everyone" that is critical in this way.
Side note about this (great) PG article: its conclusion is that readers are leaving print media to come read online blogs because online content is "more honest" and less formulaic.
After 2 years of widespread GPT slop at the top of search engine results, we've definitely come full circle.
Having been an avid net user since the early 90s, I can’t think of a time where that assertion wasn’t specious. In 2005— the year Gmail debuted and people stated using the term “web 2.0”— most of the content on the net was still from traditional media sources— PR garbage and all. Most blogs were still people just rattling off their opinions which was more likely based on the available content than their own high-quality research. And lack of oversight is a double-edged sword: sure you might have been less likely to get pure unfiltered marketing dreck but you were way more likely to get straight-up bullshit, which is a different, but serious problem. I think he was trying to champion the idealistic anti-establishment soul from the early net despite it essentially being an anachronism, even in 2005.
At the risk of wading into politics - consider a legal environment, in any country, where laws become increasingly strict, but where prosecutorial discretion, pardon powers, and a justice system designed to allow well-resourced law firms to delay cases indefinitely, are all transparently used for political purposes. Such an environment could easily exhibit a feedback loop that allows justice to be arbitrary and opposition voices to be silenced.
I'll refrain from value judgments on the above - but for heaven's sake, we're on a site called "Hacker News." We should understand that a machine like this could turn on any one of us in an instant for any reason.
Part of this is that YouTube makes this viable only for creators whose inbound viewers are likely to stay to watch a majority of the content; otherwise, the algorithm penalizes your content for every "bounce." A comedy short that'll attract people who like comedy shorts, and will be over before many people bounce? A long-form science documentary that's likely only going to be clicked by someone who wants to watch a long-form science documentary? Both meet this criterion. But any kind of traditional filmmaking with longer character arcs will be penalized, and that's a really hard thing to see for your creation's primary distribution channel.
Curious about this approach! It looks like you're connecting to https://livekit.io/ under the hood - do you have custom server side components intercepting the video stream and doing video analysis & lip movement patching there?
I'm not exactly sure what the developer experience of this would be, but I expect it would make Portals significantly more powerful and able to be hot-swapped into different containers without disrupting either their internal state or even things like loaded iFrames. I expect others will be experimenting heavily with this now that there's platform support.
I'm the developer of react-reverse-portal, this will definitely be a real help!
Iframes particularly have always been a pain point, as people often want to reparent them to move things around the DOM, but they always reload completely when you do so which causes all sorts of issues. This new API says it explicitly _doesn't_ reload iframes, so we'll be able to drop that caveat immediately (for supporting browsers, but hopefully Safari/FF will follow suit soon). Looks great!
This will need refactoring before being enabled, and possibly might not be enabled at all because it would introduce a perf regression do to the necessary checks to branch between moveBefore or insertBefore.
This was brought up multiple times to the html spec authors, but no changes were made.
Notably, Recital 12 says the definition "should not cover systems that are based on the rules defined solely by natural persons to automatically execute operations."
It seems to me the key phrase in that definition is "that may exhibit adaptiveness after deployment" - If your code doesn't change its own operation without needing to be redeployed, it's not AI under this definition. If adaptation requires deployment, such as pushing a new version, that's not AI.
I'm not sure what they intended this to apply to. LLM based systems don't change their own operation (at least, not more so than anything with a database).
We'll probably have to wait until they fine someone a zillion dollars to figure out what they actually meant.
For LLMs we have "for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments".
For either option you can trace the intention of the definitions to "was it a human coding the decision or not". Did a human decide the branches of the literal or figurative "if"?
The distinction is accountability. Determining whether a human decided the outcome, or it was decided by an obscure black box where data is algebraically twisted and turned in a way no human can fully predict today.
Legally that accountability makes all the difference. It's why companies scurry to use AI for all the crap they want to wash their hands of. "Unacceptable risk AI" will probably simply mean "AI where no human accepted the risk", and with it the legal repercussions for the AI's output.
> We'll probably have to wait until they fine someone a zillion dollars to figure out what they actually meant.
In reality, we will wait until someone violates the obvious spirit of this so egregiously and ignore multiple warnings to that end and wind up in court (a la the GDPR suits).
This seems pretty clear.
That means you can answer the question whether they comply with the relevant law in the necessary jurisdiction and can prove that to the regulator. That should be easy, right? If it's not, maybe it's better to use two regexps instead.
I understand that phrase to have the opposite meaning: Something _can_ adapt its behavior after deployment and still be considered AI under the definition. Of course this aspect is well known as online learning in machine learning terminology.
No offense but this is a good demonstration of a common mistake tech people (especially those used to common law systems like the US) engage in when looking at laws (especially in civil law systems like much of the rest of the world): you're thinking of technicalities, not intent.
If you use Copilot to generate code by essentially just letting it autocomplete the entire code base with little supervision, yeah, sure, that might maybe fall under this law somehow.
If you use Copilot like you would use autocomplete, i.e. by letting it fill in some sections but making step-by-step decisions about whether the code reflects your intent or not, it's not functionally different from having written that code by hand as far as this law is concerned.
But looking at these two options, nobody actually does the first one and then just leaves it at that. Letting an LLM generate code and then shipping it without having a human first reason about and verify it is not by itself a useful or complete process. It's far more likely this is just a part of a process that uses acceptance tests to verify the code and then feeds the results back into the system to generate new code and so on. But if you include this context, it's pretty obvious that this indeed would describe an "AI system" and the fact there's generated code involved is just a red herring.
So no, your gotcha doesn't work. You didn't find a loophole (or anti-loophole?) that brings down the entire legal system.
> Notably, Recital 12 says the definition "should not cover systems that are based on the rules defined solely by natural persons to automatically execute operations."
That's every AI system. It follows the rules defined solely by the programmers (who I suppose might sometimes stretch the definition of natural persons) who made pytorch or whatever framework.
If the thinking machine rejects my mortgage application, it should be possible to point out which exact rule triggered the rejection. With rules explicitly set by an operator it's possible. It's also possible to say that the rules in place comply with the law and stay compliant during the operation, for example it doesn't unintentionally guess that I'm having another citizenship based on my surname or postal code.
If the mortgage application evaluation system is deterministic so that the same input always produces the same output then it is easy to answer "Why was my application rejected?".
Just rerun the application with higher income until you get a pass. Then tell the person their application was rejected because income was not at least whatever that passing income amount was.
Maybe also vary some other inputs to see if it is possible to get a pass without raising income as much, and add to the explanation that they could lower the income needed by say getting a higher credit score or lowering your outstanding debt or not changing jobs as often or whatever.
That tells you how sensitive the model is along its decision boundary to permutations in the input -- but it isnt a relevant kind of reason why the application was rejected, since this decision boundary wasnt crafted by any person. We're here looking for programs which express prior normative reasoning (eg., "you should get a loan if...") -- whereas this decision boundary expresses no prior normative reason.
It is simply that, eg., "on some historical dataset, this boundary most relibaly predicted default" -- but this confers no normative reason to accept or reject any individual application (cf. the ecological fallacy). And so, in a very literal way, there is no normative reason the operator of this model has in accepting/rejecting any individual application.
> If the mortgage application evaluation system is deterministic so that the same input always produces the same output then it is easy to answer "Why was my application rejected?".
But banks, at least in my country (central EU), don't have to explain their reasons for rejecting a mortgage application. So why would their automated systems have to?
They don't have to explain to the applicant. They do have to explain to the regulator how exactly they stay compliant with the law.
There is so called three line system -- operational line does the actual thing (approves or rejects the mortgage), the second line gives the operational line the manual to do so the right way, the internal audit should keep an eye that whatever the first line is doing is actually what the policy says they should be doing.
It's entirely plausable that operational line is an actual LLM which is trained on a policy that the compliance department drafted and the audit department occasionally checks the outputs of the modal against the policy.
But at this point it's much easier to use LLM to write deterministic function in your favorite lisp based on the policy and run that to make decisions.
In the US they do have to explain. It's a requirement of the Equal Credit Opportunity Act of 1974 [1]. Here is an article with more detail on what is required in the explanation [2].
>Just rerun the application with higher income until you get a pass. Then tell the person their application was rejected because income was not at least whatever that passing income amount was.
Why do you need an AI if what you are doing is "if X < N" ?
It would not be just an "if X < N". Those decisions are going to depend on a lot of variables besides income such as credit history, assets, employment history, debts, and more.
For someone with a great credit history, lots of assets, a long term job in a stable position, and low debt they might be approved with a lower income than someone with a poor credit history whose income comes from a job in a volatile field.
There might be some absolute requirements, such as the person have a certain minimum income independent of all those other factors, and they they have a certain minimum credit score, and so on. If the application is rejected because it doesn't meet one of those then sure, you can just do a simple check and report that.
But most applications will be above the absolute minimums in all parameters and the rejection is because some more complicated function of all the criteria didn't meet the requirements.
But you can't just tell the person "We put all your numbers into this black box and it said 'no'. You have to give them specific reasons their application was rejected.
Say a lender has used machine learning to train some sort of black box to take in loan applications and respond with an approve/reject response. If they reject an application using that the Equal Credit Opportunity Act in the US require that they tell the applicant a specific reason for the rejection. They can't just say "our machine learning model said no".
If there were not using any kind of machine learning system they probably would have made the decision according to some series of rules, like "modified income must be X times the monthly payment on the loan", where modified income is the person's monthly income with adjustments for various things. Adjustments might be multipliers based on credit score, debt, and other things.
With that kind of system they would be able to tell you specifically why your were rejected. Say you need a modified income of $75k and you are a little short. They could look at their rules and figure out that you could get a modified income of $75k if you raised your income by a specific amount or lowered your debt by a specific amount, or by some combination of those.
That kind of feedback is useful to the application. It tells them specific things they can do to improve their chances.
With the company using a machine learning black box they don't know the rules that the machine has learned. Hence my suggestion of asking the black box what-if scenarios to figure out specific things the applicant can change to get approval.
>That kind of feedback is useful to the application. It tells them specific things they can do to improve their chances.
In that sense it's very practical, but it kicks the can down the road. Maybe the thing has a hidden parameter that represents the risk of the applicant being fired, which increases the threshold by 5% if the salary is a round number. Or it is more likely to reject everyone with an income between 73 and 75k because it learned this is a proxy to a parameter you are explicitly forbidden to have.
Let's just say it doesn't have a discontinuity, and actually produces the threshold which is deterministically compared with your income. How does it come up with this threshold? You may not be required to disclosed that to the applicant, but it would be a shame if people will figure out that the threshold is consistently higher for a certain population (for example people who's given name ends with a vowel).
It's fairly reasonable to for a regulator to ask you to demonstrate it doesn't do any of this things.
Sure, it's the rule that multiplies all the weights of matrix C1, with your transformed inputs. It's right there. What part of that rule don't you understand?
In strict mathematical reading, maybe - depends on how you define "rules", "defined" and "solely" :P. Fortunately, legal language is more straightforward like than that.
The obvious straightforward read is along the lines of: imagine you make some software, which then does something bad, and you end up in court defending yourself with an argument along the lines of, "I didn't explicitly make it do it, this behavior was a possible outcome (i.e. not a bug) but wasn't something we intended or could've reasonably predicted" -- if that argument has a chance of holding water, then the system in question does not fall under the exception your quoted.
The overall point seems to be to make sure systems that can cause harm always have humans that can be held accountable. Software where it's possible to trace the bad outcome back to specific decisions made by specific people who should've known better is OK. Software that's adaptive to the point it can do harm "on its own" and leaves no one but "the system" to blame is not allowed in those applications.
It means a decision tree where every decision node is entered by humans is not an AI system, but an unsupervised random forest is. It’s not difficult to see the correct interpretation.
reply