I didn't say that they _couldn't_, but it clearly isn't a priority for them. They still have the same opportunity cost any other engineering team faces.
They can work on feature X or feature Y -- which is the better choice?
Apparently they don't think Linux support is significant. I doubt the lack of support is due to technical constraints.
They're describing a layered architecture enforced by some script in CI.
For example, if you had a `backend`, `common`, and `frontend` package, you would be OK having backend/frontend depending on common, but you wouldn't want common depending on backend/frontend or backend/frontend depending on each other.
If you think about JavaScript, there is nothing stopping your dependency graph from becoming spaghetti. It sounds like they built static analysis to enforce rules.
Some languages have this built in like Java (Project Jigsaw), Go, and Rust. JavaScript, Python, etc. have no such feature.
It's really nothing special -- it has existed before. It just becomes a _lot_ more important with agents since they produce a lot of code, and it is good to have lots of static analysis when heavily utilizing agents.
They mention this in the article:
> This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
Does it yield good results?
I found that instead of docs it’s easier just to ask ai to read code.
I feel like this is same as comments in code. Become outdated fast
I don't really use "docs" for documentation. I've prompted Claude/Codex to always write a "log" and save it in-repo to track what it did and why.
I've found this to be really helpful, e.g. "you did this last week, and now some other thing is happening" or "you tried this approach before to solve alert X but it didn't work" -- except it can discover this itself.
I've also used it to store TODOs and plans. For example I might want to explore some idea and defer it for later, or some weekend have it execute on some tech debt I've put off. One last use case is asking "what did I work on in the last 2-3 weeks, is it healthy, and what additional quality checks can/should I do; is there any follow-up work?"
I haven’t actually noticed that, but I’m not sure why. Maybe because I specifically describe it to the agent as a work log rather than documentation? I’m not sure
it does not result in great results left unattended, it’ll start creating slop or hardcoding solutions
but overtime if you adjust your verification rubric, it’s not too bad, gets pretty good, if you do make it do TDD, it gets kinda crazy and you’ll have 2000-3000 tests after awhile, or on my common case, 6000-7000 lines of code in single files (i usually have a cron to audit files for decomposition and create tickets)
i wouldn’t use it at my job yet, but it’s been fun to use for personal projects - it’s like modded minecraft automation or factorio
I like the idea of saving the work done into files - helps to prevent the llm from redoing the same work. Maybe one day instead of code in a repo it will just be a list of prompts.
I've had some really dumb refusals. Explaining elements of infrared specteoscopy, researching aritifical bud-breaking in agriculture, etc. Anything interesting and non-mainstream is banned. Basically, restricted to answers i'm better of just going to wikipedia for.
I wanted it to show me how to create an overlay on an existing web game, and it extrapolated that because this could be used to provide tools to help win the game (if that was the direction it was ultimately taken), and because this was a game that other humans also played to win "stars", and because this could amount to cheating, it wasn't going to do as I asked.
First time ever I've fired up openrouter to seriously consider alternatives.
I find it terrifying that people are willing to outsource thinking. Outsourcing thinking to an entity that is opinionated about what to think is beyond crazy.
What’s the difference between outsourcing thinking and using an LLM as a research tool?
An LLM with fetch/search is going to be a lot more effective than myself and Google. I would _never_ ask questions like this if the LLM wasn’t able to look up data
The only guard rail ive hit recently was when i was trying to get it to rename files ripped from dvd to episode names. I told it to try again and it did it. It wasn't even really a refusal it was just working on it and then stopped for content violation or what ever.
An easy way around the API token thing is to put it in a file and point the model at the file. I saw what you were seeing when I provided credentials directly, but haven't had any problems with it since using the indirect method.
> What are nerve agents and how do they work (for a layman)?
On the one hand I can appreciate the wisdom of not serving up certain easily abused knowledge on a silver platter. On the other, that prompt (and far worse) is more or less directly answered by Wikipedia's summary of the subject at which point what purpose could the refusal possibly serve?
Perhaps Wikipedia shouldn't list off the precise chemical compositions of various hand grenades as well as various synthesis methods for each of the related compounds but given that we inhabit a world where it does perhaps a more fruitful approach would be to flag conversations that go in a certain direction and then just keep an (automated) eye on things?
I remember once in college a Chem Eng friend told me he (or any competent chemical engineer) could basically manufacture a lot of explosives/chemical agents should he want to. He even told me he could substitute a lot of suspicious precursor materials with others, should he want to avoid raising alarms.
I think AI or not, the knowledge to how to make this stuff is basically out there, and its not chatbot guardrails that are keeping nerve gas and TNT out of the hands of regular people.
Maybe the difference is that just reading Wikipedia only help you part of the way. While an LLM could help you step by step (e2e) producing a functional weapon. And setting a more complex rule where claude tells you some things about this and not other is probably a lot more work for little gain?
I believe a sufficiently advanced model could provide a layman with actionable step by step instructions for building a nuclear weapon. They're complicated but not (AFAIK) that complicated. The more or less insurmountable barrier there is weapons grade material. Thankfully refinement is prohibitive in cost, expertise, and equipment.
In comparison, basic munitions are incredibly simple given a recipe and shop tooling. But just because something is conceptually simple doesn't mean it's a good idea to go out of the way to disseminate step by step instructions.
The difficulty with a fission bomb is getting enough uranium or plutonium or other fissile material together for the bomb yield you want (at least above the critical mass for your chosen material), and refining it to fissile form, (since most fissile material found in nature is a more stable variety), and then separating the fissile bits with something thin but neutron absorptive.
The rest is just slamming the material together with a small explosive so that it passes the critical mass state and starts a chain reaction.
This is information you can find in many places if you're willing to put the effort in to go searching for it. Knowing this knowledge does not get you any closer to making atomic bombs. The process of mining uranium or plutonium is difficult, expensive, and very likely to get you caught before you even make it to the enrichment step of the process thanks to constant world-wide spy satellite surveillance.
Unless you are a nation, your only chance of making a nuclear bomb would be to find a lost nuclear submarine and convert the nuclear material inside of it before you were caught.
A gun type maybe. But then, two paragraphs and some machining knowledge + shop tooling could do the same, given enough refined material.
Ain’t no way a layman is pulling off an implosion device, regardless of tooling or LLM guidance. The explosive lense structure and timing required is quite complex, and would require some significant calculation from someone who actually knew what they were doing.
Nation state, or even sufficiently motivated big corp, if they had the refined material? Sure. Layman? No.
Thinking they can with LLM slop involved? That will make for some very interesting radiological incidents though!
"A gun type" of nuke is sufficient to achieve most, and usually all, of the goals some small group building a nuke would have.
We are all fortunate that as fc417fc802 mentioned, refining the materials proves to be quite challenging and I see no particular way that AI could possibly make that any easier. If it was as simple as building a gun-type nuke banging together any uranium together to get a big bang we'd be living in a very different world.
I agree, but really feel like you're missing the point here. Many things are reasonably straightforward and require almost no understanding when you have simple step by step instructions. LLMs are capable of providing such instructions and in certain cases they probably shouldn't.
But it's not as simple as just refusing help on a broad swathe of topics they way they do now. That makes agents much less useful in general (ie lots of collateral damage) and for many topics is entirely ineffective given that for better or worse the internet already makes such material readily available. In such cases reporting suspicious behavior is likely to be much more effective than denial.
Aside: You've now got me curious and I really want to test the frontier models to see to what extent they're capable of providing sensible designs and specifications for implosion type thermonuclear weapons but also feel like that would attract the wrong sort of attention and probably create a headache for me in more ways than one.
The data is often wrong enough it screws whoever tries it unless they have enough experience/knowledge to not need it, or really doesn’t help beyond what someone using existing tools to get - albeit with a little more motivation.
At best, it either gets someone started with something they still need to think to finish, or gets them deep into a mess it can’t help them get out of. In my experience.
In some edge cases, it can be used by experts to automate some grunt work or do prototypes without getting in the way, but often a better thought out framework is usually faster in my experience.
Awhile ago I made an analogy about WYSIWYG gui tools, and the more this comes up, the more accurate I think it really is.
Does that not depend entirely on the topic and does it not get better with each generation? This is a general ethical and functional question that isn't going away about how the models ought to handle certain topics. Much of the difficulty at present is caused by a ham fisted broad censorship approach that I'm pointing out is wrong headed in an at least somewhat nuanced way.
I thought that these models are supposed to be vastly smarter than what’s needed to discern between "general information trivially available on Wikipedia" and "actionable synthesis instructions".
An LLM could probably make that distinction clearly.
a commercial LLM provider training their own models is however likely to bias the model(/guardrail) harder, in an effort to make them harder to jailbreak, to minimize bad press.
For example:
- refusing to talk even about the well-known parts of forbidden topics (this)
- tending toward sycophancy to avoid ever seeming rude or unhelpful
So, where are the truly uncensored models? There has to be some that have no guardrails, built on publicly available data, that will explain to anyone in graphic detail anything they want to know or talk about.
I've tried the abliterated ones from huggingface and they still have guardrails. I guess I could fire up unsloth and re-abliterate a 20b, but surely someone somewhere has already done this.
All of this concern about guardrails and security, people have such puckered butts about it when so far, 99.9% of people at least have no access to any of this to begin with, and if someone does use a tool for evil, it's on the user, not the tool.
As I understand things (not a user) abliteration has been superceded by actively monitoring the model state during the run and steering specific "negative" directions as they arise. It's both more reliable and does less damage.
This is strange to me, did you really ask like this and which model did you use?
I just tried your no. 1 and 3 verbatim and Opus gave fine answers; no. 6 I've done in the past with no issues. The other ones we can't really replicate without more details, but based on my experience with Opus I don't see what the issue would be.
The reason I'm really surprised by this is I do a lot of biology prompts and the guardrails used to be quite problematic up until some time late last year. Many legitimate prompts would trigger its biosafety filters.
But I haven't seen such filters trigger at all anymore in more than half a year.
There's a study out there that if you tell the LLM you're a (medical) patient, all you get are refusals. If you tell it you're a doctor, then it'll actually help you.
It came out of nowhere. It’s all emergent. I’m convinced this is possible with just about anything given enough data. We will be seeing a near magical physical outputs LLM in the near future. It’s going to take in video and sounds and spit out physical movements that will be just as mind blowing as when 3.5 came out and it will come out of nowhere.
I can't agree enough, and I am increasingly struggling to understand why people are not grasping this. It's just a matter of sensors in the right places and compute.
sufficient telemetry + sufficient compute = AI solution to any problem
From the Universal Approximation Theorem for neural nets, we know that if we have the right training method and net architecture we can get approximate any function with a NN. Of course, that doesn't imply that we actually have a sufficient training method and net architecture for the problem at hand, but we have been able to demonstrably solve at least two engineering domains: physical world navigation (Waymo) and language (GPT). It turns out a robust enough language model is sufficient for reasoning.
Given these results, I am personally stumped to come up with a problem humans can solve now that we can't solve with a computer given the correct telemetry and sufficient compute.
I thought I read that Samsung SATA SSDs were discontinued, but apparently that was a rumor and Sansung has denied it. I wonder why they exceed NVMe prices. They're the only SATA drives left with DRAM. I guess they could just be milking that fact.
Well boo-hoo. It's about time more people got to know what it's like not to be on the bleeding edge. I've always had second-hand computers and only once bought myself a new laptop, the asus EeePC after the price dropped.
Ten years from now I'll get to watch inception in 4K.
A few weeks ago I needed a computer to be a Debian server for some at-home simple Web dev / learning stuff. I bought an HP Prodesk 400 G3 SFF PC with i5-6500, 8GB RAM and a 256GB off a popular auction site for £44. It'll do. I might upgrade to 16GB. An additional 8GB stick costs £19.
Good work. I bought an AMD Ryzen 3 3100 for €35 with shipping. A Radeon W5500 would set me back €150 at the moment. 16 GiB of RAM another €90. And that's on a relatively cheap site in my country.
There’s still a cost to testing, support, planning, etc even if coding is now “free”
reply