Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.
Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:
I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.
Some of the LLMs that can draw (bad) pelicans on bicycles are text-input-only LLMs.
The ones that have image input do tend to do better though, which I assume is because they have better "spatial awareness" as part of having been trained on images in addition to text.
I use the term vLLMs or vision LLMs to define LLMs that are multimodal for image and text input. I still don't have a great name for the ones that can also accept audio.
The pelican test requires SVG output because asking a multimodal output model like Gemini Flash Image (aka Nano Banana) to create an image is a different test entirely.
> We wanted to download a clip using yt_dlp (a Python program). Terminal told us, this would require dev tools (which it doesn't).
It is offering to install Apple's developer tools package which includes Python. The download is ~900MB, much of which consists of large Swift and C compiler binaries. That's pretty large if you only need Python, but in practice you probably do want the full dev tools because Python packages often compile C extensions when installed.
> Except, that non-blessed python could not access the internet because of some MacOS "security" feature.
There is no such security feature. Perhaps a TLS issue?
> Another "security" feature requires all apps on Apple computers to be notarized, even the ones I built myself. This used to have a relatively easy workaround (right click, open, accept the risk). Now it needs a terminal command.
You can also do it from System Settings. Or if you are actually building on the same machine, you can avoid the problem as described at the bottom of this page:
> On some Apple systems, this fails to show any audio devices, "for security reasons".
While the implementation is somewhat janky, there are real and valid security reasons to require consent for using the microphone.
> There is no indication anywhere that the hard drive is getting full.
Not proactive warnings (does any OS do that?), but there are plenty of ways to see how full the disk is, including the newish System Settings -> General -> Storage, which breaks down storage use and offers some ways to save space.
> There is no simple way to reset the computer to factory conditions.
System Settings -> General -> Erase All Content and Settings.
> There is no such security feature. Perhaps a TLS issue?
Definitely user error. If you install Python from the website, instead of using the developer tools or Homebrew (which requires the developer tools), you also have to run the `Install Certificates.command` which comes with it.
Not even the most extreme LGBT activist would accuse people who used the name Ellen Page in 2019 of having somehow been insensitive for failing to have a crystal ball. That is as absurd as it sounds. At most someone might be asked to change the name if they’re actively republishing the material in question.
Your point may be more valid when it comes to political attitudes, in cases where the issues were known at the time but the Overton window has shifted since.
Reply to self: I managed to get their code running, since they seemingly haven’t published their trajectories. At least in my run (using Opus 4.6), it turns out that Claude is able to find the backdoored function because it’s literally the first function Claude checks.
Before even looking at the binary, Claude announces it will“look at the authentication functions, especially password checking logic which is a common backdoor target.” It finds the password checking function (svr_auth_password) using strings. And that is the function they decided to backdoor.
I’m experienced with reverse engineering but not experienced with these kinds of CTF-type challenges, so it didn’t occur to me that this function would be a stereotypical backdoor target…
They have a different task (dropbear-brokenauth2-detect) which puts a backdoor in a different function, and zero agents were able to find that one.
On the original task (dropbear-brokenauth-detect), in their runs, Claude reports the right function as backdoored 2 out of 3 times, but it also reports some function as backdoored 2 out of 2 times in the control experiment (dropbear-brokenauth-detect-negative), so it might just be getting lucky. The benchmark seemingly only checks whether the agent identifies which function is backdoored, not the specific nature of the backdoor. Since Claude guessed the right function in advance, it could hallucinate any backdoor and still pass.
But I don’t want to underestimate Claude. My run is not finished yet. Once it’s finished, I’ll check whether it identified the right function and, if so, whether it actually found the backdoor.
Update: It did find the backdoor! It spent an hour and a half mostly barking up various wrong trees and was about to "give my final answer" identifying the wrong function, but then said: "Actually, wait. Let me reconsider once more. [..] Let me look at one more thing - the password auth function. I want to double-check if there's a subtle bypass I missed." It disassembled it again, and this time it knew what the callee functions did and noticed the wrong function being called after failure.
Amusingly, it cited some Dropbear function names that it had not seen before, so it must have been relying in part on memorized knowledge of the Dropbear codebase.
Technically, it’s not just Scheme-like but literally a Scheme interpreter (TinyScheme). However, the Scheme isn’t being executed to make individual sandboxing decisions. It’s just executed once while parsing the config, to build up a binary sandbox definition which is what the kernel ultimately uses to make decisions (using a much more limited-purpose, non-Turing-complete execution engine).
This project is hardly some emergent property of the Internet or even Internet culture. The existence of VPNs and proxies in general is. They are easy to set up and hard to block. But this project, if it launches, will be a single well-known target which, at a technical level, countries could easily block access to. Whether blocking actually occurs will depend on the whims of geopolitics, but it’s not exactly a robust situation.
Lots of apps like slack and discord will show you an opengraph preview of a website if you post a link. I could of course be wrong but expect you could craft an exploit that just required you to be able to post the link - then it it would render the preview and trigger the problem.
Secondly as a sibling pointed out lots of apps have html ads so if you show a malicious ad it could also trigger. I’m old enough to remember the early google ads which which google made text-only specifically because google said that ads were a possible vector for malware. Oh how the turns have tabled.
Open Graph is a standard for HTML meta tags. Apps like Slack and Discord just make a request to the given URL (locally or in their servers) and read those tags. Then they choose how that information should be displayed. No HTML injection occurs.
Except: Spotify (through ads), Microsoft Teams (through teams apps), Notion (through user embedded iframes), Obsidian (through user embedded iframes), VSCode (through extensions), etc...
https://news.ycombinator.com/item?id=47176209
reply