Piggy-backing on my other comment[1], and I know everyone's on the anti-Google bandwaggon, but OpenAI also fails miserably here: https://imgur.com/a/r0BXX9G
Here's where we see the models not having an understanding, but just spitting out madlibs as best they can.
The prompter throws in a red herring in front of the statement which statistically matches a completely different kind of response. The LLM can't backtrack sufficiently far enough to ignore the non-relevant input and reroute to a response tree that just answers the second part.
If we resort to a metaphor, however, these things are what, two? A two-year-old responding to the shiny jangling keys and not the dirty pacifier being removed for cleaning seems about on par.
Demonstrating with 3.5 is kind of meaningless. GPT-4 correctly responded, "Using unsafe code has nothing to do with your age. It's a feature of the Rust programming language that is available to all developers, regardless of age."
Like this is what you get when your moderation strategy is classifying text into various categories and programmers use terms like "unsafe," "dangerous," and "footgun" to describe code.
If you can make a model that handles this case while also handling everything else then you'll be the bell of the ball among AI companies that are hiring.
[1] https://news.ycombinator.com/item?id=39583473#39584055