A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?
All LLMs I tried miss the point that she stole things and not bought them
We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.
Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.
Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.
It's interesting to me that the answers showing "correct" answers from current models still don't strike me as correct. The question is unanswerable, but not only because we don't know how many balls and bats she stole. We don't know that she had any intention of maxing out what she could buy with that much money. We have no idea how long she has been alive and accumulating bats and balls at various prices that don't match the current prices with money she no longer has. We have no idea how many balls and bats her parents gave her 30 years ago that she still has stuffed in a box in her attic somewhere.
Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.
I’ve had the opposite experience from some of the skepticism in this thread—I’ve been massively productive with LLMs. But the key is not jumping straight into code generation.
Instead, I use LLMs for high-level thinking first: writing detailed system design documents, reasoning about architecture, and even planning out entire features as a series of smaller tasks. I ask the LLM to break work down for me, suggest test plans, and help track step-by-step progress. This workflow has been a game changer.
As for the argument that LLMs can’t deal with large codebases—I think that critique is a bit off. Frankly, humans can’t deal with large codebases in full either. We navigate them incrementally, build mental models, and work within scoped contexts. LLMs can do the same if you guide them: ask them to summarize the structure, explain modules, or narrow focus. Once scoped properly, the model can be incredibly effective at navigating and working within complex systems.
So while there are still limitations, dismissing LLMs based on “context window size” misses the bigger picture. It’s not about dumping an entire codebase into the prompt—it’s about smart tooling, scoped interactions, and using the LLM as a thinking partner across the full dev lifecycle. Used this way, it’s been faster and more powerful than anything else I’ve tried.
> But the key is not jumping straight into code generation.
That's a bingo!
My workflow is to attach my entire codebase (or just the src folder + auxiliary files like sql schemas) to a Gemini 2.5 pro chat and ask it to write an implementation plan in phases for whatever feature I need, along with a list of assumptions, types, function signatures, documentation, and tests. I then spend a few minutes iterating to make sure it uses the right libraries, patterns, and endpoints. I copy paste the plan into plan.md and instruct Cursor/Windsurf/Aider/etc to implement phase 1 of the plan, saving implementation notes to plan-notes.md (both markdown files are explicitly included in the context). Keep telling it to "continue" and "keep going with the next phase" as needed. The implementation notes keep the LLM "grounded" in each step and allows creating a new chat context when it grows too long or messes up, requiring a git reset.
The alternative first step - when I'm working on an isolated module that doesn't need to know about the rest of the codebase but is otherwise quite complicated - is to have Gemini Deep Research write a report about how to implement that feature and feed that report into the planner.
The other important part is what I call "self reflection." Give the plan or research report to an LLM and ask it about improvements, pitfalls, tradeoffs, etc. and incorporate that feedback back into the plan. It helps to mix them up, so i.e. Claude and GPT review a Gemini plan and vice versa.
Avoid using `.local`. In my experience Chrome does not like it with HTTPS. It takes much much longer to resolve. I found a Chrome bug relating to this but do not have it handy to share. `.localhost` makes more sense for local development anyways.
.local is mDNS/Rendezvous/Bonjour territory. In some cases it takes longer to resolve because your machine will multicast a query for the owner of the name.
I use it extensively on my LAN with great success, but I have Macs and Linux machines with Avahi. People who don't shouldn't mess with it...
The reason is .local is a special case TLD for link-local networking with name resolution through things like mdns, by trying to hijack it for other use things might not go as you intend. Alternatively, .localhost is just a reserved TLD so it has no other usage to check.
Honestly, if I had my druthers there would be a standardized exception for .local domains that self-signed HTTPS certs would be accepted without known roots. It's insane how there's no good workflow for HTTPS on LAN-only services.
Technically speaking you could use DANE with mDNS. Nobody does it, browser don't implemented it, but you can follow the spec if you'd like.
Practically speaking, HTTPS on LAN is essentially useless, so I don't see the benefits. If anything, the current situation allows the user to apply TOFU to local devices by adding their unsigned certs to the trust store.
Sure, but you can connect those devices to a real domain and use Let's Encrypt on them, or you can TOFU and add the self-signed cert to your browser; after you've verified that you're not being MitM'd by one of those untrusted devices, of course (I dunno, by printing the public key on the side of the device or something?).
In practice, you probably want an authorized network for management, and an open network with the management interface locked out, just in case there's a vulnerability in the management interface allowing auth bypass (which has happened more often than anyone would like).
The former just aren't practical for small business and home consumers, though. Browsers just don't have good workflow for TOFU.
I agree on the latter, but that means your IoT devices being accessible through both networks and being able to discriminate which requests are coming from the insecure interface and which are coming from secure admin, which isn't practical for lay users to configure as well. I mean, a router admin screen can handle that but what about other devices?
I know it seems pedantic, but this UI problem is one of many reasons why everything goes through the Cloud instead of our own devices living on our own networks, and I don't like that controlling most IoT devices (except router admin screens) involves going out to the Internet and then back to my own network. It's insecure and stupid and violates basic privacy sensibilities.
Ideally I want end users to be able to buy a consumer device, plug it into their router, assign it a name and admin-user credentials (or notify it about their credential server if they've got one), and it's ready and secure without having to do elaborate network topology stuff or having to install a cert onto literally every LAN client who wants to access its public interface.
I use .local all the time and it works just fine. For TLS I use my existing personal CA, but HTTP links don't cause issues for me.
That said, I do use mDNS/Bonjour to resolve .local addresses (which is probably what breaks .local if you're using it as a placeholder for a real domain). Using .local as a imaginary LAN domain is a terrible idea. These days, .internal is reserved for that.
if you add your CA to the list of trusted certificate, everything will be fine. I do not recommend using custom certificates and would stick to http, unless you really know what you are doing
I cancelled my OpenAI plan. Gemini 2.5 Pro is extremely good compared to OpenAI and Anthropic models. Unless things change, I don't see why I need to pay those subscription fees?
A ten pound bag of russet potatoes costs $2-3 here (high CoL SoCal), and that's >3,000 kcal. A four pound bag of pinto beans is $4, that's >5,000 kcal. That's four days of 2,000 kcal per day for $7. Likewise 32,000 kCal of rice at Costco is $24 so it gets even cheaper when you buy those 20-40 pound bags. That goes for quinoa, lentils, and all kinds of other staples. Base caloric requirements are really cheap to cover with the basics, and should cost $50-60/mo. The rest can be spent on fresh meat, veggies, and fruit.
Under $200/mo is relatively easy to achieve as long as you know how to cook or can tolerate a repetitive diet. Stretching it to $250-300/mo takes it up a notch and makes it a very balanced and varied diet with whatever fruit and vegetables you want. I only run it up to $300/mo when I buy higher quality meats at Costco and eat an avocado a day.
Definitely highly location dependent. In Hungary we spend ~3 times that for 2 people. And I definitely don't buy the cheapest. So to me, $200 looks realistic.
They upgraded Gemini Deep Research to use the 2.5 model a few days ago and the stylist shot up - I've seen a bunch of people comparing the new version favorably to OpenAI's, I agree that it's as good and maybe even better now.
given one, three, five, seven, nine (odd numbers), seems like the machine should have said "there are no odd numbers without an e" since every odd number ends in an odd number, and when spelling them you always have to.. mention the final number.
these LLM's don't think too well.
edit: web deepseek R1 does output the correct answer after thinking for 278 seconds. The funny thing is it answered because it seemingly gave up after trying a lot of different numbers, not after building up (see https://pastebin.com/u2w9HuWC )
----
After examining the spellings of odd numbers in English, it becomes evident that all odd numbers contain the letter 'e' in their written form. Here's the breakdown:
1. *1*: "one" (contains 'e')
2. *3*: "three" (contains 'e')
3. *5*: "five" (contains 'e')
4. *7*: "seven" (contains 'e')
5. *9*: "nine" (contains 'e')
6. All subsequent odd numbers (e.g., 11, 13, 15...) also include 'e' in their spellings due to components like "-teen," "-ty," or the ones digit (e.g., "one," "three," "five").
*Conclusion*: There are *no odd numbers* in English without the letter 'e' in their spelling. Therefore, the first three such numbers do not exist.
This is a feint. By ramping up the pressure, calling it out and demanding it take on a more intelligent role, I was able to break out of the crafted personality and get much more intelligent responses. It copped to dumbing itself down for the sake of conversation quality.
There is a limit due to the need to keep model responses nearly instant and the trade off that smaller models that are generally capable of that have. Unless you have unique hardware
Only Cerebras can run medium to large models at truly near instant speed.
Unwanted, very loud verbal attention between strangers (usually men delivered to women), in public. E.g. whistling, shouting something suggestive, etc.
All LLMs I tried miss the point that she stole things and not bought them
reply