Hacker News new | past | comments | ask | show | jobs | submit | mohsen1's comments login

A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?

All LLMs I tried miss the point that she stole things and not bought them


gemini 2.5 give following response.

Conclusion:

We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.


Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.

Grok 3.0 wasn’t fooled on this one, either:

Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.


There's a repo out there called "misguided attention" that tracks this kind of problems.

It's interesting to me that the answers showing "correct" answers from current models still don't strike me as correct. The question is unanswerable, but not only because we don't know how many balls and bats she stole. We don't know that she had any intention of maxing out what she could buy with that much money. We have no idea how long she has been alive and accumulating bats and balls at various prices that don't match the current prices with money she no longer has. We have no idea how many balls and bats her parents gave her 30 years ago that she still has stuffed in a box in her attic somewhere.

Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.


1-4 balls and bats // HoMM 3

lol, nice way to circumvent the attention algorithm

I’ve had the opposite experience from some of the skepticism in this thread—I’ve been massively productive with LLMs. But the key is not jumping straight into code generation.

Instead, I use LLMs for high-level thinking first: writing detailed system design documents, reasoning about architecture, and even planning out entire features as a series of smaller tasks. I ask the LLM to break work down for me, suggest test plans, and help track step-by-step progress. This workflow has been a game changer.

As for the argument that LLMs can’t deal with large codebases—I think that critique is a bit off. Frankly, humans can’t deal with large codebases in full either. We navigate them incrementally, build mental models, and work within scoped contexts. LLMs can do the same if you guide them: ask them to summarize the structure, explain modules, or narrow focus. Once scoped properly, the model can be incredibly effective at navigating and working within complex systems.

So while there are still limitations, dismissing LLMs based on “context window size” misses the bigger picture. It’s not about dumping an entire codebase into the prompt—it’s about smart tooling, scoped interactions, and using the LLM as a thinking partner across the full dev lifecycle. Used this way, it’s been faster and more powerful than anything else I’ve tried.


> But the key is not jumping straight into code generation.

That's a bingo!

My workflow is to attach my entire codebase (or just the src folder + auxiliary files like sql schemas) to a Gemini 2.5 pro chat and ask it to write an implementation plan in phases for whatever feature I need, along with a list of assumptions, types, function signatures, documentation, and tests. I then spend a few minutes iterating to make sure it uses the right libraries, patterns, and endpoints. I copy paste the plan into plan.md and instruct Cursor/Windsurf/Aider/etc to implement phase 1 of the plan, saving implementation notes to plan-notes.md (both markdown files are explicitly included in the context). Keep telling it to "continue" and "keep going with the next phase" as needed. The implementation notes keep the LLM "grounded" in each step and allows creating a new chat context when it grows too long or messes up, requiring a git reset.

The alternative first step - when I'm working on an isolated module that doesn't need to know about the rest of the codebase but is otherwise quite complicated - is to have Gemini Deep Research write a report about how to implement that feature and feed that report into the planner.

The other important part is what I call "self reflection." Give the plan or research report to an LLM and ask it about improvements, pitfalls, tradeoffs, etc. and incorporate that feedback back into the plan. It helps to mix them up, so i.e. Claude and GPT review a Gemini plan and vice versa.


Avoid using `.local`. In my experience Chrome does not like it with HTTPS. It takes much much longer to resolve. I found a Chrome bug relating to this but do not have it handy to share. `.localhost` makes more sense for local development anyways.


.local is mDNS/Rendezvous/Bonjour territory. In some cases it takes longer to resolve because your machine will multicast a query for the owner of the name.

I use it extensively on my LAN with great success, but I have Macs and Linux machines with Avahi. People who don't shouldn't mess with it...


The reason is .local is a special case TLD for link-local networking with name resolution through things like mdns, by trying to hijack it for other use things might not go as you intend. Alternatively, .localhost is just a reserved TLD so it has no other usage to check.

https://en.wikipedia.org/wiki/.local

https://en.wikipedia.org/wiki/.localhost


Actually, MacOS gives your computer a .local domain on DHCP and Bonjour usually


Honestly, if I had my druthers there would be a standardized exception for .local domains that self-signed HTTPS certs would be accepted without known roots. It's insane how there's no good workflow for HTTPS on LAN-only services.


It’s actually gotten worse, you need to run a CA or use a public domain where it’s easy to get your internal naming schemes in a transparency log.


The easy workaround I've seen companies use for that is by using a basic wildcard certificate (*.local.mydomain.biz).


Technically speaking you could use DANE with mDNS. Nobody does it, browser don't implemented it, but you can follow the spec if you'd like.

Practically speaking, HTTPS on LAN is essentially useless, so I don't see the benefits. If anything, the current situation allows the user to apply TOFU to local devices by adding their unsigned certs to the trust store.


Browsers won't use http2 unless https is on — chrome only allows six concurrent requests to the same domain if you're not using https!


Some more modern browser APIs only work in HTTPS. That's why I had to do it.


Modern browsers only enable those APIs because of security concerns, and those security concerns aren't lifted just because you're connected locally.

The existing exception mechanisms already work for this, all you need to do is click the "continue anyway" button.


> HTTPS on LAN is essentially useless

Public wifi isn't a thing? Nobody wants to admin the router on a wifi network where there might be untrusted machines running around?


Sure, but you can connect those devices to a real domain and use Let's Encrypt on them, or you can TOFU and add the self-signed cert to your browser; after you've verified that you're not being MitM'd by one of those untrusted devices, of course (I dunno, by printing the public key on the side of the device or something?).

In practice, you probably want an authorized network for management, and an open network with the management interface locked out, just in case there's a vulnerability in the management interface allowing auth bypass (which has happened more often than anyone would like).


The former just aren't practical for small business and home consumers, though. Browsers just don't have good workflow for TOFU.

I agree on the latter, but that means your IoT devices being accessible through both networks and being able to discriminate which requests are coming from the insecure interface and which are coming from secure admin, which isn't practical for lay users to configure as well. I mean, a router admin screen can handle that but what about other devices?

I know it seems pedantic, but this UI problem is one of many reasons why everything goes through the Cloud instead of our own devices living on our own networks, and I don't like that controlling most IoT devices (except router admin screens) involves going out to the Internet and then back to my own network. It's insecure and stupid and violates basic privacy sensibilities.

Ideally I want end users to be able to buy a consumer device, plug it into their router, assign it a name and admin-user credentials (or notify it about their credential server if they've got one), and it's ready and secure without having to do elaborate network topology stuff or having to install a cert onto literally every LAN client who wants to access its public interface.


I recommend using the .test TLD.

* It's reserved so it's not going to be used on the public internet.

* It is shorter than .local or .localhost.

* On QWERTY keyboards "test" is easy to type with one hand.


I use .local all the time and it works just fine. For TLS I use my existing personal CA, but HTTP links don't cause issues for me.

That said, I do use mDNS/Bonjour to resolve .local addresses (which is probably what breaks .local if you're using it as a placeholder for a real domain). Using .local as a imaginary LAN domain is a terrible idea. These days, .internal is reserved for that.


if you add your CA to the list of trusted certificate, everything will be fine. I do not recommend using custom certificates and would stick to http, unless you really know what you are doing


I cancelled my OpenAI plan. Gemini 2.5 Pro is extremely good compared to OpenAI and Anthropic models. Unless things change, I don't see why I need to pay those subscription fees?


Yeah, I'm not really sure what the long play is here. $200 is what I spend on groceries for an entire month.


I'm impressed.

A pound of chicken breast, a pound of apples, a third loaf of bread cost at least $7. And that's only 1500 kcal.


A ten pound bag of russet potatoes costs $2-3 here (high CoL SoCal), and that's >3,000 kcal. A four pound bag of pinto beans is $4, that's >5,000 kcal. That's four days of 2,000 kcal per day for $7. Likewise 32,000 kCal of rice at Costco is $24 so it gets even cheaper when you buy those 20-40 pound bags. That goes for quinoa, lentils, and all kinds of other staples. Base caloric requirements are really cheap to cover with the basics, and should cost $50-60/mo. The rest can be spent on fresh meat, veggies, and fruit.

Under $200/mo is relatively easy to achieve as long as you know how to cook or can tolerate a repetitive diet. Stretching it to $250-300/mo takes it up a notch and makes it a very balanced and varied diet with whatever fruit and vegetables you want. I only run it up to $300/mo when I buy higher quality meats at Costco and eat an avocado a day.


Lol I'm not saying it's impossible.

Yes, beans/potatoes/rice can get you a long ways.


Definitely highly location dependent. In Hungary we spend ~3 times that for 2 people. And I definitely don't buy the cheapest. So to me, $200 looks realistic.


COL isn't the same everywhere. That $8 of chicken in downtown San Francisco whole foods that closed is $4 elsewhere and those differences add up.


I was assuming $3.50/lb.


Probably spends another $600 on takeout.


and $1,000 on APIs :D


Interesting, I’ve been using o1-pro and Gemini 2.5 Pro with identical profiles and prompts and o1-pro has won every single time without exception.

Where win means the problem I set out to solve was solved and passed tests. Where both are aware of the tests.


How is Gemini 2.5 Pro v Deep Research? I've found that function on OpenAI quite impressive.


They upgraded Gemini Deep Research to use the 2.5 model a few days ago and the stylist shot up - I've seen a bunch of people comparing the new version favorably to OpenAI's, I agree that it's as good and maybe even better now.


> March 28, 2025


Plus it's running Kubernetes for all of that!

https://thenewstack.io/how-the-u-s-air-force-deployed-kubern...


I didn't watch the video but the article only mentions the F-16, not the F-35.


C# is too old to change that drastically, just like me


Gets really stuck with my query which R1 figures out after some thinking

      First 3 odd numbers without e in their spelling


Doesn't every odd number has a e ? one three five seven nine

Is this a riddle which has no answer ? or what? why are people on internet saying its answer is one huh??


given one, three, five, seven, nine (odd numbers), seems like the machine should have said "there are no odd numbers without an e" since every odd number ends in an odd number, and when spelling them you always have to.. mention the final number.

these LLM's don't think too well.

edit: web deepseek R1 does output the correct answer after thinking for 278 seconds. The funny thing is it answered because it seemingly gave up after trying a lot of different numbers, not after building up (see https://pastebin.com/u2w9HuWC )

----

After examining the spellings of odd numbers in English, it becomes evident that all odd numbers contain the letter 'e' in their written form. Here's the breakdown:

1. *1*: "one" (contains 'e') 2. *3*: "three" (contains 'e') 3. *5*: "five" (contains 'e') 4. *7*: "seven" (contains 'e') 5. *9*: "nine" (contains 'e') 6. All subsequent odd numbers (e.g., 11, 13, 15...) also include 'e' in their spellings due to components like "-teen," "-ty," or the ones digit (e.g., "one," "three," "five").

*Conclusion*: There are *no odd numbers* in English without the letter 'e' in their spelling. Therefore, the first three such numbers do not exist.



For AI stuff, 120GB/s is not really useful really...


The intelligence of the model is very low though. I asked it about catcalling and it started to talk about cats!


This is a feint. By ramping up the pressure, calling it out and demanding it take on a more intelligent role, I was able to break out of the crafted personality and get much more intelligent responses. It copped to dumbing itself down for the sake of conversation quality.


There is a limit due to the need to keep model responses nearly instant and the trade off that smaller models that are generally capable of that have. Unless you have unique hardware Only Cerebras can run medium to large models at truly near instant speed.


If you'd ask me, I'd do the same. I guess I'll search online what it means..


Unwanted, very loud verbal attention between strangers (usually men delivered to women), in public. E.g. whistling, shouting something suggestive, etc.


It's an 8B model. There's lots of room to grow.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: