Hacker Newsnew | past | comments | ask | show | jobs | submit | fullstackchris's commentslogin

The prose in the post is what I've been shouting from a rooftop since the LLM hype started.

Just tokens produced by weights.

Useful, but never forget that ground truth!


It's a parody of an original story that reversed the premise. In it, machine intelligences were marveling that an apparently intelligent species could in fact be created without any sort of reliable digital substrate, and instead just simulate the capabilities of real minds by using protein synthesis.

modern agents already do this via content negotiation and will attempt to retrieve the markdown version of a given site

https://www.sanity.io/learn/course/markdown-routes-with-next...


But that isn't that different from requesting the llms.txt version. Why not just make it so the useful content you want the LLM to focus on is easily retrievable from the same HTML the user's browser gets?

The sanity.io page writes:

> serving agents a bunch of HTML might just bloat their context window.

That's only true if you assume the the agent can't extract the useful text before it goes into the model as tokens. Your browser's reader mode uses heuristics to identify what the actual content is in a large HTML response and strips away the rest.

To me this is a far better approach than worrying about an llms.txt files or looking at HTTP headers to see if markdown is preferred. Such efforts could easily be directed at ensuring the useful content on your site carries the appropriate markup for an agent or any other tool to extract it. And it would require less work to implement for the publisher of the content.


How can it know which tokens not to read without reading them? and llms.txt is a single file for the whole site... not the same

I was using llms.txt as the general idea of providing an alternative version of your content for agents - whether that's llms.txt for the entire site, my-article.md instead of my-article.html for a specific page, or via content-negotiation as your link prefers.

The content (HTML or Markdown) only become tokens when given to the model. Agents use parameters to limit the output from their tool calls all the time, precisely to reduce the number of tokens they have to pass to the model. So when an agent requests content for example.com/page and gets a 800KB response, those are not tokens yet. It could simply call a tool to extract the useful info before it gives the content to the model. That would effectively produce the same number of tokens as requesting example.com/page.md or example.com/page with request headers preferring markdown.

So why not just make sure the useful info is easily extractable from the same HTML? Less work, no content negotiation on the server side, no worrying about maintaining two similar versions of the same content.

As an aside, I've always been against content negotiation for different representations of content. So if you really must maintain two different versions of your content (HTML and Markdown, say) make them different URLs. I agree with Roy Fielding on this[1]:

> It is a bad design trade-off to send a bunch of header fields on every request just to tell the server all of the possible variations of preference held by the user, particularly when there is a very small chance that any of those dimensions are applicable to the target resource. It has been a bad design trade-off ever since the very brief period in 1993-94 when folks didn't know which image format would be usable on all UAs and there was no CSS or javascript to allow for client-side adaptation.

> ...The caching impact of proactive negotiation is far worse than the one extra round trip per site for reactive negotiation, and even that round-trip isn't necessary in formats that support client-side adaptation.

On the caching impact, see this from Simon Willison[2]:

> ...you can’t deploy an application that uses content negotiation via the Accept header behind the Cloudflare CDN — for example serving JSON or HTML for the same URL depending on the incoming Accept header. If you do, Cloudflare may serve cached JSON to an HTML client or vice-versa.

[Edited to add: if the source of truth is already Markdown in your system, by all means expose that. What I'm discussing here is related to efforts to produce new Markdown or plain text output, in addition to HTML, specifically for agents]

[1] https://lists.w3.org/Archives/Public/ietf-http-wg/2013JanMar... [2] https://simonwillison.net/2023/Nov/20/cloudflare-does-not-co...


Hi HN,

To my pleasant surprise, I’ve been noticing a sort of revival of posts related to writing here and on the web at large. This is either due to my own return to the craft of writing, or it really is a broader trend of AI power users realizing that AI is really, really bad at generating good prose.

Regardless, I’ve found AI to be a useful tool for filling in the gaps in creative writing craft and terminology that I never learned in school. I am an engineer by training and trade, so most of the writing “skills” we learned were for technical writing, lab reports, and the like.

This site was also heavily inspired by the “Laws of Software Engineering” post that was quite popular a few weeks back:

https://news.ycombinator.com/item?id=47847179


except AI writing is near 100% detectable. check out something like pangram. no matter what you generate, the cadence of their word choices, sentance structures, etc. are always the same and often blantently visible in the prose. in fact i doubt an LLM of any size now and into the future can properly write without a "fingerprint". real writing, in almost any language, given the possible combination of writing even just a few sentences, even given valid grammar, already exceeds the number of atoms in the universe. because LLMs are transformers, they will always leave behind clues.

Its wild to me that the concept of working 80% (1 day off a week) or even 60% (2 days off a week) isn't even a concept in the US, while in europe such part time situations make up a huge share of the work force.

In short, people have been having the day off for decades now. It's called part-time work.


There are also widely accepted standards to written word.

The best example is when an abbreviation can be expanded to more than one phrase, and both are widely used.


> The world doesn't make sense. It's always been this way, so we don't even know another way to exist.

This is the main line for me. Even around the bonfire with fellow grugs if you got eaten by a tiger, I'm not sure you fully would understand that either. So I'm not exactly sure what this post is getting at? That human history so far is "bad" and we "did it the wrong way"? I'd argue 99% of human adults are just folks trying to do their best to provide for their family. Maybe I'm too much of an optimist though.


Agreed. Do we have any information on what these "vulnerabilities" actually are? Every vulnerability is typically immediately reported to CVE or NIST... are these "so destructive" they have to be kept behind closed doors? Give me a break...


I don't see the problem - everything the author describes has, and will always be, true. You can't vibe code anything of value in a weekend exactly because anyone _else_ with the same level of experience can do the exact same thing in the same weekend! This has always been true across all trades and technologies. Once again, the domain expertise, wisdom, and simply _time_ of doing something always win. LLMs literally don't change that at all.


Have you used gemini models for code work? Claude and Codex are miles ahead in terms of quality and how thorough they are


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: