llflw's comments

llflw · 2025-09-15T04:12:30 1757909550

I think there are two main reasons you tend to be a slow thinker: 1. You're not focused enough. (you pay too much attention to what people think about what you're going to do, instead of focusing on the question itself.) 2. You haven't practiced enough. (This is the first time you've encountered this kind of problem.)

llflw · 2025-06-20T13:26:49 1750426009

Why bother using human language to communicate with a computer? You interact with a computer using a programming language—code—which is more precise and effective. Specifically: → In 1.0, you communicate with computers using compiled code. → In 2.0, you communicate with compilers using high-level programming languages. → In 3.0, you interact with LLMs using prompts, which arguably should not be in natural human language. Nonetheless, you should communicate with AGIs using human language, just as you would with other human beings.

standeven · 2025-06-20T16:37:06 1750437426

Why bother using higher-level programming languages to communicate with a computer? You interact with a computer using assembly - raw bit shifting and memory addresses - which is more precise and effective.

dustbunny · 2025-06-20T17:54:08 1750442048

Using assembly is not really more precise in terms of solving the problem. You can definitely make an argument that using a higher level language is equally if not more precise. Especially since your low level assembly will be limited to which architectures it can run on, you can state that the c++ that generates that assembly is "more precisely defining a calculator program".

rictic · 2025-06-20T23:01:57 1750460517

I agree with your general point, but C++ isn't a great example, as it is so underspecified. Imagine as part of our calculator we wrote the function:

    int add(int a, int b) {
      return a + b;
    }

What is the result of add(32767, 1)? C++ does not presume to define just one meaning for such an expression. Or even any meaning at all. What to do when the program tries to add ints that large is left to the personal conscience of compiler authors.

d0mine · 2025-06-22T05:16:37 1750569397

Precision is not boolean (present or absent/0 or 1). There may be many numbers between 0 and 1. Compared to human languages, programming languages are much more precise that makes the results much more predictable in practice.

I can imagine OS being written in C++ and working most of the time. I don't think you can replace Linux written in C with any number of LLM prompts.

LLM can be a [bad so far] programmer but a prompt is not a program.

tim333 · 2025-06-21T18:18:03 1750529883

Using code may not be more precise in terms of solving a problem than english. Take the NHS. With better AI, saying build a good IT system for the NHS may have worked better than this stuff https://www.theguardian.com/society/2013/sep/18/nhs-records-...

wing-_-nuts · 2025-06-20T17:35:57 1750440957

You can express dang near anything you wish to express in assembly in a higher order programming language because it is designed to allow that level of clarity and specificity. In fact most have compile time checks to stop you if you have not properly specified certain behavior.

The English language is not comparable. It is a language designed to capture all the ambiguity of human thought, and as such is not appropriate for computation.

TLDR: There's a reason why programmers still exist after the dawn of 4GL / 'no code' frameworks. Otherwise we'd all be product managers typing specs into JIRA and getting fully formed applications out the other side.

llflw · 2025-05-07T04:00:19 1746590419

It seems like it's token caching, not model caching.

Jaxkr · 2025-05-07T04:13:34 1746591214

That’s what this is. It’s caching the state of the model after the tokens have been loaded. Reduces latency and cost dramatically. 5m TTL on the cache usually.

cal85 · 2025-05-07T08:24:08 1746606248

Interesting! I’m wondering, does caching the model state mean the tokens are no longer directly visible to the model? i.e. if you asked it to print out the input tokens perfectly (assuming there’s no security layer blocking this, and assuming it has no ‘tool’ available to pull in the input tokens), could it do it?

saagarjha · 2025-05-07T08:49:36 1746607776

The model state encodes the past tokens (in some lossy way that the model has chosen for itself). You can ask it to try and, assuming its attention is well-trained, it will probably do a pretty good job. Being able to refer to what is in its context window is an important part of being able to predict the next token, after all.

noodletheworld · 2025-05-07T08:51:05 1746607865

It makes no difference.

Theres no difference between feeding an LLM a prompt and feeding it half the prompt, saving the state, restoring the state and feeding it other half of the prompt.

Ie. The data processed by the LLM is prompt P.

P can be composed of any number of segments.

Any number of segments can be cached, as long as all preceeding segments are cached.

The final input is P, regardless.

So; tldr; yes? Anything you can do with a prompt you can do, becasue its just a prompt.

chpatrick · 2025-05-07T12:46:39 1746621999

Isn't the state of the model exactly the previous generated text (ie. the prompt)?

int_19h · 2025-05-07T20:07:08 1746648428

When the prompt is processed, there is an internal key-value cache that gets updated with each token processed, and is ultimately used for inference of the new token. If you process the prompt first and then dump that internal cache, you can effectively resume prompt processing (and thus inference) from that point more or less for free.

https://medium.com/@plienhar/llm-inference-series-3-kv-cachi...

EGreg · 2025-05-07T04:39:05 1746592745

Can someone explain how to use Prompt Caching with LLAMA 4?

concats · 2025-05-07T11:35:57 1746617757

Depends on what front end you use. But for text-generation-webui for example, Prompt Caching is simply a checkbox under the Model tab you can select before you click "load model".

EGreg · 2025-05-07T12:59:52 1746622792

I basically want to interface with llama.cpp via an API from Node.js

What are some of the best coding models that run locally today? Do they have prompt caching support?