Hacker Newsnew | past | comments | ask | show | jobs | submit | throwaw12's commentslogin

How do you run it reliably/durable?

Does it have automatic checkpoints to be able to resume conversations?


Is this another way of saying: HTML will increase revenue by increasing token usage, so let's advocate for it

Europe is already moving towards having own payment system with Wero

Most European countries I frequent already have their own local equivalent to Pix. In Spain almost everyone has Bizum, and not uncommon for vendors at markets to accept it too, Sweden has Swish which is the same deal for the Swedes. I think this lists the most prominent ones that are in wide usage today already: https://en.wikipedia.org/wiki/European_Mobile_Payment_System...

Apparently, some of these already have interoperability between each other too (https://en.wikipedia.org/wiki/European_Payments_Alliance), happy to learn I can now send/receive money to/from my Portuguese brethren :)


Since May the 5th most shops will begin integrating Bizum within their payments systems.

In Spanish:

https://cincodias.elpais.com/companias/2026-04-01/a-partir-d...


Since the Bizum became widespread, depending on the size of the business, some always did accept Bizum if you ask nicely :)

Would be amazing for me to use Swish to pay for things in Brazil as if I had Pix! Could that work somehow?

Wero is a private effort, the digital euro is closer to Pix but it keeps being stalled (given the article in question, one could make some guesses as to why).

I was shocked that none of my cards worked for lots of stuff in the Netherlands. They also have their own system.

Curious to know why are they not hitting their limits.

In the organization I work, things are crazy at the moment, we are drinking tokens as if we are in hot desert and 1k is barely enough for a week for some people


On heavy coding days it can go up. But most people aren’t coding all day. And research and docs tend to be pretty gentle on the tokens IME. Only time I hit my limit was coding for 80 hours in a war-room. (Also I mostly use cursor which is more efficient with its implicit use of light models and indexed workspace).

People who lived through 2001 and 2008 crashes, did it look like this or was it even worse than what's happening these days with so many layoffs?

2001 affected tech mainly, so a lot of folks went to other industries still hiring. And 2008 affected other industries more than tech, so the inverse.

2008 really wasn't that bad if you were in tech...

No idea about 2001, but I've heard it was fairly rough. More recently I've seen people say now it's harder to find work today, I think in part because in 2001 it was mostly tech companies laying off talent, while corporates who were less impacted by the dot-com bubble were still building out their engineering teams.


The problem with the 2001 dot com bust was that it came on the heels of the telecoms downturn, so the two biggest (at the time) tech sectors were in trouble simultaneously.

Yes, there was still corporate IT - and some areas like finance were positively booming. But for online retail, media, advertising, etc it was a wasteland for 4-5 years. Plenty of people never found a way back into the industry.

To me, it felt much worse then than it does now (though perhaps the USA is being hit much harder at the moment).

2008 did hit tech but, outside of finance, the shock was over much quicker. The effects on the bricks & mortar economy were more obvious, though, so it got covered more in the media.


I agree. It was still ok for most people in tech. Maybe rough for that single year. After that is was nothing but boom times.

I can't speak to 2001.

This feels like something much worse and weirder.


The problem today is we had 10 years of “learn to code” and a a lot of people did. Similar to 2001. 2009 wasn’t really bad for tech since we had a huge shortage of workers in the space. Companies would hire straight out of mid-tier universities. CS departments were desperate for students.

From memory, the only recent layoff where the company was not only profitable, but also growing is this one.

It's not clear if we are at the beginning of a crash or if the everything-bubble still has some pressure to expand. But whatever impression people can report here are about the past, and thus not about any crash.


2001 set tech jobs back a few years. 2008 wasn't nearly as bad for tech because as money got tight, there was an impetus for companies to invest in tech to cut costs. I think a similar thing might be unfolding now with non-tech companies investing in AI to streamline processes.

I don't remember 2001. But 2008 tech job market was way wayyy better than today.

2008 the global economy came somewhere between hours to days of completely crashing if AIG hadn't been bailed out. Other than Covid, it's only the second time in the past 50 years unemployment hit double-digits, the other time being the early 80s recession in the wake of the 1979 energy crisis, which saw inflation go as high as 13.5% and the prime interest rate hit 21.5%. You're probably only concerned about your own industry, but even now, unemployment is still around the lowest it's been since WWII outside of the past couple of years and the late 50s.

It'll be another 40 years hopefully to get a full lifetime of experience and see how I ultimately feel about this, but right now, my sense is software saw a huge boom in the 2010s, a la aerospace in the 60s and finance in the 90s, and it isn't going to die, but that boom was never going to last forever, either. Being a specialist surgeon was always the only true close to guarantee you'll make half a mil annually with supreme job security. Everything else sees booms, busts, regional disparities, and power laws that make it hard to even talk to each other about it because nobody's experience is universal. Even now, in my particular niche of the industry, I don't know anybody who's been laid off. My own company and our competitors are not exactly drowning in cash (I work largely on commission and it's been a terrible quarter), but we're expanding headcount, not reducing.

Conversely, in the 2010s as software boomed and I did terrifically, basically my entire family is in trades and it was totally different for them. Drastic cyclical instability, projects started but then canceled all over the place, injuries, bankruptcies, drug addiction, prison terms. But that's also in California. I live in Texas and construction here seemingly mostly stayed in the boom state. All the tradesmen I know from here rather than family did much better. We also had roughneck as a lucrative fallback option for anyone that didn't mind living in the middle of nowhere thanks to the fracking and shale booms. Computer geeks from 2006 to 2021 or so also had that kind of easy skill transfer fallback thing thanks to the boom in computational data analytics due to advances in data storage and machine learning technologies.

We might even do well to remember that hyperscalers drowning in ad money for the past 20 years had a practice of intentionally overhiring to hoard talent but not give them anything productive to do, putting them into restrictive NDAs and non-competes largely to prevent them from starting their own companies or working for competitors. If that practice is ending, it floods the labor market, driving down wages, and reduces industry-wide employment metrics, but it's not death of the profession so much as ending a market distortion. Maybe it even supercharges entrepreneurialism, but right now we just seem to see a boom in the "solo indie dev" putting out reams of slop. At some point, people have to actually work together and have a real product vision that solves a problem other than using AI to make dev tools to harness AI for making more dev tools.


it was worse

> The math and coding part is impressive but the agentic one is not.

I think this is very important to eventually become a viable replacement for coding models. Because most of the time coding harnesses are leveraging tool calls to gather the context and then write a solution.

I am hopeful, that one day we can replace Claude and OpenAI models with local SOTA LLMs


It's pretty close already. Check qwen3.6 27b if you haven't already. People are vibe and agentic coding with it on a single GPU.

It is more finicky than Claude but if you hand hold it a bit it's crazy.


I see that going around, and either the test cases are too simplistic or I'm doing something wrong. I have a server with a 3090 in it, enough to run qwen3.6, but I haven't had much luck using it with either codex or oh-my-pi. They work, but the model gets really slow with ~64k context and the attention degrades quickly. You'll sometimes execute a prompt, the model will load a test file and say something like "I was presented with a test file but no command. What should I do with it?".

So yeah, while it's true that qwen3.6 is good for agentic coding, it's not very good for exploring the codebase and coming up with plans. You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan, and even then the implementation might take 10x the amount of time it'd take for sonnet or Gemini 3 to crunch through the plan.

EDIT:

My setup is really as simple as possible. I run ollama on a remote server on my local network. In my laptop I set OLLAMA_HOST and do `ollama pull qwen3.6:27b`, which then becomes available to the agent harnesses. I am not sure now how I set the context, but I think it was directly in oh-my-pi. So server config- and quantization-wise, it's the defaults.


I have a old supermicro X10DRU-i server with two Tesla V100's (48 GB VRAM) and 128GB RAM and have been running qwen3.6-27B with a lot of success. I would say it's performance on my use case (modifying and extending a ~70kloc C++ code base) has been excellent. I have no benchmarks, but it seems comparable to claude sonnet 4.6 in capabilities. I run it with llama.cpp:

llama-server -m Qwen3.6-27B-Q8_0.gguf -c 131072 --tensor-split 0.4,0.6 --batch-size 256 --cont-batching --flash-attn on -ngl 999 --threads 16 --jinja

I regularly get ~22tok/s when context utilization is below <65k, but it does slow done to ~13tok/s when the context is nearly full (lots of swapping to RAM). I have been using the qwen-code harness though, since it is far more token efficient than claude-code which injects massive prompts that chew up the context window. I plan on trying it with pi next.

I'm keeping my ~$20/mo claude subscripts for the planning prompts, and then hand it off to qwen for implementation. It's been working well so far.


This link [1] features some good insight on how to adapt your usage to smaller models which require more explicit or deliberate prompting. I have been using Gemma 4 31B a lot and have found it very competent. It can be a bit unstable and start spiraling or end up in infinite loops that you need to reset, but for the most part it's been really good.

[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...


thanks for the link! Interesting paper and little-coder looks purpose built for my own local model agentic experiments.

I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk


club-3090 with llamacpp did it. Full 262k context, usable in oh-my-pi. Still testing it, but initial results are promising.

I had to make a couple of adjustments though. After downloading the model with hf, I needed to move the mmproj-F16.gguf to the parent folder:

   tree /media/fast-storage/club-3090-models/qwen3.6-27b/
  /media/fast-storage/club-3090-models/qwen3.6-27b/
  ├── mmproj-F16.gguf
  └── unsloth-q3kxl
      └── Qwen3.6-27B-UD-Q3_K_XL.gguf
then, on starting the server, the container would complain that llama-server wasn't a known binary, so I needed to add PATH="/app:$PATH" to the entrypoint of the llama service.

The only things that's missing is for llama to emit thinking blocks that oh-my-pi can parse, but it's running alright. That's mostly cosmetic.


I managed to execute with vllm successfully, but it breaks opencode on simple "what's this repo?" task. On oh-my-pi it wont event execute because omp sends multiple system prompts. I'll try with llama.cpp later and see if it works more reliably.

will give more info in the post

EDIT: thanks for the links!


For context, I'm feeling like I have a "free Sonnet" now that I've got Qwen3.6 35B running on my 5070ti at home (I connect to it via Tailscale). I run it _almost exactly_ the same as this Reddit post which found a good way to squeeze the 35B model onto a GPU with 16GB of VRAM: https://www.reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_507... I really like it because it's slightly more operationally complex (I had to write a script to start it) but now that I have it, I literally never have to change it. It's a folder with the llama-server in it and with the model.gguf in it, I run the script which starts serving the model, done.

Like that post, I get 75 tokens/second. The exact model is: Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and I get 128k of context

I run it on my home machine and connect to it from anywhere over tailscale. I connect through the opencode CLI which I configure with this as provider by adding the following to my `~/.config/opencode/opencode.json`:

    {
      "provider": {
        "vllm": {
          "npm": "@ai-sdk/openai-compatible",
          "name": "local-llm-qwen3.6-35B",
          "options": {
            "baseURL": "http://homepc.tail987654.ts.net:8033/v1"
          },
          "models": {
            "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf": {
              "name": "Qwen3.6-35B"
            }
          }
        }
      }
    }

You're not sharing what quantization you're using, in my experience, anything below Q8 and less than ~30B tends to basically be useless locally, at least for what you typically use codex et al for, I'm sure it works for very simple prompts.

But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.


will do that in an edit to the post

Sure, waiting :)

In the meantime, Ollama seems to default to "Q4_K_M" which is barely usable for anything, and really won't be useful for agentic coding, the quantization level is just too low. Not sure why Ollama defaults to basically unusable quantizations, but that train left a long time ago, they're more interesting in people thinking they can run stuff, rather than flagging things up front, and been since day 1.


Ollama is definitely not the way to go once you have an interest beyond "how quickly can I run a new LLM" rather then "how do I use a local llm to do things in a remotely optimal way"

I'm currently giving club3090 a try, it seems to have lots of pre-configured setups depending on the workflow. I'm trying vllm first, then with llama.cpp.

Yeah. Context size matters a lot. With OpenCode dumping like 10k tokens in the system prompt it takes like 4 rounds before it had to compact at say 64k. It's not really worth it to run at anything below 100k and even then the models aren't all that useful.

They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...


I see your updated post. Switch over to llamacpp and look up recommended quants and settings. A good place for this info is on /r/localllama

Yep! I'm currently trying vllm, then I'll give llamacpp a try too

Something promising I found is "DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090" - https://github.com/Luce-Org/lucebox-hub

Didn't know 3.6 was available on Ollama outside of MacOS!


When you run ollama serve, make sure you override the context size to about 32K. Also, I give the model a useful short README.md on the code it is writing or modifying, and a Makefile with useful targets for the agent to use. I usually use Claude Code with qwen3.6

I also go outside for fresh air while I wait for a session to run.


The (unsloth dynamic) 4-bit quants of Qwen 3.6 kept getting stuck in circles for me. Even though it doesn't benchmark as well, GLM 4.7 Flash at least keeps making progress if I keep nudging it, so I have actually been able to finish some apps with it.

Qwen3.6 supports 266k context out of the box. Try using q8 kv cache to enable more of it.

I limited it to 64k expecting 24GB vram to not be enough to make use of the entire context window, but I'll try with other's suggestions.

I agree for planning it's not there yet. But I wouldn't be surprised if something came out that was in a similar weight class.

> qwen

I only have luck with pi and qwen bashing 100 line scripts. Everything real needs a planful model. To your point:

> You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan…

Curiously, ANTHROP\C seems determined to ensure you don't use your Opus 4.7 Max 1M tokens for this any more, instead it sics Haiku on your context to "sample" using a weird pile of inchoate regex tp return "no more than 50 lines" or similar uselessness then finally Opus goes and burns tokens cogitating a solve for a problem shape that doesn't have anything to do with the areas of interest, inevitably unsampled.

I really really really want a "no subagents, no sampling" mode (telling it all subagents are Opus in env vars doesn't seem to persuade it to go ahead and use those 1M tokens to just, you know, read the damn file. Ironic if getting the best out of Opus cannot be in their harness.

All this said, it seems most people think AI saves them time and money so long as it costs no money — feels like ANTHROP\C is optimizing for that. I get it.

But can we also have a `ANTHROPIC_ENABLE_HIGH_ROI=1` mode please?

It costs more fixing all these unnecessary oversights than it would cost to just do the toil the machine is here to do.


Try oh-my-openagent plan mode.

Vibe coding on consumer hardware is still very limited; this is especially true on GPUs, whose RAM limit is around 16 - maybe 24 - GB for the vast majority (although Macs change the equation).

These are two realworld experiments, whose results are disappointing for those expecting levels of performance comparable to cloud services:

- https://deploy.live/blog/running-local-llms-offline-on-a-ten...

- https://betweentheprompts.com/40000-feet/

The first is even the 35b version of qwen3.6.


I don't disagree, I just want to say I've been really liking DeepseekV4, which is on par with Sonnet 4.6, in my coding tests.

I can't vibe code on a M3 Max - 48GB like I do with claude or codex..Far from it

I don't see how it's disappointing? 95% correct using the 35b model before the right quants came out on a laptop? And they still got tons of code written for them.

On a real GPU using 27b with the latest quants the experience is better. It's still not the same as opus running on a subsidized GPU farm. Well it is better for privacy at least.

I find it interesting how 2 people can read the same thing and come to very different conclusions.


I'm just using Qwen3-coder-next; I tried the new ones and the thinking mode is just too much. I'm still ending up with 'vibe' coding that's slow enough to catch when it does stupid things.

Eh. It is good in terms of results ( accuracy, good recommendations and so on ), but slow when it comes to actual inference. On local 128gb machine, it took over 5 minutes to brainstorm garage door opening mechanism with some additional restrictions for spice.

I find it hilarious how waiting 5 whole minutes to design software is considered slow in a way that people refer to as not useful. My god lol.

Is that 128gb RAM or VRAM?


Its the unified memory in this case ( Ryzen AI max ) so obviously there is some room for improvement there. Still, I would not dismiss the speed out of hand. Remember, we are trying to argue here that 'it is pretty close already'. In ways, it is. It others, it is not yet.

That's absolutely possible, its just as we move towards more advancement, We'll soon see Small models being smart enough to not be judged by parameter count but their reasoning and intelligence. You can see examples like Qwen 3.6 27B.

Yeah this is key, a lot of people are still just looking at the number of params and thinking these models are toys. What Qwen 3.6 has shown is that reasoning and tool calling are just as important if not more.

Apart from "AI" making us productive talk.

Can anyone share how and when they see market is getting in a better shape?

Specifically I am curious, how we would be working with AIs even if market gets in a better shape


> If any store used dynamic pricing to expand their margins, the others would just do the same and compete away those margins once again, with the marginal gain being handed back to consumers.

How would you do it if pricing is dynamic and changes every day?

By the time competitor finds out about the price, you might have already reduced it, making it look like theirs is more expensive even after they applied discount.


Because the exact same is true in reverse

(no impact == no staffing, no resource allocation) -> Leadership: "please, stop working on it"

> How about we turn down the heat, everyone?

How about Anthropic turn down the heat and refunds money to everyone for every bug it created with its LLM?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: