Hacker News new | past | comments | ask | show | jobs | submit | milansuk's comments login

No need to use Ollama. LLama.cpp has its own OpenAI-compatible server[0] and it works great.

[0] https://github.com/ggerganov/llama.cpp#web-server


Thanks didn't know that.

Do you happen to know the reason to use ollama rather than the built in server? How much work is required to get similar functionality? looks like just downloading the models? I find it odd that ollama took off so quickly if LLamma.cpp had the same built in functionality.


Yes I'm aware. I was contrasting the general use of an inference server vs calling llama.cpp directly (not via HTTP request).

And among servers Ollama seems to be more popular, so it's worth mentioning when talking about support for local LLMs.


Higher volume legal actions can only be successful as "peer-to-peer". If it goes through a courtroom, there is no way they can handle this kind of volume(even with AI tools). Imagine that the CEO is informed there are 20 new legal actions and the first court hearing will start 10-20 years from now when the CEO will not be part of the company anymore.


> It doesn't do that fancy "make this text more professional"

I looked into the Scramble code[0] and it seems there are few pre-defined prompts(const DEFAULT_PROMPTS).

[0] https://github.com/zlwaterfield/scramble/blob/main/backgroun...


True. The only solution is to keep your data outside cloud(aka someone else's computer) no matter what encryption you use.


Also means it can’t transit the internet. So actually, only on airgapped networks.


If we're going to extremes like that, airgapped networks aren't truly safe either


Could you explain why that is? If I have an airgapped smart home network, someone has to come physically sniff the packets. If it’s only over ethernet, they have to physically plug in. That’s not a scalable attack strategy.


There's tons of ways to exfiltrate data from air gapped systems, if you can manage to get something installed in them. Ones I've read about are by toggling the caps lock led and recording it with a camera. Encoding data into the cpu fan speed, and capturing the sound with a microphone for analysis (run a spinloop for a 1, thread.sleep for a zero). Variations of these can also be used, such as with screen brightness, monitoring powerlines.

My personal favourite is one where they send specific patterns of data over usb, where the EM fields generated by the "data" flowing over the wire form a carrier signal onto which data can be encoded, which can be received up to 5m away. This requires no additional hardware.

All of these involve some malware installed on the system and have a tiny amount of bandwidth, but if there'a man on the inside, all they have to do is install the malware without having to worry about additional hardware for getting the data out of the machine.


Also, the safest data is the one never sampled into digital format and stored in computer systems.


> But file syncing is a “dumb” protocol. You can’t “hook” into sync events, or update notifications, or conflict resolution. There isn’t much API; you just save files and they get synced. In case of conflict, best case, you get two files. Worst — you get only one :)

Sync services haven't evolved much. I guess, a service that would provide lower APIs and different data structures (CRDTs, etc.) would be a hacker's dream. Also, E2EE would be nice.

And if they closed the shop, I would have all the files on my devices.


This looks cool for v1! The only problem I see is most devices don't have much RAM, so local models are small and most requests will go to the servers.

Apple could use it to sell more devices - every new generation can have more RAM = more privacy. People will have real reason to buy a new phone more often.


Apple is starting to anticipate a higher RAM need in their M4+ silicon chips: There are rumors they are including more ram than specified in their entry level computers.

https://forums.macrumors.com/threads/do-m4-ipad-pros-with-8g...

One reason could be future AI models.

I'm not sure if this has been verified independently, but interesting nonetheless and would make sense in an AI era.


I don't see any explanation for why they trained 8B instead of 7B. I thought that If you have a 16GB GPU, you can put 14GB(7B*16bits) model into it, but how does it fit If the model is exactly 16GB?


The bigger size is probably from the bigger vocabulary in the tokenizer. But most people are running this model quantized at least to 8 bits, and still reasonably down to 3-4 bpw.


> The bigger size is probably from the bigger vocabulary in the tokenizer.

How does that affect anything? It still uses 16 bit floats in the model doesn't it?


Upgrade to a 24GB GPU?


Any recommendations?


3090, trivially.

No reason to go 4090 as it's no more capable, and the 5090 is probably not going to have more than 24GB on it either simply because nVidia wants to maintain their margins through market segregation (and adding more VRAM to that card would obsolete their low-end enterprise AI cards that cost 6000+ dollars).


Appreciate the info!

In another thread I saw a recommendation for dual 3090s if you're not doing anything gaming related, good to have some confirmation there.


I'd also consider dual A6000-48GB (96GB total) if you have a budget of $8000 or dual V100-32GB (64GB) if you have a budget of $4000.

V100 is old and slower, but for AI applications, RAM is king and there are lots of enterprise V100's coming off racks and being sold on eBay for cheap.


The progress is insane. A few days ago I started being very impressed with LLM coding skills. I wanted Golang code, instead of Python, which you can see in many demos. The prompt was:

Write a Golang func, which accepts the path into a .gpx file and outputs a JSON string with points(x=tolal distance in km, y=elevation). Don't use any library.


This is an implementation of a transformer and in README it's presented as text->text. Tokens are just integers going in and out.

Is it possible to use it to train other types of LLMs(text->image, image->text, speech->text, etc.)?


Yes, anything can be an input token.

Patch of pixels ---> token Fragment of input Audio ---> token etc


The transformer itself just takes arrays of numbers and turns them into arrays of numbers. What you are interested in is the process that happens before and after the transformer.


You can run Gemma and hundreds of other models(many fine-tuned) in llama.cpp. It's easy to swap to a different model.

It's important there are companies publishing models(running locally). If some stop and others are born, it's ok. The worst thing that could happen is having AI only in the cloud.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: