Do you happen to know the reason to use ollama rather than the built in server? How much work is required to get similar functionality? looks like just downloading the models? I find it odd that ollama took off so quickly if LLamma.cpp had the same built in functionality.
Higher volume legal actions can only be successful as "peer-to-peer". If it goes through a courtroom, there is no way they can handle this kind of volume(even with AI tools). Imagine that the CEO is informed there are 20 new legal actions and the first court hearing will start 10-20 years from now when the CEO will not be part of the company anymore.
Could you explain why that is? If I have an airgapped smart home network, someone has to come physically sniff the packets. If it’s only over ethernet, they have to physically plug in. That’s not a scalable attack strategy.
There's tons of ways to exfiltrate data from air gapped systems, if you can manage to get something installed in them. Ones I've read about are by toggling the caps lock led and recording it with a camera. Encoding data into the cpu fan speed, and capturing the sound with a microphone for analysis (run a spinloop for a 1, thread.sleep for a zero). Variations of these can also be used, such as with screen brightness, monitoring powerlines.
My personal favourite is one where they send specific patterns of data over usb, where the EM fields generated by the "data" flowing over the wire form a carrier signal onto which data can be encoded, which can be received up to 5m away. This requires no additional hardware.
All of these involve some malware installed on the system and have a tiny amount of bandwidth, but if there'a man on the inside, all they have to do is install the malware without having to worry about additional hardware for getting the data out of the machine.
> But file syncing is a “dumb” protocol. You can’t “hook” into sync events, or update notifications, or conflict resolution. There isn’t much API; you just save files and they get synced. In case of conflict, best case, you get two files. Worst — you get only one :)
Sync services haven't evolved much. I guess, a service that would provide lower APIs and different data structures (CRDTs, etc.) would be a hacker's dream. Also, E2EE would be nice.
And if they closed the shop, I would have all the files on my devices.
This looks cool for v1! The only problem I see is most devices don't have much RAM, so local models are small and most requests will go to the servers.
Apple could use it to sell more devices - every new generation can have more RAM = more privacy. People will have real reason to buy a new phone more often.
Apple is starting to anticipate a higher RAM need in their M4+ silicon chips: There are rumors they are including more ram than specified in their entry level computers.
I don't see any explanation for why they trained 8B instead of 7B.
I thought that If you have a 16GB GPU, you can put 14GB(7B*16bits) model into it, but how does it fit If the model is exactly 16GB?
The bigger size is probably from the bigger vocabulary in the tokenizer. But most people are running this model quantized at least to 8 bits, and still reasonably down to 3-4 bpw.
No reason to go 4090 as it's no more capable, and the 5090 is probably not going to have more than 24GB on it either simply because nVidia wants to maintain their margins through market segregation (and adding more VRAM to that card would obsolete their low-end enterprise AI cards that cost 6000+ dollars).
I'd also consider dual A6000-48GB (96GB total) if you have a budget of $8000 or dual V100-32GB (64GB) if you have a budget of $4000.
V100 is old and slower, but for AI applications, RAM is king and there are lots of enterprise V100's coming off racks and being sold on eBay for cheap.
The progress is insane. A few days ago I started being very impressed with LLM coding skills. I wanted Golang code, instead of Python, which you can see in many demos. The prompt was:
Write a Golang func, which accepts the path into a .gpx file and outputs a JSON string with points(x=tolal distance in km, y=elevation). Don't use any library.
The transformer itself just takes arrays of numbers and turns them into arrays of numbers. What you are interested in is the process that happens before and after the transformer.
You can run Gemma and hundreds of other models(many fine-tuned) in llama.cpp. It's easy to swap to a different model.
It's important there are companies publishing models(running locally). If some stop and others are born, it's ok. The worst thing that could happen is having AI only in the cloud.
[0] https://github.com/ggerganov/llama.cpp#web-server