Hacker News new | past | comments | ask | show | jobs | submit | Isuckatcode's comments login

I assume this would require a jail broken iPhone ? Or is it using accessibility feature to control the device ?


I wish the password manager app allowed you to set a custom password to open the app. The iPhone password is something that people often share easily with those around them. With just the passcode, a person with bad intentions could easily gain access to all the passwords


>By fine-tuning only the adapter layers, the original parameters of the base pre-trained model remain unchanged, preserving the general knowledge of the model while tailoring the adapter layers to support specific tasks.

From a ML noob (me) understanding of this, does this mean that the final matrix is regularly fine tuned instead of fine tuning the main model ? Is this similar to how chatGPT now remembers memory[1] ?

[1] https://help.openai.com/en/articles/8590148-memory-faq


The base model is frozen. The smaller adaptor matrices which are finetuned with new data. During inference, the weights from the adaptor matrices "shadow" the weights in the base model. Since the adaptor matrices are much smaller, it's quite efficient to finetune them.

The advantage of the adaptor matrices is you can have different sets of adaptor matrices for different tasks, all based of the base model.


ChatGPT memory is just a database with everything you told it to remember.

Low Rank Adaptors (LoRA) are a way of changing the function of a model by only having to load a delta for a tiny percentage of the weights rather than all the weights for an entirely new model.

No fine-tuning is going to happen on Apple computers or phones at any point. They are just swapping out Apple's pre-made LoRAs so that they can store one LLM and dozens of LoRAs in a fraction of the space it would take to store dozens of LLMs.


Do you recommend any good running shoes with velcro


Can they recover the booster back to shore for an investigation?


I was able to successfully run Llama 3 8B, mistral 7B, phi and other 7B models using Ollama [1] on my M1 MacBook Air.

[1] https://ollama.com


Are they able to run at a good speed? I'm just wondering what the economics would look like if I want to create agents in my games. I don't think many are going to be willing to get with usage based / token based pricing. That's the biggest roadblock with building LLM-based games right now.

Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.


> Are they able to run at a good speed?

Not on most consumer computers, which likely lack a dedicated GPU. My M2 struggles (only thing that makes it warm) with a 7B model, but token speed is unbearable. I switched to remote APIs for the speed.

If you are targeting gamers with a GPU, the answer may change, but as others have pointed out, there are numerous issues here.

> This would virtually make inference free right?

Yes-ish, if you are only counting your dollars, however it will slow their computer down and have slow response time, which will impact adoption of your game.

If you want to go this route, I'd start with a 2B sized model, and not worry about shipping it nicely. Get some early users to see if this is the way forward.

I suspect that remote LLM calls with sophisticated caching (cross user / convo / pre-gen'd) is something worth exploring as well. IIRC, people suspected gtp3-turbo was caching common queries and avoided the LLM when it could, for the speed


You could also ship a couple of them and let the game/user choose which one to run depending on the hardware.


This is something I was considering as well - thanks


There isn't. For games you would need vLLM, because batch size is more important than latency. Something that people don't seem to understand is that an NPC doesn't need to generate tokens faster than its TTS can speak. You only need to minimize the time to first token.


The biggest roadblock is not running the model on the user's machine, that's barely an issue with 7B models on a gaming PC. The difficulty is in getting the NPC to take interesting actions with a tangible effect on the game world as a result of their conversation with the player.


The generative agents paper takes a pretty decent shot at this I think


Here [1] is a reference to the token/sec of Llama 3 on different apple hardware. You can evaluate if this is an acceptable performance for your agents. I would assume the token/sec would be much lower if the LLM agent is running along the side as the game would also be using a portion of the CPU and GPU. I think this is something that you need to test out on your own to determine its usability.

You can also look into lower parameter models (3B for example) to determine if the balance between accuracy and performance fits under your usecase.

>Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

I don't have any knowledge on game dev so I can comment on this but yes, packaging it locally would make the inference free.

[1] https://github.com/ggerganov/llama.cpp/discussions/4167


Thanks! This is helpful. I was thinking about the phi models - those might be useful for this task. Will look into how those can be run locally as well


I just ran phi3:mini[1] with Ollama on an Apple M3 Max laptop, on battery set to "Low power" (mentioned because that makes some things run more slowly). phi3:mini output roughly 15-25 words/second. The token rate is higher but I don't have an easy way to measure that.

Then llama3:8b[2]. It output 28 words/second. This is higher despite the larger model, perhaps because llama3 obeyed my request to use short words.

Then mixtral:8x7b[3]. That output 10.5 words/second. It looked like 2 tokens/word, as the pattern was quite repetitive and visible, but again I have no easy way to measure it.

That was on battery, set to "Low power" mode, and I was impressed that even with mixtral:8x7b, the fans didn't come on at all for the first 2 minutes of continuous output. Total system power usage peaked at 44W, of which about 38W was attributable to the GPU.

[1] https://ollama.com/library/phi3 [2] https://ollama.com/library/llama3 [3] https://ollama.com/library/mixtral


Well since OP doesn't seem to want to: Thank you for your response.

I came across this thread while doing some research, and it's been helpful.

(I hate how common Tragedy of the Commons is. =/)


What? chill out buddy - there's such a thing as timezones I was just sleeping


Just want to add a recent experience I had on this issue. I accidentally deleted a bunch of photos on my Mac (including from "recently deleted") and didn't realize it deletes it from my emtire iCloud photo library. In a state of panic, I contacted apple and they were able to restore all the photos that were "permanently deleted " from the last 60 days. This surprised me quite a bit as I was not expecting them to get them back. Makes me think if any pictures on the cloud are truly deleted.


Deleted files are kept in a hidden “expunged” folder both on device and (presumably) in the Cloud for up to 40 days pending actual deletion.

You can sometimes recover “deleted” files and media by dumping the entire phone and recovering files via the expunged folder for iCloud Drive and iCloud Photo Library — these were already deleted from the “Recently Deleted” folder.


Could you share your blog



Could you share instructions on how you did that


On a very high level you chain it together like so:

llama.cpp >> OpenAI translation server (included in llama git) >> Continue extension in vscode


Sucks that they will announce who are affected by the end of April. Already struggling with anxiety, this is gonna make it difficult to concentrate on work.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: