I have been researching methods and projects built around training and running LLMs locally. I'm interested in what others have been using on this front (including straight up Pytorch/Transformers). Here's what I've gathered so far:
Engines/APIs:
- vllm: Inference and serving engine for LLMs (none quantized models only?) [1]
- ollama: Go project to run, create and share LLMs [2]
- llama.cpp: Inference of LLaMA models in C/C++ w/UI (including quantized models) [3]
- llama-cpp-python: run OpenAI compatible API bindings for llama.cpp [4]
- llm-engine: engine for fine-tuning and serving LLMs [5]
- Lamini: hosted? closed source? solution for training LLMs [6]
- GPT4All: free, locally run chatbot [7]
- SkyPilot: framework for running LLMs, AI, and batch jobs [8]
- HuggingFace Transformers: APIs and tools to download and train models (via Pytorch) [9]
- RAGStack: Deploy a private ChatGPT alternative hosted within your VPC [14]
UI/Interface:
- Simon's `llm` tool [15]
Quantization Bits:
- AutoGPTQ: GPTQ algorithm based quantization package [10]
- QLoRA: finetuning of quantized LLMs [11]
- bitsandbytes: CUDA functions for PyTorch [12]
- SkyPilot QLoRA [13]
Video Guides:
- https://www.youtube.com/watch?v=eeM6V5aPjhk
- https://www.youtube.com/watch?v=TYgtG2Th6fI (jawerty's example)
Reference:
- [1] https://github.com/vllm-project/vllm
- [2] https://github.com/jmorganca/ollama
- [3] https://github.com/ggerganov/llama.cpp
- [4] https://github.com/abetlen/llama-cpp-python
- [5] https://github.com/scaleapi/llm-engine
- [6] https://www.lamini.ai/
- [7] https://github.com/nomic-ai/gpt4all
- [8] https://github.com/skypilot-org/skypilot
- [9] https://github.com/huggingface/transformers/releases
- [10] https://github.com/PanQiWei/AutoGPTQ
- [11] https://github.com/artidoro/qlora
- [12] https://github.com/TimDettmers/bitsandbytes
- [13] https://github.com/artidoro/qlora/pull/132
- [14] https://github.com/psychic-api/rag-stack
- [15] https://simonwillison.net/2023/Aug/1/llama-2-mac/