Know-how for running LLMs on your own hardware. Local inference with open-source models as an alternative to cloud APIs.
Advantages: full data control, no API costs at high volume, GDPR-compliant. Hands-on with various model sizes, quantization, GPU requirements, and integration into existing applications.
Components
- Inference — Ollama, llama.cpp, vLLM — from quick local setup to high-performance production serving on macOS and Linux
- Security & access control — API gateway in front of the inference server for authentication, rate limiting, and routing
- Model selection — Choosing the right model for the use case: small & fast for classification, large & capable for generation
- Hardware consulting — Advice on GPU, RAM, and infrastructure requirements depending on model size and throughput needs