How to deploy DeepSeek locally

lang
en
date
Mar 14, 2025
slug
Post-12-en
status
Published
tags
技术分享
summary
Lightweight local large language model
type
Post
Recently I've been working on AI dialog system for gaming NPCs and tried to deploy DeepSeek lightweight local big language model, so today I'm going to share my pitfalls and process.
Choose the official website to install " Ollama " home page click "Download" to download, after the normal installation.
notion imagenotion image
Or use the terminal to install
brew install ollama
Ollama installation is complete we began to deploy DeepSeek
About choosing DeepSeek suitable model you can refer to this URL: https: //huggingface.co/deepseek-ai
My device is a MacBook Air (M2+8G), since the memory is very small, I chose the smallest model deepseek-r1:1.5b
Let's open a terminal and start pulling the DeepSeek model
ollama pull deepseek-r1:1.5b
notion imagenotion image
When it's done let's go into interactive mode:
ollama run deepseek-r1:1.5b
notion imagenotion image
Name the summary:
View downloaded models
ollama list
Delete model
ollama rm Model Name
Run the model (interactive)
ollama run Model Name
Ask a question
ollama run Model Name -p "Your question"
Downloading Models
ollama pull Model Name
About optimizing performance
1、Create or edit configuration files:
nano ~/.ollama/config
2、Add configuration information (~/.ollama/config):
{ "gpu_layers": 35, // The number of GPU layers is adjusted according to the graphics card performance "cpu_threads": 6, // The number of CPU threads is recommended to be set to the number of CPU cores "batch_size": 512, // Batch size, affecting memory usage "context_size": 4096 // Context window size affects conversation length }
3、Restart the ollama service to take effect
ollama stop ollama start ollama run deepseek-r1:5b
4、Performance tuning suggestions:
  • If your computer is hot/laggy: reduce gpu_layers and batch_size.
  • If memory is insufficient: reduce batch_size.
  • If you need longer conversations: increase context_size (will consume more memory)
  • cpu_threads is recommended to be set to the actual number of CPU cores -2.
5、Performance reference
  • Memory usage: ~12-14GB
  • First load: 30-60 seconds
  • Dialog delay: 1-3 seconds
  • Context window: 4096 tokens
6、Verify the optimization takes effect
Enter a longer question in the model dialog screen
💡
Please explain the basics of quantum computing to me in detail, requiring an answer of more than 500 words
View CPU and memory usage
top
 
Web UI visualization access
If you want to interact with web pages like ChatGPT, you can pair it with a Web UI like:
open-webui
docker run -d -p 3000:3000 -e OLLAMA_HOST=http://host.docker.internal:11434 --name open-webui ghcr.io/open-webui/open-webui:main
and then open a browser to access it:
Once you are logged in, it will automatically connect to your local Ollama and display the model you downloaded (e.g. deepseek-r1:5b).
对于本文内容有任何疑问, 可与我联系.