How to deploy DeepSeek locally

lang

date

Mar 14, 2025

slug

Post-12-en

status

Published

tags

技术分享

summary

Lightweight local large language model

type

Post

Recently I've been working on AI dialog system for gaming NPCs and tried to deploy DeepSeek lightweight local big language model, so today I'm going to share my pitfalls and process.

Choose the official website to install " Ollama " home page click "Download" to download, after the normal installation.

Or use the terminal to install


brew install ollama

Ollama installation is complete we began to deploy DeepSeek

About choosing DeepSeek suitable model you can refer to this URL: https: //huggingface.co/deepseek-ai

My device is a MacBook Air (M2+8G), since the memory is very small, I chose the smallest model deepseek-r1:1.5b

Let's open a terminal and start pulling the DeepSeek model


ollama pull deepseek-r1:1.5b

When it's done let's go into interactive mode:


ollama run deepseek-r1:1.5b

Name the summary:

View downloaded models


ollama list

Delete model


ollama rm Model Name

Run the model (interactive)


ollama run Model Name

Ask a question


ollama run Model Name -p "Your question"

Downloading Models


ollama pull Model Name

About optimizing performance

1、Create or edit configuration files:


nano ~/.ollama/config

2、Add configuration information (~/.ollama/config):


{
  "gpu_layers": 35,         // The number of GPU layers is adjusted according to the graphics card performance
  "cpu_threads": 6,         // The number of CPU threads is recommended to be set to the number of CPU cores  
  "batch_size": 512,        // Batch size, affecting memory usage
  "context_size": 4096      // Context window size affects conversation length
}

3、Restart the ollama service to take effect


ollama stop 
ollama start
ollama run deepseek-r1:5b

4、Performance tuning suggestions:

If your computer is hot/laggy: reduce gpu_layers and batch_size.

If memory is insufficient: reduce batch_size.

If you need longer conversations: increase context_size (will consume more memory)

cpu_threads is recommended to be set to the actual number of CPU cores -2.

5、Performance reference

Memory usage: ~12-14GB

First load: 30-60 seconds

Dialog delay: 1-3 seconds

Context window: 4096 tokens

6、Verify the optimization takes effect

Enter a longer question in the model dialog screen

💡

Please explain the basics of quantum computing to me in detail, requiring an answer of more than 500 words

View CPU and memory usage

top

Web UI visualization access

If you want to interact with web pages like ChatGPT, you can pair it with a Web UI like:

open-webui


docker run -d -p 3000:3000 -e OLLAMA_HOST=http://host.docker.internal:11434 --name open-webui ghcr.io/open-webui/open-webui:main

and then open a browser to access it:

http://localhost:3000

Once you are logged in, it will automatically connect to your local Ollama and display the model you downloaded (e.g. deepseek-r1:5b).