Ollama本地CPU模式和GPU模式运行LLM
最近使用服务器尝试部署了本地大模型,实测发现CPU模式和GPU模式差距很大,GPU在运行LLM时的性能优势太明显了,但功耗确实很高,未来LLM进化的天花板可能就是能源了,难怪Open AI,微软都开始投资核聚变公司,布局能源就是布局未来的AI
Llama3.1 8B + CPU模式
CPU运行模式模式运行LLM配置很简单,直接docker 运行Ollama即可,只是LLM运行速度感人
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
拉取大模型文件
ollama pull llama3.1
运行结果
硬件配置:Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz 128G RAM 64-Core,执行生成任务时,占用6GB内存,CPU 3000%(占用30核心/64核心),LLM的生成速度只有10 token/s
Llama3.1 8B + GPU模式
环境安装
服务器系统:Centos7.8社区版本(linux kernel version 3.10.0-1127)
GPU: Nvidia Tesla T4 16G
Nvidia驱动安装
- 系统库升级和准备
yum update
yum install gcc kernel-devel kernel-headers -y
yum group install "Development Tools" //安装GCC相关工具
- 禁用默认驱动
禁用默认nouveau驱动,然后重启确认lsmod | grep nouveau
无输出
mkdir -p /etc/modprobe.d
tee /etc/modprobe.d/blacklist-nouveau.conf<<EOF
blacklist nouveau
options nouveau modeset=0
EOF
sudo dracut --force
reboot
- 下载英伟达驱动包Linux-x86_64(注意不是RHEL版本,最新版本也无法安装,可能提示内核版本过低,因此需要历史版本才能安装上)安装, 安装成功后,
nvidia-smi
命令行可以正常显示GPU的运行信息
chmod +x NVIDIA-Linux-x86_64-390.157.run
./NVIDIA-Linux-x86_64-390.157.run
- Docker方式安装Ollma镜像及其配套的工具包
1. 安装Nvidia Container Toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
| sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit
Ollama配置和启动
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
使用GPU模式启动Ollama,查看容器日志,确认GPU被正确加载和使用
-- Logs begin at Tue 2024-09-03 16:26:04 CST, end at Tue 2024-09-03 16:28:29 CST. --
Sep 03 16:26:32 local.novalocal systemd[1]: Started Ollama Service.
Sep 03 16:26:34 local.novalocal ollama[1280]: 2024/09/03 16:26:34 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
Sep 03 16:26:34 local.novalocal ollama[1280]: time=2024-09-03T16:26:34.798+08:00 level=INFO source=images.go:753 msg="total blobs: 0"
Sep 03 16:26:34 local.novalocal ollama[1280]: time=2024-09-03T16:26:34.816+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 03 16:26:34 local.novalocal ollama[1280]: time=2024-09-03T16:26:34.832+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.9)"
Sep 03 16:26:34 local.novalocal ollama[1280]: time=2024-09-03T16:26:34.848+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama512843583/runners
Sep 03 16:27:07 local.novalocal ollama[1280]: time=2024-09-03T16:27:07.069+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu]"
Sep 03 16:27:07 local.novalocal ollama[1280]: time=2024-09-03T16:27:07.073+08:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
Sep 03 16:27:08 local.novalocal ollama[1280]: time=2024-09-03T16:27:08.782+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-4dae52f8-1e3a-64d6-c51f-b1add94c9e1d library=cuda variant=v11 compute=6.1 driver=11.4 name="Tesla T4"
运行结果
[root@local ~]# nvidia-smi
Fri Sep 6 17:11:31 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:08.0 Off | 0 |
| N/A 58C P0 66W / 70W | 6294MiB / 15109MiB | 88% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 115856 C ...a_v11/ollama_llama_server 6291MiB |
+-----------------------------------------------------------------------------+
实测运行llama3.1 8B 大约30 token/s速度,算是能用了,但功率一直稳定在90%以上,可见对LLM对电力的消耗巨大