Xinference模型推理框架

Warning

不要在windows上安装，windows可以使用wsl
使用wsl后需要使用nginx代理才能让其他服务器访问

Warning

xinference 和 ollama 在使用gpu时会有冲突

echo $env:CUDA_VISIBLE_DEVICES

ollama 使用uuid
xinference 使用 id

Warning

无法科学上网需要配置modelscope为下载地址

install

conda create -n xinference python=3.10 pip -y
conda activate xinference

cpugpu

pip3 install "xinference[transformers]" -i https://pypi.tuna.tsinghua.edu.cn/simple

Warning

需要先安装pytorch
nvidia-smi根据cuda版本选择
https://pytorch.org/get-started/locally/
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

pip3 install "xinference[transformers]" -i https://pypi.tuna.tsinghua.edu.cn/simple

use

startset model src&start

xinference-local --host 0.0.0.0 --port 9997

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

command

说明	命令
列出所有在运行的模型	xinference list
当你不需要某个正在运行的模型，可以通过以下的方式来停止它并释放资源	xinference terminate --model-uid "qwen2.5-instruct"
查询与 qwen-chat 模型相关的参数组合	xinference engine -e http://localhost:9997 --model-name qwen-chat

environment variable

variable	comment
XINFERENCE_MODEL_SRC	指定模型下载网站
XINFERENCE_HOME	存储位置

模型

embedding

jina-embeddings-v2-base-zh

pip install -U sentence-transformers

xinference launch --model-name jina-embeddings-v2-base-zh --model-type embedding

curl -X 'POST' \
  'http://192.168.3.89:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "http://192.168.3.89:9997",
  "model_name": "jina-embeddings-v2-base-zh",
  "model_type": "embedding"
}'

Q&A

cannot import name 'HybridCache' from 'transformers'

pip install --upgrade peft 可能还需要： pip install --upgrade mistral_common

参考

No module named 'transformers.onnx'

pip install transformers==4.57.6

libgomp.so.1, needed by vendor/llama.cpp/ggml/src/libggml.so, not found

通过查找命令 find /usr -name libgomp.so.1 找到内容 /usr/lib/x86_64-linux-gnu/libgomp.so.1

然后在执行安装命令前, 输入如下命令并回车, 指定 LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

然后再执行如下命令成功了 pip install "xinference[all]"

xinference[all] 安装时，默认会把所有【需要GPU加速】的模块都一起安装，所以安装失败了。只装 CPU 版 pip3 install "xinference" -i https://pypi.tuna.tsinghua.edu.cn/simple

只能运行cpu

python -c "import torch; print(torch.cuda.is_available())"

torch.cuda.is_available() 返回 False 说明你安装的是 CPU 版本的 PyTorch，需要重新安装 GPU 版本。

1. 先检查你的 CUDA 环境
powershell# 检查 NVIDIA 驱动
nvidia-smi


2. 卸载当前的 PyTorch
pip uninstall torch torchvision torchaudio
3. 安装 GPU 版本的 PyTorch
根据你的 CUDA 版本选择对应的安装命令：
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
4. 验证安装
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
python -c "import torch; print('CUDA version:', torch.version.cuda)"
python -c "import torch; print('GPU count:', torch.cuda.device_count())"
5. 重启 Xinference
powershellxinference-local --host 192.168.3.89 --port 9997

☁️ 部署建议

如果你打算长期运行项目（博客 / API / 自动化脚本），建议直接用云服务器，会比本地稳定很多。

👉 查看云服务器（新用户优惠）