RTX-3090 Qwen3-8B Dify RAG环境搭建
- 一、环境配置
- 二、操作步骤
- 1、创建容器
- 2、下载`Qwen3-8B`和embedding模型
- 3、安装`transformers`
- 4、安装`vllm`
- 5、安装`flash-attention`
- 6、启动兼容OpenAI API的服务
- 1、方案一:启动`vllm`服务【不支持多任务】
- 2、方案二:Flask和PyTorch实现的Qwen3-8B和Embeddings 兼容OpenAI API的服务
- 7、测试API服务
- 8.安装`Dify`
一、环境配置
属性 | 值 |
---|
CUDA Driver Version | 555.42.02 |
CUDA Version | 12.5 |
OS | Ubuntu 20.04.6 LTS |
Docker version | 24.0.5, build 24.0.5-0ubuntu1~20.04.1 |
GPU | NVIDIA GeForce RTX 3090 24GB显存 |
二、操作步骤
1、创建容器
docker run --runtime nvidia --gpus all -ti \-v $PWD:/home -w /home \-p 8000:8000 --ipc=host nvcr.io/nvidia/pytorch:24.03-py3 bash
2、下载Qwen3-8B
和embedding模型
cd /home
pip install modelscope
modelscope download --model Qwen/Qwen3-8B --local_dir Qwen3-8B
modelscope download --model maidalun/bce-embedding-base_v1 --local_dir bce-embedding-base_v1
3、安装transformers
cd /home
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout v4.51.0
pip install tokenizers==0.21
python3 setup.py install
4、安装vllm
pip install vllm
pip install flashinfer-python==v0.2.2
python3 -m pip install --upgrade 'optree>=0.13.0'
pip install bitsandbytes>=0.45.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
5、安装flash-attention
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/
git checkout fd2fc9d85c8e54e5c20436465bca709bc1a6c5a1
python setup.py build_ext
python setup.py bdist_wheel
pip install dist/flash_attn-*.whl
6、启动兼容OpenAI API的服务
1、方案一:启动vllm
服务【不支持多任务】
cd /home
export TORCH_CUDA_ARCH_LIST="8.6+PTX"
vllm serve ./Qwen3-8B/ --quantization bitsandbytes --enable-prefix-caching --dtype bfloat16
2、方案二:Flask和PyTorch实现的Qwen3-8B和Embeddings 兼容OpenAI API的服务
cat > dify_api_srv.py <<-'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import torch
from transformers import AutoModel
from typing import List
import numpy as np
from transformers import TextStreamer
from flask import Flask, request, jsonify, Response
import uuid
import jsonapp = Flask(__name__)
model_name = "./Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="auto"
)
MODEL_PATH = "./bce-embedding-base_v1"
rerank_tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
rerank_model = AutoModel.from_pretrained(MODEL_PATH)@app.route('/v1/completions', methods=['POST'])
def handle_completion():"""处理文本补全请求"""data = request.get_json()print(data)prompt = data.get('prompt', '')max_tokens = data.get('max_tokens', 32768)temperature = float(data.get('temperature', 1.0))top_p = float(data.get('top_p', 1.0))messages = [{"role": "user", "content": prompt}]formatted_text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=False)inputs = tokenizer(formatted_text, return_tensors="pt").to(model.device)generated_ids = model.generate(**inputs,max_new_tokens=max_tokens,temperature=temperature,do_sample=False,top_p=top_p,pad_token_id=tokenizer.eos_token_id)output_ids = generated_ids[0][len(inputs.input_ids[0]):]try:think_token_id = tokenizer.convert_tokens_to_ids("</think>")index = len(output_ids