YP.Lam | PyTorch 多显卡推理与问题排查

PyTorch 多显卡推理与问题排查

本文记录双 RTX4080 显卡运行 Yi-34B-Chat-4bits 模型过程与排错。

基于 Ubuntu Server 20.04.6 LTS,并具有基础的科学上网条件。

由于 CUDA 已绑定对应显卡驱动,因此直接到官网按提示下载安装即可,譬如我选择安装 12.1 版本:

https://developer.nvidia.com/cuda-12-1-0-download-archive

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-ubuntu2004-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

安装 Docker

for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

如果上面有出错信息,则大概率是访问 docker 服务器间的网络不顺畅。。。

安装 NVIDIA Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Dockerfile 与 compose.yaml

此 Dockerfile 同时加入 chatglm-cpp 的支持(貌似默认无法支持多卡,需要修改代码?)

FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04

ARG TIMEZONE=Asia/Shanghai
ARG UID=1000
ARG GID=1000
ENV DEBIAN_FRONTEND=noninteractive

RUN \
    sed -e "s/archive.ubuntu.com/mirrors.tuna.tsinghua.edu.cn/g" \
        -e "s/security.ubuntu.com/mirrors.tuna.tsinghua.edu.cn/g" -i /etc/apt/sources.list

# Set timezone
RUN ln -snf /usr/share/zoneinfo/${TIMEZONE} /etc/localtime && echo ${TIMEZONE} > /etc/timezone \
&& "date"

RUN apt-get update && apt-get install -y \
    curl \
    apt-utils \
    wget \
    build-essential \
    libssl-dev \
    zlib1g-dev \
    libncurses5-dev \
    libncursesw5-dev \
    libreadline-dev \
    libsqlite3-dev \
    libgdbm-dev \
    libdb5.3-dev \
    libbz2-dev \
    libexpat1-dev \
    liblzma-dev \
    git \
    vim \
    libffi-dev \
    libgdbm-compat-dev \
    ffmpeg libsm6 libxext6
    
RUN wget https://www.python.org/ftp/python/3.10.12/Python-3.10.12.tar.xz -O /tmp/Python-3.10.12.tar.xz \
    && tar xvf /tmp/Python-3.10.12.tar.xz -C /tmp \
    && cd /tmp/Python-3.10.12 \
    && ./configure --enable-optimizations --with-ensurepip=install \
    && make -j $(nproc) \
    && make altinstall \
    && update-alternatives --install /usr/bin/python python /usr/local/bin/python3.10 1 \
    && update-alternatives --install /usr/bin/pip pip /usr/local/bin/pip3.10 1 \
    && rm -rf /tmp/Python-3.10.12.tar.xz /tmp/Python-3.10.12

RUN export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install xformers --index-url https://download.pytorch.org/whl/cu118
RUN pip install wheel streamlit transformers_stream_generator cpm_kernels astunparse accelerate tiktoken einops scipy peft auto-gptq optimum


RUN CMAKE_ARGS="-DGGML_CUBLAS=ON" pip install -U 'chatglm-cpp[api]'

RUN pip install autoawq xformers

RUN groupadd -g $GID -o chatglm
RUN useradd -m -u $UID -g $GID -o -s /bin/bash chatglm
USER chatglm
WORKDIR /home/chatglm/work


ENTRYPOINT ["tail", "-f", "/dev/null"]

compose.yaml

services:
  chatglm:
    build:
      context: ./docker
      network: host
    container_name: chatglm
    restart: always
    # entrypoint: ["uvicorn", "chatglm_cpp.openai_api:app", "--host", "0.0.0.0"]
    dns:
      - 119.29.29.29
    volumes:
      - ./:/home/chatglm/work/
      - ./data/cache:/home/chatglm/.cache
      - ./data/config:/home/chatglm/.config
      - ./data/triton:/home/chatglm/.triton
      - /usr/local/nvidia:/usr/local/nvidia
    ports:
      - 8000:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

docker compose up -d –build , 运行,devel镜像较大,可能需要较长时间。

运行后先进入容器查看是否显卡已正常挂载,nvidia-smi 。

git clone https://huggingface.co/01-ai/Yi-34B-Chat-4bits 下载模型,需要先安装 lfs 插件。

进入容器,按官方资料编写并执行脚本,由于 AutoModelForCausalLM 已支持多 GPU,通常不需要更改代码。

错误排查

对于多 GPU 的环境,需要先保证 GPU 间数据能正常通讯:

import torch
x = torch.tensor([1.0, 2.0], device=0)
x.to(1)
x.to('cpu').to(1)

需要测试 x.to(1) 后的值是否正常,如果为 [0, 0] 则代表 GPU 间数据传递失败,可能存在两种原因:

  1. IOMMU,可以通过 bios 配置关闭,参考 https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

  2. 显卡驱动问题,建议寻找合适的驱动版本。譬如 CUDA 12.1 对应的驱动版本无法 GPU 间传递数据,升级到 12.2 后正常(容器依然基于 12.1,经测试依然可用)