Centos7安装Cuda/cudnn教程

本文将从如下几个部分入手,进行在Centos7上安装Cuda,本文手册支持cuda10.0/cuda10.1版本的安装,摘要如下所示:

第一部分:环境查看及验证

第二部分:Nvidia驱动及Cuda卸载

第三部分:Cuda安装

第四部分:Cudnn安装

第五部分:遇到的问题及解决方案

第一部分:环境查看及验证

# cuda下载地址:

https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm

https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64.rpm

# cudnn下载地址:

https://developer.nvidia.com/rdp/cudnn-download

# 查看操作系统版本信息

[root@localhost ~]# lsb_release -a

LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch

Distributor ID: CentOS

Description:    CentOS Linux release 7.5.1804 (Core)

Release:  7.5.1804

Codename:     Core

[root@localhost ~]#

# lspci | grep -i vga  (查看电脑对应显卡信息)

[root@localhost nvidia_soft]# lspci | grep -i vga

03:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)

[root@localhost nvidia_soft]#

#nvidia-smi (查看显卡详细信息)(如果不安装驱动并不能看到内容)

[root@localhost nvidia_soft]# nvidia-smi

Wed Jul 31 16:42:22 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Tesla P4            Off  | 00000000:3B:00.0 Off |                    0 |

| N/A   36C    P0    23W /  75W |      0MiB /  7611MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

|   1  Tesla P4            Off  | 00000000:AF:00.0 Off |                    0 |

| N/A   35C    P0    22W /  75W |      0MiB /  7611MiB |      4%      Default |

+-------------------------------+----------------------+----------------------+

 

+-----------------------------------------------------------------------------+

| Processes:                                                       GPU Memory |

|  GPU       PID   Type   Process name                             Usage      |

|=============================================================================|

|  No running processes found                                                 |

+-----------------------------------------------------------------------------+

[root@localhost nvidia_soft]#

#cat /proc/version (查看电脑相关信息)

[root@localhost nvidia_soft]# cat /proc/version

Linux version 3.10.0-957.21.3.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Tue Jun 18 16:35:19 UTC 2019

[root@localhost nvidia_soft]#

# gcc --version

[root@localhost nvidia_soft]# gcc --version

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)

Copyright (C) 2015 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[root@localhost nvidia_soft]#

# 检查linux是否安装了GPU(有信息就是按照了,没信息就是没安装)

[root@localhost nvidia_soft]# lspci | grep -i nvidia

3b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)

af:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)

[root@localhost nvidia_soft]#

第二部分:Nvidia驱动卸载及Cuda卸载

2.1、Nvidia驱动安装几种方式

显卡驱动安装方式有三种

A、CUDA(.run)下载以后安装带上显卡驱动,类似这样的文件:cuda_10.1.168_418.67_linux.run

B、显卡驱动.run文件安装,类似(NVIDIA-Linux-x86_64-430.34.run)这样的文件

C、集成软件包安装(yum等)

2.2、Nvidia驱动卸载几种方式

A、显卡驱动.run文件安装:

找到.run文件

输入例如:sh NVIDIA-Linux-x86_64-430.34.run --uninstall

B、yum安装:

确认是否为yum安装,yum list installed,找到以NVIDIA为首的包:其中版本号和中查询出来的版本号一样的那个包,然后yum remove 该包,如下所示:

yum remove "*cublas*" "cuda*"

或者:

yum remove "*nvidia*"

或者:

rpm -qa | grep -i nvidia | sort

yum remove *nvidia*

#查找cuda相关的包,然后将cuda相关的全部删除,一般采用*cuda* 正则即可

rpm -qa | grep -i cuda | sort

yum remove *cuda*

C、基于Cuda安装的卸载方法,见如下部分。

2.3、Cuda卸载

进入Cuda的安装目录,之后执行程序卸载命令即可,如下所示

/usr/local/cuda-10.1/bin

[root@localhost bin]# ./cuda-uninstaller

注解这个方式卸载,会把驱动以前卸载,如果要重新安装其它的cuda版本,要重启服务器,不然安装会出错,出错信息,请见问题汇总。

第三部分:Cuda安装

# 安装kernel-devel和kernel-headers

yum install kernel-devel

yum install kernel-headers

# 赋值权限(cuda10.1版本截图,cuda10.0安装提示不一样)

[root@localhost nvidia_soft]# chmod 775 cuda_10.1.168_418.67_linux.run

# 执行该脚本

# 执行安装脚本(cuda10.1版本)

[root@localhost nvidia_soft]# ./cuda_10.1.168_418.67_linux.run

# Cuda安装完之后,会出现如下提示(Cuda10.0版本)-配置环境变量:

===========

= Summary =

===========

 

Driver:   Installed

Toolkit:  Installed in /usr/local/cuda-10.0

Samples:  Installed in /root

 

Please make sure that

-   PATH includes /usr/local/cuda-10.0/bin

-   LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root

 

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin

To uninstall the NVIDIA Driver, run nvidia-uninstall

 

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.

 

Logfile is /tmp/cuda_install_2537.log

第四部分:Cudnn安装

4.1、软件下载地址:

- cudnn下载:https://developer.nvidia.com/rdp/cudnn-download
https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.6.2.24/prod/10.1_20190719/cudnn-10.1-linux-x64-v7.6.2.24.tgz

4.2、解压CUDNN

# 10.1版本

tar -xzvf cudnn-10.1-linux-x64-v7.6.2.24.tgz

4.3、复制相关文件到cuda特定目录下

# 安装10.1版本

sudo cp /root/nvidia_soft/cuda/include/cudnn.h  /usr/local/cuda-10.1/include

sudo cp /root/nvidia_soft/cuda/lib64/libcudnn*  /usr/local/cuda-10.1/lib64

# 安装10.0版本

sudo cp /root/nvidia_soft/cuda_cudnn-10.0-linux-x64-v7.6.2.24/include/cudnn.h  /usr/local/cuda-10.0/include

sudo cp /root/nvidia_soft/cuda_cudnn-10.0-linux-x64-v7.6.2.24/lib64/libcudnn*  /usr/local/cuda-10.0/lib64

4.4、修改文件权限

# 10.1版本

sudo chmod a+r /usr/local/cuda-10.1/include/cudnn.h /usr/local/cuda-10.1/lib64/libcudnn*

#10.0版本

sudo chmod a+r /usr/local/cuda-10.0/include/cudnn.h /usr/local/cuda-10.0/lib64/libcudnn*

第五部分:遇到的问题及解决方案

5.1、NVIDIA驱动单独安装

# sh NVIDIA-Linux-x86_64-430.34.run

[root@localhost nvidia_soft]# sh NVIDIA-Linux-x86_64-430.34.run

5.2、查看nvcc版本号

[root@localhost local]# nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2019 NVIDIA Corporation

Built on Wed_Apr_24_19:10:27_PDT_2019

Cuda compilation tools, release 10.1, V10.1.168

[root@localhost local]#

5.3、 mesa-libGLES 包冲突

Transaction check error:

file /usr/lib64/libEGL.so.1 from install of libglvnd-egl-1:1.0.1-0.8.git5baa1e5.el7.x86_64 conflicts with file from package mesa-libEGL-17.2.3-8.20171019.el7.x86_64

file /usr/lib64/libGLESv2.so.2 from install of libglvnd-gles-1:1.0.1-0.8.git5baa1e5.el7.x86_64 conflicts with file from package mesa-libGLES-17.2.3-8.20171019.el7.x86_64

file /usr/lib64/libGL.so.1 from install of libglvnd-glx-1:1.0.1-0.8.git5baa1e5.el7.x86_64 conflicts with file from package mesa-libGL-17.2.3-8.20171019.el7.x86_64

Error Summary

运行命令(该方法没用):sudo yum -y install kmod-nvidia

参考网站地址:https://www.centos.org/forums/viewtopic.php?t=69630&start=10

更新命令:yum update mesa-libEGL

参考网址:https://www.cnblogs.com/caoshousong/p/10642478.html

5.4、重新执行./NVIDIA-Linux-x86_64-430.34.run报错

[root@localhost nvidia_soft]# vim /var/log/nvidia-installer.log

 

nvidia-installer log file '/var/log/nvidia-installer.log'

creation time: Wed Jul 31 15:42:17 2019

installer version: 430.34

 

PATH: /root/anaconda3/condabin:/usr/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin

 

nvidia-installer command line:

./nvidia-installer

 

Unable to load: nvidia-installer ncurses v6 user interface

 

Using: nvidia-installer ncurses user interface

-> Detected 24 CPUs online; setting concurrency level to 24.

ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.

ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

解决方案:这个情况,肯定是卸载NVIDIA驱动之后,重新安装驱动引起的,卸载NVIDIA驱动要重启服务器才能生效。

5.5、cuda切换成10.0,跑TensorFlow报libcudart.so.10.0找不到错误

运行tensorFlow报如下错误:

2019-07-31 18:22:39.191575: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory

查询文件:

[root@localhost cuda-10.0]# find / -name *libcudart.so.10.0*

/usr/local/cuda-10.0/lib64/libcudart.so.10.0.130

/usr/local/cuda-10.0/lib64/libcudart.so.10.0

[root@localhost cuda-10.0]#

解决方案:

sudo cp /usr/local/cuda-10.0/lib64/libcudart.so.10.0 /usr/local/lib/libcudart.so.10.0 && sudo ldconfig

sudo cp /usr/local/cuda-10.0/lib64/libcudart.so.10.0 /usr/local/lib/libcudart.so.10.0 && sudo ldconfig

sudo cp /usr/local/cuda-10.0/lib64/libcudart.so.10.0 /usr/local/lib/libcudart.so.10.0 && sudo ldconfig

参考网址:https://blog.csdn.net/chenjiyou363753068/article/details/84374661

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: