Configuring a deep-learning machine

Edit 2017-12-22: I’ve updated the guide for CUDA 9.1 and CuDNN 7.

Pre-requisites

What’s covered

Installing & Configuring :

  • Drivers
  • GPU enabled tensor frameworks and classical datascience software.
    • Pytorch is a deep learning library from Facebook focused on research. Uses dynamic graphs. I find pytorch much easier to work with than tensorflow.
    • Tensorflow is a deep learning library from Google. Suitable for production and research. Good support for embedded devices and production deployments but I find it trickier to work with and debug. Uses static graphs.
    • Anaconda includes classical statistical learning and datascience tools like numpy, scipy, scikit-learn, pandas and many others. Anaconda also has virtual environment like capabilities for managing dependencies across projects.
  • Convenience tweaks for remote access.
  • Configuring Jupyterhub for remote Jupyter programming.

GPU Drivers & Configuration

First, install common dependencies using the apt-get package manager.

sudo apt-get update
sudo apt-get install -y --no-install-recommends \
        build-essential \
        curl \
        git \
        libfreetype6-dev \
        libpng12-dev \
        libzmq3-dev \
        pkg-config \
        software-properties-common \
        swig \
        zip \
        zlib1g-dev \
        libcurl3-dev \
        wget \
        python3-pip \
        python3-dev \
        python-pip \
        python-dev \
        python-virtualenv \
        libcupti-dev \
        vim-nox

Install latest GPU drivers

# Add NVIDIA's graphics ppa repository
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
# (re-run if any warning/error messages)
sudo apt-get install nvidia-
# Press tab after nvidia-  to see latest. Do not use 378 it causes login loops.
# 384 was the latest driver as of time of writing.
sudo apt-get install nvidia-384

Check installation by running nvidia-smi.

Install NVIDIA’s CUDA 9.1

CUDA is an API that lets deep learning frameworks do GPU computations.

wget "http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.1.85-1_amd64.deb"
sudo dpkg -i cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda
# check version
cat /usr/local/cuda/version.txt
vi ~/.bashrc
# add the following to the bottom of your bashrc
# export PATH="/usr/local/cuda-9.1/bin/:$PATH"

Check installation by running nvcc --version. You should see:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

Install CuDNN 7

The NVIDIA CUDA Deep Neural Network library (cuDNN) provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.

Traditionally, you are instructed to sign up to NVIDIA’s website and agree to their terms which is a pain in the ass. This guide assume you’ve already done that, just like in their docker images.😀

# become root root
sudo su
echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
CUDNN_VERSION="7.0.5.15"
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
        libcudnn7=$CUDNN_VERSION-1+cuda9.1 \
        libcudnn7-dev=$CUDNN_VERSION-1+cuda9.1

# move files where TF expects them
ls -lah /usr/local/cuda/lib64/*
mkdir /usr/lib/x86_64-linux-gnu/include/
ln -s /usr/lib/x86_64-linux-gnu/include/cudnn.h /usr/lib/x86_64-linux-gnu/include/cudnn.h
ln -s /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.6 /usr/local/cuda/lib64/libcudnn.so.6

# confirm your version
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

Reboot.

Software

Anaconda

Anaconda is a package manager, virtual-environment, and a collection of common data-science tools rolled into one.

wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
bash Anaconda3-4.4.0-Linux-x86_64.sh
# follow the install prompts
# restart your bash session
exec -l $SHELL
# check to make sure python is anaconda
which python
# should return $HOME/anaconda3/bin/python
# install pip via conda
conda install pip
# to update 

Pytorch

conda install pytorch torchvision cuda90 -c pytorch
# clone the examples repository to test
git clone https://github.com/pytorch/examples $HOME/pytorch-examples
cd $HOME/pytorch-examples/mnist
python main.py

Tensorflow

pip install tensorflow-gpu --upgrade
# from $HOME
git clone https://github.com/tensorflow/tensorflow.git $HOME/tensorflow
# run an example to test
python $HOME/tensorflow/tensorflow/examples/tutorials/mnist/fully_connected_feed.py

(Optional) Make remoting great again

Things I do to make my remote day-to-day easier.

Secure remote access over SSH

On your machine learning machine.

sudo apt-get install openssh-server

On your development machine, where bdd is your username on the remote machine and mlbox.bdd.io is the hostname or ip-address to that server.

ssh bdd@mlbox.bdd.io
# optionally, drop your public key on that server
ssh-copy-id mlbox.bdd.io

Remote file-system over ssh

Install sshfs. If your main machine is a mac, you can use brew and simply do brew install sshfs.

sshfs mlbox.bdd.io:/home/bdd/ ~/mlbox

~/mlbox will now point to the remote file system.