My EC2 Theano Keras Cluster Development Setup

Setting my EC2 environment to work on Machine Learning using GPU acceleration took a bit learning. Setting up EC2 was simple. I had to figure out how to do the following

Setup a EC2 cluster of nodes
Make sure there is a shared storage in EBS where the home directories are stored. Storage that will persist and be reused across multiple EC2 cluster starts and stops.
Setup the the networking between them that so they can talk to each other and have passwordless SSH between nodes in the cluster
Set them up to use their GPU, Theano and Keras
Set the master up for GUI Desktop so for developer convenience

So I thought I should document my steps for my own use in the future. But hopefully this will help others who come looking for a guide, just as I was a few days ago.

This is my development setup. I plan on building Machine Learning Models and run them on the GPU and eventually run it on a cluster of GPU. So I am planning ahead to make sure I have all the pieces I need to do that development.

My Local Machine setup

Make sure StarCluster is installed and is configured to use my EC2 account.
That it can be used to create clusters in my region
Create a volume where home directories will be stored and will persist across cluster starts/stops

My Node Setup

Make sure ubuntu image being used is upto date and secure.
EC2 GPU enabled StarCluster Ubuntu 14.04 image for cluster development
An EC2 VPC and Security Group to bring the nodes in the cluster together and allow them to be accessible.
Setup passwordless ssh access between nodes in the cluster
Numpy, Scipy and other libraries
Nvidia GPU tooling
Python VirtualEnv
Theano
Keras
EC2 Instance Setup
XFCE Desktop with X2GO

My Master Setup

All the steps from My Node Setup above
A XFCE desktop connected with X2GO for GUI access to the master node.

Install StarCluster on your local machine, MACOS in my case

The next step is to create an EBS storage volume using a standard StarCluster enabled image, so that it is created, formated and made available.

http://star.mit.edu/cluster/docs/latest/installation.html

Follow the quick start steps at

http://star.mit.edu/cluster/docs/latest/quickstart.html

to make sure you can start a basic default cluster using

starcluster start mycluster

starcluster sshmaster mycluster -u ubuntu

starcluster terminate mycluster

EC2 VPC andSecurity Group Setup

Create your own VPC in the VPN menu, and enable the following

VPC CIDR: Pick a range. Block sizes between /16 to /28. Example: 172.30.0.0/16
DNS Resolution: Yes
DNS Hostnames: Yes
Classic Link: Yes

Add a Security Group, and do the following

Give it a name
Add the VPC to the security group
Edit the Inbound Rules with

TCL, ALL TCP, ALL, 0.0.0.0/0
SSH(22), SSH, 22, 0.0.0.0/0
ALL ICMP, ICMP(1), ALL, 0.0.0.0/0

Ubuntu 14.04 updated to confirm the Shell Shock bug is fixed

Create EC2 instance of type g2.2xlarge to start with and use the latest standard Ubuntu AMI and using the above VPC

1. Confirm linux kernel information

uname -mrs

cat /etc/lsb-release

2. Confirm that the Shell Shock bug does not exist in this image, the following command should not say vulnerable.

env x='() { :;}; echo vulnerable' bash -c "echo this is a test"

3. Upgrade packages

sudo apt-get update

sudo apt-get upgrade

sudo apt-get dist-upgrade

4. Upgrade kernel as follows. Got to http://kernel.ubuntu.com/~kernel-ppa/mainline/ and pick the latest version of the kernel within the the same major version number.

mkdir kernel

cd kernel/

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/linux-headers-3.19.8-031908_3.19.8-031908.201505110938_all.deb

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/linux-headers-3.19.8-031908-generic_3.19.8-031908.201505110938_amd64.deb

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/linux-image-3.19.8-031908-generic_3.19.8-031908.201505110938_amd64.deb

sudo dpkg -i *.deb

cd ..

rm -rf kernel

sudo shutdown -r now

5. Verify the shell shock bug does not exist

env x='() { :;}; echo vulnerable' bash -c "echo this is a test"

6. Create and save the AMI for future use.

Create StarCluster enabled Ubuntu 14.04 AMI

The next step is to create a StarCluster enabled image based of the updated Ubuntu 14.04 AMI we created in the previous step.

1. Create new EC2 instance with the AMI created above or continue from the last section.

2. Update apt-get sources.list to uncomment the lines that add multiverse as a source and update

sudo vi /etc/apt/sources.list

sudo apt-get update

3. Install nfs-kernel-server and dependencies along with portmap. Ubuntu 14.04 uses RPC bind, but we can install portmap and make it work. 5. Download sg6.tar.gz from the following link7. Create and save Cluster AMI. You now have an Ubuntu 14.04 Image that you can use with StarCluster

sudo apt-get install nfs-kernel-server nfs-common portmap

sudo ln -s /etc/init.d/nfs-kernel-server /etc/init.d/nfs

sudo ln -s /lib/init/upstart-job /etc/init.d/portmap

sudo ln -s /lib/init/upstart-job /etc/init.d/portmap-wait

4. Use the customized version scimage_14_04.py script from my fork of StarCluster

git clone https://github.com/sarvi/StarCluster.git

sudo python StarCluster/utils/scimage_14_04.py

5. Download sge6.tar.gz from the following URL into /home/ubuntu/

https://drive.google.com/folderview?id=0BwXqXe5m8cbWflY1UEpnVUpScVozbFVuMERaOE9sMktrX1dFQmhCU0tLbnItUEo0VkZxZFE&usp=sharing

6. Untar it into /opt

cd /opt

sudo tar -zxvf /home/ubuntu/sge6.tar.gz

rm sg6.tar.gz

rm -rf StarCluster

7. Create and save the AMI that can now be used in a StarCluster configuration

Setup Numpy, Scipy, CUDA and other libraries

The next step is to install Numpy, Scipy, CUDA compilers and tools, etc. It is recommended to have the python virtualenv tooling to allow you have different custom virtual python environments for developing software. The following commands should be get them installed.

sudo apt-get update

sudo apt-get -y dist-upgrade

sudo apt-get install -y gcc g++ gfortran build-essential git wget linux-image-generic libopenblas-dev python-dev python-pip python-nose python-numpy python-scipy

sudo apt-get install -y python-virtualenv

sudo wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb

sudo apt-get update

sudo apt-get install -y cuda

echo -e "\nexport PATH=/usr/local/cuda/bin:$PATH\n\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64" >> .bashrc

sudo shutdown -r now

Wait for the machine to reboot, relogin and continue installation and setup as follows

cuda-install-samples-7.0.sh ~/

cd NVIDIA\_CUDA-7.0\_Samples/1\_Utilities/deviceQuery

make

The following will make sure the CUDA was installed correctly and verify that the GPU is accessible and ready for use.

./deviceQuery

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520

Result = PASS

cd ~/

rm -rf cuda-repo-ubuntu1404_7.0-28_amd64.deb

rm -rf NVIDIA\_CUDA-7.0\_Samples/1\_Utilities/deviceQuery

Create a virtual python environment that has to the global system packages installed, activate it by sourcing the activation script

virtualenv --system-site-packages theanoenv

source theanoenv/bin/activate

Install the Theano environment within the virtual environment. I do it this was so that I can work on theano itself to help fix bugs in the code. I can install it in an editable format and modify its code if needed. If you have no intention of modifying or updating theano, you can install it outside the virtualenv.

As a rule, I tend to install all tools that are bleeding edge, and stuff they depend on inside the virtual python environment.

pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

echo -e "\n[global]\nfloatX=float32\ndevice=gpu\n[mode]=FAST_RUN\n\n[nvcc]\nfastmath=True\n\n[cuda]\nroot=/usr/local/cuda" >> ~/.theanorc

Make sure the theano installation works and can use the GPU. This will acquire the GPU and start running theano tests on it. This will take a while. You can interrupt it once you know the GPU is being used and atleast some of the tests are running and passing

python -c "import theano; theano.test()"

Next pull Keras the modular machine learning library that builds on theano, so its sources are available to you. And pip install the code as editable, so that your changes to the keras sources can be run, debugged and test easily.

mkdir Workspace

cd Workspace

git clone https://github.com/fchollet/keras.git

pip install -e keras/

Next setup passwordless ssh between nodes in the cluster. For this you need to copy over the key file(*.pem) that you downloaded from amazon and that you use to ssh into your EC2 instance from your local machine. Copy this over to the instance you are working with. Then ssh-add this *.pem key

chmod 644 .ssh/authorized_keys

scp -i <your-public-encryption-key>.pem <your-public-encryption-key>.pem ubuntu@<public-ip-address-ec2-instance>:/home/ubuntu/

eval `ssh-agent`

ssh-add <your-public-encryption-key>.pem

Next verify that you can do a passwordless SSH, by trying an ssh into the same EC2 instance through its local IP address.

ssh ubuntu@<local-ip-address>

At this stage, you have everything installed and configure for working with GPU using theano and keras. This software configure can be used for masters and slaves in the cluster.

Create a slave AMI

But this would a good point to the go to the AWS menu and create and AMI, i.e. and Image based on the software and configuration of your current instance. You can launch future EC2 instances with the AMI that you create here. I call this a slave AMI since I would like to add a GUI desktop functionality into my master.

XFCE Desktop with X2GO for GUI access

I prefer to have a master machine running a GUI deskotp, with xterms to do my development on my master node. A setup that I can disconnect and connect back as needed. Where my development environment is intact and allows me some continuity of development.

For this I setup and XFCE Desktop, that is known for its light foot print and X2GO for remote GUI access for its low bandwidth.

Add the X2Go Stable PPA

sudo add-apt-repository ppa:x2go/stable

sudo apt-get update

Install the XFCE packages and X2Go. Feel free to add other packages, but I purposely kept this selection small.

Installing "x2goserver-xsession" enables X2Go to launch any utilities specified under /etc/X11/Xsession.d/ , which is how a local X11 display or an XDMCP display would behave. This maximizes compatibility with applications.

sudo apt-get install xfce4 xfce4-goodies xfce4-artwork xubuntu-icon-theme firefox x2goserver x2goserver-xsession

Install X2Go Client and connect with it. In the X2Go Client "Session Preferences":

Specify "XFCE" as the "Session type."

If you have the SSH key in OpenSSH/PEM format, specify it in "Use RSA/DSA key for ssh connection".

If you have the ssh key in PuTTY .PPK format, convert it using PuTTYgen, and then specify it.

Or even better, just launch Pageant (part of the PuTTY suite,) load the .PPK key in Pageant, then in X2Go Client select "Try auto login (ssh-agent or default ssh key)".

Disable Screen Saver to minimize CPU usage

Create EBS storage to be mounted on all cluster nodes

The next step is to create an EBS storage volume using a standard StarCluster enabled image, so that it is created, formated and made available. And to move the home directory, /home/ubuntu on this storage. This will get mounted as /home in clusters and hence will act storage that will be persistent between clusters that created and destroyed.

Create EBS storage volume of desired size. Use an image-id from list show in the "starcluster listpublic" command. Specify a region where you want the volume to be created, in my case, us-west-2c where GPU nodes are available and cheap. 100 being the size in gigabytes

starcluster listpublic

starcluster createvolume --name=myhome --image-id=ami-04bedf34 100 us-west-2c

Note down the volume id that is created.

You will need to temporarily configure the star cluster config file to use the just created EBS mount as a shared storage at /myhome.

VOLUMES = myhome

..........................

[volume myhome]

VOLUME_ID = vol-595b0fbe

MOUNT_PATH = /myhome

This will mount the created EBS volume onto /myhome in a star cluster master and slave nodes

Now start a new cluster and log into the master node as user ubuntu

starcluster start mycluster

starcluster sshmaster mycluster -u ubuntu

Move the home directory to mounted storage
Then make sure the .gnupg directory is owned by user ubuntu, if not, change its ownership as follows. Then tar the ubuntu directory and save it into /myhome

sudo chown -R $(whoami) ubuntu/.gnupg

sudo tar -czvf /myhome/ubuntu.owner.tar.gz --same-owner ubuntu
cd /myhome

sudo tar -zxvf ubuntu.owner.tar.gz

Now change the starcluster configuration to now mount the shared EBS volume on /home instead of /myhome, terminate and restart the starcluster. You should now have an ubuntu home directory in EBS storage that is not is not lost when you start and restart your cluster.

Create a master AMI

But this would a good point to the go to the AWS menu and create the master AMI, with GUI desktop functionality

Updates

I plan on keeping this page updated as my setup evolves and I refine the environment.

Experiments with My Mind - A Chronicle

Wednesday, July 22, 2015