Wednesday, July 22, 2015

My EC2 Theano Keras Cluster Development Setup

Setting my EC2 environment to work on Machine Learning using GPU acceleration took a bit learning. Setting up EC2 was simple. I had to figure out how to do the following
  1. Setup a EC2 cluster of nodes
  2. Make sure there is a shared storage in EBS where the home directories are stored. Storage that will persist and be reused across multiple EC2 cluster starts and stops.
  3. Setup the the networking between them that so they can talk to each other and have passwordless SSH between nodes in the cluster
  4. Set them up to use their GPU, Theano and Keras
  5. Set the master up for GUI Desktop so for developer convenience
So I thought I should document my steps for my own use in the future. But hopefully this will help others who come looking for a guide, just as I was a few days ago.
This is my development setup. I plan on building Machine Learning Models and run them on the GPU and eventually run it on a cluster of GPU. So I am planning ahead to make sure I have all the pieces I need to do that development.

My Local Machine setup

  1. Make sure StarCluster is installed and is configured to use my EC2 account.
  2. That it can be used to create clusters in my region
  3. Create a volume where home directories will be stored and will persist across cluster starts/stops

My Node Setup

  1. Make sure ubuntu image being used is upto date and secure.
  2. EC2 GPU enabled StarCluster Ubuntu 14.04 image for cluster development
  3. An EC2 VPC and Security Group to bring the nodes in the cluster together and allow them to be accessible.
  4. Setup passwordless ssh access between nodes in the cluster
  5. Numpy, Scipy and other libraries
  6. Nvidia GPU tooling
  7. Python VirtualEnv
  8. Theano
  9. Keras
  10. EC2 Instance Setup
  11. XFCE Desktop with X2GO

My Master Setup

  1. All the steps from My Node Setup above
  2. A XFCE desktop connected with X2GO for GUI access to the master node.

Install StarCluster on your local machine, MACOS in my case

The next step is to create an EBS storage volume using a standard StarCluster enabled image, so that it is created, formated and made available.
http://star.mit.edu/cluster/docs/latest/installation.html
Follow the quick start steps at
http://star.mit.edu/cluster/docs/latest/quickstart.html
to make sure you can start a basic default cluster using
starcluster start mycluster
starcluster sshmaster mycluster -u ubuntu
starcluster terminate mycluster

EC2 VPC andSecurity Group Setup

Create your own VPC in the VPN menu, and enable the following
  1. VPC CIDR: Pick a range. Block sizes between /16 to /28. Example: 172.30.0.0/16
  2. DNS Resolution: Yes
  3. DNS Hostnames: Yes
  4. Classic Link: Yes
Add a Security Group, and do the following
  1. Give it a name
  2. Add the VPC to the security group
  3. Edit the Inbound Rules with
    1. TCL, ALL TCP, ALL, 0.0.0.0/0
    2. SSH(22), SSH, 22, 0.0.0.0/0
    3. ALL ICMP, ICMP(1), ALL, 0.0.0.0/0

Ubuntu 14.04 updated to confirm the Shell Shock bug is fixed

Create EC2 instance of type g2.2xlarge to start with and use the latest standard Ubuntu AMI and using the above VPC
1. Confirm linux kernel information
uname -mrs
cat /etc/lsb-release  
2. Confirm that the Shell Shock bug does not exist in this image, the following command should not say vulnerable.
env x='() { :;}; echo vulnerable' bash -c "echo this is a test"
3. Upgrade packages
sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
4. Upgrade kernel as follows. Got to http://kernel.ubuntu.com/~kernel-ppa/mainline/ and pick the latest version of the kernel within the the same major version number.
mkdir kernel
cd kernel/
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/linux-headers-3.19.8-031908_3.19.8-031908.201505110938_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/linux-headers-3.19.8-031908-generic_3.19.8-031908.201505110938_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/linux-image-3.19.8-031908-generic_3.19.8-031908.201505110938_amd64.deb
sudo dpkg -i *.deb
cd ..
rm -rf kernel
sudo shutdown -r now
5. Verify the shell shock bug does not exist
env x='() { :;}; echo vulnerable' bash -c "echo this is a test"
6. Create and save the AMI for future use.

Create StarCluster enabled Ubuntu 14.04 AMI

The next step is to create a StarCluster enabled image based of the updated Ubuntu 14.04 AMI we created in the previous step.
1. Create new EC2 instance with the AMI created above or continue from the last section.
2. Update apt-get sources.list to uncomment the lines that add multiverse as a source and update
sudo vi /etc/apt/sources.list
sudo apt-get update
3. Install nfs-kernel-server and dependencies along with portmap. Ubuntu 14.04 uses RPC bind, but we can install portmap and make it work. 5. Download sg6.tar.gz from the following link7. Create and save Cluster AMI. You now have an Ubuntu 14.04 Image that you can use with StarCluster
sudo apt-get install nfs-kernel-server nfs-common portmap
sudo ln -s /etc/init.d/nfs-kernel-server /etc/init.d/nfs
sudo ln -s /lib/init/upstart-job /etc/init.d/portmap
sudo ln -s /lib/init/upstart-job /etc/init.d/portmap-wait
4. Use the customized version scimage_14_04.py script from my fork of StarCluster
git clone https://github.com/sarvi/StarCluster.git
sudo python StarCluster/utils/scimage_14_04.py  
5. Download sge6.tar.gz from the following URL into /home/ubuntu/
https://drive.google.com/folderview?id=0BwXqXe5m8cbWflY1UEpnVUpScVozbFVuMERaOE9sMktrX1dFQmhCU0tLbnItUEo0VkZxZFE&usp=sharing
6. Untar it into /opt
cd /opt  
sudo tar -zxvf /home/ubuntu/sge6.tar.gz
cd
rm sg6.tar.gz
rm -rf StarCluster
7. Create and save the AMI that can now be used in a StarCluster configuration

Setup Numpy, Scipy, CUDA and other libraries

The next step is to install Numpy, Scipy, CUDA compilers and tools, etc. It is recommended to have the python virtualenv tooling to allow you have different custom virtual python environments for developing software. The following commands should be get them installed.  
sudo apt-get update
sudo apt-get -y dist-upgrade


sudo apt-get install -y gcc g++ gfortran build-essential git wget linux-image-generic libopenblas-dev python-dev python-pip python-nose python-numpy python-scipy


sudo apt-get install -y python-virtualenv
sudo wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb


sudo apt-get update
sudo apt-get install -y cuda


echo -e "\nexport PATH=/usr/local/cuda/bin:$PATH\n\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64" >> .bashrc


sudo shutdown -r now
Wait for the machine to reboot, relogin and continue installation and setup as follows
cuda-install-samples-7.0.sh ~/
cd NVIDIA\_CUDA-7.0\_Samples/1\_Utilities/deviceQuery
make
The following will make sure the CUDA was installed correctly and verify that the GPU is accessible and ready for use.
./deviceQuery
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS
cd ~/
rm -rf cuda-repo-ubuntu1404_7.0-28_amd64.deb
rm -rf NVIDIA\_CUDA-7.0\_Samples/1\_Utilities/deviceQuery
Create a virtual python environment that has to the global system packages installed, activate it by sourcing the activation script
virtualenv --system-site-packages theanoenv
source theanoenv/bin/activate
Install the Theano environment within the virtual environment. I do it this was so that I can work on theano itself to help fix bugs in the code. I can install it in an editable format and modify its code if needed. If you have no intention of modifying or updating theano, you can install it outside the virtualenv.
As a rule, I tend to install all tools that are bleeding edge, and stuff they depend on inside the virtual python environment.
pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git
echo -e "\n[global]\nfloatX=float32\ndevice=gpu\n[mode]=FAST_RUN\n\n[nvcc]\nfastmath=True\n\n[cuda]\nroot=/usr/local/cuda" >> ~/.theanorc
Make sure the theano installation works and can use the GPU. This will acquire the GPU and start running theano tests on it. This will take a while. You can interrupt it once you know the GPU is being used and atleast some of the tests are running and passing
python -c "import theano; theano.test()"
Next pull Keras the modular machine learning library that builds on theano, so its sources are available to you. And pip install the code as editable, so that your changes to the keras sources can be run, debugged and test easily.
mkdir Workspace
cd Workspace
git clone https://github.com/fchollet/keras.git
pip install -e keras/
Next setup passwordless ssh between nodes in the cluster. For this you need to copy over the key file(*.pem) that you downloaded from amazon and that you use to ssh into your EC2 instance from your local machine. Copy this over to the instance you are working with. Then ssh-add this *.pem key
chmod 644 .ssh/authorized_keys
scp -i <your-public-encryption-key>.pem <your-public-encryption-key>.pem ubuntu@<public-ip-address-ec2-instance>:/home/ubuntu/
eval `ssh-agent`
ssh-add <your-public-encryption-key>.pem
Next verify that you can do a passwordless SSH, by trying an ssh into the same EC2 instance through its local IP address.
ssh ubuntu@<local-ip-address>
At this stage, you have everything installed and configure for working with GPU using theano and keras. This software configure can be used for masters and slaves in the cluster.
Create a slave AMI
But this would a good point to the go to the AWS menu and create and AMI, i.e. and Image based on the software and configuration of your current instance. You can launch future EC2 instances with the AMI that you create here. I call this a slave AMI since I would like to add a GUI desktop functionality into my master.
XFCE Desktop with X2GO for GUI access
I prefer to have a master machine running a GUI deskotp, with xterms to do my development on my master node. A setup that I can disconnect and connect back as needed. Where my development environment  is intact and allows me some continuity of development.
For this I setup and XFCE Desktop, that is known for its light foot print and X2GO for remote GUI access for its low bandwidth.
Add the X2Go Stable PPA
sudo add-apt-repository ppa:x2go/stable
sudo apt-get update
Install the XFCE packages and X2Go. Feel free to add other packages, but I purposely kept this selection small.
Installing "x2goserver-xsession" enables X2Go to launch any utilities specified under /etc/X11/Xsession.d/ , which is how a local X11 display or an XDMCP display would behave. This maximizes compatibility with applications.
sudo apt-get install xfce4 xfce4-goodies xfce4-artwork xubuntu-icon-theme firefox x2goserver x2goserver-xsession
Install X2Go Client and connect with it. In the X2Go Client "Session Preferences":
Specify "XFCE" as the "Session type."
If you have the SSH key in OpenSSH/PEM format, specify it in "Use RSA/DSA key for ssh connection".
If you have the ssh key in PuTTY .PPK format, convert it using PuTTYgen, and then specify it.
Or even better, just launch Pageant (part of the PuTTY suite,) load the .PPK key in Pageant, then in X2Go Client select "Try auto login (ssh-agent or default ssh key)".
Disable Screen Saver to minimize CPU usage

Create EBS storage to be mounted on all cluster nodes

The next step is to create an EBS storage volume using a standard StarCluster enabled image, so that it is created, formated and made available. And to move the home directory, /home/ubuntu on this storage. This will get mounted as /home in clusters and hence will act storage that will be persistent between clusters that created and destroyed.
Create EBS storage volume of desired size. Use an image-id from list show in the "starcluster listpublic" command. Specify a region where you want the volume to be created, in my case, us-west-2c where GPU nodes are available and cheap. 100 being the size in gigabytes
starcluster listpublic
starcluster createvolume --name=myhome --image-id=ami-04bedf34 100 us-west-2c
Note down the volume id that is created.
You will need to temporarily configure the star cluster config file to use the just created EBS mount as a shared storage at /myhome.
VOLUMES = myhome
..........................
[volume myhome]
VOLUME_ID = vol-595b0fbe
MOUNT_PATH = /myhome
This will mount the created EBS volume onto /myhome in a star cluster master and slave nodes
Now start a new cluster and log into the master node as user ubuntu
starcluster start mycluster
starcluster sshmaster mycluster -u ubuntu
Move the home directory to mounted storage
Then make sure the .gnupg directory is owned by user ubuntu, if not, change its ownership as follows. Then tar the ubuntu directory and save it into /myhome
sudo chown -R $(whoami) ubuntu/.gnupg
sudo tar -czvf /myhome/ubuntu.owner.tar.gz --same-owner ubuntu
cd /myhome
sudo tar -zxvf ubuntu.owner.tar.gz
Now change the starcluster configuration to now mount the shared EBS volume on /home instead of /myhome, terminate and restart the starcluster. You should now have an ubuntu home directory in EBS storage that is not is not lost when you start and restart your cluster.  
Create a master AMI
But this would a good point to the go to the AWS menu and create the master AMI, with GUI desktop functionality

Updates

I plan on keeping this page updated as my setup evolves and I refine the environment.

Copyright (c) Sarvi Shanmugham