Saturday, December 21, 2013

My Machine Learning Tool Box

I've been meaning, for sometime, to write about everything science and technology that I am passionate about. What better way to kick it all off, than to talk about my machine learning development toolbox.

Programming Language

I believe the programming languages you use is one of the most key tools in your toolbox.  Why? Because it affects the way you develop code, from the simplicity of code in expressing the solution, the readability of the written code, the availability of libraries for the wide range of problems you might encounter, the productivity with which you can develop software and doing it all without trading off too much in performance when it matters. I also believe in a small but powerful toolbox that you are an expert at than a very large one. For an average developer like myself, I find it easier to be an expert of one or two languages than claim to have an average understanding of 10 languages. Hence I choose my languages in my tool box carefully.

Hence my tool box consists of the following

  1. Python
  2. C/C++
Python is a scripting language, that is comparable to perl and ruby but superior in many respects. It has the same level of expressive power as Perl, yet is way more readable which is very important in software development. Code gets read many more times than it is written and it is important that it is very readable. Its an interpreted language but also has JIT implementation available for performance. And when further performance is needed, it is easy to convert the performance critical code  in to C/C++ directly or using a tool like Cython as an assist. With libraries like CFFI it is easy to wrap C libraries and make them available for use from python. And the range of libraries and tools available in python is legendary, from web development, mathematics, scientific computing.


C/C++ is a compiled lower level language that allows me to implement sections of the code that need to be performant. Generally speaking, 90% of the performance problems can be narrowed down to 10% of the code. So my development philosophy has always been to write what i need in python and drop down to C/C++ to improve that 10% of the code where performance matters. There tools like Cython and CFFI that make it brain dead simple to do.  For the sort of the code where scripting language is not an option C/C++ will fill in the space completely.

Go language from google as a future alternative to C/C++ has potential. I am keeping an eye on it. I would like to see how it works and interfaces with python. And if it keeps improving like it has been, it probably will replace C/C++ in my toolbox

Development Environment

iPython is a more advanced python shell/development environment that allows you to develop faster. It supports functionality like the ipython notebook that provides a browser placed environment to experiment with code, generate and view graphs, etc. interactively from within the browser. Additionally, the code and generated graphs can be interspersed with documentation that makes it a great tool to communicate ideas or generate interactive documentation.

virtualenv/pip is a set of tools in python that allows you to create virtual private instances of the python installation, there by allowing you to do development of python modules without messing with the central installation. These virtual instances are very light weight. Further, pip a more advanced package management tool allows for a "development mode" install of the library you are developing which is convenient for development. It creates symlinks to your development directory from the python installation directory, thereby letting you modify your python sources and running them without any additional installation requirement.

CFFI is python wrapper generator for C code. This allows you to take any C library and generate a python wrapper to the C code from the C header files with little or no change.

Mayavi is a 3D scientific visualization tool with a python interface that is very handy to view error surfaces of neurons or neural networks.  It allows me to take training errors, neuron signals, etc as they evolve during training and have them plotted as 3D surfaces that give better visual understand of how things are performing or where the problems are. Understanding machine learning needs an understanding of how error surfaces behave during training.

Libraries

Numpy/SciPy  - Machine learning involves a lot of floating point vector arithmetic. A library like Numpy allows you to do fast vector arithmetic from python and has a full library of statistical and scientific formulae implemented and ready for use.

Pandas - Think of Pandas as filling the holes in Numpy when specifically dealing with Time Series data.

Hardware Acceleration

Theano - GPGPU Acceleration - When working machine learning algorithms, you will quickly find that it is very vector math intensive, meaning you are going to be doing vector calculations a lot. This means that you are soon going to hit the limits of your main CPU, even if it has many cores. Though many of the vector math libraries can make use of multiple CPUs to accelerate your machine learning algorithms, being able to leverage one or more GPGPU units you may have on the system will be indispensable. GPGPU stands for General Purpose Graphic Processing Units. These are graphic accelerators from vendors like NVidia. But graphic processing requires a lot of vector arithmetic and hence they have been designed to be good array/vector processors which is exactly what is needed for machine learning.

A python library like Theano allows you to express your machine learning formulae and have them compiled to run on the GPU and hence achieving 10-100 fold increase in performance. If a GPU is not available, it is smart enough to use vector acceleration libraries that can make use of multiple CPU cores to still get better performance from python.

The Future

There  is still a scale issue if you want to do really massive models that might involve distributed computing. The GPGPU are evolving to be networking capable and I hope in the future theano evolves to be able to distribute its computation across a cluster of GPGPU nodes.

Machine Learning involves developing mathematical models for neurons and neural networks as well creating networks of networks. It would be nice to have an object orient library where I can implement neuron and network models driven by the power of theano behind it. It would allow for a more intuitive experimentation with machine learning models and networks.

In general the language you are comfortable, the tools you like to use are all quite personal. Over the years I have tried many of the language choices including perl, tcl, java, bash, etc. and have settled to the once above for the near term. Technology does change rapidly and these choices might change in the future. But this is my machine learning toolbox as it stands now.