Series on Deep Learning

Setting up GPU Learning on EC2

We will be putting together a series of practical tutorials on deep learning techniques and tools.  This first post will focus on getting the GPU environment set up on EC2. 

Step 1: Launching a GPU instance

Launch an Amazon EC2 g2.2xlarge GPU instance which has a Nvidia GRID K520 with 3,072 cores.  If you are just testing and playing around, you can start with a EC2 spot instance and save yourself some money.  

Recommended Settings

  • Amazon Linux 64bit (based on Red Hat)
  • 300GB+ of EBS backed storage

After a few minutes of waiting (assuming you bid high enough) you'll see your spot instance show up.  Connect to the instance via ssh.

Step 2: Configuring the Instance

Update System

sudo yum update -y
sudo reboot #yes it is necessary
sudo yum groupinstall -y "Development tools"

Load Nvidia Drivers


sudo /bin/bash ./
sudo /bin/bash ./

# ignore the message about an unsupported environment.  It's because of the Amazon Linux flavor. 

Verify Nvidia Drivers/Hardware

# you should see GRID K520 in the output
nvidia-smi -q | head

Step 3: Loading Deep Learning Software

Install Python Libraries

sudo yum install python-devel python-nose python-setuptools gcc gcc-gfortran gcc-c++ blas-devel lapack-devel atlas-devel
sudo easy_install pip
sudo easy_install ipython
sudo pip install numpy==1.6.1
sudo pip install scipy==0.10.1
sudo pip install Theano

Configure Environment

# add to .bash_profile
export PATH=$PATH:$HOME/bin:/usr/local/cuda-7.0/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64
export THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'

Select and Install Deep Learning Library (choose one)

# Keras
git clone
sudo python install

# Lasagne
git clone
pip install -r requirements.txt
python install

# Pylearn2
git clone git://
python develop

Step 4: Verifying Training on GPU

cd keras/examples

# output should show: Using gpu device 0: GRID K520


State of Hyperparameter Selection

Historically hyperparameter determination has been a woefully forgotten aspect of machine learning. With the rise of neural nets - which require more hyperparameters, more precisely tuned than many other models - there has been a recent surge of interest in intelligent methods for selection; however, the average practitioner still seems to commonly use either default hyperparameters, grid search, random search, or (believe it or not) manual search.

For the readers who don’t know, hyperparameter selection boils down to a conceptually simple problem: you have a set of variables (your hyperparameters) and an objective function (a measure of how good your model is). As you add hyperparameters, the search space of this problem explodes.

Grid Search is (often) Stupid

One method for finding optimal hyperparameters is grid search (divide the space into even increments and test them exhaustively).

When presented with the above plot, a human would instinctively detect the pattern present, and, if looking for the lowest point, would be able to make an intelligent guess on where to begin.  Most would not choose to evenly divide the space into a grid and test every point, yet this is precisely what grid search does.  Humans have a fundamental intuition from looking at that image that there are areas the minimum is more likely to be. By exhaustively searching the space, you’re wasting your time on the areas where the function is obviously (excluding an improbable fluke) not going to be at its minimum and ignoring any information you have from the points you already know.

Random Search Isn’t Much Better

The next most common method is random search, which is exactly what it sounds like.  Given the same plot, I doubt anybody would decide to pick random points.  Random search is not quite stupid - anybody who has studied statistics knows the power of randomness from techniques like bootstrapping or monte carlo. In fact, randomly picking parameters often outperforms grid search.  Random hyperparameters can find points that grid search would either skip over (if the granularity of the grid was too coarse) or cost a tremendous amount to find (as the grid becomes finer).  It can similarly outperform manual search in some situations: a human will generally focus on the domain in which they have seen the lowest points, whereas random search finds new domains neglected by intuition.

A Different Kind of Optimization

Outside of cases where finding the absolute global minimum is required and an exhaustive search is necessary, grid search could only really reasonably be used in situations where the cost of evaluating the objective function is so low it could be considered to be non-existent, and even in those cases the only excuse for implementing it is laziness (i.e. it costs more to implement / run a sophisticated method than to perform grid search). In these cases grid search is preferred to random search because with a fine enough grid search you are guaranteed to find a near optimal point, whereas random search offers no such guarantee.  

But what if you are in a situation where finding every point is too costly? Training models is often expensive, both in time and computational power, and this expense skyrockets with the increased data and model complexity.  What we need is an intelligent way to traverse our parameter space while searching for a minimum.

Upon first impression, this might seem like an easy problem.  Finding the minimum of an objective function is pretty much all we ever do in machine learning, right?  Well here’s the rub: you don’t know this function.  Fitting a regression you choose your cost function.  Training a neural net you choose your activation function.  If you know what these functions are, then you know their derivatives (assuming you picked differentiable functions); this means you know which direction points “down”. This knowledge is fundamental to most optimization techniques, like momentum or stochastic gradient descent.

There are other techniques, like binary search or golden ratio search, which don’t require the derivative directly but require the knowledge your objective is unimodal - any local, non-global minima have potential to make this search entirely ineffective. Yet other optimization methods do not depend upon any knowledge of the underlying function (simulated annealing, coordinate descent) but require a large number of samples from the objective function to find an approximate minimum.

So the question is, what do you do when the cost of evaluation is high? How do we intelligently guess where the minimums are likely to be in a high dimensional space?  We need a method which doesn’t waste time on points where there isn’t expected return, but also won’t get caught in local minima.

Being Smart with Hyperparameters

So now that we know how bad grid search and random search are, the questions is how can we do better?  Here we discuss one technique for intelligent hyperparameter search, known as bayesian optimization.  We will now cover the concept of how this technique can be used to traverse hyperparameter space; there is an associated IPython Notebook which further illustrates the practice.

Bayesian Optimization on a Conceptual Level

The basic idea is this: you assume there is a smooth response surface (i.e. the curve of your objective function) connecting all your known hyperparameter-objective function points, then you model all of these possible surfaces. Averaging over these surfaces we acquire an expected objective function value for any given point and an associated variance (in more mathematical terms, we are modeling our response as a Gaussian Process). The lowest the variance on this surface dips is the point of highest ‘expected improvement’. This is a black-box model; we need no actual knowledge of the underlying process producing the response curve.

This concept is illustrated clearly in Yelp’s MOE documentation for the case of a single hyperparameter.  On top, you see our objective function response surface when we have no previous data points.  It is flat, as we don’t yet know anything. We can see that the variance (the shaded region) is also flat.  In the bottom plot we see the maximum Expected Improvement (the lowest the variance dips).  This plot is also flat, so our point of highest Expected Improvement is random.


Next we acquire a single data value, effectively pinning our expectation and collapsing its variance around a given point.  Everywhere else the objective function remains unknown, but we are modeling the response surface as smooth.  Additionally, you can see how the variance of our sample point can easily be incorporated into this model - the variance simply isn’t pinched quite as tightly.


We can see that the objective function value for our acquired point is high (which for this example we will say is undesirable).  We pick our next point to sample as far from here as possible.


We’ve now ‘pinched’ our response curve on both sides, and begin to explore the middle. However, since the lower hyperparameter value had a lower objective value, we will favor towards lower hyperparameter values.  The red line above shows the point of maximum expected improvement, i.e. our next point to sample.


Now that we’ve pinched the middle of the curve, we have a choice to make - exploration or exploitation.  You can see this trade-off is automatically made in our model - where the modeled variance dips the lowest is where our highest expected improvement point lies (the one dimensional example isn’t ideal for illustrating this, but you can imagine in more dimensions having large unexplored domains and the need to balance between exploiting better points near the low points you have and exploring these unknown domains).

If you have the capability to carry out multiple evaluations of the response curve in parallel (i.e. can train multiple models at once), a simple approach for sampling multiple points would be to assume the expected response curve value for your current point and sample a new point based upon that.  When you get the actual values back, you update your model and keep sampling.

Our hyper-hyperparameter, the variance of the Gaussian Process, is actually very important in determining exploration vs. exploitation. Below you can see two examples of identical expected response surfaces where the variance magnitude (1 on the left, 5 on the right) which give different next points to sample (note that the scale of the y-axis has changed). The greater variance is set the more the model favors exploration and the lower it is set the more the model favors exploitation. This can be tuned automatically, however. We can test the certainty of our Gaussian Process model based on either a Leave
One Out cross-validation score or a marginal likelihood.  Our response curve is designed such that we maximize one of these measures, and from the value we achieve we can make an estimate of how certain we are of the resultant model and set the variance accordingly.


With bayesian optimization, in the worst case (when you have no history) you get random search.  As you gain information, it becomes less random and more intelligent, picking points where the maximum improvement is expected, trading off between finding absolute minima around previously sampled points and exploring new domains.

There are two prominent open source packages which implement bayesian optimization: the above mentioned MOE (Metric Optimization Engine, produced by Yelp and the source of all of the pretty pictures featured above) and Spearmint (from the Harvard research group HIPS).  These packages are so easy to use (see the attached IPython Notebook) that there’s practically no reason not to implement them on every hyperparameter search you perform (the argument that they take computing power to run themselves is valid; however, the computing cost of either is often negligible compared to that of training almost any non-toy model).

So don’t waste your time looking places which won’t yield results.  And don’t search randomly when you can search intelligently.

A Note on Overfitting

As always, by tweaking based on a function of your data, there is a danger of overfitting. The two easiest ways to avoid overfitting your hyperparameters are to either tune your hyperparameters to an in-sample metric or to keep a third data split for final validation. Additionally, regularization can always be used to bound the potential complexity of your model.

Footnote:  Animated plots of MOE exploring various objective functions to find the minimum


Acknowledgement: Scott Clark, SigOpt


Notes on eXtreme Gradient Boosting

Trevor Hastie hypothesizes (slide 17) that in terms of model accuracy: 

Boosting > Random Forest > Bagging > Single Tree

Thanks to XGBoost, it's become computationally feasible to test Hastie's claim about boosting's superiority on pretty much any size/shape dataset.  The darling of many recent Kaggle competitions, XGBoost is a library for fast general purpose gradient boosting. It is parallelized using OpenMP and also provides an AllReduce-based distributed version. It implements generalized linear model and gradient boosted regression trees. 

In a quick laptop-scale experiment on covertype, we see that boosting can indeed achieve slightly better prediction accuracy (2.5% error vs. Random Forest benchmark of 2.6%). Covertype is a classic dataset used for multi-class, non-linear algorithm benchmarking. The data consists of 54 variables and 581,012 observations. There are 7 classes (some of which are minority). 

Some crib notes from building this model:

  • Setting the shrinkage parameter low (eta < 0.1) will yield significant improvement in model generalization however it must be offset computationally (by doing more rounds of boosting to capture the residuals)
  • XGBoost expects multi-class problems to be 0 (zero) indexed.  The dataset comes with class labels of 1-7, so we need to shift the class labels by -1.
  • Watchlist does not affect model training.  It is simply a way to assess prediction error on an independent sample during the training process (that isn't used for training).
  • The categorical variable in this particular dataset has already been booleaned but if that wasn't the case, One-Hot-Encoding must be applied to all categoricals (see Orchestra.jl).
  • XGBoost will grow 1 tree per class per round of boosting (in this case 7 x 500 or 3,500).  As model complexity grows, prediction time will increase.  Thus the model was able to score ~110k observations in 18 seconds (which may be fine for some use cases but perhaps too slow for adtech).
  • Shrinkage parameter (eta) will only affect the prediction score of the leaf nodes and not the shape of the tree, whereas the other parameters (gamma, max_depth, min_child_weight, max_delta_step, subsample, colsample_bytree) will influence the tree shape.  

left: eta=0.1, eta=0.5 right: colsample_bytree=1, colsample_bytree=0.5.



Finally here are some great resources if you want to continue learning about boosting.

Tianqi Chen - Introduction to Boosted Trees

Trevor Hastie - Gradient Boosting Machine Learning

Mikhail Golovnya - Advances In Gradient Boosting: The Power of Post-Processing

Hastie, Tibshirani & Friedman - Elements of Statistical Learning (Chapter 10)

Hastie & Tibshirani - Stanford Statistical Learning Course

Reflections on Julia


Julia is a new language that could become the goto choice for scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing.  It uses LLVM-based just-in-time (JIT) compilation, has the speed of C and the dynamism of Ruby.  

Contributors of Julia wrote a manifesto to explain their motivation for creating yet another programming language.  Jeff Bezanson, Stefan Karpinski, Viral Shah and Alan Edelman highlight Python's annoying dependencies, JVM's unnecessary overhead, and the debugging pain of distributed systems like Hadoop as just of few of the reasons why Julia exists.

Julia holds a lot of promise because of a few fundamental design choices:

  • Almost everything in Julia is written in Julia.  This will get us out of the C/C++ and Fortran dependency-hell of scikit-learn.
  • Type system makes it possible to rapidly experiment and iterate on data science problems.  The documentation claims that, "Julia’s type system is designed to be powerful and expressive, yet clear, intuitive and unobtrusive."  This is in fact the case.  For example if we build a Hidden Markov Model and our initial attempt was to treat all hidden states as Gaussian distributions, and now we want to try out Exponential, we won't need to refactor the HMM code.  If HMM was designed correctly and references the Distributions type, either Normal or Exponential can be used. 
  • Our limited testing suggests that identically constructed code often will run 2-3 times the speed of Python
  • Github is used for tracking all the Julia source code and for installing packages. Goodbye PyPi and Maven repos!
  • Julia supports metaprogramming.  This makes it possible for a program to transform and generate its own code, resulting in a new level of flexibility and powerful reflection capabilities.  

What's Missing?

  • Pandas is significantly more mature than Julia DataFrames.
  • For NLP problems, Python is still a better choice. TextAnalysis.jl is very basic.
  • John Myles White points out some challenges with the current Julia stats functionality that will be improved in v0.4.
  • Julia community is still small (but hopefully growing).

Getting Started on OS X

Download and install Anaconda (only if you want to run Julia in IPython Notebook)

Download and install Julia

Mac OS X Package (.dmg) contains  Drag Julia icon to Applications.

sudo ln -s /Applications/ /usr/bin/julia

julia in terminal (you should see the beautiful ascii version of the logo)



Start IPython Notebook with a Julia profile (in terminal)

ipython notebook --profile julia

Useful Packages

Gadfly.jl - plotting and data visualization package that conveniently installs most of the frequently used packages like DataFrames, Iterators, Distributions,  etc.

Cairo.jl - Cairo graphics library used among other things to render PDFs from Gadfly charts

DecisionTree.jlClustering.jlMultivariateStats.jl - stats / machine learning tools

DSP.jl - provides a number of common Digital Signal Processing (DSP) routines

Graph.jl - provides graph types and algorithms like centrality, connected components, cycle detection, etc.

Mocha.jl - deep learning framework inspired by the C++ framework Caffe

Optim.jl - basic optimization algorithms in pure Julia

Morsel.jl - a Sinatra-like micro framework for declaring routes and handling requests. It is built on top of HttpServer.jl and Meddle.jl.

PyCall.jl - if all else fails, call some Python library

JavaCall.jl - reuse the millions of lines of Java code that's out there

~500 more packages

If you find a package that isn't registered you can install it by:


To update packages:

Pkg.update() #for all packages

Examples and Tutorials

Introduction to Julia tutorial at SciPy 2014

YouTube videos

Implementing Digital Filters in Julia

Videos from the Julia tutorial at MIT

Learn Bayes Theorem with Julia

Data Analysis in Julia with Data Frames

Is it Ready for Production?

Yes!  We run Julia against massive volumes of data and process tens of thousands of transactions per second.  We have successfully deployed Julia for graph analytics, non-parametric probability density functions, graphical models, DSP problems, etc.

We also use Julia in our fellowship.  While we encourage fellows to check out Julia, we certainly do not insist on using it for every problem.    

Getting Answers to Questions

The julia-users mailing list is for discussion around the usage of Julia.

JuliaCon 2015 will be held at the MIT Stata Center June 24 - June 28.  



Startup.ML Fellowship

You've made up your mind to become a data scientist. You've taken every data science MooC, you've eaten a lifetime of pizza at machine learning meetups, you even attended a data science "academy." Why hasn't it worked?

Data Science is not knowledge to be acquired but rather a skill that can be learned and improved through practice. The number one qualification employers look for when hiring a data science candidate is previous experience.

Startup.ML is launching a fellowship to give aspiring data scientists the chance to hone their skills by building real machine learning applications for startups and established data science teams.  Upon completing the program, fellows will have direct access to a network of hiring partners, a letter of recommendation, and a portfolio of real world projects.

A Fellowship Designed to Maximize Practical Experience

  • Full time for 4 months at our San Francisco location

  • Fellows build scalable machine learning models that integrate into real products

  • We follow an agile development process in groups of 3 (pair programming plus agile team lead).  This means weekly iteration planning, daily scrum and every Friday is demo day!

We strive to have each fellow work on two separate projects to give them exposure to a wider range of problems.

Work on a Project for an Established Data Science Team

We ensure that our fellows work on real world problems sourced through partnerships with established data science teams. This aspect of the work is well defined, datasets have already been gathered, and there is an opportunity to learn from leading practitioners.

The challenge fellows will typically confront is how to get results when working in a large team with lots of differing opinions, approaches, skills and priorities.  

Another key skill that fellows learn by working with an established data science team, is the ability to communicate ideas and build consensus.  We ask fellows to present their work to outside team members on a weekly basis which hones their ability to deliver crisp results.

Work on a Nebulous Startup Problem

Startup.ML works with early stage companies to help them incorporate machine learning into their products.  We afford fellows the opportunity to work on these problems along with us.  The learning opportunity for this part of the fellowship is different from the previous experience of working with an established data science team.  

There may be a lack of clarity about what exactly needs to be done, the data may not exist or if it exists it often doesn't have the necessary signal, and the founders are uncertain about the direction they want to take their product.  

Fellows need to learn how to iterate quickly and have a laser-focus on producing a solid minimum viable product.

Pairing Software Engineers with Quants

Data Science is an interdisciplinary field which combines aspects of computer science, mathematics and statistics. Rarely do we see someone that has a deep background in all of these areas.  We encounter software engineers that are not familiar with probabilistic approaches to problem solving and quants (mathematicians, statisticians, physicists, computational biologists, etc.) that can't write a recursive function.

Instead of holding out for the unicorn, we pair fellows from each of these backgrounds to work on a problem together.  The pair-programming methodology has already proven to be extremely effective when building software and we believe it’s also the right approach for data science.

Software Tools

While it is true that the problem should dictate the choice of the tool, we have developed some preferences.

Julia is a new but very promising language for scientific computing.  We have successfully used Julia for digital signal processing, state-space tracking and anomaly detection.

Fellows are free to pick their choice of tools but using a tool that mentors are familiar with, makes it easier to get help.

After the Program

At the completion of the program, we help fellows find the right career opportunity. Fellows will leave the program with intimate knowledge of the startups they are working with, while also having numerous opportunities to engage with guest speakers from established data science teams.

How to Apply

Our first cohort starts on March 9, 2015.  We are accepting applications for future cohorts.

Why We Created Conf.Startup.ML


An estimated 11 million people are involved in the software industry today (according to IDC). They write complicated, brittle software using a declarative approach that often fails to recognize the complexity of the real world and the intricacy of user behavior.

The future of software engineering is not declarative, it’s probabilistic!  Conf.Startup.ML is a one-of-a-kind event that brings together the leading machine learning engineers from industry and academia to kick off the transition away from the declarative era of software development.  Organized by Startup.ML, Next.ML features leading machine learning researchers from Facebook, Université de Montréal LISA Lab, Stanford, and industry innovators including indico, Domino Data Lab, MapR, GraphLab, and

Facebook’s Soumith Chintala will be leading a workshop on video classification using convolutional neural networks, the same algorithm that powered this recent Stanford research project on image annotation.  Soumith will take attendees through the processes of training deep neural nets using GPUs on AWS. Other workshops will cover image analysis, sentiment analysis, probabilistic programming and the new scientific compute language-Julia.

Conf is not a series of research papers turned into powerpoint presentations, rather it is a chance to equip yourself with the intuition and necessary skills to solve problems through machine learning.  

What every machine learning package can learn from Vowpal Wabbit

Vowpal Wabbit (VW) is one of the overlooked gems of machine learning. The open source brainchild of John Langford and his collaborators at Yahoo and Microsoft Research, VW can teach us a lot about modern, scalable learning.

Certain conscious design choices set VW apart from many of the other popular packages:

  • Online Learner: learning is done by streaming the data through a fixed-size memory window which means it can learn datasets that are much larger than the amount of system memory. Progressive validation is at the heart of this learning approach. It forces the model to make a prediction before seeing the true label of the example, yielding a surprisingly reliable estimate of generalization error.
  • Feature Hashing: the hashing trick is implemented in VW as the core representation which results in significant storage compression for parameter vectors. In practical terms this allows VW to handle sparse data sets with millions of dimensions.
  • Learning Reductions: VW basically knows how to do one thing (which is a weighted binary classification) and it does it extremely well. Everything else including regression, multi-class, multi-label, structured prediction, etc. is handled through a stack of reductions down to a weighted binary classifier problem. This manifests as consistency in how features work and what gets stored in a model. Perhaps this could also be the reason why for all its power, VW has less than 20k lines of code.
  • Distributed Learning: To go beyond the amount of data that fits on a single machine, the authors have implemented a Hadoop-compatible computational model called AllReduce. Their goal was to eliminate the drawbacks of MPI and MapReduce as it relates to ML. In the case of MPI that often means too much data movement and no fault tolerance. In the case of MapReduce it usually means having to completely refactor the algorithm code to fit the MapReduce paradigm. The initial results of AllReduce look rather promising: 1,000 node cluster can learn billions of examples on millions of features in a matter of a few hours.

So while other packages are chasing the greatest number of algorithms (often poorly implemented), clever hooks into R or the best Lambda architecture story, VW is quietly moving the ball on scalable machine learning forward.  Just don't ask about the name!