What every machine learning package can learn from Vowpal Wabbit

Vowpal Wabbit (VW) is one of the overlooked gems of machine learning. The open source brainchild of John Langford and his collaborators at Yahoo and Microsoft Research, VW can teach us a lot about modern, scalable learning.

Certain conscious design choices set VW apart from many of the other popular packages:

  • Online Learner: learning is done by streaming the data through a fixed-size memory window which means it can learn datasets that are much larger than the amount of system memory. Progressive validation is at the heart of this learning approach. It forces the model to make a prediction before seeing the true label of the example, yielding a surprisingly reliable estimate of generalization error.
  • Feature Hashing: the hashing trick is implemented in VW as the core representation which results in significant storage compression for parameter vectors. In practical terms this allows VW to handle sparse data sets with millions of dimensions.
  • Learning Reductions: VW basically knows how to do one thing (which is a weighted binary classification) and it does it extremely well. Everything else including regression, multi-class, multi-label, structured prediction, etc. is handled through a stack of reductions down to a weighted binary classifier problem. This manifests as consistency in how features work and what gets stored in a model. Perhaps this could also be the reason why for all its power, VW has less than 20k lines of code.
  • Distributed Learning: To go beyond the amount of data that fits on a single machine, the authors have implemented a Hadoop-compatible computational model called AllReduce. Their goal was to eliminate the drawbacks of MPI and MapReduce as it relates to ML. In the case of MPI that often means too much data movement and no fault tolerance. In the case of MapReduce it usually means having to completely refactor the algorithm code to fit the MapReduce paradigm. The initial results of AllReduce look rather promising: 1,000 node cluster can learn billions of examples on millions of features in a matter of a few hours.

So while other packages are chasing the greatest number of algorithms (often poorly implemented), clever hooks into R or the best Lambda architecture story, VW is quietly moving the ball on scalable machine learning forward.  Just don't ask about the name!