How to Pick Magic Numbers

aka a (hopefully) gentle introduction to Bayesian optimization

What are the magic numbers in machine learning?

The machine learning “pipeline,” if there is such a thing, is deceptively elegant. In my view, and at a high level: one first finds a dataset that they’re interested in. Next, they specify what questions they’d like to answer about that dataset. Then, they pick an appropriate model to answer those questions (or develop their own!). Next, they set their computer to learn the parameters of their model with respect to the dataset. Finally, they evaluate their results and draw conclusions.

Of course, every piece of this pipeline is fraught with complexities. One such complexity has to do with picking particular “magic” numbers that underlie machine learning algorithms. While there are lots of model parameters that one can learn directly from data, models will often require one to specify a special set of numbers, often called “hyperparameters,” before one can learn your normal, not-so-hyper parameters.

So, the question de jour: if you need to specify these hyperparameters before you can learn the rest of your model from data, how do you pick them?

TL;DR

There’s been a lot of great papers on Bayesian Optimization in the last several years, including (but not limited to) this, this, this, and this overview. I really like Bayesian Optimization – it helps me squeeze a bit more accuracy out of my models when I am able to do a proper parameter search. Furthermore, I am of the opinion that many folks (including myself) often do not sweep hyperparameters adequately. The purpose of this blog post is to provide an introduction to Bayesian Optimization and related methods for aspiring data scientists.

There’s a great library called Spearmint that performs this parameter search. However, the implementation is geared towards large-scale parameter searches, and provides a lot of (not optional) functionality that I don’t really need or want for small-scale experiments (i.e. mongoDB backend). Previous releases included “spearmint lite,” a smaller, simpler version of the program, but this seems to have been dropped. As such, I found myself writing a few python scripts to do local, small-scale Bayesian Optimization with Spearmint, and I decided to throw my scripts on github and write a blog post about it.

The optimists and pessimists: A running example

Image courtesy Randall Munroe, https://what-if.xkcd.com/6/

I don't know which I am!

To demonstrate smart ways one can choose their hyperparameters, I am going to use a popular machine learning algorithm, the kernel support vector machine, as a running example. Sounds fancy, right? In general, however, hyperparameters are ubiquitous in machine learning. One could use the methods I describe here to figure out how many layers to have in a neural network, what weights to use as priors in Latent Dirichlet Allocation, etc. We’ll stick with the SVM, though.

I’m someone who likes to start with concrete problems, so I give one here. The problem setting: imagine you are working at Twitter, and you have a dataset consisting of many, many tweets (say 50,000). Being the empathetic software engineer you are, you might be interested in which tweets have a positive sentiment (such a beautiful day today! :) #sunny) and which tweets have a negative sentiment (rain again? this is stupid #norainnoshine). This setting is referred to as sentiment analysis; this is a very popular task in natural language processing.

The good news: you know the sentiment of 20% of these tweets because of the emoticons included in the text body – you’ll assume that any tweet with a :) is a positive one, and any tweet with a :( is a negative one. Your task: group the tweets without emoticons (the unlabeled tweets) into positive or negative classes. At a high level, you’ll be learning linguistic patterns from the tweets you know the sentiment of a priori, and then applying what you learn to the tweets that you’d like to categorize.

Most machine learning algorithms operate on numbers rather than tweets, so we’ll need to convert our tweets to something that a computer can more easily understand. More specifically, a majority of machine learning algorithms assume that, for each tweet, we will be able to come up with a list of numbers (a vector) that represents the language it contains. The task of mapping from complex, structured inputs (cant wait to go play in the #sun today, I hope it doesn’t rain!) to vectors ([2.23,-1.63,1.11,...]) is an area of active research in machine learning, and I don’t intend to get into the details here. Indeed, this process is an art form, and there’s no one right way to do it. If you’re curious, you can read about a simple method called “bag of words” here.

For now, though, we’ll assume that we have a vector of some fixed length for each tweet in our dataset. Imagine plotting these tweets as follows…

Untrained

One thing we could learn from this data is a line that separates the positive tweets from the negative ones. Then, for our unlabeled examples, we could see which side of that line they fall on to determine whether they are positive or negative. A kernel support vector machine (SVM) is one way of defining such a decision barrier, though one could, in principle, use any number of other supervised methods. I choose to use a kernel SVM for this blog post because a kernel SVM has hyperparameters.

Say, for a particular hyperparameter setting, we learn the rest of our parameters from data, and end up drawing the following line:

Trained

Given a new testing example, represented by the green question mark, we can make a classification based on what side of the line it falls on (in this case, we would say that the new example is a negative tweet). I could talk in more detail about normal parameter learning for SVMs, but the purpose of this blog post is to introduce smart ways of sweeping hyperparameters, so on we go.

There are two main hyperparameters that govern how this line is drawn in the case of kernel SVMs. First, it turns out that we need to define a similarity function (also called a kernel function, in this setting) between tweet vectors. A common similarity function is the radial basis function (RBF) which takes in two vectors x_1 and x_2 and assigns them a similarity based on the following function: RBF
Kernel. Note that more similar tweets have a higher similarity according to the RBF function – that’s a property we would like to have with any sort of similarity function, no? :)

Here, we find our first hyperparameter: gamma! In later sections, we will discuss exactly how we assign a value for gamma. The short of it: we need to let our data determine our hyperparameter setting, but we need to be careful about how we do it.

The other hyperparameter we need to address results from the SVM optimization function. C, as its usually denoted, describes how bad it is for a training example to fall on the wrong side of the line. For large C, the SVM optimizer will try very hard to have every training example fall on the correct side of the line. This sounds like a good plan, but if we penalize points being on the wrong side too heavily, we will end up with a very jagged line that is very sensitive to our particular training data. As such, it will not generalize well. On the other hand, we would like to assign some penalty for being on the wrong side of the line. This hyperparameter C addresses what is known as the bias vs. variance trade-off.

Picking numbers

So, we have some numbers to pick (C and gamma). Our strategy is the following: we will divide our 10K labeled tweets into 3 sets: a training set (of 5K tweets or so) a validation set (of 2K tweets or so) and a testing set (of 3K tweets or so). We will place our testing set aside, and we wont touch it until we have picked our hyperparameters (really, don’t touch it!).

We will train a SVM on our training data, given one setting of hyperparameters, and measure its performance on our validation set by counting how many tweets the resulting decision barrier classifies correctly. The idea is: settings of hyperparameters that do well on our validation set are likely to do well on our unseen testing set. So, we will try a bunch of hyperparameter settings, training a new SVM for each setting, and then pick a set of hyperparameters that performs well on the validation set.

Why do we have a validation set, and not just a test set? Couldn’t we directly pick parameters that do well for the test set? Well, we would like to have a sense of how well our models generalize to never before seen data. If we choose hyperparameters based on the testing set, we won’t even have a guess as to how we will generalize beyond our training examples. At the very, very end, to see how well we did, we then evaluate our model the test set. There are a lot of subtleties underlying this process, but I hope the mechanical process is relatively straightforward.

So, which hyperparameter settings should we evaluate? Unfortunately, we can’t do them all – in the case of C and gamma, the hyperparameters live in a continuous space, and we cant train uncountably infinite SVMs! Further complicating things: our parameters depend upon each other in unknown ways. This means we cannot optimize one parameter at a time and then combine the results.

In practice, there are 3 main strategies folks use. Here’s an overview of two basic strategies, which will lead us into the actual topic of this post: a third strategy called Bayesian Optimization.

Grid search

Idea: make a uniformly spaced grid in your search space, and evaluate hyperparameter settings at the points you’ve introduced.

Pros:

  1. Easy to implement
  2. The space is searched uniformly, so you’ll likely not miss anything too obvious if your underlying function is smooth.

Cons:

  1. Sensitive to parameter scaling – log scaling your parameters, for instance, will cause your search to change drastically, if you don’t account for it.
  2. Often doesn’t perform that well compared to other methods.
  3. Scales poorly with number of parameters.
  4. If your underlying function is periodic with constant frequency, you might sample in a very unlucky fashion and only sample the mins/maxes.

Random search

Idea: pick some number of random points in your parameter space, and evaluate.

Pros

  1. Easy to implement.
  2. Performs better than grid search in a lot of cases.
  3. Performs just about as well as anything else (including Bayesian Optimization) if you’re doing a relatively low number of parameter evaluations.

Cons

  1. No guarantee you’ll explore the space adequately.
  2. Might waste computation evaluating points that are very close together.

Bayesian Optimization: Overview

At a high level, Bayesian Optimization offers a principled method of hyperparameter searching that takes advantage of information one learns during the optimization process. The idea is as follows: you pick some prior belief (it is Bayesian, after all) about how your parameters behave, and search the parameter space of interest by enforcing and updating that prior belief based on your ongoing measurements. The trade-off between exploration (making sure that you’ve visited most relevant corners of your space) and exploitation (once you’ve found a promising region of your space, finding a good setting of parameters within it) is handled in a structured and intelligent way.

Slightly more specifically, Bayesian Optimization algorithms place a Gaussian Process (GP) prior on your function of interest (in this case, the function that maps from parameter settings to validation set performance). By assuming this prior and updating it after each evaluation of parameter settings, one can infer the shape and structure of the underlying function one is attempting to optimize.

Basically: by assuming that functions behave in a particular way, you can make smart guesses about where in your parameter space you should search next. As a point of contact, I have used Vincent Dubourg and Jake Vanderplas’ Gaussian process plotting code (via sk-learn) to produce the following image:

Gaussian process

Here, we are trying to minimize the unknown function 5*sin(x) over the region from x=0 to x=10. This function is given by the red dashed line. The red points represent evaluations we have already made (the observations). The blue line is where the Gaussian Process believes the function most likely resides. The blue shaded region represents the 95% confidence interval of the blue line for a given x, assuming the GP model. Note how this region gets wider the further away one gets from an evaluated point; this is because we are less sure about where the underlying function lives in regions far away from any point where we have evaluated before.

The fundamental question of Bayesian Optimization is: given this set of evaluations, where should we look next? The answer is not obvious. We are fairly confident about the value of the function in regions close to where we evaluate, so perhaps it is best to evaluate in these regions? On the other hand, the unknown function may have a smaller value we’d like to find further away from the region we’ve already explored, though we are less confident in that expectation, as illustrated by the wide confidence intervals.

In practice, different algorithms answer this question differently (based on their so-called acquisition function) but all address the balance between exploration (going to new, unknown regions of parameter space) and exploitation (optimizing within regions where you have higher confidence) to determine good settings of hyperparameters.

The Spearmint Wrapper I Wrote

I decided to write a [few rough python scripts] (https://github.com/jmhessel/BasicSpearmint) to the Spearmint code that is (hopefully) more accessible for folks just starting out with hyperparameter optimization beyond random/grid search. Older versions of Spearmint have included spearmint-lite, which had fewer features (i.e. no mongoDB backend, outputs saved directly in files, etc.) but it seems that the newest versions do not. Ergo, I had to write my own scripts to interact with Spearmint if I wanted to use it in this way.

The documentation is provided in the git repository, and comments/bugfixes are welcome. It’s very rough because it’s mostly intended for my personal use (and I optimized for code shortness/my coding time over code generalizability/readability/design) but if it saves someone else some time as well, that’s awesome!

Here’s an example output – I’ve created 10 random splits of my data (train/validation/test) in a 31-class classification setting (I’ve included this data in the examples section with my Spearmint wrappers). I optimized each split separately for 30 minutes. My objective function I use for parameter optimization is accuracy on the validation set (ValObj). Note that the objectives are negated accuracies, because Spearmint minimizes by default (and I wouldn’t want to minimize accuracy!).

Included in the table is how many evaluations were performed per split, the best parameter settings, and the testing accuracy of those settings. The scripts are geared towards people who are running small-scale classification problems that require lots of splits of their data.

+-----+----------+---------+------------+-----------+---------+
| Run | NumEvals | ValObj  | C          | gamma     | TestObj |
+-----+----------+---------+------------+-----------+---------+
| 0   | 106      | -0.4681 | 76998.3369 | 499.9362  | -0.4575 |
| 1   | 119      | -0.5036 | 15203.7916 | 606.7886  | -0.4749 |
| 2   | 116      | -0.4826 | 19946.5472 | 327.9233  | -0.4676 |
| 3   | 100      | -0.4754 | 51506.0979 | 366.8840  | -0.4560 |
| 4   | 100      | -0.4580 | 60692.2926 | 833.9274  | -0.4700 |
| 5   | 115      | -0.4935 | 99525.2766 | 81.6870   | -0.4488 |
| 6   | 119      | -0.4746 | 13874.5225 | 416.0896  | -0.4845 |
| 7   | 112      | -0.4790 | 92094.4707 | 53.4785   | -0.4507 |
| 8   | 117      | -0.4870 | 41714.7042 | 213.3730  | -0.4725 |
| 9   | 115      | -0.4667 | 6104.2698  | 3782.7797 | -0.4536 |
+-----+----------+---------+------------+-----------+---------+

Note that, in 8/10 cases, the validation accuracy is higher than the testing accuracy. This is to be expected because the hyperparameters are being selected based on their performance on the validation set, and not the testing set.

Written on June 17, 2015