Not that long ago I wrote an introduction post on Gaussian Processes (GP’s), a regression technique where we condition a Gaussian prior distribution over functions on observed data. GP’s can model any function that is possible within a given prior distribution. And we don’t get a function f, we get a whole posterior distribution of functions P(f|X)
This of course, sounds very cool and all, but there is no free lunch. GP’s have a complexity O(N3), where N is the number of data points. GP’s work well up to approximately 1000 data points (I should note that there are approximating solutions that scale better). This led me to believe that I wouldn’t be using them too much in real-world problems, but luckily I was very wrong!
Bayesian Optimization
This post is about Bayesian optimization (BO), an optimization technique that has gained more traction over the past few years, as its being used to search for optimal hyperparameters in neural networks. BO is actually a useful optimisation algorithm for any black-box function that is costly to evaluate. Black box, in this sense, means that we observe only the (noisy) outputs of the function and not more information that could be to our advantage (i.e. first- or second-order derivatives). By ‘costly’, we mean that the function evaluations are on a certain budget. There are resources required to evaluate the function. These resources are often time or money.
Some examples where bayesian optimization can be useful are:
- Hyperparameter search for machine learning
- Parameter search for physics models
- Calibration of environmental models
- Increasing conversion rates
These subjects are time-consuming, have a large parameter space, and the implementations are ‘black-box’ as we can’t compute derivatives.
Some examples where you shouldn’t use Bayesian Optimization:
- Curve fitting
- Linear programming
For these kinds of problems, there are better optimization algorithms that can, for instance, take advantage of the shape of the function’s codomain (convex problems).
In this post, we are going to implement a Bayesian Optimization algorithm in Python and while doing so we are going to explore some properties of the algorithm.