November 12, 2008

Tips for Solving the Dating Game
(or, a machine learning primer)

The Problem

In the dating game, you play the role of a matchmaker trying to satisfy a picky dater. Much like in real life, suitors (or suitresses) are reduced to a real-valued vector of n (≤ 100) features. Each element of the vector is in the range [0..1] and corresponds to some characteristic of the candidate. The picky dater (called "person" or P in the above link) judges the quality of each candidate based on an unknown weighting of their features. Each feature can be weighted positively or negatively (between -1 and 1) and we are guaranteed that the negative feature weights will sum to -1 and the positive weights will sum to 1. Additionally, we know that each weight will use no more than 2 decimal places, which I'll ignore for the rest of this page.

Initially, you are given P's dating history in the form of m (≈ 20) feature vectors along with the score that P assigned to each vector. Formally, we can describe this as:

		w ∈ [-1..1]ⁿ (P's weight vector of length n)
		
		∀w_i < 0 : ∑w_i  = -1
		∀w_j > 0 : ∑w_j  = 1
		
		dating history = { x_k, y_k | 1 ≤ k ≤ m }

Your job is to generate an ideal match for P, which we'll call x^*, such that w^T · x^* = 1. To do this, you will probably want to figure out the weighting function by which your candidates are being judged. If we approximate w using only the dating history we are first given, we are doing a type of function estimation called linear regression. Additionally, every candidate you submit (paired with the score you get back) can be used as evidence to refine your estimate of w. This puts us in an interesting machine learning setting called active learning (an area of current research). Since I don't know of any canonical techniques for active learning, I'll leave that part to you. Instead, I'll focus on the easier problem: given some data, how do we derive the best linear model (ie, the optimal estimate of w).

Loss and Optimization (the story of life?)

To find our optimal w (which we'll call w^* from now on), we need to pick a learning algorithm. Most machine learning algorithms boil down to different definitions of the following two concepts:

The Cost Function: Denoted C(w), is the function that tells us how undesirable a particular weight vector is. The cost function consists of two parts:
1. The Loss Function L(w) computes how well the weights predict the data we already have.
2. The Regularization Term R(w) imposes some assumptions on what w should look like in order to improve performance on unseen data. Regularization terms are optional, but often helpful.
The Optimization Method: Given a risk function, how do we go about finding the weights which result in minimum risk.

In summary, to say that we have a "learning algorithm" means we have defined

	C(w) = L(w) + R(w)	
	optimize :: (sample data X, sample labels Y, cost function C) → best parameters

Ordinary Least Squares

The simplest cost function worth talking about is called mean squared error (MSE). MSE ignores regularization and thus consists of just a loss function: the average squared difference between the value our estimate of w gives to each candidate and the value we know to be correct.

	C(w) = (1/m) * ∑(x_i^T·w - y_i)²
	
	or in matrix form, where the rows of X are x_i and Y is a column vector
	
	C(w) = (1/m) * (Xw - Y)^T(Xw - Y)

Minimizing MSE for a linear system (of the form Y = X·w) gives us Ordinary Least Squares (OLS) linear regression.
OLS regression is appealing because it's arguably the simplest sort of optimization we can imagine doing. We don't have to differentiate anything and there's no iterative update process. Just a little bit of term shuffling and we'll have a solution.
First, let's assume we have this desired w^* in hand. Let's further assume that it's a perfect fit, resulting in no error.

	X ∈ R^{m × n}	(previous candidates)
	Y ∈ R^{m × 1}	(scores of previous candidates)
	w ∈ R^{n × 1}	(picky dater's weights)
	
	Xw^* = Y
	X^TXw^* = X^TY 	(multiply both sides by X^T)
	w^* = (X^TX)^-1X^TY	(multiply both sides by (X^TX)^-1)

By algebraic magic, we now know how to calculate w^*. OK, good job guys. Can we go home? Well, not so fast. There's a few things worth noting here:

The OLS derivation relies on taking a matrix inverse. Most languages don't come with built in linear algebra routines so you'll have to find an appropriate library. If you don't want to use such a library, you'll need to use a different optimization technique.
The MSE cost function doesn't take into account any of the constraints we know the weight vector must obey. Perhaps a different cost function might perform better.
In the presence of noise (which we luckily don't have in the Dating Game) or with an underconstrained linear system (which we probably will have) OLS can suffer from overfitting, meaning it will give you no error on the candidates you've seen but might perform terribly on an new candidate. A regularization term could help improve performance.

Changing the Optimization Algorithm

If we want to keep the same cost function (mean squared error) but ditch the matrices we can instead use gradient descent. The idea here is that at every value of w we choose, we can calculate the gradient of the loss function and then move down the gradient towards the minimum. To do this we, of course, have to actually figure out what the gradient looks like. To make the math a little easier, let's drop the 1/m from in front of C(w) and replace it with a 1/2. Since constants won't affect the location of a minimum, this is kosher.

	C(w) = 1/2 * ∑(x_i^T·w - y_i)²
	∇C =  ∑∇C_i
	∇C_i = (x_i^T·w - y_i)x_i

Now that we know how to calculate the gradient, we can initialize the weight vector w₀ to some random value and proceed to iteratively refine that value.

	Until some stopping condition 
		G_t = vector of length n, initialized to all 0s
		foreach candidate c
			dotprod = ∑_i x_c,i*w_t,i
			diff = dotprod - y_c
			foreach gradient index i
				G_t,i =  G_t,i + diff * x_c,i
		foreach weight index i
			w_t+1,i = w_t,i - η*G_t,i

The two key decisions to make with the above algorithm are the stopping criterion and the learning rate (denoted η).

Stopping Criterion: You can choose to stop either at a sufficiently small error or when the magnitude of your updates to w starts to vanish.
Learning Rate: The learning rate ought to be a small value (somewhere between 0.00001 and 0.1) which is often determined using cross validation. Specifically, Leave One Out Cross Validation (described below) is best for small samples such as ours.

The above gradient descent method is called "batch learning" since it aggregates the gradients from each candidate before updating the weights. A variant called "stochastic gradient descent" or "online learning" updates the weight vector using the individual gradient from each candidate. Stochastic gradient descent is often preferable for large samples and/or complicated cost surfaces. Since in the Dating Game we have neither, I can't see a reason to use the stochastic variant over the batch algorithm. For a more thorough explanation of gradient descent (both batch and stochastic), see this page on Linear Neural Networks.

Leave One Out Cross Validation

Here's a very simple algorithm for choosing the best learning rate (η) to use with gradient descent.

	bestCost = ∞
	bestEta = 0
	foreach η in eta_range
		currCost = 0
		foreach candidate i
			w_i = train(X without x_i, Y without y_i)
			currCost = currCost + C(w_i)
		if currCost < bestCost then 
			bestLoss = currCost
			bestEta = η

Adding a Regularization Term

When trying to solve for parameters in an underconstrained or noisy system a smart thing to do is regularization. Regularization imposes some soft constraints on your weight vector, it lets you express an assumption about what a good solution should look like. Actually, even with sufficient data, regularization turns out (for reasons too complicated to discuss here) to be beneficial.

A common regularized form of linear regression is ridge regression, which uses the cost function:

	C(w) =  ∑(x_i^T·w - y_i)² + α∑w_j²

You'll notice that the first term (the loss) is the same loss function we used in OLS regression. The regularization term (on the right) penalizes the norm of w and thus guides the algorithm to prefer smaller weights. In practice, this works quite well. The choice of a regularization term may seem arbitrary, but ridge regression can actually be derived from some simple probabilistic assumptions. (derivation care of Sergey Feldman).

A Note on Support Vector Machines

Support vector machines have been something of a machine learning wunderkind for 10 or so years (ie, you might have heard of them). At first, they seem very intimidating and complicated. You might be relieved to discover that for linear systems (like the one we're dealing with in the dating game) SVMs are nothing more than the combination of a smart loss function (hinge loss), the same regularization term we use for ridge regression, and a quick optimization method (SMO). That's it. In their usual form, however, an SVM wouldn't be a good fit for the Dating Game. SVMs generate classifiers which label each candiate as +1 or -1. Since we're dealing with continuous values of y_i, a classifier wouldn't do us much good. You can do regression with an SVM variant, but it's definitely overkill for the Dating Game.

More Info

Generalization, Overfitting, and Regularization: http://dis.unal.edu.co/~fgonza/courses/2007-I/ml/regularization.pdf
Cross validation: http://www.autonlab.org/tutorials/overfit.html
Estimating linear models with gradient descent: http://www.autonlab.org/tutorials/neural.html
Introductoin to regression: http://www.autonlab.org/tutorials/introreg.html

Tips for Solving the Dating Game (or, a machine learning primer)