November 12, 2008

(

In the dating game, you play the role of a matchmaker trying to satisfy
a picky dater. Much like in real life, suitors (or suitresses) are reduced to a real-valued vector of `n`

(≤ 100) features. Each element of the vector is in the range `[0..1]`

and corresponds to some characteristic of the candidate. The picky dater (called "person" or `P`

in the above link) judges the quality of each candidate based on an unknown weighting of their features. Each feature can be weighted positively or negatively (between `-1`

and `1`

) and we are guaranteed that the negative feature weights will sum to `-1`

and the positive weights will sum to `1`

. Additionally, we know that each weight will use no more than 2 decimal places, which I'll ignore for the rest of this page.

Initially, you are given `P`

's dating history in the form of `m`

(≈ 20) feature vectors along with the score that `P`

assigned to each vector. Formally, we can describe this as:

w ∈ [-1..1]^{n}(P's weight vector of length n) ∀w_{i}< 0 : ∑w_{i}= -1 ∀w_{j}> 0 : ∑w_{j}= 1 dating history = { x_{k}, y_{k}| 1 ≤ k ≤ m }

Your job is to generate an ideal match for `P`

, which we'll call `x`

, such that ^{*}`w`

. To do this, you will probably want to figure out the weighting function by which your candidates are being judged. If we approximate ^{T} · x^{*} = 1`w`

using only the dating history we are first given, we are doing a type of function estimation called linear regression. Additionally, every candidate you submit (paired with the score you get back) can be used as evidence to refine your estimate of `w`

. This puts us in an interesting machine learning setting called active learning (an area of current research). Since I don't know of any canonical techniques for active learning, I'll leave that part to you. Instead, I'll focus on the easier problem: given some data, how do we derive the best linear model (ie, the optimal estimate of `w`

).

To find our optimal `w`

(which we'll call `w`

^{*} from now on), we need to pick a learning algorithm. Most machine learning algorithms boil down to different definitions of the following two concepts:

**The Cost Function**: Denoted`C(w)`

, is the function that tells us how undesirable a particular weight vector is. The cost function consists of two parts:*The Loss Function*`L(w)`

computes how well the weights predict the data we already have.*The Regularization Term*`R(w)`

imposes some assumptions on what`w`

should look like in order to improve performance on unseen data. Regularization terms are optional, but often helpful.

**The Optimization Method**: Given a risk function, how do we go about finding the weights which result in minimum risk.

In summary, to say that we have a "learning algorithm" means we have defined

C(w) = L(w) + R(w) optimize :: (sample data X, sample labels Y, cost function C) → best parameters

The simplest cost function worth talking about is called mean squared error (MSE). MSE ignores regularization and thus consists of just a loss function: the average squared difference between the value our estimate of `w`

gives to each candidate and the value we know to be correct.

C(w) = (1/m) * ∑(x_{i}^{T}·w - y_{i})^{2}or in matrix form, where the rows of X are xC(w) = (1/m) * (Xw - Y)_{i}and Y is a column vector^{T}(Xw - Y)

Minimizing MSE for a linear system (of the form `Y = X·w`

) gives us Ordinary Least Squares (OLS) linear regression.

OLS regression is appealing because it's arguably the simplest sort of optimization we can imagine doing. We don't have to differentiate anything and there's no iterative update process. Just a little bit of term shuffling and we'll have a solution.

First, let's assume we have this desired `w`

^{*} in hand. Let's further assume that it's a perfect fit, resulting in no error.

X ∈R^{m × n}(previous candidates)Y ∈R^{m × 1}(scores of previous candidates)w ∈R^{n × 1}(picky dater's weights)Xw^{*}= Y X^{T}Xw^{*}= X^{T}Y(multiply both sides by Xw^{T})^{*}= (X^{T}X)^{-1}X^{T}Y(multiply both sides by (X^{T}X)^{-1})

By algebraic magic, we now know how to calculate `w`

^{*}. OK, good job guys. Can we go home? Well, not so fast. There's a few things worth noting here:

- The OLS derivation relies on taking a matrix inverse. Most languages don't come with built in linear algebra routines so you'll have to find an appropriate library. If you don't want to use such a library, you'll need to use a different optimization technique.
- The MSE cost function doesn't take into account any of the constraints we know the weight vector must obey. Perhaps a different cost function might perform better.
- In the presence of noise (which we luckily don't have in the Dating Game) or with an underconstrained linear system (which we probably will have) OLS can suffer from overfitting, meaning it will give you no error on the candidates you've seen but might perform terribly on an new candidate. A regularization term could help improve performance.

If we want to keep the same cost function (mean squared error) but ditch the matrices we can instead use gradient descent. The idea here is that at every value of `w`

we choose, we can calculate the gradient of the loss function and then move down the gradient towards the minimum. To do this we, of course, have to actually figure out what the gradient looks like. To make the math a little easier,
let's drop the 1/m from in front of `C(w)`

and replace it with a `1/2`

. Since constants won't affect the location of a minimum, this is kosher.

C(w) = 1/2 * ∑(x_{i}^{T}·w - y_{i})^{2}∇C = ∑∇C_{i}∇C_{i}= (x_{i}^{T}·w - y_{i})x_{i}

Now that we know how to calculate the gradient, we can initialize the weight vector `w`

to some random value and proceed to iteratively refine that value.
_{0}

Until some stopping condition G_{t}=vector of length n, initialized to all 0sforeach candidate c dotprod = ∑_{i}x_{c,i}*w_{t,i}diff = dotprod - y_{c}foreach gradient index i G_{t,i}= G_{t,i}+ diff * x_{c,i}foreach weight index i w_{t+1,i}= w_{t,i}- η*G_{t,i}

The two key decisions to make with the above algorithm are the stopping criterion and the learning rate (denoted η).

**Stopping Criterion**: You can choose to stop either at a sufficiently small error or when the magnitude of your updates to`w`

starts to vanish.**Learning Rate**: The learning rate ought to be a small value (somewhere between 0.00001 and 0.1) which is often determined using cross validation. Specifically, Leave One Out Cross Validation (described below) is best for small samples such as ours.

The above gradient descent method is called "batch learning" since it aggregates the gradients from each candidate before updating the weights. A variant called "stochastic gradient descent" or "online learning" updates the weight vector using the individual gradient from each candidate. Stochastic gradient descent is often preferable for large samples and/or complicated cost surfaces. Since in the Dating Game we have neither, I can't see a reason to use the stochastic variant over the batch algorithm. For a more thorough explanation of gradient descent (both batch and stochastic), see this page on Linear Neural Networks.

Here's a very simple algorithm for choosing the best learning rate (η) to use with gradient descent.

bestCost = ∞ bestEta = 0 foreach η in eta_range currCost = 0 foreach candidate i w_{i}= train(X without x_{i}, Y without y_{i}) currCost = currCost + C(w_{i}) if currCost < bestCost then bestLoss = currCost bestEta = η

When trying to solve for parameters in an underconstrained or noisy system a smart thing to do is regularization. Regularization imposes some soft constraints on your weight vector, it lets you express an assumption about what a good solution should look like. Actually, even with sufficient data, regularization turns out (for reasons too complicated to discuss here) to be beneficial.

A common regularized form of linear regression is ridge regression, which uses the cost function:

C(w) = ∑(xYou'll notice that the first term (the loss) is the same loss function we used in OLS regression. The regularization term (on the right) penalizes the norm of_{i}^{T}·w - y_{i})^{2}+ α∑w_{j}^{2}

`w`

and thus guides the algorithm to prefer smaller weights. In practice, this works quite well. The choice of a regularization term may seem arbitrary, but ridge regression can actually be derived from some simple probabilistic assumptions. (`+1`

or `-1`

. Since we're dealing with continuous values of `y`_{i}

, a classifier wouldn't do us much good. You - Generalization, Overfitting, and Regularization: http://dis.unal.edu.co/~fgonza/courses/2007-I/ml/regularization.pdf
- Cross validation: http://www.autonlab.org/tutorials/overfit.html
- Estimating linear models with gradient descent: http://www.autonlab.org/tutorials/neural.html
- Introductoin to regression: http://www.autonlab.org/tutorials/introreg.html