Table of Contents

The Kernel Trick (SVM)

slide

This slide shows the switch from a model's "primal form" (using w) to its "dual form" (using Ξ±), which is the key to using the kernel trick. Let's break down your questions intuitively.


πŸ€” What is Alpha (Ξ±\alpha) Intuitively?

Intuitively, Ξ±\alpha is a vector of weights, where each weight Ξ±i\alpha_i tells you the importance of the i-th training example in defining the decision boundary.

This is a fundamental shift in perspective:

In an SVM, most of the Ξ±i\alpha_i values will be zero. The only non-zero Ξ±i\alpha_i values belong to the support vectors. This means the model is defined only by the most critical points on the margin.

So, to answer your question, Ξ± does not replace w directly, but it provides an alternative way to construct it. The mathematical relationship is: w=XTΞ±=βˆ‘i=1nΞ±ixi\mathbf{w} = X^T \alpha = \sum_{i=1}^{n} \alpha_i \mathbf{x}_i This formula says that the feature-weight vector w is simply a linear combination of the training data points, where the coefficients of that combination are the importance weights in Ξ±.


πŸ”Ž Why is Inference y^=k0TΞ±βˆ—\hat{y} = \mathbf{k}_0^T \alpha^*?

This formula looks complex, but it's just the original prediction formula, y^=x0Tw\hat{y} = \mathbf{x}_0^T \mathbf{w}, rewritten using the relationship we just defined. Let's walk through it.

1. Start with the original prediction formula: To predict a new point x0\mathbf{x}_0, we use: y^=x0Twβˆ—\hat{y} = \mathbf{x}_0^T \mathbf{w}^*

2. Substitute the dual form of w: We know that wβˆ—=XTΞ±βˆ—\mathbf{w}^* = X^T \alpha^*. Let's plug that in: y^=x0T(XTΞ±βˆ—)\hat{y} = \mathbf{x}_0^T (X^T \alpha^*)

3. Rearrange using linear algebra: Because of the properties of transpose, we can re-group the terms: y^=(x0TXT)Ξ±βˆ—β€…β€ŠβŸΉβ€…β€Šy^=(Xx0)TΞ±βˆ—\hat{y} = (\mathbf{x}_0^T X^T) \alpha^* \implies \hat{y} = (X \mathbf{x}_0)^T \alpha^*

4. Define k0\mathbf{k}_0: The slide defines the term k0=Xx0k_0 = X \mathbf{x}_0. Let's see what that actually is.

So, k0\mathbf{k}_0 is a vector containing the dot products of your new point (x0\mathbf{x}_0) with every single point in the training set. It's a measure of similarity between the new point and all the old points.

5. The Final Formula: By substituting k0k_0 back into the equation from step 3, we get: y^=k0TΞ±βˆ—\hat{y} = \mathbf{k}_0^T \alpha^* The reason it's k0T\mathbf{k}_0^T (transpose) is a notational convention for the dot product between two column vectors. The formula simply calculates a weighted sum, where you are summing the "similarity scores" in k0\mathbf{k}_0, weighted by the "importance scores" in Ξ±βˆ—\alpha^*.

This is powerful because the entire processβ€”both training and predictionβ€”now depends only on inner products (dot products), which allows us to swap them out for kernels.


Β© 2025 James Yap

Personal Website and Knowledge Base