Data Augmentation for Bias Term

The Kernel Trick (SVM)
- 🤔 What is Alpha ( $\alpha$ ) Intuitively?
- 🔎 Why is Inference $\hat{y} = \mathbf{k}_0^T \alpha^*$ ?

The Kernel Trick (SVM)

This slide shows the switch from a model's "primal form" (using w) to its "dual form" (using α), which is the key to using the kernel trick. Let's break down your questions intuitively.

🤔 What is Alpha ( $\alpha$ ) Intuitively?

Intuitively, $\alpha$ is a vector of weights, where each weight $\alpha_i$ tells you the importance of the i-th training example in defining the decision boundary.

This is a fundamental shift in perspective:

w (Primal weights): A weight vector where each element $w_j$ corresponds to a feature. It tells you how important the j-th feature (e.g., "Size" or "Weight" from your last example) is to the model's prediction.
α (Dual weights): A weight vector where each element $\alpha_i$ corresponds to a training data point. It tells you how important the i-th data point (e.g., a specific apple) is to the model's prediction.

In an SVM, most of the $\alpha_i$ values will be zero. The only non-zero $\alpha_i$ values belong to the support vectors. This means the model is defined only by the most critical points on the margin.

So, to answer your question, α does not replace w directly, but it provides an alternative way to construct it. The mathematical relationship is: $\mathbf{w} = X^T \alpha = \sum_{i=1}^{n} \alpha_i \mathbf{x}_i$ This formula says that the feature-weight vector w is simply a linear combination of the training data points, where the coefficients of that combination are the importance weights in α.

🔎 Why is Inference $\hat{y} = \mathbf{k}_0^T \alpha^*$ ?

This formula looks complex, but it's just the original prediction formula, $\hat{y} = \mathbf{x}_0^T \mathbf{w}$ , rewritten using the relationship we just defined. Let's walk through it.

1. Start with the original prediction formula: To predict a new point $\mathbf{x}_0$ , we use: $\hat{y} = \mathbf{x}_0^T \mathbf{w}^*$

2. Substitute the dual form of w: We know that $\mathbf{w}^* = X^T \alpha^*$ . Let's plug that in: $\hat{y} = \mathbf{x}_0^T (X^T \alpha^*)$

3. Rearrange using linear algebra: Because of the properties of transpose, we can re-group the terms: $\hat{y} = (\mathbf{x}_0^T X^T) \alpha^* \implies \hat{y} = (X \mathbf{x}_0)^T \alpha^*$

4. Define $\mathbf{k}_0$ : The slide defines the term $k_0 = X \mathbf{x}_0$ . Let's see what that actually is.

$X$ is the matrix where each row is a training example ( $x_1^T, x_2^T, \dots$ ).
$x_0$ is the new point we want to classify.
The product $X \mathbf{x}_0$ results in a vector where the i-th element is the dot product of the i-th training point and the new point ( $x_i^T x_0$ ).

So, $\mathbf{k}_0$ is a vector containing the dot products of your new point ( $\mathbf{x}_0$ ) with every single point in the training set. It's a measure of similarity between the new point and all the old points.

5. The Final Formula: By substituting $k_0$ back into the equation from step 3, we get: $\hat{y} = \mathbf{k}_0^T \alpha^*$ The reason it's $\mathbf{k}_0^T$ (transpose) is a notational convention for the dot product between two column vectors. The formula simply calculates a weighted sum, where you are summing the "similarity scores" in $\mathbf{k}_0$ , weighted by the "importance scores" in $\alpha^*$ .

This is powerful because the entire process—both training and prediction—now depends only on inner products (dot products), which allows us to swap them out for kernels.

Table of Contents

The Kernel Trick (SVM)

🤔 What is Alpha (α\alphaα) Intuitively?

🔎 Why is Inference y^=k0Tα∗\hat{y} = \mathbf{k}_0^T \alpha^*y^​=k0T​α∗?

🤔 What is Alpha ( $\alpha$ ) Intuitively?

🔎 Why is Inference $\hat{y} = \mathbf{k}_0^T \alpha^*$ ?