支持向量机svm之数学原理

A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.机器在高维或无限维空间中构造超平面或超平面集合,可用于分类、回归或其他任务。直观上,超平面是距离任何类最近的训练数据点距离最大的平面(即所谓的函数边界),可以实现很好的分离,因为一般来说,边界越大,分类器的泛化误差越低。

1 SVC

1.1 $\text{n_classes}=2$

对于数据特征矩阵$\mathbf{X}_{n \times m}$和二值目标向量$\vec{y}$

Given training vectors $x_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$ , in two classes, and a vector $\vec{y} \ \ where \ y_1 \in \{1,-1\}^n$ , SVC solves the following primal problem:给定训练向量$x_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$(为便于理解,这里其实是数据集X中的第$i$行)和向量$\vec{y} \ \ where \ y_1 \in \{1,-1\}^n$$\vec{y}$是二值目标向量,$y_i$是其第$i$行值。),SVC解决以下原始问题: $$\begin{align} \begin{aligned} & \min_ {\vec{w}, b, \vec{\zeta}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_i \\ & \begin{split}\textrm {subject to } & y_i (\vec{w}^T \phi (\vec{x}_i) + b) \geq 1 - \zeta_i,\\ & \zeta_i \geq 0, \\ & \zeta_i \text{ is the i-th elemnet of vector $\vec{\zeta}$} \\ & i=1, ..., n\end{split}\end{aligned}\end{align}$$ 上式决定支持向量的参数$\vec{w}$也就是$w_1, w_2, \ldots, w_m$$b$$\vec{\zeta}$也就是$\zeta_1, \zeta_2, \ldots, \zeta_n$

这一优化将使得支持向量间距离最大化。因采用核函数,会将$\vec{x}_i$转化为$\phi(\vec{x}_i)$,这点需注意一下。进一步可阅读Support-Vector Networks。

Its dual is:其对偶形式如下: $$\begin{align}\begin{aligned} & \min_{\vec{\alpha}} \frac{1}{2} \vec{\alpha}^T \mathbf{Q} \vec{\alpha} - \vec{e}^T \vec{\alpha}\\ & \begin{split} \textrm {subject to } & \vec{y}^T \vec{\alpha} = 0\\ & 0 \leq \alpha_i \leq C, \\ & i=1, ..., n\end{split}\end{aligned}\end{align}$$ 上式决定核函数相关参数$\vec{\alpha}$也就是$\alpha_1, \alpha_2, \ldots, \alpha_n$

Support-Vector Networks文章中的式40-48为详细的推导过程。不过,因采用核函数,会将$\vec{x}_i$转化为$\phi(\vec{x}_i)$

where $\vec{e}$ is the vector of all ones, $C > 0$ is the upper bound, $\mathbf{Q}$ is an $n$ by $n$ positive semidefinite matrix, $\mathbf{Q}_{ij} \equiv y_i y_j K(\vec{x}_i, \vec{x}_j)$ where $K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$ is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function $\phi$ .其中$\vec{e}$是全部由1组成的向量,$C > 0$是上界,$\mathbf{Q}$$n$乘以$n$半正定矩阵,$\mathbf{Q}_{ij} \equiv y_i y_j K(\vec{x}_i, \vec{x}_j)$,其中$K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$是核函数。在这里,训练向量通过函数$\phi$隐式映射到一个更高的(也许是无限的)维空间。

The decision function is:决策方程为: $$\operatorname{sgn}(\sum_{i=1}^n y_i \alpha_i K(\vec{x}_i, \vec{x}) + \rho)$$ 决定函数也就是决定$y$取值的函数,由于是分类,所以即决定$y$的正负。关于$\rho$,也就是intercept_,查阅sklearn源码,其是一高参。当目标变量分类数$\text{n_classes}=2$那么$shape(\rho)=[1]$;当目标变量分类数$\text{n_classes} \ne 2$那么$shape(\rho)=[\text{n_classes}]$。其实$\rho$也可以按需自由设置,也就是sklearn.svm.SVC中的coef0参数。

上述三式中,式1优化出$\vec{w}$$b$$\vec{\zeta}$以决定支持向量;式2、3优化出核函数相关参数$\vec{\alpha}$用于预测训练集以外数据点,优化出的$\vec{\alpha}$不参与决定支持向量。式3中因为$x_i$为已知,那么$K(\vec{x}_i, \vec{x})$$m$维空间中的一个曲面,曲面高度为$K(\vec{x}_i, \vec{x})$。式1中$\phi{vec{x}_i}$$\vec{x}_i$映射到空间$\mathcal {V}$,对SVM施加核技巧可将之转化成非线性模型。关于核函数可参阅我的博文核方法kernel_method

Support-Vector Networks文章中的式33。

This parameters can be accessed through the members dual_coef_ which holds the product $y_i \alpha_i$ , support_vectors_ which holds the support vectors, and intercept_ which holds the independent term $\rho$ .这些参数可通过dual_coef_support_vectors_intercept_访问,dual_coef_控制乘积$y_i \alpha_i$support_vectors_控制支持向量,intercept_控制独立项$\rho$

1.2 $\text{n_classes} > 2$(原创)

对于多分类目标向量$\vec{y}$(记$p=\text{n_classes}$),对每个$y_i$编码,可将$\vec{y}$转化为$\mathbf{Y}_{n \times p}$

上面三个公式相应改写为: $$\begin{align}\begin{aligned} & \min_ {\vec{w}, b, \mathbf{Z}_{n \times p}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_{i1} \\ & \min_ {\vec{w}, b, \mathbf{Z}_{n \times p}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_{i2} \\ & \vdots \\ & \min_ {\vec{w}, b, \mathbf{Z}_{n \times p}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_{ip} \\ & \begin{split}\textrm {subject to } & y_{ik} (\vec{w}^T \phi (\vec{x}_i) + b) \geq 1 - \zeta_{ik},\\ & \zeta_{ik} \geq 0, \\ & \zeta_{ik} \text{ is the the element from row i, column k of $\mathbf{Z}_{n \times p}$,} \\ & y_{ik} \text{ is the the element from row i, column k of $\mathbf{Y}_{n \times p}$,} \\ & i=1, ..., n, \\ & k=1, ..., p\end{split}\end{aligned}\end{align}$$ 上式决定支持向量的参数$\vec{w}$也就是$\{w_1, w_2, \ldots, w_m\}$$b$$\mathbf{Z}_{n \times p}$也就是$\{\zeta_{11}, \zeta_{21}, \ldots, \zeta_{n1}, \quad \zeta_{12}, \zeta_{22}, \ldots, \zeta_{n2}, \quad \ldots, \quad \zeta_{1p}, \zeta_{2p}, \ldots, \zeta_{np}\}$ $$\begin{align}\begin{aligned} & \min_{\mathbf{A}_{n \times p}} \frac{1}{2} \vec{\alpha}_{1}^T Q \vec{\alpha}_{1} - \vec{e}^T \vec{\alpha}_{1}\\ & \min_{\mathbf{A}_{n \times p}} \frac{1}{2} \vec{\alpha}_{2}^T Q \vec{\alpha}_{2} - \vec{e}^T \vec{\alpha}_{2}\\ & \vdots \\ & \min_{\mathbf{A}_{n \times p}} \frac{1}{2} \vec{\alpha}_{p}^T Q \vec{\alpha}_{p} - \vec{e}^T \vec{\alpha}_{p}\\ & \begin{split} \textrm {subject to } & \vec{y}_{k}^T \vec{\alpha}_k = 0\\ & \vec{y}_k \text{ is column k of $\mathbf{Y}_{n \times p}$,} \\ & \vec{\alpha}_k \text{ is column k of $\mathbf{A}_{n \times p}$,} \\ & 0 \leq \alpha_{ik} \leq C, \\ & i=1, ..., n \\ & j=1, ..., p \end{split}\end{aligned}\end{align}$$ 上式决定核函数相关参数$\mathbf{A}_{n \times m}$也就是$\{\alpha_{11}, \alpha_{21}, \ldots, \alpha_{n1}, \quad \alpha_{12}, \alpha_{22}, \ldots, \alpha_{n2}, \quad \ldots, \quad \alpha_{1p}, \alpha_{2p}, \ldots, \alpha_{np}\}$

$$\operatorname{sgn}(\sum_{i=1}^n \vec{y}_{i1} \vec{\alpha}_{i1} K(x_i, x) + \rho_1) \\ \operatorname{sgn}(\sum_{i=1}^n \vec{y}_{i2} \vec{\alpha}_{i2} K(x_i, x) + \rho_2) \\ \vdots \\ \operatorname{sgn}(\sum_{i=1}^n \vec{y}_{ip} \vec{\alpha}_{ip} K(x_i, x) + \rho_p) \\ $$ 决定函数也就是决定$y$取值的函数,由于是分类,所以即决定$y$的正负。关于$\vec{\rho}$,也就是intercept_,查阅sklearn源码,当目标变量分类数$\text{n_classes}=2$那么$shape(\rho)=[1]$;当目标变量分类数$\text{n_classes} \ne 2$那么$shape(\rho)=[\text{n_classes}]$

关于多分类试写的上面三个公式有可能是错误的,因为毕竟$\mathbf{Y}$的各列不是独立的,上面公式完全割裂了各列的联系。不过暂时留下上述公式吧,其总还是可以方便我们理解。

2 NuSVC

We introduce a new parameter $\nu$ which controls the number of support vectors and training errors. The parameter $\nu \in (0,1]$ is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.引入新参数$\nu$,其控制支持向量和训练误差的数值。$\nu \in (0,1]$是是训练误差部分的上界和支持向量部分的下界。

It can be shown that the $\nu$ -SVC formulation is a reparameterization of the $C$ -SVC and therefore mathematically equivalent.显然$\nu$-SVC是$C$-SVC的再参数化(看两者取值范围,其关系应是互为倒数),因此两者是等价的。

3 SVR

Given training vectors $\vec{x}_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$ , and a vector $\vec{y} \in \mathbb{R}^n$ $\varepsilon$-SVR solves the following primal problem:给定训练向量$\vec{x}_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$$\vec{y} \in \mathbb{R}^n$$\vec{y} \in \mathbb{R}^n$是因为变量训练的行数是n),$\varepsilon$-SVR解决以下原始问题: $$\begin{align}\begin{aligned} & \min_ {\vec{w}, b, \vec{\zeta}, \vec{\zeta}^*} \frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} (\zeta_i + \zeta_i^*)\\ & \begin{split}\textrm {subject to } & y_i - \vec{w}^T \phi (\vec{x}_i) - b \leq \varepsilon + \zeta_i,\\ & \vec{w}^T \phi (\vec{x}_i) + b - y_i \leq \varepsilon + \zeta_i^*,\\ & \zeta_i, \zeta_i^* \geq 0, \\ & i=1, ..., n\end{split}\end{aligned}\end{align}$$ 那么模型对训练数据的目标预测值$\vec{w}^T \phi (\vec{x}_i) + b$与真实值$y_i$之差的预测范围是$[\varepsilon + \zeta_i,\varepsilon + \zeta_i^*]$。其中$\zeta_i$$\zeta_i^*$将惩罚至尽可能小,而对于$\varepsilon$,查阅sklearn文档发现是一高参,不能被学习器调整,只能网格搜索寻求最优。按文档说法,损失函数中,如果预测值与真实值之差距小于$\varepsilon$将不给与惩罚,如上式所示,不会对$\varepsilon$进行限制。

Its dual is:其对偶是: $$\begin{align}\begin{aligned} & \min_{\vec{\alpha}, \vec{\alpha}^*} \frac{1}{2} (\vec{\alpha} - \vec{\alpha}^*)^T Q (\vec{\alpha} - \vec{\alpha}^*) + \varepsilon \vec{e}^T (\vec{\alpha} + \vec{\alpha}^*) - y^T (\vec{\alpha} - \vec{\alpha}^*)\\ & \begin{split} \textrm {subject to } & \vec{e}^T (\vec{\alpha} - \vec{\alpha}^*) = 0\\ & 0 \leq \alpha_i, \alpha_i^* \leq C, i=1, ..., n\end{split}\end{aligned}\end{align}$$ where $\vec{e}$ is the vector of all ones, $C>0$ is the upper bound, $\mathbf{Q}$ is an $n$ by $n$ positive semidefinite matrix, $\mathbf{Q}_{ij} \equiv K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$ is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function $\phi$ .其中$\vec{e}$是全部由1组成的向量,$C > 0$是上界,$\mathbf{Q}$$n$乘以$n$半正定矩阵,$\mathbf{Q}_{ij} \equiv K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$是核函数。在这里,训练向量通过函数$\phi$隐式映射到一个更高的(也许是无限的)维空间。

The decision function is:决策方程为: $$\sum_{i=1}^n {((\alpha_i - \alpha_i^*) K(\vec{x}_i, \vec{x}) + \rho)}$$ These parameters can be accessed through the members dual_coef_ which holds the difference $\alpha_i - \alpha_i^*$ , support_vectors_ which holds the support vectors, and intercept_ which holds the independent term $\rho$ .这些参数可通过dual_coef_support_vectors_intercept_访问,dual_coef_控制差值$\alpha_i - \alpha_i^*$support_vectors_控制支持向量,intercept_控制独立项$\rho$

参考:

  1. sklearn文档1.4.7. Mathematical formulation
  2. “Automatic Capacity Tuning of Very Large VC-dimension Classifiers”, I. Guyon, B. Boser, V. Vapnik - Advances in neural information processing 1993.
  3. “Support-vector networks”, C. Cortes, V. Vapnik - Machine Learning, 20, 273-297 (1995).
  4. “A Tutorial on Support Vector Regression”, Alex J. Smola, Bernhard Schölkopf - Statistics and Computing archive Volume 14 Issue 3, August 2004, p. 199-222.