支持向量机svm之数学原理
A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.机器在高维或无限维空间中构造超平面或超平面集合,可用于分类、回归或其他任务。直观上,超平面是距离任何类最近的训练数据点距离最大的平面(即所谓的函数边界),可以实现很好的分离,因为一般来说,边界越大,分类器的泛化误差越低。
1 SVC
1.1 $\text{n_classes}=2$
对于数据特征矩阵$\mathbf{X}_{n \times m}$
和二值目标向量$\vec{y}$
:
Given training vectors
$x_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$
, in two classes, and a vector
$\vec{y} \ \ where \ y_1 \in \{1,-1\}^n$
, SVC solves the following primal problem:给定训练向量$x_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$
(为便于理解,这里其实是数据集X中的第$i$
行)和向量$\vec{y} \ \ where \ y_1 \in \{1,-1\}^n$
($\vec{y}$
是二值目标向量,$y_i$
是其第$i$
行值。),SVC解决以下原始问题:
$$\begin{align} \begin{aligned} & \min_ {\vec{w}, b, \vec{\zeta}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_i \\
& \begin{split}\textrm {subject to } & y_i (\vec{w}^T \phi (\vec{x}_i) + b) \geq 1 - \zeta_i,\\
& \zeta_i \geq 0, \\
& \zeta_i \text{ is the i-th elemnet of vector $\vec{\zeta}$} \\
& i=1, ..., n\end{split}\end{aligned}\end{align}$$
上式决定支持向量的参数$\vec{w}$
也就是$w_1, w_2, \ldots, w_m$
、$b$
、$\vec{\zeta}$
也就是$\zeta_1, \zeta_2, \ldots, \zeta_n$
。
这一优化将使得支持向量间距离最大化。因采用核函数,会将$\vec{x}_i$
转化为$\phi(\vec{x}_i)$
,这点需注意一下。进一步可阅读Support-Vector Networks。
Its dual is:其对偶形式如下:
$$\begin{align}\begin{aligned} & \min_{\vec{\alpha}} \frac{1}{2} \vec{\alpha}^T \mathbf{Q} \vec{\alpha} - \vec{e}^T \vec{\alpha}\\
& \begin{split}
\textrm {subject to } & \vec{y}^T \vec{\alpha} = 0\\
& 0 \leq \alpha_i \leq C, \\
& i=1, ..., n\end{split}\end{aligned}\end{align}$$
上式决定核函数相关参数$\vec{\alpha}$
也就是$\alpha_1, \alpha_2, \ldots, \alpha_n$
。
Support-Vector Networks文章中的式40-48为详细的推导过程。不过,因采用核函数,会将$\vec{x}_i$
转化为$\phi(\vec{x}_i)$
。
where
$\vec{e}$
is the vector of all ones,
$C > 0$
is the upper bound,
$\mathbf{Q}$
is an
$n$
by
$n$
positive semidefinite matrix,
$\mathbf{Q}_{ij} \equiv y_i y_j K(\vec{x}_i, \vec{x}_j)$
where
$K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$
is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function
$\phi$
.其中$\vec{e}$
是全部由1组成的向量,$C > 0$
是上界,$\mathbf{Q}$
是$n$
乘以$n$
半正定矩阵,$\mathbf{Q}_{ij} \equiv y_i y_j K(\vec{x}_i, \vec{x}_j)$
,其中$K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$
是核函数。在这里,训练向量通过函数$\phi$
隐式映射到一个更高的(也许是无限的)维空间。
The decision function is:决策方程为:
$$\operatorname{sgn}(\sum_{i=1}^n y_i \alpha_i K(\vec{x}_i, \vec{x}) + \rho)$$
决定函数也就是决定$y$
取值的函数,由于是分类,所以即决定$y$
的正负。关于$\rho$
,也就是intercept_
,查阅sklearn源码,其是一高参。当目标变量分类数$\text{n_classes}=2$
那么$shape(\rho)=[1]$
;当目标变量分类数$\text{n_classes} \ne 2$
那么$shape(\rho)=[\text{n_classes}]$
。其实$\rho$
也可以按需自由设置,也就是sklearn.svm.SVC
中的coef0
参数。
上述三式中,式1优化出$\vec{w}$
、$b$
、$\vec{\zeta}$
以决定支持向量;式2、3优化出核函数相关参数$\vec{\alpha}$
用于预测训练集以外数据点,优化出的$\vec{\alpha}$
不参与决定支持向量。式3中因为$x_i$
为已知,那么$K(\vec{x}_i, \vec{x})$
是$m$
维空间中的一个曲面,曲面高度为$K(\vec{x}_i, \vec{x})$
。式1中$\phi{vec{x}_i}$
将$\vec{x}_i$
映射到空间$\mathcal {V}$
,对SVM施加核技巧可将之转化成非线性模型。关于核函数可参阅我的博文核方法kernel_method。
Support-Vector Networks文章中的式33。
- Note: While SVM models derived from
libsvm
andliblinear
use C as regularization parameter, most other estimators use alpha. The exact equivalence between the amount of regularization of two models depends on the exact objective function optimized by the model. For example, when the estimator used issklearn.linear_model.Ridge regression
, the relation between them is given as$C = \frac{1}{alpha}$
. 注:由libsvm
和liblinear
衍生的SVM模型使用C作为正则化参数,而其他大多数估计器使用alpha。两个模型正则化量的精确等价取决于模型优化后的精确目标函数。例如,使用的估计器是sklearn.linear_model.Ridge regression
,它们之间的关系为$C = \frac{1}{alpha}$
。
This parameters can be accessed through the members dual_coef_
which holds the product
$y_i \alpha_i$
, support_vectors_
which holds the support vectors, and intercept_
which holds the independent term
$\rho$
.这些参数可通过dual_coef_
、support_vectors_
和intercept_
访问,dual_coef_
控制乘积$y_i \alpha_i$
,support_vectors_
控制支持向量,intercept_
控制独立项$\rho$
。
1.2 $\text{n_classes} > 2$
(原创)
对于多分类目标向量$\vec{y}$
(记$p=\text{n_classes}$
),对每个$y_i$
编码,可将$\vec{y}$
转化为$\mathbf{Y}_{n \times p}$
:
上面三个公式相应改写为:
$$\begin{align}\begin{aligned} & \min_ {\vec{w}, b, \mathbf{Z}_{n \times p}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_{i1} \\
& \min_ {\vec{w}, b, \mathbf{Z}_{n \times p}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_{i2} \\
& \vdots \\
& \min_ {\vec{w}, b, \mathbf{Z}_{n \times p}}\frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \zeta_{ip} \\
& \begin{split}\textrm {subject to } & y_{ik} (\vec{w}^T \phi (\vec{x}_i) + b) \geq 1 - \zeta_{ik},\\
& \zeta_{ik} \geq 0, \\
& \zeta_{ik} \text{ is the the element from row i, column k of $\mathbf{Z}_{n \times p}$,} \\
& y_{ik} \text{ is the the element from row i, column k of $\mathbf{Y}_{n \times p}$,} \\
& i=1, ..., n, \\
& k=1, ..., p\end{split}\end{aligned}\end{align}$$
上式决定支持向量的参数$\vec{w}$
也就是$\{w_1, w_2, \ldots, w_m\}$
、$b$
、$\mathbf{Z}_{n \times p}$
也就是$\{\zeta_{11}, \zeta_{21}, \ldots, \zeta_{n1}, \quad \zeta_{12}, \zeta_{22}, \ldots, \zeta_{n2}, \quad \ldots, \quad \zeta_{1p}, \zeta_{2p}, \ldots, \zeta_{np}\}$
。
$$\begin{align}\begin{aligned} & \min_{\mathbf{A}_{n \times p}} \frac{1}{2} \vec{\alpha}_{1}^T Q \vec{\alpha}_{1} - \vec{e}^T \vec{\alpha}_{1}\\
& \min_{\mathbf{A}_{n \times p}} \frac{1}{2} \vec{\alpha}_{2}^T Q \vec{\alpha}_{2} - \vec{e}^T \vec{\alpha}_{2}\\
& \vdots \\
& \min_{\mathbf{A}_{n \times p}} \frac{1}{2} \vec{\alpha}_{p}^T Q \vec{\alpha}_{p} - \vec{e}^T \vec{\alpha}_{p}\\
& \begin{split}
\textrm {subject to } & \vec{y}_{k}^T \vec{\alpha}_k = 0\\
& \vec{y}_k \text{ is column k of $\mathbf{Y}_{n \times p}$,} \\
& \vec{\alpha}_k \text{ is column k of $\mathbf{A}_{n \times p}$,} \\
& 0 \leq \alpha_{ik} \leq C, \\
& i=1, ..., n \\
& j=1, ..., p \end{split}\end{aligned}\end{align}$$
上式决定核函数相关参数$\mathbf{A}_{n \times m}$
也就是$\{\alpha_{11}, \alpha_{21}, \ldots, \alpha_{n1}, \quad \alpha_{12}, \alpha_{22}, \ldots, \alpha_{n2}, \quad \ldots, \quad \alpha_{1p}, \alpha_{2p}, \ldots, \alpha_{np}\}$
。
$$\operatorname{sgn}(\sum_{i=1}^n \vec{y}_{i1} \vec{\alpha}_{i1} K(x_i, x) + \rho_1) \\
\operatorname{sgn}(\sum_{i=1}^n \vec{y}_{i2} \vec{\alpha}_{i2} K(x_i, x) + \rho_2) \\
\vdots \\
\operatorname{sgn}(\sum_{i=1}^n \vec{y}_{ip} \vec{\alpha}_{ip} K(x_i, x) + \rho_p) \\
$$
决定函数也就是决定$y$
取值的函数,由于是分类,所以即决定$y$
的正负。关于$\vec{\rho}$
,也就是intercept_
,查阅sklearn源码,当目标变量分类数$\text{n_classes}=2$
那么$shape(\rho)=[1]$
;当目标变量分类数$\text{n_classes} \ne 2$
那么$shape(\rho)=[\text{n_classes}]$
。
关于多分类试写的上面三个公式有可能是错误的,因为毕竟$\mathbf{Y}$
的各列不是独立的,上面公式完全割裂了各列的联系。不过暂时留下上述公式吧,其总还是可以方便我们理解。
2 NuSVC
We introduce a new parameter
$\nu$
which controls the number of support vectors and training errors. The parameter
$\nu \in (0,1]$
is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.引入新参数$\nu$
,其控制支持向量和训练误差的数值。$\nu \in (0,1]$
是是训练误差部分的上界和支持向量部分的下界。
It can be shown that the
$\nu$
-SVC formulation is a reparameterization of the
$C$
-SVC and therefore mathematically equivalent.显然$\nu$
-SVC是$C$
-SVC的再参数化(看两者取值范围,其关系应是互为倒数),因此两者是等价的。
3 SVR
Given training vectors
$\vec{x}_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$
, and a vector
$\vec{y} \in \mathbb{R}^n$
$\varepsilon$
-SVR solves the following primal problem:给定训练向量$\vec{x}_i \in \mathbb{R}^p \quad i=1,2,\ldots,n$
和$\vec{y} \in \mathbb{R}^n$
($\vec{y} \in \mathbb{R}^n$
是因为变量训练的行数是n),$\varepsilon$
-SVR解决以下原始问题:
$$\begin{align}\begin{aligned} & \min_ {\vec{w}, b, \vec{\zeta}, \vec{\zeta}^*} \frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} (\zeta_i + \zeta_i^*)\\ & \begin{split}\textrm {subject to } & y_i - \vec{w}^T \phi (\vec{x}_i) - b \leq \varepsilon + \zeta_i,\\
& \vec{w}^T \phi (\vec{x}_i) + b - y_i \leq \varepsilon + \zeta_i^*,\\
& \zeta_i, \zeta_i^* \geq 0, \\
& i=1, ..., n\end{split}\end{aligned}\end{align}$$
那么模型对训练数据的目标预测值$\vec{w}^T \phi (\vec{x}_i) + b$
与真实值$y_i$
之差的预测范围是$[\varepsilon + \zeta_i,\varepsilon + \zeta_i^*]$
。其中$\zeta_i$
和$\zeta_i^*$
将惩罚至尽可能小,而对于$\varepsilon$
,查阅sklearn文档发现是一高参,不能被学习器调整,只能网格搜索寻求最优。按文档说法,损失函数中,如果预测值与真实值之差距小于$\varepsilon$
将不给与惩罚,如上式所示,不会对$\varepsilon$
进行限制。
Its dual is:其对偶是:
$$\begin{align}\begin{aligned} & \min_{\vec{\alpha}, \vec{\alpha}^*} \frac{1}{2} (\vec{\alpha} - \vec{\alpha}^*)^T Q (\vec{\alpha} - \vec{\alpha}^*) + \varepsilon \vec{e}^T (\vec{\alpha} + \vec{\alpha}^*) - y^T (\vec{\alpha} - \vec{\alpha}^*)\\
& \begin{split}
\textrm {subject to } & \vec{e}^T (\vec{\alpha} - \vec{\alpha}^*) = 0\\
& 0 \leq \alpha_i, \alpha_i^* \leq C, i=1, ..., n\end{split}\end{aligned}\end{align}$$
where
$\vec{e}$
is the vector of all ones,
$C>0$
is the upper bound,
$\mathbf{Q}$
is an
$n$
by
$n$
positive semidefinite matrix,
$\mathbf{Q}_{ij} \equiv K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$
is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function
$\phi$
.其中$\vec{e}$
是全部由1组成的向量,$C > 0$
是上界,$\mathbf{Q}$
是$n$
乘以$n$
半正定矩阵,$\mathbf{Q}_{ij} \equiv K(\vec{x}_i, \vec{x}_j) = \phi (\vec{x}_i)^T \phi (\vec{x}_j)$
是核函数。在这里,训练向量通过函数$\phi$
隐式映射到一个更高的(也许是无限的)维空间。
The decision function is:决策方程为:
$$\sum_{i=1}^n {((\alpha_i - \alpha_i^*) K(\vec{x}_i, \vec{x}) + \rho)}$$
These parameters can be accessed through the members dual_coef_
which holds the difference
$\alpha_i - \alpha_i^*$
, support_vectors_
which holds the support vectors, and intercept_
which holds the independent term
$\rho$
.这些参数可通过dual_coef_
、support_vectors_
和intercept_
访问,dual_coef_
控制差值$\alpha_i - \alpha_i^*$
,support_vectors_
控制支持向量,intercept_
控制独立项$\rho$
。
参考:
- sklearn文档1.4.7. Mathematical formulation;
- “Automatic Capacity Tuning of Very Large VC-dimension Classifiers”, I. Guyon, B. Boser, V. Vapnik - Advances in neural information processing 1993.;
- “Support-vector networks”, C. Cortes, V. Vapnik - Machine Learning, 20, 273-297 (1995).;
- “A Tutorial on Support Vector Regression”, Alex J. Smola, Bernhard Schölkopf - Statistics and Computing archive Volume 14 Issue 3, August 2004, p. 199-222.;