朴素贝叶斯naive_bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. Bayes’ theorem states the following relationship, given class variable $y$ and dependent feature vector $x_1$ through $x_m$ ,:朴素贝叶斯方法是应用贝叶斯定理的监督学习算法的集合，其假设决定目标分类的各特征列之间两两条件独立，这是“朴素”二字的由来。贝叶斯定理陈述如下关系，给定分类变量 $y$ 和其依存特征变量从 $x_1$ 到 $x_m$ ： $$P(y \mid x_1, x_2, \dots, x_m) = \frac{P(y) P(x_1, x_2, \dots x_m \mid y)} {P(x_1, x_2, \dots, x_m)} \tag{1}$$

这里 $x_1,\ldots,x_m$ 是一行数据，这m个数构成一个行向量 $[x_1,x_2,\ldots,x_m]$ 。

Using the naive conditional independence assumption that:使用条件独立假设： $$P(x_1, x_2, \dots, x_m | y) = \prod_{i=1}^{m}{P(x_i | y)} \tag{2}$$ For all $i$ , this relationship is simplified to:对所有 $i$ ，这些关系简化为： $$P(y \mid x_1, x_2, \dots, x_m) = \frac{P(y) \prod_{i=1}^{m} P(x_i \mid y)} {P(x_1, x_2, \dots, x_m)} \tag{3}$$ Since $P(x_1, x_2, \dots, x_m)$ is constant given the input, we can use the following classification rule:由于 $P(x_1, x_2, \dots, x_m)$ 在给定输入条件下是常数，我们可以使用以下分类规则： $$\begin{align}\begin{aligned}P(y \mid x_1, x_2, \dots, x_m) \propto P(y) \prod_{i=1}^{m} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{m} P(x_i \mid y),\end{aligned}\end{align} \tag{4}$$ And we can use Maximum A Posteriori (MAP) estimation to estimate $P(y)$ and $P(x_i \mid y)$ ; the former is then the relative frequency of class $y$ in the training set.我们可以用极大后验(MAP)估计 $P(y)$ 和 $P(x_i \mid y)$ ；前者是训练集中 $y$ 分类的相对频率。

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i \mid y)$ .不同朴素贝叶斯分类器的区别主要在于它们对 $P(x_i \mid y)$ 分布的假设。

In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data it does, see the references below.)尽管朴素贝叶斯分类器的假设显然过于简单，但在许多实际情况下，它们都能很好地工作，比如著名的文档分类和垃圾邮件过滤。它们需要少量的训练数据来估计必要的参数。(出于理论上的原因，朴素贝叶斯算法为什么能很好地工作，以及它适用于哪种类型的数据，请参阅下面的参考2)。

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.与更复杂的方法相比，朴素贝叶斯学习者和分类器可以非常快。分类条件特征分布的解耦意味着每个分布可以独立地估计为一维分布。这反过来又有助于缓解维度诅咒带来的问题。

On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.另一方面，虽然朴素贝叶斯被认为是一个不错的分类器，但它被认为是一个糟糕的估计器，所以predict_proba的概率输出不需要太认真对待。

1 Gaussian Naive Bayes高斯朴素贝叶斯

GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:GaussianNB实现了高斯朴素贝叶斯分类算法。假设特征的似然是高斯分布的: $$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right) \tag{5}$$ The parameters $\sigma_y$ and $\mu_y$ are estimated using maximum likelihood.参数 $\sigma_y$ 和 $\mu_y$ 是使用最似然估计的。

2 Multinomial Naive Bayes多项式朴素贝叶斯

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors $\theta_y = (\theta_{y1},\ldots,\theta_{ym})$ for each class $y$ , where $m$ is the number of features (in text classification, the size of the vocabulary) and $\theta_{yi}$ is the probability $P(x_i \mid y)$ of feature $i$ appearing in a sample belonging to class $y$ . MultinomialNB对多项式分布数据实现朴素贝叶斯算法，是文本分类中使用的两个经典朴素贝叶斯变体之一(其中数据通常表示为单词向量计数，尽管tf-idf向量在实践中也很有效)。向量的分布按每个样本分类 $y$ 参数化为 $\theta_y = (\theta_{y1},\ldots,\theta_{ym})$ ，其中 $m$ 是特征列的数量(在文本分类中,词汇数量)， $\theta_{yi}$ 是第 $i$ 个特征的概率 $P(x_i \mid y)$ 。

The parameters $\theta_{y}$ is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:参数z由平滑后的版本的最大似然估计，即相对频率计数: $$\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha m} \tag{6}$$ where $N_{yi} = \sum_{x \in T} x_i$ is the number of times feature $x_i$ appears in a sample of class $y$ in the training set $T$ , and $N_{y} = \sum_{i=1}^{n} N_{yi}$ is the total count of all features for class $y$ .其中 $N_{yi} = \sum_{x \in T} x_i$ 为训练集 $T$ 中 $y$ 类样本中特征 $x_i$ 出现的次数， $N_{y} = \sum_{i=1}^{n} N_{yi}$ 为 $y$ 类特征的总数（也就是，目标的分类为 $y$ 时，特征 $x_i$ 的所有分类的计数之和，一个意思，不过我这个表述好理解一点。）。

The smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha = 1$ is called Laplace smoothing, while $\alpha < 1$ is called Lidstone smoothing.平滑先验 $\alpha \ge 0$ 考虑了学习样本中不存在的特征，防止了进一步计算中的零概率。设 $\alpha = 1$ 则为拉普拉斯平滑， $\alpha < 1$ 则为Lidstone平滑。

【这里 $m$ 是特征列的数量，暂时不清楚为什么分母中 $\alpha$ 需乘以特征列的数量，但查阅sklearn文档Additive smoothing，确实分母中 $\alpha$ 需乘以特征列的数量。

在维基Additive smoothing中是这样定义的：

In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing as used in image processing), or Lidstone smoothing, is a technique used to smooth categorical data. Given an observation $\mathbf {x} \ =\ \left\langle x_{1},\,x_{2},\,\ldots ,\,x_{m}\right\rangle$ from a multinomial distribution with $n$ trials, a "smoothed" version of the data gives the estimator:在统计学中，加性平滑，也称为Laplace平滑(不要与图像处理中的Laplacian平滑混淆)或Lidstone平滑，是一种用于平滑分类数据的技术。给定 $n$ 次试验的多项式分布的观测 $\mathbf {x} \ =\ \left\langle x_{1},\,x_{2},\,\ldots ,\,x_{m}\right\rangle$ ，数据的“平滑”版本给出估计量 $${\hat {\theta }}_{i}={\frac {n_{x_{i}}+\alpha }{n+\alpha m}}\qquad (i=1,\ldots ,m) \tag{7}$$ where the "pseudocount" $\alpha > 0$ is a smoothing parameter. $\alpha = 0$ corresponds to no smoothing. Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical probability (relative frequency) $\frac {n_{x_{i}}}{n}$ , and the uniform probability $\frac {1}{m}$ . Invoking Laplace's rule of succession, some authors have argued that $\alpha$ should be 1 (in which case the term add-one smoothing is also used), though in practice a smaller value is typically chosen.其中“伪计数” $\alpha > 0$ 是一个平滑参数。 $\alpha = 0$ 对应没有平滑。加性平滑是一种收缩估计，因为得到的估计值将介于经验概率(相对频率) $\frac {n_{x_{i}}}{n}$ 和均匀概率 $\frac {1}{m}$ 之间。引用拉普拉斯演算法则，一些作者认为 $\alpha$ 应该是1(在这种情况下，术语add-one平滑也被使用)，尽管在实践中通常选择较小的值。下面证明得到的估计值将介于经验概率(相对频率) $\frac {n_{x_{i}}}{n}$ 和均匀概率 $\frac {1}{m}$ 之间： $$\begin{align} \\ &(\frac{x_1+x_2}{y_1+y_2}-\frac{x_1}{y_1})\cdot(\frac{x_2}{y_2}-\frac{x_1+x_2}{y_1+y_2}) \\ &=\frac{(x_1+x_2)y_1-x_1(y_1+y_2)}{(y_1+y_2)y_1}\cdot\frac{x_2(y_1+y_2)-(x_1+x_2)y_2}{(y_1+y_2)y_2} \\ &=\frac{(x_2y_1-x_1y_2)^2}{(y_1+y_2)^2y_1y_2} \end{align}$$ 当 $y_1y_2 > 0$ 时上式必不小于0，那么 $\frac{x_1+x_2}{y_1+y_2}$ 介于 $\frac{x_1}{y_1}$ 和 $\frac{x_2}{y_2}$ 之间。因为 $n$ 和 $\alpha m$ 均大于零，故得到的估计值将介于经验概率(相对频率) $\frac {n_{x_{i}}}{n}$ 和均匀概率 $\frac {1}{m}$ 之间。

From a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter $\alpha$ as a prior distribution. In the special case where the number of categories is 2, this is equivalent to using a Beta distribution as the conjugate prior for the parameters of Binomial distribution.从贝叶斯的观点来看，这对应于后验分布的期望值，使用参数为 $\alpha$ 的对称狄利克雷分布作为先验分布。在类别数为2的特殊情况下，这等价于使用Beta分布作为二项分布参数的共轭先验。

以上对维基的转载，为使叙述连贯，还有发现了一个小错误，因此改动过符号标记。

公式7中为什么是计算 $P(x_i \mid y)$ 的加性平滑估计而不是 $P(x_i)$ 的平滑估计，所以由 $n_{x_{i}}$ 和 $n$ 相应改为 $N_{y_{i}}$ 和 $N_{y}$ 。这里的 $n_{x_{i}}$ 、 $n$ 、 $N_{y_{i}}$ 和 $N_{y}$ 均是计数，千万不要被x、y束缚。】

3 Complement Naive Bayes互补朴素贝叶斯

ComplementNB implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Specifically, CNB uses statistics from the complement of each class to compute the model’s weights. The inventors of CNB show empirically that the parameter estimates for CNB are more stable than those for MNB. Further, CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks. The procedure for calculating the weights is as follows: ComplementNB实现了互补朴素贝叶斯算法(CNB)。CNB是标准多项式朴素贝叶斯(MNB)算法的一种改进，特别适用于不平衡数据集。具体来说，CNB使用来自每个类的补集的统计信息来计算模型的权重。CNB的发明人通过实验证明，CNB的参数估计比MNB的参数估计更稳定。此外，在文本分类任务上，CNB通常比MNB表现更好(通常是相当大的优势)。计算权重的步骤如下： $$\begin{align}\begin{aligned}\hat{\theta}_{ci} = \frac{\alpha_i + \sum_{j:y_j \neq c} d_{ij}} {\alpha + \sum_{j:y_j \neq c} \sum_{k} d_{kj}}\\w_{ci} = \log \hat{\theta}_{ci}\\w_{ci} = \frac{w_{ci}}{\sum_{j} |w_{cj}|}\end{aligned}\end{align} \tag{8}$$

这里三行是三个步骤，第一行计算 $\hat{\theta}_{ci}$ ，第二行对 $\hat{\theta}_{ci}$ 取对数得到 $w_{ci}$ ，第三行标准化 $w_{ci}$ 使其和为1。 这里关于i、j的标注与机器学习刚好相反，不可因此混淆。

where the summations are over all documents $j$ not in class $c$ , $d_{ij}$ is either the count or tf-idf value of term $i$ in document $j$ , $\alpha_i$ is a smoothing hyperparameter like that found in MNB, and $\alpha = \sum_{i} \alpha_i$ . The second normalization addresses the tendency for longer documents to dominate parameter estimates in MNB. The classification rule is:其中，对所有不在 $c$ 类中的文档 $j$ 求和， $d_{ij}$ 是文档 $j$ 中 $i$ 项的count或tf-idf值， $\alpha_i$ 是一个类似于MNB中的超参数的平滑超参数，且 $\alpha = \sum_{i} \alpha_i$ 。第二个标准化处理了较长文档的趋势以控制参数估计。分类规则是： $$\hat{c} = \arg\min_c \sum_{i} t_i w_{ci} \tag{9}$$ 这里其实就是公式4，但是因为是互补朴素贝叶斯，我们要使分类 $y$ 被分配正确分类的补集（错误分类）的概率尽可能小，这样 $y$ 也就被正确分入正确的分类了。

i.e., a document is assigned to the class that is the poorest complement match.即，将文档分配给最不匹配补集的类。

关于算法的过程，大致使清楚了的，但是，为什么互补贝叶斯可以处理好不平衡数据集，则任然不清楚。有空再阅读Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive bayes text classifiers. In ICML (Vol. 3, pp. 616-623).。

4 Bernoulli Naive Bayes

BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued feature vectors; if handed any other kind of data, a BernoulliNB instance may binarize its input (depending on the binarize parameter).对多元伯努利分布分布的数据，BernoulliNB实现了朴素贝叶斯训练和分类算法；即，可能有多个特性，但每个特性都被假定为一个二分类值(Bernoulli, boolean)变量。因此，该类要求样本表示为二值特征向量；如果传递任何其他类型的数据，BernoulliNB实例可以对其输入进行二值化(取决于二值化参数)。

The decision rule for Bernoulli naive Bayes is based on:伯努利朴素贝叶斯的决策规则是基于： $$P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i) \tag{10}$$ which differs from multinomial NB’s rule in that it explicitly penalizes the non-occurrence of a feature $i$ that is an indicator for class $y$ , where the multinomial variant would simply ignore a non-occurring feature.这与多项式朴素贝叶斯规则的不同之处在于，它显式地惩罚不出现的特征 $i$ ，而 $i$ 是类 $y$ 的指示器，其中多项式变量（这里作者的意思是多元伯努利变量）会简单地忽略不出现的特征。

In the case of text classification, word occurrence vectors (rather than word count vectors) may be used to train and use this classifier. BernoulliNB might perform better on some datasets, especially those with shorter documents. It is advisable to evaluate both models, if time permits.在文本分类的情况下，可以使用单词出现向量(而不是单词计数向量)来训练和使用这个分类器。BernoulliNB可能在某些数据集上表现更好，特别是那些文档较短的数据集。如果时间允许，最好对这两个模型（BernoulliNB和MultinomialNB）进行评估。

5 Out-of-core naive Bayes model fitting核外朴素贝叶斯模型拟合

Naive Bayes models can be used to tackle large scale classification problems for which the full training set might not fit in memory. To handle this case, MultinomialNB, BernoulliNB, and GaussianNB expose a partial_fit method that can be used incrementally as done with other classifiers as demonstrated in Out-of-core classification of text documents. All naive Bayes classifiers support sample weighting.朴素贝叶斯模型可用于处理大规模分类问题，但完整的训练集可能超出内存范围。为了处理这种情况，MultinomialNB、BernoulliNB和GaussianNB公开了一种partial_fit方法，可以像在Out-of-core classification of text documents演示的那样，增量地使用其他分类器。

Contrary to the fit method, the first call to partial_fit needs to be passed the list of all the expected class labels.与fit方法相反，对partial_fit的第一次调用需要传递所有预期分类标签的列表。

For an overview of available strategies in scikit-learn, see also the out-of-core learning documentation.有关scikit-learn中可用策略的概述，请参见文档out-of-core learning。

参考：