# KeepNotes blog

Stay hungry, Stay Foolish.

0%

4 Applications of Naive Bayes Algorithms

• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time
• Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

• 多项式模型，较为常见，一般用于特征值是离散型数据，会做拉普拉斯平滑（Laplace smoothing）处理（贝叶斯估计），因为用极大似然法会导致估计概率为0的情况出现
• 高斯模型，一般用于特征值是连续型数据，比如身高、体重等等；高斯模型一般假设每个特征维度都满足高斯分布，因此需要计算均值和方差
• 伯努利模型，一般用于特征值是离散型数据，但与多项式模型不同的是，其特征值一般为0或者1等布尔值，因此会多一步二值化的过程

``````class MultinomialNB:
'''
fit函数输入参数：
X 测试数据集
y 标记数据
alpha 贝叶斯估计的正数λ
predict函数输入参数：
test 测试数据集
'''
def fit(self, X, y, alpha = 0):
# 整理分类
feature_data = defaultdict(lambda: [])
label_data = defaultdict(lambda: 0)
for feature, lab in zip(X, y):
feature_data[lab].append(feature)
label_data[lab] += 1

# 计算先验概率
self.label = y
self.pri_p_label = {k: (v + alpha)/(len(self.label) + len(np.unique(self.label)) * alpha) for k,v in label_data.items()}

# 计算不同特征值的条件概率
self.cond_p_feature = defaultdict(lambda: {})
for i,sub in feature_data.items():
sub = np.array(sub)
for f_dim in range(sub.shape[1]):
for feature in np.unique(X[:,f_dim]):
self.cond_p_feature[i][(f_dim,feature)] = (np.sum(sub[:,f_dim] == feature) + alpha) / (sub.shape[0] + len(np.unique(X[:,f_dim])) * alpha)

def predict(self, test):
p_data = {}
for sub_label in np.unique(self.label):
# 对概率值取log，防止乘积时浮点下溢
p_data[sub_label] = self.pri_p_label[sub_label]
for i in range(len(test)):
if self.cond_p_feature[sub_label].get((i,test[i])):
p_data[sub_label] *= self.cond_p_feature[sub_label][(i,test[i])]
opt_label = max(p_data, key = p_data.get)
return([opt_label, p_data.get(opt_label)])``````

``````import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.model_selection import train_test_split

dataset = np.array(dataset)
dataset[:,1:][dataset[:,1:] != 0] = 1
label = dataset[:,0]
# 分割训练集和测试集
train_dat, test_dat, train_label, test_label = train_test_split(dataset[:,1:], label, test_size = 0.2, random_state = 123456)
# 构建NB模型
model = MultinomialNB()
model.fit(X=train_dat, y=train_label, alpha=1)
# 使用NB模型进行预测
pl = {}
i = 0
for test in test_dat:
temp = model.predict(test=test)
pl[i] = temp
i += 1
# 输出测试错误率%
error = 0
for k,v in pl.items():
if test_label[k] != v[0]:
error += 1
print(error/len(test_label)*100)``````