Kaggle micro-course of machine learning

最近看Kaggel的时候，发现一个machine learning的micro-course，其以简洁易懂的方法让我们初步了解什么是机器学习以及如何运用机器学习的算法来解决问题。对于每一小节都配套exercise，在Kernals(Kernels：Explore and run machine learning code with Kaggle Kernels, a cloud computational environment that enables reproducible and collaborative analysis)上练习，模板是Jupyter notebooks，互动性很好，蛮好玩的下面都是基于这个课程https://www.kaggle.com/learn/machine-learning的笔记

应用场景：

你表哥在房地产投机上赚了几百万美元。由于你对数据科学的兴趣，他愿意成为你的商业伙伴。他将会提供资金而你则通过模型预测不同房子的价格

我们从决策树模型开始，虽然还有其他模型可以提供更准确的预测，但是决策树更容易理解，它是数据科学中一些极佳模型（如随机森林）的基本组成部分

Explore Your Data

任何机器学习项目的第一步都是了解数据，相当于查看数据并了解数据的每行每列代表的意义，这里使用pandas包，测试数据集路径：https://www.kaggle.com/dansbecker/how-models-work/data

读入数据，查看每列数据的汇总信息

import pandas as pd

melbourne_data = pd.read_csv("./melbourne-housing-snapshot/melb_data.csv")
melbourne_data.describe()

Selecting Data for Modeling

简单选择一些特征（列）来构建决策树模型，由于数据集中有缺失值，先暂时按照完全剔除的方式来处理（用dropna），从数据集中也挑选出预测目标Price即模型输出y，以及根据指定列挑选出训练数据X

melbourne_data = melbourne_data.dropna(axis=0)

y = melbourne_data.Price
melbourne_data.columns
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.head()

接下来使用Python的scikit-learn库来调用模型（不用自己写各个机器学习算法。。。参数都有，只需要调好参即可。。是不是很方便。。。more than you'll want or need for a long time），一般构建模型思路如下：

Define：What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.（指定使用的模型）
Fit：Capture patterns from provided data. This is the heart of modeling.（建模）
Predict：Just what it sounds like（使用模型预测）
Evaluate：Determine how accurate the model's predictions are.（评估模型预测结果）

简单调用下回归决策树(DecisionTreeRegressor)，用X和y建模并查看预测值（使用训练集中前5个样本）

from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=12345)
melbourne_model.fit(X, y)
print(melbourne_model.predict(X.head()))

Model Validation

构建好模型后，则需要评估下模型，方法很多，一般会先看下预测结果与训练集的真实结果（这里的例子就是要将模型的预测价格与数据集中的真实价格做比较），即Mean Absolute Error(MAE)，可以用mean_absolute_error函数进行计算，代码接上述步骤：

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
# 1115.7

上面这种方法有个缺点在于其只有训练集而没有验证集，因此单纯的看训练集的平均绝对误差这个指标是没意义的，无法保证这个模型在新数据集下也有很好的预测效果；简单的解决办法则是将上述数据集分成两部分（训练集和验证集），即使用交叉验证的方法

因此我们使用train_test_split函数分割数据集，然后按照上述同样的方法（建模-预测-评估）看下结果：

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 12345)

melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
# 243803.6

从结果中可看出，使用交叉验证后，平均绝对误差明显上升了，说明还需要对模型以及特征值选择做一定的调整

Underfitting and Overfitting

上一节的预测效果不太好，因此我们需要对模型参数进行调整，这个例子是用决策树，其中有一个tree depth参数可能会影响预测准确率，在调整参数前，我们需要了解两个概念（过拟合和欠拟合）：

overfitting： a model matches the training data almost perfectly, but does poorly in validation and other new data
underfitting：a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data

代码实现的话比较简单，只需要将之前的步骤写入一个自定义函数中，然后for循环在不同树深度参数条件下来输出平均绝对误差

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=12345)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
mae = {}
for max_leaf_nodes in candidate_max_leaf_nodes:
    mae[max_leaf_nodes] = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)

best_tree_size = min(mae, key = mae.get)

现在用交叉验证选择出最佳的tree depth后，接着需要将参数和全部训练集都放入模型中

final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
final_model.fit(X, y)

Random Forests

从上面可看出，决策树需要调整树深度才能达到相对比较好的预测结果，不同的树深度可能会带来过拟合（high tree depth）和欠拟合（low tree depth）等问题；这时可以考虑用random forest（随机森林），其相比决策树有更好的鲁棒性（在使用默认参数时），也就是说不需要调整过多参数！

from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_mae = mean_absolute_error(val_y, rf_model.predict(val_X))
print("Validation MAE for Random Forest Model: {:.2f}".format(rf_val_mae))
# Validation MAE for Random Forest Model: 22762.43

从结果中看出，random forest的默认参数的预测结果相比决策树的要好上一点。。。

Handling Missing Values

上面提到数据集有缺失值的情况，我们简单的采取了剔除含有缺失值的样本；我们可以先统计下每个特征值的缺失值数目

import pandas as pd
melbourne_data = pd.read_csv("./melbourne-housing-snapshot/melb_data.csv")
missing_val_count_by_column = (melbourne_data.isnull().sum())
print(missing_val_count_by_column)

处理缺失值的策略比较常见的以下几种：

剔除有缺失值的特征值（或者样本）：

melbourne_data_without_nan = melbourne_data.dropna(axis=1)

补缺(Imputation)，虽然补缺的值相比真实值并不是很准确，但是对于模型的预测结果来说有一定的提升作用；PS.有时缺失值有代表意义，因此会新建特征保存这些缺失值的信息（有或者没有这种信息）
```
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
# 只保留数值型的数据
melbourne_data_numeric = melbourne_data.select_dtypes(exclude=['object'])
melbourne_data_imputation = my_imputer.fit_transform(melbourne_data_numeric)
```

Using Categorical Data with One Hot Encoding

我们一般会将训练数据转化为模型可以接受的输入，并保留尽可能多的信息，这个过程通常叫做特征工程。比如在LR模型中，一般使用离散型特征，对于某个特征下不同的值而言，其只是一个编号，不具备比较大小的意义，这时就需要One Hot编码进行转化

One-Hot编码是使用N位状态寄存器来对N个状态进行编码，每个状态都有它独立的寄存器位，并且在任意时候，其中只有一位有效；对于One-Hot来说，如果一个特征有N个类别，那么就有N个变量，每个变量“管理一个类别取值”，这样就形成了一个长度为N的稀疏向量

相对于One-Hot编码，还有一种处理方式是哑变量，两者区别在于，后者对于一个具有N个状态的特征，只有N-1位状态寄存器，也就是说于One-Hot编码相比哑变量会多一位；一般One-Hot编码搭配正则化来实现的话，在非线性条件下比哑变量会表现的好些

One-hot编码pandas实现方法（如果参数加上drop_first = True则变成哑变量）如下：

one_hot_encoded_melb_data = pd.get_dummies(melbourne_data)

也可以用sklearn.preprocessing的OneHotEncoder函数来实现，例子如：https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

XGBoost

XGBoost是Gradient Boosted Decision Trees算法的一种实现，在很多Kaggle竞赛中表现很出色；其涉及到了Decision Trees, Boosting, Gradient Boosting等概念，相比Random Forests只需要调整很少的参数，其还是要通过调参来达到最佳模型，下面只是简单的介绍下XGBoost，留个印象

先简单的分割数据集并补缺

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

data = pd.read_csv('./home-data-for-ml-course/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)

# impute data
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

然后使用xgboost库的XGBRegressor函数建模

from xgboost import XGBRegressor

my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle
my_model.fit(train_X, train_y, verbose=False)

计算平均绝对误差评估模型

predictions = my_model.predict(test_X)

from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : {:.1f}".format(mean_absolute_error(predictions, test_y)))

个人觉得具体参数调整可以等实际使用的时候来了解，XGBoost常见的几个参数有：n_estimators, learning_rate, n_jobs等等

刚好今天看到一篇公众号文章：线性模型已退场，XGBoost时代早已来

Partial Dependence Plots

Partial Dependence Plots(PDP，偏依赖图)简单的说是一种对给定的一个或多个输入变量相对于预测结果（即输出）的效果进行可视化的工具；可以告诉你一个特征是如何影响预测的，理论上是可以用于任何模型的；PDP可以通过1D或2D图显示目标与所选特征之间的关系

PDP需要在模型训练后才能计算，然后选择一定的特征以图形化方式展示其与预测值之间的关系，比如想知道在训练Gradient Boosting模型后，Distance和BuildingArea与house price之间的影响关系，则可以：

from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.ensemble import GradientBoostingRegressor

# scikit-learn originally implemented partial dependence plots only for Gradient Boosting models
# this was due to an implementation detail, and a future release will support all model types.
my_model = GradientBoostingRegressor()
my_model.fit(train_X, train_y)
my_plots = plot_partial_dependence(my_model, features=[0,1], X=train_X, feature_names=['Distance', 'BuildingArea'], grid_resolution=10)

Pipelines && Cross-Validation

管道(Pipelines是为了方便处理数据，文章中给出了以下几个好处：

Cleaner Code：You won't need to keep track of your training (and validation) data at each step of processing. Accounting for data at each step of processing can get messy. With a pipeline, you don't need to manually keep track of each step
Fewer Bugs：There are fewer opportunities to mis-apply a step or forget a pre-processing step
Easier to Productionize：It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help
More Options For Model Testing：You will see an example in the next tutorial, which covers cross-validation

交叉验证（Cross-Validation）是在机器学习建立模型和验证模型参数时常用的办法，前面例子中将数据集拆分成两部分的方法属于简单交叉验证，还有K折交叉验证（K-Folder Cross Validation）、留一交叉验证（Leave-one-out Cross Validation）以及一种特殊的交叉验证自助法（bootstrapping）

在小数据集上，交叉验证的计算负担较小，因此建议使用交叉验证；对于大数据集，简单的train-test split（也就是上述的简单交叉验证）足够用了，因此不需要K折交叉验证了，而且这样的话速度也足够的快；当然也可以先用交叉验证（K折）先试试，如果每次结果相差不大的话，再切换到train-test split即可

Data Leakage

这小节主要讲了建模中的Data Leakage现象，这个不是字面上理解的数据泄露，而是因果关系的泄露（某些特征不是预测值的因，而是果），以文中一个例子为例：

比如你想要预测是否会得肺炎，训练集中某一个特征：是否服用抗生素；我们都知道得肺炎后为了康复需要服用抗生素，那么这个特征“是否服用抗生素”一般不会是“是否得肺炎”的因，更像是一个果，或者说一个标记；如果用带有这个特征的训练集来训练模型，那训练的预测值的准确度会很高，但对于实际数据的预测一般没啥用

Leakage可以分为两种：

Leaky Predictors：上述的例子就属于这种，在模型中使用了不可用的特征（比如与预测值有很强的因果关系的）
Leaky Validation Strategies：在模型的处理中用了验证集中的一些参数/数据，比如你先对数据集进行preprocessing (like fitting the Imputer for missing values)，然后再train-test split，这时训练集中补缺的值已经包含了验证集的部分信息（换句话说就是训练集的在补缺时用了验证集的数据）

如何避免这些Leaky呢，一般就需要对数据有一定的了解，然后可以在模型训练前先检查下特征与预测值之间相关性，然后尽量避免验证集对于训练集的干扰（这个可以用Pipelines来实现）

本文出自于http://www.bioinfo-scrounger.com转载请注明出处