Scikit-learn Pipeline and ColumnTransformer

Pipeline

Pipeline可以用来简化构建变换和模型链的过程

Pipeline的好处：

构建好Pipeline后，只需要一次fit和predict，即可避免对每一个estimators都调用一遍fit和transform
如果使用grid search，即一次历遍所有estimators的参数
避免测试集的信息泄露到交叉验证训练集中（典型的就是在做交叉验证前做了scale，这样会leaking statistics）

常用的方法有Pipeline和make_pipeline，后者相比前者可以不用为每一步指定名称（后者默认是用方法的小写缩写），如：

pipe = Pipeline([("Scale", MinMaxScaler()), ("SVM", SVC())])
pipe = make_pipeline(MinMaxScaler(), SVC())

如果上述使用了make_pipeline来构建pipeline，那么访问pipe中的每个步骤，可以通过列表访问的方式：

pipe.steps[0]
pipe.named_steps.svc

除了访问还可以修改pipeline中的参数，格式<estimator>__<parameter>，如：

pipe.set_params(svc__C=10)

在grid search中使用pipeline，其可以对不同的estimators中的不同参数都写入grid param中，比如以PCA作为预处理，SVC作为分类器：

param_grid = {'pca__n_components': [2, 5, 10],
          'svc__C': [0.1, 10, 100]}
grid_search = GridSearchCV(pipe, param_grid=param_grid)

如果想跳过跳过pipeline中的某个步骤，比如pipeline中分类器有linear和Tree-based两种，对于linear模型想做scale预处理，但是对于Tree-based则不想，这时就可以设置为passthrough或者None

pipe = Pipeline([('prep', StandardScaler()), ('clf', SVC())])

param_grid = [{'clf': [SVC()], 'prep': [StandardScaler(), None], 'clf__C': [0.01, 0.1, 1]},
              {'clf': [RandomForestClassifier()], 'prep': [None], 'clf__max_features': [1, 2, 3]}]

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

ColumnTransformer

ColumnTransformer是scikit-learn 0.20版本中新增的，主要用于处理不同类型列的预处理（一般都会选择用pandas处理），比如需要分别对numeric和categorical特征做不同的encoding，这样做的其中一个好处是可以避免在交叉验证中数据泄露

看到上述介绍，我才想起来有一次对于训练集进行交叉验证时，发现最终的score等于1，很明显是过拟合了，但是当时没往数据泄露方向想（经验太少了。。。）；现在看来，应该是在交叉验证前对某列进行encoding造成的数据泄露

因此如果通过ColumnTransformer就能很好的避免上述情况，不然得小心避免数据泄露问题

用法类似于Pipeline，可搭配其使用，如：

column_trans = make_column_transformer(
    (OneHotEncoder(), ['city']),
    (CountVectorizer(), 'title'),
    remainder=MinMaxScaler())
pipe = Pipeline([('prep', column_trans), ('clf', SVC())])

用make_column_transformer可省略名称

参考资料

《Python机器学习基础教程》构建管道(make_pipeline)
Pipelines and composite estimators

本文出自于http://www.bioinfo-scrounger.com转载请注明出处