Scikit-learn Feature selection

机器学习(周志华):有用的属性称为“相关特征”(relevant feature),没用的属性称为“无关特征”(irrelevant feature)。从给定特征值集合中选择出相关特征子集的过程,称为“特征选择”(feature selection)

之前对于数据的处理一直比较喜欢用R,但是Python的scikit-learn库对于机器学习又比较的友好,因此打算多学点scikit-learn库;其实个人觉得很多方法R或者Python都能实现,原理都是一样的,只是在于哪个实现的更快更熟练罢了

sklearn.feature_selection是用于feature筛选(增加模型的精确度)或者降维(提高在高维数据中的性能)的模块

Removing features with low variance

我们可以用VarianceThreshold函数来简单过滤掉低于方差阈值(默认为0)的features,例子给出了离散变量(0,1),想过滤掉80%以上都为0或者1的features;由于其属于伯努利变量(方差为p(1-p),期望为p),所以VarianceThreshold参数可以设置为.8 * (1 - .8),然后用fit_transform方法拟合转化

from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

Univariate feature selection

基于单变量统计分析的打分,去除掉低score的features,一般将这作为预处理步骤,常见的选择方法有以下几种:

  • SelectKBest:选择最佳打分的K个的features
  • SelectPercentile:根据百分位数来选择最佳的features
  • SelectFdr:根据P值经FDR校正后的值来选择最佳的features
  • GenericUnivariateSelect:综合上述策略的选择器,在超参数查找中选择最好的单变量选择策略

SelectKBestGenericUnivariateSelect为例,用法如下:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, GenericUnivariateSelect, chi2

iris = load_iris()
X, y = iris.data, iris.target
# SelectKBest
X_new1 = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new1.shape
# GenericUnivariateSelect
X_new2 = GenericUnivariateSelect(chi2, 'k_best', param=2).fit_transform(X, y)
X_new2.shape

这里的score_func不仅chip2,对于分类和回归还有如下:

  • For regression: f_regression, mutual_info_regression
  • For classification: chi2, f_classif, mutual_info_classif

Recursive feature elimination

Recursive feature elimination(RFE,递归消除特征法),基于模型(根据需要指定模型)进行多次训练,依次去掉least importance特征向量,由剩余的特征值组成一个新的特征集继续训练,直到特征值数目符合指定的需求

以SVM的分类模型为例,用法如下:

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE

digits = load_digits()
# reshape的用法,将digits转化为1797行,列自动计算
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=3, step=1)
rfe.fit(X, y)
print(rfe.n_features_)
print(rfe.support_)
print(rfe.ranking_)

ranking_可输出第i个特征的排名,如果排名是1的话,说明是最好的那一个或者几个特征值

RFE如果带上交叉验证,则是RFECV方法,具体参数可看:RFECV,多了几个交叉验证所要用到的参数

Feature selection using SelectFromModel

SelectFromModel每次做特征向量提取时必然加载的一个类,感觉用的频率比较高;比如我们用了RF或者linear_svm后,需要根据score阈值过滤掉一些特征值并生成新的特征集时会用到

L1-based feature selection

基于L1惩罚来做特征选择,SelectFromModel类结合带L1惩罚项的线性模型,以LassoCV为例(lasso加上cross-validation来选择最佳模型),SelectFromModel阈值设为0.25(默认是1e-5,也可以设置为其他方式,具体看参数说明)

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

boston = load_boston()
X, y = boston['data'], boston['target']

clf = LassoCV(cv=5)
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
print(sfm.transform(X).shape[1])
print(sfm.get_support())

一般对于alpha值没有通用的规则,如果使用交叉验证的Lasso可能会导致模型under-penalized(包含少量non-relevant特征值不利于预测打分),对于BIC (LassoLarsIC)则可以设置较高的alpha值

Tree-based feature selection

基于树模型(如DecisionTreeClassifier,RandomForestClassifier等等)的打分,并搭配SelectFromModel类来做特征选择;以ExtraTreesClassifier模型为例

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.datasets import fetch_olivetti_faces
import numpy as np

data = fetch_olivetti_faces()
X = data.images.reshape((len(data.images), -1))
y = data.target

forest = ExtraTreesClassifier(n_estimators=1000, max_features=128,random_state=0)
forest.fit(X, y)
sfm = SelectFromModel(forest, prefit=True, threshold = 0.001)
print(sfm.transform(X).shape)
print(np.shape(np.where(forest.feature_importances_ > 0.001)))

Feature selection as part of a pipeline

平时看到一些代码会用pipeline的形式来写模型过程,比如将上述特征选择(LassoCV模型)的结果作为模型(RandomForestClassifier)拟合训练的输入,用pipeline则如下

clf = Pipeline([
        ('feature_selection', SelectFromModel(LassoCV(cv=5))), 
        ('classification', RandomForestClassifier())
])
clf.fit(X,y)
# getting the selected features chosen by LassoCV filter
clf.named_steps.feature_selection.get_support()

参考资料:
Feature selection
《Python机器学习基础教程》

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

文章目录
  1. 1. Removing features with low variance
  2. 2. Univariate feature selection
  3. 3. Recursive feature elimination
  4. 4. Feature selection using SelectFromModel
    1. 4.1. L1-based feature selection
    2. 4.2. Tree-based feature selection
  5. 5. Feature selection as part of a pipeline
|