Scikit-learn Feature Extraction from Text

平时常见的数据属性类型有连续和分类，然后在输入ML algorithms之前一般会转化为numerical matirx；除了上述两类数据外，还有一种是文本型数据，我们也需要通过一定的方法将其转化为numerical matirx

Bag of words是一种常见将Text documents转化为numerical features的方法，其思路大致可以分为：

tokenization
counting
normalization

其中比较基础的方法有CountVectorizer和TfidfVectorizer

对于文本数据有些特定的术语：

document，是指每个由单个文本表示的数据点，例如一段文本信息、Email或者book等等
corpus，是指documents的合集，整个数据集等
token/tokenization，是指每个document划分出来的单词（即词例token），比如"How are you"对应的token有"how"、"are"和"you"
vocabulary building，是指收集一个词表，里面包含docuemnt里出现的所有词，并对其编号
encoding，即计算词表中每个单词出现的频次

CountVectorizer

CountVectorizer是Sklearn中的一个类

使用这种方法时，我们舍弃了输入文本中的大部分结构，如章节、段落、句子和格式，只计算corpus中每个token在document中出现的频次

以一个message为例：

messages = ["Hey hey hey lets go get lunch today :)",
           "Did you go home?",
           "Hey!!! I need a favor"]

用fit拟合后，可用get_feature_names查看从corpus提取出的features

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

vect.fit(messages)
vect.get_feature_names()

如果想将其结果导入ML model中，则需要再将CountVectorizer对象transform为矩阵

dtm = vect.transform(messages)
print(dtm)
>>(0, 2)        1
  (0, 3)        1
  (0, 4)        3
  (0, 6)        1
  (0, 7)        1
  (0, 9)        1
  (1, 0)        1
  (1, 3)        1
  (1, 5)        1
  (1, 10)       1
  (2, 1)        1
  (2, 4)        1
  (2, 8)        1

默认dtm为sparse matrix（稀疏矩阵），这个就是上述所说的bags of words；由于其是典型的high-dimensional sparse datasets，即如果bags of words有100000个feature和10000个documents，那么其是10000×100000的矩阵，需要4G内存，而其大部分值都是0（因为对于一个document而言，其出现的taken只占corpus中非常小的部分），所以这时用sparse matrix来表示以便节约内容

如上(0,2)代表第一个document的第3个token，其对应的频次为1，sparse matrix只保存非零的值及其矩阵中的位置

如果想转化为dense matrix，可用pandas的toarray方法

import pandas as pd
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
>> did  favor  get  go  hey  home  lets  lunch  need  today  you
0    0      0    1   1    3     0     1      1     0      1    0
1    1      0    0   1    0     1     0      0     0      0    1
2    0      1    0   0    1     0     0      0     1      0    0

还可以用vocabulary_.get获取指定token的频次

vect.vocabulary_.get("hey")
>> 4

从上述可看出，其都是单个token，即一元分词（unigram）；可通过设定ngram_range参数来将其转化为二元分词或者多元，在多数情况下，添加二元分词会对模型的性能有所帮助

vect = CountVectorizer(ngram_range=(2,2))
vect.fit_transform(messages)
vect.get_feature_names()

如果想去掉一些不常见的单词或者停用词（stop word），可用用max_features，min_df，max_df以及stop_words等参数

vect = CountVectorizer(min_df=2, stop_words="english")
vect.fit_transform(messages)
vect.get_feature_names()

TfidfVectorizer

TfidfVectorizer跟CountVectorizer类似，也是将text转化为一个matrix，但对应的值不再是频次，而是tf-idf

TF-IDF的定义（其有多个变种，但主要的部分是类似的，可见wiki-Tf-idf）：

TF-IDF，全称为term frequency-inverse document frequency，相当于其是由TF和IDF两部分组成，即TF-IDF = term frequency * inverse document frequency

TF，即词频，指token在document中出现的频率
IDF，即逆向文档频率，指token在所有documents中出现的常见成都；比如某个token在所有document都出现，那么其IDF就会降低（相当于1 / document frequency）

这种方法对于在特定document中经常出现的token给予较高的weight，但在corpus的很多个documents都经常出现的toekn给予较低的weight。如果一个token在某个特定的document出现，但在很多documents却不常出现，那么这个token可能对这个文档内容有很好的解释能力

sklearn有两个类可用于计算TF-IDF

TfidfTransformer，接受CountVectorizer生成的sparse matrix，并将其做TF-IDF转化
TfidfVectorizer，可直接对Text进行feature extraction，然后再做TF-IDF转化

以TfidfVectorizer类为例

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
dtm = tfidf.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=tfidf.get_feature_names())

TfidfVectorizer和TfidfTransformer默认在计算TF-IDF后会采用L2范数进行缩放；如果设定smooth_idf=False，还会对idx额外加1，默认是在idx计算公式的分母加1

Summary

CountVectorizer和TfidfVectorizer是比较基础的bag of words中的方法，现实中的文本数据其实会更加复杂，因此需要做些改性，比如有些documents会包含某个token的单数和复数形式，或者是动词的不同分词形式，这时可能需要将其合并成相同的词干（word stem），这种标准化处理方法有：

词干提取（stemming），基于规则的启发式法，比如删除常见的后缀
词形还原（lemmatization），即用由已知单词形式组成的字典（人工矫正后的），并考虑单词在句子中的作用

参考资料：

Feature Extraction from Text
Working With Text Data

本文出自于http://www.bioinfo-scrounger.com转载请注明出处