機器學習-文字分類(1)之獨熱編碼、詞袋模型、N-gram、TF-IDF
1、one-hot
一般是針對於標籤而言,比如現在有貓:0,狗:1,人:2,船:3,車:4這五類,那麼就有:
貓:[1,0,0,0,0]
狗:[0,1,0,0,0]
人:[0,0,1,0,0]
船:[0,0,0,1,0]
車:[0,0,0,0,1]
from sklearn import preprocessing import numpy as np enc = OneHotEncoder(sparse = False) labels=[0,1,2,3,4] labels=np.array(labels).reshape(len(labels),-1) ans = enc.fit_transform(labels)
結果:array([[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [0., 0., 1., 0., 0.], [0., 0., 0., 1., 0.], [0., 0., 0., 0., 1.]])
2、Bags of Words
統計單詞出現的次數並進行賦值。
import re """ corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] """ corpus = [ 'Bob likes to play basketball, Jim likes too.', 'Bob also likes to play football games.' ]#所有單片語成的列表 words=[] for sentence in corpus: #過濾掉標點符號 sentence=re.sub(r'[^\w\s]','',sentence.lower()) #拆分句子為單詞 for word in sentence.split(" "): if word not in words: words.append(word) else: continue word2idx={} #idx2word={} for i in range(len(words)): word2idx[words[i]]=i #idx2word[i]=words[i] #按字典的值排序 word2idx=sorted(word2idx.items(),key=lambda x:x[1])
import collections BOW=[] for sentence in corpus: sentence=re.sub(r'[^\w\s]','',sentence.lower()) print(sentence) tmp=[0 for _ in range(len(word2idx))] for word in sentence.split(" "): for k,v in word2idx: if k==word: tmp[v]+=1 else: continue BOW.append(tmp) print(word2idx) print(BOW)
輸出:
bob likes to play basketball jim likes too bob also likes to play football games [('bob', 0), ('likes', 1), ('to', 2), ('play', 3), ('basketball', 4), ('jim', 5), ('too', 6), ('also', 7), ('football', 8), ('games', 9)] [[1, 2, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]]
需要注意的是,我們是從單詞表中進行讀取判斷其出現在句子中的次數。
在sklearn中的實現:
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()
結果:array([[0, 1, 1, 0, 0, 1, 2, 1, 1, 1], [1, 0, 1, 1, 1, 0, 1, 1, 1, 0]])
構建的單詞的列表的單詞的順序不同,結果會稍有不同。
3、N-gram
核心思想:滑動視窗。來獲取單詞的上下文資訊。
sklearn實現:
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Bob likes to play basketball, Jim likes too.', 'Bob also likes to play football games.' ] # ngram_range=(2, 2)表明適應2-gram,decode_error="ignore"忽略異常字元,token_pattern按照單詞切割 ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore", token_pattern = r'\b\w+\b',min_df=1) x1 = ngram_vectorizer.fit_transform(corpus)
(0, 3) 1 (0, 6) 1 (0, 10) 1 (0, 8) 1 (0, 1) 1 (0, 5) 1 (0, 7) 1 (1, 6) 1 (1, 10) 1 (1, 2) 1 (1, 0) 1 (1, 9) 1 (1, 4) 1
上面的第一列中第一個值標識句子順序,第二個值標識滑動視窗單詞順序。與BOW相同,再計算每個窗口出現的次數。
[[0 1 0 1 0 1 1 1 1 0 1] [1 0 1 0 1 0 1 0 0 1 1]]
# 檢視生成的詞表 print(ngram_vectorizer.vocabulary_)
{
'bob likes': 3,
'likes to': 6,
'to play': 10,
'play basketball': 8,
'basketball jim': 1,
'jim likes': 5,
'likes too': 7,
'bob also': 2,
'also likes': 0,
'play football': 9,
'football games': 4
}
4、TF-IDF
TF-IDF分數由兩部分組成:第一部分是詞語頻率(Term Frequency),第二部分是逆文件頻率(Inverse Document Frequency)
參考:
https://blog.csdn.net/u011311291/article/details/79164289
https://mp.weixin.qq.com/s/6vkz18Xw4USZ3fldd_wf5g