1. 程式人生 > 實用技巧 >機器學習-文字分類(1)之獨熱編碼、詞袋模型、N-gram、TF-IDF

機器學習-文字分類(1)之獨熱編碼、詞袋模型、N-gram、TF-IDF

1、one-hot

一般是針對於標籤而言,比如現在有貓:0,狗:1,人:2,船:3,車:4這五類,那麼就有:

貓:[1,0,0,0,0]

狗:[0,1,0,0,0]

人:[0,0,1,0,0]

船:[0,0,0,1,0]

車:[0,0,0,0,1]

from sklearn import preprocessing
import numpy as np
enc = OneHotEncoder(sparse = False)
labels=[0,1,2,3,4]
labels=np.array(labels).reshape(len(labels),-1)
ans = enc.fit_transform(labels)

結果:array([[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [0., 0., 1., 0., 0.], [0., 0., 0., 1., 0.], [0., 0., 0., 0., 1.]])

2、Bags of Words

統計單詞出現的次數並進行賦值。

import re
"""
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
"""
corpus = [
'Bob likes to play basketball, Jim likes too.',
'Bob also likes to play football games.'
]
#所有單片語成的列表 words=[] for sentence in corpus: #過濾掉標點符號 sentence=re.sub(r'[^\w\s]','',sentence.lower()) #拆分句子為單詞 for word in sentence.split(" "): if word not in words: words.append(word) else: continue word2idx={} #idx2word={} for i in range(len(words)): word2idx[words[i]]
=i #idx2word[i]=words[i] #按字典的值排序 word2idx=sorted(word2idx.items(),key=lambda x:x[1])
import collections
BOW=[]
for sentence in corpus:
  sentence=re.sub(r'[^\w\s]','',sentence.lower())
  print(sentence)
  tmp=[0 for _ in range(len(word2idx))]
  for word in sentence.split(" "):
    for k,v in word2idx:  
      if k==word:
        tmp[v]+=1
      else:
        continue
  BOW.append(tmp)
print(word2idx)
print(BOW)

輸出:

bob likes to play basketball jim likes too
bob also likes to play football games
[('bob', 0), ('likes', 1), ('to', 2), ('play', 3), ('basketball', 4), ('jim', 5), ('too', 6), ('also', 7), ('football', 8), ('games', 9)]
[[1, 2, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]]

需要注意的是,我們是從單詞表中進行讀取判斷其出現在句子中的次數

在sklearn中的實現:

vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()

結果:array([[0, 1, 1, 0, 0, 1, 2, 1, 1, 1], [1, 0, 1, 1, 1, 0, 1, 1, 1, 0]])

構建的單詞的列表的單詞的順序不同,結果會稍有不同。

3、N-gram

核心思想:滑動視窗。來獲取單詞的上下文資訊。

sklearn實現:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'Bob likes to play basketball, Jim likes too.',
'Bob also likes to play football games.'
]
# ngram_range=(2, 2)表明適應2-gram,decode_error="ignore"忽略異常字元,token_pattern按照單詞切割
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
                                        token_pattern = r'\b\w+\b',min_df=1)
x1 = ngram_vectorizer.fit_transform(corpus)
 (0, 3)    1
  (0, 6)    1
  (0, 10)    1
  (0, 8)    1
  (0, 1)    1
  (0, 5)    1
  (0, 7)    1
  (1, 6)    1
  (1, 10)    1
  (1, 2)    1
  (1, 0)    1
  (1, 9)    1
  (1, 4)    1

上面的第一列中第一個值標識句子順序,第二個值標識滑動視窗單詞順序。與BOW相同,再計算每個窗口出現的次數。

[[0 1 0 1 0 1 1 1 1 0 1] [1 0 1 0 1 0 1 0 0 1 1]]

# 檢視生成的詞表
print(ngram_vectorizer.vocabulary_)

{

'bob likes': 3,

'likes to': 6,

'to play': 10,

'play basketball': 8,

'basketball jim': 1,

'jim likes': 5,

'likes too': 7,

'bob also': 2,

'also likes': 0,

'play football': 9,

'football games': 4

}

4、TF-IDF

TF-IDF分數由兩部分組成:第一部分是詞語頻率(Term Frequency),第二部分是逆文件頻率(Inverse Document Frequency)

參考:

https://blog.csdn.net/u011311291/article/details/79164289

https://mp.weixin.qq.com/s/6vkz18Xw4USZ3fldd_wf5g

https://blog.csdn.net/jyz4mfc/article/details/81223572