20.【進階】流行庫模型--NLTK(Nature Language Toolkit)
阿新 • • 發佈:2018-12-30
#-*- coding:utf-8 -*-
#如何將下面兩行句子向量化
sentence1 = 'The cat is walking in the bedroom.'
sentence2 = 'A dog was running across the kitchen.'
#1.使用詞袋法進行向量化
#詞袋法,顧名思義就是講所有樣本中出現的單詞,形成一個列向量,或者稱之為詞表,
#然後每一個訓練資料,根據包含單詞的個數,進行數字化表示。
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
sentences = [sentence1,sentence2]
sentences = vec.fit_transform(sentences)
print sentences.toarray()
# [[0 1 1 0 1 1 0 0 2 1 0]
# [1 0 0 1 0 0 1 1 1 0 1]]
#輸出向量各個維度的特徵含義
print vec.get_feature_names()
# [u'across', u'bedroom', u'cat', u'dog', u'in', u'is', u'kitchen', u'running', u'the', u'walking', u'was']
#*************************************************************************************
#2. 使用NLTK進行向量化
import nltk
#(1)對句子進行詞彙分割和正規化,有些情況如 aren't需要分割成are和't, I'm 分割成I和'm
tokens_1 = nltk.word_tokenize(sentence1)
print tokens_1
#['The', 'cat', 'is', 'walking', 'in', 'the', 'bedroom', '.']
tokens_2 = nltk.word_tokenize(sentence2)
print tokens_2
#['A', 'dog', 'was', 'running', 'across', 'the' , 'kitchen', '.']
#(2)整理兩句的詞表,按照ASCII的排序輸出
vocab_1 = sorted(set(tokens_1))
print vocab_1
#['.', 'The', 'bedroom', 'cat', 'in', 'is', 'the', 'walking']
vocab_2 = sorted(set(tokens_2))
print vocab_2
#['.', 'A', 'across', 'dog', 'kitchen', 'running', 'the', 'was']
#(3)初始化stemmer尋找各個詞彙最原始的詞根(如 walking->walk,running->run...)
stemmer = nltk.stem.PorterStemmer()
stem_1 = [stemmer.stem(t) for t in tokens_1]
print stem_1
#['the', 'cat', 'is', u'walk', 'in', 'the', 'bedroom', '.']
stem_2 = [stemmer.stem(t) for t in tokens_2]
print stem_2
#['A', 'dog', u'wa', u'run', u'across', 'the', 'kitchen', '.']
#(4)初始化詞性標註器,對每個詞彙進行標註(詞性,名次,動詞,介詞...)
pos_tag_1 = nltk.tag.pos_tag(tokens_1)
print pos_tag_1
#[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('walking', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('bedroom', 'NN'), ('.', '.')]
pos_tag_2 = nltk.tag.pos_tag(tokens_2)
print pos_tag_2
#[('A', 'DT'), ('dog', 'NN'), ('was', 'VBD'), ('running', 'VBG'), ('across', 'IN'), ('the', 'DT'), ('kitchen', 'NN'), ('.', '.')]
#小結:
#1.NLTK不僅可以對詞彙的具體詞性進行標註,甚至可以對句子進行結構,
#2.缺點是我們只能分析詞性,但是對於具體詞彙word之間的含義是否相似,無法度量,
#3.在本例中的兩個句子,從語義的角度來講,二者描述的場景是極為相似的,我們需要將word轉成向量表示,
# 接下來學習word2vec技術。