Python3 文章標題關鍵字提取的例子
思路:
1.讀取所有文章標題;
2.用“結巴分詞”的工具包進行文章標題的詞語分割;
3.用“sklearn”的工具包計算Tf-idf(詞頻-逆文件率);
4.得到滿足關鍵詞權重閾值的詞
結巴分詞詳見:結巴分詞Github
sklearn詳見:文字特徵提取——4.2.3.4 Tf-idf項加權
import os import jieba import sys from sklearn.feature_extraction.text import TfidfVectorizer sys.path.append("../") jieba.load_userdict('userdictTest.txt') STOP_WORDS = set(( "基於","面向","研究","系統","設計","綜述","應用","進展","技術","框架","txt" )) def getFileList(path): filelist = [] files = os.listdir(path) for f in files: if f[0] == '.': pass else: filelist.append(f) return filelist,path def fenci(filename,path,segPath): # 儲存分詞結果的資料夾 if not os.path.exists(segPath): os.mkdir(segPath) seg_list = jieba.cut(filename) result = [] for seg in seg_list: seg = ''.join(seg.split()) if len(seg.strip()) >= 2 and seg.lower() not in STOP_WORDS: result.append(seg) # 將分詞後的結果用空格隔開,儲存至本地 f = open(segPath + "/" + filename + "-seg.txt","w+") f.write(' '.join(result)) f.close() def Tfidf(filelist,sFilePath,tfidfw): corpus = [] for ff in filelist: fname = path + ff f = open(fname + "-seg.txt",'r+') content = f.read() f.close() corpus.append(content) vectorizer = TfidfVectorizer() # 該類實現詞向量化和Tf-idf權重計算 tfidf = vectorizer.fit_transform(corpus) word = vectorizer.get_feature_names() weight = tfidf.toarray() if not os.path.exists(sFilePath): os.mkdir(sFilePath) for i in range(len(weight)): print('----------writing all the tf-idf in the ',i,'file into ',sFilePath + '/',".txt----------") f = open(sFilePath + "/" + str(i) + ".txt",'w+') result = {} for j in range(len(word)): if weight[i][j] >= tfidfw: result[word[j]] = weight[i][j] resultsort = sorted(result.items(),key=lambda item: item[1],reverse=True) for z in range(len(resultsort)): f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n') print(resultsort[z][0] + " " + str(resultsort[z][1])) f.close()
TfidfVectorizer( ) 類 實現了詞向量化和Tf-idf權重的計算
詞向量化:vectorizer.fit_transform是將corpus中儲存的切分後的單詞轉為詞頻矩陣,其過程為先將所有標題切分的單詞形成feature特徵和列索引,並在dictionary中儲存了{‘特徵':索引,……},如{‘農業':0,‘大資料':1,……},在csc_matric中為每個標題儲存了 (標題下標,特徵索引) 詞頻tf……,然後對dictionary中的單詞進行排序重新編號,並對應更改csc_matric中的特徵索引,以便形成一個特徵向量詞頻矩陣,接著計算每個feature的idf權重,其計算公式為
以下面六個文章標題為例進行關鍵詞提取
Using jieba on 農業大資料研究與應用進展綜述.txt
Using jieba on 基於Hadoop的分散式並行增量爬蟲技術研究.txt
Using jieba on 基於RPA的財務共享服務中心賬表核對流程優化.txt
Using jieba on 基於大資料的特徵趨勢統計系統設計.txt
Using jieba on 網路大資料平臺異常風險監測系統設計.txt
Using jieba on 面向資料中心的多源異構資料統一訪問框架.txt
----------writing all the tf-idf in the 0 file into ./keywords/ 0 .txt----------
農業 0.773262366783
大資料 0.634086202434
----------writing all the tf-idf in the 1 file into ./keywords/ 1 .txt----------
hadoop 0.5
分散式 0.5
並行增量 0.5
爬蟲 0.5
----------writing all the tf-idf in the 2 file into ./keywords/ 2 .txt----------
rpa 0.408248290464
優化 0.408248290464
服務中心 0.408248290464
流程 0.408248290464
財務共享 0.408248290464
賬表核對 0.408248290464
----------writing all the tf-idf in the 3 file into ./keywords/ 3 .txt----------
特徵 0.521823488025
統計 0.521823488025
趨勢 0.521823488025
大資料 0.427902724969
----------writing all the tf-idf in the 4 file into ./keywords/ 4 .txt----------
大資料平臺 0.4472135955
異常 0.4472135955
監測 0.4472135955
網路 0.4472135955
風險 0.4472135955
----------writing all the tf-idf in the 5 file into ./keywords/ 5 .txt----------
多源異構資料 0.57735026919
資料中心 0.57735026919
統一訪問 0.57735026919
以上這篇Python3 文章標題關鍵字提取的例子就是小編分享給大家的全部內容了,希望能給大家一個參考,也希望大家多多支援我們。