Python3 文章標題關鍵字提取的例子

阿新 • • 發佈：2020-01-09

思路：

1.讀取所有文章標題；

2.用“結巴分詞”的工具包進行文章標題的詞語分割；

3.用“sklearn”的工具包計算Tf-idf（詞頻-逆文件率）;

4.得到滿足關鍵詞權重閾值的詞

結巴分詞詳見：結巴分詞Github

sklearn詳見：文字特徵提取——4.2.3.4 Tf-idf項加權

import os
import jieba
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
 
 
sys.path.append("../")
jieba.load_userdict('userdictTest.txt')
STOP_WORDS = set((
  "基於","面向","研究","系統","設計","綜述","應用","進展","技術","框架","txt"
 ))
 
def getFileList(path):
 filelist = []
 files = os.listdir(path)
 for f in files:
  if f[0] == '.':
   pass
  else:
   filelist.append(f)
 return filelist,path
 
def fenci(filename,path,segPath):
 
 # 儲存分詞結果的資料夾
 if not os.path.exists(segPath):
  os.mkdir(segPath)
 seg_list = jieba.cut(filename)
 result = []
 for seg in seg_list:
  seg = ''.join(seg.split())
  if len(seg.strip()) >= 2 and seg.lower() not in STOP_WORDS:
   result.append(seg)
 
 # 將分詞後的結果用空格隔開，儲存至本地
 f = open(segPath + "/" + filename + "-seg.txt","w+")
 f.write(' '.join(result))
 f.close()
 
def Tfidf(filelist,sFilePath,tfidfw):
 corpus = []
 for ff in filelist:
  fname = path + ff
  f = open(fname + "-seg.txt",'r+')
  content = f.read()
  f.close()
  corpus.append(content)
 
 vectorizer = TfidfVectorizer() # 該類實現詞向量化和Tf-idf權重計算
 tfidf = vectorizer.fit_transform(corpus)
 word = vectorizer.get_feature_names()
 weight = tfidf.toarray()
 
 if not os.path.exists(sFilePath):
  os.mkdir(sFilePath)
 
 for i in range(len(weight)):
  print('----------writing all the tf-idf in the ',i,'file into ',sFilePath + '/',".txt----------")
  f = open(sFilePath + "/" + str(i) + ".txt",'w+')
  result = {}
  for j in range(len(word)):
   if weight[i][j] >= tfidfw:
    result[word[j]] = weight[i][j]
  resultsort = sorted(result.items(),key=lambda item: item[1],reverse=True)
  for z in range(len(resultsort)):
   f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n')
   print(resultsort[z][0] + " " + str(resultsort[z][1]))
  f.close()

TfidfVectorizer( ) 類實現了詞向量化和Tf-idf權重的計算

詞向量化：vectorizer.fit_transform是將corpus中儲存的切分後的單詞轉為詞頻矩陣，其過程為先將所有標題切分的單詞形成feature特徵和列索引，並在dictionary中儲存了{‘特徵'：索引，……}，如{‘農業'：0，‘大資料'：1，……}，在csc_matric中為每個標題儲存了 (標題下標，特徵索引) 詞頻tf……，然後對dictionary中的單詞進行排序重新編號，並對應更改csc_matric中的特徵索引，以便形成一個特徵向量詞頻矩陣，接著計算每個feature的idf權重，其計算公式為

其中是所有文件數量，是包含該單詞的文件數。最後計算tf*idf並進行正則化，得到關鍵詞權重。

以下面六個文章標題為例進行關鍵詞提取

Using jieba on 農業大資料研究與應用進展綜述.txt

Using jieba on 基於Hadoop的分散式並行增量爬蟲技術研究.txt

Using jieba on 基於RPA的財務共享服務中心賬表核對流程優化.txt

Using jieba on 基於大資料的特徵趨勢統計系統設計.txt

Using jieba on 網路大資料平臺異常風險監測系統設計.txt

Using jieba on 面向資料中心的多源異構資料統一訪問框架.txt

----------writing all the tf-idf in the 0 file into ./keywords/ 0 .txt----------

農業 0.773262366783

大資料 0.634086202434

----------writing all the tf-idf in the 1 file into ./keywords/ 1 .txt----------

hadoop 0.5

分散式 0.5

並行增量 0.5

爬蟲 0.5

----------writing all the tf-idf in the 2 file into ./keywords/ 2 .txt----------

rpa 0.408248290464

優化 0.408248290464

服務中心 0.408248290464

流程 0.408248290464

財務共享 0.408248290464

賬表核對 0.408248290464

----------writing all the tf-idf in the 3 file into ./keywords/ 3 .txt----------

特徵 0.521823488025

統計 0.521823488025

趨勢 0.521823488025

大資料 0.427902724969

----------writing all the tf-idf in the 4 file into ./keywords/ 4 .txt----------

大資料平臺 0.4472135955

異常 0.4472135955

監測 0.4472135955

網路 0.4472135955

風險 0.4472135955

----------writing all the tf-idf in the 5 file into ./keywords/ 5 .txt----------

多源異構資料 0.57735026919

資料中心 0.57735026919

統一訪問 0.57735026919

以上這篇Python3 文章標題關鍵字提取的例子就是小編分享給大家的全部內容了，希望能給大家一個參考，也希望大家多多支援我們。

Python3 文章標題關鍵字提取的例子

Python3 文章標題關鍵字提取的例子

python資料分析:關鍵字提取方式

數學建模省賽小結：資料預處理（按照關鍵字提取行/列並進行簡單運算）

織夢文章tag標籤或者文章keyword關鍵字呼叫相關文章

jieba分詞庫介紹-關鍵字提取

文件撰寫在開發中的作用_在撰寫此部落格文章標題之前，我做了10件事。接下來發生的事情會讓您震驚。...

python關鍵字提取原始碼_python提取頁面html中的ICP編號和頁面原始碼編碼

根據標題關鍵字生成圖片實現優質圖片PHP程式碼

爬蟲抓取部落格園前10頁標題帶有Python關鍵字（不區分大小寫）的文章

Pytorch提取模型特徵向量儲存至csv的例子

python每5分鐘從kafka中提取資料的例子

Python3分析處理聲音資料的例子

Python3使用騰訊雲文字識別(騰訊OCR)提取圖片中的文字內容例項詳解

一篇文章帶你輕鬆瞭解C# Lock關鍵字

織夢欄目關鍵字文章關鍵字呼叫相關文章

AVX / AVX2 指令程式設計帶例子推薦優質文章

java讀取word文件,提取標題和內容的例項

全網最詳細的Python3基礎教程！一篇文章讓你入門！

Python3中 socket的簡單客戶-服務端例子

SnpEff軟體安裝以及如何從vcf檔案中提取python3以及pysam模組

Python3 文章標題關鍵字提取的例子

相關推薦