用NLTK對英文語料做預處理，用gensim計算相似度

阿新 • • 發佈：2019-01-05

import nltk
from nltk.tokenize import word_tokenize

text = open('F:/iPython/newsfortfidf.txt')
# testtext = [line.strip() for line in file('text')]
testtextt = [course.split("###") for text in testtext]
print testtext

texts_tokenized = [[word for word in nltk.word_tokenize(testtext)]]
print texts_tokenized

from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
print english_stopwords
len(english_stopwords)
texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
print texts_filtered_stopwords

english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-']
texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
print texts_filtered

from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
print texts_stemmed

all_stems = sum(texts_stemmed, [])
stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

from gensim import corpora,models,similarities          #   http://blog.csdn.net/questionfish/article/details/46715795
 import logging                                                         #通過logging.basicConfig函式對日誌的輸出格式及方式做相關配置
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
dictionary = corpora.Dictionary(texts)                               #為每個出現在語料庫中的單詞分配了一個獨一無二的整數編號
print dictionary
print dictionary.token2id                                                   #檢視單詞與編號之間的對映關係

corpus = [dictionary.doc2bow(text) for text in texts]           #函式doc2bow()簡單地對每個不同單詞的出現次數進行了計數，並將      
print corpus                                                                    #單詞轉換為其編號，然後以稀疏向量的形式返回結果

tfidf = models.TfidfModel(corpus)   #補充tf-idf：http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
     print doc

print tfidf.dfs    #同idfs，也是個字典，每個key的value代表的是該單詞在多少文件曾經出現過
print tfidf.idfs   #資料的字典，每個資料的value代表該單詞對於該篇文件的代表性大小，即：如果該單詞在所有的文章中均出現，
                  #說明毫無代表作用，該處value為0,而如果該單詞在越少的文章中出現，則代表該單詞對於該文件有更強的代表性

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)
corpus_lsi = lsi[corpus_tfidf]
for doc in corpus_lsi:
... print doc

index = similarities.MatrixSimilarity(lsi[corpus])

>>> print courses_name[210]
Machine Learning

>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi

>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1]

用NLTK對英文語料做預處理，用gensim計算相似度

import nltk from nltk.tokenize import word_tokenize text = open('F:/iPython/newsfortfidf.txt') # testtext = [line.strip() for line in file('text')] testte

用python進行資料預處理，過濾特殊符號，英文和數字。（適用於中文分詞）

要進行中文分詞，必須要求資料格式全部都是中文，需求過濾掉特殊符號、標點、英文、數字等。當然了使用者可以根據自己的要求過濾自定義字元。實驗環境：python、mysql 實驗目的：從資料庫讀取資料，

Python正則表示式做文字預處理，去掉特殊符號

在進行文字訓練和處理之前難免要進行下預處理，過濾掉沒有用的符號等，簡單用python 的正則表示式過濾一下。 #!/usr/bin/python # encoding: UTF-8 import re # make English text clean def clean_en_text(te

英文token預處理，用於將英文句子處理成單詞

參考 https://github.com/google-research/bert/blob/master/tokenization.py 使用 import tokenization tokenizer = tokenization.BasicTokenizer(do_lower

tensorflow: session開始對前tensor做一些處理

tf.boolean_mask 這個操作可以用於留下指定的元素，類似於numpy的操作。 import numpy as np tensor = tf.range(4) mask = np.array([True, False, True, False]) bool_mask =

js 對固定的計算，做快取處理，比如計算乘積;

對固定的計算，做快取處理，比如計算乘積; var mult = (function() { var cache = {}; var calculate = function() { var a = 1; for(var i = 0, l = arguments.

sklearn庫：分類、迴歸、聚類、降維、模型優化、文字預處理實現用例（趕緊收藏）

分類演算法 # knn演算法 from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() ''' __init__函式 def __init__(self, n_neighbors=5,

Python中文語料批量預處理手記

#coding=utf-8 import os import jieba import sys import re import time import jieba.posseg as pseg sys.path.append("../") jieba.load_userdict(".

京東豬臉識別比賽資料預處理：用Python將視訊每一幀提取儲存為圖片

最近參加京東的豬臉識別比賽，訓練集是30個視訊，需要將視訊的每一幀提取出來儲存為圖片，存入對應的資料夾（分類標籤）。本例是直接呼叫了cv2 模組中的 VideoCapture。一次執行，大概10分鐘，就能得到預處理後的分類圖片了，具體程式碼如下。

shell的字串和數字的轉化（數字自動做字串處理，變數名做字串輸出用單引號）

shell裡面怎麼樣把字串轉換為數字？例如：a="024" 1，用${{a}} 2，用let達到(()) 運算效果。 let num=0123; echo $num; 83 3，雙括號運算子: a=$((1+2)); echo $a; 等同於： a=`expr 1 +

對影象資料進行預處理時遇到的若干問題（1）

（1）MATLAB如何斷點檢驗變數？首先使 m檔案在正確的目前下執行，如果有錯誤，命令列視窗會提示錯誤的程式碼行數和錯誤原因；此時對此行斷點標誌，並再次執行，然後將滑鼠移動剛到此行相應引數上，就會看到錯誤的原因了。（2）比如 A = cel

《TensorFlow學習筆記》對圖片資料的預處理一、-編碼解碼調整大小色彩亮度

IDE：pycharm Python: Python3.6 OS: win10 tf： 1.5.0 圖片資料的預處理所謂，預處理就是對訓練圖片提前進行一些處理，為什麼要這麼幹呢？？答案是為了降低其他無關因素對最後的識別結果的影響，比如說一幅

乾貨 | 自然語言處理（5）之英文文字挖掘預處理流程

前言原文連結：http://www.cnblogs.com/pinard/p/6756534.h

編寫程式：建立一個學生資料鏈表，每個節點的資訊包括如下內容：學號，姓名，性別年齡專業。對連結串列做如下處理。輸入一個學號（專業），如果連結串列中的節點包含此學號（專業），則刪去該結點。

#include <iostream> using namespace std; struct student {char name[20];char s_number[10];char gender[3];int age;char major[10];stru

《TensorFlow學習筆記》對圖片資料的預處理二、畫標註框，預處理完整框架

IDE：pycharm Python: Python3.6 OS: win10 前提如果您只是來看畫標註框的話也只需要看這一篇文章即可，會有一個很詳細的介紹和使用，但是你如果想學習整體的預處理請您看我的上一篇博文《TensorFlow學習筆記

字符串操作，英文詞頻統計預處理

contex row 惠州市新興 odi ESS 性別南山區 ddr str="""440000 廣東省 440100 　　廣州市440103 　　荔灣區440104 　　越秀區440105 　　海珠區440106 　　天河區440111 　　白雲區4401

字符串、文件操作，英文詞頻統計預處理

ima post 本體預處理 eight 固定密鑰圖片行政區 1.字符串操作：解析身份證號：生日、性別、出生地等。　地址碼：表示編碼對象常住戶口所在縣(市、旗、區)的行政區劃代碼。　　出生日期碼：表示編碼對象出生的年、月、日，年、月、

Python——字符串、文件操作，英文詞頻統計預處理

string 加密和解密 com 模塊 put 圖片查詢 url 偏移一.字符串操作：解析身份證號：生日、性別、出生地等。凱撒密碼編碼與解碼網址觀察與批量生成 2.凱撒密碼編碼與解碼　　凱撒加密法的替換方法

字符串操作、文件操作，英文詞頻統計預處理

ews 行政區劃 format bcd ignore hat 密碼 clas 串操作 1.字符串操作：解析身份證號：生日、性別、出生地等。凱撒密碼編碼與解碼網址觀察與批量生成（1）解析身份證號： ID = input(‘請輸入十八位身份證號碼

一些有趣的預處理， #， ##

fin c++ 宏定義 div vc++ 有趣 name tom 如果 2、帶參宏一般用法比如#define MAX(a,b) ((a)>(b)?(a):(b))則遇到MAX(1+2,value)則會把它替換成： ((1+2)>(value)?(1+2):

用NLTK對英文語料做預處理，用gensim計算相似度

相關推薦