1. 程式人生 > >Python文字詞頻統計的編碼問題-MOOC嵩天

Python文字詞頻統計的編碼問題-MOOC嵩天

1 Python文字詞頻統計程式碼

1.1Hamlet詞頻統計(含Hamlet原文文字)

#CalHamletV1.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>[email protected][\\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #將文字中特殊字元替換為空格
    return txt
 
hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:           
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

1.2《三國演義》人物出場統計(上)(含《三國演義》原文文字)

#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

1.3《三國演義》人物出場統計(下)(含《三國演義》原文文字)

#CalThreeKingdomsV2.py
import jieba
excludes = {"將軍","卻說","荊州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "諸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "關公" or word == "雲長":
        rword = "關羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "劉備"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

1.4 文字詞頻統計的文字

該資源是《Python文字詞頻統計的編碼問題-MOOC嵩天》的文字詞頻統計的文字。包內包含三國演義中文版TXT和哈姆雷特英文版TXT。
資源地址:文字詞頻統計的文字

2 文字詞頻統計的編碼問題

2.1 文字詞頻編碼對應程式碼

將文字詞頻統計的文字和程式碼放於同一資料夾下,執行上述程式碼,會出現以下報錯:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 14: illegal multibyte sequence
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 3: invalid start byte

根據提示,是TXT文字編碼的格式問題。因此只需要將程式碼中識別編碼格式的程式碼進行更改即可。Hamlet以及ThreeKingdom的識別編碼的程式碼行如下:
Hamlet

txt = open("hamlet.txt", "r").read()

ThreeKingdomV1

txt = open("threekingdoms.txt", "r", encoding='utf-8').read()

ThreeKingdomV2

txt = open("threekingdoms.txt", "r", encoding='utf-8').read()

2.2 檢視TXT編碼並更改程式碼

檢視TXT編碼只需要開啟TXT並另存為,就會出現現在文件的編碼格式。編碼格式也可以更改。關鍵是要保持文字的編碼格式和程式碼讀取的編碼格式相同。
由於我上傳的TXT編碼格式都是utf-8編碼,因此相應程式碼只需要更改為以下程式碼即可成功執行。
Hamlet

txt = open("hamlet.txt", "r", encoding='utf-8').read()

ThreeKingdomV1

txt = open("threekingdoms.txt", "r", encoding='utf-8').read()

ThreeKingdomV2

txt = open("threekingdoms.txt", "r", encoding='utf-8').read()