1. 程式人生 > >DFA 算法實現關鍵詞匹配

DFA 算法實現關鍵詞匹配

== tail this word 允許 text children contain 源代碼

起因: 從網頁中爬去的頁面。須要推斷是否跟預設的關鍵詞匹配(是否包括預設的關鍵詞),並返回全部匹配到的關鍵詞 。


眼下pypi 上兩個實現

ahocorasick
https://pypi.python.org/pypi/ahocorasick/0.9
esmre
https://pypi.python.org/pypi/esmre/0.3.1

可是事實上包都是基於DFA 實現的
這裏提供源代碼例如以下:

#!/usr/bin/python2.6  
# -*- coding: utf-8 -*-
import time
class Node(object):
    def __init__
(self):
self.children = None # 標記匹配到了關鍵詞 self.flag = False # The encode of word is UTF-8 def add_word(root,word): if len(word) <= 0: return node = root for i in range(len(word)): if node.children == None: node.children = {} node.children[word[i]] = Node() elif
word[i] not in node.children: node.children[word[i]] = Node() node = node.children[word[i]] node.flag = True def init(word_list): root = Node() for line in word_list: add_word(root,line) return root # The encode of word is UTF-8 # The encode of message is UTF-8
def key_contain(message, root): res = set() for i in range(len(message)): p = root j = i while (j<len(message) and p.children!=None and message[j] in p.children): if p.flag == True: res.add(message[i:j]) p = p.children[message[j]] j = j + 1 if p.children==None: res.add(message[i:j]) #print ‘---word---‘,message[i:j] return res def dfa(): print ‘----------------dfa-----------‘ word_list = [‘hello‘, ‘民警‘, ‘朋友‘,‘女兒‘,‘派出所‘, ‘派出所民警‘] root = init(word_list) message = ‘四處亂咬亂吠,嚇得家中11歲的女兒躲在屋裏不敢出來,直到轄區派出所民警趕到後,才將孩子從屋中救出。最後在征得主人允許後,民警和村民合力將這僅僅發瘋的狗打死‘ x = key_contain(message, root) for item in x: print item if __name__ == ‘__main__‘: dfa()

請再閱讀我的這篇文章
http://blog.csdn.net/woshiaotian/article/details/10047675

DFA 算法實現關鍵詞匹配