Hierarchical Attention Network for Document Classification--tensorflow實現篇

阿新 • • 發佈：2019-01-12

上週我們介紹了Hierarchical Attention Network for Document Classification這篇論文的模型架構，這周抽空用tensorflow實現了一下，接下來主要從程式碼的角度介紹如何實現用於文字分類的HAN模型。

資料集

首先介紹一下資料集，這篇論文中使用了幾個比較大的資料集，包括IMDB電影評分，yelp餐館評價等等。選定使用yelp2013之後，一開始找資料集的時候完全處於懵逼狀態，所有相關的論文和資料裡面出現的資料集下載連結都指向YELP官網,但是官網上怎麼都找不到相關資料的下載，然後就各種搜感覺都搜不到==然後就好不容易在github上面找到了，MDZZ，我這都是在寫什麼，絕對不是在湊字數，單純的吐槽資料不好找而已。連結如下：

https://github.com/rekiksab/Yelp/tree/master/yelp_challenge/yelp_phoenix_academic_dataset
這裡面好像不止一個數據集，還有user，business等其他幾個資料集，不過在這裡用不到罷了。先來看一下資料集的格式，如下，每一行是一個評論的文字，是json格式儲存的，主要有vote, user_id, review_id, stars, data, text, type, business_id幾項，針對本任務，只需要使用stars評分和text評論內容即可。這裡我選擇先將相關的資料儲存下來作為資料集。程式碼如下所示：

{"votes 
": {"funny": 0, "useful": 5, "cool": 2}, "user_id": "rLtl8ZkDX5vH5nAx9C3q5Q", "review_id": "fWKvX83p0-ka4JS3dc6E5A", "stars": 5, "date": "2011-01-26", "text": "My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best \"toast\" I've ever had.\n\nAnyway, I can't wait to go back!" 
, "type": "review", "business_id": "9yKzy9PApeiPPOUJEtnvkg"}

資料集的預處理操作，這裡我做了一定的簡化，將每條評論資料都轉化為30*30的矩陣，其實可以不用這麼規劃，只需要將大於30的截斷即可，小魚30的不需要補全操作，只是後續需要給每個batch選定最大長度，然後獲取每個樣本大小，這部分我還沒有太搞清楚，等之後有時間再看一看，把這個功能加上就行了。先這樣湊合用==

#coding=utf-8
import json
import pickle
import nltk
from nltk.tokenize import WordPunctTokenizer
from collections import defaultdict

#使用nltk分詞分句器
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
word_tokenizer = WordPunctTokenizer()

#記錄每個單詞及其出現的頻率
word_freq = defaultdict(int)

# 讀取資料集，並進行分詞，統計每個單詞出現次數，儲存在word freq中
with open('yelp_academic_dataset_review.json', 'rb') as f:
    for line in f:
        review = json.loads(line)
        words = word_tokenizer.tokenize(review['text'])
        for word in words:
            word_freq[word] += 1

    print "load finished"

# 將詞頻表儲存下來
with open('word_freq.pickle', 'wb') as g:
    pickle.dump(word_freq, g)
    print len(word_freq)#159654
    print "word_freq save finished"

num_classes = 5
# 將詞頻排序，並去掉出現次數最小的3個
sort_words = list(sorted(word_freq.items(), key=lambda x:-x[1]))
print sort_words[:10], sort_words[-10:]

#構建vocablary，並將出現次數小於5的單詞全部去除，視為UNKNOW
vocab = {}
i = 1
vocab['UNKNOW_TOKEN'] = 0
for word, freq in word_freq.items():
    if freq > 5:
        vocab[word] = i
        i += 1
print i
UNKNOWN = 0

data_x = []
data_y = []
max_sent_in_doc = 30
max_word_in_sent = 30

#將所有的評論檔案都轉化為30*30的索引矩陣，也就是每篇都有30個句子，每個句子有30個單詞
# 不夠的補零，多餘的刪除，並儲存到最終的資料集檔案之中
with open('yelp_academic_dataset_review.json', 'rb') as f:
    for line in f:
        doc = []
        review = json.loads(line)
        sents = sent_tokenizer.tokenize(review['text'])
        for i, sent in enumerate(sents):
            if i < max_sent_in_doc:
                word_to_index = []
                for j, word in enumerate(word_tokenizer.tokenize(sent)):
                    if j < max_word_in_sent:
                            word_to_index.append(vocab.get(word, UNKNOWN))
                doc.append(word_to_index)

        label = int(review['stars'])
        labels = [0] * num_classes
        labels[label-1] = 1
        data_y.append(labels)
        data_x.append(doc)
    pickle.dump((data_x, data_y), open('yelp_data', 'wb'))
    print len(data_x) #229907
    # length = len(data_x)
    # train_x, dev_x = data_x[:int(length*0.9)], data_x[int(length*0.9)+1 :]
    # train_y, dev_y = data_y[:int(length*0.9)], data_y[int(length*0.9)+1 :]

在將資料預處理之後，我們就得到了一共229907篇文件，每篇都是30*30 的單詞索引矩陣，這樣在後續進行讀取的時候直接根據嵌入矩陣E就可以將單詞轉化為詞向量了。也就省去了很多麻煩。這樣，我們還需要一個數據的讀取的函式，將儲存好的資料載入記憶體，其實很簡單，就是一個pickle讀取函式而已，然後將資料集按照9:1的比例分成訓練集和測試集。其實這裡我覺得9:1會使驗證集樣本過多（20000個），但是論文中就是這麼操作的==暫且不管這個小細節，就按論文裡面的設定做吧。程式碼如下所示：

def read_dataset():
    with open('yelp_data', 'rb') as f:
        data_x, data_y = pickle.load(f)
        length = len(data_x)
        train_x, dev_x = data_x[:int(length*0.9)], data_x[int(length*0.9)+1 :]
        train_y, dev_y = data_y[:int(length*0.9)], data_y[int(length*0.9)+1 :]
        return train_x, train_y, dev_x, dev_y

有了這個函式，我們就可以在訓練時一鍵讀入資料集了。接下來我們看一下模型架構的實現部分。

模型實現

按照上篇部落格中關於模型架構的介紹，結合下面兩張圖進行理解，我們應該很容易的得出模型的框架主要分為句子層面，文件層面兩部分，然後每個內部有包含encoder和attention兩部分。
這裡寫圖片描述

程式碼部分如下所示，主要是用tf.nn.bidirectional_dynamic_rnn()函式實現雙向GRU的構造，然後Attention層就是一個MLP+softmax機制，yehe你容易理解。

#coding=utf8

import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import layers

def length(sequences):
#返回一個序列中每個元素的長度
    used = tf.sign(tf.reduce_max(tf.abs(sequences), reduction_indices=2))
    seq_len = tf.reduce_sum(used, reduction_indices=1)
    return tf.cast(seq_len, tf.int32)

class HAN():

    def __init__(self, vocab_size, num_classes, embedding_size=200, hidden_size=50):

        self.vocab_size = vocab_size
        self.num_classes = num_classes
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size

        with tf.name_scope('placeholder'):
            self.max_sentence_num = tf.placeholder(tf.int32, name='max_sentence_num')
            self.max_sentence_length = tf.placeholder(tf.int32, name='max_sentence_length')
            self.batch_size = tf.placeholder(tf.int32, name='batch_size')
            #x的shape為[batch_size, 句子數， 句子長度(單詞個數)]，但是每個樣本的資料都不一樣，，所以這裡指定為空
            #y的shape為[batch_size, num_classes]
            self.input_x = tf.placeholder(tf.int32, [None, None, None], name='input_x')
            self.input_y = tf.placeholder(tf.float32, [None, num_classes], name='input_y')

        #構建模型
        word_embedded = self.word2vec()
        sent_vec = self.sent2vec(word_embedded)
        doc_vec = self.doc2vec(sent_vec)
        out = self.classifer(doc_vec)

        self.out = out


    def word2vec(self):
        #嵌入層
        with tf.name_scope("embedding"):
            embedding_mat = tf.Variable(tf.truncated_normal((self.vocab_size, self.embedding_size)))
            #shape為[batch_size, sent_in_doc, word_in_sent, embedding_size]
            word_embedded = tf.nn.embedding_lookup(embedding_mat, self.input_x)
        return word_embedded

    def sent2vec(self, word_embedded):
        with tf.name_scope("sent2vec"):
            #GRU的輸入tensor是[batch_size, max_time, ...].在構造句子向量時max_time應該是每個句子的長度，所以這裡將
            #batch_size * sent_in_doc當做是batch_size.這樣一來，每個GRU的cell處理的都是一個單詞的詞向量
            #並最終將一句話中的所有單詞的詞向量融合（Attention）在一起形成句子向量

            #shape為[batch_size*sent_in_doc, word_in_sent, embedding_size]
            word_embedded = tf.reshape(word_embedded, [-1, self.max_sentence_length, self.embedding_size])
            #shape為[batch_size*sent_in_doce, word_in_sent, hidden_size*2]
            word_encoded = self.BidirectionalGRUEncoder(word_embedded, name='word_encoder')
            #shape為[batch_size*sent_in_doc, hidden_size*2]
            sent_vec = self.AttentionLayer(word_encoded, name='word_attention')
            return sent_vec

    def doc2vec(self, sent_vec):
        #原理與sent2vec一樣，根據文件中所有句子的向量構成一個文件向量
        with tf.name_scope("doc2vec"):
            sent_vec = tf.reshape(sent_vec, [-1, self.max_sentence_num, self.hidden_size*2])
            #shape為[batch_size, sent_in_doc, hidden_size*2]
            doc_encoded = self.BidirectionalGRUEncoder(sent_vec, name='sent_encoder')
            #shape為[batch_szie, hidden_szie*2]
            doc_vec = self.AttentionLayer(doc_encoded, name='sent_attention')
            return doc_vec

    def classifer(self, doc_vec):
        #最終的輸出層，是一個全連線層
        with tf.name_scope('doc_classification'):
            out = layers.fully_connected(inputs=doc_vec, num_outputs=self.num_classes, activation_fn=None)
            return out

    def BidirectionalGRUEncoder(self, inputs, name):
        #雙向GRU的編碼層，將一句話中的所有單詞或者一個文件中的所有句子向量進行編碼得到一個 2×hidden_size的輸出向量，然後在經過Attention層，將所有的單詞或句子的輸出向量加權得到一個最終的句子/文件向量。
        #輸入inputs的shape是[batch_size, max_time, voc_size]
        with tf.variable_scope(name):
            GRU_cell_fw = rnn.GRUCell(self.hidden_size)
            GRU_cell_bw = rnn.GRUCell(self.hidden_size)
            #fw_outputs和bw_outputs的size都是[batch_size, max_time, hidden_size]
            ((fw_outputs, bw_outputs), (_, _)) = tf.nn.bidirectional_dynamic_rnn(cell_fw=GRU_cell_fw,
                                                                                 cell_bw=GRU_cell_bw,
                                                                                 inputs=inputs,
                                                                                 sequence_length=length(inputs),
                                                                                 dtype=tf.float32)
            #outputs的size是[batch_size, max_time, hidden_size*2]
            outputs = tf.concat((fw_outputs, bw_outputs), 2)
            return outputs

    def AttentionLayer(self, inputs, name):
        #inputs是GRU的輸出，size是[batch_size, max_time, encoder_size(hidden_size * 2)]
        with tf.variable_scope(name):
            # u_context是上下文的重要性向量，用於區分不同單詞/句子對於句子/文件的重要程度,
            # 因為使用雙向GRU，所以其長度為2×hidden_szie
            u_context = tf.Variable(tf.truncated_normal([self.hidden_size * 2]), name='u_context')
            #使用一個全連線層編碼GRU的輸出的到期隱層表示,輸出u的size是[batch_size, max_time, hidden_size * 2]
            h = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh)
            #shape為[batch_size, max_time, 1]
            alpha = tf.nn.softmax(tf.reduce_sum(tf.multiply(h, u_context), axis=2, keep_dims=True), dim=1)
            #reduce_sum之前shape為[batch_szie, max_time, hidden_szie*2]，之後shape為[batch_size, hidden_size*2]
            atten_output = tf.reduce_sum(tf.multiply(inputs, alpha), axis=1)
            return atten_output

以上就是主要的模型架構部分，其實思路也是很簡單的，主要目的是熟悉一下其中一些操作的使用方法。接下來就是模型的訓練部分了。

模型訓練

其實這部分裡的資料讀入部分我一開始打算使用上次部落格中提到的TFRecords來做，但是實際用的時候發現貌似還有點不熟悉，嘗試了好幾次都有點小錯誤，雖然之前已經把別人的程式碼都看明白了，但是真正到自己寫的時候還是存在一定的難度，還要抽空在學習學習==所以在最後還是回到了以前的老方法，分批次讀入，恩，最起碼簡單易懂23333.。。。

由於這部分大都是重複性的程式碼，所以不再進行詳細贅述，不懂的可以去看看我前面幾篇部落格裡面關於模型訓練部分程式碼的介紹。

這裡重點說一下，關於梯度訓練部分的梯度截斷，由於RNN模型在訓練過程中往往會出現梯度爆炸和梯度彌散等現象，所以在訓練RNN模型時，往往會使用梯度截斷的技術來防止梯度過大而引起無法正確求到的現象。然後就基本上都是使用的dennizy大神的CNN程式碼中的程式了。

#coding=utf-8
import tensorflow as tf
import model
import time
import os
from load_data import read_dataset, batch_iter


# Data loading params
tf.flags.DEFINE_string("data_dir", "data/data.dat", "data directory")
tf.flags.DEFINE_integer("vocab_size", 46960, "vocabulary size")
tf.flags.DEFINE_integer("num_classes", 5, "number of classes")
tf.flags.DEFINE_integer("embedding_size", 200, "Dimensionality of character embedding (default: 200)")
tf.flags.DEFINE_integer("hidden_size", 50, "Dimensionality of GRU hidden layer (default: 50)")
tf.flags.DEFINE_integer("batch_size", 32, "Batch Size (default: 64)")
tf.flags.DEFINE_integer("num_epochs", 10, "Number of training epochs (default: 50)")
tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")
tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")
tf.flags.DEFINE_integer("evaluate_every", 100, "evaluate every this many batches")
tf.flags.DEFINE_float("learning_rate", 0.01, "learning rate")
tf.flags.DEFINE_float("grad_clip", 5, "grad clip to prevent gradient explode")

FLAGS = tf.flags.FLAGS

train_x, train_y, dev_x, dev_y = read_dataset()
print "data load finished"

with tf.Session() as sess:
    han = model.HAN(vocab_size=FLAGS.vocab_size,
                    num_classes=FLAGS.num_classes,
                    embedding_size=FLAGS.embedding_size,
                    hidden_size=FLAGS.hidden_size)

    with tf.name_scope('loss'):
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=han.input_y,
                                                                      logits=han.out,
                                                                      name='loss'))
    with tf.name_scope('accuracy'):
        predict = tf.argmax(han.out, axis=1, name='predict')
        label = tf.argmax(han.input_y, axis=1, name='label')
        acc = tf.reduce_mean(tf.cast(tf.equal(predict, label), tf.float32))

    timestamp = str(int(time.time()))
    out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
    print("Writing to {}\n".format(out_dir))

    global_step = tf.Variable(0, trainable=False)
    optimizer = tf.train.AdamOptimizer(FLAGS.learning_rate)
    # RNN中常用的梯度截斷，防止出現梯度過大難以求導的現象
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), FLAGS.grad_clip)
    grads_and_vars = tuple(zip(grads, tvars))
    train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

    # Keep track of gradient values and sparsity (optional)
    grad_summaries = []
    for g, v in grads_and_vars:
        if g is not None:
            grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
            grad_summaries.append(grad_hist_summary)

    grad_summaries_merged = tf.summary.merge(grad_summaries)

    loss_summary = tf.summary.scalar('loss', loss)
    acc_summary = tf.summary.scalar('accuracy', acc)


    train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])
    train_summary_dir = os.path.join(out_dir, "summaries", "train")
    train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

    dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
    dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
    dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)

    checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
    checkpoint_prefix = os.path.join(checkpoint_dir, "model")
    if not os.path.exists(checkpoint_dir):
        os.makedirs(checkpoint_dir)
    saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)

    sess.run(tf.global_variables_initializer())

    def train_step(x_batch, y_batch):
        feed_dict = {
            han.input_x: x_batch,
            han.input_y: y_batch,
            han.max_sentence_num: 30,
            han.max_sentence_length: 30,
            han.batch_size: 64
        }
        _, step, summaries, cost, accuracy = sess.run([train_op, global_step, train_summary_op, loss, acc], feed_dict)

        time_str = str(int(time.time()))
        print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, cost, accuracy))
        train_summary_writer.add_summary(summaries, step)

        return step

    def dev_step(x_batch, y_batch, writer=None):
        feed_dict = {
            han.input_x: x_batch,
            han.input_y: y_batch,
            han.max_sentence_num: 30,
            han.max_sentence_length: 30,
            han.batch_size: 64
        }
        step, summaries, cost, accuracy = sess.run([global_step, dev_summary_op, loss, acc], feed_dict)
        time_str = str(int(time.time()))
        print("++++++++++++++++++dev++++++++++++++{}: step {}, loss {:g}, acc {:g}".format(time_str, step, cost, accuracy))
        if writer:
            writer.add_summary(summaries, step)

    for epoch in range(FLAGS.num_epochs):
        print('current epoch %s' % (epoch + 1))
        for i in range(0, 200000, FLAGS.batch_size):
            x = train_x[i:i + FLAGS.batch_size]
            y = train_y[i:i + FLAGS.batch_size]
            step = train_step(x, y)
            if step % FLAGS.evaluate_every == 0:
                dev_step(dev_x, dev_y, dev_summary_writer)

當模型訓練好之後，我們就可以去tensorboard上面檢視訓練結果如何了。

訓練結果

訓練起來不算慢，但是也稱不上快，在實驗室伺服器上做測試，64G記憶體，基本上2秒可以跑3個batch。然後我昨天晚上跑了之後就回宿舍了，回來之後發現忘了把dev的資料寫到summary裡面，而且現在每個epoch裡面沒加shuffle，也沒跑很久，更沒有調參，所以結果湊合能看出一種趨勢，等過幾天有時間在跑跑該該引數之類的看能不能有所提升，就簡單上幾個截圖吧。
這裡寫圖片描述

Hierarchical Attention Network for Document Classification--tensorflow實現篇

資料集

模型實現

模型訓練

訓練結果

Hierarchical Attention Network for Document Classification--tensorflow實現篇

Hierarchical Attention Networks for Document Classification 模型理解篇

Hierarchical Attention Networks for Document Classification 實現篇

《17.Residual Attention Network for Image Classification》

論文筆記：Residual Attention Network for Image Classification

Residual Attention Network for Image Classification, cvpr17

Deep Neural Network for Image Classification: Application

01神經網路和深度學習-Deep Neural Network for Image Classification: Application-第四周程式設計作業2

Recurrent Neural Network for Text Classification with Multi-Task Learning

Andrew Ng 深度學習課程deeplearning.ai 程式設計作業——shallow network for datesets classification (1-3)

第四周程式設計作業（二）-Deep Neural Network for Image Classification: Application

ReID：Harmonious Attention Network for Peson Re-Identification 解讀

Deep Neural Network for Image Classification:Application

Hierarchical Attention Based Semi-supervised Network Representation Learning

tensorflow實現attention

Connectionist Temporal Classification(CTC)、音識別模型小型綜述和一個簡易的語音識別模型的tensorflow實現

tensorflow實現分類問題classification

MACNN-Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition

Two-level attention model for fine-grained Image classification

論文筆記《The application of two-level attention models in deep convolutional neural network for FGVC》

Hierarchical Attention Network for Document Classification--tensorflow實現篇

資料集

模型實現

模型訓練

訓練結果

相關推薦