1. 程式人生 > >Fear the REAPER A System for Automatic Multi-Document Summarization with Reinforcement Learning

Fear the REAPER A System for Automatic Multi-Document Summarization with Reinforcement Learning

Cody Rioux, Sadid A. Hasan, Yllias Chali

##Abstract

  • Achieve the largest coverage of the docu
    ments content.目標的覆蓋整個文件的內容
  • Concentrate distributed information to hidden units layer by layer. 通過一層一層的隱藏單元集中分散的資訊
  • the whole deep architecture is fine tuned by minimizing the information loss of reconstruction validation. 整個框架是減少重建確認時發生的資訊丟失
  • According to the concentrated information, dynamic programming is used to seek most informative set of sentences as the summary
    DP被用來計算最有資訊量的集合,來作為摘要
    ##Relatedwork
  • We explore the use of SARSA which is a derivative of TD(lamada) that models the action space in addition to the state space modelled by TD(lamada). Furthermore we explore the use of an algorithm not based on temporal difference methods, but instead on policy iteration techniques
  • REAPER (Relatedness-focused Extractive Automatic
    summary Preparation Exploiting Reinfocement learning)
    以相關性為中心的抽取自動摘要準備利用強化學習
    ##Motivation
    TD(lamada) is relatively old as far as reinforcement learning (RL)
    algorithms are concerned, and the optimal ILP did not outperform ASRL using the same reward function.
    強化學習有很大打提升空間
    基於查詢的摘要得到廣泛關注
    不對句子壓縮的效果做進一步探討
    ##Model
  • TD(lamada)
    時間差(TD)學習是一種基於預測的機器學習方法。它主要用於強化學習問題,據說是“ 蒙特卡羅思想和動態規劃(DP)思想的結合”。[1] TD類似於蒙特卡洛方法,因為它根據某種策略通過對環境進行取樣來學習,並且與動態規劃技術相關,因為它基於先前學習的估計來逼近其當前估計(稱為自舉)。TD學習演算法與動物學習的時間差模型有關。[2]
    temporal difference methods-wiki
  • Approximate Policy Iteration
    近似策略迭代(API)遵循一個不同的正規化,通過迭代地改進馬爾可夫決策過程的策略,直到策略收斂為止。
  • Sarsa演算法
    Q演算法是當選擇下一步的時候 會找最好的一個走(選最大Q值的) 而sarsa是當選擇下一步的時候 運用和上一步一樣/想等的Q值 但是最後都會更新之前的一步從而達到學習的效果~
    On-policy Sarsa演算法與Off-policy Q learning對比
    ##Experiment
  • Feature Space depends on the presence of top bigrams,而不用
    tf *idf words
  • Reward Function
    based on the n-gram concurrence score metric
    the longest-common-subsequence recall metric
  1. Immediate Rewards
  2. Query Focused Rewards