fastText一個庫用於詞表示的高效學習和句子分類

阿新 • • 發佈：2017-07-14

包括 div itl bar standard nump for each mil skip

fastText

fastText 是 Facebook 開發的一個用於高效學習單詞呈現以及語句分類的開源庫。

要求

fastText 使用 C++11 特性，因此需要一個對 C++11 支持良好的編譯器，可以使用：

(gcc-4.6.3 或者更新版本) 或者 (clang-3.3 或者更新版本)

我們使用 Makefile 進行編譯，因此需要 make 工具。為了運行單詞相似度演示腳本，我們需要如下工具：

python 2.6 or newer
numpy & scipy

構建 fastText

使用如下命令來構建 fastText 庫：

$ git clone git@github.com:facebookresearch/fastText.git
$ cd fastText
$ make

這將會為所有的類產生一堆文件，包括主二進制文件fasttext．如果你不打算用系統默認的編譯器，在Makefile（CC 和 INCLUDES）的頭部修改兩個宏定義.

使用樣例

這個包有兩個主要功能：單詞特征學習與文本分類．這都在以面兩份論文[1] and [2]中有描述

單詞特征學習

為了學習單詞向量，就像[1]描述的那樣：如下操作：

$ ./fasttext skipgram -input data.txt -output model

data.txt是一個訓練文件，包含一些以utf-8編碼的文本．默認的這些詞向量將會劃入字符(3致6個字符)帳目 g-grams . 最後的分析程序會保存為兩個文件：model.bin 和　model.vec . model.vec是文本文件包含單詞向量，每個單詞一行．model.bin是二進制文件包含字典模型參數與所有的其它參數．　這個二進制文件可以用於計算單詞向量或重新分析。

第 2 段（可獲 2.01 積分） 0

從輸出單詞處獲取單詞向量

前期的訓練模型可以從輸出單詞處計算詞向量．假如你有一個文本文件queries.txt包含一些你想切分的單詞向量，運用下面的命令：

$ ./fasttext print-vectors model.bin < queries.txt

這會將單詞向量輸出到標準輸出，一個向量一行．你也可以使用管道：

$ cat queries.txt | ./fasttext print-vectors model.bin

上面的腳本只是一個示例，為了更形像點運行：

$ ./word-vector-example.sh

第 3 段（可獲 0.86 積分） 0

這將會編譯代碼，下載數據，計算詞向量，並可以測試那些由很少出現的詞組成的數據集，測試它們的相似性[例如Thang 等等]．

文本分類

這個類庫也可以用來監督文本分類訓練，例如情緒分析．[2]裏面描述可以用於訓練文本分類, 使用：

$ ./fasttext supervised -input train.txt -output model

train.txt是包含訓練語句的文本文件，每行都帶有標簽，默認情況下，我們假設標簽為單詞，用前後加下劃線的單詞表示　如__label__．這個命令將會生成兩個文件：model.bin 和　model.vec . 一旦模型被訓練，你可以評價它，用第一部分來測試計算它的精度：

第 4 段（可獲 1.4 積分） 0

$ ./fasttext test model.bin test.txt

為了獲得一段文本最相似的標簽，可以使用如下命令：

$ ./fasttext predict model.bin test.txt

test.txt 包含一些文本用來根據每行進行分類。執行完畢將會輸出每一行的近似標簽。請看 classification-example.sh 來了解示例代碼的使用場景。為了從論文 [2] 中重新生成結果，可以運行 classification-results.sh 腳本，這將下載所有的數據集並從表1中重新生成結果。

命令完整文檔

The following arguments are mandatory:
  -input      training file path
  -output     output file path

The following arguments are optional:
  -lr         learning rate [0.05]
  -dim        size of word vectors [100]
  -ws         size of the context window [5]
  -epoch      number of epochs [5]
  -minCount   minimal number of word occurences [1]
  -neg        number of negatives sampled [5]
  -wordNgrams max length of word ngram [1]
  -loss       loss function {ns, hs, softmax} [ns]
  -bucket     number of buckets [2000000]
  -minn       min length of char ngram [3]
  -maxn       max length of char ngram [6]
  -thread     number of threads [12]
  -verbose    how often to print to stdout [1000]
  -t          sampling threshold [0.0001]
  -label      labels prefix [__label__]

第 5 段（可獲 0.86 積分） 0

參考資料

如果使用這些代碼用於學習單詞的呈現請引用 [1] ，如果用於文本分類請引用 [2]。

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

第 6 段（可獲 0.63 積分） 0

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

(* 這些作者貢獻一樣.)

加入 fastText 社區

Facebook page: https://www.facebook.com/groups/1174547215919768
Contact: [email protected], [email protected], [email protected], [email protected]

fastText一個庫用於詞表示的高效學習和句子分類

fastText

fastText is a library for efficient learning of word representations and sentence classification.

Requirements

fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. These include :

(gcc-4.6.3 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. For the word-similarity evaluation script you will need:

python 2.6 or newer
numpy & scipy

Building fastText

In order to build fastText, use the following:

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing utf-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.binis a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__. This will output two files: model.binand model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k ([email protected] and [email protected]) on a test set using:

$ ./fasttext test model.bin test.txt k

The argument k is optional, and is equal to 1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

where test.txt contains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argument k is optional, and equal to 1 by default. See classification-example.sh for an example use case. In order to reproduce results from the paper 2, run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -minCount           minimal number of word occurences [1]
  -minCountLabel      minimal number of label occurences [0]
  -neg                number of negatives sampled [5]
  -wordNgrams         max length of word ngram [1]
  -loss               loss function {ns, hs, softmax} [ns]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -thread             number of threads [12]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]
  -verbose            verbosity level [2]
  -pretrainedVectors  pretrained word vectors for supervised learning []

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

(* These authors contributed equally.)

Resources

You can find the preprocessed YFCC100M data used in [2] at https://research.facebook.com/research/fasttext/

Join the fastText community

Facebook page: https://www.facebook.com/groups/1174547215919768
Google group: https://groups.google.com/forum/#!forum/fasttext-library
Contact: [email protected], [email protected], [email protected], [email protected]

See the CONTRIBUTING file for information about how to help out.

fastText一個庫用於詞表示的高效學習和句子分類

包括 div itl bar standard nump for each mil skip fastText fastText 是 Facebook 開發的一個用於高效學習單詞呈現以及語句分類的開源庫。要求 fastText 使用 C++11

fastText一個庫用於詞表示的高效學習和句子分類

fastText

要求

構建 fastText

使用樣例

單詞特征學習

從輸出單詞處獲取單詞向量

命令完整文檔

參考資料

加入 fastText 社區

fastText一個庫用於詞表示的高效學習和句子分類

fastText

Requirements

Building fastText

Example use cases

Word representation learning

Obtaining word vectors for out-of-vocabulary words

Text classification

Full documentation

References

Enriching Word Vectors with Subword Information

Bag of Tricks for Efficient Text Classification

Resources

Join the fastText community

相關推薦