weka文字聚類（3）--文字轉換成arff

阿新 • • 發佈：2019-01-11

要使用weka進行聚類分析，必須先將文字資料轉換成weka可識別的arff格式。Instances類是weka可識別的資料類，其toString方法即可轉換為arff格式的資料。在文字聚類中，arff格式的示例如下：

@relation patent

@attribute text string

@data
'第一篇文章的內容'

'第二篇文章的內容'

......

經過摸索，主要有三種方式將文字轉換成Instances類。

（1）連線資料庫。weka對資料庫連線的支援很差，需要將weka的jar解壓，再修改裡面的引數重新打包才可以正常使用。修改引數的示例百度上有許多，現在送上一個連結，是

（2）呼叫TextDirectoryLoader。此類是weka自帶的Loader，能夠讀取一個資料夾下的文字，並轉換成arff格式。其呼叫非常簡單，但是有幾個需要注意的點。首先是文字的擺放格式，一篇文字用一個檔案儲存，但是主資料夾下不能直接放置文字檔案，需要建立不同的資料夾放置不同種類的文字檔案。舉個例子，如在“d:\\text"目錄下，應該建立多個子資料夾，如“class1”,"class2"，在兩個子資料夾下再放置文字檔案。本次使用主題是用weka進行文字聚類，因此，文字只需要放置在一個資料夾下就可以了。以剛才的例子為示例，下面是TextDirectoryLoader的使用程式碼。

TextDirectoryLoader loader = new TextDirectoryLoader();

loader.setDirectory(new File(“d:\\text”));

Instances dataRaw = loader.getDataSet();

dataRaw.setClassIndex(-1);

這個Instances即是我們需要的weka格式的檔案。通過 TextDirectoryLoader匯入的Instances是帶有分類這個屬性的，而k-means聚類演算法不允許Instances帶有分類，因此需要該分類設定為-1,才能被k-means演算法處理。

（3）直接構造Instances。這種方法來源於對TextDirectoryLoader原始碼的分析，它既然能讀取資料夾轉換成arff，其內部必然有直接構造Instances的方法，通過檢視，其使用如下：

public Instances getStruct(List<String> list) {
       FastVector atts = new FastVector();
       atts.addElement(new Attribute("text", (FastVector) null));
       Instances data = new Instances("patent", atts, 0);

       for (String str: list) {
           double[] newInst = new double[1];
           //這裡為了更加清晰，省略了對文字進行分詞的程式碼，
           newInst[0] = (double) data.attribute(0).addStringValue(str);
           data.add(new Instance(1.0, newInst));
       }

       return data;
   }

上面的程式碼是將需要分類的文字放在list中，每個String物件代表一篇文字，為了使結構清晰，省略了對文字進行分詞的步驟，在實際中，文字分詞應該在這裡進行。下面也提供一個對文字分詞後，再轉換成arff的程式碼。

public Instances getStruct(List<Agriculture> agList) {
       FastVector atts = new FastVector();
       atts.addElement(new Attribute("text", (FastVector) null));
       Instances data = new Instances("patent", atts, 0);

FilterRecognition stopFilter = null;
try {

//初始化filterRecognition，用於過濾掉停用詞，是ansj的工具類

            stopFilter = InitStopWords();
       } catch (Exception e) {
           // TODO Auto-generated catch block
           e.printStackTrace();
       }
       for (Agriculture ag : agList) {
           System.out.println(ag.getContent());
           double[] newInst = new double[1]; // 不算分類屬性

           String content = ToAnalysis.parse(ag.getContent())
                   .recognition(stopFilter).toStringWithOutNature(" ");
           System.out.println("分詞結果：" + content);
           newInst[0] = (double) data.attribute(0).addStringValue(content);
           data.add(new Instance(1.0, newInst));
       }
       return data;
   }

//構造停用詞工具

public FilterRecognition InitStopWords() throws Exception {
       ArrayList<String> stopList = new ArrayList<String>();
       String stopWordTable = "src/stopwords.txt";
       // 讀入停用詞檔案
       BufferedReader StopWordFileBr = new BufferedReader(
               new InputStreamReader(new FileInputStream(stopWordTable),
                       "UTF-8"));

       String stopWord = null;
       for (; (stopWord = StopWordFileBr.readLine()) != null;) {
           stopList.add(stopWord);
       }
       StopWordFileBr.close();

       FilterRecognition filterRecognition = new FilterRecognition();
       filterRecognition.insertStopWords(stopList);

       return filterRecognition;
   }

以上就是將文字轉換成arff格式的方法，能夠完成到這裡，即已經進入了使用weka的入口，邁向成功的一大步。

weka文字聚類（3）--文字轉換成arff

weka文字聚類（3）--文字轉換成arff

呼叫WEKA包進行kmeans聚類（java）

Android彈幕實現：基於B站彈幕開源系統（3）-文字彈幕的完善和細節調整

JAVA——特殊類(1)——String類（3）——字串比較(方法）

基於聚類（Kmeans）演算法實現客戶價值分析系統(電信運營商)

原型聚類（一）k均值演算法和python實現

原型聚類（二）學習向量量化（LVQ）和python實現

機器學習之聚類（二）

scipy做層級聚類（轉）

寫爬蟲所用到的工具類---（3）[檔案]

劃分方法聚類（三） Canopy+K-MEANS 演算法解析

使用Python進行層次聚類（三）——層次聚類簇間自然分割方法和評價方法

圖形介面程式設計（五）佈局容器類（3）

使用Python進行層次聚類（二）——scipy中層次聚類的自定義距離度量問題

密度聚類（DBSCAN）

sklearn中聚類（部分）

劃分方法聚類（二）K-MEANS演算法的改進

硬聚類（HCM）和模糊聚類（FCM）在彩色影象分割中的具體應用

聚類（下）

使用Python進行層次聚類（一）——基本使用+主成分分析繪圖觀察結果+繪製熱圖

weka文字聚類（3）--文字轉換成arff

相關推薦