1. 程式人生 > >windows下使用libsvm3.2

windows下使用libsvm3.2

lib 效率 whether 訓練樣本 分類器 direct 輸入 疑惑 guid

一、官方介紹

libsvm主頁:https://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html

libsvm介紹文檔:http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf

官方關於更有效地使用libsvm的使用說明:http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (很有必要看)

數據庫:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

關於二分類的實例:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

關於多分類實例:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html

常見問答:http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html ? (這裏能夠幫你解決好多疑惑)

有用工具列表:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ (guide提到的liblinear在此)


二、須要軟件

libsvm-3.20:http://www.csie.ntu.edu.tw/~cjlin/libsvm/libsvm-3.20.zip

python-2.7.10:https://www.python.org/ftp/python/2.7.10/python-2.7.10.msi(調用python工具時使用)
gnuplot5.0.1:http://jaist.dl.sourceforge.net/project/gnuplot/gnuplot/5.0.1/gp501-win32-mingw.exe(用畫圖展示整個搜索最佳參數過程)



三、訓練過程說明
——(以後輸入命令以.bat格式存儲就可以使用)

1、提取數據形式的特征:(類別標簽 特征序號:特征值)
1 1:2.111 2:3.567 3:-0.125
...
0 1:2.156 2:3.259 3:0.258
...
分別將訓練樣本數據和測試樣本數據存成名為train的文件和名為test的文件(僅為了方便區分)


2、對特征數據進行縮放(提高運算效率)
svm-scale -l -1 -u 1 -s rangetrain >train.scale (-1~1表示縮放範圍 -l表low -u表up -s表save 將變換後區間存為range ?train是原始特征數據 train.scale是縮放後的數據)
svm-scale -r range1test>test.scale(-r 表read 將test的數據按同一range進行縮放)
說明:區間[0,1]和[-1,1]的效果是一樣的,僅僅是[0,1]的運算效率更高


3、尋找最優c、g參數
python grid.pytrain.scale(運算結束後,會提供最優參數c和g.比方運算結果是2.0 1.0 96.8922,96.8922為交叉驗證準確率)

4、使用最優參數進行訓練
svm-train -c 2 -g1train.scale(會生成一個名為train.scale.model文件,文件參數說明見興許補充說明.這裏我們使用了默認核函數RBF。一般RBF是效果最好的)

5、拿訓練結果進行測試
svm-predict test.scale train.scale.model test.predict(得預測結果test.predict文件以及正確率)



四、補充說明:

1、改動交叉驗證
svm-scale -l -1 -u 1 train >train.scale?

svm-train -v 6 train.scale(交叉驗證是為了得到更好的參數)

python grid.pytrain.scale
svm-train -c 2 -g 2 train.scale

2、關於/libsvm-3.20/tools/中的easy.py和grid.py
安裝完python和gnuplot後,將E:\Program Files\Python,F:\libsvm-3.20\windows,E:\Program Files\gnuplot\bin三個目錄加入到系統路徑裏面,改動上兩個py文件裏關於libsvm的路徑和gnuplot的路徑.
easy.py中:gnuplot_exe = r"e:\Program Files\gnuplot\bin\gnuplot.exe"
grid.py中:#svmtrain_pathname = r‘f:\libsvm-3.20\windows\svm-train.exe‘
? ? ? ? ?? self.gnuplot_pathname = r‘e:\Program Files\gnuplot\bin\gnuplot.exe‘
能夠依照guide.pdf,用easy.py測試guide中的實例。guide中實驗數據鏈接:http://www.csie.ntu.edu.tw/~cjlin/papers/guide/data/

3、關於model文件裏的參數說明
svm_type c_svc ? (svc表用SVM作分類器,svr表用SVM作回歸,c_svc 表用異常值懲處因子C進行不全然分類)
kernel_type rbf ? ? (徑向基核,對於大多數情況都是一個較好的選擇:d(x,y) = exp(-gamma*|x-y|2))
gamma 0.03125核函數的參數)
nr_class 2類別數)
total_sv 287支持向量總數)
rho 102.102判決函數的常數項b)
label 1 0類標簽)
nr_sv 144 143各個類中落在邊界上的向量個數)
SVSV以下枚舉了全部的支持向量)
8192 1:-1 2:-0.688314 3:0.595954 4:0.416735

...


4、svmscale.exe參數說明

"-l?lower : x scaling lower limit (default -1)\n"
"
-u upper : x scaling upper limit (default +1)\n"
"
-y y_lower y_upper : y scaling limits (default: no y scaling)\n"
"
-s save_filename : save scaling parameters to save_filename\n"
"
-r restore_filename : restore scaling parameters from restore_filename\n"


5、svmtrain.exe的參數列表

"-s svm_type : set type of SVM (default 0)\n"
" 0 -- C-SVC(multi-class classification)\n"
" 1 -- nu-SVC(multi-class classification)\n"
" 2 -- one-class SVM\n"
" 3 -- epsilon-SVR(regression)\n"
" 4 -- nu-SVR(regression)\n"
"-t kernel_type : set type of kernel function (default 2)\n"
" 0 -- linear: u‘*v\n"
" 1 -- polynomial: (gamma*u‘*v + coef0)^degree\n"
" 2 -- radial basis function: exp(-gamma*|u-v|^2)\n"
" 3 -- sigmoid: tanh(gamma*u‘*v + coef0)\n"
" 4 -- precomputed kernel (kernel values in training_set_file)\n"
"-d degree : set degree in kernel function (default 3)\n"
"-g gamma : set gamma in kernel function (default 1/num_features)\n"
"-r coef0 : set coef0 in kernel function (default 0)\n"
"-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)\n"
"-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)\n"
"-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)\n"
"-m cachesize : set cache memory size in MB (default 100)\n"
"-e epsilon : set tolerance of termination criterion (default 0.001)\n"
"-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)\n"
"-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)\n"
"-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)\n"
"-v n: n-fold cross validation mode\n"
"-q : quiet mode (no outputs)\n"


6、經常使用FAQ

Q1: Is there a program to check if my data are in the correct format??
The svm-train program in libsvm conducts only a simple check of the input data. To do a detailed check, after libsvm 2.85, you can use the python script tools/checkdata.py. See tools/README for details.

Q2: The output of training C-SVM is like the following. What do they mean??
optimization finished, #iter = 219?
nu = 0.431030?
obj = -100.877286, rho = 0.424632?
nSV = 132, nBSV = 107?
Total nSV = 132
obj is the optimal objective value of the dual SVM problem. rho is the bias term in the decision function sgn(w^Tx - rho). nSV and nBSV are number of support vectors and bounded support vectors (i.e., alpha_i = C). nu-svm is a somewhat equivalent form of C-SVM where C is replaced by nu. nu simply shows the corresponding parameter. More details are in libsvm document.

Q3: Should I use float or double to store numbers in the cache ??
We have float as the default as you can store more numbers in the cache. In general this is good enough but for few difficult cases (e.g. C very very large) where solutions are huge numbers, it might be possible that the numerical precision is not enough using only float.

Q4: Does it make a big difference if I scale each attribute to [0,1] instead of [-1,1]?

?
For the linear scaling method, if the RBF kernel is used and parameter selection is conducted, there is no difference. Assume Mi and mi are respectively the maximal and minimal values of the ith attribute. Scaling to [0,1] means
? ? ? ? ? ? ? ? x‘=(x-mi)/(Mi-mi)
For [-1,1],
? ? ? ? ? ? ? ? x‘‘=2(x-mi)/(Mi-mi)-1.
In the RBF kernel,
? ? ? ? ? ? ? ? x‘-y‘=(x-y)/(Mi-mi), x‘‘-y‘‘=2(x-y)/(Mi-mi).
Hence, using (C,g) on the [0,1]-scaled data is the same as (C,g/2) on the [-1,1]-scaled data.
Though the performance is the same, the computational time may be different. For data with many zero entries, [0,1]-scaling keeps the sparsity of input data and hence may save the time.

Q5: My data are unbalanced. Could libsvm handle such problems?

?
Yes, there is a -wi options. For example, if you use
> svm-train -s 0 -c 10 -w1 1 -w-1 5 data_file
the penalty for class "-1" is larger. Note that this -w option is for C-SVC only.

Q6: How can I use OpenMP to parallelize LIBSVM on a multicore/shared-memory computer??
It is very easy if you are using GCC 4.2 or after.
In Makefile, add -fopenmp to CFLAGS.
In class SVC_Q of svm.cpp, modify the for loop of get_Q to:
#pragma omp parallel for private(j)?
for(j=start;j<len;j++)
In the subroutine svm_predict_values of svm.cpp, add one line to the for loop:
#pragma omp parallel for private(i)?
for(i=0;i<l;i++)
kvalue[i] = Kernel::k_function(x,model->SV[i],model->param);
For regression, you need to modify class SVR_Q instead. The loop in svm_predict_values is also different because you need a reduction clause for the variable sum:
#pragma omp parallel for private(i) reduction(+:sum)?
for(i=0;i<model->l;i++)
sum += sv_coef[i] * Kernel::k_function(x,model->SV[i],model->param);
Then rebuild the package. Kernel evaluations in training/testing will be parallelized. An example of running this modification on an 8-core machine using the data set ijcnn1:
8 cores:
%setenv OMP_NUM_THREADS 8
%time svm-train -c 16 -g 4 -m 400 ijcnn1
27.1sec
1 core:
%setenv OMP_NUM_THREADS 1
%time svm-train -c 16 -g 4 -m 400 ijcnn1
79.8sec
For this data, kernel evaluations take 80% of training time. In the above example, we assume you use csh. For bash, use
export OMP_NUM_THREADS=8
instead.
For Python interface, you need to add the -lgomp link option:
$(CXX) -lgomp -shared -dynamiclib svm.o -o libsvm.so.$(SHVER)
For MS Windows, you need to add /openmp in CFLAGS of Makefile.win

Q7: How could I know which training instances are support vectors?

?
It‘s very simple. Since version 3.13, you can use the function
void svm_get_sv_indices(const struct svm_model *model, int *sv_indices)
to get indices of support vectors. For example, in svm-train.c, after
model = svm_train(&prob, &param);
you can add
int nr_sv = svm_get_nr_sv(model);
int *sv_indices = Malloc(int, nr_sv);
svm_get_sv_indices(model, sv_indices);
for (int i=0; i<nr_sv; i++)
printf("instance %d is a support vector\n", sv_indices[i]);
If you use matlab interface, you can directly check
model.sv_indices

Q8: After doing cross validation, why there is no model file outputted ??
Cross validation is used for selecting good parameters. After finding them, you want to re-train the whole data without the -v option.

Q9: How do I choose the kernel??
In general we suggest you to try the RBF kernel first. A recent result by Keerthi and Lin ( download paper here) shows that if RBF is used with model selection, then there is no need to consider the linear kernel. The kernel matrix using sigmoid may not be positive definite and in general it‘s accuracy is not better than RBF. (see the paper by Lin and Lin ( download paper here). Polynomial kernels are ok but if a high degree is used, numerical difficulties tend to happen (thinking about dth power of (<1) goes to 0 and (>1) goes to infinity).

Q10: I press the "load" button to load data points but why svm-toy does not draw them ?

?
The program svm-toy assumes both attributes (i.e. x-axis and y-axis values) are in (0,1). Hence you want to scale your data to between a small positive number and a number less than but very close to 1. Moreover, class labels must be 1, 2, or 3 (not 1.0, 2.0 or anything else).


Q11:Feature selection tool
This is a simple python script (download here) to use F-score for selecting features. To run it, please put it in the sub-directory "tools" of LIBSVM.
Usage: ./fselect.py training_file [testing_file]
Output files: .fscore shows importance of features, .select gives the running log, and .pred gives testing results.
More information about this implementation can be found in Y.-W. Chen and C.-J. Lin,Combining SVMs with various feature selection strategies. To appear in the book "Feature extraction, foundations and applications." 2005. This implementation is still preliminary. More comments are very welcome.


Q12:Weights for data instances
Users can give a weight to each data instance.?For LIBSVM users, please download the zip file (MATLAB and Python interfaces are included).?You must store weights in a separated file and specify -W your_weight_file. This setting is different from earlier versions where weights are in the first column of training data.
1)Training/testing sets are the same as those for standard LIBSVM/LIBLINEAR.
2)We do not support weights for test data.
3)All solvers are supported.
4)Matlab/Python interfaces for both LIBSVM/LIBLIENAR are supported.


Q13:Binary-class Cross Validation with Different Criteria

參考文檔:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/eval/index.html

windows下使用libsvm3.2