1. 程式人生 > >(十)訓練資料集建立

(十)訓練資料集建立

Caffe2 - 訓練資料集建立

caffe2 使用二值 DB 儲存模型訓練的資料,以 key-value 格式儲存,

key1 value1 key2 value2 key3 value3 ...

DB 中,將 keys 和 values 儲存為 strings 形式;可以通過 TensorProtos protocol buffer 來轉換為結構化的資料:

TensorProtos protocol buffer:

記錄 Tensors,也叫多維陣列(multi-dimensional arrays, together),tensor 資料型別及資料 shape 資訊.

故,採用 TensorProtosDBInput Operator 來載入資料,以進行 SGD 訓練.

UCI Iris 資料集為例,Iris 花朵分類資料集,其包括 4 種實值特徵來表示花,對三種類型的花進行分類.

資料集格式:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
...
import urllib2 
import numpy as np
import
matplotlib.pyplot as plt from StringIO import StringIO from caffe2.python import core, utils, workspace from caffe2.proto import caffe2_pb2 print("Necessities imported!") # Load txtdata # https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data raw_datas = open('iris_data.txt').readlines() num_datas = len(raw_datas) features = np.zeros((num_datas, 4
), dtype=np.float32) # 每一行一個樣本 labels = np.zeros((num_datas, ), dtype=np.int) #label_dict = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} label_converter = lambda s : {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}[s] for idx in range(num_datas): data = raw_datas[idx].strip() print data feature = np.loadtxt(StringIO(data), dtype=np.float32, delimiter=',', usecols=(0, 1, 2, 3)) label = np.loadtxt(StringIO(data), dtype=np.int, delimiter=',', usecols=(4,), converters={4: label_converter}) features[idx] = feature labels[idx] = label # train: 100 # test:50 random_index = np.random.permutation(150) # 打亂順序 features = features[random_index] labels = labels[random_index] train_features = features[:100] train_labels = labels[:100] test_features = features[100:] test_labels = labels[100:] # 視覺化下特徵 # first two features 和 label. legend = ['rx', 'b+', 'go'] plt.title("Training data distribution, feature 0 and 1") for i in range(3): plt.plot(train_features[train_labels==i, 0], train_features[train_labels==i, 1], legend[i]) plt.figure() plt.title("Testing data distribution, feature 0 and 1") for i in range(3): plt.plot(test_features[test_labels==i, 0], test_features[test_labels==i, 1], legend[i]) plt.show()

這裡寫圖片描述

將資料放入 Caffe2 DB,key - train_xxx,value - 使用 TensorProtos 來儲存每個資料樣本的兩個 tensor,feature 和 label.

# 測試
# 從 numpy arrays 建立 TensorProtos protocol buffer
feature_and_label = caffe2_pb2.TensorProtos()
feature_and_label.protos.extend([utils.NumpyArrayToCaffe2Tensor(features[0]), utils.NumpyArrayToCaffe2Tensor(labels[0])])
print('This is what the tensor proto looks like for a feature and its label:')
print(str(feature_and_label))
print('This is the compact string that gets written into the db:')
print(feature_and_label.SerializeToString())

# 資料寫入 DB
def write_db(db_type, db_name, features, labels):
    db = core.C.create_db(db_type, db_name, core.C.Mode.write)
    transaction = db.new_transaction()
    for i in range(features.shape[0]):
        feature_and_label = caffe2_pb2.TensorProtos()
        feature_and_label.protos.extend([utils.NumpyArrayToCaffe2Tensor(features[i]), utils.NumpyArrayToCaffe2Tensor(labels[i])])
        transaction.put('train_%03d'.format(i), feature_and_label.SerializeToString())

    del transaction
    del db

write_db("minidb", "iris_train.minidb", train_features, train_labels)
write_db("minidb", "iris_test.minidb", test_features, test_labels)


# 建立網路,測試 DB 載入
net_proto = core.Net("example_iris_net")
dbreader = net_proto.CreateDB([], "dbreader", db="iris_train.minidb", db_type="minidb")
net_proto.TensorProtosDBInput([dbreader], ["X", "Y"], batch_size=16)

print("The net looks like this:")
print(str(net_proto.Proto()))

workspace.CreateNet(net_proto)

workspace.RunNet(net_proto.Proto().name)
print("The first batch of feature is:")
print(workspace.FetchBlob("X"))
print("The first batch of label is:")
print(workspace.FetchBlob("Y"))

Reference