淺談tensorflow中dataset.shuffle和dataset.batch dataset.repeat注意點

阿新 • • 發佈：2020-06-09

batch很好理解，就是batch size。注意在一個epoch中最後一個batch大小可能小於等於batch size

dataset.repeat就是俗稱epoch，但在tf中與dataset.shuffle的使用順序可能會導致個epoch的混合

dataset.shuffle就是說維持一個buffer size 大小的 shuffle buffer，圖中所需的每個樣本從shuffle buffer中獲取，取得一個樣本後，就從源資料集中加入一個樣本到shuffle buffer中。

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ""
import numpy as np
import tensorflow as tf
np.random.seed(0)
x = np.random.sample((11,2))
# make a dataset from a numpy array
print(x)
print()
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.shuffle(3)
dataset = dataset.batch(4)
dataset = dataset.repeat(2)

# create the iterator
iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))

#源資料集
[[ 0.5488135  0.71518937]
 [ 0.60276338 0.54488318]
 [ 0.4236548  0.64589411]
 [ 0.43758721 0.891773 ]
 [ 0.96366276 0.38344152]
 [ 0.79172504 0.52889492]
 [ 0.56804456 0.92559664]
 [ 0.07103606 0.0871293 ]
 [ 0.0202184  0.83261985]
 [ 0.77815675 0.87001215]
 [ 0.97861834 0.79915856]]

# 通過shuffle batch後取得的樣本
[[ 0.4236548  0.64589411]
 [ 0.60276338 0.54488318]
 [ 0.43758721 0.891773 ]
 [ 0.5488135  0.71518937]]
[[ 0.96366276 0.38344152]
 [ 0.56804456 0.92559664]
 [ 0.0202184  0.83261985]
 [ 0.79172504 0.52889492]]
[[ 0.07103606 0.0871293 ]
 [ 0.97861834 0.79915856]
 [ 0.77815675 0.87001215]] #最後一個batch樣本個數為3
[[ 0.60276338 0.54488318]
 [ 0.5488135  0.71518937]
 [ 0.43758721 0.891773 ]
 [ 0.79172504 0.52889492]]
[[ 0.4236548  0.64589411]
 [ 0.56804456 0.92559664]
 [ 0.0202184  0.83261985]
 [ 0.07103606 0.0871293 ]]
[[ 0.77815675 0.87001215]
 [ 0.96366276 0.38344152]
 [ 0.97861834 0.79915856]] #最後一個batch樣本個數為3

1、按照shuffle中設定的buffer size，首先從源資料集取得三個樣本：
shuffle buffer：
[ 0.5488135 0.71518937]
[ 0.60276338 0.54488318]
[ 0.4236548 0.64589411]
2、從buffer中取一個樣本到batch中得：
shuffle buffer：
[ 0.5488135 0.71518937]
[ 0.60276338 0.54488318]
batch：
[ 0.4236548 0.64589411]
3、shuffle buffer不足三個樣本，從源資料集提取一個樣本：
shuffle buffer：
[ 0.5488135 0.71518937]

[ 0.60276338 0.54488318]
[ 0.43758721 0.891773 ]
4、從buffer中取一個樣本到batch中得：
shuffle buffer：
[ 0.5488135 0.71518937]
[ 0.43758721 0.891773 ]
batch：
[ 0.4236548 0.64589411]
[ 0.60276338 0.54488318]
5、如此反覆。這就意味中如果shuffle 的buffer size=1，資料集不打亂。如果shuffle 的buffer size=資料集樣本數量，隨機打亂整個資料集

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ""
import numpy as np
import tensorflow as tf
np.random.seed(0)
x = np.random.sample((11,2))
# make a dataset from a numpy array
print(x)
print()
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.shuffle(1)
dataset = dataset.batch(4)
dataset = dataset.repeat(2)

# create the iterator
iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))

[[ 0.5488135  0.71518937]
 [ 0.60276338 0.54488318]
 [ 0.4236548  0.64589411]
 [ 0.43758721 0.891773 ]
 [ 0.96366276 0.38344152]
 [ 0.79172504 0.52889492]
 [ 0.56804456 0.92559664]
 [ 0.07103606 0.0871293 ]
 [ 0.0202184  0.83261985]
 [ 0.77815675 0.87001215]
 [ 0.97861834 0.79915856]]

[[ 0.5488135  0.71518937]
 [ 0.60276338 0.54488318]
 [ 0.4236548  0.64589411]
 [ 0.43758721 0.891773 ]]
[[ 0.96366276 0.38344152]
 [ 0.79172504 0.52889492]
 [ 0.56804456 0.92559664]
 [ 0.07103606 0.0871293 ]]
[[ 0.0202184  0.83261985]
 [ 0.77815675 0.87001215]
 [ 0.97861834 0.79915856]]
[[ 0.5488135  0.71518937]
 [ 0.60276338 0.54488318]
 [ 0.4236548  0.64589411]
 [ 0.43758721 0.891773 ]]
[[ 0.96366276 0.38344152]
 [ 0.79172504 0.52889492]
 [ 0.56804456 0.92559664]
 [ 0.07103606 0.0871293 ]]
[[ 0.0202184  0.83261985]
 [ 0.77815675 0.87001215]
 [ 0.97861834 0.79915856]]

注意如果repeat在shuffle之前使用：

官方說repeat在shuffle之前使用能提高效能，但模糊了資料樣本的epoch關係

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ""
import numpy as np
import tensorflow as tf
np.random.seed(0)
x = np.random.sample((11,2))
# make a dataset from a numpy array
print(x)
print()
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.repeat(2)
dataset = dataset.shuffle(11)
dataset = dataset.batch(4)

# create the iterator
iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))
  print(sess.run(el))

[[ 0.5488135  0.71518937]
 [ 0.60276338 0.54488318]
 [ 0.4236548  0.64589411]
 [ 0.43758721 0.891773 ]
 [ 0.96366276 0.38344152]
 [ 0.79172504 0.52889492]
 [ 0.56804456 0.92559664]
 [ 0.07103606 0.0871293 ]
 [ 0.0202184  0.83261985]
 [ 0.77815675 0.87001215]
 [ 0.97861834 0.79915856]]

[[ 0.56804456 0.92559664]
 [ 0.5488135  0.71518937]
 [ 0.60276338 0.54488318]
 [ 0.07103606 0.0871293 ]]
[[ 0.96366276 0.38344152]
 [ 0.43758721 0.891773 ]
 [ 0.43758721 0.891773 ]
 [ 0.77815675 0.87001215]]
[[ 0.79172504 0.52889492]  #出現相同樣本出現在同一個batch中
 [ 0.79172504 0.52889492]
 [ 0.60276338 0.54488318]
 [ 0.4236548  0.64589411]]
[[ 0.07103606 0.0871293 ]
 [ 0.4236548  0.64589411]
 [ 0.96366276 0.38344152]
 [ 0.5488135  0.71518937]]
[[ 0.97861834 0.79915856]
 [ 0.0202184  0.83261985]
 [ 0.77815675 0.87001215]
 [ 0.56804456 0.92559664]]
[[ 0.0202184  0.83261985]
 [ 0.97861834 0.79915856]]     #可以看到最後個batch為2，而前面都是4

使用案例：

def input_fn(filenames,batch_size=32,num_epochs=1,perform_shuffle=False):
  print('Parsing',filenames)
  def decode_libsvm(line):
    #columns = tf.decode_csv(value,record_defaults=CSV_COLUMN_DEFAULTS)
    #features = dict(zip(CSV_COLUMNS,columns))
    #labels = features.pop(LABEL_COLUMN)
    columns = tf.string_split([line],' ')
    labels = tf.string_to_number(columns.values[0],out_type=tf.float32)
    splits = tf.string_split(columns.values[1:],':')
    id_vals = tf.reshape(splits.values,splits.dense_shape)
    feat_ids,feat_vals = tf.split(id_vals,num_or_size_splits=2,axis=1)
    feat_ids = tf.string_to_number(feat_ids,out_type=tf.int32)
    feat_vals = tf.string_to_number(feat_vals,out_type=tf.float32)
    #feat_ids = tf.reshape(feat_ids,shape=[-1,FLAGS.field_size])
    #for i in range(splits.dense_shape.eval()[0]):
    #  feat_ids.append(tf.string_to_number(splits.values[2*i],out_type=tf.int32))
    #  feat_vals.append(tf.string_to_number(splits.values[2*i+1]))
    #return tf.reshape(feat_ids,field_size]),tf.reshape(feat_vals,labels
    return {"feat_ids": feat_ids,"feat_vals": feat_vals},labels

  # Extract lines from input files using the Dataset API,can pass one filename or filename list
  dataset = tf.data.TextLineDataset(filenames).map(decode_libsvm,num_parallel_calls=10).prefetch(500000)  # multi-thread pre-process then prefetch

  # Randomizes input using a window of 256 elements (read into memory)
  if perform_shuffle:
    dataset = dataset.shuffle(buffer_size=256)

  # epochs from blending together.
  dataset = dataset.repeat(num_epochs)
  dataset = dataset.batch(batch_size) # Batch size to use

  #return dataset.make_one_shot_iterator()
  iterator = dataset.make_one_shot_iterator()
  batch_features,batch_labels = iterator.get_next()
  #return tf.reshape(batch_ids,tf.reshape(batch_vals,batch_labels
  return batch_features,batch_labels

到此這篇關於淺談tensorflow中dataset.shuffle和dataset.batch dataset.repeat注意點的文章就介紹到這了,更多相關tensorflow中dataset.shuffle和dataset.batch dataset.repeat內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們！

淺談tensorflow中dataset.shuffle和dataset.batch dataset.repeat注意點

淺談tensorflow中dataset.shuffle和dataset.batch dataset.repeat注意點

淺談tensorflow中Dataset圖片的批量讀取及維度的操作詳解

淺談tensorflow中張量的提取值和賦值

淺談tensorflow 中的圖片讀取和裁剪方式

淺談tensorflow 中tf.concat()的使用

淺談Python中的異常和JSON讀寫資料的實現

淺談Python中re.match()和re.search()的使用及區別

淺談Python中threading join和setDaemon用法及區別說明

淺談Python中資料夾和python package包的區別

淺談Python中的生成器和迭代器

淺談pytorch中torch.max和F.softmax函式的維度解釋

淺談TensorFlow中讀取影象資料的三種方式

淺談js中的attributes和Attribute的用法與區別

淺談vue中$event理解和框架中在包含預設值外傳參

淺談Redis中的雪崩和穿透及擊穿。

淺談SQL中的 where 和 having 的區別

淺談C中的malloc和free

淺談Tensorflow載入Vgg預訓練模型的幾個注意事項

淺談MySQL中float、double、decimal三個浮點型別的區別與總結

淺談keras中的batch_dot,dot方法和TensorFlow的matmul

淺談tensorflow中dataset.shuffle和dataset.batch dataset.repeat注意點

相關推薦