十. Tensorflow中数据处理的OP

本篇博客记录tensorflow中和数据处理相关的操作。

数据格式变换（改变数据的形状）

tf.one_hot

1	list of indices to one hot.

比如原来的index列表是[0, 2, -1, 1]


import tensorflow as tf
sess = tf.Session()

one_hot = tf.one_hot(indices = [0, 2, -1, 1],
                     depth = 3,
                     on_value = 1.0,
                     off_value = 0.0,
                     axis = -1)

print sess.run(one_hot)

[[ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  0.  0.]
 [ 0.  1.  0.]]

这个操作很像embedding lookup，只是返回的数据是depth维的one hot向量。常用于NLP中label的表示。

tf.sequence_mask

这个操作和one hot也很像，但是指定的不是index而是从前到后有多少个True，返回的是True和False。

1
2
3

sq_mask = tf.sequence_mask([1, 3, 2], 5)

print sess.run(sq_mask)

1
2
3

[[ True False False False False]
 [ True  True  True False False]
 [ True  True False False False]]

tf.boolean_mask

这个操作可以用于留下指定的元素，类似于numpy的操作。

import numpy as np
tensor = tf.range(4)
mask = np.array([True, False, True, False])
bool_mask = tf.boolean_mask(tensor, mask)

print sess.run(bool_mask)

[0 2]

也可以先用数字传进来，再转换成bool，这样就可以利用one_hot了。

num_mask = np.array([1,0,1,0])
num_mask = tf.cast(num_mask, tf.bool)
bool_num_mask = tf.boolean_mask(tensor, num_mask)
print sess.run(bool_num_mask)

[0 2]

mask和被处理的tensor必须shape相同，执行下面的代码会报错：

tensor = tf.reshape(tf.range(8), [2,4])
mask = np.array([True, False,
True, False])
bool_mask = tf.boolean_mask(tensor, mask)
print
sess.run(bool_mask)
# ValueError: Shapes (2,) and (4,) are incompatible

tf.split

分割数据

1 2	m1 = tf.reshape(tf.range(24), [2,3,4]) m1

1	<tf.Tensor 'Reshape:0' shape=(2, 3, 4) dtype=int32>

# tf.split(value, num_or_size_splits, axis=0, num=None, name='split')
split0, split1, split2 = tf.split(m1, 3, 1)

split0.get_shape()

1	TensorShape([Dimension(2), Dimension(1), Dimension(4)])

tf.concat

连接数据

1 2	m2 = tf.reshape(tf.range(24), [2,3,4]) m2

1	<tf.Tensor 'Reshape_1:0' shape=(2, 3, 4) dtype=int32>

1	tf.concat([m1, m2], 0) # 第0维上连接数据

1	<tf.Tensor 'concat:0' shape=(4, 3, 4) dtype=int32>

1	tf.concat([m1, m2], 1) # 第1维上连接数据

1	<tf.Tensor 'concat_1:0' shape=(2, 6, 4) dtype=int32>

tf.squeeze

压缩长度为1的维度

1
2
3

arr = tf.truncated_normal([3,4,1,6,1], stddev=0.1)

arr.get_shape()

1	TensorShape([Dimension(3), Dimension(4), Dimension(1), Dimension(6), Dimension(1)])

1	tf.squeeze(arr).get_shape()

1	TensorShape([Dimension(3), Dimension(4), Dimension(6)])

tf.expand_dims

和squeeze相反，可以扩展指定的维度。

1	tf.expand_dims(arr, 0).get_shape()

1	TensorShape([Dimension(1), Dimension(3), Dimension(4), Dimension(1), Dimension(6), Dimension(1)])

tf.gather

一个tensor当源数据，一个tensor当下标，取出对应的数据。

indices = tf.placeholder(tf.int32, [5])
arr = tf.range(10, 20)
g = tf.gather(arr, indices)

print sess.run(g, feed_dict={indices:[4,5,7,1,2]})

1	[14 15 17 11 12]

tf.tile

给定一个tensor，堆成更大的tensor。

tf.tile(input, multiples, name=None)

.input n

1
2
3

t_simple = tf.range(10)
t_complex = tf.tile(t_simple, [2])
sess.run(t_complex)

1	array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

1
2
3

t_simple = tf.reshape(tf.range(10), [2, 5]) # multiples的维度和输入的维度需要保持一致
t_complex = tf.tile(t_simple, [2, 3])
sess.run(t_complex)

array([[0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9]], dtype=int32)