七. Tensorflow 多GPU操作

记录tensorflow多GPU操作的各个方面。

Tensorflow中指定使用设备

“/cpu:0”: 机器中的 CPU
“/gpu:0”: 机器中的 GPU, 如果你有一个的话.
“/gpu:1”: 机器中的第二个 GPU, 以此类推…

import tensorflow as tf
import os

# 新建一个 graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# 新建session with log_device_placement并设置为True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# 运行这个 op.
print(sess.run(c))

1 2	[[ 22. 28.] [ 49. 64.]]

notebook里没有看到设备指派的log

观察到一个问题，显存被全部占满。

# 新建一个graph.
with tf.device('/cpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# 新建session with log_device_placement并设置为True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# 运行这个op.
print(sess.run(c))

1 2	[[ 22. 28.] [ 49. 64.]]

虽然计算的时候指定了设备，但是问题来了，显存还是都被占满。

# 指定多个GPU
# 新建一个 graph.
c = []
for d in ['/gpu:0', '/gpu:1']:
    with tf.device(d):
        a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
        b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
        c.append(tf.matmul(a, b))
    with tf.device('/cpu:0'):
        sum = tf.add_n(c)

# 新建session with log_device_placement并设置为True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# 运行这个op.
print(sess.run(sum))

结论：修改device不会改变显存占用。

不全部占满显存的方法

1.所有显存设置分配比例

1
2
3

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)  
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print(sess.run(sum))

1 2	[[ 44. 56.] [ 98. 128.]]

显示占用了所有显存的30%。

2.自动增长：按需求分配显存

config = tf.ConfigProto()  
config.gpu_options.allow_growth=True  
sess = tf.Session(config=config)  
print(sess.run(sum))

1 2	[[ 44. 56.] [ 98. 128.]]

上面的几种方法都是所有的gpu都会涉及，不够干脆。和别人共用GPU还是使用下面的方法。

3.指定可见的gpu

在命令行执行 export CUDA_VISIBLE_DEVICES = "8,9,10,11,12,13,14,15"（你所用的gpu编号）

或者直接在~/.bashrc中加入（如果你和别人使用不同的登陆账号的话）。

# python设置系统变量的方法
os.environ["CUDA_VISIBLE_DEVICES"] = "8,9,10,11,12,13,14,15"

sess = tf.Session()
print(sess.run(sum))

1 2	[[ 44. 56.] [ 98. 128.]]

结果显示只有后面8个GPU显存被占用。