TensorFlow 踩坑之内存和耗时不断增加的问题

问题描述

使用finetune后的图像分类模型对一批图片进行特征提取时，发现：随着时间推移，每张图片处理耗时增多，占用内存不断变大。tensorflow有类似的issue。

问题代码

...
with tf.Graph().as_default():
    with slim.arg_scope(inception_resnet_v2.inception_resnet_v2_arg_scope()):
        # build graph
        preprocessed_image = tf.placeholder(tf.float32, shape=(image_size,image_size,3), name="preprocessed_images")
        processed_image = tf.expand_dims(preprocessed_image, 0)
        logits, end_points = inception_resnet_v2.inception_resnet_v2(processed_image, num_classes=_NUM_CLASSES, is_training=False)
        probabilities = logits
        init_fn = slim.assign_from_checkpoint_fn(sys.argv[1], slim.get_model_variables('InceptionResnetV2'))
        with tf.Session() as sess:
            # initialize graph
            init_fn(sess)
            # run graph
            for line in sys.stdin:
                start_time = time.time()
                line = line.strip(" \r\n")
                if len(line) == 0:
                    continue
                try:
                    image_string_tmp = tf.gfile.FastGFile(line, 'rb').read()
                    image_decode_tmp = tf.image.decode_image(testImage_string_tmp, channels=3)
                    preprocessed_image_tmp = inception_preprocessing.preprocess_image(image_decode_tmp, image_size, image_size, is_training=False)
                    preprocessed_image_tmp_val = sess.run([preprocessed_image_tmp])
                    np_probabilities = sess.run(probabilities,{"preprocessed_image:0":preprocessed_image_tmp_val[0]})
                    np_probabilities = np_probabilities[0, 0:]
                    imgfea = np_probabilities.tolist()
                    sys.stdout.write("%s\t%s\n" % (line, " ".join(["%.17f"%x for x in imgfea])))
                except Exception,e:
                    pass
                print >>sys.stderr, (time.time() - start_time) * 1000

解决过程

tensorflow都是预先构建好graph，输入使用placeholder占位替代，然后再运行，即一次构建，多次运行。凭直觉，上面的代码中可能存在一个问题：inception_preprocessing.preprocess_image构建图操作放在了运行阶段。所以，第一步尝试把inception_preprocessing.preprocess_image从运行阶段放到构建图阶段，然而问题并未解决。之后查阅相关问题，按照issue上面的做法，详细记录各个步骤的耗时和内存占用。具体地，使用time.time()和resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024分别记录每步大耗时和内存占用情况。举个例子：

...
import time
import resource
...
end_read_time = time.time()
image_decode_tmp = tf.image.decode_image(testImage_string_tmp, channels=3)
end_decode_time = time.time()
print >>sys.stderr, "[decode image] timecost=%f memory_usage=%f" % (end_decode_time - end_read_time, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024)
...

从记录日志来看，主要是tf.image.decode_image这一步耗时和内存不断增长。所以需要把这一步也挪到构建图阶段。

解决方案

with tf.Graph().as_default():
    with slim.arg_scope(inception_resnet_v2.inception_resnet_v2_arg_scope()):
        # build graph
        image_str = tf.placeholder(tf.string)
        image_decode = tf.image.decode_image(image_str, channels=3)
        image_tensor = tf.placeholder(tf.uint8, shape=[None, None, 3])
        preprocessed_image = inception_preprocessing.preprocess_image(image_tensor, image_size, image_size, is_training=False)
        processed_image = tf.expand_dims(preprocessed_image, 0)
        logits, end_points = inception_resnet_v2.inception_resnet_v2(processed_image, num_classes=_NUM_CLASSES, is_training=False)
        init_fn = slim.assign_from_checkpoint_fn(sys.argv[1], slim.get_model_variables('InceptionResnetV2'))
        with tf.Session() as sess:
            # initialize graph
            init_fn(sess)
            # run graph
            for line in sys.stdin:
                start_time = time.time()
                line = line.strip(" \r\n")
                if len(line) == 0:
                    continue
                try:
                    with open(line, "r") as f:
                        image_string_tmp = f.read()
                        image_decode_tmp = sess.run([image_decode], {image_str: image_string_tmp})
                        image_feature = sess.run(logits, {image_tensor:image_decode_tmp[0]})
                        image_feature = image_feature[0, 0:]
                        imgfea = image_feature.tolist()
                        sys.stdout.write("%s\t%s\n" % (line, " ".join(["%.17f"%x for x in imgfea])))
                except Exception,e:
                    sys.stderr.write("%s" % traceback.format_exc())
                print "cost:", (time.time() - start_time) * 1000

总结

tf.image.decode_image仅仅是对图片进行图片解码（把图片字符转换成tensor），看似人畜无害，其实也暗藏陷阱。个人推测，每次构建图时，会为tensor分配内存。如果在运行时不断构建图，会导致内存急剧上升；时间上涨的原因待探索。所以，使用tensorflow时，尽量把tensor相关操作一次性定义在graph中，避免在运行阶段构建图。