Text Summarization using Sequence-to-Sequence model in Tensorflow and GPU computing: Part 2 – AWS P2 Instance Installation

Merry Christmas!

I finally got my toys for the Christmas – AWS P2 instances, Yay!  This weekend Chicago snowed again, so a perfect timing to try Cloud!  This article was a footnote of my first touch with multi-GPU platform.  I discussed the advantage of multi-GPU platform in Deep Learning package Tensorflow, and tried Seq2Seq attention model and Convolutional Neural Network and their applications in text summarization and image classification.

This article was mainly about architecture, and a follow up of my previous post:

Text Summarization using Sequence-to-Sequence model in Tensorflow and GPU computing: Part I – How to get things running

Operating System: Ubuntu Linux 16.04 64-bit
Tensorflow r 0.11  (At the time this blog is written, TF r 0.12 has been released but I downgrade to 0.11 because of some compatibility issues)
Anaconda Python 2.7 64-bit
NVIDIA CUDA 8.0
NVIDIA cuDNN 5.1

AWS Platform

P2-xlarge  / 1 GPU / 4 vCPU / 61 G RAM

P2-8xlarge / 8GPU / 32 vCPU / 488 G RAM

P2-16xlarge / 16GPU / 64 vCPU / 732 G RAM

1. Install CUDA on P2 instance

aws_p2.png

I followed my early post to setup CUDA and Tensorflow.  Though I never expected a home run cause I knew there would always be some hurdles,  still I ended up spending half of day figuring out all the problems. Here are some major ones and solutions:

1.1 Disable Nouveau

To disable Nouveau on AWS, I added the following lines to the /etc/modprobe.d/blacklist-nouveau.conf  file:

blacklist nouveau 
blacklist lbm-nouveau 
options nouveau modeset=0 
alias nouveau off 
alias lbm-nouveau off

1.2 Update the kernel

$ apt-get install linux-source 
$ apt-get install linux-headers-3.13.0-37-generic

1.3 Test CUDA

Once CUDA installed correctly, I used ./deviceQuery to determine that the GPU model in P2 instance is Tesla K80.  I expected Titan Pascal :O.

p2_gpu

2. Install Tensorflow r 0.11.0

2.1  Some issues in compilation and installation

While installing r 0.12.0, at the first time, the ./configure successfully finished but at

"bazel build -c opt //tensorflow/tools/pip_package:build_pip_package"

it reported an error:

 ......./tensorflow/tensorflow/python/BUILD:1806:1: in cc_library rule //tensorflow/python:tf_session_helper: non-test target '//tensorflow/python:tf_session_helper' depends on testonly target '//tensorflow/python:construction_fails_op' and doesn't have testonly attribute set

The workaround was commenting out these two lines of “…./python/BUILD”:

 1811 ":construction_fails_op",
 1813 ":test_ops_kernels",
 then the build is fine.

2.2  CXXABI_1.3.8 not found error

When testing TF installation in Python, ran into an error like following:

CXXABI_error.pngand resolved by the following solution:

strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep CXXABI_1.3.8
cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /home/ubuntu/anaconda2/lib/

2.3  Why I still use r 0.11.0

At the time I tried this, Tensorflow r 0.12 has been released, but it popped up too many warnings and error messages when running the text summarization models. For example:

scalar_summary.png

So I decided to stick to r 0.11.0.

3.  Tensorflow on multi-GPU instance

3.1 Tensorflow supports no more than 8 GPUs

When running text summarization model on P16-xlarge instance, I ran into the following error:

CUDA_peer_error.png

A dig on github (https://github.com/tensorflow/tensorflow/issues/5789) indicates that current releases of Tensorflow do not fully support more than 8 GPU devices.  To avoid error on platform which has more than 8 GPU,  I restricted the number of GPU in Python code:

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7,8"

Of course, the other GPUs are wasted therefore it seems there is no advantage of using P16-xlarge instance for Tensorflow.

3.2 Text summarization performance comparison

My ultimate goal was originally to optimize the platform to carry large scale text summarization training.  With that in mind, I reran the experiment using the toy data mentioned in my early post.  The model was trained on 20 articles, and epoch was set to 5000.

To my surprise, Multi-GPU architectures did not bring significant improvement to Text Summarization model.  I mainly compared 7 settings:

  • P1 GPU:  AWS 2xlarge instance equipped with Single TELSA K80 GPU
  • P8 n_gpu=8:  AWS 8xlarge instance equipped with Eight TELSA K80 GPU, setting the —num_gpus=8 explicitly
  • P8 n_gpu=2:  AWS 8xlarge instance equipped with Eight TELSA K80 GPU, setting the —num_gpus=2 explicitly
  • P8 n_gpu=0:  AWS 8xlarge instance equipped with Eight TELSA K80 GPU, setting the —num_gpus=0 explicitly
  • P16 GPU:  AWS 16xlarge instance equipped with Eight TELSA K80 GPU, setting the —num_gpus=0 explicitly.  Because Tensorflow has difficulty handling > 8 GPUs, I set the os.environ[“CUDA_VISIBLE_DEVICES”]=”0,1,2,3,4,5,6,7,8″
  • P16 CPU: AWS 16xlarge instance equipped with Eight TELSA K80 GPU, with the train function of seq2seq_attention model uses tf.device('/cpu:0'All other settings use tf.device('/gpu:0'
  • GTX 970: My own desktop machine with single GTX 970 GPU and 6-core AMD CPU
    textsum_compare.png

    As the picture shows, multi-GPU architecture does not bring significant improvement on efficiency. A further look of the model code gives me a feeling that the text summarization model does not exploit the parallelization of GPU.  The “_Train” function seems only using the first GPU / CPU device.   As known, Tensor flow was originally deployed on Google Cloud Platform and was not optimized for GPU.  So maybe the Text Summarization model hasn’t fully exploited the advantages yet.  It ought to be rewritten in a GPU-specified divide-and-conquer manner, and to exploit parallelism at all stages on multiple GPUs.  Sounds a lots of work, and I am not an expert in low-level programming.  Maybe for the same effort, it will be easier for me to create a similar model structure in Caffe because its better support for multi-GPU?

def _Train(model, data_batcher):
   """Runs model training."""
  with tf.device('/gpu:0'):  # was ('/cpu:0')
      model.build_graph()
      saver = tf.train.Saver()
      summary_writer = tf.train.SummaryWriter(FLAGS.train_dir)
      sv = tf.train.Supervisor(logdir=FLAGS.log_root, is_chief=True,<span class="Apple-converted-space"></span>saver=saver, summary_op=None,<span class="Apple-converted-space"> </span>save_summaries_secs=60, save_model_secs=FLAGS.checkpoint_secs,global_step=model.global_step)
      sess = sv.prepare_or_wait_for_session(config=tf.ConfigProto(allow_soft_placement=True)
      running_avg_loss = 0
      step = 0
   ...

With a bit disappointment, I continued to explore other models.  In next section, I will show another (better) example in Tensorflow that can actually benefit from high-end multi-GPU architecture.

3.3 CNN Image classification benchmark CIFAR-10

CIFAR-10 is a CNN (convolutional neural network) demo included in Tensorflow as an image classification benchmark. The problem is to classify RGB 32×32 pixel images across 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

cafir10.jpg

Tensorflow provides a multi-GPU version algorithm for model training. According to Tensorflow documentation:

” In a workstation with multiple GPU cards, each GPU will have similar speed and contain enough memory to run an entire CIFAR-10 model. Thus, we opt to design our training system in the following manner:

  • Place an individual model replica on each GPU.
  • Update model parameters synchronously by waiting for all GPUs to finish processing a batch of data.

Note that each GPU computes inference as well as the gradients for a unique batch of data. This setup effectively permits dividing up a larger batch of data across the GPUs. 

Parallelism_CNN.png

Therefore, the advantage of multi-GPU architecture here is to build model on larger datasets within approximately the same amount of time. More training data can lead to more accurate model predictions.  In other words, for a same amount of data, more GPU can reduce the training time but keep the model performance more or less the same. Due to the time limitations, I only trained models for 10K steps.

CIFAR10 CNN model is located at

~/tensorflow/tensorflow/models/image/cifar10

and

cifar10_multi_gpu_train.py

is the training algorithm for multi-GPU and

cifar10_eval.py

evaluates the trained model (auto-saved as checkpoint) as precision value of 10-class image classification on 10K held-out CIFAR test data.  To specify the number of GPU in training, I used:

python cifar10_multi_gpu_train.py --num_gpus=2

Tensorflow r.12.0 has made changes on these script from r.11.0, if the Tensorflow package was compiled and installed as r.11.0, there will be compatibility issues.  I extract these script from archive r 0.11.0 version.

1GPU  (20k images) 2GPU (40K images) 3GPU (60K images) 4GPU (80K images) 6GPU (120K images) 8GPU (160K images)
Time(min) 20 23 30 32 42 52
Precision 0.815 0.826 0.826 0.832 0.832 0.836
Images/Sec 1022 2064 2422 2719 3000 3347
Sec/batch 0.125 0.068 0.047 0.045 0.04 0.035

CNN_CIFAR_image_size.png

As shown in the figure above,  with more images used in training, the model performance increased from 0.816 (20K images in training) to 0.836 (160K images in training). While the training data increased 8 times, due to the parallelization of multi-GPU, the training time only increased 2.5 times.  The author of CIFAR model reported a best precision score about 0.86 trained for 1M steps.

GPU_ratio.png

The second figure compared the ratio of data processing speed (Images/sec) / (Second/batch) vs. the increase of GPU numbers. It is very difficult to obtain linear speed-up in m-GPU platform.  As shown, an 8 GPU platform only led to 3.5 times speed up comparing to single GPU system.

4. Conclusion

AWS P2 instances are powerful platforms to exploit the power of GPU computation. Unfortunately, the software package seems lagging behind, and cannot benefit from the high-end multi-GPU architecture provided in P2.16xlarge instances.

The current text summarization model in Tensorflow has not exploited the full potential of multi-GPU system, and hopefully smart contributors will restructure the code in future releases.

One side note was, regarding AWS P2 instance, each time I stopped/restarted, or created a new instance from a saved AMI image with CUDA/Tensorflow preinstalled, the environment was broken again and again.  I ended up recompiling / reinstalling the environment so many times and now I can install everything with my eyes covered.  It was a big fun, and more fun when I see the coming Amazon AWS bill :).

4 thoughts on “Text Summarization using Sequence-to-Sequence model in Tensorflow and GPU computing: Part 2 – AWS P2 Instance Installation

  1. Shee Yu thanks for putting this together! I was wanting to make the move with textsum from my local machine to AWS and I found this to be quite insightful and will be a great reference when I make the jump.

    Have you had any luck exporting the decoding portion of the textsum model for Tensorflow serving yet? It’s an issue I have been having a problem with and have yet to get something working. Regardless, thanks again!

    Like

    • Hey Daniel.

      You are welcome! No, I haven’t dig into decoding yet. Though I am not sure about the exact problem you had, but Is that something related to Tensorflow “Checkpoint”?

      Like

      • It’s the overall process of exporting a model for tensorflow serving. I finally got the model trained against data where my decode results are somewhat valid, so I now was wanting to try sending in an article and getting a headline response. It is my understanding that I must use Tensorflow Serving to accomplish this. In order to serve the Tensorflow model on a server you must export the model similar to the tutorials for Tensorflow Serving. This requires setting up signatures, class references and output params, but I have run into problem upon problem attempting to export. I too thought I could just go about using the checkpoint files created by the model, however it seems that this is not the case and in order to host the model, one must actually perform an export.

        Should I ever figure it out I will reference the answer. Thanks for the response and I look forward to seeing what you work on next.

        Like

Leave a comment