Torch vs TensorFlow vs Theano
For an ongoing project at GA-CCRi, we wanted to determine whether remaining with Torch (used for Phase I of a project currently underway at GA-CCRi running on GPUs) or switching to TensorFlow or Theano made the most sense for Phase II of the project. We ultimately found that TensorFlow’s combination of performance and usability made it the best choice as we move into Phase II.
As always with such tests, newer versions of any components used can make these results increasingly dated, but it was interesting to compare the current state of the art of the three frameworks.
All benchmarks were run on boxes using a single Pascal Titan X graphics card with CUDA 8 and version 5.1 of cudNN, NVIDIA’s CUDA Deep Neural Network library.
The github page Setting up a Deep Learning Machine from Scratch (Software) has good background on installing many of the tools described here.
Modeling Metrics
Half-precision floating point (fp16) support
Torch
- Torch’s cunn library (a standard CUDA neural network backend of Torch) finished their support for fp16 computations recently.
- Problem: doesn’t include Recurrent Neural Networks (RNNs) or even basic neural net layers such as Linear layers.
TensorFlow
- Has fp16 storage support, but not fp16 computation at the moment. TensorFlow has a github ticket for this, but Google has been largely silent.
Theano
- Has limited fp16 in beta. See Theano Github ticket 2908.
Note on fp16 on Pascal Titan X
According to the AnandTech article The NVIDIA GeForce GTX 1080 & GTX 1070 Founders Editions Review: Kicking Off the FinFET Generation, it is unclear how much fp16 support actually helps on a Pascal Titan X architecture. To evaluate this, we tested this on one of the boxes described above. Our evaluations observed both storage and speed performance using Torch7, with speed measured in samples / sec.
Model: | VGG-16 with fully connected layers removed. |
---|---|
GPU: | 1 Pascal Titan X |
Drivers: | Cuda 8 with cudNN 5.1 |
Summary: fp16 uses less RAM, but is slower per sample. The only reason I can speculate for when one would use fp16 on a Pascal Titan X is if the size of the model with a single batch was otherwise too big to fit in RAM.
batch size | forward samples / sec | forward + backward samples / sec | max memory usage (Mb) | |
---|---|---|---|---|
fp32 | 32 | 191 | 55 | 4269 |
fp16 | 32 | 149 | 38 | 2373 |
fp32 | 64 | 194 | 55 | 7981 |
fp16 | 64 | 151 | 38 | 4239 |
fp32 | 128 | out of memory | out of memory | out of memory |
fp16 | 128 | 150 | 36 | 7971 |
CNN Benchmarks
Summary: unless you’re Nervana, if you use cudNN everything is basically the same.
- For CNN layers, we have the soumith convnet benchmarks, which are well documented and updated. Unfortunately, it doesn’t include Theano in the more recent benchmarks. I am not entirely sure what the fp16 benchmark mentioned there is measuring, exactly; my guess (since it is running on an older Titan X) is that it is simulated and hacked in instead of using native fp16 support.
LSTM Benchmarks
- For RNN layers (incl LSTMs), there are the glample rnn benchmarks (which at this writing date back to May of 2016) using TensorFlow version 0.8 when 0.11 is currently available. We ran these tests on more recent software below.
Model: | A single LSTM layer:
|
---|---|
GPU: | 1 Pascal Titan X |
Drivers: | Cuda 8 with cudNN 5.1 |
Summary:
- Generally, for just the forward pass, Torch > Theano > TensorFlow.
- For forward + backward, it seems that Theano > Torch > TensorFlow. Torch and Theano are generally about the same in this case except for smaller batch sizes with larger numbers of hidden units where Theano crushes Torch and TensorFlow.
- As the batch size and hidden layer size grows, the difference between these frameworks shrinks. This is not surprising, as more of the work is being shelled out to cuda, which is the same across the board.
sequence length | batch size | hidden layer size | forward samples / sec | forward + backward samples / sec | |
---|---|---|---|---|---|
Torch | 30 | 32 | 128 | 22110 | 4849 |
TensorFlow | 30 | 32 | 128 | 2778 | 1410 |
Theano | 30 | 32 | 128 | 15462 | 5440 |
Torch | 30 | 32 | 512 | 6722 | 1582 |
TensorFlow | 30 | 32 | 512 | 2155 | 1285 |
Theano | 30 | 32 | 512 | 7127 | 1874 |
Torch | 30 | 32 | 1024 | 3618 | 864 |
TensorFlow | 30 | 32 | 1024 | 1790 | 888 |
Theano | 30 | 32 | 1024 | 4421 | 1143 |
Torch | 30 | 128 | 128 | 74897 | 15131 |
TensorFlow | 30 | 128 | 128 | 8656 | 5411 |
Theano | 30 | 128 | 128 | 53953 | 14491 |
Torch | 30 | 128 | 512 | 27781 | 7335 |
TensorFlow | 30 | 128 | 512 | 6421 | 4238 |
Theano | 30 | 128 | 512 | 23037 | 6514 |
Torch | 30 | 128 | 1024 | 10524 | 3090 |
TensorFlow | 30 | 128 | 1024 | 4753 | 2702 |
Theano | 30 | 128 | 1024 | 9679 | 2751 |
Torch | 60 | 32 | 128 | 11126 | 2364 |
TensorFlow | 60 | 32 | 128 | 1353 | 879 |
Theano | 60 | 32 | 128 | 5538 | 3092 |
Torch | 60 | 32 | 512 | 3344 | 785 |
TensorFlow | 60 | 32 | 512 | 1272 | 811 |
Theano | 60 | 32 | 512 | 3951 | 1060 |
Torch | 60 | 32 | 1024 | 1810 | 428 |
TensorFlow | 60 | 32 | 1024 | 1009 | 467 |
Theano | 60 | 32 | 1024 | 2339 | 613 |
Torch | 60 | 128 | 128 | 37693 | 7575 |
TensorFlow | 60 | 128 | 128 | 5278 | 3328 |
Theano | 60 | 128 | 128 | 31076 | 8702 |
Torch | 60 | 128 | 512 | 13966 | 3676 |
TensorFlow | 60 | 128 | 512 | 4057 | 2691 |
Theano | 60 | 128 | 512 | 12505 | 3649 |
Torch | 60 | 128 | 1024 | 5248 | 1543 |
TensorFlow | 60 | 128 | 1024 | 2695 | 1423 |
Theano | 60 | 128 | 1024 | 4366 | 1409 |
Fluffy Metrics
Usability
Because these are developer tools, we reviewed usability in terms of the Python interfaces for TensorFlow and Theano and the Lua interface for Torch.
Writing Code
Generally, it seems like the ease or challenge in using either Torch or TensorFlow comes from the choice of language. Everyone seems to have Python experience nowadays, whereas Lua experience is rarer. Adding the lack of many basic functions in the Lua language raises the barrier to entry for new users picking up and coding in the environment. Theano requires a paradigm shift in thinking about how to write the code, which makes it more verbose and complicated in general.
The neural network libraries built on top of Torch (nn, rnn, …) and TensorFlow/Theano (Keras), however, seem to be roughly equivalent in terms of structure and therefore are expected to be equivalent in terms of barrier to entry for new users to begin constructing their own models.
Reading Code
With the exception of the raw Theano library, for pure readability it seems like both the raw frameworks and the neural network libraries built on top of them are relatively straightforward to read and understand what is going on, with small syntactic differences here and there, and other relatively confusing aspects that the user can just take on good faith are there for a good reason (“Why does collectgarbage() get called twice in a row here?”). Of course, if you only ever interact with Theano through the Keras library, then it doesn’t really matter how different raw Theano is.
Debugging
- TensorFlow: Lots of tools. You can return whichever element of the graph you want and set multiple watchers on your tensorboard. (There’s also a TensorBoard for visualization and organization.)
- Torch: Debugging can be done using standard debug tools. Breakpoints can be set at locations in your own code, and in library code, and variables can be inspected at each trigger.
- Theano: My experience with this is not recent, but Theano has historically been known to be a pain to debug.
Moving to Production
My understanding is that any of these could be run in a docker, which probably makes for the easiest deployment. Aside from that, one of the biggest difficulties with Torch is that they don’t actually cut releases of any of their code, so your dependencies are “whatever copy of Torch I have now.”