Torch vs TensorFlow vs Theano

For an ongoing project at GA-CCRi, we wanted to determine whether remaining with Torch (used for Phase I of a project currently underway at GA-CCRi running on GPUs) or switching to TensorFlow or Theano made the most sense for Phase II of the project. We ultimately found that TensorFlow’s combination of performance and usability made it the best choice as we move into Phase II.

As always with such tests, newer versions of any components used can make these results increasingly dated, but it was interesting to compare the current state of the art of the three frameworks.

All benchmarks were run on boxes using a single Pascal Titan X graphics card with CUDA 8 and version 5.1 of cudNN, NVIDIA’s CUDA Deep Neural Network library.

The github page Setting up a Deep Learning Machine from Scratch (Software) has good background on installing many of the tools described here.

Modeling Metrics

Half-precision floating point (fp16) support

Torch

Torch’s cunn library (a standard CUDA neural network backend of Torch) finished their support for fp16 computations recently.
Problem: doesn’t include Recurrent Neural Networks (RNNs) or even basic neural net layers such as Linear layers.

TensorFlow

Has fp16 storage support, but not fp16 computation at the moment. TensorFlow has a github ticket for this, but Google has been largely silent.

Theano

Has limited fp16 in beta. See Theano Github ticket 2908.

Note on fp16 on Pascal Titan X

According to the AnandTech article The NVIDIA GeForce GTX 1080 & GTX 1070 Founders Editions Review: Kicking Off the FinFET Generation, it is unclear how much fp16 support actually helps on a Pascal Titan X architecture. To evaluate this, we tested this on one of the boxes described above. Our evaluations observed both storage and speed performance using Torch7, with speed measured in samples / sec.

Model:	VGG-16 with fully connected layers removed.
GPU:	1 Pascal Titan X
Drivers:	Cuda 8 with cudNN 5.1

Summary: fp16 uses less RAM, but is slower per sample. The only reason I can speculate for when one would use fp16 on a Pascal Titan X is if the size of the model with a single batch was otherwise too big to fit in RAM.

	batch size	forward samples / sec	forward + backward samples / sec	max memory usage (Mb)
fp32	32	191	55	4269
fp16	32	149	38	2373
fp32	64	194	55	7981
fp16	64	151	38	4239
fp32	128	out of memory	out of memory	out of memory
fp16	128	150	36	7971

CNN Benchmarks

Summary: unless you’re Nervana, if you use cudNN everything is basically the same.

For CNN layers, we have the soumith convnet benchmarks, which are well documented and updated. Unfortunately, it doesn’t include Theano in the more recent benchmarks. I am not entirely sure what the fp16 benchmark mentioned there is measuring, exactly; my guess (since it is running on an older Titan X) is that it is simulated and hacked in instead of using native fp16 support.

LSTM Benchmarks

For RNN layers (incl LSTMs), there are the glample rnn benchmarks (which at this writing date back to May of 2016) using TensorFlow version 0.8 when 0.11 is currently available. We ran these tests on more recent software below.

Model:	A single LSTM layer: nn.SeqLSTM for Torch (updated version as of Nov 18, 2016) tf.nn.rnn_cell.LSTMCell for TensorFlow 0.11 scan from Theano 0.8.2 (note: this version is compatible with cudNN 5, not cudNN 5.1, so there may be some problem there, but it still ran pretty quickly)
GPU:	1 Pascal Titan X
Drivers:	Cuda 8 with cudNN 5.1

Summary:

Generally, for just the forward pass, Torch > Theano > TensorFlow.
For forward + backward, it seems that Theano > Torch > TensorFlow. Torch and Theano are generally about the same in this case except for smaller batch sizes with larger numbers of hidden units where Theano crushes Torch and TensorFlow.
As the batch size and hidden layer size grows, the difference between these frameworks shrinks. This is not surprising, as more of the work is being shelled out to cuda, which is the same across the board.

	sequence length	batch size	hidden layer size	forward samples / sec	forward + backward samples / sec
Torch	30	32	128	22110	4849
TensorFlow	30	32	128	2778	1410
Theano	30	32	128	15462	5440
Torch	30	32	512	6722	1582
TensorFlow	30	32	512	2155	1285
Theano	30	32	512	7127	1874
Torch	30	32	1024	3618	864
TensorFlow	30	32	1024	1790	888
Theano	30	32	1024	4421	1143

Torch	30	128	128	74897	15131
TensorFlow	30	128	128	8656	5411
Theano	30	128	128	53953	14491
Torch	30	128	512	27781	7335
TensorFlow	30	128	512	6421	4238
Theano	30	128	512	23037	6514
Torch	30	128	1024	10524	3090
TensorFlow	30	128	1024	4753	2702
Theano	30	128	1024	9679	2751

Torch	60	32	128	11126	2364
TensorFlow	60	32	128	1353	879
Theano	60	32	128	5538	3092
Torch	60	32	512	3344	785
TensorFlow	60	32	512	1272	811
Theano	60	32	512	3951	1060
Torch	60	32	1024	1810	428
TensorFlow	60	32	1024	1009	467
Theano	60	32	1024	2339	613

Torch	60	128	128	37693	7575
TensorFlow	60	128	128	5278	3328
Theano	60	128	128	31076	8702
Torch	60	128	512	13966	3676
TensorFlow	60	128	512	4057	2691
Theano	60	128	512	12505	3649
Torch	60	128	1024	5248	1543
TensorFlow	60	128	1024	2695	1423
Theano	60	128	1024	4366	1409

Fluffy Metrics

Usability

Because these are developer tools, we reviewed usability in terms of the Python interfaces for TensorFlow and Theano and the Lua interface for Torch.

Writing Code

Generally, it seems like the ease or challenge in using either Torch or TensorFlow comes from the choice of language. Everyone seems to have Python experience nowadays, whereas Lua experience is rarer. Adding the lack of many basic functions in the Lua language raises the barrier to entry for new users picking up and coding in the environment. Theano requires a paradigm shift in thinking about how to write the code, which makes it more verbose and complicated in general.

The neural network libraries built on top of Torch (nn, rnn, …) and TensorFlow/Theano (Keras), however, seem to be roughly equivalent in terms of structure and therefore are expected to be equivalent in terms of barrier to entry for new users to begin constructing their own models.

Reading Code

With the exception of the raw Theano library, for pure readability it seems like both the raw frameworks and the neural network libraries built on top of them are relatively straightforward to read and understand what is going on, with small syntactic differences here and there, and other relatively confusing aspects that the user can just take on good faith are there for a good reason (“Why does collectgarbage() get called twice in a row here?”). Of course, if you only ever interact with Theano through the Keras library, then it doesn’t really matter how different raw Theano is.

Debugging

TensorFlow: Lots of tools. You can return whichever element of the graph you want and set multiple watchers on your tensorboard. (There’s also a TensorBoard for visualization and organization.)
Torch: Debugging can be done using standard debug tools. Breakpoints can be set at locations in your own code, and in library code, and variables can be inspected at each trigger.
Theano: My experience with this is not recent, but Theano has historically been known to be a pain to debug.

Moving to Production

My understanding is that any of these could be run in a docker, which probably makes for the easiest deployment. Aside from that, one of the biggest difficulties with Torch is that they don’t actually cut releases of any of their code, so your dependencies are “whatever copy of Torch I have now.”

Go Back