Pentest Tools

Published on January 19th, 2016 📆 | 7122 Views ⚑


WARP-CTC — Open Source Artificial Intelligence

Chinese web services company Baidu has released a new artificial intelligence software called WARP-CTC. The code is apparently capable of speech recognition, particularly for short segments, that exceeds human capability. The source code uses an approach called ‘connectionist temporal classification’.

[adsense size='1']

A fast parallel implementation of CTC, on both CPU and GPU.

Connectionist Temporal Classification is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition, which is how we have been using it at Baidu’s Silicon Valley AI Lab.


The illustration above shows CTC computing the probability of an output sequence “THE CAT “, as a sum over all possible alignments of input sequences that could map to “THE CAT “, taking into account that labels may be duplicated because they may stretch over several time steps of the input data (represented by the spectrogram at the bottom of the image). Computing the sum of all such probabilities explicitly would be prohibitively costly due to the combinatorics involved, but CTC uses dynamic programming to dramatically reduce the complexity of the computation. Because CTC is a differentiable function, it can be used during standard SGD training of deep neural networks.

In Baidu lab, team focused on scaling up recurrent neural networks, and CTC loss is an important component. To make the system efficient, Baidu parallelized the CTC algorithm, as described in this paper. This project contains our high performance CPU and CUDA versions of the CTC loss, along with bindings forTorch. The library provides a simple C interface, so that it is easy to integrate into deep learning frameworks.

This implementation has improved training scalability beyond the performance improvement from a faster parallel CTC implementation. For GPU-focused training pipelines, the ability to keep all data local to GPU memory allows us to spend interconnect bandwidth on increased data parallelism.

[adsense size='2']


This CTC implementation is much more efficient compared with many of the other publicly available implementations. It is also written to be as numerically stable as possible. The algorithm is numerically sensitive and Baidu team observed catastrophic underflow even in double precision with the standard calculation – the result of division of two numbers on the order of 1e-324 which should have been approximately one, instead become infinity when the denominator underflowed to 0. Instead, by performing the calculation in log space, it is numerically stable even in single precision floating point at the cost of significantly more expensive operations. Instead of one machine instruction, addition requires the evaluation of multiple transcendental functions. Because of this, the speed of CTC implementations can only be fairly compared if they are both performing the calculation the same way.

Baidu compared the system performance with Eesen, a CTC implementation built on Theano, and a Cython CPU only implementation Stanford-CTC. Baidu benchmarked the Theano implementation operating on 32-bit floating-point numbers and doing the calculation in log-space, in order to match the other implementations Baidu compare against. Stanford-CTC was modified to perform the calculation in log-space as it did not support it natively. It also does not support minibatches larger than 1, so would require an awkward memory layout to use in a real training pipeline, assuming linear increase in cost with minibatch size.

Baidu provided the results on two problem sizes relevant to English and Mandarin end-to-end models, respectively, where T represents the number of timesteps in the input to CTC, L represents the length of the labels for each example, and A represents the alphabet size.

On the GPU,  performance at a minibatch of 64 examples ranges from 7x faster to 155x faster than Eesen, and 46x to 68x faster than the Theano implementation.



The interface is in include/ctc.h. It supports CPU or GPU execution, and you can specify OpenMP parallelism if running on the CPU, or the CUDA stream if running on the GPU. Baidu took care to ensure that the library does not perform memory allocation internally, in order to avoid synchronizations and overheads caused by memory allocation.

[adsense size='2']


warp-ctc has been tested on Ubuntu 14.04 and OSX 10.10. Windows is not supported at this time.

First get the code:

git clone
cd warp-ctc

create a build directory:

mkdir build
cd build

if you have a non standard CUDA install export CUDA_BIN_PATH=/path_to_cuda so that CMake detects CUDA and to ensure Torch is detected, make sure th is in $PATH

run cmake and build:

cmake ../

The C library and torch shared libraries should now be built along with test executables. If CUDA was detected, then test_gpu will be built; test_cpu will always be built.



To run the tests, make sure the CUDA libraries are in LD_LIBRARY_PATH (DYLD_LIBRARY_PATH for OSX).

The Torch tests must be run from the torch_binding/tests/ directory.


Torch Installation

luarocks make torch_binding/rocks/warp-ctc-scm-1.rockspec

You can also install without cloning the repository using

luarocks install

There is a Torch CTC tutorial.


[adsense size='3']

Source && Download

Leave a Reply

Your email address will not be published.