Twitter is a challenge for deep learning.
It’s easy to forget too that whilst Baidu, Google and Facebook are elbow barging each other to get to the front of the line, that Twitter is also a perpetual stream of unstructured and growing content, served up in real-time which presents a weird and wonderful storm of challenges with millions of made up words and images, photoshopped, retweeted, photoshopped, added text, retweeted, photoshopped again, more made up words added and hashtagged.
Instead of heading off towards an IPO, Deep Learning start-up Mad Bits, which was co-founded by Clément Farabet, was swallowed up by Twitter in 2014 for the ground it had made in bringing order to the chaos of unstructured information.
Farabet has an elegant heavyweight pedigree in deep learning research, having achieved his PhD under the supervision of Facebook AI Research’s Yann LeCun where the core of his thesis was an algorithmic deep learning framework to automatically parse and understand videos and images.
Farabet recently described the challenges for Twitter at the GPU Technology Conference
Deep Learning is really an amazing solution right now. So what’s the challenge really? For raw media, images, video, audio, the big challenge is we don’t have a concept of vocabulary so in general researchers have spent decades working and hand crafting features to solve those types of tasks but it only takes you so far and coming up with hand crafted features for certain tasks can be extremely challenging.
For text, you do have a vocabulary because the number of words or characters that are used to form language are finite but the issue is, that vocabulary can also change so if you look at a platform like Twitter people constantly invent words, hashtags are particular words that keep changing and some hashtags are very relevant one week, then they become completely useless.
Usernames that get cited a lot, you can also treat them as words in a vocabulary that keep changing and then beyond the vocabulary, you have an issue of language structure and the exact sequencing of words might make the meaning change radically.
So at Twitter we started using deep learning around six months ago to address some of these issues. It’s one general tool box we can use to address these problems. The current challenge that you see is that the relation between the raw data and the observed variables and the prediction space is highly linear and empirically it’s been shown by multiple people in communities that you can go beyond what was done before in terms of state of the art performance for images, videos, text, speech and audio types of problems.
Twitter use three types of families of deep learners, supervised, unsupervised and different flavours of semi-supervised techniques depending on how data is labelled, using supervised learning, or neural nets for around 80-90% of its data with images and videos and use Torch for their work.
Fabaret is a key collaborator on Torch, which is a powerful scientific framework using simple to use neural network and optimization libraries whilst also having the flexibility for implementing complex neural network topologies. Fabaret goes on to describe how Torch is central to their efforts.
In terms of software, we do everything using an open source stack called Torch, which is now used by a couple of companies like Facebook and Google and also Twitter, it’s something we have been using for a while, I was also using it before Twitter.
One of the key characteristics of this stack is that it uses Lua as a scripting language which is a bit esoteric compared to Python but it gives a lot of advantages in terms of scripting and it’s a very clean stack and interface to see.
At Twitter, we have built an internal stack that relies on MPI to parallelize everything, so we currently use two nodes and four GPUs per node to parallelize our training. The way it looks is, thanks to CUDA-Aware MPI, things are surprisingly simple.
We have this low-level abstraction that lets us divide the jobs across multiple GPUs, across multiple boxes and Infiniband connects pairs of boxes and the way we then parallelize is we allocate one job per GPU and then n jobs per CPU that are going to be used as data slaves.