How Andrew Ng Is Developing Deep Learning at Baidu, SF Deep Learning Summit 2015

One of the highlights of January’s Deep Learning Summit in San Francisco was Andrew Ng’s fireside chat with Derrick Harris (fireplace not included) in which Ng gave a fantastic insight into how the culture of learning and empowerment at Baidu, partly through Ng’s involvement with Coursera which has led to a culture in which Deep Learning has taken on a life of it’s own. Baidu are seeing the returns of their investment in educating and supporting employees by seeing the phenomenon of people volunteering to use the Deep Learning platform to further their own development and the success of the company through efforts which are essentially outside of Ng’s remit.


DH: Given the audience and the subject of the conference which is Deep Learning, I don’t think we need to dive into ‘what is Deep Learning‘ but I do want to ask you something that we talked about earlier, maybe you can talk about shifting the analogy that we usually used to talk about with Deep Learning?

AN: I think about five or so years ago when a bunch of us were starting on this we used to explain deep learning as a neural network that kind of simulates the brain, one of the problems with that analogy is that is has lead to excessive amounts of hype where people think about super intelligences similar to the human brain and of course all of us that work in deep learning just know that is not true.

Mike Jordan likes to say that Deep Learning algorithms are really like a cartoon of the brain and I think that’s accurate. So, for lay audiences we try out a new analogy and you can’t tell people ‘think of it as not a brain’, that’s not helpful, so now when I think about building deep learning products I think of building a rocket ship so, what is a rocket or a shuttle? It’s a giant engine and it’s a lot of fuel, that’s basically what a rocket is.

So I think of deep learning being something similar, we need a giant engine – that ‘s the giant neural networks we build that can absorb huge amounts of data and the data is the rocket fuel and so I think for a lot of computer systems work we are now able to build these giant machines, giant engines and because of the digitisation of society and other tactics we now have huge amounts of data and the combination of these two things helps us launch more and more rockets and lets them go further and further.

DH: I want to talk about both those things, first of all on the data front, the fuel, is it kind of a virtuous cycle of the more people you get doing voice search the more people you can convince to use these interfaces, the more data you get to train the systems? How do you get that much fuel?

AN: One interesting thing that has happened with the rise of Deep Learning is that for many applications we are now able to build these giant machines that are able to absorb more data than even leading tech companies can get access to. We used to have this idea of the virtuous circle of AI where you build a great product, it gets you lots of users, the users generate data, the data helps you make your product even better, it get you even more users and you go round and round and these things have a positive feedback effect on each other.

That hadn’t worked until recently because if you look at the older generations of AI algorithms, even as you got more data, performance would get better but then plateau as if the older generation’s algorithms didn’t know what to do with all the data we feed it. So Deep Learning algorithms are really a class of algorithms, maybe the first I’ve seen in my career, that the more data you feed it the better it gets and I think that is today letting tech companies, for example, to start on this virtuous cycle. 

In this interim period of time where we are just powering up this positive feedback loop, I am seeing a number of efforts use very innovative ways to acquire huge data sets and so I think with computer vision and speech recognition, at Baidu, other organisations as well have been aggressively pushing the envelope on data augmentation. Take speech recognition, in academia, the largest data set is about two thousand hours, we start out with seven thousand hours of data so, you know, quite a bit bigger than what you see in academic data sets, but we said seven thousand hours wasn’t enough so what we did was take seven thousand hours of audio data when building our speech recognition system and we added all sorts of noise to it, car driving noise, restaurant noise, crowd noise, so we took that seven thousand hours of data and synthesised about a hundred thousand hours of total data and so we train our speech systems on around a hundred thousand hours of speech data which is way more than you see in any academic paper.. and that was the secret sauce really, the rocket fuel, the hundred thousand hours of data together with our large investment with GPU clusters, supercomputers for Deep Learning is a combination of the rocket engine and the rocket fuel that allowed us to have what I think is a state of the art speech recognition system today.”

DH: I want to ask about the supercomputer you just mentioned because I think that’s one of the more interesting papers I have read on a while. What was the impetus to say, you know, at Google your building more a distributed cloud system, then all of a sudden your building this supercomputer with high speed interconnects, what was the rationale behind the supercomputer?

AN: I have been seeing a shift in the centre of gravity with the technology that powers deep learning, so several years ago, it was 2011, when I started and was leading the Google Brain team and Google’s main Deep Learning team and later on you’ll hear from two member Greg Corrado and Quoc Le who were two of the earliest members that I recruited into the team I was leading at Google so I am excited about the work they continue to do but when I started the Google Brain team the centre of gravity of that team was using cloud computing in order to scale up Deep Learning. We said Google has more computers, let’s use Google’s computers to build giant new networks and Google is a fantastic cloud company and we use fundamentally cloud computing like technologies in order to scale and it was very successful and I am very proud of the work that the team did when I was there and the team continues to do today without me. 

More recently I have been taking a different approach where instead of relying on cloud technologies, we are relying on high performance computing, so supercomputing technologies and that is a different group of people, different skill set, you recruit from different venues and so Baidu as many of you may know was the first company to build a GPU cluster for Deep Learning, it turns out that these technologies are pretty different. If you are running a cloud computing shop, you might distribute your jobs across a thousand servers, and if the mean time for failure of a computer is three years, if you are running a thousand computers it means that about one computer will die per day and so if you are using cloud technologies, you need to worry about computers going down and if you have a thousand computers and one fails, how do you recover from that. A lot of the complexity of work that my team did at Google was dealing with these cases of machine failures. 

In the HPC world, you have a much more modest number of servers but you kind of don’t worry about them failing. That’s the different approach we have taken at Baidu, using HPC technologies but then we worry about other things like the latency hiding and fast networking, fast interconnects so I think that latter approach has allowed us to scale up our Deep Learning algorithms pretty aggressively at Baidu.

DH: How does the low latency affect how the algorithms are running?

AN: One of the things I learned both when I was running the Google team and now building a new team at Baidu is that computer systems are really important for Deep Learning. On both the teams I have been leading there has been tight collaboration between great systems Researchers and great AI Researchers. The mental model for the job of a machine learning researcher, what do you do? You have an idea, then you have to express your idea in code, expressing your idea in code allows you to run an experiment, you see how your experiment goes and this gives you additional information that causes you to have better ideas and you kind of go round that cycle.

When I think about my job in terms of designing the deep learning team at Baidu, I obsess a lot about having the systems team, make it really efficient for machine learning researchers to progress through this iterative empirical process of designing learning algorithms. So what does that mean? One, it means we obsess about writing great developer tools, so that it’s really efficient for a machine learning researcher to to express your ideas in code. Two, some of the experiments we run, they use to take two weeks to run and so we also obsess on building HPC supercomputers in order to bring down that iteration time so that you can get a result back in three hours instead of a week and that again speeds up your progress as a machine learning researcher in terms of how quickly you get your results back, get better intuitions, get better ideas and iterate. 

The last thing I do is, where do ideas come from? The last thing I do at Baidu (I think I did a good job of this at Google too actually!) was invest heavily in employee development. Everyone talks about employee development, like you know, train employees, maybe because of my background as a Co-Founder of Coursera I have a certain understanding of how to develop employees and make sure that everyone learns, at Baidu I have never seen any other organisation engaged in training employees as intensely as we do at Baidu. It’s one of the things we obsess about, people read papers all the time, share ideas all the time, we really value employees teaching each other. We’ve designed tons of processes based on the lessons I learned working on education at Coursera to make sure that every member of the team is learning rapidly.

DH: That’s kind of a two prong thing, so how and where do you actually recruit because eighteen months ago it seemed to me the story would read that Deep Learning is so hard and so new and there are only a handful of students that understand it and they get paid a gazillion dollars and now everyone seems to have an opinion or an understanding. When you look for talent how do you get the right people? Second of all, if you are a novice, if you are trying to learn Deep Learning is there a place to go to get the fundamental skills that maybe can get you a job or at least get you functional.

 AN: It turns out that from a Machine Learning Researcher perspective, if we offer the best platform because of our systems, investments, our employee development for a Researcher to make rapid progress in Machine Learning that’s actually an appealing prospect to a lot of Machine Learning Researchers. For Systems Researchers, to a lot of HPC supercomputer Researchers, there is a sense that this is the time. Traditionally supercomputers were used to do atomic bomb simulations or whatever and maybe that is ok, but now supercomputer technology, the GPU clusters, faster connect, latency hiding, the complicated scheduling work being done, that is having a huge impact on AI which I think will change the world. I also think that supercomputer Researchers, the opportunity to do that work and have an impact on the world has been very appealing as well. So, we’ve been attracting from industry and from top universities, we’ve not really had a huge problem recruiting. I think also people know that if they join Baidu we are very efficient, with my background in education, I can vouch for this, we are one of the most efficient organisations in the world for teaching you to be great at Deep Learning. So the combination of those things has made quite a lot of people want to join us.

DH: So it’s on the job training? If you get a smart enough person, you can teach them what they need to know right?

AN: Yeah.. and teaching them all the lessons I learned, and it turns out that because of Coursera, we are pretty good at teaching facts, what’s the State Capital of Montana, facts like that. We are pretty good at teaching procedures, how do you implement this thing, how to implement Quicksort, we know how to teach those things.

In education what we find harder to do is teaching strategic skills and by strategic skills I mean, as a Deep Learning Researcher, you wake up, your neural network trained the night before and something happened, some totally new situation that no other human has been in before. So what do you do? Do you get more data? Do you plot the data? Do you visualise this way? Do you read a paper? So that’s a strategic skill. The education system has so far found it relatively challenging to teach strategic skills as opposed to facts and procedures and I think the best way, frankly the only way we know how to teach strategic skills is to show you example after example.

This is why MBA programmes, Business Schools, use case studies, showing you example after example of corporate strategy, you kind of pick it up. So, what we’ve been trying to do at Baidu is systemise the process that lets employees join us and see example after example so that after a relatively short period of time you will have seen tons of these examples and be able to make these strategic decisions. 

The other way I like to think about this is, we train airplane pilots in flight simulators because if you stick a pilot in a normal airplane, you might need to fly for decades or years or something before you see the emergency scenarios and decide what to do but if you put a pilot in an aircraft simulator you can show them tons of examples of things that can go wrong in the air, you know wing falls off, hopefully not! Engine dies or some fuel thing or some break problem so we try to show them tons of examples for the employees that join us so that in a relatively short period of time you can be in that compressed learning environment, like a flight simulator for Deep Learning so you’ll acquire these skills deeply.

DH: So if you want to get into this space, learn Machine Learning, then hopefully you’ll go somewhere and they’ll teach you, can you go to Coursera and learn Deep Learning?

AN: If you start out and you want to learn the foundations of Deep Learning, my team at Stanford put a lot of work into designing a tutorial,, Geoff Hinton also hand a course on Coursera, I think his videos are still up and they will teaching the basics, the foundations of Deep Learning.tons of people have worked through the Stanford tutorial but then to learn the strategic skills, sadly I think now the most efficient way to do that is probably still to join one of the top Deep Learning groups, it will be interesting to see if there are ways to teach that more properly. 

DH: My last question before we open it up here (questions) is that the other thing one reads is tied to this super intelligence thing you mentioned before. You get this idea that Deep Learning is over hyped a lot, on the other hand at Baidu you are actually running stuff in production based on Deep Learning, can tell us about the type of stuff you are doing at Baidu?

AN: Baidu has tons of products and services in production that use Deep Learning, we have a lot of image services, actually I think that is a nice sweater, if I take a picture of your sweater with my cellphone, Baidu has a relatively unique feature that will use Deep Learning to recognise the sweater and try and tell me where to buy a similar sweater, so tons of image products all compiled by Deep Learning. I think we were the first company to figure out how to do Deep Learning on advertising, make it work really well and we talked about this probably before but you know, it has a significant impact on revenue.

One of the thing that I think Baidu did really well early on was, internally there was a team that built a very effective Deep Learning platform that opened up Deep Learning all across the company so that any engineer could use state of the art Deep Learning tools, run them on our GPU servers either in training or in production and testing in order to use Deep Learning algorithms for very very diverse sets of applications. One of the examples we were chatting about earlier was one of the Engineers in our infrastructure team decided to use Deep Learning to try to predict when hardware would fail so today we can predict 75% recall or something if a hard disc in our data centre is about to fail so this helps us reduce our data centre operations costs and increase reliability of service to users.

This is an example of the sort of application, that I, as a Machine Learning guy, would never have thought of doing but by building a platform and empowering engineers all across the company to use these tools that enables all sorts of applications to flourish so Deep Learning today is used by a lot more products all across Baidu than I have been able to keep track of but there are a few like speech recognition or image products and advertising that I have been able to keep track of. 



About Gary Donovan

Machine Learning and Data Science blogger, hacker, consultant living in Melbourne, Australia. Passionate about the people and communities that drive forward the evolution of technology.
Show Buttons
Share On Facebook
Share On Twitter
Share On Linkedin
Share On Pinterest
Share On Stumbleupon
Contact us
Hide Buttons