Last week saw the New York City instalment of MLConf 2015 in a day which was bursting with insightful talks from respected experts who are at the leading edge of the commercial application of machine learning techniques, delivering more than fifteen sessions over the event.
Corinna Cortes, Head of Research at Google kicked off the day with ‘Finding Structured Data at Scale and Scoring it’s Quality’ which looked at how Google introduced Structure Snippets to search results which is a collaboration between Google Research and the Web Search Team and uses algorithms to determine quality and relevance and presents tabled data to users.
For the more paranoid human being, Ted Willke, Senior Principal Engineer at Intel Labs, gave a talk about some of the approaches being used with machine learning to learn more about what people are thinking in real-time. Intel have partnered up with pioneers of brain decoding using MRI to discover more about real neural networks and how real-time cognitive processes can be accurately uncovered.
Facebook’s Jeff Johnson spoke about using the right tools for the job with ‘Hacking GPUs for Deep Learning’. Johnson argues that many of deep learning’s workloads are too small to warrant the use of GPUs, delved into the use of GPUs in current architectures and spoke about deep learning at Facebook AI Research.
Prior to working at Yahoo, Alina Beygelzimer worked at the IBM Watson Research Center, receiving the Pat Goldberg Best Paper Award for her work on nearest neighbour search. One of the challenges she faces at Yahoo is with optimisation and evaluation from data which is collected but not necessarily from fair user input as a result of the information which is shown to the user. Beygelzimer’s talk discussed the importance of collecting the right data for supervised learning.
Bryan Thompson is Founder of SYSTAP LLC and discussed research on SYSTAP’s MapGraph platform. MapGraph is an API which exploits data level parallelism to deliver high performance graph analytics on GPUs. Thompson discussed the exciting future of MapGraph which on 64 NVIDIA K20 GPUs is able to traverse a scale free graph of 4.3 billion directed edges in 0.13 seconds for a through put of 32 bn traversed edges per second.
Then there is machine learning with Sparkling Water. H20 Software Engineer Michal Malohlava talked about the tool that integrates H2O’s fast scalable machine learning engine with Spark and brings machine learning to the doorstep of the wider developer community.
The frequent challenge in explaining models to business can often lead to practitioners returning to using older tools which can make things easier to interpret, despite the advances in technology. Dan Mallinger from Think Big Analytics addressed the issues surrounding machine learning from a communication perspective and how it affects the choice of tools which are used.
Ilona Murynets gave a session on the application of machine learning algorithms to help detect the large volumes of spam text messages and voice calls made across telephone networks. Murynets is a Scientist at at the AT&T Research Center and delved into how combining predictions of multiple classifiers can combat fraudulent activity.
‘All the Data and Still Not Enough!’ was presented by Claudia Perlich, Chief Scientist at Dstillery who described the application of machine learning in scenarios where there may not be enough of the right data available, yet the application of specific techniques can still produce surprisingly accurate results. Perlich delved into some of the instances where it is simply impossible to gather the required data, such as niche areas where collection is naturally sparse or simultaneously observed scenarios.
Restaurant booking engine OpenTable gathers a lot of data from diners about experiences and Senior Manager of Data Science, Jeremy Schiff discussed the architecture behind their accurate recommendation systems. The talk was split into three parts on ‘ (1) how A/B testing works with machine learning to iterate toward better recommendations, (2) how to couple an information-retrieval based search stack with collaborative filtering to capture user intent in a personalized way, and (3) making recommendations more relevant and interpretable.’
Internet radio Goliath Pandora also made an appearance at the event with Director of Research Òscar Celma talking about how the interdisciplinary team unravel vast data sets to make recommendations to more than 80 million active users that use the service each month. Although every single track served up by Pandora is catalogued manually according to criteria like tempo and intrumentals, it develops it’s own machine learning algorithms to run the recommendation service and grow user engagement.
From the retail sector, Ronald Menich, Chief Data Scientist at Predictix, which helps retailers to model their business by using machine learning to forecast sales talked about the rise of machine learning and how the process of replenishment and planning has changed in recent years.
Juliet Hougland is a Data Scientist at Cloudera and in ‘Decomposition at Scale’ discussed the challenges of using distributed systems when matrices are simply too big. Specifically Hougland talks about LanczosSolver and StochasticSVD in Mahout and the SVD implementation in Spark MLLib and talks about tradeoffs from the perspective of real world performance and accuracy as well as giving guidelines for choosing which implementation to use based on requirements.
Data Scientist Jeremy Stanley also explained how marketing engine Sailthru scale their machine learning systems in the cloud to deliver personalised marketing packets across different media and platforms whilst keeping costs down. Sailthru’s Sightlines products predicts the future behaviour of users and Stanley dissects the architecture and tools behind the product which uses Amazon spot instances and Apache Mesos whilst employing automated infrastructure and A/B production environments to handle iterative changes.