Last week’s SF Big Analytics Meetup featured two talks by Christopher Berner of Facebook as well as Mohitdeep Singh from Rdio talking about non-parametric bayesian approaches which allow learning of infinite clusters.
Berner explained how at Facebook, his machine learning capability plugin for distributed SQL query engine Presto started out as a Hackathon project. PrestoML joins the power of machine learning to the simplicity of SQL and is now use in their Hadoop warehouse.
Prior to PrestoDB, Facebook had traditionally used Hive for data processing but Presto offers more flexibility in being able to query data from where it lives and from multiple sources by using connector APIs to write queries which means that it is being used for a range of products at Facebook including advertising.
PrestoML adds new types and new functions to Presto in order to build a model which can then be written back to the data store because of the connector APIs it uses.
Berner says that the distributed nature of PrestoDB also means that the model can be saved, “I have frequently done some sort of model prototyping where I’ve built a model on my lap top and then six months later after I’ve given the predictions to someone they come back and they ask, ‘do you still have that model, we want to run some new data that we have’ and I’m like, ‘oh, yeah, I deleted that! If I can just save it into a distributed database that’s just great because it’s never going to go away and these models tend to not be very big.. they are not going to take up much space on a distributed data store whereas they would on my lap top’.
All of this means that machine learning can easily be brought to the hundreds of analysts at Facebook, Berner says, that would not necessarily be trained in ML or be able to write, say C++, but are very capable of writing SQL and understand data.
For the future development of PrestoML, Berner says they will be looking to add more models as well as different workloads like collaborative filtering or clustering. On top of this modelling is currently done with one machine because they are using LIBSVM so distributed training is also something they are looking at.