Exactly How do you Define the Data Scientist Skillset?

I look at a lot of characteristics in Software Developers for my clients and I still find it extremely challenging to find the right balance of skills and character for me to represent someone.

It’s usually the small things that can be the most telling, perhaps the attention to detail or lack thereof but almost always the information beyond the resume. The GitHub, StackOverflow, the blogs and talk about interests outside of development.

For nearly two years now, I have been ‘about to change’ from recruiting in Software Development to Data Science and in those two years I feel like I have just been looking at Data Science through a kaleidoscope, trying to find my bearings but having had to settle for being bewildered by the research and relentless advancement of Data Science before I feel I can make a start with a respectable level of understanding of Data Science and what companies need.

I found that Martin Goodson’s preamble as the Chair at a recent debate at the Royal Statistical Society in London added some useful pointers to help me get my bearings. Here is what he said –

These days there is so much written about data science, there is so much spoken about data science and yet lots of people don’t seem to be very clear about what data science actually is. So, I just want to say why I think this is an important question and something that is worth discussing.

So firstly, I think that the profession of data science needs to define itself in order to develop and actually just to survive. So, for instance, I run the data science team for a start-up in London called Skimlinks and in common with loads of companies in London we find it very very difficult to find Data Scientists, so there are lots of people, because of this demand thinking about changing career and going into the career of data science.

The problem is there is no real consensus about the core skill set, lots of companies are hiring Software Engineers with no real statistical training as Data Scientists. I think we need to be a bit clear about this and I think programming and statistics are core skills. It is a fact that organizations are collecting so much data now that you do need serious programming skills just to manipulate it.

I do think that Statisticians are the right people to analyze data but they can’t analyze data unless they can actually access it and manipulate it.

Secondly I think we need to develop some new methodologies, working on large data sets now is really the norm it’s not the exception and I think the many aspects of doing statistics change when you work at scale, so just to take an example, after running a single logistic regression, let’s say, a Statistician might run some diagnostic plots to gain a feel for the goodness or fit of the model and look for potential outliers. This approach is just not scalable when you run 20,000 logistic regressions on a single data set and this is typical these days.

I think visualization, hypothesis testing, outlier detection, feature selection, they all require different approaches when you are working at scale and there is a need for an automated large-scale approach of the Statistician I think.

It’s not just data analysis, industry and academia face huge problems in the storing and processing of data. Statisticians and Computer Scientists really need to collaborate to develop this methodology that we need in order to deal with all of the data that we have these days.

How do you best foster this collaboration? Should we consider data science to be a new field? Should it have its own journals, textbooks and University departments?

Thirdly, I think we need to educate people, as I said there is a massive demand for Data Scientists and educators are trying to train enough people to meet the demand but we are in a kind of absurd position at the moment where every company knows they need a Data Scientist but they don’t know really what skills a Data Scientist needs.

There are Data Science MSc programs and they are starting to produce their first graduates, some of these courses do not contain a single data analysis course as part of their core curriculum, whilst the course can be dominated by big data frameworks. I think that new graduates from these programs who are trained in a hot big data framework are going to be shocked when they discover that industry has moved on when they graduate and they’ve moved onto the next big thing, the next big software framework.




About Gary Donovan

Machine Learning and Data Science blogger, hacker, consultant living in Melbourne, Australia. Passionate about the people and communities that drive forward the evolution of technology.
Show Buttons
Share On Facebook
Share On Twitter
Share On Linkedin
Share On Pinterest
Share On Stumbleupon
Contact us
Hide Buttons