Why do data scientist propose complex solutions to simple problems?
I’ve recently seen a lot of criticism of data science of the following two types:
- A data scientists once said something silly, therefore data science is broken
- A hypothetical caricature data scientist does this or that, therefore the whole field is all hype and no substance
These types of arguments - and their popularity - is no doubt a normal reaction to the fact that one occupation is suddenly receiving so much publicity. But having met and worked with hundreds of data scientists in the past few years, I can reveal that they are one of the most level-headed bunch of people you’re likely to meet.
That’s not to say there aren’t legitimate criticisms. For example, this morning I saw the following tweet:
The recent paper out from Google, "Scalable and accurate deep learning with electronic health records", has an notable result in the supplement: regularized logistic regression essentially performs just as well as Deep Netshttps://t.co/2vYzZiBoWRhttps://t.co/IStdZQOAe0 pic.twitter.com/U2qWwCb63p— Uri Shalit (@ShalitUri) June 20, 2018
This reminded me of something I’ve seen in a few projects, which is that when data scientists are working on a problem, their starting point is some kind of ensemble method or support vector machine. They simply do not consider simpler approaches like the generalized linear model. This not only goes against basic scientific principles like Occam’s razor but it’s also wasteful in terms of time and computing resources.
Why are data scientists sometimes preferring complex approaches over simple ones? I think it’s partly a matter of conflicting interest. On the one hand, for an individual data scientist it makes a lot of sense to be knowledgeable of all the latest tools. First, it is intellectually satisfying to learn new things and most data scientists are researchy types. Second, it’s more helpful for your job prospects to say you built this AI application with Tensorflow than it is to say you ran a logistic regression and it was sufficient for your purposes.
For organisations, on the other hand, it doesn’t really matter what the method is. What matters is that the objective of the project is achieved and preferably using as little resources as possible. Sometimes achieving the objective requires the latest deep learning algorithm with all the bells and whistles. Sometimes a linear regression model or something even simpler works fine. Hence the occasional conflict of interest between individual data scientists and organisations.
So what can be done? Organisations would do well to mandate the use of simple models as baselines for prediction and classification tasks. The improved accuracy (if any) from using the fancy model should be assessed against the time and resource it takes to train that model. What about individual data scientists, what should they do? I don’t know what the correct answer is here, but here’s what I’d like it to be: data scientists should only focus on whatever tools they need to get the job done because whether or not they get the job done is how they are going to be assessed as data scientists. There’s hoping the field moves towards this type of thinking in the years to come.