Is It Data or Model in Machine Learning?

When we talk about Machine Learning one thing comes into picture, a model that can predict the resulting data. But what’s more important to get those predictions right, the model or the data that is taken. Andrew Ng session on Deep Learning AI about data centric AI has ignited the question of what should be concentrated more in order to get a better and practically applicable model.

Photo by Isaac Smith on Unsplash

Now when we start to talk about data, some figures, rows, columns and other things pop up in front. So is data just about those columns, rows and other types like images or is it more than that.

We can get a lot of data in everyday life, in fact IBM article stated that 90% of the data in the world today has been created in the last two years. Everyday, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone! Which is insane!

The data created in huge amount is the result of exponential technological improvements and evolution, that’s where the Big Data word arises. The huge amount of data generated can bring patterns and thus machine learning models help to recognize those patterns. But what if the data taken is not a true representation of the actual pattern?

The data generated can be inclined or uniformly distributed. If its normally distributed or evenly collected, it can result in a pattern that can somehow explain the future patterns. Now in very general manner normally distributed doesn’t mean the data should be in normal distribution as in statistics, but rather evenly and normally collected. The data should contain most of the information from the scene. If the collection is somehow consciously or unconsciously biased or unevenly collected, then the resulting pattern using the model could be either underfitted or overfitted.

Lack of proper collection can result in bad data. Bad set of data can give faulty results and models that aren’t practically implementable. That’s why most of the ML models trained today are never practically implemented in real life applications.

This is a serious problem. If the data is collection doesn’t represent the true picture then overlaid layers of foundation will be weak and brittle. Even if the model trained provides high accuracy, that doesn’t really mean it can predict any data apart from the tested one. A lot of factor come’s into picture, the resulted model could be so faulty if not checked that it can disrupt the application ability if applied practically.

A person from US can’t use a data from other country to back the patterns for prediction. The person has to use the data, that is collected responsible and is diverse enough to represent the true picture and prediction for that area. The data should be diverse, this will also address the AI ethic principles.

So, train your model very finely, but still if the data used was not even and non biased then the model will fail in real life. Hence, next phase of machine learning era should be data centric rather than model centric.