In 1941 Hitler followed what Napolean did in 1812. They both invaded Russia and ended by not just losing the battle but also the war. Both their armies were far superior on all fronts to the Russian army of respective periods. What then resulted in these history-defining losses?
“Conditions”, Both the armies and the leaders were masters on their own turfs. However, Russia proved to be a very different terrain, and the elements impacted what would have been a routine operation. A combination of unfamiliar terrain, ill-preparedness, and bad judgment lost them defining moments in their conquests.
What has Napolean and Hitler got to do with Machine Learning you might wonder. Machine Learning models are built and trained from historical data in a benevolent environment and are employed to make predictions on new data where many times, they are ill-prepared to deal with changes in data. In our last post, we discussed monitoring in Machine Learning. In this article let’s explore the two most significant changes that can occur in data, concept drift and data drift.
Concept drift refers to the change in the target variable over time in unforeseeable ways. Concept drift arises when our interpretation of the data changes even while the data may not have. Generally, the word concept refers to the quantity to be predicted.
There are different ways in concept drift can happen
What we agreed upon as belonging to class A in the past, we claim now that it should belong to class B, as our understanding of the properties of A and B have changed since. This is pure concept drift.
A piece of text
(Corona) could legitimately be labeled as belonging to one class (
Beer) in the past but belonging to a different class (
Corona Virus) now. So the predictions from a model built in the past are going to be largely in error for the same data now.
Data drift refers to the change of properties of features used to train the model. Formally
Data drift between source distribution S and a target distribution T can be defined as a change in the joint distribution of features and target
Data drift can occur due to a change in the distribution of data. For example, an e-commerce apparel company might start serving a product that wins a lot of teenage customers, which is significantly different from its target market of 25-35 years.
Data drift can occur when data schema changes at the source. Some of the schema changes can be mentioned as a new feature being added, one of the existing features being deleted, or the type of a field is changed. In the above example, an e-commerce company can change it’s pricing from each individual item to a predefined set.
Drifts whether Concept or Data cannot be avoided. What can be done is you can proactively detect these early and take corrective actions. We will talk about these in detail in a future post.