Data Drift

¿What is the meaning of data drift in pydrift?

Definition

Data drifting is all related to your features drift, it checks when the features in a dataset are not aligned with other dataset ones.

The most common use case is when you train a model with a dataset, let's call it df_train and you want to apply your model to other dataset, for example df_test.

If the data you used for training your model is different that the data when you apply the model, then your model performance will be poor.

So data drift part of pydrift helps you to detect this changes that every feature has and understanding why your model is not working well.

General Use Case

Your model takes into account patterns and features distributions from the training data features.

For example if you train a model with a feature that ranges between 0 and 10 and then in your test data has a different distribution, for example 100 to 1000, then your model won't work as expected.

Types of data drift

  • Numerical features drift: pydrift.DataDriftChecker.check_numerical_columns
  • Categorical features drift: pydrift.DataDriftChecker.check_categorical_columns
  • Interaction between features drift (discriminative model): pydrift.DataDriftChecker.ml_model_can_discriminate

Numerical columns drift

For checking drift in numerical columns, pydrift takes, one by one, all of the numerical features and apply a Kolmogorov-Smirnov Test for checking if the distributions are unbiased.

This is a statistical method that checks for this differences and pydrift uses stats.ks_2samp from the famous scipy python library to easily check it.

Categorical columns drift

For checking drift in categorical columns, pydrift takes, one by one, all of the categorical features and apply a chi Test for checking if the distributions are unbiased.

This is a statistical method that checks for this differences and pydrift uses stats.chisquare from the famous scipy python library to easily check it.

Interaction between features drift (discriminative model)

For a more general drift checking, pydrift trains a machine learning model (catboost by default, but you can use any model from sklearn API) creating a new target that indicates if a register comes from training distibution or from the testing one.

If the model can discriminate and obtains a good metric (AUC in this case) that means that your date is easily distinguishable, so you have a drift in your data.

If on the contrary the model is not able to differentiate between train and test data it means that you do not have any drift problem and that the data is not biased, so you will be able to apply your model without problems (waiting for the model drift checker step, that relies in your features relation with the correct target feature).

results matching ""

    No results matching ""