Class imbalance in MultiClass classification : Simplified !!

Shikha Garg
3 min readMay 19, 2021

Co-Author : Subarna Rana

In Data Science world, everything revolves around data — How the data is structured — What is the story behind the data — What is it targeting, and the target is mostly the prediction which one infers using machine learning.

For example, If the weather is sunny, John plays in Central Park in the evening else he stays at home. Here the target is John Plays, or he stays at home. Data includes weather which can be sunny or rainy. If the data are divided in two parts: one for each weather. It can be of any proportion. 50:50 or 80:20 etc.

If the ratio is 50:50 or something closer, its called balanced data, otherwise it's imbalanced. Imbalanced data is where number of observations in favour of any class are significantly more than the other(s). This takes the model out of balance. This problem is called Class imbalance problem.

There are various techniques for handling the class imbalance problem like simple over and under sampling method, smote, rose etc. However, these don’t work well in multiclass classification problem where the target class has more than 2 classes. For Example, If we have 4 classes spread disproportionately in our data, it would be much easier if we can assign the weights manually to each class. The majority class can be assigned a smaller weight and minority class can be assigned higher weight. This way the balance can be obtained between different classes.

Fortunately, there is a function in sklearn that does exactly the same thing which is called compute_class_weight. This function assigns class weights to each class according to the number of observations each class has in its favour

So, if a class has majority of observations, it will be given less weight so that the other minor classes get equal coverage.

sklearn.utils.class_weight.compute_class_weight(class_weight, *, classes, y)

Parameters :

  1. class_weight

2. classes — Unique class labels. E.g.: [1,2] or np.unique(df[‘Target’]) where y is the target variable.

3. Target variable — Target column. E.g.: df[‘Target’] or Y.

Basic Maths when ‘balanced’ class weight is used :

Let's say, you have 4 target classes — 0,1,2,3 and total rows are 3923. You counted the number of rows for each class, and it came out like:

0–1093

1–2218

2–398

3–214

class_weight [0] = 3923/ (4*1093) = 0.897

class_weight [1] = 3923/ (4*2218) = 0.44

class_weight [2] = 3923/ (4*398) = 2.46

class_weight [3] = 3923/ (4*214) = 4.58

Your class weights are coming as [0.897, 0.44, 2.46, 4.58].

The weight assigned to the majority class is the least. That’s how compute class weight penalizes the class which has more observations.

These weights can be passed to the sample weight parameter of the classifier model functions like randomforestclassifier. If Data is highly imbalanced, then also, this approach gives an idea of class weights, and we can make slight modification to them as well.

I hope this helped you to understand the calculation behind class weights better and how it helps in handling the class imbalance problem specially in multiclass classification problem.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Shikha Garg
Shikha Garg

Written by Shikha Garg

learner at heart and dreamer at every part

Responses (3)

Write a response

This is a great article Shikha, thank you for such a good explanation. I do have one question, I am working on a multi-class classification project for Kidney Disease classification in deep learning and the classes have string labels such as "Cyst"…

Thank you Shikha for the nice explanation! I do have a question: once we assign the the class_weight, what is the consequence after? Would it affect the outcome or the model? Do we need to adjust back when calculating lift/gain/predict_proba? Also…

Insightful. Thanks Shikha