Co-Author : Subarna Rana
In Data Science world, everything revolves around data — How the data is structured — What is the story behind the data — What is it targeting, and the target is mostly the prediction which one infers using machine learning.
For example, If the weather is sunny, John plays in Central Park in the evening else he stays at home. Here the target is John Plays, or he stays at home. Data includes weather which can be sunny or rainy. If the data are divided in two parts: one for each weather. It can be of any proportion. 50:50 or 80:20 etc.
If the ratio is 50:50 or something closer, its called balanced data, otherwise it's imbalanced. Imbalanced data is where number of observations in favour of any class are significantly more than the other(s). This takes the model out of balance. This problem is called Class imbalance problem.
There are various techniques for handling the class imbalance problem like simple over and under sampling method, smote, rose etc. However, these don’t work well in multiclass classification problem where the target class has more than 2 classes. For Example, If we have 4 classes spread disproportionately in our data, it would be much easier if we can assign the weights manually to each class. The majority class can be assigned a smaller weight and minority class can be assigned higher weight. This way the balance can be obtained between different classes.
Fortunately, there is a function in sklearn that does exactly the same thing which is called compute_class_weight. This function assigns class weights to each class according to the number of observations each class has in its favour
So, if a class has majority of observations, it will be given less weight so that the other minor classes get equal coverage.
sklearn.utils.class_weight.compute_class_weight(class_weight, *, classes, y)
2. classes — Unique class labels. E.g.: [1,2] or np.unique(df[‘Target’]) where y is the target variable.
3. Target variable — Target column. E.g.: df[‘Target’] or Y.
Basic Maths when ‘balanced’ class weight is used :
Let's say, you have 4 target classes — 0,1,2,3 and total rows are 3923. You counted the number of rows for each class, and it came out like:
class_weight  = 3923/ (4*1093) = 0.897
class_weight  = 3923/ (4*2218) = 0.44
class_weight  = 3923/ (4*398) = 2.46
class_weight  = 3923/ (4*214) = 4.58
Your class weights are coming as [0.897, 0.44, 2.46, 4.58].
The weight assigned to the majority class is the least. That’s how compute class weight penalizes the class which has more observations.
These weights can be passed to the sample weight parameter of the classifier model functions like randomforestclassifier. If Data is highly imbalanced, then also, this approach gives an idea of class weights, and we can make slight modification to them as well.
I hope this helped you to understand the calculation behind class weights better and how it helps in handling the class imbalance problem specially in multiclass classification problem.