Multicollinearity

 

Multicollinearity


Generally occurs a high correlation between two or more independent variables. This can be implied widely in the regression model. In realtime, the data might be having collinearity properties. Before applying any model the collinearity needs to be rectified or else it will lead to a false result and lesser accuracy.

Consider daily activities to explain better multicollinearity. Tom usually likes sweets. He enjoyed sweets while watching television. How can we determine Tom's happiness rating? This can be raised in two ways while watching TV and eating sweets. Those two variables are correlated to one another. Finally, this comes in to picture of multicollinearity.

Once you find the multicollinearity exists in the data, then how to remove it?. Is that any method available? Yes, it is. This can be achieved by using the VIF factor.

Variable Inflation Factors

The R² value is determined to find out how well an independent variable is described by the other independent variables. A high value of R² means that the variable is highly correlated with the other variables. This is captured by the VIF which is denoted below:

  • VIF starts at 1 and has no upper limit
  • VIF = 1, no correlation between the independent variable and the other variables
  • VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others
  • Let’s see implementation in the dataset about predicting the house price. Below snippet applying the independent variables on VIF factor.

    from statsmodels.stats.outliers_influence import variance_inflation_factor
    def applyVIF():
       vif = pd.DataFrame()
       vif["Features"] = X.columns
       vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(variables.shape[1])]
       
       print(vif)

    You can clearly see the bedrooms, main road, bathrooms, stories, and area having high multicollinearity.


    Fixing Multicollinearity

    By removing the columns which hold the VIF value is greater than 5.

    X.drop(['area','bedrooms','bathrooms','stories','mainroad'], axis=1, inplace=True)
    We can use Correlation Matrices and Plots for finding a correlation.

    Correlation Matrices and Plots: for correlation between all the X variables.

    This plot shows the extent of correlation between the independent variable. Generally, a correlation greater than 0.9 or less than -0.9 is to be avoided.




    Remedies for Multicollinearity


    • Do Nothing: If the Correlation is not that extreme, we can ignore it. If the correlated variables are not used in solving our business question, they can be ignored.
    • Remove One Variable: Like in dummy variable trap
    • Combine the correlated variables: Like creating a seniority score based on Age and Years of experience
    • Principal Component Analysis





    Post a Comment

    0 Comments

    📅 📢 I'm open to new opportunities! If you're hiring or know someone who is, feel free to connect.
    📧 Email: gk765813@gmail.com | LinkedIn | Resume ×