Linear Regression

 

Linear Regression




What is Linear Regression?

It is nothing but the linear relationship between the dependent variable (target variable) and the independent variable (feature variables). Let suppose we have a dataset of height and weight of students of a school. We all know there is a positive relation b/w height and weight. So with the help of height, we can estimate weight. Here height will be our independent variable and weight will be the dependent variable. With the help of linear regression, we can build a simple linear equation that helps us to predict a weight value closer to the actual value.




Assumptions of Linear Regression:-

  • Linearity: The relationship between X and the mean of Y is linear. We can check the linearity by scatter plot.
  • Homoscedasticity: The variance of residual (Difference b/w  Predicted and actual value) is constant for any value of X.
  • Independence:  There should be no multi-collinearity among the independent variables. i.e the independent variables should not highly be correlated with each other. We can easily identify the multi-collinearity by looking at the correlation matrix or by a variance inflation factor.
  • Normality: The error or residual terms are normally distributed. This assumption may be checked by looking at a histogram or a Q-Q-Plot.
  • Equation of the Linear regression:- 

                           Y = m . X + b

    Here, Y = Dependent Variable (The Target Variable)
                 X = Independent Variable (The Features)
                 m = Coefficient of X (Slope of the regression line),  which represents the relation b/w X and Y. 
                 b  = The Residual or the error term.

    Now the question is, how can we find the best-fitted line for our regression model ?? 




    In the above diagram,

  • Blue Dots = Data points.
  • Red Line = Regression Line.
  • Blue Line  = The residuals or error. The difference between the predicted value given by the Regression line and the actual value.  The difference we can denote as D
  • Now we need to calculate the sum of the square of the residual,  We can call it the LOSS FUNCTION for the regression model. So we need to minimize the loss and for which line we get the minimum value, we should select that line as the best fit line.

    Residual Sum of Square (RSS)
    Residual Sum of Square (RSS) shown in a simple way.

    The below visual is showing the loss minimizing and the best fitting way in the regression model. 



    Let's try to understand these things mathematically (Least Square Method)

    In the language of statistics, we call it Least Square Method, which is shown below.

    According to the first-order condition to minimize the equation, we need to do first-order partial differentiation with respect to m and Separately and then equate with zero. From there we can get the ideal value of and b.

    Now equating the equations with the zero.

    The same equation can be written in matrix form as:

    According to the Second-order condition, the second-order differentiation should be positive. 

    When we have two independent variables and one dependent variable, we will have the gradient descent diagram, as shown below.

    Again another question comes in our way that how to calculate the accuracy of the regression model ?? 

    R-Squared Statistics:-

    R-squared is a statistical method that explains how close the data points to the regression line. The R-squared value lies between 0 to 1.

    The formula of R-squared Statistics

    To understand the formula of the R-squared statistics, first, we need to aware of the RSS and TSS.

    RSS and TSS showing diagrammatically. 

      A residual sum of squares (RSS) is a statistical technique used to measure the amount of variance in a data set that is not explained by a regression model.  

    Residual Sum of Square .      ( y = actual value ,y_hat = predicted value )

      The Total Sum of squares (TSS) tells you how much variation is there in the dependent variable (Y). 

    Total Sum of Square  = Σ(Yi – mean of Y)2

    So when ,RSS = TSS ,

    then ,R-square =
    This implies that the regression line is unable to explain any variation among the dataset. 

    Again when RSS = 0,
    then, R-square = 1,
    This implies that the regression line is explaining 100% of the data variation. 

    Adjusted R-squared statistical method:-

    Whenever we add up a new independent variable to our dataset will increase the R-squared value for sure. Even if the independent variable hasn't any correlation with the dependent variable. So, we cannot count on the R-squared statistics always. Let's try to understand this mathematically, 
    Let suppose we have two models having one and two independent variables respectively.


    So, we can conclude that whenever we increase the number of the independent variable, the R-squared statistics will automatically increase.

    To rectify this problem, we use Adjusted R-square statistics which penalizes such dependent variable which does not correlate with the independent variable. 

    In the above equation we can see, when p = 0,  adjusted R-square = R-square value .
    Thus, adjusted R-square <= R-square. 

    So we can count on the adjusted R-square statistic anytime to check the accuracy of a regression model.






    Post a Comment

    0 Comments

    📅 📢 I'm open to new opportunities! If you're hiring or know someone who is, feel free to connect.
    📧 Email: gk765813@gmail.com | LinkedIn | Resume ×