Linear Regression
What is Linear Regression?
It is nothing but the linear relationship between the dependent variable (target variable) and the independent variable (feature variables). Let suppose we have a dataset of height and weight of students of a school. We all know there is a positive relation b/w height and weight. So with the help of height, we can estimate weight. Here height will be our independent variable and weight will be the dependent variable. With the help of linear regression, we can build a simple linear equation that helps us to predict a weight value closer to the actual value.
Assumptions of Linear Regression:-
Equation of the Linear regression:-
Y = m . X + b
Here, Y = Dependent Variable (The Target Variable)
X = Independent Variable (The Features)
m = Coefficient of X (Slope of the regression line), which represents the relation b/w X and Y.
b = The Residual or the error term.
Now the question is, how can we find the best-fitted line for our regression model ??
In the above diagram,
Now we need to calculate the sum of the square of the residual, We can call it the LOSS FUNCTION for the regression model. So we need to minimize the loss and for which line we get the minimum value, we should select that line as the best fit line.


The below visual is showing the loss minimizing and the best fitting way in the regression model.
Let's try to understand these things mathematically (Least Square Method)
In the language of statistics, we call it Least Square Method, which is shown below.

According to the first-order condition to minimize the equation, we need to do first-order partial differentiation with respect to m and b Separately and then equate with zero. From there we can get the ideal value of m and b.

Now equating the equations with the zero.

The same equation can be written in matrix form as:

According to the Second-order condition, the second-order differentiation should be positive.
When we have two independent variables and one dependent variable, we will have the gradient descent diagram, as shown below.

Again another question comes in our way that how to calculate the accuracy of the regression model ??
R-Squared Statistics:-
R-squared is a statistical method that explains how close the data points to the regression line. The R-squared value lies between 0 to 1.

To understand the formula of the R-squared statistics, first, we need to aware of the RSS and TSS.
A residual sum of squares (RSS) is a statistical technique used to measure the amount of variance in a data set that is not explained by a regression model.
The Total Sum of squares (TSS) tells you how much variation is there in the dependent variable (Y).
So when ,RSS = TSS ,
then ,R-square =0
This implies that the regression line is unable to explain any variation among the dataset.
Again when RSS = 0,
then, R-square = 1,
This implies that the regression line is explaining 100% of the data variation.
Adjusted R-squared statistical method:-
Whenever we add up a new independent variable to our dataset will increase the R-squared value for sure. Even if the independent variable hasn't any correlation with the dependent variable. So, we cannot count on the R-squared statistics always. Let's try to understand this mathematically,
Let suppose we have two models having one and two independent variables respectively.
So, we can conclude that whenever we increase the number of the independent variable, the R-squared statistics will automatically increase.
To rectify this problem, we use Adjusted R-square statistics which penalizes such dependent variable which does not correlate with the independent variable.
In the above equation we can see, when p = 0, adjusted R-square = R-square value .
Thus, adjusted R-square <= R-square.
So we can count on the adjusted R-square statistic anytime to check the accuracy of a regression model.





0 Comments