Linear Regression Model – Prerequisites, Case Study, Goodness-of-Fit

February 12, 2025

MSG Team's other articles

The subprime mortgage crisis is known for the spectacular collapse of some of the largest financial institutions in almost no time. Bear Stearns is a prime example of this. This was an institution that was built on decades of hard work and innovation. However, certain risky bets made in the markets led to the famous […]

Top Five Factors That Spur Economic Growth

The world has witnesses the rise and fall of many great economic empires. It is safe to say that we have enough empirical data that we can analyze what spurs economic growth. In this article, we will look at some of the important factors that lead to economic growth. Natural Resources Natural resources are the […]

The Big Bust Washington Mutual (WaMu)

Washington Mutual (WaMu) was one of the oldest and biggest banks in the United States. Like many of its peers, this bank had survived several financial shocks such as World Wars, Great Depression etc. Also, like many of its peers it became a big ticket casualty of the subprime mortgage crash. Washington Mutual (WaMu)’s fall […]

Proptech: The Future of Real Estate

The rise of the internet and mobile technology has fundamentally changed the way we live. All sectors of the economy have been greatly impacted by this internet boom. Real estate sector has been one of the laggards when it comes to this technological adoption. However, of late, proptech has become a buzzword in the startup […]

Why Does the Definition of Inflation Matter?

We discussed the definition of inflation in a lot of detail in the previous article. The previous article was meant to bring to the readers notice that the current definition of inflation is flawed. Instead a previously used definition was capable of defining the concept in a much better manner. This brings up the question, […]

Search with tags

No tags available.

Suppose you are an HR professional and want to determine:

Whether age of an employee has a substantial effect on their maturity
The importance of experience and capability on remuneration
The importance of IQ (Intelligence Quotient) vs. EQ (Emotional Quotient) on problem handling capability
How sedentary lifestyle at workplace affects employee output
If a specific physical activity makes employees more energetic and lively at the workplace

All these are routine scenarios in an organization. But their impact is huge. How, as an HR professional, can you determine which variables have what impact on employee productivity?

Regression analysis offers you the answer. It helps you explain the relationship between two or more variables.

With Explanation Comes Consideration!

However, before we go into details and understand how regression models can be employed to derive a cause and effect relationship, there are several important considerations to take into account:

Not all assumptions can be tested and validated.
You need to carefully establish a hypothesis, for it to work out correctly.
Accuracy of results depends upon authenticity of data.
Ignoring important variables can bias your coefficients.
The econometric model you choose should fit the type of data you’re using.

Linear Regression Model

Regression analysis simplifies some very complex situations, almost magically. It helps researchers and professionals correlate intertwined variables. Linear regression is one of the simplest and most commonly used regression models. It predicts the cause and effect relationship between two variables.

The model uses Ordinary Least Squares (OLS) method, which determines the value of unknown parameters in a linear regression equation. Its goal is to minimize the difference between observed responses and the ones predicted using linear regression model. There are certain requirements that you need to fulfill, in order to use this model. Otherwise the results can be confusing and ambiguous.

Prerequisites of using a linear regression model

The number of observations is finite.
The primary assumption is that there are negligible errors in the value of independent variable (X) or regressor variables. It follows the principle of strict exogeneity, which means zero error.
Regressors or independent variables should be predefined constants or random variables.
Smaller the differences between data point corresponding to the regression line, better an econometric model fits the data.
It provides the maximum likelihood estimator, maximizing the agreement of the selected econometric model with the observed data.

Let’s understand it with the help of an example:

The phenomenon is: Work experience and remuneration are related variables. The linear regression model can help predict the remuneration slab of an employee given his/her work experience.

Statistical Representation

Now the problem arises how to represent it statistically.

There are two lines of regression – Y on X and X on Y.

Y on X is when the value of Y is unknown. X on Y is when the value of X is unknown.

Here are their statistical representations:

Suppose, the value of remuneration is = Y and the value of experience is = X

The Line of Regression of Y on X	The Line of Regression of X on Y
Then, to predict the unknown value of remuneration (Y), the statistical representation will be:	If it is the other way around or the value of work experience is unknown, the statistical representation will be:
Y = a + b(X)	X = c + d(Y)
Where ‘b’ is the coefficient of X and ‘a’ is the intercept of Y.	Where ‘d’ is the coefficient of Y and ‘c’ is the intercept of X.

Selection of Line of Regression

The statistical representation above is an example to show how to develop econometric models when the value of one of the variables is known and another’s unknown. However, this doesn’t mean that both the representations are correct.

This is because remuneration may depend on the work experience of an individual but the vice versa is not true. Experience doesn’t depend on remuneration. Therefore, you’ll have to carefully choose the dependent variable and then the line of regression.

What Does a Regression Equation Mean?

A linear regression equation shows the percentage increase or decrease in the value of dependent variable (Y) with the percentage increase or decrease in the value of independent variable (X).

Let us suppose the values of X and Y are known.

Table 1

X	Y
1	1
2	1.5
3	2.5
4	2.8
5	4

Graphically it’s represented as:

Graph 1

The white line connecting all the dots in the graph above represents the error or prediction. But you now want to find the best-fitted line of regression to minimize the error of prediction. The aim is to help find the best-fitted line of regression.

How to Find the Best-Fitted Line of Regression?

By using the ordinary least squares method!

Let us continue with the above example:

Table 2

	X	Y	XY	X-X’	Y-Y’	(X-X’)(Y-Y’)	(X-X’)²	(Y-Y’)²
	1	1	1	-2	-1.16	2.32	4	1.346
	2	1.5	3	-1	-0.66	0.66	1	0.436
	3	2.5	7.5	0	0.34	0.34	0	0.116
	4	2.8	11.2	1	0.64	0.64	1	0.410
	5	4	20	2	1.84	3.68	4	3.386
Sum	15	10.8	42.7			7.64	10	5.69
Mean	X’ = 3 (15/5)	Y’ = 2.16 (10.8/5)

Graph 2

The primary equation is:

Y = a + b(X) +e (error term)

In this case, e is zero because it is assumed that the independent variable (X) has negligible errors.

Therefore, it remains Y = a + b(X).

Let us now find the value of b.

b = [ ∑ XY – (∑Y)(∑X)/n ] / ∑(X-X’)²

Substitute the values in the above formula:

b = [ 42.7 – (15*10.8)/5 ]/10 = [ 42.7 – 162/5 ]/10 = [ 42.7 – 32.5 ]/10 = 10.2/10 = 1.02

b = 1.02

Therefore,

a = Y – b(X)

a = Y – 1.02(X) or a = ∑Y/n – 1.02 (∑X/n)

a = 2.16 – 1.02*3 = 2.16 – 3.06

a = -1.06

By substituting the value of a, b and X, we can find the corresponding value of Y.

Y = a + b(X) = -1.06 + 1.02X

When X = 1

Y = -1.06 + 1.02*1 = -0.04

When X = 2

Y = - 1.06 + 1.02*2 = 0.98

When X = 3, Y will be 2

When X = 4, Y will be 3.02

When X = 5, Y will be 4.04

Table 3

X	Y
1	-0.04
2	0.98
3	2
4	3.02
5	4.04

The best-fitted regression line will be:

Graph 3

Properties of the Best-Fitted Regression Line/Estimators

The regression line passes through the X’, which is 3 in this case. (Refer to Graph 3)
b, the regression coefficient of X, is an average change in Y. In this case, b = 1.02 which is the average change in the values of y: -0.04, 0.98, 2, 3.02 and 4.04. (Refer Table 3)
The regression line passes through Y’, which is 2.16 in this case. (Refer Graph 3)

Coefficient of Determination – Assessing Goodness-of-Fit

Whenever you use a regression model, the first thing you should consider is – how well an econometric model fits the data or how well a regression equation fits the data.

This is where the concept of coefficient of determination comes in. The regression models are generally fitted using this approach.

It determines the extent to which dependent variable can be predicted from independent variable. It assesses the goodness-of-fit of the Ordinary Least Square Regression Model.

Denoted by R², its value lies between 0 and 1. When:

R² = 0: the value of the dependent variable cannot be predicted from the independent variable.
R² = 1, the value of dependent variable can be easily predicted from the independent variable. There are no errors in the data.

Higher the value of R², better fit the model is to the data.

Let’s now understand how to calculate R². The formula for finding R² is:

R² = { (1 / n) * ∑ (X-X’) * (Y-Y’) } / (σx * σy)²

Where n = number of observations = 5

∑ (X-X’) * (Y-Y’) = 7.64 (Ref Table 2)

σx is the standard deviation of X and σy is the standard deviation of Y

σx = square root of ∑ (X-X’)²/n = √10/5 = √ 2 = 1.414

σy = square root of ∑ (Y-Y’)²/n = √5.69/5 = √1.138 = 1.067

Now let’s determine the value of R²

R² = { 1/5 (7.64) } / (1.414 * 1.067)² = 1.528 / (1.509)² = 1.528/2.277 = 0.67

Hence R² = 0.67

Higher the value of coefficient of determination, lower the standard error. The result indicates that about 67% of the variation in remuneration can be explained by the work experience. It shows that the work experience plays a major role in determining the remuneration.

Properties of Estimators

b is unbiased and independent.
σx is unbiased because of strict exogeneity. In case of no-strict exogeneity, the value will be biased in finite samples.
(σy)² is biased but square root minimizes the error.

The linear regression model is used when there is a linear relationship between dependent and independent variables. When the value of a dependent variable is based on multiple variables (more than one), we use multiple regression analysis. We’ll study about this in the next article. Stay tuned!

Article Written by

MSG Team

An insightful writer passionate about sharing expertise, trends, and tips, dedicated to inspiring and informing readers through engaging and thoughtful content.

Your email address will not be published. Required fields are marked *

Comment *

First Name

Last Name

Phone

Econometrics of Human Resources

Applied Econometrics – Steps to Carry Out an Empirical Study

MSG Team

February 12, 2025

Econometrics of Human Resources

Hypothesis Testing – Writing, Examples and Steps

MSG Team

February 12, 2025

Econometrics of Human Resources

Econometrics – Meaning, Elements, Techniques & its Application

MSG Team

February 12, 2025

Linear Regression Model – Prerequisites, Case Study, Goodness-of-Fit

MSG Team's other articles

Categories

Search with tags

Linear Regression Model

Prerequisites of using a linear regression model

Statistical Representation

What Does a Regression Equation Mean?

Coefficient of Determination – Assessing Goodness-of-Fit

Leave a reply

Linear Regression Model – Prerequisites, Case Study, Goodness-of-Fit

MSG Team's other articles

Categories

Search with tags

Linear Regression Model

Prerequisites of using a linear regression model

Statistical Representation

What Does a Regression Equation Mean?

Coefficient of Determination – Assessing Goodness-of-Fit

Leave a reply

Related Articles