Linear regression is a type of supervised machine learning algorithm that is used to model the relationship between a dependent variable (y) and one or more independent variables (X). The goal of linear regression is to find the best line (the line of best fit) that fits the data points in such a way that the sum of the squared distances between the data points and the line is minimized. This line is called the regression line, and it is used to make predictions about the dependent variable given the independent variables.
How Does Linear Regression Work?#
Linear regression models the relationship between the dependent and independent variables as a linear equation of the form:
y = β0 + β1X1 + β2X2 + … + βnXn
Where y is the dependent variable, X1, X2, …, Xn are the independent variables, and β0, β1, β2, …, βn are the coefficients. The coefficients are found using a method called least squares estimation, which minimizes the sum of the squared distances between the data points and the regression line.
Types of Linear Regression#
There are two main types of linear regression:
Simple Linear Regression: This type of linear regression models the relationship between the dependent variable and one independent variable.
Multiple Linear Regression: This type of linear regression models the relationship between the dependent variable and multiple independent variables.
Advantages of Linear Regression#
Simple and Easy to Implement: Linear regression is a simple and straightforward algorithm that is easy to implement.
Provides a Good Starting Point: Linear regression provides a good starting point for more complex models, and it can be used as a basis for comparison.
Fast and Efficient: Linear regression is fast and efficient, making it suitable for large datasets.
Disadvantages of Linear Regression#
Linearity Assumption: Linear regression assumes that the relationship between the dependent and independent variables is linear. This assumption may not always hold, and the model may not be accurate if the relationship is not linear.
Outliers: Linear regression is sensitive to outliers, which can have a big impact on the results.
Non-Linear Relationships: Linear regression is not well-suited to modeling non-linear relationships between the dependent and independent variables.
import pandas as pd from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Load the Boston Housing dataset boston = load_boston() # Convert the dataset into a pandas dataframe df = pd.DataFrame(boston.data, columns=boston.feature_names) df["Target"] = boston.target # Assign the features and target X = df.drop("Target", axis=1) y = df["Target"] # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train the Linear Regression model reg = LinearRegression() reg.fit(X_train, y_train) # Predict the target on the test set y_pred = reg.predict(X_test) # Calculate the mean squared error mse = mean_squared_error(y_test, y_pred) print("Mean Squared Error:", mse)
Linear regression is a simple and efficient type of machine learning algorithm that is widely used for modeling the relationship between a dependent variable and one or more independent variables. Despite its limitations, linear regression continues to be a valuable tool for any data scientist to have in their toolkit, and it provides a good starting point for more complex models. By understanding the strengths and limitations of linear regression, data scientists can make informed decisions about when to use this algorithm and when to consider alternative models.
Where to Learn More#
We cover Linear Regression in-depth in the following course:
And we apply it in the following courses: