Linear Regression

Concept, Mathematics, and Implementation

What is Regression?

Statistical method for modeling relationship between variables
Predicts a dependent variable based on independent variable(s)
Linear regression is the simplest form

Linear Regression

Models linear relationship between variables
Assumes a straight line can approximate the relationship
Goal: Find the best-fitting line

Simple Linear Regression Equation

\[ y = mx + b \]

Where:

\( y \) is the dependent variable
\( x \) is the independent variable
\( m \) is the slope
\( b \) is the y-intercept

Multiple Linear Regression

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon \]

Where:

\( y \) is the dependent variable
\( x_1, x_2, ..., x_n \) are independent variables
\( \beta_0, \beta_1, ..., \beta_n \) are coefficients
\( \epsilon \) is the error term

Ordinary Least Squares (OLS)

Method to estimate coefficients
Minimizes the sum of squared residuals
Residual: Difference between observed and predicted values

OLS Objective Function

Minimize:

\[ \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Where:

\( y_i \) is the observed value
\( \hat{y}_i \) is the predicted value

Assumptions of Linear Regression

Linearity: Relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Constant variance of residuals
Normality: Residuals are normally distributed

Implementing Linear Regression with NumPy

import numpy as np

# Generate sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])

# Add intercept term to X
X_b = np.c_[np.ones((X.shape[0], 1)), X]

# Calculate coefficients
theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

# Print results
print(f"Intercept: {theta[0]}, Slope: {theta[1]}")

Prediction with NumPy Implementation

# Function to make predictions
def predict(X, theta):
    return np.dot(np.c_[np.ones((X.shape[0], 1)), X], theta)

# Make predictions
X_new = np.array([[0], [2], [4], [6]])
y_pred = predict(X_new, theta)

print("Predictions:", y_pred)

Linear Regression with scikit-learn


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Prepare data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Print results
print(f"Intercept: {model.intercept_}, Slope: {model.coef_[0]}")

Prediction with scikit-learn

# Make predictions
X_new = np.array([[0], [2], [4], [6]])
y_pred = model.predict(X_new)

print("Predictions:", y_pred)

# Evaluate model
from sklearn.metrics import mean_squared_error, r2_score

y_pred_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Advantages of scikit-learn

Easy to use and understand
Consistent API across different models
Built-in cross-validation and model selection tools
Efficient implementation for large datasets

Evaluating Linear Regression

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-squared (Coefficient of Determination)
Adjusted R-squared

Mean Squared Error (MSE)

\[ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Measures average squared difference between predicted and actual values
Lower values indicate better fit

R-squared

\[ R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \]

Proportion of variance in dependent variable explained by model
Ranges from 0 to 1 (higher is better)

Limitations of Linear Regression

Assumes linear relationship (may not always hold)
Sensitive to outliers
Can overfit with too many features
Assumes independence of features (multicollinearity issues)

Conclusion

Linear regression is a fundamental statistical technique
Useful for understanding relationships between variables
Easy to implement with NumPy or scikit-learn
Important to understand assumptions and limitations

Thank You for Attending!