Linear Regression

Concept, Mathematics, and Implementation

What is Regression?

  • Statistical method for modeling relationship between variables
  • Predicts a dependent variable based on independent variable(s)
  • Linear regression is the simplest form

Linear Regression

  • Models linear relationship between variables
  • Assumes a straight line can approximate the relationship
  • Goal: Find the best-fitting line

Simple Linear Regression Equation

\[ y = mx + b \]

Where:

  • \( y \) is the dependent variable
  • \( x \) is the independent variable
  • \( m \) is the slope
  • \( b \) is the y-intercept

Multiple Linear Regression

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon \]

Where:

  • \( y \) is the dependent variable
  • \( x_1, x_2, ..., x_n \) are independent variables
  • \( \beta_0, \beta_1, ..., \beta_n \) are coefficients
  • \( \epsilon \) is the error term

Ordinary Least Squares (OLS)

  • Method to estimate coefficients
  • Minimizes the sum of squared residuals
  • Residual: Difference between observed and predicted values

OLS Objective Function

Minimize:

\[ \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Where:

  • \( y_i \) is the observed value
  • \( \hat{y}_i \) is the predicted value

Assumptions of Linear Regression

  • Linearity: Relationship between X and Y is linear
  • Independence: Observations are independent
  • Homoscedasticity: Constant variance of residuals
  • Normality: Residuals are normally distributed

Implementing Linear Regression with NumPy

import numpy as np

# Generate sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])

# Add intercept term to X
X_b = np.c_[np.ones((X.shape[0], 1)), X]

# Calculate coefficients
theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

# Print results
print(f"Intercept: {theta[0]}, Slope: {theta[1]}")
					

Prediction with NumPy Implementation

# Function to make predictions
def predict(X, theta):
    return np.dot(np.c_[np.ones((X.shape[0], 1)), X], theta)

# Make predictions
X_new = np.array([[0], [2], [4], [6]])
y_pred = predict(X_new, theta)

print("Predictions:", y_pred)
					

Linear Regression with scikit-learn


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Prepare data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Print results
print(f"Intercept: {model.intercept_}, Slope: {model.coef_[0]}")
					

Prediction with scikit-learn

# Make predictions
X_new = np.array([[0], [2], [4], [6]])
y_pred = model.predict(X_new)

print("Predictions:", y_pred)

# Evaluate model
from sklearn.metrics import mean_squared_error, r2_score

y_pred_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
					

Advantages of scikit-learn

  • Easy to use and understand
  • Consistent API across different models
  • Built-in cross-validation and model selection tools
  • Efficient implementation for large datasets

Evaluating Linear Regression

  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • R-squared (Coefficient of Determination)
  • Adjusted R-squared

Mean Squared Error (MSE)

\[ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

  • Measures average squared difference between predicted and actual values
  • Lower values indicate better fit

R-squared

\[ R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \]

  • Proportion of variance in dependent variable explained by model
  • Ranges from 0 to 1 (higher is better)

Limitations of Linear Regression

  • Assumes linear relationship (may not always hold)
  • Sensitive to outliers
  • Can overfit with too many features
  • Assumes independence of features (multicollinearity issues)

Conclusion

  • Linear regression is a fundamental statistical technique
  • Useful for understanding relationships between variables
  • Easy to implement with NumPy or scikit-learn
  • Important to understand assumptions and limitations

Thank You for Attending!

QR Code