Linear Regression
Concept, Mathematics, and Implementation
What is Regression?
- Statistical method for modeling relationship between variables
- Predicts a dependent variable based on independent variable(s)
- Linear regression is the simplest form
Linear Regression
- Models linear relationship between variables
- Assumes a straight line can approximate the relationship
- Goal: Find the best-fitting line
Simple Linear Regression Equation
\[ y = mx + b \]
Where:
- \( y \) is the dependent variable
- \( x \) is the independent variable
- \( m \) is the slope
- \( b \) is the y-intercept
Multiple Linear Regression
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon \]
Where:
- \( y \) is the dependent variable
- \( x_1, x_2, ..., x_n \) are independent variables
- \( \beta_0, \beta_1, ..., \beta_n \) are coefficients
- \( \epsilon \) is the error term
Ordinary Least Squares (OLS)
- Method to estimate coefficients
- Minimizes the sum of squared residuals
- Residual: Difference between observed and predicted values
OLS Objective Function
Minimize:
\[ \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
Where:
- \( y_i \) is the observed value
- \( \hat{y}_i \) is the predicted value
Assumptions of Linear Regression
- Linearity: Relationship between X and Y is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals are normally distributed
Implementing Linear Regression with NumPy
import numpy as np
# Generate sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])
# Add intercept term to X
X_b = np.c_[np.ones((X.shape[0], 1)), X]
# Calculate coefficients
theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
# Print results
print(f"Intercept: {theta[0]}, Slope: {theta[1]}")
Prediction with NumPy Implementation
# Function to make predictions
def predict(X, theta):
return np.dot(np.c_[np.ones((X.shape[0], 1)), X], theta)
# Make predictions
X_new = np.array([[0], [2], [4], [6]])
y_pred = predict(X_new, theta)
print("Predictions:", y_pred)
Linear Regression with scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Prepare data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Print results
print(f"Intercept: {model.intercept_}, Slope: {model.coef_[0]}")
Prediction with scikit-learn
# Make predictions
X_new = np.array([[0], [2], [4], [6]])
y_pred = model.predict(X_new)
print("Predictions:", y_pred)
# Evaluate model
from sklearn.metrics import mean_squared_error, r2_score
y_pred_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Advantages of scikit-learn
- Easy to use and understand
- Consistent API across different models
- Built-in cross-validation and model selection tools
- Efficient implementation for large datasets
Evaluating Linear Regression
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared (Coefficient of Determination)
- Adjusted R-squared
Mean Squared Error (MSE)
\[ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
- Measures average squared difference between predicted and actual values
- Lower values indicate better fit
R-squared
\[ R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \]
- Proportion of variance in dependent variable explained by model
- Ranges from 0 to 1 (higher is better)
Limitations of Linear Regression
- Assumes linear relationship (may not always hold)
- Sensitive to outliers
- Can overfit with too many features
- Assumes independence of features (multicollinearity issues)
Conclusion
- Linear regression is a fundamental statistical technique
- Useful for understanding relationships between variables
- Easy to implement with NumPy or scikit-learn
- Important to understand assumptions and limitations
Thank You for Attending!