Understanding Bias-Variance Trade-off in Regularized Polynomial Regression

Bias vs. Variance in Regularized Regession

AI Generated image: Bias vs. Variance in Regularized Regression

📘 Understanding Bias-Variance Trade-off in Regularized Polynomial Regression

This project explores how bias and variance influence model performance and how regularization can be used to strike the right balance in polynomial regression models.

🔧 Tools

Python (NumPy, SciPy)
Matplotlib for data visualization
Scikit-learn for model validation
Jupyter Notebook for experimentation

🎯 Goals

To analyze the impact of bias and variance on model accuracy.
To implement and evaluate regularized linear and polynomial regression.
To use cross-validation for selecting optimal regularization parameters (λ).
To visually interpret learning curves for diagnosing underfitting and overfitting.

🌟 Impacts

Demonstrates a practical understanding of the bias-variance trade-off using visual tools.
Shows how regularization improves generalization in polynomial regression.
Equips learners with techniques to tune models for better performance on unseen data.
Provides foundational concepts critical for building robust ML systems.

Introduction
Regularized Linear Regression Cost Function
Polynomial Regression: Improving Model Performance
Regularization and Learning Curves in Polynomial Regression
Impact of High Regularization on Polynomial Regression
Selecting the Optimal Regularization Parameter (λ) Using Cross-Validation
Conclusion: Key Insights from the Bias-Variance Trade-off Analysis
Acknowledgment:

Introduction

Image 1: High Bias

In this project, we explore the bias-variance trade-off using regularized polynomial regression. The goal is to build a regression model that balances underfitting (high bias) and overfitting (high variance) by adjusting the regularization parameter (λ).

The project follows a structured approach:

Feature Engineering: We generate polynomial features to increase model complexity.
Feature Normalization: Standardization ensures numerical stability.
Model Training & Regularization: Different values of λ are tested to see their impact on training and validation errors.
Cross-validation:Cross-validation: A separate validation set is used to assess generalization. A separate validation set is used to assess generalization.
Error Analysis & Visualization: Training and validation errors are plotted against λ to identify the optimal regularization strength.

By systematically analyzing the impact of λ on model performance, this project provides insights into how to control overfitting and underfitting in machine learning models.

Importing Required Libraries:

        
            %matplotlib inline
            import numpy as np
            import matplotlib.pyplot as plt
            import scipy.io #Used to load the OCTAVE *.mat files
            import scipy.optimize #fmin_cg to train the linear regression

            import warnings
            warnings.filterwarnings('ignore')

The %matplotlib inline command ensures that plots are displayed directly inside Jupyter Notebook.
numpy is imported as np for numerical computations.
matplotlib.pyplot is imported as plt for data visualization.
scipy.io is used to load MATLAB .mat files, which contain the dataset.
scipy.optimize provides optimization functions for training models.
warnings.filterwarnings('ignore') suppresses warning messages for a cleaner output.

Loading and Preparing the Dataset

        
            datafile = r'd:\mlprojects\data\ex5data1.mat'
            mat = scipy.io.loadmat(datafile)
            
            # Training set
            X, y = mat['X'], mat['y']
            
            # Cross-validation set
            Xval, yval = mat['Xval'], mat['yval']
            
            # Test set
            Xtest, ytest = mat['Xtest'], mat['ytest']
            
            # Insert a column of 1's to all of the X's, as usual
            X = np.insert(X, 0, 1, axis=1)
            Xval = np.insert(Xval, 0, 1, axis=1)
            Xtest = np.insert(Xtest, 0, 1, axis=1)

The dataset is loaded from ex5data1.mat, which is in MATLAB format.
The dataset consists of:
1. Training set (X, y)
2. Cross-validation set (Xval, yval)
3. Test set (Xtest, ytest)
Since linear regression requires a bias term (intercept), a column of 1s is inserted into all X matrices.

Plotting the Training Data

        
            def plotData():
                plt.figure(figsize=(8,5))
                plt.ylabel('Water flowing out of the dam (y)')
                plt.xlabel('Change in water level (x)')
                plt.plot(X[:,1], y, 'rx')
                plt.grid(True)

            plotData()

OUTPUT: After running codes, the following graph is drawn.

Image 2: Graph 1

Visualizing the Dataset:

1. Description of the Dataset

In this project, we use a dataset that represents water flow in a dam based on the change in water level.
The dataset consists of three parts:
1. Training set: Used to train the model.
2. Cross-validation set: Used to tune regularization parameters.
3. Test set: Used to evaluate model performance.
A function plotData() is defined to visualize the training data.
A grid is added for better readability.

2. Data Preprocessing

We load the dataset from a MATLAB file and split it into training, validation, and test sets.
Since linear regression requires a bias term (intercept), we add a column of 1s to the input features.

3. Graph Explanation

The graph below shows the relationship between water flow and water level changes.
Each red cross ('x') represents an actual data point from the training set.
The x-axis represents the water level change, while the y-axis represents water flow.
This visualization helps us understand how water level fluctuations impact the outflow from the dam.

4. Observations from the Graph

When the water level is low or negative (left side of the graph), the water flow remains small.
As the water level increases (moving to the right), the amount of water flowing out of the dam significantly increases.
The relationship appears non-linear, meaning a simple straight-line model might not fully capture the pattern.

Regularized Linear Regression Cost Function

        
            def h(theta,X): #Linear hypothesis function
                return np.dot(X,theta)

            def computeCost(mytheta,myX,myy,mylambda=0.): #Cost function
                """
                theta_start is an n- dimensional vector of initial theta guess
                X is matrix with n- columns and m- rows
                y is a matrix with m- rows and 1 column
                """
                m = myX.shape[0]
                myh = h(mytheta,myX).reshape((m,1))
                mycost = float((1./(2*m)) * np.dot((myh-myy).T,(myh-myy)))
                regterm = (float(mylambda)/(2*m)) * float(mytheta[1:].T.dot(mytheta[1:]))
                return mycost + regterm

Expllanation:

Function h(theta, X): Implements the linear hypothesis function \( h_{\theta}(x) = \theta^T X \).
computeCost(mytheta, myX, myy, mylambda=0.):
1. Computes the cost function for linear regression.
2. Uses mean squared error (MSE): \[ J(\theta) = \frac{1}{2m} \sum (h_{\theta}(x) - y)^2 \].
3. Adds a regularization term to prevent overfitting: \[ \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 \]

(Note: Regularization does not apply to \( \theta_0 \), the bias term.)

Computing Cost with Given \( \theta_0 \) Values

Using codes:

        
            #Using theta initialized at [1; 1], and lambda = 1. 
            mytheta = np.array([[1.],[1.]])
            print(computeCost(mytheta,X,y,mylambda=1.))
            
            OUTPUT:
            303.9931922202643

Explanation:

Initializes \( \theta \) = [1,1].
Computes the cost function with \( \lambda \) = 1.
Output: 303.9931922202643

This value represents the error of the model before training.

Regularized Linear Regression Gradient

        
            def computeGradient(mytheta,myX,myy,mylambda=0.):
                mytheta = mytheta.reshape((mytheta.shape[0],1))
                m = myX.shape[0]
                #grad has same shape as myTheta (2x1)
                myh = h(mytheta,myX).reshape((m,1))
                grad = (1./float(m))*myX.T.dot(h(mytheta,myX)-myy)
                regterm = (float(mylambda)/m)*mytheta
                regterm[0] = 0 #don't regulate bias term
                regterm.reshape((grad.shape[0],1))
                return grad + regterm

                #Here's a wrapper for computeGradient that flattens the output
                #This is for the minimization routine that wants everything flattened

            def computeGradientFlattened(mytheta,myX,myy,mylambda=0.):
                return computeGradient(mytheta,myX,myy,mylambda=0.).flatten()

Explanation:

1. Function computeGradient(mytheta, myX, myy, mylambda=0.):

Computes the gradient (partial derivative of the cost function with respect to \( \theta \) used for optimization.
Gradient formula: \[ \nabla J(\theta) = \frac{1}{m} X^T (h_{\theta}(X) - y) + \frac{\lambda}{m} \theta \] Note: Bias term \( \theta_0 \) is not regularized.

2. Function computeGradientFlattened():

Returns the gradient as a flattened array to work with scipy.optimize.

Computing the Gradient for \( \theta \) = [1,1]

        
            # Using theta initialized at [1; 1] we would expect to see a
            # gradient of [-15.303016; 598.250744] (with lambda=1)
            mytheta = np.array([[1.],[1.]])
            print(computeGradient(mytheta,X,y,1.))

OUTPUT: \( \left[ \begin{array}{c} -15.303016 \\ 598.250744 \end{array} \right] \)

Note: This tells us how \( \theta_0 \) should be updated to minimize the cost.

Training the Model (Optimizing \( \theta_0 \))

Fitting linear regression:

Codes used:

            
            def optimizeTheta(myTheta_initial, myX, myy, mylambda=0.,print_output=True):
                fit_theta = scipy.optimize.fmin_cg(computeCost,x0=myTheta_initial,\
                                                    fprime=computeGradientFlattened,\
                                                    args=(myX,myy,mylambda),\
                                                    disp=print_output,\
                                                    epsilon=1.49e-12,\
                                                    maxiter=1000)
                fit_theta = fit_theta.reshape((myTheta_initial.shape[0],1))
                return fit_theta

Function optimizeTheta():

Uses Conjugate Gradient Descent (fmin_cg) to find the best \( \theta_0 \) values that minimize the cost function.

Running Optimization

Code used:

        
            mytheta = np.array([[1.],[1.]])
            fit_theta = optimizeTheta(mytheta,X,y,0.)
            
            OUTPUT:
            Optimization terminated successfully.
                    Current function value: 22.373906
                    Iterations: 18
                    Function evaluations: 28
                    Gradient evaluations: 28

Explanation:

Runs optimizeTheta with initial \( \theta_0 \) = [1,1].
Output:
1. Final Cost: 22.373906
2. Converged in 18 iterations
3. Shows optimization success.

Plotting the Fitted Line

        
            plotData()
            plt.plot(X[:,1],h(fit_theta,X).flatten())

OUTPUT:

Image 3: Graph 2

Graph Descriptions:

The graph visualizes the relationship between the change in water level and the water flowing out of the dam.
nitially, red crosses (×) show the training data.
The blue straight line represents the best-fit regression line trained using regularized linear regression.
The optimization process adjusted θ values to minimize the cost function.
Uses plotData() to re-plot the data points.
Plots the new fitted regression line using the optimized \( \theta \).
Output:
1. Graph shows a blue straight line, representing the trained model’s prediction.
2. The line captures the trend in the data.

Key Takeaways:

The model was trained using gradient descent with regularization.
The blue regression line captures the trend but may still need polynomial features for better accuracy.
Regularization helps reduce overfitting.

Bias-Variance Tradeoff: Learning Curve for Linear Regression

Generating the Learning Curve:

Python codes:

        
            def plotLearningCurve():
                """
                Loop over first training point, then first 2 training points, then first 3 ...
                and use each training-set-subset to find trained parameters.
                With those parameters, compute the cost on that subset (Jtrain)
                remembering that for Jtrain, lambda = 0 (even if you are using regularization).
                Then, use the trained parameters to compute Jval on the entire validation set
                again forcing lambda = 0 even if using regularization.
                Store the computed errors, error_train and error_val and plot them.
                """
                initial_theta = np.array([[1.],[1.]])
                mym, error_train, error_val = [], [], []
                for x in range(1,13,1):
                    train_subset = X[:x,:]
                    y_subset = y[:x]
                    mym.append(y_subset.shape[0])
                    fit_theta = optimizeTheta(initial_theta,train_subset,y_subset,mylambda=0.,print_output=False)
                    error_train.append(computeCost(fit_theta,train_subset,y_subset,mylambda=0.))
                    error_val.append(computeCost(fit_theta,Xval,yval,mylambda=0.))
                    
                plt.figure(figsize=(8,5))
                plt.plot(mym,error_train,label='Train')
                plt.plot(mym,error_val,label='Cross Validation')
                plt.legend()
                plt.title('Learning curve for linear regression')
                plt.xlabel('Number of training examples')
                plt.ylabel('Error')
                plt.grid(True)

The function plotLearningCurve() is designed to visualize the bias-variance tradeoff by plotting the learning curve. Here's what it does:

Iterates over different training set sizes (from 1 to 12 training examples).
Trains the model on each subset of the training data using optimizeTheta().
Computes training error (Jtrain):
- Evaluates the cost function computeCost() on the entire validation set with lambda = 0.
Computes validation error (Jval):
- Evaluates the cost function computeCost() on the entire validation set with lambda = 0.
Stores the errors (error_train and error_val) for plotting.
Plots the learning curve:
- X-axis: Number of training examples.
- Y-axis: Error values.
- Two curves:
  - Training error (Jtrain) - Expected to decrease with more data.
  - Validation error (Jval) - Expected to stabilize.

Running the Learning Curve Function

        
            plotLearningCurve()
        
            OUTPUT:

Image 4: Graph 3

Graph Descriptions:

Calls plotLearningCurve() to generate and display the learning curve.
The output graph shows how training and validation errors change as we increase the number of training examples.
Observations:
1. If both training and validation errors stay high, the model has high bias (underfitting).
2. If training error is low, but validation error is high, the model has high variance (overfitting).

Polynomial Regression: Improving Model Performance

1. Introduction:

Polynomial regression allows us to capture complex, non-linear relationships that simple linear regression cannot.

In this section, we apply polynomial regression (degree = 5) and observe how it fits the data.

Polynomial Features and Feature Scaling:

Polynomial regression transforms input features by adding higher-degree terms (e.g., \( x^2, x^3, \dots, x^d \)).
Feature normalization is applied to scale the data, making optimization efficient.
Mathematical equation for polynomial regression:

\[ h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \dots + \theta_d x^d \]

Optimization Process:

We trained a polynomial regression model and optimized the parameters using gradient descent.
The optimization converged successfully, reaching a final cost of 0.198.

2. Generating Polynomial Features and Feature Normalization

Polynomial regression:

Python codes:

            
                def genPolyFeatures(myX,p):
                """
                    Function takes in the X matrix (with bias term already included as the first column)
                    and returns an X matrix with "p" additional columns.
                    The first additional column will be the 2nd column (first non-bias column) squared,
                    the next additional column will be the 2nd column cubed, etc.
                    """
                    newX = myX.copy()
                    for i in range(p):
                        dim = i+2
                        newX = np.insert(newX,newX.shape[1],np.power(newX[:,1],dim),axis=1)
                    return newX

                def featureNormalize(myX):
                    """
                    Takes as input the X array (with bias "1" first column), does
                    feature normalizing on the columns (subtract mean, divide by standard deviation).
                    Returns the feature-normalized X, and feature means and stds in a list
                    Note this is different than my implementation in assignment 1...
                    I didn't realize you should subtract the means, THEN compute std of the
                    mean-subtracted columns.
                    Doesn't make a huge difference, I've found
                    """
                    
                    Xnorm = myX.copy()
                    stored_feature_means = np.mean(Xnorm,axis=0) #column-by-column
                    Xnorm[:,1:] = Xnorm[:,1:] - stored_feature_means[1:]
                    stored_feature_stds = np.std(Xnorm,axis=0,ddof=1)
                    Xnorm[:,1:] = Xnorm[:,1:] / stored_feature_stds[1:]
                    return Xnorm, stored_feature_means, stored_feature_stds

Function 1: genPolyFeatures(myX, p)

Takes the input matrix X and generates polynomial features up to degree p.
The function adds p additional columns:
- The 2nd column (feature) is squared, cubed, and so on up to x^p.
Purpose: Expands the feature set to allow the model to fit complex, non-linear patterns.

Function 2: featureNormalize(myX)

Performs feature scaling (mean normalization & standardization) to improve numerical stability.
Steps:
1. Subtracts the mean from each feature (except bias).
2. Divides by the standard deviation to scale values between -1 and 1.
Why? This ensures gradient descent converges faster and prevents dominance of large values.

3. Training Polynomial Regression Model

Python codes:

            
                #Generate an X matrix with terms up through x^8
                #(7 additional columns to the X matrix)
                
                global_d = 5
                newX = genPolyFeatures(X,global_d)
                newX_norm, stored_means, stored_stds = featureNormalize(newX)
                #Find fit parameters starting with 1's as the initial guess
                mytheta = np.ones((newX_norm.shape[1],1))
                fit_theta = optimizeTheta(mytheta,newX_norm,y,0.)
                
                OUTPUT:
                Optimization terminated successfully.
                        Current function value: 0.198053
                        Iterations: 76
                        Function evaluations: 152
                        Gradient evaluations: 152

Code Explanation:

global_d = 5 → Uses polynomials up to degree 5 (x² to x⁶).
Calls genPolyFeatures(X, global_d) → Generates new X with polynomial terms.
Calls featureNormalize(newX) → Normalizes the dataset.
Optimization Process:Optimization Process:
1. Uses optimizeTheta() to minimize cost and find the best theta values.
2. Initial theta values are set to 1.
3. Optimization results:

    
        Optimization terminated successfully.
        Current function value: 0.198053
        Iterations: 76
        Function evaluations: 152
        Gradient evaluations: 152

Interpretation:

The cost function reached 0.198 after 76 iterations, meaning the polynomial regression model is learning well.

4. Plotting the Polynomial Fit

Python codes:

            
            def plotFit(fit_theta,means,stds):
                """
                Function that takes in some learned fit values (on feature-normalized data)
                It sets x-points as a linspace, constructs an appropriate X matrix,
                un-does previous feature normalization, computes the hypothesis values,
                and plots on top of data
                """
                n_points_to_plot = 50
                xvals = np.linspace(-55,55,n_points_to_plot)
                xmat = np.ones((n_points_to_plot,1))
                
                xmat = np.insert(xmat,xmat.shape[1],xvals.T,axis=1)
                xmat = genPolyFeatures(xmat,len(fit_theta)-2)
                #This is undoing feature normalization
                xmat[:,1:] = xmat[:,1:] - means[1:]
                xmat[:,1:] = xmat[:,1:] / stds[1:]
                plotData()
                plt.plot(xvals,h(fit_theta,xmat),'b--')

                plotFit(fit_theta,stored_means,stored_stds)

Oouput Graph:

Image 5: Graph 4

Graph Descriptions:

Function: plotFit(fit_theta, means, stds)

Generates a smooth curve of polynomial regression predictions.
Steps:
1. Creates 50 evenly spaced points in the input range (xvals from -55 to 55).
2. Constructs polynomial features using genPolyFeatures().
3. Applies inverse feature normalization (undoes scaling).
4. Computes predictions using hypothesis function h().
5. Plots the original dataset and polynomial regression curve.

✅ Final Output: A Graph Showing the Polynomial Regression Fit

Blue dashed line (b--) represents the polynomial regression model.
Data points are plotted to show the actual distribution.

Polynomial Regression Learning Curve

This function plotPolyLearningCurve is used to plot the learning curve for Polynomial Regression. The learning curve helps analyze the bias-variance tradeoff by comparing training error and cross-validation error as the number of training examples increases.

Python codes:

            
                def plotPolyLearningCurve(mylambda=0.):

                    initial_theta = np.ones((global_d+2,1))
                    mym, error_train, error_val = [], [], []
                    myXval, dummy1, dummy2 = featureNormalize(genPolyFeatures(Xval,global_d))

                    for x in range(1,13,1):
                        train_subset = X[:x,:]
                        y_subset = y[:x]
                        mym.append(y_subset.shape[0])
                        train_subset = genPolyFeatures(train_subset,global_d)   
                        train_subset, dummy1, dummy2 = featureNormalize(train_subset)
                        fit_theta = optimizeTheta(initial_theta,train_subset,y_subset,mylambda=mylambda,print_output=False)
                        error_train.append(computeCost(fit_theta,train_subset,y_subset,mylambda=mylambda))
                        error_val.append(computeCost(fit_theta,myXval,yval,mylambda=mylambda))
                        
                    plt.figure(figsize=(8,5))
                    plt.plot(mym,error_train,label='Train')
                    plt.plot(mym,error_val,label='Cross Validation')
                    plt.legend()
                    plt.title('Polynomial Regression Learning Curve (lambda = 0)')
                    plt.xlabel('Number of training examples')
                    plt.ylabel('Error')
                    plt.ylim([0,100])
                    plt.grid(True)
                    
                    plotPolyLearningCurve()

Step-by-Step Breakdown of the Code:

1. Initialize Variables:

  
      Python:

      initial_theta = np.ones((global_d+2,1))
      mym, error_train, error_val = [], [], []

initial_theta: A vector initialized with ones, used as the initial guess for optimization.
mym: Stores the number of training examples.
error_train & error_val: Lists to store training and validation errors.

2. Prepare the Validation Set (Xval):

        
            Python:

            myXval, dummy1, dummy2 = featureNormalize(genPolyFeatures(Xval,global_d))

Converts Xval into polynomial features up to degree global_d.
Feature normalization is applied to scale the features.

3. Loop Over Different Training Set Sizes:

        
            Python:

            for x in range(1,13,1):

This iterates from 1 to 12 training examples.

4. Generate Polynomial Features & Normalize:

        
            Python:

            train_subset = X[:x,:]
            y_subset = y[:x]
            train_subset = genPolyFeatures(train_subset,global_d)
            train_subset, dummy1, dummy2 = featureNormalize(train_subset)

Extracts a subset of training data (X[:x,:] and y[:x]).
Converts it to polynomial features and normalizes them.

5. Optimize the Model (Train Parameters):

        
            Python:

            fit_theta = optimizeTheta(initial_theta,train_subset,y_subset,mylambda=mylambda,print_output=False)

Uses optimization (gradient descent or an optimization solver) to learn parameters fit_theta.

6. Compute Training and Validation Errors:

        
            Python:

            error_train.append(computeCost(fit_theta,train_subset,y_subset,mylambda=mylambda))
            error_val.append(computeCost(fit_theta,myXval,yval,mylambda=mylambda))

error_train: Computes cost on the subset of training data.
error_val: Computes cost on the full validation set.

7. Plot the Learning Curve:

        
            Python:

            plt.figure(figsize=(8,5))
            plt.plot(mym,error_train,label='Train')
            plt.plot(mym,error_val,label='Cross Validation')
            plt.legend()
            plt.title('Polynomial Regression Learning Curve (lambda = 0)')
            plt.xlabel('Number of training examples')
            plt.ylabel('Error')
            plt.ylim([0,100])
            plt.grid(True)

The training error and cross-validation error are plotted against the number of training examples.
Title: "Polynomial Regression Learning Curve (lambda = 0)"
Y-axis: Error values
X-axis: Number of training examples
Grid and legend are added for better visualization.

Output:

Image 6: Graph 4

Explanation of the Learning Curve (Lambda = 0)

This graph represents the learning curve for polynomial regression when λ (lambda) = 0 (no regularization). It shows how the cross-validation error changes as the number of training examples increases.

Key Observations from the Graph:

1. High Initial Error:

When the number of training examples is very small (around 2), the cross-validation error is extremely high (~100).
This happens because a very small dataset leads to poor generalization, causing large fluctuations in the model's performance.

2. Error Drops Quickly:

As more training examples are added, the error drops sharply and stabilizes at a lower value.
This suggests that adding more data initially helps improve generalization.

3. Fluctuations in Error:

After the initial drop, the error does not decrease consistently—it fluctuates.
These fluctuations indicate that the model is overfitting some parts of the data while struggling with others./li>

4. Indication of Overfitting:

The overall trend suggests high variance, meaning the model fits the training data well but does not generalize perfectly to unseen data.
Since lambda = 0, there is no regularization, making overfitting more likely.

Conclusion:

The model benefits from more training data initially, but beyond a certain point, adding more examples does not consistently reduce error.
Regularization (higher lambda values) could help stabilize the model and improve generalization.

Regularization and Learning Curves in Polynomial Regression

This section adjusts the regularization parameter (lambda = 1) and evaluates its effect on polynomial regression. The process includes:

Python codes:

            
                #Try Lambda = 1
                mytheta = np.zeros((newX_norm.shape[1],1))
                fit_theta = optimizeTheta(mytheta,newX_norm,y,1)
                plotFit(fit_theta,stored_means,stored_stds)
                plotPolyLearningCurve(1.)
                
                OUTPUT:
                Current function value: 8.042488
                        Iterations: 5
                        Function evaluations: 71
                        Gradient evaluations: 60
                
                Loading the two graphs:

Explanation of the Codes

1. Initializing Theta:

            
                Python:
                
                mytheta = np.zeros((newX_norm.shape[1],1))

A zero-initialized parameter vector mytheta is created.

2. Optimizing Theta for Lambda = 1:

            
                Python:
                
                fit_theta = optimizeTheta(mytheta, newX_norm, y, 1)

This finds the best-fitting theta values for the training set while applying regularization (λ = 1).

3. Plotting the Polynomial Fit:

            
                Python:
                
                plotFit(fit_theta, stored_means, stored_stds)

This generates the first graph, which likely shows the polynomial regression fit over the dataset.

Output:Polynomial Regression Fit Curve with λ = 1

Image 7: Polynomial Regression Fit Curve with λ = 1

Explanation of the Polynomial Regression Fit Curve with λ = 1

Polynomial regression model fitted to the dataset with regularization (λ = 1). The red markers represent actual data points, while the blue dashed curve represents the model’s prediction. Regularization helps control overfitting, ensuring a smoother and more generalized fit.

This graph shows the polynomial regression model fitted to the dataset, where x-axis represents the change in water level, and y-axis represents the water flowing out of the dam.
The red points represent the actual data, while the blue dashed line represents the polynomial regression model.
A regularization parameter lambda = 1 has been applied, which helps to control overfitting.
Compared to an unregularized model (lambda = 0), this fit appears smoother and avoids extreme oscillations.
The model follows the general trend of the data without excessive flexibility.

4. Plotting the Learning Curve for λ = 1:

            
                Python:
                
                plotPolyLearningCurve(1.)

This generates the second graph, showing how training and validation errors change as the number of training examples increases.

Output: Polynomial Regression Learning Curve, λ = 1

Image 8: Polynomial Regression Learning Curve, λ = 1

Explanation of the Polynomial Regression Learning Curve, λ = 1

Learning curve analysis for polynomial regression with regularization (λ = 1). The training error (blue) and cross-validation error (orange) converge as the number of training examples increases, demonstrating improved generalization and reduced variance.

The x-axis represents the number of training examples, and the y-axis represents the error.
The blue curve represents training error, and the orange curve represents cross-validation error.
Initially, when fewer training examples are used, there is a large gap between training and validation error.
As the number of training examples increases, the errors start to converge.
The training error increases slightly as more examples are added, while the validation error decreases, showing a reduction in overfitting.
Since lambda = 1 is used, the model has some regularization, reducing variance while maintaining a reasonable bias.

Impact of High Regularization on Polynomial Regression

Python codes:

                
                #Try Lambda = 100
                #Note after one iteration, the lambda of 100 penalizes the theta params so hard
                #that the minimizer loses precision and gives up...
                #so the plot below is NOT indicative of a successful fit
                mytheta = np.random.rand(newX_norm.shape[1],1)
                fit_theta = optimizeTheta(mytheta,newX_norm,y,100.)
                plotFit(fit_theta,stored_means,stored_stds)

                OUTPUT:
                Current function value: 131.414491
                    Iterations: 0
                    Function evaluations: 38
                    Gradient evaluations: 27

Explanation of the Codes

In this cell, you are testing the impact of a high regularization parameter λ = 100 on polynomial regression.

            
                Python:
                    
                mytheta = np.random.rand(newX_norm.shape[1],1)

The code initializes the parameter vector θ with random values. This is different from earlier codes, where θ was initialized to zeros.

            
                Python:
                    
                fit_theta = optimizeTheta(mytheta, newX_norm, y, 100.)

Uses optimization to minimize the cost function with a high λ = 100. However, a very high λ penalizes the parameters excessively, preventing the model from learning effectively.
1. The optimization stops immediately (Iterations: 0), meaning it failed to improve the model.
2. The cost function value is 131.414491, significantly higher than the previous case (λ = 1), indicating poor model performance.

            
                Python:
                    
                plotFit(fit_theta, stored_means, stored_stds)

Plots the polynomial regression curve. Due to excessive regularization, the model underfits the data, leading to an almost linear or flat fit that does not capture the actual trend.

Output: Polynomial Regression Learning Curve, λ = 1

Image 9: Polynomial regression model with high regularization (λ = 100)

Explanation of the graph

Graph showing what impact of High Regularization on Polynomial Regression:

Polynomial regression model with high regularization (λ = 100). The model fails to capture the data trend due to excessive penalization of parameters, resulting in severe underfitting. The optimization process does not complete successfully, demonstrating the negative impact of overly strong regularization.

This graph demonstrates how an excessively high regularization parameter (λ = 100) affects polynomial regression.
The fitted curve (blue dashed line) appears almost flat or linear, rather than following the natural trend of the data.
The optimization process fails to converge properly, meaning the model is unable to fit the data well.
This happens because λ = 100 applies too much regularization, shrinking the polynomial coefficients too close to zero, effectively preventing the model from capturing the true relationship.
Instead of a curved polynomial fit (as seen in λ = 1), this graph will likely show a very simple or nearly horizontal line, which underfits the data.

This is a classic example of over-regularization, where the model becomes too simple and fails to capture important patterns in the data.

Selecting the Optimal Regularization Parameter (λ) Using Cross-Validation

Training and Validation Error Calculation:

Python codes:

            
                #lambdas = [0., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1., 3., 10.]
                lambdas = np.linspace(0,5,20)
                errors_train, errors_val = [], []
                for mylambda in lambdas:
                    newXtrain = genPolyFeatures(X,global_d)
                    newXtrain_norm, dummy1, dummy2 = featureNormalize(newXtrain)
                    newXval = genPolyFeatures(Xval,global_d)
                    newXval_norm, dummy1, dummy2 = featureNormalize(newXval)
                    init_theta = np.ones((newX_norm.shape[1],1))
                    fit_theta = optimizeTheta(mytheta,newXtrain_norm,y,mylambda,False)
                    errors_train.append(computeCost(fit_theta,newXtrain_norm,y,mylambda=mylambda))
                    errors_val.append(computeCost(fit_theta,newXval_norm,yval,mylambda=mylambda))

Explanation of the codes:

The codes run polynomial regression for different values of the regularization parameter λ (lambda) and records the corresponding training and cross-validation errors.

lambdas = np.linspace(0,5,20) → Generates 20 evenly spaced values of λ between 0 and 5.
For each λ, the model is trained using polynomial regression on the training set.
Errors are computed for both training and cross-validation datasets to observe how the model generalizes.

Plotting the Errors vs. Lambda

            
                Python:

                plt.figure(figsize=(8,5))
                plt.plot(lambdas,errors_train,label='Train')
                plt.plot(lambdas,errors_val,label='Cross Validation')
                plt.legend()
                plt.xlabel('lambda')
                plt.ylabel('Error')
                plt.grid(True)

This code plots the errors from the previous step to visualize the impact of different λ values on model performance.

Output: Optimal Regularization Parameter (λ) Using Cross-Validatio

Image 9: Effect of regularization (λ) on polynomial regression performance.

This graph illustrates the effect of different regularization strengths (λ) on training and validation errors.

The blue line represents training error, while the orange line represents cross-validation error. The optimal λ value is where validation error is minimized, balancing bias and variance.

The graph typically follows this trend:

For small λ (near 0): The training error is low, but the cross-validation error is high, indicating overfitting (too complex model).
For moderate λ values: The cross-validation error reaches a minimum, meaning optimal regularization. This is the best balance between bias and variance.
For large λ values: Both training and cross-validation errors increase, meaning underfitting (too simple model). The model cannot capture enough complexity.

Conclusion

Key Insights from the Bias-Variance Trade-off Analysis

This project highlights the importance of regularization in polynomial regression. Key takeaways include:

Small λ values lead to overfitting → The model fits the training data very well but performs poorly on validation data.
Large λ values lead to underfitting → The model becomes too simple and fails to capture important patterns.
An optimal λ value exists → This is where validation error is minimized, ensuring the best balance between bias and variance.

Through this exercise, we demonstrated how cross-validation helps in selecting the right λ, making the model more generalizable to unseen data. This method is essential for improving model robustness in real-world machine learning applications.

Acknowledgments

I sincerely thank Prof. Andrew NG (DeepLearning.AI, Stanford University) for his inspiring courses that laid the foundation for this project.

Understanding Bias-Variance Trade-off in Regularized Polynomial Regression