Building Blocks of Machine Learning: Feature Normalization & Cost Optimization


Feature Normalization and Cost Optimization

AI Generated image: Feature Normalization & Cost Optimization



Featured Project: Building Blocks of Machine Learning

This project explores the foundational concepts of machine learning by diving into feature normalization and cost optimization using linear regression techniques.


Introduction

This project explores the application of linear regression using machine learning techniques. In this projects we are going use Linear regression Cost function convergence. This technique is practically applied to predict the house price and many other fields.



Understanding Linear Regression


Linear regression technique is used to determine the model relationship between a dependent variable (target) and one or more independent variables (predictors). It assumes a linear relationship between the input variables and the output variable.

Objective of Linear Regression

  • To find the best-fit line that minimizes the difference between the actual and predicted values.
  • Uses Gradient Descent to minimize the cost function.

Applications of Linear Regression

  • Predicting house prices based on square footage and number of bedrooms.
  • Estimating sales revenue based on marketing spend.
  • Analyzing trends and relationships in economic or scientific data.

Example Code

Before starting house price prediction let us plot an scattered graph to see how number of population effects on the profits. Here inputs x and y are converted into NumPy arrays.

      
        %matplotlib inline
        import numpy as np
        import matplotlib.pyplot as plt
        
        datafile = 'd:/mlprojects/data/ex1data1.txt'
        cols = np.loadtxt(datafile,delimiter=',',usecols=(0,1),unpack=True) # Read in comma separated data
        
        # Form the usual "X" matrix and "y" vector
        X = np.transpose(np.array(cols[:-1]))
        y = np.transpose(np.array(cols[-1:]))
        m = y.size # Number of training examples
        
        # Insert the usual column of 1's into the "X" matrix
        X = np.insert(X,0,1,axis=1)
        
        # Plot the data to see what it looks like
        plt.figure(figsize=(10,6))
        plt.plot(X[:,1],y[:,0],'rx',markersize=10)
        plt.grid(True) # Always plot.grid true!
        plt.ylabel('Profit in $10,000s')
        plt.xlabel('Population of City in 10,000s')
      
    
Linear Regression Scatter Graph

Graph 1: Linear regression scatter graph

Explanation of the Graph

This is a scatter plot of the given dataset, where:

  • X-axis (Horizontal): Represents the population of a city (in 10,000s).
  • Y-axis (Vertical): Represents the profit (in $10,000s).
  • Red 'x' markers: Represent individual data points.

Observations

  • General Trend: There seems to be a positive correlation between population and profit.
  • Density of Data Points: More data points are clustered around smaller populations (~5 to 10), meaning most cities in the dataset have smaller populations.
  • Possible Outliers: Some cities with small populations have higher profits, and some with large populations have relatively lower profits.

Next Steps for Linear Regression

  • This scatter plot visually justifies using linear regression, as there is an upward trend.
  • The goal of linear regression is to find the best-fit line that represents this trend and can be used for predictions.


Gradient Descent & Cost Computation


Gradient descent is an optimization algorithm used to minimize the cost function in machine learning models, particularly in linear regression. It iteratively adjusts the model parameters (theta) to find the best-fit line that minimizes the error between predicted and actual values.


We are using the following codes in cells of Jupyter Notebook:

      
        iterations = 1500
        alpha = 0.01
                                    
        def h(theta,X): #Linear hypothesis function
            return np.dot(X,theta)
                                    
        def computeCost(mytheta,X,y): #Cost function
            """
            theta_start is an n- dimensional vector of initial theta guess
            X is matrix with n- columns and m- rows
            y is a matrix with m- rows and 1 column
            """
            #note to self: *.shape is (rows, columns)
            # return float((1./(2*m)) * np.dot((h(mytheta,X)-y).T,(h(mytheta,X)-y)))
            return float((1./(2*m)) * np.dot((h(mytheta, X) - y).T, (h(mytheta, X) - y)).item())
                                    
            #Test that running computeCost with 0's as theta returns 32.07:
                                    
            initial_theta = np.zeros((X.shape[1],1)) #(theta is a vector with n rows and 1 columns (if X has n features) )
            print(computeCost(initial_theta,X,y)) 
      
    


After running above codes, results are:

32.072733877455676

1. Understanding the above Gradient Descent Parameters

We have used two important parameters:

  • iterations = 1500 → The number of times the algorithm updates the parameters. More iterations allow the model to refine its predictions.
  • alpha = 0.01 → The learning rate, which controls the step size in each iteration. A small value ensures gradual improvements, preventing overshooting of the minimum cost.

2. Linear Hypothesis Function

We have defined the function h(theta, X) as:

    
        def h(theta, X):  # Linear hypothesis function
            return np.dot(X, theta)
    
  
  • This function computes the predicted output (h or hypothesis) by taking the dot product of the feature matrix X and parameter vector theta.
  • Initially, theta is set to all zeros.

3. Cost Function Computation

The function computeCost(mytheta, X, y) calculates the Mean Squared Error (MSE), which quantifies how far the model’s predictions are from actual values.

    
      def computeCost(mytheta, X, y):  # Cost function
          return float((1./(2*m)) * np.dot((h(mytheta, X) - y).T, (h(mytheta, X) - y)).item())
  
  
  • Cost function for Linear Regression used:
  • $$ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(X_i) - y_i)^2 $$
  • h(mytheta, X) - y calculates the difference between predicted and actual values.
  • Squaring the differences ensures that errors don’t cancel out.
  • m is the number of training samples.
  • The sum of squared errors is scaled by \( \frac{1}{2m} \) to normalize it.

4. Testing the Cost Function with Zero Thetas

      
          initial_theta = np.zeros((X.shape[1],1))
          print(computeCost(initial_theta, X, y))
      
    
  • initial_theta is initialized with zeros.
  • When we run the cost function, the output is 32.0727, meaning that with untrained parameters (all zeros), the average squared error is 32.07.
  • As we apply gradient descent, this cost will decrease, meaning the model is improving.


Convergence of the Cost Function in Gradient Descent


We run the following codes in the cells of Jupyter Notebook:

          
              #Actual gradient descent minimizing routine
              def descendGradient(X, theta_start=np.zeros(2)):
                  theta = theta_start
                  jvec = []  # Stores cost values for each iteration
                  thetahistory = []  # Stores theta updates for visualization

                  for _ in range(iterations):  # Use range instead of xrange
                      tmptheta = theta.copy()  # Explicitly copy theta
                      jvec.append(computeCost(theta, X, y))
                      thetahistory.append(list(theta[:, 0]))

                      for j in range(len(tmptheta)):
                          tmptheta[j] = theta[j] - (alpha / m) * np.sum((h(theta, X) - y) * np.array(X[:, j]).reshape(m, 1))
                      theta = tmptheta
                  return theta, thetahistory, jvec


                  #Actually run gradient descent to get the best-fit theta values
                  initial_theta = np.zeros((X.shape[1],1))
                  theta, thetahistory, jvec = descendGradient(X,initial_theta)

                  #Plot the convergence of the cost function
                  def plotConvergence(jvec):
                      plt.figure(figsize=(10,6))
                      plt.plot(range(len(jvec)),jvec,'bo')
                      plt.grid(True)
                      plt.title("Convergence of Cost Function")
                      plt.xlabel("Iteration number")
                      plt.ylabel("Cost function")
                      dummy = plt.xlim([-0.05*iterations,1.05*iterations])
                      #dummy = plt.ylim([4,8])

                  plotConvergence(jvec)
                  dummy = plt.ylim([4,7])
          
        


After running above codes, results are:

Cost Function Convergence Graph

Graph 2: Showing how costs are decreasing with the increase of training iteration

Understanding the Graph

The plotted graph represents how the cost function \(J(θ) \) decreases over multiple iterations of the gradient descent algorithm.

1. Axes Explanation

  • X-axis (Iteration Number): Represents the number of training iterations (from 0 to ~1500). Each iteration updates the model’s parameters.
  • Y-axis (Cost Function): Measures the error between the model’s predictions and the actual values. A lower cost means better performance.

2. Curve Analysis

  • Initial High Cost: At the start, the cost is around 6.8, indicating poor model performance.
  • Rapid Initial Decline (0–400 Iterations): The cost decreases rapidly, showing the model is learning effectively.
  • Slowing Down Convergence (400–1000 Iterations): As iterations progress, the cost function flattens, meaning the model is approaching an optimal solution and further improvements become smaller. The model's rate of improvement slows as it nears an optimal solution.
  • Convergence (~1000+ Iterations): The curve flattens completely at the Cost function levels off around 4.5, indicating that the gradient descent has converged, meaning additional iterations won't significantly reduce the cost.

3. Observations

  • Smooth Curve: This graph visually confirms that gradient descent effectively optimizes the cost function, moving the model towards the best-fit line for linear regression. The decreasing cost ensures that the predicted values align closely with the actual data.

Linear Regression Model: Best-Fit Line Over Training Data


We run the following codes in the cells of Jupyter Notebook:


      
      #Plot the line on top of the data to ensure it looks correct
      def myfit(xval):
          return theta[0] + theta[1]*xval
      plt.figure(figsize=(10,6))
      plt.plot(X[:,1],y[:,0],'rx',markersize=10,label='Training Data')
      #plt.plot(X[:,1],myfit(X[:,1]),'b-',label = 'Hypothesis: h(x) = %0.2f + %0.2fx'%(theta[0],theta[1]))
      
      plt.plot(X[:,1], myfit(X[:,1]), 'b-', 
               label = 'Hypothesis: h(x) = %0.2f + %0.2fx' % (theta[0].item(), theta[1].item()))
      
      plt.grid(True) #Always plot.grid true!
      plt.ylabel('Profit in $10,000s')
      plt.xlabel('Population of City in 10,000s')
      plt.legend()
      
      
      

After running the above codes, the following graph is drawn:

Data distribution Graph

Graph 3: Showing Linear Regression Best-Fit Line Over Training Data

Explanation of the Graph:

This graph represents the fitted linear regression model plotted over the training data. The objective is to visualize how well the model captures the relationship between the population of a city and its profit.

Key Observations:

1. Red 'x' Markers (Training Data):

  • Each point represents a city, with population on the X-axis and profit on the Y-axis.
  • The scatter plot shows an overall upward trend, suggesting a positive correlation.

2. Blue Line (Linear Regression Hypothesis \( h(x)\)):

  • This line represents the best-fit equation:
    \( h(x) = \theta_0 + \theta_1 x \), where \( x \) represents the input feature.
  • It is obtained using gradient descent, minimizing the cost function to fit the data optimally.
  • The line closely follows the pattern of the data, validating the effectiveness of linear regression.
  • The line closely follows the pattern of the data, validating the effectiveness of linear regression.

Conclusion:

  • The model successfully captures the general trend between population and profit.
  • This straight line serves as a predictive model, allowing us to estimate future profits for cities with different populations.
  • Any deviations from the line could indicate outliers or data points where additional factors influence profit.


3D Visualization of the Cost Function


We run the following codes in the cells of Jupyter Notebook:


        
            #Import necessary matplotlib tools for 3d plots
            from mpl_toolkits.mplot3d import axes3d, Axes3D
            from matplotlib import cm
            import itertools

            fig = plt.figure(figsize=(12,12))
            #ax = fig.gca(projection='3d')
            ax = fig.add_subplot(111, projection='3d')


            xvals = np.arange(-10,10,.5)
            yvals = np.arange(-1,4,.1)
            myxs, myys, myzs = [], [], []
            for david in xvals:
            for kaleko in yvals:
              myxs.append(david)
              myys.append(kaleko)
              myzs.append(computeCost(np.array([[david], [kaleko]]),X,y))

            scat = ax.scatter(myxs,myys,myzs,c=np.abs(myzs),cmap=plt.get_cmap('YlOrRd'))

            plt.xlabel(r'$\theta_0$',fontsize=30)
            plt.ylabel(r'$\theta_1$',fontsize=30)
            plt.title('Cost (Minimization Path Shown in Blue)',fontsize=30)
            plt.plot([x[0] for x in thetahistory],[x[1] for x in thetahistory],jvec,'bo-')
            plt.show()
        
    

The following 3D Visualization Graph is generated:

3D Visualization Graph

Graph 4: 3D Visualization of the Cost function

Graph Description

This 3D plot represents the cost function \( J(θ) \) with respect to the parameters \( θ0 \) and \( θ1 \). The surface demonstrates how the cost varies across different parameter values, with the color intensity indicating the magnitude of the cost. The blue trajectory illustrates the optimization path taken by the gradient descent algorithm as it iteratively minimizes the cost, converging toward the optimal parameter values.

This graph provides a visual representation of the cost function landscape in a three-dimensional space. The X and Y axes correspond to the parameter values \( θ0 \)and \( θ1 \), while the Z-axis represents the computed cost \( J(θ) \). The scattered points on the surface denote different cost values, color-coded to indicate their magnitude. The plotted blue path shows the step-by-step progression of gradient descent as it moves towards the minimum cost.

This visualization is crucial for understanding how gradient descent optimizes the parameters and ensures convergence towards the best-fit hypothesis.



Distribution of Feature Values Before Normalization


We have run the following codes in a cell of Jupyter Notebook:


      
          datafile = r'd:\mlprojects\data\ex1data2.txt'
          #Read into the data file
          cols = np.loadtxt(datafile,delimiter=',',usecols=(0,1,2),unpack=True) #Read in comma separated data
          #Form the usual "X" matrix and "y" vector
          X = np.transpose(np.array(cols[:-1]))
          y = np.transpose(np.array(cols[-1:]))
          m = y.size # number of training examples
          #Insert the usual column of 1's into the "X" matrix
          X = np.insert(X,0,1,axis=1)

          #CODES IN CELL 11
          #Quick visualize data
          plt.grid(True)
          plt.xlim([-100,5000])
          dummy = plt.hist(X[:,0],label = 'col1')
          dummy = plt.hist(X[:,1],label = 'col2')
          dummy = plt.hist(X[:,2],label = 'col3')
          plt.title('Clearly we need feature normalization.')
          plt.xlabel('Column Value')
          plt.ylabel('Counts')
          dummy = plt.legend()
      
    


The following bar graphs are generated

Pre-Normalization Graph

Graph 5: Distribution of Feature Values Before Normalization

Description:

Feature scaling is crucial for gradient descent optimization. This histogram represents the dataset before normalization.

The histogram visualizes the distribution of feature values in the dataset before applying feature normalization. The three histograms correspond to different columns of the dataset, showing a wide range of values. The need for feature scaling is evident, as the features vary significantly in magnitude, which can impact the performance of gradient descent.


More Explanation of the Histogram

This figure illustrates the distribution of dataset features before normalization. The X-axis represents the feature values, while the Y-axis indicates the count of occurrences. The histograms for different columns show that some feature values are much larger than others and all of them having the same orange color.

Since gradient descent is sensitive to feature magnitudes, feature normalization is essential to bring all features to a similar scale. This improves convergence speed and ensures that the optimization algorithm progresses efficiently without being dominated by features with larger numerical ranges.

Some Issues with the Histogram, Same orange color in all the bars:

Need Improvement

Feature Normalization (Standardization or Min-Max Scaling) is required to bring all features to the same scale.



Feature Normalization


Feature normalization carried out by modifying the codes and running in Jupyter Notebook as:



    
        #Feature normalizing the columns (subtract mean, divide by standard deviation)
        #Store the mean and std for later use
        #Note don't modify the original X matrix, use a copy
        stored_feature_means, stored_feature_stds = [], []
        Xnorm = X.copy()
        for icol in range(Xnorm.shape[1]):
        stored_feature_means.append(np.mean(Xnorm[:,icol]))
        stored_feature_stds.append(np.std(Xnorm[:,icol]))
        #Skip the first column
        if not icol: continue
        #Faster to not recompute the mean and std again, just used stored values
        Xnorm[:,icol] = (Xnorm[:,icol] - stored_feature_means[-1])/stored_feature_stds[-1]

        #Quick visualize the feature-normalized data
        plt.grid(True)
        plt.xlim([-5,5])
        dummy = plt.hist(Xnorm[:,0],label = 'col1')
        dummy = plt.hist(Xnorm[:,1],label = 'col2')
        dummy = plt.hist(Xnorm[:,2],label = 'col3')
        plt.title('Feature Normalization Accomplished')
        plt.xlabel('Column Value')
        plt.ylabel('Counts')
        dummy = plt.legend()
    


The following bar graphs are generated

Post-Normalization Graph

Graph 6: Distribution of Features after Normalization

Description:

Feature normalization ensures a stable and efficient training process. The transformed features are now on a similar scale.

This histogram illustrates the distribution of feature values after applying feature normalization. The green, orange, and blue bars represent different features of the dataset. The transformation scales the data to a common range, ensuring that all features have a This histogram illustrates the distribution of feature values after applying feature normalization. The green, orange, and blue bars represent different features of the dataset. The transformation scales the data to a common range, ensuring that all features have a mean of zero and a standard deviation of one, which improves the efficiency of gradient descent.mean of zero and a standard deviation of one, which improves the efficiency of gradient descent.

Normalization Process: The code normalizes each column in the dataset X by subtracting the column mean and dividing by the standard deviation. This process ensures that each column has a mean of 0 and a standard deviation of 1.

Storage of Means and Stds: The code calculates the mean and standard deviation for each column in X and stores them in stored_feature_means and stored_feature_stds for later use.

Normalization: The Xnorm array stores the normalized values. Each column (except the first) in Xnorm is adjusted using the corresponding mean and standard deviation from stored_feature_means and stored_feature_stds.



Explanation for Project Page:

Feature normalization is a crucial preprocessing step in machine learning, particularly for gradient descent optimization. This figure visualizes the effect of normalization on feature values.



This step ensures that features contribute equally to the learning process, preventing any single feature from dominating the optimization.

Why Feature Normalization is Necessary:

Feature normalization is critical in gradient descent, especially with multiple variables, because it ensures that all features (columns of Xnorm) are on a similar scale. Without normalization, the gradient steps could be uneven or "overflow" could occur during multiplication, making the algorithm fail.



Convergence of the Cost Function in Multi-Variable Gradient Descent


Codes for convergence:

    
        #Run gradient descent with multiple variables, initial theta still set to zeros
        #(Note! This doesn't work unless we feature normalize! "overflow encountered in multiply")
        initial_theta = np.zeros((Xnorm.shape[1],1))
        theta, thetahistory, jvec = descendGradient(Xnorm,initial_theta)
    
        #Plot convergence of cost function:
        plotConvergence(jvec)
    
    
    

After running the codes the following graph is generated:


Cost Function Convergence Graph

Graph 7: Convergence of the Cost Function in Multi-Variable Gradient Descent

Gradient descent minimizes the cost function iteratively. This graph shows the decrease in cost over time.


Description

This plot illustrates the convergence behavior of the cost function over multiple iterations of gradient descent. The curved trajectory, resembling a right angle, indicates a rapid initial decrease in cost, followed by a gradual stabilization as the algorithm approaches the optimal parameters.

More Explanation of the graph

In multi-variable linear regression, gradient descent optimizes the cost function by iteratively updating model parameters. This figure visualizes the process:



This graph confirms that the gradient descent algorithm is functioning correctly, efficiently minimizing the cost function to improve model accuracy.



Conclusion


The bend line graph shows how the cost function decreases during the gradient descent process. Over time, as the algorithm converges, the cost approaches its minimum, which is where the model parameters (theta) are optimized.



Acknowledgments


I sincerely thank Prof. Andrew NG (DeepLearning.AI, Stanford University) for his inspiring courses that laid the foundation for this project.





Contact Me

hoque@HoqueAi.com

+1 473 607 1949

Download Resume