Logistic regression


Feature Normalization and Cost Optimization

AI Generated image: Feature Normalization & Cost Optimization


Featured Project: Logistic Regression and Feature Optimization

This project showcases how logistic regression can be optimized to make accurate binary classifications, visualized through AI-generated plots and performance metrics.


Light in Logistic Regression


Logistic regression is a fundamental classification algorithm used in machine learning and statistics to predict categorical outcomes. Unlike linear regression, which models continuous values, logistic regression estimates the probability of an event occurring by applying the sigmoid function to a linear combination of input features. This ensures the output remains between 0 and 1, making it ideal for binary classification problems such as spam detection, medical diagnosis, and fraud detection. The model is trained using maximum likelihood estimation and optimized through gradient descent to minimize the loss function, typically cross-entropy loss. Logistic regression can also be extended to multi-class classification using techniques like one-vs-all (OvA) or softmax regression.



Lab Exercise

1. Scatter Plot of Exam Scores for Admission Classification


Let us run the following codes:

    
      %matplotlib inline
      import numpy as np
      import matplotlib.pyplot as plt
      import pandas as pd
      
      datafile = r'd:\mlprojects\data\ex2data1.txt'
      #!head $datafile
      cols = np.loadtxt(datafile,delimiter=',',usecols=(0,1,2),unpack=True) #Read in comma separated data
      ##Form the usual "X" matrix and "y" vector
      X = np.transpose(np.array(cols[:-1]))
      y = np.transpose(np.array(cols[-1:]))
      m = y.size # number of training examples
      ##Insert the usual column of 1's into the "X" matrix
      X = np.insert(X,0,1,axis=1)
      
      #Divide the sample into two: ones with positive classification, one with null classification
      pos = np.array([X[i] for i in range(X.shape[0]) if y[i] == 1])
      neg = np.array([X[i] for i in range(X.shape[0]) if y[i] == 0])
      #Check to make sure I included all entries
      #print "Included everything? ",(len(pos)+len(neg) == X.shape[0])
      
      def plotData():
          plt.figure(figsize=(10,6))
          plt.plot(pos[:,1],pos[:,2],'k+',label='Admitted')
          plt.plot(neg[:,1],neg[:,2],'yo',label='Not admitted')
          plt.xlabel('Exam 1 score')
          plt.ylabel('Exam 2 score')
          plt.legend()
          plt.grid(True)
          
      plotData()     
    
  


After running the codes in Jupyter Notebook, the graph generated:

Logistic Regression Scatter Graph

Graph 1: Scatter Plot of Exam Scores for Admission Classification

Explanation of the Graph

The scatter plot visualizes the classification of students based on their exam scores. Each point represents an individual student, where the x-axis corresponds to Exam 1 scores and the y-axis to Exam 2 scores. The black plus markers (‘+’) indicate students who were admitted, while the yellow circles (‘o’) represent those who were not admitted. This dataset serves as the foundation for training a logistic regression model to predict future admissions based on exam performance.



2. Visualization of the Sigmoid Function in Logistic Regression


Codes to run in cells of Jupyter Notebook:

    
          from scipy.special import expit #Vectorized sigmoid function

          #Codes to generate Sigmoid graph
          myx = np.arange(-10,10,.1)
          #Quick check that expit is what I think it is
          myx = np.arange(-10,10,.1)
          
          plt.figure(figsize=(8,6))  # Adjust figure size for better centering
          plt.plot(myx, expit(myx))
          plt.title("The 'S-shaped' Sigmoid Function")
          plt.grid(True)
          plt.axhline(0.5, color='r', linestyle='--', alpha=0.6)  # Optional reference line
          plt.xlim([-10, 10])  # Set x-axis limits
          plt.ylim([-0.1, 1.1])  # Set y-axis limits
          plt.gca().set_position([0.15, 0.15, 0.7, 0.7])  # Adjust plot positioning
          plt.show()
    
  
Sigmoid Graph

Graph 2: Sigmoid function graph

Graph description:

The graph displays the Sigmoid function, a fundamental component of logistic regression. The function maps input values to a range between 0 and 1, making it suitable for probability-based classification. The S-shaped curve ensures that values far from zero approach either 0 or 1 asymptotically, which is crucial for decision-making in binary classification problems.


Purpose of the red-dotted Line:


The dotted red horizontal line in the middle of the graph, at y = 0.5, is a decision boundary reference line.




Initial Cost Function Evaluation in Logistic Regression

We run the following codes in the cells of Jupyter Notebook:

      
        #Hypothesis function and cost function for logistic regression
        def h(mytheta,myX): #Logistic hypothesis function
            return expit(np.dot(myX,mytheta))
        
        #Cost function, default lambda (regularization) 0
        def computeCost(mytheta,myX,myy,mylambda = 0.): 
            """
            theta_start is an n- dimensional vector of initial theta guess
            X is matrix with n- columns and m- rows
            y is a matrix with m- rows and 1 column
            Note this includes regularization, if you set mylambda to nonzero
            For the first part of the homework, the default 0. is used for mylambda
            """
            #note to self: *.shape is (rows, columns)
            term1 = np.dot(-np.array(myy).T,np.log(h(mytheta,myX)))
            term2 = np.dot((1-np.array(myy)).T,np.log(1-h(mytheta,myX)))
            regterm = (mylambda/2) * np.sum(np.dot(mytheta[1:].T,mytheta[1:])) #Skip theta0
            return float( (1./m) * ( np.sum(term1 - term2) + regterm ) )
        #Check that with theta as zeros, cost returns about 0.693:
        initial_theta = np.zeros((X.shape[1],1))
        computeCost(initial_theta,X,y)
      
    

Output: 0.6931471805599453


Understanding the Output of the Logistic Regression Cost Function:


The output 0.6931471805599453 is the initial cost (or loss) of the logistic regression model when all the parameters (theta) are set to zero. This value represents how well (or poorly) the model is performing before training begins


Explanation of the Code and Its Purpose:


This code defines two key functions used in logistic regression:


1. Hypothesis Function (h(mytheta, myX)):


  • Computes the sigmoid function of X * theta.
  • The sigmoid function maps values between 0 and 1, making it ideal for binary classification.

2. Cost Function (computeCost(mytheta, myX, myy, mylambda))


  • Computes the logistic regression cost function, which measures the model’s performance.
  • The cost function for logistic regression is given by:

$$ J(\theta) = \frac{1}{m} \sum \left[-y \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x)) \right] + \frac{\lambda}{2m} \sum \theta_j^2 $$


Optimizing Logistic Regression: Finding Optimal Parameters Using Scipy


We run the following codes in the cells of Jupyter Notebook:


      
          #An alternative to OCTAVE's 'fminunc' we'll use some scipy.optimize function, "fmin"
          #Note "fmin" does not need to be told explicitly the derivative terms
          #It only needs the cost function, and it minimizes with the "downhill simplex algorithm."
          #http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.optimize.fmin.html
          from scipy import optimize
          
          def optimizeTheta(mytheta,myX,myy,mylambda=0.):
              result = optimize.fmin(computeCost, x0=mytheta, args=(myX, myy, mylambda), maxiter=400, full_output=True)
              return result[0], result[1]
          
          theta, mincost = optimizeTheta(initial_theta,X,y)
          #That's pretty cool. Black boxes ftw
      
    


OUTPUT:

    
      Optimization terminated successfully.
               Current function value: 0.203498
               Iterations: 157
               Function evaluations: 287
    
  

Explanation of Code and Output


The optimization process successfully minimized the logistic regression cost function to 0.203498 after 157 iterations and 287 function evaluations, ensuring an optimal set of parameters for the model.


This code optimizes the logistic regression cost function using Scipy’s fmin function, which implements the downhill simplex algorithm to minimize the cost function. Unlike gradient-based optimization, fmin does not require explicit derivative calculations.


1. Function optimizeTheta:

  • Calls optimize.fmin to minimize the cost function (computeCost).
  • Takes an initial guess for θ (theta) and refines it iteratively.
  • Uses a maximum of 400 iterations to converge to an optimal value.

2. Output Interpretation:

  • "Optimization terminated successfully."→ The algorithm successfully found an optimal solution.
  • "Current function value: 0.203498" → The final cost (loss) after optimization. A lower value indicates better optimization.
  • "Iterations: 157" → The number of iterations needed to reach convergence.
  • "Function evaluations: 287" → The number of times the cost function was evaluated during optimization.

Understanding the Sigmoid Function Graph

The plotted graph represents how the cost function \(J(θ) \) decreases over multiple iterations of the gradient descent algorithm.


1. Axes Explanation:

  • X-axis (Iteration Number): Represents the number of training iterations (from 0 to ~1500). Each iteration updates the model’s parameters.
  • Y-axis (Cost Function): Measures the error between the model’s predictions and the actual values. A lower cost means better performance.

2. Curve Analysis:

  • Initial High Cost: At the start, the cost is around 6.8, indicating poor model performance.
  • Rapid Initial Decline (0–400 Iterations): The cost decreases rapidly, showing the model is learning effectively.
  • Slowing Down Convergence (400–1000 Iterations): As iterations progress, the cost function flattens, meaning the model is approaching an optimal solution and further improvements become smaller. The model's rate of improvement slows as it nears an optimal solution.
  • Convergence (~1000+ Iterations): The curve flattens completely at the Cost function levels off around 4.5, indicating that the gradient descent has converged, meaning additional iterations won't significantly reduce the cost.

3. Observations:

  • Smooth Curve: This graph visually confirms that gradient descent effectively optimizes the cost function, moving the model towards the best-fit line for linear regression. The decreasing cost ensures that the predicted values align closely with the actual data.

Optimized Cost Function Value After Logistic Regression Training


We run the following codes in the cells of Jupyter Notebook:


    
        #"Call our costFunction function using the optimal parameters of θ. 
        #We should see that the cost is about 0.203."
        print(computeCost(theta,X,y))
        
        OUTPUT:
        0.20349770159021513
    
    

Explanation of the Output:


After running gradient descent optimization, the cost function converged to approximately 0.2035, indicating that the model has effectively minimized the error in classification. This validates the effectiveness of logistic regression in finding the optimal decision boundary.


The output 0.20349770159021513 represents the optimized cost value after training the logistic regression model. This cost function measures the difference between the predicted and actual classifications, with lower values indicating a better fit. Since the initial cost was 0.693, this significant reduction suggests that the model has successfully learned from the data and improved its predictions.


Decision Boundary Visualization for Logistic Regression


We run the following codes in the cells of Jupyter Notebook:


      
        #Plotting the decision boundary: two points, draw a line between
        #Decision boundary occurs when h = 0, or when
        #theta0 + theta1*x1 + theta2*x2 = 0
        #y=mx+b is replaced by x2 = (-1/thetheta2)(theta0 + theta1*x1)
        
        boundary_xs = np.array([np.min(X[:,1]), np.max(X[:,1])])
        boundary_ys = (-1./theta[2])*(theta[0] + theta[1]*boundary_xs)
        plotData()
        plt.plot(boundary_xs,boundary_ys,'b-',label='Decision Boundary')
        plt.legend()
      
    

The following Decision boundary graph is generated:


3D Visualization Graph

Graph 3: Decision boundary graph


Explanation of the graph:


The blue line represents the decision boundary, which separates the two classes learned by the logistic regression model. Data points above the line are classified into one category, while those below are classified into another. This boundary is determined based on the optimal values of θ.


The generated line represents the decision boundary for the logistic regression model. This boundary is the threshold at which the model classifies data points into different categories. In this case, it separates the two classes (e.g., admitted vs. not admitted) by using the learned parameters θ from the logistic regression training.


Since logistic regression is a linear classifier, the decision boundary appears as a straight line. The equation used:

\[ x_2 = -\frac{1}{\theta_2} (\theta_0 + \theta_1 x_1) \]

It ensures that the model predicts class 0 below the boundary and class 1 above it.




Predicting Admission Probability Using Logistic Regression


We run the following codes in the cells of Jupyter Notebook:


    #For a student with an Exam 1 score of 45 and an Exam 2 score of 85, 
    #you should expect to see an admission probability of 0.776.
    print(h(theta,np.array([1, 45.,85.])))
    
    OUTPUT:
    0.7762915904112412


Explanation


In this step, we use the trained logistic regression model to predict the probability of admission for a student based on their exam scores. The hypothesis function for logistic regression is given by:

The logistic regression hypothesis function is given by: \[ h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} \]

where

  • θ represents the optimized model parameters,
  • x is the input feature vector, including a bias term (1 for intercept),
  • hθ(x) outputs the probability of admission.

The given input corresponds to a student with an Exam 1 score of 45 and an Exam 2 score of 85. When we pass these values through our logistic regression model, the predicted probability of admission is 0.776, meaning there is a 77.6% chance that this student will be admitted.



Model Accuracy on Training Data for Predictions

We have run the following codes in a cell of Jupyter Notebook:

    
        def makePrediction(mytheta, myx):
        return h(mytheta,myx) >= 0.5

        #Compute the percentage of samples I got correct:
        pos_correct = float(np.sum(makePrediction(theta,pos)))
        neg_correct = float(np.sum(np.invert(makePrediction(theta,neg))))
        tot = len(pos)+len(neg)
        prcnt_correct = float(pos_correct+neg_correct)/tot
        print("Fraction of training samples correctly predicted: %f." % prcnt_correct)

      
  


Output:

  Fraction of training samples correctly predicted: 0.890000.

Explanation

To evaluate the performance of our logistic regression model, we calculated the fraction of correctly predicted training samples. The prediction function makePrediction() assigns a class label (1 or 0) based on whether the hypothesis function outputs a probability ≥ 0.5.

The model correctly classified 89% of the training samples, demonstrating a strong ability to distinguish between the two classes. This accuracy metric is essential for assessing the effectiveness of our model before deploying it on unseen data.




Regularized Logistic Regression

Visualizing the Dataset: Microchip Test Results:


We have run the following codes:


      
        datafile = r'd:\mlprojects\data\ex2data2.txt'
        #!head $datafile
        cols = np.loadtxt(datafile,delimiter=',',usecols=(0,1,2),unpack=True) #Read in comma separated data
        ##Form the usual "X" matrix and "y" vector
        X = np.transpose(np.array(cols[:-1]))
        y = np.transpose(np.array(cols[-1:]))
        m = y.size # number of training examples
        ##Insert the usual column of 1's into the "X" matrix
        X = np.insert(X,0,1,axis=1)

        #Divide the sample into two: ones with positive classification, one with null classification
        pos = np.array([X[i] for i in range(X.shape[0]) if y[i] == 1])
        neg = np.array([X[i] for i in range(X.shape[0]) if y[i] == 0])
        #Check to make sure I included all entries
        #print("Included everything? ",(len(pos)+len(neg) == X.shape[0]))

        def plotData():
        plt.plot(pos[:,1],pos[:,2],'k+',label='y=1')
        plt.plot(neg[:,1],neg[:,2],'yo',label='y=0')
        plt.xlabel('Microchip Test 1')
        plt.ylabel('Microchip Test 2')
        plt.legend()
        plt.grid(True)

        #Draw it square to emphasize circular features
        plt.figure(figsize=(6,6))
        plotData()

    
  

After running the codes the following histogram is generated:

Post-Normalization Graph

Graph 4: Scatter Plot of Microchip Test Results: Visualizing Accepted and Rejected Samples

Graph 5: Scatter Plot of Microchip Test Results: Visualizing Accepted and Rejected Samples

Overview:


This visualization represents the dataset used to train a logistic regression model for microchip quality classification. Each microchip undergoes two tests, and the goal is to predict whether it should be accepted (y=1) or rejected (y=0)..


Key Observations:

  • The accepted microchips (✅) and rejected microchips (❌) are not linearly separable (i.e., we cannot use a simple straight-line decision boundary).
  • This suggests that feature engineering (polynomial features) or non-linear decision boundaries will be necessary to improve classification performance.


Cost Function Evaluation for Logistic Regression with Feature Mapping

We have run the following codes:


#This code I took from someone else (the OCTAVE equivalent was provided in the HW)
def mapFeature( x1col, x2col ):
""" 
Function that takes in a column of n- x1's, a column of n- x2s, and builds
a n- x 28-dim matrix of featuers as described in the homework assignment
"""
degrees = 6
out = np.ones( (x1col.shape[0], 1) )

for i in range(1, degrees+1):
    for j in range(0, i+1):
        term1 = x1col ** (i-j)
        term2 = x2col ** (j)
        term  = (term1 * term2).reshape( term1.shape[0], 1 ) 
        out   = np.hstack(( out, term ))
return out

#Create feature-mapped X matrix
mappedX = mapFeature(X[:,1],X[:,2])

#Cost function is the same as the one implemented above, as I included the regularization
#toggled off for default function call (lambda = 0)
#I do not need separate implementation of the derivative term of the cost function
#Because the scipy optimization function I'm using only needs the cost function itself
#Let's check that the cost function returns a cost of 0.693 with zeros for initial theta,
#and regularized x values
initial_theta = np.zeros((mappedX.shape[1],1))
computeCost(initial_theta,mappedX,y)



Output:
0.6931471805599454


Overview:

This section demonstrates the calculation of the initial cost function for a logistic regression model with polynomial feature mapping. The function mapFeature expands the dataset into a higher-dimensional space by generating polynomial features up to the 6th degree.


  • The output 0.693 is the expected cost when initializing θ to zeros, meaning the model is yet to learn meaningful decision boundaries.
  • Feature mapping helps capture non-linear decision boundaries, which are essential for separating complex datasets like this one.
  • The next step is to optimize θ using a numerical solver (e.g., fmin) to find the parameters that minimize this cost and improve classification accuracy.

Mathematical Explanation:


The cost function for logistic regression with regularization is given by:

$$ J(\theta) = \frac{1}{m} \sum \left[-y \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x)) \right] + \frac{\lambda}{2m} \sum \theta_j^2 $$

where:

  • \( h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} \) is the Sigmoid function.
  • m is the number of training samples,
  • λ is the regularization parameter,
  • The first term measures how well the model fits the data,
  • The second term (if regularization is applied) prevents overfitting.
  • θ represents the optimized model parameters,
  • x is the input feature vector, including a bias term (1 for intercept),
  • hθ(x) outputs the probability of admission.

Acknowledgments

I sincerely thank Prof. Andrew NG (DeepLearning.AI, Stanford University) for his inspiring courses that laid the foundation for this project.





Contact Me

hoque@HoqueAi.com

+1 473 607 1949

Download Resume