LogisticRegression

Logistic Regression Introduction

Introduction

You’ve come to the right place for a Logistic Regression introduction! First and foremost, let’s clear up an ongoing debate that is doing the rounds between machine learning practitioners and statisticians.

Logistic Regression is often referred to as a ‘classification task’ in the machine learning community. The reason for this is that ML practitioners tend to group algorithms by their utility – i.e. classification, clustering, etc. Taking the classification utility,  we see distance-based methods (k-NN, etc.), regression methods (Logistic Regression, etc.), neural networks with a softmax output, and so on. All of the above are in an ML practitioner’s toolkit when they embark on solving a classification task. This is why Logistic Regression is often referred to as a classification task in the ML community.

Clarity on Logistic Regression

To be clear, Logistic Regression is most certainly a regression algorithm. The same way k-NN is a distance-based algorithm. In actual fact, every ML classification task is in some way a specific use case of an underlying statistical or mathematical model. Unfortunately, the way the subject has evolved, some (not all) ML students are taught that Logistic Regression is not a regression algorithm. This is flat out wrong! What they should be taught is: Logistic Regression as applied to classification, is a linear regression model whose output is passed through a logistic function, before applying a threshold to achieve a binary classification.

Don’t worry if this doesn’t make sense! Just remember that in machine learning, Logistic Regression is a regression model with a classifier on top. For the remainder of this blog post and to avoid having to caveat each and every reference to Logistic Regression, it is assumed when a reference is made to Logistic Regression, it is in the context of machine learning classification tasks.

Logistic Regression Technical Overview

To elaborate, Logistic Regression is used in machine learning to perform classification. Class membership is achieved through the use of probability. Specifically, Logistic Regression outputs the probability of a datapoint belonging to a particular class and we as Machine Learning practitioners need to decide what to do with the probability. Since probability is measured on a scale from 0 to 1, a widely accepted approach is to classify a datapoint as true if the probability of the datapoint being true is greater than a threshold of 0.5. The term ‘true’ here is a collective term that can represent a category, a state, a hypothesis, a truth value or whatever the ML practitioner decides. For example, for classifying defective products in manufacturing we could use two categories: Defective / Not Defective, with the corresponding truth values as TRUE / FALSE and binary encodings as 1 / 0.

(As a side note, the term ‘Logistic’ in Logistic Regression gets its name from the function that calculates the probabilities, i.e. the Logistic Function).

Logistic Regression & Probability

Linear Regression Recap

As previously mentioned, Logistic Regression uses probability to classify datapoints. Let’s explore that a little further with some visuals. To motivate the idea, recall that Linear Regression is a regression task that fits a line through a set of datapoints. More formally, the aim of Linear Regression to find the best estimate of the parameters β₀ and β₁ that minimize the error in

Ŷ = β₀ + β₁X

We can see this depicted in the following graph:

Figure 1 – Linear Regression Recap

So, can we use Linear Regression for classification? Answer: no. Linear Regression provides us with estimates of y on a continuous scale. Since we are interested in binary classification (0/1), our desired output scale should be discrete. The following graph depicts how Linear Regression is problematic for this task.

Figure 2 – Linear Regression: Age vs Needs Care

In the above example, we are attempting to use Linear Regression to make predictions on someone needing care based on their age. However there are issues. What does the y-axis mean? How can a person be 0.25 needing care in a yes/no scenario and how can 0.25 even be interpreted? People younger than 40 are estimated to be negatively needing care, and people older than 100 are estimated to be more than needing care, for want of a better phrase. Even if we were to use a threshold of say 0.5, and say that anyone with an estimated y value of >= 0.5 needs care, and the rest don’t – one might be fooled into thinking, wait – this works! But watch what happens when we include people 18 years old and younger:

Figure 3 – Linear Regression: Age vs Needs Care – Not Suitable

Where do we go from here? Every value for Age results in a y value of less than 0.5. What happened to our perfect 0.5 threshold estimator? It’s almost as if there is a feeling Linear Regression could and should work, but something is missing. Also, we are not interested in a continuous value for y. Instead, we are interested in binary classification. Step forward, Logistic Regression!

Segue to Logistic Regression

What if we were to undergo a slight paradigm shift and focus on the probability that y is true, as oppose to a predicted continuous value for y? How would that help? Well, straight away we can make an important assumption. And that is (as stated earlier) a threshold probability of 0.5 can be used to make classifications. And how do we go from a continuous value of y to the probability that y is true? What’s the function? The function is called the Logistic Function and is written as follows:

where z = β₀ + β₁X

Note that β₀ + β₁X is the generalized form of the value of ŷ in Linear Regression, hence the ‘Regression’ element of the term Logistic Regression.

It’s worth noting that as z becomes arbitrarily large, the e⁻ᶻ term tends towards 0 and f(z) tends towards 1. As z becomes arbitrarily small, the e⁻ᶻ term becomes arbitrarily large and f(z) tends towards 0. To help understand this concept, here are a swath of values before and after being passed through the logistic function:

ze^-z1/(1+e^-z)
-5148.4130.007
-12.7180.269
-0.51.6490.378
-0.11.1050.475
010.5
0.10.9050.525
0.50.6070.622
10.3680.731
50.0070.993

Logistic Regression Example

Let’s now motivate Logistic Regression further with a different example. Say we have a scatter of 2D points where each point belongs to one of two categories as follows:

Figure 4 – Scatterplot of dots.csv

We would like to build a classifier with which we can pass through the coordinates of a datapoint and obtain the correct category classification. We can see from the plot that the data is linearly separable and can conclude that a simple logistic regression task would be suitable. It is also known that each point has two features (x1, x2) and there are two categories present in the dataset. Since there are only two categories, this is a binary classification task and hence we could assign category values of 0 and 1 to the categories respectively.

Next we use the Logistic Function described earlier and substitute z as follows:

z = β₀ + β₁X₁ + β₂X₂

where β₀ refers to the intercept, or in this case, the bias. β₁X₁ and β₂X₂ refer to the coefficient and value for x₁ and x₂ respectively.

On Weight Estimation

You might be wondering where do β₀, β₁ and β₂ come from? After all, the dataset only contains readings for x1 and x2. This is a whole area of research in itself but suffice it to say that one of the most common methods for obtaining these values is called Stochastic Gradient Descent (SGD) and is outside the scope of this walkthrough. This article covers SGD very well. In a nutshell, SGD is an iterative process whereby the parameters of the equation are found by minimizing a loss function.
 
Great so the output of the Logistic Function provided us with probabilities of the datapoints being true, i.e. belonging to category 1. Finally we apply a threshold function to obtain the classifications of y:


And that’s how Logistic Regression works! In the example above, wouldn’t it be great if we could see the Logistic Regression decision boundary? Thankfully, this is quite straightforward. The full code for this example, including plotting the decision boundary is provided in the Python code below. The source file, dots.csv is available for download here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Read in data
df_dots = pd.read_csv("dots.csv")

# Quick look
df_dots.head()

# Parse attributes and classes
X = np.array(df_dots[['x1','x2']])
y = np.array(df_dots['y'])

# Plot data
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax.scatter(df_dots['x1'], df_dots['x2'],c=df_dots['y'])
ax.set_title('2-Category 2D Scatter Plot', fontsize=14, pad=15)
ax.set_xlabel("x1", fontsize=12)
ax.set_ylabel("x2", fontsize=12)

fig.show()

# Fit logistic regression model with dots data
lr = LogisticRegression(random_state=0).fit(X, y)

# Glance at the coefficients and intercept (bias)
# Notice that there are two coefficients, one for each feature
print(lr.coef_)
print(lr.intercept_)

# Parse weights and bias
w1, w2 = lr.coef_.T
b = lr.intercept_[0]

# Calculate the intercept and gradient of the decision boundary
c = -b/w2
m = -w1/w2

# Plot the data and the classification with the decision boundary.
xmin, xmax = 0, 1
ymin, ymax = 0, 1
xd = np.array([xmin, xmax])
yd = m*xd + c

# Plot data
fig_db, ax_db = plt.subplots(1,1, figsize=(8,8))
ax_db.scatter(df_dots['x1'], df_dots['x2'],c=df_dots['y'])
ax_db.plot(xd, yd, 'k', lw=1, ls='--')
ax_db.set_xlim(xmin, xmax)
ax_db.set_ylim(ymin, ymax)
ax_db.set_xlabel("x1", fontsize=12)
ax_db.set_ylabel("x2", fontsize=12)
ax_db.set_title('Logistic Regression with Decision Boundary', fontsize=14, pad=15)

fig_db.show()

Here is the plot with the Logistic Regression decision boundary:

Figure 5 – Logistic Regression Decision Boundary

Closing Thoughts

Logistic Regression is a simple yet effective method for classification. It works well when data is linearly separable and should be considered before more complicated techniques. Linear Regressors can be combined to classify more complex data and this is the basis of Neural Networks / Deep Learning.

If you enjoyed this post, please leave a comment below. Likewise, and for more content and news, why not follow me on Twitter or subscribe to my YouTube channel. For direct contact, feel free to use the contact form.

Share this post

Share on twitter
Share on linkedin
Share on email

Also by Jonathan:

5 1 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x