Logistic Regression Tutorial - Part 1: Introduction

Part 1: Introduction to Logistic Regression

Logistic Regression is a fundamental statistical method for binary classification problems, where the goal is to predict the probability that an event occurs (coded as 1) or does not occur (coded as 0) based on one or more predictor variables. It's widely used across various fields, including healthcare, economics, social sciences, and machine learning, due to its simplicity, interpretability, and effectiveness in binary outcome prediction.

Logistic regression is an essential tool in classification problems where the dependent variable is binary. Unlike linear regression, which predicts continuous outcomes, logistic regression predicts the probability of a binary outcome. This makes it ideal for cases where we want to classify observations into one of two groups, such as:

Diagnosing a disease (e.g., cancer detection: 1 for disease, 0 for no disease)
Predicting customer behavior (e.g., 1 for purchase, 0 for no purchase)
Predicting whether a loan will be approved (1 for approval, 0 for denial)

Overview of Logistic Regression

Unlike linear regression, which is used to predict continuous outcomes, logistic regression predicts the probability of a binary outcome. The model is based on the logistic function (also known as the sigmoid function), which maps any input value between 0 and 1. This mapping ensures that the predicted probabilities are always between 0 and 1, making it suitable for binary classification tasks.

Mathematically, logistic regression is represented as:

$Logistic Regression Logit Function$

Where:

P(Y=1) is the probability of the event occurring (i.e., the target variable equals 1).
X₁, X₂, ..., X_k are the predictor variables.
β₀, β₁, ..., β_k are the coefficients estimated by the model.

$Logistic Regression Logistic Function$

This equation ensures that the predicted probability, P(Y=1), always lies between 0 and 1, making it suitable for binary classification. The logistic function maps any real-valued input to a probability, thus transforming the linear combination of predictor variables into a probability estimate.

The core idea of logistic regression is to model the log-odds of the outcome as a linear function of the predictors. The log-odds are the natural logarithm of the ratio of the probability that the event occurs to the probability that it does not occur. By transforming the linear relationship between the predictors and the log-odds into a probability, the logistic regression model can make predictions for binary outcomes (e.g., success/failure, disease/no disease, etc.).

Key Terms and Concepts

Before diving deeper, let's explore some key concepts in logistic regression:

Odds: The odds of an event are the ratio of the probability that the event occurs to the probability that it does not occur.

$Odds equation$

Log-Odds (Logit): The log of the odds is the log-odds of the event occurring. This is a crucial component of the logistic regression model, as it relates the linear combination of the predictors to the odds of the outcome.

$Log-Odds equation$

Sigmoid Function: The logistic (or sigmoid) function maps the log-odds to a probability, ensuring that the predicted values are between 0 and 1. This function is key to turning the linear equation into a probability prediction for the binary outcome.

$Sigmoid function equation$

In summary, logistic regression works by modeling the log-odds of the binary outcome as a linear function of the predictor variables. The model is built around the sigmoid function, which ensures that the predicted values are probabilities that range between 0 and 1. This makes it an ideal tool for binary classification problems.

In the next part, we will dive deeper into the logistic regression model itself, examine the assumptions that underpin the method, and explore how it can be implemented effectively in practice.

Part 2: Mathematical Foundation of Logistic Regression

In this part, we will explore the underlying mathematics of logistic regression in greater detail. Understanding the mathematical structure of the model is essential for interpreting how logistic regression works and how it produces probability predictions. We'll discuss the log-odds (logit), the logistic function, and the process of estimating the model parameters using Maximum Likelihood Estimation (MLE).

Log-Odds (Logit)

The core idea of logistic regression is to model the log-odds of the outcome as a linear function of the predictor variables. To begin, we need to understand what "log-odds" means. The odds are the ratio of the probability that an event occurs to the probability that it does not occur:

$Odds equation$

In logistic regression, we take the natural logarithm (ln) of the odds, which is called the log-odds (or logit). The logit is given by:

$Log-Odds equation$

Thus, the log-odds of the outcome occurring is modeled as a linear function of the predictor variables. Specifically, we model the log-odds as:

$Logit equation in logistic regression$

Where:

is the log-odds of the event occurring.
is the intercept term (also called the bias).
, are the coefficients that determine the effect of the predictor variables on the log-odds.
are the predictor variables.

The Logistic Function (Sigmoid Function)

Once we have modeled the log-odds as a linear function of the predictor variables, we need to convert this value back into a probability (between 0 and 1). This is done using the logistic function, also known as the sigmoid function. The logistic function is given by:

$Logistic (Sigmoid) Function$

Where:

is the probability that the event occurs (i.e., the target variable equals 1).
is the base of the natural logarithm (approximately 2.718).
is the linear combination of predictor variables, which we transform into a probability.

Notice that this function maps any input to a value between 0 and 1, ensuring that the predicted probability lies within the appropriate range for binary classification.

Maximum Likelihood Estimation (MLE)

In logistic regression, we estimate the model coefficients (\beta_0, \beta_1, ..., \beta_k) using a method called Maximum Likelihood Estimation (MLE). MLE seeks to find the values of the coefficients that maximize the likelihood of observing the data, given the model. This means that MLE chooses the coefficients that make the observed outcomes (0 or 1) most likely according to the logistic regression model.

The likelihood function for logistic regression is based on the Bernoulli distribution, which models binary outcomes. Given a set of observations Y_i and their corresponding predictor values X_i, the likelihood function is the product of individual probabilities for each observation:

$Likelihood Function$

The log-likelihood is the natural logarithm of the likelihood function, and maximizing the log-likelihood is equivalent to maximizing the likelihood function itself. The coefficients that maximize the log-likelihood are the estimated parameters of the logistic regression model.

Conclusion of Part 2

In this part, we have covered the key mathematical foundations of logistic regression, including the log-odds (logit) and how the logistic function transforms the log-odds into a probability. We also introduced Maximum Likelihood Estimation (MLE), the method used to estimate the parameters of the model. These concepts are the cornerstone of understanding how logistic regression works at a deeper level.

In the next section, we will explore how to implement logistic regression in practice, including coding examples and model evaluation techniques.

Part 3: Estimation and Interpretation of Logistic Regression Coefficients

In this part, we will discuss how the coefficients in a logistic regression model are estimated and interpreted. Understanding how to interpret the coefficients is essential for explaining the relationship between the predictor variables and the outcome, as well as for making predictions using the model.

Model Estimation: Coefficients of the Logistic Regression Model

The coefficients in the logistic regression model, denoted by , are estimated using Maximum Likelihood Estimation (MLE), as discussed in the previous part. These coefficients determine the strength and direction of the relationship between each predictor variable and the log-odds of the outcome. The model itself can be written as:

$Logistic Regression Model Coefficients$

Where:

is the probability that the event occurs (i.e., the target variable equals 1).
$\beta_0$ is the intercept (or bias) term, which represents the log-odds of the event occurring when all predictor variables are equal to zero.
, are the coefficients for the predictor variables, and they represent the change in the log-odds of the outcome for a one-unit increase in the respective predictor variable, holding all other variables constant.
, are the predictor variables (covariates).

The goal of logistic regression is to estimate the values of the coefficients that maximize the likelihood of the observed data.

Interpretation of Coefficients

Once the coefficients are estimated, we need to interpret their meaning. The interpretation of the coefficients in logistic regression is slightly different from that in linear regression due to the log-odds transformation. Here's how we interpret them:

Intercept : The intercept term represents the log-odds of the event occurring when all predictor variables are equal to zero. It is the baseline log-odds, which corresponds to the probability of the outcome when no predictors are considered.
Slope Coefficients : The slope coefficient (for any predictor ) represents the change in the log-odds of the outcome for a one-unit increase in , while holding all other predictors constant. In other words, for each one-unit increase in , the log-odds of the event occurring change by .

To make the interpretation more intuitive, we can transform the coefficients from the log-odds scale to the odds scale by exponentiating the coefficients. The exponentiated coefficients are called odds ratios and provide a more interpretable measure of the effect of each predictor on the outcome.

Odds Ratios

The odds ratio for each predictor variable is obtained by exponentiating the corresponding coefficient:

$Odds Ratio Equation$

Where is the odds ratio for predictor , and is the estimated coefficient for . The odds ratio tells us how the odds of the event occurring change with a one-unit increase in :

If , the predictor increases the odds of the event occurring (i.e., the predictor is positively associated with the outcome).
If , the predictor decreases the odds of the event occurring (i.e., the predictor is negatively associated with the outcome).
If , there is no effect of on the odds of the event occurring (i.e., the predictor has no effect on the outcome).

Example: Interpretation of Coefficients

Let's say we fit a logistic regression model to predict whether a patient has a particular disease based on their age and gender . After fitting the model, we obtain the following coefficients:

    Intercept:  = -3.5
    Age:  = 0.05
    Gender (Male = 1, Female = 0):  = 1.2

The interpretation of the coefficients would be as follows:

The intercept term = -3.5 represents the log-odds of the disease occurring when both age and gender are zero (which is not a meaningful scenario but serves as a baseline).
The coefficient = 0.05 represents the change in the log-odds of having the disease for each one-year increase in age. For each additional year of age, the log-odds of having the disease increase by 0.05.
The coefficient $\beta_2$ = 1.2 represents the change in the log-odds of having the disease for males (with gender coded as 1 for male and 0 for female). For males, the log-odds of having the disease increase by 1.2 compared to females.

Conclusion of Part 3

In this part, we learned how the coefficients in a logistic regression model are estimated and interpreted. The coefficients represent the change in the log-odds of the outcome for a one-unit change in the respective predictor variable. We also explored how to interpret the coefficients in terms of odds ratios, which provide a more intuitive understanding of the effect of each predictor on the outcome. In the next section, we will cover how to evaluate the performance of a logistic regression model using different metrics.

Part 4: Model Evaluation and Diagnostics

In this part, we will discuss how to evaluate the performance of a logistic regression model. The evaluation process is essential for understanding how well the model generalizes to new, unseen data and identifying areas for improvement. We'll also cover various diagnostic tools to assess the fit of the model and detect potential issues like overfitting.

1. Evaluation Metrics for Logistic Regression

When evaluating the performance of a logistic regression model, we rely on several key metrics. These metrics allow us to understand the trade-off between correctly predicting positive and negative outcomes. The most common evaluation metrics are:

Accuracy: Accuracy is the percentage of correct predictions made by the model. It is calculated as:

$Accuracy formula$

TP = True Positives: Correctly predicted positive outcomes
TN = True Negatives: Correctly predicted negative outcomes
FP = False Positives: Incorrectly predicted positive outcomes
FN = False Negatives: Incorrectly predicted negative outcomes

Precision: Precision (also called positive predictive value) measures the proportion of true positive predictions among all positive predictions made by the model. It is calculated as:

$Precision formula$

Recall: Recall (also called sensitivity or true positive rate) measures the proportion of actual positive cases that were correctly identified by the model. It is calculated as:

$Recall formula$

F1 Score: The F1 score is the harmonic mean of precision and recall. It is particularly useful when the class distribution is imbalanced. The F1 score is calculated as:

$F1 Score formula$

ROC Curve and AUC: The ROC (Receiver Operating Characteristic) curve is a graphical representation of a model's performance across different thresholds. It plots the true positive rate (recall) against the false positive rate (1 - specificity). The AUC (Area Under the Curve) is a summary metric that quantifies the overall ability of the model to discriminate between the positive and negative classes. AUC values range from 0 to 1, where 1 indicates perfect discrimination and 0.5 indicates random guessing.

$ROC formula$

2. Confusion Matrix

To calculate many of the evaluation metrics listed above (such as accuracy, precision, recall, and F1 score), we use a confusion matrix. The confusion matrix is a table that summarizes the performance of the model by showing the counts of true positives, true negatives, false positives, and false negatives. The confusion matrix looks like this:

$Confusion Matrix$

3. Model Fit and Overfitting

In addition to the evaluation metrics, it is important to assess the fit of the logistic regression model and identify potential overfitting. Overfitting occurs when the model is too complex and captures noise in the training data, leading to poor generalization to new data.

Train-Test Split: A simple technique to assess overfitting is to split the data into a training set and a test set. The model is trained on the training set, and its performance is evaluated on the test set. A large discrepancy between the performance on the training and test sets can indicate overfitting.
Cross-Validation: Cross-validation is a more robust method for evaluating model performance. It involves splitting the data into several subsets (folds) and training and testing the model on different folds to get a more reliable estimate of model performance.
Regularization: Regularization techniques, such as L1 (Lasso) and L2 (Ridge), can be used to prevent overfitting by penalizing large coefficients. Regularization helps in simplifying the model and improving generalization.

4. Model Diagnostics

Finally, model diagnostics allow us to assess whether the logistic regression assumptions are met and whether the model is appropriate for the data.

Linearity of Log-Odds: Logistic regression assumes a linear relationship between the predictor variables and the log-odds of the outcome. We can check this assumption using plots of the log-odds against the predictor variables or by using statistical tests.
No Multicollinearity: Multicollinearity occurs when predictor variables are highly correlated with each other. This can lead to unstable coefficient estimates. We can check for multicollinearity using variance inflation factors (VIFs) or correlation matrices.
Independence of Errors: Logistic regression assumes that the errors are independent. We can check for this assumption by looking at residual plots and testing for autocorrelation.

Conclusion of Part 4

In this part, we covered the evaluation metrics and diagnostic tools used to assess the performance of a logistic regression model. We discussed key metrics such as accuracy, precision, recall, and the ROC curve. We also looked at techniques for detecting overfitting and diagnosing potential issues with the model. In the next section, we will dive into how to implement logistic regression in practice, using both Python and R code examples.

Part 5: Implementing Logistic Regression in R

In this section, we will implement logistic regression using R on the mtcars dataset, which comes pre-installed with the dplyr package. This dataset contains information on fuel consumption, performance, and various aspects of automobile design for 32 cars.

1. Installing Required Packages

We need the following packages for logistic regression:

dplyr: For data manipulation.
caTools: For splitting the dataset into training and testing sets.
ROCR: For plotting the ROC curve and calculating the AUC (Area Under the Curve).

To install the required packages and load them into the environment, use the following code:

#Installing the necessary packages
install.packages("dplyr")
install.packages("caTools")
install.packages("ROCR")

# Loading the libraries
library(dplyr)
library(caTools)
library(ROCR)

2. Exploring the Dataset

The mtcars dataset is included with the dplyr package. To view a summary of the dataset, use the following code:

# Summary of mtcars dataset
summary(mtcars)

This will provide you with summary statistics for each variable, such as the minimum, maximum, mean, and median values.

3. Splitting the Dataset

Next, we split the dataset into a training set and a test set using the sample.split() function from the caTools package. The training set will be used to build the model, while the test set will evaluate its performance.

#Splitting the data into training and testing sets
split <- sample.split(mtcars$vs, SplitRatio = 0.8)
train_reg <- subset(mtcars, split == "TRUE")
test_reg <- subset(mtcars, split == "FALSE")

4. Fitting the Logistic Regression Model

We now fit a logistic regression model to predict the vs (engine type) variable based on wt (weight of the car) and disp (displacement) using the glm() function. The family = "binomial" argument specifies that we are performing logistic regression.

# Fitting the logistic regression model
logistic_model <- glm(vs ~ wt + disp, data = train_reg, family = "binomial")

# Display model summary
summary(logistic_model)

5. Interpreting the Model Output

The model summary will provide information about the estimated coefficients, their standard errors, z-values, and p-values. Here's a breakdown of the output:

Call:
glm(formula = vs ~ wt + disp, family = "binomial", data = train_reg)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
    -1.6552  -0.4051   0.4446   0.6180   1.9191  

    Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
    (Intercept)  1.58781    2.60087   0.610   0.5415  
    wt           1.36958    1.60524   0.853   0.3936  
    disp        -0.02969    0.01577  -1.882   0.0598 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Null deviance: 34.617  on 24  degrees of freedom
    Residual deviance: 20.212  on 22  degrees of freedom
    AIC: 26.212

The key components of the output include:

Coefficients: These represent the estimated log-odds of the outcome for each predictor variable. For example, the coefficient for wt is 1.36958, which suggests that as the weight of the car increases, the log-odds of the engine being of type "vs = 1" (V/S) increases.
Significance of Coefficients: The p-value for wt is 0.3936, which is greater than 0.05, indicating that wt is not statistically significant in predicting vs. The p-value for disp is 0.0598, which is close to 0.05, suggesting a borderline significance.
Deviance Residuals: These show the discrepancy between the actual and predicted values. Smaller residuals indicate a better fit.
Null Deviance and Residual Deviance: The null deviance (34.617) represents the model’s deviation when only the intercept is considered. The residual deviance (20.212) represents the model’s deviation after including the predictors. A decrease in deviance suggests that the model with predictors is a better fit than the null model.
AIC: The Akaike Information Criterion (AIC) is used for model comparison. Lower AIC values indicate better models. The AIC for this model is 26.212.

6. Predicting Test Data

We can now use the fitted logistic regression model to predict the vs values on the test data:

# Predicting test data based on the logistic regression model
predict_reg <- predict(logistic_model, test_reg, type = "response")

# Converting probabilities to binary outcomes (1 or 0)
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)

The predict() function provides the predicted probabilities. By converting these probabilities to binary outcomes (0 or 1), we can make predictions on whether a car has engine type "vs = 1" (V/S) or "vs = 0" (non-V/S).

7. Evaluating Model Accuracy

To evaluate the model's performance, we can use a confusion matrix, which compares the actual and predicted values. We can also calculate the accuracy of the model:

# Confusion matrix
table(test_reg$vs, predict_reg)

# Accuracy calculation
missing_classerr <- mean(predict_reg != test_reg$vs)
print(paste('Accuracy =', 1 - missing_classerr))

The confusion matrix will help assess how well the model performs. If the accuracy is high, it indicates that the model makes correct predictions most of the time.

8. ROC Curve and AUC

To further evaluate the model's performance, we can plot the ROC curve and calculate the AUC (Area Under the Curve). The AUC gives us a measure of the model's ability to distinguish between the two classes:


# ROC curve and AUC
ROCPred <- prediction(predict_reg, test_reg$vs)
ROCPer <- performance(ROCPred, measure = "tpr", x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]

# Plot ROC curve
plot(ROCPer)
plot(ROCPer, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1), main = "ROC Curve")
    abline(a = 0, b = 1)

# Display AUC
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various threshold levels. The AUC measures the area under the ROC curve, with values closer to 1 indicating a better model. In this example, the AUC value will be displayed in the plot legend.

Conclusion of Part 5

In this section, we have implemented a logistic regression model in R using the mtcars dataset. We covered the steps of splitting the data, fitting the model, evaluating its performance, and interpreting the results. We also used cross-validation and plotted the ROC curve to assess the model's ability to distinguish between classes.

6. Advanced Topics and Extensions

6.1. Regularization (Lasso, Ridge) in Logistic Regression

Regularization is used to prevent overfitting by adding a penalty to the regression model. In logistic regression, we can use Lasso (L1) and Ridge (L2) regularization to achieve this. Regularization modifies the cost function by adding a penalty term to the coefficients:

$Lasso and Ridge Cost Function$

In this equation:

is the predicted probability of the positive class for input .
are the model's coefficients.
is the regularization parameter that controls the strength of the penalty.

The code below demonstrates Lasso (L1) and Ridge (L2) regularization using the glmnet package:

# Installing the package for regularization
install.packages("glmnet")
library(glmnet)

# Preparing the dataset (since we're working with mtcars, let's focus on numeric variables)
mtcars_reg <- mtcars[, c("wt", "disp", "hp", "qsec", "vs")]

# Converting the target variable to factor (necessary for classification)
mtcars_reg$vs <- as.factor(mtcars_reg$vs)

# Lasso regularization (L1)
lasso_model <- glmnet(as.matrix(mtcars_reg[, -ncol(mtcars_reg)]), mtcars_reg$vs, alpha = 1, family = "binomial")
plot(lasso_model)

# Ridge regularization (L2)
ridge_model <- glmnet(as.matrix(mtcars_reg[, -ncol(mtcars_reg)]), mtcars_reg$vs, alpha = 0, family = "binomial")
plot(ridge_model)

Ridge Model Example

The plot() function will show you the regularization path, which plots the coefficients as a function of the regularization parameter . In Lasso (L1), the coefficients shrink to zero as increases, performing feature selection. In Ridge (L2), the coefficients shrink but do not reach zero, helping to reduce overfitting.

6.2. Multi-class Logistic Regression (Softmax Function)

When you have more than two classes, you can use the Softmax function to perform multi-class logistic regression. The softmax function calculates the probability of each class as follows:

$Softmax Equation$

In this equation:

is the probability of class j given the input x.
represents the model’s coefficients for class j.
is the number of classes.

The code below demonstrates how to implement multi-class logistic regression using the nnet package:

# Converting the 'cyl' variable to factor to simulate multi-class classification
mtcars$cyl <- as.factor(mtcars$cyl)

# Training a multinomial logistic regression model
install.packages("nnet")
library(nnet)

# Fit a multi-class logistic regression model
multi_logistic_model <- multinom(cyl ~ wt + disp + hp + qsec, data = mtcars)

# Summary of the model
summary(multi_logistic_model)

# Making predictions
multi_pred <- predict(multi_logistic_model, mtcars)
table(mtcars$cyl, multi_pred)

The summary() function will provide the coefficients for each class relative to a reference class. The predictions are compared to the actual values using a confusion matrix.

# Summary of the model
summary(multi_logistic_model)
Call:
multinom(formula = cyl ~ wt + disp + hp + qsec, data = mtcars)

Coefficients:
  (Intercept)       wt      disp        hp      qsec
6   16.324641 18.77777 0.7936079 0.3893764 -12.59293
8   -5.332788 31.14284 0.9402801 0.4786161 -16.50727

Std. Errors:
  (Intercept)       wt     disp       hp     qsec
6    176.5115 543.7406 359.5155 267.7063 57.06707
8    176.4970 543.7468 359.5071 267.7005 57.10338

Residual Deviance: 0.0001646752 
AIC: 20.00016

# Making predictions
multi_pred <- predict(multi_logistic_model, mtcars)
table(mtcars$cyl, multi_pred)
   multi_pred
     4  6  8
  4 11  0  0
  6  0  7  0
  8  0  0 14

6.3. Handling Imbalanced Datasets (Under-sampling, Over-sampling)

When dealing with imbalanced datasets, we can use techniques such as under-sampling the majority class or over-sampling the minority class. Here we’ll demonstrate how to use over-sampling using the ROSE package:

$Logistic Regression Equation$

This is the same logistic regression model as before, but with a focus on the imbalance correction. The equation above represents the standard logistic regression function where is the linear predictor.

The code for handling imbalanced datasets using under-sampling is shown below:

# Installing ROSE package for handling imbalanced data
install.packages("ROSE")
library(ROSE)

# Under-sampling the majority class
data_balanced<-ovun.sample(vs ~ wt + disp, data = mtcars, method = "under", N = 32)$data
# Fit a logistic regression model on the balanced data
balanced_logistic_model <- glm(vs ~ wt + disp, data = data_balanced, family = "binomial")

The output from the summary() function call is listed below:

summary(balanced_logistic_model)

Call:
glm(formula = vs ~ wt + disp, family = "binomial", data = data_balanced)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.67506  -0.28444  -0.08401   0.57281   2.08234  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.60859    2.43903   0.660    0.510  
wt           1.62635    1.49068   1.091    0.275  
disp        -0.03443    0.01536  -2.241    0.025 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.86  on 31  degrees of freedom
Residual deviance: 21.40  on 29  degrees of freedom
AIC: 27.4

Number of Fisher Scoring iterations: 6

Model Summary

The summary of the logistic regression model balanced_logistic_model provides valuable information about the model's fit and the significance of the predictors. Below is a detailed breakdown:

Model Call: The model was fit using the formula vs ~ wt + disp, meaning the target variable vs (engine type) was modeled using the predictors wt (weight) and disp (displacement). The family used is binomial, suitable for binary outcomes like vs.

Deviance Residuals: These residuals measure the difference between observed and predicted values. The distribution of the residuals is as follows:

Min: -1.67506
1st Quartile (1Q): -0.28444
Median: -0.08401
3rd Quartile (3Q): 0.57281
Max: 2.08234

This indicates that most residuals are close to zero, with a few larger residuals, suggesting that the model fits the data reasonably well.

Coefficients: The coefficients represent the estimated effects of the predictors wt and disp on the log odds of the outcome vs being 1 (vs = 1 for V-shaped engine, vs = 0 for straight engine). Here are the coefficient values:

(Intercept): 1.60859
wt: 1.62635
disp: -0.03443

Significance: The disp predictor has a p-value of 0.025, which is less than the 0.05 threshold, indicating that it is statistically significant. In contrast, wt and the intercept are not significant, with p-values of 0.275 and 0.510, respectively.

Deviance and AIC:

Null deviance: 43.86 — The deviance of the model with no predictors.
Residual deviance: 21.40 — The deviance of the model with the predictors.
AIC: 27.4 — The Akaike Information Criterion penalizes the complexity of the model. Lower AIC values indicate a better fit, but the AIC must be compared across models.

Conclusion: The model suggests that the disp variable is a significant predictor of the outcome, while wt is not. The residual deviance and AIC indicate a reasonable fit, but further refinement (such as adding or removing predictors) might be necessary for improved model performance.


# New data (simulating a new car data)
new_data <- data.frame(wt = c(3.0, 4.0), disp = c(160, 300))

# Predicting the class labels for new data
new_pred <- predict(balanced_logistic_model, new_data, type = "response")
new_pred_class <- ifelse(new_pred > 0.5, 1, 0)
new_pred_class

This code uses the ovun.sample() function to perform under-sampling, which helps balance the classes in the dataset. After that, we fit the logistic regression model and calculate accuracy by comparing the predicted and actual class labels.

Model Prediction and Evaluation

This section of the code is used to predict the outcomes using the fitted logistic regression model (balanced_logistic_model) and evaluate its performance. The predict() function generates predicted probabilities for each data point. These probabilities are then converted to binary values (0 or 1) using the ifelse() function, where a threshold of 0.5 is used to classify the predictions.

Accuracy Calculation

The confusion matrix is generated using the table() function, which compares the true values of the target variable (vs) against the predicted values. The accuracy of the model is then calculated as the proportion of correct predictions by dividing the sum of the diagonal elements of the confusion matrix (true positives and true negatives) by the total number of observations.

Results: The model achieved an accuracy of 0.88, indicating that 88% of the predictions were correct. This is a strong result, suggesting that the model is performing well on the balanced dataset.

6.4. Model Deployment and Making Predictions on New Data

Once the logistic regression model is trained, you can deploy the model and use it to make predictions on new, unseen data. Here’s an example:

# New data (simulating a new car data)
new_data <- data.frame(wt = c(3.0, 4.0), disp = c(160, 300))

# Predicting the class labels for new data
new_pred <- predict(balanced_logistic_model, new_data, type = "response")
new_pred_class <- ifelse(new_pred > 0.5, 1, 0)
new_pred_class

The predict() function is used to make predictions for new data. Here, we simulate new data with specific values for the variables, and the model returns the predicted class labels.

CLINICAL BIOSTATS

Logistic Regression: Understanding the Basics and Beyond

Part 1: Introduction to Logistic Regression

Part 2: Mathematical Foundation of Logistic Regression

Leave a Reply Cancel reply

Links

Company