Predictive Modelling Using Logistic Regression

Raw Text

RAJAT PANCHOTIA · Follow

Published in Analytics Vidhya · 8 min read · Mar 9, 2021

--

Listen

Share

Logistic Regression

Regression allows us to predict an output based on some input parameters. For instance, we can predict someone’s height based on their parents height and age. This type of regression is called linear regression because the outcome variable is a continuous real number.

But what if we wanted to predict something that is not a continuous number?

Why Not Linear Regression?

Let us say we want to predict likelihood of a candidate to pass the Math’s Olympiad for class X. Using ordinary linear regression will not work in this scenario because it doesn’t make sense to treat our outcome as a continuous number — it’s either pass or fail. In this case, we use logistic regression, because the outcome variable is a binary response variable.

Why running linear regression for such scenarios will not work?

• Binary variable does not have a normal distribution which is a primary condition needed for linear regression analysis.

Predicted value of the target variable can go beyond 0 and 1, which violates the definition of probability.

• Probabilities are often nonlinear and can be U shaped due to the extreme value effect of the x variables.

Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval, or ratio-level independent variables.

Logistic regression seeks to:

Model the probability of an event occurring depending on the values of one or more nominal, ordinal, interval, or ratio-level independent variables.

Estimate the probability that an event occurs for a randomly selected set of observations versus the probability of non-occurrence of an event.

Predicts the effect of a series of variables on a binary response variable.

Classify observations based on the probability score that their likelihood to belong to a particular category (for above example, it is pass or fail).

Log Odds

Before getting into the details of logistic regression, let us define something called “Odds” of an event. Suppose p is the probability of an event occurring.. The event could be anything, like a customer not paying a loan, a customer turning up to a retail store during an event, or a customer complain being escalated, or anything else. Then, the odds of the event is defined as the ratio of likelihood of the event occurring to the likelihood of it not occurring. Hence, the odds is given by p/(1-p). Note, p being a probability value, it lies between zero and one and hence odds can take any non-negative number.

Recall, in linear regression we assumed that the target variable is a linear function of predictors. In logistic regression, we assume the log of odds (i.e. log of p/(1-p)) of the event is a linear function. Note, log of odds can take any real number. It is positive if p is greater than 0.5 and negative otherwise.

Logistic Regression Model

In logistic regression we model for log of the odds ratio, which is the log (p/1-p) where p is the probability of the event occurring and 1-p is the probability of the non-occurrence of the event. We try to estimate p through a linear combination of the independent variables.

The odds ratio for a variable in logistic regression represent how the odds change with a 1 unit change in that variables keeping others constant.

In other words, Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of the presence of the characteristic of interest:

Where β0 is the Y-intercept, e is the error in the model, β1 is the coefficient (slope) for independent factor x1, and β2 is the coefficient (slope) for independent factor x2, and so on.

The above equation gives the probability of the even as below:

Where b0, b1, b2, etc. are the estimates of β0, β1, β2, etc. respectively

Since we are working here with a binomial distribution (dependent variable), we need to choose a link function that is best suited for this distribution. This link function is a logit function. In the equation above, the parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors (like in ordinary regression).

Recall that in linear regression if we plot the predictor variable vs. the target variable we should get a plot closer to a straight line. In the case of logistic regression, if we plot the probability of the event vs. the predictor, we will get an S-shaped curve.

Example of Logistic Regression

Let us discuss an application of logistic regression in the telecom industry. An analyst at a telecom company wants to predict the probability of customer churn. The target variable is customer churn, where zero represents no-churn and one represents churn. The independent variables are income, credit limit, age, outstanding amount, current bill, unbilled amount, last month’s billed amount, calls per day, last used, current usage, data usage proportion, etc. The analyst now models for either a non-event, that is, no-churn, or for an event, that is, churn. Once a logistic regression model is built, the output is interpreted as follows:

Check if the right probability, that is, churn or no-churn is modeled.

Check if the convergence is satisfied. If the convergence is not satisfied then, re-run the model with another combination of independent variables.

Test for the global null hypothesis by analyzing the p-values of Likelihood Ratio, Wald Test and Score. All of these should be significant.

Check the Maximum Likelihood Estimates for the intercept and all the variables, since these should be significant. The logistic regression coefficients (estimates) show the change (increase when bi>0, decrease when bi<0) in the predicted log odds of having the characteristic of interest for a one-unit change in the independent variables.

At every stage remove insignificant variables and re-run the model until only significant variables remain. Also examine the odd-ratio estimates, but these are not particularly important in terms of the model fit.

Check the concordance (that is, percent concordant, percent discordant, and percent tied). The higher the concordance, the better is the model. Ideally, the concordance values should be greater than 0.5.

Check the Hosmer-Lemeshow test statistics for the goodness of fit of the logistic model.

Check the KS statistics of the logistic model. Higher the KS better is the model

Check the rank ordering of the response variables at the decile level. The dependent variable should rank order across 10 deciles.

The outcome of the procedure would be a probability associated with each individual observation, which would indicate the probability of the customer to churn or not (according to what probability is modeled).

Model Validation

Once any predictive model is developed it is important to have frequent evaluation of its predictive power and effectiveness over the required target sample. This process is known as “predictive model validation”. Over a period of time the predictive strength of any predictive model can vary from the original development sample.

This can happen due to:

Change in the target population.

Unavailability of one or more of the independent variable(s).

Change in interpretation of a particular variable (e.g. due to inflation the price of any consumer good can vary significantly over a period of time and hence the interpretation of the absolute value would vary).

If the model has significantly deteriorated, it is advisable that either the model is re-calibrated (with the same set of variables) or a completely new model is developed.

How is Model Validation Done?

Validation is usually done by evaluating certain key indicator statistics for the validation sample and comparing it with the ones from development sample. It is crucial to validate it on both on an in-time and out-of-time samples — An “in-time sample” is a data sample mirroring the model development data and is selected from the same vintage (cohort). Similarly, an “out-of-time sample” is a data sample mirroring the model development data, but is selected from a vintage (cohort) that is from a different point in time.

Check for K-S statistic :

One of the most relevant of such statistic is “K-S Statistic”, which gives an estimate of the extent of differentiation between res-ponders and non-res ponders based on a particular model. If the indicator statistics are “close-enough” to the development sample, the model is considered stable and valid. What level of differences are acceptable is a completely subjective criteria and may vary from case to case. Typically, if K-S statistic for a validation sample is within 10% of the development sample, it is considered acceptable.

Population Stability Index (PSI) :

PSI gives a view if the target population for the model has remained “stable” over a period of time with respect to the model score. For example, in an ideal scenario the model score cut-off for the top 10% of the population in the development sample should yield 10% account even in the validation samples. However, in most of the real-life scenarios this is not the case and the population can get skewed with respect to the model score. In cases where this skew goes beyond the acceptable threshold, it is advisable to re-calibrate/re-develop the model. PSI is one of the methods to quantify it.

To calculate PSI the development sample is divided into equal groups and model score cut-offs are calculated for each. Post that these cut-offs are applied on the validation sample and the actual % of account in each of the groups are obtained.

PSI is calculated in the following manner (here “n” is the number of bins that a variable has been divided into):

As a thumb rule PSI<0.1 would mean that there is “insignificant change” in the development and validation populations “moderate change” and the learner should look at other validation parameters before deciding the next steps. PSI >0.25 would mean “significant change” in the population and ideally the model score should be re-calibrated/redeveloped.

Single Line Text

RAJAT PANCHOTIA · Follow. Published in Analytics Vidhya · 8 min read · Mar 9, 2021. -- Listen. Share. Logistic Regression. Regression allows us to predict an output based on some input parameters. For instance, we can predict someone’s height based on their parents height and age. This type of regression is called linear regression because the outcome variable is a continuous real number. But what if we wanted to predict something that is not a continuous number? Why Not Linear Regression? Let us say we want to predict likelihood of a candidate to pass the Math’s Olympiad for class X. Using ordinary linear regression will not work in this scenario because it doesn’t make sense to treat our outcome as a continuous number — it’s either pass or fail. In this case, we use logistic regression, because the outcome variable is a binary response variable. Why running linear regression for such scenarios will not work? • Binary variable does not have a normal distribution which is a primary condition needed for linear regression analysis. Predicted value of the target variable can go beyond 0 and 1, which violates the definition of probability. • Probabilities are often nonlinear and can be U shaped due to the extreme value effect of the x variables. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval, or ratio-level independent variables. Logistic regression seeks to: Model the probability of an event occurring depending on the values of one or more nominal, ordinal, interval, or ratio-level independent variables. Estimate the probability that an event occurs for a randomly selected set of observations versus the probability of non-occurrence of an event. Predicts the effect of a series of variables on a binary response variable. Classify observations based on the probability score that their likelihood to belong to a particular category (for above example, it is pass or fail). Log Odds. Before getting into the details of logistic regression, let us define something called “Odds” of an event. Suppose p is the probability of an event occurring.. The event could be anything, like a customer not paying a loan, a customer turning up to a retail store during an event, or a customer complain being escalated, or anything else. Then, the odds of the event is defined as the ratio of likelihood of the event occurring to the likelihood of it not occurring. Hence, the odds is given by p/(1-p). Note, p being a probability value, it lies between zero and one and hence odds can take any non-negative number. Recall, in linear regression we assumed that the target variable is a linear function of predictors. In logistic regression, we assume the log of odds (i.e. log of p/(1-p)) of the event is a linear function. Note, log of odds can take any real number. It is positive if p is greater than 0.5 and negative otherwise. Logistic Regression Model. In logistic regression we model for log of the odds ratio, which is the log (p/1-p) where p is the probability of the event occurring and 1-p is the probability of the non-occurrence of the event. We try to estimate p through a linear combination of the independent variables. The odds ratio for a variable in logistic regression represent how the odds change with a 1 unit change in that variables keeping others constant. In other words, Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of the presence of the characteristic of interest: Where β0 is the Y-intercept, e is the error in the model, β1 is the coefficient (slope) for independent factor x1, and β2 is the coefficient (slope) for independent factor x2, and so on. The above equation gives the probability of the even as below: Where b0, b1, b2, etc. are the estimates of β0, β1, β2, etc. respectively. Since we are working here with a binomial distribution (dependent variable), we need to choose a link function that is best suited for this distribution. This link function is a logit function. In the equation above, the parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors (like in ordinary regression). Recall that in linear regression if we plot the predictor variable vs. the target variable we should get a plot closer to a straight line. In the case of logistic regression, if we plot the probability of the event vs. the predictor, we will get an S-shaped curve. Example of Logistic Regression. Let us discuss an application of logistic regression in the telecom industry. An analyst at a telecom company wants to predict the probability of customer churn. The target variable is customer churn, where zero represents no-churn and one represents churn. The independent variables are income, credit limit, age, outstanding amount, current bill, unbilled amount, last month’s billed amount, calls per day, last used, current usage, data usage proportion, etc. The analyst now models for either a non-event, that is, no-churn, or for an event, that is, churn. Once a logistic regression model is built, the output is interpreted as follows: Check if the right probability, that is, churn or no-churn is modeled. Check if the convergence is satisfied. If the convergence is not satisfied then, re-run the model with another combination of independent variables. Test for the global null hypothesis by analyzing the p-values of Likelihood Ratio, Wald Test and Score. All of these should be significant. Check the Maximum Likelihood Estimates for the intercept and all the variables, since these should be significant. The logistic regression coefficients (estimates) show the change (increase when bi>0, decrease when bi<0) in the predicted log odds of having the characteristic of interest for a one-unit change in the independent variables. At every stage remove insignificant variables and re-run the model until only significant variables remain. Also examine the odd-ratio estimates, but these are not particularly important in terms of the model fit. Check the concordance (that is, percent concordant, percent discordant, and percent tied). The higher the concordance, the better is the model. Ideally, the concordance values should be greater than 0.5. Check the Hosmer-Lemeshow test statistics for the goodness of fit of the logistic model. Check the KS statistics of the logistic model. Higher the KS better is the model. Check the rank ordering of the response variables at the decile level. The dependent variable should rank order across 10 deciles. The outcome of the procedure would be a probability associated with each individual observation, which would indicate the probability of the customer to churn or not (according to what probability is modeled). Model Validation. Once any predictive model is developed it is important to have frequent evaluation of its predictive power and effectiveness over the required target sample. This process is known as “predictive model validation”. Over a period of time the predictive strength of any predictive model can vary from the original development sample. This can happen due to: Change in the target population. Unavailability of one or more of the independent variable(s). Change in interpretation of a particular variable (e.g. due to inflation the price of any consumer good can vary significantly over a period of time and hence the interpretation of the absolute value would vary). If the model has significantly deteriorated, it is advisable that either the model is re-calibrated (with the same set of variables) or a completely new model is developed. How is Model Validation Done? Validation is usually done by evaluating certain key indicator statistics for the validation sample and comparing it with the ones from development sample. It is crucial to validate it on both on an in-time and out-of-time samples — An “in-time sample” is a data sample mirroring the model development data and is selected from the same vintage (cohort). Similarly, an “out-of-time sample” is a data sample mirroring the model development data, but is selected from a vintage (cohort) that is from a different point in time. Check for K-S statistic : One of the most relevant of such statistic is “K-S Statistic”, which gives an estimate of the extent of differentiation between res-ponders and non-res ponders based on a particular model. If the indicator statistics are “close-enough” to the development sample, the model is considered stable and valid. What level of differences are acceptable is a completely subjective criteria and may vary from case to case. Typically, if K-S statistic for a validation sample is within 10% of the development sample, it is considered acceptable. Population Stability Index (PSI) : PSI gives a view if the target population for the model has remained “stable” over a period of time with respect to the model score. For example, in an ideal scenario the model score cut-off for the top 10% of the population in the development sample should yield 10% account even in the validation samples. However, in most of the real-life scenarios this is not the case and the population can get skewed with respect to the model score. In cases where this skew goes beyond the acceptable threshold, it is advisable to re-calibrate/re-develop the model. PSI is one of the methods to quantify it. To calculate PSI the development sample is divided into equal groups and model score cut-offs are calculated for each. Post that these cut-offs are applied on the validation sample and the actual % of account in each of the groups are obtained. PSI is calculated in the following manner (here “n” is the number of bins that a variable has been divided into): As a thumb rule PSI<0.1 would mean that there is “insignificant change” in the development and validation populations “moderate change” and the learner should look at other validation parameters before deciding the next steps. PSI >0.25 would mean “significant change” in the population and ideally the model score should be re-calibrated/redeveloped.