1 Logistic regression

We uses a logistic regression to be able to use our simple linear regression methods against categorical data.

1.1 Linear Regression

In a linear regression, the data we are trying to predict might look like:

1.2 Binomial class

In a logistic regression, the data looks like:

1.3 Probability

Trying to plot a standard regression would result in lots of error, instead we need to do something to make the results of a linear regression to put values closer to the two binomial outcomes.

We could use the probability of a given outcome as this gives us a range of values.

prob_y<-seq(0,1, by=.001)
qplot(prob_y,prob_y)

The trouble with probabilities is they must be within 0 and 1 to make any sense and it would be tough to bind a linear regression to this range.

1.4 Odds

To bust out of this range, we could use odds, the probability of something happening versus it not happening to create a more dispersed value.

odds_y<- prob_y/(1-prob_y)
qplot(prob_y,odds_y)

Using odds allows us to exceed 1, but there’s still no negative values allowed. Additionally, the relationship is distribution of values is difficult to model linearly.

1.5 Logit

To be able to get both a positive and a negative range of values allowed, we can take the log of the odds.

qplot(prob_y, log(odds_y))

This now gives us a strong dispersal of values into positive and negative ranges, with a distribution much more suited to a linear regression model.

We can take a nuanced probabilistic approach to values, or simply say if the logit is positive, we predict the outcome 1, if not the outcome 0.

1.6 Transforming logits

To get back to a probability from a logit (or vice versa) is pretty simple but I wrote some helper functions in the package optiRum to facilitate this.

library(optiRum)

logits     <- -4:4
odds       <- logit.odd(logits)
probs      <- odd.prob(odds)
pred_class <- logits>=0

knitr::kable(data.frame(logits,odds,probs,pred_class))
logits odds probs pred_class
-4 0.0183156 0.0179862 FALSE
-3 0.0497871 0.0474259 FALSE
-2 0.1353353 0.1192029 FALSE
-1 0.3678794 0.2689414 FALSE
0 1.0000000 0.5000000 TRUE
1 2.7182818 0.7310586 TRUE
2 7.3890561 0.8807971 TRUE
3 20.0855369 0.9525741 TRUE
4 54.5981500 0.9820138 TRUE
prob.odd
## function (prob) 
## {
##     prob/(1 - prob)
## }
## <environment: namespace:optiRum>
odd.logit
## function (odds) 
## {
##     log(odds)
## }
## <environment: namespace:optiRum>
logit.odd
## function (logit) 
## {
##     exp(logit)
## }
## <environment: namespace:optiRum>
odd.prob
## function (odds) 
## {
##     odds/(1 + odds)
## }
## <environment: namespace:optiRum>

2 Analysis workflow

  • Understand the problem
  • Gather some data
  • Clean the data
  • Clean the data some more
  • Exploration and visualisation
  • Feature Reduction
  • Feature Engineering
  • (Depending on models) Feature Selection
  • Candidate models builds
  • Evaluate models
  • In-depth evaluation of selected model
  • Productionising

3 Sources of change in analysis

3.1 Exercise

What sort of things can alter the results of a piece of analysis?

3.2 Answers

  • Changes in data
  • Changes in code behaviours
  • Changes in behaviours in dependencies
  • Randomness

4 Accounting for change

4.1 Exercise

What sort of things can we do to prevent changes creeping into our analysis that stop it from being “deterministic”?

4.2 Answers

  • Checksums to flag if anything has changed
  • Keeping a seperate copy of data
  • Keeping dependencies the same over time
  • Source control
  • Unit testing and validating code
  • set.seed

5 GLM step-by-step – Project setup

5.1 Project checklist

  • Git
  • Project options
    • No Rdata or history!
    • Insert spaces for tabs
  • Packrat +packrat::init()
  • Folder structure
    • data
    • processeddata
    • analysis
    • outputs
    • docs
  • DESCRIPTION
  • LICENSE
  • .Rbuildignore
  • README.Rmd
  • Makefile
  • .travis.yml

5.2 Travis setup

5.3 Github setup

6 GLM step-by-step – Data

  • Source
  • Verification steps
  • Multiple outputs?
  • Main report
  • Supplementary data quality report
  • Shiny?

7 GLM step-by-step – Data processing

  • Cleaning steps
  • Sampling
  • Feature scaling
  • Univariate analysis
  • Bivariate analysis

8 GLM step-by-step – Candidate models

  • Feature selection
  • Various glm* models

9 GLM step-by-step – Evaluation

  • Scaling sample
  • Single model evaluation techniques
  • Comparing multiple models
  • Cross-validation

10 GLM step-by-step – Model selection

  • Using evaluation metrics to select best model
  • Presenting model
  • In-depth evaluation of best model

11 GLM step-by-step – Supplementary materials

  • Data lineage
  • Data quality
  • Feature analysis in-depth
  • Candidate model evaluations
  • Code
  • Reproducibility info