Multinomial logistic regression - understanding program choices made by students

Let's assume that high school students are to be enrolled on a program. The students are given the opportunity to choose programs of their choice. The choices of the students are based on three options. These choices are general program, vocational program, and academic program. The choice of each student is based on each student's writing score and social economic status.

Getting ready

In order to complete this recipe we shall be using a student's dataset. The first step is collecting the data.

Step 1 - collecting data

The student's dataset titled hsbdemo is being utilized. The dataset is available at: http://voia.yolasite.com/resources/hsbdemo.csv in an MS Excel format. There are 201 data rows and 13 variables in the dataset. The eight numeric measurements are as follows:

  • id
  • read
  • write
  • math
  • science
  • socst
  • awards
  • cid

The non-numeric measurements are as follows:

  • gender
  • ses
  • schtyp
  • prog
  • honors

How to do it...

Let's get into the details.

Step 2 - exploring data

The first step is loading the packages. The library () returns an error if the package does not exist. Use the following commands:

 > library(foreign)
 > library (nnet)
 > library (ggplot2)
 > library (reshape2)
Note

Version info: Code for this page was tested in R version 3.2.3 (2015-12-10).

Exploring the data throws some light on the relationships of the data. The CSV file titled hsbdemo.csv needs to be imported in the R environment. The imported data is saved in the data frame titled ml as follows:

> ml <- read.table("d:/hsbdemo.csv", header=TRUE, sep=",", row.names="id")

Exploring the descriptive statistics of the variables that are of interest is to be carried out using the with() function as follows:

> with(ml, table(ses, prog))

The results are as follows:

 prog
 ses academic general vocation
 high 42 9 7
 low 19 16 12
 middle 44 20 31

Let us obtain the mean and the standard deviation as follows:

> with(ml, do.call(rbind, tapply(write, prog, function(x) c(M = mean(x), SD = sd(x)))))

The results are as follows:

 M SD
academic 56.25714 7.943343
general 51.33333 9.397775
vocation 46.76000 9.318754

The mean is the highest for academic and the standard deviation is the highest for general.

Step 3 - training the model

In order to estimate multinomial logistic regression, the multinom() function is used. The multinom() function does not require the reshaping of the data.

It is important to choose a reference group for the outcome. We can choose the level of our outcome that we wish to use as our baseline. This is specified in the relevel () function. Then, we run our model using the multinom() function. Since no p-value calculations are carried out for the regression coefficients, p-value tests are carried out using Wald tests (z-tests). The formula mentioned in the multinom() function is of the form response ~ predictors. The data frame ml is the data frame to interpret the variables occurring in the formula, as follows:

 > ml$prog2 <- relevel(ml$prog, ref = "academic") 
 > test <- multinom(prog2 ~ ses + write, data = ml)

The results are as follows:

# weights: 15 (8 variable)
initial value 219.722458 
iter 10 value 179.983731
final value 179.981726 
converged
 > summary(test)

The results are as follows:

 Call:
multinom(formula = prog2 ~ ses + write, data = ml)
Coefficients:
 (Intercept) seslow sesmiddle write
general 1.689478 1.1628411 0.6295638 -0.05793086
vocation 4.235574 0.9827182 1.2740985 -0.11360389
 Std. Errors:
 (Intercept) seslow sesmiddle write
general 1.226939 0.5142211 0.4650289 0.02141101
vocation 1.204690 0.5955688 0.5111119 0.02222000
Residual Deviance: 359.9635 
AIC: 375.9635 

Next, the test summary of coefficients is divided by the test summary of standard errors, as follows:

> z <- summary(test)$coefficients/summary(test)$standard.errors

Display the value of z as follows:

> z

The results are as follows:

 (Intercept) seslow sesmiddle write
general 1.376987 2.261364 1.353816 -2.705658
vocation 3.515904 1.650050 2.492798 -5.112687
Step 4 - testing the results of the model

A two-tailed z test is carried out as follows:

> p <- (1 - pnorm(abs(z), 0, 1))*2

Display the value of p as follows:

> p

The results are as follows:

 (Intercept) seslow sesmiddle write
general 0.1685163893 0.02373673 0.1757949 6.816914e-03
vocation 0.0004382601 0.09893276 0.0126741 3.176088e-07
Step 5 - model improvement performance

Relative risk is defined as the ratio between choosing one outcome category and choosing the baseline category. The relative risk is the exponential of the right-hand side of the linear equation. The exponential regression coefficients are relative risk ratios for a unit change in the predictor variable.

Extract the coefficients from the model and perform an exponential on it as follows:

> exp(coef(test))

The results are as follows:

 (Intercept) seslow sesmiddle write
general 5.416653 3.199009 1.876792 0.9437152
vocation 69.101326 2.671709 3.575477 0.8926115

The relative risk ratio for a one-unit increase in the variable write is .9437 for being in a general program versus an academic program. The relative risk ratio switching from ses = 1 to 3 is .3126 for being in a general program versus an academic program. Use the probabilities that have been predicted to get an insight into the model. The fitted() function is used to calculate predicted probabilities for each of our outcome levels as follows:

> head(pp <- fitted(test))

The results are as follows:

 academic general vocation
45 .1482721 0.3382509 0.5134769
108 0.1201988 0.1806335 0.6991678
15 0.4186768 0.2368137 0.3445095
67 0.1726839 0.3508433 0.4764728
153 0.1001206 0.1689428 0.7309367
51 0.3533583 0.2378047 0.4088370

Examine the changes in probability associated with one of the two variables, ses and write. Create small datasets varying one variable while holding the other one constant. First, hold the write variable at its mean and then examine the predicted probability for each level of the ses variable as follows:

 > dses <- data.frame(ses = c("low", "middle", "high"),write = mean(ml$write))
 > predict(test, newdata = dses, "probs")

The results are as follows:

 academic general vocation
1 0.4396813 0.3581915 0.2021272
2 0.4777451 0.2283359 0.2939190
3 0.7009046 0.1784928 0.1206026

Looking at the average predicted probabilities for different values of the continuous predictor variable, using predicted probabilities as follows:

> dwrite <- data.frame(ses = rep(c("low", "middle", "high"), each = 41), write = rep(c(30:70), 3))

Store the predicted probabilities for each value of ses and write as follows:

> pp.write <- cbind(dwrite, predict(test, newdata = dwrite, type = "probs", se = TRUE))

Calculate the mean probabilities within each level of ses as follows:

> by(pp.write[, 3:5], pp.write$ses, colMeans)

The results are as follows:

pp.write$ses: high
 academic general vocation 
 0.6164348 0.1808049 0.2027603 
-------------------------------------------------------------------------- 
pp.write$ses: low
 academic general vocation 
 0.3972955 0.3278180 0.2748864 
-------------------------------------------------------------------------- 
pp.write$ses: middle 
 academic general vocation 
 0.4256172 0.2010877 0.3732951 

Sometimes, a couple of plots can convey a good deal of information. Using the predictions we generated for the pp.write object previously, we can plot the predicted probabilities against the writing score by the level of ses for different levels of the outcome variable. The melt() function takes data in wide format and stacks a set of columns into a single column of data. The lpp data frame is used to specify the data frame as follows:

> lpp <- melt(pp.write, id.vars = c("ses", "write"), value.name = "probability")

Print the values for head as follows:

> head(lpp)

The results are as follows:

 ses write variable probability
1 low 30 academic 0.09843258
2 low 31 academic 0.10716517
3 low 32 academic 0.11650018
4 low 33 academic 0.12645441
5 low 34 academic 0.13704163
6 low 35 academic 0.14827211

Next we plot predicted probabilities across write values for each level of ses facetted by program type as follows:

> ggplot(lpp, aes(x = write, y = probability, colour = ses)) +
+ geom_line() +
+ facet_grid(variable ~ ., scales="free")