- Practical Machine Learning Cookbook
- Atul Tripathi
- 1001字
- 2025-02-26 23:45:48
Multinomial logistic regression - understanding program choices made by students
Let's assume that high school students are to be enrolled on a program. The students are given the opportunity to choose programs of their choice. The choices of the students are based on three options. These choices are general program, vocational program, and academic program. The choice of each student is based on each student's writing score and social economic status.
Getting ready
In order to complete this recipe we shall be using a student's dataset. The first step is collecting the data.
Step 1 - collecting data
The student's dataset titled hsbdemo
is being utilized. The dataset is available at: http://voia.yolasite.com/resources/hsbdemo.csv in an MS Excel format. There are 201 data rows and 13 variables in the dataset. The eight numeric measurements are as follows:
id
read
write
math
science
socst
awards
cid
The non-numeric measurements are as follows:
gender
ses
schtyp
prog
honors
How to do it...
Let's get into the details.
Step 2 - exploring data
The first step is loading the packages. The library ()
returns an error if the package does not exist. Use the following commands:
> library(foreign) > library (nnet) > library (ggplot2) > library (reshape2)
Note
Version info: Code for this page was tested in R version 3.2.3 (2015-12-10).
Exploring the data throws some light on the relationships of the data. The CSV file titled hsbdemo.csv
needs to be imported in the R environment. The imported data is saved in the data frame titled ml
as follows:
> ml <- read.table("d:/hsbdemo.csv", header=TRUE, sep=",", row.names="id")
Exploring the descriptive statistics of the variables that are of interest is to be carried out using the with()
function as follows:
> with(ml, table(ses, prog))
The results are as follows:
prog ses academic general vocation high 42 9 7 low 19 16 12 middle 44 20 31
Let us obtain the mean and the standard deviation as follows:
> with(ml, do.call(rbind, tapply(write, prog, function(x) c(M = mean(x), SD = sd(x)))))
The results are as follows:
M SD academic 56.25714 7.943343 general 51.33333 9.397775 vocation 46.76000 9.318754
The mean is the highest for academic and the standard deviation is the highest for general.
Step 3 - training the model
In order to estimate multinomial logistic regression, the multinom()
function is used. The multinom()
function does not require the reshaping of the data.
It is important to choose a reference group for the outcome. We can choose the level of our outcome that we wish to use as our baseline. This is specified in the relevel ()
function. Then, we run our model using the multinom()
function. Since no p-value calculations are carried out for the regression coefficients, p-value tests are carried out using Wald tests (z-tests). The formula mentioned in the multinom()
function is of the form response ~ predictors. The data frame ml
is the data frame to interpret the variables occurring in the formula, as follows:
> ml$prog2 <- relevel(ml$prog, ref = "academic") > test <- multinom(prog2 ~ ses + write, data = ml)
The results are as follows:
# weights: 15 (8 variable) initial value 219.722458 iter 10 value 179.983731 final value 179.981726 converged
> summary(test)
The results are as follows:
Call: multinom(formula = prog2 ~ ses + write, data = ml) Coefficients: (Intercept) seslow sesmiddle write general 1.689478 1.1628411 0.6295638 -0.05793086 vocation 4.235574 0.9827182 1.2740985 -0.11360389
Std. Errors: (Intercept) seslow sesmiddle write general 1.226939 0.5142211 0.4650289 0.02141101 vocation 1.204690 0.5955688 0.5111119 0.02222000 Residual Deviance: 359.9635 AIC: 375.9635
Next, the test summary of coefficients is divided by the test summary of standard errors, as follows:
> z <- summary(test)$coefficients/summary(test)$standard.errors
Display the value of z
as follows:
> z
The results are as follows:
(Intercept) seslow sesmiddle write general 1.376987 2.261364 1.353816 -2.705658 vocation 3.515904 1.650050 2.492798 -5.112687
Step 4 - testing the results of the model
A two-tailed z test is carried out as follows:
> p <- (1 - pnorm(abs(z), 0, 1))*2
Display the value of p as follows:
> p
The results are as follows:
(Intercept) seslow sesmiddle write general 0.1685163893 0.02373673 0.1757949 6.816914e-03 vocation 0.0004382601 0.09893276 0.0126741 3.176088e-07
Step 5 - model improvement performance
Relative risk is defined as the ratio between choosing one outcome category and choosing the baseline category. The relative risk is the exponential of the right-hand side of the linear equation. The exponential regression coefficients are relative risk ratios for a unit change in the predictor variable.
Extract the coefficients from the model and perform an exponential on it as follows:
> exp(coef(test))
The results are as follows:
(Intercept) seslow sesmiddle write general 5.416653 3.199009 1.876792 0.9437152 vocation 69.101326 2.671709 3.575477 0.8926115
The relative risk ratio for a one-unit increase in the variable write is .9437
for being in a general program versus an academic program. The relative risk ratio switching from ses = 1
to 3
is .3126
for being in a general program versus an academic program. Use the probabilities that have been predicted to get an insight into the model. The fitted()
function is used to calculate predicted probabilities for each of our outcome levels as follows:
> head(pp <- fitted(test))
The results are as follows:
academic general vocation 45 .1482721 0.3382509 0.5134769 108 0.1201988 0.1806335 0.6991678 15 0.4186768 0.2368137 0.3445095 67 0.1726839 0.3508433 0.4764728 153 0.1001206 0.1689428 0.7309367 51 0.3533583 0.2378047 0.4088370
Examine the changes in probability associated with one of the two variables, ses
and write
. Create small datasets varying one variable while holding the other one constant. First, hold the write variable at its mean and then examine the predicted probability for each level of the ses
variable as follows:
> dses <- data.frame(ses = c("low", "middle", "high"),write = mean(ml$write)) > predict(test, newdata = dses, "probs")
The results are as follows:
academic general vocation 1 0.4396813 0.3581915 0.2021272 2 0.4777451 0.2283359 0.2939190 3 0.7009046 0.1784928 0.1206026
Looking at the average predicted probabilities for different values of the continuous predictor variable, using predicted probabilities as follows:
> dwrite <- data.frame(ses = rep(c("low", "middle", "high"), each = 41), write = rep(c(30:70), 3))
Store the predicted probabilities for each value of ses
and write as follows:
> pp.write <- cbind(dwrite, predict(test, newdata = dwrite, type = "probs", se = TRUE))
Calculate the mean probabilities within each level of ses
as follows:
> by(pp.write[, 3:5], pp.write$ses, colMeans)
The results are as follows:
pp.write$ses: high academic general vocation 0.6164348 0.1808049 0.2027603 -------------------------------------------------------------------------- pp.write$ses: low academic general vocation 0.3972955 0.3278180 0.2748864 -------------------------------------------------------------------------- pp.write$ses: middle academic general vocation 0.4256172 0.2010877 0.3732951
Sometimes, a couple of plots can convey a good deal of information. Using the predictions we generated for the pp.write
object previously, we can plot the predicted probabilities against the writing score by the level of ses
for different levels of the outcome variable. The melt()
function takes data in wide format and stacks a set of columns into a single column of data. The lpp
data frame is used to specify the data frame as follows:
> lpp <- melt(pp.write, id.vars = c("ses", "write"), value.name = "probability")
Print the values for head
as follows:
> head(lpp)
The results are as follows:
ses write variable probability 1 low 30 academic 0.09843258 2 low 31 academic 0.10716517 3 low 32 academic 0.11650018 4 low 33 academic 0.12645441 5 low 34 academic 0.13704163 6 low 35 academic 0.14827211
Next we plot predicted probabilities across write values for each level of ses
facetted by program type as follows:
> ggplot(lpp, aes(x = write, y = probability, colour = ses)) + + geom_line() + + facet_grid(variable ~ ., scales="free")
