1 Foundations

Does everyone have a working laptop?

This course is designed to bring one up to speed on the tools necessary to follow all the examples in Frank Harrell’s ``Regression Modeling Strategies’’. It assumes a rudimentary knowledge of R or at least having some experience with coding.

If there are any questions or areas you’d like covered in more detail please do not hesitate to speak up.

RStudio has multiple panes. One can customize what’s open at any point. In general the four quadrants are:

  • Upper Left: Open Files
  • Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
  • Lower Left: Console. The interactive command to the running R interpreter.
  • Lower Right: Directory browser, Plots, Packages, Help.

If one accidentally closes one, it’s easily opened again from the View menu.

Note Menu items have underlines for certain characters. These are shortcuts for Alt-letter that one can access quickly from the keyboard.

1.1 Creating scripts

R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.

From RStudio, select File -> New File and there is a large number of options. The important ones for this class are:

  • R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click source in the upper right of the file and it will execute in the console below.
  • R Notebook. An older version that allows for a preview of the document. The preview is rebuilt on every save.
  • R Markdown. Same as R Notebook, but requires a manual Knit to construct the document. No preview mode. For documents that take a long time to build this is preferred and is becoming the more common option.

A common question is “Which file type should I choose?” It never hurts to start with a basic R script, as once the code is written it can be sourced or even copied and pasted into one of the other documents. For writing code that will import data or perform data manipulation, I prefer R scripts. For teaching with interactive examples, I prefer using R notebook. I like to use R markdown when creating output that will be shared (either as HTML or PDF).

1.2 Creating projects

A project brings together several scripts as a project. One can create a project using File -> New Project and select an existing directory or create a blank one as desired. Existing projects can be opened with File -> Open Project.

1.3 Session & Working Directory

There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document’s environment. The Run -> Restart R and Run All Chunks restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.

It is sometimes helpful to communicate versions of packages are loaded when discussing issues. The command sessionInfo() provides a nice overview of this that is good for checking package versions loaded.

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] bookdown_0.18    digest_0.6.25    R6_2.4.1         jsonlite_1.7.1   magrittr_2.0.1   evaluate_0.14    rmdformats_0.3.7 stringi_1.5.3   
##  [9] rlang_0.4.11     jquerylib_0.1.4  bslib_0.3.1      rmarkdown_2.11   tools_4.2.0      stringr_1.4.0    xfun_0.30        yaml_2.2.1      
## [17] fastmap_1.1.0    compiler_4.2.0   htmltools_0.5.2  knitr_1.36       sass_0.4.0

The current working directory is available via getwd() and settable via setwd("directory"). All file loads and saves are relative to the current working directory.

1.4 Environment

R maintains “environment frames” which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.

x <- 3
f <- function()
{ # One can view the parenthesis as capturing a frame
  x <- 4   # Local variable inner frame
  cat("This is the inner 'x'", x, "\n") 
}
f()
## This is the inner 'x' 4
cat("This is the outer 'x'", x, "\n")
## This is the outer 'x' 3

1.5 Packages

Install the packages Hmisc, rms, knitr, rmarkdown, and plotly through RStudio.

Packages contain R code and may often require the installation of other packages. When I install Hmisc, I see that its dependency acepack is automatically installed:

Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb

* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)

1.5.1 Install and updating via CRAN

Install through CRAN repositories

# Having code to install packages insider your Rmarkdown is bad practice. Don't do this.
# This code is marked `eval=FALSE` in the Rmarkdown chunk so it is not evaluated.
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown', 'plotly'))
# update all packages
update.packages(checkBuilt = TRUE, ask = FALSE)

1.5.2 Install and update via Github

Github is a code sharing platform. Packages are typically in a rawer state, but it can be important to use sometimes to get the latest bugfix at the risk of introducing a different bug. Installing through Github repositories is done using the remotes package.

install.packages('remotes')
require(remotes)
install_github('harrelfe/rms')
# [package]::[function]() is a common syntax that shows what package is associated with the given function
# for example, this works like above but explicitly tells us we're using the "remotes" package
# remotes::install_github('harrelfe/rms')

1.5.3 Loading Packages

Note that library is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require used in its place. If the package is not installed, library will trigger an error while require will return FALSE. Once loaded the functions created in the package are available to your R session.

library(Hmisc)
require(Hmisc)

1.6 Help

From the R console one can get help about any function by appending it with a ?.

?lm

The R maintainers are quite pedantic about requiring package help files on functions up to date, complete and written in proper English. The benefit is that the user now has a reference for just about every function including from community sourced packages.

1.7 Rmarkdown

Earlier versions of this course focused on LaTeX document generation. LaTeX is a wonderful publishing tool (for PDF), that is also as complex as it is rich. Rmarkdown has come a long way to generating very rich documents that compile to either HTML or PDF with very little knowledge of the file formats of either. Given that the focus of this course is on Regression Modeling strategies and not document generation, the current focus is on using Rmarkdown and showing some interactive plots via HTML. Full support and possibilities are shown at https://rmarkdown.rstudio.com/.

The cheat sheet is available at https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf, or through Rstudio’s menu bar: Help -> Cheetsheets -> R Markdown Cheet Sheet.

Quarto is a next generation version of R Markdown from RStudio. Users are encouraged to evaluate it as an alternative, though it must be installed in addition to R and RStudio.

1.8 Best Practices

A blog post RMarkdown Driven Development by Emily Riederer is an excellent introduction to best practices in producing statistical documents using Rmarkdown.

One takeaway is to avoid the general pitfall of putting large amounts of data processing inside ones Rmarkdown reports, then minor changes to the report can require hours to compile. Large data processing and wrangling tasks should be done in scripts outside the report. The report should load the results of this process.

Another great reference is Dr. Harrell’s analysis project workflow. His content mirrors much of this workshop and will reinforce its material. It can also be used as a template for quarto as its source code is available (by clicking the “</> Code” link).

1.9 Caching

Chunks in Rmarkdown can be cached. I.e., the results of the last computation are saved and it’s not rerun until a manual override by clicking on that section is done. It is recommended that one not do this as it inevitably leads to confusion when code or data changes are made and the document compilation does not reflect the changes.

A better approach for long computations is to do these with external scripts and save the results. The report can load these results and present as needed. It’s very similar to caching, but is not prone to later confusion as it’s a bit more direct that the report is loading the processed data file. In short do large data processing tasks outside of reports.

1.10 Harrell’s RMS

See chapters 1 and 9 of Biostatistics for Biomedical Research.

1.11 Setup

Most Rmarkdown documents have a setup section that loads required libraries and sets up some options. The following snippet is the basic setup for this course. It loads the Hmisc package. Sets knitr to use markdown by default. Sets the graphics option to use plotly. It also pulls some HTML rendering options from the markup specs for use later and stores this in a global variable mu.

require(Hmisc)
knitrSet(lang='markdown', h=4.5, fig.path='png/', fig.align='left')
# knitrSet redirects all messages to messages.txt
options(grType='plotly') # for certain graphics functions
mu <- markupSpecs$html   # markupSpecs is in Hmisc

1.12 Data

The getHdata function is used to fetch a dataset from the Vanderbilt Biostatistics DataSets web site. upData is used to

  • create a new variable from an old one
  • add labels to 2 variables
  • add units to the new variable
  • remove the old variable
  • automatically move units of measurements from parenthetical expressions in labels to separate units attributes used by Hmisc and rms functions for table making and graphics

contents is used to print a data dictionary, run through an html method for nicer output.

getHdata(pbc)
pbc <- upData(pbc,
              fu.yrs = fu.days / 365.25,
              labels = c(fu.yrs = 'Follow-up Time',
                         status = 'Death or Liver Transplantation'),
              units = c(fu.yrs = 'year'),
              drop  = 'fu.days',
              moveUnits=TRUE, html=TRUE)
Input object size:   80952 bytes;    19 variables    418 observations
 Label for bili changed from Serum Bilirubin (mg/dl) to Serum Bilirubin 
    units set to mg/dl 
 Label for albumin changed from Albumin (gm/dl) to Albumin 
    units set to gm/dl 
 Label for protime changed from Prothrombin Time (sec.) to Prothrombin Time 
    units set to sec. 
 Label for alk.phos changed from Alkaline Phosphatase (U/liter) to Alkaline Phosphatase 
    units set to U/liter 
 Label for sgot changed from SGOT (U/ml) to SGOT 
    units set to U/ml 
 Label for chol changed from Cholesterol (mg/dl) to Cholesterol 
    units set to mg/dl 
 Label for trig changed from Triglycerides (mg/dl) to Triglycerides 
    units set to mg/dl 
 Label for platelet changed from Platelets (per cm^3/1000) to Platelets 
    units set to per cm^3/1000 
 Label for copper changed from Urine Copper (ug/day) to Urine Copper 
    units set to ug/day 
 Added variable     fu.yrs
 Dropped variable   fu.days
 New object size:   84744 bytes;    19 variables    418 observations
html(contents(pbc), maxlevels=10, levelType='table')

Data frame:pbc

418 observations and 19 variables, maximum # NAs:136  
NameLabelsUnitsLevelsStorageNAs
biliSerum Bilirubinmg/dldouble 0
albuminAlbumingm/dldouble 0
stageHistologic Stage, Ludwig Criteriainteger 6
protimeProthrombin Timesec.double 2
sex2integer 0
ageAgedouble 0
spiders2integer106
hepatom2integer106
ascites2integer106
alk.phosAlkaline PhosphataseU/literdouble106
sgotSGOTU/mldouble106
cholCholesterolmg/dlinteger134
trigTriglyceridesmg/dlinteger136
plateletPlateletsper cm^3/1000integer110
drug3integer 0
statusDeath or Liver Transplantationinteger 0
edema3integer 0
copperUrine Copperug/dayinteger108
fu.yrsFollow-up Timeyeardouble 0

VariableLevels
sexmale
female
spiders, hepatomabsent
 ascitespresent
drugD-penicillamine
placebo
not randomized
edemano edema
edema, no diuretic therapy
edema despite diuretic therapy

1.13 Exercise 1

Do exercise 1. Pull the example exercise.Rmd from github as your starting template.

1.14 Descriptive Statistics Without Stratification

# did have results='asis' above
d <- describe(pbc)
html(d, size=80, scroll=TRUE)
pbc

19 Variables   418 Observations

bili: Serum Bilirubin mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4180980.9983.2213.742 0.50 0.60 0.80 1.40 3.40 8.0314.00
lowest : 0.3 0.4 0.5 0.6 0.7 , highest: 21.6 22.5 24.5 25.5 28.0
albumin: Albumin gm/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
418015413.4970.4732.7502.9673.2433.5303.7704.0104.141
lowest : 1.96 2.10 2.23 2.27 2.31 , highest: 4.30 4.38 4.40 4.52 4.64
stage: Histologic Stage, Ludwig Criteria
image
nmissingdistinctInfoMeanGmd
412640.8933.0240.9519
 Value          1     2     3     4
 Frequency     21    92   155   144
 Proportion 0.051 0.223 0.376 0.350
 

protime: Prothrombin Time sec.
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4162480.99810.731.029 9.60 9.8010.0010.6011.1012.0012.45
lowest : 9.0 9.1 9.2 9.3 9.4 , highest: 13.8 14.1 15.2 17.1 18.0
sex
nmissingdistinct
41802
 Value        male female
 Frequency      44    374
 Proportion  0.105  0.895
 

age: Age
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4180345150.7411.9633.8436.3742.8351.0058.2464.3067.92
lowest : 26.27789 28.88433 29.55510 30.27515 30.57358 , highest: 74.52430 75.00000 75.01164 76.70910 78.43943
spiders
nmissingdistinct
3121062
 Value       absent present
 Frequency      222      90
 Proportion   0.712   0.288
 

hepatom
nmissingdistinct
3121062
 Value       absent present
 Frequency      152     160
 Proportion   0.487   0.513
 

ascites
nmissingdistinct
3121062
 Value       absent present
 Frequency      288      24
 Proportion   0.923   0.077
 

alk.phos: Alkaline Phosphatase U/liter
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
312106295119831760 599.6 663.0 871.51259.01980.03826.46669.9
lowest : 289.0 310.0 369.0 377.0 414.0 , highest: 11046.6 11320.2 11552.0 12258.8 13862.4
sgot: SGOT U/ml
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3121061791122.660.45 54.25 60.45 80.60114.70151.90196.47219.25
lowest : 26.35 28.38 41.85 43.40 45.00 , highest: 288.00 299.15 328.60 338.00 457.25
chol: Cholesterol mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
2841342011369.5194.5188.4213.6249.5309.5400.0560.8674.0
lowest : 120 127 132 149 151 , highest: 1336 1480 1600 1712 1775
trig: Triglycerides mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
2821361461124.764.07 56.00 63.10 84.25108.00151.00195.00230.95
lowest : 33 44 46 49 50 , highest: 319 322 382 432 598
platelet: Platelets per cm3/1000
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3081102101261.9107.8117.7139.7199.8257.0322.5386.5430.6
lowest : 62 70 71 79 80 , highest: 493 514 518 539 563
drug
image
nmissingdistinct
41803
 Value      D-penicillamine         placebo  not randomized
 Frequency              154             158             106
 Proportion           0.368           0.378           0.254
 

status: Death or Liver Transplantation
nmissingdistinctInfoSumMeanGmd
418020.711610.38520.4748

edema
image
nmissingdistinct
41803
 Value                            no edema     edema, no diuretic therapy
 Frequency                             354                             44
 Proportion                          0.847                          0.105
                                          
 Value      edema despite diuretic therapy
 Frequency                              20
 Proportion                          0.048
 

copper: Urine Copper ug/day
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
310108158197.6583.16 17.45 24.00 41.25 73.00123.00208.10249.20
lowest : 4 9 10 11 12 , highest: 412 444 464 558 588
fu.yrs: Follow-up Time year
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
418039915.2513.429 0.671 1.661 2.992 4.736 7.155 9.64911.063
lowest : 0.1122519 0.1177276 0.1396304 0.1943874 0.2108145
highest:12.320328512.344969212.383299112.473648213.1279945

p <- plot(d)
# "p" has two named elements that we can access with "$"
p$Categorical # or, p[['Categorical']]
p$Continuous

1.15 Stratified Descriptive Statistics

Produce stratified quantiles, means/SD, and proportions by treatment group. Plot the results before rendering as an advanced HTML table:

  • categorical variables: a single dot chart
  • continuous variables: a series of extended box plots
s <- summaryM(bili + albumin + stage + protime + sex + age + spiders +
              alk.phos + sgot + chol ~ drug, data=pbc,
                            overall=FALSE, test=TRUE)

2 Plotting

plot(s, which='categorical')
plot(s, which='continuous', vars=1:4)
plot(s, which='continuous', vars=5:7)
html(s, caption='Baseline characteristics by randomized treatment',
     exclude1=TRUE, npct='both', digits=3,
     prmsd=TRUE, brmsd=TRUE, msdsize=mu$smaller2)
Baseline characteristics by randomized treatment.
N
D-penicillamine
N=154
placebo
N=158
not randomized
N=106
Test Statistic
Serum Bilirubin
mg/dl
418 0.725 1.300 3.600
3.649 ± 5.282
0.800 1.400 3.200
2.873 ± 3.629
0.725 1.400 3.075
3.117 ± 4.043
F2 415=0.03, P=0.9721
Albumin
gm/dl
418 3.342 3.545 3.777
3.524 ± 0.396
3.212 3.565 3.830
3.516 ± 0.443
3.125 3.470 3.720
3.431 ± 0.435
F2 415=2.13, P=0.121
Histologic Stage, Ludwig Criteria : 1 412 0.03 4154 0.08 12158 0.05 5100 χ26=5.33, P=0.5022
  2 0.21 32154 0.22 35158 0.25 25100
  3 0.42 64154 0.35 56158 0.35 35100
  4 0.35 54154 0.35 55158 0.35 35100
Prothrombin Time
sec.
416 10.000 10.600 11.400
10.800 ±  1.138
10.025 10.600 11.000
10.653 ±  0.851
10.100 10.600 11.000
10.750 ±  1.078
F2 413=0.23, P=0.7951
sex : female 418 0.90 139154 0.87 137158 0.92 98106 χ22=2.38, P=0.3042
Age 418 41.43 48.11 55.80
48.58 ±  9.96
42.98 51.93 58.90
51.42 ± 11.01
46.00 53.00 61.00
52.87 ±  9.78
F2 415=6.1, P=0.0021
spiders 312 0.29 45154 0.28 45158 χ21=0.02, P=0.8852
Alkaline Phosphatase
U/liter
312 922 1283 1950
1943 ± 2102
841 1214 2028
2021 ± 2183
F1 310=0.06, P=0.8123
SGOT
U/ml
312 83.8 117.4 151.9
125.0 ±  58.9
76.7 111.6 151.5
120.2 ±  54.5
F1 310=0.55, P=0.463
Cholesterol
mg/dl
284 254 304 377
374 ± 252
248 316 417
365 ± 210
F1 282=0.37, P=0.5453
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD.   N is the number of non-missing values.
Tests used: 1Kruskal-Wallis test; 2Pearson test; 3Wilcoxon test .

2.1 Spike Histogram

Show data for continuous variable stratified by treatment.

p <- with(pbc, histboxp(x=sgot, group=drug, sd=TRUE))
p

2.2 Data Visualization

The very low birthweight data set contains data on 671 infants born with a birth weight of under 1600 grams. We’ll plot gestational age by birthweight using three graphics systems: base graphics, ggplot, and plotly.

getHdata(vlbw)
# remove missing values
vlbw <- vlbw[complete.cases(vlbw[,c('sex','dead','gest','bwt')]),]

2.3 Base Graphics Example

Empty canvas - add each element into your plot.

# "grps" is a list with each combination of sex/dead
grps <- split(vlbw[,c('gest','bwt')], vlbw[,c('sex','dead')])
# default "plot" function; set up parameters but with no lines/points because type 'n'
plot(c(22,40), c(400,1600), type='n', xlab='Gestational Age', ylab='Birth Weight (grams)', axes=FALSE)
# axis 1 is below plot
axis(1, at=c(22,28,34,40), labels=c(22,28,34,40))
# axis 2 is left of plot
axis(2, at=seq(400,1600,by=400), labels=seq(400,1600,by=400))
## four different methods to add points to plot area
# data.frame with two columns: 1st column is associated with x variable
points(grps[['female.0']], col='black', pch=1)
# two arguments: first vector is associated with x variable
points(grps[['female.1']][,'gest'], grps[['female.1']][,'bwt'], col='gray', pch=0)
# like above, but jitter adds noise so points have less overlap
points(jitter(grps[['male.0']][,'gest'], 2), grps[['male.0']][,'bwt'], col='black', pch=2)
# LHS ~ RHS: with tilde, the left hand side is associated with y variable
points(grps[['male.1']][,'bwt'] ~ jitter(grps[['male.1']][,'gest'], 2), col='gray', pch=3)
legend(x=37, y=1000, legend=c('F:alive','F:dead','M:alive','M:dead'), col=c('black','gray','black','gray'), pch=c(1,0,2,3))

2.4 ggplot2 Example

Given a data set, choose the aesthetic mapping and geometry layer.

# `as.factor(dead)` makes categorical variable rather than continuous
p <- ggplot(data=vlbw) + aes(x=gest, y=bwt, color=sex, shape=as.factor(dead)) + geom_point()
p

p <- p + geom_jitter() + scale_x_continuous() + scale_y_continuous()
p

2.5 Plotly

Add interactive graphics, which is trivial when also using ggplot.

require(plotly)
ggplotly(p)
# plotly uses pipe operator (%>%) from magrittr package
# in this context it works much like "+" does with ggplot
plot_ly(type="box") %>%
  add_boxplot(y = grps[['female.0']][,'bwt'], name='F:0') %>%
  add_boxplot(y = grps[['female.1']][,'bwt'], name='F:1') %>%
  add_boxplot(y = grps[['male.0']][,'bwt'], name='M:0') %>%
  add_boxplot(y = grps[['male.1']][,'bwt'], name='M:1')

3 Models

3.1 Cardiovascular risk factor data

getHdata(diabetes)
html(contents(diabetes), levelType='table')

Data frame:diabetes

403 observations and 19 variables, maximum # NAs:262  
NameLabelsUnitsLevelsStorageNAs
idSubject IDinteger 0
cholTotal Cholesterolinteger 1
stab.gluStabilized Glucoseinteger 0
hdlHigh Density Lipoproteininteger 1
ratioCholesterol/HDL Ratiodouble 1
glyhbGlycosolated Hemoglobindouble 13
location2integer 0
ageyearsinteger 0
gender2integer 0
heightinchesinteger 5
weightpoundsinteger 1
frame3integer 12
bp.1sFirst Systolic Blood Pressureinteger 5
bp.1dFirst Diastolic Blood Pressureinteger 5
bp.2sSecond Systolic Blood Pressureinteger262
bp.2dSecond Diastolic Blood Pressureinteger262
waistinchesinteger 2
hipinchesinteger 2
time.ppnPostprandial Time when Labs were Drawnminutesinteger 3

VariableLevels
locationBuckingham
Louisa
gendermale
female
framesmall
medium
large

3.2 Formulas

The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.

  • y ~ a + b, + indicates to include both a and b as terms
  • y ~ a:b, : indicates the interaction of a and b
  • y ~ a*b, equivalent to y ~ a+b+a:b
  • y ~ (a+b)^2, equivalent to y ~ (a+b)*(a+b)
  • y ~ a + I(b+c), I indicates to use + in the arithmetic sense

See ?formula for more examples.

3.3 Model Fitting

The most simple model-fitting function is lm, which is used to fit linear models. It’s primary argument is a formula. Using the diabetes data set, we can fit waist size by weight.

m <- lm(waist ~ weight, data=diabetes)

This creates an lm object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.

  • coef
  • fitted
  • predict
  • residuals
  • vcov
  • summary
  • anova
names(m)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"         "model"        
# can access named elements with "$"
m$coefficients
(Intercept)      weight 
 16.5050329   0.1204712 
coef(m)
(Intercept)      weight 
 16.5050329   0.1204712 
head(fitted(m))
       1        2        3        4        5        6 
31.08205 42.76776 47.34567 30.84111 38.55127 39.39457 
# defaut "predict" gives same results
# sum(abs(fitted(m) - predict(m)) > 1e-8)
predict(m, data.frame(weight=c(150, 200, 250)))
       1        2        3 
34.57572 40.59928 46.62284 
head(residuals(m))
        1         2         3         4         5         6 
-2.082053  3.232237  1.654330  2.158890  5.448731 -3.394568 
vcov(m) # variance-covariance matrix
             (Intercept)        weight
(Intercept)  0.465908502 -0.0024927458
weight      -0.002492746  0.0000140231
# "summary" produces additional information about model
summary(m)

Call:
lm(formula = waist ~ weight, data = diabetes)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6835 -2.0850 -0.2021  1.7741 11.5330 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.505033   0.682575   24.18   <2e-16 ***
weight       0.120471   0.003745   32.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.02 on 398 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.7223,    Adjusted R-squared:  0.7216 
F-statistic:  1035 on 1 and 398 DF,  p-value: < 2.2e-16
coef(summary(m))
              Estimate  Std. Error  t value      Pr(>|t|)
(Intercept) 16.5050329 0.682574906 24.18054  3.970142e-80
weight       0.1204712 0.003744743 32.17077 9.082965e-113
anova(m)
Analysis of Variance Table

Response: waist
           Df Sum Sq Mean Sq F value    Pr(>F)    
weight      1 9438.0  9438.0    1035 < 2.2e-16 ***
Residuals 398 3629.4     9.1                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 Visualization

Create a scatterplot of weight and waist size.

# `qplot` can create default ggplot output
p <- qplot(weight, waist, data=diabetes)
p + geom_smooth(method="lm")
# the above uses the implied formula y~x
# p + geom_smooth(method="lm", formula = y ~ x)

Remove missing values and fit with LOESS curve.

diab <- diabetes[complete.cases(diabetes[,c('waist','weight')]),]
p <- qplot(weight, waist, data=diab)
p + geom_smooth(method="loess")

3.5 Data Manipulation

The Dominican Republic Hypertension dataset contains data on gender, age, and systolic and diastolic blood pressure from several villages in the DR.

getHdata(DominicanHTN)
html(describe(DominicanHTN))
DominicanHTN

5 Variables   381 Observations

village
image
nmissingdistinct
38108
lowest :Bare' Nuevo Batey Verde Carmona Cojobal Juan Sanchez
highest:Cojobal Juan Sanchez La AltagraciaLos Gueneos San Antonio
 Value        Bare' Nuevo   Batey Verde       Carmona       Cojobal  Juan Sanchez
 Frequency             32            41            64            59            53
 Proportion         0.084         0.108         0.168         0.155         0.139
                                                     
 Value      La Altagracia   Los Gueneos   San Antonio
 Frequency             40            57            35
 Proportion         0.105         0.150         0.092
 

gender
nmissingdistinct
38102
 Value      Female   Male
 Frequency     258    123
 Proportion  0.677  0.323
 

age
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3810690.99947.9716.9823303847596873
lowest : 15 17 18 19 20 , highest: 85 87 90 95 100
sbp: Systolic Blood Pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3810570.99613328.35 98105118130150170180
lowest : 80 90 95 98 99 , highest: 196 200 204 210 236
dbp: Diastolic Blood Pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3810410.99384.2315.58 64 70 76 82 92100110
lowest : 20 40 50 60 62 , highest: 115 118 120 130 152

3.6 Adding variables or transformations

The data.frame bracket syntax is data.frame[row_selection, column_selection].

# create "mean arterial pressure"
DominicanHTN[,'map'] <- (DominicanHTN[,'sbp'] + DominicanHTN[,'dbp'] * 2) / 3
# use `as.numeric` to convert TRUE/FALSE into 1/0
DominicanHTN[,'male'] <- as.numeric(DominicanHTN[,'gender'] == 'Male')
DominicanHTN[1:5,]
       village gender age sbp dbp      map male
1 Juan Sanchez   Male  56 150 100 116.6667    1
2 Juan Sanchez   Male  42 120  90 100.0000    1
3 Juan Sanchez   Male  69 120  90 100.0000    1
4 Juan Sanchez   Male  70 180  80 113.3333    1
5 Juan Sanchez   Male  62 138  78  98.0000    1

In this example, I collapse a variable into smaller bins - which should generally be avoided.

qage <- quantile(DominicanHTN[,'age'])
qage[1] <- qage[1]-1
qage
  0%  25%  50%  75% 100% 
  14   38   47   59  100 
DominicanHTN[,'ageGrp'] <- cut(DominicanHTN[,'age'], breaks=qage)
# could easily specify other breakpoints
# levels(cut(DominicanHTN[,'age'], breaks = c(0, 18, 100)))
# levels(cut(DominicanHTN[,'age'], breaks = seq(10, 100, by = 10)))

3.7 Filter

nrow(DominicanHTN)
[1] 381
nrow(DominicanHTN[DominicanHTN[,'gender'] == "Male",])
[1] 123
which(DominicanHTN[,'village'] %in% c('Carmona','San Antonio'))
 [1] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
[35] 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 249 250 251 252
[69] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283
cojMen <- DominicanHTN[DominicanHTN[,'village'] == 'Cojobal' & DominicanHTN[,'gender'] == "Male",]
cojMen
   village gender age sbp dbp       map male   ageGrp
54 Cojobal   Male  38 132  94 106.66667    1  (14,38]
55 Cojobal   Male  77 140  88 105.33333    1 (59,100]
56 Cojobal   Male  45 140  92 108.00000    1  (38,47]
57 Cojobal   Male  45 122  86  98.00000    1  (38,47]
58 Cojobal   Male  68 128  92 104.00000    1 (59,100]
59 Cojobal   Male  56  90  72  78.00000    1  (47,59]
60 Cojobal   Male  70 118  78  91.33333    1 (59,100]
61 Cojobal   Male  60 120  90 100.00000    1 (59,100]
62 Cojobal   Male  30 110  70  83.33333    1  (14,38]
63 Cojobal   Male  48 122  84  96.66667    1  (47,59]
64 Cojobal   Male  50 200 100 133.33333    1  (47,59]
65 Cojobal   Male  43 110  75  86.66667    1  (38,47]

3.8 Sort

order(cojMen[,'age'], decreasing=TRUE) # first value is element number with highest age
 [1]  2  7  5  8  6 11 10  3  4 12  1  9
# use row numbers to re-order a data.frame
cojMen[order(cojMen[,'age'], decreasing=TRUE),]
   village gender age sbp dbp       map male   ageGrp
55 Cojobal   Male  77 140  88 105.33333    1 (59,100]
60 Cojobal   Male  70 118  78  91.33333    1 (59,100]
58 Cojobal   Male  68 128  92 104.00000    1 (59,100]
61 Cojobal   Male  60 120  90 100.00000    1 (59,100]
59 Cojobal   Male  56  90  72  78.00000    1  (47,59]
64 Cojobal   Male  50 200 100 133.33333    1  (47,59]
63 Cojobal   Male  48 122  84  96.66667    1  (47,59]
56 Cojobal   Male  45 140  92 108.00000    1  (38,47]
57 Cojobal   Male  45 122  86  98.00000    1  (38,47]
65 Cojobal   Male  43 110  75  86.66667    1  (38,47]
54 Cojobal   Male  38 132  94 106.66667    1  (14,38]
62 Cojobal   Male  30 110  70  83.33333    1  (14,38]

3.9 Aggregate and counts

table(DominicanHTN[,'gender'])

Female   Male 
   258    123 
addmargins(with(DominicanHTN, table(village, gender)))
               gender
village         Female Male Sum
  Bare' Nuevo       19   13  32
  Batey Verde       30   11  41
  Carmona           39   25  64
  Cojobal           47   12  59
  Juan Sanchez      40   13  53
  La Altagracia     22   18  40
  Los Gueneos       41   16  57
  San Antonio       20   15  35
  Sum              258  123 381
with(DominicanHTN, tapply(age, village, mean))
  Bare' Nuevo   Batey Verde       Carmona       Cojobal  Juan Sanchez La Altagracia   Los Gueneos   San Antonio 
     47.53125      48.02439      49.00000      47.15254      47.45283      49.65000      45.15789      51.25714 
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)
  gender      sbp
1 Female 133.1977
2   Male 132.5691
aggregate(age ~ village + gender, DominicanHTN, median)
         village gender  age
1    Bare' Nuevo Female 40.0
2    Batey Verde Female 43.0
3        Carmona Female 40.0
4        Cojobal Female 43.0
5   Juan Sanchez Female 47.0
6  La Altagracia Female 48.0
7    Los Gueneos Female 40.0
8    San Antonio Female 42.5
9    Bare' Nuevo   Male 48.0
10   Batey Verde   Male 52.0
11       Carmona   Male 53.0
12       Cojobal   Male 49.0
13  Juan Sanchez   Male 56.0
14 La Altagracia   Male 57.5
15   Los Gueneos   Male 51.0
16   San Antonio   Male 61.0
# easy to change statistical measure, or function
# aggregate(age ~ village + gender, DominicanHTN, max)

3.10 Exercise 2 & 3

Do exercise 2 & 3.

3.11 Rosner’s lead data

The lead exposure dataset was collected to study the well-being of children who lived near a lead smelting plant. The following analysis considers the lead exposure levels in 1972 and 1973, as well as age and the finger-wrist tapping score maxfwt.

getHdata(lead)
lead <- upData(lead,
       keep = c('ld72','ld73','age','maxfwt'),
       labels = c(age = 'Age'),
       units = c(age='years', ld72='mg/100*ml', ld73='mg/100*ml'))

Input object size: 53928 bytes; 39 variables 124 observations Kept variables ld72,ld73,age,maxfwt New object size: 14096 bytes; 4 variables 124 observations

html(contents(lead), maxlevels=10, levelType='table')

Data frame:lead

124 observations and 4 variables, maximum # NAs:25  
NameLabelsUnitsStorageNAs
ageAgeyearsdouble 0
ld72Blood Lead Levels, 1972mg/100*mlinteger 0
ld73Blood Lead Levels, 1973mg/100*mlinteger 0
maxfwtMaximum mean finger-wrist tapping scoreinteger25

html(describe(lead))
lead

4 Variables   124 Observations

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
12407318.9354.074 3.929 4.333 6.167 8.37512.02114.00015.000
lowest : 3.750000 3.833333 3.916667 4.000000 4.166667
highest:14.25000015.00000015.25000015.41666715.916667

ld72: Blood Lead Levels, 1972 mg/100*ml
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1240470.99936.1617.2318.0021.0027.0034.0043.0057.0061.85
lowest : 1 2 10 14 18 , highest: 62 64 66 68 99
ld73: Blood Lead Levels, 1973 mg/100*ml
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1240370.99831.7111.0618.1521.0024.0030.5037.0047.0050.85
lowest : 15 16 18 19 20 , highest: 52 53 54 57 58
maxfwt: Maximum mean finger-wrist tapping score
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
9925400.99851.9613.833.238.046.052.059.065.072.2
lowest : 13 14 23 26 34 , highest: 74 76 79 83 84

4 Introduction to RMS (Ordinary least squares regression)

The rms package includes the ols fitting function. Use datadist to compute summaries of the distributional characteristics of the predictors - or simply give it an entire data frame. datadist must be re-run if you add a new predictor or recode an old one.

Note: The ols function is not covered in the main course. Generalized least squares is. Ordinary least squares does not handle data that has correlation between residuals. It is covered here to give a simple overview of the structure of models in RMS.

require(rms)
dd <- datadist(lead)
options(datadist='dd')
f <- ols(maxfwt ~ age + ld72 + ld73, data=lead)
f
Frequencies of Missing Values Due to Each Variable
maxfwt    age   ld72   ld73 
    25      0      0      0 

Linear Regression Model
 
 ols(formula = maxfwt ~ age + ld72 + ld73, data = lead)
 
 
                 Model Likelihood    Discrimination    
                       Ratio Test           Indexes    
 Obs      99    LR chi2     62.25    R2       0.467    
 sigma9.5221    d.f.            3    R2 adj   0.450    
 d.f.     95    Pr(> chi2) 0.0000    g       10.104    
 
 Residuals
 
      Min       1Q   Median       3Q      Max 
 -33.9958  -4.9214   0.7596   5.1106  33.2590 
 
 
           Coef    S.E.   t     Pr(>|t|)
 Intercept 34.1059 4.8438  7.04 <0.0001 
 age        2.6078 0.3231  8.07 <0.0001 
 ld72      -0.0246 0.0782 -0.31 0.7538  
 ld73      -0.2390 0.1325 -1.80 0.0744  
 

rms provides many methods to work with the ols fit object.

  • coef
  • fitted
  • predict
  • resid
  • anova
  • summary
  • Predict
  • ggplot
  • Function
  • nomogram

4.1 Coefficient Estimates

coef(f)
 Intercept        age       ld72       ld73 
34.1058551  2.6078450 -0.0245978 -0.2389695 

4.2 Define R function representing fitted model

The default values for function arguments are medians.

g <- Function(f)
g
function (age = 8.375, ld72 = 34, ld73 = 30.5) 
{
    34.105855 + 2.607845 * age - 0.024597799 * ld72 - 0.23896951 * 
        ld73
}
<environment: 0x5568f55801a8>
g(age=10, ld72=21, ld73=c(21, 47))
[1] 54.64939 48.43618

4.3 Predicted values with standard errors

predict(f, data.frame(age=10, ld72=21, ld73=c(21, 47)), se.fit=TRUE)
$linear.predictors
       1        2 
54.64939 48.43618 

$se.fit
       1        2 
1.391858 3.140361 

4.4 Residuals

r <- resid(f)
par(mfrow=c(2,2))
plot(fitted(f), r); abline(h=0)
with(lead, plot(age, r)); abline(h=0)
with(lead, plot(ld73, r)); abline(h=0)
# q-q plot to check normality
qqnorm(r)
qqline(r)

4.5 Plotting partial effects

Predict and plotp make one plot for each predictor. Predictors not shown in plot are set to constants (continuous:median, categorical:mode). 0.95 pointwise confidence limits for \(\hat{E}(y|x)\) are shown; use conf.int=FALSE to suppress CLs.

plotp(Predict(f))

Specify which predictors are plotted.

plotp(Predict(f, age))
plotp(Predict(f, age=3:15))
plotp(Predict(f, age=seq(3,16,length=150)))

Obtain confidence limits for \(\hat{y}\).

plotp(Predict(f, age, conf.type='individual'))

Combine plots.

p1 <- Predict(f, age, conf.int=0.99, conf.type='individual')
p2 <- Predict(f, age, conf.int=0.99, conf.type='mean')
p <- rbind(Individual=p1, Mean=p2)
ggplot(p)
plotp(Predict(f, ld73, age=3))
ggplot(Predict(f, ld73, age=c(3,9)))

3-d surface for two continuous predictors against \(\hat{y}\).

bplot(Predict(f, ld72, ld73))

4.6 Nomograms

plot(nomogram(f))

4.7 Point Estimates for Partial Effects

summary(f)
             Effects              Response : maxfwt 

 Factor Low     High   Diff.   Effect   S.E.   Lower 0.95 Upper 0.95
 age     6.1667 12.021  5.8542 15.26700 1.8914 11.5120    19.02200  
 ld72   27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773     2.09010  
 ld73   24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255     0.31234  

Adjust age to 5, which has no effect as the model does not include any interactions.

summary(f, age=5)
             Effects              Response : maxfwt 

 Factor Low     High   Diff.   Effect   S.E.   Lower 0.95 Upper 0.95
 age     6.1667 12.021  5.8542 15.26700 1.8914 11.5120    19.02200  
 ld72   27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773     2.09010  
 ld73   24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255     0.31234  

Effect of changing ld73 from 20 to 40.

summary(f, ld73=c(20,40))
             Effects              Response : maxfwt 

 Factor Low     High   Diff.   Effect   S.E.   Lower 0.95 Upper 0.95
 age     6.1667 12.021  5.8542 15.26700 1.8914  11.5120   19.02200  
 ld72   27.0000 43.000 16.0000 -0.39356 1.2511  -2.8773    2.09010  
 ld73   20.0000 40.000 20.0000 -4.77940 2.6495 -10.0390    0.48052  

If a predictor has a linear effect, a one-unit change can be used to get the confidence interval of its slope.

summary(f, age=5:6)
             Effects              Response : maxfwt 

 Factor Low High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
 age     5   6    1     2.60780 0.32308  1.9664    3.24920   
 ld72   27  43   16    -0.39356 1.25110 -2.8773    2.09010   
 ld73   24  37   13    -3.10660 1.72220 -6.5255    0.31234   

There is also a plot method for summary results.

plot(summary(f))

4.8 Getting Predicted Values

Give predict a fit object and data.frame.

predict(f, data.frame(age=3, ld72=21, ld73=21))
       1 
36.39448 
predict(f, data.frame(age=c(3,10), ld72=21, ld73=c(21,47)))
       1        2 
36.39448 48.43618 
newdat <- expand.grid(age=c(4,8), ld72=c(21, 47), ld73=c(21, 47))
newdat
  age ld72 ld73
1   4   21   21
2   8   21   21
3   4   47   21
4   8   47   21
5   4   21   47
6   8   21   47
7   4   47   47
8   8   47   47
predict(f, newdat)
       1        2        3        4        5        6        7        8 
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095 

Include confidence levels.

predict(f, newdat, conf.int=0.95)
$linear.predictors
       1        2        3        4        5        6        7        8 
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095 

$lower
       1        2        3        4        5        6        7        8 
33.97441 46.23595 32.15468 43.94736 25.68920 36.94167 27.17060 38.86475 

$upper
       1        2        3        4        5        6        7        8 
44.03023 52.63145 44.57088 53.64095 39.88902 49.49932 37.12854 46.29716 
predict(f, newdat, conf.int=0.95, conf.type='individual')
$linear.predictors
       1        2        3        4        5        6        7        8 
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095 

$lower
       1        2        3        4        5        6        7        8 
19.44127 30.26132 18.46566 29.27888 12.59596 23.30120 12.60105 23.31531 

$upper
       1        2        3        4        5        6        7        8 
58.56337 68.60609 58.25989 68.30944 52.98227 63.13979 51.69810 61.84659 

Brute-force predicted values.

b <- coef(f)
b[1] + b[2]*3 + b[3]*21 + b[4]*21
Intercept 
 36.39448 

Predicted values with Function.

# g <- Function(f)
g(age = c(3,8), ld72 = 21, ld73 = 21)
[1] 36.39448 49.43370
g(age = 3)
[1] 33.80449

4.9 ANOVA

Use anova to get all total effects and individual partial effects.

an <- anova(f)
an
                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P     
 age         1   5907.535742 5907.535742 65.15 <.0001
 ld72        1      8.972994    8.972994  0.10 0.7538
 ld73        1    295.044370  295.044370  3.25 0.0744
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001
 ERROR      95   8613.750674   90.671060             

Include corresponding variable names.

print(an, 'names')
                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P      Tested       
 age         1   5907.535742 5907.535742 65.15 <.0001 age          
 ld72        1      8.972994    8.972994  0.10 0.7538 ld72         
 ld73        1    295.044370  295.044370  3.25 0.0744 ld73         
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001 age,ld72,ld73
 ERROR      95   8613.750674   90.671060                           
print(an, 'subscripts')
                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P      Tested
 age         1   5907.535742 5907.535742 65.15 <.0001 1     
 ld72        1      8.972994    8.972994  0.10 0.7538 2     
 ld73        1    295.044370  295.044370  3.25 0.0744 3     
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001 1-3   
 ERROR      95   8613.750674   90.671060                    

Subscripts correspond to:
[1] age  ld72 ld73
print(an, 'dots')
                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P      Tested
 age         1   5907.535742 5907.535742 65.15 <.0001 .     
 ld72        1      8.972994    8.972994  0.10 0.7538  .    
 ld73        1    295.044370  295.044370  3.25 0.0744   .   
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001 ...   
 ERROR      95   8613.750674   90.671060                    

Subscripts correspond to:
[1] age  ld72 ld73

Combined partial effects.

anova(f, ld72, ld73)
                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS         F    P     
 ld72        1      8.972994   8.972994 0.10 0.7538
 ld73        1    295.044370 295.044370 3.25 0.0744
 REGRESSION  2    747.283558 373.641779 4.12 0.0192
 ERROR      95   8613.750674  90.671060            

4.10 Exercise 4

Time for exercise 4.

5 Cleaning and Importing Data

5.1 Text files

  • read.table - work with data in table format
  • read.csv - CSV files
  • read.delim - tab-delimited files
  • scan - more general function for reading in data

5.2 Binary representation of R objects

Compressed data and loads faster.

  • RData
  • feather package

5.3 Database connections

Allows data queries.

  • RODBC package
  • RSQLite package

5.4 Other statistical packages

  • Hmisc package: sas.get, sasxport.get
  • haven package: read_sas, read_sav, read_dta
  • foreign package
  • Stat/Transfer - not free nor open source

5.5 Data Cleaning

An R data.frame consists of a collection of values. In a well-structured data set, each value is associated with a variable (column) and observation (row). Data sets often need to be manipulated to become well-structured.

Hadley Wickham’s Tidy Data provides an excellent overview on how to clean messy data sets.

5.6 Column header contains values

# politics and religion
senateReligion <- data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
                          Democrats=c(20,15,8,1,1,3),
                          Republicans=c(38,9,0,5,0,0))
senateReligion
            religion Democrats Republicans
1         Protestant        20          38
2           Catholic        15           9
3             Jewish         8           0
4             Mormon         1           5
5           Buddhist         1           0
6 Don't Know/Refused         3           0
senRel <- cbind(senateReligion[,'religion'], stack(senateReligion[,c('Democrats','Republicans')]))
names(senRel) <- c('religion','count','party')
senRel
             religion count       party
1          Protestant    20   Democrats
2            Catholic    15   Democrats
3              Jewish     8   Democrats
4              Mormon     1   Democrats
5            Buddhist     1   Democrats
6  Don't Know/Refused     3   Democrats
7          Protestant    38 Republicans
8            Catholic     9 Republicans
9              Jewish     0 Republicans
10             Mormon     5 Republicans
11           Buddhist     0 Republicans
12 Don't Know/Refused     0 Republicans

5.7 Multiple variables in a column

pets <- data.frame(county=c('Davidson','Rutherford','Cannon'),
                   male.dog=c(50,150,70), female.dog=c(60,150,70),
                   male.cat=c(30,100,50), female.cat=c(30,70,40),
                   male.horse=c(6,30,20), female.horse=c(6,28,19))
pets
      county male.dog female.dog male.cat female.cat male.horse female.horse
1   Davidson       50         60       30         30          6            6
2 Rutherford      150        150      100         70         30           28
3     Cannon       70         70       50         40         20           19
pets2 <- cbind(pets[,'county'], stack(pets[,-1]))
names(pets2) <- c('county','count','ind')
pets2
       county count          ind
1    Davidson    50     male.dog
2  Rutherford   150     male.dog
3      Cannon    70     male.dog
4    Davidson    60   female.dog
5  Rutherford   150   female.dog
6      Cannon    70   female.dog
7    Davidson    30     male.cat
8  Rutherford   100     male.cat
9      Cannon    50     male.cat
10   Davidson    30   female.cat
11 Rutherford    70   female.cat
12     Cannon    40   female.cat
13   Davidson     6   male.horse
14 Rutherford    30   male.horse
15     Cannon    20   male.horse
16   Davidson     6 female.horse
17 Rutherford    28 female.horse
18     Cannon    19 female.horse

Watch out for factor variables.

tryCatch(strsplit(pets2[,'ind'], '\\.'), error=function(e){ e })
<simpleError in strsplit(pets2[, "ind"], "\\."): non-character argument>

First for loop, allows iterating over a sequence.

sexAnimal <- strsplit(as.character(pets2[,'ind']), '\\.')
pets2[,c('sex','animal')] <- NA
for(i in seq(nrow(pets2))) {
  pets2[i, c('sex','animal')] <- sexAnimal[[i]]
}
pets2[,'ind'] <- NULL
pets2
       county count    sex animal
1    Davidson    50   male    dog
2  Rutherford   150   male    dog
3      Cannon    70   male    dog
4    Davidson    60 female    dog
5  Rutherford   150 female    dog
6      Cannon    70 female    dog
7    Davidson    30   male    cat
8  Rutherford   100   male    cat
9      Cannon    50   male    cat
10   Davidson    30 female    cat
11 Rutherford    70 female    cat
12     Cannon    40 female    cat
13   Davidson     6   male  horse
14 Rutherford    30   male  horse
15     Cannon    20   male  horse
16   Davidson     6 female  horse
17 Rutherford    28 female  horse
18     Cannon    19 female  horse

5.8 Variables in a row

set.seed(1000)
d1 <- rnorm(1000, 5, 2)
d2 <- runif(1000, 0, 10)
d3 <- rpois(1000, 5)
d4 <- rbinom(1000, 10, 0.5)
x <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
           stat=rep(c('mean','sd'),4),
           value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
x
  distribution stat    value
1       normal mean 4.975048
2       normal   sd 1.963494
3      uniform mean 4.958261
4      uniform   sd 2.870631
5      poisson mean 5.015000
6      poisson   sd 2.232882
7     binomial mean 5.054000
8     binomial   sd 1.542551
cbind(distribution=x[c(1,3,5,7),'distribution'], unstack(x, form=value~stat))
  distribution     mean       sd
1       normal 4.975048 1.963494
2      uniform 4.958261 2.870631
3      poisson 5.015000 2.232882
4     binomial 5.054000 1.542551

5.9 Normalization and merging

employees <- data.frame(id=1:4, Name=c('Eddie','Andrea','Steve','Theresa'),
                        job=c('engineer','accountant','statistician','technician'))
set.seed(11)
hours <- data.frame(id=sample(4, 10, replace=TRUE),
                    week=1:10, hours=sample(30:60, 10, replace=TRUE))
merge(employees, hours)
   id    Name        job week hours
1   1   Eddie   engineer    6    59
2   1   Eddie   engineer    7    50
3   1   Eddie   engineer    4    59
4   2  Andrea accountant    1    50
5   2  Andrea accountant    9    58
6   2  Andrea accountant    2    51
7   4 Theresa technician    3    59
8   4 Theresa technician    8    32
9   4 Theresa technician    5    36
10  4 Theresa technician   10    40
# merge(employees, hours, by.x='id', by.y='id')
empHours <- merge(employees, hours, all.x=TRUE)
empHours
   id    Name          job week hours
1   1   Eddie     engineer    6    59
2   1   Eddie     engineer    7    50
3   1   Eddie     engineer    4    59
4   2  Andrea   accountant    1    50
5   2  Andrea   accountant    9    58
6   2  Andrea   accountant    2    51
7   3   Steve statistician   NA    NA
8   4 Theresa   technician    3    59
9   4 Theresa   technician    8    32
10  4 Theresa   technician    5    36
11  4 Theresa   technician   10    40
unique(empHours[,c('id','Name','job')])
  id    Name          job
1  1   Eddie     engineer
4  2  Andrea   accountant
7  3   Steve statistician
8  4 Theresa   technician
empHours[,c('id','week','hours')]
   id week hours
1   1    6    59
2   1    7    50
3   1    4    59
4   2    1    50
5   2    9    58
6   2    2    51
7   3   NA    NA
8   4    3    59
9   4    8    32
10  4    5    36
11  4   10    40

Exclude rows with missing data.

empHours[!is.na(empHours[,'week']), c('id','week','hours')]
   id week hours
1   1    6    59
2   1    7    50
3   1    4    59
4   2    1    50
5   2    9    58
6   2    2    51
8   4    3    59
9   4    8    32
10  4    5    36
11  4   10    40

5.10 Combining data

rbind(sexAnimal[[1]], sexAnimal[[2]], sexAnimal[[3]])
     [,1]   [,2] 
[1,] "male" "dog"
[2,] "male" "dog"
[3,] "male" "dog"
do.call(rbind, sexAnimal)
      [,1]     [,2]   
 [1,] "male"   "dog"  
 [2,] "male"   "dog"  
 [3,] "male"   "dog"  
 [4,] "female" "dog"  
 [5,] "female" "dog"  
 [6,] "female" "dog"  
 [7,] "male"   "cat"  
 [8,] "male"   "cat"  
 [9,] "male"   "cat"  
[10,] "female" "cat"  
[11,] "female" "cat"  
[12,] "female" "cat"  
[13,] "male"   "horse"
[14,] "male"   "horse"
[15,] "male"   "horse"
[16,] "female" "horse"
[17,] "female" "horse"
[18,] "female" "horse"

5.11 Exercise 5

Time for exercise 5.

6 Simulation

Plot mean while sampling from normal distribution

n <- 100
imean <- numeric(n)
smean <- numeric(n)
for(i in seq(n)) {
  imean[i] <- mean(rnorm(10))
  smean[i] <- mean(imean[seq(i)])
}
plot(imean, ylab='Sample Mean')
abline(h=0, lty=3)
lines(smean)

n <- 1000
sig <- numeric(n)
for(i in seq(n)) {
  grp1 <- rnorm(15, 60, 5)
  grp2 <- rnorm(15, 65, 5)
  sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
}
mean(sig)
[1] 0.75

Add some flexibility to our simulation by creating a function.

tTestPower <- function(n=15, mu1=60, mu2=65, sd=5, nsim=1000) {
  sig <- numeric(nsim)
  for(i in seq(nsim)) {
    grp1 <- rnorm(n, mu1, sd)
    grp2 <- rnorm(n, mu2, sd)
    sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
  }
  mean(sig)
}
tTestPower()
[1] 0.755
tTestPower(25, nsim=10000)
[1] 0.9336

There’s already a function to compute the power of a two-sample t test.

power.t.test(n=25, delta=5, sd=5)$power
[1] 0.9337076

7 Simulation and Bootstrapping

Simulation is a wonderful tool for really understanding just how well a method performs. Sometimes the modeling and techniques are complex enough that analytical and numerical properties are difficult to ascertain. Simulation can help answer that question by doing a procedure on random data repeatedly and examining how well the technique performs.

Bootstrap sampling is a related area that takes multiple random samples from a data set and performs estimates of statistical properties. It is robust in that it makes few assumptions about the structure of the data.

7.1 Bootstrap sample

true_mu <- 0
x <- rnorm(100, true_mu)
R <- 999
res <- matrix(nrow=R, ncol=6)
colnames(res) <- c('mu','se','lb','ub','coverage','bias')
for(i in seq(R)) {
  r <- sample(x, replace=TRUE)
  res[i,'mu'] <- mean(r)
  res[i,'se'] <- sd(r) / sqrt(length(r))
}
res[,'lb'] <- res[,'mu'] + qnorm(0.025) * res[,'se']
res[,'ub'] <- res[,'mu'] + qnorm(0.975) * res[,'se']
res[,'coverage'] <- res[,'lb'] < true_mu & res[,'ub'] > true_mu
res[,'bias'] <- res[,'mu'] - true_mu
res[1:5,]
               mu         se          lb        ub coverage         bias
[1,] 0.0004065727 0.09355070 -0.18294943 0.1837626        1 0.0004065727
[2,] 0.0450594243 0.10109567 -0.15308446 0.2432033        1 0.0450594243
[3,] 0.0153579674 0.09152029 -0.16401851 0.1947344        1 0.0153579674
[4,] 0.1331559513 0.10133702 -0.06546097 0.3317729        1 0.1331559513
[5,] 0.1953979673 0.09007560  0.01885303 0.3719429        0 0.1953979673
vals <- colMeans(res)
basicCI <- 2 * mean(x) - quantile(res[,'mu'], probs=c(0.975, 0.025))
out <- sprintf("empirical SE: %.3f
MSE: %.3f
basic bootstrap confidence interval: (%.3f, %.3f)
coverage probability: %.3f
mean bias: %.3f
sample mean: %.3f
", sd(res[,'mu']), vals['se'], basicCI[1], basicCI[2], vals['coverage'], vals['bias'], mean(x))
cat(out)
empirical SE: 0.096
MSE: 0.096
basic bootstrap confidence interval: (-0.175, 0.197)
coverage probability: 0.931
mean bias: 0.020
sample mean: 0.023

7.2 Control Structures

Control structures are used to change if and when lines of code in a program are run. The control comes from conditions being true or false.

7.3 Root finding with the Newton-Raphson algorithm

The for loop has already been introduced. Two other important control structures are while loops and if statements.

f <- function(x) x^3 + x^2 - 3*x - 3
fp <- function(x) 3*x^2 + 2*x - 3
plot(f, xlim=c(-4,4))

x <- -2
i <- 1
while(abs(f(x)) > 1e-8) {
  x <- x - f(x) / fp(x)
  i <- i + 1
  if(i > 20) {
    print("does not converge")
    break
  }
}
x
[1] -1.732051
f(x)
[1] -4.440892e-16

7.4 Exercise 6 & 7

Final exercises, 6 & 7.

8 Worth Mentioning

8.1 tidyverse

Wickham’s “Tidy Data” philosophy was concurrent with the development of a suite of R packages useful for data management. This includes ggplot2 and haven as well as many others such as dplyr, stringr, and tidyr.

8.2 data.table

data.table is another beneficial package for data manipulation. It provides similar functionality to the data.frame, but works well with large data sets. It minimizes memory operations. It is best to develop ones code using either data.frame or data.table from the beginning as they are not exactly equivalent and surprises can happen upon dropping in one for the other.

9 Computing Environment1

 R version 4.2.0 (2022-04-22)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 18.04.6 LTS
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
 [1] rms_6.3-0       SparseM_1.78    plotly_4.10.0   Hmisc_4.7-0     ggplot2_3.3.5   Formula_1.2-3   survival_3.2-13 lattice_0.20-45
 
To cite R in publications use:

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-0, https://CRAN.R-project.org/package=Hmisc.

To cite the rms package in publications use:

Harrell Jr FE (2022). rms: Regression Modeling Strategies. R package version 6.3-0, https://CRAN.R-project.org/package=rms.

To cite the ggplot2 package in publications use:

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

To cite the survival package in publications use:

Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.


  1. mu is a copy of the part of the Hmisc package object markupSpecs that is for html. It includes a function session that renders the session environment (including package versions) in html.↩︎