Foundations

Does everyone have a working laptop?

This course is designed to bring one up to speed on the tools necessary to follow all the examples in Frank Harrell’s ``Regression Modeling Strategies’’. It assumes a rudimentary knowledge of R or at least having some experience with coding.

If there are any questions or areas you’d like covered in more detail please do not hesitate to speak up.

RStudio has multiple panes. One can customize what’s open at any point. In general the four quadrants are:

Upper Left: Open Files
Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
Lower Left: Console. The interactive command to the running R interpreter.
Lower Right: Directory browser, Plots, Packages, Help.

If one accidentally closes one, it’s easily opened again from the View menu.

Note Menu items have underlines for certain characters. These are shortcuts for Alt-letter that one can access quickly from the keyboard.

Creating scripts

R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.

From RStudio, select File -> New File and there is a large number of options. The important ones for this class are:

R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click source in the upper right of the file and it will execute in the console below.
R Notebook. An older version that allows for a preview of the document. The preview is rebuilt on every save.
R Markdown. Same as R Notebook, but requires a manual Knit to construct the document. No preview mode. For documents that take a long time to build this is preferred and is becoming the more common option.

A common question is “Which file type should I choose?” It never hurts to start with a basic R script, as once the code is written it can be sourced or even copied and pasted into one of the other documents. For writing code that will import data or perform data manipulation, I prefer R scripts. For teaching with interactive examples, I prefer using R notebook. I like to use R markdown when creating output that will be shared (either as HTML or PDF).

Creating projects

A project brings together several scripts as a project. One can create a project using File -> New Project and select an existing directory or create a blank one as desired. Existing projects can be opened with File -> Open Project.

Session & Working Directory

There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document’s environment. The Run -> Restart R and Run All Chunks restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.

It is sometimes helpful to communicate versions of packages are loaded when discussing issues. The command sessionInfo() provides a nice overview of this that is good for checking package versions loaded.

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.25   R6_2.4.1        jsonlite_1.7.1  magrittr_2.0.1  evaluate_0.14   rlang_0.4.11    stringi_1.5.3   jquerylib_0.1.4
##  [9] bslib_0.3.1     rmarkdown_2.11  tools_4.2.0     stringr_1.4.0   xfun_0.30       yaml_2.2.1      fastmap_1.1.0   compiler_4.2.0 
## [17] htmltools_0.5.2 knitr_1.36      sass_0.4.0

The current working directory is available via getwd() and settable via setwd("directory"). All file loads and saves are relative to the current working directory.

Environment

R maintains “environment frames” which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.

x <- 3
f <- function()
{ # One can view the parenthesis as capturing a frame
  x <- 4   # Local variable inner frame
  cat("This is the inner 'x'", x, "\n") 
}
f()

## This is the inner 'x' 4

cat("This is the outer 'x'", x, "\n")

## This is the outer 'x' 3

Packages

Install the packages Hmisc, rms, knitr, rmarkdown, and plotly through RStudio.

Packages contain R code and may often require the installation of other packages. When I install Hmisc, I see that its dependency acepack is automatically installed:

Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb

* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)

Install and updating via CRAN

Install through CRAN repositories

# Having code to install packages insider your Rmarkdown is bad practice. Don't do this.
# This code is marked `eval=FALSE` in the Rmarkdown chunk so it is not evaluated.
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown', 'plotly'))
# update all packages
update.packages(checkBuilt = TRUE, ask = FALSE)

Install and update via Github

Github is a code sharing platform. Packages are typically in a rawer state, but it can be important to use sometimes to get the latest bugfix at the risk of introducing a different bug. Installing through Github repositories is done using the remotes package.

install.packages('remotes')
require(remotes)
install_github('harrelfe/rms')
# [package]::[function]() is a common syntax that shows what package is associated with the given function
# for example, this works like above but explicitly tells us we're using the "remotes" package
# remotes::install_github('harrelfe/rms')

Loading Packages

Note that library is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require used in its place. If the package is not installed, library will trigger an error while require will return FALSE. Once loaded the functions created in the package are available to your R session.

library(Hmisc)
require(Hmisc)

Help

From the R console one can get help about any function by appending it with a ?.

?lm

The R maintainers are quite pedantic about requiring package help files on functions up to date, complete and written in proper English. The benefit is that the user now has a reference for just about every function including from community sourced packages.

Rmarkdown

Earlier versions of this course focused on LaTeX document generation. LaTeX is a wonderful publishing tool (for PDF), that is also as complex as it is rich. Rmarkdown has come a long way to generating very rich documents that compile to either HTML or PDF with very little knowledge of the file formats of either. Given that the focus of this course is on Regression Modeling strategies and not document generation, the current focus is on using Rmarkdown and showing some interactive plots via HTML. Full support and possibilities are shown at https://rmarkdown.rstudio.com/.

The cheat sheet is available at https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf, or through Rstudio’s menu bar: Help -> Cheetsheets -> R Markdown Cheet Sheet.

Quarto is a next generation version of R Markdown from RStudio. Users are encouraged to evaluate it as an alternative, though it must be installed in addition to R and RStudio.

Best Practices

A blog post RMarkdown Driven Development by Emily Riederer is an excellent introduction to best practices in producing statistical documents using Rmarkdown.

One takeaway is to avoid the general pitfall of putting large amounts of data processing inside ones Rmarkdown reports, then minor changes to the report can require hours to compile. Large data processing and wrangling tasks should be done in scripts outside the report. The report should load the results of this process.

Another great reference is Dr. Harrell’s analysis project workflow. His content mirrors much of this workshop and will reinforce its material. It can also be used as a template for quarto as its source code is available (by clicking the “</> Code” link).

Caching

Chunks in Rmarkdown can be cached. I.e., the results of the last computation are saved and it’s not rerun until a manual override by clicking on that section is done. It is recommended that one not do this as it inevitably leads to confusion when code or data changes are made and the document compilation does not reflect the changes.

A better approach for long computations is to do these with external scripts and save the results. The report can load these results and present as needed. It’s very similar to caching, but is not prone to later confusion as it’s a bit more direct that the report is loading the processed data file. In short do large data processing tasks outside of reports.

Harrell’s RMS

See chapters 1 and 9 of Biostatistics for Biomedical Research.

Setup

Most Rmarkdown documents have a setup section that loads required libraries and sets up some options. The following snippet is the basic setup for this course. It loads the Hmisc package. Sets knitr to use markdown by default. Sets the graphics option to use plotly. It also pulls some HTML rendering options from the markup specs for use later and stores this in a global variable mu.

require(Hmisc)

knitrSet(lang='markdown', h=4.5, fig.path='png/', fig.align='left')
# knitrSet redirects all messages to messages.txt
options(grType='plotly') # for certain graphics functions
mu <- markupSpecs$html   # markupSpecs is in Hmisc

Data

The getHdata function is used to fetch a dataset from the Vanderbilt Biostatistics DataSets web site. upData is used to

create a new variable from an old one
add labels to 2 variables
add units to the new variable
remove the old variable
automatically move units of measurements from parenthetical expressions in labels to separate units attributes used by Hmisc and rms functions for table making and graphics

contents is used to print a data dictionary, run through an html method for nicer output.

getHdata(pbc)
pbc <- upData(pbc,
              fu.yrs = fu.days / 365.25,
              labels = c(fu.yrs = 'Follow-up Time',
                         status = 'Death or Liver Transplantation'),
              units = c(fu.yrs = 'year'),
              drop  = 'fu.days',
              moveUnits=TRUE, html=TRUE)

Input object size:   80952 bytes;    19 variables    418 observations
 Label for bili changed from Serum Bilirubin (mg/dl) to Serum Bilirubin 
    units set to mg/dl 
 Label for albumin changed from Albumin (gm/dl) to Albumin 
    units set to gm/dl 
 Label for protime changed from Prothrombin Time (sec.) to Prothrombin Time 
    units set to sec. 
 Label for alk.phos changed from Alkaline Phosphatase (U/liter) to Alkaline Phosphatase 
    units set to U/liter 
 Label for sgot changed from SGOT (U/ml) to SGOT 
    units set to U/ml 
 Label for chol changed from Cholesterol (mg/dl) to Cholesterol 
    units set to mg/dl 
 Label for trig changed from Triglycerides (mg/dl) to Triglycerides 
    units set to mg/dl 
 Label for platelet changed from Platelets (per cm^3/1000) to Platelets 
    units set to per cm^3/1000 
 Label for copper changed from Urine Copper (ug/day) to Urine Copper 
    units set to ug/day 
 Added variable     fu.yrs
 Dropped variable   fu.days
 New object size:   84744 bytes;    19 variables    418 observations

html(contents(pbc), maxlevels=10, levelType='table')

Data frame:pbc

418 observations and 19 variables, maximum # NAs:136

Name	Labels	Units	Levels	Storage	NAs
bili	Serum Bilirubin	mg/dl		double	0
albumin	Albumin	gm/dl		double	0
stage	Histologic Stage, Ludwig Criteria			integer	6
protime	Prothrombin Time	sec.		double	2
sex			2	integer	0
age	Age			double	0
spiders			2	integer	106
hepatom			2	integer	106
ascites			2	integer	106
alk.phos	Alkaline Phosphatase	U/liter		double	106
sgot	SGOT	U/ml		double	106
chol	Cholesterol	mg/dl		integer	134
trig	Triglycerides	mg/dl		integer	136
platelet	Platelets	per cm^3/1000		integer	110
drug			3	integer	0
status	Death or Liver Transplantation			integer	0
edema			3	integer	0
copper	Urine Copper	ug/day		integer	108
fu.yrs	Follow-up Time	year		double	0

Variable	Levels
sex	male
	female
spiders, hepatom	absent
ascites	present
drug	D-penicillamine
	placebo
	not randomized
edema	no edema
	edema, no diuretic therapy
	edema despite diuretic therapy

Exercise 1

Do exercise 1. Pull the example exercise.Rmd from github as your starting template.

Descriptive Statistics Without Stratification

# did have results='asis' above
d <- describe(pbc)
html(d, size=80, scroll=TRUE)

pbc

19 Variables 418 Observations

bili: Serum Bilirubin mg/dl

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
418	0	98	0.998	3.221	3.742	0.50	0.60	0.80	1.40	3.40	8.03	14.00

lowest : 0.3 0.4 0.5 0.6 0.7 , highest: 21.6 22.5 24.5 25.5 28.0

albumin: Albumin gm/dl

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
418	0	154	1	3.497	0.473	2.750	2.967	3.243	3.530	3.770	4.010	4.141

lowest : 1.96 2.10 2.23 2.27 2.31 , highest: 4.30 4.38 4.40 4.52 4.64

stage: Histologic Stage, Ludwig Criteria

n	missing	distinct	Info	Mean	Gmd
412	6	4	0.893	3.024	0.9519

 Value          1     2     3     4
 Frequency     21    92   155   144
 Proportion 0.051 0.223 0.376 0.350

protime: Prothrombin Time sec.

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
416	2	48	0.998	10.73	1.029	9.60	9.80	10.00	10.60	11.10	12.00	12.45

lowest : 9.0 9.1 9.2 9.3 9.4 , highest: 13.8 14.1 15.2 17.1 18.0

sex

n	missing	distinct
418	0	2

 Value        male female
 Frequency      44    374
 Proportion  0.105  0.895

age: Age

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
418	0	345	1	50.74	11.96	33.84	36.37	42.83	51.00	58.24	64.30	67.92

lowest : 26.27789 28.88433 29.55510 30.27515 30.57358 , highest: 74.52430 75.00000 75.01164 76.70910 78.43943

spiders

n	missing	distinct
312	106	2

 Value       absent present
 Frequency      222      90
 Proportion   0.712   0.288

hepatom

n	missing	distinct
312	106	2

 Value       absent present
 Frequency      152     160
 Proportion   0.487   0.513

ascites

n	missing	distinct
312	106	2

 Value       absent present
 Frequency      288      24
 Proportion   0.923   0.077

alk.phos: Alkaline Phosphatase U/liter

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
312	106	295	1	1983	1760	599.6	663.0	871.5	1259.0	1980.0	3826.4	6669.9

lowest : 289.0 310.0 369.0 377.0 414.0 , highest: 11046.6 11320.2 11552.0 12258.8 13862.4

sgot: SGOT U/ml

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
312	106	179	1	122.6	60.45	54.25	60.45	80.60	114.70	151.90	196.47	219.25

lowest : 26.35 28.38 41.85 43.40 45.00 , highest: 288.00 299.15 328.60 338.00 457.25

chol: Cholesterol mg/dl

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
284	134	201	1	369.5	194.5	188.4	213.6	249.5	309.5	400.0	560.8	674.0

lowest : 120 127 132 149 151 , highest: 1336 1480 1600 1712 1775

trig: Triglycerides mg/dl

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
282	136	146	1	124.7	64.07	56.00	63.10	84.25	108.00	151.00	195.00	230.95

lowest : 33 44 46 49 50 , highest: 319 322 382 432 598

platelet: Platelets per cm³/1000

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
308	110	210	1	261.9	107.8	117.7	139.7	199.8	257.0	322.5	386.5	430.6

lowest : 62 70 71 79 80 , highest: 493 514 518 539 563

drug

n	missing	distinct
418	0	3

 Value      D-penicillamine         placebo  not randomized
 Frequency              154             158             106
 Proportion           0.368           0.378           0.254

status: Death or Liver Transplantation

n	missing	distinct	Info	Sum	Mean	Gmd
418	0	2	0.71	161	0.3852	0.4748

edema

n	missing	distinct
418	0	3

 Value                            no edema     edema, no diuretic therapy
 Frequency                             354                             44
 Proportion                          0.847                          0.105
                                          
 Value      edema despite diuretic therapy
 Frequency                              20
 Proportion                          0.048

copper: Urine Copper ug/day

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
310	108	158	1	97.65	83.16	17.45	24.00	41.25	73.00	123.00	208.10	249.20

lowest : 4 9 10 11 12 , highest: 412 444 464 558 588

fu.yrs: Follow-up Time year

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
418	0	399	1	5.251	3.429	0.671	1.661	2.992	4.736	7.155	9.649	11.063

lowest :	0.1122519	0.1177276	0.1396304	0.1943874	0.2108145
highest:	12.3203285	12.3449692	12.3832991	12.4736482	13.1279945

p <- plot(d)

# "p" has two named elements that we can access with "$"
p$Categorical # or, p[['Categorical']]

p$Continuous

Stratified Descriptive Statistics

Produce stratified quantiles, means/SD, and proportions by treatment group. Plot the results before rendering as an advanced HTML table:

categorical variables: a single dot chart
continuous variables: a series of extended box plots

s <- summaryM(bili + albumin + stage + protime + sex + age + spiders +
              alk.phos + sgot + chol ~ drug, data=pbc,
                            overall=FALSE, test=TRUE)

Plotting

plot(s, which='categorical')

plot(s, which='continuous', vars=1:4)

plot(s, which='continuous', vars=5:7)

html(s, caption='Baseline characteristics by randomized treatment',
     exclude1=TRUE, npct='both', digits=3,
     prmsd=TRUE, brmsd=TRUE, msdsize=mu$smaller2)

	N	D-penicillamine N=154	placebo N=158	not randomized N=106	Test Statistic
Baseline characteristics by randomized treatment.
Serum Bilirubin mg/dl	418	0.725 1.300 3.600 3.649 ± 5.282	0.800 1.400 3.200 2.873 ± 3.629	0.725 1.400 3.075 3.117 ± 4.043	F_2 415=0.03, P=0.972¹
Albumin gm/dl	418	3.342 3.545 3.777 3.524 ± 0.396	3.212 3.565 3.830 3.516 ± 0.443	3.125 3.470 3.720 3.431 ± 0.435	F_2 415=2.13, P=0.12¹
Histologic Stage, Ludwig Criteria : 1	412	0.03 ⁴⁄₁₅₄	0.08 ¹²⁄₁₅₈	0.05 ⁵⁄₁₀₀	χ²₆=5.33, P=0.502²
2		0.21 ³²⁄₁₅₄	0.22 ³⁵⁄₁₅₈	0.25 ²⁵⁄₁₀₀
3		0.42 ⁶⁴⁄₁₅₄	0.35 ⁵⁶⁄₁₅₈	0.35 ³⁵⁄₁₀₀
4		0.35 ⁵⁴⁄₁₅₄	0.35 ⁵⁵⁄₁₅₈	0.35 ³⁵⁄₁₀₀
Prothrombin Time sec.	416	10.000 10.600 11.400 10.800 ± 1.138	10.025 10.600 11.000 10.653 ± 0.851	10.100 10.600 11.000 10.750 ± 1.078	F_2 413=0.23, P=0.795¹
sex : female	418	0.90 ¹³⁹⁄₁₅₄	0.87 ¹³⁷⁄₁₅₈	0.92 ⁹⁸⁄₁₀₆	χ²₂=2.38, P=0.304²
Age	418	41.43 48.11 55.80 48.58 ± 9.96	42.98 51.93 58.90 51.42 ± 11.01	46.00 53.00 61.00 52.87 ± 9.78	F_2 415=6.1, P=0.002¹
spiders	312	0.29 ⁴⁵⁄₁₅₄	0.28 ⁴⁵⁄₁₅₈		χ²₁=0.02, P=0.885²
Alkaline Phosphatase U/liter	312	922 1283 1950 1943 ± 2102	841 1214 2028 2021 ± 2183		F_1 310=0.06, P=0.812³
SGOT U/ml	312	83.8 117.4 151.9 125.0 ± 58.9	76.7 111.6 151.5 120.2 ± 54.5		F_1 310=0.55, P=0.46³
Cholesterol mg/dl	284	254 304 377 374 ± 252	248 316 417 365 ± 210		F_1 282=0.37, P=0.545³
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. Tests used: ¹Kruskal-Wallis test; ²Pearson test; ³Wilcoxon test .

Spike Histogram

Show data for continuous variable stratified by treatment.

p <- with(pbc, histboxp(x=sgot, group=drug, sd=TRUE))
p

Data Visualization

The very low birthweight data set contains data on 671 infants born with a birth weight of under 1600 grams. We’ll plot gestational age by birthweight using three graphics systems: base graphics, ggplot, and plotly.

getHdata(vlbw)
# remove missing values
vlbw <- vlbw[complete.cases(vlbw[,c('sex','dead','gest','bwt')]),]

Base Graphics Example

Empty canvas - add each element into your plot.

# "grps" is a list with each combination of sex/dead
grps <- split(vlbw[,c('gest','bwt')], vlbw[,c('sex','dead')])
# default "plot" function; set up parameters but with no lines/points because type 'n'
plot(c(22,40), c(400,1600), type='n', xlab='Gestational Age', ylab='Birth Weight (grams)', axes=FALSE)
# axis 1 is below plot
axis(1, at=c(22,28,34,40), labels=c(22,28,34,40))
# axis 2 is left of plot
axis(2, at=seq(400,1600,by=400), labels=seq(400,1600,by=400))
## four different methods to add points to plot area
# data.frame with two columns: 1st column is associated with x variable
points(grps[['female.0']], col='black', pch=1)
# two arguments: first vector is associated with x variable
points(grps[['female.1']][,'gest'], grps[['female.1']][,'bwt'], col='gray', pch=0)
# like above, but jitter adds noise so points have less overlap
points(jitter(grps[['male.0']][,'gest'], 2), grps[['male.0']][,'bwt'], col='black', pch=2)
# LHS ~ RHS: with tilde, the left hand side is associated with y variable
points(grps[['male.1']][,'bwt'] ~ jitter(grps[['male.1']][,'gest'], 2), col='gray', pch=3)
legend(x=37, y=1000, legend=c('F:alive','F:dead','M:alive','M:dead'), col=c('black','gray','black','gray'), pch=c(1,0,2,3))

ggplot2 Example

Given a data set, choose the aesthetic mapping and geometry layer.

# `as.factor(dead)` makes categorical variable rather than continuous
p <- ggplot(data=vlbw) + aes(x=gest, y=bwt, color=sex, shape=as.factor(dead)) + geom_point()
p

p <- p + geom_jitter() + scale_x_continuous() + scale_y_continuous()
p

Plotly

Add interactive graphics, which is trivial when also using ggplot.

require(plotly)

ggplotly(p)

# plotly uses pipe operator (%>%) from magrittr package
# in this context it works much like "+" does with ggplot
plot_ly(type="box") %>%
  add_boxplot(y = grps[['female.0']][,'bwt'], name='F:0') %>%
  add_boxplot(y = grps[['female.1']][,'bwt'], name='F:1') %>%
  add_boxplot(y = grps[['male.0']][,'bwt'], name='M:0') %>%
  add_boxplot(y = grps[['male.1']][,'bwt'], name='M:1')

Models

Cardiovascular risk factor data

getHdata(diabetes)
html(contents(diabetes), levelType='table')

Data frame:diabetes

403 observations and 19 variables, maximum # NAs:262

Name	Labels	Units	Levels	Storage	NAs
id	Subject ID			integer	0
chol	Total Cholesterol			integer	1
stab.glu	Stabilized Glucose			integer	0
hdl	High Density Lipoprotein			integer	1
ratio	Cholesterol/HDL Ratio			double	1
glyhb	Glycosolated Hemoglobin			double	13
location			2	integer	0
age		years		integer	0
gender			2	integer	0
height		inches		integer	5
weight		pounds		integer	1
frame			3	integer	12
bp.1s	First Systolic Blood Pressure			integer	5
bp.1d	First Diastolic Blood Pressure			integer	5
bp.2s	Second Systolic Blood Pressure			integer	262
bp.2d	Second Diastolic Blood Pressure			integer	262
waist		inches		integer	2
hip		inches		integer	2
time.ppn	Postprandial Time when Labs were Drawn	minutes		integer	3

Variable	Levels
location	Buckingham
	Louisa
gender	male
	female
frame	small
	medium
	large

Formulas

The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.

y ~ a + b, + indicates to include both a and b as terms
y ~ a:b, : indicates the interaction of a and b
y ~ a*b, equivalent to y ~ a+b+a:b
y ~ (a+b)^2, equivalent to y ~ (a+b)*(a+b)
y ~ a + I(b+c), I indicates to use + in the arithmetic sense

See ?formula for more examples.

Model Fitting

The most simple model-fitting function is lm, which is used to fit linear models. It’s primary argument is a formula. Using the diabetes data set, we can fit waist size by weight.

m <- lm(waist ~ weight, data=diabetes)

This creates an lm object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.

coef
fitted
predict
residuals
vcov
summary
anova

names(m)

 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"         "model"

# can access named elements with "$"
m$coefficients

(Intercept)      weight 
 16.5050329   0.1204712

coef(m)

(Intercept)      weight 
 16.5050329   0.1204712

head(fitted(m))

       1        2        3        4        5        6 
31.08205 42.76776 47.34567 30.84111 38.55127 39.39457

# defaut "predict" gives same results
# sum(abs(fitted(m) - predict(m)) > 1e-8)
predict(m, data.frame(weight=c(150, 200, 250)))

       1        2        3 
34.57572 40.59928 46.62284

head(residuals(m))

        1         2         3         4         5         6 
-2.082053  3.232237  1.654330  2.158890  5.448731 -3.394568

vcov(m) # variance-covariance matrix

             (Intercept)        weight
(Intercept)  0.465908502 -0.0024927458
weight      -0.002492746  0.0000140231

# "summary" produces additional information about model
summary(m)


Call:
lm(formula = waist ~ weight, data = diabetes)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6835 -2.0850 -0.2021  1.7741 11.5330 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.505033   0.682575   24.18   <2e-16 ***
weight       0.120471   0.003745   32.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.02 on 398 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.7223,    Adjusted R-squared:  0.7216 
F-statistic:  1035 on 1 and 398 DF,  p-value: < 2.2e-16

coef(summary(m))

              Estimate  Std. Error  t value      Pr(>|t|)
(Intercept) 16.5050329 0.682574906 24.18054  3.970142e-80
weight       0.1204712 0.003744743 32.17077 9.082965e-113

anova(m)

Analysis of Variance Table

Response: waist
           Df Sum Sq Mean Sq F value    Pr(>F)    
weight      1 9438.0  9438.0    1035 < 2.2e-16 ***
Residuals 398 3629.4     9.1                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualization

Create a scatterplot of weight and waist size.

# `qplot` can create default ggplot output
p <- qplot(weight, waist, data=diabetes)
p + geom_smooth(method="lm")

# the above uses the implied formula y~x
# p + geom_smooth(method="lm", formula = y ~ x)

Remove missing values and fit with LOESS curve.

diab <- diabetes[complete.cases(diabetes[,c('waist','weight')]),]
p <- qplot(weight, waist, data=diab)
p + geom_smooth(method="loess")

Data Manipulation

The Dominican Republic Hypertension dataset contains data on gender, age, and systolic and diastolic blood pressure from several villages in the DR.

getHdata(DominicanHTN)
html(describe(DominicanHTN))

DominicanHTN

5 Variables 381 Observations

village

n	missing	distinct
381	0	8

lowest :	Bare' Nuevo	Batey Verde	Carmona	Cojobal	Juan Sanchez
highest:	Cojobal	Juan Sanchez	La Altagracia	Los Gueneos	San Antonio

 Value        Bare' Nuevo   Batey Verde       Carmona       Cojobal  Juan Sanchez
 Frequency             32            41            64            59            53
 Proportion         0.084         0.108         0.168         0.155         0.139
                                                     
 Value      La Altagracia   Los Gueneos   San Antonio
 Frequency             40            57            35
 Proportion         0.105         0.150         0.092

gender

n	missing	distinct
381	0	2

 Value      Female   Male
 Frequency     258    123
 Proportion  0.677  0.323

age

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
381	0	69	0.999	47.97	16.98	23	30	38	47	59	68	73

lowest : 15 17 18 19 20 , highest: 85 87 90 95 100

sbp: Systolic Blood Pressure mmHg

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
381	0	57	0.996	133	28.35	98	105	118	130	150	170	180

lowest : 80 90 95 98 99 , highest: 196 200 204 210 236

dbp: Diastolic Blood Pressure mmHg

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
381	0	41	0.993	84.23	15.58	64	70	76	82	92	100	110

lowest : 20 40 50 60 62 , highest: 115 118 120 130 152

Adding variables or transformations

The data.frame bracket syntax is data.frame[row_selection, column_selection].

# create "mean arterial pressure"
DominicanHTN[,'map'] <- (DominicanHTN[,'sbp'] + DominicanHTN[,'dbp'] * 2) / 3
# use `as.numeric` to convert TRUE/FALSE into 1/0
DominicanHTN[,'male'] <- as.numeric(DominicanHTN[,'gender'] == 'Male')
DominicanHTN[1:5,]

       village gender age sbp dbp      map male
1 Juan Sanchez   Male  56 150 100 116.6667    1
2 Juan Sanchez   Male  42 120  90 100.0000    1
3 Juan Sanchez   Male  69 120  90 100.0000    1
4 Juan Sanchez   Male  70 180  80 113.3333    1
5 Juan Sanchez   Male  62 138  78  98.0000    1

In this example, I collapse a variable into smaller bins - which should generally be avoided.

qage <- quantile(DominicanHTN[,'age'])
qage[1] <- qage[1]-1
qage

  0%  25%  50%  75% 100% 
  14   38   47   59  100

DominicanHTN[,'ageGrp'] <- cut(DominicanHTN[,'age'], breaks=qage)
# could easily specify other breakpoints
# levels(cut(DominicanHTN[,'age'], breaks = c(0, 18, 100)))
# levels(cut(DominicanHTN[,'age'], breaks = seq(10, 100, by = 10)))

Filter

nrow(DominicanHTN)

[1] 381

nrow(DominicanHTN[DominicanHTN[,'gender'] == "Male",])

[1] 123

which(DominicanHTN[,'village'] %in% c('Carmona','San Antonio'))

 [1] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
[35] 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 249 250 251 252
[69] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283

cojMen <- DominicanHTN[DominicanHTN[,'village'] == 'Cojobal' & DominicanHTN[,'gender'] == "Male",]
cojMen

   village gender age sbp dbp       map male   ageGrp
54 Cojobal   Male  38 132  94 106.66667    1  (14,38]
55 Cojobal   Male  77 140  88 105.33333    1 (59,100]
56 Cojobal   Male  45 140  92 108.00000    1  (38,47]
57 Cojobal   Male  45 122  86  98.00000    1  (38,47]
58 Cojobal   Male  68 128  92 104.00000    1 (59,100]
59 Cojobal   Male  56  90  72  78.00000    1  (47,59]
60 Cojobal   Male  70 118  78  91.33333    1 (59,100]
61 Cojobal   Male  60 120  90 100.00000    1 (59,100]
62 Cojobal   Male  30 110  70  83.33333    1  (14,38]
63 Cojobal   Male  48 122  84  96.66667    1  (47,59]
64 Cojobal   Male  50 200 100 133.33333    1  (47,59]
65 Cojobal   Male  43 110  75  86.66667    1  (38,47]

Sort

order(cojMen[,'age'], decreasing=TRUE) # first value is element number with highest age

 [1]  2  7  5  8  6 11 10  3  4 12  1  9

# use row numbers to re-order a data.frame
cojMen[order(cojMen[,'age'], decreasing=TRUE),]

   village gender age sbp dbp       map male   ageGrp
55 Cojobal   Male  77 140  88 105.33333    1 (59,100]
60 Cojobal   Male  70 118  78  91.33333    1 (59,100]
58 Cojobal   Male  68 128  92 104.00000    1 (59,100]
61 Cojobal   Male  60 120  90 100.00000    1 (59,100]
59 Cojobal   Male  56  90  72  78.00000    1  (47,59]
64 Cojobal   Male  50 200 100 133.33333    1  (47,59]
63 Cojobal   Male  48 122  84  96.66667    1  (47,59]
56 Cojobal   Male  45 140  92 108.00000    1  (38,47]
57 Cojobal   Male  45 122  86  98.00000    1  (38,47]
65 Cojobal   Male  43 110  75  86.66667    1  (38,47]
54 Cojobal   Male  38 132  94 106.66667    1  (14,38]
62 Cojobal   Male  30 110  70  83.33333    1  (14,38]

Aggregate and counts

table(DominicanHTN[,'gender'])


Female   Male 
   258    123

addmargins(with(DominicanHTN, table(village, gender)))

               gender
village         Female Male Sum
  Bare' Nuevo       19   13  32
  Batey Verde       30   11  41
  Carmona           39   25  64
  Cojobal           47   12  59
  Juan Sanchez      40   13  53
  La Altagracia     22   18  40
  Los Gueneos       41   16  57
  San Antonio       20   15  35
  Sum              258  123 381

with(DominicanHTN, tapply(age, village, mean))

  Bare' Nuevo   Batey Verde       Carmona       Cojobal  Juan Sanchez La Altagracia   Los Gueneos   San Antonio 
     47.53125      48.02439      49.00000      47.15254      47.45283      49.65000      45.15789      51.25714

aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)

  gender      sbp
1 Female 133.1977
2   Male 132.5691

aggregate(age ~ village + gender, DominicanHTN, median)

         village gender  age
1    Bare' Nuevo Female 40.0
2    Batey Verde Female 43.0
3        Carmona Female 40.0
4        Cojobal Female 43.0
5   Juan Sanchez Female 47.0
6  La Altagracia Female 48.0
7    Los Gueneos Female 40.0
8    San Antonio Female 42.5
9    Bare' Nuevo   Male 48.0
10   Batey Verde   Male 52.0
11       Carmona   Male 53.0
12       Cojobal   Male 49.0
13  Juan Sanchez   Male 56.0
14 La Altagracia   Male 57.5
15   Los Gueneos   Male 51.0
16   San Antonio   Male 61.0

# easy to change statistical measure, or function
# aggregate(age ~ village + gender, DominicanHTN, max)

Exercise 2 & 3

Do exercise 2 & 3.

Rosner’s lead data

The lead exposure dataset was collected to study the well-being of children who lived near a lead smelting plant. The following analysis considers the lead exposure levels in 1972 and 1973, as well as age and the finger-wrist tapping score maxfwt.

getHdata(lead)
lead <- upData(lead,
       keep = c('ld72','ld73','age','maxfwt'),
       labels = c(age = 'Age'),
       units = c(age='years', ld72='mg/100*ml', ld73='mg/100*ml'))

Input object size: 53928 bytes; 39 variables 124 observations Kept variables ld72,ld73,age,maxfwt New object size: 14096 bytes; 4 variables 124 observations

html(contents(lead), maxlevels=10, levelType='table')

Data frame:lead

124 observations and 4 variables, maximum # NAs:25

Name	Labels	Units	Storage	NAs
age	Age	years	double	0
ld72	Blood Lead Levels, 1972	mg/100*ml	integer	0
ld73	Blood Lead Levels, 1973	mg/100*ml	integer	0
maxfwt	Maximum mean finger-wrist tapping score		integer	25

html(describe(lead))

lead

4 Variables 124 Observations

age: Age years

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
124	0	73	1	8.935	4.074	3.929	4.333	6.167	8.375	12.021	14.000	15.000

lowest :	3.750000	3.833333	3.916667	4.000000	4.166667
highest:	14.250000	15.000000	15.250000	15.416667	15.916667

ld72: Blood Lead Levels, 1972 mg/100*ml

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
124	0	47	0.999	36.16	17.23	18.00	21.00	27.00	34.00	43.00	57.00	61.85

lowest : 1 2 10 14 18 , highest: 62 64 66 68 99

ld73: Blood Lead Levels, 1973 mg/100*ml

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
124	0	37	0.998	31.71	11.06	18.15	21.00	24.00	30.50	37.00	47.00	50.85

lowest : 15 16 18 19 20 , highest: 52 53 54 57 58

maxfwt: Maximum mean finger-wrist tapping score

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
99	25	40	0.998	51.96	13.8	33.2	38.0	46.0	52.0	59.0	65.0	72.2

lowest : 13 14 23 26 34 , highest: 74 76 79 83 84

Introduction to RMS (Ordinary least squares regression)

The rms package includes the ols fitting function. Use datadist to compute summaries of the distributional characteristics of the predictors - or simply give it an entire data frame. datadist must be re-run if you add a new predictor or recode an old one.

Note: The ols function is not covered in the main course. Generalized least squares is. Ordinary least squares does not handle data that has correlation between residuals. It is covered here to give a simple overview of the structure of models in RMS.

require(rms)

dd <- datadist(lead)
options(datadist='dd')
f <- ols(maxfwt ~ age + ld72 + ld73, data=lead)
f

Frequencies of Missing Values Due to Each Variable
maxfwt    age   ld72   ld73 
    25      0      0      0 

Linear Regression Model
 
 ols(formula = maxfwt ~ age + ld72 + ld73, data = lead)
 
 
                 Model Likelihood    Discrimination    
                       Ratio Test           Indexes    
 Obs      99    LR chi2     62.25    R2       0.467    
 sigma9.5221    d.f.            3    R2 adj   0.450    
 d.f.     95    Pr(> chi2) 0.0000    g       10.104    
 
 Residuals
 
      Min       1Q   Median       3Q      Max 
 -33.9958  -4.9214   0.7596   5.1106  33.2590 
 
 
           Coef    S.E.   t     Pr(>|t|)
 Intercept 34.1059 4.8438  7.04 <0.0001 
 age        2.6078 0.3231  8.07 <0.0001 
 ld72      -0.0246 0.0782 -0.31 0.7538  
 ld73      -0.2390 0.1325 -1.80 0.0744

rms provides many methods to work with the ols fit object.

coef
fitted
predict
resid
anova
summary
Predict
ggplot
Function
nomogram

Coefficient Estimates

coef(f)

 Intercept        age       ld72       ld73 
34.1058551  2.6078450 -0.0245978 -0.2389695

Define R function representing fitted model

The default values for function arguments are medians.

g <- Function(f)
g

function (age = 8.375, ld72 = 34, ld73 = 30.5) 
{
    34.105855 + 2.607845 * age - 0.024597799 * ld72 - 0.23896951 * 
        ld73
}
<environment: 0x5569673767b8>

g(age=10, ld72=21, ld73=c(21, 47))

[1] 54.64939 48.43618

Predicted values with standard errors

predict(f, data.frame(age=10, ld72=21, ld73=c(21, 47)), se.fit=TRUE)

$linear.predictors
       1        2 
54.64939 48.43618 

$se.fit
       1        2 
1.391858 3.140361

Residuals

r <- resid(f)
par(mfrow=c(2,2))
plot(fitted(f), r); abline(h=0)
with(lead, plot(age, r)); abline(h=0)
with(lead, plot(ld73, r)); abline(h=0)
# q-q plot to check normality
qqnorm(r)
qqline(r)

Plotting partial effects

Predict and plotp make one plot for each predictor. Predictors not shown in plot are set to constants (continuous:median, categorical:mode). 0.95 pointwise confidence limits for \(\hat{E}(y|x)\) are shown; use conf.int=FALSE to suppress CLs.

plotp(Predict(f))

Specify which predictors are plotted.

plotp(Predict(f, age))

plotp(Predict(f, age=3:15))

plotp(Predict(f, age=seq(3,16,length=150)))

Obtain confidence limits for \(\hat{y}\).

plotp(Predict(f, age, conf.type='individual'))

Combine plots.

p1 <- Predict(f, age, conf.int=0.99, conf.type='individual')
p2 <- Predict(f, age, conf.int=0.99, conf.type='mean')
p <- rbind(Individual=p1, Mean=p2)
ggplot(p)

plotp(Predict(f, ld73, age=3))

ggplot(Predict(f, ld73, age=c(3,9)))

3-d surface for two continuous predictors against \(\hat{y}\).

bplot(Predict(f, ld72, ld73))

Nomograms

plot(nomogram(f))

Point Estimates for Partial Effects

summary(f)

             Effects              Response : maxfwt 

 Factor Low     High   Diff.   Effect   S.E.   Lower 0.95 Upper 0.95
 age     6.1667 12.021  5.8542 15.26700 1.8914 11.5120    19.02200  
 ld72   27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773     2.09010  
 ld73   24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255     0.31234

Adjust age to 5, which has no effect as the model does not include any interactions.

summary(f, age=5)

             Effects              Response : maxfwt 

 Factor Low     High   Diff.   Effect   S.E.   Lower 0.95 Upper 0.95
 age     6.1667 12.021  5.8542 15.26700 1.8914 11.5120    19.02200  
 ld72   27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773     2.09010  
 ld73   24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255     0.31234

Effect of changing ld73 from 20 to 40.

summary(f, ld73=c(20,40))

             Effects              Response : maxfwt 

 Factor Low     High   Diff.   Effect   S.E.   Lower 0.95 Upper 0.95
 age     6.1667 12.021  5.8542 15.26700 1.8914  11.5120   19.02200  
 ld72   27.0000 43.000 16.0000 -0.39356 1.2511  -2.8773    2.09010  
 ld73   20.0000 40.000 20.0000 -4.77940 2.6495 -10.0390    0.48052

If a predictor has a linear effect, a one-unit change can be used to get the confidence interval of its slope.

summary(f, age=5:6)

             Effects              Response : maxfwt 

 Factor Low High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
 age     5   6    1     2.60780 0.32308  1.9664    3.24920   
 ld72   27  43   16    -0.39356 1.25110 -2.8773    2.09010   
 ld73   24  37   13    -3.10660 1.72220 -6.5255    0.31234

There is also a plot method for summary results.

plot(summary(f))

Getting Predicted Values

Give predict a fit object and data.frame.

predict(f, data.frame(age=3, ld72=21, ld73=21))

       1 
36.39448

predict(f, data.frame(age=c(3,10), ld72=21, ld73=c(21,47)))

       1        2 
36.39448 48.43618

newdat <- expand.grid(age=c(4,8), ld72=c(21, 47), ld73=c(21, 47))
newdat

  age ld72 ld73
1   4   21   21
2   8   21   21
3   4   47   21
4   8   47   21
5   4   21   47
6   8   21   47
7   4   47   47
8   8   47   47

predict(f, newdat)

       1        2        3        4        5        6        7        8 
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095

Include confidence levels.

predict(f, newdat, conf.int=0.95)

$linear.predictors
       1        2        3        4        5        6        7        8 
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095 

$lower
       1        2        3        4        5        6        7        8 
33.97441 46.23595 32.15468 43.94736 25.68920 36.94167 27.17060 38.86475 

$upper
       1        2        3        4        5        6        7        8 
44.03023 52.63145 44.57088 53.64095 39.88902 49.49932 37.12854 46.29716

predict(f, newdat, conf.int=0.95, conf.type='individual')

$linear.predictors
       1        2        3        4        5        6        7        8 
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095 

$lower
       1        2        3        4        5        6        7        8 
19.44127 30.26132 18.46566 29.27888 12.59596 23.30120 12.60105 23.31531 

$upper
       1        2        3        4        5        6        7        8 
58.56337 68.60609 58.25989 68.30944 52.98227 63.13979 51.69810 61.84659

Brute-force predicted values.

b <- coef(f)
b[1] + b[2]*3 + b[3]*21 + b[4]*21

Intercept 
 36.39448

Predicted values with Function.

# g <- Function(f)
g(age = c(3,8), ld72 = 21, ld73 = 21)

[1] 36.39448 49.43370

g(age = 3)

[1] 33.80449

ANOVA

Use anova to get all total effects and individual partial effects.

an <- anova(f)
an

                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P     
 age         1   5907.535742 5907.535742 65.15 <.0001
 ld72        1      8.972994    8.972994  0.10 0.7538
 ld73        1    295.044370  295.044370  3.25 0.0744
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001
 ERROR      95   8613.750674   90.671060

Include corresponding variable names.

print(an, 'names')

                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P      Tested       
 age         1   5907.535742 5907.535742 65.15 <.0001 age          
 ld72        1      8.972994    8.972994  0.10 0.7538 ld72         
 ld73        1    295.044370  295.044370  3.25 0.0744 ld73         
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001 age,ld72,ld73
 ERROR      95   8613.750674   90.671060

print(an, 'subscripts')

                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P      Tested
 age         1   5907.535742 5907.535742 65.15 <.0001 1     
 ld72        1      8.972994    8.972994  0.10 0.7538 2     
 ld73        1    295.044370  295.044370  3.25 0.0744 3     
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001 1-3   
 ERROR      95   8613.750674   90.671060                    

Subscripts correspond to:
[1] age  ld72 ld73

print(an, 'dots')

                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS          F     P      Tested
 age         1   5907.535742 5907.535742 65.15 <.0001 .     
 ld72        1      8.972994    8.972994  0.10 0.7538  .    
 ld73        1    295.044370  295.044370  3.25 0.0744   .   
 REGRESSION  3   7540.087710 2513.362570 27.72 <.0001 ...   
 ERROR      95   8613.750674   90.671060                    

Subscripts correspond to:
[1] age  ld72 ld73

Combined partial effects.

anova(f, ld72, ld73)

                Analysis of Variance          Response: maxfwt 

 Factor     d.f. Partial SS  MS         F    P     
 ld72        1      8.972994   8.972994 0.10 0.7538
 ld73        1    295.044370 295.044370 3.25 0.0744
 REGRESSION  2    747.283558 373.641779 4.12 0.0192
 ERROR      95   8613.750674  90.671060

Exercise 4

Time for exercise 4.

Cleaning and Importing Data

Text files

read.table - work with data in table format
read.csv - CSV files
read.delim - tab-delimited files
scan - more general function for reading in data

Binary representation of R objects

Compressed data and loads faster.

RData
feather package

Database connections

Allows data queries.

RODBC package
RSQLite package

Other statistical packages

Hmisc package: sas.get, sasxport.get
haven package: read_sas, read_sav, read_dta
foreign package
Stat/Transfer - not free nor open source

Data Cleaning

An R data.frame consists of a collection of values. In a well-structured data set, each value is associated with a variable (column) and observation (row). Data sets often need to be manipulated to become well-structured.

Hadley Wickham’s Tidy Data provides an excellent overview on how to clean messy data sets.

Column header contains values

# politics and religion
senateReligion <- data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
                          Democrats=c(20,15,8,1,1,3),
                          Republicans=c(38,9,0,5,0,0))
senateReligion

            religion Democrats Republicans
1         Protestant        20          38
2           Catholic        15           9
3             Jewish         8           0
4             Mormon         1           5
5           Buddhist         1           0
6 Don't Know/Refused         3           0

senRel <- cbind(senateReligion[,'religion'], stack(senateReligion[,c('Democrats','Republicans')]))
names(senRel) <- c('religion','count','party')
senRel

             religion count       party
1          Protestant    20   Democrats
2            Catholic    15   Democrats
3              Jewish     8   Democrats
4              Mormon     1   Democrats
5            Buddhist     1   Democrats
6  Don't Know/Refused     3   Democrats
7          Protestant    38 Republicans
8            Catholic     9 Republicans
9              Jewish     0 Republicans
10             Mormon     5 Republicans
11           Buddhist     0 Republicans
12 Don't Know/Refused     0 Republicans

Multiple variables in a column

pets <- data.frame(county=c('Davidson','Rutherford','Cannon'),
                   male.dog=c(50,150,70), female.dog=c(60,150,70),
                   male.cat=c(30,100,50), female.cat=c(30,70,40),
                   male.horse=c(6,30,20), female.horse=c(6,28,19))
pets

      county male.dog female.dog male.cat female.cat male.horse female.horse
1   Davidson       50         60       30         30          6            6
2 Rutherford      150        150      100         70         30           28
3     Cannon       70         70       50         40         20           19

pets2 <- cbind(pets[,'county'], stack(pets[,-1]))
names(pets2) <- c('county','count','ind')
pets2

       county count          ind
1    Davidson    50     male.dog
2  Rutherford   150     male.dog
3      Cannon    70     male.dog
4    Davidson    60   female.dog
5  Rutherford   150   female.dog
6      Cannon    70   female.dog
7    Davidson    30     male.cat
8  Rutherford   100     male.cat
9      Cannon    50     male.cat
10   Davidson    30   female.cat
11 Rutherford    70   female.cat
12     Cannon    40   female.cat
13   Davidson     6   male.horse
14 Rutherford    30   male.horse
15     Cannon    20   male.horse
16   Davidson     6 female.horse
17 Rutherford    28 female.horse
18     Cannon    19 female.horse

Watch out for factor variables.

tryCatch(strsplit(pets2[,'ind'], '\\.'), error=function(e){ e })

<simpleError in strsplit(pets2[, "ind"], "\\."): non-character argument>

First for loop, allows iterating over a sequence.

sexAnimal <- strsplit(as.character(pets2[,'ind']), '\\.')
pets2[,c('sex','animal')] <- NA
for(i in seq(nrow(pets2))) {
  pets2[i, c('sex','animal')] <- sexAnimal[[i]]
}
pets2[,'ind'] <- NULL
pets2

       county count    sex animal
1    Davidson    50   male    dog
2  Rutherford   150   male    dog
3      Cannon    70   male    dog
4    Davidson    60 female    dog
5  Rutherford   150 female    dog
6      Cannon    70 female    dog
7    Davidson    30   male    cat
8  Rutherford   100   male    cat
9      Cannon    50   male    cat
10   Davidson    30 female    cat
11 Rutherford    70 female    cat
12     Cannon    40 female    cat
13   Davidson     6   male  horse
14 Rutherford    30   male  horse
15     Cannon    20   male  horse
16   Davidson     6 female  horse
17 Rutherford    28 female  horse
18     Cannon    19 female  horse

Variables in a row

set.seed(1000)
d1 <- rnorm(1000, 5, 2)
d2 <- runif(1000, 0, 10)
d3 <- rpois(1000, 5)
d4 <- rbinom(1000, 10, 0.5)
x <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
           stat=rep(c('mean','sd'),4),
           value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
x

  distribution stat    value
1       normal mean 4.975048
2       normal   sd 1.963494
3      uniform mean 4.958261
4      uniform   sd 2.870631
5      poisson mean 5.015000
6      poisson   sd 2.232882
7     binomial mean 5.054000
8     binomial   sd 1.542551

cbind(distribution=x[c(1,3,5,7),'distribution'], unstack(x, form=value~stat))

  distribution     mean       sd
1       normal 4.975048 1.963494
2      uniform 4.958261 2.870631
3      poisson 5.015000 2.232882
4     binomial 5.054000 1.542551

Normalization and merging

employees <- data.frame(id=1:4, Name=c('Eddie','Andrea','Steve','Theresa'),
                        job=c('engineer','accountant','statistician','technician'))
set.seed(11)
hours <- data.frame(id=sample(4, 10, replace=TRUE),
                    week=1:10, hours=sample(30:60, 10, replace=TRUE))
merge(employees, hours)

   id    Name        job week hours
1   1   Eddie   engineer    6    59
2   1   Eddie   engineer    7    50
3   1   Eddie   engineer    4    59
4   2  Andrea accountant    1    50
5   2  Andrea accountant    9    58
6   2  Andrea accountant    2    51
7   4 Theresa technician    3    59
8   4 Theresa technician    8    32
9   4 Theresa technician    5    36
10  4 Theresa technician   10    40

# merge(employees, hours, by.x='id', by.y='id')

empHours <- merge(employees, hours, all.x=TRUE)
empHours

   id    Name          job week hours
1   1   Eddie     engineer    6    59
2   1   Eddie     engineer    7    50
3   1   Eddie     engineer    4    59
4   2  Andrea   accountant    1    50
5   2  Andrea   accountant    9    58
6   2  Andrea   accountant    2    51
7   3   Steve statistician   NA    NA
8   4 Theresa   technician    3    59
9   4 Theresa   technician    8    32
10  4 Theresa   technician    5    36
11  4 Theresa   technician   10    40

unique(empHours[,c('id','Name','job')])

  id    Name          job
1  1   Eddie     engineer
4  2  Andrea   accountant
7  3   Steve statistician
8  4 Theresa   technician

empHours[,c('id','week','hours')]

   id week hours
1   1    6    59
2   1    7    50
3   1    4    59
4   2    1    50
5   2    9    58
6   2    2    51
7   3   NA    NA
8   4    3    59
9   4    8    32
10  4    5    36
11  4   10    40

Exclude rows with missing data.

empHours[!is.na(empHours[,'week']), c('id','week','hours')]

   id week hours
1   1    6    59
2   1    7    50
3   1    4    59
4   2    1    50
5   2    9    58
6   2    2    51
8   4    3    59
9   4    8    32
10  4    5    36
11  4   10    40

Combining data

rbind(sexAnimal[[1]], sexAnimal[[2]], sexAnimal[[3]])

     [,1]   [,2] 
[1,] "male" "dog"
[2,] "male" "dog"
[3,] "male" "dog"

do.call(rbind, sexAnimal)

      [,1]     [,2]   
 [1,] "male"   "dog"  
 [2,] "male"   "dog"  
 [3,] "male"   "dog"  
 [4,] "female" "dog"  
 [5,] "female" "dog"  
 [6,] "female" "dog"  
 [7,] "male"   "cat"  
 [8,] "male"   "cat"  
 [9,] "male"   "cat"  
[10,] "female" "cat"  
[11,] "female" "cat"  
[12,] "female" "cat"  
[13,] "male"   "horse"
[14,] "male"   "horse"
[15,] "male"   "horse"
[16,] "female" "horse"
[17,] "female" "horse"
[18,] "female" "horse"

Exercise 5

Time for exercise 5.

Simulation

Plot mean while sampling from normal distribution

n <- 100
imean <- numeric(n)
smean <- numeric(n)
for(i in seq(n)) {
  imean[i] <- mean(rnorm(10))
  smean[i] <- mean(imean[seq(i)])
}
plot(imean, ylab='Sample Mean')
abline(h=0, lty=3)
lines(smean)

n <- 1000
sig <- numeric(n)
for(i in seq(n)) {
  grp1 <- rnorm(15, 60, 5)
  grp2 <- rnorm(15, 65, 5)
  sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
}
mean(sig)

[1] 0.75

Add some flexibility to our simulation by creating a function.

tTestPower <- function(n=15, mu1=60, mu2=65, sd=5, nsim=1000) {
  sig <- numeric(nsim)
  for(i in seq(nsim)) {
    grp1 <- rnorm(n, mu1, sd)
    grp2 <- rnorm(n, mu2, sd)
    sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
  }
  mean(sig)
}
tTestPower()

[1] 0.755

tTestPower(25, nsim=10000)

[1] 0.9336

There’s already a function to compute the power of a two-sample t test.

power.t.test(n=25, delta=5, sd=5)$power

[1] 0.9337076

Simulation and Bootstrapping

Simulation is a wonderful tool for really understanding just how well a method performs. Sometimes the modeling and techniques are complex enough that analytical and numerical properties are difficult to ascertain. Simulation can help answer that question by doing a procedure on random data repeatedly and examining how well the technique performs.

Bootstrap sampling is a related area that takes multiple random samples from a data set and performs estimates of statistical properties. It is robust in that it makes few assumptions about the structure of the data.

Bootstrap sample

true_mu <- 0
x <- rnorm(100, true_mu)
R <- 999
res <- matrix(nrow=R, ncol=6)
colnames(res) <- c('mu','se','lb','ub','coverage','bias')
for(i in seq(R)) {
  r <- sample(x, replace=TRUE)
  res[i,'mu'] <- mean(r)
  res[i,'se'] <- sd(r) / sqrt(length(r))
}
res[,'lb'] <- res[,'mu'] + qnorm(0.025) * res[,'se']
res[,'ub'] <- res[,'mu'] + qnorm(0.975) * res[,'se']
res[,'coverage'] <- res[,'lb'] < true_mu & res[,'ub'] > true_mu
res[,'bias'] <- res[,'mu'] - true_mu

res[1:5,]

               mu         se          lb        ub coverage         bias
[1,] 0.0004065727 0.09355070 -0.18294943 0.1837626        1 0.0004065727
[2,] 0.0450594243 0.10109567 -0.15308446 0.2432033        1 0.0450594243
[3,] 0.0153579674 0.09152029 -0.16401851 0.1947344        1 0.0153579674
[4,] 0.1331559513 0.10133702 -0.06546097 0.3317729        1 0.1331559513
[5,] 0.1953979673 0.09007560  0.01885303 0.3719429        0 0.1953979673

vals <- colMeans(res)
basicCI <- 2 * mean(x) - quantile(res[,'mu'], probs=c(0.975, 0.025))
out <- sprintf("empirical SE: %.3f
MSE: %.3f
basic bootstrap confidence interval: (%.3f, %.3f)
coverage probability: %.3f
mean bias: %.3f
sample mean: %.3f
", sd(res[,'mu']), vals['se'], basicCI[1], basicCI[2], vals['coverage'], vals['bias'], mean(x))
cat(out)

empirical SE: 0.096
MSE: 0.096
basic bootstrap confidence interval: (-0.175, 0.197)
coverage probability: 0.931
mean bias: 0.020
sample mean: 0.023

Control Structures

Control structures are used to change if and when lines of code in a program are run. The control comes from conditions being true or false.

Root finding with the Newton-Raphson algorithm

The for loop has already been introduced. Two other important control structures are while loops and if statements.

f <- function(x) x^3 + x^2 - 3*x - 3
fp <- function(x) 3*x^2 + 2*x - 3
plot(f, xlim=c(-4,4))

x <- -2
i <- 1
while(abs(f(x)) > 1e-8) {
  x <- x - f(x) / fp(x)
  i <- i + 1
  if(i > 20) {
    print("does not converge")
    break
  }
}
x

[1] -1.732051

f(x)

[1] -4.440892e-16

Exercise 6 & 7

Final exercises, 6 & 7.

Worth Mentioning

tidyverse

Wickham’s “Tidy Data” philosophy was concurrent with the development of a suite of R packages useful for data management. This includes ggplot2 and haven as well as many others such as dplyr, stringr, and tidyr.

data.table

data.table is another beneficial package for data manipulation. It provides similar functionality to the data.frame, but works well with large data sets. It minimizes memory operations. It is best to develop ones code using either data.frame or data.table from the beginning as they are not exactly equivalent and surprises can happen upon dropping in one for the other.

Computing Environment¹

 R version 4.2.0 (2022-04-22)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 18.04.6 LTS
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
 [1] rms_6.3-0       SparseM_1.78    plotly_4.10.0   Hmisc_4.7-0     ggplot2_3.3.5   Formula_1.2-3   survival_3.2-13 lattice_0.20-45

To cite R in publications use:

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-0, https://CRAN.R-project.org/package=Hmisc.

To cite the rms package in publications use:

Harrell Jr FE (2022). rms: Regression Modeling Strategies. R package version 6.3-0, https://CRAN.R-project.org/package=rms.

To cite the ggplot2 package in publications use:

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

To cite the survival package in publications use:

Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.

mu is a copy of the part of the Hmisc package object markupSpecs that is for html. It includes a function session that renders the session environment (including package versions) in html.↩︎

R Workshop

Cole Beck, Shawn Garbett

5/09/2022