R Workshop
1 Foundations
Does everyone have a working laptop?
This course is designed to bring one up to speed on the tools necessary to follow all the examples in Frank Harrell’s ``Regression Modeling Strategies’’. It assumes a rudimentary knowledge of R or at least having some experience with coding.
If there are any questions or areas you’d like covered in more detail please do not hesitate to speak up.
RStudio has multiple panes. One can customize what’s open at any point. In general the four quadrants are:
- Upper Left: Open Files
- Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
- Lower Left: Console. The interactive command to the running R interpreter.
- Lower Right: Directory browser, Plots, Packages, Help.
If one accidentally closes one, it’s easily opened again from the View
menu.
Note Menu items have underlines for certain characters. These are shortcuts for Alt-
letter that one can access quickly from the keyboard.
1.1 Creating scripts
R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.
From RStudio, select File -> New File
and there is a large number of options. The important ones for this class are:
- R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click
source
in the upper right of the file and it will execute in the console below. - R Notebook. An older version that allows for a preview of the document. The preview is rebuilt on every save.
- R Markdown. Same as R Notebook, but requires a manual
Knit
to construct the document. No preview mode. For documents that take a long time to build this is preferred and is becoming the more common option.
A common question is “Which file type should I choose?” It never hurts to start with a basic R script, as once the code is written it can be sourced or even copied and pasted into one of the other documents. For writing code that will import data or perform data manipulation, I prefer R scripts. For teaching with interactive examples, I prefer using R notebook. I like to use R markdown when creating output that will be shared (either as HTML or PDF).
1.2 Creating projects
A project brings together several scripts as a project. One can create a project using File -> New Project
and select an existing directory or create a blank one as desired. Existing projects can be opened with File -> Open Project
.
1.3 Session & Working Directory
There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document’s environment. The Run -> Restart R and Run All Chunks
restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.
It is sometimes helpful to communicate versions of packages are loaded when discussing issues. The command sessionInfo()
provides a nice overview of this that is good for checking package versions loaded.
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
## [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] bookdown_0.18 digest_0.6.25 R6_2.4.1 jsonlite_1.7.1 magrittr_2.0.1 evaluate_0.14 rmdformats_0.3.7 stringi_1.5.3
## [9] rlang_0.4.11 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.11 tools_4.2.0 stringr_1.4.0 xfun_0.30 yaml_2.2.1
## [17] fastmap_1.1.0 compiler_4.2.0 htmltools_0.5.2 knitr_1.36 sass_0.4.0
The current working directory is available via getwd()
and settable via setwd("directory")
. All file loads and saves are relative to the current working directory.
1.4 Environment
R maintains “environment frames” which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.
3
x <- function()
f <-# One can view the parenthesis as capturing a frame
{ 4 # Local variable inner frame
x <-cat("This is the inner 'x'", x, "\n")
}f()
## This is the inner 'x' 4
cat("This is the outer 'x'", x, "\n")
## This is the outer 'x' 3
1.5 Packages
Install the packages Hmisc, rms, knitr, rmarkdown
, and plotly
through RStudio.
Packages contain R code and may often require the installation of other packages. When I install Hmisc
, I see that its dependency acepack
is automatically installed:
Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb
* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)
1.5.1 Install and updating via CRAN
Install through CRAN repositories
# Having code to install packages insider your Rmarkdown is bad practice. Don't do this.
# This code is marked `eval=FALSE` in the Rmarkdown chunk so it is not evaluated.
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown', 'plotly'))
# update all packages
update.packages(checkBuilt = TRUE, ask = FALSE)
1.5.2 Install and update via Github
Github is a code sharing platform. Packages are typically in a rawer state, but it can be important to use sometimes to get the latest bugfix at the risk of introducing a different bug. Installing through Github repositories is done using the remotes
package.
install.packages('remotes')
require(remotes)
install_github('harrelfe/rms')
# [package]::[function]() is a common syntax that shows what package is associated with the given function
# for example, this works like above but explicitly tells us we're using the "remotes" package
# remotes::install_github('harrelfe/rms')
1.5.3 Loading Packages
Note that library
is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require
used in its place. If the package is not installed, library
will trigger an error while require
will return FALSE. Once loaded the functions created in the package are available to your R session.
library(Hmisc)
require(Hmisc)
1.6 Help
From the R console one can get help about any function by appending it with a ?
.
?lm
The R maintainers are quite pedantic about requiring package help files on functions up to date, complete and written in proper English. The benefit is that the user now has a reference for just about every function including from community sourced packages.
1.7 Rmarkdown
Earlier versions of this course focused on LaTeX document generation. LaTeX is a wonderful publishing tool (for PDF), that is also as complex as it is rich. Rmarkdown has come a long way to generating very rich documents that compile to either HTML or PDF with very little knowledge of the file formats of either. Given that the focus of this course is on Regression Modeling strategies and not document generation, the current focus is on using Rmarkdown and showing some interactive plots via HTML. Full support and possibilities are shown at https://rmarkdown.rstudio.com/.
The cheat sheet is available at https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf, or through Rstudio’s menu bar: Help -> Cheetsheets -> R Markdown Cheet Sheet
.
Quarto is a next generation version of R Markdown from RStudio. Users are encouraged to evaluate it as an alternative, though it must be installed in addition to R and RStudio.
1.8 Best Practices
A blog post RMarkdown Driven Development by Emily Riederer is an excellent introduction to best practices in producing statistical documents using Rmarkdown.
One takeaway is to avoid the general pitfall of putting large amounts of data processing inside ones Rmarkdown reports, then minor changes to the report can require hours to compile. Large data processing and wrangling tasks should be done in scripts outside the report. The report should load the results of this process.
Another great reference is Dr. Harrell’s analysis project workflow. His content mirrors much of this workshop and will reinforce its material. It can also be used as a template for quarto as its source code is available (by clicking the “</> Code” link).
1.9 Caching
Chunks in Rmarkdown can be cached. I.e., the results of the last computation are saved and it’s not rerun until a manual override by clicking on that section is done. It is recommended that one not do this as it inevitably leads to confusion when code or data changes are made and the document compilation does not reflect the changes.
A better approach for long computations is to do these with external scripts and save the results. The report can load these results and present as needed. It’s very similar to caching, but is not prone to later confusion as it’s a bit more direct that the report is loading the processed data file. In short do large data processing tasks outside of reports.
1.10 Harrell’s RMS
See chapters 1 and 9 of Biostatistics for Biomedical Research.
1.11 Setup
Most Rmarkdown documents have a setup section that loads required libraries and sets up some options. The following snippet is the basic setup for this course. It loads the Hmisc
package. Sets knitr
to use markdown by default. Sets the graphics option to use plotly
. It also pulls some HTML rendering options from the markup specs for use later and stores this in a global variable mu
.
require(Hmisc)
knitrSet(lang='markdown', h=4.5, fig.path='png/', fig.align='left')
# knitrSet redirects all messages to messages.txt
options(grType='plotly') # for certain graphics functions
markupSpecs$html # markupSpecs is in Hmisc mu <-
1.12 Data
The getHdata
function is used to fetch a dataset from the Vanderbilt Biostatistics DataSets web site. upData
is used to
- create a new variable from an old one
- add labels to 2 variables
- add units to the new variable
- remove the old variable
- automatically move units of measurements from parenthetical expressions in labels to separate
units
attributes used byHmisc
andrms
functions for table making and graphics
contents
is used to print a data dictionary, run through an html
method for nicer output.
getHdata(pbc)
upData(pbc,
pbc <-fu.yrs = fu.days / 365.25,
labels = c(fu.yrs = 'Follow-up Time',
status = 'Death or Liver Transplantation'),
units = c(fu.yrs = 'year'),
drop = 'fu.days',
moveUnits=TRUE, html=TRUE)
Input object size: 80952 bytes; 19 variables 418 observations Label for bili changed from Serum Bilirubin (mg/dl) to Serum Bilirubin units set to mg/dl Label for albumin changed from Albumin (gm/dl) to Albumin units set to gm/dl Label for protime changed from Prothrombin Time (sec.) to Prothrombin Time units set to sec. Label for alk.phos changed from Alkaline Phosphatase (U/liter) to Alkaline Phosphatase units set to U/liter Label for sgot changed from SGOT (U/ml) to SGOT units set to U/ml Label for chol changed from Cholesterol (mg/dl) to Cholesterol units set to mg/dl Label for trig changed from Triglycerides (mg/dl) to Triglycerides units set to mg/dl Label for platelet changed from Platelets (per cm^3/1000) to Platelets units set to per cm^3/1000 Label for copper changed from Urine Copper (ug/day) to Urine Copper units set to ug/day Added variable fu.yrs Dropped variable fu.days New object size: 84744 bytes; 19 variables 418 observations
html(contents(pbc), maxlevels=10, levelType='table')
Data frame:pbc
418 observations and 19 variables, maximum # NAs:136Name | Labels | Units | Levels | Storage | NAs |
---|---|---|---|---|---|
bili | Serum Bilirubin | mg/dl | double | 0 | |
albumin | Albumin | gm/dl | double | 0 | |
stage | Histologic Stage, Ludwig Criteria | integer | 6 | ||
protime | Prothrombin Time | sec. | double | 2 | |
sex | 2 | integer | 0 | ||
age | Age | double | 0 | ||
spiders | 2 | integer | 106 | ||
hepatom | 2 | integer | 106 | ||
ascites | 2 | integer | 106 | ||
alk.phos | Alkaline Phosphatase | U/liter | double | 106 | |
sgot | SGOT | U/ml | double | 106 | |
chol | Cholesterol | mg/dl | integer | 134 | |
trig | Triglycerides | mg/dl | integer | 136 | |
platelet | Platelets | per cm^3/1000 | integer | 110 | |
drug | 3 | integer | 0 | ||
status | Death or Liver Transplantation | integer | 0 | ||
edema | 3 | integer | 0 | ||
copper | Urine Copper | ug/day | integer | 108 | |
fu.yrs | Follow-up Time | year | double | 0 |
Variable | Levels |
---|---|
sex | male |
female | |
spiders, hepatom | absent |
ascites | present |
drug | D-penicillamine |
placebo | |
not randomized | |
edema | no edema |
edema, no diuretic therapy | |
edema despite diuretic therapy |
1.13 Exercise 1
Do exercise 1. Pull the example exercise.Rmd from github as your starting template.
1.14 Descriptive Statistics Without Stratification
# did have results='asis' above
describe(pbc)
d <-html(d, size=80, scroll=TRUE)
19 Variables 418 Observations
bili: Serum Bilirubin mg/dl
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
418 | 0 | 98 | 0.998 | 3.221 | 3.742 | 0.50 | 0.60 | 0.80 | 1.40 | 3.40 | 8.03 | 14.00 |
albumin: Albumin gm/dl
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
418 | 0 | 154 | 1 | 3.497 | 0.473 | 2.750 | 2.967 | 3.243 | 3.530 | 3.770 | 4.010 | 4.141 |
stage: Histologic Stage, Ludwig Criteria
n | missing | distinct | Info | Mean | Gmd |
---|---|---|---|---|---|
412 | 6 | 4 | 0.893 | 3.024 | 0.9519 |
Value 1 2 3 4 Frequency 21 92 155 144 Proportion 0.051 0.223 0.376 0.350
protime: Prothrombin Time sec.
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
416 | 2 | 48 | 0.998 | 10.73 | 1.029 | 9.60 | 9.80 | 10.00 | 10.60 | 11.10 | 12.00 | 12.45 |
sex
n | missing | distinct |
---|---|---|
418 | 0 | 2 |
Value male female Frequency 44 374 Proportion 0.105 0.895
age: Age
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
418 | 0 | 345 | 1 | 50.74 | 11.96 | 33.84 | 36.37 | 42.83 | 51.00 | 58.24 | 64.30 | 67.92 |
spiders
n | missing | distinct |
---|---|---|
312 | 106 | 2 |
Value absent present Frequency 222 90 Proportion 0.712 0.288
hepatom
n | missing | distinct |
---|---|---|
312 | 106 | 2 |
Value absent present Frequency 152 160 Proportion 0.487 0.513
ascites
n | missing | distinct |
---|---|---|
312 | 106 | 2 |
Value absent present Frequency 288 24 Proportion 0.923 0.077
alk.phos: Alkaline Phosphatase U/liter
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
312 | 106 | 295 | 1 | 1983 | 1760 | 599.6 | 663.0 | 871.5 | 1259.0 | 1980.0 | 3826.4 | 6669.9 |
sgot: SGOT U/ml
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
312 | 106 | 179 | 1 | 122.6 | 60.45 | 54.25 | 60.45 | 80.60 | 114.70 | 151.90 | 196.47 | 219.25 |
chol: Cholesterol mg/dl
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
284 | 134 | 201 | 1 | 369.5 | 194.5 | 188.4 | 213.6 | 249.5 | 309.5 | 400.0 | 560.8 | 674.0 |
trig: Triglycerides mg/dl
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
282 | 136 | 146 | 1 | 124.7 | 64.07 | 56.00 | 63.10 | 84.25 | 108.00 | 151.00 | 195.00 | 230.95 |
platelet: Platelets per cm3/1000
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
308 | 110 | 210 | 1 | 261.9 | 107.8 | 117.7 | 139.7 | 199.8 | 257.0 | 322.5 | 386.5 | 430.6 |
drug
n | missing | distinct |
---|---|---|
418 | 0 | 3 |
Value D-penicillamine placebo not randomized Frequency 154 158 106 Proportion 0.368 0.378 0.254
status: Death or Liver Transplantation
n | missing | distinct | Info | Sum | Mean | Gmd |
---|---|---|---|---|---|---|
418 | 0 | 2 | 0.71 | 161 | 0.3852 | 0.4748 |
edema
n | missing | distinct |
---|---|---|
418 | 0 | 3 |
Value no edema edema, no diuretic therapy Frequency 354 44 Proportion 0.847 0.105 Value edema despite diuretic therapy Frequency 20 Proportion 0.048
copper: Urine Copper ug/day
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
310 | 108 | 158 | 1 | 97.65 | 83.16 | 17.45 | 24.00 | 41.25 | 73.00 | 123.00 | 208.10 | 249.20 |
fu.yrs: Follow-up Time year
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
418 | 0 | 399 | 1 | 5.251 | 3.429 | 0.671 | 1.661 | 2.992 | 4.736 | 7.155 | 9.649 | 11.063 |
lowest : | 0.1122519 | 0.1177276 | 0.1396304 | 0.1943874 | 0.2108145 |
highest: | 12.3203285 | 12.3449692 | 12.3832991 | 12.4736482 | 13.1279945 |
plot(d) p <-
# "p" has two named elements that we can access with "$"
$Categorical # or, p[['Categorical']] p
$Continuous p
1.15 Stratified Descriptive Statistics
Produce stratified quantiles, means/SD, and proportions by treatment group. Plot the results before rendering as an advanced HTML table:
- categorical variables: a single dot chart
- continuous variables: a series of extended box plots
summaryM(bili + albumin + stage + protime + sex + age + spiders +
s <- alk.phos + sgot + chol ~ drug, data=pbc,
overall=FALSE, test=TRUE)
2 Plotting
plot(s, which='categorical')
plot(s, which='continuous', vars=1:4)
plot(s, which='continuous', vars=5:7)
html(s, caption='Baseline characteristics by randomized treatment',
exclude1=TRUE, npct='both', digits=3,
prmsd=TRUE, brmsd=TRUE, msdsize=mu$smaller2)
Baseline characteristics by randomized treatment. | |||||
N |
D-penicillamine N=154 |
placebo N=158 |
not randomized N=106 |
Test Statistic |
|
---|---|---|---|---|---|
Serum Bilirubin mg/dl |
418 | 0.725 1.300 3.600 3.649 ± 5.282 |
0.800 1.400 3.200 2.873 ± 3.629 |
0.725 1.400 3.075 3.117 ± 4.043 |
F2 415=0.03, P=0.9721 |
Albumin gm/dl |
418 | 3.342 3.545 3.777 3.524 ± 0.396 |
3.212 3.565 3.830 3.516 ± 0.443 |
3.125 3.470 3.720 3.431 ± 0.435 |
F2 415=2.13, P=0.121 |
Histologic Stage, Ludwig Criteria : 1 | 412 | 0.03 4⁄154 | 0.08 12⁄158 | 0.05 5⁄100 | χ26=5.33, P=0.5022 |
2 | 0.21 32⁄154 | 0.22 35⁄158 | 0.25 25⁄100 | ||
3 | 0.42 64⁄154 | 0.35 56⁄158 | 0.35 35⁄100 | ||
4 | 0.35 54⁄154 | 0.35 55⁄158 | 0.35 35⁄100 | ||
Prothrombin Time sec. |
416 | 10.000 10.600 11.400 10.800 ± 1.138 |
10.025 10.600 11.000 10.653 ± 0.851 |
10.100 10.600 11.000 10.750 ± 1.078 |
F2 413=0.23, P=0.7951 |
sex : female | 418 | 0.90 139⁄154 | 0.87 137⁄158 | 0.92 98⁄106 | χ22=2.38, P=0.3042 |
Age | 418 | 41.43 48.11 55.80 48.58 ± 9.96 |
42.98 51.93 58.90 51.42 ± 11.01 |
46.00 53.00 61.00 52.87 ± 9.78 |
F2 415=6.1, P=0.0021 |
spiders | 312 | 0.29 45⁄154 | 0.28 45⁄158 | χ21=0.02, P=0.8852 | |
Alkaline Phosphatase U/liter |
312 | 922 1283 1950 1943 ± 2102 |
841 1214 2028 2021 ± 2183 |
F1 310=0.06, P=0.8123 | |
SGOT U/ml |
312 | 83.8 117.4 151.9 125.0 ± 58.9 |
76.7 111.6 151.5 120.2 ± 54.5 |
F1 310=0.55, P=0.463 | |
Cholesterol mg/dl |
284 | 254 304 377 374 ± 252 |
248 316 417 365 ± 210 |
F1 282=0.37, P=0.5453 | |
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. Tests used: 1Kruskal-Wallis test; 2Pearson test; 3Wilcoxon test . |
2.1 Spike Histogram
Show data for continuous variable stratified by treatment.
with(pbc, histboxp(x=sgot, group=drug, sd=TRUE))
p <- p
2.2 Data Visualization
The very low birthweight data set contains data on 671 infants born with a birth weight of under 1600 grams. We’ll plot gestational age by birthweight using three graphics systems: base graphics, ggplot, and plotly.
getHdata(vlbw)
# remove missing values
vlbw[complete.cases(vlbw[,c('sex','dead','gest','bwt')]),] vlbw <-
2.3 Base Graphics Example
Empty canvas - add each element into your plot.
# "grps" is a list with each combination of sex/dead
split(vlbw[,c('gest','bwt')], vlbw[,c('sex','dead')])
grps <-# default "plot" function; set up parameters but with no lines/points because type 'n'
plot(c(22,40), c(400,1600), type='n', xlab='Gestational Age', ylab='Birth Weight (grams)', axes=FALSE)
# axis 1 is below plot
axis(1, at=c(22,28,34,40), labels=c(22,28,34,40))
# axis 2 is left of plot
axis(2, at=seq(400,1600,by=400), labels=seq(400,1600,by=400))
## four different methods to add points to plot area
# data.frame with two columns: 1st column is associated with x variable
points(grps[['female.0']], col='black', pch=1)
# two arguments: first vector is associated with x variable
points(grps[['female.1']][,'gest'], grps[['female.1']][,'bwt'], col='gray', pch=0)
# like above, but jitter adds noise so points have less overlap
points(jitter(grps[['male.0']][,'gest'], 2), grps[['male.0']][,'bwt'], col='black', pch=2)
# LHS ~ RHS: with tilde, the left hand side is associated with y variable
points(grps[['male.1']][,'bwt'] ~ jitter(grps[['male.1']][,'gest'], 2), col='gray', pch=3)
legend(x=37, y=1000, legend=c('F:alive','F:dead','M:alive','M:dead'), col=c('black','gray','black','gray'), pch=c(1,0,2,3))
2.4 ggplot2 Example
Given a data set, choose the aesthetic mapping and geometry layer.
# `as.factor(dead)` makes categorical variable rather than continuous
ggplot(data=vlbw) + aes(x=gest, y=bwt, color=sex, shape=as.factor(dead)) + geom_point()
p <- p
p + geom_jitter() + scale_x_continuous() + scale_y_continuous()
p <- p
2.5 Plotly
Add interactive graphics, which is trivial when also using ggplot
.
require(plotly)
ggplotly(p)
# plotly uses pipe operator (%>%) from magrittr package
# in this context it works much like "+" does with ggplot
plot_ly(type="box") %>%
add_boxplot(y = grps[['female.0']][,'bwt'], name='F:0') %>%
add_boxplot(y = grps[['female.1']][,'bwt'], name='F:1') %>%
add_boxplot(y = grps[['male.0']][,'bwt'], name='M:0') %>%
add_boxplot(y = grps[['male.1']][,'bwt'], name='M:1')
3 Models
3.1 Cardiovascular risk factor data
getHdata(diabetes)
html(contents(diabetes), levelType='table')
Data frame:diabetes
403 observations and 19 variables, maximum # NAs:262Name | Labels | Units | Levels | Storage | NAs |
---|---|---|---|---|---|
id | Subject ID | integer | 0 | ||
chol | Total Cholesterol | integer | 1 | ||
stab.glu | Stabilized Glucose | integer | 0 | ||
hdl | High Density Lipoprotein | integer | 1 | ||
ratio | Cholesterol/HDL Ratio | double | 1 | ||
glyhb | Glycosolated Hemoglobin | double | 13 | ||
location | 2 | integer | 0 | ||
age | years | integer | 0 | ||
gender | 2 | integer | 0 | ||
height | inches | integer | 5 | ||
weight | pounds | integer | 1 | ||
frame | 3 | integer | 12 | ||
bp.1s | First Systolic Blood Pressure | integer | 5 | ||
bp.1d | First Diastolic Blood Pressure | integer | 5 | ||
bp.2s | Second Systolic Blood Pressure | integer | 262 | ||
bp.2d | Second Diastolic Blood Pressure | integer | 262 | ||
waist | inches | integer | 2 | ||
hip | inches | integer | 2 | ||
time.ppn | Postprandial Time when Labs were Drawn | minutes | integer | 3 |
Variable | Levels |
---|---|
location | Buckingham |
Louisa | |
gender | male |
female | |
frame | small |
medium | |
large |
3.2 Formulas
The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.
- y ~ a + b,
+
indicates to include both a and b as terms - y ~ a:b,
:
indicates the interaction of a and b - y ~ a*b, equivalent to y ~ a+b+a:b
- y ~ (a+b)^2, equivalent to y ~ (a+b)*(a+b)
- y ~ a + I(b+c),
I
indicates to use+
in the arithmetic sense
See ?formula for more examples.
3.3 Model Fitting
The most simple model-fitting function is lm
, which is used to fit linear models. It’s primary argument is a formula. Using the diabetes data set, we can fit waist size by weight.
lm(waist ~ weight, data=diabetes) m <-
This creates an lm
object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.
- coef
- fitted
- predict
- residuals
- vcov
- summary
- anova
names(m)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual"
[9] "na.action" "xlevels" "call" "terms" "model"
# can access named elements with "$"
$coefficients m
(Intercept) weight
16.5050329 0.1204712
coef(m)
(Intercept) weight
16.5050329 0.1204712
head(fitted(m))
1 2 3 4 5 6
31.08205 42.76776 47.34567 30.84111 38.55127 39.39457
# defaut "predict" gives same results
# sum(abs(fitted(m) - predict(m)) > 1e-8)
predict(m, data.frame(weight=c(150, 200, 250)))
1 2 3
34.57572 40.59928 46.62284
head(residuals(m))
1 2 3 4 5 6
-2.082053 3.232237 1.654330 2.158890 5.448731 -3.394568
vcov(m) # variance-covariance matrix
(Intercept) weight
(Intercept) 0.465908502 -0.0024927458
weight -0.002492746 0.0000140231
# "summary" produces additional information about model
summary(m)
Call:
lm(formula = waist ~ weight, data = diabetes)
Residuals:
Min 1Q Median 3Q Max
-7.6835 -2.0850 -0.2021 1.7741 11.5330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.505033 0.682575 24.18 <2e-16 ***
weight 0.120471 0.003745 32.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.02 on 398 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.7223, Adjusted R-squared: 0.7216
F-statistic: 1035 on 1 and 398 DF, p-value: < 2.2e-16
coef(summary(m))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.5050329 0.682574906 24.18054 3.970142e-80
weight 0.1204712 0.003744743 32.17077 9.082965e-113
anova(m)
Analysis of Variance Table
Response: waist
Df Sum Sq Mean Sq F value Pr(>F)
weight 1 9438.0 9438.0 1035 < 2.2e-16 ***
Residuals 398 3629.4 9.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.4 Visualization
Create a scatterplot of weight and waist size.
# `qplot` can create default ggplot output
qplot(weight, waist, data=diabetes)
p <-+ geom_smooth(method="lm") p
# the above uses the implied formula y~x
# p + geom_smooth(method="lm", formula = y ~ x)
Remove missing values and fit with LOESS curve.
diabetes[complete.cases(diabetes[,c('waist','weight')]),]
diab <- qplot(weight, waist, data=diab)
p <-+ geom_smooth(method="loess") p
3.5 Data Manipulation
The Dominican Republic Hypertension dataset contains data on gender, age, and systolic and diastolic blood pressure from several villages in the DR.
getHdata(DominicanHTN)
html(describe(DominicanHTN))
5 Variables 381 Observations
village
n | missing | distinct |
---|---|---|
381 | 0 | 8 |
lowest : | Bare' Nuevo | Batey Verde | Carmona | Cojobal | Juan Sanchez |
highest: | Cojobal | Juan Sanchez | La Altagracia | Los Gueneos | San Antonio |
Value Bare' Nuevo Batey Verde Carmona Cojobal Juan Sanchez Frequency 32 41 64 59 53 Proportion 0.084 0.108 0.168 0.155 0.139 Value La Altagracia Los Gueneos San Antonio Frequency 40 57 35 Proportion 0.105 0.150 0.092
gender
n | missing | distinct |
---|---|---|
381 | 0 | 2 |
Value Female Male Frequency 258 123 Proportion 0.677 0.323
age
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
381 | 0 | 69 | 0.999 | 47.97 | 16.98 | 23 | 30 | 38 | 47 | 59 | 68 | 73 |
sbp: Systolic Blood Pressure mmHg
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
381 | 0 | 57 | 0.996 | 133 | 28.35 | 98 | 105 | 118 | 130 | 150 | 170 | 180 |
dbp: Diastolic Blood Pressure mmHg
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
381 | 0 | 41 | 0.993 | 84.23 | 15.58 | 64 | 70 | 76 | 82 | 92 | 100 | 110 |
3.6 Adding variables or transformations
The data.frame bracket syntax is data.frame[row_selection, column_selection].
# create "mean arterial pressure"
'map'] <- (DominicanHTN[,'sbp'] + DominicanHTN[,'dbp'] * 2) / 3
DominicanHTN[,# use `as.numeric` to convert TRUE/FALSE into 1/0
'male'] <- as.numeric(DominicanHTN[,'gender'] == 'Male')
DominicanHTN[,1:5,] DominicanHTN[
village gender age sbp dbp map male
1 Juan Sanchez Male 56 150 100 116.6667 1
2 Juan Sanchez Male 42 120 90 100.0000 1
3 Juan Sanchez Male 69 120 90 100.0000 1
4 Juan Sanchez Male 70 180 80 113.3333 1
5 Juan Sanchez Male 62 138 78 98.0000 1
In this example, I collapse a variable into smaller bins - which should generally be avoided.
quantile(DominicanHTN[,'age'])
qage <-1] <- qage[1]-1
qage[ qage
0% 25% 50% 75% 100%
14 38 47 59 100
'ageGrp'] <- cut(DominicanHTN[,'age'], breaks=qage)
DominicanHTN[,# could easily specify other breakpoints
# levels(cut(DominicanHTN[,'age'], breaks = c(0, 18, 100)))
# levels(cut(DominicanHTN[,'age'], breaks = seq(10, 100, by = 10)))
3.7 Filter
nrow(DominicanHTN)
[1] 381
nrow(DominicanHTN[DominicanHTN[,'gender'] == "Male",])
[1] 123
which(DominicanHTN[,'village'] %in% c('Carmona','San Antonio'))
[1] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
[35] 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 249 250 251 252
[69] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283
DominicanHTN[DominicanHTN[,'village'] == 'Cojobal' & DominicanHTN[,'gender'] == "Male",]
cojMen <- cojMen
village gender age sbp dbp map male ageGrp
54 Cojobal Male 38 132 94 106.66667 1 (14,38]
55 Cojobal Male 77 140 88 105.33333 1 (59,100]
56 Cojobal Male 45 140 92 108.00000 1 (38,47]
57 Cojobal Male 45 122 86 98.00000 1 (38,47]
58 Cojobal Male 68 128 92 104.00000 1 (59,100]
59 Cojobal Male 56 90 72 78.00000 1 (47,59]
60 Cojobal Male 70 118 78 91.33333 1 (59,100]
61 Cojobal Male 60 120 90 100.00000 1 (59,100]
62 Cojobal Male 30 110 70 83.33333 1 (14,38]
63 Cojobal Male 48 122 84 96.66667 1 (47,59]
64 Cojobal Male 50 200 100 133.33333 1 (47,59]
65 Cojobal Male 43 110 75 86.66667 1 (38,47]
3.8 Sort
order(cojMen[,'age'], decreasing=TRUE) # first value is element number with highest age
[1] 2 7 5 8 6 11 10 3 4 12 1 9
# use row numbers to re-order a data.frame
order(cojMen[,'age'], decreasing=TRUE),] cojMen[
village gender age sbp dbp map male ageGrp
55 Cojobal Male 77 140 88 105.33333 1 (59,100]
60 Cojobal Male 70 118 78 91.33333 1 (59,100]
58 Cojobal Male 68 128 92 104.00000 1 (59,100]
61 Cojobal Male 60 120 90 100.00000 1 (59,100]
59 Cojobal Male 56 90 72 78.00000 1 (47,59]
64 Cojobal Male 50 200 100 133.33333 1 (47,59]
63 Cojobal Male 48 122 84 96.66667 1 (47,59]
56 Cojobal Male 45 140 92 108.00000 1 (38,47]
57 Cojobal Male 45 122 86 98.00000 1 (38,47]
65 Cojobal Male 43 110 75 86.66667 1 (38,47]
54 Cojobal Male 38 132 94 106.66667 1 (14,38]
62 Cojobal Male 30 110 70 83.33333 1 (14,38]
3.9 Aggregate and counts
table(DominicanHTN[,'gender'])
Female Male
258 123
addmargins(with(DominicanHTN, table(village, gender)))
gender
village Female Male Sum
Bare' Nuevo 19 13 32
Batey Verde 30 11 41
Carmona 39 25 64
Cojobal 47 12 59
Juan Sanchez 40 13 53
La Altagracia 22 18 40
Los Gueneos 41 16 57
San Antonio 20 15 35
Sum 258 123 381
with(DominicanHTN, tapply(age, village, mean))
Bare' Nuevo Batey Verde Carmona Cojobal Juan Sanchez La Altagracia Los Gueneos San Antonio
47.53125 48.02439 49.00000 47.15254 47.45283 49.65000 45.15789 51.25714
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)
gender sbp
1 Female 133.1977
2 Male 132.5691
aggregate(age ~ village + gender, DominicanHTN, median)
village gender age
1 Bare' Nuevo Female 40.0
2 Batey Verde Female 43.0
3 Carmona Female 40.0
4 Cojobal Female 43.0
5 Juan Sanchez Female 47.0
6 La Altagracia Female 48.0
7 Los Gueneos Female 40.0
8 San Antonio Female 42.5
9 Bare' Nuevo Male 48.0
10 Batey Verde Male 52.0
11 Carmona Male 53.0
12 Cojobal Male 49.0
13 Juan Sanchez Male 56.0
14 La Altagracia Male 57.5
15 Los Gueneos Male 51.0
16 San Antonio Male 61.0
# easy to change statistical measure, or function
# aggregate(age ~ village + gender, DominicanHTN, max)
3.10 Exercise 2 & 3
Do exercise 2 & 3.
3.11 Rosner’s lead data
The lead
exposure dataset was collected to study the well-being of children who lived near a lead smelting plant. The following analysis considers the lead exposure levels in 1972 and 1973, as well as age and the finger-wrist tapping score maxfwt
.
getHdata(lead)
upData(lead,
lead <-keep = c('ld72','ld73','age','maxfwt'),
labels = c(age = 'Age'),
units = c(age='years', ld72='mg/100*ml', ld73='mg/100*ml'))
Input object size: 53928 bytes; 39 variables 124 observations Kept variables ld72,ld73,age,maxfwt New object size: 14096 bytes; 4 variables 124 observations
html(contents(lead), maxlevels=10, levelType='table')
Data frame:lead
124 observations and 4 variables, maximum # NAs:25Name | Labels | Units | Storage | NAs |
---|---|---|---|---|
age | Age | years | double | 0 |
ld72 | Blood Lead Levels, 1972 | mg/100*ml | integer | 0 |
ld73 | Blood Lead Levels, 1973 | mg/100*ml | integer | 0 |
maxfwt | Maximum mean finger-wrist tapping score | integer | 25 |
html(describe(lead))
4 Variables 124 Observations
age: Age years
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
124 | 0 | 73 | 1 | 8.935 | 4.074 | 3.929 | 4.333 | 6.167 | 8.375 | 12.021 | 14.000 | 15.000 |
lowest : | 3.750000 | 3.833333 | 3.916667 | 4.000000 | 4.166667 |
highest: | 14.250000 | 15.000000 | 15.250000 | 15.416667 | 15.916667 |
ld72: Blood Lead Levels, 1972 mg/100*ml
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
124 | 0 | 47 | 0.999 | 36.16 | 17.23 | 18.00 | 21.00 | 27.00 | 34.00 | 43.00 | 57.00 | 61.85 |
ld73: Blood Lead Levels, 1973 mg/100*ml
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
124 | 0 | 37 | 0.998 | 31.71 | 11.06 | 18.15 | 21.00 | 24.00 | 30.50 | 37.00 | 47.00 | 50.85 |
maxfwt: Maximum mean finger-wrist tapping score
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
99 | 25 | 40 | 0.998 | 51.96 | 13.8 | 33.2 | 38.0 | 46.0 | 52.0 | 59.0 | 65.0 | 72.2 |
4 Introduction to RMS (Ordinary least squares regression)
The rms
package includes the ols
fitting function. Use datadist
to compute summaries of the distributional characteristics of the predictors - or simply give it an entire data frame. datadist
must be re-run if you add a new predictor or recode an old one.
Note: The ols
function is not covered in the main course. Generalized least squares is. Ordinary least squares does not handle data that has correlation between residuals. It is covered here to give a simple overview of the structure of models in RMS.
require(rms)
datadist(lead)
dd <-options(datadist='dd')
ols(maxfwt ~ age + ld72 + ld73, data=lead)
f <- f
Frequencies of Missing Values Due to Each Variable
maxfwt age ld72 ld73
25 0 0 0
Linear Regression Model
ols(formula = maxfwt ~ age + ld72 + ld73, data = lead)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 99 LR chi2 62.25 R2 0.467
sigma9.5221 d.f. 3 R2 adj 0.450
d.f. 95 Pr(> chi2) 0.0000 g 10.104
Residuals
Min 1Q Median 3Q Max
-33.9958 -4.9214 0.7596 5.1106 33.2590
Coef S.E. t Pr(>|t|)
Intercept 34.1059 4.8438 7.04 <0.0001
age 2.6078 0.3231 8.07 <0.0001
ld72 -0.0246 0.0782 -0.31 0.7538
ld73 -0.2390 0.1325 -1.80 0.0744
rms
provides many methods to work with the ols
fit object.
- coef
- fitted
- predict
- resid
- anova
- summary
- Predict
- ggplot
- Function
- nomogram
4.1 Coefficient Estimates
coef(f)
Intercept age ld72 ld73
34.1058551 2.6078450 -0.0245978 -0.2389695
4.2 Define R function representing fitted model
The default values for function arguments are medians.
Function(f)
g <- g
function (age = 8.375, ld72 = 34, ld73 = 30.5)
{
34.105855 + 2.607845 * age - 0.024597799 * ld72 - 0.23896951 *
ld73
}
<environment: 0x5568f55801a8>
g(age=10, ld72=21, ld73=c(21, 47))
[1] 54.64939 48.43618
4.3 Predicted values with standard errors
predict(f, data.frame(age=10, ld72=21, ld73=c(21, 47)), se.fit=TRUE)
$linear.predictors
1 2
54.64939 48.43618
$se.fit
1 2
1.391858 3.140361
4.4 Residuals
resid(f)
r <-par(mfrow=c(2,2))
plot(fitted(f), r); abline(h=0)
with(lead, plot(age, r)); abline(h=0)
with(lead, plot(ld73, r)); abline(h=0)
# q-q plot to check normality
qqnorm(r)
qqline(r)
4.5 Plotting partial effects
Predict
and plotp
make one plot for each predictor. Predictors not shown in plot are set to constants (continuous:median, categorical:mode). 0.95 pointwise confidence limits for \(\hat{E}(y|x)\) are shown; use conf.int=FALSE
to suppress CLs.
plotp(Predict(f))
Specify which predictors are plotted.
plotp(Predict(f, age))
plotp(Predict(f, age=3:15))
plotp(Predict(f, age=seq(3,16,length=150)))
Obtain confidence limits for \(\hat{y}\).
plotp(Predict(f, age, conf.type='individual'))
Combine plots.
Predict(f, age, conf.int=0.99, conf.type='individual')
p1 <- Predict(f, age, conf.int=0.99, conf.type='mean')
p2 <- rbind(Individual=p1, Mean=p2)
p <-ggplot(p)
plotp(Predict(f, ld73, age=3))
ggplot(Predict(f, ld73, age=c(3,9)))
3-d surface for two continuous predictors against \(\hat{y}\).
bplot(Predict(f, ld72, ld73))
4.6 Nomograms
plot(nomogram(f))
4.7 Point Estimates for Partial Effects
summary(f)
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 6.1667 12.021 5.8542 15.26700 1.8914 11.5120 19.02200
ld72 27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773 2.09010
ld73 24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255 0.31234
Adjust age
to 5, which has no effect as the model does not include any interactions.
summary(f, age=5)
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 6.1667 12.021 5.8542 15.26700 1.8914 11.5120 19.02200
ld72 27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773 2.09010
ld73 24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255 0.31234
Effect of changing ld73
from 20 to 40.
summary(f, ld73=c(20,40))
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 6.1667 12.021 5.8542 15.26700 1.8914 11.5120 19.02200
ld72 27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773 2.09010
ld73 20.0000 40.000 20.0000 -4.77940 2.6495 -10.0390 0.48052
If a predictor has a linear effect, a one-unit change can be used to get the confidence interval of its slope.
summary(f, age=5:6)
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 5 6 1 2.60780 0.32308 1.9664 3.24920
ld72 27 43 16 -0.39356 1.25110 -2.8773 2.09010
ld73 24 37 13 -3.10660 1.72220 -6.5255 0.31234
There is also a plot
method for summary
results.
plot(summary(f))
4.8 Getting Predicted Values
Give predict
a fit object and data.frame
.
predict(f, data.frame(age=3, ld72=21, ld73=21))
1
36.39448
predict(f, data.frame(age=c(3,10), ld72=21, ld73=c(21,47)))
1 2
36.39448 48.43618
expand.grid(age=c(4,8), ld72=c(21, 47), ld73=c(21, 47))
newdat <- newdat
age ld72 ld73
1 4 21 21
2 8 21 21
3 4 47 21
4 8 47 21
5 4 21 47
6 8 21 47
7 4 47 47
8 8 47 47
predict(f, newdat)
1 2 3 4 5 6 7 8
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095
Include confidence levels.
predict(f, newdat, conf.int=0.95)
$linear.predictors
1 2 3 4 5 6 7 8
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095
$lower
1 2 3 4 5 6 7 8
33.97441 46.23595 32.15468 43.94736 25.68920 36.94167 27.17060 38.86475
$upper
1 2 3 4 5 6 7 8
44.03023 52.63145 44.57088 53.64095 39.88902 49.49932 37.12854 46.29716
predict(f, newdat, conf.int=0.95, conf.type='individual')
$linear.predictors
1 2 3 4 5 6 7 8
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095
$lower
1 2 3 4 5 6 7 8
19.44127 30.26132 18.46566 29.27888 12.59596 23.30120 12.60105 23.31531
$upper
1 2 3 4 5 6 7 8
58.56337 68.60609 58.25989 68.30944 52.98227 63.13979 51.69810 61.84659
Brute-force predicted values.
coef(f)
b <-1] + b[2]*3 + b[3]*21 + b[4]*21 b[
Intercept
36.39448
Predicted values with Function
.
# g <- Function(f)
g(age = c(3,8), ld72 = 21, ld73 = 21)
[1] 36.39448 49.43370
g(age = 3)
[1] 33.80449
4.9 ANOVA
Use anova
to get all total effects and individual partial effects.
anova(f)
an <- an
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P
age 1 5907.535742 5907.535742 65.15 <.0001
ld72 1 8.972994 8.972994 0.10 0.7538
ld73 1 295.044370 295.044370 3.25 0.0744
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001
ERROR 95 8613.750674 90.671060
Include corresponding variable names.
print(an, 'names')
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P Tested
age 1 5907.535742 5907.535742 65.15 <.0001 age
ld72 1 8.972994 8.972994 0.10 0.7538 ld72
ld73 1 295.044370 295.044370 3.25 0.0744 ld73
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001 age,ld72,ld73
ERROR 95 8613.750674 90.671060
print(an, 'subscripts')
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P Tested
age 1 5907.535742 5907.535742 65.15 <.0001 1
ld72 1 8.972994 8.972994 0.10 0.7538 2
ld73 1 295.044370 295.044370 3.25 0.0744 3
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001 1-3
ERROR 95 8613.750674 90.671060
Subscripts correspond to:
[1] age ld72 ld73
print(an, 'dots')
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P Tested
age 1 5907.535742 5907.535742 65.15 <.0001 .
ld72 1 8.972994 8.972994 0.10 0.7538 .
ld73 1 295.044370 295.044370 3.25 0.0744 .
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001 ...
ERROR 95 8613.750674 90.671060
Subscripts correspond to:
[1] age ld72 ld73
Combined partial effects.
anova(f, ld72, ld73)
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P
ld72 1 8.972994 8.972994 0.10 0.7538
ld73 1 295.044370 295.044370 3.25 0.0744
REGRESSION 2 747.283558 373.641779 4.12 0.0192
ERROR 95 8613.750674 90.671060
4.10 Exercise 4
Time for exercise 4.
5 Cleaning and Importing Data
5.1 Text files
- read.table - work with data in table format
- read.csv - CSV files
- read.delim - tab-delimited files
- scan - more general function for reading in data
5.2 Binary representation of R objects
Compressed data and loads faster.
- RData
feather
package
5.3 Database connections
Allows data queries.
RODBC
packageRSQLite
package
5.4 Other statistical packages
Hmisc
package:sas.get
,sasxport.get
haven
package:read_sas
,read_sav
,read_dta
foreign
package- Stat/Transfer - not free nor open source
5.5 Data Cleaning
An R data.frame
consists of a collection of values. In a well-structured data set, each value is associated with a variable (column) and observation (row). Data sets often need to be manipulated to become well-structured.
Hadley Wickham’s Tidy Data provides an excellent overview on how to clean messy data sets.
5.6 Column header contains values
# politics and religion
data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
senateReligion <-Democrats=c(20,15,8,1,1,3),
Republicans=c(38,9,0,5,0,0))
senateReligion
religion Democrats Republicans
1 Protestant 20 38
2 Catholic 15 9
3 Jewish 8 0
4 Mormon 1 5
5 Buddhist 1 0
6 Don't Know/Refused 3 0
cbind(senateReligion[,'religion'], stack(senateReligion[,c('Democrats','Republicans')]))
senRel <-names(senRel) <- c('religion','count','party')
senRel
religion count party
1 Protestant 20 Democrats
2 Catholic 15 Democrats
3 Jewish 8 Democrats
4 Mormon 1 Democrats
5 Buddhist 1 Democrats
6 Don't Know/Refused 3 Democrats
7 Protestant 38 Republicans
8 Catholic 9 Republicans
9 Jewish 0 Republicans
10 Mormon 5 Republicans
11 Buddhist 0 Republicans
12 Don't Know/Refused 0 Republicans
5.7 Multiple variables in a column
data.frame(county=c('Davidson','Rutherford','Cannon'),
pets <-male.dog=c(50,150,70), female.dog=c(60,150,70),
male.cat=c(30,100,50), female.cat=c(30,70,40),
male.horse=c(6,30,20), female.horse=c(6,28,19))
pets
county male.dog female.dog male.cat female.cat male.horse female.horse
1 Davidson 50 60 30 30 6 6
2 Rutherford 150 150 100 70 30 28
3 Cannon 70 70 50 40 20 19
cbind(pets[,'county'], stack(pets[,-1]))
pets2 <-names(pets2) <- c('county','count','ind')
pets2
county count ind
1 Davidson 50 male.dog
2 Rutherford 150 male.dog
3 Cannon 70 male.dog
4 Davidson 60 female.dog
5 Rutherford 150 female.dog
6 Cannon 70 female.dog
7 Davidson 30 male.cat
8 Rutherford 100 male.cat
9 Cannon 50 male.cat
10 Davidson 30 female.cat
11 Rutherford 70 female.cat
12 Cannon 40 female.cat
13 Davidson 6 male.horse
14 Rutherford 30 male.horse
15 Cannon 20 male.horse
16 Davidson 6 female.horse
17 Rutherford 28 female.horse
18 Cannon 19 female.horse
Watch out for factor variables.
tryCatch(strsplit(pets2[,'ind'], '\\.'), error=function(e){ e })
<simpleError in strsplit(pets2[, "ind"], "\\."): non-character argument>
First for
loop, allows iterating over a sequence.
strsplit(as.character(pets2[,'ind']), '\\.')
sexAnimal <-c('sex','animal')] <- NA
pets2[,for(i in seq(nrow(pets2))) {
c('sex','animal')] <- sexAnimal[[i]]
pets2[i,
}'ind'] <- NULL
pets2[, pets2
county count sex animal
1 Davidson 50 male dog
2 Rutherford 150 male dog
3 Cannon 70 male dog
4 Davidson 60 female dog
5 Rutherford 150 female dog
6 Cannon 70 female dog
7 Davidson 30 male cat
8 Rutherford 100 male cat
9 Cannon 50 male cat
10 Davidson 30 female cat
11 Rutherford 70 female cat
12 Cannon 40 female cat
13 Davidson 6 male horse
14 Rutherford 30 male horse
15 Cannon 20 male horse
16 Davidson 6 female horse
17 Rutherford 28 female horse
18 Cannon 19 female horse
5.8 Variables in a row
set.seed(1000)
rnorm(1000, 5, 2)
d1 <- runif(1000, 0, 10)
d2 <- rpois(1000, 5)
d3 <- rbinom(1000, 10, 0.5)
d4 <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
x <-stat=rep(c('mean','sd'),4),
value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
x
distribution stat value
1 normal mean 4.975048
2 normal sd 1.963494
3 uniform mean 4.958261
4 uniform sd 2.870631
5 poisson mean 5.015000
6 poisson sd 2.232882
7 binomial mean 5.054000
8 binomial sd 1.542551
cbind(distribution=x[c(1,3,5,7),'distribution'], unstack(x, form=value~stat))
distribution mean sd
1 normal 4.975048 1.963494
2 uniform 4.958261 2.870631
3 poisson 5.015000 2.232882
4 binomial 5.054000 1.542551
5.9 Normalization and merging
data.frame(id=1:4, Name=c('Eddie','Andrea','Steve','Theresa'),
employees <-job=c('engineer','accountant','statistician','technician'))
set.seed(11)
data.frame(id=sample(4, 10, replace=TRUE),
hours <-week=1:10, hours=sample(30:60, 10, replace=TRUE))
merge(employees, hours)
id Name job week hours
1 1 Eddie engineer 6 59
2 1 Eddie engineer 7 50
3 1 Eddie engineer 4 59
4 2 Andrea accountant 1 50
5 2 Andrea accountant 9 58
6 2 Andrea accountant 2 51
7 4 Theresa technician 3 59
8 4 Theresa technician 8 32
9 4 Theresa technician 5 36
10 4 Theresa technician 10 40
# merge(employees, hours, by.x='id', by.y='id')
merge(employees, hours, all.x=TRUE)
empHours <- empHours
id Name job week hours
1 1 Eddie engineer 6 59
2 1 Eddie engineer 7 50
3 1 Eddie engineer 4 59
4 2 Andrea accountant 1 50
5 2 Andrea accountant 9 58
6 2 Andrea accountant 2 51
7 3 Steve statistician NA NA
8 4 Theresa technician 3 59
9 4 Theresa technician 8 32
10 4 Theresa technician 5 36
11 4 Theresa technician 10 40
unique(empHours[,c('id','Name','job')])
id Name job
1 1 Eddie engineer
4 2 Andrea accountant
7 3 Steve statistician
8 4 Theresa technician
c('id','week','hours')] empHours[,
id week hours
1 1 6 59
2 1 7 50
3 1 4 59
4 2 1 50
5 2 9 58
6 2 2 51
7 3 NA NA
8 4 3 59
9 4 8 32
10 4 5 36
11 4 10 40
Exclude rows with missing data.
!is.na(empHours[,'week']), c('id','week','hours')] empHours[
id week hours
1 1 6 59
2 1 7 50
3 1 4 59
4 2 1 50
5 2 9 58
6 2 2 51
8 4 3 59
9 4 8 32
10 4 5 36
11 4 10 40
5.10 Combining data
rbind(sexAnimal[[1]], sexAnimal[[2]], sexAnimal[[3]])
[,1] [,2]
[1,] "male" "dog"
[2,] "male" "dog"
[3,] "male" "dog"
do.call(rbind, sexAnimal)
[,1] [,2]
[1,] "male" "dog"
[2,] "male" "dog"
[3,] "male" "dog"
[4,] "female" "dog"
[5,] "female" "dog"
[6,] "female" "dog"
[7,] "male" "cat"
[8,] "male" "cat"
[9,] "male" "cat"
[10,] "female" "cat"
[11,] "female" "cat"
[12,] "female" "cat"
[13,] "male" "horse"
[14,] "male" "horse"
[15,] "male" "horse"
[16,] "female" "horse"
[17,] "female" "horse"
[18,] "female" "horse"
5.11 Exercise 5
Time for exercise 5.
6 Simulation
Plot mean while sampling from normal distribution
100
n <- numeric(n)
imean <- numeric(n)
smean <-for(i in seq(n)) {
mean(rnorm(10))
imean[i] <- mean(imean[seq(i)])
smean[i] <-
}plot(imean, ylab='Sample Mean')
abline(h=0, lty=3)
lines(smean)
1000
n <- numeric(n)
sig <-for(i in seq(n)) {
rnorm(15, 60, 5)
grp1 <- rnorm(15, 65, 5)
grp2 <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
sig[i] <-
}mean(sig)
[1] 0.75
Add some flexibility to our simulation by creating a function.
function(n=15, mu1=60, mu2=65, sd=5, nsim=1000) {
tTestPower <- numeric(nsim)
sig <-for(i in seq(nsim)) {
rnorm(n, mu1, sd)
grp1 <- rnorm(n, mu2, sd)
grp2 <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
sig[i] <-
}mean(sig)
}tTestPower()
[1] 0.755
tTestPower(25, nsim=10000)
[1] 0.9336
There’s already a function to compute the power of a two-sample t test.
power.t.test(n=25, delta=5, sd=5)$power
[1] 0.9337076
7 Simulation and Bootstrapping
Simulation is a wonderful tool for really understanding just how well a method performs. Sometimes the modeling and techniques are complex enough that analytical and numerical properties are difficult to ascertain. Simulation can help answer that question by doing a procedure on random data repeatedly and examining how well the technique performs.
Bootstrap sampling is a related area that takes multiple random samples from a data set and performs estimates of statistical properties. It is robust in that it makes few assumptions about the structure of the data.
7.1 Bootstrap sample
0
true_mu <- rnorm(100, true_mu)
x <- 999
R <- matrix(nrow=R, ncol=6)
res <-colnames(res) <- c('mu','se','lb','ub','coverage','bias')
for(i in seq(R)) {
sample(x, replace=TRUE)
r <-'mu'] <- mean(r)
res[i,'se'] <- sd(r) / sqrt(length(r))
res[i,
}'lb'] <- res[,'mu'] + qnorm(0.025) * res[,'se']
res[,'ub'] <- res[,'mu'] + qnorm(0.975) * res[,'se']
res[,'coverage'] <- res[,'lb'] < true_mu & res[,'ub'] > true_mu
res[,'bias'] <- res[,'mu'] - true_mu res[,
1:5,] res[
mu se lb ub coverage bias
[1,] 0.0004065727 0.09355070 -0.18294943 0.1837626 1 0.0004065727
[2,] 0.0450594243 0.10109567 -0.15308446 0.2432033 1 0.0450594243
[3,] 0.0153579674 0.09152029 -0.16401851 0.1947344 1 0.0153579674
[4,] 0.1331559513 0.10133702 -0.06546097 0.3317729 1 0.1331559513
[5,] 0.1953979673 0.09007560 0.01885303 0.3719429 0 0.1953979673
colMeans(res)
vals <- 2 * mean(x) - quantile(res[,'mu'], probs=c(0.975, 0.025))
basicCI <- sprintf("empirical SE: %.3f
out <-MSE: %.3f
basic bootstrap confidence interval: (%.3f, %.3f)
coverage probability: %.3f
mean bias: %.3f
sample mean: %.3f
", sd(res[,'mu']), vals['se'], basicCI[1], basicCI[2], vals['coverage'], vals['bias'], mean(x))
cat(out)
empirical SE: 0.096
MSE: 0.096
basic bootstrap confidence interval: (-0.175, 0.197)
coverage probability: 0.931
mean bias: 0.020
sample mean: 0.023
7.2 Control Structures
Control structures are used to change if and when lines of code in a program are run. The control comes from conditions being true or false.
7.3 Root finding with the Newton-Raphson algorithm
The for
loop has already been introduced. Two other important control structures are while
loops and if
statements.
function(x) x^3 + x^2 - 3*x - 3
f <- function(x) 3*x^2 + 2*x - 3
fp <-plot(f, xlim=c(-4,4))
-2
x <- 1
i <-while(abs(f(x)) > 1e-8) {
x - f(x) / fp(x)
x <- i + 1
i <-if(i > 20) {
print("does not converge")
break
}
} x
[1] -1.732051
f(x)
[1] -4.440892e-16
7.4 Exercise 6 & 7
Final exercises, 6 & 7.
8 Worth Mentioning
8.1 tidyverse
Wickham’s “Tidy Data” philosophy was concurrent with the development of a suite of R packages useful for data management. This includes ggplot2
and haven
as well as many others such as dplyr
, stringr
, and tidyr
.
8.2 data.table
data.table
is another beneficial package for data manipulation. It provides similar functionality to the data.frame
, but works well with large data sets. It minimizes memory operations. It is best to develop ones code using either data.frame or data.table from the beginning as they are not exactly equivalent and surprises can happen upon dropping in one for the other.
9 Computing Environment1
R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.6 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rms_6.3-0 SparseM_1.78 plotly_4.10.0 Hmisc_4.7-0 ggplot2_3.3.5 Formula_1.2-3 survival_3.2-13 lattice_0.20-45To cite R in publications use:
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-0, https://CRAN.R-project.org/package=Hmisc.
To cite the rms package in publications use:Harrell Jr FE (2022). rms: Regression Modeling Strategies. R package version 6.3-0, https://CRAN.R-project.org/package=rms.
To cite the ggplot2 package in publications use:Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
To cite the survival package in publications use:Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.
mu
is a copy of the part of theHmisc
package objectmarkupSpecs
that is for html. It includes a functionsession
that renders the session environment (including package versions) in html.↩︎