MSCI: R Bootcamp

Author

Cole Beck

Published

September 9, 2022

Installation

In today’s R bootcamp, we want to install three things.

  1. R (version 4.2.1): https://cran.r-project.org/
  2. RStudio (version 2022.07.1-554): https://www.rstudio.com/
  3. R packages (rms, ggplot2, plotly)

We will then run an example that should produce an HTML document. This will let us know each part is working correctly.

Windows

On Windows you should install R and RStudio.

macOS

On a Mac, you probably have an arm-based processor. If so, install R-ARM.
Alternatively if you have an Intel-based processor, use R-Intel.

Then install RStudio.

RStudio

RStudio is like a text editor where we can run our R code.

The application window has multiple panes. One can customize what’s open at any point. In general the four quadrants are:

  • Upper Left: Open Files
  • Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
  • Lower Left: Console. The interactive command to the running R interpreter.
  • Lower Right: Directory browser, Plots, Packages, Help.

If one accidentally closes one, it’s easily opened again from the View menu.

Install the packages rms, ggplot2, and plotly within RStudio. From the “Packages” tab, click “Install”. Alternatively run the following command from the “Console” pane.

install.packages(c('rms', 'ggplot2', 'plotly'), type = 'binary')

Creating scripts

R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.

From RStudio, select File -> New File and there are a large number of options. The important ones for this class are:

  • R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click source in the upper right of the file and it will execute in the console below.
  • R Markdown. A file containing a mix of R code and markdown. Using Knit produces an HTML or PDF document. A preview is available when switching to visual editing mode.
  • R Notebook. Uses R Markdown to provide an interactive version with inline previews.
  • Quarto Document. Same as R Markdown but supports additional features and output formats.

A common question is “Which file type should I choose?” It never hurts to start with a basic R script, as once the code is written it can be sourced or even copied and pasted into one of the other documents. Typically you should pick R Markdown or Quarto, especially when creating output that will be shared.

Exercise 1

We need to run an example to ensure everything is working together.

From RStudio, create a new R Markdown file. Select “HTML” as the output format. The document will contain text from the default template.

An R code chunk starts with three backticks, followed by a lowercase “r” surrounded by braces:

```{r}
# example R chunk
```

What line has the first R code chunk? Add two lines to the first code chunk: library(Hmisc) and library(ggplot2). Note that the first chunk has been labelled “setup” - this is the text following the lowercase “r”.

The second R code chunk shows the summary of the “cars” data set. Copy the chunk and make a new one below it. Change summary to describe. Change the chunk label from “cars” to “cars2”.

The third (now 4th) R code chunks plots the pressure data set. Copy the chunk and make a new one below it. Replace the text with qplot(temperature, pressure, data = pressure). Change the chunk label from “pressure” to “pressure2”.

Your R Markdown file should now have five code chunks. Do you know how to navigate straight to a code chunk?

Use the “Knit” button to create an HTML document.

Creating projects

A project brings together several scripts as a project. One can create a project using File -> New Project and select an existing directory or create a blank one as desired. Existing projects can be opened with File -> Open Project.

Session & Working Directory

There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document’s environment. The Run -> Restart R and Run All Chunks restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.

It is sometimes helpful to communicate which versions of packages are loaded when discussing issues. The command sessionInfo() provides a nice overview of this that is good for checking package versions loaded.

sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.5.4 compiler_4.2.0    magrittr_2.0.1    fastmap_1.1.0     cli_3.3.0         tools_4.2.0       htmltools_0.5.2  
 [8] yaml_2.2.1        stringi_1.5.3     rmarkdown_2.11    knitr_1.36        stringr_1.4.0     xfun_0.30         digest_0.6.25    
[15] jsonlite_1.8.0    rlang_1.0.3       evaluate_0.14    

The current working directory is available via getwd() and settable via setwd("directory"). All file loads and saves are relative to the current working directory.

The menu option Session -> Set Working Directory -> To Source File Location is especially helpful.

Environment

R maintains “environment frames” which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.

x <- 3
f <- function()
{ # One can view the parenthesis as capturing a frame
  x <- 4   # Local variable inner frame
  cat("This is the inner 'x'", x, "\n") 
}
f()
This is the inner 'x' 4 
cat("This is the outer 'x'", x, "\n")
This is the outer 'x' 3 

Packages

Packages contain R code and may often require the installation of other packages. When I install Hmisc, I see that its dependency acepack is automatically installed:

Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb

* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)

Install and updating via CRAN

Install through CRAN (official) repositories.

It is easiest to install and update from the “Packages” tab on RStudio.

# Having code to install packages insider your Rmarkdown is bad practice. Don't do this.
# This code is marked `eval=FALSE` in the Rmarkdown chunk so it is not evaluated.
# install several packages
install.packages('Hmisc')
# update all packages
update.packages(checkBuilt = TRUE, ask = FALSE)

Install and update via Github

Github is a code sharing platform. Packages are typically in a rawer state, but it can be important to use sometimes to get the latest bugfix at the risk of introducing a different bug. Installing through Github repositories is done using the remotes package.

install.packages('remotes')
require(remotes)
install_github('harrelfe/rms')
# [package]::[function]() is a common syntax that shows what package is associated with the given function
# for example, this works like above but explicitly tells us we're using the "remotes" package
# remotes::install_github('harrelfe/rms')

Loading Packages

Note that library is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require used in its place. If the package is not installed, library will trigger an error while require will return FALSE. Once loaded the functions created in the package are available to your R session.

library(Hmisc)
require(Hmisc)

Help

From the R console one can get help about any function by appending it with a ?.

?lm

The R maintainers are quite pedantic about requiring package help files on functions up to date, complete and written in proper English. The benefit is that the user now has a reference for just about every function including from community sourced packages.

Rmarkdown

Prior to Rmarkdown’s availability, document generation was performed with LaTeX. LaTeX is a wonderful publishing tool (for PDF), that is also as complex as it is rich. Rmarkdown has come a long way to generating very rich documents that compile to either HTML or PDF with very little knowledge of the file formats of either. Document generation is an essential part of reproducible research and as such students are encouraged to learn more about markdown and Rmarkdown. Full support and possibilities are shown at https://rmarkdown.rstudio.com/. Markdown Basics is a good place to start.

The cheat sheet is available at https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown.pdf, or through Rstudio’s menu bar: Help -> Cheetsheets -> R Markdown Cheet Sheet.

Quarto is a next generation version of R Markdown from RStudio. Users are encouraged to evaluate it as an alternative.

Best Practices

R and its packages will always be free. By using these tools, you ensure that you can produce the same results today, tomorrow, and in the future, and that others can reproduce these results as well.

A blog post RMarkdown Driven Development by Emily Riederer is an excellent introduction to best practices in producing statistical documents using Rmarkdown.

One takeaway is to avoid the general pitfall of putting large amounts of data processing inside ones Rmarkdown reports, then minor changes to the report can require hours to compile. Large data processing and wrangling tasks should be done in scripts outside the report. The report should load the results of this process.

Another great reference is Dr. Harrell’s analysis project workflow. It can also be used as a template for Quarto as its source code is available (by clicking the “</> Code” link).

Caching

Chunks in R Markdown can be cached. I.e., the results of the last computation are saved and it’s not rerun until a manual override by clicking on that section is done. It is recommended that one not do this as it inevitably leads to confusion when code or data changes are made and the document compilation does not reflect the changes.

A better approach for long computations is to do these with external scripts and save the results. The report can load these results and present as needed. It’s very similar to caching, but is not prone to later confusion as it’s a bit more direct that the report is loading the processed data file. In short do large data processing tasks outside of reports.

Setup

Most R Markdown documents have a setup section that loads required libraries and sets up some options.

This example the Hmisc package. It also pulls some HTML rendering options from the markup specs for use later and stores this in a global variable mu.

require(Hmisc)
mu <- markupSpecs$html   # markupSpecs is in Hmisc

Data

An R data.frame consists of a collection of values. In a well-structured data set, each value is associated with a variable (column) and observation (row). A column contains values of a single data type.

Types of data

  • numeric
  • character
  • logical
  • date (time)

Special types

  • Undefined: NULL
  • Missing: NA
  • Infinite: Inf, -Inf
  • Not a Number: NaN

Categorical variables

In R, categorical variables are called “factors”. Values look like character strings but are stored as numeric.

race <- factor(1:3, labels = c('white','black','other'))
race
[1] white black other
Levels: white black other
unclass(race)
[1] 1 2 3
attr(,"levels")
[1] "white" "black" "other"
factor(sample(c('light','moderate','vigorous'), 10, replace = TRUE), ordered = TRUE)
 [1] vigorous vigorous vigorous light    vigorous moderate vigorous moderate vigorous moderate
Levels: light < moderate < vigorous

Numeric vectors

Sometimes you will need to create a collection of numeric values. A one-dimensional collection of values is called a vector. Here are a few ways to create numeric vectors.

numeric(3)
[1] 0 0 0
rep(1, 3)
[1] 1 1 1
c(1, 2, 3)
[1] 1 2 3
1:3
[1] 1 2 3
seq(3)
[1] 1 2 3
seq(1, 10, by = 2)
[1] 1 3 5 7 9
seq(1, 21, length.out = 6)
[1]  1  5  9 13 17 21
sample(10, 5)
[1] 7 9 8 5 3
sample(3, 5, replace = TRUE)
[1] 2 2 2 2 1

Updating with upData

The getHdata function is used to fetch a dataset from the Vanderbilt Biostatistics DataSets web site. We can use the upData function to make updates to a data set. Here it is used to

  • create a new variable from an old one
  • add labels to 2 variables
  • add units to the new variable
  • remove the old variable
  • automatically move units of measurements from parenthetical expressions in labels to separate units attributes used by Hmisc and rms functions for table making and graphics
getHdata(pbc)
pbc <- upData(pbc,
              fu.yrs = fu.days / 365.25,
              labels = c(fu.yrs = 'Follow-up Time',
                         status = 'Death or Liver Transplantation'),
              units = c(fu.yrs = 'year'),
              drop  = 'fu.days',
              moveUnits=TRUE, html=TRUE)
Input object size:   80952 bytes;    19 variables    418 observations
 Label for bili changed from Serum Bilirubin (mg/dl) to Serum Bilirubin 
    units set to mg/dl 
 Label for albumin changed from Albumin (gm/dl) to Albumin 
    units set to gm/dl 
 Label for protime changed from Prothrombin Time (sec.) to Prothrombin Time 
    units set to sec. 
 Label for alk.phos changed from Alkaline Phosphatase (U/liter) to Alkaline Phosphatase 
    units set to U/liter 
 Label for sgot changed from SGOT (U/ml) to SGOT 
    units set to U/ml 
 Label for chol changed from Cholesterol (mg/dl) to Cholesterol 
    units set to mg/dl 
 Label for trig changed from Triglycerides (mg/dl) to Triglycerides 
    units set to mg/dl 
 Label for platelet changed from Platelets (per cm^3/1000) to Platelets 
    units set to per cm^3/1000 
 Label for copper changed from Urine Copper (ug/day) to Urine Copper 
    units set to ug/day 
 Added variable     fu.yrs
 Dropped variable   fu.days
 New object size:   84744 bytes;    19 variables    418 observations

Data Dictionary

contents is used to print a data dictionary, run through an html method for nicer output.

html(contents(pbc), maxlevels=10, levelType='table')

Data frame:pbc

418 observations and 19 variables, maximum # NAs:136  
NameLabelsUnitsLevelsStorageNAs
biliSerum Bilirubinmg/dldouble 0
albuminAlbumingm/dldouble 0
stageHistologic Stage, Ludwig Criteriainteger 6
protimeProthrombin Timesec.double 2
sex2integer 0
ageAgedouble 0
spiders2integer106
hepatom2integer106
ascites2integer106
alk.phosAlkaline PhosphataseU/literdouble106
sgotSGOTU/mldouble106
cholCholesterolmg/dlinteger134
trigTriglyceridesmg/dlinteger136
plateletPlateletsper cm^3/1000integer110
drug3integer 0
statusDeath or Liver Transplantationinteger 0
edema3integer 0
copperUrine Copperug/dayinteger108
fu.yrsFollow-up Timeyeardouble 0

VariableLevels
sexmale
female
spiders, hepatomabsent
 ascitespresent
drugD-penicillamine
placebo
not randomized
edemano edema
edema, no diuretic therapy
edema despite diuretic therapy

Retrieving a row

Put brackets after the data set name. The row number should be to the left of the comma.

pbc[1,]
  bili albumin stage protime    sex      age spiders hepatom ascites alk.phos   sgot chol trig platelet    drug status
1 14.5     2.6     4    12.2 female 58.76523 present present present     1718 137.95  261  172      190 placebo      1
                           edema copper  fu.yrs
1 edema despite diuretic therapy    156 1.09514
pbc[100,]
    bili albumin stage protime  sex      age spiders hepatom ascites alk.phos   sgot chol trig platelet            drug status    edema
100  2.3       3     4      12 male 51.46886  absent present  absent      746 178.25  178  122      119 D-penicillamine      1 no edema
    copper   fu.yrs
100    145 1.511294
pbc[nrow(pbc),]
    bili albumin stage protime    sex age spiders hepatom ascites alk.phos sgot chol trig platelet           drug status    edema copper
418  0.7    3.29     4    10.6 female  53    <NA>    <NA>    <NA>       NA   NA   NA   NA       NA not randomized      0 no edema     NA
      fu.yrs
418 2.672142

Retrieving a column

Put brackets after the data set name. The column name or number should be to the right of the comma. In this example, only the first 10 results are returned.

pbc[1:10,1]
Serum Bilirubin [mg/dl] 
 [1] 14.5  1.1  1.4  1.8  3.4  0.8  1.0  0.3  3.2 12.6
pbc[1:10,'age']
Age 
 [1] 58.76523 56.44627 70.07255 54.74059 38.10541 66.25873 55.53457 53.05681 42.50787 70.55989
# dollar-sign also works
pbc$chol[1:10]
Cholesterol [mg/dl] 
 [1] 261 302 176 244 279 248 322 280 562 200

Here’s a few useful functions we can use on a variable. Watch out for missing values.

mean(pbc$age)
[1] 50.74146
# testing if variable equals a value returns TRUE/FALSE
# using "sum" will count TRUE values
sum(pbc[,'spiders'] == 'present', na.rm = TRUE)
[1] 90
table(pbc$sex)

  male female 
    44    374 
with(pbc, table(sex, spiders, useNA = 'always'))
        spiders
sex      absent present <NA>
  male       32       4    8
  female    190      86   98
  <NA>        0       0    0

Exercise 2

getHdata(vlbw)

Use contents to answer these questions about the “vlbw” data set.

  1. How many rows are in this data set?
  2. Which variable is missing the most data?
  3. Which variables are categorical?

Use mean, sum, or table to answer these questions.

  1. What race appears in this data set the most?
  2. How many twins (multiple gestation) are in the data set?
  3. What is the mean gestational age?

Cleaning and Importing Data

Loading a data set into R typically involves reading a text file. RStudio has an “Import Dataset” button.

Text files

  • read.table - work with data in table format
  • read.csv - CSV files
  • read.delim - tab-delimited files
  • scan - more general function for reading in data

Binary representation of R objects

Compressed data and loads faster.

  • RData
  • feather package

Database connections

Allows data queries.

  • RODBC package
  • RSQLite package

Other statistical packages

  • Hmisc package: sas.get, sasxport.get
  • haven package: read_sas, read_sav, read_dta
  • foreign package
  • Stat/Transfer - not free nor open source

Data Cleaning

While learning R you’ll generally be working with ready-to-analyze data sets.

Data sets often need to be manipulated to become well-structured.

Hadley Wickham’s Tidy Data provides an excellent overview on how to clean messy data sets.

Column header contains values

# politics and religion
senateReligion <- data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
                          Democrats=c(21,15,9,0,1,4),
                          Republicans=c(37,10,0,3,0,0))
senateReligion
            religion Democrats Republicans
1         Protestant        21          37
2           Catholic        15          10
3             Jewish         9           0
4             Mormon         0           3
5           Buddhist         1           0
6 Don't Know/Refused         4           0

Democrats/Republicans is a value that should be stored in a column named “party”.

cntParty <- stack(senateReligion[,c('Democrats','Republicans')])
# since "cntParty" is longer, "1:6" is repeated as necessary
cbind(cntParty, 1:6)
   values         ind 1:6
1      21   Democrats   1
2      15   Democrats   2
3       9   Democrats   3
4       0   Democrats   4
5       1   Democrats   5
6       4   Democrats   6
7      37 Republicans   1
8      10 Republicans   2
9       0 Republicans   3
10      3 Republicans   4
11      0 Republicans   5
12      0 Republicans   6
senRel <- cbind(senateReligion[,'religion'], cntParty)
names(senRel) <- c('religion','count','party')
senRel
             religion count       party
1          Protestant    21   Democrats
2            Catholic    15   Democrats
3              Jewish     9   Democrats
4              Mormon     0   Democrats
5            Buddhist     1   Democrats
6  Don't Know/Refused     4   Democrats
7          Protestant    37 Republicans
8            Catholic    10 Republicans
9              Jewish     0 Republicans
10             Mormon     3 Republicans
11           Buddhist     0 Republicans
12 Don't Know/Refused     0 Republicans

Multiple variables in a column

pets <- data.frame(county=c('Davidson','Rutherford','Cannon'),
                   male.dog=c(50,150,70), female.dog=c(60,150,70),
                   male.cat=c(30,100,50), female.cat=c(30,70,40),
                   male.horse=c(6,30,20), female.horse=c(6,28,19))
pets
      county male.dog female.dog male.cat female.cat male.horse female.horse
1   Davidson       50         60       30         30          6            6
2 Rutherford      150        150      100         70         30           28
3     Cannon       70         70       50         40         20           19

Column names have combined “sex” and “animal” - these should be variables (separate columns).

cntPets <- stack(pets[,-1])
pets2 <- cbind(pets[,'county'], cntPets)
names(pets2) <- c('county','count','ind')
pets2
       county count          ind
1    Davidson    50     male.dog
2  Rutherford   150     male.dog
3      Cannon    70     male.dog
4    Davidson    60   female.dog
5  Rutherford   150   female.dog
6      Cannon    70   female.dog
7    Davidson    30     male.cat
8  Rutherford   100     male.cat
9      Cannon    50     male.cat
10   Davidson    30   female.cat
11 Rutherford    70   female.cat
12     Cannon    40   female.cat
13   Davidson     6   male.horse
14 Rutherford    30   male.horse
15     Cannon    20   male.horse
16   Davidson     6 female.horse
17 Rutherford    28 female.horse
18     Cannon    19 female.horse

Note that we can find sex and animal by using period (.) as a delimiter. For example in “male.dog”, “male” is left of the period and “dog” is right of the period. In cases like this we can use the strsplit function. It can be hard to use though.

We have to “escape” the period with slashes. Also, the “ind” column is a factor variable, not a character string, which produces an error.

tryCatch(strsplit(pets2[,'ind'], '\\.'), error = function(e){ e })
<simpleError in strsplit(pets2[, "ind"], "\\."): non-character argument>

It also produces “list” output.

saType <- strsplit(as.character(pets2[,'ind']), '\\.')
saType
[[1]]
[1] "male" "dog" 

[[2]]
[1] "male" "dog" 

[[3]]
[1] "male" "dog" 

[[4]]
[1] "female" "dog"   

[[5]]
[1] "female" "dog"   

[[6]]
[1] "female" "dog"   

[[7]]
[1] "male" "cat" 

[[8]]
[1] "male" "cat" 

[[9]]
[1] "male" "cat" 

[[10]]
[1] "female" "cat"   

[[11]]
[1] "female" "cat"   

[[12]]
[1] "female" "cat"   

[[13]]
[1] "male"  "horse"

[[14]]
[1] "male"  "horse"

[[15]]
[1] "male"  "horse"

[[16]]
[1] "female" "horse" 

[[17]]
[1] "female" "horse" 

[[18]]
[1] "female" "horse" 
saType[[1]]
[1] "male" "dog" 

First for loop, allows iterating over a sequence.

# create two columns, set to missing/NA
pets2[,c('sex','animal')] <- NA
# iterate over the number of rows
# "i" is essentially the row number
for(i in seq(nrow(pets2))) {
  pets2[i, c('sex','animal')] <- saType[[i]]
}
# delete "ind" column
pets2[,'ind'] <- NULL
pets2
       county count    sex animal
1    Davidson    50   male    dog
2  Rutherford   150   male    dog
3      Cannon    70   male    dog
4    Davidson    60 female    dog
5  Rutherford   150 female    dog
6      Cannon    70 female    dog
7    Davidson    30   male    cat
8  Rutherford   100   male    cat
9      Cannon    50   male    cat
10   Davidson    30 female    cat
11 Rutherford    70 female    cat
12     Cannon    40 female    cat
13   Davidson     6   male  horse
14 Rutherford    30   male  horse
15     Cannon    20   male  horse
16   Davidson     6 female  horse
17 Rutherford    28 female  horse
18     Cannon    19 female  horse

Variables in a row

set.seed(1000)
d1 <- rnorm(1000, 5, 2)
d2 <- runif(1000, 0, 10)
d3 <- rpois(1000, 5)
d4 <- rbinom(1000, 10, 0.5)
x <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
           stat=rep(c('mean','sd'),4),
           value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
x
  distribution stat    value
1       normal mean 4.975048
2       normal   sd 1.963494
3      uniform mean 4.958261
4      uniform   sd 2.870631
5      poisson mean 5.015000
6      poisson   sd 2.232882
7     binomial mean 5.054000
8     binomial   sd 1.542551

The stat variable is either “mean” or “sd”. These should actually be columns not values.

distStats <- unstack(x, form = value~stat)
distStats
      mean       sd
1 4.975048 1.963494
2 4.958261 2.870631
3 5.015000 2.232882
4 5.054000 1.542551
# "unstack"-ing reduces the number of rows from 8 to 4
# re-combining the data will require removing the duplicate rows
# where are the first occurrences of each distribution?
which(!duplicated(x[,'distribution']))
[1] 1 3 5 7
cbind(distribution = x[c(1,3,5,7),'distribution'], distStats)
  distribution     mean       sd
1       normal 4.975048 1.963494
2      uniform 4.958261 2.870631
3      poisson 5.015000 2.232882
4     binomial 5.054000 1.542551
# a similar approach exists with the "reshape" function
reshape(x, timevar = 'stat', idvar = 'distribution', direction = 'wide')
  distribution value.mean value.sd
1       normal   4.975048 1.963494
3      uniform   4.958261 2.870631
5      poisson   5.015000 2.232882
7     binomial   5.054000 1.542551

Normalization and merging

Merging two data sets requires a common variable that can link the two together.

employees <- data.frame(id = 1:4, Name = c('Eddie','Andrea','Steve','Theresa'),
                        job = c('engineer','accountant','statistician','technician'))
set.seed(11)
hours <- data.frame(id = sample(4, 10, replace = TRUE),
                    week = 1:10, hours = sample(30:60, 10, replace=TRUE))
merge(employees, hours)
   id    Name        job week hours
1   1   Eddie   engineer    6    59
2   1   Eddie   engineer    7    50
3   1   Eddie   engineer    4    59
4   2  Andrea accountant    1    50
5   2  Andrea accountant    9    58
6   2  Andrea accountant    2    51
7   4 Theresa technician    3    59
8   4 Theresa technician    8    32
9   4 Theresa technician    5    36
10  4 Theresa technician   10    40
# the linking variable may need to be specified
# "by.x" refers to the first data set, "by.y" the second
# merge(employees, hours, by.x = 'id', by.y = 'id')

The default merge will only keep observations found in both data sets. Use “all.x”, “all.y”, or “all” to keep observations without links.

empHours <- merge(employees, hours, all.x = TRUE)
empHours
   id    Name          job week hours
1   1   Eddie     engineer    6    59
2   1   Eddie     engineer    7    50
3   1   Eddie     engineer    4    59
4   2  Andrea   accountant    1    50
5   2  Andrea   accountant    9    58
6   2  Andrea   accountant    2    51
7   3   Steve statistician   NA    NA
8   4 Theresa   technician    3    59
9   4 Theresa   technician    8    32
10  4 Theresa   technician    5    36
11  4 Theresa   technician   10    40

To remove a merge, use the unique function with the columns specified.

unique(empHours[,c('id','Name','job')])
  id    Name          job
1  1   Eddie     engineer
4  2  Andrea   accountant
7  3   Steve statistician
8  4 Theresa   technician
empHours[,c('id','week','hours')]
   id week hours
1   1    6    59
2   1    7    50
3   1    4    59
4   2    1    50
5   2    9    58
6   2    2    51
7   3   NA    NA
8   4    3    59
9   4    8    32
10  4    5    36
11  4   10    40

Exclude rows with missing data.

empHours[!is.na(empHours[,'week']), c('id','week','hours')]
   id week hours
1   1    6    59
2   1    7    50
3   1    4    59
4   2    1    50
5   2    9    58
6   2    2    51
8   4    3    59
9   4    8    32
10  4    5    36
11  4   10    40

Full Example with Quarto

Frank has provided a short example to build a reproducible report with RStudio and Quarto. The final output can be viewed here.

The example can be opened in RStudio with the following command. Clicking the “Render” button with produce HTML output.

require(Hmisc)
getRs('stressecho.qmd', put='rstudio')

Exercise 3

Create a R Markdown file that will create output that looks like this.

Refer to this document, exercise 1, Markdown Basics and the stressecho example.

Data Visualization

In ggplot, the type of plot references a “geom”.

geom variable 1 variable 2
density continuous NA
histogram continuous NA
bar discrete NA
point continuous continuous
boxplot discrete continuous
dotplot discrete continuous
violin discrete continuous

Examples

We’ll return to the “vlbw” data set used in the second exercise.

qplot(bwt, data=vlbw, geom='density')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_density).

qplot(bwt, data=vlbw, geom='histogram')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (stat_bin).

qplot(race, data=vlbw, geom='bar')

qplot(gest, bwt, data=vlbw, geom='point')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 5 rows containing missing values (geom_point).

qplot(sex, bwt, data=vlbw, geom='boxplot')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_boxplot).

qplot(sex, bwt, data=vlbw, geom='dotplot', binaxis = "y", binwidth = 10)
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_bindot).

qplot(sex, bwt, data=vlbw, geom='violin')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_ydensity).

Overlapping with jitter

We can provide overlapping geometeries. Order matters.

qplot(sex, bwt, data=vlbw, geom=c('jitter','boxplot'))
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Warning: Removed 2 rows containing missing values (geom_point).

qplot(sex, bwt, data=vlbw, geom=c('boxplot','jitter'))
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Warning: Removed 2 rows containing missing values (geom_point).

Aesthetics

We can use aesthetics to map variables to “x” and “y” as well as color, size, and shape.

qplot(gest, bwt, color=ivh, data=vlbw)
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 5 rows containing missing values (geom_point).

qplot(gest, bwt, color=sex, shape=sex, data=vlbw)
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 23 rows containing missing values (geom_point).

ggplot(data=vlbw) + geom_point(mapping = aes(x=gest, y=bwt, size=lol))
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 383 rows containing missing values (geom_point).

Faceting

If we want to compare plots by a categorical variable, we can use faceting.

Faceting introduces the “~” (tilde) character. Think of it as having a left-hand side (LHS) and right-hand side (RHS).

bestRows <- complete.cases(vlbw[,c('gest','bwt','sex')])
p <- ggplot(data=vlbw[bestRows,]) + geom_point(mapping = aes(x=gest, y=bwt)) +
  scale_x_continuous() + scale_y_continuous()
p

A plot for each race.

p + facet_wrap(~ race)

A plot for each race/sex combination.

p + facet_wrap(~ race + sex)

A plot for each race/sex combination, organized into row (LHS, sex) and column (RHS, race).

p + facet_grid(sex ~ race)

A plot for each race, organized into columns (RHS).

p + facet_grid(. ~ race)

A plot for each race, organized into rows (LHS).

p + facet_grid(race ~ .)

Interactive graphics

Many plots created with ggplot can be made interactive.

library(plotly)
ggplotly(p)

Learning Resources

type site link
list stackoverflow https://stackoverflow.com/tags/r/info
forum stackoverflow https://stackoverflow.com/questions/tagged/r
tutorial Matloff https://github.com/matloff/fasteR
tutorial swirl https://swirlstats.com/students.html
book R for Data Science http://r4ds.had.co.nz/

Also: Frank, Dandan, Cole, and each other

Worth Mentioning

tidyverse

Wickham’s “Tidy Data” philosophy was concurrent with the development of a suite of R packages useful for data management. This includes ggplot2 and haven as well as many others such as dplyr, stringr, and tidyr.

data.table

data.table is another beneficial package for data manipulation. It provides similar functionality to the data.frame, but works well with large data sets. It minimizes memory operations. It is best to develop ones code using either data.frame or data.table from the beginning as they are not exactly equivalent and surprises can happen upon dropping in one for the other.

Computing Environment1

 R version 4.2.0 (2022-04-22)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 18.04.6 LTS
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
 [1] plotly_4.10.0   Hmisc_4.7-0     ggplot2_3.3.5   Formula_1.2-3   survival_3.2-13 lattice_0.20-45
 
To cite R in publications use:

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-0, https://CRAN.R-project.org/package=Hmisc.

To cite the ggplot2 package in publications use:

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

To cite the survival package in publications use:

Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.

Footnotes

  1. mu is a copy of the part of the Hmisc package object markupSpecs that is for html. It includes a function session that renders the session environment (including package versions) in html.↩︎