install.packages(c('rms', 'ggplot2', 'plotly'), type = 'binary')
MSCI: R Bootcamp
Installation
In today’s R bootcamp, we want to install three things.
- R (version 4.2.1): https://cran.r-project.org/
- RStudio (version 2022.07.1-554): https://www.rstudio.com/
- R packages (rms, ggplot2, plotly)
We will then run an example that should produce an HTML document. This will let us know each part is working correctly.
Windows
macOS
On a Mac, you probably have an arm-based processor. If so, install R-ARM.
Alternatively if you have an Intel-based processor, use R-Intel.
Then install RStudio.
RStudio
RStudio is like a text editor where we can run our R code.
The application window has multiple panes. One can customize what’s open at any point. In general the four quadrants are:
- Upper Left: Open Files
- Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
- Lower Left: Console. The interactive command to the running R interpreter.
- Lower Right: Directory browser, Plots, Packages, Help.
If one accidentally closes one, it’s easily opened again from the View
menu.
Install the packages rms
, ggplot2
, and plotly
within RStudio. From the “Packages” tab, click “Install”. Alternatively run the following command from the “Console” pane.
Creating scripts
R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.
From RStudio, select File -> New File
and there are a large number of options. The important ones for this class are:
- R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click
source
in the upper right of the file and it will execute in the console below. - R Markdown. A file containing a mix of R code and markdown. Using
Knit
produces an HTML or PDF document. A preview is available when switching to visual editing mode. - R Notebook. Uses R Markdown to provide an interactive version with inline previews.
- Quarto Document. Same as R Markdown but supports additional features and output formats.
A common question is “Which file type should I choose?” It never hurts to start with a basic R script, as once the code is written it can be sourced or even copied and pasted into one of the other documents. Typically you should pick R Markdown or Quarto, especially when creating output that will be shared.
Exercise 1
We need to run an example to ensure everything is working together.
From RStudio, create a new R Markdown file. Select “HTML” as the output format. The document will contain text from the default template.
An R code chunk starts with three backticks, followed by a lowercase “r” surrounded by braces:
```{r}
# example R chunk
```
What line has the first R code chunk? Add two lines to the first code chunk: library(Hmisc)
and library(ggplot2)
. Note that the first chunk has been labelled “setup” - this is the text following the lowercase “r”.
The second R code chunk shows the summary of the “cars” data set. Copy the chunk and make a new one below it. Change summary
to describe
. Change the chunk label from “cars” to “cars2”.
The third (now 4th) R code chunks plots the pressure data set. Copy the chunk and make a new one below it. Replace the text with qplot(temperature, pressure, data = pressure)
. Change the chunk label from “pressure” to “pressure2”.
Your R Markdown file should now have five code chunks. Do you know how to navigate straight to a code chunk?
Use the “Knit” button to create an HTML document.
Creating projects
A project brings together several scripts as a project. One can create a project using File -> New Project
and select an existing directory or create a blank one as desired. Existing projects can be opened with File -> Open Project
.
Session & Working Directory
There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document’s environment. The Run -> Restart R and Run All Chunks
restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.
It is sometimes helpful to communicate which versions of packages are loaded when discussing issues. The command sessionInfo()
provides a nice overview of this that is good for checking package versions loaded.
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.5.4 compiler_4.2.0 magrittr_2.0.1 fastmap_1.1.0 cli_3.3.0 tools_4.2.0 htmltools_0.5.2
[8] yaml_2.2.1 stringi_1.5.3 rmarkdown_2.11 knitr_1.36 stringr_1.4.0 xfun_0.30 digest_0.6.25
[15] jsonlite_1.8.0 rlang_1.0.3 evaluate_0.14
The current working directory is available via getwd()
and settable via setwd("directory")
. All file loads and saves are relative to the current working directory.
The menu option Session -> Set Working Directory -> To Source File Location
is especially helpful.
Environment
R maintains “environment frames” which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.
<- 3
x <- function()
f # One can view the parenthesis as capturing a frame
{ <- 4 # Local variable inner frame
x cat("This is the inner 'x'", x, "\n")
}f()
This is the inner 'x' 4
cat("This is the outer 'x'", x, "\n")
This is the outer 'x' 3
Packages
Packages contain R code and may often require the installation of other packages. When I install Hmisc
, I see that its dependency acepack
is automatically installed:
Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb
* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)
Install and updating via CRAN
Install through CRAN (official) repositories.
It is easiest to install and update from the “Packages” tab on RStudio.
# Having code to install packages insider your Rmarkdown is bad practice. Don't do this.
# This code is marked `eval=FALSE` in the Rmarkdown chunk so it is not evaluated.
# install several packages
install.packages('Hmisc')
# update all packages
update.packages(checkBuilt = TRUE, ask = FALSE)
Install and update via Github
Github is a code sharing platform. Packages are typically in a rawer state, but it can be important to use sometimes to get the latest bugfix at the risk of introducing a different bug. Installing through Github repositories is done using the remotes
package.
install.packages('remotes')
require(remotes)
install_github('harrelfe/rms')
# [package]::[function]() is a common syntax that shows what package is associated with the given function
# for example, this works like above but explicitly tells us we're using the "remotes" package
# remotes::install_github('harrelfe/rms')
Loading Packages
Note that library
is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require
used in its place. If the package is not installed, library
will trigger an error while require
will return FALSE. Once loaded the functions created in the package are available to your R session.
library(Hmisc)
require(Hmisc)
Help
From the R console one can get help about any function by appending it with a ?
.
?lm
The R maintainers are quite pedantic about requiring package help files on functions up to date, complete and written in proper English. The benefit is that the user now has a reference for just about every function including from community sourced packages.
Rmarkdown
Prior to Rmarkdown’s availability, document generation was performed with LaTeX. LaTeX is a wonderful publishing tool (for PDF), that is also as complex as it is rich. Rmarkdown has come a long way to generating very rich documents that compile to either HTML or PDF with very little knowledge of the file formats of either. Document generation is an essential part of reproducible research and as such students are encouraged to learn more about markdown and Rmarkdown. Full support and possibilities are shown at https://rmarkdown.rstudio.com/. Markdown Basics is a good place to start.
The cheat sheet is available at https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown.pdf, or through Rstudio’s menu bar: Help -> Cheetsheets -> R Markdown Cheet Sheet
.
Quarto is a next generation version of R Markdown from RStudio. Users are encouraged to evaluate it as an alternative.
Best Practices
R and its packages will always be free. By using these tools, you ensure that you can produce the same results today, tomorrow, and in the future, and that others can reproduce these results as well.
A blog post RMarkdown Driven Development by Emily Riederer is an excellent introduction to best practices in producing statistical documents using Rmarkdown.
One takeaway is to avoid the general pitfall of putting large amounts of data processing inside ones Rmarkdown reports, then minor changes to the report can require hours to compile. Large data processing and wrangling tasks should be done in scripts outside the report. The report should load the results of this process.
Another great reference is Dr. Harrell’s analysis project workflow. It can also be used as a template for Quarto as its source code is available (by clicking the “</> Code” link).
Caching
Chunks in R Markdown can be cached. I.e., the results of the last computation are saved and it’s not rerun until a manual override by clicking on that section is done. It is recommended that one not do this as it inevitably leads to confusion when code or data changes are made and the document compilation does not reflect the changes.
A better approach for long computations is to do these with external scripts and save the results. The report can load these results and present as needed. It’s very similar to caching, but is not prone to later confusion as it’s a bit more direct that the report is loading the processed data file. In short do large data processing tasks outside of reports.
Setup
Most R Markdown documents have a setup section that loads required libraries and sets up some options.
This example the Hmisc
package. It also pulls some HTML rendering options from the markup specs for use later and stores this in a global variable mu
.
require(Hmisc)
<- markupSpecs$html # markupSpecs is in Hmisc mu
Data
An R data.frame
consists of a collection of values. In a well-structured data set, each value is associated with a variable (column) and observation (row). A column contains values of a single data type.
Types of data
- numeric
- character
- logical
- date (time)
Special types
- Undefined: NULL
- Missing: NA
- Infinite: Inf, -Inf
- Not a Number: NaN
Categorical variables
In R, categorical variables are called “factors”. Values look like character strings but are stored as numeric.
<- factor(1:3, labels = c('white','black','other'))
race race
[1] white black other
Levels: white black other
unclass(race)
[1] 1 2 3
attr(,"levels")
[1] "white" "black" "other"
factor(sample(c('light','moderate','vigorous'), 10, replace = TRUE), ordered = TRUE)
[1] vigorous vigorous vigorous light vigorous moderate vigorous moderate vigorous moderate
Levels: light < moderate < vigorous
Numeric vectors
Sometimes you will need to create a collection of numeric values. A one-dimensional collection of values is called a vector. Here are a few ways to create numeric vectors.
numeric(3)
[1] 0 0 0
rep(1, 3)
[1] 1 1 1
c(1, 2, 3)
[1] 1 2 3
1:3
[1] 1 2 3
seq(3)
[1] 1 2 3
seq(1, 10, by = 2)
[1] 1 3 5 7 9
seq(1, 21, length.out = 6)
[1] 1 5 9 13 17 21
sample(10, 5)
[1] 7 9 8 5 3
sample(3, 5, replace = TRUE)
[1] 2 2 2 2 1
Updating with upData
The getHdata
function is used to fetch a dataset from the Vanderbilt Biostatistics DataSets web site. We can use the upData
function to make updates to a data set. Here it is used to
- create a new variable from an old one
- add labels to 2 variables
- add units to the new variable
- remove the old variable
- automatically move units of measurements from parenthetical expressions in labels to separate
units
attributes used byHmisc
andrms
functions for table making and graphics
getHdata(pbc)
<- upData(pbc,
pbc fu.yrs = fu.days / 365.25,
labels = c(fu.yrs = 'Follow-up Time',
status = 'Death or Liver Transplantation'),
units = c(fu.yrs = 'year'),
drop = 'fu.days',
moveUnits=TRUE, html=TRUE)
Input object size: 80952 bytes; 19 variables 418 observations Label for bili changed from Serum Bilirubin (mg/dl) to Serum Bilirubin units set to mg/dl Label for albumin changed from Albumin (gm/dl) to Albumin units set to gm/dl Label for protime changed from Prothrombin Time (sec.) to Prothrombin Time units set to sec. Label for alk.phos changed from Alkaline Phosphatase (U/liter) to Alkaline Phosphatase units set to U/liter Label for sgot changed from SGOT (U/ml) to SGOT units set to U/ml Label for chol changed from Cholesterol (mg/dl) to Cholesterol units set to mg/dl Label for trig changed from Triglycerides (mg/dl) to Triglycerides units set to mg/dl Label for platelet changed from Platelets (per cm^3/1000) to Platelets units set to per cm^3/1000 Label for copper changed from Urine Copper (ug/day) to Urine Copper units set to ug/day Added variable fu.yrs Dropped variable fu.days New object size: 84744 bytes; 19 variables 418 observations
Data Dictionary
contents
is used to print a data dictionary, run through an html
method for nicer output.
html(contents(pbc), maxlevels=10, levelType='table')
Data frame:pbc
418 observations and 19 variables, maximum # NAs:136Name | Labels | Units | Levels | Storage | NAs |
---|---|---|---|---|---|
bili | Serum Bilirubin | mg/dl | double | 0 | |
albumin | Albumin | gm/dl | double | 0 | |
stage | Histologic Stage, Ludwig Criteria | integer | 6 | ||
protime | Prothrombin Time | sec. | double | 2 | |
sex | 2 | integer | 0 | ||
age | Age | double | 0 | ||
spiders | 2 | integer | 106 | ||
hepatom | 2 | integer | 106 | ||
ascites | 2 | integer | 106 | ||
alk.phos | Alkaline Phosphatase | U/liter | double | 106 | |
sgot | SGOT | U/ml | double | 106 | |
chol | Cholesterol | mg/dl | integer | 134 | |
trig | Triglycerides | mg/dl | integer | 136 | |
platelet | Platelets | per cm^3/1000 | integer | 110 | |
drug | 3 | integer | 0 | ||
status | Death or Liver Transplantation | integer | 0 | ||
edema | 3 | integer | 0 | ||
copper | Urine Copper | ug/day | integer | 108 | |
fu.yrs | Follow-up Time | year | double | 0 |
Variable | Levels |
---|---|
sex | male |
female | |
spiders, hepatom | absent |
ascites | present |
drug | D-penicillamine |
placebo | |
not randomized | |
edema | no edema |
edema, no diuretic therapy | |
edema despite diuretic therapy |
Retrieving a row
Put brackets after the data set name. The row number should be to the left of the comma.
1,] pbc[
bili albumin stage protime sex age spiders hepatom ascites alk.phos sgot chol trig platelet drug status
1 14.5 2.6 4 12.2 female 58.76523 present present present 1718 137.95 261 172 190 placebo 1
edema copper fu.yrs
1 edema despite diuretic therapy 156 1.09514
100,] pbc[
bili albumin stage protime sex age spiders hepatom ascites alk.phos sgot chol trig platelet drug status edema
100 2.3 3 4 12 male 51.46886 absent present absent 746 178.25 178 122 119 D-penicillamine 1 no edema
copper fu.yrs
100 145 1.511294
nrow(pbc),] pbc[
bili albumin stage protime sex age spiders hepatom ascites alk.phos sgot chol trig platelet drug status edema copper
418 0.7 3.29 4 10.6 female 53 <NA> <NA> <NA> NA NA NA NA NA not randomized 0 no edema NA
fu.yrs
418 2.672142
Retrieving a column
Put brackets after the data set name. The column name or number should be to the right of the comma. In this example, only the first 10 results are returned.
1:10,1] pbc[
Serum Bilirubin [mg/dl]
[1] 14.5 1.1 1.4 1.8 3.4 0.8 1.0 0.3 3.2 12.6
1:10,'age'] pbc[
Age
[1] 58.76523 56.44627 70.07255 54.74059 38.10541 66.25873 55.53457 53.05681 42.50787 70.55989
# dollar-sign also works
$chol[1:10] pbc
Cholesterol [mg/dl]
[1] 261 302 176 244 279 248 322 280 562 200
Here’s a few useful functions we can use on a variable. Watch out for missing values.
mean(pbc$age)
[1] 50.74146
# testing if variable equals a value returns TRUE/FALSE
# using "sum" will count TRUE values
sum(pbc[,'spiders'] == 'present', na.rm = TRUE)
[1] 90
table(pbc$sex)
male female
44 374
with(pbc, table(sex, spiders, useNA = 'always'))
spiders
sex absent present <NA>
male 32 4 8
female 190 86 98
<NA> 0 0 0
Exercise 2
getHdata(vlbw)
Use contents
to answer these questions about the “vlbw” data set.
- How many rows are in this data set?
- Which variable is missing the most data?
- Which variables are categorical?
Use mean
, sum
, or table
to answer these questions.
- What race appears in this data set the most?
- How many twins (multiple gestation) are in the data set?
- What is the mean gestational age?
Cleaning and Importing Data
Loading a data set into R typically involves reading a text file. RStudio has an “Import Dataset” button.
Text files
- read.table - work with data in table format
- read.csv - CSV files
- read.delim - tab-delimited files
- scan - more general function for reading in data
Binary representation of R objects
Compressed data and loads faster.
- RData
feather
package
Database connections
Allows data queries.
RODBC
packageRSQLite
package
Other statistical packages
Hmisc
package:sas.get
,sasxport.get
haven
package:read_sas
,read_sav
,read_dta
foreign
package- Stat/Transfer - not free nor open source
Data Cleaning
While learning R you’ll generally be working with ready-to-analyze data sets.
Data sets often need to be manipulated to become well-structured.
Hadley Wickham’s Tidy Data provides an excellent overview on how to clean messy data sets.
Column header contains values
# politics and religion
<- data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
senateReligion Democrats=c(21,15,9,0,1,4),
Republicans=c(37,10,0,3,0,0))
senateReligion
religion Democrats Republicans
1 Protestant 21 37
2 Catholic 15 10
3 Jewish 9 0
4 Mormon 0 3
5 Buddhist 1 0
6 Don't Know/Refused 4 0
Democrats/Republicans is a value that should be stored in a column named “party”.
<- stack(senateReligion[,c('Democrats','Republicans')])
cntParty # since "cntParty" is longer, "1:6" is repeated as necessary
cbind(cntParty, 1:6)
values ind 1:6
1 21 Democrats 1
2 15 Democrats 2
3 9 Democrats 3
4 0 Democrats 4
5 1 Democrats 5
6 4 Democrats 6
7 37 Republicans 1
8 10 Republicans 2
9 0 Republicans 3
10 3 Republicans 4
11 0 Republicans 5
12 0 Republicans 6
<- cbind(senateReligion[,'religion'], cntParty)
senRel names(senRel) <- c('religion','count','party')
senRel
religion count party
1 Protestant 21 Democrats
2 Catholic 15 Democrats
3 Jewish 9 Democrats
4 Mormon 0 Democrats
5 Buddhist 1 Democrats
6 Don't Know/Refused 4 Democrats
7 Protestant 37 Republicans
8 Catholic 10 Republicans
9 Jewish 0 Republicans
10 Mormon 3 Republicans
11 Buddhist 0 Republicans
12 Don't Know/Refused 0 Republicans
Multiple variables in a column
<- data.frame(county=c('Davidson','Rutherford','Cannon'),
pets male.dog=c(50,150,70), female.dog=c(60,150,70),
male.cat=c(30,100,50), female.cat=c(30,70,40),
male.horse=c(6,30,20), female.horse=c(6,28,19))
pets
county male.dog female.dog male.cat female.cat male.horse female.horse
1 Davidson 50 60 30 30 6 6
2 Rutherford 150 150 100 70 30 28
3 Cannon 70 70 50 40 20 19
Column names have combined “sex” and “animal” - these should be variables (separate columns).
<- stack(pets[,-1])
cntPets <- cbind(pets[,'county'], cntPets)
pets2 names(pets2) <- c('county','count','ind')
pets2
county count ind
1 Davidson 50 male.dog
2 Rutherford 150 male.dog
3 Cannon 70 male.dog
4 Davidson 60 female.dog
5 Rutherford 150 female.dog
6 Cannon 70 female.dog
7 Davidson 30 male.cat
8 Rutherford 100 male.cat
9 Cannon 50 male.cat
10 Davidson 30 female.cat
11 Rutherford 70 female.cat
12 Cannon 40 female.cat
13 Davidson 6 male.horse
14 Rutherford 30 male.horse
15 Cannon 20 male.horse
16 Davidson 6 female.horse
17 Rutherford 28 female.horse
18 Cannon 19 female.horse
Note that we can find sex and animal by using period (.) as a delimiter. For example in “male.dog”, “male” is left of the period and “dog” is right of the period. In cases like this we can use the strsplit
function. It can be hard to use though.
We have to “escape” the period with slashes. Also, the “ind” column is a factor variable, not a character string, which produces an error.
tryCatch(strsplit(pets2[,'ind'], '\\.'), error = function(e){ e })
<simpleError in strsplit(pets2[, "ind"], "\\."): non-character argument>
It also produces “list” output.
<- strsplit(as.character(pets2[,'ind']), '\\.')
saType saType
[[1]]
[1] "male" "dog"
[[2]]
[1] "male" "dog"
[[3]]
[1] "male" "dog"
[[4]]
[1] "female" "dog"
[[5]]
[1] "female" "dog"
[[6]]
[1] "female" "dog"
[[7]]
[1] "male" "cat"
[[8]]
[1] "male" "cat"
[[9]]
[1] "male" "cat"
[[10]]
[1] "female" "cat"
[[11]]
[1] "female" "cat"
[[12]]
[1] "female" "cat"
[[13]]
[1] "male" "horse"
[[14]]
[1] "male" "horse"
[[15]]
[1] "male" "horse"
[[16]]
[1] "female" "horse"
[[17]]
[1] "female" "horse"
[[18]]
[1] "female" "horse"
1]] saType[[
[1] "male" "dog"
First for
loop, allows iterating over a sequence.
# create two columns, set to missing/NA
c('sex','animal')] <- NA
pets2[,# iterate over the number of rows
# "i" is essentially the row number
for(i in seq(nrow(pets2))) {
c('sex','animal')] <- saType[[i]]
pets2[i,
}# delete "ind" column
'ind'] <- NULL
pets2[, pets2
county count sex animal
1 Davidson 50 male dog
2 Rutherford 150 male dog
3 Cannon 70 male dog
4 Davidson 60 female dog
5 Rutherford 150 female dog
6 Cannon 70 female dog
7 Davidson 30 male cat
8 Rutherford 100 male cat
9 Cannon 50 male cat
10 Davidson 30 female cat
11 Rutherford 70 female cat
12 Cannon 40 female cat
13 Davidson 6 male horse
14 Rutherford 30 male horse
15 Cannon 20 male horse
16 Davidson 6 female horse
17 Rutherford 28 female horse
18 Cannon 19 female horse
Variables in a row
set.seed(1000)
<- rnorm(1000, 5, 2)
d1 <- runif(1000, 0, 10)
d2 <- rpois(1000, 5)
d3 <- rbinom(1000, 10, 0.5)
d4 <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
x stat=rep(c('mean','sd'),4),
value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
x
distribution stat value
1 normal mean 4.975048
2 normal sd 1.963494
3 uniform mean 4.958261
4 uniform sd 2.870631
5 poisson mean 5.015000
6 poisson sd 2.232882
7 binomial mean 5.054000
8 binomial sd 1.542551
The stat variable is either “mean” or “sd”. These should actually be columns not values.
<- unstack(x, form = value~stat)
distStats distStats
mean sd
1 4.975048 1.963494
2 4.958261 2.870631
3 5.015000 2.232882
4 5.054000 1.542551
# "unstack"-ing reduces the number of rows from 8 to 4
# re-combining the data will require removing the duplicate rows
# where are the first occurrences of each distribution?
which(!duplicated(x[,'distribution']))
[1] 1 3 5 7
cbind(distribution = x[c(1,3,5,7),'distribution'], distStats)
distribution mean sd
1 normal 4.975048 1.963494
2 uniform 4.958261 2.870631
3 poisson 5.015000 2.232882
4 binomial 5.054000 1.542551
# a similar approach exists with the "reshape" function
reshape(x, timevar = 'stat', idvar = 'distribution', direction = 'wide')
distribution value.mean value.sd
1 normal 4.975048 1.963494
3 uniform 4.958261 2.870631
5 poisson 5.015000 2.232882
7 binomial 5.054000 1.542551
Normalization and merging
Merging two data sets requires a common variable that can link the two together.
<- data.frame(id = 1:4, Name = c('Eddie','Andrea','Steve','Theresa'),
employees job = c('engineer','accountant','statistician','technician'))
set.seed(11)
<- data.frame(id = sample(4, 10, replace = TRUE),
hours week = 1:10, hours = sample(30:60, 10, replace=TRUE))
merge(employees, hours)
id Name job week hours
1 1 Eddie engineer 6 59
2 1 Eddie engineer 7 50
3 1 Eddie engineer 4 59
4 2 Andrea accountant 1 50
5 2 Andrea accountant 9 58
6 2 Andrea accountant 2 51
7 4 Theresa technician 3 59
8 4 Theresa technician 8 32
9 4 Theresa technician 5 36
10 4 Theresa technician 10 40
# the linking variable may need to be specified
# "by.x" refers to the first data set, "by.y" the second
# merge(employees, hours, by.x = 'id', by.y = 'id')
The default merge will only keep observations found in both data sets. Use “all.x”, “all.y”, or “all” to keep observations without links.
<- merge(employees, hours, all.x = TRUE)
empHours empHours
id Name job week hours
1 1 Eddie engineer 6 59
2 1 Eddie engineer 7 50
3 1 Eddie engineer 4 59
4 2 Andrea accountant 1 50
5 2 Andrea accountant 9 58
6 2 Andrea accountant 2 51
7 3 Steve statistician NA NA
8 4 Theresa technician 3 59
9 4 Theresa technician 8 32
10 4 Theresa technician 5 36
11 4 Theresa technician 10 40
To remove a merge, use the unique
function with the columns specified.
unique(empHours[,c('id','Name','job')])
id Name job
1 1 Eddie engineer
4 2 Andrea accountant
7 3 Steve statistician
8 4 Theresa technician
c('id','week','hours')] empHours[,
id week hours
1 1 6 59
2 1 7 50
3 1 4 59
4 2 1 50
5 2 9 58
6 2 2 51
7 3 NA NA
8 4 3 59
9 4 8 32
10 4 5 36
11 4 10 40
Exclude rows with missing data.
!is.na(empHours[,'week']), c('id','week','hours')] empHours[
id week hours
1 1 6 59
2 1 7 50
3 1 4 59
4 2 1 50
5 2 9 58
6 2 2 51
8 4 3 59
9 4 8 32
10 4 5 36
11 4 10 40
Full Example with Quarto
Frank has provided a short example to build a reproducible report with RStudio and Quarto. The final output can be viewed here.
The example can be opened in RStudio with the following command. Clicking the “Render” button with produce HTML output.
require(Hmisc)
getRs('stressecho.qmd', put='rstudio')
Exercise 3
Create a R Markdown file that will create output that looks like this.
Refer to this document, exercise 1, Markdown Basics and the stressecho example.
Data Visualization
In ggplot, the type of plot references a “geom”.
geom | variable 1 | variable 2 |
---|---|---|
density | continuous | NA |
histogram | continuous | NA |
bar | discrete | NA |
point | continuous | continuous |
boxplot | discrete | continuous |
dotplot | discrete | continuous |
violin | discrete | continuous |
Examples
We’ll return to the “vlbw” data set used in the second exercise.
qplot(bwt, data=vlbw, geom='density')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_density).
qplot(bwt, data=vlbw, geom='histogram')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (stat_bin).
qplot(race, data=vlbw, geom='bar')
qplot(gest, bwt, data=vlbw, geom='point')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 5 rows containing missing values (geom_point).
qplot(sex, bwt, data=vlbw, geom='boxplot')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_boxplot).
qplot(sex, bwt, data=vlbw, geom='dotplot', binaxis = "y", binwidth = 10)
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_bindot).
qplot(sex, bwt, data=vlbw, geom='violin')
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Overlapping with jitter
We can provide overlapping geometeries. Order matters.
qplot(sex, bwt, data=vlbw, geom=c('jitter','boxplot'))
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Warning: Removed 2 rows containing missing values (geom_point).
qplot(sex, bwt, data=vlbw, geom=c('boxplot','jitter'))
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Warning: Removed 2 rows containing missing values (geom_point).
Aesthetics
We can use aesthetics to map variables to “x” and “y” as well as color, size, and shape.
qplot(gest, bwt, color=ivh, data=vlbw)
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 5 rows containing missing values (geom_point).
qplot(gest, bwt, color=sex, shape=sex, data=vlbw)
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 23 rows containing missing values (geom_point).
ggplot(data=vlbw) + geom_point(mapping = aes(x=gest, y=bwt, size=lol))
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
Warning: Removed 383 rows containing missing values (geom_point).
Faceting
If we want to compare plots by a categorical variable, we can use faceting.
Faceting introduces the “~” (tilde) character. Think of it as having a left-hand side (LHS) and right-hand side (RHS).
<- complete.cases(vlbw[,c('gest','bwt','sex')])
bestRows <- ggplot(data=vlbw[bestRows,]) + geom_point(mapping = aes(x=gest, y=bwt)) +
p scale_x_continuous() + scale_y_continuous()
p
A plot for each race.
+ facet_wrap(~ race) p
A plot for each race/sex combination.
+ facet_wrap(~ race + sex) p
A plot for each race/sex combination, organized into row (LHS, sex) and column (RHS, race).
+ facet_grid(sex ~ race) p
A plot for each race, organized into columns (RHS).
+ facet_grid(. ~ race) p
A plot for each race, organized into rows (LHS).
+ facet_grid(race ~ .) p
Interactive graphics
Many plots created with ggplot can be made interactive.
library(plotly)
ggplotly(p)
Learning Resources
type | site | link |
---|---|---|
list | stackoverflow | https://stackoverflow.com/tags/r/info |
forum | stackoverflow | https://stackoverflow.com/questions/tagged/r |
tutorial | Matloff | https://github.com/matloff/fasteR |
tutorial | swirl | https://swirlstats.com/students.html |
book | R for Data Science | http://r4ds.had.co.nz/ |
Also: Frank, Dandan, Cole, and each other
Worth Mentioning
tidyverse
Wickham’s “Tidy Data” philosophy was concurrent with the development of a suite of R packages useful for data management. This includes ggplot2
and haven
as well as many others such as dplyr
, stringr
, and tidyr
.
data.table
data.table
is another beneficial package for data manipulation. It provides similar functionality to the data.frame
, but works well with large data sets. It minimizes memory operations. It is best to develop ones code using either data.frame or data.table from the beginning as they are not exactly equivalent and surprises can happen upon dropping in one for the other.
Computing Environment1
R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.6 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] plotly_4.10.0 Hmisc_4.7-0 ggplot2_3.3.5 Formula_1.2-3 survival_3.2-13 lattice_0.20-45To cite R in publications use:
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-0, https://CRAN.R-project.org/package=Hmisc.
To cite the ggplot2 package in publications use:Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
To cite the survival package in publications use:Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.
Footnotes
mu
is a copy of the part of theHmisc
package objectmarkupSpecs
that is for html. It includes a functionsession
that renders the session environment (including package versions) in html.↩︎