LaTeX is an open source typesetting system used for document preparation. It should be installed for R and RStudio to produce PDF reports. It is a tool for advanced users who wish to compose advanced tables, use cross-referencing, indexing, or have finer control over document layout. Others may use Markdown as an alternative to produce PDF and HTML reports.
Install MiKTeX (“1,” n.d.) - choose Basic MiKTeX Installer
Applications will likely be installed in C:\Program Files (x86)\MiKTeX 2.9\miktex\bin
.
Use the mpm
application to install additional LaTeX packages.
Install MacTeX (“2,” n.d.) - choose BasicTeX.pkg
Applications will likely be installed in /usr/texbin
.
Use the tlmgr
command-line program to install additional LaTeX packages. As an example, this installs the framed
style package:
sudo tlmgr install framed
Install TeXLive - Ubuntu flavored distributions may use the command-line program apt-get
:
sudo apt-get install texlive-base texlive-latex-recommended texlive-latex-extra
Applications will likely be installed in /usr/bin
.
Additional LaTeX packages may require manual installation in /usr/share/texlive/texmf-dist/tex/latex
.
After LaTeX has been installed, test that pdflatex
runs from the command line prompt:
pdflatex --version
If this fails to return information about pdflatex
, check that it is installed in the proper location. If found the LaTeX application path needs to be added to your environment path, or you may need to log out and back in.
Install the packages Hmisc, rms, knitr, rmarkdown
through RStudio.
Packages contain R code and may often require the installation of other packages. When I install Hmisc
, I see that its dependency acepack
is automatically installed:
Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb
* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown'))
# update all packages
update.packages(checkBuilt=TRUE, ask=FALSE)
Note that library
is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require
used in its place. If the package is not installed, library
will trigger an error while require
will return FALSE. Once loaded the functions created in the package are available to your R session.
library(Hmisc)
require(Hmisc)
knitr (“3,” n.d.) was designed as a replacement for Sweave - it is used for dynamic report generation. It combines the content of your document with the R code that produces results. Blocks of code are called chunks.
By default RStudio uses Sweave, but this can be changed to knitr in the options.
Tools -> Global Options, Sweave panel, "Weave Rnw files using..."
Here’s an example of using the knitr
package to strip out all of this file’s R chunks and store them in an R script.
library(knitr)
# may need path to file
purl('intro.Rmd')
Reports can be generated by two methods, Rnw (noweb) files or Rmd (markdown) files. Syntax-wise there are two major differences:
LaTeX is much more difficult to learn but it offers superior control over document layout and formatting. Markdown offers more flexibility in the type of output, such as PDF, HTML and Word. How do you choose which to use? A good rule of thumb is to ask yourself “must I print the results nicely on paper?”. If yes, Rnw (Yihui Xie (“4,” n.d.)).
Some R objects are easily converted into LaTeX (see Hmisc::latex
). If you are using any LaTeX functions, you should also choose Rnw files.
You can see good examples of both methods here (“5,” n.d.).
Output is created by clicking the Compile PDF
or Knit HTML
button.
An Rnw chunk looks like this:
<<name, chunk-options>>=
# lines of R code
@
Here’s an R chunk in Rmd:
```{r, chunk-options}
# lines of R code
```
There are many chunk options, see the online documentation (“6,” n.d.).
Dr. Harrell has predefined several options and functions that are useful when creating Rnw files. Load the Hmisc
package to make the knitrSet
function available for use. In the past this function could be included by adding it to your .Rprofile
file (example Rprofile) (“7,” n.d.). If knitrSet
is included in your Rprofile
, it should be removed.
The following chunk will load Hmisc
and run this function.
<<echo=FALSE>>=
require(Hmisc)
knitrSet()
@
Any of the Rmd files found here (“8,” n.d.) would make a good template for a reproducible report. There are two chunks you should consider including in any report. As previously mentioned knitrSet
should be added to the beginning of the report. The computing environment should be added to the end.
It should look similar to this.
```{r setup,echo=FALSE}
require(Hmisc)
knitrSet(lang='markdown')
```
# Analysis Start
# content
# Analysis End
# Computing Environment
```{r rsession,echo=FALSE}
si <- sessionInfo(); si$loadedOnly <- NULL
print(si, locale=FALSE)
```
```{r cite,results='asis',echo=FALSE}
print(citation(), style='text')
```
The Hmisc::getRs
function can be used to open one of the sample rscripts within RStudio.
In R, a data set is called a data.frame
. Consisting of rows and columns of data, they are the fundamental data structure in R.
R reads in delimited files, such as comma-separated or tab-delimited, as data frames. Typically this is done with one of the following functions.
Several options are useful:
A data set may also be imported by clicking the Import Dataset
button in the Environment pane in RStudio.
The base package datasets
includes several data sets. In this example we’ll use one to create a file that can be read in as a CSV.
head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
write.csv(state.x77, file='state.csv')
state <- read.csv('state.csv', stringsAsFactors=FALSE)
Other methods are available to read in data sets created outside of R.
For example, Hmisc
includes the functions sas.get
and sasxport.get
to work with SAS data sets.
Stat/Transfer is a good option but is not free nor open source.
See the foreign
package for other options.
The Hmisc::getHdata
function downloads ready-to-use data sets from the web site for Hmisc
and rms
.
require(Hmisc, warn.conflicts=FALSE, quietly=TRUE)
# list available data sets
getHdata()
[1] "abm" "acath" "ari" "ari_other"
[5] "birth.estriol" "boston" "cdystonia" "counties"
[9] "diabetes" "dmd" "DominicanHTN" "esopH"
[13] "esopH2" "FEV" "hospital" "kprats"
[17] "lead" "nhgh" "olympics.1996" "pbc"
[21] "plasma" "prostate" "rhc" "sex.age.response"
[25] "stressEcho" "support2" "support" "titanic2"
[29] "titanic3" "titanic" "valung" "vlbw"
# download and load data set
getHdata(counties)
head(counties)
county state msa pmsa pop.density pop pop.change age6574 age75 crime college
1 Autauga AL 5240 NA 61 34222 11.9 5.7 4.1 4996 14.5
2 Baldwin AL 5160 NA 67 98280 35.4 9.2 6.0 3329 16.8
3 Barbour AL NA NA 29 25417 2.0 8.2 6.4 3192 11.8
4 Bibb AL NA NA 28 16576 9.2 6.7 6.0 0 4.7
5 Blount AL 1000 NA 62 39248 10.6 7.4 5.6 2052 7.0
6 Bullock AL NA NA 18 11042 3.7 8.5 7.6 3630 10.0
income farm democrat republican Perot white black turnout
1 32240 1.8 30.9 55.9 12.3 79.31740 20.001753 45.54088
2 30199 1.7 26.2 56.5 16.5 86.04498 12.861213 47.28938
3 23838 2.4 46.4 42.9 9.8 55.54550 44.041390 41.03946
4 23714 0.9 43.2 46.5 10.2 78.74035 20.982143 40.54054
5 26323 4.7 32.9 53.8 11.8 97.83173 1.327456 42.05310
6 17796 2.6 67.7 26.0 5.5 27.49502 72.323853 43.61529
# view data dictionary for file - opens in web browser
getHdata(counties, what="contents")
contents(counties)
Data frame:counties 3141 observations and 19 variables Maximum # NAs:2956
Labels Storage NAs
county character 0
state character 0
msa integer 2311
pmsa integer 2956
pop.density 1992 pop per 1990 miles^2 integer 0
pop 1990 population integer 0
pop.change % population change 1980-1992 double 0
age6574 % age 65-74, 1990 double 0
age75 % age >= 75, 1990 double 0
crime serious crimes per 100,000 1991 integer 0
college % with bachelor's degree or higher of those age>=25 double 0
income median family income, 1989 dollars integer 0
farm farm population, % of total, 1990 double 0
democrat % votes cast for democratic president double 27
republican % votes cast for republican president double 27
Perot % votes cast for Ross Perot double 27
white % white, 1990 double 0
black % black, 1990 double 0
turnout 1992 votes for president / 1990 pop x 100 double 25
A data.frame
consists of observations (rows) of variables (columns). Examine the hospital
data set.
getHdata(hospital)
contents(hospital)
Data frame:hospital 25 observations and 9 variables Maximum # NAs:0
Labels Levels Storage
id integer
duration Duration of Hospital Stay, Days integer
age integer
sex 2 integer
temp double
wbc integer
antibiotic 2 integer
bculture 2 integer
service 2 integer
+----------+-----------+
|Variable |Levels |
+----------+-----------+
|sex |male,female|
+----------+-----------+
|antibiotic|yes,no |
|bculture | |
+----------+-----------+
|service |med,surg |
+----------+-----------+
hospital
id duration age sex temp wbc antibiotic bculture service
1 1 5 30 female 99.0 8 no no med
2 2 10 73 female 98.0 5 no yes med
3 3 6 40 female 99.0 12 no no surg
4 4 11 47 female 98.2 4 no no surg
5 5 5 25 female 98.5 11 no no surg
6 6 14 82 male 96.8 6 yes no surg
7 7 30 60 male 99.5 8 yes yes med
8 8 11 56 female 98.6 7 no no med
9 9 17 43 female 98.0 7 no no med
10 10 3 50 male 98.0 12 no yes surg
11 11 9 59 female 97.6 7 no yes med
12 12 3 4 male 97.8 3 no no surg
13 13 8 22 female 99.5 11 yes no surg
14 14 8 33 female 98.4 14 yes yes surg
15 15 5 20 female 98.4 11 no yes surg
16 16 5 32 male 99.0 9 no no surg
17 17 7 36 male 99.2 6 yes no surg
18 18 4 69 male 98.0 6 no no surg
19 19 3 47 male 97.0 5 yes no med
20 20 7 22 male 98.2 6 no no surg
21 21 9 11 male 98.2 10 no no surg
22 22 11 19 male 98.6 14 yes no surg
23 23 11 67 female 97.6 4 no no med
24 24 9 43 female 98.6 5 no no surg
25 25 4 41 female 98.0 5 no no med
Once loaded, note that it is displayed in RStudio’s Environment
pane. It gives information about the data.frame
. This can also be shown through the following functions.
str(hospital)
'data.frame': 25 obs. of 9 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ duration : 'labelled' int 5 10 6 11 5 14 30 11 17 3 ...
..- attr(*, "label")= chr "Duration of Hospital Stay, Days"
$ age : int 30 73 40 47 25 82 60 56 43 50 ...
$ sex : Factor w/ 2 levels "male","female": 2 2 2 2 2 1 1 2 2 1 ...
$ temp : num 99 98 99 98.2 98.5 96.8 99.5 98.6 98 98 ...
$ wbc : int 8 5 12 4 11 6 8 7 7 12 ...
$ antibiotic: Factor w/ 2 levels "yes","no": 2 2 2 2 2 1 1 2 2 2 ...
$ bculture : Factor w/ 2 levels "yes","no": 2 1 2 2 2 2 1 2 2 1 ...
$ service : Factor w/ 2 levels "med","surg": 1 1 2 2 2 2 1 1 1 2 ...
names(hospital)
[1] "id" "duration" "age" "sex" "temp" "wbc"
[7] "antibiotic" "bculture" "service"
rownames(hospital)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16"
[17] "17" "18" "19" "20" "21" "22" "23" "24" "25"
dim(hospital)
[1] 25 9
The str
function returns the structure of an R object. In this case the hospital data set is made up of columns of different types: integer, character, factor, numeric. The Hmisc::describe
function is especially helpful when taking a first look at a data set.
describe(hospital)
hospital
9 Variables 25 Observations
-------------------------------------------------------------------------------------
id
n missing distinct Info Mean Gmd .05 .10 .25
25 0 25 1 13 8.667 2.2 3.4 7.0
.50 .75 .90 .95
13.0 19.0 22.6 23.8
lowest : 1 2 3 4 5, highest: 21 22 23 24 25
-------------------------------------------------------------------------------------
duration : Duration of Hospital Stay, Days
n missing distinct Info Mean Gmd .05 .10 .25
25 0 12 0.988 8.6 5.6 3.0 3.4 5.0
.50 .75 .90 .95
8.0 11.0 12.8 16.4
Value 3 4 5 6 7 8 9 10 11 14 17 30
Frequency 3 2 4 1 2 2 3 1 4 1 1 1
Proportion 0.12 0.08 0.16 0.04 0.08 0.08 0.12 0.04 0.16 0.04 0.04 0.04
-------------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10 .25
25 0 22 0.999 41.24 23.39 12.6 19.4 25.0
.50 .75 .90 .95
41.0 56.0 68.2 72.2
lowest : 4 11 19 20 22, highest: 60 67 69 73 82
-------------------------------------------------------------------------------------
sex
n missing distinct
25 0 2
Value male female
Frequency 11 14
Proportion 0.44 0.56
-------------------------------------------------------------------------------------
temp
n missing distinct Info Mean Gmd .05 .10 .25
25 0 12 0.987 98.31 0.7747 97.12 97.60 98.00
.50 .75 .90 .95
98.20 98.60 99.12 99.44
Value 96.8 97.0 97.6 97.8 98.0 98.2 98.4 98.5 98.6 99.0 99.2 99.5
Frequency 1 1 2 1 5 3 2 1 3 3 1 2
Proportion 0.04 0.04 0.08 0.04 0.20 0.12 0.08 0.04 0.12 0.12 0.04 0.08
-------------------------------------------------------------------------------------
wbc
n missing distinct Info Mean Gmd .05 .10 .25
25 0 11 0.988 7.84 3.687 4.0 4.4 5.0
.50 .75 .90 .95
7.0 11.0 12.0 13.6
Value 3 4 5 6 7 8 9 10 11 12 14
Frequency 1 2 4 4 3 2 1 1 3 2 2
Proportion 0.04 0.08 0.16 0.16 0.12 0.08 0.04 0.04 0.12 0.08 0.08
-------------------------------------------------------------------------------------
antibiotic
n missing distinct
25 0 2
Value yes no
Frequency 7 18
Proportion 0.28 0.72
-------------------------------------------------------------------------------------
bculture
n missing distinct
25 0 2
Value yes no
Frequency 6 19
Proportion 0.24 0.76
-------------------------------------------------------------------------------------
service
n missing distinct
25 0 2
Value med surg
Frequency 9 16
Proportion 0.36 0.64
-------------------------------------------------------------------------------------
Saving data sets as text files (like CSV) is not always a good idea. It is more efficient (time and space) to work with R binary objects. The save
function will store R objects (including the global environment). Save a single data set with the .rda
extension, and save a collection of objects with the .RData
extension. These objects can be restored with the load
function. By default these files are saved with compression.
# save single data set
save(hospital, file='hosp.rda', compress=TRUE)
# save everything
save.image(file='myintro.RData')
Restart R and the following will restore the hospital
data set.
load('hosp.rda')
Hmisc::Save
and Hmisc::Load
may also be used on single objects - the file will keep the name of the object unless otherwise provided. Compression is also enabled.
Save(hospital)
Load(hospital)
x <- "my string"
y = 5
print(x)
[1] "my string"
cat(x, "\n")
my string
x
[1] "my string"
sum(y)
[1] 5
(x <- 1)
[1] 1
Container of one type. Elements are indexed from 1 to n, and can be accessed by this index (more on this later).
z <- c(1,-5,20)
(z <- c(10, z, 36))
[1] 10 1 -5 20 36
z[2]
[1] 1
z + 1
[1] 11 2 -4 21 37
z - 2
[1] 8 -1 -7 18 34
z * 3
[1] 30 3 -15 60 108
z / 4
[1] 2.50 0.25 -1.25 5.00 9.00
z ^ 5
[1] 100000 1 -3125 3200000 60466176
# mod, the remainder
z %% 6
[1] 4 1 1 2 0
# integer division
z %/% 6
[1] 1 0 -1 3 6
# recreate z
(z %/% 6) * 6 + z %% 6
[1] 10 1 -5 20 36
sqrt(z)
Warning in sqrt(z): NaNs produced
[1] 3.162278 1.000000 NaN 4.472136 6.000000
Many other functions work with numeric values, see ?S3groupGeneric
abs(z)
[1] 10 1 5 20 36
sign(z)
[1] 1 1 -1 1 1
round(z / 7, 2)
[1] 1.43 0.14 -0.71 2.86 5.14
floor(z / 7)
[1] 1 0 -1 2 5
ceiling(z / 7)
[1] 2 1 0 3 6
# exponential function
exp(z)
[1] 2.202647e+04 2.718282e+00 6.737947e-03 4.851652e+08 4.311232e+15
# natural logarithms
log(z)
Warning in log(z): NaNs produced
[1] 2.302585 0.000000 NaN 2.995732 3.583519
# trig functions
sin(z)
[1] -0.5440211 0.8414710 0.9589243 0.9129453 -0.9917789
sum(z)/length(z)
[1] 12.4
mean(z)
[1] 12.4
median(z)
[1] 10
sd(z)
[1] 16.22652
min(z)
[1] -5
max(z)
[1] 36
range(z)
[1] -5 36
quantile(z)
0% 25% 50% 75% 100%
-5 1 10 20 36
quantile(z, probs=c(0.1, 0.5, 0.9))
10% 50% 90%
-2.6 10.0 29.6
See ?seq and ?Distributions
numeric(5)
[1] 0 0 0 0 0
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
1:10
[1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10, by=2)
[1] 1 3 5 7 9
rep(1, 5)
[1] 1 1 1 1 1
rep(1:3, each=3)
[1] 1 1 1 2 2 2 3 3 3
rep(1:3, times=3)
[1] 1 2 3 1 2 3 1 2 3
rep(1:3, times=1:3)
[1] 1 2 2 3 3 3
rnorm(5, mean=0, sd=1)
[1] -0.73296459 0.18679832 0.05218688 0.35698932 0.88248873
rbinom(5, size=1, prob=0.5)
[1] 0 1 0 1 1
rpois(5, lambda=3)
[1] 2 2 1 6 3
runif(5, min=0, max=10)
[1] 4.499876 2.212769 4.527861 8.437709 2.210349
Recreate random data with seeds - values are reproducible across computers
# the seed can be any integer value
set.seed(1)
rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
set.seed(1)
rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Beware precision errors with numeric values.
seq(10)/10 == seq(0.1, 1, by=0.1)
[1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
abs(seq(10)/10 - seq(0.1, 1, by=0.1)) < 10^-7
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Mostly unnecessary, useful to represent data exactly or to pass to C or Fortran code
as.integer(1)
[1] 1
class(1L)
[1] "integer"
class(c(1L, 1))
[1] "numeric"
logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
rep(TRUE, 5)
[1] TRUE TRUE TRUE TRUE TRUE
as.logical(c(-1,0,1))
[1] TRUE FALSE TRUE
1+1 == 2
[1] TRUE
3 < 2
[1] FALSE
# logical negation
!FALSE
[1] TRUE
x <- seq(10)
# logical or
x %% 2 == 0 | x %% 3 == 0
[1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
# logical and
x %% 2 == 0 & x %% 3 == 0
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
(x <- rnorm(5) > 0)
[1] FALSE TRUE TRUE TRUE FALSE
# return index for true elements
which(x)
[1] 2 3 4
any(x)
[1] TRUE
all(x)
[1] FALSE
character(5)
[1] "" "" "" "" ""
LETTERS[seq(5)]
[1] "A" "B" "C" "D" "E"
letters[seq(5)]
[1] "a" "b" "c" "d" "e"
x <- "My string"
toupper(x)
[1] "MY STRING"
tolower(x)
[1] "my string"
nchar(x)
[1] 9
substr(x, 4, 6)
[1] "str"
# [set of values]
grep("[aeiou]", letters)
[1] 1 5 9 15 21
grepl("[aeiou]", letters)
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[14] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
gsub("[a-m]", "_", x)
[1] "My str_n_"
(y <- paste(x, "is", x))
[1] "My string is My string"
paste(y, "!", sep='')
[1] "My string is My string!"
# this might not do what you think
paste(letters[seq(10)], sep='')
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
paste(letters[seq(10)], collapse='')
[1] "abcdefghij"
# strsplit turns vector into list
strsplit(c("1;2","3;4","5:6"), split=";")
[[1]]
[1] "1" "2"
[[2]]
[1] "3" "4"
[[3]]
[1] "5:6"
# use unlist if only one string
unlist(strsplit(y, split=" "))
[1] "My" "string" "is" "My" "string"
Factor variables are categorical variables; they may appear as character strings, but they are stored as numerics. They may be important to models or plotting functions.
set.seed(5)
race <- sample(c('white','black','other'), size=10, replace=TRUE, prob=c(0.5,0.3,0.2))
# factor values are created in alphabetical order
(r2 <- factor(race))
[1] white black other white white black black other other white
Levels: black other white
r2 <- factor(race, levels=c('white','black','asian','other'))
levels(r2)
[1] "white" "black" "asian" "other"
nlevels(r2)
[1] 4
levels(r2) <- list(white='white', black='black', other=c('other', 'asian'))
# this fails
c(r2, "hispanic")
[1] "1" "2" "3" "1" "1" "2" "2"
[8] "3" "3" "1" "hispanic"
# coerce to character string first
factor(c(as.character(r2), "hispanic"))
[1] white black other white white black black other other
[10] white hispanic
Levels: black hispanic other white
Internally, dates are stored as days from 01/01/1970 (this is day 0). Format options can be checked from the ?strptime help file.
Sys.Date()
[1] "2019-04-01"
# need consistent format
as.Date(c("05-03-01", "11-11-25"), format="%y-%m-%d")
[1] "2005-03-01" "2011-11-25"
# accepts numeric data, if origin is specified
# Excel typically uses an origin of "1900-01-01"
(x <- as.Date(seq(0, by=365, length.out=3), origin="2000-01-01"))
[1] "2000-01-01" "2000-12-31" "2001-12-31"
as.numeric(x)
[1] 10957 11322 11687
format(x, format="%m/%d/%Y")
[1] "01/01/2000" "12/31/2000" "12/31/2001"
# seq.Date(x[1], x[2], by=14)
seq(x[1], x[2], by=14)
[1] "2000-01-01" "2000-01-15" "2000-01-29" "2000-02-12" "2000-02-26" "2000-03-11"
[7] "2000-03-25" "2000-04-08" "2000-04-22" "2000-05-06" "2000-05-20" "2000-06-03"
[13] "2000-06-17" "2000-07-01" "2000-07-15" "2000-07-29" "2000-08-12" "2000-08-26"
[19] "2000-09-09" "2000-09-23" "2000-10-07" "2000-10-21" "2000-11-04" "2000-11-18"
[25] "2000-12-02" "2000-12-16" "2000-12-30"
diff(x)
Time differences in days
[1] 365 365
as.numeric(diff(x), units='weeks')
[1] 52.14286 52.14286
Date-time variables can be stored in two ways. POSIXct variables are stored as seconds from 01/01/1970. POSIXlt variables have each date/time component saved in a list container. Typically, you should use POSIXct.
Careful consideration of the timezone should be used when working with date-time variables. By default R will take the timezone from the operating system. It can be helpful to explicitly select your time zone, and additionally using UTC
whenever possible.
Sys.time()
[1] "2019-04-01 16:36:03 CDT"
Sys.timezone()
[1] "America/Chicago"
x <- as.POSIXct("2014-10-10 10:20")
# see ?strptime for format options
y <- as.POSIXlt("10/17/2014 20:10", tz="UTC", format="%m/%d/%Y %H:%M")
unclass(x)
[1] 1412954400
attr(,"tzone")
[1] ""
unclass(y)
$sec
[1] 0
$min
[1] 10
$hour
[1] 20
$mday
[1] 17
$mon
[1] 9
$year
[1] 114
$wday
[1] 5
$yday
[1] 289
$isdst
[1] 0
attr(,"tzone")
[1] "UTC"
difftime(Sys.time(), x, units="days")
Time difference of 1634.261 days
base::round.POSIXt(x, units="hours")
[1] "2014-10-10 10:00:00 CDT"
NULL
NA
Inf
and -Inf
NaN
x <- c(1,4,NA,0)
is.na(x)
[1] FALSE FALSE TRUE FALSE
# note calculations with NA result in NA
is.infinite(x / 0)
[1] TRUE TRUE FALSE FALSE
is.nan(x / 0)
[1] FALSE FALSE FALSE TRUE
Mixing types together in a vector will coerce all variables to be of the same, least flexible, type.
# str shows the structure of an R object, similar to Environment pane
str(c("a", 1))
chr [1:2] "a" "1"
str(c(5L, 1))
num [1:2] 5 1
str(c(TRUE, 0))
num [1:2] 1 0
str(c(FALSE, "false"))
chr [1:2] "FALSE" "false"
str(c(100, Sys.Date()))
num [1:2] 100 17987
str(c('today', Sys.time()))
chr [1:2] "today" "1554154565.14581"
str(c(factor(c("apple", "orange")), "banana"))
chr [1:3] "1" "2" "banana"
Each have two important attributes, dimension length and dimension names.
x <- c(1, 10, 100)
length(x)
[1] 3
names(x) <- c("ones", "tens", "hundreds")
# assign names during vector creation
(x <- c(ones=1, tens=10, hundreds=100))
ones tens hundreds
1 10 100
names(x)
[1] "ones" "tens" "hundreds"
# print vector without names
unname(x)
[1] 1 10 100
# remove names
names(x) <- NULL
mx <- matrix(1:6, nrow=2)
# number of rows
nrow(mx)
[1] 2
# number of columns
ncol(mx)
[1] 3
# display number of rows and columns
dim(mx)
[1] 2 3
rownames(mx) <- c('odd', 'even')
colnames(mx) <- c('a', 'b', 'c')
# assign names during matrix creation
(mx <- matrix(1:6, nrow=2, dimnames=list(c('odd', 'even'), c('a','b','c'))))
a b c
odd 1 3 5
even 2 4 6
str(mx)
int [1:2, 1:3] 1 2 3 4 5 6
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "odd" "even"
..$ : chr [1:3] "a" "b" "c"
Indexing is vital concept. It allows access to specific parts of a container, for extraction or replacement. There are four main ways to use indexing.
Note that in R, the first element is indexed at 1. Many programming languages start at 0 instead.
x <- c(royals=4, angels=0, tigers=1, orioles=7)
# index by number, returned in order specified
x[c(4,3)]
orioles tigers
7 1
# index by negation
x[-c(1,2)]
tigers orioles
1 7
# index by name
x[c("orioles", "angels")]
orioles angels
7 0
# index by truth
x[x > 3]
royals orioles
4 7
x[x > 3 & x < 7]
royals
4
# index replacement
x[c("royals","orioles")] <- c(7,4)
# create new element
x["as"] <- 0
# non-existent elements are NA
x[6]
<NA>
NA
%in%
# these examples use state data sets, see ?state
head(state.region)
[1] South West West South West West
Levels: Northeast South North Central West
tail(state.region, n=10)
[1] North Central South South West Northeast
[6] South West South North Central West
Levels: Northeast South North Central West
state.region %in% c("West", "South", "Central")
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[14] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
[27] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
[40] TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
c("West", "South", "Central") %in% state.region
[1] TRUE TRUE FALSE
match(state.region, c("West", "South", "Central"))
[1] 2 1 1 2 1 1 NA 2 2 2 1 1 NA NA NA NA 2 2 NA 2 NA NA NA 2 NA 1 NA
[28] 1 NA NA 1 NA 2 NA NA 2 1 NA NA 2 NA 2 2 1 NA 2 1 2 NA 1
summary(state.area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1214 37317 56222 72368 83234 589757
sort(state.area)
[1] 1214 2057 5009 6450 7836 8257 9304 9609 10577 24181 31055
[12] 33215 36291 40395 40815 41222 42244 45333 47716 48523 49576 51609
[23] 52586 53104 56154 56290 56400 58216 58560 58876 68192 69686 69919
[34] 70665 77047 77227 82264 83557 84068 84916 96981 97914 104247 110540
[45] 113909 121666 147138 158693 267339 589757
# display order by element index
order(state.area)
[1] 39 8 7 11 30 21 29 45 20 48 40 19 14 17 46 35 42 38 24 18 32 1 33 4 49 15 13
[28] 22 9 10 47 25 36 34 41 27 16 12 23 44 37 50 6 28 3 31 26 5 43 2
table(state.region)
state.region
Northeast South North Central West
9 16 12 13
table(state.division, state.region)
state.region
state.division Northeast South North Central West
New England 6 0 0 0
Middle Atlantic 3 0 0 0
South Atlantic 0 8 0 0
East South Central 0 4 0 0
West South Central 0 4 0 0
East North Central 0 0 5 0
West North Central 0 0 7 0
Mountain 0 0 0 8
Pacific 0 0 0 5
unique(state.region)
[1] South West Northeast North Central
Levels: Northeast South North Central West
duplicated(state.region)
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[14] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[27] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[40] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
set.seed(2)
x <- sample(20, 10)
y <- sample(20, 10)
union(x, y)
[1] 4 14 11 3 16 15 2 18 6 7 12 5 13 1
# union takes two arguments, unique takes one (a vector)
all(union(x, y) == unique(c(x, y)))
[1] TRUE
intersect(x, y)
[1] 4 14 3 18 6 7
# elements in x, not in y
setdiff(x, y)
[1] 11 16 15 2
# elements in y, not in x
setdiff(y, x)
[1] 12 5 13 1
# create list of state names by region
split(state.name, state.region)
$Northeast
[1] "Connecticut" "Maine" "Massachusetts" "New Hampshire" "New Jersey"
[6] "New York" "Pennsylvania" "Rhode Island" "Vermont"
$South
[1] "Alabama" "Arkansas" "Delaware" "Florida"
[5] "Georgia" "Kentucky" "Louisiana" "Maryland"
[9] "Mississippi" "North Carolina" "Oklahoma" "South Carolina"
[13] "Tennessee" "Texas" "Virginia" "West Virginia"
$`North Central`
[1] "Illinois" "Indiana" "Iowa" "Kansas" "Michigan"
[6] "Minnesota" "Missouri" "Nebraska" "North Dakota" "Ohio"
[11] "South Dakota" "Wisconsin"
$West
[1] "Alaska" "Arizona" "California" "Colorado" "Hawaii" "Idaho"
[7] "Montana" "Nevada" "New Mexico" "Oregon" "Utah" "Washington"
[13] "Wyoming"
Some vector operations require vectors of equal length. If the vectors are of unequal length, elements in the shorter vector may be recycled.
# c(10,1) becomes c(10,1,10,1,10)
c(1,10,100,1000,10000) / c(10,1)
Warning in c(1, 10, 100, 1000, 10000)/c(10, 1): longer object length is not a
multiple of shorter object length
[1] 1e-01 1e+01 1e+01 1e+03 1e+03
# letters[1:2] becomes letters[c(1,2,1,2)]
paste(letters[1:2], letters[23:26])
[1] "a w" "b x" "a y" "b z"
Two dimensional vector. Values are stored by column by default.
Useful functions:
%*%
- matrix multiplicationmatrix(1:5)
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
matrix(seq(8), nrow=4, ncol=2)
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
(x <- matrix(seq(8), nrow=4, ncol=2, byrow=TRUE))
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
rowSums(x)
[1] 3 7 11 15
colSums(x)
[1] 16 20
y <- matrix(c(1,1,0,0), nrow=2)
x %*% y
[,1] [,2]
[1,] 3 0
[2,] 7 0
[3,] 11 0
[4,] 15 0
x - y[rep(1,4),]
[,1] [,2]
[1,] 0 2
[2,] 2 4
[3,] 4 6
[4,] 6 8
# add a column, note the use of recycling
cbind(x, 3)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 3 4 3
[3,] 5 6 3
[4,] 7 8 3
rbind(x, c(9, 10))
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
x <- matrix(c(4,12,-5,-10), nrow=2)
all.equal(x %*% solve(x), diag(2))
[1] TRUE
solve(x, c(10, 40))
[1] 5 2
x <- matrix(seq(25), nrow=5)
t(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
[5,] 21 22 23 24 25
upper.tri(x)
[,1] [,2] [,3] [,4] [,5]
[1,] FALSE TRUE TRUE TRUE TRUE
[2,] FALSE FALSE TRUE TRUE TRUE
[3,] FALSE FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE TRUE
[5,] FALSE FALSE FALSE FALSE FALSE
Remember that the row and column index could be either the row/column number or the row/column name.
x[3,]
[1] 3 8 13 18 23
x[,3]
[1] 11 12 13 14 15
x[3,3]
[1] 13
x[c(1,5),]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 5 10 15 20 25
x[3,c(3,4)]
[1] 13 18
x[x %% 3 == 0] <- 0
x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 11 16 0
[2,] 2 7 0 17 22
[3,] 0 8 13 0 23
[4,] 4 0 14 19 0
[5,] 5 10 0 20 25
x[lower.tri(x)] <- 0
x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 11 16 0
[2,] 0 7 0 17 22
[3,] 0 0 13 0 23
[4,] 0 0 0 19 0
[5,] 0 0 0 0 25
Arrays are similar to matrices, however they support one, two or more dimensions.
Index by array[d1ix, d2ix, d3ix, ..., dnix]
# this creates 4 2x3 matrices
x <- array(1:24, dim=c(2,3,4))
x[2,3,4]
[1] 24
x[,,4]
[,1] [,2] [,3]
[1,] 19 21 23
[2,] 20 22 24
Data.frames are the fundamental data structures in R. They have rows and columns like matrices, however different columns may contain different variable types.
Remember that matrices use rownames
and colnames
to access row/column names. Data frames use row.names
and names
instead.
When creating a data.frame with a column of character values, make sure to decide if the values should be converted to factor variables. This can be set with the stringsAsFactors
argument.
x <- data.frame(id=1:3, animal=c('cat','dog','rabbit'),
speed=c('faster','fast','fastest'))
y <- data.frame(id=1:3, animal=c('cat','dog','rabbit'),
speed=c('faster','fast','fastest'), stringsAsFactors=FALSE)
y[3,]
id animal speed
3 3 rabbit fastest
y[,c(2,3)]
animal speed
1 cat faster
2 dog fast
3 rabbit fastest
y[,'speed']
[1] "faster" "fast" "fastest"
y[y[,'speed'] == 'fastest',]
id animal speed
3 3 rabbit fastest
y[y[,'speed'] == 'fastest', 'animal']
[1] "rabbit"
y[,'speed.factor'] <- factor(y[,'speed'])
There are other methods to select or create data.frame columns:
y[,3]
[1] "faster" "fast" "fastest"
y[,'speed']
[1] "faster" "fast" "fastest"
y$speed
[1] "faster" "fast" "fastest"
y[['speed']]
[1] "faster" "fast" "fastest"
y[,'speed', drop=FALSE]
speed
1 faster
2 fast
3 fastest
y['speed']
speed
1 faster
2 fast
3 fastest
y[,'new'] <- NA
y <- cbind(y, newer=0)
y
id animal speed speed.factor new newer
1 1 cat faster faster NA 0
2 2 dog fast fast NA 0
3 3 rabbit fastest fastest NA 0
It is good to write code that is consistent and easy to understand. For this reason I recommend using the df[,colname]
syntax.
There are also a couple of ways to remove columns:
y[,'new'] <- NULL
y <- y[,-1]
y <- y[,-match('newer', names(y))]
y
animal speed speed.factor
1 cat faster faster
2 dog fast fast
3 rabbit fastest fastest
Useful functions:
x <- data.frame(id=seq(4,24,4), gender=rbinom(6, 1, 0.5))
y <- data.frame(id=seq(6,24,6), smoker=rbinom(4, 1, 0.25))
merge(x, y)
id gender smoker
1 12 1 0
2 24 0 0
merge(x, y, by=0)
Row.names id.x gender id.y smoker
1 1 4 1 6 0
2 2 8 0 12 0
3 3 12 1 18 1
4 4 16 0 24 0
merge(x, y, by.x='id', by.y='id', all.x=TRUE, all.y=FALSE)
id gender smoker
1 4 1 NA
2 8 0 NA
3 12 1 0
4 16 0 NA
5 20 0 NA
6 24 0 0
merge(x, y, all=TRUE)
id gender smoker
1 4 1 NA
2 6 NA 0
3 8 0 NA
4 12 1 0
5 16 0 NA
6 18 NA 1
7 20 0 NA
8 24 0 0
# modify data frame, and reference columns without repeating data frame name
x <- within(x, {
weight <- round(rnorm(nrow(x), 120+gender*60, 10))
age <- sample(15:25, nrow(x), replace=TRUE)
})
# shorter than x[order(x[,'gender'], x[,'age']),]
x[with(x, order(gender, age)),]
id gender age weight
2 8 0 16 129
6 24 0 18 141
4 16 0 23 130
5 20 0 25 124
1 4 1 16 157
3 12 1 25 180
Lists are generic containers. Think of them as a vector where each element can by anything you want, including more lists.
emptylist <- vector('list', 10)
x <- list(abb=state.abb, area=state.area, region=state.region, animals=y, other=5)
Extracting with a single bracket returns a list. Extracting with two brackets returns the contents at the given index.
x[1]
$abb
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS"
[17] "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY"
[33] "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV"
[49] "WI" "WY"
x[1:2]
$abb
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS"
[17] "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY"
[33] "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV"
[49] "WI" "WY"
$area
[1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 6450
[12] 83557 56400 36291 56290 82264 40395 48523 33215 10577 8257 58216
[23] 84068 47716 69686 147138 77227 110540 9304 7836 121666 49576 52586
[34] 70665 41222 69919 96981 45333 1214 31055 77047 42244 267339 84916
[45] 9609 40815 68192 24181 56154 97914
x[[1]]
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS"
[17] "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY"
[33] "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV"
[49] "WI" "WY"
x[["area"]]
[1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 6450
[12] 83557 56400 36291 56290 82264 40395 48523 33215 10577 8257 58216
[23] 84068 47716 69686 147138 77227 110540 9304 7836 121666 49576 52586
[34] 70665 41222 69919 96981 45333 1214 31055 77047 42244 267339 84916
[45] 9609 40815 68192 24181 56154 97914
# while not recommended, $ provides short-cut
x$area
[1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 6450
[12] 83557 56400 36291 56290 82264 40395 48523 33215 10577 8257 58216
[23] 84068 47716 69686 147138 77227 110540 9304 7836 121666 49576 52586
[34] 70665 41222 69919 96981 45333 1214 31055 77047 42244 267339 84916
[45] 9609 40815 68192 24181 56154 97914
x[4:5] <- NULL
A list can be flattened to produce a vector with unlist
.
# notice that all values become character strings, because the first element
# contained character values (the least flexible variable type)
unlist(x)
abb1 abb2 abb3 abb4 abb5 abb6 abb7 abb8 abb9
"AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL"
abb10 abb11 abb12 abb13 abb14 abb15 abb16 abb17 abb18
"GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA"
abb19 abb20 abb21 abb22 abb23 abb24 abb25 abb26 abb27
"ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE"
abb28 abb29 abb30 abb31 abb32 abb33 abb34 abb35 abb36
"NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK"
abb37 abb38 abb39 abb40 abb41 abb42 abb43 abb44 abb45
"OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
abb46 abb47 abb48 abb49 abb50 area1 area2 area3 area4
"VA" "WA" "WV" "WI" "WY" "51609" "589757" "113909" "53104"
area5 area6 area7 area8 area9 area10 area11 area12 area13
"158693" "104247" "5009" "2057" "58560" "58876" "6450" "83557" "56400"
area14 area15 area16 area17 area18 area19 area20 area21 area22
"36291" "56290" "82264" "40395" "48523" "33215" "10577" "8257" "58216"
area23 area24 area25 area26 area27 area28 area29 area30 area31
"84068" "47716" "69686" "147138" "77227" "110540" "9304" "7836" "121666"
area32 area33 area34 area35 area36 area37 area38 area39 area40
"49576" "52586" "70665" "41222" "69919" "96981" "45333" "1214" "31055"
area41 area42 area43 area44 area45 area46 area47 area48 area49
"77047" "42244" "267339" "84916" "9609" "40815" "68192" "24181" "56154"
area50 region1 region2 region3 region4 region5 region6 region7 region8
"97914" "2" "4" "4" "2" "4" "4" "1" "2"
region9 region10 region11 region12 region13 region14 region15 region16 region17
"2" "2" "4" "4" "3" "3" "3" "3" "2"
region18 region19 region20 region21 region22 region23 region24 region25 region26
"2" "1" "2" "1" "3" "3" "2" "3" "4"
region27 region28 region29 region30 region31 region32 region33 region34 region35
"3" "4" "1" "1" "4" "1" "2" "3" "3"
region36 region37 region38 region39 region40 region41 region42 region43 region44
"2" "4" "1" "1" "2" "3" "2" "2" "4"
region45 region46 region47 region48 region49 region50
"1" "2" "4" "2" "3" "4"
Control structures are used to change if and when lines of code in a program are run. The control comes from conditions being true or false.
if
statements execute a block of code once or not at all. Multiple conditions can be tested so that only the block under the first condition evaluated as true will be executed. A default block of code can be evaluated if none of the conditions pass.
Several lines of code can be combined in a block by surrounding them with braces {...}
. It is good practice to use braces, though they are not required for a single statement.
x <- 5
if(x >=0) {
print(sqrt(x))
}
[1] 2.236068
x <- -4
if(x >=0) {
print(sqrt(x))
} else if(x < 0) {
print(sqrt(x*-1))
}
[1] 2
x <- 0
if(x > 0) {
print(sqrt(x))
} else if(x < 0) {
print(sqrt(x*-1))
} else {
print(0)
}
[1] 0
The ifelse
function can be used to test a vector of values.
set.seed(3)
ifelse(rnorm(10) > 0, 1, 0)
[1] 0 0 1 0 1 1 1 1 0 1
Three mechanisms can be used to run blocks of code multiple times.
set.seed(5)
nc <- 5
x <- sample(nc, 100, replace=TRUE)
cnt <- numeric(nc)
for(i in seq(nc)) {
cnt[i] <- sum(x == i)
}
cnt
[1] 18 24 17 13 28
i <- 3
while(i <= 100) {
isprime <- all(i %% seq(2, sqrt(i)) != 0)
if(isprime) cat(i, "")
i <- i + 1
}
3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
a <- 1
b <- 2
repeat {
tmp <- a
a <- b
b <- a + tmp
if(b > 100) break
print(b)
}
[1] 3
[1] 5
[1] 8
[1] 13
[1] 21
[1] 34
[1] 55
[1] 89
Functions allow blocks of code to be run repeatedly. Variables are passed into functions as arguments. When the function completes, it may also return a variable (including a list, which may hold several elements).
Don’t access variables from within a function that were created outside of the function (unless they are passed into the function as arguments).
prQ <- function(x) {
qs <- quantile(x, probs=c(0.5, 0.25, 0.75))
return(sprintf("%s (%s, %s)", qs[1], qs[2], qs[3]))
}
prQ(1:9)
[1] "5 (3, 7)"
# return doesn't have to be explicitly called
factors <- function(x) {
Filter(function(i) x %% i == 0, seq(1, floor(x/2)))
}
factors(64)
[1] 1 2 4 8 16 32
# NULL can be returned as well
doNothing <- function(x) NULL
A function can have many arguments. A generic argument ...
can be used to specify any number of additional arguments. Arguments can be given a default value.
powerAdd <- function(p1=1, p2=1, p3=1) {
p1 + p2^2 + p3^3
}
powerAdd()
[1] 3
powerAdd(3,2,1)
[1] 8
powerAdd(5, p3=3)
[1] 33
powerAdd(p2=16)
[1] 258
# take any number of arguments
morePowerAdd <- function(p1=1, ...) {
extra <- unlist(list(...))
sum(p1, extra^(1+seq_along(extra)))
}
morePowerAdd()
[1] 1
morePowerAdd(5, 4, 3, 2, 1)
[1] 65
# use extra arguments for another function
callPlot <- function(msg, ...) {
print(sprintf("calling plot: %s", msg))
plot(...)
NULL
}
callPlot("log", 1:10, log(1:10))
[1] "calling plot: log"
NULL
This creates a sample data set with multiple visits for each patient:
size <- 1000
set.seed(475)
x <- data.frame(id=sample(100, size, replace=TRUE),
visitdate=sample(365*2, size, replace=TRUE)
)
male <- sample(c(0,1,NA), 100, replace=TRUE, prob=c(45, 45, 10))
age <- round(runif(100, 40, 80))
x <- merge(x, cbind(male, id=seq(100)))
x <- merge(x, cbind(age, id=seq(100)))
x[,'visitdate'] <- as.Date(x[,'visitdate'], origin='2010-01-01')
x <- x[with(x, order(id, visitdate)),]
row.names(x) <- NULL
It’s common to operate on a data.frame
by breaking it into small subsets. Calculate visit number and days since first visit:
uids <- unique(x[,'id'])
for(i in seq_along(uids)) {
# note x is already ordered by visitdate
id.x <- which(x[,'id'] == uids[i])
x[id.x, 'visitno'] <- seq_along(id.x)
x[id.x, 'daysOn'] <- x[id.x, 'visitdate'] - x[id.x[1], 'visitdate']
}
This method may work great, unless rows are added or removed. What if we wanted to add a last visit that’s three years after the first visit?
First consider a function that is given a subset.
# x should only have one unique ID value, and be ordered by visitdate
addLastVisit <- function(x) {
last <- x[1,]
last[,'visitdate'] <- last[,'visitdate'] + 365*3
last[,'visitno'] <- nrow(x) + 1
last[,'daysOn'] <- 365*3
rbind(x, last)
}
# quick example
addLastVisit(x[x[,'id'] == 100,])
id visitdate male age visitno daysOn
995 100 2010-03-20 0 57 1 0 days
996 100 2010-05-11 0 57 2 52 days
997 100 2010-09-14 0 57 3 178 days
998 100 2011-02-13 0 57 4 330 days
999 100 2011-02-17 0 57 5 334 days
1000 100 2011-08-31 0 57 6 529 days
9951 100 2013-03-19 0 57 7 1095 days
Because our new data.frame
may have a unknown size, we can store the intermediate subsets in a list. lapply
is a convenient function to use for this.
# now x is saved as 100 element list
x.list <- lapply(unique(x[,'id']), FUN=function(i) addLastVisit(x[x[,'id'] == i,]))
The finals steps are to convert the list back to a data.frame
, and, if we care, update row names. do.call
is powerful, but can be difficult to learn. The first argument is a function. The second argument should be a list, whose elements are the arguments to the referenced function. In this case, the 100 elements of x.list
are the arguments to the rbind
function, which should make sense as we are row-binding these data frames together.
x.df <- do.call(rbind, x.list)
row.names(x.df) <- NULL
The R package plyr
was written to handle similar tasks.
Variables can be transformed by applying functions or arithmetic.
getHdata(diabetes)
contents(diabetes)
Data frame:diabetes 403 observations and 19 variables Maximum # NAs:262
Labels Units Levels Storage NAs
id Subject ID integer 0
chol Total Cholesterol integer 1
stab.glu Stabilized Glucose integer 0
hdl High Density Lipoprotein integer 1
ratio Cholesterol/HDL Ratio double 1
glyhb Glycosolated Hemoglobin double 13
location 2 integer 0
age years integer 0
gender 2 integer 0
height inches integer 5
weight pounds integer 1
frame 3 integer 12
bp.1s First Systolic Blood Pressure integer 5
bp.1d First Diastolic Blood Pressure integer 5
bp.2s Second Systolic Blood Pressure integer 262
bp.2d Second Diastolic Blood Pressure integer 262
waist inches integer 2
hip inches integer 2
time.ppn Postprandial Time when Labs were Drawn minutes integer 3
+--------+------------------+
|Variable|Levels |
+--------+------------------+
|location|Buckingham,Louisa |
+--------+------------------+
|gender |male,female |
+--------+------------------+
|frame |small,medium,large|
+--------+------------------+
# create BMI
diabetes[,'bmi'] <- 703*diabetes[,'weight']/diabetes[,'height']^2
# scale cholesterol to [0,1]
diabetes[,'chol.scaled'] <- (diabetes[,'chol']-min(diabetes[,'chol'], na.rm=TRUE))/
(max(diabetes[,'chol'], na.rm=TRUE)-min(diabetes[,'chol'], na.rm=TRUE))
head(diabetes)
id chol stab.glu hdl ratio glyhb location age gender height weight frame
1 1000 203 82 56 3.6 4.31 Buckingham 46 female 62 121 medium
2 1001 165 97 24 6.9 4.44 Buckingham 29 female 64 218 large
3 1002 228 92 37 6.2 4.64 Buckingham 58 female 61 256 large
4 1003 78 93 12 6.5 4.63 Buckingham 67 male 67 119 large
5 1005 249 90 28 8.9 7.72 Buckingham 64 male 68 183 medium
6 1008 248 94 69 3.6 4.81 Buckingham 34 male 71 190 large
bp.1s bp.1d bp.2s bp.2d waist hip time.ppn bmi chol.scaled
1 118 59 NA NA 29 38 720 22.12877 0.3424658
2 112 68 NA NA 46 48 360 37.41553 0.2383562
3 190 92 185 92 49 57 180 48.36549 0.4109589
4 110 50 NA NA 33 38 480 18.63600 0.0000000
5 138 80 NA NA 44 41 300 27.82202 0.4684932
6 132 86 NA NA 36 42 195 26.49673 0.4657534
getHdata(olympics.1996)
contents(olympics.1996)
Data frame:olympics.1996 79 observations and 3 variables Maximum # NAs:0
Storage
country character
medals integer
population integer
# take log of population
olympics.1996[,'log.pop'] <- log(olympics.1996[,'population'])
# create factor variable using cut points
olympics.1996[,'medals.factor'] <- cut(olympics.1996[,'medals'],
c(0,10,20,50,100,Inf), right=FALSE)
head(olympics.1996)
country medals population log.pop medals.factor
1 Tonga 1 96165 11.47382 [0,10)
2 Bahamas 1 281584 12.54819 [0,10)
3 Jamaica 6 2589043 14.76680 [0,10)
4 Cuba 25 10952046 16.20904 [20,50)
5 Australia 41 18348078 16.72504 [20,50)
6 Hungary 21 10273590 16.14509 [20,50)
Several examples of subsetting or indexing the hospital
data set are below. Also shown is the subset
function, a convenience function for accomplishing this task.
# select rows by index
hospital[1:3,]
id duration age sex temp wbc antibiotic bculture service
1 1 5 30 female 99 8 no no med
2 2 10 73 female 98 5 no yes med
3 3 6 40 female 99 12 no no surg
# select rows by name
hospital[c("5","10","15","20"),]
id duration age sex temp wbc antibiotic bculture service
5 5 5 25 female 98.5 11 no no surg
10 10 3 50 male 98.0 12 no yes surg
15 15 5 20 female 98.4 11 no yes surg
20 20 7 22 male 98.2 6 no no surg
# select columns by index, and first 10 rows
hospital[seq(10),c(1,5)]
id temp
1 1 99.0
2 2 98.0
3 3 99.0
4 4 98.2
5 5 98.5
6 6 96.8
7 7 99.5
8 8 98.6
9 9 98.0
10 10 98.0
# select columns by name, and 5 random rows
hospital[sample(nrow(hospital), 5), c('age','sex')]
age sex
15 20 female
9 43 female
6 82 male
23 67 female
2 73 female
# be careful when selecting a single column
# no longer data.frame, but vector
hospital[seq(5),'service']
[1] med med surg surg surg
Levels: med surg
# maintain data.frame structure
hospital[seq(5),'service', drop=FALSE]
service
1 med
2 med
3 surg
4 surg
5 surg
# select rows that satisfy condition
# every 5th row
hospital[seq(nrow(hospital))%%5 == 0,]
id duration age sex temp wbc antibiotic bculture service
5 5 5 25 female 98.5 11 no no surg
10 10 3 50 male 98.0 12 no yes surg
15 15 5 20 female 98.4 11 no yes surg
20 20 7 22 male 98.2 6 no no surg
25 25 4 41 female 98.0 5 no no med
# men
hospital[hospital[,'sex'] == 'male',]
id duration age sex temp wbc antibiotic bculture service
6 6 14 82 male 96.8 6 yes no surg
7 7 30 60 male 99.5 8 yes yes med
10 10 3 50 male 98.0 12 no yes surg
12 12 3 4 male 97.8 3 no no surg
16 16 5 32 male 99.0 9 no no surg
17 17 7 36 male 99.2 6 yes no surg
18 18 4 69 male 98.0 6 no no surg
19 19 3 47 male 97.0 5 yes no med
20 20 7 22 male 98.2 6 no no surg
21 21 9 11 male 98.2 10 no no surg
22 22 11 19 male 98.6 14 yes no surg
subset(hospital, sex == 'male')
id duration age sex temp wbc antibiotic bculture service
6 6 14 82 male 96.8 6 yes no surg
7 7 30 60 male 99.5 8 yes yes med
10 10 3 50 male 98.0 12 no yes surg
12 12 3 4 male 97.8 3 no no surg
16 16 5 32 male 99.0 9 no no surg
17 17 7 36 male 99.2 6 yes no surg
18 18 4 69 male 98.0 6 no no surg
19 19 3 47 male 97.0 5 yes no med
20 20 7 22 male 98.2 6 no no surg
21 21 9 11 male 98.2 10 no no surg
22 22 11 19 male 98.6 14 yes no surg
# 30-40 year olds, but only first 5
hospital[hospital[,'age'] %in% 30:49, c('id','age')][1:5,]
id age
1 1 30
3 3 40
4 4 47
9 9 43
14 14 33
subset(hospital, age %in% 30:49, select=c(id, age))[1:5,]
id age
1 1 30
3 3 40
4 4 47
9 9 43
14 14 33
# temp greater than 99
hospital[hospital[,'temp'] > 99, c('id','temp')]
id temp
7 7 99.5
13 13 99.5
17 17 99.2
subset(hospital, temp > 99, c(id, temp))
id temp
7 7 99.5
13 13 99.5
17 17 99.2
# combining conditions with AND
# women with duration over 10
hospital[hospital[,'sex'] == 'female' & hospital[,'duration'] > 10,]
id duration age sex temp wbc antibiotic bculture service
4 4 11 47 female 98.2 4 no no surg
8 8 11 56 female 98.6 7 no no med
9 9 17 43 female 98.0 7 no no med
23 23 11 67 female 97.6 4 no no med
# `with` saves some typing
hospital[with(hospital, sex == 'female' & duration > 10),]
id duration age sex temp wbc antibiotic bculture service
4 4 11 47 female 98.2 4 no no surg
8 8 11 56 female 98.6 7 no no med
9 9 17 43 female 98.0 7 no no med
23 23 11 67 female 97.6 4 no no med
# `subset` saves even more
subset(hospital, sex == 'female' & duration > 10)
id duration age sex temp wbc antibiotic bculture service
4 4 11 47 female 98.2 4 no no surg
8 8 11 56 female 98.6 7 no no med
9 9 17 43 female 98.0 7 no no med
23 23 11 67 female 97.6 4 no no med
# combining conditions with OR
# antibiotic or bculture
hospital[hospital[,'antibiotic'] == 'yes' | hospital[,'bculture'] == 'yes',]
id duration age sex temp wbc antibiotic bculture service
2 2 10 73 female 98.0 5 no yes med
6 6 14 82 male 96.8 6 yes no surg
7 7 30 60 male 99.5 8 yes yes med
10 10 3 50 male 98.0 12 no yes surg
11 11 9 59 female 97.6 7 no yes med
13 13 8 22 female 99.5 11 yes no surg
14 14 8 33 female 98.4 14 yes yes surg
15 15 5 20 female 98.4 11 no yes surg
17 17 7 36 male 99.2 6 yes no surg
19 19 3 47 male 97.0 5 yes no med
22 22 11 19 male 98.6 14 yes no surg
# combining conditions with complicated combinations
# men younger than 30 or women older than 60
hospital[with(hospital, (sex == 'male' & age < 30) | (sex == 'female' & age > 60)),]
id duration age sex temp wbc antibiotic bculture service
2 2 10 73 female 98.0 5 no yes med
12 12 3 4 male 97.8 3 no no surg
20 20 7 22 male 98.2 6 no no surg
21 21 9 11 male 98.2 10 no no surg
22 22 11 19 male 98.6 14 yes no surg
23 23 11 67 female 97.6 4 no no med
# select columns that match pattern
hospital[seq(10), grep("^s", names(hospital))]
sex service
1 female med
2 female med
3 female surg
4 female surg
5 female surg
6 male surg
7 male med
8 female med
9 female med
10 male surg
# select empty data.frame
hospital[FALSE,]
[1] id duration age sex temp wbc antibiotic
[8] bculture service
<0 rows> (or 0-length row.names)
Summary statistics can be computed with tapply
, aggregate
, and Hmisc::summarize
.
getHdata(DominicanHTN)
contents(DominicanHTN)
Data frame:DominicanHTN 381 observations and 5 variables Maximum # NAs:0
Labels Units Levels Storage
village 8 integer
gender 2 integer
age integer
sbp Systolic Blood Pressure mmHg integer
dbp Diastolic Blood Pressure mmHg integer
+--------+------------------------------------------------------------------+
|Variable|Levels |
+--------+------------------------------------------------------------------+
| village|Bare' Nuevo,Batey Verde,Carmona,Cojobal,Juan Sanchez,La Altagracia|
| |Los Gueneos,San Antonio |
+--------+------------------------------------------------------------------+
| gender |Female,Male |
+--------+------------------------------------------------------------------+
# calculate mean age by village
with(DominicanHTN, tapply(age, village, mean))
Bare' Nuevo Batey Verde Carmona Cojobal Juan Sanchez La Altagracia
47.53125 48.02439 49.00000 47.15254 47.45283 49.65000
Los Gueneos San Antonio
45.15789 51.25714
# calculate mean sbp by gender
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)
gender sbp
1 Female 133.1977
2 Male 132.5691
# calculate median age by village and gender
aggregate(age ~ village + gender, DominicanHTN, median)
village gender age
1 Bare' Nuevo Female 40.0
2 Batey Verde Female 43.0
3 Carmona Female 40.0
4 Cojobal Female 43.0
5 Juan Sanchez Female 47.0
6 La Altagracia Female 48.0
7 Los Gueneos Female 40.0
8 San Antonio Female 42.5
9 Bare' Nuevo Male 48.0
10 Batey Verde Male 52.0
11 Carmona Male 53.0
12 Cojobal Male 49.0
13 Juan Sanchez Male 56.0
14 La Altagracia Male 57.5
15 Los Gueneos Male 51.0
16 San Antonio Male 61.0
# calculate quantile of sbp by village and gender
with(DominicanHTN, summarize(sbp, llist(village, gender), quantile))
village gender sbp X25. X50. X75. X100.
1 Bare' Nuevo Female 80 125.0 130.0 141.00 166
2 Bare' Nuevo Male 102 120.0 126.0 136.00 148
3 Batey Verde Female 110 120.0 132.0 155.00 190
4 Batey Verde Male 90 117.0 130.0 147.50 180
5 Carmona Female 95 114.0 130.0 150.00 204
6 Carmona Male 100 120.0 130.0 150.00 168
7 Cojobal Female 90 119.0 124.0 150.00 200
8 Cojobal Male 90 116.0 122.0 134.00 200
9 Juan Sanchez Female 100 118.0 130.0 140.00 200
10 Juan Sanchez Male 90 120.0 130.0 150.00 200
11 La Altagracia Female 90 118.0 125.0 156.00 210
12 La Altagracia Male 80 112.0 120.0 137.00 236
13 Los Gueneos Female 90 106.0 118.0 140.00 180
14 Los Gueneos Male 98 118.0 124.5 144.75 170
15 San Antonio Female 90 116.5 129.5 150.00 190
16 San Antonio Male 100 125.0 140.0 165.00 210
The previous aggregate
example introduced the ~
operator. The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.
+
indicates to include both a and b as terms:
indicates the interaction of a and bI
indicates to use +
in the arithmetic senseSee ?formula for more examples.
The most simple model-fitting function is lm
, which is used to fit linear models. It’s primary argument is a formula. Using the DominicanHTN data set, we can fit sbp by age.
(m <- lm(sbp ~ age, data=DominicanHTN))
Call:
lm(formula = sbp ~ age, data = DominicanHTN)
Coefficients:
(Intercept) age
108.3079 0.5146
This creates an lm
object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.
names(m)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
[6] "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
m$coefficients
(Intercept) age
108.3078571 0.5146198
coef(m)
(Intercept) age
108.3078571 0.5146198
head(fitted(m))
Systolic Blood Pressure [mmHg]
1 2 3 4 5 6
137.1266 129.9219 143.8166 144.3312 140.2143 140.7289
predict(m, data.frame(age=c(40, 60)))
1 2
128.8926 139.1850
head(residuals(m))
Systolic Blood Pressure [mmHg]
1 2 3 4 5 6
12.873433 -9.921890 -23.816625 35.668756 -2.214286 -25.728906
vcov(m)
(Intercept) age
(Intercept) 17.5571859 -0.333034803
age -0.3330348 0.006942401
summary(m)
Call:
lm(formula = sbp ~ age, data = DominicanHTN)
Residuals:
Systolic Blood Pressure [mmHg]
Min 1Q Median 3Q Max
-63.080 -16.688 -2.787 11.961 96.815
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 108.30786 4.19013 25.848 < 2e-16 ***
age 0.51462 0.08332 6.176 1.69e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.54 on 379 degrees of freedom
Multiple R-squared: 0.09145, Adjusted R-squared: 0.08905
F-statistic: 38.15 on 1 and 379 DF, p-value: 1.692e-09
coef(summary(m))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 108.3078571 4.19012958 25.848331 1.177025e-85
age 0.5146198 0.08332107 6.176347 1.691868e-09
anova(m)
Analysis of Variance Table
Response: sbp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 22980 22980.3 38.147 1.692e-09 ***
Residuals 379 228314 602.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The examples here (“9,” n.d.) say it best.
library(ggplot2)
# create factors, categorical variables
m2 <- within(mtcars, {
gear.factor <- factor(gear, levels=c(3,4,5), labels=c("3gears","4gears","5gears"))
am.factor <- factor(am, levels=c(0,1), labels=c("Automatic","Manual"))
cyl.factor <- factor(cyl, levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl"))
})
# kernel density
qplot(mpg, data=m2, geom="density", fill=gear.factor, alpha=I(.5),
main="Distribution of Gas Mileage", xlab="Miles Per Gallon", ylab="Density"
)
# scatterplot
qplot(hp, mpg, data=m2, shape=am.factor, color=am.factor,
facets=gear.factor~cyl.factor, size=I(3), xlab="Horsepower",
ylab="Miles per Gallon"
)
# boxplot
qplot(gear, mpg, data=m2, geom=c("boxplot", "jitter"), fill=gear.factor,
main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon"
)
Example with regression:
# scatterplot
p <- qplot(wt, mpg, data=m2, color=cyl.factor,
main="Regression of MPG on Weight", xlab="Weight", ylab="Miles per Gallon"
)
# add regression line
p + geom_smooth(method="lm")
p + geom_smooth(method="loess")
Plots created with ggplot2
can be saved to a file with the ggsave
function. The file format is determined by the file name. Some are vector file formats (pdf, svg), while others (bmp, jpeg, png) are raster formats. Vector files can be scaled without pixelation.
Create 6“x6” png at 300 dpi, useful for print publications:
qplot(hp, mpg, data=m2, shape=am.factor, color=am.factor,
facets=gear.factor~cyl.factor, size=I(3), xlab="Horsepower",
ylab="Miles per Gallon"
)
ggsave("mileageByHorses.png", width=6, height=6, dpi=300)
Plot to PDF:
qplot(gear, mpg, data=m2, geom=c("boxplot", "jitter"), fill=gear.factor,
main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon")
ggsave("mileageByGear.pdf")
Saving 7 x 5 in image
“1.” n.d. http://miktex.org/download.
“2.” n.d. http://www.tug.org/mactex/morepackages.html.
“3.” n.d. http://yihui.name/knitr/.
“5.” n.d. https://github.com/yihui/knitr-examples.
“8.” n.d. https://github.com/harrelfe/rscripts.