1 Installing LaTeX
2 Navigating RStudio
3 R Packages
4 knitr
5 Importing Data Sets
6 Saving and Loading R Binary Objects
7 R Basics
8 Transforming Variables
9 Subsetting data frames
10 Data Aggregation
11 Formulas
12 Models
13 R Graphics with ggplot2
Links

1 Installing LaTeX

LaTeX is an open source typesetting system used for document preparation. It should be installed for R and RStudio to produce PDF reports. It is a tool for advanced users who wish to compose advanced tables, use cross-referencing, indexing, or have finer control over document layout. Others may use Markdown as an alternative to produce PDF and HTML reports.

1.1 Windows

Install MiKTeX (“1,” n.d.) - choose Basic MiKTeX Installer

Applications will likely be installed in C:\Program Files (x86)\MiKTeX 2.9\miktex\bin.

Use the mpm application to install additional LaTeX packages.

1.2 OS X

Install MacTeX (“2,” n.d.) - choose BasicTeX.pkg

Applications will likely be installed in /usr/texbin.

Use the tlmgr command-line program to install additional LaTeX packages. As an example, this installs the framed style package:

sudo tlmgr install framed

1.3 Linux

Install TeXLive - Ubuntu flavored distributions may use the command-line program apt-get:

sudo apt-get install texlive-base texlive-latex-recommended texlive-latex-extra

Applications will likely be installed in /usr/bin.

Additional LaTeX packages may require manual installation in /usr/share/texlive/texmf-dist/tex/latex.

1.4 Test Installation

After LaTeX has been installed, test that pdflatex runs from the command line prompt:

pdflatex --version

If this fails to return information about pdflatex, check that it is installed in the proper location. If found the LaTeX application path needs to be added to your environment path, or you may need to log out and back in.

2 Navigating RStudio

Creating scripts (R Script, R Markdown, R Sweave)
Creating projects
Environment
Files
Plots
Packages
Help

2.1 Running a R script

When an R script is open, RStudio can run one line at a time, run all highlighted lines, or run the entire script.

A script can also be run in entirety by calling the source function, which works with local files as well as URLs.

source("https://github.com/harrelfe/rscripts/raw/master/introda.r")

3 R Packages

Install the packages Hmisc, rms, knitr, rmarkdown through RStudio.

Packages contain R code and may often require the installation of other packages. When I install Hmisc, I see that its dependency acepack is automatically installed:

Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb

* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)

3.1 Commands for Package Management

3.1.1 Install and updating

# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown'))
# update all packages
update.packages(checkBuilt=TRUE, ask=FALSE)

3.1.2 Loading

Note that library is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require used in its place. If the package is not installed, library will trigger an error while require will return FALSE. Once loaded the functions created in the package are available to your R session.

library(Hmisc)
require(Hmisc)

4 knitr

knitr (“3,” n.d.) was designed as a replacement for Sweave - it is used for dynamic report generation. It combines the content of your document with the R code that produces results. Blocks of code are called chunks.

By default RStudio uses Sweave, but this can be changed to knitr in the options.

Tools -> Global Options, Sweave panel, "Weave Rnw files using..."

Here’s an example of using the knitr package to strip out all of this file’s R chunks and store them in an R script.

library(knitr)
# may need path to file
purl('intro.Rmd')

4.1 Rnw/LaTeX/PDF or Rmd/markdown/HTML+

Reports can be generated by two methods, Rnw (noweb) files or Rmd (markdown) files. Syntax-wise there are two major differences:

Declaration of code chunks
Content is formatted with LaTeX (Rnw) or Markdown (Rmd)

LaTeX is much more difficult to learn but it offers superior control over document layout and formatting. Markdown offers more flexibility in the type of output, such as PDF, HTML and Word. How do you choose which to use? A good rule of thumb is to ask yourself “must I print the results nicely on paper?”. If yes, Rnw (Yihui Xie (“4,” n.d.)).

Some R objects are easily converted into LaTeX (see Hmisc::latex). If you are using any LaTeX functions, you should also choose Rnw files.

You can see good examples of both methods here (“5,” n.d.).

Output is created by clicking the Compile PDF or Knit HTML button.

4.1.1 Code Chunks

An Rnw chunk looks like this:

<<name, chunk-options>>=
# lines of R code
@

Here’s an R chunk in Rmd:

```{r, chunk-options}
# lines of R code
```

There are many chunk options, see the online documentation (“6,” n.d.).

4.1.2 Nice Defaults with knitrSet

Dr. Harrell has predefined several options and functions that are useful when creating Rnw files. Load the Hmisc package to make the knitrSet function available for use. In the past this function could be included by adding it to your .Rprofile file (example Rprofile) (“7,” n.d.). If knitrSet is included in your Rprofile, it should be removed.

The following chunk will load Hmisc and run this function.

<<echo=FALSE>>=
require(Hmisc)
knitrSet()
@

4.1.3 Rmarkdown Template

Any of the Rmd files found here (“8,” n.d.) would make a good template for a reproducible report. There are two chunks you should consider including in any report. As previously mentioned knitrSet should be added to the beginning of the report. The computing environment should be added to the end.

It should look similar to this.

```{r setup,echo=FALSE}
require(Hmisc)
knitrSet(lang='markdown')
```

# Analysis Start
# content
# Analysis End

# Computing Environment

```{r rsession,echo=FALSE}
si <- sessionInfo(); si$loadedOnly <- NULL
print(si, locale=FALSE)
```
```{r cite,results='asis',echo=FALSE}
print(citation(), style='text')
```

The Hmisc::getRs function can be used to open one of the sample rscripts within RStudio.

5 Importing Data Sets

In R, a data set is called a data.frame. Consisting of rows and columns of data, they are the fundamental data structure in R.

R reads in delimited files, such as comma-separated or tab-delimited, as data frames. Typically this is done with one of the following functions.

read.table - work with data in table format
read.csv - CSV files
read.delim - tab-delimited files
scan - more general function for reading in data

Several options are useful:

header - indicator if first line contains column headers
sep - field separator
quote - quote character
na.strings - strings to convert to NA value
nrows - number of rows to read
skip - number of rows to skip at start of file
stringsAsFactors - read strings as character or factor values

A data set may also be imported by clicking the Import Dataset button in the Environment pane in RStudio.

The base package datasets includes several data sets. In this example we’ll use one to create a file that can be read in as a CSV.

head(state.x77)

           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

write.csv(state.x77, file='state.csv')
state <- read.csv('state.csv', stringsAsFactors=FALSE)

5.1 Importing from Other Applications

Other methods are available to read in data sets created outside of R.

For example, Hmisc includes the functions sas.get and sasxport.get to work with SAS data sets.

Stat/Transfer is a good option but is not free nor open source.

See the foreign package for other options.

5.2 getHdata

The Hmisc::getHdata function downloads ready-to-use data sets from the web site for Hmisc and rms.

require(Hmisc, warn.conflicts=FALSE, quietly=TRUE)
# list available data sets
getHdata()

 [1] "abm"              "acath"            "ari"              "ari_other"       
 [5] "birth.estriol"    "boston"           "cdystonia"        "counties"        
 [9] "diabetes"         "dmd"              "DominicanHTN"     "esopH"           
[13] "esopH2"           "FEV"              "hospital"         "kprats"          
[17] "lead"             "nhgh"             "olympics.1996"    "pbc"             
[21] "plasma"           "prostate"         "rhc"              "sex.age.response"
[25] "stressEcho"       "support2"         "support"          "titanic2"        
[29] "titanic3"         "titanic"          "valung"           "vlbw"

# download and load data set
getHdata(counties)
head(counties)

   county state  msa pmsa pop.density   pop pop.change age6574 age75 crime college
1 Autauga    AL 5240   NA          61 34222       11.9     5.7   4.1  4996    14.5
2 Baldwin    AL 5160   NA          67 98280       35.4     9.2   6.0  3329    16.8
3 Barbour    AL   NA   NA          29 25417        2.0     8.2   6.4  3192    11.8
4    Bibb    AL   NA   NA          28 16576        9.2     6.7   6.0     0     4.7
5  Blount    AL 1000   NA          62 39248       10.6     7.4   5.6  2052     7.0
6 Bullock    AL   NA   NA          18 11042        3.7     8.5   7.6  3630    10.0
  income farm democrat republican Perot    white     black  turnout
1  32240  1.8     30.9       55.9  12.3 79.31740 20.001753 45.54088
2  30199  1.7     26.2       56.5  16.5 86.04498 12.861213 47.28938
3  23838  2.4     46.4       42.9   9.8 55.54550 44.041390 41.03946
4  23714  0.9     43.2       46.5  10.2 78.74035 20.982143 40.54054
5  26323  4.7     32.9       53.8  11.8 97.83173  1.327456 42.05310
6  17796  2.6     67.7       26.0   5.5 27.49502 72.323853 43.61529

# view data dictionary for file - opens in web browser
getHdata(counties, what="contents")
contents(counties)


Data frame:counties 3141 observations and 19 variables    Maximum # NAs:2956

                                                         Labels   Storage  NAs
county                                                          character    0
state                                                           character    0
msa                                                               integer 2311
pmsa                                                              integer 2956
pop.density                           1992 pop per 1990 miles^2   integer    0
pop                                             1990 population   integer    0
pop.change                        % population change 1980-1992    double    0
age6574                                       % age 65-74, 1990    double    0
age75                                         % age >= 75, 1990    double    0
crime                           serious crimes per 100,000 1991   integer    0
college     % with bachelor's degree or higher of those age>=25    double    0
income                       median family income, 1989 dollars   integer    0
farm                          farm population, % of total, 1990    double    0
democrat                  % votes cast for democratic president    double   27
republican                % votes cast for republican president    double   27
Perot                               % votes cast for Ross Perot    double   27
white                                             % white, 1990    double    0
black                                             % black, 1990    double    0
turnout               1992 votes for president / 1990 pop x 100    double   25

5.3 Structure of a data.frame

A data.frame consists of observations (rows) of variables (columns). Examine the hospital data set.

getHdata(hospital)
contents(hospital)


Data frame:hospital 25 observations and 9 variables    Maximum # NAs:0

                                    Labels Levels Storage
id                                                integer
duration   Duration of Hospital Stay, Days        integer
age                                               integer
sex                                             2 integer
temp                                               double
wbc                                               integer
antibiotic                                      2 integer
bculture                                        2 integer
service                                         2 integer

+----------+-----------+
|Variable  |Levels     |
+----------+-----------+
|sex       |male,female|
+----------+-----------+
|antibiotic|yes,no     |
|bculture  |           |
+----------+-----------+
|service   |med,surg   |
+----------+-----------+

hospital

   id duration age    sex temp wbc antibiotic bculture service
1   1        5  30 female 99.0   8         no       no     med
2   2       10  73 female 98.0   5         no      yes     med
3   3        6  40 female 99.0  12         no       no    surg
4   4       11  47 female 98.2   4         no       no    surg
5   5        5  25 female 98.5  11         no       no    surg
6   6       14  82   male 96.8   6        yes       no    surg
7   7       30  60   male 99.5   8        yes      yes     med
8   8       11  56 female 98.6   7         no       no     med
9   9       17  43 female 98.0   7         no       no     med
10 10        3  50   male 98.0  12         no      yes    surg
11 11        9  59 female 97.6   7         no      yes     med
12 12        3   4   male 97.8   3         no       no    surg
13 13        8  22 female 99.5  11        yes       no    surg
14 14        8  33 female 98.4  14        yes      yes    surg
15 15        5  20 female 98.4  11         no      yes    surg
16 16        5  32   male 99.0   9         no       no    surg
17 17        7  36   male 99.2   6        yes       no    surg
18 18        4  69   male 98.0   6         no       no    surg
19 19        3  47   male 97.0   5        yes       no     med
20 20        7  22   male 98.2   6         no       no    surg
21 21        9  11   male 98.2  10         no       no    surg
22 22       11  19   male 98.6  14        yes       no    surg
23 23       11  67 female 97.6   4         no       no     med
24 24        9  43 female 98.6   5         no       no    surg
25 25        4  41 female 98.0   5         no       no     med

Once loaded, note that it is displayed in RStudio’s Environment pane. It gives information about the data.frame. This can also be shown through the following functions.

str(hospital)

'data.frame':   25 obs. of  9 variables:
 $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ duration  : 'labelled' int  5 10 6 11 5 14 30 11 17 3 ...
  ..- attr(*, "label")= chr "Duration of Hospital Stay, Days"
 $ age       : int  30 73 40 47 25 82 60 56 43 50 ...
 $ sex       : Factor w/ 2 levels "male","female": 2 2 2 2 2 1 1 2 2 1 ...
 $ temp      : num  99 98 99 98.2 98.5 96.8 99.5 98.6 98 98 ...
 $ wbc       : int  8 5 12 4 11 6 8 7 7 12 ...
 $ antibiotic: Factor w/ 2 levels "yes","no": 2 2 2 2 2 1 1 2 2 2 ...
 $ bculture  : Factor w/ 2 levels "yes","no": 2 1 2 2 2 2 1 2 2 1 ...
 $ service   : Factor w/ 2 levels "med","surg": 1 1 2 2 2 2 1 1 1 2 ...

names(hospital)

[1] "id"         "duration"   "age"        "sex"        "temp"       "wbc"       
[7] "antibiotic" "bculture"   "service"

rownames(hospital)

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16"
[17] "17" "18" "19" "20" "21" "22" "23" "24" "25"

dim(hospital)

[1] 25  9

The str function returns the structure of an R object. In this case the hospital data set is made up of columns of different types: integer, character, factor, numeric. The Hmisc::describe function is especially helpful when taking a first look at a data set.

describe(hospital)

hospital 

 9  Variables      25  Observations
-------------------------------------------------------------------------------------
id 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
      25        0       25        1       13    8.667      2.2      3.4      7.0 
     .50      .75      .90      .95 
    13.0     19.0     22.6     23.8 

lowest :  1  2  3  4  5, highest: 21 22 23 24 25
-------------------------------------------------------------------------------------
duration : Duration of Hospital Stay, Days 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
      25        0       12    0.988      8.6      5.6      3.0      3.4      5.0 
     .50      .75      .90      .95 
     8.0     11.0     12.8     16.4 
                                                                      
Value         3    4    5    6    7    8    9   10   11   14   17   30
Frequency     3    2    4    1    2    2    3    1    4    1    1    1
Proportion 0.12 0.08 0.16 0.04 0.08 0.08 0.12 0.04 0.16 0.04 0.04 0.04
-------------------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
      25        0       22    0.999    41.24    23.39     12.6     19.4     25.0 
     .50      .75      .90      .95 
    41.0     56.0     68.2     72.2 

lowest :  4 11 19 20 22, highest: 60 67 69 73 82
-------------------------------------------------------------------------------------
sex 
       n  missing distinct 
      25        0        2 
                        
Value        male female
Frequency      11     14
Proportion   0.44   0.56
-------------------------------------------------------------------------------------
temp 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
      25        0       12    0.987    98.31   0.7747    97.12    97.60    98.00 
     .50      .75      .90      .95 
   98.20    98.60    99.12    99.44 
                                                                      
Value      96.8 97.0 97.6 97.8 98.0 98.2 98.4 98.5 98.6 99.0 99.2 99.5
Frequency     1    1    2    1    5    3    2    1    3    3    1    2
Proportion 0.04 0.04 0.08 0.04 0.20 0.12 0.08 0.04 0.12 0.12 0.04 0.08
-------------------------------------------------------------------------------------
wbc 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
      25        0       11    0.988     7.84    3.687      4.0      4.4      5.0 
     .50      .75      .90      .95 
     7.0     11.0     12.0     13.6 
                                                                 
Value         3    4    5    6    7    8    9   10   11   12   14
Frequency     1    2    4    4    3    2    1    1    3    2    2
Proportion 0.04 0.08 0.16 0.16 0.12 0.08 0.04 0.04 0.12 0.08 0.08
-------------------------------------------------------------------------------------
antibiotic 
       n  missing distinct 
      25        0        2 
                    
Value       yes   no
Frequency     7   18
Proportion 0.28 0.72
-------------------------------------------------------------------------------------
bculture 
       n  missing distinct 
      25        0        2 
                    
Value       yes   no
Frequency     6   19
Proportion 0.24 0.76
-------------------------------------------------------------------------------------
service 
       n  missing distinct 
      25        0        2 
                    
Value       med surg
Frequency     9   16
Proportion 0.36 0.64
-------------------------------------------------------------------------------------

6 Saving and Loading R Binary Objects

Saving data sets as text files (like CSV) is not always a good idea. It is more efficient (time and space) to work with R binary objects. The save function will store R objects (including the global environment). Save a single data set with the .rda extension, and save a collection of objects with the .RData extension. These objects can be restored with the load function. By default these files are saved with compression.

# save single data set
save(hospital, file='hosp.rda', compress=TRUE)
# save everything
save.image(file='myintro.RData')

Restart R and the following will restore the hospital data set.

load('hosp.rda')

Hmisc::Save and Hmisc::Load may also be used on single objects - the file will keep the name of the object unless otherwise provided. Compression is also enabled.

Save(hospital)
Load(hospital)

7 R Basics

7.1 Assignment

x <- "my string"
y = 5

7.2 Printing

print(x)

[1] "my string"

cat(x, "\n")

my string

[1] "my string"

sum(y)

[1] 5

(x <- 1)

[1] 1

7.3 Vectors

Container of one type. Elements are indexed from 1 to n, and can be accessed by this index (more on this later).

z <- c(1,-5,20)
(z <- c(10, z, 36))

[1] 10  1 -5 20 36

z[2]

[1] 1

7.4 Numeric

7.4.1 Basic arithmetic

z + 1

[1] 11  2 -4 21 37

z - 2

[1]  8 -1 -7 18 34

z * 3

[1]  30   3 -15  60 108

z / 4

[1]  2.50  0.25 -1.25  5.00  9.00

z ^ 5

[1]   100000        1    -3125  3200000 60466176

# mod, the remainder
z %% 6

[1] 4 1 1 2 0

# integer division
z %/% 6

[1]  1  0 -1  3  6

# recreate z
(z %/% 6) * 6 + z %% 6

[1] 10  1 -5 20 36

sqrt(z)

Warning in sqrt(z): NaNs produced

[1] 3.162278 1.000000      NaN 4.472136 6.000000

7.4.2 Numeric Functions

Many other functions work with numeric values, see ?S3groupGeneric

abs(z)

[1] 10  1  5 20 36

sign(z)

[1]  1  1 -1  1  1

round(z / 7, 2)

[1]  1.43  0.14 -0.71  2.86  5.14

floor(z / 7)

[1]  1  0 -1  2  5

ceiling(z / 7)

[1] 2 1 0 3 6

# exponential function
exp(z)

[1] 2.202647e+04 2.718282e+00 6.737947e-03 4.851652e+08 4.311232e+15

# natural logarithms
log(z)

Warning in log(z): NaNs produced

[1] 2.302585 0.000000      NaN 2.995732 3.583519

# trig functions
sin(z)

[1] -0.5440211  0.8414710  0.9589243  0.9129453 -0.9917789

7.4.3 Descriptive Statistics

sum(z)/length(z)

[1] 12.4

mean(z)

[1] 12.4

median(z)

[1] 10

sd(z)

[1] 16.22652

min(z)

[1] -5

max(z)

[1] 36

range(z)

[1] -5 36

quantile(z)

  0%  25%  50%  75% 100% 
  -5    1   10   20   36

quantile(z, probs=c(0.1, 0.5, 0.9))

 10%  50%  90% 
-2.6 10.0 29.6

7.4.4 Generating Sequences and Random Data

See ?seq and ?Distributions

numeric(5)

[1] 0 0 0 0 0

seq(1, 10)

 [1]  1  2  3  4  5  6  7  8  9 10

seq(10)

 [1]  1  2  3  4  5  6  7  8  9 10

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 10, by=2)

[1] 1 3 5 7 9

rep(1, 5)

[1] 1 1 1 1 1

rep(1:3, each=3)

[1] 1 1 1 2 2 2 3 3 3

rep(1:3, times=3)

[1] 1 2 3 1 2 3 1 2 3

rep(1:3, times=1:3)

[1] 1 2 2 3 3 3

rnorm(5, mean=0, sd=1)

[1] -0.73296459  0.18679832  0.05218688  0.35698932  0.88248873

rbinom(5, size=1, prob=0.5)

[1] 0 1 0 1 1

rpois(5, lambda=3)

[1] 2 2 1 6 3

runif(5, min=0, max=10)

[1] 4.499876 2.212769 4.527861 8.437709 2.210349

Recreate random data with seeds - values are reproducible across computers

# the seed can be any integer value
set.seed(1)
rnorm(5)

[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

set.seed(1)
rnorm(5)

[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

7.4.5 Floating-point precision

Beware precision errors with numeric values.

seq(10)/10 == seq(0.1, 1, by=0.1)

 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

abs(seq(10)/10 - seq(0.1, 1, by=0.1)) < 10^-7

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

7.5 Integer

Mostly unnecessary, useful to represent data exactly or to pass to C or Fortran code

as.integer(1)

[1] 1

class(1L)

[1] "integer"

class(c(1L, 1))

[1] "numeric"

7.6 Logical

logical(5)

[1] FALSE FALSE FALSE FALSE FALSE

rep(TRUE, 5)

[1] TRUE TRUE TRUE TRUE TRUE

as.logical(c(-1,0,1))

[1]  TRUE FALSE  TRUE

1+1 == 2

[1] TRUE

3 < 2

[1] FALSE

# logical negation
!FALSE

[1] TRUE

x <- seq(10)
# logical or
x %% 2 == 0 | x %% 3 == 0

 [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

# logical and
x %% 2 == 0 & x %% 3 == 0

 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

(x <- rnorm(5) > 0)

[1] FALSE  TRUE  TRUE  TRUE FALSE

# return index for true elements
which(x)

[1] 2 3 4

any(x)

[1] TRUE

all(x)

[1] FALSE

7.7 Character

character(5)

[1] "" "" "" "" ""

LETTERS[seq(5)]

[1] "A" "B" "C" "D" "E"

letters[seq(5)]

[1] "a" "b" "c" "d" "e"

x <- "My string"
toupper(x)

[1] "MY STRING"

tolower(x)

[1] "my string"

nchar(x)

[1] 9

substr(x, 4, 6)

[1] "str"

# [set of values]
grep("[aeiou]", letters)

[1]  1  5  9 15 21

grepl("[aeiou]", letters)

 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[14] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

gsub("[a-m]", "_", x)

[1] "My str_n_"

(y <- paste(x, "is", x))

[1] "My string is My string"

paste(y, "!", sep='')

[1] "My string is My string!"

# this might not do what you think
paste(letters[seq(10)], sep='')

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

paste(letters[seq(10)], collapse='')

[1] "abcdefghij"

# strsplit turns vector into list
strsplit(c("1;2","3;4","5:6"), split=";")

[[1]]
[1] "1" "2"

[[2]]
[1] "3" "4"

[[3]]
[1] "5:6"

# use unlist if only one string
unlist(strsplit(y, split=" "))

[1] "My"     "string" "is"     "My"     "string"

7.8 Factor

Factor variables are categorical variables; they may appear as character strings, but they are stored as numerics. They may be important to models or plotting functions.

set.seed(5)
race <- sample(c('white','black','other'), size=10, replace=TRUE, prob=c(0.5,0.3,0.2))
# factor values are created in alphabetical order
(r2 <- factor(race))

 [1] white black other white white black black other other white
Levels: black other white

r2 <- factor(race, levels=c('white','black','asian','other'))
levels(r2)

[1] "white" "black" "asian" "other"

nlevels(r2)

[1] 4

levels(r2) <- list(white='white', black='black', other=c('other', 'asian'))
# this fails
c(r2, "hispanic")

 [1] "1"        "2"        "3"        "1"        "1"        "2"        "2"       
 [8] "3"        "3"        "1"        "hispanic"

# coerce to character string first
factor(c(as.character(r2), "hispanic"))

 [1] white    black    other    white    white    black    black    other    other   
[10] white    hispanic
Levels: black hispanic other white

7.9 Date

Internally, dates are stored as days from 01/01/1970 (this is day 0). Format options can be checked from the ?strptime help file.

Sys.Date()

[1] "2019-04-01"

# need consistent format
as.Date(c("05-03-01", "11-11-25"), format="%y-%m-%d")

[1] "2005-03-01" "2011-11-25"

# accepts numeric data, if origin is specified
# Excel typically uses an origin of "1900-01-01"
(x <- as.Date(seq(0, by=365, length.out=3), origin="2000-01-01"))

[1] "2000-01-01" "2000-12-31" "2001-12-31"

as.numeric(x)

[1] 10957 11322 11687

format(x, format="%m/%d/%Y")

[1] "01/01/2000" "12/31/2000" "12/31/2001"

# seq.Date(x[1], x[2], by=14)
seq(x[1], x[2], by=14)

 [1] "2000-01-01" "2000-01-15" "2000-01-29" "2000-02-12" "2000-02-26" "2000-03-11"
 [7] "2000-03-25" "2000-04-08" "2000-04-22" "2000-05-06" "2000-05-20" "2000-06-03"
[13] "2000-06-17" "2000-07-01" "2000-07-15" "2000-07-29" "2000-08-12" "2000-08-26"
[19] "2000-09-09" "2000-09-23" "2000-10-07" "2000-10-21" "2000-11-04" "2000-11-18"
[25] "2000-12-02" "2000-12-16" "2000-12-30"

diff(x)

Time differences in days
[1] 365 365

as.numeric(diff(x), units='weeks')

[1] 52.14286 52.14286

7.10 Date-time

Date-time variables can be stored in two ways. POSIXct variables are stored as seconds from 01/01/1970. POSIXlt variables have each date/time component saved in a list container. Typically, you should use POSIXct.

Careful consideration of the timezone should be used when working with date-time variables. By default R will take the timezone from the operating system. It can be helpful to explicitly select your time zone, and additionally using UTC whenever possible.

Sys.time()

[1] "2019-04-01 16:36:03 CDT"

Sys.timezone()

[1] "America/Chicago"

x <- as.POSIXct("2014-10-10 10:20")
# see ?strptime for format options
y <- as.POSIXlt("10/17/2014 20:10", tz="UTC", format="%m/%d/%Y %H:%M")
unclass(x)

[1] 1412954400
attr(,"tzone")
[1] ""

unclass(y)

$sec
[1] 0

$min
[1] 10

$hour
[1] 20

$mday
[1] 17

$mon
[1] 9

$year
[1] 114

$wday
[1] 5

$yday
[1] 289

$isdst
[1] 0

attr(,"tzone")
[1] "UTC"

difftime(Sys.time(), x, units="days")

Time difference of 1634.261 days

base::round.POSIXt(x, units="hours")

[1] "2014-10-10 10:00:00 CDT"

7.11 Special Types

Undefined - NULL
Missing - NA
Infinite - Inf and -Inf
Not a Number - NaN

x <- c(1,4,NA,0)
is.na(x)

[1] FALSE FALSE  TRUE FALSE

# note calculations with NA result in NA
is.infinite(x / 0)

[1]  TRUE  TRUE FALSE FALSE

is.nan(x / 0)

[1] FALSE FALSE FALSE  TRUE

7.12 Mixing Types

Mixing types together in a vector will coerce all variables to be of the same, least flexible, type.

# str shows the structure of an R object, similar to Environment pane
str(c("a", 1))

 chr [1:2] "a" "1"

str(c(5L, 1))

 num [1:2] 5 1

str(c(TRUE, 0))

 num [1:2] 1 0

str(c(FALSE, "false"))

 chr [1:2] "FALSE" "false"

str(c(100, Sys.Date()))

 num [1:2] 100 17987

str(c('today', Sys.time()))

 chr [1:2] "today" "1554154565.14581"

str(c(factor(c("apple", "orange")), "banana"))

 chr [1:3] "1" "2" "banana"

7.13 Containers

vector - supports one type, with one dimension
matrix - supports one type, with two dimensions
array - supports one type, with many dimensions
data.frame - supports one type per column, but may vary by column
list - supports many types

Each have two important attributes, dimension length and dimension names.

x <- c(1, 10, 100)
length(x)

[1] 3

names(x) <- c("ones", "tens", "hundreds")
# assign names during vector creation
(x <- c(ones=1, tens=10, hundreds=100))

    ones     tens hundreds 
       1       10      100

names(x)

[1] "ones"     "tens"     "hundreds"

# print vector without names
unname(x)

[1]   1  10 100

# remove names
names(x) <- NULL
mx <- matrix(1:6, nrow=2)
# number of rows
nrow(mx)

[1] 2

# number of columns
ncol(mx)

[1] 3

# display number of rows and columns
dim(mx)

[1] 2 3

rownames(mx) <- c('odd', 'even')
colnames(mx) <- c('a', 'b', 'c')
# assign names during matrix creation
(mx <- matrix(1:6, nrow=2, dimnames=list(c('odd', 'even'), c('a','b','c'))))

     a b c
odd  1 3 5
even 2 4 6

str(mx)

 int [1:2, 1:3] 1 2 3 4 5 6
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "odd" "even"
  ..$ : chr [1:3] "a" "b" "c"

7.14 Vectors

7.14.1 Accessing Elements

Indexing is vital concept. It allows access to specific parts of a container, for extraction or replacement. There are four main ways to use indexing.

index by element number
index by negation
index by element name
index on truth of logical comparison

Note that in R, the first element is indexed at 1. Many programming languages start at 0 instead.

x <- c(royals=4, angels=0, tigers=1, orioles=7)
# index by number, returned in order specified
x[c(4,3)]

orioles  tigers 
      7       1

# index by negation
x[-c(1,2)]

 tigers orioles 
      1       7

# index by name
x[c("orioles", "angels")]

orioles  angels 
      7       0

# index by truth
x[x > 3]

 royals orioles 
      4       7

x[x > 3 & x < 7]

royals 
     4

# index replacement
x[c("royals","orioles")] <- c(7,4)
# create new element
x["as"] <- 0
# non-existent elements are NA
x[6]

<NA> 
  NA

7.14.2 Functions for Vectors

head, tail
rev
match, %in%
summary
sort, order
table - contingency table
unique
duplicated
union, intersect, setdiff
split, unsplit

# these examples use state data sets, see ?state
head(state.region)

[1] South West  West  South West  West 
Levels: Northeast South North Central West

tail(state.region, n=10)

 [1] North Central South         South         West          Northeast    
 [6] South         West          South         North Central West         
Levels: Northeast South North Central West

state.region %in% c("West", "South", "Central")

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[14] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
[27] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
[40]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE

c("West", "South", "Central") %in% state.region

[1]  TRUE  TRUE FALSE

match(state.region, c("West", "South", "Central"))

 [1]  2  1  1  2  1  1 NA  2  2  2  1  1 NA NA NA NA  2  2 NA  2 NA NA NA  2 NA  1 NA
[28]  1 NA NA  1 NA  2 NA NA  2  1 NA NA  2 NA  2  2  1 NA  2  1  2 NA  1

summary(state.area)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1214   37317   56222   72368   83234  589757

sort(state.area)

 [1]   1214   2057   5009   6450   7836   8257   9304   9609  10577  24181  31055
[12]  33215  36291  40395  40815  41222  42244  45333  47716  48523  49576  51609
[23]  52586  53104  56154  56290  56400  58216  58560  58876  68192  69686  69919
[34]  70665  77047  77227  82264  83557  84068  84916  96981  97914 104247 110540
[45] 113909 121666 147138 158693 267339 589757

# display order by element index
order(state.area)

 [1] 39  8  7 11 30 21 29 45 20 48 40 19 14 17 46 35 42 38 24 18 32  1 33  4 49 15 13
[28] 22  9 10 47 25 36 34 41 27 16 12 23 44 37 50  6 28  3 31 26  5 43  2

table(state.region)

state.region
    Northeast         South North Central          West 
            9            16            12            13

table(state.division, state.region)

                    state.region
state.division       Northeast South North Central West
  New England                6     0             0    0
  Middle Atlantic            3     0             0    0
  South Atlantic             0     8             0    0
  East South Central         0     4             0    0
  West South Central         0     4             0    0
  East North Central         0     0             5    0
  West North Central         0     0             7    0
  Mountain                   0     0             0    8
  Pacific                    0     0             0    5

unique(state.region)

[1] South         West          Northeast     North Central
Levels: Northeast South North Central West

duplicated(state.region)

 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[14]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[27]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[40]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

set.seed(2)
x <- sample(20, 10)
y <- sample(20, 10)
union(x, y)

 [1]  4 14 11  3 16 15  2 18  6  7 12  5 13  1

# union takes two arguments, unique takes one (a vector)
all(union(x, y) == unique(c(x, y)))

[1] TRUE

intersect(x, y)

[1]  4 14  3 18  6  7

# elements in x, not in y
setdiff(x, y)

[1] 11 16 15  2

# elements in y, not in x
setdiff(y, x)

[1] 12  5 13  1

# create list of state names by region
split(state.name, state.region)

$Northeast
[1] "Connecticut"   "Maine"         "Massachusetts" "New Hampshire" "New Jersey"   
[6] "New York"      "Pennsylvania"  "Rhode Island"  "Vermont"      

$South
 [1] "Alabama"        "Arkansas"       "Delaware"       "Florida"       
 [5] "Georgia"        "Kentucky"       "Louisiana"      "Maryland"      
 [9] "Mississippi"    "North Carolina" "Oklahoma"       "South Carolina"
[13] "Tennessee"      "Texas"          "Virginia"       "West Virginia" 

$`North Central`
 [1] "Illinois"     "Indiana"      "Iowa"         "Kansas"       "Michigan"    
 [6] "Minnesota"    "Missouri"     "Nebraska"     "North Dakota" "Ohio"        
[11] "South Dakota" "Wisconsin"   

$West
 [1] "Alaska"     "Arizona"    "California" "Colorado"   "Hawaii"     "Idaho"     
 [7] "Montana"    "Nevada"     "New Mexico" "Oregon"     "Utah"       "Washington"
[13] "Wyoming"

7.14.3 Recycling

Some vector operations require vectors of equal length. If the vectors are of unequal length, elements in the shorter vector may be recycled.

# c(10,1) becomes c(10,1,10,1,10)
c(1,10,100,1000,10000) / c(10,1)

Warning in c(1, 10, 100, 1000, 10000)/c(10, 1): longer object length is not a
multiple of shorter object length

[1] 1e-01 1e+01 1e+01 1e+03 1e+03

# letters[1:2] becomes letters[c(1,2,1,2)]
paste(letters[1:2], letters[23:26])

[1] "a w" "b x" "a y" "b z"

7.15 Matrices

Two dimensional vector. Values are stored by column by default.

Useful functions:

%*% - matrix multiplication
rowSums, colSums
t - transpose
cbind - add column
rbind - add row
solve
upper.tri, lower.tri
diag - access diagonal, or create identity matrix
det - calculate the determinant

matrix(1:5)

     [,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5

matrix(seq(8), nrow=4, ncol=2)

     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

(x <- matrix(seq(8), nrow=4, ncol=2, byrow=TRUE))

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8

rowSums(x)

[1]  3  7 11 15

colSums(x)

[1] 16 20

y <- matrix(c(1,1,0,0), nrow=2)
x %*% y

     [,1] [,2]
[1,]    3    0
[2,]    7    0
[3,]   11    0
[4,]   15    0

x - y[rep(1,4),]

     [,1] [,2]
[1,]    0    2
[2,]    2    4
[3,]    4    6
[4,]    6    8

# add a column, note the use of recycling
cbind(x, 3)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    3    4    3
[3,]    5    6    3
[4,]    7    8    3

rbind(x, c(9, 10))

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10

x <- matrix(c(4,12,-5,-10), nrow=2)
all.equal(x %*% solve(x), diag(2))

[1] TRUE

solve(x, c(10, 40))

[1] 5 2

x <- matrix(seq(25), nrow=5)
t(x)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
[5,]   21   22   23   24   25

upper.tri(x)

      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] FALSE  TRUE  TRUE  TRUE  TRUE
[2,] FALSE FALSE  TRUE  TRUE  TRUE
[3,] FALSE FALSE FALSE  TRUE  TRUE
[4,] FALSE FALSE FALSE FALSE  TRUE
[5,] FALSE FALSE FALSE FALSE FALSE

7.15.1 Indexing

by row(s)
by column(s)
by comparison

Remember that the row and column index could be either the row/column number or the row/column name.

x[3,]

[1]  3  8 13 18 23

x[,3]

[1] 11 12 13 14 15

x[3,3]

[1] 13

x[c(1,5),]

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    5   10   15   20   25

x[3,c(3,4)]

[1] 13 18

x[x %% 3 == 0] <- 0
x

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0   11   16    0
[2,]    2    7    0   17   22
[3,]    0    8   13    0   23
[4,]    4    0   14   19    0
[5,]    5   10    0   20   25

x[lower.tri(x)] <- 0
x

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0   11   16    0
[2,]    0    7    0   17   22
[3,]    0    0   13    0   23
[4,]    0    0    0   19    0
[5,]    0    0    0    0   25

7.16 Arrays

Arrays are similar to matrices, however they support one, two or more dimensions.

Index by array[d1ix, d2ix, d3ix, ..., dnix]

# this creates 4 2x3 matrices
x <- array(1:24, dim=c(2,3,4))
x[2,3,4]

[1] 24

x[,,4]

     [,1] [,2] [,3]
[1,]   19   21   23
[2,]   20   22   24

7.17 Data Frames

Data.frames are the fundamental data structures in R. They have rows and columns like matrices, however different columns may contain different variable types.

Remember that matrices use rownames and colnames to access row/column names. Data frames use row.names and names instead.

When creating a data.frame with a column of character values, make sure to decide if the values should be converted to factor variables. This can be set with the stringsAsFactors argument.

x <- data.frame(id=1:3, animal=c('cat','dog','rabbit'), 
                speed=c('faster','fast','fastest'))
y <- data.frame(id=1:3, animal=c('cat','dog','rabbit'), 
                speed=c('faster','fast','fastest'), stringsAsFactors=FALSE)
y[3,]

  id animal   speed
3  3 rabbit fastest

y[,c(2,3)]

  animal   speed
1    cat  faster
2    dog    fast
3 rabbit fastest

y[,'speed']

[1] "faster"  "fast"    "fastest"

y[y[,'speed'] == 'fastest',]

  id animal   speed
3  3 rabbit fastest

y[y[,'speed'] == 'fastest', 'animal']

[1] "rabbit"

y[,'speed.factor'] <- factor(y[,'speed'])

There are other methods to select or create data.frame columns:

y[,3]

[1] "faster"  "fast"    "fastest"

y[,'speed']

[1] "faster"  "fast"    "fastest"

y$speed

[1] "faster"  "fast"    "fastest"

y[['speed']]

[1] "faster"  "fast"    "fastest"

y[,'speed', drop=FALSE]

    speed
1  faster
2    fast
3 fastest

y['speed']

    speed
1  faster
2    fast
3 fastest

y[,'new'] <- NA
y <- cbind(y, newer=0)
y

  id animal   speed speed.factor new newer
1  1    cat  faster       faster  NA     0
2  2    dog    fast         fast  NA     0
3  3 rabbit fastest      fastest  NA     0

It is good to write code that is consistent and easy to understand. For this reason I recommend using the df[,colname] syntax.

There are also a couple of ways to remove columns:

y[,'new'] <- NULL
y <- y[,-1]
y <- y[,-match('newer', names(y))]
y

  animal   speed speed.factor
1    cat  faster       faster
2    dog    fast         fast
3 rabbit fastest      fastest

Useful functions:

merge - join data frames by common columns
with, within - simplify expressions using column names

x <- data.frame(id=seq(4,24,4), gender=rbinom(6, 1, 0.5))
y <- data.frame(id=seq(6,24,6), smoker=rbinom(4, 1, 0.25))
merge(x, y)

  id gender smoker
1 12      1      0
2 24      0      0

merge(x, y, by=0)

  Row.names id.x gender id.y smoker
1         1    4      1    6      0
2         2    8      0   12      0
3         3   12      1   18      1
4         4   16      0   24      0

merge(x, y, by.x='id', by.y='id', all.x=TRUE, all.y=FALSE)

  id gender smoker
1  4      1     NA
2  8      0     NA
3 12      1      0
4 16      0     NA
5 20      0     NA
6 24      0      0

merge(x, y, all=TRUE)

  id gender smoker
1  4      1     NA
2  6     NA      0
3  8      0     NA
4 12      1      0
5 16      0     NA
6 18     NA      1
7 20      0     NA
8 24      0      0

# modify data frame, and reference columns without repeating data frame name
x <- within(x, {
  weight <- round(rnorm(nrow(x), 120+gender*60, 10))
  age <- sample(15:25, nrow(x), replace=TRUE)
})
# shorter than x[order(x[,'gender'], x[,'age']),]
x[with(x, order(gender, age)),]

  id gender age weight
2  8      0  16    129
6 24      0  18    141
4 16      0  23    130
5 20      0  25    124
1  4      1  16    157
3 12      1  25    180

7.18 Lists

Lists are generic containers. Think of them as a vector where each element can by anything you want, including more lists.

emptylist <- vector('list', 10)
x <- list(abb=state.abb, area=state.area, region=state.region, animals=y, other=5)

7.18.1 Indexing

Extracting with a single bracket returns a list. Extracting with two brackets returns the contents at the given index.

x[1]

$abb
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS"
[17] "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY"
[33] "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV"
[49] "WI" "WY"

x[1:2]

$abb
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS"
[17] "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY"
[33] "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV"
[49] "WI" "WY"

$area
 [1]  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876   6450
[12]  83557  56400  36291  56290  82264  40395  48523  33215  10577   8257  58216
[23]  84068  47716  69686 147138  77227 110540   9304   7836 121666  49576  52586
[34]  70665  41222  69919  96981  45333   1214  31055  77047  42244 267339  84916
[45]   9609  40815  68192  24181  56154  97914

x[[1]]

 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS"
[17] "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY"
[33] "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV"
[49] "WI" "WY"

x[["area"]]

 [1]  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876   6450
[12]  83557  56400  36291  56290  82264  40395  48523  33215  10577   8257  58216
[23]  84068  47716  69686 147138  77227 110540   9304   7836 121666  49576  52586
[34]  70665  41222  69919  96981  45333   1214  31055  77047  42244 267339  84916
[45]   9609  40815  68192  24181  56154  97914

# while not recommended, $ provides short-cut
x$area

 [1]  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876   6450
[12]  83557  56400  36291  56290  82264  40395  48523  33215  10577   8257  58216
[23]  84068  47716  69686 147138  77227 110540   9304   7836 121666  49576  52586
[34]  70665  41222  69919  96981  45333   1214  31055  77047  42244 267339  84916
[45]   9609  40815  68192  24181  56154  97914

x[4:5] <- NULL

A list can be flattened to produce a vector with unlist.

# notice that all values become character strings, because the first element
# contained character values (the least flexible variable type)
unlist(x)

    abb1     abb2     abb3     abb4     abb5     abb6     abb7     abb8     abb9 
    "AL"     "AK"     "AZ"     "AR"     "CA"     "CO"     "CT"     "DE"     "FL" 
   abb10    abb11    abb12    abb13    abb14    abb15    abb16    abb17    abb18 
    "GA"     "HI"     "ID"     "IL"     "IN"     "IA"     "KS"     "KY"     "LA" 
   abb19    abb20    abb21    abb22    abb23    abb24    abb25    abb26    abb27 
    "ME"     "MD"     "MA"     "MI"     "MN"     "MS"     "MO"     "MT"     "NE" 
   abb28    abb29    abb30    abb31    abb32    abb33    abb34    abb35    abb36 
    "NV"     "NH"     "NJ"     "NM"     "NY"     "NC"     "ND"     "OH"     "OK" 
   abb37    abb38    abb39    abb40    abb41    abb42    abb43    abb44    abb45 
    "OR"     "PA"     "RI"     "SC"     "SD"     "TN"     "TX"     "UT"     "VT" 
   abb46    abb47    abb48    abb49    abb50    area1    area2    area3    area4 
    "VA"     "WA"     "WV"     "WI"     "WY"  "51609" "589757" "113909"  "53104" 
   area5    area6    area7    area8    area9   area10   area11   area12   area13 
"158693" "104247"   "5009"   "2057"  "58560"  "58876"   "6450"  "83557"  "56400" 
  area14   area15   area16   area17   area18   area19   area20   area21   area22 
 "36291"  "56290"  "82264"  "40395"  "48523"  "33215"  "10577"   "8257"  "58216" 
  area23   area24   area25   area26   area27   area28   area29   area30   area31 
 "84068"  "47716"  "69686" "147138"  "77227" "110540"   "9304"   "7836" "121666" 
  area32   area33   area34   area35   area36   area37   area38   area39   area40 
 "49576"  "52586"  "70665"  "41222"  "69919"  "96981"  "45333"   "1214"  "31055" 
  area41   area42   area43   area44   area45   area46   area47   area48   area49 
 "77047"  "42244" "267339"  "84916"   "9609"  "40815"  "68192"  "24181"  "56154" 
  area50  region1  region2  region3  region4  region5  region6  region7  region8 
 "97914"      "2"      "4"      "4"      "2"      "4"      "4"      "1"      "2" 
 region9 region10 region11 region12 region13 region14 region15 region16 region17 
     "2"      "2"      "4"      "4"      "3"      "3"      "3"      "3"      "2" 
region18 region19 region20 region21 region22 region23 region24 region25 region26 
     "2"      "1"      "2"      "1"      "3"      "3"      "2"      "3"      "4" 
region27 region28 region29 region30 region31 region32 region33 region34 region35 
     "3"      "4"      "1"      "1"      "4"      "1"      "2"      "3"      "3" 
region36 region37 region38 region39 region40 region41 region42 region43 region44 
     "2"      "4"      "1"      "1"      "2"      "3"      "2"      "2"      "4" 
region45 region46 region47 region48 region49 region50 
     "1"      "2"      "4"      "2"      "3"      "4"

7.19 Control Structure

Control structures are used to change if and when lines of code in a program are run. The control comes from conditions being true or false.

7.19.1 Branching

if statements execute a block of code once or not at all. Multiple conditions can be tested so that only the block under the first condition evaluated as true will be executed. A default block of code can be evaluated if none of the conditions pass.

Several lines of code can be combined in a block by surrounding them with braces {...}. It is good practice to use braces, though they are not required for a single statement.

x <- 5
if(x >=0) {
  print(sqrt(x))
}

[1] 2.236068

x <- -4
if(x >=0) {
  print(sqrt(x))
} else if(x < 0) {
  print(sqrt(x*-1))
}

[1] 2

x <- 0
if(x > 0) {
  print(sqrt(x))
} else if(x < 0) {
  print(sqrt(x*-1))
} else {
  print(0)
}

[1] 0

The ifelse function can be used to test a vector of values.

set.seed(3)
ifelse(rnorm(10) > 0, 1, 0)

 [1] 0 0 1 0 1 1 1 1 0 1

7.19.2 Iteration

Three mechanisms can be used to run blocks of code multiple times.

for - repeat over sequence
while - repeat while condition passes
repeat - repeat until forced break

set.seed(5)
nc <- 5
x <- sample(nc, 100, replace=TRUE)
cnt <- numeric(nc)
for(i in seq(nc)) {
  cnt[i] <- sum(x == i)
}
cnt

[1] 18 24 17 13 28

i <- 3
while(i <= 100) {
  isprime <- all(i %% seq(2, sqrt(i)) != 0)
  if(isprime) cat(i, "")
  i <- i + 1
}

3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97

a <- 1
b <- 2
repeat {
  tmp <- a
  a <- b
  b <- a + tmp
  if(b > 100) break
  print(b)
}

[1] 3
[1] 5
[1] 8
[1] 13
[1] 21
[1] 34
[1] 55
[1] 89

7.20 Functions

Functions allow blocks of code to be run repeatedly. Variables are passed into functions as arguments. When the function completes, it may also return a variable (including a list, which may hold several elements).

Don’t access variables from within a function that were created outside of the function (unless they are passed into the function as arguments).

7.20.1 Return Values

prQ <- function(x) {
  qs <- quantile(x, probs=c(0.5, 0.25, 0.75))
  return(sprintf("%s (%s, %s)", qs[1], qs[2], qs[3]))
}
prQ(1:9)

[1] "5 (3, 7)"

# return doesn't have to be explicitly called
factors <- function(x) {
  Filter(function(i) x %% i == 0, seq(1, floor(x/2)))
}
factors(64)

[1]  1  2  4  8 16 32

# NULL can be returned as well
doNothing <- function(x) NULL

7.20.2 Arguments

A function can have many arguments. A generic argument ... can be used to specify any number of additional arguments. Arguments can be given a default value.

powerAdd <- function(p1=1, p2=1, p3=1) {
  p1 + p2^2 + p3^3
}
powerAdd()

[1] 3

powerAdd(3,2,1)

[1] 8

powerAdd(5, p3=3)

[1] 33

powerAdd(p2=16)

[1] 258

# take any number of arguments
morePowerAdd <- function(p1=1, ...) {
  extra <- unlist(list(...))
  sum(p1, extra^(1+seq_along(extra)))
}
morePowerAdd()

[1] 1

morePowerAdd(5, 4, 3, 2, 1)

[1] 65

# use extra arguments for another function
callPlot <- function(msg, ...) {
  print(sprintf("calling plot: %s", msg))
  plot(...)
  NULL
}
callPlot("log", 1:10, log(1:10))

[1] "calling plot: log"

NULL

7.21 Putting It Together

This creates a sample data set with multiple visits for each patient:

size <- 1000
set.seed(475)
x <- data.frame(id=sample(100, size, replace=TRUE),
                visitdate=sample(365*2, size, replace=TRUE)
                )
male <- sample(c(0,1,NA), 100, replace=TRUE, prob=c(45, 45, 10))
age <- round(runif(100, 40, 80))
x <- merge(x, cbind(male, id=seq(100)))
x <- merge(x, cbind(age, id=seq(100)))
x[,'visitdate'] <- as.Date(x[,'visitdate'], origin='2010-01-01')
x <- x[with(x, order(id, visitdate)),]
row.names(x) <- NULL

It’s common to operate on a data.frame by breaking it into small subsets. Calculate visit number and days since first visit:

uids <- unique(x[,'id'])
for(i in seq_along(uids)) {
  # note x is already ordered by visitdate
  id.x <- which(x[,'id'] == uids[i])
  x[id.x, 'visitno'] <- seq_along(id.x)
  x[id.x, 'daysOn'] <- x[id.x, 'visitdate'] - x[id.x[1], 'visitdate']
}

This method may work great, unless rows are added or removed. What if we wanted to add a last visit that’s three years after the first visit?

First consider a function that is given a subset.

# x should only have one unique ID value, and be ordered by visitdate
addLastVisit <- function(x) {
  last <- x[1,]
  last[,'visitdate'] <- last[,'visitdate'] + 365*3
  last[,'visitno'] <- nrow(x) + 1
  last[,'daysOn'] <- 365*3
  rbind(x, last)
}
# quick example
addLastVisit(x[x[,'id'] == 100,])

      id  visitdate male age visitno    daysOn
995  100 2010-03-20    0  57       1    0 days
996  100 2010-05-11    0  57       2   52 days
997  100 2010-09-14    0  57       3  178 days
998  100 2011-02-13    0  57       4  330 days
999  100 2011-02-17    0  57       5  334 days
1000 100 2011-08-31    0  57       6  529 days
9951 100 2013-03-19    0  57       7 1095 days

Because our new data.frame may have a unknown size, we can store the intermediate subsets in a list. lapply is a convenient function to use for this.

# now x is saved as 100 element list
x.list <- lapply(unique(x[,'id']), FUN=function(i) addLastVisit(x[x[,'id'] == i,]))

The finals steps are to convert the list back to a data.frame, and, if we care, update row names. do.call is powerful, but can be difficult to learn. The first argument is a function. The second argument should be a list, whose elements are the arguments to the referenced function. In this case, the 100 elements of x.list are the arguments to the rbind function, which should make sense as we are row-binding these data frames together.

x.df <- do.call(rbind, x.list)
row.names(x.df) <- NULL

The R package plyr was written to handle similar tasks.

8 Transforming Variables

Variables can be transformed by applying functions or arithmetic.

getHdata(diabetes)
contents(diabetes)


Data frame:diabetes 403 observations and 19 variables    Maximum # NAs:262

                                         Labels   Units Levels Storage NAs
id                                   Subject ID                integer   0
chol                          Total Cholesterol                integer   1
stab.glu                     Stabilized Glucose                integer   0
hdl                    High Density Lipoprotein                integer   1
ratio                     Cholesterol/HDL Ratio                 double   1
glyhb                   Glycosolated Hemoglobin                 double  13
location                                                     2 integer   0
age                                               years        integer   0
gender                                                       2 integer   0
height                                           inches        integer   5
weight                                           pounds        integer   1
frame                                                        3 integer  12
bp.1s             First Systolic Blood Pressure                integer   5
bp.1d            First Diastolic Blood Pressure                integer   5
bp.2s            Second Systolic Blood Pressure                integer 262
bp.2d           Second Diastolic Blood Pressure                integer 262
waist                                            inches        integer   2
hip                                              inches        integer   2
time.ppn Postprandial Time when Labs were Drawn minutes        integer   3

+--------+------------------+
|Variable|Levels            |
+--------+------------------+
|location|Buckingham,Louisa |
+--------+------------------+
|gender  |male,female       |
+--------+------------------+
|frame   |small,medium,large|
+--------+------------------+

# create BMI
diabetes[,'bmi'] <- 703*diabetes[,'weight']/diabetes[,'height']^2
# scale cholesterol to [0,1]
diabetes[,'chol.scaled'] <- (diabetes[,'chol']-min(diabetes[,'chol'], na.rm=TRUE))/
  (max(diabetes[,'chol'], na.rm=TRUE)-min(diabetes[,'chol'], na.rm=TRUE))
head(diabetes)

    id chol stab.glu hdl ratio glyhb   location age gender height weight  frame
1 1000  203       82  56   3.6  4.31 Buckingham  46 female     62    121 medium
2 1001  165       97  24   6.9  4.44 Buckingham  29 female     64    218  large
3 1002  228       92  37   6.2  4.64 Buckingham  58 female     61    256  large
4 1003   78       93  12   6.5  4.63 Buckingham  67   male     67    119  large
5 1005  249       90  28   8.9  7.72 Buckingham  64   male     68    183 medium
6 1008  248       94  69   3.6  4.81 Buckingham  34   male     71    190  large
  bp.1s bp.1d bp.2s bp.2d waist hip time.ppn      bmi chol.scaled
1   118    59    NA    NA    29  38      720 22.12877   0.3424658
2   112    68    NA    NA    46  48      360 37.41553   0.2383562
3   190    92   185    92    49  57      180 48.36549   0.4109589
4   110    50    NA    NA    33  38      480 18.63600   0.0000000
5   138    80    NA    NA    44  41      300 27.82202   0.4684932
6   132    86    NA    NA    36  42      195 26.49673   0.4657534

getHdata(olympics.1996)
contents(olympics.1996)


Data frame:olympics.1996    79 observations and 3 variables    Maximum # NAs:0

             Storage
country    character
medals       integer
population   integer

# take log of population
olympics.1996[,'log.pop'] <- log(olympics.1996[,'population'])
# create factor variable using cut points
olympics.1996[,'medals.factor'] <- cut(olympics.1996[,'medals'], 
                                       c(0,10,20,50,100,Inf), right=FALSE)
head(olympics.1996)

    country medals population  log.pop medals.factor
1     Tonga      1      96165 11.47382        [0,10)
2   Bahamas      1     281584 12.54819        [0,10)
3   Jamaica      6    2589043 14.76680        [0,10)
4      Cuba     25   10952046 16.20904       [20,50)
5 Australia     41   18348078 16.72504       [20,50)
6   Hungary     21   10273590 16.14509       [20,50)

9 Subsetting data frames

Several examples of subsetting or indexing the hospital data set are below. Also shown is the subset function, a convenience function for accomplishing this task.

# select rows by index
hospital[1:3,]

  id duration age    sex temp wbc antibiotic bculture service
1  1        5  30 female   99   8         no       no     med
2  2       10  73 female   98   5         no      yes     med
3  3        6  40 female   99  12         no       no    surg

# select rows by name
hospital[c("5","10","15","20"),]

   id duration age    sex temp wbc antibiotic bculture service
5   5        5  25 female 98.5  11         no       no    surg
10 10        3  50   male 98.0  12         no      yes    surg
15 15        5  20 female 98.4  11         no      yes    surg
20 20        7  22   male 98.2   6         no       no    surg

# select columns by index, and first 10 rows
hospital[seq(10),c(1,5)]

# select columns by name, and 5 random rows
hospital[sample(nrow(hospital), 5), c('age','sex')]

   age    sex
15  20 female
9   43 female
6   82   male
23  67 female
2   73 female

# be careful when selecting a single column
# no longer data.frame, but vector
hospital[seq(5),'service']

[1] med  med  surg surg surg
Levels: med surg

# maintain data.frame structure
hospital[seq(5),'service', drop=FALSE]

  service
1     med
2     med
3    surg
4    surg
5    surg

# select rows that satisfy condition
# every 5th row
hospital[seq(nrow(hospital))%%5 == 0,]

   id duration age    sex temp wbc antibiotic bculture service
5   5        5  25 female 98.5  11         no       no    surg
10 10        3  50   male 98.0  12         no      yes    surg
15 15        5  20 female 98.4  11         no      yes    surg
20 20        7  22   male 98.2   6         no       no    surg
25 25        4  41 female 98.0   5         no       no     med

# men
hospital[hospital[,'sex'] == 'male',]

   id duration age  sex temp wbc antibiotic bculture service
6   6       14  82 male 96.8   6        yes       no    surg
7   7       30  60 male 99.5   8        yes      yes     med
10 10        3  50 male 98.0  12         no      yes    surg
12 12        3   4 male 97.8   3         no       no    surg
16 16        5  32 male 99.0   9         no       no    surg
17 17        7  36 male 99.2   6        yes       no    surg
18 18        4  69 male 98.0   6         no       no    surg
19 19        3  47 male 97.0   5        yes       no     med
20 20        7  22 male 98.2   6         no       no    surg
21 21        9  11 male 98.2  10         no       no    surg
22 22       11  19 male 98.6  14        yes       no    surg

subset(hospital, sex == 'male')

   id duration age  sex temp wbc antibiotic bculture service
6   6       14  82 male 96.8   6        yes       no    surg
7   7       30  60 male 99.5   8        yes      yes     med
10 10        3  50 male 98.0  12         no      yes    surg
12 12        3   4 male 97.8   3         no       no    surg
16 16        5  32 male 99.0   9         no       no    surg
17 17        7  36 male 99.2   6        yes       no    surg
18 18        4  69 male 98.0   6         no       no    surg
19 19        3  47 male 97.0   5        yes       no     med
20 20        7  22 male 98.2   6         no       no    surg
21 21        9  11 male 98.2  10         no       no    surg
22 22       11  19 male 98.6  14        yes       no    surg

# 30-40 year olds, but only first 5
hospital[hospital[,'age'] %in% 30:49, c('id','age')][1:5,]

subset(hospital, age %in% 30:49, select=c(id, age))[1:5,]

# temp greater than 99
hospital[hospital[,'temp'] > 99, c('id','temp')]

   id temp
7   7 99.5
13 13 99.5
17 17 99.2

subset(hospital, temp > 99, c(id, temp))

   id temp
7   7 99.5
13 13 99.5
17 17 99.2

# combining conditions with AND
# women with duration over 10
hospital[hospital[,'sex'] == 'female' & hospital[,'duration'] > 10,]

   id duration age    sex temp wbc antibiotic bculture service
4   4       11  47 female 98.2   4         no       no    surg
8   8       11  56 female 98.6   7         no       no     med
9   9       17  43 female 98.0   7         no       no     med
23 23       11  67 female 97.6   4         no       no     med

# `with` saves some typing
hospital[with(hospital, sex == 'female' & duration > 10),]

   id duration age    sex temp wbc antibiotic bculture service
4   4       11  47 female 98.2   4         no       no    surg
8   8       11  56 female 98.6   7         no       no     med
9   9       17  43 female 98.0   7         no       no     med
23 23       11  67 female 97.6   4         no       no     med

# `subset` saves even more
subset(hospital, sex == 'female' & duration > 10)

   id duration age    sex temp wbc antibiotic bculture service
4   4       11  47 female 98.2   4         no       no    surg
8   8       11  56 female 98.6   7         no       no     med
9   9       17  43 female 98.0   7         no       no     med
23 23       11  67 female 97.6   4         no       no     med

# combining conditions with OR
# antibiotic or bculture
hospital[hospital[,'antibiotic'] == 'yes' | hospital[,'bculture'] == 'yes',]

   id duration age    sex temp wbc antibiotic bculture service
2   2       10  73 female 98.0   5         no      yes     med
6   6       14  82   male 96.8   6        yes       no    surg
7   7       30  60   male 99.5   8        yes      yes     med
10 10        3  50   male 98.0  12         no      yes    surg
11 11        9  59 female 97.6   7         no      yes     med
13 13        8  22 female 99.5  11        yes       no    surg
14 14        8  33 female 98.4  14        yes      yes    surg
15 15        5  20 female 98.4  11         no      yes    surg
17 17        7  36   male 99.2   6        yes       no    surg
19 19        3  47   male 97.0   5        yes       no     med
22 22       11  19   male 98.6  14        yes       no    surg

# combining conditions with complicated combinations
# men younger than 30 or women older than 60
hospital[with(hospital, (sex == 'male' & age < 30) | (sex == 'female' & age > 60)),]

   id duration age    sex temp wbc antibiotic bculture service
2   2       10  73 female 98.0   5         no      yes     med
12 12        3   4   male 97.8   3         no       no    surg
20 20        7  22   male 98.2   6         no       no    surg
21 21        9  11   male 98.2  10         no       no    surg
22 22       11  19   male 98.6  14        yes       no    surg
23 23       11  67 female 97.6   4         no       no     med

# select columns that match pattern
hospital[seq(10), grep("^s", names(hospital))]

      sex service
1  female     med
2  female     med
3  female    surg
4  female    surg
5  female    surg
6    male    surg
7    male     med
8  female     med
9  female     med
10   male    surg

# select empty data.frame
hospital[FALSE,]

[1] id         duration   age        sex        temp       wbc        antibiotic
[8] bculture   service   
<0 rows> (or 0-length row.names)

10 Data Aggregation

Summary statistics can be computed with tapply, aggregate, and Hmisc::summarize.

getHdata(DominicanHTN)
contents(DominicanHTN)


Data frame:DominicanHTN 381 observations and 5 variables    Maximum # NAs:0

                          Labels Units Levels Storage
village                                     8 integer
gender                                      2 integer
age                                           integer
sbp      Systolic Blood Pressure  mmHg        integer
dbp     Diastolic Blood Pressure  mmHg        integer

+--------+------------------------------------------------------------------+
|Variable|Levels                                                            |
+--------+------------------------------------------------------------------+
| village|Bare' Nuevo,Batey Verde,Carmona,Cojobal,Juan Sanchez,La Altagracia|
|        |Los Gueneos,San Antonio                                           |
+--------+------------------------------------------------------------------+
| gender |Female,Male                                                       |
+--------+------------------------------------------------------------------+

# calculate mean age by village
with(DominicanHTN, tapply(age, village, mean))

  Bare' Nuevo   Batey Verde       Carmona       Cojobal  Juan Sanchez La Altagracia 
     47.53125      48.02439      49.00000      47.15254      47.45283      49.65000 
  Los Gueneos   San Antonio 
     45.15789      51.25714

# calculate mean sbp by gender
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)

  gender      sbp
1 Female 133.1977
2   Male 132.5691

# calculate median age by village and gender
aggregate(age ~ village + gender, DominicanHTN, median)

         village gender  age
1    Bare' Nuevo Female 40.0
2    Batey Verde Female 43.0
3        Carmona Female 40.0
4        Cojobal Female 43.0
5   Juan Sanchez Female 47.0
6  La Altagracia Female 48.0
7    Los Gueneos Female 40.0
8    San Antonio Female 42.5
9    Bare' Nuevo   Male 48.0
10   Batey Verde   Male 52.0
11       Carmona   Male 53.0
12       Cojobal   Male 49.0
13  Juan Sanchez   Male 56.0
14 La Altagracia   Male 57.5
15   Los Gueneos   Male 51.0
16   San Antonio   Male 61.0

# calculate quantile of sbp by village and gender
with(DominicanHTN, summarize(sbp, llist(village, gender), quantile))

         village gender sbp  X25.  X50.   X75. X100.
1    Bare' Nuevo Female  80 125.0 130.0 141.00   166
2    Bare' Nuevo   Male 102 120.0 126.0 136.00   148
3    Batey Verde Female 110 120.0 132.0 155.00   190
4    Batey Verde   Male  90 117.0 130.0 147.50   180
5        Carmona Female  95 114.0 130.0 150.00   204
6        Carmona   Male 100 120.0 130.0 150.00   168
7        Cojobal Female  90 119.0 124.0 150.00   200
8        Cojobal   Male  90 116.0 122.0 134.00   200
9   Juan Sanchez Female 100 118.0 130.0 140.00   200
10  Juan Sanchez   Male  90 120.0 130.0 150.00   200
11 La Altagracia Female  90 118.0 125.0 156.00   210
12 La Altagracia   Male  80 112.0 120.0 137.00   236
13   Los Gueneos Female  90 106.0 118.0 140.00   180
14   Los Gueneos   Male  98 118.0 124.5 144.75   170
15   San Antonio Female  90 116.5 129.5 150.00   190
16   San Antonio   Male 100 125.0 140.0 165.00   210

11 Formulas

The previous aggregate example introduced the ~ operator. The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.

y ~ a + b, + indicates to include both a and b as terms
y ~ a:b, : indicates the interaction of a and b
y ~ a*b, equivalent to y ~ a+b+a:b
y ~ (a+b)^2, equivalent to y ~ (a+b)*(a+b)
y ~ a + I(b+c), I indicates to use + in the arithmetic sense

See ?formula for more examples.

12 Models

The most simple model-fitting function is lm, which is used to fit linear models. It’s primary argument is a formula. Using the DominicanHTN data set, we can fit sbp by age.

(m <- lm(sbp ~ age, data=DominicanHTN))


Call:
lm(formula = sbp ~ age, data = DominicanHTN)

Coefficients:
(Intercept)          age  
   108.3079       0.5146

This creates an lm object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.

coef
fitted
predict
residuals
vcov
summary
anova

names(m)

 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
 [6] "assign"        "qr"            "df.residual"   "xlevels"       "call"         
[11] "terms"         "model"

m$coefficients

(Intercept)         age 
108.3078571   0.5146198

coef(m)

(Intercept)         age 
108.3078571   0.5146198

head(fitted(m))

Systolic Blood Pressure [mmHg] 
       1        2        3        4        5        6 
137.1266 129.9219 143.8166 144.3312 140.2143 140.7289

predict(m, data.frame(age=c(40, 60)))

       1        2 
128.8926 139.1850

head(residuals(m))

Systolic Blood Pressure [mmHg] 
         1          2          3          4          5          6 
 12.873433  -9.921890 -23.816625  35.668756  -2.214286 -25.728906

vcov(m)

            (Intercept)          age
(Intercept)  17.5571859 -0.333034803
age          -0.3330348  0.006942401

summary(m)


Call:
lm(formula = sbp ~ age, data = DominicanHTN)

Residuals:
Systolic Blood Pressure [mmHg] 
    Min      1Q  Median      3Q     Max 
-63.080 -16.688  -2.787  11.961  96.815 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 108.30786    4.19013  25.848  < 2e-16 ***
age           0.51462    0.08332   6.176 1.69e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.54 on 379 degrees of freedom
Multiple R-squared:  0.09145,   Adjusted R-squared:  0.08905 
F-statistic: 38.15 on 1 and 379 DF,  p-value: 1.692e-09

coef(summary(m))

               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 108.3078571 4.19012958 25.848331 1.177025e-85
age           0.5146198 0.08332107  6.176347 1.691868e-09

anova(m)

Analysis of Variance Table

Response: sbp
           Df Sum Sq Mean Sq F value    Pr(>F)    
age         1  22980 22980.3  38.147 1.692e-09 ***
Residuals 379 228314   602.4                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

13 R Graphics with ggplot2

The examples here (“9,” n.d.) say it best.

library(ggplot2)
# create factors, categorical variables
m2 <- within(mtcars, {
    gear.factor <- factor(gear, levels=c(3,4,5), labels=c("3gears","4gears","5gears"))
    am.factor <- factor(am, levels=c(0,1), labels=c("Automatic","Manual"))
    cyl.factor <- factor(cyl, levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl"))
})

# kernel density
qplot(mpg, data=m2, geom="density", fill=gear.factor, alpha=I(.5), 
      main="Distribution of Gas Mileage", xlab="Miles Per Gallon", ylab="Density"
)

# scatterplot
qplot(hp, mpg, data=m2, shape=am.factor, color=am.factor, 
      facets=gear.factor~cyl.factor, size=I(3), xlab="Horsepower", 
      ylab="Miles per Gallon"
)

# boxplot
qplot(gear, mpg, data=m2, geom=c("boxplot", "jitter"), fill=gear.factor, 
      main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon"
)

Example with regression:

# scatterplot
p <- qplot(wt, mpg, data=m2, color=cyl.factor, 
      main="Regression of MPG on Weight", xlab="Weight", ylab="Miles per Gallon"
)
# add regression line
p + geom_smooth(method="lm")

p + geom_smooth(method="loess")

13.1 Graphics Devices

Plots created with ggplot2 can be saved to a file with the ggsave function. The file format is determined by the file name. Some are vector file formats (pdf, svg), while others (bmp, jpeg, png) are raster formats. Vector files can be scaled without pixelation.

Create 6“x6” png at 300 dpi, useful for print publications:

qplot(hp, mpg, data=m2, shape=am.factor, color=am.factor, 
      facets=gear.factor~cyl.factor, size=I(3), xlab="Horsepower", 
      ylab="Miles per Gallon"
)

ggsave("mileageByHorses.png", width=6, height=6, dpi=300)

Plot to PDF:

qplot(gear, mpg, data=m2, geom=c("boxplot", "jitter"), fill=gear.factor, 
            main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon")

ggsave("mileageByGear.pdf")

Saving 7 x 5 in image

Links

“1.” n.d. http://miktex.org/download.

“2.” n.d. http://www.tug.org/mactex/morepackages.html.

“3.” n.d. http://yihui.name/knitr/.

“4.” n.d. http://yihui.name/en/2013/10/markdown-or-latex/.

“5.” n.d. https://github.com/yihui/knitr-examples.

“6.” n.d. http://yihui.name/knitr/options/#chunk_options.

“7.” n.d. http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RConfiguration/Rprofile.

“8.” n.d. https://github.com/harrelfe/rscripts.

“9.” n.d. http://www.statmethods.net/advgraphs/ggplot2.html.

R Workshop

Cole Beck