"Hadley’s vocabulary: part 1 - the basics"

The endlessly productive Hadley Wickham has suggested an R vocabulary, a minimal set of common terms and operators that everyone using R should know. Surprisingly, I didn't know all of them or didn't know all their behaviours. So as a piece of deliberate practice, here's the first part of the list with some comments and examples. I've rearranged some items (so as to group similar terms), added a few that seemed missing and conversely left out but 1 or 2 obscure ones.

Getting help

When in doubt, ? is the first resort, calling up the documentation for the following exact term. Doubling it - ?? - leads R to search through all documentation instead:

? str
?? str

str gives a breakdown of an object, listing its type and recursing into its structure:

# a simple, single object
str ("")
##  chr ""
# a vector
str (1:5)
##  int [1:5] 1 2 3 4 5
# a complex structure
str (list(a = "A", L = as.list(1:5)))
## List of 2
##  $ a: chr "A"
##  $ L:List of 5
##   ..$ : int 1
##   ..$ : int 2
##   ..$ : int 3
##   ..$ : int 4
##   ..$ : int 5

summary gives a less detailed (but more readable) breakdown of an object. Different classes of objects render their summaries in different ways:

# this just gives simple type info for the object
summary ("")
##    Length     Class      Mode 
##         1 character character
# numerical objects are rendered as summary statistics
summary (1:5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       3       3       4       5
# complex objects may just give their structure
summary (list (a = "A", L = as.list(1:10)))
##   Length Class  Mode     
## a  1     -none- character
## L 10     -none- list

Assignment

Assignment and variable creation is traditionally done with <- in R. = is sometimes used and largely works the same for most common cases: <- can be used anywhere, while = can only be used at the top level.

<<- is like <- but searches upwards through environments fopr an existing variable to assign to, i.e. it doesn't create a variable if it can't find it in the local scope but goes looking for it elsewhere. So it's like Python's global statement. Only use this if you really need to.

<- and <<- can also be used backwards (assigning left to right) but this is not recommended.

assign is a function to set a named variable and thus is useful for programmatic assignation:

x <- 5
y <<- 6
z = 7

8 -> a
9 ->> b

nam <- paste("a", 'b', sep = ".")
assign (nam, 1)

Operators

R contains the usual mathematical operators, most of which use the usual symbols: + - * ^

^ is the exponention / power operator.

Note that unlike some other langauges, / may do floating point division regardless of the arguments. If the result can precisely be an integer, an integer will be produced. If the result is a float (non-whole), that will be produced.

4 / 2
## [1] 2
4.0 / 2
## [1] 2
4.0 / 2.0
## [1] 2
3 / 2
## [1] 1.5

Modulo is expressed with %%.

The usual mathematical comparsators are there as well: != == > >= < <=

Basic search

%in% is the membership function, returning whether a value can be found inside another:

1 %in% 1:5
## [1] TRUE
'A' %in% list (a = "A", L = as.list(1:10))
## [1] TRUE

match by contrast, returns the indices (plural) of where that value is found:

match (1, 1:5)
## [1] 1
match (6, 1:5)
## [1] NA

Browsing large datasets

Largely and unwieldy pieces opf data can be browsed with head and tail:

head (1:10000)
## [1] 1 2 3 4 5 6
head (1:10000, n=10)
##  [1]  1  2  3  4  5  6  7  8  9 10
tail (1:10000)
## [1]  9995  9996  9997  9998  9999 10000

Type tests

R includes a large number of tests for testing the type of variables, mostly called obvious names like is.FOO. For example, is.na and is.null test for objects being NA and NULL respectively.

Which leads us to discuss the meaning and difference of NA and NULL.

  • NA is a logical value akin to TRUE and FALSE, signifying indeterminacy, unknown or missing data. You may find it in vectors or dataframes.

  • NULL is the absence of a value, a value that "doesn't exist". It can be returned by functions to signal "there is no answer". It cannot be inserted into a container as a member, and in fact is used to delete members.

The similar sounding but unrelated NaN is "not a number", the result of division by 0 for example.

Other mathematical type tests include: is.finite is.infinite is.nan

Mathematical functions

abs returns the absolute value of a variable while sign returns its sign (positive or negative).

The usual geometric functions are there and do what you'd expect: acos, asin, atan, atan2 sin, cos, tan. Note that angles are in radians not degrees.

The usual logarithmic functions are there: exp, log, log10, log2, sqrt. The natural log is, of course, log.

ceiling rounds up, while floor rounds down and round rounds off. trunc rounds towards 0, while signif rounds to a significant number of digits (seemingly 2 by default):

ceiling (12.34)
## [1] 13
floor (12.34)
## [1] 12
floor (-12.34)
## [1] -13
round (12.34)
## [1] 12
round (12.54)
## [1] 13
trunc (12.34)
## [1] 12
trunc (-12.34)
## [1] -12

The rounding functions all take an optional argument digits, that let's you control where the rounding takes place. If digits is negative, rounding can take place in the whole numbers. Think of the arguments as being 10^-digits is the figure that's being altered:

round (12.54, digits=2)
## [1] 12.54
round (12.54, digits=1)
## [1] 12.5
round (12.54, digits=0)
## [1] 13
round (12.54, digits=-1)
## [1] 10

Comparison

Comparing objects is actualy a tricky thing, so R provide a few different methods for doing it.

all.equal tests equality in a slightly sloppy way, allowing for slight tolerances. Note that it return a boolean if the objects are identical, but a description of the differences otherwise:

all.equal (pi, 355/113)
## [1] "Mean relative difference: 8.491368e-08"
all.equal (tan (pi* (1/4 + 1:10)), rep(1, 10))
## [1] TRUE

identical is the stricter version, that measures if two objects are exactly the same.

Filtering & slicing

The bracket indexing in R works much like it does in other langauges, but is a lot more powerful. It can accept an index, a range of indices, a boolean vector or an expression that produces any of those:

d <- 1:10
d[4]
## [1] 4
d[4:6]
## [1] 4 5 6
d[d %% 2 == 0]
## [1]  2  4  6  8 10

[[ can be used to select a single element dropping names, effectively copying the value 'out' of the containing object.

$ can be used to select named elements out of an object.

subset allows you to filter a dataset into just those rows meeting a certain criteria. You can also select those columns to appear in the output:

head (airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
head (subset(airquality, Temp > 80))
##    Ozone Solar.R Wind Temp Month Day
## 29    45     252 14.9   81     5  29
## 35    NA     186  9.2   84     6   4
## 36    NA     220  8.6   85     6   5
## 38    29     127  9.7   82     6   7
## 39    NA     273  6.9   87     6   8
## 40    71     291 13.8   90     6   9
head (subset(airquality, Temp > 80, select = c(Ozone, Temp)))
##    Ozone Temp
## 29    45   81
## 35    NA   84
## 36    NA   85
## 38    29   82
## 39    NA   87
## 40    71   90

Simple analysis

Several simple calculations over a sequence of data are available, most of which are obvious names like max min prod sum:

d <- 1:5
max (d)
## [1] 5
min (d)
## [1] 1
prod (d)
## [1] 120
sum (d)
## [1] 15

Usefully, R provides "cumulative" versions that produce answers as you move along the sequence:

cummax (d)
## [1] 1 2 3 4 5
cummin (d)
## [1] 1 1 1 1 1
cumprod (d)
## [1]   1   2   6  24 120
cumsum (d)
## [1]  1  3  6 10 15

pmax and pmin are a bit strange but useful in the right circumstances. They give the maximum (or minmum) across a set of sequences, for every position. Note that as is "standard" (well, common) in R, if the sequence lengths don't match, then the shorter one is "recycled" (repeated):

pmax (1:5, 2:6)
## [1] 2 3 4 5 6
pmin (1:5, 2:3)
## Warning in pmin(1:5, 2:3): an argument will be fractionally recycled
## [1] 1 2 2 3 2

range returns the minimum and maximum of a sequence:

range (1:5)
## [1] 1 5

The mean, median, standard deviation and variance of a sequence can be calculated by mean, median, sd, var:

mean (1:5)
## [1] 3
median (1:5)
## [1] 3
sd (1:5)
## [1] 1.581139
var (1:5)
## [1] 2.5

var can also be used to compute the variance of one sequence with another. Likewise, cov and cor can be used to compute covariance and correlation, respectively. They can also be used over matrices:

var (1:5, 2:6)
## [1] 2.5
cov (1:5, 2:6)
## [1] 2.5
cor (1:5, 2:6)
## [1] 1

rle counts "runs", consecutive appearances of the same value:

rle (c(1, 2, 2, 3, 2, 4, 4, 5, 1, 5, 5, 5))
## Run Length Encoding
##   lengths: int [1:8] 1 2 1 1 2 1 1 3
##   values : num [1:8] 1 2 3 2 4 5 1 5

Functions

Unsurprisingly, function is used for constructing functions and return for returning values from functions. It's worth pointing out that return is a function itself, so the returned value must appear in a brackets after it, i.e. not return x. You could just leave R to use the implicit return, i..e the last value in the function, but it's best to be explicit:

double_x <- function (x) {
   return (2 * x)
}

missing can be used in function bodies to check for missing argument values, which might be used to check for optional arguments:

double_x <- function (x) {
   if (missing (x)) {
      return ("no argument!")
   } else {
      return (2 * x)
   }
}

I tend to think that sensible default values for arguments are a cleaner way to handle this, and actually signal that an argument is optional. NULL is the usual default value used in these cases:

double_x <- function (x=NULL) {
   if (is.null (x)) {
      return ("no argument!")
   } else {
      return (2 * x)
   }
}