## Monday, July 14, 2014

### R Notes: Functions

R's semantics is modeled after Scheme. Scheme is a functional language and R is functional too. I am writing about the functions in R and many R's strange usages are just syntax sugars of special function calls.

# What is `rownames(x) <- c('A','B','C')`?

``y <- c(1, 2, 3, 4, 5, 6)x <- matrix(y, nrow = 3, ncol = 2)rownames(x) <- c("A", "B", "C")x``

``##   [,1] [,2]## A    1    4## B    2    5## C    3    6``

How can we assign `c('A','B','C')` to the return value of `rownames(x)`, yet the assignment effect is done on `x`?! This is because the statement is a syntax sugar for the following function call:

``x <- `rownames<-`(x, c("A1", "B1", "C1"))``

First, `rownames<-` is indeed a function! Because it has special characters, to apply it we need put `"` around the function name. Second, `"rownames<-"(x, c('A1','B1','C1'))` is a pure function call. It returns a new copy of `x` and does not change the row names of `x`. To take the assignment effect, we must assign the return value back to `x` explicitly.

The technical term for `functionname<-` is replacement function. The general rule:

`f(x) <- y` is transformed as `x <- "f<-(x,y)"`.

Besides `rownames`, there are many other functions that have a twin replacement function, for example, `diag` and `length`.

A special case is index operator `"["`. Its replacement function is `"[<-"`:

``y <- c(1, 2, 3)"["(y,1)``

``## [1] 1``

```[<-`(y, 2, 100)``

``## [1]   1 100   3``

More examples at R language defination 3.4.4.

# Operators are functions

Operators in R are indeed function calls.

``"+"(c(1,2), 2)  # same as c(1,2) + 2``

``## [1] 3 4``

Because operators are functions, we can define new operators using function syntax:

``# strict vector add"%++%" <- function(x, y) {    n <- length(x)    m <- length(y)    if (n != m) {        stop("length does not match")    } else {        x + y    }}``

Self-defined operators become more powerful when used with S3 objects (see below).

# S3 functions

There are two object systems in R, S3 and S4. S3 objects are lists, so its fields are accessed by the same operator to access list fields `\$`. S4 objects are intended to be safer than S4 objects. Its fields are accessed by `@`. Many packages, especially those with a long history, such as `lars`, `rpart` and `randomForest`, use S3 objects. S3 objects are easier to use and understand.

Knowing how to implement an S3 object is very useful when we need to check the source code of a specific R package. And when we want to release a small piece of code for others to use, S3 objects provide a simple interface.

The easiest way to understand S3 objects is the following analogy:

R ANSI C
S3 object/R list object C `struct`
S3 functions functions on struct

``struct MyVec {  int *A;  int n;};int safe_get(struct MyVec vec, int i) {  if (i<0 || i>=vec.n) {    fprintf(stderr, "index error");    exit(1);  }  return vec.A[i];}``

In R, the S3 object is implemented as:

``vec <- list()vec\$A <- c(1, 2, 3)vec\$n <- length(vec\$A)class(vec) <- "MyVec"``

In the S3 object system, the method names cannot be set freely as those in C. They must follow a pattern: “functionname.classname”. Here my class name is `MyVec`, so all the methods names must end with `.MyVec`.

``"[.MyVec" <- function(vec, i) {    if (i <= 0 || i > vec\$n) {        stop("index error")    }    vec\$A[i]}``

Let's implement the replacement function too:

``"[<-.MyVec" <- function(vec, i, value) {    if (i <= 0 || i > vec\$n)         stop("index error")    vec\$A[i] <- value    vec}``

Let's play with MyVec objects:

``vec[3]``

``## [1] 3``

``vec[2] <- 100vec[30]``

``## Error: index error``

We can also add other member functions for `MyVec` such as `summary.MyVec`, `print.MyVec`, `plot.Vec`, etc. To use these functions, we don't have to specify the full function names, we can just use `summary`, `print`, and `plot`. R will inspect the S3 class type (in our case, it is `MyVec`) and find the corresponding functions automatically.

# Default parameters and `...`

Many functions in R have a long list of parameters. For example `plot` function. It would becomes tedious and even impossible for the end user to assign the values for every parameter. So to have a clean interface, R supports default parameters. A simple example below:

``add2 <- function(a, b = 10) {    a + b}add2(5)``

``## [1] 15``

What I want to emphasize in this section is `...`, which is called variable number of arguments. And it is universal in R to implement good function interfaces. If you read the documents of R's general functions such as `summary` and `plot`, most of their interfaces include `...`.

Consider the following case: I am implementing a statistical function `garchFit`, say GARCH model calibration, I used a optimizer `optim` which has a lot of parameters. Now I need to think about the API of my GARCH calibration function because I want others to use it as well. Shall I expose parameters of `optim` in `garchFit`'s parameters? Yes, since I want to give the users of my function some freedom in optimizing. But as we know a single procedure in `optim` such as `l-bfgs` would have many parameters. On one side, I want to give the user the option to specify these parameters, on the the side, if I expose all of them in my `garchFit`, the parameter list would go too long. `...` comes to the rescue! See the following example:

``f1 <- function(a = 1, ...) {    a * f2(...)}f2 <- function(b = 5, ...) {    b * f3(...)}f3 <- function(c = 10) {    c}f1()``

``## [1] 50``

``f1(a = 1, b = 2)``

``## [1] 20``

``f1(c = 3)``

``## [1] 15``

A simple user of `f1` would only need to study its exposed parameter `a`, while advanced users have options to specify parameters in `f2` and `f3` when calling `f1`.

# Global variables

Global variables are readly accessible from any R functions. However, to change the value of a global variable, we need a special assignment operator `<<-`. Python has similar rules. See the following example:

``a <- 100foo <- function() {    b <- a  # get a's value    a <- 10  # change a's value fails, (actually done: creates a local variable a, and assign 10)    c(a, b)}foo()``

``## [1]  10 100``

``a``

``## [1] 100``

``boo <- function() {    a <<- 10}boo()a``

``## [1] 10``

Here our scopes have two layers “global” and top-layer functions (`foo` and `boo`). When there are more layers, i.e., nested functions, `<<-` operator finds the variable with the same name in the closet layer for assignment. But it is generally very bad practice to have same variable names across different function layers (except for variables like `i`, `j`, `x`). `assign` is more general, check `?assign` for document.

# Variable scope

I think this is the most tricky part of R programming for C programmers. Because block syntax `{...}` does not introduce a new scope in R while C/C++/Java/C#/etc all introduce a new scope! In R, only a function introduce a new scope.

Please see my previous post: Subtle Variable Scoping in R

# `quote`, `subsititude`, `eval`, etc.

Many language-level features of R such as debugging function `trace` is implemented in R itself, rather than by interpreter hack because R supports meta-programming. I will write another post for these special functions.

## Saturday, July 12, 2014

### R Notes: vectors

R is different from C family languages. It has a C syntax, but a Lisp semantics. Programmers from C/C++/Java world would find many usages in R adhoc and need to memorize special cases. This is because they use R from a C's perspective. R is a very elegant language if we unlearn some C concepts and know R’s rules. I am writing several R notes to explain several important R language rules. This is the first note.

## The atomicity of R vectors

The atomic data structure in R is vector. This is so different from any C family language. In C/C++, built-in types such as `int `and `char a`re atomic data structures while C array (a continuous data block in memory) is obviously not the simplest type. In R, vector is indeed the most basic data structure. There is no scalar data structure in R – you cannot have a scalar `int` in R as `int x = 10` in C.

The atomicity of R vectors is written in many documents. The reason that it is usually skipped by R learners is that many R users come from C in which array is a composite data structure. Many seemingly special cases in R language all comes from the atomicity of R vectors. And I will try to cover them coherently.

### Display

``x <- 10  # equivalent to x <- c(10)x  # or equivalent to print(x)``

``## [1] 10``

``y <- c(10, 20)y``

``## [1] 10 20``

What does `[1]` mean in the output? It means that the output is a vector and from index 1, the result is `...` x is a vector of length 1, so its value is `[1] 10`, while y is a vector of length 2, so its value is `[1] 10 20`. For a vector with longer length, the output contains more indices to assist human reading:

``z <- 1:25print(z)``

``##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23## [24] 24 25``

### Vectors with different types

Though vectors in R are atomic. There are different vectors: int vector, float vector, complex vector, character vector and logical vector. Int and float vectors are numeric vectors. In above, we have seen int vectors. Let's see more types of vectors below:

``x <- c(1, 2.1)mode(x)``

``## [1] "numeric"``

``y <- c("a", "bb", "123")mode(y)``

``## [1] "character"``

``z <- complex(real = 1, imaginary = 1)mode(z)``

``## [1] "complex"``

Notice that in R, string (In R's term: character type) is like int, float, logical types. It is not a vector of chars. R does not differentiate between a character and a sequences of characters. R has a set of special functions such as `paste` and `strsplit` for string processing, however R's character type is not a composite type and it is not a `vector` of chars either!

### matrix and array

Matrix is a vector with augmented properties and this makes matrix an R class. Its core data structure is still a vector. See the example below:

``y <- c(1, 2, 3, 4, 5, 6)x <- matrix(y, nrow = 3, ncol = 2)class(x)``

``## [1] "matrix"``

``rownames(x) <- c("A", "B", "C")colnames(x) <- c("V1", "V2")attributes(x)``

``## \$dim## [1] 3 2## ## \$dimnames## \$dimnames[[1]]## [1] "A" "B" "C"## ## \$dimnames[[2]]## [1] "V1" "V2"``

``x``

``##   V1 V2## A  1  4## B  2  5## C  3  6``

``as.vector(x)``

``## [1] 1 2 3 4 5 6``

In R, arrays are less frequently used. A 2D arrays is indeed a matrix. To find more: `?array`. We can say that an array/matrix is a vector (augmented with `dim` and other properties). But we cannot say that a vector is an array. In OOP terminology, array/matrix is a subtype of vector.

### operators

Because the fundamental data structure in R is vector, all the basic operators are defined on vectors. For example, `+` is indeed vector addition while adding two vectors with length 1 is just a special case.

When the lengths of the two vectors are not of the same length, then the shorter one is repeated to the same length as the longer one. For example:

``x <- c(1, 2, 3, 4, 5)y <- c(1)x + y  # y is repeated to (1,1,1,1,1)``

``## [1] 2 3 4 5 6``

``z <- c(1, 2)x + z  # z is repeated to (1,2,1,2,1), a warning is triggered``

``## Warning: longer object length is not a multiple of shorter object length``

``## [1] 2 4 4 6 6``

`+`,`-`,`*`,`/`,etc. are vector operators. When they are used on matrices, their semantics are the same when dealing with vectors – a matrix is treated as a long vector concatenated column by column. So do not expect all of them to work properly as matrix operators! For example:

``x <- c(1, 2)y <- matrix(1:6, nrow = 2)x * y``

``##      [,1] [,2] [,3]## [1,]    1    3    5## [2,]    4    8   12``

For matrix multiplication, we shall use the dedicated operator:

``x %*% y  # 1 x 2 * 2 x 3 = 1 x 3``

``##      [,1] [,2] [,3]## [1,]    5   11   17``

``y %*% x  # dimension does not match, c(1,2) is a row vector, not a col vector!``

``## Error: non-conformable arguments``

The single-character operators are all operated on vectors and would expect generate a vector of the same length. So &, |, etc, are vector-wise logic operators.  While `&&, ||, etc` are special operators that generates a logic vector with length 1 (usually used in IF clauses).

``x <- c(T, T, F)y <- c(T, F, F)x & y``

``## [1]  TRUE FALSE FALSE``

``x && y``

``## [1] TRUE``

### math functions

All R math functions take vector inputs and generate vector outputs. For example:

``exp(1)``

``## [1] 2.718``

``exp(c(1))``

``## [1] 2.718``

``exp(c(1, 2))``

``## [1] 2.718 7.389``

``sum(matrix(1:6, nrow = 2))  # matrix is a vector, for row/col sums, use rowSums/colSums``

``## [1] 21``

``cumsum(c(1, 2, 3))``

``## [1] 1 3 6``

``which.min(c(3, 1, 2))``

``## [1] 2``

``sqrt(c(3, 2))``

``## [1] 1.732 1.414``

### `NA` and `NULL`

NA is a valid value. NULL means empty.

``print(NA)``

``## [1] NA``

``print(NULL)``

``## NULL``

``c(NA, 1)``

``## [1] NA  1``

``c(NULL, 1)``

``## [1] 1``

*I find Knitr integrated with RStudio IDE is very helpful to write tutorials.