Wednesday, July 17, 2013

Subtle Variable Scoping in R

 

A languages manual usually defines how a language behaves, but does not warn you in cases where you assume a feature should be supported but isn’t. As an example, I will talk about the subtle variable scoping in R language.

{} code blocks

A lot of programmers coming from C/C++/Java will assume that code blocks inside {} also introduce a new scope. However, in dynamic languages like R/Python/JavaScript/Matlab, code blocks do not introduce new scopes; only function does. This difference may cause some subtle bugs.

For example, the following R function returns a list of quadratic function objects:

make_funcs <- function(a, b, c_){
n <- length(a)
fs <- list()
for (i in 1:n) {
fs[[i]] <- function(x) {
cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[i], b[i], c_[i], x))
a[i]*x*x + b[i]*x + c_[i]
}
}
cat(sprintf('variable i is still in the scope and has value %d\n', i))
fs
}


The input to this functions is three vectors of numbers which represent the three coefficients in quadratic forms. And let’s make three objects using the following coefficients:



a <- c(1,2,3)
b <- c(4,5,6)
c_ <- c(-1,-1,-1)


fs <- make_funcs(a, b, c_)

fs[[1]](1)
fs[[2]](1)
fs[[3]](1)


We are supposed to get three different function values. However, all the three functions are the same after checking the output:



> fs[[1]](1)
eval 3.0*x^2+6.0*x+-1.0 where x = 1.0
[1] 8
> fs[[2]](1)
eval 3.0*x^2+6.0*x+-1.0 where x = 1.0
[1] 8
> fs[[3]](1)
eval 3.0*x^2+6.0*x+-1.0 where x = 1.0
[1] 8


It seems that the three fs[i] use the same variable i when they are evaluated. That is, when the three functions are created, the R interpreter just remembers i as a variable in its parent function. Then the result can be explained: after the loop is finished, the variable i has value of 3, and it is still inside the scope of make_func.



Let’s see how will we write make_func in F#:



// a direct translation 
let make_funcs (a: int array, b:int array, c: int array) =
let n = Array.length a
let fs = new ResizeArray<(int -> int)>()
for i=0 to n-1 do
fs.Add(fun x -> a.[i]*x*x + b.[i]*x + c.[i])
fs.ToArray()


// a more functional translation
let make_funcs2 (a: int array, b:int array, c: int array) =
Array.zip3 a b c
|> Array.map (
fun (a0, b0, c0) ->
(fun x -> a0*x*x + b0*x + c0))


The following code would make three different functions as we expect:



let a = [| 1; 2; 3 |]
let b = [| 4; 5; 6 |]
let c = [| -1; -1; -1 |]

let fs = make_funcs (a, b, c)

fs.[0](1) // 4
fs.[1](1) // 6
fs.[2](1) // 8


Why F# code works as expected? When the three functions are created, they also know that variable i shall be found in the parent scope; however the three is have three independent scopes!



As the behavior in R’s version is definitely not what we want. How to make three different functions? Answer is make three different is in side a new function:



make_funcs2 <- function(a, b, c_){
n <- length(a)
fs <- list()
for (i in 1:n) {
fs[[i]] <-
(function() {
j <- i
function(x) {
cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[j], b[j], c_[j], x))
a[j]*x*x + b[j]*x + c_[j]
}
}) ()
}
fs
}


Each time in the for loop, I create a new function and defines a variable j inside it, and the new function return the function(x). Notice that this new function is executed for n times in the for loop, therefore creates n different js.



This trick is ubiquitously used in JavaScript. For example instead of writing,



{
var a = 1;
//code blocks
}


we define a function and execute it immediately to make local variables:



(function() {
var a_is_hidden_from_outside = 1;
// in other words, no new variable in the global space is introduced.
}) ()


The assignment operators <- and <--



It seems that the block syntax {} can be translated as (function() {}) () in R/JavaScript. But in R, things can be more subtle. See the third version of make_funcs:



make_funcs3 <- function(a, b, c_){
n <- length(a)
fs <- list()
for (i in 1:n)
(function() {
j <- i
fs[[i]] <-
function(x) {
cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[j], b[j], c_[j], x))
a[j]*x*x + b[j]*x + c_[j]
}
print(length(fs))
}) ()

fs
}

a <- c(1,2,3)
b <- c(4,5,6)
c_ <- c(-1,-1,-1)

fs <- make_funcs3(a, b, c_)

fs[[1]](1)


We translate {} after for loop as (function () {}) (), and now the assignment of fs[i] is inside the function wrapper. However, the code would not run correctly:



> fs <- make_funcs(a, b, c_)
ls length = 1
ls length = 2
ls length = 3

>
> fs[[1]](1)
Error in fs[[1]] : subscript out of bounds
> length(fs)
[1] 0


Obviously the variable fs is growing when the for loop is executed, however the variable fs inside the for loop is a different one from the one outside the for loop. And we find that the variable fs outside the for loop is only initialized but has not been added any new elements.



The assignment operator <- creates a new variables inside a function! It won’t search if the same variable name is in its parent environments! To do that which is what we suppose R to do, we have to use the <<- operator:



make_funcs3 <- function(a, b, c_){
n <- length(a)
fs <- list()
for (i in 1:n)
(function() {
j <- i
fs[[i]] <<-
function(x) {
cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[j], b[j], c_[j], x))
a[j]*x*x + b[j]*x + c_[j]
}
cat(sprintf('ls length = %d\n', length(fs)))
}) ()
fs
}


And now this function should run as we expected.



Summary



In using R for interactive data analysis and plotting, most of time we won’t deal with these subtle language features. We just copy/paste some code snippet from R help and online and modify it to suit our own data analysis. However when we are into R programming, these issues do occur and will bite us when we assume our experience in C++/Java would also work in R.

2 comments:

  1. Hi. I was searching F# and google lead me here. I notice that you are also a HKUST PhD. So am I. I pursue my PhD in CSE and I guess you are the same, right?
    Are you still in HKUST? I'd like to meet you if you are available.

    ReplyDelete