Wednesday, July 17, 2013

Subtle Variable Scoping in R

A languages manual usually defines how a language behaves, but does not warn you in cases where you assume a feature should be supported but isn’t. As an example, I will talk about the subtle variable scoping in R language.

{} code blocks

A lot of programmers coming from C/C++/Java will assume that code blocks inside {} also introduce a new scope. However, in dynamic languages like R/Python/JavaScript/Matlab, code blocks do not introduce new scopes; only function does. This difference may cause some subtle bugs.

For example, the following R function returns a list of quadratic function objects:

make_funcs <- function(a, b, c_){
  n <- length(a)
  fs <- list()
  for (i in 1:n) {
    fs[[i]] <- function(x) {
      cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[i], b[i], c_[i], x))
      a[i]*x*x + b[i]*x + c_[i]
    }
  }
  cat(sprintf('variable i is still in the scope and has value %d\n', i))
  fs
}

The input to this functions is three vectors of numbers which represent the three coefficients in quadratic forms. And let’s make three objects using the following coefficients:

a <- c(1,2,3)
b <- c(4,5,6)
c_ <- c(-1,-1,-1)

fs <- make_funcs(a, b, c_)

fs[[1]](1) 
fs[[2]](1) 
fs[[3]](1)

We are supposed to get three different function values. However, all the three functions are the same after checking the output:

> fs[[1]](1)
eval 3.0*x^2+6.0*x+-1.0 where x = 1.0
[1] 8
> fs[[2]](1)
eval 3.0*x^2+6.0*x+-1.0 where x = 1.0
[1] 8
> fs[[3]](1)
eval 3.0*x^2+6.0*x+-1.0 where x = 1.0
[1] 8

It seems that the three fs[i] use the same variable i when they are evaluated. That is, when the three functions are created, the R interpreter just remembers i as a variable in its parent function. Then the result can be explained: after the loop is finished, the variable i has value of 3, and it is still inside the scope of make_func.

Let’s see how will we write make_func in F#:

// a direct translation 
let make_funcs (a: int array, b:int array, c: int array) =
    let n = Array.length a
    let fs = new ResizeArray<(int -> int)>()
    for i=0 to n-1 do
        fs.Add(fun x -> a.[i]*x*x + b.[i]*x + c.[i])
    fs.ToArray()

// a more functional translation 
let make_funcs2 (a: int array, b:int array, c: int array) =
    Array.zip3 a b c
    |> Array.map (fun (a0, b0, c0) -> 
        (fun x -> a0*x*x + b0*x + c0))

The following code would make three different functions as we expect:

let a = [| 1; 2; 3 |]
let b = [| 4; 5; 6 |]
let c = [| -1; -1; -1 |]

let fs = make_funcs (a, b, c)

fs.[0](1) // 4
fs.[1](1) // 6
fs.[2](1) // 8

Why F# code works as expected? When the three functions are created, they also know that variable i shall be found in the parent scope; however the three is have three independent scopes!

As the behavior in R’s version is definitely not what we want. How to make three different functions? Answer is make three different is in side a new function:

make_funcs2 <- function(a, b, c_){
  n <- length(a)
  fs <- list()
  for (i in 1:n) {
    fs[[i]] <-
      (function() {
        j <- i
        function(x) {
          cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[j], b[j], c_[j], x))
          a[j]*x*x + b[j]*x + c_[j]
        }
       }) ()
  }
  fs
}

Each time in the for loop, I create a new function and defines a variable j inside it, and the new function return the function(x). Notice that this new function is executed for n times in the for loop, therefore creates n different js.

This trick is ubiquitously used in JavaScript. For example instead of writing,

{
  var a = 1;
  //code blocks
}

we define a function and execute it immediately to make local variables:

(function() {
  var a_is_hidden_from_outside = 1;
  // in other words, no new variable in the global space is introduced. 
}) ()

The assignment operators <- and <--

It seems that the block syntax {} can be translated as (function() {}) () in R/JavaScript. But in R, things can be more subtle. See the third version of make_funcs:

make_funcs3 <- function(a, b, c_){
  n <- length(a)
  fs <- list()
  for (i in 1:n) 
      (function() {
        j <- i
        fs[[i]] <-
          function(x) {
            cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[j], b[j], c_[j], x))
            a[j]*x*x + b[j]*x + c_[j]
          }
        print(length(fs))
      }) ()
  
  fs
}

a <- c(1,2,3)
b <- c(4,5,6)
c_ <- c(-1,-1,-1)

fs <- make_funcs3(a, b, c_)

fs[[1]](1)

We translate {} after for loop as (function () {}) (), and now the assignment of fs[i] is inside the function wrapper. However, the code would not run correctly:

> fs <- make_funcs(a, b, c_)
ls length = 1
ls length = 2
ls length = 3
> 
> fs[[1]](1)
Error in fs[[1]] : subscript out of bounds
> length(fs)
[1] 0

Obviously the variable fs is growing when the for loop is executed, however the variable fs inside the for loop is a different one from the one outside the for loop. And we find that the variable fs outside the for loop is only initialized but has not been added any new elements.

The assignment operator <- creates a new variables inside a function! It won’t search if the same variable name is in its parent environments! To do that which is what we suppose R to do, we have to use the <<- operator:

make_funcs3 <- function(a, b, c_){
  n <- length(a)
  fs <- list()
  for (i in 1:n) 
      (function() {
        j <- i
        fs[[i]] <<-
          function(x) {
            cat(sprintf('eval %.1f*x^2+%.1f*x+%.1f where x = %.1f\n', a[j], b[j], c_[j], x))
            a[j]*x*x + b[j]*x + c_[j]
          }
        cat(sprintf('ls length = %d\n', length(fs)))
      }) ()
  fs
}

And now this function should run as we expected.

Summary

In using R for interactive data analysis and plotting, most of time we won’t deal with these subtle language features. We just copy/paste some code snippet from R help and online and modify it to suit our own data analysis. However when we are into R programming, these issues do occur and will bite us when we assume our experience in C++/Java would also work in R.

Tuesday, July 9, 2013

Unicode Tips in Python 2 and R

Most of time, I don’t need to deal with different encodings at all. When possible, I use ASCII characters. And when there is a little processing in Chinese characters or other Unicode characters, I use .Net languages or JVM languages, in which every string is Unicode and of course when the characters are displayed they are displayed as characters (not as the unreadable escaped strings or Unicode IDs).

However recently I am working on projects on Chinese Weibo data, and I encountered some Unicode problems when using Python and R. (I use Python for data crawling and processing and R for modeling and visualization.)

This post is a note I start today and I will update it when I encounter new Unicode problems…

Python 2.7

Lesson 1: Normal string and Unicode string are two types. Do not mix use them.

When writing Unicode strings literals, we put a prefix u before the string:

>>> type(u'')
<type 'unicode'>
>>> type('')
<type 'str'>

Notice that the types of the two strings are different, one is 'unicode’, and the other is ‘str’. Python is a dynamic language and sometimes will do smart things between the two types. But we should use Python as if it is a strict static type language and never cross use the two types, and when needed, do conversions explicitly. See the following example:

>>> u'abcd'.find('b')
1
>>> u'你好'.find('好')

Traceback (most recent call last):
  File "", line 1, in 
    u'你好'.find('好')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xba in position 0: ordinal not in range(128)

The first find operation works fine because it does not involve any non-ascii characters. The second won’t work because the two types don’t work with each other.

The error in the example seems simple to avoid. However when you load a set of text files from different sources, you may mix Unicode strings and normal strings.

Lesson 2: When passing a Unicode string to a function not written by you, check carefully whether that function supports Unicode.

It seems that if we have all the strings as Unicode strings in Python 2.7, we won’t have any problems. We can add u prefix to all non-ascii string in the file, and we also use codecs package to read Unicode files:

    with codecs.open(unicode_text_file, 'r', encoding='utf-8') as f:
        read the file and all strings you get are of unicode type

Everything seems fine until we call function/packages that are written only for normal strings and we wished they would work for Unicode strings. For example, the csv package in Python 2.7 official release does not work with Unicode files. You cannot do things like:

    with codecs.open(file_profile, 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        head = reader.next()

The third line will throw UnicodeDecodeError just as above. So in Python 2.7 this is the pain point – you cannot expect all the common libraries that work nicely with normal strings to work nicely with Unicode strings. And a script that worked before for ascii files can suddenly fail on a Unicode file.

For library writers, it is pain too. Sometimes they need to write a special version for Unicode. I once worked with a human name parser in Python. European names can have accents on letters, but that library only accepts ascii strings. To use that name parser, I have to

1. Convert the name string into a normal Python string by removing all accents

2. Parse it using the library

3. Convert the result to Unicode string using .decode method

R

Lesson 1: In R, there is only one type of string, that is character. But this type can store whatever can be repressed as byte arrays. Think of R’s character type as C’s char[] and nothing more.

In Python, we have two types for two kinds of strings. In R, we have only one type for all kinds of strings.

My first problem is displaying Unicode characters in R. R’s default GUI and command line console cannot even display Chinese on a non-Chinese Windows. (On a Chinese Windows, you can only get a Chinese version R. Chinese characters are fine. However all info/error messages are also in Chinese, which are translated from English and which are just wired Chinese.)

When R is installed, it checks the System encoding. On Linux and Mac OS, the system encoding is usually UTF-8, and R uses that. However Windows is different, the following is my R session info:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252

R’s default GUI then uses the encoding to display all strings. Of course, it fails on most Unicode characters.

Luckily the editor and the console in RStudio can display Chinese characters. And RStudio must have done some smart things:

> as.character(china_mapdata$NAME[1])
[1] "黑龙江省"
> "黑龙江省"
[1] "黑龙江省"
> as.character(china_mapdata$NAME[1]) == "黑龙江省"
[1] FALSE

The string in the data frame displays the same as the literal (the literal is represented as Unicode in RStudio console), yet they are not equal in many ways. This is because two strings display the same can be different in their internal representations:

> nchar("黑龙江省")
[1] 4
> nchar(as.character(china_mapdata$NAME[1]))
[1] 8

Their types are both character. However the representation are different: the literal is represented as 4 Unicode characters (4 chars * 3 bytes/char = 12 bytes) in the memory, and the string read from the data file is represented as 8 bytes (4 chars * 2 bytes/char = 8 bytes) in the memory:

> charToRaw("黑龙江省")
 [1] e9 bb 91 e9 be 99 e6 b1 9f e7 9c 81
> charToRaw(as.character(china_mapdata$NAME[1]))
[1] ba da c1 fa bd ad ca a1

I find that the Chinese characters in the file I load the data frame uses GB2312 encoding. And because it is a binary file, I don’t have a simple way to change its encoding. But here is a method that I find:

# First write the data frame to disk
write.csv(china_mapdata, file = 'china_map.csv')
# In EmEditor, open it as GB2312, and SaveAs UTF-8
# Load the utf-8 file
china_mapdata <- read.csv('china_map.utf8.csv', encoding='UTF-8')
# Test
as.character(china_mapdata$NAME[1]) == "黑龙江省" # should be TRUE now

After converting all Chinese characters in Unicode, I can now follow the map example here (the author of the post, I believe, uses a Chinese Windows, therefore does not have my problem; All systems except Chinese Windows will encounter the coding problem though) :

ggplot(zhejiang, aes(x = long, y = lat, group = group,fill=NAME))+
  geom_polygon(fill="beige" )+
  geom_path(colour = "grey40")+
  ggtitle("中华人民共和国浙江省")+
  geom_point(x=120.12,y=30.16,fill=FALSE)+
  annotate("text",x=120.3,y=30,label="杭州市")

But when exporting the file as PDF, the Chinese characters cannot display correctly:

After searching online, I find the solution in this SO question, and specifying the font explicitly solves the problem:

cairo_pdf("example.pdf", family="FangSong")
ggplot(zhejiang, aes(x = long, y = lat, group = group,fill=NAME))+
  geom_polygon(fill="beige" )+
  geom_path(colour = "grey40")+
  ggtitle("中华人民共和国浙江省")+
  geom_point(x=120.12,y=30.16,fill=FALSE)+
  annotate("text",x=120.3,y=30,label="杭州市")
dev.off()

The correct PDF output:

We can also change the font type to other Chinese fonts, refer their names here.

Some R’s functions can recognize the Unicode string, e.g. Encoding. I think such recognition is based on the first few bits of the string. But it does not recognize GB2312 string (Encoding function outputs ‘unknown’ for GB2312 strings). Magically RStudio in Windows (English version, locale set to Simplified Chinese) can recognize both strings and display them correctly.

Lesson 2: When dealing with Unicode in R, use a Chinese Windows (for Chinese only), or use Linux/Mac OS (which is by default UTF-8), otherwise you cannot display Unicode well.

See how Unicode code may be displayed in R's console (English Windows):

> head(zhejiang)
    X.U.FEFF.  AREA PERIMETER BOU2_4M_ BOU2_4M_ID ADCODE93 ADCODE99                     NAME
223       222 9.277    26.825      224         33   330000   330000 <u  +6D59><u  +6C5F><u  +7701>
224       223 0.000     0.103      225       1626   330000   330000 <u  +6D59><u  +6C5F><u  +7701>
225       224 0.000     0.052      226       1634   330000   330000 <u  +6D59><u  +6C5F><u  +7701>

The <u+xxx> symbols indicate that R knows that they are Unicode characters, however R cannot display them correctly. And these symbols occur as symbols in your plots as well, which is just meaningless. If R cannot display these characters as characters, how can we know what are they and how to use these characters in a plot?

To solve this problem, my solution is to use a Linux with UTF-8 encoding. This is the session info in my Linux virtual machine:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

Summary

Conceptually Unicode in Python and R are quite simple. In Python, Unicode and normal strings have different types. In R, all the strings are represented as byte arrays (the good old C char[]!); so from its type, you cannot decide whether a string is Unicode or not. I think Python’s two-type design is good because the runtime exception will force the programmer to split normal strings and Unicode strings clearly (though it is headache, this is why Python 3 treat all strings as Unicode strings) .

What adds to the Unicode complexity is functions that manipulate Unicode strings and especially the display of Unicode characters on screens. A Unicode character is anyway 2-4 bytes in the memory, nothing special. When using Python and R for data analysis, we usually follow the REPL (Read–eval–print loop), it is meaningless to display Unicode as escaped strings. One cannot even judge whether the character is a Chinese character by reading the escaped strings! What’s worse is that different terminals may display differently for the same character. For example, in a Unicode-enabled Python console:

>>> print u'你'
你
>>> u'你'
u'\u4f60'

In a non-Unicode console:

>>> u'你'
u'\xc4\xe3'
>>> print u'你'
Äã
>>> print u'\u4f60'
你
>>> u'你' == u'\u4f60'
False
(In this example, the problem is not that the terminal cannot display Unicode characters, it is that the input characters in the terminal are not Unicode!)

The suggestion for the terminal and editor is

Use a Unicode terminal and a Unicode text editor when working with Python and R. For example, RStudio is, while Rgui.exe isn’t. PyDev plugin/PyScripter is, while the default IDLE isn’t.

And in Python, always put

# -*- coding: utf-8 -*-

at the beginning of the file, and save all source code files as Unicode.