An R Mumble

This weekend I attended beCamp, “Charlottesville’s version of the BarCamp unconference phenomenon”. Basically a tech meetup with talks, talking outside of talks, food, and drinks. All around, a good time. This was my first beCamp, but I enjoyed it, and will probably try to go to others in the future.

The way the camp goes, you show up Friday night and people stand up and say things they could talk about, if there’s interest. And then people put marks about if they’d attend or whatever, and a schedule of talks for Saturday comes together. When I signed up for the camp a week beforehand, my intention was to spend some free time during the week preparing a talk about R. Didn’t happen, but I stood up Friday and said I could maybe say a few words about it anyway. I got a few ‘Learn’ checkmarks, indicating a little interest. Lucky for all involved, though, I didn’t get officially scheduled to talk. Of course, I didn’t know that until I showed up Saturday, having spent about 2 hours that morning trying to throw something together, just in case. I can’t say I was too disappointed with not having to talk, though it could have been fun. At lunch, there was about an hour of “lightning talks”, just 5 minute talks. While I was sitting there, I figured that would actually be a good amount for me. Just as the line of talkers for that was starting to wind down, and I was thinking about joining it, a handful more queued up. Those talks used all the remaining time, so I was, again, off the hook.

But, hey, I’ve got this venue, I can talk about stuff whenever I want, right? So here’s some notes about R I was just about prepared to mumble about Saturday.

According to the webpage, “R is a free software environment for statistical computing and graphics.” It’s something like an open source clone of the S software for stats processing. The language has some sort of interesting aspects to it, and also some things that really get me sometimes. R has some good built-in “call for help” utilities, making it sort of easy to pick up and do fairly interesting things fairly quickly. Perhaps one of the best things is the huge collection of freely available libraries. I’ll try to talk about some or all of these things, showing by example and not being too formal (me, informal?), but hopefully still fairly close to correct (or, at least, usably incorrect). Folks wishing for more can certainly jump to the official introduction. John D. Cook has some good notes about picking up R if you know some programming (I’m assuming you do), and I also found “The R Inferno” [pdf] informative.

Right, so lets look at some code. I always feel, with a language like R, you don’t start off with “Hello World”, but with calculations. Because it’s just so novel to have a calculator:

> 2+2
[1] 4
> exp(5)
[1] 148.4132
> (1i)^(1i)
[1] 0.2078796+0i
> exp(pi*(1i))
[1] -1+0i

Those lines with the arrow in front are the input lines, where you type, and the other lines are the output. Anyway, pretty exciting, no?

I should mention that you can reproduce this, or other examples, by running R from the command line, or using R-specific editors (or emacs add-ons). I’ve been using RStudio recently, and, while it’s got its issues, it’s got some nice features too. Worth a check-out.

Ok, so R does the normal hello world too, and has if-statements (imagine!) and for loops (how are you not using it already!?). The ‘:’ operator is pretty helpful for making ranges, as the example shows:

> print("hello world")
[1] "hello world"
> if (TRUE) { print("yep") } else { print("nope") }
[1] "yep"
> for (n in 1:10) { print(log(n)) }
[1] 0
[1] 0.6931472
[1] 1.098612
[1] 1.386294
[1] 1.609438
[1] 1.791759
[1] 1.94591
[1] 2.079442
[1] 2.197225
[1] 2.302585

Here’s an example of writing our own function. Pretty straightforward… note that ‘<-‘ is the assignment operator (‘=’ would also work here, but I think ‘<-‘ is more R-ey). There’s another assignment operator, ‘<<-‘, which, I think, is a little bit of a bad habit to use. It’s along the lines of making the thing on the left a global variable, but I’d rather not get too much into that sort of thing (i.e., things I don’t understand at all, instead of things I can at least pretend a little about). I seem to recall reading somewhere the R using lexical scoping rules. If you know what that means, good for you. I think it applies here, because if we didn’t assign to fact, then the call to fact at the end of the function would fail. Oh, and note you can return things in R with return(), but more typically results get returned by just having them be the last line of the function (in my (limited) experience).

> fact <- function(n) {
+     if (n <= 1) {
+        return(1)
+     }
+     n*fact(n-1)
+ }
> fact(3)
[1] 6

There’s a few ways to iterate over a bunch of things and apply a function to each object in turn (i.e., “map”). A simple way is with sapply. In the following example, we use sapply to get the factorial of the values 0 to 4, then plot the results with lines connecting the values. We add a title to the plot, and re-plot the points. Note that we pass 0:4 in to specify the x-coordinates of the values; R indexes from 1 by default (I don’t hold R personally responsible for this, since they’re trying to be S-compatible, but still). Anyway, example (run it yourself for the picture):

> first.facts <- sapply(0:4, fact)
> plot(0:4, first.facts, xlab="n", ylab="fact(n)", type="l")
> title("Factorial")
> points(0:4, first.facts)

Plotting can get just about as complicated and involved as you’d like. So now’s probably a good place to introduce R’s help system. If you want to find out more about a command, just use ‘?’:

> ?plot

There’s another help operator, ‘??’, I’ll use later. (I think I actually saw there’s a third, ‘???’, but I haven’t used it). Another cool thing about R is that you can look at the source for functions. Just type the function without the parentheses:

> title
function (main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
    line = NA, outer = FALSE, ...)
{
    main <- as.graphicsAnnot(main)
    sub <- as.graphicsAnnot(sub)
    xlab <- as.graphicsAnnot(xlab)
    ylab <- as.graphicsAnnot(ylab)
    .Internal(title(main, sub, xlab, ylab, line, outer, ...))
}
<environment: namespace:graphics>

Ok, only so informative, since it passes work off with that .Internal call, but the principle is there.

I want to do one more function example, because it shows sort of a fun thing you can do. Most functions allow you to define default values for arguments. R lets you define the value in terms of passed in values. When you call the function, you can pass in values by naming the arguments, so you don’t have to worry about order. And, when you do, you can actually cheat and use initial substrings of the argument names. An example is in order:

> feet.to.yards <- function(feet) {
+   feet/3
+ }
> yards.to.feet <- function(yards) {
+   yards*3
+ }
> feet.to.yards(3)
[1] 1
> yards.to.feet(8)
[1] 24
> convert <- function(feet=yards.to.feet(yards), yards=feet.to.yards(feet)) {
+ 	print(paste(feet, "feet is", yards, "yards"))
+ }
>
> convert(3)
[1] "3 feet is 1 yards"
> convert(yards=8)
[1] "24 feet is 8 yards"
> convert(y=5)
[1] "15 feet is 5 yards"

(that example is based on something I read in a blog post, but I can’t find the link… it was talking about the things I said in the last paragraph, and did an example of converting polar to cartesian coordinates, if memory serves…) (Update 20111004 – found it).

Anyway, I said R was for stats… we should do some of that.

> # random sample of size 100 from normal with mean 7, s.d. 3
> nums <- rnorm(100, 7, 3)
> # calculate sample standard deviation
> sd(nums)
[1] 2.655835
> # plot a histogram
> hist(nums)
> # have a quick look at nums
> summary(nums)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-0.6268  4.8390  6.8200  6.5500  8.4870 11.8500

One of the really fun things about R is “vectorized” operations (terminology taken from the R Inferno, mentioned above… could be standard-ish though, I dunno). In particular, the usual arithmetic operations are applied componentwise to vectors (and typing ‘2’ is a vector of length one, for example). Shorter vectors are recycled. There’s lots of ways to make vectors, ‘:’ was used above, c() is easy, as is seq(). Anyway, here’s an example:

> (1:10)*c(2,3)
 [1]  2  6  6 12 10 18 14 24 18 30
> c(1*2, 2*3, 3*2, 4*3, 5*2, 6*3, 7*2, 8*3, 9*2, 10*3)
 [1]  2  6  6 12 10 18 14 24 18 30

Fitting lines to data sounds like a statsy thing to do, and R is for stats, so let’s do that. Let’s take the nums we made above and use them as offsets from the line y=5x. Then we’ll fit a line to that data, and it’s slope should be pretty close to 5. Note that ‘.’ isn’t special in R like it is in many other languages, so it’s typically used to separate words in variable names (although, it can be used in formulas, see later). Sort of the equivalent to other languages’ ‘.’ is ‘$’, exemplified below.

> sloped.nums <- 5*(1:100) + nums
> sloped.line <- lsfit(1:100, sloped.nums)
> print(sloped.line$coefficients)
Intercept         X
 6.256159  5.005810

I’ll leave it for the curious to sort out why the intercept is 6ish. Of course, if you’re running this at home, you’ll get different numbers, owing to the random sampling. Anyway, we should probably plot the thing and make sure it seems ok:

> plot(1:100, sloped.nums)
> abline(sloped.line, col="RED")

Looks fine to me. Of course, it’d probably be cool to try something with actual data. If you’ve got a csv sitting around, R is quite good at reading them in. If you don’t have a csv sitting around, I found some census data, and it seems R will open csvs across the wire (which is kinda hot).

> census <- read.csv("http://www.census.gov/popest/national/files/NST_EST2009_ALLDATA.csv")

So… what’d we just get? Well, we could read about it, or just mess with it. We’ve already seen summary(), so we can try that. Below, I’ve only shown the top of the output.

> summary(census)
     SUMLEV      REGION    DIVISION      STATE               NAME
 Min.   :10.00   0: 1   5      : 9   Min.   : 0.00   Alabama   : 1
 1st Qu.:40.00   1:10   8      : 8   1st Qu.:12.00   Alaska    : 1
 Median :40.00   2:13   4      : 7   Median :27.00   Arizona   : 1
 Mean   :38.07   3:18   1      : 6   Mean   :27.18   Arkansas  : 1
 3rd Qu.:40.00   4:14   0      : 5   3rd Qu.:41.00   California: 1
 Max.   :40.00   X: 1   3      : 5   Max.   :72.00   Colorado  : 1
                        (Other):17                   (Other)   :51
 CENSUS2000POP       ESTIMATESBASE2000   POPESTIMATE2000
 Min.   :   493782   Min.   :   493783   Min.   :   493958
 1st Qu.:  1808344   1st Qu.:  1808344   1st Qu.:  1806962
 Median :  4301261   Median :  4302015   Median :  4328070
 Mean   : 14878497   Mean   : 14878639   Mean   : 14918075
 3rd Qu.:  8186453   3rd Qu.:  8186781   3rd Qu.:  8230161
 Max.   :281421906   Max.   :281424602   Max.   :282171957

Each of these headers is a name we can use to index into the census object. It’s technically a “data frame”, one of the types in R. That means it’s basically a (not-necessarily-numeric) matrix (as you might expect from a csv table), each column has the same number of entries, and within a column, all of the entries are the same type (no mixing strings with numbers). The names are the column names, and you can use the ‘$’ operator to get at any particular column.

> head(census$NAME)
[1] United States Northeast     Midwest       South         West
[6] Alabama
57 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming

The last line of output indicates that census$NAME is a “factor”, one of the types in R. Basically a list of values all taken from a set. I don’t want to say too much about it. While I’m at it, though, we might as well talk about indexing into R objects. It’s one of the cool things about R. Let’s grab that NAME column, and convert it to strings:

> # $NAME and [['NAME']] do the same thing
> states <- as.character(census[['NAME']])
> states[c(5, 18, 37)]
[1] "West"       "Idaho"      "New Mexico"
> states[-(c(1:10, 15:50))] # negative indices = remove those ones
 [1] "Colorado"                 "Connecticut"
 [3] "Delaware"                 "District of Columbia"
 [5] "Vermont"                  "Virginia"
 [7] "Washington"               "West Virginia"
 [9] "Wisconsin"                "Wyoming"
[11] "Puerto Rico Commonwealth"
> states[c(11:14,51:57)]
 [1] "Colorado"                 "Connecticut"
 [3] "Delaware"                 "District of Columbia"
 [5] "Vermont"                  "Virginia"
 [7] "Washington"               "West Virginia"
 [9] "Wisconsin"                "Wyoming"
[11] "Puerto Rico Commonwealth"
> which(substr(states, 1, 1) == "P") # substr is, apparently, "vectorized"
[1] 44 57
> states[which(substr(states, 1, 1) == "P")]
[1] "Pennsylvania"             "Puerto Rico Commonwealth"

Like I said, census is basically a table. You can pull out rectangular bits of it easily, as the example below shows. And it’s easy enough to generalize that a little, and start pulling out whatever bits you want. If you leave one of the coordinates in the ‘[]’ empty, it means “everything”. So census[,] is the same as census (for some definition of “same as”).

> census[1:6,1:5] # rows 1 to 6, columns 1:5
  SUMLEV REGION DIVISION STATE          NAME
1     10      0        0     0 United States
2     20      1        0     0     Northeast
3     20      2        0     0       Midwest
4     20      3        0     0         South
5     20      4        0     0          West
6     40      3        6     1       Alabama

Right, we’re supposed to be doing statsy things. We should be actually playing with the data… Let’s pick out some easy bit. Let’s play with the “POPESTIMATE2009”, “BIRTHS2009”, and “DEATHS2009” values for just the states.

> state.rows <- c(6:13, 15:56)
> some.data <- census[state.rows, c("POPESTIMATE2009", "BIRTHS2009", "DEATHS2009")]
> summary(some.data)
 POPESTIMATE2009      BIRTHS2009       DEATHS2009
 Min.   :  544270   Min.   :  6407   Min.   :  3140
 1st Qu.: 1802408   1st Qu.: 25748   1st Qu.: 14667
 Median : 4403094   Median : 58800   Median : 36536
 Mean   : 6128138   Mean   : 85094   Mean   : 49617
 3rd Qu.: 6647091   3rd Qu.:100143   3rd Qu.: 58657
 Max.   :36961664   Max.   :558912   Max.   :241733
> dim(some.data)
[1] 50  3
> colnames(some.data)
[1] "POPESTIMATE2009" "BIRTHS2009"      "DEATHS2009"
> rownames(some.data)
 [1] "6"  "7"  "8"  "9"  "10" "11" "12" "13" "15" "16" "17" "18" "19" "20" "21"
[16] "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36"
[31] "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51"
[46] "52" "53" "54" "55" "56"
> rownames(some.data) <- as.character(census$NAME[state.rows])
> head(some.data)
           POPESTIMATE2009 BIRTHS2009 DEATHS2009
Alabama            4708708      62265      47157
Alaska              698473      11307       3140
Arizona            6595778     103956      49657
Arkansas           2889450      40539      28003
California        36961664     558912     241733
Colorado           5024748      72537      31679

To get an idea how the variables relate to each other, you can plot each of them against each of the others quickly, with pairs():

> pairs(some.data)

We fit a line to data earlier, and can expand on that here. Let’s model the population estimate given the births and deaths. I’m not trying to display some dazzling insight into the data, just show how easy it is in R to do this sort of thing.

> l <- lm(POPESTIMATE2009 ~ ., data=some.data
> print(l)

Call:
lm(formula = POPESTIMATE2009 ~ ., data = some.data)

Coefficients:
(Intercept)   BIRTHS2009   DEATHS2009
  -73663.83        42.52        52.07

> summary(l)

Call:
lm(formula = POPESTIMATE2009 ~ ., data = some.data)

Residuals:
     Min       1Q   Median       3Q      Max
-1074015  -205561   -12694   171717   997020

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -73663.831  70178.173   -1.05    0.299
BIRTHS2009      42.518      1.540   27.62   DEATHS2009      52.075      3.099   16.80   ---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 342400 on 47 degrees of freedom
Multiple R-squared: 0.9976,	Adjusted R-squared: 0.9975
F-statistic:  9652 on 2 and 47 DF,  p-value: < 2.2e-16

That looks good and statsy.

If you don’t have your own data, by the way, R comes with some, and many libraries bring in their own, for examples. Just poking around a little (here’s to tab completion), I found, for example, that R comes with a data frame with state-by-state arrest counts by crime (USArrests</code). I mentioned earlier that '?' is good for getting help on a command. If you want to find a command to use, use '??'. This frequently returns lists of packages that you can poke at. One of the great things about R is that many of the libraries you can install come with 'vignettes', providing a little tutorial on some of the tools in the package. R makes it very easy to install packages (install.packages()).

I’m sort of running out of steam for the evening, so think I may wrap this up. I had sort of envisioned talking about a couple of fun packages. Guess I’ll save that for another post (this has gotten long anyway), and try to do some cool examples, with prettier pictures.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: