martes, 23 de agosto de 2016

Reading R Code. Factors in R

R factors are a common source of confusion for R beginners, even in their most basic use. In this post I try to shed some light on R factors. But rather than only rely on external sources of information or on the R documentation, I will make some experiments and have a look at the implementation to understand what a factor is and how it behaves in the most simple use cases.

First of all, R factors are intended for representing categorical variables. A categorical variable is a variable that can take values of a limited set of so-called categories. For instance, to encode the gender of a population we would use a categorical variable with two possible categories, male and female. Or to encode the educational level of a population under 18 we can use a categorical variable with three categories: elementary, middle and high.

In addition, categories of a categorical variable may or may not have some internal ordering. We cannot ascribe any sound ordering to the first example of categorical variable, but we should in the second case (elementary < middle < high)

In R, the term levels stands for categories of the categorical variable (the factor). labels in turn suggests something like custom names for those categories.

This distinction might be blurry, and it is. Think of some data you are going to analyze, say, a record of the gender of 6 persons:

> x <- c("F", "M", "F", "F", "F", "M")

If we want to convert this into a factor, which are the levels? Maybe, "male" and "female"?

> factor(x, c("male", "female"))
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: male female

Yikes! This doesn't work. Interestingly, no error message has been triggered, and levels are stored as we intended. But what are those <NA>s there?

Let's try without supplying levels and see what happens:

> factor(x)
> [1] F M F F F M
> Levels: F M

This looks better, at least nasty <NA>s have disappeared. So, when no levels are supplied, levels end up being the distinct elements of the original x. It seems that we could pass them instead of our "male" and "female" before:

> factor(x, levels = c("M", "F"))
> [1] F M F F F M
> Levels: M F

Fine! This works. Only that levels appear now in a different order, the one we have supplied.

Still we would prefer the more human-friendly "male" and "female". It might be that these words are rather labels. Let's try:

> factor(x, labels = c("male", "female"))
> [1] male   female male   male   male   female
> Levels: male female

Much better, but wait, our initial vector was another one. This totally wrecks havoc with x!

Looks like the order in which we pass the labels makes a serious effect. And it certainly does:

> factor(x, labels = c("female", "male"))
> [1] female male   female female female male  
> Levels: female male

Great! If for some hidden reason (maybe, say, because we expect that in a plot of this factor the legend shows male scores before female ones) we still want the other ordering of levels, we may try to combine both parameters:

> factor(x, levels = c("M", "F"), labels = c("male", "female"))
> [1] female male   female female female male  
> Levels: male female

That's it! Particularly, note how each label corresponds to each level in the values supplied. To verify, something like the following (where there is a mismatch in the ordering of levels and labels)

> factor(x, levels = c("F", "M"), labels = c("male", "female"))
> [1] male   female male   male   male   female
> Levels: male female

ruins x again.

From these experiments we can draw some intuitive conclusions:

  1. levels are the distinct elements in the object to be converted to factor.
  2. If levels are not supplied they are constructed from those unique elements sorted in alphabetical order.
  3. labels are the "names" we want to give to our levels.
  4. If no supplied, labels are the same as the levels.
  5. The order of labels passed must match the order of the levels, either of the default levels or of the levels we supply.
  6. The less risky way to create factors with custom labels is to supply both parameters.

Looking at the implementation makes crystal clear what we have found out.

  1 function (x = character(), levels, labels = levels, exclude = NA,               
  2     ordered = is.ordered(x), nmax = NA)                                         
  3 {       
  4     if (is.null(x)) 
  5         x <- character()
  6     nx <- names(x)                                                              
  7     if (missing(levels)) {                                                      
  8         y <- unique(x, nmax = nmax)
  9         ind <- sort.list(y)
 10         y <- as.character(y)
 11         levels <- unique(y[ind])
 12     }
 13     force(ordered)
 14     exclude <- as.vector(exclude, typeof(x))
 15     x <- as.character(x)
 16     levels <- levels[is.na(match(levels, exclude))]
 17     f <- match(x, levels)
 18     if (!is.null(nx)) 
 19         names(f) <- nx
 20     nl <- length(labels)
 21     nL <- length(levels)
 22     if (!any(nl == c(1L, nL)))
 23         stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
 24             nl, nL), domain = NA)
 25     levels(f) <- if (nl == nL)
 26         as.character(labels)
 27     else paste0(labels, seq_along(levels))
 28     class(f) <- c(if (ordered) "ordered", "factor")
 29     f
 30 }

Let us consider the second point above. If levels are not supplied they will be the distinct elements of the given vector:

if (missing(levels)) {
    y <- unique(x, nmax = nmax)
    ind <- sort.list(y)
    y <- as.character(y)
    levels <- unique(y[ind])
}

If levels are missing, we create a vector y of unique elements of x:

> y <- unique(x)
> y 
[1] "F" "M"

Note that nmax is set by default to NA in the function header:

function (x = character(), levels, labels = levels, exclude = NA, 
    ordered = is.ordered(x), nmax = NA) 

and unique(x, nmax = NA) is equivalent to the call just above.

Also, note that if the order of elements in x were different, say, c("M", "F", "M"), unique would produce the distinct elements just by removing duplicates in the given vector:

> unique(c("M", "F", "M"))
[1] "M" "F"

So unique doesn't sort the result.

To sort the result we need a sorting function, here sort.list. This function produces the indices suitable for subsetting y:

> sort.list(y)
[1] 1 2

> y[sort.list(y)]
[1] "F" "M"

Note also that y holds a character vector always. So levels produced are always a character vector sorted in alphabetical order.

If levels are supplied it is expected that they match the unique elements of x. Pay attention to this crucial line:

f <- match(x, levels)

match is a nice but maybe tricky function to understand, it produces the positions (= indices) of matches of x in levels.

Let's try some examples. Recall that x is:

> x
[1] "F" "M" "F" "F" "F" "M"


> match(x, c("M", "F"))
[1] 2 1 2 2 2 1

Indeed, the first element of x matches the 2nd element of levels; the second element of x matches the 1st element of levels, and so on.

If there is no match, a NA is produced by default:

> match(x, c("M", "O"))
[1] NA  1 NA NA NA  1

Now we see where the NAs in our initial wrong attempt with "male" and "female" as levels came from.

What about labels?

If labels are not supplied they are just the levels as the function header states (levels is the default value for labels).

If they are supplied, these lines give labels to levels:

levels(f) <- if (nl == nL) 
    as.character(labels)
else paste0(labels, seq_along(levels))

Actually, the line that is executed for our previous examples with labels is just:

levels(f) <- as.character(labels)

since in our examples the number of levels are labels are the same (nl == nL, where nl <- length(labels) and nL <- length(levels))

Note again that levels with those new labels are still always a character object.

The last line in the snippet above deals with the case where a single label is supplied. We haven't tried an example for that, it is time to do it now:

> factor(x, labels = "gender")
[1] gender1 gender2 gender1 gender1 gender1 gender2
Levels: gender1 gender2

This is a valid way to assign labels to levels, though not so good for this example. The code responsible of producing this vector is:

paste0(labels, seq_along(levels))

seq_along(levels) produces a sequence 1, 2, ... n where n is the length of levels. paste0 just concatenates the single label with those digits in the sequence via recycling.

However, the most remarkable fact we may notice from looking at the implementation is that a factor in R is just an integer vector whose attributes "levels" (as we have seen), "names" (as names of x [see line 6 and lines 18-19 in the implementation]) and, above all, "class" are set appropriately.

As for the latter the last but one line in the implementation sets the "class" attribute to "factor" always. Besides, if x is an ordered vector (the default behavior), or if we pass TRUE as value to the ordered argument, "class" is set to c("ordered", "factor"):

class(f) <- c(if (ordered) "ordered", "factor") 

Note the order of elements in the last vector for "class". The order here matters, and means that in such a case f is like an instance of the class "ordered", that in turn is somewhat like a subclass of "factor". [See ?class for more details about this kind of inheritance mechanism.]

A nice way of finishing this post is to demonstrate what we have discovered, that a factor is, from the R's internal point of view, just an integer vector with certain attributes set appropriately.

> f <- c(1L, 2L, 1L, 1L, 1L, 2L)
> levels(f) <- c("female", "male")
> class(f) <- "factor"
> str(f)
 Factor w/ 2 levels "female","male": 1 2 1 1 1 2
> f
[1] female male   female female female male  
Levels: female male

> f2 <- c(1L, 3L, 3L, 2L, 1L, 2L)
> levels(f2) <- c("elem", "middle", "high")
> class(f2) <- c("ordered", "factor")
> str(f2)
 Ord.factor w/ 3 levels "elem"<"middle"<..: 1 3 3 2 1 2
> f2
[1] elem   high   high   middle elem   middle
Levels: elem < middle < high

No hay comentarios:

Publicar un comentario