codeRclub | bioCEED R coding club

TAG | ggplot

When data are imported into R it generally arrives as a data.frame, a type of object that contains named columns. We often want to access the contents of each column which can be done with the dollar or square-bracket notation.

attach() is used by some R-users to make the columns of a data.frame directly accessible rather than using dollar notation or data arguments.
Unfortunately although attach() saves a little typing and maybe makes the very first R experience a tiny bit easier, there is a large cost which I explore below.

The data.frame beaver1 holds body temperatures of a beaver (it is a found in the datasets package which is installed by default). These are the first few rows.

head(beaver1)
day time temp activ
346 840 36.33 0
346 850 36.34 0
346 900 36.35 0
346 910 36.42 0
346 920 36.55 0
346 930 36.69 0

If we attach() beaver1, we can access any of the columns of the data.frame directly.

attach(beaver1)
quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
boxplot(temp~activ)

Boxplot of body temperature against activity

Simple, what could possibly go wrong.

Let’s attach second beaver dataset which has the same column names as the first.

attach(beaver2)
  ## The following objects are masked from beaver1:
  ## 
  ##     activ, day, temp, time

That generated a load of warnings that various objects from beaver1 were masked. Now if we run quantile or some other function, we get the data from beaver2

quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

If we want to use beaver1 again, we have to detach() beaver2 first.

detach(beaver2)
quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
detach(beaver1)

Attaching and detaching data.frames is obviously going to go horribly wrong very quickly unless you are very careful. Always.

One alternative would be to change the column names on one or both of the data.frames so that all column names are unique.

colnames(beaver1) <- paste("beaver1", colnames(beaver1), sep = "_")
colnames(beaver2) <- paste("beaver2", colnames(beaver2), sep = "_")
attach(beaver1)
attach(beaver2)
quantile(beaver1_temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(beaver2_temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
rm(beaver1, beaver2)#clean up
detach(beaver1)
detach(beaver2)

Possible, but hardly elegant and this now as much typing as the dollar notation shown below.

Another solution would be to combine the two datasets. We need to add a column identifying which beaver is which.

beavers <- rbind(cbind(id = "beaver1", beaver1), cbind(id = "beaver2", beaver2))
attach(beavers)
quantile(temp[id == "beaver1"])
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(temp[id == "beaver2"])
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

Having combined the data.frames, we now need to subset the objects to get the data for each beaver. This isn’t a simple solution (in some cases it will be horribly complicated), but combining the data.frames might be very useful for some analyses (for example if you wanted to run an ANOVA).

attach will also cause problems if there are objects with the same name as one of the columns of the data.frame.

temp <- 7 
quantile(temp)
  ##   0%  25%  50%  75% 100% 
  ##    7    7    7    7    7

If we attach the data.frames when object temp already exists, we get a warning, if we make temp afterwards, no warning is given and this object masks the column in the data.frame. Obviously, this could cause some nasty bugs. Good luck with them.

It is of course possible to avoid these problems with masking if we are very careful with naming objects. In practice, it is very easy to make mistakes which then cause difficult to interpret bugs.

Even if we manage to avoid errors when using attach() it makes the code difficult to read as it is not obvious which data.frame each object is coming from.

Avoiding attach()

Fortunately there are alternatives to attach().

Referencing the columns in the data.frame

We can reference the columns in the data.frame by using either dollar notation or square bracket notation. Dollar notation is generally neater looking and needs less typing.

quantile(beaver1[,"temp"])
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(beaver2$temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
plot(beaver2$time %/% 100 + 24*(beaver2$day - beaver2$day[1]) + (beaver2$time %% 100)/60, 
                            y = beaver2$temp, col = ifelse(beaver2$activ, 2, 1), xlab = "Time")
# %/% is integer division: 5%/%2 = 2
# %% gives the modulus: 5%%2 = 1

Beaver body temperature against time

When using the dollar notation, if there are spaces in the column names (a bad idea), the column name needs to be quoted.

with() and within()

If the command you are using makes many references to a data.frame, it, the code can become rather messy, as in the previous example where the day and time columns are combined to give hours.

In such cases the with() function can be useful. It is equivalent to attaching the data.frame for this block of code only (without problems with masking).

with(beaver1, {
    hours <- time %/% 100 + 24*(day - day[1]) + (time %% 100)/60
    plot(hours, temp, col = ifelse(activ, 2, 1))
  }
  )

with() can make code much easier to read.

within() is useful if we want to modify the contents of a data.frame. Here, I’m using the lubridate package to make the date and time of each measurement.

library(lubridate)
beavers <- within(beavers, {
  day <- dmy("01-01-1990") + days(x = day-1) + hours(time %/% 100) + minutes(time %% 100)
  rm(time)# remove unwanted column
})
head(beavers)
  ##        id                 day  temp activ
  ## 1 beaver1 1990-12-12 08:40:00 36.33     0
  ## 2 beaver1 1990-12-12 08:50:00 36.34     0
  ## 3 beaver1 1990-12-12 09:00:00 36.35     0
  ## 4 beaver1 1990-12-12 09:10:00 36.42     0
  ## 5 beaver1 1990-12-12 09:20:00 36.55     0
  ## 6 beaver1 1990-12-12 09:30:00 36.69     0

The data argument

Any function that takes a formula argument (think y ~ x) has a data argument that can be given a data.frame.

library(mgcv)
mod <- gam(temp ~ s(unclass(day)), data  = beavers, subset = id == "beaver1")
#converting the date to seconds as gam doesn't like dates
plot(temp ~ day, data = beavers, subset = id == "beaver1", col = ifelse(activ, 2, 1), xlab = "Hours")
with(beavers, lines(day[id == "beaver1"], fitted(mod), col = 2))

Beaver body temperature against time

This is much better than attaching data as it makes it explicit which data are being used.

Do not be tempted to use the dollar notation in formula

## don't do this
  #lm(beavers$temp~beavers$day)
  

This makes the code difficult to read, especially if there are multiple predictors, and will cause problems when making predictions.

ggplot

ggplot(), an alternative system for plotting data from the ggplot2 package, wants to have the data in a data.frame. It does not need or want the data to be attached.

library(ggplot2)
ggplot(data =  beavers, mapping = aes(x = day, y = temp)) + 
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x)) +
  facet_wrap(~id, dir = "v", scales = "free_x") + 
  xlab("Hours")

Beaver body temperature against time

Does attach() have any uses

attach() is not completely useless: attach() can also attach environments. This allows some nifty tricks for having utility function etc. available without cluttering up your global environment (but it would probably be better to make a package – this is not hard).

· · · ·

Theme Design by devolux.nh2.me