codeRclub | bioCEED R coding club

May/17

11

Why use ggplot2?

Over the last year or so, I have changed from using base R graphics for all my plots to using ggplot2 for almost everything.

I want to show you why.

I’m going to use the iris dataset with is available within R.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

I want to start by plotting a histogram of Sepal.Length.

With base R this is easy.

hist(iris$Sepal.Length)

base_hist-1

That was easy, and with a little extra work, I could give the plot better axis labels and so on.

The ggplot2 code is a little longer.

library("ggplot2") # load ggplot2 package
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram()

ggplot_hist-1

This histogram looks a little different from the base R version because it is using many more bins. Probably too many. This could be controlled by using the bins argument to geom_histogram.

In the iris data.frame, there are three species. I want to have separate histograms for each. Of course this is possible with base R.

par(mfrow = c(1, 3))
xlim <- range(iris$Sepal.Length)#use to ensure plots span same x
by(iris, iris$Species, function(x){
  hist(x$Sepal.Length, main = x$Species[1], 
       xlab = "Sepal length", xlim = xlim)
})

base_hist3-1

But it is not trivial – this code loops over the species and plots a histogram of each. Looks like I forgot to specify the number of bins for the histograms to use so they would all be the same width, and I might also want to standardise the y-axis scale.

To make the equivalent plot in ggplot2 is much simpler. We just need to add a facet to the code.

ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(bins = 12) +
  facet_wrap(~ Species)

ggplot_hist3-1

But I’ve changed my mind about how to present these data. I now want a single histogram, with the different species shown by different colour fills.

This too is possible in base R, but it will take a lot of effort (or cunning) to get it to work. In ggplot2, I simply need to add fill = Species to the aesthetics.

ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(bins = 12)

ggplot_stackedhist-1

ggplot even adds a legend by default.

Perhaps a density plot would look better – with ggplot2 it is easy to try, just change the geom to geom_density.

ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_density(alpha = 0.4) # alpha gives transparency

ggplot_density-1

There are several other geometries that could be used to display these data. With ggplot it is easy to try just be changing the geom, with base R, we would probably have to completely rewrite the code.

jt <- ggplot(iris, aes(x = Species, y = Sepal.Length, colour = Species)) +
  geom_jitter(height = 0, width = 0.3) #height and width specify the amount of jitter
bx <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot(notch = TRUE) #notch gives uncertainty of the median
vi <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_violin() #density plots on their side
gridExtra::grid.arrange(jt, bx, vi)

ggplot_altgeom-1

This ease of changing the geometries (or aesthetics) used in a plot, with consistent code, is one of main advantages of ggplot2.

No tags

For most analyses, a large proportion of the code is used to import, reformat and clean the data, and only a small portion is used to run the statistical tests. Despite this, university courses tend to focus on statistical analyses and neglect the data processing steps.

Base R (the packages that are automatically installed) has many very powerful tools for data processing, but an extra package called dplyr can make this processing much easier.

In this post, I’m going to compare how some data.frame manipulations can be done in base R and with dplyr.

First, we need to install the dplyr package, and also the gapminder package which contains the data set I want to process. You only need do this once, and some of these packages may already be installed.

install.packages(c("dplyr", "gapminder"))

Then we need to load the packages.

library("dplyr")
library("gapminder")

The gapminder data set contains life expectancy, population and GDP per capita data for different countries from 1957 to 2007. These are the first few rows

gapminder
## # A tibble: 1,704 × 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1  Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2  Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3  Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4  Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5  Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6  Afghanistan      Asia  1977  38.438 14880372  786.1134
## 7  Afghanistan      Asia  1982  39.854 12881816  978.0114
## 8  Afghanistan      Asia  1987  40.822 13867957  852.3959
## 9  Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

The gapminder data is in a special type of data.frame called a tibble. For most functions, this behaves just like a data.frame except that it has a better print function – see ?tibble for details.

Filtering rows

I want extract from the gapminder data.frame the data for the African countries with a GDP of less than $500 per capita in 1967.

With base R, we can use square brackets to select the rows. Remember these work like this: myDataFrame[rowSelector, columnSelector] where rowSelector is a logical vector (TRUE/FALSE) of rows to return, or a vector of the row numbers to return.

gapminder1967 <- gapminder[gapminder$continent == "Africa" & gapminder$year == 1967 & gapminder$gdpPercap < 500, ]
gapminder1967
## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Burundi    Africa  1967  43.548 3330989  412.9775
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Lesotho    Africa  1967  48.492  996380  498.6390
## 4  Malawi    Africa  1967  39.487 4147252  495.5148

In this code, the & represents AND, so only rows that are TRUE for all three tests are returned. Remember to use == as a test of equality.

With dplyr, we can use filter().

filter(gapminder, continent == "Africa", year == 1967, gdpPercap < 500)
## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Burundi    Africa  1967  43.548 3330989  412.9775
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Lesotho    Africa  1967  48.492  996380  498.6390
## 4  Malawi    Africa  1967  39.487 4147252  495.5148

I think this is cleaner, more explicit code. The first argument is the name of the data.frame; subsequent arguments are the logical tests. If the logical tests need to be combined with AND, they can be seperated by commas.

Use filter to extract

  • the data for Norway
  • the data for 2007 in Asia or Africa

Selecting columns

We can select a single column using the dollar notation

gapminder1967$country
## [1] Burundi Eritrea Lesotho Malawi 
## 142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe

If we want more columns, we need to use the myDataFrame[rowSelector, columnSelector] notation.

gapminder1967[, c("country", "lifeExp")]
## # A tibble: 4 × 2
##   country lifeExp
##    <fctr>   <dbl>
## 1 Burundi  43.548
## 2 Eritrea  42.189
## 3 Lesotho  48.492
## 4  Malawi  39.487

With dplyr, we can use select()

select(gapminder1967, country, lifeExp)
## # A tibble: 4 × 2
##   country lifeExp
##    <fctr>   <dbl>
## 1 Burundi  43.548
## 2 Eritrea  42.189
## 3 Lesotho  48.492
## 4  Malawi  39.487

The first argument to select is the data.frame, the subsequent columns are the required columns. The column names are not quoted.

This is often enough, but sometimes other methods for selecting columns with select are useful.

Adjacent columns can be selected by using firstColumnName:lastColumnName notation

select(gapminder1967, country:pop)
## # A tibble: 4 × 5
##   country continent  year lifeExp     pop
##    <fctr>    <fctr> <int>   <dbl>   <int>
## 1 Burundi    Africa  1967  43.548 3330989
## 2 Eritrea    Africa  1967  42.189 1820319
## 3 Lesotho    Africa  1967  48.492  996380
## 4  Malawi    Africa  1967  39.487 4147252

Columns can be removed by by putting a minus sign in front of their name

select(gapminder1967, -continent)
## # A tibble: 4 × 5
##   country  year lifeExp     pop gdpPercap
##    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Burundi  1967  43.548 3330989  412.9775
## 2 Eritrea  1967  42.189 1820319  468.7950
## 3 Lesotho  1967  48.492  996380  498.6390
## 4  Malawi  1967  39.487 4147252  495.5148

Alternatively, a helper function can be used

select(gapminder1967, starts_with("c"))
## # A tibble: 4 × 2
##   country continent
##    <fctr>    <fctr>
## 1 Burundi    Africa
## 2 Eritrea    Africa
## 3 Lesotho    Africa
## 4  Malawi    Africa

This will select all columns that start with a “c”. See ?starts_with for other helper functions.

Using select() extract

  • columns for lifeExp and gpdPercap
  • All columns except contintent

Arranging rows

Rows can be sorted in base R using order(). This code orders the data by life expectancy.

gapminder1967[order(gapminder1967$lifeExp), ]
## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1  Malawi    Africa  1967  39.487 4147252  495.5148
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Burundi    Africa  1967  43.548 3330989  412.9775
## 4 Lesotho    Africa  1967  48.492  996380  498.6390

In dplyr, we can use arrange().

arrange(gapminder1967, lifeExp)
## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1  Malawi    Africa  1967  39.487 4147252  495.5148
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Burundi    Africa  1967  43.548 3330989  412.9775
## 4 Lesotho    Africa  1967  48.492  996380  498.6390

To arrange rows in the reverse direction use desc().

arrange(gapminder1967, desc(lifeExp))
## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Lesotho    Africa  1967  48.492  996380  498.6390
## 2 Burundi    Africa  1967  43.548 3330989  412.9775
## 3 Eritrea    Africa  1967  42.189 1820319  468.7950
## 4  Malawi    Africa  1967  39.487 4147252  495.5148

Using filter and arrange

  • Which African country had the highest life expectancy in 2007

Chaining dplyr functions

dplyr functions can be used seperately, with multiple commands run on a data.frame via intermediate objects.

gapminder1967 <- filter(gapminder, continent == "Africa", year == 1967, gdpPercap < 500)
result <- select(gapminder1967, country, lifeExp)
result <- arrange(result, lifeExp)

dplyr functions can also be chained together with pipes which look like %>%. Pipes pass the output of one function to the next function in the chain as the first argument. The above code can be written with pipes as

result <- gapminder %>%
  filter(continent == "Africa", year == 1967, gdpPercap < 500) %>%
  select(country)

As the number of processing steps increases, this notation becomes more and more useful. When writing a dplyr chain, it is best to write one command at a time and then test whether it works. Remember to put the pipe %>% after each command.

Using pipes %>%

  • Which Asian country had the lowest life expectancy in 2002?

Mutating columns

We can add a new column to a data.frame with mutate().

This code will add GDP to the data.frame

gapminder1967 %>%
  mutate(GDP = gdpPercap * pop)
## # A tibble: 4 × 7
##   country continent  year lifeExp     pop gdpPercap        GDP
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>      <dbl>
## 1 Burundi    Africa  1967  43.548 3330989  412.9775 1375623555
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950  853356391
## 3 Lesotho    Africa  1967  48.492  996380  498.6390  496833953
## 4  Malawi    Africa  1967  39.487 4147252  495.5148 2055024665

To replace a column, use an existing column name rather than making a new one.

Using mutate()

  • Add new columns holding the log of population and gpd per capita

Summarising data

All the above processing can be done in base R, but is neater and more explicit in dplyr. Some things are much easier to do in dplyr.

For example, calculating the mean and standard deviation of life expectancy per continent for 2007. With base R this gets quite complicated.

With dplyr, we need to declare which column (or columns) is to be use to group the data with group_by, and then use summarise.

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarise(
    n = n(), 
    mean_lifeExp = mean(lifeExp), 
    sd_lifeExp = sd(lifeExp)
  )
## # A tibble: 5 × 4
##   continent     n mean_lifeExp sd_lifeExp
##      <fctr> <int>        <dbl>      <dbl>
## 1    Africa    52     54.80604  9.6307807
## 2  Americas    25     73.60812  4.4409476
## 3      Asia    33     70.72848  7.9637245
## 4    Europe    30     77.64860  2.9798127
## 5   Oceania     2     80.71950  0.7290271

n() reports how many cases there are for each level of the grouping variable.

If there are many columns to summarise or mutate with the same functions, summarise_all or mutate_all and related functions can be useful.

What happens if you use mutate() instead of summarise() in this code?

Which countries have a life expectancy higher than the mean for their continent?

Other useful functions

rename for renaming columns

gapminder %>%
  rename(population = pop) # newName = oldName
## # A tibble: 1,704 × 6
##        country continent  year lifeExp population gdpPercap
##         <fctr>    <fctr> <int>   <dbl>      <int>     <dbl>
## 1  Afghanistan      Asia  1952  28.801    8425333  779.4453
## 2  Afghanistan      Asia  1957  30.332    9240934  820.8530
## 3  Afghanistan      Asia  1962  31.997   10267083  853.1007
## 4  Afghanistan      Asia  1967  34.020   11537966  836.1971
## 5  Afghanistan      Asia  1972  36.088   13079460  739.9811
## 6  Afghanistan      Asia  1977  38.438   14880372  786.1134
## 7  Afghanistan      Asia  1982  39.854   12881816  978.0114
## 8  Afghanistan      Asia  1987  40.822   13867957  852.3959
## 9  Afghanistan      Asia  1992  41.674   16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763   22227415  635.3414
## # ... with 1,694 more rows

slice selects rows by position

gapminder %>%
  group_by(continent) %>%
  slice(1:2)
## Source: local data frame [10 x 6]
## Groups: continent [5]
## 
##        country continent  year lifeExp      pop  gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>      <dbl>
## 1      Algeria    Africa  1952  43.077  9279525  2449.0082
## 2      Algeria    Africa  1957  45.685 10270856  3013.9760
## 3    Argentina  Americas  1952  62.485 17876956  5911.3151
## 4    Argentina  Americas  1957  64.399 19610538  6856.8562
## 5  Afghanistan      Asia  1952  28.801  8425333   779.4453
## 6  Afghanistan      Asia  1957  30.332  9240934   820.8530
## 7      Albania    Europe  1952  55.230  1282697  1601.0561
## 8      Albania    Europe  1957  59.280  1476505  1942.2842
## 9    Australia   Oceania  1952  69.120  8691212 10039.5956
## 10   Australia   Oceania  1957  70.330  9712569 10949.6496

This will return the first two rows for each continent.

distinct removes duplicate rows

gapminder %>%
  distinct(continent)
## # A tibble: 5 × 1
##   continent
##      <fctr>
## 1      Asia
## 2    Europe
## 3    Africa
## 4  Americas
## 5   Oceania

left_join joins two data.frames by one or more columns in common. For example, to find which countries had a reduction in life expectancy between 2002 and 2007, we can generate subsets of gapminder for each year and then join them by country and continent.

left_join(
  filter(gapminder, year == 2002),
  filter(gapminder, year == 2007),
  by = c("continent" = "continent","country" = "country"), 
  suffix = c(".2002", ".2007")
) %>%
  filter(lifeExp.2002 &gt; lifeExp.2007) %>%
  select(country, continent, starts_with("lifeExp")) %>%
  mutate(diff = lifeExp.2002 - lifeExp.2007)
## # A tibble: 5 × 5
##        country continent lifeExp.2002 lifeExp.2007  diff
##         <fctr>    <fctr>        <dbl>        <dbl> <dbl>
## 1        Gabon    Africa       56.761       56.735 0.026
## 2      Lesotho    Africa       44.593       42.592 2.001
## 3   Mozambique    Africa       44.026       42.082 1.944
## 4 South Africa    Africa       53.365       49.339 4.026
## 5    Swaziland    Africa       43.869       39.613 4.256

Further reading

dplyr cheat sheet

dplyr vignette

ggplot2 is a very powerful plotting package available in R, but sometimes you just want more: maybe you want to want to make your plots more accessible to colour-blind audiences. Or maybe you just don’t like the included themes. Or maybe you just want more colour in your life like some of the students coming to the R-club and wondering how you can make pink plots.

Before we start, a good tip is to look at the syntax for the built-in themes such as theme_bw, theme_classic or just our good friend theme_grey (just type the name into the R console and press return). Another good tip is to use the help files, they can tell you more about all the possibilities of each theme option – see ?theme.

Some other things that are important to know to understand this guide:

  1. All the highlighted lines in this guide is actually read as one command into R. Still I would recommend you to copy this code into your editor of choice.
  2. You can use names of colours (see color()) rather than the hex codes (for example “#e600e6” – the first two digits indicate how much red is needed in hexidecimal so ff is full intensity red, the next two digits are for blue, and then green) to change the colours in a theme or when you are plotting, I have just used the hex codes here to get some more control over the different shades of pink.

And with that in mind let’s make a pink theme!

#Importing library
library(ggplot2)

#Making a pink theme by modifying theme_bw 
theme_pink <- theme_bw()+                    
#Setting all text to size 20 and colour to purple/pink                                  
theme(text=element_text(size=20, colour="#b300b3"),    
#Changing the colour of the axis text to pink  
axis.text = element_text(colour="#e600e6"),
#..and the axis ticks..        
axis.ticks = element_line(colour="#B404AE"),  
#..and the panel border to purple/pink       
panel.border = element_rect(colour="#B404AE"),  
#Changing the colour of the major..
panel.grid.major = element_line(colour="#ffe6ff"), 
#..and minor grid lines to light pink     
panel.grid.minor = element_line((colour="#ffe6ff")),    
#Setting legend title to size 16..
legend.title = element_text(size=16),   
#..and the legend text to size 14                
legend.text = element_text(size=14),  
#Centering plot title text and making it bold                     
plot.title = element_text(hjust = 0.5, face="bold"),  
#Centering subtitle                  
plot.subtitle = element_text(hjust = 0.5))              

The next step is optional and that is to specify which colours we want to use for plotting.

#Pastel colours for each continent
#From http://www.colourlovers.com/palette/396247/Melting_Puppies)
melting_puppies <-c("#93DFB8","#FFC8BA","#E3AAD6",
                    "#B5D8EB","#FFBDD8") 

And now lets use our new theme to make an actual plot using the dataset from the gapminder package.


#Importing library with dataset
library(gapminder) 

#Subset with data from year 2007
year.2007.df <- subset(gapminder, year == "2007")

#Making plot

#Making a plot with the data from 2007, with gdpPercap on the x-axis..
#..and lifeExp on the y-axis, and using colours to identify them by..
#..continent, and want the point size scaled according to pop
pink.plot <- ggplot(year.2007.df, aes(x=gdpPercap, y=lifeExp, colour=continent, size = pop)) +              
#Telling R that that we want a scatterplot with semi-transparent points
geom_point(alpha = (5/8)) +              
#Telling R to use the pink theme we just made                                                     
theme_pink + 
#Changing the colours of the points into the colours saved in.. 
#..melting_puppies, and that I want the title of the points in legend..
#..to be Continent
scale_colour_manual(values=melting_puppies, name="Continent") + 
#Scaling the size of the points so they fit within the range of..
#..2 to 22 and removing description of point size from legend
scale_size_continuous(range=c(2,22), guide = FALSE) + 
#Changing the labels of the x- and y-axis
labs(x="GDP per capita", y="Life expectancy") +
#Even though you have made a theme you can still edit it:
#Here we are changing the plot margins..
#..and that we want to change the position of the legend so it fits..
#..lower right corner of the plot
theme(plot.margin = unit(c(1,1,0.5,0.5), "cm"), legend.position = c(1, 0), legend.justification = c(1.05, -0.05)) +
#Adding a title and a subtitle
ggtitle(label = "Life expectancy in 2007", subtitle = "(size of circles indicate population size)") + 
#Increase the points size in the legend to 4, and making them opaque
guides(colour = guide_legend(override.aes = list(size=4, alpha = 1)))

#Looking at the plot
pink.plot

#Saving plot
ggsave("pink-lifeexp.png")

This is what your example plot now should look like:
pink-lifeexp

Congratulation, you now know the basics of how to make your own ggplot2 themes!

· ·

When data are imported into R it generally arrives as a data.frame, a type of object that contains named columns. We often want to access the contents of each column which can be done with the dollar or square-bracket notation.

attach() is used by some R-users to make the columns of a data.frame directly accessible rather than using dollar notation or data arguments.
Unfortunately although attach() saves a little typing and maybe makes the very first R experience a tiny bit easier, there is a large cost which I explore below.

The data.frame beaver1 holds body temperatures of a beaver (it is a found in the datasets package which is installed by default). These are the first few rows.

head(beaver1)
day time temp activ
346 840 36.33 0
346 850 36.34 0
346 900 36.35 0
346 910 36.42 0
346 920 36.55 0
346 930 36.69 0

If we attach() beaver1, we can access any of the columns of the data.frame directly.

attach(beaver1)
quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
boxplot(temp~activ)

Boxplot of body temperature against activity

Simple, what could possibly go wrong.

Let’s attach second beaver dataset which has the same column names as the first.

attach(beaver2)
  ## The following objects are masked from beaver1:
  ## 
  ##     activ, day, temp, time

That generated a load of warnings that various objects from beaver1 were masked. Now if we run quantile or some other function, we get the data from beaver2

quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

If we want to use beaver1 again, we have to detach() beaver2 first.

detach(beaver2)
quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
detach(beaver1)

Attaching and detaching data.frames is obviously going to go horribly wrong very quickly unless you are very careful. Always.

One alternative would be to change the column names on one or both of the data.frames so that all column names are unique.

colnames(beaver1) <- paste("beaver1", colnames(beaver1), sep = "_")
colnames(beaver2) <- paste("beaver2", colnames(beaver2), sep = "_")
attach(beaver1)
attach(beaver2)
quantile(beaver1_temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(beaver2_temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
rm(beaver1, beaver2)#clean up
detach(beaver1)
detach(beaver2)

Possible, but hardly elegant and this now as much typing as the dollar notation shown below.

Another solution would be to combine the two datasets. We need to add a column identifying which beaver is which.

beavers <- rbind(cbind(id = "beaver1", beaver1), cbind(id = "beaver2", beaver2))
attach(beavers)
quantile(temp[id == "beaver1"])
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(temp[id == "beaver2"])
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

Having combined the data.frames, we now need to subset the objects to get the data for each beaver. This isn’t a simple solution (in some cases it will be horribly complicated), but combining the data.frames might be very useful for some analyses (for example if you wanted to run an ANOVA).

attach will also cause problems if there are objects with the same name as one of the columns of the data.frame.

temp <- 7 
quantile(temp)
  ##   0%  25%  50%  75% 100% 
  ##    7    7    7    7    7

If we attach the data.frames when object temp already exists, we get a warning, if we make temp afterwards, no warning is given and this object masks the column in the data.frame. Obviously, this could cause some nasty bugs. Good luck with them.

It is of course possible to avoid these problems with masking if we are very careful with naming objects. In practice, it is very easy to make mistakes which then cause difficult to interpret bugs.

Even if we manage to avoid errors when using attach() it makes the code difficult to read as it is not obvious which data.frame each object is coming from.

Avoiding attach()

Fortunately there are alternatives to attach().

Referencing the columns in the data.frame

We can reference the columns in the data.frame by using either dollar notation or square bracket notation. Dollar notation is generally neater looking and needs less typing.

quantile(beaver1[,"temp"])
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(beaver2$temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
plot(beaver2$time %/% 100 + 24*(beaver2$day - beaver2$day[1]) + (beaver2$time %% 100)/60, 
                            y = beaver2$temp, col = ifelse(beaver2$activ, 2, 1), xlab = "Time")
# %/% is integer division: 5%/%2 = 2
# %% gives the modulus: 5%%2 = 1

Beaver body temperature against time

When using the dollar notation, if there are spaces in the column names (a bad idea), the column name needs to be quoted.

with() and within()

If the command you are using makes many references to a data.frame, it, the code can become rather messy, as in the previous example where the day and time columns are combined to give hours.

In such cases the with() function can be useful. It is equivalent to attaching the data.frame for this block of code only (without problems with masking).

with(beaver1, {
    hours <- time %/% 100 + 24*(day - day[1]) + (time %% 100)/60
    plot(hours, temp, col = ifelse(activ, 2, 1))
  }
  )

with() can make code much easier to read.

within() is useful if we want to modify the contents of a data.frame. Here, I’m using the lubridate package to make the date and time of each measurement.

library(lubridate)
beavers <- within(beavers, {
  day <- dmy("01-01-1990") + days(x = day-1) + hours(time %/% 100) + minutes(time %% 100)
  rm(time)# remove unwanted column
})
head(beavers)
  ##        id                 day  temp activ
  ## 1 beaver1 1990-12-12 08:40:00 36.33     0
  ## 2 beaver1 1990-12-12 08:50:00 36.34     0
  ## 3 beaver1 1990-12-12 09:00:00 36.35     0
  ## 4 beaver1 1990-12-12 09:10:00 36.42     0
  ## 5 beaver1 1990-12-12 09:20:00 36.55     0
  ## 6 beaver1 1990-12-12 09:30:00 36.69     0

The data argument

Any function that takes a formula argument (think y ~ x) has a data argument that can be given a data.frame.

library(mgcv)
mod <- gam(temp ~ s(unclass(day)), data  = beavers, subset = id == "beaver1")
#converting the date to seconds as gam doesn't like dates
plot(temp ~ day, data = beavers, subset = id == "beaver1", col = ifelse(activ, 2, 1), xlab = "Hours")
with(beavers, lines(day[id == "beaver1"], fitted(mod), col = 2))

Beaver body temperature against time

This is much better than attaching data as it makes it explicit which data are being used.

Do not be tempted to use the dollar notation in formula

## don't do this
  #lm(beavers$temp~beavers$day)
  

This makes the code difficult to read, especially if there are multiple predictors, and will cause problems when making predictions.

ggplot

ggplot(), an alternative system for plotting data from the ggplot2 package, wants to have the data in a data.frame. It does not need or want the data to be attached.

library(ggplot2)
ggplot(data =  beavers, mapping = aes(x = day, y = temp)) + 
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x)) +
  facet_wrap(~id, dir = "v", scales = "free_x") + 
  xlab("Hours")

Beaver body temperature against time

Does attach() have any uses

attach() is not completely useless: attach() can also attach environments. This allows some nifty tricks for having utility function etc. available without cluttering up your global environment (but it would probably be better to make a package – this is not hard).

· · · ·

Nov/15

30

Importing excel data into R

Importing data is one of those things that is often much more effort that it should be. It is often convenient to enter data into excel, but this then needs to be imported into R. There are several ways to do this.

Probably the worst way to do this is to copy the data from excel and use read.table("clipboard"). It can lose precision (try importing 1/3 when the excel sheet has been set to display only one decimal place) and needs doing manually each time the data are imported. “clipboard” is OK for a quick peek at some data, but not as part of any serious analysis that needs to be repeatable.

I used to recommend that people save their data as a tab-delimited file and then import this with read.table(). The main problem with this is that it is easy to get the excel file and the tab-delimited file out of sync. Every time an edit is made to the excel file, the tab-delimited file needs to be re-created, and any edits that have been accidentally made to the tab-delimited file will be lost. Having two versions of the raw data is simply asking for trouble.

Importing the excel file directly is a much better strategy. There are several functions in R that will do this.

  • The RODBC package will read excel files, but only works with the 32-bit version of R which is a bit tedious.
  • The function read.excel() in the xlsx package. This needs java to be installed.
  • The function read_excel() in the readxl package. This is probably the fastest method.
  • Functions in the openxlsx package.
  • read.xls() in the gdata package, which is reported to be very slow
  • There are other solutions

See the stackoverflow discussion of the merits of different methods.

Most of these methods require that the excel worksheet(s) are clean. For example, only one table of data per sheet, columns are all of the same type (so no comments in an otherwise numeric field).

My current recommendation is to use the readxl package. It will only read data from excel, but this is what you want to do most of the time.

To install this package, run install.packages("readxl"). This only needs to be done once. Then run library(readxl) everytime you start R.

Oct/15

26

Avoid using T to mean TRUE

In R code, it is legal to use T and F to mean TRUE and FALSE respectively. However, TRUE and FALSE are reserved words – they can only be used to mean TRUE or FALSE. Code like

TRUE<-7

will return a syntax error.

T and F are not so protected. This means that code like

T<-FALSE

is completely legal.

Of course, you wouldn’t deliberately do this (except perhaps at the start of April), but it is possible to do it accidentally. Perhaps, for example, a column of a data.frame is named T or F and you have attached it. It is best to be safe and always use TRUE and FALSE. This also makes code easier to read.

No tags

Here is a guide for the kind of person who needs to get their data into R and have never done so or are struggling to get their data to load. I’ve tried to explain using simple words and lots of detail – the post is aimed at people who are not comfortable with code, so programmers might find it too simplified!

So, let’s start with the basics. Why are we bothering? At the beginning, we bother for three mean reasons, IMHO.

  1. When you write some code, you’ve got a record of exactly what you’ve done to your data. This means that the methods section is much easier to write.
  2. This also means that if you have to change one little annoying thing in your data file, you can run all the analyses again on one mouse click. For me, this is the most important reason!
  3. You have much better control over settings, particularly in graphics. That way, you don’t end up with a horrible excel-default-settings figure (which a journal will reject).

Of course, advanced users will also highlight that R has any function you can dream of, or at least the building blocks to make that function. And best of all R is free!

Before you begin, you really really should use a text editor for writing your code, not the R GUI (graphical user interface – the windows which open when you select R in your list of programs). This is because text editors do useful things like save your code, and colour it in pretty colours for the names of functions and arguments – the same as the code in this blog. Just believe me, this is extremely useful for finding mistakes. I use Tinn-R but nearly everyone is now using R Studio – both are free and open-source. Find a text editor that works with R and use it!

Now, R thinks about the world a particular way. R divides the world into objects, functions and arguments. R commands are very often in the form “Make me an object that is made by doing these functions to that other object, according to those arguments”. Arguments are kind of like settings. Your data, as far as R sees it, is an object. But it’s probably an Excel or Open Office spreadsheet, which R does not deal directly with and cannot use as an object. R would much rather that your data was a data object, so you want to make a data object. Many first-timers use the clipboard function, and we often do that in teaching because we have the datasets ready and in the right format. Chances are that you don’t have your data in the right format, so we’re going to take a slightly longer route which we have much more control over.

I’m going to describe how to load a “fat” (variables x samples) table. Examples might be quadrats and plant species, or water samples and chemical measurements (pH, salinity, temperature). You should prepare the Excel file with the variables along the top row, and the site or sample number down the first column. There should not be any special characters (stuff like / ! #) though Norwegian letters should be ok. Variable names need to start with a letter – microbiologists using numbers to identify genes should just add an “a” in front, it can be taken out later if needed. Numbers are fine for sample names (as are number-letter combinations). Save the spreadsheet as a tab-delimited text file. Here’s the top corner of an excel spreadsheet set up correctly:

Screenshot of example data in Excel

A screenshot of the top left hand few rows and columns of an example excel file.

Next, open your text editor. The first thing to do is write a line of code to tell R where you keep your data. Then R will always look there to open files, and it will always put figures and so on in the same place. This place is called the ‘working directory’. So on my computer at home, I do this:


setwd("\\\\C:\\MyDocuments\\ Rstuff")

Rstuff is the folder where I’m storing that text file of data, and where I want the graphs to end up. See all those \\ where the file path normally just has the one \? They are because R sees \ as a ‘special character’, so you need to tell it “no R, that really is a file directory marker, not a special thingy.” The absolute easiest way to do this is with something called file.choose, which opens a regular window and lets you browse to any file in the folder that you want. That puts the file path in the r window in the right format, and you can copy the folders bit of that file path to the setwd function.


file.choose()

The brackets are important, but leave them empty like I’ve shown.

Now we make a data object in R by using the read.table function. I’ve made an example of data that you can download by clicking: community_data

but you should try this with your own dataset. We made our Excel spreadsheet into a .txt file, because it makes this next step a lot simpler and more reliable.

 


community.df<-read.table ("Community_data.txt", sep="\t", header=TRUE, row.names=1)

The .df means data frame, a kind of data object. Putting that in the new object’s name doesn’t force R to make it a data frame, but helps you remember what you were intending.  The <- (an arrow made out of a “less than” sign and a hyphen) tells R that on the left of the arrow is what you want it to make, and on the right is how you want it made. The “read.table” is the function – what it is you are asking R to do. "Community_data.txt" is the name of the data file – don’t forget the "…" and the .txt at the end. The “sep=\t” tells R that this is a tab delimited file. The “header=TRUE” tells R that your data has a header, that is, that the first row of the .txt file is the column names rather than data. Finally, “row.names=1” tells R that you have sample numbers in the first column.

So now we have two lines of code in our text editor. Most text editors will send code to R by clicking a button. I’m going to let the text editor help files tell you how to do that, because it varies between text editors and operating systems. So now send those two lines.

Did it work? Let’s look at the data. We can ask R what it thinks the data object is.


str(community.df)

This command does not change your data, but just displays the structure of the data. You can check that the file has read in correctly, has the right number of species (columns) and samples (rows). The example data should say “data.frame':   10 obs. of 11 variables:” and then list the species names.

You can look at all the data at once, by just typing the name of the data object


community.df

But this is a bad idea if you have more than ten samples. Instead


head(community.df)

shows you the first six rows.

Here is an example of how it should look.

Screenshot of R with example code run

A screenshot of how the code in this post should look in R.

Not working?
A common error message comes at the read.table stage.

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'Communitydata.txt': No such file or directory

This means that either you have set the working directory wrong, or typed the filename wrong. Remember to type the suffix for the file type, here .txt!

· ·

May/15

21

Progress bars and for loops.

Today somebody asked me about building a progress bar into a for loop. This can be really useful if you are running lots of bootstrapping or Monte Carlo simulations, and you want some peace of mind so that you know that loop is still running as the computer chugs away in the background. It’s good to know whether it’s worth hanging around for the code finish, or better to go climbing/ skiing for the weekend/ or whether there is just time for a cup of tea.

I’ve written a dummy script here to show how this can be done. Fortunately the built in R.utils package contains a function for incorporating a progress bar on the R console using the function txtProgressBar(). I thought it would be fun for the for loop to actually do something, so I made it write out a message (stored in the cdRclb object) in an empty plot window. Here is the script:

# Write a message to plot and make an empty plot window
cdRclb  <-c( "c","o","d", "e", "R", "c", "l", "u", "b")
plot(x= 1:9, y = 1:9, type="n", axes=FALSE, frame.plot = TRUE, ann=FALSE)

So that the progress bar takes at least some time to finish, I decided to run the for loop for 900000 iterations (100000 x the length() of the object cdRclb).

# Set the number of iterations for the loop
nIts <- length(cdRclb) *100000

Then make the progress bar and write the loop. You write the txtProgressBar() function first, outside of the loop, and everything within the {} is repeated for nIts, the number of iterations. Here, for each iteration we find a new value, k, the iteration number in the loop divided by 100000.

Still within the for loop, we use an if statement to decide whether we should write-out out some of the message stored in cdRclb. When k is a round number, we write an item of cdRclb using text(). At the end of the iteration we update the progress bar using setTxtProgressBar() and the loop starts over. Check what happens in the plot window as the loop progresses, and also check the R console. Fun and extremely satisfying!

# create the progress bar
progBar <- txtProgressBar(min = 0, max = nIts, style = 3)

# Start the for loop.
for(i in 1:nIts){
	
	# Find k. NB, lots of other functions/commands could go here.
	k <- i/100000 
	
	# Use an if command to decide whether to plot part of the message. Only this when k is a whole number
	if(floor(k)==k) text(x = k ,y = 5, labels = cdRclb[k],  col= k+1, cex = 4)
   	
   	# Update the progress bar
   	setTxtProgressBar(progBar, i)
}

# Close the progress bar
close(progBar)

· · ·

In Friday’s codeRclub, we had a problem which involved finding the row and column names for items in a matrix greater than a specified value (e.g. finding the names of the pairs of samples in a  correlation matrix with correlation coefficient greater than 0.5). The problem is that using standard sub-setting methods you are able to find the locations/ values of the cells within the matrix, but not the row or column names. We solved the problem using an argument in the  which command in R. We wrote a function to do this, returning the row and column names and the correlation coefficients in a data.frame.

First simulate a correlation matrix and set our correlation cut-off value:

x <-  matrix(c(1,.8,.2,  .8,1,.7,  .2,.7,1),nrow=3, dimnames = list(c("a", "b", "c"), c("a", "b", "c")))  # Simulate the 3x3 matric and give the matrix row and column names of the samples

Then make a function, which.names.matrix, to return the row and column names of interest. x is a correlation matrix, cutVal is your correlation cut-off value.

which.names.matrix <- function(x, cutVal = 0.5){

x[lower.tri(x)] <- NA # Because it's a correlation matrix, we are only interested in one half of it, so set the lower triangle to NA.

diag(x) <- NA # Set the diagonals to NA

locs <- which(x>cutVal,arr.ind=TRUE) # Find the locations of the cells in the matrix > than cutVal

scores <- na.omit(x[x>cutVal]) # Get the scores of the cells > cutVal

data.frame(row = rownames(x)[locs[,1]], col = colnames(x)[locs[,2]], value = scores) # Return the data.frame with the row and column names, plus the scores

}
which.names.matrix(x)

· · ·

May/15

7

Expressions in R

expression() and related functions including bquote() are powerful tools for annotating figures with mathematical notation in R. This functionality is not obvious from their respective help files. demo(plotmath) nicely shows the huge potential of expression(), but does not help that much with getting the code need for many real cases.

I tend to get my expressions to work by trial and lots of errors (although having put this together, I now understand them at least temporarily). I’ve just searched through my code library and extracted and annotated some examples of expression() being used. I hope someone finds it useful.

I’m going to use expression() with title(), but the same expressions can be used with any of the functions (text(), title(), mtext(), legend(), etc) used for putting text on plots.

x11(width=4, height=5, point=14)
par(mar=rep(0,4), cex.main=.8)
plot(1, type="n", axes=FALSE, ann=FALSE)

The simplest use of expression is take a character or string of characters and it will be added to the plot. If the string contains spaces, it must be enclosed in quotes (alternatively, the space can be replaced by a tilde ~, which probably gives better code).

title(line=-1, main=expression(fish))

This use of expression is entirely pointless, but is a useful starting point. Some strings have special meanings, for example infinity will draw the infinity symbol. If for some reason you want to have “infinity” written on your plot, it must be in quotes. Greek letters can be used by giving their name in lower-case or with the first letter capitalised to get the lower or upper case character respectively.

title(line=-2, main=expression(infinity))
title(line=-3, main=expression(pi))
title(line=-4, main=expression(Delta))

Subscript or superscript can be added to a string using ^ and [] notation respectively.

title(line=-5, main=expression(r^2))
title(line=-6, main=expression(beta[1]))

If the string we want to have as sub- or superscript contains a space, the string must be in quotes. Braces can be used to force multiple elements to all be superscript.

Strings can be separated by mathematical operators.

title(line=-7, main=expression(N[high]-N[low]))
title(line=-8, main=expression(N[2]==5))

To make more complicated expressions, build them up from separate parts by either using * or paste to join them together (if you want a multiplication symbol, use %*%). The * notation gives nicer code.

title(line=-9, main=expression(Delta*"R yr"))
title(line=-10, main=expression(paste(Delta,"R yr")))
title(line=-11, main=expression(paste("Two Year Minimum ",O[2])))
#title(line=-11, main=expression(Two~Year~Minimum~O[2]))
title(line=-12, main=expression(paste("Coefficient ", beta[1])))
#title(line=-12, main=expression(Coefficient~beta[1]))
title(line=-13, main=expression(paste("TP ", mu,"g l"^-1)))
#title(line=-13, main=expression(TP~mu*g~l^-1))
title(line=-14, main=expression(paste(delta^18,"O")))
#title(line=-14, main=expression(delta^18*O))
title(line=-15, main=expression(paste("Foram ", exp(H*minute[bc]))))
#title(line=-15, main=expression(Foram~exp(H*minute[bc])))

To start an expression() with a superscript (or subscript), I use an empty string (you can also use phantom()).

title(line=-16, main= expression(""^14*C*" years BP"))
#title(line=-16, main= expression(phantom()^14*C~years~BP))

So far so good. But sometimes, you want to use the value of an R-object in plot annotation.

For example, if we wanted to label a point with its x value, this will not work.

x<-5
title(line=-17, main= expression(x==x))

Instead of using expression(), we have to use bquote(), with the object we want written out inside .()

title(line=-18, main= bquote(x==.(x)))
title(line=-19, main= bquote(x==.(x)~mu*g~l^-1))
 Plot annotations with expression and bquote

Plot annotations with expression and bquote

If you understand these examples, you should be able to use the remainder of the functionality demonstrated by demo(plotmath) and at ?plotmath.

· · ·

Older posts >>

Theme Design by devolux.nh2.me