codeRclub | bioCEED R coding club

CAT | For absolute beginners

ggplot2 is a very powerful plotting package available in R, but sometimes you just want more: maybe you want to want to make your plots more accessible to colour-blind audiences. Or maybe you just don’t like the included themes. Or maybe you just want more colour in your life like some of the students coming to the R-club and wondering how you can make pink plots.

Before we start, a good tip is to look at the syntax for the built-in themes such as theme_bw, theme_classic or just our good friend theme_grey (just type the name into the R console and press return). Another good tip is to use the help files, they can tell you more about all the possibilities of each theme option – see ?theme.

Some other things that are important to know to understand this guide:

  1. All the highlighted lines in this guide is actually read as one command into R. Still I would recommend you to copy this code into your editor of choice.
  2. You can use names of colours (see color()) rather than the hex codes (for example “#e600e6” – the first two digits indicate how much red is needed in hexidecimal so ff is full intensity red, the next two digits are for blue, and then green) to change the colours in a theme or when you are plotting, I have just used the hex codes here to get some more control over the different shades of pink.

And with that in mind let’s make a pink theme!

#Importing library

#Making a pink theme by modifying theme_bw 
theme_pink <- theme_bw()+                    
#Setting all text to size 20 and colour to purple/pink                                  
theme(text=element_text(size=20, colour="#b300b3"),    
#Changing the colour of the axis text to pink  
axis.text = element_text(colour="#e600e6"),
#..and the axis ticks..        
axis.ticks = element_line(colour="#B404AE"),  
#..and the panel border to purple/pink       
panel.border = element_rect(colour="#B404AE"),  
#Changing the colour of the major..
panel.grid.major = element_line(colour="#ffe6ff"), 
#..and minor grid lines to light pink     
panel.grid.minor = element_line((colour="#ffe6ff")),    
#Setting legend title to size 16..
legend.title = element_text(size=16),   
#..and the legend text to size 14                
legend.text = element_text(size=14),  
#Centering plot title text and making it bold                     
plot.title = element_text(hjust = 0.5, face="bold"),  
#Centering subtitle                  
plot.subtitle = element_text(hjust = 0.5))              

The next step is optional and that is to specify which colours we want to use for plotting.

#Pastel colours for each continent
melting_puppies <-c("#93DFB8","#FFC8BA","#E3AAD6",

And now lets use our new theme to make an actual plot using the dataset from the gapminder package.

#Importing library with dataset

#Subset with data from year 2007
year.2007.df <- subset(gapminder, year == "2007")

#Making plot

#Making a plot with the data from 2007, with gdpPercap on the x-axis..
#..and lifeExp on the y-axis, and using colours to identify them by..
#..continent, and want the point size scaled according to pop
pink.plot <- ggplot(year.2007.df, aes(x=gdpPercap, y=lifeExp, colour=continent, size = pop)) +              
#Telling R that that we want a scatterplot with semi-transparent points
geom_point(alpha = (5/8)) +              
#Telling R to use the pink theme we just made                                                     
theme_pink + 
#Changing the colours of the points into the colours saved in.. 
#..melting_puppies, and that I want the title of the points in legend.. be Continent
scale_colour_manual(values=melting_puppies, name="Continent") + 
#Scaling the size of the points so they fit within the range of..
#..2 to 22 and removing description of point size from legend
scale_size_continuous(range=c(2,22), guide = FALSE) + 
#Changing the labels of the x- and y-axis
labs(x="GDP per capita", y="Life expectancy") +
#Even though you have made a theme you can still edit it:
#Here we are changing the plot margins..
#..and that we want to change the position of the legend so it fits..
#..lower right corner of the plot
theme(plot.margin = unit(c(1,1,0.5,0.5), "cm"), legend.position = c(1, 0), legend.justification = c(1.05, -0.05)) +
#Adding a title and a subtitle
ggtitle(label = "Life expectancy in 2007", subtitle = "(size of circles indicate population size)") + 
#Increase the points size in the legend to 4, and making them opaque
guides(colour = guide_legend(override.aes = list(size=4, alpha = 1)))

#Looking at the plot

#Saving plot

This is what your example plot now should look like:

Congratulation, you now know the basics of how to make your own ggplot2 themes!

· ·

When data are imported into R it generally arrives as a data.frame, a type of object that contains named columns. We often want to access the contents of each column which can be done with the dollar or square-bracket notation.

attach() is used by some R-users to make the columns of a data.frame directly accessible rather than using dollar notation or data arguments.
Unfortunately although attach() saves a little typing and maybe makes the very first R experience a tiny bit easier, there is a large cost which I explore below.

The data.frame beaver1 holds body temperatures of a beaver (it is a found in the datasets package which is installed by default). These are the first few rows.

day time temp activ
346 840 36.33 0
346 850 36.34 0
346 900 36.35 0
346 910 36.42 0
346 920 36.55 0
346 930 36.69 0

If we attach() beaver1, we can access any of the columns of the data.frame directly.

##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300

Boxplot of body temperature against activity

Simple, what could possibly go wrong.

Let’s attach second beaver dataset which has the same column names as the first.

  ## The following objects are masked from beaver1:
  ##     activ, day, temp, time

That generated a load of warnings that various objects from beaver1 were masked. Now if we run quantile or some other function, we get the data from beaver2

##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

If we want to use beaver1 again, we have to detach() beaver2 first.

##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300

Attaching and detaching data.frames is obviously going to go horribly wrong very quickly unless you are very careful. Always.

One alternative would be to change the column names on one or both of the data.frames so that all column names are unique.

colnames(beaver1) <- paste("beaver1", colnames(beaver1), sep = "_")
colnames(beaver2) <- paste("beaver2", colnames(beaver2), sep = "_")
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
rm(beaver1, beaver2)#clean up

Possible, but hardly elegant and this now as much typing as the dollar notation shown below.

Another solution would be to combine the two datasets. We need to add a column identifying which beaver is which.

beavers <- rbind(cbind(id = "beaver1", beaver1), cbind(id = "beaver2", beaver2))
quantile(temp[id == "beaver1"])
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(temp[id == "beaver2"])
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

Having combined the data.frames, we now need to subset the objects to get the data for each beaver. This isn’t a simple solution (in some cases it will be horribly complicated), but combining the data.frames might be very useful for some analyses (for example if you wanted to run an ANOVA).

attach will also cause problems if there are objects with the same name as one of the columns of the data.frame.

temp <- 7 
  ##   0%  25%  50%  75% 100% 
  ##    7    7    7    7    7

If we attach the data.frames when object temp already exists, we get a warning, if we make temp afterwards, no warning is given and this object masks the column in the data.frame. Obviously, this could cause some nasty bugs. Good luck with them.

It is of course possible to avoid these problems with masking if we are very careful with naming objects. In practice, it is very easy to make mistakes which then cause difficult to interpret bugs.

Even if we manage to avoid errors when using attach() it makes the code difficult to read as it is not obvious which data.frame each object is coming from.

Avoiding attach()

Fortunately there are alternatives to attach().

Referencing the columns in the data.frame

We can reference the columns in the data.frame by using either dollar notation or square bracket notation. Dollar notation is generally neater looking and needs less typing.

##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
plot(beaver2$time %/% 100 + 24*(beaver2$day - beaver2$day[1]) + (beaver2$time %% 100)/60, 
                            y = beaver2$temp, col = ifelse(beaver2$activ, 2, 1), xlab = "Time")
# %/% is integer division: 5%/%2 = 2
# %% gives the modulus: 5%%2 = 1

Beaver body temperature against time

When using the dollar notation, if there are spaces in the column names (a bad idea), the column name needs to be quoted.

with() and within()

If the command you are using makes many references to a data.frame, it, the code can become rather messy, as in the previous example where the day and time columns are combined to give hours.

In such cases the with() function can be useful. It is equivalent to attaching the data.frame for this block of code only (without problems with masking).

with(beaver1, {
    hours <- time %/% 100 + 24*(day - day[1]) + (time %% 100)/60
    plot(hours, temp, col = ifelse(activ, 2, 1))

with() can make code much easier to read.

within() is useful if we want to modify the contents of a data.frame. Here, I’m using the lubridate package to make the date and time of each measurement.

beavers <- within(beavers, {
  day <- dmy("01-01-1990") + days(x = day-1) + hours(time %/% 100) + minutes(time %% 100)
  rm(time)# remove unwanted column
  ##        id                 day  temp activ
  ## 1 beaver1 1990-12-12 08:40:00 36.33     0
  ## 2 beaver1 1990-12-12 08:50:00 36.34     0
  ## 3 beaver1 1990-12-12 09:00:00 36.35     0
  ## 4 beaver1 1990-12-12 09:10:00 36.42     0
  ## 5 beaver1 1990-12-12 09:20:00 36.55     0
  ## 6 beaver1 1990-12-12 09:30:00 36.69     0

The data argument

Any function that takes a formula argument (think y ~ x) has a data argument that can be given a data.frame.

mod <- gam(temp ~ s(unclass(day)), data  = beavers, subset = id == "beaver1")
#converting the date to seconds as gam doesn't like dates
plot(temp ~ day, data = beavers, subset = id == "beaver1", col = ifelse(activ, 2, 1), xlab = "Hours")
with(beavers, lines(day[id == "beaver1"], fitted(mod), col = 2))

Beaver body temperature against time

This is much better than attaching data as it makes it explicit which data are being used.

Do not be tempted to use the dollar notation in formula

## don't do this

This makes the code difficult to read, especially if there are multiple predictors, and will cause problems when making predictions.


ggplot(), an alternative system for plotting data from the ggplot2 package, wants to have the data in a data.frame. It does not need or want the data to be attached.

ggplot(data =  beavers, mapping = aes(x = day, y = temp)) + 
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x)) +
  facet_wrap(~id, dir = "v", scales = "free_x") + 

Beaver body temperature against time

Does attach() have any uses

attach() is not completely useless: attach() can also attach environments. This allows some nifty tricks for having utility function etc. available without cluttering up your global environment (but it would probably be better to make a package – this is not hard).

· · · ·



Avoid using T to mean TRUE

In R code, it is legal to use T and F to mean TRUE and FALSE respectively. However, TRUE and FALSE are reserved words – they can only be used to mean TRUE or FALSE. Code like


will return a syntax error.

T and F are not so protected. This means that code like


is completely legal.

Of course, you wouldn’t deliberately do this (except perhaps at the start of April), but it is possible to do it accidentally. Perhaps, for example, a column of a data.frame is named T or F and you have attached it. It is best to be safe and always use TRUE and FALSE. This also makes code easier to read.

No tags

Here is a guide for the kind of person who needs to get their data into R and have never done so or are struggling to get their data to load. I’ve tried to explain using simple words and lots of detail – the post is aimed at people who are not comfortable with code, so programmers might find it too simplified!

So, let’s start with the basics. Why are we bothering? At the beginning, we bother for three mean reasons, IMHO.

  1. When you write some code, you’ve got a record of exactly what you’ve done to your data. This means that the methods section is much easier to write.
  2. This also means that if you have to change one little annoying thing in your data file, you can run all the analyses again on one mouse click. For me, this is the most important reason!
  3. You have much better control over settings, particularly in graphics. That way, you don’t end up with a horrible excel-default-settings figure (which a journal will reject).

Of course, advanced users will also highlight that R has any function you can dream of, or at least the building blocks to make that function. And best of all R is free!

Before you begin, you really really should use a text editor for writing your code, not the R GUI (graphical user interface – the windows which open when you select R in your list of programs). This is because text editors do useful things like save your code, and colour it in pretty colours for the names of functions and arguments – the same as the code in this blog. Just believe me, this is extremely useful for finding mistakes. I use Tinn-R but nearly everyone is now using R Studio – both are free and open-source. Find a text editor that works with R and use it!

Now, R thinks about the world a particular way. R divides the world into objects, functions and arguments. R commands are very often in the form “Make me an object that is made by doing these functions to that other object, according to those arguments”. Arguments are kind of like settings. Your data, as far as R sees it, is an object. But it’s probably an Excel or Open Office spreadsheet, which R does not deal directly with and cannot use as an object. R would much rather that your data was a data object, so you want to make a data object. Many first-timers use the clipboard function, and we often do that in teaching because we have the datasets ready and in the right format. Chances are that you don’t have your data in the right format, so we’re going to take a slightly longer route which we have much more control over.

I’m going to describe how to load a “fat” (variables x samples) table. Examples might be quadrats and plant species, or water samples and chemical measurements (pH, salinity, temperature). You should prepare the Excel file with the variables along the top row, and the site or sample number down the first column. There should not be any special characters (stuff like / ! #) though Norwegian letters should be ok. Variable names need to start with a letter – microbiologists using numbers to identify genes should just add an “a” in front, it can be taken out later if needed. Numbers are fine for sample names (as are number-letter combinations). Save the spreadsheet as a tab-delimited text file. Here’s the top corner of an excel spreadsheet set up correctly:

Screenshot of example data in Excel

A screenshot of the top left hand few rows and columns of an example excel file.

Next, open your text editor. The first thing to do is write a line of code to tell R where you keep your data. Then R will always look there to open files, and it will always put figures and so on in the same place. This place is called the ‘working directory’. So on my computer at home, I do this:

setwd("\\\\C:\\MyDocuments\\ Rstuff")

Rstuff is the folder where I’m storing that text file of data, and where I want the graphs to end up. See all those \\ where the file path normally just has the one \? They are because R sees \ as a ‘special character’, so you need to tell it “no R, that really is a file directory marker, not a special thingy.” The absolute easiest way to do this is with something called file.choose, which opens a regular window and lets you browse to any file in the folder that you want. That puts the file path in the r window in the right format, and you can copy the folders bit of that file path to the setwd function.


The brackets are important, but leave them empty like I’ve shown.

Now we make a data object in R by using the read.table function. I’ve made an example of data that you can download by clicking: community_data

but you should try this with your own dataset. We made our Excel spreadsheet into a .txt file, because it makes this next step a lot simpler and more reliable.


community.df<-read.table ("Community_data.txt", sep="\t", header=TRUE, row.names=1)

The .df means data frame, a kind of data object. Putting that in the new object’s name doesn’t force R to make it a data frame, but helps you remember what you were intending.  The <- (an arrow made out of a “less than” sign and a hyphen) tells R that on the left of the arrow is what you want it to make, and on the right is how you want it made. The “read.table” is the function – what it is you are asking R to do. "Community_data.txt" is the name of the data file – don’t forget the "…" and the .txt at the end. The “sep=\t” tells R that this is a tab delimited file. The “header=TRUE” tells R that your data has a header, that is, that the first row of the .txt file is the column names rather than data. Finally, “row.names=1” tells R that you have sample numbers in the first column.

So now we have two lines of code in our text editor. Most text editors will send code to R by clicking a button. I’m going to let the text editor help files tell you how to do that, because it varies between text editors and operating systems. So now send those two lines.

Did it work? Let’s look at the data. We can ask R what it thinks the data object is.


This command does not change your data, but just displays the structure of the data. You can check that the file has read in correctly, has the right number of species (columns) and samples (rows). The example data should say “data.frame':   10 obs. of 11 variables:” and then list the species names.

You can look at all the data at once, by just typing the name of the data object


But this is a bad idea if you have more than ten samples. Instead


shows you the first six rows.

Here is an example of how it should look.

Screenshot of R with example code run

A screenshot of how the code in this post should look in R.

Not working?
A common error message comes at the read.table stage.

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'Communitydata.txt': No such file or directory

This means that either you have set the working directory wrong, or typed the filename wrong. Remember to type the suffix for the file type, here .txt!

· ·

Theme Design by