R Programming

Details
Reviews

Introduction

The purpose of this monograph is to provide a reference for scientists and programmers working on problems in bioinformatics and computational biology. It may also appeal to programmers who want to improve their programming skills or programmers who have been working in bioinformatics and computational biology but are familiar with languages other than R.

A reasonable level of programming skill is presumed as is some familiarity with some of the basic tasks that need to be carried out in bioinformatics. We concentrate on programming tools and there is no discussion of either graphics or of the multitude of software for fitting models or carrying out machine learning.

Reasonable coverage of these topics would result in a much longer monograph and to some extent they are orthogonal to our purpose.

Bioinformatics blossomed as a scientific discipline in the 1990s when a number of technological innovations appeared that revolutionized biology.

Suddenly, data on the complete genomic sequence of many different organisms were available, microarrays could measure the abundance of tens of thousands of mRNA species, and other arrays and technologies made it possible to study protein interactions and many other cellular processes at the molecular level.

Basically, biology moved from a small data discipline to one with large complex data sets, virtually overnight. Faced with these sudden challenges, scientific programmers grabbed whatever tools were available and made use of them to help address some of the many problems.

Perl was perhaps the most widely used and it remains a dominant player to this date. Other popular programming languages such as Java and Python are also used. R is an implementation of the S language (Becker et al., 1988; Chambers and Hastie, 1992; Chambers, 1998). S has been a favorite tool for statisticians and data analysts since the early 1980s when John Chambers and colleagues started to release versions of it from Bell Labs. It is now becoming one of the most widely used software tools for bioinformatics.

This is mainly due to its flexibility and data handling and modeling capabilities. Some of these have been exposed through the Bioconductor Project (Gentleman et al., 2004) but many users simply find it a useful tool for doing analyses