Skip to content

Commit

Permalink
md files from swcarpentry#256
Browse files Browse the repository at this point in the history
  • Loading branch information
chendaniely committed Jul 16, 2017
1 parent 1b1ef5a commit 8d04ace
Showing 1 changed file with 58 additions and 23 deletions.
81 changes: 58 additions & 23 deletions _episodes/06-best-practices-R.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ keypoints:
---


### Keep track of who wrote your code and its intended purpose

1. Start your code with an annotated description of what the code does:
Starting your code with an annotated description of what the code does when it is run will help you when you have to look at or change it in the future. Just one or two lines at the beginning of the file can save you sor someone else a lot of time and effort when trying to understand what a particular script does.


~~~
Expand All @@ -36,7 +37,11 @@ keypoints:
~~~
{: .r}

2. Next, load all of the packages needed to run your code (using `library()`):
### Be explicit about the requirements and dependencies of your code


Loading all of the packages that will be necessary to run your code (using `library`) is a nice way of indicating which packages are necessary to run your code. It can be frustrating to make it two-thirds of the way through a long-running script only to find out that a dependency hasn't been installed.



~~~
Expand All @@ -45,42 +50,71 @@ library(reshape)
library(vegan)
~~~
{: .r}
Another way you can be explicit about the requirements of your code and improve it's reproducibility is to limit the "hard-coding" of the input and output files for your script. If your code will read in data from a file, define a variable early in your code that stores the path to that file. For example

If you use only one or two functions from a package, it is sometimes useful to note that fact in a comment, e.g. `library(reshape2) ## for melt()`

3. Set your working directory before `source()`ing a script, or start `R` inside your project folder:
~~~
input_file <- "data/data.csv"
output_file <- "data/results.csv"
#read input
input_data <- read.csv(input_file)
#get number of samples in data
sample_number <- nrow(input_data)
#generate results
results <- some_other_function(input_file,sample_number)
#write results
write.table(results,results_file)
~~~
{: .r}

Exercise caution when using `setwd()`. Changing directories in a script file can limit reproducibility:
is preferable to


* `setwd()` will return an error if the directory you're trying to change to doesn't exist or if the user doesn't have the correct permissions to access that directory. This becomes a problem when sharing scripts between users who have organized their directories differently.
* If/when your script terminates with an error, you might leave the user in a different directory than the one they started in, which will cause further problems if they then call the script again. If you must use `setwd()`, it is best to put it at the top of the script to avoid these problems. Putting a commented-out `setwd()` call at the top of your code can be a reasonable compromise: it reminds you where on your machine your material is living, is easy to copy-and-paste if necessary, but doesn't commit other users.
~~~
#check
input_data <- read.csv("data/data.csv")
#get number of samples in data
sample_number <- nrow(input_data)
#generate results
results <- some_other_function("data/data.csv",sample_number)
#write results
write.table("data/results.csv",results_file)
~~~
{: .r}

This error message indicates that R has failed to set the working directory you specified:
It is also worth considering what the working directory is. If the working directory must change, it is best to do that at the beginning of the script.

```
Error in setwd("~/path/to/working/directory") : cannot change working directory
```
> ## Be careful when using `setwd()`
It is best practice to have the user running the script begin in a consistent directory on their machine and then use relative file paths from that directory to access files (see below).
> One should exercise caution when using `setwd()`. Changing directories in a script file can limit reproducibility:
> * `setwd()` will return an error if the directory to which you're trying to change doesn't exit or if the user doesn't have the correct permissions to access that directory. This becomes a problem when sharing scripts between users who have organized their directories differently.
> * If/when your script terminates with an error, you might leave the user in a different directory than the one they started in, and if they then call the script again, this will cause further problems. If you must use `setwd()`, it is best to put it at the top of the script to avoid these problems.
> The following error message indicates that R has failed to set the working directory you specified:
> ```
> Error in setwd("~/path/to/working/directory") : cannot change working directory
> ```
> It is best practice to have the user running the script begin in a consistent directory on their machine and then use relative file paths from that directory to access files (see below).
4. Annotate and mark your code using `#` or `#-` to set off sections of your code and to make finding specific parts of your code easier.
### Identify and segregate distinct components in your code
It's easy to annotate and mark your code using `#` or `#-` to set off sections of your code and to make finding specific parts of your code easier. For example, it's often helpful when writing code to separate the if you create only one or a few custom functions in your script, put them toward the top of your code. If you have written many functions, put them all in their own .R file and then `source` those files. `source` will define all of these functions so that your code can make use of them as needed.
5. If you create only a few custom functions in your script, put them toward the top of your code so they are among the first objects created. If you have written many functions, put them all in their own .R file and then `source()` those files. `source()` will define all of these functions so that your code can use them as needed. For the reasons listed above, avoid using `setwd()` (or other functions that have side-effects in the user's workspace) in scripts you `source()`.
~~~
source("my_genius_fxns.R")
~~~
{: .r}
6. Use a consistent style within your code. For example, name all matrices something ending in `.mat`. Indent consistently and decide on a scheme for multi-word variable names (e.g. `resource.use`, `resource_use`, or `resourceUse`). Consistency makes code easier to read and problems easier to spot.
### Other ideas
1. Use a consistent style within your code. For example, name all matrices something ending in `.mat`. Consistency makes code easier to read and problems easier to spot.
7. Keep your code in bite-sized chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces.
2. Keep your code in bite-sized chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces.
8. Don't repeat yourself--automate! If you are repeating the same code over and over, use a loop or a function to repeat that code for you. Needless repetition doesn't just waste time--it also increases the likelihood you'll make a costly mistake!
3. Don't repeat yourself--automate! If you are repeating the same code over and over, use a loop or a function to repeat that code for you. Needless repetition doesn't just waste time--it also increases the likelihood you'll make a costly mistake!
9. Keep all of your source files for a project in the same directory, then use relative paths as necessary to access them. For example, use
4. Keep all of your source files for a project in the same directory, then use relative paths as necessary to access them. For example, use
~~~
Expand All @@ -96,7 +130,8 @@ dat <- read.csv(file = "/Users/Karthik/Documents/sannic-project/files/dataset-20
~~~
{: .r}
10. R can run into memory issues. R scripts that run for a long time often run out of memory. To inspect the objects in your current R environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long chunks of computationally intensive code is to remove temporary objects after they have served their purpose. However, R will not always clean up unused memory immediately after you delete objects. You can force R to tidy up its memory by using `gc()`.
5. R can run into memory issues. It is a common problem to run out of memory after running R scripts for a long time. To inspect the objects in your current R environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long lines of computationally intensive code is to remove temporary objects after they have served their purpose. However, sometimes, R will not clean up unused memory for a while after you delete objects. You can force R to tidy up its memory by using `gc()`.
~~~
Expand All @@ -109,13 +144,13 @@ rm(list = ls()) # If you want to delete all the objects in the workspace and sta
~~~
{: .r}
11. Don't save your workspace (the default option in R, when it asks if you want to "Save workspace image [y/n/c]?"). Instead, start in a clean workspace without old objects cluttering it. Leftover objects from previous sessions can lead to unexpected, hard-to-debug results. Do *not* put `rm(list=ls())` (which removes all objects in your current workspace, as shown in the previous code example) at the top of your code, as this is a trap for other users who might `source()` or copy-and-paste your code in the course of their R session. Instead, restart R when you want to start fresh.
6. Don't save a session history (the default option in R, when it asks if you want an `RData` file). Instead, start in a clean environment so that older objects don't remain in your environment any longer than they need to. If that happens, it can lead to unexpected results.
12. Wherever possible, keep track of `sessionInfo()` somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and reinstall the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived). For more complex projects, you may want to use the [packrat](https://CRAN.R-project.org/package=packrat) package.
7. Wherever possible, keep track of `sessionInfo()` somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and reinstall the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived).
13. Collaborate. Grab a buddy and practice "code review". Review is used for preparing experiments and manuscripts; why not use it for code as well? Our code is also a major scientific achievement and the product of lots of hard work!
8. Collaborate. Grab a buddy and practice "code review". Review is used for preparing experiments and manuscripts; why not use it for code as well? Our code is also a major scientific achievement and the product of lots of hard work!
14. Develop your code using version control and frequent updates!
9. Develop your code using version control and frequent updates!
> ## Best Practice
>
Expand Down

0 comments on commit 8d04ace

Please sign in to comment.