diff --git a/Accelerate_with_automation/img/Job_dependencies.png b/Accelerate_with_automation/img/Job_dependencies.png new file mode 100644 index 00000000..39699fae Binary files /dev/null and b/Accelerate_with_automation/img/Job_dependencies.png differ diff --git a/Accelerate_with_automation/img/associative_array.png b/Accelerate_with_automation/img/associative_array.png new file mode 100644 index 00000000..28a885ed Binary files /dev/null and b/Accelerate_with_automation/img/associative_array.png differ diff --git a/Accelerate_with_automation/img/new_image.txt b/Accelerate_with_automation/img/new_image.txt new file mode 100644 index 00000000..c2403e23 --- /dev/null +++ b/Accelerate_with_automation/img/new_image.txt @@ -0,0 +1 @@ +new placeholder file diff --git a/Accelerate_with_automation/img/positional-parameter.jpg b/Accelerate_with_automation/img/positional-parameter.jpg new file mode 100644 index 00000000..c278b121 Binary files /dev/null and b/Accelerate_with_automation/img/positional-parameter.jpg differ diff --git a/Accelerate_with_automation/img/simpsons.gif b/Accelerate_with_automation/img/simpsons.gif new file mode 100644 index 00000000..077c41d1 Binary files /dev/null and b/Accelerate_with_automation/img/simpsons.gif differ diff --git a/Accelerate_with_automation/img/vim_insert.png b/Accelerate_with_automation/img/vim_insert.png new file mode 100644 index 00000000..4ed28c7f Binary files /dev/null and b/Accelerate_with_automation/img/vim_insert.png differ diff --git a/Accelerate_with_automation/img/vim_postsave.png b/Accelerate_with_automation/img/vim_postsave.png new file mode 100644 index 00000000..976b286c Binary files /dev/null and b/Accelerate_with_automation/img/vim_postsave.png differ diff --git a/Accelerate_with_automation/img/vim_quit.png b/Accelerate_with_automation/img/vim_quit.png new file mode 100644 index 00000000..8980329c Binary files /dev/null and b/Accelerate_with_automation/img/vim_quit.png differ diff --git a/Accelerate_with_automation/img/vim_save.png b/Accelerate_with_automation/img/vim_save.png new file mode 100644 index 00000000..641f31b6 Binary files /dev/null and b/Accelerate_with_automation/img/vim_save.png differ diff --git a/Accelerate_with_automation/img/vim_spider.png b/Accelerate_with_automation/img/vim_spider.png new file mode 100644 index 00000000..0432645c Binary files /dev/null and b/Accelerate_with_automation/img/vim_spider.png differ diff --git a/Accelerate_with_automation/img/vim_spider_number.png b/Accelerate_with_automation/img/vim_spider_number.png new file mode 100644 index 00000000..67694e17 Binary files /dev/null and b/Accelerate_with_automation/img/vim_spider_number.png differ diff --git a/Accelerate_with_automation/lessons/arrays_in_slurm.md b/Accelerate_with_automation/lessons/arrays_in_slurm.md new file mode 100644 index 00000000..d8121c8b --- /dev/null +++ b/Accelerate_with_automation/lessons/arrays_in_slurm.md @@ -0,0 +1,120 @@ + +# Arrays in Slurm + +When I am working on large data sets my mind often drifts back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say: + + +

+ +

+ +

+We've all been here +

+ +A pipeline or process may seem easy or fast when you have 1-3 samples but totally daunting when you have 50. When scaling up you need to consider file overwriting, computational resources, and time. + +One easy way to scale up is to use the array feature in slurm. + +## What is a job array? + +Atlassian says this about job arrays on O2: "Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks." [link](https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Job-Arrays). + +Array jobs run simultaneously rather than one at a time which means they are very fast! Additionally, running a job array is very simple! + +```bash +sbatch --array=1-10 my_script.sh +``` + +This will run my_script.sh 10 times with the job IDs 1,2,3,4,5,6,7,8,9,10 + +We can also put this directly into the bash script itself (although we will continue with the command line version here). +```bash +$SBATCH --array=1-10 +``` + +We can specify any job IDs we want. + +```bash +sbatch --array=1,7,12 my_script.sh +``` +This will run my_script.sh 3 times with the job IDs 1,7,12 + +Of course we don't want to run the same job on the same input files over and over, that would be pointless. We can use the job IDs within our script to specify different input or output files. In bash the job id is given a special variable `${SLURM_ARRAY_TASK_ID}` + + +## How can I use ${SLURM_ARRAY_TASK_ID}? + +The value of `${SLURM_ARRAY_TASK_ID}` is simply job ID. If I run + +```bash +sbatch --array=1,7 my_script.sh +``` +This will start two jobs, one where `${SLURM_ARRAY_TASK_ID}` is 1 and one where it is 7 + +There are several ways we can use this. If we plan ahead and name our files with these numbers (e.g., sample_1.fastq, sample_2.fastq) we can directly refer to these files in our script: `sample_${SLURM_ARRAY_TASK_ID}.fastq` However, using the ID for input files is often not a great idea as it means you need to strip away most of the information that you might put in these names. + +Instead we can keep our sample names in a separate file and use [awk](awk.md) to pull the file names. + +here is our complete list of long sample names which is found in our file `samples.txt`: + +``` +DMSO_control_day1_rep1 +DMSO_control_day1_rep2 +DMSO_control_day2_rep1 +DMSO_control_day2_rep2 +DMSO_KO_day1_rep1 +DMSO_KO_day1_rep2 +DMSO_KO_day2_rep1 +DMSO_KO_day2_rep2 +Drug_control_day1_rep1 +Drug_control_day1_rep2 +Drug_control_day2_rep1 +Drug_control_day2_rep2 +Drug_KO_day1_rep1 +Drug_KO_day1_rep2 +Drug_KO_day2_rep1 +Drug_KO_day2_rep2 +``` + +If we renamed all of these to 1-16 we would lose a lot of information that may be helpful to have on hand. If these are all sam files and we want to convert them to bam files our script could look like this + +```bash + +file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt) + +samtools view -S -b ${file}.sam > ${file}.bam + +``` + +Since we have sixteen samples we would run this as + +```bash +sbatch --array=1-16 my_script.sh +``` + +So what is this script doing? `file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)` pulls the line of `samples.txt` that matched the job ID. Then we assign that to a variable called `${file}` and use that to run our command. + +Job IDs can also be helpful for output files or folders. We saw above how we used the job ID to help name our output bam file. But creating and naming folders is helpful in some instances as well. + +```bash + +file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt) + +PREFIX="Folder_${SLURM_ARRAY_TASK_ID}" + mkdir $PREFIX + cd $PREFIX + +samtools view -S -b ../${file}.sam > ${file}.bam + +``` + +This script differs from our previous one in that it makes a folder with the job ID (Folder_1 for job ID 1) then moves inside of it to execute the command. Instead of getting all 16 of our bam files output in a single folder each of them will be in its own folder labled Folder_1 to Folder_16. + +**NOTE** That we define `${file}` BEFORE we move into our new folder as samples.txt is only present in the main directory. + + + + + + diff --git a/Accelerate_with_automation/lessons/loops_and_scripts.md b/Accelerate_with_automation/lessons/loops_and_scripts.md new file mode 100644 index 00000000..cc478953 --- /dev/null +++ b/Accelerate_with_automation/lessons/loops_and_scripts.md @@ -0,0 +1,357 @@ +--- +title: "The Shell: Loops & Scripts" +author: "Bob Freeman, Mary Piper, Radhika Khetani" +--- + +Approximate time: 60 minutes + +## Learning Objectives + +* Capture multiple commands into a script to re-run as one single command +* Understanding variables and storing information in variables +* Learn how to use variables to operate on multiple files + +## Shell scripts + +Within the command-line interface we have at our fingertips, access to various commands which allow you to interrogate your data (i.e `cat`, `less`, `wc`). + +> **NOTE:** If you are unsure about any of these commands and what they do, you may want to review the [Exploring Basics lesson](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/exploring_basics.html). + +When you are working with data, it can often be useful to run a set of commands one after another. Further, you may want to re-run this set of commands on every single set of data that you have. Wouldn't it be great if you could do all of this by simply typing out one single command? + +Welcome to the beauty and purpose of shell scripts. + +Shell scripts are **text files that contain commands we want to run**. As with any file, you can give a shell script any name and usually have the extension `.sh`. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake, this is actually a small program. + +Since we now know how to create text files in the command-line interface, we are going to use that knowledge to create a shell script and see what makes the shell such a powerful programming environment. We will use commands that you should be familiar with and save them into a file so that we can **re-run all those operations** again later by typing **one single command**. Let's write a shell script that will do two things: + +1. Tell us our current working directory +2. List the contents of the directory + +First open a new file using `vim`: + +```bash +$ vim listing.sh +``` + +Then type in the following lines in the `listing.sh` file: + +``` +echo "Your current working directory is:" +pwd +echo "These are the contents of this directory:" +ls -l +``` + +Exit `vim` and save the file. Now let's run the new script we have created. To run a shell script you usually use the `bash` or `sh` command. + +```bash +$ sh listing.sh +``` + +Now, let's run this script when we are in a different folder. + +```bash +$ cd ../raw_fastq/ + +$ sh ../other/listing.sh +``` + +> Did it work like you expected? +> +> Were the `echo` commands helpful in letting you know what came next? + +This is a very simple shell script. In this lesson, we will be learning how to write more complex ones and show you how to use the power of scripts to make our lives much easier. + +## Bash variables + +A *variable* is a common concept shared by many programming languages. Variables are essentially a symbolic/temporary name for, or a reference to, some information. Variables are analogous to "buckets", where information can be stored, maintained and modified without too much hassle. + +Extending the bucket analogy: the bucket has a name associated with it, i.e. the name of the variable, and when referring to the information in the bucket, we use the name of the bucket, and do not directly refer to the actual data stored in it. + +Let's start with a simple variable that has a single number stored in it: + +```bash +$ num=25 +``` + +*How do we know that we actually created the bash variable?* We can use the `echo` command to print to terminal: + +```bash +$ echo num +``` + +What do you see in the terminal? The `echo` utility takes what arguments you provide and prints to terminal. In this case it interpreted `num` as a a character string and simply printed it back to us. This is because **when trying to retrieve the value stored in the variable, we explicitly use a `$` in front of it**: + +```bash +$ echo $num +``` + +Now you should see the number 25 returned to you. Did you notice that when we created the variable we just typed in the variable name? This is standard shell notation (syntax) for defining and using variables. When defining the variable (i.e. setting the value) you can just type it as is, but when **retrieving the value of a variable don't forget the `$`!** + +Variables can also store a string of character values. In the example below, we define a variable or a 'bucket' called `file`. We will put a filename `Mov10_oe_1.subset.fq` as the value inside the bucket. + +```bash +$ file=Mov10_oe_1.subset.fq +``` + +Once you press return, you should be back at the command prompt. Let's check what's stored inside `file`, but first move into the `raw_fastq` directory:: + +```bash +$ cd ~/unix_lesson/raw_fastq +$ echo $file +``` + +Let's try another command using the variable that we have created. We can also count the number of lines in `Mov10_oe_1.subset.fq` by referencing the `file` variable: + +```bash +$ wc -l $file +``` + +> *NOTE:* The variables we create in a session are system-wide, and independent of where you are in the filesystem. This is why we can reference it from any directory. However, it is only available for your current session. If you exit the close your Terminal and come back again at a later time, the variables you have created will no longer exist. + +*** + +**Exercise** + +* Reuse the `$file` variable to store a different file name, and rerun the commands we ran above (`wc -l`, `echo`) + +*** + +Ok, so we know variables are like buckets, and so far we have seen that bucket filled with a single value. **Variables can store more than just a single value.** They can store multiple values and in this way can be useful to carry out many things at once. Let's create a new variable called `filenames` and this time we will store *all of the filenames* in the `raw_fastq` directory as values. + +To list all the filenames in the directory that have a `.fq` extension, we know the command is: + +```bash +$ ls *.fq +``` + +Now we want to *assign* the output of `ls` to the variable: + +```bash +$ filenames=`ls *.fq` +``` + +> Note the syntax for assigning output of commands to variables, i.e. the backticks around the `ls` command. + +Check and see what's stored inside our newly created variable using `echo`: + +```bash +$ echo $filenames +``` + +Let's try the `wc -l` command again, but this time using our new variable `filenames` as the argument: + +```bash +$ wc -l $filenames +``` + +What just happened? Because our variable contains multiple values, the shell runs the command on each value stored in `filenames` and prints the results to screen. + +*** + +**Exercise** + +* Use some of the other commands you are familiar with (i.e. `head`, `tail`) on the `filenames` variable. + +*** + +## Loops + +Another powerful concept in the Unix shell and useful when writing scripts is the concept of "Loops". We have just shown you that you can run a single command on multiple files by creating a variable whose values are the filenames that you wish to work on. But what if you want to **run a sequence of multiple commands, on multiple files**? This is where loops come in handy! + +Looping is a concept shared by several programming languages, and its implementation in *bash* is very similar to other languages. + +The structure or the syntax of (*for*) loops in bash is as follows: + +```bash +for (variable_name) in (list) +do +(command1 $variable_name) +. +. +done +``` + +where the ***variable_name*** defines (or initializes) a variable that takes the value of every member of the specified ***list*** one at a time. At each iteration, the loop retrieves the value stored in the variable (which is a member of the input list) and runs through the commands indicated between the `do` and `done` one at a time. *This syntax/structure is virtually set in stone.* + + +#### What does this loop do? + +```bash +for x in *.fq + do + echo $x + wc -l $x + done +``` + +Most simply, it writes to the terminal (`echo`) the name of the file and the number of lines (`wc -l`) for each files that end in `.fq` in the current directory. The output is almost identical to what we had before. + +In this case the list of files is specified using the asterisk wildcard: `*.fq`, i.e. all files that end in `.fq`. + +Then, we execute 2 commands between the `do` and `done`. With a loop, we execute these commands for each file at a time. Once the commands are executed for one file, the loop then executes the same commands on the next file in the list. + +Essentially, **the number of items in the list (variable name) == number of times the code will loop through**. + +In our case that is 6 times since we have 6 files in `~/unix_lesson/raw_fastq` that end in `.fq`, and these filenames are stored in the `filename` variable. + +It doesn't matter what variable name we use in a loop structure, but it is advisable to make it something intuitive. In the long run, it's best to use a name that will help point out a variable's functionality, so your future self will understand what you are thinking now. + +### The `basename` command + +Before we get started on creating more complex scripts, we want to introduce you to a command that will be useful for future shell scripting. The `basename` command is used for extracting the base name of a file, which is accomplished using **string splitting to strip the directory and any suffix from filenames**. Let's try an example, by first moving back to your home directory: + +```bash +$ cd +``` + +Then we will run the `basename` command on one of the FASTQ files. Be sure to specify the path to the file: + +```bash +$ basename ~/unix_lesson/raw_fastq/Mov10_oe_1.subset.fq +``` + +What is returned to you? The filename was split into the path `unix_lesson/raw_fastq/` and the filename `Mov10_oe_1.subset.fq`. The command returns only the filename. Now, suppose we wanted to also trim off the file extension (i.e. remove `.fq` leaving only the file *base name*). We can do this by adding a parameter to the command to specify what string of characters we want trimmed. + +```bash +$ basename ~/unix_lesson/raw_fastq/Mov10_oe_1.subset.fq .fq +``` + +You should now see that only `Mov10_oe_1.subset` is returned. + +*** + +**Exercise** + +* How would you modify the above `basename` command to only return `Mov10_oe_1`? + +*** + +## Automating with Scripts + +Now that you've learned how to use loops and variables, let's put this processing power to work. Imagine, if you will, a script that will run a series of commands that would do the following for us each time we get a new data set: + +- Use the `for` loop to iterate over each FASTQ file +- Generate a prefix to use for naming our output files +- Dump out bad reads into a new file +- Get the count of the number of bad reads and generate a summary for each file + +You might not realize it, but this is something that you now know how to do. Let's get started... + +Rather than doing all of this in the terminal we are going to create a script file with all relevant commands. Move back into `unix_lesson` and use `vim` to create our new script file: + +```bash +$ cd ~/unix_lesson + +$ vim generate_bad_reads_summary.sh +``` + +We always want to start our scripts with a shebang line: + +```bash +#!/bin/bash +``` + +This line is the absolute path to the Bash interpreter on almost all computers. The shebang line ensures that the bash shell interprets the script even if it is executed using a different shell. + +> Your script will still work without the shebang line if you run it with the `sh` or `bash` commands, but it is best practice to have it in your shell script. + +After the shebang line, we enter the commands we want to execute. First, we want to move into our `raw_fastq` directory: + +```bash +# enter directory with raw FASTQs +cd ~/unix_lesson/raw_fastq +``` + +And now we loop over all the FASTQs: + +```bash +# count bad reads for each FASTQ file in our directory +for filename in *.fq +``` + +For each file that we process we can use `basename` to create a variable that will uniquely identify our output file based on where it originated from: + +```bash +do + # create a prefix for all output files + base=`basename $filename .subset.fq` +``` + +and then we execute the following commands for each file: + +```bash + # tell us what file we're working on + echo $filename + + # grab all the bad read records into new file + grep -B1 -A2 NNNNNNNNNN $filename > $base-badreads.fastq +``` + +We'll also count the number of these reads and put that in a new file, using the count flag of `grep`: + +```bash + # grab the number of bad reads and write it to a summary file + grep -cH NNNNNNNNNN $filename > $base-badreads.count.summary +done +``` + +> **NOTE:** If you've noticed, we used a new `grep` flag `-H` above; this flag will report the filename the search was performed on along with the matching string. + +Save and exit `vim`, and voila! You now have a script you can use to assess the quality of all your new datasets. Your finished script, complete with comments, should look like the following: + +```bash +#!/bin/bash + +# enter directory with raw FASTQs +cd ~/unix_lesson/raw_fastq + +# count bad reads for each FASTQ file in our directory +for filename in *.fq +do + + # create a prefix for all output files + base=`basename $filename .subset.fq` + + # tell us what file we're working on + echo $filename + + # grab all the bad read records + grep -B1 -A2 NNNNNNNNNN $filename > $base-badreads.fastq + + # grab the number of bad reads and write it to a summary file + grep -cH NNNNNNNNNN $filename > $base-badreads.count.summary +done + +``` + +To run this script, we simply enter the following command: + +```bash +$ sh generate_bad_reads_summary.sh +``` + +How do we know if the script worked? Take a look inside the `raw_fastq` directory, we should see that for every one of the original FASTQ files we have two associated bad read files. + +```bash +$ ls -l ~/unix_lesson/raw_fastq +``` + +To keep our data organized, let's move all of the bad read files out of the `raw_fastq` directory into a new directory called `other`, and the script to a new directory called `scripts`. + +```bash +$ mv raw_fastq/*bad* other/ + +$ mkdir scripts +$ mv *.sh scripts/ +``` + +--- +*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* + +* *The materials used in this lesson were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). +All Data Carpentry instructional material is made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0).* +* *Adapted from the lesson by Tracy Teal. Original contributors: Paul Wilson, Milad Fatenejad, Sasha Wood and Radhika Khetani for Software Carpentry (http://software-carpentry.org/)* + + diff --git a/Accelerate_with_automation/lessons/positional_params.md b/Accelerate_with_automation/lessons/positional_params.md new file mode 100644 index 00000000..9fbb7c33 --- /dev/null +++ b/Accelerate_with_automation/lessons/positional_params.md @@ -0,0 +1,319 @@ +--- +title: "Introduction to Positional Parameters and Variables" +author: "Emma Berdan" +--- + + +## Learning Objectives: + +* Distinguish between variables and positional parameters +* Recognize variables and positional parameters in code written by someone else +* Implement positional parameters and variables in a bash script +* Integrate for loops and variables + +## What is a variable? + +"A **variable** is a character string to which we assign a value. The value assigned could be a number, text, filename, device, or any other type of data. +A variable is nothing more than a pointer to the actual data. The shell enables you to create, assign, and delete variables.” ([Source](https://www.tutorialspoint.com/unix/unix-using-variables.htm)) + +It is easy to identify a variable in any bash script as they will always have the $ in front of them. Here is my very cleverly named variable: `$Variable` + +## Positional parameters are a special kind of variable + +“A **positional parameter** is an argument specified on the command line, used to launch the current process in a shell. Positional parameter values are stored in a special set of variables maintained by the shell.” ([Source](https://www.computerhope.com/jargon/p/positional-parameter.htm)) + +So rather than a variable that is identified inside the bash script, a positional parameter is given when you run your script. This makes it more flexible as it can be changed without modifying the script itself. + +

+ +

+ +Here we can see that our command is the first positional parameter (`$0`) and that each of the strings afterwards are additional positional parameters (here `$1` and `$2`). Generally when we refer to positional parameters we ignore `$0` and start with `$1`. + +It is crucial to note that different positional parameters are separated by whitespace and can be strings of any length. This means that: + +```bash +$ ./myscript.sh OneTwoThree +``` +has only given one positional parameter `$1=OneTwoThree` + +and + +```bash +$ ./myscript.sh O N E +``` +has given three positional parameters `$1=O` `$2=N` `$3=E` + +You can code your script to take as many positional parameters as you like but for any parameters greater than 9 you need to use curly brackets. So positional parameter 9 is `$9` but positional parameter 10 is `${10}`. We will come back to curly brackets later. + +Finally, the variable `$@` contains the value of all positional parameters except `$0`. + +## A simple example + +Let's make a script ourselves to see positional parameters in action. + +from your command line type `vim compliment.sh` then type `i` to go to insert mode. If you have never used vim before you can find the basics [HERE](https://github.com/hbctraining/Intro-to-shell-flipped/blob/047946559cbffc9cc24ccb10a4d630aa18fab558/lessons/03_working_with_files.md). + +now copy and paste the following into your file + +```bash +#!/bin/bash + +echo $1 'is amazing at' $2 +``` +then type esc to exit insert mode. Type and enter `:wq` to write and quit. + + +>**Note** +>We want to take a step back here and think about what we might do from here to actually use our script. +>We know our script is a bash script because we wrote our "shebang" line '#!/bin/bash' but the computer +>doesn't automatically know this. One option is to run this script via the bash interpreter: sh. Doing this +>actually makes our "shebang" line obsolete! The same script without the "shebang" line will still run!! Try it! +> +>`sh compliment.sh` +> +>Here we are telling the computer to use bash to execute the commands in our script which is why we don't +>need the "shebang" line. It is **NOT** best practice to write scripts without "shebang" lines as removing this +>will leave the next person scratching their head figuring out which language the script is in. +>**ALWAYS ALWAYS ALWAYS use a "shebang" line**. With that line in place, we can run this script without +>calling bash from the command line. But first we have to make the script executable. This tells the +>computer that this is a script and not just a text file. We do that by adding file permission. +>Typing chmod u+x will make the file executable for the user (you!), once this is done the script +>can be run this way +> +>`./compliment.sh` +> +>When a file is executable the computer will use the "shebang" line to figure out which interpreter to use. +>Different programs (perl, python, etc) will have different "shebang" lines. + + +For this lesson we will make all of our scripts exectuable. Now that you are back on the command line type `chmod u+x compliment.sh` to make the file executable for yourself. More on file permissions [HERE](https://github.com/hbctraining/Intro-to-shell-flipped/blob/master/lessons/07_permissions_and_environment_variables.md). + +You may have already guessed that our script takes two different positional parameters. The first one is your first name and the second is something you are good at. Here is an example: + +```bash +./compliment.sh OliviaC acting +``` + +This will print + +```bash +OliviaC is amazing at acting +``` +You may have already guessed that I am talking about award winning actress [Olivia Coleman](https://en.wikipedia.org/wiki/Olivia_Colman) here. But if I typed + +```bash +./compliment.sh Olivia Coleman acting +``` +I would get + +```bash +Olivia is amazing at Coleman +``` +Technically I have just given three positional parameters `$1=Olivia` `$2=Colman` `$3=acting` +However, since our script does not contain `$3` this is ignored. + +In order to give Olivia her full due I would need to type + +```bash +./compliment.sh "Olivia Coleman" acting +``` + +The quotes tell bash that "Olivia Coleman" is a single string, `$1`. Both double quotes (") and single quotes (') will work. Olivia has enough accolades though, so go ahead and run the script with your name (just first or both first and last) and something you are good at! + + +## Naming variables + +My previous script was so short that it was easy to remember that `$1` represents a name and `$2` represents a skill. However, most scripts are much longer and may contain more positional parameters. To make it easier on yourself it is often a good idea to name your positional parameters. Here is the same script we just used but with named variables. + +```bash +#!/bin/bash + +name=$1 +skill=$2 + +echo $name 'is amazing at' $skill +``` + +It is critical that there is no space in our assignment statements, `name = $1` would not work. We can also assign new variables in this manner whether or not they are coming from positional parameters. Here is the same script with the variables defined within it. + +```bash +#!/bin/bash + +name="Olivia Coleman" +skill="acting" + +echo $name 'is amazing at' $skill +``` +We will talk more about naming variables later, but note that defining variables within the script can make the script **less** flexible. If I want to change my sentence, I now need to edit my script directly rather than launching the same script but with different positional parameters. + + +## A useful example + +Now that we understand the basics of variables and positional parameters how can we make them work for us? One of the best ways to do this is when writing a shell script. A shell script is a "script that embeds a system command or utility, that saves a set of parameters passed to to that command." + +As an example lets say that I want to add read groups to a series of bam files. Each bam file is one sample that I have sequenced and I need to add read groups to them all. Here is an example of my command for sample M1. + +```bash +java -jar picard.jar AddOrReplaceReadGroups I=M1.dedupped.bam \ +O=M1.final.bam RGID=M1 RGLB=M1 RGPL=illumina RGPU=unit1 RGSM=M1 +``` + +The string 'M1' occurs 5 times in this command. However, M1 is not my only sample, to make this code run for a different sample I would need to replace M1 5 times. I don't want to manually edit this line of code every time I run the command. Instead, using positional parameters I can make a shell script for this command. + + +```bash +#!/bin/bash + +java -jar picard.jar AddOrReplaceReadGroups I=$1.dedupped.bam \ +O=$1.final.bam RGID=$1 RGLB=$1 RGPL=illumina RGPU=unit1 RGSM=$1 +``` + +Here `$1` is my only positional parameter and is my sample name. **However**, this script is not written with best practices. It should actually look like this. + +```bash +#!/bin/bash + +java -jar picard.jar AddOrReplaceReadGroups I=${1}.dedupped.bam \ +O=${1}.final.bam RGID=${1} RGLB=${1} RGPL=illumina RGPU=unit1 RGSM=${1} +``` + +`$1`, which we have been using is actually a short form of `${1}` + +We can only use `$1` when it is **not** followed by a letter, digit or an underscore but we can always use `${1}` + +if wrote a script that said `echo $1_is_awesome` I wouldn't actually get any output when I ran this with a positional parameter, even our beloved [Olivia Coleman](https://en.wikipedia.org/wiki/Olivia_Colman)! Instead this script would need to be written as `echo ${1}_is_awesome` + +As you write your own code it is good to remember that it is always safe to use `${VAR}` and that errors may result from using `$VAR` instead, even if it is convienent. As you navigate scripts written by other people you will see both forms. + + +Let's test out this script this ourselves without actually running picard by using `echo`. From your command line type `vim picard.sh` then type `i` to go to insert mode. + +now copy and paste the following into your file + +```bash +#!/bin/bash + +echo java -jar picard.jar AddOrReplaceReadGroups I=${1}.dedupped.bam \ +O=${1}.final.bam RGID=${1} RGLB=${1} RGPL=illumina RGPU=unit1 RGSM=${1} +``` +then type esc to exit insert mode. Type and enter `:wq` to write and quit. + +Now that you are back on the command line type `chmod u+x picard.sh` to make the file executable for yourself. + +You can try out the code yourself by using any sample name you want. Here I am running it for sample T23 + + +```bash +./picard.sh T23 +``` + +We have now significantly decrased our own workload. By using this script we can easily run this command for any sample we have. However, sometimes we have so many samples that even running this command manually for all of these will be time consuming. In this case we can turn to one of the most powerful ways to use positional parameters and other variables, by combining them with **for loops**. More on for loops [HERE](https://github.com/hbctraining/Intro-to-shell-flipped/blob/master/lessons/06_loops_and_automation.md). + +## Variables in for loops + +We are going to continue with our picard example. Let's say that I need to run my picard command for 10 different samples. I have all of my sample names in a text file. First let's put my sample name list on the cluster so we can access it with our script. + +From your command line type `vim samples.txt` then type `i` to go to insert mode. Copy and paste the following + +```bash +M1 +M2 +M3 +O1 +O2 +O3 +O4 +S1 +S2 +S3 +``` + +then type esc to exit insert mode. Type and enter `:wq` to write and quit. Each line is a single sample name. + +Now let's write our new script. Again we will use `echo` to avoid actually calling picard. We will write this with vim then go through it line by line. + +From your command line type `vim picard_loop.sh` then type `i` to go to insert mode. Copy and paste the following + +```bash +#!/bin/bash + +for ((i=1; i<=10; i=i+1)) + do + +sample=$(awk -v awkvar="${i}" 'NR==awkvar' samples.txt) + +echo java -jar picard.jar AddOrReplaceReadGroups \ +I=${sample}.dedupped.bam O=${sample}.final.bam RGID=${sample} \ +RGLB=${sample} RGPL=illumina RGPU=unit1 RGSM=${sample} + +done +``` + +then type esc to exit insert mode. Type and enter `:wq` to write and quit. Now that you are back on the command line type `chmod u+x picard_loop.sh` to make the file executable for yourself. + +Before we run this, let's go through it line by line. + + +`for ((i=1; i<=10; i=i+1))` + +This tells bash how our loop is working. We want to start at 1 (`i=1`) and end at 10 (`i<=10`) and each time we complete our loop the value `i` should increase by 1 (`i=i+1`). Simple math tells us that means the loop will run 10 times. But we could make it run 100 times by changing `i<=10` to `i<=100`. `i` is our counter variable to track how many loops we have run. You will often see this called `i` or `j` but you can actually call it whatever you want. For example, here we are using it to track which line of samples.txt we are reading so it may be more intuitive to write it like this `for ((line=1; line<=10; line=line+1))`. `i` and `j` are used because they are shorter to write. + +`do` + +This means that whatever follows is what we want bash to do for each value of `i` (here 1,2,3,4,5,6,7,8,9,and 10). + +`sample=$(awk -v awkvar="${i}" 'NR==awkvar' samples.txt)` + +This line creates a variable called `$sample` and assigns its value to line `i` of samples.txt. We won't go into the details of how this awk command is working but you can learn more about using awk [HERE](https://github.com/hbctraining/Training-modules/blob/f168114cce7ab9d35eddbf888b94f5a2fda0318a/Intermediate_shell/lessons/advanced_lessons.md). You may also notice that we have assigned the value of `$sample` differently here using parentheses () instead of single ' or double " quotes. The syntax for assigning variables changes depending on what you are assigning. See **Syntax for assigning variables** below. + +If we look at samples.txt we can see that when `i=1` then `$sample` will be M1. What will `$sample` be when `i=5`? + + +The next line should look familiar + +```bash +echo java -jar picard.jar AddOrReplaceReadGroups \ +I=${sample}.dedupped.bam O=${sample}.final.bam RGID=${sample} \ +RGLB=${sample} RGPL=illumina RGPU=unit1 RGSM=${sample} +``` + +This is exactly the same as what we used above except `$1` is now `$sample`. We are assigning the value of `$sample` within our script instead of giving it externally as a positional parameter. + +finally we end our script with + +```bash +done +``` + +Here we are simply telling bash that this is the end of the commands that we need it to do for each value of `i`. + +Now let's run our script + + +```bash +./picard_loop.sh +``` + +Is the output what you expected? + +## Syntax for assigning variables + +Depending on what you are assigning as a variable, the syntax for doing so differs. + +`variable=$(command)` for output of a command. + + example: `variable=$(wc -l file.txt)` will assign the number of lines in the file file.text to `$variable` + +`variable=‘a string’` or `”a string”` for a string with spaces. + + example `variable="Olivia Coleman"` as seen above. + +`variable=number` for a number or a string without spaces. + + example: `variable=12` will assign the number 12 to `$variable` + +`variable=$1` for positional parameter 1. + + example: `variable=$9` will assign positional parameter 9 to `$variable` and `variable=${10}` will assign positional parameter 10 to `$variable` + diff --git a/Accelerate_with_automation/lessons/setting_up.md b/Accelerate_with_automation/lessons/setting_up.md new file mode 100644 index 00000000..d1f7b46f --- /dev/null +++ b/Accelerate_with_automation/lessons/setting_up.md @@ -0,0 +1,89 @@ +## Accessing the shell + +This workshop assumes that you have a working knowledge of *bash* and in turn know how to access it on your own computer. If you are not sure, we have this information below. + +> **With Macs** use the "**Terminal**" utility. +> +> **With Windows** you can use your favorite utility or follow our suggestion of using "**Git BASH**". Git BASH is part of the [Git for Windows](https://git-for-windows.github.io/) download, and is a *bash* emulator. + +## The command prompt + +Once again, you are likely familiar with what a command prompt is, but to ensure that everyone in the class is on the same page, we have a short description below. + +> It is a string of characters ending with `$` after which you enter the command to ask shell to do something. +> +> The string of charaters before the `$` usually represent information about the computer you are working on and your current directory; e.g. **`[MacBook-Pro-5:~]$`**. + +## Downloading data + +We will be exploring slightly more advanced capabilities of the shell by working with data from an RNA sequencing experiment. + +> *NOTE: If you attended the [Intro to shell](https://hbctraining.github.io/Training-modules/Intro_shell/) workshop with us last month, you should have already downloaded this data.* + +Before we download the data, let's check the folder we are currently in: + +```bash +$ pwd +``` + +> On a **Mac** your current folder should be something starting with `/Users/`, like `/Users/marypiper/`. +> +> On a **Windows** machine your current folder should be something starting with `/c/Users/marypiper`. To find this in your File explorer try clicking on PC and navigating to that path. + +Once you know to which folder you are downloading your data click on the link below: + +**Download RNA-Seq data to your working directory:** [click here to download](https://github.com/hbctraining/Training-modules/blob/master/Intro_shell/data/unix_lesson.zip?raw=true). + +Type the 'list' command to check that you have downloaded the file to the correct location (your present working directory): + +```bash +$ ls -l +``` + +You should see `unix_lesson.zip` as part of the output to the screen. + +Now, let's decompress the folder, using the `unzip` command: + +```bash +$ unzip unix_lesson.zip +``` + +> When you run the unzip command, you are decompressing the zipped folder, just like you would by double-clicking on it outside the Terminal. As it decompresses, you will usually see "verbose output" listing the files and folders being decompressed or inflated. +> +> ```bash +> +> Archive: unix_lesson.zip +> creating: unix_lesson/ +> creating: unix_lesson/.my_hidden_directory/ +> inflating: unix_lesson/.my_hidden_directory/hidden.txt +> creating: unix_lesson/genomics_data/ +> creating: unix_lesson/other/ +> inflating: unix_lesson/other/Mov10_rnaseq_metadata.txt +> inflating: unix_lesson/other/sequences.fa +> creating: unix_lesson/raw_fastq/ +> inflating: unix_lesson/raw_fastq/Irrel_kd_1.subset.fq +> inflating: unix_lesson/raw_fastq/Irrel_kd_2.subset.fq +> inflating: unix_lesson/raw_fastq/Irrel_kd_3.subset.fq +> inflating: unix_lesson/raw_fastq/Mov10_oe_1.subset.fq +> inflating: unix_lesson/raw_fastq/Mov10_oe_2.subset.fq +> inflating: unix_lesson/raw_fastq/Mov10_oe_3.subset.fq +> inflating: unix_lesson/README.txt +> creating: unix_lesson/reference_data/ +> inflating: unix_lesson/reference_data/chr1-hg19_genes.gtf +> ``` + +Now, run the `ls` command again. + +```bash +$ ls -l +``` + +You should see a folder/directory called `unix_lesson`, which means you are all set with the data download! + +*** + +[Next Lesson](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/exploring_basics.html) + +*** + +*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*