diff --git a/Intermediate_shell/.DS_Store b/Intermediate_shell/.DS_Store deleted file mode 100644 index 11e1f3ea..00000000 Binary files a/Intermediate_shell/.DS_Store and /dev/null differ diff --git a/Intermediate_shell/Intro_current_topics.pdf b/Intermediate_shell/Intro_current_topics.pdf deleted file mode 100644 index 136c3379..00000000 Binary files a/Intermediate_shell/Intro_current_topics.pdf and /dev/null differ diff --git a/Intermediate_shell/README.md b/Intermediate_shell/README.md deleted file mode 100644 index 23e08ab0..00000000 --- a/Intermediate_shell/README.md +++ /dev/null @@ -1,42 +0,0 @@ -## Intermediate bash - -| Audience | Computational skills required | Duration | -:----------|:----------|:----------| -| Biologists | [Beginner bash](https://hbctraining.github.io/Training-modules/Intro_shell/) | 2-3 hour workshop (~2-3 hours of trainer-led time) | - - -### Description - -This repository has teaching materials for a **2 hour**, hands-on **Intermediate bash** workshop led at a relaxed pace. Many tools for the analysis of big data require knowledge of the command line, and this workshop will build on the basic skills taught in the **Introduction to the command-line interface** workshop to allow for greater automation. This workshop will include lessons on using the command-line text editor, Vim, to create and edit files, utilizing for-loops for automation, using variables to store information, and writing scripts to perform a series of commands in a sequential order. - -### Learning Objectives - -* Learning basic operations using the Vim text editor -* Capturing previous commands into a script to re-run with one single command -* Understanding variables and storing information -* Learning how to use variables to operate on multiple files - -> These materials are developed for a trainer-led workshop, but also amenable to self-guided learning. - - -### Contents - -| Lessons | Estimated Duration | -|:------------------------|:----------| -|[Setting up](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/setting_up.html) | 15 min | -|[Exploring the basics](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/exploring_basics.html) | 25 min | -|[Introduction to Vim](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/vim.html) | 25 min | -|[Shell scripts and `for` loops](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/loops_and_scripts.html) | 75 min | - -### Dataset - -[Introduction to Shell: Dataset](https://github.com/hbctraining/Training-modules/blob/master/Intro_shell/data/unix_lesson.zip?raw=true) - -### Installation Requirements - -***Mac users:*** -No installation requirements. - -***Windows users:*** -[GitBash](https://git-scm.com/download/win) - diff --git a/Intermediate_shell/img/Job_dependencies.png b/Intermediate_shell/img/Job_dependencies.png deleted file mode 100644 index 39699fae..00000000 Binary files a/Intermediate_shell/img/Job_dependencies.png and /dev/null differ diff --git a/Intermediate_shell/img/associative_array.png b/Intermediate_shell/img/associative_array.png deleted file mode 100644 index 28a885ed..00000000 Binary files a/Intermediate_shell/img/associative_array.png and /dev/null differ diff --git a/Intermediate_shell/img/new_image.txt b/Intermediate_shell/img/new_image.txt deleted file mode 100644 index c2403e23..00000000 --- a/Intermediate_shell/img/new_image.txt +++ /dev/null @@ -1 +0,0 @@ -new placeholder file diff --git a/Intermediate_shell/img/positional-parameter.jpg b/Intermediate_shell/img/positional-parameter.jpg deleted file mode 100644 index c278b121..00000000 Binary files a/Intermediate_shell/img/positional-parameter.jpg and /dev/null differ diff --git a/Intermediate_shell/img/simpsons.gif b/Intermediate_shell/img/simpsons.gif deleted file mode 100644 index 077c41d1..00000000 Binary files a/Intermediate_shell/img/simpsons.gif and /dev/null differ diff --git a/Intermediate_shell/img/vim_insert.png b/Intermediate_shell/img/vim_insert.png deleted file mode 100644 index 4ed28c7f..00000000 Binary files a/Intermediate_shell/img/vim_insert.png and /dev/null differ diff --git a/Intermediate_shell/img/vim_postsave.png b/Intermediate_shell/img/vim_postsave.png deleted file mode 100644 index 976b286c..00000000 Binary files a/Intermediate_shell/img/vim_postsave.png and /dev/null differ diff --git a/Intermediate_shell/img/vim_quit.png b/Intermediate_shell/img/vim_quit.png deleted file mode 100644 index 8980329c..00000000 Binary files a/Intermediate_shell/img/vim_quit.png and /dev/null differ diff --git a/Intermediate_shell/img/vim_save.png b/Intermediate_shell/img/vim_save.png deleted file mode 100644 index 641f31b6..00000000 Binary files a/Intermediate_shell/img/vim_save.png and /dev/null differ diff --git a/Intermediate_shell/img/vim_spider.png b/Intermediate_shell/img/vim_spider.png deleted file mode 100644 index 0432645c..00000000 Binary files a/Intermediate_shell/img/vim_spider.png and /dev/null differ diff --git a/Intermediate_shell/img/vim_spider_number.png b/Intermediate_shell/img/vim_spider_number.png deleted file mode 100644 index 67694e17..00000000 Binary files a/Intermediate_shell/img/vim_spider_number.png and /dev/null differ diff --git a/Intermediate_shell/lessons/AWK_module.md b/Intermediate_shell/lessons/AWK_module.md deleted file mode 100644 index e1f4241e..00000000 --- a/Intermediate_shell/lessons/AWK_module.md +++ /dev/null @@ -1,332 +0,0 @@ -**NOTE: This module assumes that you are familiar with grep and sed** - -## What is awk? - -If you have ever looked up how to do a particular string manipulation using bash in [stackoverflow](https://stackoverflow.com/) or [biostars](https://www.biostars.org/) then you have probably seen someone give an `awk` command as a potential solution. - -`awk` is an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool and was especially designed to support one-liner programs. You will often see the phrase "awk one-liner". `awk` was created at Bell Labs in the 1970s and `awk` comes from from the surnames of its authors: Alfred **A**ho, Peter **W**einberger, and Brian **K**ernighan. Because the name comes from initials you will often see it written as `AWK`. `awk` shares a common history with `sed` and even `grep` dating back to `ed`. As a result, some of the syntax and functionality can be a bit familiar at times. - -## I already know grep and sed, why should I learn awk? - -`awk` can be seen as an intermediate between `grep` and `sed` and more sophisticated approaches. - -``` -The Enlightened Ones say that... - -You should never use C if you can do it with a script; -You should never use a script if you can do it with awk; -Never use awk if you can do it with sed; -Never use sed if you can do it with grep. -``` -[Text source](http://awk.info/?whygawk) - -This is best understood if we start with `grep` and work our way up. We will use these tools on a complex file we have been given, `animal_observations.txt`. - -This file came to be when a park ranger named Parker asked rangers at other parks to make monthly observations of the animals they saw that day. All of the other rangers sent Parker comma separated lists and he collated them into the following file: - -``` -Date Yellowstone Yosemite Acadia Glacier -1/15/01 bison,elk,coyote mountainlion,coyote seal,beaver,bobcat couger,grizzlybear,elk -2/15/01 pronghorn blackbear,deer moose,hare otter,deer,mountainlion -3/15/01 cougar,grizzlybear fox,coyote,deer deer,skunk beaver,elk,lynx -4/15/01 moose,bison bobcat,coyote blackbear,deer mink,wolf -5/15/01 coyote,deer blackbear,marmot otter,fox deer,blackbear -6/15/01 pronghorn coyote,deer mink,deer bighornsheep,deer,otter -7/15/01 cougar,grizzlybear fox,coyote,deer seal,porpoise,deer beaver,otter -8/15/01 moose,bison bobcat,coyote hare,fox lynx,coyote -9/15/01 blackbear,lynx,coyote coyote,deer seal,porpoise,deer elk,deer -10/15/01 beaver,bison,wolf marmot,coyote coyote,seal,skunk mink,wolf -11/15/01 bison,elk,coyote marmot,fox deer,skunk moose,blackbear -12/15/01 crane,beaver,blackbear mountainlion,coyote mink,deer bighornsheep,beaver -1/15/02 moose,bison coyote,deer coyote,seal,skunk couger,grizzlybear,elk -2/15/02 cougar,grizzlybear marmot,fox otter,fox mountaingoat,deer,elk -3/15/02 beaver,bison,wolf blackbear,deer moose,hare mountainlion,bighornsheep -4/15/02 pronghorn fox,coyote,deer deer,skunk couger,grizzlybear,elk -5/15/02 coyote,deer blackbear,marmot hare,fox mink,wolf -6/15/02 crane,beaver,blackbear bobcat,coyote seal,porpoise,deer elk,deer -7/15/02 bison,elk,coyote marmot,fox coyote,seal,skunk couger,grizzlybear,elk -8/15/02 cougar,grizzlybear blackbear,marmot blackbear,deer mountaingoat,deer,elk -9/15/02 moose,bison coyote,deer hare,fox elk,deer -10/15/02 beaver,bison,wolf mountainlion,coyote deer,skunk bighornsheep,beaver -11/15/02 moose,bison blackbear,marmot mink,deer couger,grizzlybear,elk -12/15/02 coyote,deer fox,coyote,deer moose,hare moose,blackbear -``` - -We see the date of observation and then the animals observed at each of the 5 parks. Each column is separated by a tab. - -* Please copy and paste the above into a command line document called animal_observations.txt. - -So let's say that we want to know how many dates a cougar was observed at any of the parks. We can easily use `grep` for that: - -```bash -grep "cougar" animal_observations.txt -``` -When we do that 4 lines pop up, so 4 dates. We could also pipe to wc to get a number: - -```bash -grep "cougar" animal_observations.txt | wc -l -``` - -There seemed to be more instances of cougar though. 4 seems low compared to what I saw when glancing at the document. If we look at the document again, we can see that the park ranger from Glacier National Park cannot spell and put "couger" instead of "cougar". Come on man! - -Replacing those will be a bit hard with `grep` but we can use `sed` instead! - -```bash -sed 's/couger/cougar/g' animal_observations.txt > animal_observations_edited.txt -``` - -We are telling `sed` to replace all versions of couger with cougar and output the results to a new file called animal_observations_edited.txt. If we rerun our `grep` command we can see that we now have 9 line (dates) instead of 4. - -So far so good. But let's now say that we want to know how many times a coyote was observed at Yosemite Park (ignoring all other parks) without editing our file... - -While this is *possible* with `grep` it is actually easier to do with `awk`! - - -## Ok you convinced me (I mean I signed up for this module...) how do I start with awk? - -Before we dive too deeply into `awk` we need to define two terms that `awk` will use a lot: - -- ***Field*** - This is a column of data -- ***Record*** - This is a row of data - -For our first `awk` command let's mimic what we just did with `grep`. To pull all instances of cougar from animal_observations_edited.txt using `awk`: - -```bash -awk '/cougar/' animal_observations_edited.txt -``` - -here '/cougar/' is the pattern we want to match and **since we have not told `awk` anything else it performs it's default behavior which is to print the matched lines**. - -but we only care about coyotes from Yosemite Park! How do we do that? - -```bash -awk '$3 ~ /coyote/' animal_observations_edited.txt -``` - -Let's break this down! - -* First all `awk` commands are always encased in `` so whatever you are telling `awk` to do needs to be inbetween those. - -* Then I have noted that I want to look at column 3 (the Yosemite observations) in particular. The columns are separated (defined) by white space (one or more consecutive blanks) or tabulator and denoted by the $ sign. So `$1` is the value of the first column, `$2` - second etc. $0 contains the original line including the separators. - -* So the Yosemite column is `$3` and we are asking for lines where the string "coyote" is present. We recognize the '/string/' part from our previous command. - -As we run this command we see that the output is super messy because Parker's original file is a bit of a mess. This is because the default behavior of `awk` is to print all matching lines. It is hard to even check if the command did the right thing. However, we can ask `awk` to only print the Yosemite column and the date (columns 1 and 3): - - -```bash -awk '$3 ~ /coyote/ {print $1,$3}' animal_observations_edited.txt -``` - -This shows a great feature of `awk`, chaining commands. The print command within the {} will ONLY be executed when the first criteria is met. - -We now know basic `awk` syntax: - -``` -awk ' /pattern/ {action} ' file1 file2 ... fileN -``` - -A few things to note before you try it yourself! - -> The action is performed on every line that matches the pattern. -> If a pattern is not provided, the action is performed on every line of the file. -> If an action is not provided, then all lines matching the pattern are printed (we already knew this one!) -> Since both patterns and actions are optional, actions must be enclosed in curley brackets to distinguish them from patterns. - -**** - -**Exercise** - -Can you print all of the times a seal was observed in Acadia Park? Did you print it the messy or neat way? - -Were seals ever observed in any of the other parks (hint: There are multiple ways to answer this question!)? - -**** - -Before we move on, it is sometimes helpful to know that regular text can be added to `awk` print commands. For example we can modify our earlier command to be: - -```bash -awk '$3 ~ /coyote/ {print "On this date coyotes were observed in Yosemite Park", $1}' animal_observations_edited.txt -``` - -Did you notice what was modified from the previous command besides the addition of the string "On this date coyotes were observed in Yosemite Park"? - -## awk predefined variables - -Before we continue our `awk` journey we want to introduce you to some of the `awk` predefined variables. Although there are more than just the ones we cover, these are the most helpful to start. More can be found [here](https://www.gnu.org/software/gawk/manual/html_node/Built_002din-Variables.html) - -* NR - The number of records processed (i.e., rows) -* FNR - The number of record processed in the current file. This is only needed if you give `awk` multiple files. For the first file FNR==NR, but for the second FNR will restart from 1 while NR will continue to increment. -* NF - Number of fields in current record (i.e. columns in the row) -* FILENAME - name of current input file -* FS - Field separator which is space or TAB by default - -NR is particularly useful for skipping records (i.e., rows). For example, if we only care about coyotes observed in 2002 and not 2001 we can skip the records 1-13 of `animal_observations_edited.txt`. - -```bash -awk 'NR>13 && $3 ~ /coyote/ {print $1,$3}' animal_observations_edited.txt -``` -Because we have given two patterns to match (record greater than 13 and column 3 containing the string coyote) we need to put '&&' in between them to note that we need both fulfilled. - -You have probably already noticed that Parker's file contains both comma separated fields and tab separated fields. This is no problem for `awk` if we denote the FS variable. Let's use both FS and NF to print the total number of kinds animals observed in all the parks. Note that we will not delete duplicates (i.e., if coyotes are observed in both Yosemite and Acadia we will consider it to be 2 instead of 1). - -```bash -awk -F '[[:blank:],]' '{print NF}' animal_observations_edited.txt -``` - -This is more complex than anything else we have done so let's break it down: - -* First, you might be curious why we are using -F instead of -FS. FS represents the field separator and to CHANGE it we use -F. We can think of it as -F 'FS'. Here we have to do a bit of regex magic where we accept any white space or commas. Although understanding this regex is beyond this module, we include it here as many NGS formats include multiple kinds of field separators (e.g., VCF files). - -* We then skip denoting any pattern and ask `awk` to simply print the number of fields. After you run this command you might notice that there two issues. First because we give the date NF is always 1 count higher than the number of animals. `awk` does math too and we can modify this command! - -```bash -awk -F '[[:blank:],]' '{print NF-1}' animal_observations_edited.txt -``` - -Easy peasy! - -**** - -**Exercise** - -The second issue is that we don't want to include the first record (row) as this is our header and not representative of any animals. How would you modify the command to skip the first record? - -**** - - -## Piping different separators - -We can do more advanced commands with our separators by piping `awk`. For example, we can pull lines where coyote is the **SECOND** animal listed for Yosemite park. - -Before we do that let's take a step back. You may be wondering why on earth we need this kind of command. While something like this may not be particularly useful for Parker's data, this kind of command is key for looking at some complex NGS files! - -For example take a look at this gff file - -``` -chr3 ENSEMBL five_prime_UTR 50252100 50252137 . + . ID=UTR5:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=2;exon_id=ENSE00003567505.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2 -chr3 ENSEMBL three_prime_UTR 50257691 50257714 . + . ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=8;exon_id=ENSE00003524043.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2 -chr3 ENSEMBL three_prime_UTR 50258368 50259339 . + . ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=9;exon_id=ENSE00001349779.3;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2 -chr3 ENSEMBL gene 50227436 50227490 . + . ID=ENSG00000275334.1;gene_id=ENSG00000275334.1;gene_type=miRNA;gene_name=MIR5787;level=3;hgnc_id=HGNC:49930 -chr3 ENSEMBL gene 52560570 52560707 . + . ID=ENSG00000221518.1;gene_id=ENSG00000221518.1;gene_type=snRNA;gene_name=RNU6ATAC16P;level=3;hgnc_id=HGNC:46915 -chr3 ENSEMBL transcript 52560570 52560707 . + . ID=ENST00000408591.1;Parent=ENSG00000221518.1;gene_id=ENSG00000221518.1;transcript_id=ENST00000408591.1;gene_type=snRNA;gene_name=RNU6ATAC16P;transcript_type=snRNA;transcript_name=RNU6ATAC16P-201;level=3;transcript_support_level=NA;hgnc_id=HGNC:46915;tag=basic,Ensembl_canonical -``` - -We can see that all colums are tab separated but column 9 has a bunch of ; separated items. This type of command would be useful for something like pulling out all lines where gene_type is snRNA. In fact all of the commands we are teaching today are useful on one or another NGS-related document (VCF, gff, gtf, bed, etc). We are using Parker's data instead because we can use **ALL** of these types of commands on his dataset. - -Returning to our original task, pulling lines where coyote is the **SECOND** animal listed for Yosemite park. We can do it like this: - -```bash -awk '{ print $3 }' animal_observations_edited.txt | awk -F "," '$2 ~ "coyote"' -``` - -We use the first part: - -```bash -awk '{ print $3 }' animal_observations_edited.txt -``` - -To simply extract the Yosemite data (column 3). We use the second part: - -```bash -awk -F "," '$2 ~ "coyote"' -``` - -to separate the comma separated fields of column 3 and ask which lines have the string coyote in field 2. We want to print the entire comma separated list (i.e., column 3) to test our code which is the default behavior of `awk` in this case. - -* You might have noticed that here we used "coyote" instead of /coyote/ This is because we want the entire field to be solely coyote ("coyote") rather than containing the string coyote (/coyote/). - -**** - -**Exercise** - -What command would you give to print all of the observation dates that took place in May? - -**** - -## Counting - -One of the best features of `awk` is that it can count up how many times a string occurs in a column. Let's use this to see how many times set of animal observations occurs in Yellowstone park. - -```bash -awk ' { counter[$2] += 1 } END { for (animalgroup in counter){ print animalgroup, counter[animalgroup] } }' animal_observations_edited.txt -``` - -This command is complex and contains new syntax so lets go through it bit by bit: - -* First we set up a variable that we called counter `{ counter[$2] += 1 }`. This variable is special because it is followed by brackets [ ], which makes it an associative array, a fancypants name for a variable that stores key-value pairs. - -* Here our keys will be our animal groups (i.e., the different values of column 2) and the values will be the counter for each of these. When we set up the counter the values are initialized to 0. For every line in the input, we add a 1 to the value in the array whose key is equal to $2. - -* Note that we use the addition operator `+=`, as a shortcut for `counter[$2] = counter[$2] + 1`. - -* We want this counter to run through every line of text before we look at the output. To do this we use the special variable `END` which can be used for a command you want `awk` to do at the end of a file (we won't cover it here, but its counterpoint is `BEGIN`). - -* After we tell `awk` to wait until the end of the file we tell it what we want it to do when it gets there. { for (animalgroup in counter){ print animalgroup, counter[animalgroup] }} - -* Here we have given a for loop. For each key in counter (animalgroup in counter) we want `awk` to print that key (print animalgroup`) and its corresponding value (counter[animalgroup]). I have named this animalgroup because that is what we are counting but this can be named whatever you want. - -Now that we understand our command, let's run it! - -It works! We can see that "moose,bison" is the most commonly observed group of animals at Yosemite! How Thrilling! - -**Exercise** - -1. What was the most commonly observed group of animals at Glacier National Park? - -2. Our code also counts the number of times our header text (Yosemite or Glacier) is repeated. How can you modify the code so that this is ignored? -**** - - -## MFC - -We will end by taking a look at MFC (my favorite code). This is an `awk` one liner I use all the time. - -```bash -for ((i=1; i<=10; i+=1)) - do -sam=$(awk -v awkvar="${i}" 'NR==awkvar' samples.txt) -samtools view -S -b ${sam}.sam > ${sam}.bam - -done -``` - -This actually combines a number of basic and intermediate shell topics such as [positional parameters](positional_params.md), [for loops](loops_and_scripts.md), and `awk`! - -* We start with a for loop that counts from 1 to 10 - -* Then for each value of `i` the awk command `awk -v awkvar="${i}" 'NR==awkvar' samples.txt` is run and the output is assigned to the positional parameter `${sam}`. - -* Then using the positional parameter `${sam}` a samtools command is run to convert a file from .sam to .bam - -With our new `awk` expertise let's take a look at that `awk` command alone! - - -```bash -awk -v awkvar="${i}" 'NR==awkvar' samples.txt -``` - -We have not encountered -v yet. The correct syntax is `-v var=val` which assign the value val to the variable var, before execution of the program begins. So what we are doing is creating our own variable within our `awk` program, calling it `awkvar` and assigning it the value of `${i}` which will be a number between 1 and 10 (see for loop above). `${i}` and thus `awkvar` will be different for each loop. - -Then we are simply saying that the predetermined variable `NR` will be equal to `awkvar` which will be equal to ${i}. - -Here is what samples.txt looks like - -``` -my_sample1_rep1 -my_sample1_rep2 -my_sample2_rep1 -my_sample2_rep2 -... -my_sample5_rep2 -``` - -When `${i}` is equal to 3 what will our `awk` command spit out? Why? -Why do you think that this is MFC? - - -### With our new expertise, we can not only write our own `awk` commands but we can understand commands that others have written. Go forth and `awk`! - - - - diff --git a/Intermediate_shell/lessons/advanced_lessons.md b/Intermediate_shell/lessons/advanced_lessons.md deleted file mode 100644 index d51b6818..00000000 --- a/Intermediate_shell/lessons/advanced_lessons.md +++ /dev/null @@ -1,1050 +0,0 @@ -# Advanced Shell Outline - -The following files will use a sample text file (animals.txt) to demonstrate their impact - -## sed - -The ***s***tream ***ed***itor, `sed`, is a common tool used for text manipulation. `sed` takes input from either a file or piped from a previous command and applies a transformation to it before outputting it to standard out. - -### substitution - -One common usage for `sed` is to replace one word with another. The syntax for doing this is: - -``` -sed 's/pattern/replacement/flag' file.txt -``` - -A few things to note here: - -1) The `s` in `'s/pattern/replacement/flag'` is directing `sed` to do a ***s***ubstitution. -2) The `flag` in `'s/pattern/replacement/flag'` is directing `sed` that you want this action to be carried out in a specific manner. It is very common to use the flag `g` here which will carry out the action ***g***lobally, or each time it matches the `pattern`. If `g`, is not included it wil just replace the `pattern` the first time it is observed per line. If you would like to replace a particular occurance like the third time it is observed in a line, you would use `3`. - -Let's test this out on our sample file and see that output. First, we are interested in replacing 'jungle' with 'rainforest' throughout the file: - -``` -sed 's/jungle/rainforest/g' animals.txt -``` - -Notice how all instances of 'jungle' have been replaced with 'rainforest'. However, if we don't include the global option: - -``` -sed 's/jungle/rainforest/' animals.txt -``` - -We will only recover the first instance of 'jungle' was replaced with 'rainforest'. If we want to replace only the second occurance of 'jungle' with 'rainforest' on a line, modify the occurance to be `2`: - -``` -sed 's/jungle/rainforest/2' animals.txt -``` - -It is important to note that the pattern-matching in `sed` is case-sensitive. To make your pattern searches case-insensitive, you will need to add at the `I` flag: - -``` -sed 's/Jungle/rainforest/Ig' animals.txt -``` - -This will now replace all instances of Jungle/jungle/JuNgLe/jUngle/etc. with 'rainforest'. - -***I don't know if you can do multiple occurances, like 2nd and 4th*** - -***Haven't discussed 2g syntax that will replace the the 2nd occurance and all subsequent occurances*** - -#### Addresses - -##### Single lines - -One can also direct which line, the ***address***, `sed` should make an edit on by adding the line number in front of `s`. This is most common when one wants to make a substituion for a pattern in a header line and is worried that the pattern might be elsewhere in the file. It is best practice to wrap your substitution argument in curly brackets (`{}`) when using address. To demonstrate this we can compare to commands: - -``` -sed 's/an/replacement/g' animals.txt -sed '1{s/an/replacement/g}' animals.txt -``` - -In the first command, `sed 's/an/replacement/g' animals.txt` we have replaced all instances of 'an' with 'replacement'. However, in the second command, `sed '1s/an/replacement/g' animals.txt`, we have only replaced instances on line 1. - -While wrapping the substitution in curly brackets isn't required when using a single line, it is necessacary when defining an interval. As you can see: - -``` -sed '1s/an/replacement/g' animals.txt -``` - -Produces the same output as above. - -##### Intervals - -If you only want to have this substitution carried out on the first three lines (`1,3`, this is giving an address interval, from line 1 to line 3) we would need to do include the curly brackets: - -``` -sed '1,3{s/an/replacement/g}' animals.txt -``` - -You can also replace the second address with a `$` to indicate until end of the file like: - -``` -sed '5,${s/an/replacement/g}' animals.txt -``` - -This will carry out the substitution from the fifth line until the end of the file. - -You can also use regular expressions in the address field. For example, if you only wanted the substitution happening between your first occurence of 'monkey' and your first occurrance of 'alligator', you could do: - -``` -sed '/monkey/,/alligator/{s/an/replacement/g}' animals.txt -``` - -Alternatively, if you want a replacement to happen every except a given line, such as all of you data fields but not on the header line, then one could use `!` which tells sed 'not'. - -``` -sed '1!{s/an/replacement/g}' animals.txt -``` - -You can even couple `!` with the regular expression intervals to do the substitution everywhere outside the interval: - -``` -sed '/monkey/,/alligator/!{s/an/replacement/g}' animals.txt -``` - -Lastly, you can use `N~n` in the address to indicator that you want to apply the substitution every *n*th line starting on line *N*. In the below example, starting on the first line and every 2nd line, the substitution will occur - -sed '1~2{s/an/replacement/g}' animals.txt - -#### -n option - -In `sed` the `-n` option will create no standard output. However, you can pair with with the occurance flag `p` and this will print out only lines that were were edited. - -``` -sed -n 's/an/replacement/p' animals.txt -``` - -The `-n` option has another useful purpose, you can use it to find the line number of a matched pattern by using `=` after the pattern you are searching for: - -``` -sed -n '/jungle/ =' animals.txt -``` - -### Deletion - -You can delete entire lines in `sed`. To delete lines proved the address followed by `d`. To delete the first line from a file: - -``` -sed '1d' animals.txt -``` - -Like substitutions, you can provide an interval and this will delete line 1 to line 3: - -``` -sed '1,3d' animals.txt -``` - -Also like substitution, you can use `!` to specify lines not to delete like: - -``` -sed '1,3!d' animals.txt -``` - -Additionally, you can also use regular expressions to provide the addresses to define an interval to delete from. In this case we are interested in deleting from the first instance of 'alligator' until the end of the file: - -``` -sed '/alligator/,$d' animals.txt -``` - -The `N~n` syntax also works in deletion. If we want to delete every thrid line starting on line 2, we can do: - -``` -sed '2~3d' animals.txt -``` - -### Appending - -### Appending text - -You can append a new line with the word 'ape' after the 2nd line using the `a` command in `sed`: - -``` -sed '2 a ape' animals.txt -``` - -If you want the appended text to come before the address, you need to use the `i` command: - -``` -sed '2 i ape' animals.txt -``` - -You can also do this over an interval, like from the 2nd to 4th line: - -``` -sed '2,4 a ape' animals.txt -``` - -Additionally, you can append the text every 3rd line begining with the second line: - -``` -sed '2~3 a ape' animals.txt -``` - -Lastly, you can also append after a matched pattern: - -``` -sed '/monkey/ a ape' animals.txt -``` - -#### Appending a file - -You could in interested in inserting the contents of **file B** inside at a certain point of **file A**. For example, if you wanted to insert the contents `file_B.txt` after line `4` in `file_A.txt`, you could do: - -``` -sed '4 r file_B.txt' file_A.txt -``` - -Instead of line `4`, you can append the file between every line in the interval from line 2 to line 4 with: - -``` -sed '2,4 r file_B.txt' file_A.txt -``` - -You could also append the line after each line by using `1~1` syntak: - -``` -sed '1~1 r file_B.txt' file_A.txt -``` - -The `r` argument is telling `sed` to ***r***ead in `file_B.txt`. - -Instead of inserting on a line specific line, you can also insert on a pattern: - -``` -sed '/pattern/ r file_B.txt' file_A.txt -``` - -Lastly, you could also insert a file to the end using `$`: - -``` -sed '$ r file_B.txt' file_A.txt -``` - -But this is the same result as simply concatenating two files together like: - -``` -cat file_A.txt file_B.txt -``` - -### Replacing Lines - -You can also replace entire lines in `sed` using the `c` command. We could replace the first line with the word 'header' by: - -``` -sed '1 c header' animals.txt -``` - -This can also be utilized in conjustion with the `A,B` interval syntax, but we aware that it will replace ALL lines in that interval with a SINGLE line. - -``` -sed '1,3 c header' animals.txt -``` - -You can also replace every *n*th line starting at *N*th line using the `N~n` address syntax: - -``` -sed '1~3 c header' animals.text -``` - -Lastly, you can also replace lines match a pattern: - -``` -sed '/animal/ c header' animals.txt -``` - -### Translation - -`sed` has a feature that allows you to translate characters similiarly to the `tr` function in `bash`. If you wanted to translate all of the lowercase a, b and c characters to their uppercase equivalents you could do that with the `y` command: - -``` -sed 'y/abc/ABC/' animals.txt -``` - -In this case the first letter 'a' is replaced with 'A', 'b' with 'B' and 'c' with 'C'. - -### Multiple expressions - -#### `-e` option - -If you would like to carry out multiple `sed` expressions in the same command you can use the `-e` option and after each `-e` option you can provide the expression you would like `sed` to evaluate. For example, one could change 'jungle' to 'rainforest' and 'grasslands' to 'Serengeti': - -``` -sed -e 's/jungle/rainforest/g' -e 's/grasslands/Serengeti/g' animals.txt -``` - -One can also combine different type of expressions. For instance, one could change 'jungle' to 'rainforest' using a substitution expression and then use a deletion expression to remove the header line: - -``` -sed -e 's/jungle/rainforest/g' -e '1d' animals.txt -``` - -#### `-f` option - -If you have a large number of `sed` expressions you can also place them in a text file with each expression on a separate line: - -``` -s/jungle/rainforest/g -s/grasslands/Serengeti/g -1d -``` - -If this file was named 'sed_expressions.txt', our command could look like: - -``` -sed -f sed_expressions.txt animals.txt -``` - - - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/extra_bash_tools.md#sed - -https://www.grymoire.com/Unix/Sed.html#uh-8 - -## Regular Expressions - -Regular expressions (sometimes referred to as regex) are a string of characters that can be used as a pattern to match against. This can be very helpful when searching through a file, particularly in conjunction with `sed`, `grep` or `awk`. - -### `[]` - -Square brackets can be used to notate a range of acceptable characters in a position. - -`[BPL]ATCH` could match 'BATCH', 'PATCH' or 'LATCH' - -You can also use `-` to denote a range of characters: - -`[A-Z]ATCH` would match 'AATCH', 'BATCH'...'ZATCH' - -You can also merge different ranges together by putting them right after each other or separating them by a `|` - -`[A-Za-z]ATCH` or `[A-Z|a-z]ATCH` would match 'AATCH', 'BATCH'...'ZATCH' and 'aATCH', 'bATCH'...'zATCH' - -In fact, regular expression ranges generally follow the [ASCII alphabet](https://en.wikipedia.org/wiki/ASCII), (but your local character encoding may vary) so: - -`[0-z]ATCH` would match '0ACTH', '1ACTH', '2ACTH'...'AACTH'..'zACTH'. However, it is important to also note that the ASCII alphabet has a few characters between numbers and uppercase letters such as ':' and '>', so you would also match ':ATCH' and '>ATCH', repectively. There are also a fews between upper and lowercase letters such as '^' and ']'. If you would want to search for numbers, uppercase letters and lowercase letters, but NOT these characters in between, you would need to modify the range: - -`[0-9A-Za-z]ATCH` - -You can also note that since these characters follow the ASCII character encoding order, so `[Z-A]` will give you an error telling you that it is an invalid range because 'Z' comes after 'A'. - -The `^` ***within*** `[]` functions as a 'not' function. For example: - -`[^C]ATCH` will match anything ending in 'ATCH' ***except*** 'CATCH'. - -***IMPORTANT NOTE: `^` has a different function when used outside of the `[]` that is discussed below in anchoring.*** - -### `*` - -The `*` matches the preceeding character any number of times ***including*** zero. - -`CA*TCH` would match `CTCH`, `CATCH`, `CAATCH` ... `CAAAAAAATCH` ... - -### `?` - -The `?` denotes that the previous character is optional, in the following example: - -`C?ATCH` would match 'CATCH', but also 'BATCH', '2ATCH' '^ATCH' and even 'ATCH' - -### `.` - -The `.` matches any character except new line. Notably, it ***does not*** match no character. This is similar to the behavior of the wildcard `?` in Unix. - -`.ATCH` would match 'CATCH', BATCH', '2ATCH' '^ATCH', but ***not*** 'ATCH' - -### `{}` - -The `{INTEGER}` match the preceeding character the number of times equal to INTEGER. - -`CA{3}TCH` would match 'CAAATCH', but ***not*** 'CATCH', 'CAATCH' or 'CAAAATCH'. - -### `+` - -The `+` matches one or more occurrances of the preceeding character. - -`CA+TCH` would match 'CATCH', 'CAATCH' ... 'CAAAAAAAATCH' ... - -### Anchors - -#### `^` - -The `^` character anchors the search criteria to the begining of the line. For example: - -`^CAT` would match lines that started with 'CAT', 'CATCH', but ***not** 'BOBCAT' - -***NOTE: `^` within `[]` behaves differently. Remember it functions as 'not'!*** - -#### `$` - -The `$` character anchors the search criteria to the end of the line. For example: - -`CAT$` would match lines ending in 'CAT' or 'BOBCAT', but not 'CATCH' - -### `\` for literal matches - -One problem you will likely run into with these above special characters is that you may want to match one. For example, you may want to match '.' or '?' and this is what the escape, `\`, is for. - -`C\?TCH` would match 'C?TCH', but not 'CATCH' or 'CCTCH' like `C?TCH` would do. - -### Whitespace and new lines - -You can search from tabs with '\t', space with '\s' and newline with '\n'. - -`CA\tTCH` would match 'CA TCH' - -### Examples of Combining Special Characters - -Lots of the power from regular expression comes from how you can combine them to match the pattern you want. - -If you want to find any line that starts with uppercase letters 'A-G', then you could do: - -`^[A-G]` - -Perhaps you want to see find all lines ending with 'CA' followed by any character except 'T', then you could do: - -`CA[^T]$` - -Another thing you may be interersted in is finding lines that start with 'C' and end with 'CH' with anything, including nothing, in between. - -`^C.*CH$` - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/extra_bash_tools.md#regular-expressions-regex-in-bash- - -## grep with Regular Expressions - -Alternatively, to using these ranges, there are preloaded class such as: - -`[[:alpha:]]ATCH` which is equivalent to `[A-Za-z]ATCH` - -`[[:alnum:]]ATCH` which is equivalent to `[0-9A-Za-z]ATCH` - -`[[:digit:]]ATCH` which is equivalent to `[0-9]ATCH` - -`[[:punct:]]ATCH` which is as set of punctuation marks. - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/extra_bash_tools.md#reintroducing-grep-gnu-regex-parser- - -## awk - -`awk` is a very powerful programming language in its own right and it can do a lot more than is outlined here. `awk` shares a common history with `sed` and even `grep` dating back to `ed`. As a result, some of the syntax and functionality can be a bit familiar at times. However, it is particularly useful when working with datatables in plain text format (tab-delimited files and comma-separated files). Before we dive too deeply into `awk` we need to define two terms that `awk` will use a lot: - -- ***Field*** - This is a column of data -- ***Record*** - This is a row of data - -### Printing columns - -Let's first look at one of it's most basic and useful functions, printing columns. For this example we are going to use the tab-delimited file animals.txt that we used in the `sed` examples. - -Let's first try to just print the first column: - -``` -awk '{print $1}' animals.txt -``` - -Here the `print` function in `awk` is telling `awk` that it it should output the first column of each line. We can also choose to print out multiple columns in any order. - -``` -awk '{print $3,$1,$5}' animals.txt -``` - -The default output is to have the column separated by a space. However, this built-in variable can be modified using the `-v` option. Once you have called the `-v` option you need to tell it which built-in variable you are interested in modfying. In this case it is the ***O***utput ***F***ield ***S***eparator, or `OFS`, and you need to set it to what you would like it to be equal to; a `'\t'` for tab, a `,` for a comma or even an `f` for a a lowercase 'f'. - -``` -awk -v OFS='\t' '{print $3,$1,$5}' animals.txt -``` - -#### `RS` and `ORS` - -As an aside, similarly to `OFS`, records are assumed to be read in and written out with a newline character as default. However, this behavior can be altereed with `RS` and `ORS` variables. - -`RS` can be used to alter the input ***r***ecord ***s***eparator -`ORS` can be used to alter the ***o***utput ***r***ecord ***s***eparator - -If we wanted to change the `ORS` to be a ';' we could do so with like: - -``` -awk -v OFS='\t' -v ORS=';' '{print $3,$1,$5}' animals.txt -``` - - -#### `-F` - -The default behavior of `awk` is to split the data into column based on whitespace (tabs or spaces). However, if you have a comma-separated file, then your fields are separateed by commas and not whitespace. If we run a `sed` command to swap all of the tabs to commas and then pipe this output into awk it will think there is only one field: - -``` -sed 's/\t/,/g' animals.txt | awk '{print $1}' -``` - -However, once we denote that the field separator is a comma, it will extract only the first column: - -``` -sed 's/\t/,/g' animals.txt | awk -F ',' '{print $1}' -``` - -Alternatively to using `-F`, `FS` is a variable for ***f***ield ***s***eparator and can be altered with the `-v` argument as well like: - -``` -sed 's/\t/,/g' animals.txt | awk -v FS=',' '{print $1}' -``` - -### Skipping Records - -Similarly, to `sed` you can also exclude records from your analysis in `awk`. `NR` is a variable equal to the ***N***umber of ***R***ecords (Rows) in your file. As an aside `NF` also exists and is a variable equal to the ***N***umber of ***F***ields (Columns) in your file. You can define the range that you want your print command to work over by specifiying the `NR` prior to your `{}`: - -``` -awk 'NR>1 {print $3,$1,$5}' animals.txt -``` - -This line for example is useful for removing the header. You can also set a range for records you'd like `awk` to print out by separating your range requirements with a `&&`, meaning 'and': - -``` -awk 'NR>1 && NR<=3 {print $3,$1,$5}' animals.txt -``` - -This command will print the third, first and fifth fields of animal.txt for records greater than 1 and less than or equal to three. - -### `BEGIN` - -The `BEGIN` command will execute an `awk` expression once at the beginning of a command. This can be particularly useful it you want to give an output a header that doesn't previously have one. - -``` -awk 'BEGIN {print "new_header"} NR>1 {print $1}' animals.txt -``` - -In this case we have told `awk` that we want to have "new_header" printed before anything, then `NR>1` is telling `awk` to skip the old header and finally we are printing the first column of `animals.txt` with `{print $1}`. - -### `END` - -Related to `BEGIN` is the `END` command that that tells `awk` to do a command once at the end of the file. It is ***very*** useful when summing up columns (below), but we will first demonstrate how it works by adding a new record: - -``` -awk '{print $1} END {print "new_record"}' animals.txt -``` - -As you can see, this has simply added a new record to the end of a file. Furthermore, you can chain multiple `END` command together to continously add to columns if you wished like: - -``` -awk '{print $1} END {print "new_record"} END {print "newer_record"}' animals.txt -``` - -This is equivalent to separating your `print` commands with a `;`: - -``` -awk '{print $1} END {print "new_record"; print "newer_record"}' animals.txt -``` - -### Variables - -You can also use variables in awk. Let's play like we wanted to add 5cm to every organisms height: - -``` -awk 'BEGIN {print "old_height","new_height"} NR>1 {new_height=$5+5; print $5,new_height}' animals.txt -``` - -`BEGIN {print "old_height","new_height"}` is giving us a new header - -`NR>1` is skipping the old header - -`new_height=$5+5;` creates a new variable called "new_height" and sets it equal to the height in the fifth field plus five. Note that separate commands within the same `{}` need to be separated by a `;`. - -`print $5,new_height` prints the old height with the new height. - -Lastly, you can also bring `bash` variables into `awk` using the `-v` option: - -``` -var=bash_variable -awk -v variable=$var '{print variable,$0}' animals.txt -``` - - -### Caclualtions using columns - -`awk` is also very good about handling calculations with respect to columns. ***Remember if your file has a header you will need to remove it because you obviously can't divide one word by another.*** - -#### Column sum - -Now we understand how variables and `END` work, we can take the sum of a column, in this case the fourth column of our `animals.txt`: - -``` -awk 'NR>1 {sum=$4+sum} END {print sum}' animals.txt -``` - -`NR>1` skips out header line. While not necessary because our header is not a number, it is considered best practice to excluded a header if you had one. If your file didn't have a header then you would omit this. - -`{sum=$4+sum}` is creating a variable named `sum` and updating it as it goes through each record by adding the fourth field to it. -***NOTE:*** This `{sum=$4+sum}` syntax can be, and often is, abbreviated to `{sum+=$4}`. They are equivlant syntaxes but for the context of learning I think `{sum=$4+sum}` is a bit more clear. - -`END {print sum}` Once we get to the end of the file we can call `END` to print out our variable `sum`. - -#### Column Average - -Now that we understand how to take a column sum and retrieve the number of records, we could quite easily calculate the average for a column like: - -``` -awk 'NR>1 {sum=$4+sum} END {records=NR-1; print sum/records}' animals.txt -``` - -`records=NR-1` is needed because `NR` includes our header line, so we need to make a new variable called `records` to hold the number of records in the file without the header line. - -If you didn't have a header line you could get the average of a column with a command like: - -``` -awk '{sum=$4+sum} END {print sum/NR}' file.txt -``` - -#### Calculations between columns - -If you wanted to divde the fifth field of animals.txt by the fourth field, you do it like this: - -``` -awk 'NR>1 {print $5/$4}' animals.txt -``` - -You can, of course, add columns around the this calculation as well, such as: - -``` -awk 'NR>1 {print $1,$5/$4,$2}' animals.txt -``` - -Lastly, you can also set the output of a calculation equal to a new variable and print that variable: - -``` -awk 'NR>1 {$6=$5/$4; print $1,$6,$2}' animals.txt -``` - -`$6=$5/$4` is making a sixth field with the fifth field divided by the fourth field. We then need to separate this from the `print` command with a `;`, but now we can call this new variable we've created. - -#### `$0` - -There is a special variable `$0` that corresponds to the whole record. This is very useful when appending a new field to the front or end of a record, such as. - -``` -awk 'NR>1 {print $0,$5/$4}' animals.txt -``` - -***NOTE:*** If you create a new variable such as `$6=$5/$4`, `$6` is now part of `$0` and will overwrite the values (if any) previously in `$6`. For example: - -``` -awk 'NR>1 {$6=$5/$4; print $0,$6}' animals.txt -``` - -You will get two `$6` fields at the end of the output because `$6` is now a part of `$0` and then you've also indicated that you want to then print `$6` again. - -Furthmore, - -``` -awk 'NR>1 {$5=$5/$4; print $0}' animals.txt -``` - -`$5=$5/$4` will overrwrite the values previously held in `$5` after the calculation is made. Thus, the output no long shows the original `$5`. - - -### `for` loops - -Like many other programming languages, `awk` can also do loops. One type of loop if the basic `for` loop. - -The basic syntax for a `for` loop in `awk` is: - -``` -awk '{for (initialize counter variable; end condition; increment) command}' file.txt -``` - -If you want to duplicate every entry in your file you do do so like: - -``` -awk '{ for (i = 1; i <= 2; i=i+1) print $0}' animals.txt -``` - -`for (i = 1; i <= 2; i=i+1)` is starting a `for` loop that: -- `i = 1;` starts at 1 -- `i <= 2;` runs as long as the value of i is less than or equal to 2 -- `i=i+1` after each iteration increase the counter variable by one. `++i` and `i++` are equivalent syntaxes to `i=i+1`. - -Then we print the whole line with `print $0`. - -While not discussed here, `awk` does support `while` and `do-while` loops. - -### `if` statements - -Since `awk` is it's own fully-fledged programming language, it also has conditional statements. A common time you might want to use an `if` statement in `awk` is when you have a file with tens or even hundreds of fields and you want to figure out which field has the column header of interest or a case where you are trying to write a script for broad use when the order of the input columns may not always be the same, but you want to figure out which column has a certain column header. To do that: - -``` -awk 'NR=1 {for (i=1; i<=NF; i=i+1) {if ($i == "height(cm)") print i}}' animals.txt -``` - -`NR=1` only looks at the header line - -`for (i=1; i<=NF; i=i+1)` this begins a `for` loop starting at 1 and going until the number of fields by one -`if ($i == "height(cm)")` is checking is `$i`, which is in our case is $1, $2, ... $6, to see if they are equal to `height(cm)`. If this condition is met then: -`print i` print out `i` - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/extra_bash_tools.md#awk - -## Creating shortcuts or alias and .bashrc profile - -This reource covers it all pretty well. - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/more_bash.md#alias - -## Copying files using to and from a cluster - -Oftentimes, you'll find yourself wanted to copy a file between computing cluster and your local computer and this is where `scp` and `rsync` can be used. - -### scp - -The general syntax of `scp` is similar to that of `cp`: - -``` -scp -``` - -However, since either `` and/or `` is on a computing cluser, you will need to provide host information followed by a `:` then the full path to the file/directory. If we wanted to copy a file from O2 to our local machine, we could use: - -``` -scp username@transfer.rc.hms.harvard.edu:/path/to/file_on_O2 /path/to/directory/local_machine -``` - -Alternatively, if you wanted to copy a file from your local machine to O2 you would rearrange the command: - -``` -scp /path/to/directory/local_machine username@transfer.rc.hms.harvard.edu:/path/to/file_on_O2 -``` - -You can also recursively copy an entire directory with the `-r` option. - -``` -scp -r /path/to/directory username@transfer.rc.hms.harvard.edu:/path/to/new_directory_on_O2 -``` - -### rsync - -`rsync` is a popular alternative to `scp`. One major reason it is popular is if the data transfer is interrupted, `scp` will need to begin again, while `rsync` can resume from where it left off. - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/more_bash.md#copying-files-to-and-from-the-cluster- - -## Symbolic Links - -Symbolic links, or "sym links", are an important shortcut on the command line that can save you lots of space. A symbolic link points to a location and can be very useful when software wants to have a file in a particular location. Of course, you can simply copy a file and have it stored in two locations, but it could be a large file and now it is taking up space in two locations. To avoid doing this, you can set-up a symbolic link using the following syntax: - -``` -ln -s -``` - -Let's assume we have a file named `reads.fasta` inside the directory `raw_reads`, but we want a symbolic link named `link_to_reads.fasta` to link to the reads in the directory that we are currently in, which is in the parent directory of `raw_reads`. This command would look like: - -``` -ln -s raw_reads/reads.fasta link_to_reads.fasta -``` - -When you now view this directory with `ls -l`, it will display the link like: - -``` -link_to_reads.fasta -> raw_reads/reads.fasta -``` - -If you want to keep the current name you can use `.` for ``. - -***Importantly, if the original file is deleted or moved, the symbolic link will become broken.*** It is common on many distributions for symbolic links to blink if they becomes broken. - -The `-s` option is necessacary for creating a symbolic link. Without the `-s` option, a ***hard link*** is created and modifications to the linked file will be carried over to the original. Generally, speaking hard links are typically not very common. - - - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/more_bash.md#symlink - -## Math - -There are two common ways of carrying out math on the command-line interface. One way uses a language called `bc` and the other utilizes `awk`. So let's look at these two methods - -### bc - -`bc` stands for *basic/bench calculator* and is actually it's own standalone language. In order for math to be carried out by `bc`, it needs to first be piped into `bc`. In this case, we are going to pipe in the equation we want it to calculate with a preceeding `echo` command. - -``` -echo '6 + 2' | bc -``` - -*NOTE: The whitespace in the above inside `'2+1'` is arbitarty and does not impact calculations.* - -It should return `8`. In fact you can do many basic math operations with integers like this. - -``` -# Subtraction -echo "6 - 2" | bc - -# Multiplication -echo "6 * 2" | bc - -# Division -echo "6 / 2" | bc - -# Exponent -echo "6 ^ 2" | bc - -# Square Root -echo "sqrt(4)" | bc -``` - -You can also do more complex math that involves parentheses: - -``` -echo "(3 + 1) * 4" | bc -``` - -*NOTE: You can use single or double quotes when carrying out math with `bc`, but if you want to use `bash` variables you will want to use double quotes. For this reason, it is best practice just to always use double quotes. - -We can also feed `bc` variables, such as: - -``` -variable_1=4 -variable_2=3 - -# Will return an error -echo '$variable_1 + $variable_2' | bc - -# Will return the answer -echo "$variable_1 + $variable_2" | bc -``` - -This should return the correct answer of `7`. - -While thie seems great there are some limitations that `bc` has. The biggest issue with `bc` it does not handle decimals well, particularly with division. Let's look at the following case: - -``` -echo '1 / 3' | bc -``` - -It should return `0`, which is clearly erroreouns. This is because base `bc` defaults to just the integer position. There are two ways to fix this behavior: - -### `scale` parameter - -Before the equation you would like `bc` to calculate you can put a `scale` parameter, which will tell `bc` how many decimal places to calculate to. - -``` -echo 'scale=3; 1 / 3' | bc -``` - -Now we can see that `bc` returns the appropriate answer of `.333`. - -### -l option - -Adding the `-l` option for `bc` will have utilize the a broader range of math and this automatically sets the scale paramter to 20 by default. - -``` -echo '1 / 3' | bc -l -``` - -This should return `.33333333333333333333`. You can overwrite this default scale parameter by just adding `scale=` as you had in the previous example. In general, because `-l` option helps with - -``` -echo 'scale=3; 1 / 3' | bc -l -``` - -The -l option does also open up a few more functions including: - -``` -# Natural log -echo 'l(1)' | bc -l - -# Exponential function -echo 'e(1)' | bc -l -``` - -It does also provide access to sine, cosine and arctangent, but those are outside of the scope of this course. - -## Negative Numbers - -`bc` can also handle negtaive numbers as input and output - -``` -echo "-1 + 2" | bc -l - -echo "1 - 2" | bc -l - -echo "2 ^ -3 | bc -l" -``` - -### awk - -You can also go basic arithmetic in `awk`. In order to do arithmetic in `awk` in the command line, you will need to use the `BEGIN` command which allows you to run an `awk` command without a file, then simply have it print your calculation: - -``` -awk 'BEGIN {print (2/3)^2}' -``` - -*** -*** - -## `if` statements - -## while read loops - -## Associative Arrays in bash - -## Arrays in bash - -## Positional Parameters - -## Searching history - -## O2 Job Dependencies - -## O2 Brew - -## O2 Conda - -## vim macros - -## watch - -Sometimes one may want to see the ouptut of a command that continuously changes. The `watch` command is particularly useful for this. Add `watch` before your command and your command line will take you to an output page that will continually up your command. Common uses for `watch` could be: - -1) Viewing as files get created - -``` -watch ls -lh -``` - -2) Monitoring jobs on the cluster - -``` -watch squeue -u -``` - -The default interval for update is two seconds, but that can be altered with the `-n` option. Importantly, the options used with `watch` command need to be placed ***before*** the command that you are watching or else the interpreter will evaluate the option as part of the watched command's options. An example of this is below: - -Update every 4 seconds -``` -watch -n 4 squeue -u -``` - - -## time - -Sometimes you are interested to know how long a task takes to complete. Similarly, to the `watch` command you can place `time` infront of a command and it will tell you how long the command takes to run. This can be particularly useful if you have downsampled a dataset and you are trying to estimate long long the full set will take to run. An example can be found below: - -``` -time ls -lh -``` - -The output will have three lines: - -``` -real 0m0.013s -user 0m0.002s -sys 0m0.007s -``` - -**real** is most likely the time you are interested in since it displays the time it takes to run a given command. **user** and **sys** represent CPU time used for various aspects of the computing and can be impacted by multithreading. - -## bg - -Sometimes you may start a command that will take a few minutes and you want to have your command prompt back to do other tasks while you wait for the initial command to finish. To do this, you will need to do two things: - -1) Pause the command with `Ctrl` + `Z`. -2) Send the command to the ***b***ack***g***round with the command `bg`. When you do this the command will continue from where it was paused. -3) If you want to bring the task back to the ***f***ore***g***round, you can use the command `fg`. - -In order to test this, we will briefly re-introduce the `sleep` command. `sleep` just has the command line do nothing for a period of time denoted in seconds by the integer following the `sleep` command. This is sometimes useful if you want a brief pause within a loop, such as between submitting a bunch of jobs to the cluster (as we did in [insert lesson here]). The syntax is: - -``` -sleep [integer for time in seconds] -``` - -So if you wanted there to be a five second pause, you could use: - -``` -sleep 5 -``` - -Now that we have re-introduced the `sleep` command let's go ahead and pause the command like for 180 seconds to simulate a task that is running that might take a few minutes to run. - -``` -sleep 180 -``` - -Now type `Ctrl` + `Z` and this will pause that command. Followed by the command to move that task to running in the background with: - -``` -bg -``` - -The `sleep` command is now running in the background and you have re-claimed your command-line prompt to use while the `sleep` command runs. If you want to bring the `sleep` command back to the foreground, type: - -``` -fg -``` - -And if it is still running it will be brought to the foreground. - -The place that this can be really useful is whenever you are running commands/scripts that take a few minutes to run that don't have large procesing requirements. Examples could be: - -- Indexing a fasta/bam file -- Executing a long command with many pipes -- You are running something in the command line and need to check something - -Oftentimes, it is best just to submit these types of jobs to the cluster, but sometimes you don't mind running the task on your requested compute node, but is taking a bit longer than you anticipated or something came up. - - -## md5sum - -Sometimes you are copying files between two locations and you want to ensure the copying went smoothly or are interested to see if two files are the same. Checksums can be thought of as an alphanumeric fingerprint for a file and they are used to ensure that two files are the same. It is common for people/insitutions to provide an list of md5sums for files that are availible to download. `md5sum` is one common checksum. ***Importantly, it is theorectically possible that two different files have the same md5sum, but it is practically nearly impossible.*** The syntax for checking the md5sum of a file is: - -``` -md5sum -``` - -## Downloading external data - -### `curl` - -Oftentimes, you will find yourself wanting to download data from a website. There are two comparable commands that you can use to accomplish this task. The first one is `curl` and the most common syntax for using it is: - -``` -curl -L -O [http://www.example.com/data] -``` - -The `-O` option will use the filename given on the website as its filename to write to. Alternatively, if you wanted to name it something different, you can use the `-o` option and then follow it with the preferred name like: - -``` -curl -L -o preferred_name [http://www.example.com/data] -``` - -The `-L` option tells curl to follow any redirections the HTML address gives to the data. ***I think this is right, but I am really don't understand thins. I just know that I am supposed to do it.*** - -Lastly, if you connection gets lost midway through a transfer, you can use the `-C` option followed by `-` to resume the download where it left off. For example: - -``` -curl -C - -L -O [http://www.example.com/data] -``` - -### `wget` - - -A common alternative to `curl` is `wget` and many purposes they are extremely similiar and which you decide to use is a matter of personal preference. The general syntax is a bit more friendly on `wget`: - -``` -wget [http://www.example.com/data] -``` - -If you lose your connection during the download process and would like to resume it midway through, the `-c` will ***c***ontinue the download where it left off: - -``` -wget -c [http://www.example.com/data] -``` - -### `curl` versus `wget` - -For many purposes `curl` and `wget` are similar, but there are some small differences: - -1) In `curl` you can use the `-O` option multiple times to carry out multiple downloads simulatenously. - -``` -curl -L -O [http://www.example.com/data_file_1] -O [http://www.example.com/data_file_2] -``` - -2) In `wget` you can recursively download a directory (meaning that you also download all subdirectories) with the `-r`. Typically this isn't super useful because the source will typically pack this up all into a compressed package, but nonetheless it is something that `wget` can do that `curl` cannot do. - -In general, `curl` has a bit more options and flexibility than `wget` but the vast majority, if not all, of those options are ***far*** beyond the scope of this course and for this course comes down to a personal preference. - -## Rscript diff --git a/Intermediate_shell/lessons/arrays_in_slurm.md b/Intermediate_shell/lessons/arrays_in_slurm.md deleted file mode 100644 index d8121c8b..00000000 --- a/Intermediate_shell/lessons/arrays_in_slurm.md +++ /dev/null @@ -1,120 +0,0 @@ - -# Arrays in Slurm - -When I am working on large data sets my mind often drifts back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say: - - -

- -

- -

-We've all been here -

- -A pipeline or process may seem easy or fast when you have 1-3 samples but totally daunting when you have 50. When scaling up you need to consider file overwriting, computational resources, and time. - -One easy way to scale up is to use the array feature in slurm. - -## What is a job array? - -Atlassian says this about job arrays on O2: "Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks." [link](https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Job-Arrays). - -Array jobs run simultaneously rather than one at a time which means they are very fast! Additionally, running a job array is very simple! - -```bash -sbatch --array=1-10 my_script.sh -``` - -This will run my_script.sh 10 times with the job IDs 1,2,3,4,5,6,7,8,9,10 - -We can also put this directly into the bash script itself (although we will continue with the command line version here). -```bash -$SBATCH --array=1-10 -``` - -We can specify any job IDs we want. - -```bash -sbatch --array=1,7,12 my_script.sh -``` -This will run my_script.sh 3 times with the job IDs 1,7,12 - -Of course we don't want to run the same job on the same input files over and over, that would be pointless. We can use the job IDs within our script to specify different input or output files. In bash the job id is given a special variable `${SLURM_ARRAY_TASK_ID}` - - -## How can I use ${SLURM_ARRAY_TASK_ID}? - -The value of `${SLURM_ARRAY_TASK_ID}` is simply job ID. If I run - -```bash -sbatch --array=1,7 my_script.sh -``` -This will start two jobs, one where `${SLURM_ARRAY_TASK_ID}` is 1 and one where it is 7 - -There are several ways we can use this. If we plan ahead and name our files with these numbers (e.g., sample_1.fastq, sample_2.fastq) we can directly refer to these files in our script: `sample_${SLURM_ARRAY_TASK_ID}.fastq` However, using the ID for input files is often not a great idea as it means you need to strip away most of the information that you might put in these names. - -Instead we can keep our sample names in a separate file and use [awk](awk.md) to pull the file names. - -here is our complete list of long sample names which is found in our file `samples.txt`: - -``` -DMSO_control_day1_rep1 -DMSO_control_day1_rep2 -DMSO_control_day2_rep1 -DMSO_control_day2_rep2 -DMSO_KO_day1_rep1 -DMSO_KO_day1_rep2 -DMSO_KO_day2_rep1 -DMSO_KO_day2_rep2 -Drug_control_day1_rep1 -Drug_control_day1_rep2 -Drug_control_day2_rep1 -Drug_control_day2_rep2 -Drug_KO_day1_rep1 -Drug_KO_day1_rep2 -Drug_KO_day2_rep1 -Drug_KO_day2_rep2 -``` - -If we renamed all of these to 1-16 we would lose a lot of information that may be helpful to have on hand. If these are all sam files and we want to convert them to bam files our script could look like this - -```bash - -file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt) - -samtools view -S -b ${file}.sam > ${file}.bam - -``` - -Since we have sixteen samples we would run this as - -```bash -sbatch --array=1-16 my_script.sh -``` - -So what is this script doing? `file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)` pulls the line of `samples.txt` that matched the job ID. Then we assign that to a variable called `${file}` and use that to run our command. - -Job IDs can also be helpful for output files or folders. We saw above how we used the job ID to help name our output bam file. But creating and naming folders is helpful in some instances as well. - -```bash - -file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt) - -PREFIX="Folder_${SLURM_ARRAY_TASK_ID}" - mkdir $PREFIX - cd $PREFIX - -samtools view -S -b ../${file}.sam > ${file}.bam - -``` - -This script differs from our previous one in that it makes a folder with the job ID (Folder_1 for job ID 1) then moves inside of it to execute the command. Instead of getting all 16 of our bam files output in a single folder each of them will be in its own folder labled Folder_1 to Folder_16. - -**NOTE** That we define `${file}` BEFORE we move into our new folder as samples.txt is only present in the main directory. - - - - - - diff --git a/Intermediate_shell/lessons/associative_arrays.md b/Intermediate_shell/lessons/associative_arrays.md deleted file mode 100644 index 1a33199b..00000000 --- a/Intermediate_shell/lessons/associative_arrays.md +++ /dev/null @@ -1,154 +0,0 @@ -# Associative Arrays in bash - -## Learning Objectives - -In this lesson we will: - -- Describe the structure and function of associative arrays in programming -- Implement an associative array in bash -- Use a loop to read a file into an associative array - -## What is an Associative Array? - -Associative arrays are a data structure in computing that stores "key-value" pairs. A "key-value" pair refers to a "key" term that allows you to look up its associated "value". An analogy for associative arrays is to think of them like a dictionary, where the "key" is the word you would like to define and the "value" is the definition. These data structures are oftentimes quite memory efficient and allows for a rapid look-up feature. However, the computer science behind how they achieve this is beyond the scope of this module. Importantly, all *keys* **MUST BE UNIQUE**, while *values* **DO NOT NEED TO BE UNIQUE**. - -> NOTE: While associative arrays are the general term for this type of data structure and what they are called in bash, in other languages they have different names, such as maps, dictionaries, symbol table or hashes. - -Let's consider a toy example for an associative array called `noises`: - -

- -

- -In this toy example, the *key* "dog" is associated with the *value* "woof", "goose is associated with "honk" and so on. As you can see, each *key* is unique, but both "goose" and "horn" have the same *value*. - -## Implementing an Associative Array in bash - -Using an associative array occurs in 3 steps: - -1) Declaring the associative array -2) Populating the associative array -3) Querying the associative array - -### Declaring the Associative Array - -In `bash`, it is necessary to `declare` an associative array before you begin using it. To do so, you will need to use the `declare` function in `bash`. The two main cases where you will most likely use the `declare` function is when you are either declaring an associative array (`-A` option) or an array (sometimes just called an indexed array; `-a` option). We will talk more about arrays/indexed array in a different lesson. - -> NOTE: If you're on a Mac, then you are likely running Bash version 3.2. In which case, associative arrays **do not exist** in bash version 3.2. Fortunately, the O2 cluster runs on bash version 4+. - -Let's begin by declaring our `noises` array: - -``` -declare -A noises -``` - -### Populating the Associative Array - -Now that we have declared our associative array, we can populate it with *key-value* pairs. The syntax for this is: - -``` -# Example Syntax -associative_array_name["key"]="value" -``` - -If we wanted to populate our *noises* associatiave array with the above noises, then it would look like: - -``` -noises["dog"]="woof" -noises["goose"]="honk" -noises["cat"]="meow" -noises["horn"]="honk" -``` -Now your *noises* associative array is populated and ready to be queried. - -### Querying your Associative Array - -In order to retrieve the *value* associated with a *key*, we will need to query it with the following syntax: - -``` -# Example Syntax -echo ${associative_array_name["key"]} -``` - -For example if we wanted the value associated with "cat", then we would use: - -``` -echo ${noises["cat"]} -``` - -And it should return: - -``` -meow -``` - -### Exercise - -1. When might you use an associative array in your own data analysis? - -## Reading a file into an Associative Array - -Rarely, will you ever manually enter the the keys and values in for an associative array like we have done above. Oftentimes, your key-value lists will be quite long and to enter each of those manually would be tedious and error-prone. Below we will discuss how to read in a file to populate an associative array. This is a pretty common implementation of associative arrays. The other popular implementation is that you have a program running and you are feeding output from one part of it into an associative array to be queried later. - - -It is fairly common when one receives data to have an accompanying files that included the `md5sum` values for the files. We have included this file for the contents of the `data/` directory and called it `md5sum.txt`. We can take a brief look at this file: - -``` -cat md5sum.txt -``` - -We can see that it is two columns with the filenames on the right and their associated md5sum of the left. It should look like: - -``` -acff4581ddc671047f04eb1ed2f56b64 catch.txt -8c706987a93564cfde876540e76d52f1 ecosystems.csv -81c0b67ea957b70535f52c4b26871266 ecosystems.txt -b195297072d59ef8f08d53b3b857ca89 more_ecosystems.txt -fd27045abe101d2a5b39dc166a93c7d7 next_file.txt -3ef799ee455daf0eb1e3b6e593ce7423 sed_expressions.txt -``` - -Now let's write a sample script to read this in and allow us to look-up the md5sum of any file using a positional parameter. First let's open up a new file: - -``` -vim md5sum_lookup.sh -``` - -Within this file, we can insert this commented code: - -```bash -#!/bin/bash -# Written by [Insert name] on [Insert date] - -# Declare an associative array to hold the md5sums -declare -A md5sum_associative_array - -# Open a while loop to read each line and assign the first column to the variable "provided_md5sum" and the second column to the variable "filename" -while read -r provided_md5sum filename; do - # Populate the associative array with each filename (key) and it's associated md5sum (value) - md5sum_associative_array[$filename]=$provided_md5sum -# End loop and the input file is provided -done < /home/${USER}/advanced_shell/md5sum.txt - -# Query the associative array using a positional parameter -echo ${md5sum_associative_array[$1]} -``` - -Now, we can run this code to query the provided md5sum of any file in the provided list by using: - -``` -# Example syntax -sh md5sum_lookup.sh file.txt -``` - -So if we wanted to see that the provided md5sum for `ecosystems.txt` was, then we could: - -``` -sh md5sum_lookup.sh ecosystems.txt -``` - -This script on it's own might not be the most useful, but we will build on it using conditional statements in further lessons to allow us to check the provided md5sum value with those we produce ourselves. - -*** - -*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* diff --git a/Intermediate_shell/lessons/awk.md b/Intermediate_shell/lessons/awk.md deleted file mode 100644 index f249080e..00000000 --- a/Intermediate_shell/lessons/awk.md +++ /dev/null @@ -1,283 +0,0 @@ -## awk - -`awk` is a very powerful programming language in its own right and it can do a lot more than is outlined here. `awk` shares a common history with `sed` and even `grep` dating back to `ed`. As a result, some of the syntax and functionality can be a bit familiar at times. However, it is particularly useful when working with datatables in plain text format (tab-delimited files and comma-separated files). Before we dive too deeply into `awk` we need to define two terms that `awk` will use a lot: - -- ***Field*** - This is a column of data -- ***Record*** - This is a row of data - -**Topics discussed here are:** - -[Printing Columns](awk.md#printing-columns) - -[RS and ORS](awk.md#rs-and-ors) - -[Field Separators](sed.md#-f) - -[Skipping Records](awk.md#skipping-records) - -[BEGIN](awk.md#begin) - -[END](awk.md#end) - -[Variables](awk.md#variables) - -[Calculations Using Columns](awk.md#calculations-using-columns) - -[The $0 variable](awk.md#0) - -[For Loops](awk.md#for-loops) - -[If Statments](awk.md#if-statements) - - ---- -[Return to Table of Contents](toc.md) - -### Printing columns - -Let's first look at one of it's most basic and useful functions, printing columns. For this example we are going to use the tab-delimited file animals.txt that we used in the `sed` examples. - -Let's first try to just print the first column: - -``` -awk '{print $1}' animals.txt -``` - -Here the `print` function in `awk` is telling `awk` that it it should output the first column of each line. We can also choose to print out multiple columns in any order. - -``` -awk '{print $3,$1,$5}' animals.txt -``` - -The default output is to have the column separated by a space. However, this built-in variable can be modified using the `-v` option. Once you have called the `-v` option you need to tell it which built-in variable you are interested in modfying. In this case it is the ***O***utput ***F***ield ***S***eparator, or `OFS`, and you need to set it to what you would like it to be equal to; a `'\t'` for tab, a `,` for a comma or even an `f` for a a lowercase 'f'. - -``` -awk -v OFS='\t' '{print $3,$1,$5}' animals.txt -``` - -#### `RS` and `ORS` - -As an aside, similarly to `OFS`, records are assumed to be read in and written out with a newline character as default. However, this behavior can be altereed with `RS` and `ORS` variables. - -`RS` can be used to alter the input ***r***ecord ***s***eparator -`ORS` can be used to alter the ***o***utput ***r***ecord ***s***eparator - -If we wanted to change the `ORS` to be a ';' we could do so with like: - -``` -awk -v OFS='\t' -v ORS=';' '{print $3,$1,$5}' animals.txt -``` - - -#### `-F` - -The default behavior of `awk` is to split the data into column based on whitespace (tabs or spaces). However, if you have a comma-separated file, then your fields are separateed by commas and not whitespace. If we run a `sed` command to swap all of the tabs to commas and then pipe this output into awk it will think there is only one field: - -``` -sed 's/\t/,/g' animals.txt | awk '{print $1}' -``` - -However, once we denote that the field separator is a comma, it will extract only the first column: - -``` -sed 's/\t/,/g' animals.txt | awk -F ',' '{print $1}' -``` - -Alternatively to using `-F`, `FS` is a variable for ***f***ield ***s***eparator and can be altered with the `-v` argument as well like: - -``` -sed 's/\t/,/g' animals.txt | awk -v FS=',' '{print $1}' -``` - -### Skipping Records - -Similarly, to `sed` you can also exclude records from your analysis in `awk`. `NR` is a variable equal to the ***N***umber of ***R***ecords (Rows) in your file. As an aside `NF` also exists and is a variable equal to the ***N***umber of ***F***ields (Columns) in your file. You can define the range that you want your print command to work over by specifiying the `NR` prior to your `{}`: - -``` -awk 'NR>1 {print $3,$1,$5}' animals.txt -``` - -This line for example is useful for removing the header. You can also set a range for records you'd like `awk` to print out by separating your range requirements with a `&&`, meaning 'and': - -``` -awk 'NR>1 && NR<=3 {print $3,$1,$5}' animals.txt -``` - -This command will print the third, first and fifth fields of animal.txt for records greater than 1 and less than or equal to three. - -### `BEGIN` - -The `BEGIN` command will execute an `awk` expression once at the beginning of a command. This can be particularly useful it you want to give an output a header that doesn't previously have one. - -``` -awk 'BEGIN {print "new_header"} NR>1 {print $1}' animals.txt -``` - -In this case we have told `awk` that we want to have "new_header" printed before anything, then `NR>1` is telling `awk` to skip the old header and finally we are printing the first column of `animals.txt` with `{print $1}`. - -### `END` - -Related to `BEGIN` is the `END` command that that tells `awk` to do a command once at the end of the file. It is ***very*** useful when summing up columns (below), but we will first demonstrate how it works by adding a new record: - -``` -awk '{print $1} END {print "new_record"}' animals.txt -``` - -As you can see, this has simply added a new record to the end of a file. Furthermore, you can chain multiple `END` command together to continously add to columns if you wished like: - -``` -awk '{print $1} END {print "new_record"} END {print "newer_record"}' animals.txt -``` - -This is equivalent to separating your `print` commands with a `;`: - -``` -awk '{print $1} END {print "new_record"; print "newer_record"}' animals.txt -``` - -### Variables - -You can also use variables in awk. Let's play like we wanted to add 5cm to every organisms height: - -``` -awk 'BEGIN {print "old_height","new_height"} NR>1 {new_height=$5+5; print $5,new_height}' animals.txt -``` - -`BEGIN {print "old_height","new_height"}` is giving us a new header - -`NR>1` is skipping the old header - -`new_height=$5+5;` creates a new variable called "new_height" and sets it equal to the height in the fifth field plus five. Note that separate commands within the same `{}` need to be separated by a `;`. - -`print $5,new_height` prints the old height with the new height. - -Lastly, you can also bring `bash` variables into `awk` using the `-v` option: - -``` -var=bash_variable -awk -v variable=$var '{print variable,$0}' animals.txt -``` - - -### Caclulations using columns - -`awk` is also very good about handling calculations with respect to columns. ***Remember if your file has a header you will need to remove it because you obviously can't divide one word by another.*** - -#### Column sum - -Now we understand how variables and `END` work, we can take the sum of a column, in this case the fourth column of our `animals.txt`: - -``` -awk 'NR>1 {sum=$4+sum} END {print sum}' animals.txt -``` - -`NR>1` skips out header line. While not necessary because our header is not a number, it is considered best practice to excluded a header if you had one. If your file didn't have a header then you would omit this. - -`{sum=$4+sum}` is creating a variable named `sum` and updating it as it goes through each record by adding the fourth field to it. -***NOTE:*** This `{sum=$4+sum}` syntax can be, and often is, abbreviated to `{sum+=$4}`. They are equivlant syntaxes but for the context of learning I think `{sum=$4+sum}` is a bit more clear. - -`END {print sum}` Once we get to the end of the file we can call `END` to print out our variable `sum`. - -#### Column Average - -Now that we understand how to take a column sum and retrieve the number of records, we could quite easily calculate the average for a column like: - -``` -awk 'NR>1 {sum=$4+sum} END {records=NR-1; print sum/records}' animals.txt -``` - -`records=NR-1` is needed because `NR` includes our header line, so we need to make a new variable called `records` to hold the number of records in the file without the header line. - -If you didn't have a header line you could get the average of a column with a command like: - -``` -awk '{sum=$4+sum} END {print sum/NR}' file.txt -``` - -#### Calculations between columns - -If you wanted to divde the fifth field of animals.txt by the fourth field, you do it like this: - -``` -awk 'NR>1 {print $5/$4}' animals.txt -``` - -You can, of course, add columns around the this calculation as well, such as: - -``` -awk 'NR>1 {print $1,$5/$4,$2}' animals.txt -``` - -Lastly, you can also set the output of a calculation equal to a new variable and print that variable: - -``` -awk 'NR>1 {$6=$5/$4; print $1,$6,$2}' animals.txt -``` - -`$6=$5/$4` is making a sixth field with the fifth field divided by the fourth field. We then need to separate this from the `print` command with a `;`, but now we can call this new variable we've created. - -#### `$0` - -There is a special variable `$0` that corresponds to the whole record. This is very useful when appending a new field to the front or end of a record, such as. - -``` -awk 'NR>1 {print $0,$5/$4}' animals.txt -``` - -***NOTE:*** If you create a new variable such as `$6=$5/$4`, `$6` is now part of `$0` and will overwrite the values (if any) previously in `$6`. For example: - -``` -awk 'NR>1 {$6=$5/$4; print $0,$6}' animals.txt -``` - -You will get two `$6` fields at the end of the output because `$6` is now a part of `$0` and then you've also indicated that you want to then print `$6` again. - -Furthmore, - -``` -awk 'NR>1 {$5=$5/$4; print $0}' animals.txt -``` - -`$5=$5/$4` will overrwrite the values previously held in `$5` after the calculation is made. Thus, the output no long shows the original `$5`. - - -### `for` loops - -Like many other programming languages, `awk` can also do loops. One type of loop if the basic `for` loop. - -The basic syntax for a `for` loop in `awk` is: - -``` -awk '{for (initialize counter variable; end condition; increment) command}' file.txt -``` - -If you want to duplicate every entry in your file you do do so like: - -``` -awk '{ for (i = 1; i <= 2; i=i+1) print $0}' animals.txt -``` - -`for (i = 1; i <= 2; i=i+1)` is starting a `for` loop that: -- `i = 1;` starts at 1 -- `i <= 2;` runs as long as the value of i is less than or equal to 2 -- `i=i+1` after each iteration increase the counter variable by one. `++i` and `i++` are equivalent syntaxes to `i=i+1`. - -Then we print the whole line with `print $0`. - -While not discussed here, `awk` does support `while` and `do-while` loops. - - -### `if` statements - -Since `awk` is it's own fully-fledged programming language, it also has conditional statements. A common time you might want to use an `if` statement in `awk` is when you have a file with tens or even hundreds of fields and you want to figure out which field has the column header of interest or a case where you are trying to write a script for broad use when the order of the input columns may not always be the same, but you want to figure out which column has a certain column header. To do that: - -``` -awk 'NR=1 {for (i=1; i<=NF; i=i+1) {if ($i == "height(cm)") print i}}' animals.txt -``` - -`NR=1` only looks at the header line - -`for (i=1; i<=NF; i=i+1)` this begins a `for` loop starting at 1 and going until the number of fields by one -`if ($i == "height(cm)")` is checking is `$i`, which is in our case is $1, $2, ... $6, to see if they are equal to `height(cm)`. If this condition is met then: -`print i` print out `i` - diff --git a/Intermediate_shell/lessons/exploring_basics.md b/Intermediate_shell/lessons/exploring_basics.md deleted file mode 100644 index d3d19e8e..00000000 --- a/Intermediate_shell/lessons/exploring_basics.md +++ /dev/null @@ -1,196 +0,0 @@ ---- -title: "The Shell" -author: "Sheldon McKay, Mary Piper, Radhika Khetani" ---- - -## Learning Objectives -- This lesson briefly describes commands that will be used in this workshop. We are assuming that as an someone with a basic knowledge of *bash*, you are already familiar with most of these commands. - -## The basics - -Let's explore the contents of the data we just downloaded. - -```bash -$ ls -l ~/unix_lesson - -$ cd ~/unix_lesson - -$ ls -l -``` - -> The tilde `~` is a short way of referring to your home directory -> -> 'cd' = 'change directory' -> -> `ls` = 'list' and it lists the contents of a directory. -> -> The `-l` argument modifies the default output of `ls` and gives a lot more information (long listing) - -**Wildcards** - -The '*' character is a "wildcard" and is a shortcut for "everything". Thus, if you enter `ls *.sh`, you will see all the files in a given directory ending in `.sh`. - -Now try these commands: - -```bash -$ ls -l */ - -$ ls -l raw_fastq/M*fq -``` - -> An asterisk/star is only one of the many wildcards in UNIX, but this is the most powerful one. - -**Tab completion** - -```bash -$ ls -l raw_fastq/Mov10_oe_ -``` - -When you hit the first tab, nothing happens. The reason is that there are multiple files that start with `Mov10_oe_`. Thus, the shell does not know which one to fill in. When you hit tab again, the shell will list the possible choices. - -> **Tab completion is your friend!** It helps prevent spelling mistakes, and speeds up the process of typing in the full command. - -**Examining Files** - -The command `cat` is to catenate and print the contents of the file to screen. - -```bash -$ cat other/sequences.fa -``` - -The command, `less` allows you to open up the file in a new buffer and scroll through it. - -```bash -less raw_fastq/Mov10_oe_1.subset.fq -``` - -Shortcuts for `less` - -| key | action | -| ---------------- | ---------------------- | -| SPACE | to go forward | -| b | to go backwards | -| g | to go to the beginning | -| G | to go to the end | -| q | to quit | - - -The commands `head` and `tail` allow you to look at the beginning or end of the file. - -```bash -$ head raw_fastq/Mov10_oe_1.subset.fq - -$ tail -n 8 raw_fastq/Mov10_oe_1.subset.fq -``` -> The `-n` option to either of these commands can be used to specify the number `n` lines of a file. - -To count how many lines are in a file you can use the `wc` command: - -```bash -wc -l raw_fastq/Mov10_oe_1.subset.fq -``` - -**Searching within files** - -The command `grep` allows us to search for a pattern within a file. - -```bash -$ cd ~/unix_lesson/raw_fastq - -$ grep NNNNNNNNNN Mov10_oe_1.subset.fq -``` - -We can add *modifiers* to the command to alter the way the command works. For example, `-B` and `-A` arguments for `grep` to return the matched line plus one before (`-B1`) and two lines after (`-A2`). - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq -``` - -**Redirection** - -To re-direct output to a file rather than the screen you can use `>`: - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq > bad_reads.txt - -$ wc -l bad_reads.txt # How many lines in the bad_reads.txt file? -``` - -We can also use th pipe `|` to redirect the output of a command as input for a different command. - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_3.subset.fq | wc -l -``` - -If we use `>>`, it will append to rather than overwrite a file. This can be useful for saving more than one search, for example. - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_3.subset.fq >> bad_reads.txt - -$ wc -l bad_reads.txt # You should see an increase in the number of lines here compared to running this command earlier -``` - -> The philosophy behind these **redirection** commands is that none of them really do anything all that impressive on their own. BUT, when you start chaining them together, you can do some really powerful things and very efficiently. - -## Basic commands, shortcuts, and keystrokes of note -Below is a key of the basic commands you should be familiar with in order to be able to follow this module. - -``` -## Commands - -cd # change directory to "~" or to specified directory -ls # list contents of current or specified directory -man # display manual for specified command -pwd # specify present working directory -echo "..." # display content in quotes on the standard output -history # display previous "historical" commands -cat # display all contents of a file on the standard output -less # open a buffer with the contents of a file -head # display the first 10 lines of a file -tail # display the last 10 lines of a file -cp <..> <..> # copy files or directories -mdkir # make a new directory/folder -mv <..> <..> # move or rename files or directories -rm # remove a file or a folder (-r) - -## Other -~ # home directory -. # current directory -.. # parent directory -* # wildcard -ctrl + c # cancel current command -ctrl + a # start of line -ctrl + e # end of line -``` - -## Expand your knowledge of shell! - -shell cheat sheets: - -* [http://fosswire.com/post/2007/08/unixlinux-command-cheat-sheet/](http://fosswire.com/post/2007/08/unixlinux-command-cheat-sheet/) -* [https://github.com/swcarpentry/boot-camps/blob/master/shell/shell_cheatsheet.md](https://github.com/swcarpentry/boot-camps/blob/master/shell/shell_cheatsheet.md) - -Explain shell - a web site where you can see what the different components of -a shell command are doing. - -* [http://explainshell.com](http://explainshell.com) -* [http://www.commandlinefu.com](http://www.commandlinefu.com) - -Software Carpentry shell tutorial: [The Unix shell](http://software-carpentry.org/v4/shell/index.html) - -General help: - -- http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html -- man bash -- Google - if you don't know how to do something, try Googling it. Other people -have probably had the same question. -- Learn by doing. There's no real other way to learn this than by trying it -out. - -*** - -[Next Lesson](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/vim.html) - -*** - -*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* diff --git a/Intermediate_shell/lessons/exploring_basics_long.md b/Intermediate_shell/lessons/exploring_basics_long.md deleted file mode 100644 index 3cccf280..00000000 --- a/Intermediate_shell/lessons/exploring_basics_long.md +++ /dev/null @@ -1,208 +0,0 @@ ---- -title: "The Shell" -author: "Sheldon McKay, Mary Piper, Radhika Khetani" ---- - -## Learning Objectives -- Review basic commands in shell - -## The basics - -```bash -$ ls -l ~/unix_lesson - -$ cd ~/unix_lesson - -$ ls -l -``` - -> 'cd' = 'change directory' -> -> `ls` = 'list' and it lists the contents of a directory. -> -> The `-l` argument modifies the default output of `ls` and gives a lot more information (long listing) - -**Wildcards** - -The '*' character is a "wildcard" and is a shortcut for "everything". Thus, if you enter `ls *.sh`, you will see all the files in a given directory ending in `.sh`. - -Now try these commands: - -```bash -$ ls -l */ - -$ ls raw_fastq/M*fq - -$ ls raw_fastq/M*1*fq - -$ ls raw_fastq/M*1.*fq -``` - -> An asterisk/star is only one of the many wildcards in UNIX, but this is the most powerful one. - -**Tab completion** - -```bash -$ ls raw_fastq/Mov10_oe_ -``` - -When you hit the first tab, nothing happens. The reason is that there are multiple files that start with `Mov10_oe_`. Thus, the shell does not know which one to fill in. When you hit tab again, the shell will list the possible choices. - -> **Tab completion is your friend!** It helps prevent spelling mistakes, and speeds up the process of typing in the full command. - -**Examining Files** - -We now know how to move around the file system and look at the contents of directories, but how do we look at the contents of files? - -The easiest way to examine a file is to just print out all of the contents using the command `cat`. Print the contents of `unix_lesson/other/sequences.fa` by entering the following command: - -```bash -$ cat other/sequences.fa -``` - -This prints out the all the contents of `sequences.fa` to the screen. - -`cat` is a terrific command, but when the file is really big, it can be annoying to use. The command, `less`, is useful for this case. Let's take a look at the raw_fastq files. These files are quite large, so we probably do not want to use the `cat` command to look at them. - -```bash -less raw_fastq/Mov10_oe_1.subset.fq -``` - -Please note that FASTQ files have four lines of data associated with each sequenced "read", or each record in the output from the sequencer. There a header line, followed by the nucleotide sequence, similar to a FASTA file, but in addition there is additional information which corresponds to the sequencing quality for each nucleotide. - -Shortcuts for `less` - -| key | action | -| ---------------- | ---------------------- | -| SPACE | to go forward | -| b | to go backwards | -| g | to go to the beginning | -| G | to go to the end | -| q | to quit | - - -The commands `head` and `tail` offer another way to look at files, and in this case, just look at part of them. This can be particularly useful if we just want to see the beginning or end of the file, or see how it's formatted. - -```bash -$ head raw_fastq/Mov10_oe_1.subset.fq - -$ head -n 4 raw_fastq/Mov10_oe_1.subset.fq -``` - -```bash -$ tail -n 8 raw_fastq/Mov10_oe_1.subset.fq -``` - -The `-n` option to either of these commands can be used to print the first or last `n` lines of a file. To print the first/last line of the file use: - -**Searching within files** - -Suppose we want to see how many reads in our file `Mov10_oe_1.subset.fq` are "bad", with 10 consecutive Ns (`NNNNNNNNNN`). - -```bash -$ cd ~/unix_lesson/raw_fastq - -$ grep NNNNNNNNNN Mov10_oe_1.subset.fq -``` - -Since each record in the FASTQ file is four lines and the second line is the sequence, we can use the `-B` and `-A` arguments for `grep` to return the matched line plus one before (`-B1`) and two lines after (`-A2`). - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq -``` - -**Redirection** - -We've identified reads that don't look great and we want to capture them in a new file. We can do that with something called "redirection". The idea is that we're redirecting the output from the terminal (all the stuff that went whizzing by) to something else. In this case, we want to print it to a file, so that we can look at it later. - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq > bad_reads.txt -``` - -The prompt should sit there a little bit, and then it should look like nothing happened. But you should have a new file called `bad_reads.txt`. - -```bash -$ ls -l - -## check how many lines in this new file with the "word count" command and the `-l` modifier to only show the number of lines - -$ wc -l bad_reads.txt -``` - -If we use `>>`, it will append to rather than overwrite a file. This can be useful for saving more than one search, for example. - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Irrel_kd_1.subset.fq >> bad_reads.txt - -$ wc -l bad_reads.txt -``` - -We can also use th pipe `|` to redirect the output of a command as input for a different command. - -```bash -$ grep -B 1 -A 2 NNNNNNNNNN Irrel_kd_1.subset.fq | wc -l -``` - -The philosophy behind these **redirection** commands is that none of them really do anything all that impressive on their own. BUT, when you start chaining them together, you can do some really powerful things and very efficiently. - -## Basic commands, shortcuts, and keystrokes of note - -``` -## Commands - -cd # change directory to "~" or to specified directory -ls # list contents of current or specified directory -man # display manual for specified command -pwd # specify present working directory -echo "..." # display content in quotes on the standard output -history # display previous "historical" commands -cat # display all contents of a file on the standard output -less # open a buffer with the contents of a file -head # display the first 10 lines of a file -tail # display the last 10 lines of a file -cp <..> <..> # copy files or directories -mdkir # make a new directory/folder -mv <..> <..> # move or rename files or directories -rm # remove a file or a folder (-r) - -## Other -~ # home directory -. # current directory -.. # parent directory -* # wildcard -ctrl + c # cancel current command -ctrl + a # start of line -ctrl + e # end of line -``` - -## Expand your knowledge of shell! - -shell cheat sheets: - -* [http://fosswire.com/post/2007/08/unixlinux-command-cheat-sheet/](http://fosswire.com/post/2007/08/unixlinux-command-cheat-sheet/) -* [https://github.com/swcarpentry/boot-camps/blob/master/shell/shell_cheatsheet.md](https://github.com/swcarpentry/boot-camps/blob/master/shell/shell_cheatsheet.md) - -Explain shell - a web site where you can see what the different components of -a shell command are doing. - -* [http://explainshell.com](http://explainshell.com) -* [http://www.commandlinefu.com](http://www.commandlinefu.com) - -Software Carpentry shell tutorial: [The Unix shell](http://software-carpentry.org/v4/shell/index.html) - -General help: - -- http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html -- man bash -- Google - if you don't know how to do something, try Googling it. Other people -have probably had the same question. -- Learn by doing. There's no real other way to learn this than by trying it -out. - -*** - -[Next Lesson](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/vim.html) - -*** - -*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* diff --git a/Intermediate_shell/lessons/if_statements.md b/Intermediate_shell/lessons/if_statements.md deleted file mode 100644 index 5b3ef42c..00000000 --- a/Intermediate_shell/lessons/if_statements.md +++ /dev/null @@ -1,122 +0,0 @@ - -# If-Then-Else - - -Sometimes we need to make decisions in bash. For example let's say that our program writes a file and we give our desired file name as a [positional parameter](positional_params.md) with our script: - -``` -sh myscript.sh FILE_NAME -``` - -However, if FILE_NAME already exists we do not want to overwrite this. In fact we want to warn the user that the file already exists. Within myscript.sh we can include the following code: - -``` -FILE=${1} - -if [[ -e ${FILE} ]]; -then - echo "${FILE} already exists!"; - exit 1 -fi -``` - -Let's go through this line by line. - -* `FILE=${1}` names positional parameter 1 FILE. `if [[ -e ${FILE} ]];` asks if `${FILE}` exists. Note that there are spaces between `-e ${FILE}` and the square brackets. **NOTE: bash is VERY finicky about how whitespace is set up in conditional statements and these spaces are crucial for the code to run correctly. Incorrect spacing is one of the most common script errors for conditional statements**. The `;` separates synchronous commands in bash. -* `then` tells bash what follows is what it should do if it evulaute the previous clause as true. We could write something like `if [[ 1 -eq 1 ]];` (if one equals one) for a statement that is ALWAYS true or `if [[ 1 -eq 2 ]];` (1 equals 2) for a statment that is always FALSE. -* Next we tell bash what to do: `echo "${FILE} already exists!";`. In this case we want it to echo (print back onto the command line) the phrase "${FILE} already exists!" to alert the user. -* As the file exists we want our script to stop instead of proceeding `exit 1` tells bash to exit the script under an error condition. -* Finally `fi` ends if clause. If the original if statement is true bash will execute all commands until it reaches `fi`. - -Below we will go through what might be in your if statement. Then we will explore having multiple if/else if statements. - -## Crafting an if statement in bash - -### File Conditions -Almost all if statements that you will use will have to do with file state (present, absent, is it a directory) or properties of files. - -The main file conditions are: - -* -e Does the file exist? -* -d Is it a directory? -* -f Is it a regular file? -* -r Does the file exist and is it readable? -* -x Does the file exist and is its size greater than zero (i.e., not an empty file)? - -Each of these can be combined with `!` to indicate not. For example: `if [[ -e ${FILE} ]];` evaluates if ${FILE} exists and `if [[ ! -e ${FILE} ]];` evaluates if ${FILE} does NOT exist. - -### Multiple Conditions - -Sometimes you will want to evaluate multiple conditions at once. For example maybe you need to check that 2 files exist before proceeding or you need to make sure the files are a certain length. If commands can be modified for including multiple conditions. - -``` -FILE=${1} -FILE2=${2} - -if [[ -e ${FILE} ]] && [[ -e ${FILE2} ]]; -then - echo "${FILE} and ${FILE2} already exist!"; - exit 1 -fi -``` - -Here I am checking that both `${FILE}` and `${FILE2}` exist before proceeding. The `&&` indicates that bash should evaluate if both conditions are true. There may also be times where I only need one of the two files to exist: - - -``` -FILE=${1} -FILE2=${2} - -if [[ -e ${FILE} ]] || [[ -e ${FILE2} ]]; -then - echo "Either ${FILE} or ${FILE2} already exists!"; - exit 1 -fi -``` - -Here bash will proceed if EITHER `||` file exists. Only one of them will need to exist for bash to proceed. If they both exist the condition is still evaluated as true. - -### Else if - -We can also ask bash multiple questions. Let's pretend that you are at a grocery store and ask if they have cheezits, if they say no you might ask for goldfish, no again and you may ask for saltines (but not the unsalted kind, never the unsalted kind). You can do the same thing with bash! - -For a bioinformatic example let's say we have a pipeline. We start with sample.sam then make sample.bam then sample_sorted.bam then index that file to make sample_sorted.bai. We want to check that our pipeline has completed. For maximum flexbibility we write our script to take the name of the sample as positional parameter 1. - -``` -sample=${1} - - -if [[ -e ${sample}_sorted.bai ]] ; -then - echo "Pipeline finished!"; -else - echo "Pipeline failed :( "; -fi -``` - -Here the `else` tells bash what to do if it evaluates the condition as false. If that file does not exist bash will give the second message. -However, it is not super helpful to just know the pipeline failed. We want to know where it failed! Just like with the grocery store if there are no cheezits we want to check for the next best thing. We can modify our script using `elif` which stands for "else if". If the first condition is true then bash will see if this second condition might be true. We can go on and on and on with elif until we arrive at our last possible choice (unsalted saltines, yuck!). - -``` -if [[ -e ${sample}_sorted.bai ]] ; -then - echo "Pipeline finished!"; -elif [[ -e ${sample}_sorted.bam ]] ; -then - echo "Indexing Failed"; -elif [[ -e ${sample}.bam ]] ; -then - echo "Sorting Failed"; -elif [[ -e ${sample}.sam ]] ; -then - echo "Sam to bam failed"; -elif [[ -e ${sample}.sam ]] ; -then - echo "Sam file is missing!"; -else - echo "Something is wrong but I don't know what it is!"; -fi -``` - -This script is much more handy as we can determine where in our pipeline we had an issue. - diff --git a/Intermediate_shell/lessons/job_dependencies.md b/Intermediate_shell/lessons/job_dependencies.md deleted file mode 100644 index 8befce63..00000000 --- a/Intermediate_shell/lessons/job_dependencies.md +++ /dev/null @@ -1,138 +0,0 @@ -# Job Dependencies - -## Learning Objectives - -In this lesson we will: -- Discuss the advantages of utilizing job dependencies -- Implement a job dependency - -## What is a job dependency and why would I use it? - -Most, if not all, high performance computer clusters (HPCCs) utilize a job scheduler. As we have previously discussed, O2 uses one of the most popular ones, SLURM. These job schedulers aim to allow for fair use of limited computational resources in a shared network. In order to most limit one's use of limited resources, it is oftentimes best practice to place programs that require different computational resources in different job submissions. For example, perhaps program A needs 12 CPUs, 36GB of memory and 6 hours, but the output of program A is used in program B and it requires 1 CPU, 4GB of memory and 8 hours. In this case one *could* request 12 CPUs, 36 GB and 14 hours of compute time, but when program B is running you would be wasting 11 CPUs and 32GB of memory. As a result this type of behavior is *strongly discouraged*. - -Now you might be wondering, "Okay, so I need to make two separate jobs, but what if Job 1 running program A finishing at 2am, do I need to log onto the cluster to submit Job 2 running program B?". The good news is that you don't need to and this is the exact goal of job dependencies! Job dependencies allow you to queue jobs to be dependent on other jobs. - -Job dependencies are very useful: -- When you have consecutive jobs you want to run -- When you don't want to/have to time manage the submission of consecutive jobs - -> NOTE: Job dependencies are not unique to SLURM, many other job schedulers, like PBS, also have a feature similar to this! - -## Job dependencies on SLURM - -The syntax for using job dependencies in SLURM can be done in two ways: -1) It can be part of a SBATCH directive in your job submission script -2) It can be an option in your `sbatch` command - -We will be doing the latter, but either way will use the `--dependency` option. This option has several arguments that it can accept, but the most useful one for the overwhleming number of circumstances is `afterok`. Using `afterok` means that after the following job ends without an error, then it will remove the hold on the dependent job. After the `afterok` part you can separate one or more jobs with colons to signify which jobs are dependent. - -The syntax for a dependent job submission could look like: - -```bash -# Not a real job submission -sbatch Job_1.sh -Submitted batch job 351 -sbatch --dependency=afterok:351 Job_2.sbatch -``` - -We can visualize a sample workflow below: - -

- -

- -Multiple jobs can be dependent on a single job. COnversely, we can have a single job dependent on multiple jobs. If this is the case, then we just separate each job ID with colons like: - -```bash -sbatch --dependency=afterok:353:354 Job_5.sbatch -``` - -> NOTE: While the behavior can change between implementations of SLURM, on O2, when a job exits with an error, it removes all `afterok` dependent jobs from the queue. Some other implementations of SLURM will not remove these jobs from the queue, but the provided reason when you check will be `DependencyNeverSatified`. In this case, you will need to manually cancel these jobs. - -## Implementing a job dependency - -Let's utilize a toy example so that we can see this in action: - -``` -vim sleep_step_1.sbatch -``` - -Inside of this script it is going to be a simple script that runs two echo commands and does a pauses for 30 seconds in between them. - -```bash -#!/bin/bash -# Toy example for understanding how job dependencies work -#SBATCH -t 0-00:01 -#SBATCH -p priority -#SBATCH -c 1 -#SBATCH --mem 1M -#SBATCH -o run_sleep_step_1_%j.out -#SBATCH -e run_sleep_step_2_%j.err - -echo "I am going to take a nap." > sleep.out -sleep 30 -echo "I have woken up." >> sleep.out -``` - -Now before we submit this, let's create the second script that will be dependent on the first script running: - -``` -vim sleep_step_2.sbatch -``` - -Within the script we can insert: - -```bash -#!/bin/bash -#SBATCH -t 0-00:01 -#SBATCH -p priority -#SBATCH -c 1 -#SBATCH --mem 1M -#SBATCH -o run_sleep_step_2_%j.out -#SBATCH -e run_sleep_step_2_%j.err - -echo "I am going to take a another nap." >> sleep.out -sleep 30 -echo "I have woken up again." >> sleep.out -``` - -Once we have written these two scripts we can go ahead and submit the first one: - -```bash -sbatch sleep_step_1.sbatch -``` - -It should return some text that says: - -``` -Submitted batch job [Job_ID] -``` - -Now we can submit the second job, while making it dependent on the above Job ID. In the below script replace `Job_ID` with the Job ID from above. - -```bash -sbatch --dependency=afterok:Job_ID sleep_step_2.sbatch -``` - -You can view your jobs, by using: - -```bash -squeue -u $USER -``` - -And hopefully, if you were quick enough, you should be able to see that one of the jobs is not running and the `Reason` is `(Dependency)`. This let's you know that it is not running because it is waiting on a dependency. - -Once both jobs finish, you can inspect the `sleep.out` file and it should look like: - -``` -I am going to take a nap. -I have woken up. -I am going to take a another nap. -I have woken up again. -``` - -While this is just an example, it allows us to highlight how you can create workflows and also allows you to optimize your job submissions to accomate being away from the cluster. - -*** - -*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* diff --git a/Intermediate_shell/lessons/keeping_track_of_time.md b/Intermediate_shell/lessons/keeping_track_of_time.md deleted file mode 100644 index e7686a7d..00000000 --- a/Intermediate_shell/lessons/keeping_track_of_time.md +++ /dev/null @@ -1,104 +0,0 @@ -# Keeping Track of Time - -We don't always just submit a command and come back later. There are times when you want to keep track of what is going on, see how long a task takes for future use, or run a command in the background while you continue to use the command line. - -**Topics discussed here are:** - -[Watch](keeping_track_of_time.md#watch) - -[Time](keeping_track_of_time.md#time) - -[bg](keeping_track_of_time.md#bg) - - -*************** - -## Watch - -Sometimes one may want to see the ouptut of a command that continuously changes. The `watch` command is particularly useful for this. Add `watch` before your command and your command line will take you to an output page that will continually up your command. Common uses for `watch` could be: - -1) Viewing as files get created - -``` -watch ls -lh -``` - -2) Monitoring jobs on the cluster - -``` -watch squeue -u -``` - -The default interval for update is two seconds, but that can be altered with the `-n` option. Importantly, the options used with `watch` command need to be placed ***before*** the command that you are watching or else the interpreter will evaluate the option as part of the watched command's options. An example of this is below: - -Update every 4 seconds -``` -watch -n 4 squeue -u -``` - - -## Time - -Sometimes you are interested to know how long a task takes to complete. Similarly, to the `watch` command you can place `time` infront of a command and it will tell you how long the command takes to run. This can be particularly useful if you have downsampled a dataset and you are trying to estimate long long the full set will take to run. An example can be found below: - -``` -time ls -lh -``` - -The output will have three lines: - -``` -real 0m0.013s -user 0m0.002s -sys 0m0.007s -``` - -**real** is most likely the time you are interested in since it displays the time it takes to run a given command. **user** and **sys** represent CPU time used for various aspects of the computing and can be impacted by multithreading. - -## bg - -Sometimes you may start a command that will take a few minutes and you want to have your command prompt back to do other tasks while you wait for the initial command to finish. To do this, you will need to do two things: - -1) Pause the command with `Ctrl` + `Z`. -2) Send the command to the ***b***ack***g***round with the command `bg`. When you do this the command will continue from where it was paused. -3) If you want to bring the task back to the ***f***ore***g***round, you can use the command `fg`. - -In order to test this, we will briefly re-introduce the `sleep` command. `sleep` just has the command line do nothing for a period of time denoted in seconds by the integer following the `sleep` command. This is sometimes useful if you want a brief pause within a loop, such as between submitting a bunch of jobs to the cluster (as we did in [insert lesson here]). The syntax is: - -``` -sleep [integer for time in seconds] -``` - -So if you wanted there to be a five second pause, you could use: - -``` -sleep 5 -``` - -Now that we have re-introduced the `sleep` command let's go ahead and pause the command like for 180 seconds to simulate a task that is running that might take a few minutes to run. - -``` -sleep 180 -``` - -Now type `Ctrl` + `Z` and this will pause that command. Followed by the command to move that task to running in the background with: - -``` -bg -``` - -The `sleep` command is now running in the background and you have re-claimed your command-line prompt to use while the `sleep` command runs. If you want to bring the `sleep` command back to the foreground, type: - -``` -fg -``` - -And if it is still running it will be brought to the foreground. - -The place that this can be really useful is whenever you are running commands/scripts that take a few minutes to run that don't have large procesing requirements. Examples could be: - -- Indexing a fasta/bam file -- Executing a long command with many pipes -- You are running something in the command line and need to check something - -Oftentimes, it is best just to submit these types of jobs to the cluster, but sometimes you don't mind running the task on your requested compute node, but is taking a bit longer than you anticipated or something came up. diff --git a/Intermediate_shell/lessons/loops_and_scripts.md b/Intermediate_shell/lessons/loops_and_scripts.md deleted file mode 100644 index cc478953..00000000 --- a/Intermediate_shell/lessons/loops_and_scripts.md +++ /dev/null @@ -1,357 +0,0 @@ ---- -title: "The Shell: Loops & Scripts" -author: "Bob Freeman, Mary Piper, Radhika Khetani" ---- - -Approximate time: 60 minutes - -## Learning Objectives - -* Capture multiple commands into a script to re-run as one single command -* Understanding variables and storing information in variables -* Learn how to use variables to operate on multiple files - -## Shell scripts - -Within the command-line interface we have at our fingertips, access to various commands which allow you to interrogate your data (i.e `cat`, `less`, `wc`). - -> **NOTE:** If you are unsure about any of these commands and what they do, you may want to review the [Exploring Basics lesson](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/exploring_basics.html). - -When you are working with data, it can often be useful to run a set of commands one after another. Further, you may want to re-run this set of commands on every single set of data that you have. Wouldn't it be great if you could do all of this by simply typing out one single command? - -Welcome to the beauty and purpose of shell scripts. - -Shell scripts are **text files that contain commands we want to run**. As with any file, you can give a shell script any name and usually have the extension `.sh`. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake, this is actually a small program. - -Since we now know how to create text files in the command-line interface, we are going to use that knowledge to create a shell script and see what makes the shell such a powerful programming environment. We will use commands that you should be familiar with and save them into a file so that we can **re-run all those operations** again later by typing **one single command**. Let's write a shell script that will do two things: - -1. Tell us our current working directory -2. List the contents of the directory - -First open a new file using `vim`: - -```bash -$ vim listing.sh -``` - -Then type in the following lines in the `listing.sh` file: - -``` -echo "Your current working directory is:" -pwd -echo "These are the contents of this directory:" -ls -l -``` - -Exit `vim` and save the file. Now let's run the new script we have created. To run a shell script you usually use the `bash` or `sh` command. - -```bash -$ sh listing.sh -``` - -Now, let's run this script when we are in a different folder. - -```bash -$ cd ../raw_fastq/ - -$ sh ../other/listing.sh -``` - -> Did it work like you expected? -> -> Were the `echo` commands helpful in letting you know what came next? - -This is a very simple shell script. In this lesson, we will be learning how to write more complex ones and show you how to use the power of scripts to make our lives much easier. - -## Bash variables - -A *variable* is a common concept shared by many programming languages. Variables are essentially a symbolic/temporary name for, or a reference to, some information. Variables are analogous to "buckets", where information can be stored, maintained and modified without too much hassle. - -Extending the bucket analogy: the bucket has a name associated with it, i.e. the name of the variable, and when referring to the information in the bucket, we use the name of the bucket, and do not directly refer to the actual data stored in it. - -Let's start with a simple variable that has a single number stored in it: - -```bash -$ num=25 -``` - -*How do we know that we actually created the bash variable?* We can use the `echo` command to print to terminal: - -```bash -$ echo num -``` - -What do you see in the terminal? The `echo` utility takes what arguments you provide and prints to terminal. In this case it interpreted `num` as a a character string and simply printed it back to us. This is because **when trying to retrieve the value stored in the variable, we explicitly use a `$` in front of it**: - -```bash -$ echo $num -``` - -Now you should see the number 25 returned to you. Did you notice that when we created the variable we just typed in the variable name? This is standard shell notation (syntax) for defining and using variables. When defining the variable (i.e. setting the value) you can just type it as is, but when **retrieving the value of a variable don't forget the `$`!** - -Variables can also store a string of character values. In the example below, we define a variable or a 'bucket' called `file`. We will put a filename `Mov10_oe_1.subset.fq` as the value inside the bucket. - -```bash -$ file=Mov10_oe_1.subset.fq -``` - -Once you press return, you should be back at the command prompt. Let's check what's stored inside `file`, but first move into the `raw_fastq` directory:: - -```bash -$ cd ~/unix_lesson/raw_fastq -$ echo $file -``` - -Let's try another command using the variable that we have created. We can also count the number of lines in `Mov10_oe_1.subset.fq` by referencing the `file` variable: - -```bash -$ wc -l $file -``` - -> *NOTE:* The variables we create in a session are system-wide, and independent of where you are in the filesystem. This is why we can reference it from any directory. However, it is only available for your current session. If you exit the close your Terminal and come back again at a later time, the variables you have created will no longer exist. - -*** - -**Exercise** - -* Reuse the `$file` variable to store a different file name, and rerun the commands we ran above (`wc -l`, `echo`) - -*** - -Ok, so we know variables are like buckets, and so far we have seen that bucket filled with a single value. **Variables can store more than just a single value.** They can store multiple values and in this way can be useful to carry out many things at once. Let's create a new variable called `filenames` and this time we will store *all of the filenames* in the `raw_fastq` directory as values. - -To list all the filenames in the directory that have a `.fq` extension, we know the command is: - -```bash -$ ls *.fq -``` - -Now we want to *assign* the output of `ls` to the variable: - -```bash -$ filenames=`ls *.fq` -``` - -> Note the syntax for assigning output of commands to variables, i.e. the backticks around the `ls` command. - -Check and see what's stored inside our newly created variable using `echo`: - -```bash -$ echo $filenames -``` - -Let's try the `wc -l` command again, but this time using our new variable `filenames` as the argument: - -```bash -$ wc -l $filenames -``` - -What just happened? Because our variable contains multiple values, the shell runs the command on each value stored in `filenames` and prints the results to screen. - -*** - -**Exercise** - -* Use some of the other commands you are familiar with (i.e. `head`, `tail`) on the `filenames` variable. - -*** - -## Loops - -Another powerful concept in the Unix shell and useful when writing scripts is the concept of "Loops". We have just shown you that you can run a single command on multiple files by creating a variable whose values are the filenames that you wish to work on. But what if you want to **run a sequence of multiple commands, on multiple files**? This is where loops come in handy! - -Looping is a concept shared by several programming languages, and its implementation in *bash* is very similar to other languages. - -The structure or the syntax of (*for*) loops in bash is as follows: - -```bash -for (variable_name) in (list) -do -(command1 $variable_name) -. -. -done -``` - -where the ***variable_name*** defines (or initializes) a variable that takes the value of every member of the specified ***list*** one at a time. At each iteration, the loop retrieves the value stored in the variable (which is a member of the input list) and runs through the commands indicated between the `do` and `done` one at a time. *This syntax/structure is virtually set in stone.* - - -#### What does this loop do? - -```bash -for x in *.fq - do - echo $x - wc -l $x - done -``` - -Most simply, it writes to the terminal (`echo`) the name of the file and the number of lines (`wc -l`) for each files that end in `.fq` in the current directory. The output is almost identical to what we had before. - -In this case the list of files is specified using the asterisk wildcard: `*.fq`, i.e. all files that end in `.fq`. - -Then, we execute 2 commands between the `do` and `done`. With a loop, we execute these commands for each file at a time. Once the commands are executed for one file, the loop then executes the same commands on the next file in the list. - -Essentially, **the number of items in the list (variable name) == number of times the code will loop through**. - -In our case that is 6 times since we have 6 files in `~/unix_lesson/raw_fastq` that end in `.fq`, and these filenames are stored in the `filename` variable. - -It doesn't matter what variable name we use in a loop structure, but it is advisable to make it something intuitive. In the long run, it's best to use a name that will help point out a variable's functionality, so your future self will understand what you are thinking now. - -### The `basename` command - -Before we get started on creating more complex scripts, we want to introduce you to a command that will be useful for future shell scripting. The `basename` command is used for extracting the base name of a file, which is accomplished using **string splitting to strip the directory and any suffix from filenames**. Let's try an example, by first moving back to your home directory: - -```bash -$ cd -``` - -Then we will run the `basename` command on one of the FASTQ files. Be sure to specify the path to the file: - -```bash -$ basename ~/unix_lesson/raw_fastq/Mov10_oe_1.subset.fq -``` - -What is returned to you? The filename was split into the path `unix_lesson/raw_fastq/` and the filename `Mov10_oe_1.subset.fq`. The command returns only the filename. Now, suppose we wanted to also trim off the file extension (i.e. remove `.fq` leaving only the file *base name*). We can do this by adding a parameter to the command to specify what string of characters we want trimmed. - -```bash -$ basename ~/unix_lesson/raw_fastq/Mov10_oe_1.subset.fq .fq -``` - -You should now see that only `Mov10_oe_1.subset` is returned. - -*** - -**Exercise** - -* How would you modify the above `basename` command to only return `Mov10_oe_1`? - -*** - -## Automating with Scripts - -Now that you've learned how to use loops and variables, let's put this processing power to work. Imagine, if you will, a script that will run a series of commands that would do the following for us each time we get a new data set: - -- Use the `for` loop to iterate over each FASTQ file -- Generate a prefix to use for naming our output files -- Dump out bad reads into a new file -- Get the count of the number of bad reads and generate a summary for each file - -You might not realize it, but this is something that you now know how to do. Let's get started... - -Rather than doing all of this in the terminal we are going to create a script file with all relevant commands. Move back into `unix_lesson` and use `vim` to create our new script file: - -```bash -$ cd ~/unix_lesson - -$ vim generate_bad_reads_summary.sh -``` - -We always want to start our scripts with a shebang line: - -```bash -#!/bin/bash -``` - -This line is the absolute path to the Bash interpreter on almost all computers. The shebang line ensures that the bash shell interprets the script even if it is executed using a different shell. - -> Your script will still work without the shebang line if you run it with the `sh` or `bash` commands, but it is best practice to have it in your shell script. - -After the shebang line, we enter the commands we want to execute. First, we want to move into our `raw_fastq` directory: - -```bash -# enter directory with raw FASTQs -cd ~/unix_lesson/raw_fastq -``` - -And now we loop over all the FASTQs: - -```bash -# count bad reads for each FASTQ file in our directory -for filename in *.fq -``` - -For each file that we process we can use `basename` to create a variable that will uniquely identify our output file based on where it originated from: - -```bash -do - # create a prefix for all output files - base=`basename $filename .subset.fq` -``` - -and then we execute the following commands for each file: - -```bash - # tell us what file we're working on - echo $filename - - # grab all the bad read records into new file - grep -B1 -A2 NNNNNNNNNN $filename > $base-badreads.fastq -``` - -We'll also count the number of these reads and put that in a new file, using the count flag of `grep`: - -```bash - # grab the number of bad reads and write it to a summary file - grep -cH NNNNNNNNNN $filename > $base-badreads.count.summary -done -``` - -> **NOTE:** If you've noticed, we used a new `grep` flag `-H` above; this flag will report the filename the search was performed on along with the matching string. - -Save and exit `vim`, and voila! You now have a script you can use to assess the quality of all your new datasets. Your finished script, complete with comments, should look like the following: - -```bash -#!/bin/bash - -# enter directory with raw FASTQs -cd ~/unix_lesson/raw_fastq - -# count bad reads for each FASTQ file in our directory -for filename in *.fq -do - - # create a prefix for all output files - base=`basename $filename .subset.fq` - - # tell us what file we're working on - echo $filename - - # grab all the bad read records - grep -B1 -A2 NNNNNNNNNN $filename > $base-badreads.fastq - - # grab the number of bad reads and write it to a summary file - grep -cH NNNNNNNNNN $filename > $base-badreads.count.summary -done - -``` - -To run this script, we simply enter the following command: - -```bash -$ sh generate_bad_reads_summary.sh -``` - -How do we know if the script worked? Take a look inside the `raw_fastq` directory, we should see that for every one of the original FASTQ files we have two associated bad read files. - -```bash -$ ls -l ~/unix_lesson/raw_fastq -``` - -To keep our data organized, let's move all of the bad read files out of the `raw_fastq` directory into a new directory called `other`, and the script to a new directory called `scripts`. - -```bash -$ mv raw_fastq/*bad* other/ - -$ mkdir scripts -$ mv *.sh scripts/ -``` - ---- -*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* - -* *The materials used in this lesson were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). -All Data Carpentry instructional material is made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0).* -* *Adapted from the lesson by Tracy Teal. Original contributors: Paul Wilson, Milad Fatenejad, Sasha Wood and Radhika Khetani for Software Carpentry (http://software-carpentry.org/)* - - diff --git a/Intermediate_shell/lessons/math_on_the_cluster.md b/Intermediate_shell/lessons/math_on_the_cluster.md deleted file mode 100644 index 158fdb60..00000000 --- a/Intermediate_shell/lessons/math_on_the_cluster.md +++ /dev/null @@ -1,119 +0,0 @@ -## Math - -There are two common ways of carrying out math on the command-line interface. One way uses a language called `bc` and the other utilizes `awk`. So let's look at these two methods - -### bc - -`bc` stands for *basic/bench calculator* and is actually it's own standalone language. In order for math to be carried out by `bc`, it needs to first be piped into `bc`. In this case, we are going to pipe in the equation we want it to calculate with a preceeding `echo` command. - -``` -echo '6 + 2' | bc -``` - -*NOTE: The whitespace in the above inside `'2+1'` is arbitarty and does not impact calculations.* - -It should return `8`. In fact you can do many basic math operations with integers like this. - -``` -# Subtraction -echo "6 - 2" | bc - -# Multiplication -echo "6 * 2" | bc - -# Division -echo "6 / 2" | bc - -# Exponent -echo "6 ^ 2" | bc - -# Square Root -echo "sqrt(4)" | bc -``` - -You can also do more complex math that involves parentheses: - -``` -echo "(3 + 1) * 4" | bc -``` - -*NOTE: You can use single or double quotes when carrying out math with `bc`, but if you want to use `bash` variables you will want to use double quotes. For this reason, it is best practice just to always use double quotes. - -We can also feed `bc` variables, such as: - -``` -variable_1=4 -variable_2=3 - -# Will return an error -echo '$variable_1 + $variable_2' | bc - -# Will return the answer -echo "$variable_1 + $variable_2" | bc -``` - -This should return the correct answer of `7`. - -While thie seems great there are some limitations that `bc` has. The biggest issue with `bc` it does not handle decimals well, particularly with division. Let's look at the following case: - -``` -echo '1 / 3' | bc -``` - -It should return `0`, which is clearly erroreouns. This is because base `bc` defaults to just the integer position. There are two ways to fix this behavior: - -### `scale` parameter - -Before the equation you would like `bc` to calculate you can put a `scale` parameter, which will tell `bc` how many decimal places to calculate to. - -``` -echo 'scale=3; 1 / 3' | bc -``` - -Now we can see that `bc` returns the appropriate answer of `.333`. - -### -l option - -Adding the `-l` option for `bc` will have utilize the a broader range of math and this automatically sets the scale paramter to 20 by default. - -``` -echo '1 / 3' | bc -l -``` - -This should return `.33333333333333333333`. You can overwrite this default scale parameter by just adding `scale=` as you had in the previous example. In general, because `-l` option helps with - -``` -echo 'scale=3; 1 / 3' | bc -l -``` - -The -l option does also open up a few more functions including: - -``` -# Natural log -echo 'l(1)' | bc -l - -# Exponential function -echo 'e(1)' | bc -l -``` - -It does also provide access to sine, cosine and arctangent, but those are outside of the scope of this course. - -## Negative Numbers - -`bc` can also handle negtaive numbers as input and output - -``` -echo "-1 + 2" | bc -l - -echo "1 - 2" | bc -l - -echo "2 ^ -3 | bc -l" -``` - -### awk - -You can also go basic arithmetic in `awk`. In order to do arithmetic in `awk` in the command line, you will need to use the `BEGIN` command which allows you to run an `awk` command without a file, then simply have it print your calculation: - -``` -awk 'BEGIN {print (2/3)^2}' -``` diff --git a/Intermediate_shell/lessons/mdbook.js b/Intermediate_shell/lessons/mdbook.js deleted file mode 100644 index e1d3e511..00000000 --- a/Intermediate_shell/lessons/mdbook.js +++ /dev/null @@ -1,18 +0,0 @@ -#!/usr/bin/env runhaskell -{-# LANGUAGE OverloadedStrings #-} -import Text.Pandoc.JSON -import Data.Text - -latex::Format -latex = Format "latex" - -highlight :: Block -> IO Block -highlight cb@(Div (id, (cls:_), _) (contents:_)) = - case (unpack cls) of "warn" -> return $ Div ("", [], []) ((RawBlock latex "\\begin{tcolorbox}[colframe=yellow!90!white, colback=yellow!20!white]Warning: ") : contents : (RawBlock latex "\\end{tcolorbox}") : []) - "tips" -> return $ Div ("", [], []) ((RawBlock latex "\\begin{tcolorbox}[colframe=blue!20!white, colback=blue!10!white]Tips: ") : contents : (RawBlock latex "\\end{tcolorbox}") : []) - _ -> return cb - -highlight x = return x - -main :: IO () -main = toJSONFilter highlight diff --git a/Intermediate_shell/lessons/moving_files.md b/Intermediate_shell/lessons/moving_files.md deleted file mode 100644 index 1ec1ab98..00000000 --- a/Intermediate_shell/lessons/moving_files.md +++ /dev/null @@ -1,167 +0,0 @@ -# Moving files on and off the cluster - -At some point you will need to [move files to and from a cluster](moving_files.md#copying-files-to-and-from-the-cluster), [get files from a website](moving_files.md#dowloading-external-data), [create a symbolic link](moving_files.md#symbolic-links-or-sym-links-), or [check that your transfers have worked](moving_files.md#md5sum) - - -## Copying files to and from the cluster - -You can use a program like filezilla to copy over files, but there are other way to do so using the command line interface. When you obtain your data from the sequencing facility, it will likely be stored on some remote computer and they will give you login credentials which will allow you to access it. There are various commands that can be used to help you copy those files from the remote computer over to 1) your local computer, 2) O2, or 3) whatever cluster environment you plan to work on. We present a few options here. - -### `scp` - -Similar to the `cp` command to copy there is a command that allows you to **securely copy files between computers**. The command is called `scp` and allows files to be copied to, from, or between different hosts. It uses ssh for data transfer and provides the same authentication and same level of security as ssh. - -In the example below, the first argument is the **location on the remote server** and the second argument is the **destination on your local machine**. - -> *You can also do this in the opposite direction by swapping the arguments.* - -```bash -$ scp username@transfer.rc.hms.harvard.edu:/path/to/file_on_O2 Path/to/directory/local_machine -``` - -Let's try copying over the README file from your `unix_lesson` folder. **First open up a new terminal window.** Look and see where you currently are: - -```bash -$ pwd -``` - -Then type in: - -```bash -$ scp rc_trainingXX@transfer.rc.hms.harvard.edu:~/unix_lesson/other/draft.txt . -``` - -Now see that the file has transferred over: - -```bash -$ less draft.txt -``` - -> **NOTE:** Windows users may encounter a permissions error when using `scp` to copy over locally. We are not sure how to troubleshoot this, but will update materials as we obtain more information. - -### `rsync` - -`rsync` is used to copy or synchronize data between directories. It has many advantages over `cp`, `scp` etc. It works in a specific direction, i.e. from the first directory **to** the second directory, similar to `cp`. - -**Salient Features of `rsync`** - -* If the command (or transfer) is interrupted, you can start it again and *it will restart from where it was interrupted*. -* Once a folder has been synced between 2 locations, the next time you run `rsync` it will *only update and not copy everything over again*. -* It runs a check to ensure that every file it is "syncing" over is the exact same in both locations. This check is run using a version of ["checksum"](https://en.wikipedia.org/wiki/Checksum) which ensures the data integrity during the data transfer process. - -> You can run the checksum function yourself when transferring large datasets without `rsync` using one of the following commands (or similar): `md5`, `md5sum`. - - -### Between directories on the same machine - -```bash -#DO NOT RUN -$ rsync -av ~/large_dataset/. /n/groups/dir/groupdata/ -``` - -### Between different machines - -When copying over large datasets to or from a remote machine, `rsync` works similarly to `scp`. - -```bash -#DO NOT RUN -$ rsync -av -e ssh testfile username@transfer.rc.hms.harvard.edu:~/large_files/ -``` - -* `a` is for archive - means it preserves permissions (owners, groups), times, symbolic links, and devices. -* `v` is for verbosity - means that it prints on the screen what is being copied -* `-e ssh` is for encryption - means that we want to use the ssh protocol for encryption of the file transfer - -*More helpful information and examples using rsync can be found [at this link](https://www.comentum.com/rsync.html)* - -> Please do not use O2’s login nodes for transferring large datasets (like fastq files) between your computer and O2 with `rsync` or `scp`. Instead, use the transfer nodes `ssh eCommons@transfer.rc.hms.harvard.edu`. - - - -## Downloading external data - -### `curl` - -Oftentimes, you will find yourself wanting to download data from a website. There are two comparable commands that you can use to accomplish this task. The first one is `curl` and the most common syntax for using it is: - -``` -curl -L -O [http://www.example.com/data] -``` - -The `-O` option will use the filename given on the website as its filename to write to. Alternatively, if you wanted to name it something different, you can use the `-o` option and then follow it with the preferred name like: - -``` -curl -L -o preferred_name [http://www.example.com/data] -``` - -The `-L` option tells curl to follow any redirections the HTML address gives to the data. ***I think this is right, but I am really don't understand thins. I just know that I am supposed to do it.*** - -Lastly, if you connection gets lost midway through a transfer, you can use the `-C` option followed by `-` to resume the download where it left off. For example: - -``` -curl -C - -L -O [http://www.example.com/data] -``` - -### `wget` - - -A common alternative to `curl` is `wget` and many purposes they are extremely similiar and which you decide to use is a matter of personal preference. The general syntax is a bit more friendly on `wget`: - -``` -wget [http://www.example.com/data] -``` - -If you lose your connection during the download process and would like to resume it midway through, the `-c` will ***c***ontinue the download where it left off: - -``` -wget -c [http://www.example.com/data] -``` - -### `curl` versus `wget` - -For many purposes `curl` and `wget` are similar, but there are some small differences: - -1) In `curl` you can use the `-O` option multiple times to carry out multiple downloads simulatenously. - -``` -curl -L -O [http://www.example.com/data_file_1] -O [http://www.example.com/data_file_2] -``` - -2) In `wget` you can recursively download a directory (meaning that you also download all subdirectories) with the `-r`. Typically this isn't super useful because the source will typically pack this up all into a compressed package, but nonetheless it is something that `wget` can do that `curl` cannot do. - -In general, `curl` has a bit more options and flexibility than `wget` but the vast majority, if not all, of those options are ***far*** beyond the scope of this course and for this course comes down to a personal preference. - - - -## Symbolic Links or "sym links" - -Symbolic links are like shortcuts you may create on your laptop. A sym link makes it appear as if the linked object is actually there. It can be useful to access a file from multiple locations without creating copies and without using much disk space. (Symlinks are only a few bytes in size.) - -Let's check out an example of a folder with lots of symlinks. - - -```bash -ls -l /n/app/bcbio/tools/bin/ -``` - -Now, let's create a sym link in our home directory for the same `unix_lesson` folder we had originally copied over. - -```bash -$ cd - -$ ln -s /n/groups/hbctraining/unix_lesson/ unix_lesson_sym - -$ ls -l -``` - -We recommend that you create something like this for your raw data so it does not accidentally get corrupted or overwritten. - -> Note: a “hard” link (just `ln` without the `-s` option) is very different. Always use “ln -s” unless you really know what you’re doing! - - -## md5sum - -Sometimes you are copying files between two locations and you want to ensure the copying went smoothly or are interested to see if two files are the same. Checksums can be thought of as an alphanumeric fingerprint for a file and they are used to ensure that two files are the same. It is common for people/insitutions to provide an list of md5sums for files that are availible to download. `md5sum` is one common checksum. ***Importantly, it is theorectically possible that two different files have the same md5sum, but it is practically nearly impossible.*** The syntax for checking the md5sum of a file is: - -``` -md5sum diff --git a/Intermediate_shell/lessons/positional_params.md b/Intermediate_shell/lessons/positional_params.md deleted file mode 100644 index 9fbb7c33..00000000 --- a/Intermediate_shell/lessons/positional_params.md +++ /dev/null @@ -1,319 +0,0 @@ ---- -title: "Introduction to Positional Parameters and Variables" -author: "Emma Berdan" ---- - - -## Learning Objectives: - -* Distinguish between variables and positional parameters -* Recognize variables and positional parameters in code written by someone else -* Implement positional parameters and variables in a bash script -* Integrate for loops and variables - -## What is a variable? - -"A **variable** is a character string to which we assign a value. The value assigned could be a number, text, filename, device, or any other type of data. -A variable is nothing more than a pointer to the actual data. The shell enables you to create, assign, and delete variables.” ([Source](https://www.tutorialspoint.com/unix/unix-using-variables.htm)) - -It is easy to identify a variable in any bash script as they will always have the $ in front of them. Here is my very cleverly named variable: `$Variable` - -## Positional parameters are a special kind of variable - -“A **positional parameter** is an argument specified on the command line, used to launch the current process in a shell. Positional parameter values are stored in a special set of variables maintained by the shell.” ([Source](https://www.computerhope.com/jargon/p/positional-parameter.htm)) - -So rather than a variable that is identified inside the bash script, a positional parameter is given when you run your script. This makes it more flexible as it can be changed without modifying the script itself. - -

- -

- -Here we can see that our command is the first positional parameter (`$0`) and that each of the strings afterwards are additional positional parameters (here `$1` and `$2`). Generally when we refer to positional parameters we ignore `$0` and start with `$1`. - -It is crucial to note that different positional parameters are separated by whitespace and can be strings of any length. This means that: - -```bash -$ ./myscript.sh OneTwoThree -``` -has only given one positional parameter `$1=OneTwoThree` - -and - -```bash -$ ./myscript.sh O N E -``` -has given three positional parameters `$1=O` `$2=N` `$3=E` - -You can code your script to take as many positional parameters as you like but for any parameters greater than 9 you need to use curly brackets. So positional parameter 9 is `$9` but positional parameter 10 is `${10}`. We will come back to curly brackets later. - -Finally, the variable `$@` contains the value of all positional parameters except `$0`. - -## A simple example - -Let's make a script ourselves to see positional parameters in action. - -from your command line type `vim compliment.sh` then type `i` to go to insert mode. If you have never used vim before you can find the basics [HERE](https://github.com/hbctraining/Intro-to-shell-flipped/blob/047946559cbffc9cc24ccb10a4d630aa18fab558/lessons/03_working_with_files.md). - -now copy and paste the following into your file - -```bash -#!/bin/bash - -echo $1 'is amazing at' $2 -``` -then type esc to exit insert mode. Type and enter `:wq` to write and quit. - - ->**Note** ->We want to take a step back here and think about what we might do from here to actually use our script. ->We know our script is a bash script because we wrote our "shebang" line '#!/bin/bash' but the computer ->doesn't automatically know this. One option is to run this script via the bash interpreter: sh. Doing this ->actually makes our "shebang" line obsolete! The same script without the "shebang" line will still run!! Try it! -> ->`sh compliment.sh` -> ->Here we are telling the computer to use bash to execute the commands in our script which is why we don't ->need the "shebang" line. It is **NOT** best practice to write scripts without "shebang" lines as removing this ->will leave the next person scratching their head figuring out which language the script is in. ->**ALWAYS ALWAYS ALWAYS use a "shebang" line**. With that line in place, we can run this script without ->calling bash from the command line. But first we have to make the script executable. This tells the ->computer that this is a script and not just a text file. We do that by adding file permission. ->Typing chmod u+x will make the file executable for the user (you!), once this is done the script ->can be run this way -> ->`./compliment.sh` -> ->When a file is executable the computer will use the "shebang" line to figure out which interpreter to use. ->Different programs (perl, python, etc) will have different "shebang" lines. - - -For this lesson we will make all of our scripts exectuable. Now that you are back on the command line type `chmod u+x compliment.sh` to make the file executable for yourself. More on file permissions [HERE](https://github.com/hbctraining/Intro-to-shell-flipped/blob/master/lessons/07_permissions_and_environment_variables.md). - -You may have already guessed that our script takes two different positional parameters. The first one is your first name and the second is something you are good at. Here is an example: - -```bash -./compliment.sh OliviaC acting -``` - -This will print - -```bash -OliviaC is amazing at acting -``` -You may have already guessed that I am talking about award winning actress [Olivia Coleman](https://en.wikipedia.org/wiki/Olivia_Colman) here. But if I typed - -```bash -./compliment.sh Olivia Coleman acting -``` -I would get - -```bash -Olivia is amazing at Coleman -``` -Technically I have just given three positional parameters `$1=Olivia` `$2=Colman` `$3=acting` -However, since our script does not contain `$3` this is ignored. - -In order to give Olivia her full due I would need to type - -```bash -./compliment.sh "Olivia Coleman" acting -``` - -The quotes tell bash that "Olivia Coleman" is a single string, `$1`. Both double quotes (") and single quotes (') will work. Olivia has enough accolades though, so go ahead and run the script with your name (just first or both first and last) and something you are good at! - - -## Naming variables - -My previous script was so short that it was easy to remember that `$1` represents a name and `$2` represents a skill. However, most scripts are much longer and may contain more positional parameters. To make it easier on yourself it is often a good idea to name your positional parameters. Here is the same script we just used but with named variables. - -```bash -#!/bin/bash - -name=$1 -skill=$2 - -echo $name 'is amazing at' $skill -``` - -It is critical that there is no space in our assignment statements, `name = $1` would not work. We can also assign new variables in this manner whether or not they are coming from positional parameters. Here is the same script with the variables defined within it. - -```bash -#!/bin/bash - -name="Olivia Coleman" -skill="acting" - -echo $name 'is amazing at' $skill -``` -We will talk more about naming variables later, but note that defining variables within the script can make the script **less** flexible. If I want to change my sentence, I now need to edit my script directly rather than launching the same script but with different positional parameters. - - -## A useful example - -Now that we understand the basics of variables and positional parameters how can we make them work for us? One of the best ways to do this is when writing a shell script. A shell script is a "script that embeds a system command or utility, that saves a set of parameters passed to to that command." - -As an example lets say that I want to add read groups to a series of bam files. Each bam file is one sample that I have sequenced and I need to add read groups to them all. Here is an example of my command for sample M1. - -```bash -java -jar picard.jar AddOrReplaceReadGroups I=M1.dedupped.bam \ -O=M1.final.bam RGID=M1 RGLB=M1 RGPL=illumina RGPU=unit1 RGSM=M1 -``` - -The string 'M1' occurs 5 times in this command. However, M1 is not my only sample, to make this code run for a different sample I would need to replace M1 5 times. I don't want to manually edit this line of code every time I run the command. Instead, using positional parameters I can make a shell script for this command. - - -```bash -#!/bin/bash - -java -jar picard.jar AddOrReplaceReadGroups I=$1.dedupped.bam \ -O=$1.final.bam RGID=$1 RGLB=$1 RGPL=illumina RGPU=unit1 RGSM=$1 -``` - -Here `$1` is my only positional parameter and is my sample name. **However**, this script is not written with best practices. It should actually look like this. - -```bash -#!/bin/bash - -java -jar picard.jar AddOrReplaceReadGroups I=${1}.dedupped.bam \ -O=${1}.final.bam RGID=${1} RGLB=${1} RGPL=illumina RGPU=unit1 RGSM=${1} -``` - -`$1`, which we have been using is actually a short form of `${1}` - -We can only use `$1` when it is **not** followed by a letter, digit or an underscore but we can always use `${1}` - -if wrote a script that said `echo $1_is_awesome` I wouldn't actually get any output when I ran this with a positional parameter, even our beloved [Olivia Coleman](https://en.wikipedia.org/wiki/Olivia_Colman)! Instead this script would need to be written as `echo ${1}_is_awesome` - -As you write your own code it is good to remember that it is always safe to use `${VAR}` and that errors may result from using `$VAR` instead, even if it is convienent. As you navigate scripts written by other people you will see both forms. - - -Let's test out this script this ourselves without actually running picard by using `echo`. From your command line type `vim picard.sh` then type `i` to go to insert mode. - -now copy and paste the following into your file - -```bash -#!/bin/bash - -echo java -jar picard.jar AddOrReplaceReadGroups I=${1}.dedupped.bam \ -O=${1}.final.bam RGID=${1} RGLB=${1} RGPL=illumina RGPU=unit1 RGSM=${1} -``` -then type esc to exit insert mode. Type and enter `:wq` to write and quit. - -Now that you are back on the command line type `chmod u+x picard.sh` to make the file executable for yourself. - -You can try out the code yourself by using any sample name you want. Here I am running it for sample T23 - - -```bash -./picard.sh T23 -``` - -We have now significantly decrased our own workload. By using this script we can easily run this command for any sample we have. However, sometimes we have so many samples that even running this command manually for all of these will be time consuming. In this case we can turn to one of the most powerful ways to use positional parameters and other variables, by combining them with **for loops**. More on for loops [HERE](https://github.com/hbctraining/Intro-to-shell-flipped/blob/master/lessons/06_loops_and_automation.md). - -## Variables in for loops - -We are going to continue with our picard example. Let's say that I need to run my picard command for 10 different samples. I have all of my sample names in a text file. First let's put my sample name list on the cluster so we can access it with our script. - -From your command line type `vim samples.txt` then type `i` to go to insert mode. Copy and paste the following - -```bash -M1 -M2 -M3 -O1 -O2 -O3 -O4 -S1 -S2 -S3 -``` - -then type esc to exit insert mode. Type and enter `:wq` to write and quit. Each line is a single sample name. - -Now let's write our new script. Again we will use `echo` to avoid actually calling picard. We will write this with vim then go through it line by line. - -From your command line type `vim picard_loop.sh` then type `i` to go to insert mode. Copy and paste the following - -```bash -#!/bin/bash - -for ((i=1; i<=10; i=i+1)) - do - -sample=$(awk -v awkvar="${i}" 'NR==awkvar' samples.txt) - -echo java -jar picard.jar AddOrReplaceReadGroups \ -I=${sample}.dedupped.bam O=${sample}.final.bam RGID=${sample} \ -RGLB=${sample} RGPL=illumina RGPU=unit1 RGSM=${sample} - -done -``` - -then type esc to exit insert mode. Type and enter `:wq` to write and quit. Now that you are back on the command line type `chmod u+x picard_loop.sh` to make the file executable for yourself. - -Before we run this, let's go through it line by line. - - -`for ((i=1; i<=10; i=i+1))` - -This tells bash how our loop is working. We want to start at 1 (`i=1`) and end at 10 (`i<=10`) and each time we complete our loop the value `i` should increase by 1 (`i=i+1`). Simple math tells us that means the loop will run 10 times. But we could make it run 100 times by changing `i<=10` to `i<=100`. `i` is our counter variable to track how many loops we have run. You will often see this called `i` or `j` but you can actually call it whatever you want. For example, here we are using it to track which line of samples.txt we are reading so it may be more intuitive to write it like this `for ((line=1; line<=10; line=line+1))`. `i` and `j` are used because they are shorter to write. - -`do` - -This means that whatever follows is what we want bash to do for each value of `i` (here 1,2,3,4,5,6,7,8,9,and 10). - -`sample=$(awk -v awkvar="${i}" 'NR==awkvar' samples.txt)` - -This line creates a variable called `$sample` and assigns its value to line `i` of samples.txt. We won't go into the details of how this awk command is working but you can learn more about using awk [HERE](https://github.com/hbctraining/Training-modules/blob/f168114cce7ab9d35eddbf888b94f5a2fda0318a/Intermediate_shell/lessons/advanced_lessons.md). You may also notice that we have assigned the value of `$sample` differently here using parentheses () instead of single ' or double " quotes. The syntax for assigning variables changes depending on what you are assigning. See **Syntax for assigning variables** below. - -If we look at samples.txt we can see that when `i=1` then `$sample` will be M1. What will `$sample` be when `i=5`? - - -The next line should look familiar - -```bash -echo java -jar picard.jar AddOrReplaceReadGroups \ -I=${sample}.dedupped.bam O=${sample}.final.bam RGID=${sample} \ -RGLB=${sample} RGPL=illumina RGPU=unit1 RGSM=${sample} -``` - -This is exactly the same as what we used above except `$1` is now `$sample`. We are assigning the value of `$sample` within our script instead of giving it externally as a positional parameter. - -finally we end our script with - -```bash -done -``` - -Here we are simply telling bash that this is the end of the commands that we need it to do for each value of `i`. - -Now let's run our script - - -```bash -./picard_loop.sh -``` - -Is the output what you expected? - -## Syntax for assigning variables - -Depending on what you are assigning as a variable, the syntax for doing so differs. - -`variable=$(command)` for output of a command. - - example: `variable=$(wc -l file.txt)` will assign the number of lines in the file file.text to `$variable` - -`variable=‘a string’` or `”a string”` for a string with spaces. - - example `variable="Olivia Coleman"` as seen above. - -`variable=number` for a number or a string without spaces. - - example: `variable=12` will assign the number 12 to `$variable` - -`variable=$1` for positional parameter 1. - - example: `variable=$9` will assign positional parameter 9 to `$variable` and `variable=${10}` will assign positional parameter 10 to `$variable` - diff --git a/Intermediate_shell/lessons/regular_expressions.md b/Intermediate_shell/lessons/regular_expressions.md deleted file mode 100644 index 2ce61153..00000000 --- a/Intermediate_shell/lessons/regular_expressions.md +++ /dev/null @@ -1,149 +0,0 @@ -# Regular Expressions - -Regular expressions (sometimes referred to as regex) are a string of characters that can be used as a pattern to match against. This can be very helpful when searching through a file, particularly in conjunction with `sed`, `grep` or `awk`. The topics covered will be: - -[Ranges](regular_expressions.md#ranges) - -[Special Characters](regular_expressions.md#special-characters) - -[Quantifers](regular_expressions.md#quantifiers) - -[Anchors](regular_expressions.md#anchors) - -[Literal Matches](regular_expressions.md#literal-matches) - -[Whitespace and new lines](regular_expressions.md#whitespace-and-new-lines) - -[Examples of Combining Special Characters](regular_expressions.md#examples-of-combining-special-characters) - -[Additional Resources](regular_expressions.md#additional-resources) - ---- - -[Return to Table of Contents](toc.md) - -## Ranges - -A range of acceptable characters can be given with `[]`. Square brackets can be used to notate a range of acceptable characters in a position. - -`[BPL]ATCH` could match 'BATCH', 'PATCH' or 'LATCH' - -You can also use `-` to denote a range of characters: - -`[A-Z]ATCH` would match 'AATCH', 'BATCH'...'ZATCH' - -You can also merge different ranges together by putting them right after each other or separating them by a `|` - -`[A-Za-z]ATCH` or `[A-Z|a-z]ATCH` would match 'AATCH', 'BATCH'...'ZATCH' and 'aATCH', 'bATCH'...'zATCH' - -In fact, regular expression ranges generally follow the [ASCII alphabet](https://en.wikipedia.org/wiki/ASCII), (but your local character encoding may vary) so: - -`[0-z]ATCH` would match '0ACTH', '1ACTH', '2ACTH'...'AACTH'..'zACTH'. However, it is important to also note that the ASCII alphabet has a few characters between numbers and uppercase letters such as ':' and '>', so you would also match ':ATCH' and '>ATCH', repectively. There are also a fews between upper and lowercase letters such as '^' and ']'. If you would want to search for numbers, uppercase letters and lowercase letters, but NOT these characters in between, you would need to modify the range: - -`[0-9A-Za-z]ATCH` - -You can also note that since these characters follow the ASCII character encoding order, so `[Z-A]` will give you an error telling you that it is an invalid range because 'Z' comes after 'A'. - -The `^` ***within*** `[]` functions as a 'not' function. For example: - -`[^C]ATCH` will match anything ending in 'ATCH' ***except*** 'CATCH'. - -***IMPORTANT NOTE: `^` has a different function when used outside of the `[]` that is discussed below in anchoring.*** - -[Back to the top](regular_expressions.md#regular-expressions) - -## Special Characters - -### `.` - -The `.` matches any character except new line. Notably, it ***does not*** match no character. This is similar to the behavior of the wildcard `?` in Unix. - -[Back to the top](regular_expressions.md#regular_expressions) - -## Quantifiers - -### `*` - -The `*` matches the preceeding character any number of times ***including*** zero. - -`CA*TCH` would match `CTCH`, `CATCH`, `CAATCH` ... `CAAAAAAATCH` ... - -### `?` - -The `?` denotes that the previous character is optional, in the following example: - -`C?ATCH` would match 'CATCH', but also 'BATCH', '2ATCH' '^ATCH' and even 'ATCH' - -`.ATCH` would match 'CATCH', BATCH', '2ATCH' '^ATCH', but ***not*** 'ATCH' - -## `{}` - -The `{INTEGER}` match the preceeding character the number of times equal to INTEGER. - -`CA{3}TCH` would match 'CAAATCH', but ***not*** 'CATCH', 'CAATCH' or 'CAAAATCH'. - -## `+` - -The `+` matches one or more occurrances of the preceeding character. - -`CA+TCH` would match 'CATCH', 'CAATCH' ... 'CAAAAAAAATCH' ... - -[Back to the top](regular_expressions.md#regular_expressions) - -## Anchors - -### `^` - -The `^` character anchors the search criteria to the begining of the line. For example: - -`^CAT` would match lines that started with 'CAT', 'CATCH', but ***not** 'BOBCAT' - -***NOTE: `^` within `[]` behaves differently. Remember it functions as 'not'!*** - -### `$` - -The `$` character anchors the search criteria to the end of the line. For example: - -`CAT$` would match lines ending in 'CAT' or 'BOBCAT', but not 'CATCH' - -[Back to the top](regular_expressions.md#regular_expressions) - -## Literal matches - -One problem you will likely run into with these above special characters is that you may want to match one. For example, you may want to match '.' or '?' and this is what the escape, `\`, is for. - -`C\?TCH` would match 'C?TCH', but not 'CATCH' or 'CCTCH' like `C?TCH` would do. - -[Back to the top](regular_expressions.md#regular_expressions) - -## Whitespace and new lines - -You can search from tabs with '\t', space with '\s' and newline with '\n'. - -`CA\tTCH` would match 'CA TCH' - -[Back to the top](regular_expressions.md#regular_expressions) - -## Examples of Combining Special Characters - -Lots of the power from regular expression comes from how you can combine them to match the pattern you want. - -If you want to find any line that starts with uppercase letters 'A-G', then you could do: - -`^[A-G]` - -Perhaps you want to see find all lines ending with 'CA' followed by any character except 'T', then you could do: - -`CA[^T]$` - -Another thing you may be interersted in is finding lines that start with 'C' and end with 'CH' with anything, including nothing, in between. - -`^C.*CH$` - -[Back to the top](regular_expressions.md#regular_expressions) - -## Additional Resources - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/extra_bash_tools.md#regular-expressions-regex-in-bash- - -[Back to the top](regular_expressions.md#regular_expressions) diff --git a/Intermediate_shell/lessons/sed.md b/Intermediate_shell/lessons/sed.md deleted file mode 100644 index e996af37..00000000 --- a/Intermediate_shell/lessons/sed.md +++ /dev/null @@ -1,339 +0,0 @@ -# sed - -The ***s***tream ***ed***itor, `sed`, is a common tool used for text manipulation. `sed` takes input from either a file or piped from a previous command and applies a transformation to it before outputting it to standard out. - -**Topics discussed here are:** - -[Substitution](sed.md#substitution) - -[Addresses](sed.md#addresses) - -[Deletion](sed.md#deletion) - -[Appending](sed.md#appending) - -[Replacing Lines](sed.md#replacing-lines) - -[Translation](sed.md#translation) - -[Multiple Expressions](sed.md#multiple-expressions) - -[Additonal Resources](sed.md#additional-resources) - ---- - -[Return to Table of Contents](toc.md) - -## substitution - -One common usage for `sed` is to replace one word with another. The syntax for doing this is: - -``` -sed 's/pattern/replacement/flag' file.txt -``` - -A few things to note here: - -1) The `s` in `'s/pattern/replacement/flag'` is directing `sed` to do a ***s***ubstitution. -2) The `flag` in `'s/pattern/replacement/flag'` is directing `sed` that you want this action to be carried out in a specific manner. It is very common to use the flag `g` here which will carry out the action ***g***lobally, or each time it matches the `pattern`. If `g`, is not included it wil just replace the `pattern` the first time it is observed per line. If you would like to replace a particular occurance like the third time it is observed in a line, you would use `3`. - -Let's test this out on our sample file and see that output. First, we are interested in replacing 'jungle' with 'rainforest' throughout the file: - -``` -sed 's/jungle/rainforest/g' animals.txt -``` - -Notice how all instances of 'jungle' have been replaced with 'rainforest'. However, if we don't include the global option: - -``` -sed 's/jungle/rainforest/' animals.txt -``` - -We will only recover the first instance of 'jungle' was replaced with 'rainforest'. If we want to replace only the second occurance of 'jungle' with 'rainforest' on a line, modify the occurance to be `2`: - -``` -sed 's/jungle/rainforest/2' animals.txt -``` - -It is important to note that the pattern-matching in `sed` is case-sensitive. To make your pattern searches case-insensitive, you will need to add at the `I` flag: - -``` -sed 's/Jungle/rainforest/Ig' animals.txt -``` - -This will now replace all instances of Jungle/jungle/JuNgLe/jUngle/etc. with 'rainforest'. - -***I don't know if you can do multiple occurances, like 2nd and 4th*** - -***Haven't discussed 2g syntax that will replace the the 2nd occurance and all subsequent occurances*** - -### -n option - -In `sed` the `-n` option will create no standard output. However, you can pair with with the occurance flag `p` and this will print out only lines that were were edited. - -``` -sed -n 's/an/replacement/p' animals.txt -``` - -The `-n` option has another useful purpose, you can use it to find the line number of a matched pattern by using `=` after the pattern you are searching for: - -``` -sed -n '/jungle/ =' animals.txt -``` - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) - -## Addresses - -### Single lines - -One can also direct which line, the ***address***, `sed` should make an edit on by adding the line number in front of `s`. This is most common when one wants to make a substituion for a pattern in a header line and is worried that the pattern might be elsewhere in the file. It is best practice to wrap your substitution argument in curly brackets (`{}`) when using address. To demonstrate this we can compare to commands: - -``` -sed 's/an/replacement/g' animals.txt -sed '1{s/an/replacement/g}' animals.txt -``` - -In the first command, `sed 's/an/replacement/g' animals.txt` we have replaced all instances of 'an' with 'replacement'. However, in the second command, `sed '1s/an/replacement/g' animals.txt`, we have only replaced instances on line 1. - -While wrapping the substitution in curly brackets isn't required when using a single line, it is necessacary when defining an interval. As you can see: - -``` -sed '1s/an/replacement/g' animals.txt -``` - -Produces the same output as above. - -### Intervals - -If you only want to have this substitution carried out on the first three lines (`1,3`, this is giving an address interval, from line 1 to line 3) we would need to do include the curly brackets: - -``` -sed '1,3{s/an/replacement/g}' animals.txt -``` - -You can also replace the second address with a `$` to indicate until end of the file like: - -``` -sed '5,${s/an/replacement/g}' animals.txt -``` - -This will carry out the substitution from the fifth line until the end of the file. - -You can also use regular expressions in the address field. For example, if you only wanted the substitution happening between your first occurence of 'monkey' and your first occurrance of 'alligator', you could do: - -``` -sed '/monkey/,/alligator/{s/an/replacement/g}' animals.txt -``` - -Alternatively, if you want a replacement to happen every except a given line, such as all of you data fields but not on the header line, then one could use `!` which tells sed 'not'. - -``` -sed '1!{s/an/replacement/g}' animals.txt -``` - -You can even couple `!` with the regular expression intervals to do the substitution everywhere outside the interval: - -``` -sed '/monkey/,/alligator/!{s/an/replacement/g}' animals.txt -``` - -Lastly, you can use `N~n` in the address to indicator that you want to apply the substitution every *n*th line starting on line *N*. In the below example, starting on the first line and every 2nd line, the substitution will occur - -``` -sed '1~2{s/an/replacement/g}' animals.txt -``` - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) - -## Deletion - -You can delete entire lines in `sed`. To delete lines proved the address followed by `d`. To delete the first line from a file: - -``` -sed '1d' animals.txt -``` - -Like substitutions, you can provide an interval and this will delete line 1 to line 3: - -``` -sed '1,3d' animals.txt -``` - -Also like substitution, you can use `!` to specify lines not to delete like: - -``` -sed '1,3!d' animals.txt -``` - -Additionally, you can also use regular expressions to provide the addresses to define an interval to delete from. In this case we are interested in deleting from the first instance of 'alligator' until the end of the file: - -``` -sed '/alligator/,$d' animals.txt -``` - -The `N~n` syntax also works in deletion. If we want to delete every thrid line starting on line 2, we can do: - -``` -sed '2~3d' animals.txt -``` - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) - -## Appending - -### Appending text - -You can append a new line with the word 'ape' after the 2nd line using the `a` command in `sed`: - -``` -sed '2 a ape' animals.txt -``` - -If you want the appended text to come before the address, you need to use the `i` command: - -``` -sed '2 i ape' animals.txt -``` - -You can also do this over an interval, like from the 2nd to 4th line: - -``` -sed '2,4 a ape' animals.txt -``` - -Additionally, you can append the text every 3rd line begining with the second line: - -``` -sed '2~3 a ape' animals.txt -``` - -Lastly, you can also append after a matched pattern: - -``` -sed '/monkey/ a ape' animals.txt -``` - -### Appending a file - -You could in interested in inserting the contents of **file B** inside at a certain point of **file A**. For example, if you wanted to insert the contents `file_B.txt` after line `4` in `file_A.txt`, you could do: - -``` -sed '4 r file_B.txt' file_A.txt -``` - -Instead of line `4`, you can append the file between every line in the interval from line 2 to line 4 with: - -``` -sed '2,4 r file_B.txt' file_A.txt -``` - -You could also append the line after each line by using `1~1` syntak: - -``` -sed '1~1 r file_B.txt' file_A.txt -``` - -The `r` argument is telling `sed` to ***r***ead in `file_B.txt`. - -Instead of inserting on a line specific line, you can also insert on a pattern: - -``` -sed '/pattern/ r file_B.txt' file_A.txt -``` - -Lastly, you could also insert a file to the end using `$`: - -``` -sed '$ r file_B.txt' file_A.txt -``` - -But this is the same result as simply concatenating two files together like: - -``` -cat file_A.txt file_B.txt -``` - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) - -## Replacing Lines - -You can also replace entire lines in `sed` using the `c` command. We could replace the first line with the word 'header' by: - -``` -sed '1 c header' animals.txt -``` - -This can also be utilized in conjustion with the `A,B` interval syntax, but we aware that it will replace ALL lines in that interval with a SINGLE line. - -``` -sed '1,3 c header' animals.txt -``` - -You can also replace every *n*th line starting at *N*th line using the `N~n` address syntax: - -``` -sed '1~3 c header' animals.text -``` - -Lastly, you can also replace lines match a pattern: - -``` -sed '/animal/ c header' animals.txt -``` - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) - -## Translation - -`sed` has a feature that allows you to translate characters similiarly to the `tr` function in `bash`. If you wanted to translate all of the lowercase a, b and c characters to their uppercase equivalents you could do that with the `y` command: - -``` -sed 'y/abc/ABC/' animals.txt -``` - -In this case the first letter 'a' is replaced with 'A', 'b' with 'B' and 'c' with 'C'. - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) - -## Multiple expressions - -### `-e` option - -If you would like to carry out multiple `sed` expressions in the same command you can use the `-e` option and after each `-e` option you can provide the expression you would like `sed` to evaluate. For example, one could change 'jungle' to 'rainforest' and 'grasslands' to 'Serengeti': - -``` -sed -e 's/jungle/rainforest/g' -e 's/grasslands/Serengeti/g' animals.txt -``` - -One can also combine different type of expressions. For instance, one could change 'jungle' to 'rainforest' using a substitution expression and then use a deletion expression to remove the header line: - -``` -sed -e 's/jungle/rainforest/g' -e '1d' animals.txt -``` - -### `-f` option - -If you have a large number of `sed` expressions you can also place them in a text file with each expression on a separate line: - -``` -s/jungle/rainforest/g -s/grasslands/Serengeti/g -1d -``` - -If this file was named 'sed_expressions.txt', our command could look like: - -``` -sed -f sed_expressions.txt animals.txt -``` - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) - -## Additional Resources - -https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/extra_bash_tools.md#sed - -https://www.grymoire.com/Unix/Sed.html#uh-8 - -[Back to the top](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/sed.md#sed) diff --git a/Intermediate_shell/lessons/setting_up.md b/Intermediate_shell/lessons/setting_up.md deleted file mode 100644 index d1f7b46f..00000000 --- a/Intermediate_shell/lessons/setting_up.md +++ /dev/null @@ -1,89 +0,0 @@ -## Accessing the shell - -This workshop assumes that you have a working knowledge of *bash* and in turn know how to access it on your own computer. If you are not sure, we have this information below. - -> **With Macs** use the "**Terminal**" utility. -> -> **With Windows** you can use your favorite utility or follow our suggestion of using "**Git BASH**". Git BASH is part of the [Git for Windows](https://git-for-windows.github.io/) download, and is a *bash* emulator. - -## The command prompt - -Once again, you are likely familiar with what a command prompt is, but to ensure that everyone in the class is on the same page, we have a short description below. - -> It is a string of characters ending with `$` after which you enter the command to ask shell to do something. -> -> The string of charaters before the `$` usually represent information about the computer you are working on and your current directory; e.g. **`[MacBook-Pro-5:~]$`**. - -## Downloading data - -We will be exploring slightly more advanced capabilities of the shell by working with data from an RNA sequencing experiment. - -> *NOTE: If you attended the [Intro to shell](https://hbctraining.github.io/Training-modules/Intro_shell/) workshop with us last month, you should have already downloaded this data.* - -Before we download the data, let's check the folder we are currently in: - -```bash -$ pwd -``` - -> On a **Mac** your current folder should be something starting with `/Users/`, like `/Users/marypiper/`. -> -> On a **Windows** machine your current folder should be something starting with `/c/Users/marypiper`. To find this in your File explorer try clicking on PC and navigating to that path. - -Once you know to which folder you are downloading your data click on the link below: - -**Download RNA-Seq data to your working directory:** [click here to download](https://github.com/hbctraining/Training-modules/blob/master/Intro_shell/data/unix_lesson.zip?raw=true). - -Type the 'list' command to check that you have downloaded the file to the correct location (your present working directory): - -```bash -$ ls -l -``` - -You should see `unix_lesson.zip` as part of the output to the screen. - -Now, let's decompress the folder, using the `unzip` command: - -```bash -$ unzip unix_lesson.zip -``` - -> When you run the unzip command, you are decompressing the zipped folder, just like you would by double-clicking on it outside the Terminal. As it decompresses, you will usually see "verbose output" listing the files and folders being decompressed or inflated. -> -> ```bash -> -> Archive: unix_lesson.zip -> creating: unix_lesson/ -> creating: unix_lesson/.my_hidden_directory/ -> inflating: unix_lesson/.my_hidden_directory/hidden.txt -> creating: unix_lesson/genomics_data/ -> creating: unix_lesson/other/ -> inflating: unix_lesson/other/Mov10_rnaseq_metadata.txt -> inflating: unix_lesson/other/sequences.fa -> creating: unix_lesson/raw_fastq/ -> inflating: unix_lesson/raw_fastq/Irrel_kd_1.subset.fq -> inflating: unix_lesson/raw_fastq/Irrel_kd_2.subset.fq -> inflating: unix_lesson/raw_fastq/Irrel_kd_3.subset.fq -> inflating: unix_lesson/raw_fastq/Mov10_oe_1.subset.fq -> inflating: unix_lesson/raw_fastq/Mov10_oe_2.subset.fq -> inflating: unix_lesson/raw_fastq/Mov10_oe_3.subset.fq -> inflating: unix_lesson/README.txt -> creating: unix_lesson/reference_data/ -> inflating: unix_lesson/reference_data/chr1-hg19_genes.gtf -> ``` - -Now, run the `ls` command again. - -```bash -$ ls -l -``` - -You should see a folder/directory called `unix_lesson`, which means you are all set with the data download! - -*** - -[Next Lesson](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/exploring_basics.html) - -*** - -*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* diff --git a/Intermediate_shell/lessons/toc.md b/Intermediate_shell/lessons/toc.md deleted file mode 100644 index b378d5cc..00000000 --- a/Intermediate_shell/lessons/toc.md +++ /dev/null @@ -1,30 +0,0 @@ -# Advanced Unix - -Below is the table of contents for the Advanced Unix lessons - -| Topic | Status | -|:-----------:|:----------:| -| [String manipulation](https://github.com/hbctraining/Training-modules/blob/master/Advanced_shell/lessons/02_String_manipulation.md) | Drafted | -| [sed](sed.md) | Drafted | -| [Regular Expressions](https://github.com/hbctraining/Training-modules/blob/master/Intermediate_shell/lessons/regular_expressions.md) | Drafted | -| [grep]() | Sort of Undrafted - Advanced Module has Regex using grep | -| [awk](awk.md) | Drafted | -| [Aliases, Shortcuts and .bashrc](https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/more_bash.md#alias) | Drafted | -| [Copying files using to and from a cluster](moving_files.md)| Drafted | -| [Symbolic Links](https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionVI/lessons/more_bash.md#symlink) | Drafted | -| [Arithmetic](math_on_the_cluster.md) | Drafted | -| [if statements](if_statements.md) | Drafted | -| [while read loops]() | Undrafted | -| [Associative Arrays in bash](associative_arrays.md) | Drafted | -| [Arrays in bash]() | Undrafted | -| [Positional Parameters](positional_params.md) | Drafted | -| [Searching history]() | Undrafted | -| [O2 Job Dependencies](job_dependencies.md) | Drafted | -| [O2 Brew]() | Undrafted | -| [O2 Conda]() | Undrafted | -| [vim macros]() | Undrafted | -| [Keeping track of time](keeping_track_of_time.md) | Drafted | -| [Arrays in Slurm](arrays_in_slurm.md) | Drafted | -| [Rscript]() | Undrafted | - - diff --git a/Intermediate_shell/lessons/vim.md b/Intermediate_shell/lessons/vim.md deleted file mode 100644 index 8a3500cc..00000000 --- a/Intermediate_shell/lessons/vim.md +++ /dev/null @@ -1,140 +0,0 @@ - -## Learning Objectives - -* Learn basic operations using the Vim text editor - -## Creating text files - - -### GUI text editors - -You can easily create text files on your computer by opening up a text editor program such as [TextWrangler](http://www.barebones.com/products/textwrangler/), [Sublime](http://www.sublimetext.com/), and [Notepad++](http://notepad-plus-plus.org/), and start typing. These text editors often have features to easily search text, extract text, and highlight syntax from multiple programming languages. We refer to these as **GUI text editors** since they have a **G**raphical **U**ser **I**nterface that has buttons and menus that you can click on to issue commands to the computer and you can move about the interface just by pointing and clicking. - -> **NOTE:** When we say, "text editor," we really do mean "text": these editors can only work with plain character data, not tables, images, or any other media and it explicitly excludes *Microsoft Word* or *TextEdit*. - - -### Command-line text editors - -But what if we need **a text editor that functions from the command line interface**? If we are working on remote computer (i.e. high-performance compute environments) we don't have access to a GUI and so we need to use **Command-line editors** to create, modify and save files. When using these types of editors, you cannot 'point-and-click', you must navigate the interface using only the keyboard. - -Some popular command-line editors include [nano](http://www.nano-editor.org/), [Emacs](http://www.gnu.org/software/emacs/) or [Vim](http://www.vim.org/). These editors are available by default on any shell environment, including on high-performance compute environments (local or cloud). - -### Introduction to Vim - -Today, we are going to introduce the text editor 'vim'. It is a powerful text editor with extensive text editing options; however, in this introduction we are going to focus on exploring some of the more basic functions. We hope that after this introduction you will become more comfortable using it and will explore the advanced functionality as needed. - -To help you remember some of the keyboard shortcuts that are introduced below and to allow you to explore additional functionality on your own, we have compiled [a cheatsheet](https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/resources/VI_CommandReference.pdf). - -#### Vim Interface - -You can create a document by calling a text editor and providing the name of the document you wish to create. Change directories to the `unix_lesson/other` folder and create a document using `vim` entitled `draft.txt`: - -```bash -$ cd ~/unix_lesson/other - -$ vim draft.txt -``` - -Notice the `"draft.txt" [New File]` typed at the bottom left-hand section of the screen. This tells you that you just created a new file in vim. - - -#### Vim Modes - -Vim has **_two basic modes_** that will allow you to create documents and edit your text: - -- **_command mode (default mode):_** will allow you to save and quit the program (and execute other more advanced commands). - -- **_insert (or edit) mode:_** will allow you to write and edit text - - -Upon creation of a file, `vim` is automatically in command mode. Let's _change to insert mode_ by typing i. Notice the `--INSERT--` at the bottom left hand of the screen. - -Now let's type in a few lines of text: -``` -While vim offers great functionality, it takes time to get used to get familiar and learn the shortcuts. -``` - -After you have finished typing, press esc to enter command mode. Notice the `--INSERT--` disappeared from the bottom of the screen. - -### Vim Saving and Quitting -To **write to file (save)**, type :w. You can see the commands you type in the bottom left-hand corner of the screen. - -After you have saved the file, the total number of lines and characters in the file will print out at the bottom left-hand section of the screen. - -Alternatively, we can **write to file (save) and quit** all at once. Let's do that by typing :wq. Now, you should have exited `vim` and returned back to the command prompt. - -To edit your `draft.txt` document, open up the file again using the same command you used to create the file: `vim draft.txt`. - -Change into the insert mode and type a few more lines (you can move around the lines of text using the arrows on the keyboard). This time we decide to **quit without saving** by typing :q! - -### Vim Editing -Create the document `spider.txt` in vim. Enter the text as follows: - -![image](../img/vim_spider.png) - -To make it easier to refer to distinct lines, we can add line numbers by typing :set number. Note that you have to do this in the *command mode*. - -![image](../img/vim_spider_number.png) - -**Save the document.** If you choose to remove the line numbers later you can type :set nonumber. - -While we cannot point and click to navigate the document, we can use the arrow keys to move around. Navigating with arrow keys can be very slow, so `vim` has shortcuts (which are completely unituitive, but very useful as you get used to them over time). Check to see what mode you are currently in. While in command mode, try moving around the screen and familarizing yourself with some of these shortcuts: - -| key | action | -| ---------------- | ---------------------- | -| | to move to top of file | -| | to move to bottom of file | -| | to move to end of line | -| | to move to beginning of line | - -In addition to shortcuts for navigation, vim also offers editing shortcuts such as: - -| key | action | -| ---------------- | ---------------------- | -| | to delete line | -| | to undo | -| | to redo | - -*** - -**Exercise** - -We have covered some basic commands in `vim`, but practice is key for getting comfortable with the program. Let's -practice what we just learned in a brief challenge. - -1. Open `spider.txt`, and delete line #2. -2. Quit without saving. -3. Open `spider.txt` again, go to the last line and delete it. -4. Undo your previous deletion. -5. Redo your previous deletion. -6. Save the file and see whether your results match your neighbors. - -*** - -### Overview of vim commands - -**Vim modes:** - -| key | action | -| ---------------- | ---------------------- | -| | insert mode - to write and edit text | -| | command mode - to issue commands / shortcuts | - - -**Saving and quitting:** - -| key | action | -| ---------------- | ---------------------- | -| | to write to file (save) | -| | to write to file and quit | -| | to quit without saving | -| | to display line numbers | -| | to not display line numbers | - -*** - -[Next Lesson](https://hbctraining.github.io/Training-modules/Intermediate_shell/lessons/loops_and_scripts.html) - -*** - -*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*