diff --git a/archive/Intermediate_shell/lessons/AWK_module.md b/Extra_shell_materials/AWK_module.md similarity index 100% rename from archive/Intermediate_shell/lessons/AWK_module.md rename to Extra_shell_materials/AWK_module.md diff --git a/archive/Intermediate_shell/lessons/advanced_lessons.md b/Extra_shell_materials/advanced_lessons.md similarity index 100% rename from archive/Intermediate_shell/lessons/advanced_lessons.md rename to Extra_shell_materials/advanced_lessons.md diff --git a/archive/Intermediate_shell/lessons/arrays_in_slurm.md b/Extra_shell_materials/arrays_in_slurm.md similarity index 100% rename from archive/Intermediate_shell/lessons/arrays_in_slurm.md rename to Extra_shell_materials/arrays_in_slurm.md diff --git a/archive/Intermediate_shell/lessons/associative_arrays.md b/Extra_shell_materials/associative_arrays.md similarity index 100% rename from archive/Intermediate_shell/lessons/associative_arrays.md rename to Extra_shell_materials/associative_arrays.md diff --git a/archive/Intermediate_shell/lessons/awk.md b/Extra_shell_materials/awk.md similarity index 100% rename from archive/Intermediate_shell/lessons/awk.md rename to Extra_shell_materials/awk.md diff --git a/archive/Intermediate_shell/lessons/exploring_basics.md b/Extra_shell_materials/exploring_basics.md similarity index 100% rename from archive/Intermediate_shell/lessons/exploring_basics.md rename to Extra_shell_materials/exploring_basics.md diff --git a/archive/Intermediate_shell/lessons/exploring_basics_long.md b/Extra_shell_materials/exploring_basics_long.md similarity index 100% rename from archive/Intermediate_shell/lessons/exploring_basics_long.md rename to Extra_shell_materials/exploring_basics_long.md diff --git a/archive/Intermediate_shell/lessons/if_statements.md b/Extra_shell_materials/if_statements.md similarity index 100% rename from archive/Intermediate_shell/lessons/if_statements.md rename to Extra_shell_materials/if_statements.md diff --git a/archive/Intermediate_shell/lessons/job_dependencies.md b/Extra_shell_materials/job_dependencies.md similarity index 100% rename from archive/Intermediate_shell/lessons/job_dependencies.md rename to Extra_shell_materials/job_dependencies.md diff --git a/archive/Intermediate_shell/lessons/keeping_track_of_time.md b/Extra_shell_materials/keeping_track_of_time.md similarity index 100% rename from archive/Intermediate_shell/lessons/keeping_track_of_time.md rename to Extra_shell_materials/keeping_track_of_time.md diff --git a/archive/Intermediate_shell/lessons/loops_and_scripts.md b/Extra_shell_materials/loops_and_scripts.md similarity index 100% rename from archive/Intermediate_shell/lessons/loops_and_scripts.md rename to Extra_shell_materials/loops_and_scripts.md diff --git a/archive/Intermediate_shell/lessons/math_on_the_cluster.md b/Extra_shell_materials/math_on_the_cluster.md similarity index 100% rename from archive/Intermediate_shell/lessons/math_on_the_cluster.md rename to Extra_shell_materials/math_on_the_cluster.md diff --git a/archive/Intermediate_shell/lessons/moving_files.md b/Extra_shell_materials/moving_files.md similarity index 100% rename from archive/Intermediate_shell/lessons/moving_files.md rename to Extra_shell_materials/moving_files.md diff --git a/archive/Intermediate_shell/lessons/positional_params.md b/Extra_shell_materials/positional_params.md similarity index 100% rename from archive/Intermediate_shell/lessons/positional_params.md rename to Extra_shell_materials/positional_params.md diff --git a/archive/Intermediate_shell/lessons/regular_expressions.md b/Extra_shell_materials/regular_expressions.md similarity index 100% rename from archive/Intermediate_shell/lessons/regular_expressions.md rename to Extra_shell_materials/regular_expressions.md diff --git a/archive/Intermediate_shell/lessons/sed.md b/Extra_shell_materials/sed.md similarity index 100% rename from archive/Intermediate_shell/lessons/sed.md rename to Extra_shell_materials/sed.md diff --git a/archive/Intermediate_shell/lessons/setting_up.md b/Extra_shell_materials/setting_up.md similarity index 100% rename from archive/Intermediate_shell/lessons/setting_up.md rename to Extra_shell_materials/setting_up.md diff --git a/archive/Intermediate_shell/lessons/toc.md b/Extra_shell_materials/toc.md similarity index 100% rename from archive/Intermediate_shell/lessons/toc.md rename to Extra_shell_materials/toc.md diff --git a/archive/Intermediate_shell/lessons/vim.md b/Extra_shell_materials/vim.md similarity index 100% rename from archive/Intermediate_shell/lessons/vim.md rename to Extra_shell_materials/vim.md diff --git a/Finding_and_summarizing_colossal_files/README.md b/Finding_and_summarizing_colossal_files/README.md new file mode 100644 index 00000000..e69de29b diff --git a/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md new file mode 100644 index 00000000..ef974c35 --- /dev/null +++ b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md @@ -0,0 +1,103 @@ +# Setting up + +## Disclaimers + +**Disclaimer 1:** Before we start this Current Topics module, it is important to highlight that because we will be doing more advanced commands and options for those commands, some commands or their options might not work for you. We have tried our best to pick commands and options for those commands that are widely used, but since we are all on our own computers, each of us has a different implementation of certain commands and there may be options that your specific implementation doesn't have. We have run this code on the [O2 cluster](https://it.hms.harvard.edu/our-services/research-computing/services/high-performance-computing) and each of these commands works there. + +**Disclaimer 2:** The contents of this module contain many examples. Some of these examples will have use-cases that you may frequently use and others you may rarely, if ever, use. We will try to highlight commands and use-cases that we frequently use where applicable, but it is important to note that few people, if any, have memorized all of the contents of this module. Some of the examples are here to just to provide familiarity that a concept simply exists. Try not to get hung up on memorizing the syntax except for perhaps the most frequently used commands. This is a resource that we anticipate you will look back to when you come across a problem that you know can be solved with these materials, but you don't remember the specific syntax. + +## Learning Objective + +In this lesson, you will: +- Download data via the command-line interface using `curl` +- Uncompress a `.zip` file via the command-line interface + +## Getting the dataset + +Before starting the course we asked that Windows users have successfully, installed [GitBash](https://git-scm.com/download/win). Mac users can just use the built-in Terminal application. Go ahead and launch either GitBash or Terminal. + +In order for this Current Topics module to be successful, we need to make sure that we are working from the same datasets. We have developed a handful of datasets for us to use as toy examples during this workshop. In previous workshops, we have used the point-and-click interface for downloading datasets. However, since this is the advanced workshop, we are going to do it a bit different because if you are working on a high-performance computing (HPC) cluster, a point-and-click option likely won't be availible to you. Let's go ahead and move to our home directory: + +``` +cd ~ +``` + +Now the dataset that we are going to use today can be found [here](https://github.com/hbctraining/Training-modules/raw/master/Advanced_shell/data/advanced_shell.zip). If you want, you *could* use the method we have previously taught where you can right-click on the link and select "Download Linked File As..."(Mac) or "Save Linked File As..."(Windows). However, imagine we are on a HPC and we don't have the folder to save it to on our computer, how would we do this? + +## curl + +There are several similar tools for doing the task of downloading linked files. Two common tools for this are `curl` and `wget`. For this application and most applications they will provide the same functionality. There are minor differences between `curl` and `wget` that are beyond the scope of this course and, in reality, most uses. We will be using `curl` here, but we will give an example of `wget` in a dropdown menu at the end of this section along with an additonal dropdown providing a brief overview of the differences between `curl` and `wget` for those that wish to dig a bit deeper. Okay, so let's start by downloading the file. We can right-click the hyperlink above or [here](https://github.com/hbctraining/Training-modules/raw/master/Advanced_shell/data/advanced_shell.zip) and select "Copy Link". Next, we will go to our command line and type: + +``` +curl -O -L [paste your link here] +``` + +> **NOTE:** The `-O` above is a capital letter O and ***NOT*** a zero! + +It should look like: + +``` +curl -O -L https://github.com/hbctraining/Training-modules/raw/master/Advanced_shell/data/advanced_shell.zip +``` + +Let's briefly talk about the options of this command: + +- `-O` This tells curl to keep the same name for the file as it appears on the host server. In this case, `advanced_shell.zip`. + +> **NOTE:** If for some reason you wanted to change the name to something different as you were downloading the file, then you would use the `-o` option instead of `-O` and provide the name you'd prefer to use instead. This is sometimes useful if the URL you are downloading from the internet has a parameter (has an ending of `?` followed by some text, something like `?raw=true` is particularly common when downloading files from GitHub. Of course, you can always just use the `mv` to rename a file after you have downloaded the file as well) + +- `-L https://github.com/hbctraining/Training-modules/raw/master/Advanced_shell/data/advanced_shell.zip` This is the linked file that you would like to download from the host server. + +Now if we look inside our directory, we should see an `advanced_shell.zip` file: + +``` +ls +``` + +Let's unpack this zip file and move inside of it by using: + +``` +unzip advanced_shell.zip +cd advanced_shell +``` + +Now you should be able to see the toy text files that we will be using in this module by using: + +``` +ls +``` + +
+ Click here for an example of doing this in wget + The command below is an example wget command that you can use to accomplish the same task as we did in curl: +
+  wget  https://github.com/hbctraining/Training-modules/raw/master/Advanced_shell/data/advanced_shell.zip
+ This code should be pretty self-explanatory. You are calling the wget command and providing it with the link that you would like to download. +
+
+ +
+ Click here for a bit more depth into the difference between curland wget + One advantage that curl has is that you can provide it with multiple files to download by providing multiple -O options like: +
+  curl -L -O [http://www.example.com/data_file_1.txt] -O [http://www.example.com/data_file_2.txt]
+ But you can also just accomplish this task by running curl on each linked file. wget sort of has the ability to do this as well, but it requires that you make a text file with the linked files and use the -i option. Overall, this benefit feels pretty minor.

+ wget has the nice perk of being able to recursively download a directory. What that means is that if a directory that you're downloading has subdirectories, it will downloading those subdirectories contents as well. For this you would use: +
+  wget -r http://www.example.com/data_directory/
+ The -r, or --recursive, option is telling wget to recursivley download the directory. At first glance, this would appear to be REALLY useful, however most of the time you download a directory from a link, it is almost always compressed into a .zip file or other compression file. In that case, you don't need to recursively download because it is a file and not a directory.

+ Also, some Macs don't have wget pre-installed, while all Macs come with curl. So, if you are on a Mac want to use wget, you will need to install it.

+ In summary, either of these commands will do what you need them to do in the overwhelming majority of cases, so it is mostly personal preference as to which one you use. +
+
+ +Now that we have downloaded our toy datasets, we are ready to dive into learning more advanced bash! + +[Next Lesson >>](02_String_manipulation.md) + +[Back to Schedule](../README.md) + +*** + +*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* + diff --git a/Finding_and_summarizing_colossal_files/lessons/02_String_manipulation.md b/Finding_and_summarizing_colossal_files/lessons/02_String_manipulation.md new file mode 100644 index 00000000..8a175385 --- /dev/null +++ b/Finding_and_summarizing_colossal_files/lessons/02_String_manipulation.md @@ -0,0 +1,539 @@ +# String Manipulation + +While the syntax differs, one feature that is common is most programming languages is the process of string manipulation. Before we can introduce string manipulation, we first need to introduce strings! + +## Learning Objectives + +In this lesson, you will: +- Describe a string +- Differentiate between 0-based and 1-based indexing +- Manipulate strings in `bash` + +## Strings + +A string is a term for any sequence of characters. Some examples of strings are: + +``` +Happy_birthday +this_module_is_a_blast.txt +/path/to/my/favorite/photo.jeg +``` + +Strings have whitespace (spaces or tabs) separating them from anything else. + +> **NOTE:** While generally discouraged, strings can also have spaces in them along with other special characters. Special characters are characters that have special meaning in a language. For example, `>` is a character used for redirection or `$` is a character used with variables. You can use them if you must by "escaping" them. Escaping a special character requires putting a `\` infront of the special character, which tell bash to interpret this next character literally, not as a special character. Naturally, `\` is also a special character. Because different software tools interpret special characters differently, it is generally advised just to stay away from them in strings unless it is necessary (which sometimes it is). Many of these special characters are symbols, so general, just be leary of using non-alphanumerical characters in your strings. + +## String manipulation + +### Indexing + +Before we can explore string manipulation, we need to have some background on indexing.There are two major forms of indexing: + +- 0-based indexing counts in between the characters and starts at 0 before the first character +- 1-based indexing counts each character and start at 1 at the first character + +

+ +

+ +One advantage of 0-based indexing is that figuring out distances a bit easier. If you want to know the distance from `R` to `N` in the example above, you just need to do to 5 - 2 and you get the length of that string is 3. In 1-based indexing, you need to add 1 after you do the substraction. So in the case of `R` to `N`, it would be 5 - 3 + 1 = 3. **Many of the built-in `bash` commands use 0-based indexing**, but other programs not in this module may run on 1-based indexing, so you should be aware of how strings are indexed when analyzing them. + +### Subsetting strings + +The first lesson in manipulating strings is simply subsetting a string. Here, we are trying to take our string and only extract a portion of that string. First let's set a string, like our name, equal to a variable, in this case `name`: + +``` +name=Will +``` + +As we've seen before, we could print this `name` variable like: + +``` +echo $name +# OR +echo ${name} +``` + +> **NOTE:** Generally speaking, it's not a bad idea to always start putting your bash variables in `{}`. It's not necessary in some cases, like if the bash variable is followed by a space or other specific characters like `.` or `/`. However, it can save you a headache when debugging and using if you use them when they aren't necessary, `bash` will still interpret the variable just fine. + +Now, if we want to subset the string saved to a variable. We need to need to use the following syntax: + +``` +# DON'T RUN +${variable_name:start:length} +``` + +

+ +

+ +In this case, our variable name is `name`, where we start at is the `start` position (0-based) and we continue for a given `length`. If we want the second and third letter of the variable `$name` it would look like: + +``` +echo ${name:1:2} +``` + +### Application + +The O2 cluster at Harvard has a special space reserved for each person's "scratch" work that is deleted after 30 days of not being used. The path to this space is: + +``` +# DON'T RUN +/n/scratch3/users/[users_first_letter]/[username]/ +``` + +You should also be aware that O2 like many clusters has a special built-in variable called `$USER` that holds a username (which we will assume is `will`). I could change directories to this scratch space by using: + +``` +# DON'T RUN +cd /n/scratch3/users/w/will/ +``` + +However, if I was developing code or materials for other people in my group or lab to use, then they would have to manually change each instance of it. However, you can use substrings and variables to help you here. Instead of writing out your user information you could instead write: + +``` +# DON'T RUN +cd /n/scratch3/users/${USER:0:1}/${USER}/ +``` + +Now this would universally apply to anyone using your code on O2! + +## Substring from a position to the end of the string + +There is a special case of the above example where you might want to trim a certain amount characters from the beginning of a string. The syntax for this would be: + +``` +# DON'T RUN +${variable_name:start} +``` + +

+ +

+ +If we want to trim the first two letters off of out `$name` variable then it would look like: + +``` +echo ${name:2} +``` + +## Substring counting from the end of a string + +You may have a situation where you want to remove the last characaters from a string, the syntax for this would look similiar: + +``` +# DON'T RUN +${variable:start:-length_from_the_end} +``` + +

+ +

+ +If you wanted to trim the last two letters off of the `$name` variable: + +``` +echo ${name:0:-2} +``` + +This would still start at zero and keep everything but the last two positions. + +You could trim the first and last letter like: + +``` +echo ${name:1:-1} +``` + +Here, you are telling `bash` to start in the first position and also take everything except the last position. + +## String Addition + +You can also add character to strings. The syntax for this is pretty straightforward: + +``` +# DON'T RUN +string_to_add_to_beginning${variable_name}string_to_add_to_end +``` + +`${variable_name}` is the string assigned to `${variable_name}` and `string_to_add_to_beginning` and `string_to_add_to_end` are strings you want to add to the beginning and/or end, respecitively. + +

+ +

+ +For example, we can add onto the end of the `$name` variable we designated to make it into a legal name: + +``` +real_name=${name}iam +echo ${real_name} +``` + +### Bioinformatics Application + +You can see this could be very useful if you had a path saved to a variable and you wanted to use that path variable to create paths to files within that directory. For example: + +``` +alignment_directory=/my/alignment/files/are/here/ +SAM_alignment=${alignment_directory}file.sam +BAM_alignment=${alignment_directory}file.bam +``` + +So now if you look at `$SAM_alignment`: + +``` +echo ${SAM_alignment} +``` + +It will return: + +``` +/my/alignment/files/are/here/file.sam +``` + +Or the `$BAM_alignment`: + +``` +echo ${BAM_alignment} +``` + +It will return: + +``` +/my/alignment/files/are/here/file.bam +``` + +If you have a script where you use a path multiple times, this can be really helpful for minimizing typos and make it easier to repurpose the script for different uses. + +## Substring Removal + +Let's imagine a case where we wanted to remove some part of a string and let's start by defining a string named `slingshot`: + +``` +slingshot=slinging_slyly +``` + +### Remove the shortest match from the end + +The first thing we might want to do is remove a substring from the end of a string. The syntax for removing the shortest substring from the end of a string is: + +``` +# DON'T RUN +echo ${variable_name%substring_to_remove} +``` + +

+ +

+ +In the case below, we want to remove `ly` from the end of our `$slingshot` string: + +``` +echo ${slingshot%ly} +``` + +This will return: + +``` +slinging_sly +``` + +This example is a bit simple because our example ended with `ly`, so instead let's remove `ing` and anything after it from the end of our `$slingshot` string: + +``` +echo ${slingshot%ing*} +``` + +Notice the addition of the wildcard `*` character. This allows us to remove `ing` **and** anything after the shortest match of `ing` from the end of the string. + +#### Bioinformatics Application + +Removing the end of string is very common in bioinformatics when you want to remove the extension from a file name. Consider the case where you have a variable named, `file`, that is set equal to `/path/to/myfile.txt` and you want to remove the `.txt` extension: + +``` +file=/path/to/myfile.txt +echo ${file%.txt} +``` + +This will return: + +``` +/path/to/myfile +``` + +This can very really nice when compared to the `basename` function, which can also a strip file extension. However, `basename` also strips path information. You may have a case where you have a full path and filename, but you don't want to strip the path information, but rather just the extension. + +### Remove the longest match from the end + +We have discussed removing the shortest match from the end of a string, but we can also remove the longest match from the end of a string and the syntax for this is: + +``` +# DON'T RUN +echo ${variable_name%%substring_to_remove} +``` + +

+ +

+ +In order to differentiate the longest match from the end and the shortest match from the end, we will need to utilize the `*` wildcard. Let's remind ourselves of what the shortest match from the end would look like when using a `*`: + +``` +echo ${slingshot%ly*} +``` + +This returns: + +``` +slinging_sly +``` + +Now, let's change the `%` to `%%`: + +``` +echo ${slingshot%%ly*} +``` + +However, this returns: + +``` +slinging_s +``` + +> NOTE: It is important to note that without the use of a `*` wildcard, `echo ${slingshot%ly}` and `echo ${slingshot%%ly}` will both return `slinging_sly` + +### Remove the shortest match from the beginning + +Instead of removing matches from the end of the string we can also remove matches from the beginning of the string by using `#` instead of `%`. Excitingly, like the shebang line, this is one of the few times that `#` doesn't function as a comment in `bash`. The syntax for remove the shortest match from the beginning of a string is: + +``` +# DON'T RUN +${variable_name#substring_to_remove} +``` + +

+ +

+ +If we want to remove `sl` from the beginning of our `$slingshot` variable string, then we could use: + +``` +echo ${slingshot#sl} +``` + +This would return: + +``` +inging_slyly +``` + +Like removing matches from the end, this example isn't as interesting without the use of wildcards. Perhaps instead, we want to remove anything to and including the first match of `ing` from the beginning. We could do that like: + +``` +echo ${slingshot#*ing} +``` + +This would return: + +``` +ing_slyly +``` + +### Remove the longest match from the beginning + +We can also remove the longest match from the beginning using the following syntax: + +``` +# DON'T RUN +${variable_name##substring_to_remove} +``` + +

+ +

+ +Let's remove the longest match that contains `ing` from the beginning: + +``` +echo ${slingshot##*ing} +``` + +This would return: + +``` +_slyly +``` + +> **NOTE:** Similiarly to removing strings from the end, there isn't any difference between using `#` and `##` when removing strings from the beginning if you don't use the `*` wildcard. + +#### Bioinformatics Application-*ish* + +You could use this to strip path information. For example: + +``` +path=/my/path/to/file.txt +echo ${path##*/} +``` + +However, the `basename` function provides this exact function, so either way is synonymous. However, using `basename` might be a bit more readable. + +### Substring Removal Overview + +The table below is a summary of substring removal: + +| Shortcut | Effect | +|------|------| +| % | Remove shortest match from the end of the string | +| %% | Remove longest match from the end of the string| +| # | Remove the shortest match from the beginning of the string| +| ## | Remove the longest match from the beginning of the string| + +### Miscellanous + +#### Length of string + +The length of a string can be determined by using the following syntax: + +``` +# DON'T RUN +${#variable_name} +``` + +Once again, this is another interesting case where `#` is not used as a comment and actually has a function in `bash`. In this case, we could see the length of the `$slingshot` variable by using: + +``` +echo ${#slingshot} +``` + +Which will return a length of: + +``` +14 +``` + +#### Case changing + +**NOTE: The ability to change cases is only availible on versions of `bash` that are version 4.0+!** + +> If your version of **`bash` is too old** to change cases, the **error message** will look like: +> +> ``` +> -bash: ${variable_name^^}: bad substitution +> ``` + + +##### All Uppercase + +If you want all uppercase letters you can do: + +``` +# DON'T RUN +${variable_name^^} +``` + +For example, if we wanted `$slingshot` to be all uppercase letters, we can do: + +``` +echo ${slingshot^^} +``` + +And it would return: + +``` +SLINGING_SLYLY +``` + +##### Leading Uppercase + +If you want the leading character to be uppercase, then we can use this syntax: + +``` +# DON'T RUN +echo ${variable_name^} +``` + +If we do this to `$slingshot`, it would look like: + +``` +echo ${slingshot^} +``` + +And it would return: + +``` +Slinging_slyly +``` + +##### All lowercase + +We can also make a string entirely lowercase. Let's consider the following string: + +``` +dog=FIDO +``` + +We could force all of the letters to be lowercase using the following syntax: + +``` +# DON'T RUN +${variable_name,,} +``` + +We can apply this syntax to our `$dog` variable: + +``` +echo ${dog,,} +``` + +The output would look like: + +``` +fido +``` + +##### Leading lowercase + +We can also just make the leading character lowercase with the following syntax: + +``` +# DON'T RUN +${variable_name,} +``` + +We can apply this syntax to our `$dog` variable: + +``` +echo ${dog,} +``` + +The output would look like: + +``` +fIDO +``` + +## Exercises + +For these exercises, use the following file path: + +``` +filepath=/path/to/my/file.sam +``` + +**1)** Strip the file extension from this variable. + + +**2)** Strip the file extension from this variable and assign a new extension of `.bam` + + +**3)** Strip the file extension from this variable, assign a new extension of `.bam` and assign it to a new variable called `bam_filename`. Then print this new `bash` variable + + +*** + +[Next Lesson >>](03_Regular_expressions.md) + +[Back to Schedule](../README.md) + +*** + +*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* diff --git a/Finding_and_summarizing_colossal_files/lessons/03_Regular_expressions.md b/Finding_and_summarizing_colossal_files/lessons/03_Regular_expressions.md new file mode 100644 index 00000000..31d87f06 --- /dev/null +++ b/Finding_and_summarizing_colossal_files/lessons/03_Regular_expressions.md @@ -0,0 +1,489 @@ +# Regular Expressions + +Regular expressions (sometimes referred to as regex) are a string of characters that can be used as a pattern to match against. This can be very helpful when searching through a file, particularly in conjunction with `sed`, `grep` or `awk`. Since we have an understanding of `grep` from previous workshops we are going to use `grep` and the `catch.txt` file that we downloaded to demonstrate the principles of regular expressions. As we said though, many of these principles are also useful in `sed` and `awk`. + +## Learning Objectives + +In this lesson, we will: +- Differentiate between single and double quotes in `bash` +- Implement regular expressions to within `grep` + +## Getting Started + +Before we get started, let's take a briefly look at the `catch.txt` file in a `less` buffer in order to get an idea of what the file looks like: + +``` +less catch.txt +``` + +In here, you can see that we have a variety of case differences and misspellings. These differences are not exhaustive, but they will be helpful in exploring how regular expressions are implemented in `grep`. + +## A bit more depth on grep + +There are two principles that we should discuss more, the `-E` option and the use of quotation marks. + +### The `-E` option + +There is a `-E` option when using `grep` that allows the user to use what is considered "extended regular expressons". We won't use too many of these types of regular expressions and we will point them out when we need them. If you want to make it a habit to always use the `-E` option when using regular expressions in `grep` it is a bit more safe. + +### Quotations + +When using grep it is usually not required to put your search term in quotes. However, if you would like to use `grep` to do certain types of searches, it is better or *safer* to wrap your search term in quotations, and likely double quotations. Let's briefly discuss the differences: + +#### No quotation + +If you are using `grep` to search and have whitespace (space or tabs) in your search, `grep` will treat the expression before the whitespace as the search term and the expression after the whitespace(s) as a file(s). As a result, if your search term doesn't have whitespace it doesn't matter if you put quotations, but if it does, then it won't behave the way you'd like it to behave. + +#### Single quotations + +So `grep` doesn't ever "see" quotation marks, but rather quotation marks are interpreted by `bash` first and then the result is passed to `grep`. The big advantage of using quotation marks, single or double, when using `grep` is that it allows you to use search expressions with whitespace in them. However, within bash, single-quotation marks (`'`) are intepreted *literally*, meaning that the expression within the quotation marks will be interpreted by `bash` *EXACTLY* the way it is written. Notably, `bash` variables within single-quotations are **NOT** expanded. What we mean by this is that if you were to have a variable named `at` that holds `AT`: + +``` +at=AT +``` + +If you used grep while using single quotes like: + +``` +grep 'C${at}CH' catch.txt +``` + +It would only return: + +``` +C${at}CH +``` + +This is because it searches for the term without expanding (replacing the `bash` variable with what it stands for) the `${at}` variable. + +#### Double Quotations + +Double quotations are typically the most useful because they allow the user to search for whitespace **AND** allows for `bash` to expand variables, so that now: + +``` +grep "C${at}CH" catch.txt +``` + +Returns: + +``` +CATCH +``` + +Additionally, if you would like to be able to literally search something that looked like a `bash` variable, you can do this just by adding a `\` before the `${variable}` to "escape" it from `bash` expansion. For example: + +``` +grep "C\${at}CH" catch.txt +``` + +Will return: + +``` +C${at}CH +``` + +### grep Depth Take-Home + +In conclusion, while these are all mostly edge cases, we believe that it is generally a good habit to wrap the expressions that you use for `grep` in double quotations and also use the `-E` option. This practice will not matter for the overwhelming number of cases, but it is sometimes difficult to remember these edge cases and thus it is mofe safe to just build them into a habit. Of course, your preferences may vary. + +## Ranges + +Now that we have gotten some background information out of the way, let's start implementing some regular expressions into our `grep` commands. + +A **range** of acceptable characters can be given to `grep` with `[]`. Square brackets can be used to notate a range of acceptable characters in a position. For example: + +``` +grep -E "[BPL]ATCH" catch.txt +``` + +Will return: + +``` +PATCH +BATCH +``` + +It would have also returned `LATCH` had it been in the file, but it wasn't. + +You can also use `-` to denote a range of characters like: + +``` +grep -E "[A-Z]ATCH" catch.txt +``` + +Which will return every match that has an uppercase A through Z in it followed by "ATCH": + +``` +PATCH +BATCH +CATCH +CAATCH +CAAATCH +CAAAATCH +``` + +You can also merge different ranges together by putting them right after each other or separating them by a `|` (in this case `|` stands for "or" and is **not a pipe**): + +``` +grep -E "[A-Za-z]ATCH" catch.txt +# OR +grep -E "[A-Z|a-z]ATCH" catch.txt +``` + +This will return: + +``` +PATCH +BATCH +CATCH +pATCH +bATCH +cATCH +CAATCH +CAAATCH +CAAAATCH +``` + +In fact, regular expression ranges generally follow the [ASCII alphabet](https://en.wikipedia.org/wiki/ASCII), (but your local character encoding may vary) so: + +``` +grep -E "[0-z]ATCH" catch.txt +``` + +Will return: + +``` +PATCH +BATCH +CATCH +pATCH +bATCH +cATCH +2ATCH +:ATCH +^ATCH +CAATCH +CAAATCH +CAAAATCH +``` + +However, it is important to also note that the ASCII alphabet has a few characters between numbers and uppercase letters such as `:` and `>`, so you would also match `:ATCH` and `>ATCH` (if it was in the file), repectively. There are also a few symbols between upper and lowercase letters such as `^` and `]`, which match `^ATCH` and `]ATCH` (if it was in the file), respectively. + +Thus, if you would only want to search for numbers, uppercase letters and lowercase letters, but NOT these characters in between, you would need to modify the range: + +``` +grep -E "[0-9A-Za-z]ATCH" catch.txt +``` + +Which will return: + +``` +PATCH +BATCH +CATCH +pATCH +bATCH +cATCH +2ATCH +CAATCH +CAAATCH +CAAAATCH +``` + +You can also note that since these characters follow the ASCII character encoding order, `[Z-A]` will give you an error telling you that it is an invalid range because `Z` comes after `A`, thus you can't search from `Z` forward to `A`. + +``` +# THIS WILL PRODUCE AN ERROR +grep -E "[Z-A]ATCH" catch.txt +``` + +Another trick with ranges is the use of `^` ***within*** `[]` functions as a "not" function. For example: + + +``` +grep -E "[^C]ATCH" catch.txt +``` + +Will return: + +``` +PATCH +BATCH +pATCH +bATCH +cATCH +2ATCH +:ATCH +^ATCH +CAATCH +CAAATCH +CAAAATCH +``` + +This will match anything ending in `ATCH` ***except*** a string containing `CATCH`. + +***IMPORTANT NOTE: `^` has a different function when used outside of the `[]` that is discussed below in anchoring.*** + + +## Special Characters + +### . + +The `.` matches any character except new line. Notably, it also ***does not*** match no character. This is similar to the behavior of the wildcard `?` in `bash`. For example: + +``` +grep -E ".ATCH" catch.txt +``` + +Will return: + +``` +PATCH +BATCH +CATCH +pATCH +bATCH +cATCH +2ATCH +:ATCH +^ATCH +CAATCH +CAAATCH +CAAAATCH +``` + +But this result **will not** include `ATCH`. + +### Quantifiers + +#### * + +The `*` matches the preceeding character any number of times ***including*** zero times. For example: + +``` +grep -E "CA*TCH" catch.txt +``` + +Will return: + +``` +CATCH +CTCH +CAATCH +CAAATCH +CAAAATCH +``` + +#### ? + +The `?` denotes that the previous character is optional, in the following example: + +``` +grep -E "CA?TCH" catch.txt +``` + +Will return: + +``` +CATCH +CTCH +``` + +Since the "A" is optional, it will only match `CATCH` or `CTCH`, but not anything else, including `COTCH` which was also in our file. + +#### {} + +The `{INTEGER}` matches the preceeding character the number of times equal to INTEGER. For example: + +``` +grep -E "CA{3}TCH" catch.txt +``` + +Will return only: + +``` +CAAATCH +``` + +> NOTE: This is one of the cases that needs the `-E` option, otherwise it won't return anything. Alternatively, you can also escape the curly brackets and then you don't need the `-E` option. +> ``` +> grep "CA\{3\}TCH" catch.txt +> ``` + +#### + + +The `+` matches one or more occurrances of the preceeding character. For example: + +``` +grep -E "CA+TCH" catch.txt +``` + +Will return: + +``` +CATCH +CAATCH +CAAATCH +CAAAATCH +``` + +### Anchors + +Anchors are really useful tools in regular expressions because they specify if a pattern has to be found at the beginning or end of a line. + +#### ^ + +The `^` character anchors the search criteria to the beginning of the line. For example: + +``` +grep -E "^CAT" catch.txt +``` + +Will return: + +``` +CATCH +CAT +``` + +Importantly, it won't return `BOBCAT`, which is also in the file, because that line doesn't start with `CAT`. + +***REMINDER: `^` within `[]` functions acts as "not"!*** + +#### $ + +The `$` character anchors the search criteria to the end of the line. For example: + +``` +grep -E "CAT$" catch.txt +``` + +Will return: + +``` +CAT +BOBCAT +``` + +This won't match `CATCH` because the line doesn't end with `CAT`. + + +## Literal matches + +One problem you will likely run into with these above special characters is that you may want to match one. For example, you may want to match `.` or `?` and this is what the escape, `\`, is for. For example: + +``` +grep -E "C\?TCH" catch.txt +``` + +Will return: + +``` +C?TCH +``` + +It will not return `CATCH` or `COTCH` or others like `C?TCH` would do. + + +## Whitespace and new lines + +You can search for a tab with `\t`, a space with `\s` and a newline with `\n`. For example: + +``` +grep -E "CA\tTCH" catch.txt +``` + +Will return: + +``` +CA TCH +``` + +## Examples of Combining Special Characters + +Much of the power from regular expression comes from how you can combine them to match the pattern you want. Below are a few examples of such: + +**1)** If you want to find any line that starts with uppercase letters `A-G`, then you could do: + +``` +grep -E "^[A-G]" catch.txt +``` + +Which will return: + +``` +BATCH +CATCH +CTCH +CAATCH +CAAATCH +CAAAATCH +ATCH +CAT +BOBCAT +C?TCH +CA TCH +C${at}CH +COTCH +``` + +**2)** Perhaps you want to see find all lines ending with `CA` followed by any character except `T`, then you could do: + +``` +grep -E "CA[^T]$" catch.txt +``` + +This will return: + +``` +TAXICAB +TINCAN +``` + +**3)** We could be interersted in finding lines that start with `C` and end with `CH` with anything, including nothing, in between. + +``` +grep -E "^C.*CH$" catch.txt +``` + +This will return: + +``` +CATCH +CTCH +CAATCH +CAAATCH +CAAAATCH +C?TCH +CA TCH +C${at}CH +COTCH +``` + +*** + +## Exercises + +**1)** Use `grep` to find all matches in `catch.txt` that start with "B" and have a "T" anywhere in the string after the "B". + + +**2)** Use `grep` to find all matches in `catch.txt` that don't start with "C" and don't end with "H" + + +**3)** Use `grep` to find all matches in `catch.txt` that have atleast two "A"s in them + + +## Additional Resources + +- https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionVI/lessons/extra_bash_tools.html#regular-expressions-regex-in-bash- + +*** + +[Next Lesson >>](04_sed.md) + +[Back to Schedule](../README.md) + +*** + +*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* + diff --git a/Finding_and_summarizing_colossal_files/lessons/04_sed.md b/Finding_and_summarizing_colossal_files/lessons/04_sed.md new file mode 100644 index 00000000..4a281d7a --- /dev/null +++ b/Finding_and_summarizing_colossal_files/lessons/04_sed.md @@ -0,0 +1,382 @@ +# sed + +The ***s***tream ***ed***itor, `sed`, is a common tool used for text manipulation. `sed` takes input from either a file or piped from a previous command and applies a transformation to it before outputting it to standard output. + +## Learning Objectives + +In this lesson, we will: + +- Substitute characters using `sed` +- Utilize addresses in our `sed` commands +- Delete lines from output using `sed` + +## Quick Note on Quotation Marks + +Many of the same arguments that were made for using single vs. double quotation marks in `grep` also apply to `sed`. However, the `$` has some non-variable functionality in `sed` that we will discuss, particularly with reference to addresses. For this reason, it's more common to see `sed` commands wrapped in single-quotes rather than double-quotes. Of course, if you want to use a `bash` variable in `sed` you are going to need to wrap it in double-quotes, but if you do then you need be cautious of any non-variable usage of `$` and be sure to escape it (`\`). + +## substitution + +Before we get started let's take a brief look at the main dataset that will be using for this lesson, the `ecosystems.txt` file: + +``` +less ecosystems.txt +``` + +One common usage for `sed` is to replace `pattern` with `replacement`. The syntax for doing this is: + +``` +# DON'T RUN +sed 's/pattern/replacement/flag' file.txt +``` + +A few things to note here: + +1) The `s` in `'s/pattern/replacement/flag'` is directing `sed` to do a ***s***ubstitution. + +2) The `flag` in `'s/pattern/replacement/flag'` is directing `sed` that you want this action to be carried out in a specific manner. It is very common to use the flag `g` here which will carry out the action ***g***lobally, or each time it matches the `pattern` in a line. If `g`, is not included it wil just replace the `pattern` the first time it is observed per line. If you would like to replace a particular occurance like the third time it is observed in a line, then you would use `3`. + +Let's test this out on our `ecosystems.txt` sample file and see that output. First, we are interested in replacing `jungle` with `rainforest` throughout the file: + +``` +sed 's/jungle/rainforest/g' ecosystems.txt +``` + +Notice how all instances of `jungle` have been replaced with `rainforest`. However, if we don't include the global option: + +``` +sed 's/jungle/rainforest/' ecosystems.txt +``` + +We will only recover the first instance of `jungle` was replaced with `rainforest`. If we want to replace only the second occurance of `jungle` with `rainforest` on a line, modify the occurance flag to be `2`: + +``` +sed 's/jungle/rainforest/2' ecosystems.txt +``` + +**It is important to note that the pattern-matching in `sed` is case-sensitive.** To make your pattern searches case-insensitive, you will need to add at the `I` flag: + +``` +sed 's/jungle/rainforest/Ig' ecosystems.txt +``` + +This will now replace all instances of `Jungle`/`jungle`/`JuNgLe`/`jUngle`/etc. with `rainforest`. + +You can also replace all instances of a match starting at *n*-th match and continuing for the rest of the line. For instance if we want the second match and all subsequent matches of `jungle` to be replaced with `rainforest`, then we could use: + +``` +sed 's/jungle/rainforest/2g' ecosystems.txt +``` + +> NOTE: Depending on your implementation of `sed`, this command may give the error that too many flags have been provided. + +### -n option + +In `sed` the `-n` option will create no standard output. However, you can pair it with the flag `p` and this will print out only lines that were were edited. For example: + +``` +sed -n 's/desert/Sahara/pg' ecosystems.txt +``` + +The `-n` option has another useful purpose, you can use it to find the line number of a matched pattern by using `=` after the pattern you are searching for: + +``` +sed -n '/jungle/ =' ecosystems.txt +``` + +## Addresses + +### Single lines + +One can also direct which line, the ***address***, `sed` should make an edit on by adding the line number in front of `s`. This is most common when one wants to make a substituion for a pattern in a header line and we are concerned that the pattern might be elsewhere in the file. For example, we can compare the following commands: + +``` +sed 's/an/replacement/g' ecosystems.txt +sed '1 s/an/replacement/g' ecosystems.txt +``` + +* In the **first command**, `sed 's/an/replacement/g' animals.txt` we have replaced all instances of `an` with `replacement`. + * `animal` changed to `replacementimal` + * `toucan` also changed to `toucreplacement` + * `anaconda` changed to `replacementaconda` +* In the **second command**, `sed '1 s/an/replacement/g' ecosystems.txt`, we have only replaced instances on line 1. + +If you only want to replace an occurence in the final line of a file you can use `$` like: + +``` +sed '$ s/jag/replacement/g' ecosystems.txt +``` + +### Intervals + +If we only want to have this substitution carried out on the first five lines, then we would need to do include this interval (`1,5`, this is giving an address interval, from line 1 to line 5): + +``` +sed '1,5 s/an/replacement/g' ecosystems.txt +``` + +You can also replace the second address with a `$` to indicate until end of the file like: + +``` +sed '5,$ s/an/replacement/g' ecosystems.txt +``` + +This will carry out the substitution from the fifth line until the end of the file. + +You can also use regular expressions in the address field. For example, if you only wanted the substitution happening between your first occurence of `camel` and your first occurrance of `cichlid`, you could do: + +``` +sed '/camel/,/cichlid/ s/an/replacement/g' ecosystems.txt +``` + +Additionally, if you want a replacement every occurance except a given line, such as all of you data fields but not on the header line, then one could use `!` which tells sed "not", like: + +``` +sed '1! s/an/replacement/g' ecosystems.txt +``` + +You can even couple `!` with the regular expression intervals to do the substitution everywhere outside the interval: + +``` +sed '/camel/,/cichlid/! s/an/replacement/g' ecosystems.txt +``` + +Lastly, you can use `N~n` in the address to indicate that you want to apply the substitution every *n*-th line starting on line *N*. In the below example, starting on the first line and every 2nd line, the substitution will occur + +``` +sed '1~2 s/an/replacement/g' ecosystems.txt +``` + +## Deletion + +You can delete entire lines in `sed`. To delete lines we will need to provide the address followed by `d`. To delete the first line from a file: + +``` +sed '1d' ecosystems.txt +``` + +Like substitutions, you can provide an interval and this will delete line 1 to line 3: + +``` +sed '1,3d' ecosystems.txt +``` + +Also like substitution, you can use `!` to specify lines not to delete like: + +``` +sed '1,3!d' ecosystems.txt +``` + +Additionally, you can also use regular expressions to provide the addresses to define an interval to delete from. In this case we are interested in deleting from the first instance of `cichlid` until the end of the file: + +``` +sed '/cichlid/,$d' ecosystems.txt +``` + +The `N~n` syntax also works in deletion. If we want to delete every third line starting on line 2, we can do: + +``` +sed '2~3d' ecosystems.txt +``` + +## Appending + +### Appending, Inserting and Changing Text + +You can append a new line with the word `starfish` after the 2nd line using the `a` command in `sed`: + +``` +sed '2 a starfish' ecosystems.txt +``` + +You can also append over an interval, like from the 2nd to 4th line: + +``` +sed '2,4 a starfish' ecosystems.txt +``` + +Additionally, you can append the text every 3rd line begining with the second line: + +``` +sed '2~3 a starfish' ecosystems.txt +``` + +You can also append after a matched pattern: + +``` +sed '/monkey/ a starfish' ecosystems.txt +``` + +If you want the ***i***nsert text to come before the address, you need to use the `i` command: + +``` +sed '2 i starfish' ecosystems.txt +``` + +### Appending a file + +We could be interested in inserting the contents of **file B** inside at a certain point of **file A**. Before we append a file, let's briefly inspect the file we will try to append: + +``` +less more_ecosystems.txt +``` + +For example, if you wanted to insert the contents `more_ecosystems.txt` after line `4` in `ecosystems.txt`, you could do: + +``` +sed '4 r more_ecosystems.txt' ecosystems.txt +``` + +The `r` argument is telling `sed` to ***r***ead in `more_ecosystems.txt`. + +Instead of line `4`, you can append the file between every line in the interval from line 2 to line 4 with: + +``` +sed '2,4 r more_ecosystems.txt' ecosystems.txt +``` + +You could also append the line after each line by using `1~1` syntax: + +``` +sed '1~1 r more_ecosystems.txt' ecosystems.txt +``` + +Instead of inserting on a line specific line, you can also insert on a pattern: + +``` +sed '/camel/ r more_ecosystems.txt' ecosystems.txt +``` + +Lastly, you could also insert a file to the end using `$`: + +``` +sed '$ r more_ecosystems.txt' ecosystems.txt +``` + +But this is the same result as simply concatenating two files together like: + +``` +cat ecosystems.txt more_ecosystems.txt +``` + +### Changing Lines + +You can also ***c***hange entire lines in `sed` using the `c` command. We could replace the first line with the word 'header' by: + +``` +sed '1 c header' ecosystems.txt +``` + +This can also be utilized in conjunction with the `A,B` interval syntax, but we should be aware that it will replace ALL lines in that interval with a SINGLE line. + +``` +sed '1,3 c header' ecosystems.txt +``` + +You can also replace every *n*-th line starting at *N*-th line using the `N~n` address syntax: + +``` +sed '1~3 c header' ecosystems.txt +``` + +Lastly, you can also replace lines match a pattern: + +``` +sed '/jaguar/ c header' ecosystems.txt +``` + +## Transform + +`sed` has a feature that allows you to transform characters similiarly to the `tr` function in `bash`. If you wanted to transform all of the lowercase a, b and c characters to their uppercase equivalents you could do that with the `y` command: + +``` +sed 'y/abc/ABC/' ecosystems.txt +``` + +In this case the first letter 'a' is replaced with 'A', 'b' with 'B' and 'c' with 'C'. + +## Multiple expressions + +### `-e` option + +If you would like to carry out multiple `sed` expressions in the same command you can use the `-e` option and after each `-e` option you can provide the expression you would like `sed` to evaluate. For example, one could change `jungle` to `rainforest` and `lake` to `freshwater`: + +``` +sed -e 's/jungle/rainforest/g' -e 's/lake/freshwater/g' ecosystems.txt +``` + +One can also combine different type of expressions. For instance, one could change `jungle` to `rainforest` using a substitution expression and then use a deletion expression to remove the header line: + +``` +sed -e 's/jungle/rainforest/g' -e '1d' ecosystems.txt +``` + +If you want to use different flags to mark the occurence of a substitution, you will need to use the `-e` option: + +``` +sed -e 's/jungle/rainforest/3' -e 's/jungle/rainforest/1' ecosystems.txt +``` + +> **NOTE:** The occurences flag needs to go in decreasing order from the end of the line to the beginning of the line. Notice how `-e 's/jungle/rainforest/3'` comes before `-e 's/jungle/rainforest/1'`. + +### `-f` option + +If you have a large number of `sed` expressions you can also place them in a text file, like the `sed_expressions.txt` file, where each expression is on a separate line: + +``` +s/jungle/rainforest/g +s/lake/freshwater/g +1d +``` + +Then we can use the `-f` option to provide this file of `sed` expressions by using: + +``` +sed -f sed_expressions.txt ecosystems.txt +``` + +## Exercise + +Within your directory, there is a FASTQ file called, `Mov10_oe_1.subset.fq`. We would like to create a file of `sed` commands to convert this FASTQ file into a FASTA file. In order to do this, we need to briefly outline the difference between a FASTQ and FASTA file. + +**FASTQ files** + +There are four lines in a FASTQ file per entry that correspond to: + +- Line 1: The header line that starts with `@` +- Line 2: The sequence line +- Line 3: Usually just holds a `+` +- Line 4: Base scores corresponding to the bases in Line 2 + +**FASTA files** + +There are only two lines in a FASTA file per entry that correspond to: + +- Line 1: The header line that starts with `>` +- Line 2: The sequence line + +Let's do this task in a few parts: + +**1)** Create a new file in `vim` called `fastq_to_fasta.txt` to put our `sed` commands within + +**2)** Make the first `sed` command within this file be the one that implements a `>` at the *start* of the first line of each entry. *Hint: A regex tool could be helpful for this task* + +**3)** Make the next two `sed` commands within this file delete the third and fourth lines of each entry + +**4)** Run this file of `sed` commands on our FASTQ file and redirect the output to a new file called `Mov10_oe_1.subset.fa` + +## Additional Resources + +- https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionVI/lessons/extra_bash_tools.html##sed + +- https://www.grymoire.com/Unix/Sed.html + +*** + +[Next Lesson >>](05_awk.md) + +[Back to Schedule](../README.md) + +*** + +*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* + diff --git a/Finding_and_summarizing_colossal_files/lessons/05_awk.md b/Finding_and_summarizing_colossal_files/lessons/05_awk.md new file mode 100644 index 00000000..4c931574 --- /dev/null +++ b/Finding_and_summarizing_colossal_files/lessons/05_awk.md @@ -0,0 +1,283 @@ +# awk + +`awk` is a very powerful programming language in its own right and it can do a lot more than is outlined here. `awk` shares a common history with `sed` and even `grep` dating back to `ed`. As a result, some of the syntax and functionality can be a bit familiar at times. However, it is particularly useful when working with datatables in plain text format (tab-delimited files and comma-separated files). Before we dive too deeply into `awk` we need to define two terms that `awk` will use a lot: + +- ***Field*** - This is a column of data +- ***Record*** - This is a row of data + +### Printing columns + +Let's first look at one of it's most basic and useful functions, printing columns. For this example we are going to use the tab-delimited file `ecosystems.txt` that we used in the `sed` examples. + +Let's first try to just print the first column: + +``` +awk '{print $1}' ecosystems.txt +``` + +Here the `print` function in `awk` is telling `awk` that it it should output the first column of each line. We can also choose to print out multiple columns in any order. + +``` +awk '{print $3,$1,$5}' ecosystems.txt +``` + +The default output is to have the column separated by a space. However, built-in variables can be modified using the `-v` option. Once you have called the `-v` option you need to tell `awk` which built-in variable you are interested in modfying. In this case, it is the ***O***utput ***F***ield ***S***eparator, or `OFS`, and you need to set it to what you would like it to be equal to; a `'\t'` for tab, a `,` for a comma or even an `f` for a lowercase `f`. + +``` +awk -v OFS='\t' '{print $3,$1,$5}' ecosystems.txt +``` + +#### $0 + +There is a special variable `$0` that corresponds to the whole record. This is very useful when appending a new field to the front or end of a record, such as. + +``` +awk '{print $1,$0}' ecosystems.txt +``` + +#### RS and ORS + +As an aside, similarly to `OFS`, records are assumed to be read in and written out with a newline character as default. However, this behavior can be altered with `RS` and `ORS` variables. + +`RS` can be used to alter the input ***r***ecord ***s***eparator +`ORS` can be used to alter the ***o***utput ***r***ecord ***s***eparator + +If we wanted to change the `ORS` to be a `;` we could do so with: + +``` +awk -v OFS='\t' -v ORS=';' '{print $3,$1,$5}' ecosystems.txt +``` + +#### -F + +The default behavior of `awk` is to split the data into columns based on whitespace (tabs or spaces). However, if you have a comma-separated file, then your fields are separated by commas and not whitespace. If we run a comma-separated file and call for the first column with the default field separator, then it will print the entire line: + +``` +awk '{print $1}' ecosystems.csv +``` + +However, once we denote that the field separator is a comma, it will extract only the first column: + +``` +awk -F ',' '{print $1}' ecosystems.csv +``` + +Alternatively to using `-F`, `FS` is a built-in variable for ***f***ield ***s***eparator and can be altered with the `-v` argument as well like: + +``` +awk -v FS=',' '{print $1}' ecosystems.csv +``` + +### Skipping Records + +Similarly, to `sed` you can also exclude records from your analysis in `awk`. `NR` is a variable equal to the ***N***umber of ***R***ecords (Rows) in your file. `NF` also exists and is a variable equal to the ***N***umber of ***F***ields (Columns) in your file. You can define the range that you want your print command to work over by specifiying the `NR` prior to your `{}`. For example, if we wanted to remove the header, then we could do something like: + +``` +awk 'NR>1 {print $3,$1,$5}' ecosystems.txt +``` + +You can also set a range for records you'd like `awk` to print out by separating your range requirements with a `&&`, meaning "and": + +``` +awk 'NR>1 && NR<=3 {print $3,$1,$5}' ecosystems.txt +``` + +This command will print the third, first and fifth fields of `ecosystems.txt` for records greater than record one and less than or equal to record three. + +### BEGIN + +The `BEGIN` command will execute an `awk` expression once at the beginning of a command. This can be particularly useful it you want to give an output a header that doesn't previously have one. + +``` +awk 'BEGIN {print "new_header"} NR>1 {print $1}' ecosystems.txt +``` + +In this case we have told `awk` that we want to have "new_header" printed before anything, then `NR>1` is telling `awk` to skip the old header and finally we are printing the first column of `ecosystems.txt` with `{print $1}`. + +### END + +Related to the `BEGIN` command, the `END` command that tells `awk` to do a command once at the end of the file. It is ***very*** useful when summing up columns (below), but we will first demonstrate how it works by adding a new record: + +``` +awk '{print $1} END {print "new_record"}' ecosystems.txt +``` + +As you can see, this has simply added a new record to the end of a file. Furthermore, you can chain multiple `END` commands together to continously add to columns if you wished like: + +``` +awk '{print $1} END {print "new_record"} END {print "newer_record"}' ecosystems.txt +``` + +This is equivalent to separating your `print` commands with a `;`: + +``` +awk '{print $1} END {print "new_record"; print "newer_record"}' ecosystems.txt +``` + +### Variables + +You can also use variables in `awk`. Let's play like we wanted to add 5cm to each organism's height: + +``` +awk 'BEGIN {print "old_height","new_height"} NR>1 {new_height=$5+5; print $5,new_height}' ecosystems.txt +``` + +There's a lot going on in this command, so let's break it down a bit: + +- `BEGIN {print "old_height","new_height"}` is giving us a new header + +- `NR>1` is skipping the old header + +- `new_height=$5+5;` creates a new variable called "new_height" and sets it equal to the height in the fifth field plus five. Note that separate commands within the same `{}` need to be separated by a `;`. + +- `print $5,new_height` prints the old height with the new height. + +Lastly, you can also bring `bash` variables into `awk` using the `-v` option: + +``` +var=first_field +awk -v variable=$var '{print variable,$0}' ecosystems.txt +``` + +### Calculations using columns + +`awk` is also very good about handling calculations with respect to columns. + +#### Column sum + +Now we understand how variables and `END` work, we can take the sum of a column, in this case the fifth column of our `ecosystems.txt`: + +``` +awk 'NR>1 {sum=$5+sum} END {print sum}' ecosystems.txt +``` + +- `NR>1` skips our header line. While not necessary because our header is not a number, it is considered best practice to excluded a header if you have one. If your file didn't have a header then you would omit this. + +- `{sum=$5+sum}` is creating a variable named `sum` and updating it as it goes through each record by adding the fifth field to it. + +> **NOTE:** This `{sum=$5+sum}` syntax can be, and often is, abbreviated to `{sum+=$5}`. They are equvilant syntaxes but for the context of learning we think `{sum=$5+sum}` is a bit more clear. + +- `END {print sum}` Once we get to the end of the file we can call `END` to print out our variable `sum`. + +#### Column Average + +Now that we understand how to take a column sum and retrieve the number of records, we could quite easily calculate the average for a column like: + +``` +awk 'NR>1 {sum=$5+sum} END {records=NR-1; print sum/records}' ecosystems.txt +``` + +- `records=NR-1` is needed because `NR` contains the number of records which includes our header line. As a result, we need to make a new variable called `records` to hold the number of records in the file without the header line. + +If you didn't have a header line you could get the average of a column with a command like: + +``` +awk '{sum=$5+sum} END {print sum/NR}' ecosystems.txt +``` + +#### Calculations between columns + +If you wanted to divide the sixth field of `ecosystems.txt` by the fifth field, you could do: + +``` +awk 'NR>1 {print $6/$5}' ecosystems.txt +``` + +> **NOTE:** Here is is it particularly important to skip the header line, because otherwise you will try to divide the string `weight(g)` by the string `height(cm)` and `awk` will give you an error. + +You can, of course, add columns around the this calculation as well, such as: + +``` +awk 'NR>1 {print $1,$6/$5,$2}' ecosystems.txt +``` + +Lastly, you can also set the output of a calculation equal to a new variable and print that variable: + +``` +awk 'NR>1 {$7=$6/$5; print $1,$7,$2}' ecosystems.txt +``` + +`$7=$6/$5` is making a seventh field with the sixth field divided by the fifth field. We then need to separate this from the `print` command with a `;`, but now we can call this new variable we've created. + +***NOTE:*** If you create a new variable such as `$7=$6/$5`, `$7` is now part of `$0` and will overwrite the values (if any) previously in `$7`. For example: + +``` +awk 'NR>1 {$7=$6/$5; print $0,$7}' ecosystems.txt +``` + +You will get two `$7` fields at the end of the output because `$7` is now a part of `$0` and then you've also indicated that you want to then print `$7` again. + +However, if you have: + +``` +awk 'NR>1 {$6=$6/$5; print $0}' ecosystems.txt +``` + +`$6=$6/$5` will overrwrite the values previously held in `$6` after the calculation is made. Thus, the output no longer shows the original `$6`. + + +### `for` loops + +Like many other programming languages, `awk` can also do loops. One type of loop is the basic `for` loop. The basic syntax for a `for` loop in `awk` is: + +``` +awk '{for (initialize counter variable; end condition; increment) command}' file.txt +``` + +If you want to duplicate every record in your file you can do so like: + +``` +awk '{ for (i = 1; i <= 2; i=i+1) print $0}' ecosystems.txt +``` + +`for (i = 1; i <= 2; i=i+1)` is starting a `for` loop that: +- `i = 1` starts a counter variable at 1 +- `i <= 2` runs as long as the value of `i` is less than or equal to 2 +- `i=i+1` after each iteration, increase the counter variable by one. `++i` and `i++` are equivalent syntaxes to `i=i+1`. + +Then we print the whole line with `print $0`. + +While not discussed here, `awk` does support `while` and `do-while` loops. + +### `if` statements + +Since `awk` is it's own fully-fledged programming language, it also has conditional statements. A common time you might want to use an `if` statement in `awk` is when you have a file with tens or even hundreds of fields and you want to figure out which field has the column header of interest or a case where you are trying to write a script for broad use when the order of the input columns may not always be the same, but you want to figure out which column has a certain column header. To do that: + +``` +awk 'NR=1 {for (i=1; i<=NF; i=i+1) {if ($i == "height(cm)") print i}}' ecosystems.txt +``` + +We can break this code down a bit: + +- `NR=1` only looks at the header line + +- `for (i=1; i<=NF; i=i+1)` this begins a `for` loop starting at field one and continuing as longer as the `i` is less than or equal to number of fields and the increment is one for each interation of the `for` loop + +- `if ($i == "height(cm)")` is checking is `$i`, which is in our case is $1, $2, ... $6, to see if they are equal to `height(cm)`. If this condition is met then: + +- `print i` print out `i` + +## Exercises + +Within our directory there should be a comma-separated file called `raw_counts_mouseKO.csv`. This is a raw counts matrix from a bulk RNA-seq experiment. Let's use `awk` to accomplish a few tasks with this. + +**1)** How could we exclude the 5th through 9th columns of `raw_counts_mouseKO.csv` and pipe the output into a `less` buffer? + + +**2)** How could we have `awk` return the column number for the field named "WT2"? + + +**3)** How could measure the average counts from the sample in the 7th column? + + +**4)** How could we calculate the average counts for the first twenty genes from samples WT1, WT2, WT3 and WT4, then print the gene name, then the count for each sample for that gene and then the average across the four samples? + + +*** + +[Back to Schedule](../README.md) + +*** + +*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* +