diff --git a/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md index a1db3a7a..90cbbde0 100644 --- a/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md +++ b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md @@ -51,6 +51,8 @@ _You should see `advanced_shell.zip` as part of the output to the screen._ **4.** Finally, to **decompress the folder**: +## Comment-Upen: As we are already in terminal. it might be easier to type `unzip advanced_shell.zip` rather then going back to GUI. + * Double click on advanced_shell.zip on a mac. This will automatically inflate the folder. * If you are on windows, press and hold (or right-click) the folder, select Extract All..., and then follow the instructions. diff --git a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md index 9a8e47bf..fb86f35e 100644 --- a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md +++ b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md @@ -15,6 +15,8 @@ In this lesson, we will: ## Getting Started +## Comment-Upen: Even we mentioned that we introuduced grep in the previous workshops. I think participants will find it useful to get a brief introduction of grep before we go in depth. and then introduce our toy file `catch.txt` to use. + Before we get started, let's take a briefly look at the `catch.txt` file in a `less` buffer in order to get an idea of what the file looks like: ``` @@ -23,6 +25,9 @@ less catch.txt In here, you can see that we have a variety of case differences and misspellings. These differences are not exhaustive, but they will be helpful in exploring how regular expressions are implemented in `grep`. + +## Comment-Upen: Before introducing cautions and extended regular expression, which we say we won't be using too many. I think, If I am a participant and in beginner level, I would be more interested to just try grep on the catch.txt file to start with with simple examples. And explain the difference between no quotation, single quotataion, and double quotation with dummy errors we can produce down below. may be we can demonstrate few simple flags we can use with grep like -c for counting, -n for printing line number, using -v to print negative results and others. We can use double quotations in all the examples and ask participants what will happen if we don't use quotation marks or use single quotation. Ask them to do it to practice grep with differnt flags. and introduce the importance of quotations and cases where those will be useful. Just a thought! + ## A bit more depth on grep There are two principles that we should discuss more, the `-E` option and the use of quotation marks. @@ -31,6 +36,8 @@ There are two principles that we should discuss more, the `-E` option and the us There is a `-E` option when using `grep` that allows the user to use what is considered "extended regular expressons". We won't use too many of these types of regular expressions and we will point them out when we need them. If you want to make it a habit to always use the `-E` option when using regular expressions in `grep` it is a bit more safe. +## Comment-Upen: I would explain what we meant by safe. + ### Quotations When using grep it is usually not required to put your search term in quotes. However, if you would like to use `grep` to do certain types of searches, it is better or *safer* to wrap your search term in quotations, and likely double quotations. Let's briefly discuss the differences: @@ -86,7 +93,7 @@ Will return: ``` C${at}CH ``` - +## Comment-Upen: Maybe this take home message can go to the bottom of the page and bullet point 1. ### grep Depth Take-Home In conclusion, while these are all mostly edge cases, we believe that it is generally a good habit to wrap the expressions that you use for `grep` in double quotations and also use the `-E` option. This practice will not matter for the overwhelming number of cases, but it is sometimes difficult to remember these edge cases and thus it is mofe safe to just build them into a habit. Of course, your preferences may vary. @@ -465,6 +472,8 @@ C${at}CH COTCH ``` +## Comment-Upen: Having a multi-fasta or multi-fastq file in our demo data and using that to count number of sequences with grep -c "^>" my.fasta, Finding the starting codon "ATG" or stop codon "TAA" or extracting "cds" between ATG and TAA. using grep with -A 1 and -B 1 to get the header and sequence information of the sequence using a small part of the sequence. May be using using primer pairs to locate the pcr amplicon region, would give participants some basic real world example of grep application. I mean a few of these examples but not too many. and I think this will align well with bioinformatic examples in other lessons in this workshop, just a thought. + *** ## Exercises diff --git a/Finding_and_summarizing_colossal_files/lessons/03_sed.md b/Finding_and_summarizing_colossal_files/lessons/03_sed.md index 95002b65..06422b5e 100644 --- a/Finding_and_summarizing_colossal_files/lessons/03_sed.md +++ b/Finding_and_summarizing_colossal_files/lessons/03_sed.md @@ -159,6 +159,7 @@ Lastly, you can use `N~n` in the address to indicate that you want to apply the ``` sed '1~2 s/an/replacement/g' ecosystems.txt ``` +## Comment-Upen: tilde didn't work on my computer in above code. it says not a valid command. I am using mac with Apple M3 chip, its the latest, I suppose many of my participants will have similar configuration? ## Bioinformatics Example @@ -178,6 +179,7 @@ cat my_fastq.fq.gz | sed -n '1~4p' > quality_scores.txt ``` The first half of the pipe prints the file and the sed command grabs every forth line. Try it with the `Mov10_oe_1.subset.fq` file in the advanced_shell directory! +## Comment-Upen: There is no my_fastq.fq.gz in our training material folder. and also just my_fastq.gz or my_fq.gz would be fine as a file name. again tilde won't work on mine. ## Deletion @@ -262,24 +264,26 @@ You can also ***c***hange entire lines in `sed` using the `c` command. We could sed '1 c header' ecosystems.txt ``` +## Comment-Upen: The above command doesn't work on my laptop. instead prints: `sed: 1: "1 c header": command c expects \ followed by text` This can also be utilized in conjunction with the `A,B` interval syntax, but we should be aware that it will replace ALL lines in that interval with a SINGLE line. ``` sed '1,3 c header' ecosystems.txt ``` - +## Comment-Upen: same as above, doesn't work on my mac. You can also replace every *n*-th line starting at *N*-th line using the `N~n` address syntax: ``` sed '1~3 c header' ecosystems.txt ``` +## Comment-Upen: ~ in above command says invalid in my mac. Lastly, you can also replace lines match a pattern: ``` sed '/jaguar/ c header' ecosystems.txt ``` - +## Comment-Upen: error on above command: sed: 1: "/jaguar/ c header": command c expects \ followed by text ## Multiple expressions diff --git a/Finding_and_summarizing_colossal_files/lessons/AWK_module.md b/Finding_and_summarizing_colossal_files/lessons/AWK_module.md index 997bb95e..d3bf28c2 100644 --- a/Finding_and_summarizing_colossal_files/lessons/AWK_module.md +++ b/Finding_and_summarizing_colossal_files/lessons/AWK_module.md @@ -166,6 +166,7 @@ Were seals ever observed in any of the other parks, note that `||` is or in awk +## Comment-Upen: Both options above doesnot print anything on my laptop. **** @@ -266,6 +267,7 @@ To simply extract the Yosemite data (column 3). We use the second part: ```bash awk -F "," '$2 ~ "coyote"' ``` +## Comment-Upen: may be add the file name animal_observations_edited.txt at the end in above script. if someone enters this, terminal will just hung up. to separate the comma separated fields of column 3 and ask which lines have the string coyote in field 2. We want to print the entire comma separated list (i.e., column 3) to test our code which is the default behavior of `awk` in this case. @@ -377,6 +379,8 @@ samtools view -S -b ${sam}.sam > ${sam}.bam done ``` +## Comment-Upen: We are not running this workshop in cluster right? running above chunk with samtools might be a problem? + This actually combines a number of basic and intermediate shell topics such as [positional parameters]([positional_params.md](https://hbctraining.github.io/Training-modules/Accelerate_with_automation/lessons/positional_params.html)), [for loops](https://hbctraining.github.io/Training-modules/Accelerate_with_automation/lessons/loops_and_scripts.html), and `awk`! * We start with a for loop that counts from 1 to 10 @@ -391,6 +395,8 @@ With our new `awk` expertise let's take a look at that `awk` command alone! ```bash awk -v awkvar="${i}" 'NR==awkvar' samples.txt ``` +## Comment-Upen: No samples.txt in workshop material folder?? + We have not encountered -v yet. The correct syntax is `-v var=val` which assign the value val to the variable var, before execution of the program begins. So what we are doing is creating our own variable within our `awk` program, calling it `awkvar` and assigning it the value of `${i}` which will be a number between 1 and 10 (see for loop above). `${i}` and thus `awkvar` will be different for each loop.