From 86cdf62f18c80efbb79eee82082b2474315b0e4d Mon Sep 17 00:00:00 2001
From: Gregory Owens <5419829+owensgl@users.noreply.github.com>
Date: Wed, 11 Oct 2023 17:07:47 -0700
Subject: [PATCH] Update lab_1.md
---
labs/lab_1.md | 584 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 584 insertions(+)
diff --git a/labs/lab_1.md b/labs/lab_1.md
index 732a8fc72..fac9fc353 100644
--- a/labs/lab_1.md
+++ b/labs/lab_1.md
@@ -635,6 +635,590 @@ The commands `cd`, and `cd ~` are very useful for quickly navigating back to you
***
+## objectives
+
+- View, search within, copy, move, and rename files. Create new directories.
+- Use wildcards (`*`) to perform operations on multiple files.
+- Make a file read only.
+- Use the `history` command to view and repeat recently used commands.
+
+***
+
+## questions
+
+- How can I view and search file contents?
+- How can I create, copy and delete files and directories?
+- How can I control who has permission to modify a file?
+- How can I repeat recently used commands?
+
+***
+
+## Working with Files
+
+### Our data set: FASTQ files
+
+Now that we know how to navigate around our directory structure, let's
+start working with our sequencing files. We did a sequencing experiment and
+have two results files, which are stored in our `untrimmed_fastq` directory.
+
+### Wildcards
+
+Navigate to your `untrimmed_fastq` directory:
+
+```bash
+[grego@indri ~]$ cd ~/shell_data/untrimmed_fastq
+```
+
+We are interested in looking at the FASTQ files in this directory. We can list
+all files with the .fastq extension using the command:
+
+```bash
+[grego@indri ~]$ ls *.fastq.gz
+```
+
+```output
+bullkelp_001_R1.fastq.gz bullkelp_001_R2.fastq.gz
+```
+
+The `*` character is a special type of character called a wildcard, which can be used to represent any number of any type of character.
+Thus, `*.fastq.gz` matches every file that ends with `.fastq.gz`.
+
+This command:
+
+```bash
+[grego@indri untrimmed_fastq]$ ls *R2.fastq.gz
+```
+
+```output
+bullkelp_001_R2.fastq.gz
+```
+
+lists only the file that ends with `R2.fastq.gz`.
+
+This command:
+
+```bash
+[grego@indri untrimmed_fastq]$ ls /usr/bin/*.sh
+```
+
+```output
+/usr/bin/gettext.sh /usr/bin/lesspipe.sh /usr/bin/rescan-scsi-bus.sh /usr/bin/setup-nsssysinit.sh
+```
+
+Lists every file in `/usr/bin` that ends in the characters `.sh`.
+Note that the output displays **full** paths to files, since
+each result starts with `/`.
+
+## Challenge
+
+## Exercise
+
+Do each of the following tasks from your current directory using a single
+`ls` command for each:
+
+1. List all of the files in `/usr/bin` that start with the letter 'c'.
+2. List all of the files in `/usr/bin` that contain the letter 'a'.
+3. List all of the files in `/usr/bin` that end with the letter 'o'.
+
+Bonus: List all of the files in `/usr/bin` that contain the letter 'a' or the
+letter 'c'.
+
+Hint: The bonus question requires a Unix wildcard that we haven't talked about
+yet. Try searching the internet for information about Unix wildcards to find
+what you need to solve the bonus problem.
+
+::::::::::::::: solution
+
+## Challenge
+
+## Exercise
+
+`echo` is a built-in shell command that writes its arguments, like a line of text to standard output.
+The `echo` command can also be used with pattern matching characters, such as wildcard characters.
+Here we will use the `echo` command to see how the wildcard character is interpreted by the shell.
+
+```bash
+[grego@indri untrimmed_fastq]$ echo *.fastq.gz
+```
+
+```output
+bullkelp_001_R1.fastq.gz bullkelp_001_R2.fastq.gz
+```
+
+The `*` is expanded to include any file that ends with `.fastq.gz`. We can see that the output of
+`echo *.fastq.gz` is the same as that of `ls *.fastq.gz`.
+
+What would the output look like if the wildcard could *not* be matched? Compare the outputs of
+`echo *.missing` and `ls *.missing`.
+
+
+## Command History
+
+If you want to repeat a command that you've run recently, you can access previous
+commands using the up arrow on your keyboard to go back to the most recent
+command. Likewise, the down arrow takes you forward in the command history.
+
+A few more useful shortcuts:
+
+- Ctrl\+C will cancel the command you are writing, and give you a
+ fresh prompt.
+- Ctrl\+R will do a reverse-search through your command history. This
+ is very useful.
+- Ctrl\+L or the `clear` command will clear your screen.
+
+You can also review your recent commands with the `history` command, by entering:
+
+```bash
+[grego@indri untrimmed_fastq]$ history
+```
+
+to see a numbered list of recent commands. You can reuse one of these commands
+directly by referring to the number of that command.
+
+For example, if your history looked like this:
+
+```output
+259 ls *
+260 ls /usr/bin/*.sh
+261 ls *R1*fastq
+```
+
+then you could repeat command #260 by entering:
+
+```bash
+[grego@indri untrimmed_fastq]$ !260
+```
+
+Type `!` (exclamation point) and then the number of the command from your history.
+You will be glad you learned this when you need to re-run very complicated commands.
+For more information on advanced usage of `history`, read section 9.3 of
+[Bash manual](https://www.gnu.org/software/bash/manual/html_node/index.html).
+
+## Challenge
+
+## Exercise
+
+Find the line number in your history for the command that listed all the .sh
+files in `/usr/bin`. Rerun that command.
+
+
+
+## Examining Files
+
+We now know how to switch directories, run programs, and look at the
+contents of directories, but how do we look at the contents of files?
+
+One way to examine a file is to print out the first 10 lines using the program `head`.
+
+Enter the following command from within the `untrimmed_fastq` directory:
+
+```bash
+[grego@indri untrimmed_fastq]$ head bullkelp_001_R1.fastq.gz
+```
+
+This will print out the first 10 lines of the `bullkelp_001_R1.fastq.gz` to the screen.
+
+Notice anything weird? It's all weird and unintelligeble characters.
+That's because this file is gzipped. That means it is compressed and
+no longer a normal text file. To convert it into a format you can actually
+read, first we need to unzip it using `gunzip`.
+
+```bash
+[grego@indri untrimmed_fastq]$ gunzip *.fastq.gz
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+1. Print out the contents of the `~/shell_data/untrimmed_fastq/SRR097977.fastq` file. What is the last line of the file?
+2. From your home directory, and without changing directories,
+ use one short command to print the contents of all of the files in
+ the `~/shell_data/untrimmed_fastq` directory.
+
+::::::::::::::: solution
+
+## Solution
+
+1. The last line of the file is `C:CCC::CCCCCCCC<8?6A:C28C<608'&&&,'$`.
+2. `cat ~/shell_data/untrimmed_fastq/*`
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+`cat` is a terrific program, but when the file is really big, it can
+be annoying to use. The program, `less`, is useful for this
+case. `less` opens the file as read only, and lets you navigate through it. The navigation commands
+are identical to the `man` program.
+
+Enter the following command:
+
+```bash
+$ less SRR097977.fastq
+```
+
+Some navigation commands in `less`:
+
+| key | action |
+| ----- | ------------------------------------------------------------------------------------------------------------ |
+| Space | to go forward |
+| b | to go backward |
+| g | to go to the beginning |
+| G | to go to the end |
+| q | to quit |
+
+`less` also gives you a way of searching through files. Use the
+"/" key to begin a search. Enter the word you would like
+to search for and press `enter`. The screen will jump to the next location where
+that word is found.
+
+**Shortcut:** If you hit "/" then "enter", `less` will repeat
+the previous search. `less` searches from the current location and
+works its way forward. Scroll up a couple lines on your terminal to verify
+you are at the beginning of the file. Note, if you are at the end of the file and search
+for the sequence "CAA", `less` will not find it. You either need to go to the
+beginning of the file (by typing `g`) and search again using `/` or you
+can use `?` to search backwards in the same way you used `/` previously.
+
+For instance, let's search forward for the sequence `TTTTT` in our file.
+You can see that we go right to that sequence, what it looks like,
+and where it is in the file. If you continue to type `/` and hit return, you will move
+forward to the next instance of this sequence motif. If you instead type `?` and hit
+return, you will search backwards and move up the file to previous examples of this motif.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+What are the next three nucleotides (characters) after the first instance of the sequence quoted above?
+
+::::::::::::::: solution
+
+## Solution
+
+`CAC`
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Remember, the `man` program actually uses `less` internally and
+therefore uses the same commands, so you can search documentation
+using "/" as well!
+
+There's another way that we can look at files, and in this case, just
+look at part of them. This can be particularly useful if we just want
+to see the beginning or end of the file, or see how it's formatted.
+
+The commands are `head` and `tail` and they let you look at
+the beginning and end of a file, respectively.
+
+```bash
+$ head SRR098026.fastq
+```
+
+```output
+@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
+NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
++SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
+!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
+@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
+NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
++SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
+!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
+@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
+NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+```
+
+```bash
+$ tail SRR098026.fastq
+```
+
+```output
++SRR098026.247 HWUSI-EAS1599_1:2:1:2:1311 length=35
+#!##!#################!!!!!!!######
+@SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
+GNTGNGGTCATCATACGCGCCCNNNNNNNGGCATG
++SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
+B!;?!A=5922:##########!!!!!!!######
+@SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
+CNCTNTATGCGTACGGCAGTGANNNNNNNGGAGAT
++SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
+A!@B!BBB@ABAB#########!!!!!!!######
+```
+
+The `-n` option to either of these commands can be used to print the
+first or last `n` lines of a file.
+
+```bash
+$ head -n 1 SRR098026.fastq
+```
+
+```output
+@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
+```
+
+```bash
+$ tail -n 1 SRR098026.fastq
+```
+
+```output
+A!@B!BBB@ABAB#########!!!!!!!######
+```
+
+## Details on the FASTQ format
+
+Although it looks complicated (and it is), it's easy to understand the
+[fastq](https://en.wikipedia.org/wiki/FASTQ_format) format with a little decoding. Some rules about the format
+include...
+
+| Line | Description |
+| ----- | ------------------------------------------------------------------------------------------------------------ |
+| 1 | Always begins with '@' and then information about the read |
+| 2 | The actual DNA sequence |
+| 3 | Always begins with a '+' and sometimes the same info in line 1 |
+| 4 | Has a string of characters which represent the quality scores; must have same number of characters as line 2 |
+
+We can view the first complete read in one of the files in our dataset by using `head` to look at
+the first four lines.
+
+```bash
+$ head -n 4 SRR098026.fastq
+```
+
+```output
+@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
+NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
++SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
+!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
+```
+
+All but one of the nucleotides in this read are unknown (`N`). This is a pretty bad read!
+
+Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the
+probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call
+accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality
+score, the numerical score is converted into a code where each individual character
+represents the numerical quality score for an individual nucleotide. For example, in the line
+above, the quality score line is:
+
+```output
+!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
+```
+
+The `#` character and each of the `!` characters represent the encoded quality for an
+individual nucleotide. The numerical value assigned to each of these characters depends on the
+sequencing platform that generated the reads. The sequencing machine used to generate our data
+uses the standard Sanger quality PHRED score encoding, Illumina version 1.8 onwards.
+Each character is assigned a quality score between 0 and 42 as shown in the chart below.
+
+```output
+Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
+ | | | | |
+Quality score: 0........10........20........30........40..
+```
+
+Each quality score represents the probability that the corresponding nucleotide call is
+incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a
+base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%.
+These probability values are the results from the base calling algorithm and dependent on how
+much signal was captured for the base incorporation.
+
+Looking back at our read:
+
+```output
+@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
+NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
++SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
+!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
+```
+
+we can now see that the quality of each of the `N`s is 0 and the quality of the only
+nucleotide call (`C`) is also very poor (`#` = a quality score of 2). This is indeed a very
+bad read.
+
+## Creating, moving, copying, and removing
+
+Now we can move around in the file structure, look at files, and search files. But what if we want to copy files or move
+them around or get rid of them? Most of the time, you can do these sorts of file manipulations without the command line,
+but there will be some cases (like when you're working with a remote computer like we are for this lesson) where it will be
+impossible. You'll also find that you may be working with hundreds of files and want to do similar manipulations to all
+of those files. In cases like this, it's much faster to do these operations at the command line.
+
+### Copying Files
+
+When working with computational data, it's important to keep a safe copy of that data that can't be accidentally overwritten or deleted.
+For this lesson, our raw data is our FASTQ files. We don't want to accidentally change the original files, so we'll make a copy of them
+and change the file permissions so that we can read from, but not write to, the files.
+
+First, let's make a copy of one of our FASTQ files using the `cp` command.
+
+Navigate to the `shell_data/untrimmed_fastq` directory and enter:
+
+```bash
+$ cp SRR098026.fastq SRR098026-copy.fastq
+$ ls -F
+```
+
+```output
+SRR097977.fastq SRR098026-copy.fastq SRR098026.fastq
+```
+
+We now have two copies of the `SRR098026.fastq` file, one of them named `SRR098026-copy.fastq`. We'll move this file to a new directory
+called `backup` where we'll store our backup data files.
+
+### Creating Directories
+
+The `mkdir` command is used to make a directory. Enter `mkdir`
+followed by a space, then the directory name you want to create:
+
+```bash
+$ mkdir backup
+```
+
+### Moving / Renaming
+
+We can now move our backup file to this directory. We can
+move files around using the command `mv`:
+
+```bash
+$ mv SRR098026-copy.fastq backup
+$ ls backup
+```
+
+```output
+SRR098026-copy.fastq
+```
+
+The `mv` command is also how you rename files. Let's rename this file to make it clear that this is a backup:
+
+```bash
+$ cd backup
+$ mv SRR098026-copy.fastq SRR098026-backup.fastq
+$ ls
+```
+
+```output
+SRR098026-backup.fastq
+```
+
+### File Permissions
+
+We've now made a backup copy of our file, but just because we have two copies, it doesn't make us safe. We can still accidentally delete or
+overwrite both copies. To make sure we can't accidentally mess up this backup file, we're going to change the permissions on the file so
+that we're only allowed to read (i.e. view) the file, not write to it (i.e. make new changes).
+
+View the current permissions on a file using the `-l` (long) flag for the `ls` command:
+
+```bash
+$ ls -l
+```
+
+```output
+-rw-r--r-- 1 dcuser dcuser 43332 Nov 15 23:02 SRR098026-backup.fastq
+```
+
+The first part of the output for the `-l` flag gives you information about the file's current permissions. There are ten slots in the
+permissions list. The first character in this list is related to file type, not permissions, so we'll ignore it for now. The next three
+characters relate to the permissions that the file owner has, the next three relate to the permissions for group members, and the final
+three characters specify what other users outside of your group can do with the file. We're going to concentrate on the three positions
+that deal with your permissions (as the file owner).
+
+![](fig/rwx_figure.svg){alt='Permissions breakdown'}
+
+Here the three positions that relate to the file owner are `rw-`. The `r` means that you have permission to read the file, the `w`
+indicates that you have permission to write to (i.e. make changes to) the file, and the third position is a `-`, indicating that you
+don't have permission to carry out the ability encoded by that space (this is the space where `x` or executable ability is stored, we'll
+talk more about this in [a later lesson](05-writing-scripts.md)).
+
+Our goal for now is to change permissions on this file so that you no longer have `w` or write permissions. We can do this using the `chmod` (change mode) command and subtracting (`-`) the write permission `-w`.
+
+```bash
+$ chmod -w SRR098026-backup.fastq
+$ ls -l
+```
+
+```output
+-r--r--r-- 1 dcuser dcuser 43332 Nov 15 23:02 SRR098026-backup.fastq
+```
+
+### Removing
+
+To prove to ourselves that you no longer have the ability to modify this file, try deleting it with the `rm` command:
+
+```bash
+$ rm SRR098026-backup.fastq
+```
+
+You'll be asked if you want to override your file permissions:
+
+```output
+rm: remove write-protected regular file ‘SRR098026-backup.fastq'?
+```
+
+You should enter `n` for no. If you enter `n` (for no), the file will not be deleted. If you enter `y`, you will delete the file. This gives us an extra
+measure of security, as there is one more step between us and deleting our data files.
+
+**Important**: The `rm` command permanently removes the file. Be careful with this command. It doesn't
+just nicely put the files in the Trash. They're really gone.
+
+By default, `rm` will not delete directories. You can tell `rm` to
+delete a directory using the `-r` (recursive) option. Let's delete the backup directory
+we just made.
+
+Enter the following command:
+
+```bash
+$ cd ..
+$ rm -r backup
+```
+
+This will delete not only the directory, but all files within the directory. If you have write-protected files in the directory,
+you will be asked whether you want to override your permission settings.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+Starting in the `shell_data/untrimmed_fastq/` directory, do the following:
+
+1. Make sure that you have deleted your backup directory and all files it contains.
+2. Create a backup of each of your FASTQ files using `cp`. (Note: You'll need to do this individually for each of the two FASTQ files. We haven't
+ learned yet how to do this
+ with a wildcard.)
+3. Use a wildcard to move all of your backup files to a new backup directory.
+4. Change the permissions on all of your backup files to be write-protected.
+
+::::::::::::::: solution
+
+## Solution
+
+1. `rm -r backup`
+2. `cp SRR098026.fastq SRR098026-backup.fastq` and `cp SRR097977.fastq SRR097977-backup.fastq`
+3. `mkdir backup` and `mv *-backup.fastq backup`
+4. `chmod -w backup/*-backup.fastq`
+ It's always a good idea to check your work with `ls -l backup`. You should see something like:
+
+```output
+-r--r--r-- 1 dcuser dcuser 47552 Nov 15 23:06 SRR097977-backup.fastq
+-r--r--r-- 1 dcuser dcuser 43332 Nov 15 23:06 SRR098026-backup.fastq
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- You can view file contents using `less`, `cat`, `head` or `tail`.
+- The commands `cp`, `mv`, and `mkdir` are useful for manipulating existing files and creating new directories.
+- You can view file permissions using `ls -l` and change permissions using `chmod`.
+- The `history` command and the up arrow on your keyboard can be used to repeat recently used commands.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
## Credit
This material is adapted from Becker et al. 2019, under CC-BY 4.0 licence.