Note: Tick mark means completed tasks. Will be removed in the latter project update documents.
Summary: Finding mutations in HIV RT sequences resistant to drugs from clinical trails and MP samples
- 1. Priority1: Select/Provide concensus seq of random sampling of UMIs (50) from s25 (DS6) then provide UPGMA tree for analysis in ITOL.
- 2. Priority2: Supervise William S. to reanalize Serial Dilution I and II datasets with the current HIV mutation pipeline and in generating table for read counts per virion (all & assembled ones) for these samples.
- 3. Priority 3 (MTN): Build consensus of Assembled virions for each sample in MTN study (with IUPAC base names and 80% major base cutoff) and build UPGMA Tree of Sanger vs. Consensus for all samples in the study.
- 4. Priority 3 (IPM): Build consensus of Assembled virions for each sample in IPM study (with IUPAC base names and 80% major base cutoff) and build UPGMA Tree of Sanger vs. Consensus for all samples in the study.
- 5. Run Contamination Analysis on MTN including DS23 MTN samples (Status: Between and Within Sample analysis Finished, contaminatant detection and UPGMA validation left)
- 6. Run Contamination Analysis on IPM including DS23 IPM samples (Status: Between and Within Sample analysis Finished, contaminatant detection and UPGMA validation left)
- 7. HIV Mutation Pipeline Update1: Include the code of collecting random virion sequences with condtion of Sample > 25K virions, then randomly select 25K virions.
- 8. HIV Mutation Pipeline Update2: Include the steps for overlap pairs in paired end library. Reference for virion assembly still need to be provided by Kerri to test run this update.
- 9. Complete rest of the Priorities: 4,5 and 6.
Summary: Finding HIV human junctions and assembly of pro-virus
- 1. Write code to generate statistics on alignment of reads to HIV, such that it includes #Reads paritally aligned and carry signature sequence in the same place where reference genome has.
- 2. Generate pipleline for finding Human+HIV junction region:
Reads --> Align to HIV reference --> Collect partially aligned read ids (softclipped) with signature sequence in them --> Obtain R1 and R2 sequences for such read ids --> Align such reads to HIV+hg19 genome --> a. Find region with max coverage --> Traverse upstream and downstream of the max coverage region until coverage >= 5% of max coverage --> Assemble reads in such region using samtools/bcftools --> Count # reads in fwd and reverse direction in such region to get strand of assembly --> Reverse complement assembly if (-) strand: Human region of integration site b. Assemble reads in region from start of HIV reference to the last base after which the alignment begins. Use consensus building tool with 75% major base cutoff c. Assemble reads from 1st base of alignment on HIV reference to base where coverage >= 5% of max coverage. d. Join b and c assemblies to get provirus region of integration site.
- 3. Construct bash script pipeline for above workflow
- 4. Test on GKISA2_S6 (works well)
- 5. Rerun pipeline on GKISA samples
- 6. Once established the pipeline, run the pipeline on other datasets for HIV+Human junction
- 7. Conduct test on cutoff position for HIV+Human integration at different percentage coverage vs. length of integration region
- 8. Run Pipeline on other KISA samples
- 1. For MS6369218-500v2, and probably for other datasets with KISA and NFL assays, once decided on results of Human+HIV integration region and provirus assembly, discuss strategy to bring them together as one assembly.
- 1. Continue with the remainder of HIV mutation pipeline
- 2. Discuss if they were able to run and understand each step of the pipeline
- 3. Introduce them with HIV Contamination Analysis pipeline
- 4. Make Training ppt on Human+HIV integration project
- 5. Make Training ppt on Provirus assembly project