From 037a0d168bacb6dcbf6819e411a1d43bcbf968ce Mon Sep 17 00:00:00 2001 From: Brendan Smith Date: Thu, 7 Mar 2024 09:36:30 -0600 Subject: [PATCH] Update README --- README.md | 24 ------------------------ 1 file changed, 24 deletions(-) diff --git a/README.md b/README.md index 5f46ec2..43a649c 100644 --- a/README.md +++ b/README.md @@ -17,27 +17,3 @@ A common task for bioinformaticians is to compare variants, whether to compare V This script takes as input two VCFs and performs a comparison of the variants found in each file. The script outputs 3 VCFs, reflecting those variants that are shared and unique to each individual. **NOTE**: An example VCF is provided at `tests/resources/sample.vcf`. VCFs can grow up to 4 million variants in size, as in the case of whole genome sequencing. - -## Part 2) Problem Solving Challenge (to be discussed in person at 2nd interview) - -Carefully read the following questions. In your next interview, be prepared to discuss each question and scenario. During the discussion, be as specific as possible regarding how the mentioned technology would be used. Our goal is to find out how quickly you grasp novel technologies, with respect to both their benefits and drawbacks. - -### Scenario A: - -In the current system large JSON Lines files (100,000 JSON dictionaries per file) are uploaded to S3. Each file is then read line by line and the JSON dictionaries sent to a MongoDB collection using Apache Airflow. The client would like to remove Airflow from this process. The entire contents of the JSON Lines file must be uploaded to MongoDB. Files that are in the process of being uploaded and those that have errored must be tracked. - -#### Questions: - -1. Take 5-10 minutes to present an architecture that would eliminate the need to use Apache Airflow. Feel free to use any combination of AWS services you wish to solve the problem. Be prepared to justify your technology and architectural decisions. -2. How would you retry the process if MongoDB went offline in the middle of an upload? - -### Scenario B - -Linux nodes continuously watch a RabbitMQ queue to see if any jobs have been submitted. When a job appears in a queue, one node will pick up the job, thereby preventing a different node from picking up the same job. RabbitMQ’s ack late feature is used to ensure jobs are retried when a node abruptly fails. Only one node is supposed to execute a job at one time, however, under some conditions, it is possible for two nodes to pick up the same job which will cause issues. The only shared systems between the nodes are the RabbitMQ queue and a MongoDB database. - -#### Questions: - -1. Come up with a strategy on how to keep two jobs from running at the same time. The strategy can only use the given resources. -2. If a new shared resource or technology could be added to the system, what would you add to ensure that two jobs do not run at the same time? - -### Good luck!