Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
brendancsmith committed Mar 7, 2024
1 parent c8ae437 commit 037a0d1
Showing 1 changed file with 0 additions and 24 deletions.
24 changes: 0 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,3 @@ A common task for bioinformaticians is to compare variants, whether to compare V
This script takes as input two VCFs and performs a comparison of the variants found in each file. The script outputs 3 VCFs, reflecting those variants that are shared and unique to each individual.

**NOTE**: An example VCF is provided at `tests/resources/sample.vcf`. VCFs can grow up to 4 million variants in size, as in the case of whole genome sequencing.

## Part 2) Problem Solving Challenge (to be discussed in person at 2nd interview)

Carefully read the following questions. In your next interview, be prepared to discuss each question and scenario. During the discussion, be as specific as possible regarding how the mentioned technology would be used. Our goal is to find out how quickly you grasp novel technologies, with respect to both their benefits and drawbacks.

### Scenario A:

In the current system large JSON Lines files (100,000 JSON dictionaries per file) are uploaded to S3. Each file is then read line by line and the JSON dictionaries sent to a MongoDB collection using Apache Airflow. The client would like to remove Airflow from this process. The entire contents of the JSON Lines file must be uploaded to MongoDB. Files that are in the process of being uploaded and those that have errored must be tracked.

#### Questions:

1. Take 5-10 minutes to present an architecture that would eliminate the need to use Apache Airflow. Feel free to use any combination of AWS services you wish to solve the problem. Be prepared to justify your technology and architectural decisions.
2. How would you retry the process if MongoDB went offline in the middle of an upload?

### Scenario B

Linux nodes continuously watch a RabbitMQ queue to see if any jobs have been submitted. When a job appears in a queue, one node will pick up the job, thereby preventing a different node from picking up the same job. RabbitMQ’s ack late feature is used to ensure jobs are retried when a node abruptly fails. Only one node is supposed to execute a job at one time, however, under some conditions, it is possible for two nodes to pick up the same job which will cause issues. The only shared systems between the nodes are the RabbitMQ queue and a MongoDB database.

#### Questions:

1. Come up with a strategy on how to keep two jobs from running at the same time. The strategy can only use the given resources.
2. If a new shared resource or technology could be added to the system, what would you add to ensure that two jobs do not run at the same time?

### Good luck!

0 comments on commit 037a0d1

Please sign in to comment.