Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize the seqrepo interface and implement new backends #136

Open
reece opened this issue Feb 5, 2024 · 5 comments
Open

Generalize the seqrepo interface and implement new backends #136

reece opened this issue Feb 5, 2024 · 5 comments
Labels
keep alive exempt issue from staleness checks project proposal project proposal for interns and GSoC students

Comments

@reece
Copy link
Member

reece commented Feb 5, 2024

Difficulty Expected Duration Possible Mentors
Medium 175h @reece

Summary

SeqRepo provides a simple interface to biological sequences and subsequences, with a single backend that provides fast random-access to local, non-redundant, compressed, and journaled sequences. The original use case for SeqRepo was to provide fast and reliable access to sequences in a clinical genetics reporting pipeline. (See design)

The goal of this issue is to create an abstract interface that supports other storage backends, as well as caching and federation layers as depicted here:

Image

See #61 for additional information.

Community Benefits

When implemented, this project will enable the following (and ideally implement a few of them):

  • Network sources: The ability to store sequences in, for example, a private REDIS database
  • Federation: The ability to merge sets of sequences from different sources or species
  • Caching: The ability to transparently cache sequences locally on first use, rather than downloading a database
  • Façade: Provide a consistent interface for existing network sequence sources
  • REST API: Provide a REST API over any of these sources, including federated sources

Expected Results / Deliverables

  • Define and implement an abstract interface
  • Adapt the current Fastadir to use the interface
  • Incorporate and adapt the REST interface
  • Implement a local sequence cache
  • Implement redis, s3, or other backends as able

Required and Desired Skills

  • Python
  • Relational database design, SQL
  • REST interface design
  • Caching techniques and modes
  • Backend-specific experience, such as redis and/or AWS S3

Benefits to Intern

The internship will gain software architecture and interface abstraction skills while solving a contemporary practical issue for modern bioinformatics.

How to apply

Students applying to this project should briefly describe their vision for this project, highlight their existing skills and the skills they would need to learn, and estimate an implementation timeline.

@reece reece added the project proposal project proposal for interns and GSoC students label Feb 5, 2024
@manulpatel
Copy link

Hello @reece! I am Manul, from India working as a backend engineer building RESTful APIs in TypeScript, NestJS, and PostgresSQL as a database. In my current project, I am trying to implement Redis for session managemnt in my organisation. I have also contributed to python based open source projects.

I am interested to implement these various storage backends for the SeqRepo and be a part of the biocommons community. I couldn't find much info here, so could you please hint on what further steps or tasks other than porposal prep, do I need to follow to be a contributor to biocommons org? Also is there any other communication channel do I need to be part of, as I can't enter the official Slack without the domain email?

@Harsh-2004
Copy link

Dear @reece ,

I hope this message finds you well. I am Harsha Aditya, a third-year undergraduate student at IIT Kanpur, majoring in Bioengineering. I am excited to apply for the SeqRepo project internship opportunity and contribute to its development.

Vision for the Project:
My vision for SeqRepo is to extend its capabilities by implementing an abstract interface that supports various storage backends, caching mechanisms, and federation layers. I aim to create a flexible and scalable solution that seamlessly integrates with different data sources while ensuring fast and reliable access to biological sequences. Leveraging my expertise in C++ and Python, along with my knowledge of sequence alignment algorithms, I intend to enhance SeqRepo's functionality to meet the evolving needs of bioinformatics research and clinical genetics reporting.

Existing Skills:
As a Quant developer and researcher at Devine Group and WorldQuant, I have gained significant experience in Python programming and utilizing common libraries. My background in quantitative finance has honed my skills in data analysis, algorithm development, and software engineering. Additionally, my knowledge of sequence alignment software and algorithms will be instrumental in understanding the domain-specific requirements of SeqRepo and designing efficient solutions.

Skills to Learn:
While I am proficient in Python, I recognize the importance of expanding my skills to include backend-specific technologies such as Redis and AWS S3 for this project. I am committed to dedicating time to self-study and practical application to acquire the necessary skills. Furthermore, I am eager to deepen my understanding of caching techniques and explore how they can be applied to optimize SeqRepo's performance.

Implementation Timeline:
Based on my initial assessment, I estimate that defining and implementing the abstract interface will take approximately 50 hours. Adapting the Fastadir to use the interface and incorporating the REST interface could require around 70 hours. Implementing a local sequence cache may take 40 hours, while integrating Redis, S3, or other backends could vary depending on their complexity, requiring around 55-60 hours each.

Conclusion:
I am enthusiastic about the opportunity to contribute to SeqRepo and leverage my skills to address contemporary challenges in bioinformatics. I am confident that my background in C++, Python, and bioengineering, combined with my research experience, make me well-suited for this project. I am eager to collaborate with you and the team to achieve our objectives and advance SeqRepo's capabilities.

Thank you for considering my application. I look forward to the possibility of working together on this exciting project. Pls direct me to further steps

Warm regards,
Harsha Aditya

@jsstevenson
Copy link
Contributor

Also linking #61 to this

@manulpatel
Copy link

Hi @jsstevenson! Is there any plan to implement new backends in the project anytime soon? I would like to work on this outside GSoC. I would be happy to learn the new tech here if you could hint on some starting points?

Copy link

github-actions bot commented Sep 3, 2024

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Sep 3, 2024
@jsstevenson jsstevenson added keep alive exempt issue from staleness checks and removed stale Issue is stale and subject to automatic closing labels Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep alive exempt issue from staleness checks project proposal project proposal for interns and GSoC students
Projects
Status: No status
Development

No branches or pull requests

4 participants