-
Notifications
You must be signed in to change notification settings - Fork 36
GSoC 2022 Projects
This page contains project ideas for students applying to the Google Summer of Code 2022. We recommend that prospective students join our Slack workspace to discuss project proposals. Be sure to read our Code of Conduct - respect is important and you will be working with a team from many backgrounds. Also see the signac project's Community page with our Contributing guide.
signac is a data management framework named after the painter Paul Signac, whose colorful pointillist style resembles a collection of data "points". The signac framework is designed to help researchers design, manage, and execute computational studies. The core data management package signac helps users track data and metadata for file-based workflows (e.g. large molecular simulations) with features for searchability, collaboration, reproducibility, and archival.
The companion package signac-flow automates workflow submission on high performance computing clusters operated by universities, companies, and federal research labs. The architecture of signac is specifically aimed at research, where questions change rapidly, data models are always in flux, and computing infrastructure varies widely from project to project. Portability and fast prototypes are signac's strong suit -- compute some jobs, analyze the outputs, write a paper, and archive the data. The signac framework is available for Python 3.6+, can be installed with pip or conda, and is licensed BSD-3.
To learn more about signac, check out the signac website and framework documentation. You can also follow @signacdata on Twitter.
Above all else, we are looking for an enthusiastic student who is willing to learn and works well with our team. The signac framework is written in Python 3 and our organization relies on git, so basic familiarity in both Python and git is valuable.
We recommend you take a look at a "good first issue" to acquaint yourself with the project and our development process.
Note that the signac framework has a few separate repositories where issues are filed:
- signac, core data management package
- signac-flow, workflow automation
- signac-dashboard, rapid data visualization in a browser
- signac-docs, the central documentation repository
- signac-examples, a set of example projects
Difficulty: Medium. Time commitment: 175 hours.
The core functionality of signac is the management of a database on the filesystem. The database is called a project, and the items in the database are called jobs. Metadata associated with a job is stored in the job state point and document JSON files, which are distributed throughout the database in subdirectories corresponding to each job. One of signac's core features is the ability to interact with data and metadata through a Python API and a command-line interface (CLI). Users can perform standard database operations “create, read, update, delete” as well as execute search queries, import/export data, synchronize multiple projects, and create “views” using symbolic links.
In this project, you will develop a new feature for the signac framework: a REST API that allows users to interface with a signac project through HTTP requests like GET/PUT/POST/DELETE in a similar way to the existing Python and CLI interfaces. There are 3 overarching goals for this project:
- Develop an Internationalized Resource Identifier (IRI) schema for representing core signac data structures like projects, jobs, state points, and search queries over the project. This will serve as the basis for the REST API endpoints. For previous discussions of this, see #96 and #189. The addition of IRIs will allow provenance tracking and stable cross-links between signac data structures, strengthening the framework’s core abilities to manage, track, and archive research data.
- Using an industry standard Python API framework (such as FastAPI), implement, document, and test an application that can respond to requests for data from signac projects with data in a standard format, such as JSON (signac already uses JSON internally). This API will be documented with Sphinx and tested with pytest, like the existing Python and CLI interfaces.
- Build an example client for the REST API that demonstrates its features. This can be simple, using a library like requests, or a web-based application if time allows.
- Required: Python 3 programming experience
-
Required: Experience using
git
for version control (ideally with GitHub workflows like issues and pull requests) - Preferred: Knowledge of JSON format
- Preferred: Knowledge of a package/tool/framework/library in any of the following fields (any programming language is acceptable - the links below list Python packages in those fields):
See below.
Difficulty: Medium. Time commitment: 175 hours.
The signac-flow workflow model is designed around the concept of operations acting on jobs. It relies on signac to manage job data, allowing signac-flow to focus on workflow definition and execution. Recent expansions of this model have enabled groups of multiple operations to act on jobs, as well as operations that can act on aggregates of multiple jobs. Specific operation requirements (such as the required number of CPUs or GPUs to run on) are expressed using directives, while the execution graph itself (e.g. operation A must be completed before operation B) is expressed using conditions. All of these core concepts -- operations, groups, aggregates, directives, and conditions -- are defined using a convenient and expressive decorator-based API. However, the expansion of the workflow model has exposed some limitations with this API, particularly with respect to the fragility of composing many decorators that each have their own set of options.
In this project, you will enable users to define workflows leveraging the expressive and powerful features of the signac-flow framework with a more succinct and readable syntax. The three major goals of this project are:
- Study the design constraints of the existing API. Some known issues and design flaws include: decorator ordering (discussed in #600 and #538), limited ability to re-use operation decorator logic, problems in combining module-level decorators and Project-level decorators, inability to cache the results of condition functions (#125), and internal complexity with operator/group registration (#237).
- Design a new API, which may be based on the existing proposals in #600. The final API design will be developed with GSoC mentors and other members of the signac development team. The API should focus on addressing user needs such as reducing complexity for first-time users while continuing to enable the expressive and powerful features of the framework. The new API will be documented with Sphinx and tested with pytest, like the existing features of the package.
- Develop Jupyter notebook examples and updated tutorials to reflect the new API design.
- Required: Python 3 programming experience
-
Required: Experience using
git
for version control (ideally with GitHub workflows like issues and pull requests) - Preferred: Knowledge of a package/tool/framework/library in any of the following fields (any programming language is acceptable - the links below list Python packages in those fields):
- high-performance computing
- job schedulers
- parallel computing or distributed computing
- task queues
- template engines
- testing
See below.
- Learn to automate and scale computational workflows from laptops to the world's largest supercomputers
- Improve your skills in designing user-centered APIs, working on collaborative teams, and using scientific Python
- Work on a project that will be used by scientific researchers at institutions around the globe
- Work with a friendly team!
Our development team is distributed across several time zones, and we have an active Slack workspace, biweekly video calls, and biweekly development "sprints" to coordinate our efforts.
The following signac team members are interested in mentoring GSoC 2022 participants:
- Bradley Dice (@bdice)
- Brandon Butler (@b-butler)
- Mike Henry (@mikemhenry)
- Hardik Ojha (@kidrahahjo)
- Carl Simon Adorf (@csadorf)
- Vyas Ramasubramani (@vyasr)