Spring 2017
Extreme scale data science at the convergence of big data and massively parallel computing is enabling simulation, modelling and real-time analysis of complex natural and social phenomena at unprecedented scales. The aim of the project is to gain practical experience into this interplay by applying parallel computation principles in solving a compute and data-intensive problem. Applying interdisciplinary principles and skills of parallel computation and data science from CS205 and other courses the goal is to construct a novel parallel software solution for an open-ended data science application that requires orders of magnitude compute scaling using Harvard’s supercomputer: Odyssey. Additionally, the project provides an opportunity to apply novel concepts and technologies to create niche applications and research outputs.
As a project team (4 to 5 members) you will identify a data science problem, analyse its compute scaling requirement, collect the data, design and implement a parallel software, and demonstrate scaled performance of an end-to-end application.
- should be implemented on a heterogenous distributed memory architecture with either a many-core or a multi-core compute node and evaluated on 8 compute nodes (note: each compute node on Odyssey is a multi-core with 32 (or 64) cores or with a manycore GPU with hundreds of cores).
- as a hybrid parallel program in either
- MPI+OpenMP
- MPI+OpenAcc (or CUDA )
- PGAS + X
- Spark (with GPU acceleration)
- its performance evaluated on large data sets to demonstrate both weak and strong scaling using appropriate metrics (throughput, efficiency, iso-efficiency).
- and should solve a problem for a non-trivial computation graph and with hierarchical parallelism.
To create novel parallel software solutions, or to undertake a research oriented outcome, you can make use of advanced concepts and technology that was explored in the course. Implementations in the form of libraries and open source software are available to build niche applications on top of it:
- ‘Cutting edge’ P100 GPU architecture at 10 TeraFlops throughput from National resource (https://www.xsede.org/). Create an account in the XSEDE portal https://portal.xsede.org/ if you wish to go down the MPI+ OpenAcc path in the project.
- Communication avoiding algorithms to achieve strong scaling (is installed on Odyssey as modules: https://portal.rc.fas.harvard.edu/apps/modules/CANDMC/2016-fasrc01, https://portal.rc.fas.harvard.edu/apps/modules/CombBLAS/15.0-fasrc01 )
- Polyhedral model software to synthesise tiled parallel programs (OpenMP). https://sourceforge.net/projects/pocc/
- Parallel algorithm formulation in the semi-ring (libraries).
- Web site (max. 5 pages)
- Introduction which should include comparison with existing work on the problem.
- Technical description of the parallel software solution, programming models, with links to code repository.
- Application scaling plots (speed-up, throughput, weak and strong scaling).
- Advanced Features
- Citations
- Software with evaluation data sets, test cases (on Github repo)
- Presentation to the students and staff
The milestones for your final project will be graded at each step according to the grading criteria given below. It is important to adhere to the deadlines as the late date submission policy does not apply to projects.
Milestones | Deadline |
---|---|
Project Team announcement (sign up document deposited in Git repo) | 22nd March 2017 |
Project Proposal (1 page web site) | 25th March 2017 |
Interim progress report (website populated with preliminary results) | 15th April |
Project deliverables (web site, code, README) | 1st May |
Project presentation to class (10 mins. + Qs) | 2nd May 2017 |
Weekly meetings with project supervisors | 20th March to 28th April |
- All submissions are per group. Make your own respository on Github with a link to your project web page.
- All project deliverables, including milestones related, should be deposited in GitHub repos for peer evaluation.
- All project related correspondence should be posted on Piazza.
- Project related emails (meeting schedules) to: [email protected], general queries to Piazza. Only critical (or personal) nature emails to project supervisors email.
- Manju
- Charles
- Rafael
- WeiWeiPan
- You can choose any data/computational science problem that you have already worked on in any other course: AC 209a, AC 209b, AM205, AM207, AC297R.
- Alternatively, your own research work with advanced concepts as above would be suitable to generate research output.
- Supervisors may offer projects based on their research interests.
- You can re-use any code from the CS205 homework set and build your application software on top of it. But the code should be augmented with additional parallel code with the requirements as specified above to gain further credit.
- You can implement the solution in any programming language of your choice (discuss with supervisors).
- In the interdisciplinary spirit of the subject area, and the cs205 course, projects and project teams should have multiple disciplines.
Optionally, the project may take the path of research and generate a research paper as output. In this case the project requirements are:
- To implement a parallel algorithm as above but to support the research problem being addressed.
- To generate as final output a technical paper of journal quality comparable in depth to papers published in leading journals in computational/data science or parallel computing.
- To choose a parallel solution which could range from a novel parallel graph algorithm to optimising a scientific application on odyssey with new insights, and anything inbetween these two bounds of theory and experimentation.
Project will be graded on the depth of work undertaken, communication (web site, presentation) and participation.
- 10%: Project review (Peer and Supervisor meetings).
- 40%: Project software, README, overall quality (base line).
- 30%: Advanced features.
- 10%: Project web site.
- 10%: Presentation to class.
To uniformly assess the different projects the following criteria will be applied:
- A team of 2 will be expected to generate at least 2 homeworks equivalent of output for baseline outcome.
- A team of 4 will therefore be expected to generate double the output in qualitative terms (not necessarily code, but features, analysis, evaluation, innovation). A team of 5 should be exceeding this threshold.
- Advanced features can include half on modelling and half on parallel software.
- Harvard Computational and Data Science: references in courses listed above in project choices
- NERSC data science projects: http://www.nersc.gov/science/science-highlights-presentations/
- Exascale Computing Project: https://exascaleproject.org/researchareas
- Previous years projects: http://iacs-courses.seas.harvard.edu/courses/cs205/projects.html