Skip to content
AustereCuriosity edited this page Mar 29, 2017 · 26 revisions

Organization: OpenAstronomy

Sub-Organization: AstroPy

1. Student Information

2. University Information

  • University: Pune Institute of Computer Technology
  • Major: Computer Science & Engineering
  • Current Year: Third
  • Expected Graduation Date: June 2018
  • Degree: Bachelor of Engineering (Honors)

3. Project Proposal Information

Proposal title

AstroPy: A mixin protocol for seamless interoperability

Proposal abstract

AstroPy has emerged as an integration of several independent packages (PyFits, asciitable), each responsible for an individual task in the storage, reduction and analysis pipeline of the astronomical data sets. It has mostly unified the interface in order to make these entities coalesce. However, the need for AstroPy to shine as a single package and not just a collection of functionally independent modules, calls for seamless interoperability between the three special and powerful constituents of AstroPy, coordinates, time and units. We seek to develop a protocol that allows their storage, the mixin columns, in FITS and ascii tables, while still ensuring round tripping. Abiding by the rules set by the FITS standard, requires mapping of the special AstroPy objects, with extensive associated metadata, to the standard keywords. We thus want to achieve the ideal goal of preserving complete information, while still going hand in hand with the defined standard.

I intend to work on the above project, Seamless Combination of SkyCoord, Table, WCS, and FITS. This would involve defining a protocol which could solve common astronomical problems, one of which is described below:

I have two source catalogues of a star cluster, extracted from images of the cluster taken at two different times, through two different filters. I then match the two catalogues on the basis of RA, DEC values of the sources(stars). Each star then has an associated magnitude, in two different filters. This catalogue can then be matched with another standard catalogue to calculate certain values like the cluster's distance, age and mass. I store all that information in the catalogue Table. AstroPy offers a lot of classes to make this easier, for example coordinates that know how to transform between different coordinate systems, required for matching the RA, DEC of the two catalogues with possibly different WCS. Once I write my publication, I need to store that Table as a FITS file and print it out to LaTeX. This is where the problem arises: Each of the special objects (coordinates, times, units) has metadata that does not easily fit into the data column.

I have quoted a problem that I faced while working on a previous astronomical photometry project. This is one of the common problems that astronomers face and I propose to work on solving it.

I should say that this project is fairly open. I have suggested one scenario where the different parts of AstroPy could work together better, but that is not the only thing that can be done in order to improve the situation.

Proposal Detailed Description

I would like to divide this project in the following stages:

  • API definition for writing SkyCoord and Time as normal columns in FITS tables This involves devising a Standard API definition for writing SkyCoord (RA, DEC along with appropriate WCS to the FITS header) and Time (jd1, jd2, a pair of doubles, or offset, a double, as a FITS table column along with scale and optionally an appropriate reference time to the FITS header) objects. This involves allowing these to be written as normal columns in FITS tables and storing the most important metadata associated with each in the relevant FITS keywords. I have been working on doing the same for Quantity objects in PR #5910 where the associated metadata being stored is their unit.

  • Defining protocols for meta data that make it work between different packages

    This could be solved using two approaches:

    1. Retrieve the (sub)class the object belongs to followed by its relevant meta data. Map these to the existing FITS standard keywords. Advantage: This will conform to the existing standard and will make the stored metadata work between other astronomical packages. This would be helpful for someone who does not use AstroPy. Disadvantage: We will have to compromise on storing the entire metadata and thus round-tripping would not be possible.

    2. Retrieve the (sub)class the object belongs to and store the fully featured objects in the FITS tables, i.e. the complete metadata information, by defining a new FITS standard which will add keywords relevant to AstroPy. Advantage: This would be useful for round-tripping and thus would ensure that AstroPy users can preserve 100% information. Disadvantage: Defining a standard ensures interoperability between different communities and software packages. The FITS standard does this job. Defining a new standard for AstroPy would be helpful solely for the users of AstroPy and other packages would not be blind to this kind of metadata.

My approach towards solving this problem involves combining the advantages of both these approaches. This would involve:

Retrieving the (sub)class the object belongs to, followed by its relevant metadata. All the metadata that can be mapped to the existing FITS standard keywords will be mapped and stored as FITS encoding. We will then use Tom Aldcroft's ECSV definition to serialise all the metadata into a single YAML string. This can then be used to store the information in FITS headers directly in the form of AstroPy Objects. This defines a new encoding, which allows us to associate conventional WCS and Time information with FITS data, along with any other information that can be expressed in terms of AstroPy Objects to be stored as well. This will provide a lossless way of transferring AstroPy Objects from program to program—but based on FITS headers instead of free-format text. Although, this approach provides better capabilities, sophisticated routines to allow the distinction between metadata which can be stored with existing keywords and those which can only be stored as AstroPy objects using COMMENTS header, would be required.

Timeline

  • 22nd April - 23rd May (Community bonding period)

I would be spending this time in understanding the mixin columns and the relevant metadata associated with them along with a detailed study of the FITS standard keywords. I will generate a list of all the features of the SkyCoord, Time and Quantity(subclasses) which are not found in the FITS standard. Since I already have a fair understanding of mixin columns, I propose to work on solving the Details and Caveats of mixin columns, listed in the docs, during this period. Tackling these, I would get a better idea about the internal working of mixin columns and there will be lesser distinction between mixin and normal columns.

  • 23rd May - 20th June (4 weeks)

I would utilise this peak time for devising and writing a standard API definition for writing SkyCoord and Time objects with relevant metadata to the existing FITS standard. This involves making mixin columns a bit smarter about representing themselves. This would be done by adding a method to the mixin protocol that returns a list of normal table columns and some meta information that goes into the table header. Table.write() would then check if mixin columns provide this method and make a temporary copy of the Table where mixin columns are replaced with the new regular table columns before writing.

  1. Saving SkyCoord:

For writing, the default representation would transform a SkyCoord mixin column to normal (non-mixin) columns for RA and DEC and then add the appropriate WCS to the FITS header. Alternatively, this could be implemented as a 2D column inside a FITS table. Thus we would have to declare to the parent table that the SkyCoord column should be considered as effectively an embedded table (dict of columns) for purposes of representation. Then, the frame attributes would be written in appropriately-named meta entries and the WCS can be stored in appropriate header keywords.

  1. Saving Time:

We can currently save Time instances as FITS-compliant strings through the implementation in the PR #3547 I would be implementing the remaining part, which is to store Time objects as binary columns with appropriate header information, the most important being, scale. The binary data can either be two doubles (jd1, jd2) or an offset from a reference time using a single double. The FITS standard explicitly allows giving the offset from a reference time by setting an appropriate reference time in the header, thus we can use the DATEREF or MJDREF mechanism.

Thus, this would allow users to write SkyCoord and Time mixin columns as normal columns to FITS and to store crucial metadata using available keywords.

  • 20th June - 27th June (1 week, Mid-Term evaluation)

Although I would be frequently asking for feedback from my mentors and the AstroPy community, I would solicit feedback from the astronomical community on the above defined protocols, in order to get a broader perspective. This will help to provide an overall promise and ease for this protocol. This time will also be used to implement test cases and solve any bugs encountered in the earlier step.

  • 27th June - 18th July (3 weeks)

During this period, I will be implementing a format for ECSV to write mixin metadata, that is easily readable/parsable without AstroPy.

Serialising SkyCoord in ECSV will be a rather simpler task, once we have all the relevant frame attributes supplied by the above defined mixin protocol. I will be using and upgrading Tom Aldcroft's ECSV definition for serialising metadata and to return a complete YAML string.

Serialising Time would be a little complicated. The AstroPy Time object has some features not found in the FITS standard (and the converse is most certainly true):

  1. Attributes location, delta_ut1_utc and delta_tdb_tt. In AstroPy Time these can all be arrays which are broadcastable to the Time values. In the FITS standard, location is a scalar expressed via keyword(s) and thus we need a way to store the vector location. I would need my mentors' advice for the same. The delta_* attributes can most likely be re-calculable and can thus be dropped for storage.
  2. The FITS standard appears to only consider three formats, ISO string, JD and MJD. Time allows several other formats like unix or cxcsec, etc. with a reference time, for representing the values. Some users may explicitly want to serialise their Time object in the format of their choosing.

Thus, I would utilise this time, for serialising Time to ECSV using one of the two methods:

  1. Store jd1 and jd2, regardless of format. This will result in storing the time values as 128 bits, double-double precision and will prove to be a lossless version.
  2. Store the representation values for the format. This is lossy but meets the common-use needs. This will be used as the default representation with the option of overriding this with (1), when specified by the user.
  • 18th July - 15th August (4 weeks)

Now comes a tricky part, which is to exploit the FITS keywords for preserving as much metadata which can be used by other software packages. The various DATE-* header strings and the WCS keywords in the existing standard will be stored using the mixin protocol.

This would be followed by an approach to 100% round-tripping as described in the detailed description. After supporting 100% round-tripping of any table to ECSV which will be done in the earlier step, we will be using a single YAML string and storing it using carefully constructed COMMENT cards, as described above. Thus, we will be defining a new standard for storing AstroPy objects in FITS and an API will be designed to store the same.

For storing time in FITS, instead of the serialised time strings, the lossless version of storing jd1 and jd2 would be preferred.

  • 15th August - 23rd August (1 week)

The final week. I would add AstroPy tutorials and update the AstroPy core package documentation to show the new capabilities. I would also clean up the code (make it PEP8 compliant) and fix/add tests if required. This can also serve as a buffer week to work on adding support for FITS Schema, a simple data modelling and validation language for FITS, in order to conform to the new AstroPy standard (on discussion with mentor). This would ensure easier extensibility of this project.

  • 23rd August - 29th August (1 week)

Mentors submit final project reports to Google.

  • 30th August

Final results of Google Summer of Code 2017 announced.

Note: I would be writing documentation, tests and doc-tests along with writing the code for the protocol proposed in the project. This would make sure that mentors are able to understand my ideas vividly and would ensure correctness and ease of maintainability. I would be committing changes and pushing them regularly to my remote development branches, so that AstroPy developers can review my code and provide valuable feedback.

Software packages to be used:

astropy, numpy, re, yaml, scipy(maybe)

Links to patch/code samples:

Links to additional information:

4. Other Schedule Information

Commitments:

I would be interning elsewhere from 15th June - 15th July. The office timings are from UTC 5:00 A.M - UTC 12:00 A.M. The time difference should not be a problem because I would being getting done with my internship when AstroPy developers would wake up, so I can communicate with my mentors without any problems. I have familiarity with the initial part of my project (Allowing mixin columns to be written as normal columns), having worked on the PR (#5910). My familiarity with the mixin protocol should compensate here. Nevertheless, I would be able to devote 6-7 hours on weekdays and as many as I want on weekends. This way, I would be able to devote the required minimum of ~35-40 hours for Summer of Code.

I would not be multi-tasking, I would reserve 6-7 hours of uninterrupted time, every weekday (during the 4 weeks of my other internship) for the work. I would be entirely free on weekends so I would face no hurdles there. Any extra time, if required would be devoted to Summer of Code. I am certain that I will be able to use my familiarity with this project for working at a faster pace.

I am completely free after the 15th of July and can devote time as I see fit after that. My classes for the final year begin from the first week of August. Having finished all discipline courses by the end of this academic year, my academic load would be extremely minimal.

I do not have any logistical issues, I have 24/7 access to high-speed internet and would not be travelling anywhere during the internship period.

GSOC participation:

  • I have not participated in GSoC before.
  • I am not applying to any other organization.

5. How do I propose to complete this project

I have been contributing to Astropy, specially to the astropy.io and astropy.table modules for a while now. I have spent significant time understanding the codebase, the io and table functionalities, their interdependencies and the driver cause for the use of mixin columns. I am now comfortable in working with mixin columns.

The starting point for this project was to tackle the issue #3685, for which I have opened the Pull Request #5910. Contributing to this PR, made it easier for me to precisely sketch the problem skeleton and reviews from one of the project mentors, Tom Aldcroft, and the Astropy members and contributors, helped me understand the internals of the underlying writers for Tables, the PEP8 conventions, the rule of minimising differences and the build time errors due to imports. I have strived hard to work on them and I know I have so much more to learn.

The Travis build tests were an undisputed help to understand the various checks regarding PRs and how to solve them. Thanks to Brigitta Sipocz, who walked me through the procedure of rebasing and helped me solve many other issues, I realised how powerful git is and got a fair idea of the workflow of AstroPy, right from the build process to the coding conventions, to the norms of documentation.

Taking Hans Moritz Günther's advice, also a mentor, of working on a few PRs to the fullest of my abilities, did make me understand things better and I now have enough understanding to start working on mixin columns and their interoperability. I have worked with FITS files and the standard for over a year now, due to which I can relate to the problem statement better and can work on this project at a faster pace. I will be sticking to the deadlines and incorporating all the feedback and changes asked by the AstroPy community. I will be regularly discussing with my mentors and the community regarding the protocol definition and getting as much feedback as possible to let this project provide overall promise and ease to the entire astronomical community.

I would consult stackoverflow and the regular sources Google, Python docs etc. for solving the hurdles in this project.

5. Deliverables

6. Benefits to The Community.

7. About Me.

The universe is expanding each day and so is our knowledge of its complexity. I do not understand it completely, but I would like to spend my life trying to know it a little better. This is me, quoting in my own words, a perception of myself that I believe in.

I was introduced to night sky observations at a very young age, thanks to my father's love for the stars, which I can now proudly say runs in our genes. In school, I developed an affinity towards physics, because it fed my curiosity of the clockwork of the universe and mathematics; the language of physics. I came across the field of computing rather late. When in junior college, one of my friends asked me to be his partner in a programming competition. I did not know how to code, but the problems were enticing and those three hours transformed my life. I was absolutely amazed at how a difficult problem could be solved by tackling one logical component at a time, at least if it is not an NP complete problem. And I knew just what I wanted to do.

Coming to my formal introduction, I am Aarya Patil, a third-year undergraduate student at Pune Institute of Computer Technology (India) majoring in computer engineering and I am immensely passionate about the world of computational astrophysics. This very reason has led me towards taking up independent research. Last summer, I worked under the supervision of a Data Scientist at the Inter-University Centre for Astronomy and Astrophysics (IUCAA), on a research project titled ‘Estimating the Distance to Star Clusters ( open and globular ) using Automated Photometry and Unsupervised Learning’. I am currently working under a Post-Doctoral student at IUCAA to try and solve the problem of “Classification of Transient objects using Machine Learning techniques on Light Curve data”. Both projects being developed in Python, I am a fanatic of this language, of its ease, power and extensibility. I believe AstroPy could soon bring in all the strength IRAF has (and even more) and it would not be possible without this beautiful language. I have also been reading, writing and processing FITS files for over a year now. This project thus does not fail to excite me.

My introduction to the world of Open Source was rather recent. It has been a delight to converse with the Astropy developers, understand their ideologies and put across new ideas to this altogether collaborative environment. I would like to thank the developers, who help me learn new things everyday and will continue to do so.

It has been rather stereotypical to either be a hopeless romantic of the arts or to develop a scientific skepticism. But I have never been a stickler for stereotypes. I like to believe great science requires greater imagination. So when I am not engrossed in the above listed activities, you can find me contemplating about the world and its residents on my Art Blog.

8. Eligibility

Yes, I am eligible to receive payments from Google. In case you need any clarification or further explanation of any approach/feature, feel free to contact me at [email protected] and I would be happy to provide it to you.

Clone this wiki locally