-
Notifications
You must be signed in to change notification settings - Fork 1
GSOC'17 Proposal
- Name: Aarya Patil
- Email: [email protected]
- Time Zone: +0530 GMT
- IRC Handle: Haikyu
- Github ID: AustereCuriosity
- Instant Messaging: Google Hangout ([email protected])
- University: Pune Institute of Computer Technology
- Major: Computer Science & Engineering
- Current Year: Third
- Expected Graduation Date: June 2018
- Degree: Bachelor of Engineering (Honors)
Astropy: A mixin
protocol for seamless interoperability
Astropy has emerged as an integration of several independent packages (PyFits, asciitable), each responsible for an individual task in the storage, reduction and analysis pipeline of the astronomical data sets. It has mostly unified the interface in order to make these entities coalesce. However, the need for Astropy to shine as a single package and not just a collection of functionally independent modules, calls for seamless interoperability between the three special and powerful constituents of Astropy: coordinates, time and units and the underlying Astropy table writers. We seek to develop a protocol that allows their storage, the mixin columns
, in FITS and ASCII tables, while still ensuring round tripping. Abiding by the rules set by the FITS standard, requires mapping of the special Astropy objects, with extensive associated metadata, to the standard keywords. We thus want to achieve the ideal goal of preserving complete information, while still going hand in hand with the defined standard.
I intend to work on the above project, Seamless Combination of SkyCoord, Table, WCS, and FITS
. This would involve defining a protocol which could solve common astronomical problems, one of which is described below:
I have two source catalogues of a star cluster, extracted from images of the cluster taken at two different times, through two different filters. I then match the two catalogues on the basis of RA, DEC
values of the sources(stars). Each star then has an associated magnitude, in two different filters. This catalogue can then be matched with another standard catalogue to calculate certain values like the cluster's distance, age and mass. I store all that information in the catalogue Table. Astropy offers a lot of classes to make this easier, for example coordinates that know how to transform between different coordinate systems, required for matching the RA, DEC of the two catalogues with possibly different WCS. Once I write my publication, I need to store that Table as a FITS file and print it out to LaTeX. This is where the problem arises: Each of the special objects (coordinates, times, units) has metadata that does not easily fit into the data column.
I have quoted a problem that I faced while working on a previous astronomical photometry project. This is one of the common problems that astronomers face and I propose to work on solving it.
I should say that this project is fairly open-ended. I have suggested one scenario where the different parts of Astropy could work together better, but that is not the only thing that can be done in order to improve the situation.
I would like to divide this project in the following stages:
- API definition for writing SkyCoord and Time as normal columns in FITS tables
This involves devising a Standard API definition
for writing SkyCoord (RA, DEC
along with appropriate WCS to the FITS header) and Time (jd1, jd2
, a pair of doubles, or offset
, a double, as a FITS table column along with scale and optionally an appropriate reference time to the FITS header) objects. This involves allowing these to be written as normal columns in FITS tables and storing the most important metadata associated with each in the relevant FITS keywords. I have been working on doing the same for Quantity objects in PR #5910, where the associated metadata being stored is their unit.
-
Defining protocols for meta data that make it work between different packages
This could be solved using two approaches:
- Retrieve the (sub)class the object belongs to followed by its relevant meta data. Map these to the existing FITS standard keywords.
Advantage: This will conform to the existing standard and will make the stored metadata work between other astronomical packages. This would be helpful for someone who does not use Astropy.
Disadvantage: We will have to compromise on storing the entire metadata and thus round-tripping would not be possible.
- Retrieve the (sub)class the object belongs to and store the fully featured objects in the FITS tables, i.e. the complete metadata information, by defining a new FITS standard which will add keywords relevant to Astropy.
Advantage: This would be useful for round-tripping and thus would ensure that Astropy users can preserve 100% information.
Disadvantage: Defining a standard ensures interoperability between different communities and software packages. The FITS standard does this job. Defining a new standard for Astropy would be helpful solely for the users of Astropy and other packages would not be blind to this kind of metadata.
My approach towards solving this problem involves combining the advantages of both these approaches. This would involve:
Retrieving the (sub)class the object belongs to, followed by its relevant metadata. All the metadata that can be mapped to the existing FITS standard keywords will be mapped and stored as FITS encoding. We will then use Tom Aldcroft's ECSV definition to serialise all the metadata into a single YAML string. This can then be used to store the information in FITS headers directly in the form of Astropy Objects. This defines a new encoding, which allows us to associate conventional WCS and Time information with FITS data, along with any other information that can be expressed in terms of Astropy Objects to be stored as well. This will provide a lossless way of transferring Astropy Objects from program to program—but based on FITS headers instead of free-format text. Although, this approach provides better capabilities, sophisticated routines to allow the distinction between metadata which can be stored with existing keywords and those which can only be stored as Astropy objects using COMMENTS header, would be required.
- 5th May - 30th May (Community bonding period)
I would be spending this time in understanding the internals of mixin columns and relevant metadata associated with them. I would also be doing a detailed study of the FITS standard keywords. This would result in generation of a list of all the features of the SkyCoord, Time and Quantity(subclasses) which are not found in the FITS standard. Since I already have a fair understanding of mixin columns, I propose to work on solving the Details and Caveats
of mixin columns, listed in the docs, during this period. Tackling these, there will be lesser distinction between mixin and normal columns. I would also work on storing the metadata of Quantity subclasses, which would be a rather simple task. (I would be working on this even before the community bonding period starts).
- 31st May - 25th June (4 weeks)
I would utilise this peak time for devising and coding a standard API definition for writing SkyCoord and Time objects with relevant metadata to the existing FITS standard. This involves making mixin columns a bit smarter about representing themselves. This would be done by adding a method to the mixin protocol that returns a list of normal table columns and some meta information that goes into the FITS header. Table.write()
would then check if mixin columns provide this method and make a temporary copy of the Table where mixin columns are replaced with the new regular table columns before writing.
- Saving SkyCoord: (2 weeks)
For writing SkyCoord, the default representation would transform a SkyCoord mixin column to normal (non-mixin) columns for RA and DEC
and then add the appropriate WCS to the FITS header. Alternatively, this could be implemented as a 2D column inside a FITS table. Thus, we would have to declare to the parent table that the SkyCoord column should be considered as effectively an embedded table (dict of columns) for purposes of representation. Then, the frame attributes would be written in appropriately-named meta entries.
- Saving Time: (2 weeks)
We can currently save Time instances as FITS-compliant strings through the implementation in the PR #3547
I would be implementing the remaining part, which is to store Time objects as binary columns with appropriate header information, the most important being, scale. The binary data can either be two doubles (jd1, jd2) or an offset from a reference time, using a single double. The FITS standard explicitly allows giving the offset from a reference time by setting an appropriate keyword in the header. Thus, we can directly use the DATEREF or MJDREF
mechanism.
Thus, this would allow users to write SkyCoord and Time mixin columns as normal columns to FITS and to store crucial metadata using available keywords.
- 26th June - 30th June (1 week, Evaluation)
Although I would be frequently asking for feedback from my mentors and the Astropy community, I would solicit feedback from the astronomical community on the above defined protocols, in order to get a broader perspective. This will help in order to provide an overall promise and ease for this protocol. This period will also be used to implement test cases and solve any bugs encountered in the earlier step.
- 1st July - 23rd July (3 weeks)
During this period, I will be sketching and implementing a format for writing mixin metadata in ECSV, that is easily readable/parsable without Astropy.
Serialising SkyCoord in ECSV will be a rather simpler task, once we have all the relevant frame attributes supplied by the above defined mixin protocol. I will be using and upgrading Tom Aldcroft's ECSV definition for serialising metadata and to return a complete YAML string. (1 week)
Serialising Time would be a little complicated. The Astropy Time object has some features not found in the FITS standard (and the converse is most certainly true):
- Attributes location, delta_ut1_utc and delta_tdb_tt. In Astropy Time, these can all be arrays which are broadcastable to the Time values. In the FITS standard, location is a scalar expressed via keyword(s) and thus we need a way to store the vector location. I would need my mentors' advice for the same. The delta_* attributes can most likely be re-calculable and can thus be dropped for storage.
- The FITS standard appears to only consider three formats, ISO string, JD and MJD. Time allows several other formats like unix or cxcsec, etc. for representing the values. Some users may explicitly want to serialise their Time object in the format of their choosing.
Thus, I would utilise this time (2 weeks), for serialising Time to ECSV using one of the two methods:
- Store jd1 and jd2, regardless of format. This will result in storing the time values as 128 bits, double-double precision and will prove to be a
lossless
version. - Store the representation values for the format. This is lossy but meets the common-use needs. This will be used as the default representation with the option of overriding this with (1), when specified by the user.
- 24th July - 28th July (1 week, Evaluation)
This period would be solely used to implement test cases and solve any bugs encountered in the earlier steps. I would also be documenting all the work done.
- 29th July - 20th August (4 weeks)
Now comes a tricky part, which is to exploit the FITS keywords for preserving as much metadata as possible. This can then be used by other software packages. The various DATE-* header strings and the WCS keyword values in the existing standard will be stored using the mixin protocol. (2 weeks)
This would be followed by an approach to 100% round-tripping as described in the detailed description. After supporting 100% round-tripping of any table to ECSV which will be done in the earlier step, we will be using a single YAML string and storing it using carefully constructed COMMENT cards. Thus, we will be defining a new standard for storing Astropy objects in FITS and an API will be designed to do the same. (2 weeks)
For storing time in FITS, instead of the serialised time strings, the lossless version of storing jd1 and jd2
would be preferred.
- 21st August - 29th August (1 week)
The final week. I would add Astropy tutorials and update the Astropy core package documentation to show the new capabilities. I would also clean up the code (make it PEP8 compliant) and fix/add tests if required. This can also serve as a buffer week to work on adding support for FITS Schema, a simple data modelling and validation language for FITS, in order to conform to the new Astropy standard (on discussion with mentor). This would ensure easier extensibility of this project.
- 29th August - 5th September (1 week)
Mentors submit final project reports to Google.
- 6th September
Final results of Google Summer of Code 2017 announced.
Note: I would be writing documentation, tests and doc-tests along with writing the code for the protocol proposed in the project. This would make sure that mentors are able to understand my ideas vividly and would ensure correctness and ease of maintainability. I would be committing changes and pushing them regularly to my remote development branches, so that Astropy developers can review my code and provide valuable feedback.
astropy
, numpy
, re
, yaml
, scipy
(maybe)
Astropy
- Allow Quantity column to be written as a normal column to a FITS, votable or HDF5 file
- Specifying data types for columns when using fast ASCII readers, using the converter paramater
Merged PRs in other packages
- Sewpy
- Numpy
I have also opened the following issues in Astropy
- Adding Time column to alreading existing Table or QTable makes the column a Column object with dtype= 'object'
- Adding FITS_LDAC format for reading in astropy.table.Table.read()
Will be updating soon.
I do not plan to work elsewhere, nor do I plan to go on a vacation during this period. I will be having my sixth semester final exams during the period 15th May - 25th May. I am familiar with the initial part of this project (Allowing mixin columns to be written as normal columns), having worked on the PR (#5910). This would help me complete my work at a faster pace and compensate here. My classes for the final year begin from the first week of August. Having finished all the discipline courses by the end of this academic year, my academic load would be extremely minimal. Nevertheless, I would be able to devote as much time as needed for GSOC.
I do not have any logistical issues, I have 24/7 access to high-speed internet and I would not be travelling anywhere during the internship period.
- I have not participated in GSoC before.
- I am not applying to any other organization.
I have been contributing to Astropy, specially to the astropy.io
and astropy.table
modules for a while now. I have spent significant time understanding the codebase, the io and table functionalities, their interdependencies and the driver cause for the use of mixin columns. I am now comfortable in working with mixin columns.
The starting point for this project was to tackle the issue #3685, for which I have opened the Pull Request #5910. Contributing to this PR, made it easier for me to precisely sketch the problem skeleton and reviews from one of the project mentors, Tom Aldcroft, and the Astropy members and contributors, helped me understand the internals of the underlying writers for Tables, the PEP8 conventions, the rule of minimising differences and the build time errors due to imports. I have strived hard to work on them and I know I have so much more to learn.
The Travis build tests were an undisputed help to understand the various checks regarding PRs and how to solve them. Thanks to Brigitta Sipocz, who walked me through the procedure of rebasing and helped me solve many other issues, I realised how powerful git is and got a fair idea of the workflow of Astropy, right from the build process to the coding conventions, to the norms of documentation.
Taking Hans Moritz Günther's advice, who is also a mentor for this project, of working on a few PRs to the fullest of my abilities, did make me understand things better and I now have enough knowledge to start working on mixin columns and their interoperability. I have worked with FITS files and the standard for over a year now, due to which I can relate to the problem statement better and can work on this project at a faster pace. I will be sticking to the deadlines and incorporating all the feedback and changes asked by the Astropy community. I will be regularly discussing with my mentors and the community regarding the protocol definition and getting as much feedback as possible to let this project provide overall promise and ease to the entire astronomical community.
I would consult stackoverflow and the regular sources Google, Python docs etc. for solving the hurdles in this project.
Will be updating soon.
Will be updating soon.
The universe is expanding each day and so is our knowledge of its complexity. I do not understand it completely, but I would like to spend my life trying to know it a little better. This is me, quoting in my own words, a perception of myself that I believe in.
I was introduced to night sky observations at a very young age, thanks to my father's love for the stars, which I can now proudly say runs in our genes. In school, I developed an affinity towards physics, because it fed my curiosity of the clockwork of the universe and mathematics; the language of physics. I came across the field of computing rather late. When in junior college, one of my friends asked me to be his partner in a programming competition. I did not know how to code, but the problems were enticing and those three hours transformed my life. I was absolutely amazed at how a difficult problem could be solved by tackling one logical component at a time, at least if it is not an NP complete problem. And I knew just what I wanted to do.
Coming to my formal introduction, I am Aarya Patil, a third-year undergraduate student at Pune Institute of Computer Technology (India) majoring in computer engineering and I am immensely passionate about the world of computational astrophysics. This very reason has led me towards taking up independent research. Last summer, I worked under the supervision of a Data Scientist at the Inter-University Centre for Astronomy and Astrophysics (IUCAA), on a research project titled ‘Estimating the Distance to Star Clusters ( open and globular ) using Automated Photometry and Unsupervised Learning’. I am currently working under a Post-Doctoral student at IUCAA, to try and solve the problem of “Classification of Transient objects using Machine Learning techniques on Light Curve data”. Both projects being developed in Python, I am a fanatic of this language, of its ease, power and extensibility. I believe Astropy could soon bring in all the strength IRAF has (and even more) and it would not be possible without this beautiful language. I have also been reading, writing and processing FITS files for over a year now. This project thus does not fail to excite me.
My introduction to the world of Open Source was rather recent. It has been a delight to converse with the Astropy developers, understand their ideologies and put across new thoughts to this altogether collaborative environment. I would like to thank the developers, who help me learn new things everyday and will continue to do so.
It has been rather stereotypical to either be a hopeless romantic of the arts or to develop a scientific skepticism. But I have never been a stickler for stereotypes. I like to believe great science requires greater imagination. So when I am not engrossed in the above listed activities, you can find me contemplating about the world and its residents on my Art Blog.
Yes, I am eligible to receive payments from Google. In case you need any clarification or further explanation of any approach/feature, feel free to contact me at [email protected] and I would be happy to provide it to you.