Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RADPS (NRAO) Requirements and Feature Requests #64

Open
Jan-Willem opened this issue Aug 12, 2024 · 7 comments
Open

RADPS (NRAO) Requirements and Feature Requests #64

Jan-Willem opened this issue Aug 12, 2024 · 7 comments

Comments

@Jan-Willem
Copy link

Jan-Willem commented Aug 12, 2024

As mentioned in #63 here is what the NRAO would be interested in working on (casa-formats-io might already have some of these features):

  • Read-only CASA table access that is comparable in speed to python-casacore.
  • Lightweight and pip-installable:
    • Mac OS and Linux
    • Python 3.9-3.12
  • Single-threaded (no Dask)
  • Thread-safe
  • Example API (this demonstrates the required functionality but is not prescriptive):
ct = casa_table(table_name: str, row_ids: (list/np.array of int)/slice) # The row_ids that will be read (at this stage nothing is loaded into memory). 

ct.nrows()->int #Number of rows.

ct.get_key_words()->dict #Get the value of all table keywords. See https://casacore.github.io/python-casacore/casacore_tables.html#casacore.tables.table.getkeywords

ct.get_column_descriptions(col_names:list of str)->dict #Dictionary where the keys are the column names and the values are column descriptions. See https://casacore.github.io/python-casacore/casacore_tables.html#casacore.tables.table.getcoldesc.

ct.get_columns(col_names: list of str)->dict #Dictionary where the keys are the column names and values are n-dimensional arrays. Should throw an exception when the selected rows don’t have consistent dimensions. For example, selecting the DATA column in the main table and the rows span multiple DDIs. Note that selecting rows that span multiple DDIs for columns that have constant shape should be allowed, for example, SCAN_NUMBER, STATE_ID, DDI, etc.

@astrofrog, @keflavich, @e-koch, please let us know if you think something like this is feasible. We would, of course, be happy to contribute developer effort to achieve this.

@Jan-Willem
Copy link
Author

Here is an IPython notebook that compares the performance of casa-formats-io and python-casacore:

IPython notebook: casa_formats_io_vs_python_casacore.ipynb

Dataset (3.36 GB): VLASS3.2.sb45755730.eb46170641.60480.16266136574_spw10_split.ms.zip

On my Mac M3, casa-formats-io takes approximately 11 seconds and python-casacore takes approximately 3 seconds to read all of the main table data. Initial tests show that the time taken to read the data and perform reshaping (using np.fromfile and casa_formats_io._casa_chunking._combine_chunks) is comparable between the two libraries. Therefore, the performance difference is likely related to how the data gets organized.

@Jan-Willem Jan-Willem changed the title RADPS (NRAO) Requirements and Feature requests RADPS (NRAO) Requirements and Feature Requests Aug 12, 2024
@astrofrog
Copy link
Member

Thanks, this is very useful! I haven't really done any performance optimisation in casa-formats-io at this point so I am sure there is a lot of low hanging fruit. I will have a think about the requirements and will follow up soon.

@Jan-Willem
Copy link
Author

@astrofrog, any update on your thoughts about the requirements?

@astrofrog
Copy link
Member

Sorry for not getting back to you sooner, I was off work for a significant fraction of the summer. I have had a chance to think about the requirements you mention, and have a few follow-up questions/comments.

First, do you need to be able to access just part of the data, or would you always load an entire column into memory?

Second, you mention 'Single-threaded (no Dask)' - note that it is possible to use dask in single-threaded mode, so just to make sure we are on the same page, do you object in general to making use of the dask API (specifically the fact that the astropy table we currently return has dask arrays that require .compute() to be called) or are you ok with this as long as the computation is single-threaded? The main purpose of using dask is to be able to lazily load part of the data with a nice API.

The high-level API I was striving for here aims to completely hide away the details of a table to a user, and they would inspect the table using e.g. len(table) and use the standard astropy Table API to get column information and so on. However, that high-level API may not be suitable for all.

If the use of the dask API is a deal breaker, maybe we could agree on a public lower-level API that both you and the dask interface could use.

@Jan-Willem
Copy link
Author

  • We need to be able to load both entire columns and subselections of rows in a column. For example, we would like to load the entire SCAN_NUMBER column but only certain rows of the VISIBILITY column.
  • We want to call the casa-format-io API within dask.delayed() functions (using Dask within Dask should usually be avoided).
  • A public lower-level API would work.

@Jan-Willem
Copy link
Author

@astrofrog, we now have some developer time available and plan to start looking into this. Have you had a chance to consider it further?

@astrofrog
Copy link
Member

@Jan-Willem - sorry for the delay, I'll try and reply tonight! I'm sure we can find a way forward to avoid duplicating efforts, I'll write up some thoughts/suggestions this evening.

In any case, one thing that definitely needs doing is documenting the format, so that would be worthwhile starting if you have immediate time available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants