Skip to content

Commit

Permalink
Merge pull request #44 from wcschultz/data_access_nb
Browse files Browse the repository at this point in the history
Data Discovery and Access Notebook Update
  • Loading branch information
tddesjardins authored Sep 18, 2024
2 parents 3d3de3e + 63fa4e2 commit bc86767
Showing 1 changed file with 56 additions and 34 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"source": [
"## Kernel Information\n",
"\n",
"To run this notebook, please select the \"Roman Calibration\" kernel at the top right of your window.\n",
"To run this notebook on the Roman Science Platform, please select the \"Roman Calibration\" kernel at the top right of your window.\n",
"\n",
"## Imports\n",
"Here we import the required packages for our data access examples including:\n",
Expand Down Expand Up @@ -74,13 +74,16 @@
},
"source": [
"## Introduction\n",
"This notebook is designed to provide examples of accessing data from the science platform. It demonstrates how to stream data from the cloud directly into memory, bypassing the need to download the data locally and use excess storage. This method of cloud-based data access is *HIGHLY* recommended. However, we understand that some use-cases will require downloading the data locally, so we also provide an example of how to do this at the end of the notebook.\n",
"This notebook is designed to provide examples of accessing data from the science platform. Due to the survey nature of the Roman Space Telescope, it will produce large data volumes of data that will need to be easily and quickly accessed to perform scientific tasks like creating catalogs, difference imaging, generating light curves, etc. Downloading all the required data would burden most users by requiring excessive data storage solutions (likely >10TB).\n",
"\n",
"This notebook demonstrates how to stream data from the cloud directly into memory, bypassing the need to download the data locally and use excess storage. This method of cloud-based data access is *HIGHLY* recommended. However, we understand that some use-cases will require downloading the data locally, so we also provide an example of how to do this at the end of the notebook.\n",
"\n",
"During operations, each Roman data file will be given a Unique Resource Identifier (URI), an analog to an online filepath that is similar to a URL, which points to where the data is hosted on the AWS cloud. Users will retrieve these URIs from one of several sources including MAST (see [Accessing WFI Data](https://roman-docs.stsci.edu/data-handbook-home/accessing-wfi-data) for more information) and will be able to use the URI to access the desired data from the cloud. \n",
"\n",
"Here-in we examine how to download data from two types of sources:\n",
"- The STScI MAST server which hosts data for in-flight telescopes including Hubble, TESS, and JWST\n",
"- The STScI MAST server which hosts data for in-flight telescopes including Hubble, TESS, and JWST and will host Roman data in the future\n",
"- Simulated Roman Space Telescope data hosted in storage containers on the AWS cloud\n",
"\n",
"\n",
"### Defining terms\n",
"- *Cloud computing*: the practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than using a local server or a personal computer.\n",
"- *AWS*: Amazon Web Services (AWS) is the cloud computing platform provided by Amazon.\n",
Expand All @@ -100,7 +103,7 @@
"metadata": {},
"source": [
"## Accessing MAST Data\n",
"In this section, we will go through the steps to retreive archived MAST data from the cloud including how to query the archive, stream the files directly from the cloud, as well as download them locally.\n",
"In this section, we will go through the steps to retreive archived MAST data from the cloud including how to query the archive and stream the files directly from the cloud, as well as download them locally.\n",
"\n",
"### Enabling Cloud Access\n",
"The most important step for accessing data from the cloud is to enable *astroquery* to retreive URIs and other relevant cloud information. Even if we are working locally and plan to download the data files (not recommended for Roman data), we need to use this command to copy the file locations."
Expand All @@ -124,7 +127,9 @@
"metadata": {},
"source": [
"### Querying MAST\n",
"Now we are ready to begin our query. This example is rather simple, but it is quick and easy to reproduce. We will be querying JWST NIRCAM data of M83. In our query, we specify that we want to look at JWST data using the F444W filter and NIRCAM. We also specify the proposal id to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the observations. We then filter the products to gather all the rate image data products which still leaves us with 144 filtered products. To reduce the number of URIs we filter through, we choose a single observation to continue with in this notebook."
"Now we are ready to begin our query. This example is rather simple, but it is quick and easy to reproduce. We will be querying HST WFC3/IR data of M83. In practice, the science platform should primarily be used for analyzing and exploring Roman data products. However due to the smaller file sizes, HST WFC3/IR data provides a nice example. The process is identical regardless of which space telescope is used.\n",
"\n",
"In our query, we specify that we want to look at HST data using the F160W filter and WFC3/IR. We also specify the proposal id to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the observations. We then filter the products to gather all the level 3 science data products associated with a specific project which still leaves us with 60 data products."
]
},
{
Expand All @@ -133,18 +138,17 @@
"metadata": {},
"outputs": [],
"source": [
"obs = Observations.query_criteria(obs_collection='JWST',\n",
" filters='F444W',\n",
" instrument_name='NIRCAM/IMAGE',\n",
" proposal_id=['1783'],\n",
" dataRights='PUBLIC')\n",
"obs = Observations.query_criteria(obs_collection='HST',\n",
" filters='F160W',\n",
" instrument_name='WFC3/IR',\n",
" proposal_id=['11360'],\n",
" dataRights='PUBLIC',\n",
" )\n",
"products = Observations.get_product_list(obs)\n",
"\n",
"filtered = Observations.filter_products(products,\n",
" productSubGroupDescription='RATE')\n",
"print('Filtered data products:\\n', filtered, '\\n')\n",
"single = Observations.filter_products(filtered,\n",
" obsID='87766440')\n",
"print('Single data product:\\n', single, '\\n')"
" calib_level=[3], productType=['SCIENCE'], dataproduct_type=['image'], project=['CALWF3'])\n",
"print('Filtered data products:\\n', filtered, '\\n')"
]
},
{
Expand All @@ -160,15 +164,15 @@
"metadata": {},
"outputs": [],
"source": [
"uris = Observations.get_cloud_uris(single)\n",
"uris = Observations.get_cloud_uris(filtered)\n",
"uris"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `get_cloud_uris` method checks for duplicates in the provided products to minimize the data access volume. It is also important to note that `get_cloud_uris` will always return a list. Thus, we need to extract the individual URI strings to access the files."
"The `get_cloud_uris` method checks for duplicates in the provided products to minimize the data access volume. It is also important to note that `get_cloud_uris` will always return a list. Thus, we need to extract an individual URI string to access the file. Here we choose the first URI, but in practice you would select the URI associated with the desired file."
]
},
{
Expand All @@ -185,7 +189,7 @@
"metadata": {},
"source": [
"### Streaming files directly into memory\n",
"Here, we will use `s3fs` to directly access the data stored in the AWS S3 servers. Note that we must set `anon=True` to acces the files."
"Here, we will use `s3fs` to directly access the data stored in the AWS S3 servers. Typically to access data from AWS, authentication or log-in credentials need to be passed into `S3FileSystem`. This is primarily used to access private S3 servers. However to access publicly available data, `s3fs` can be used in \"anonymous\" mode by setting `anon=True`. As the data on MAST is publicly available, we will use the anonymous mode here."
]
},
{
Expand All @@ -211,12 +215,13 @@
"outputs": [],
"source": [
"# Open the file in AWS: 'F' is the S3 file\n",
"import numpy as np\n",
"with fs.open(uri, 'rb') as f:\n",
" # Now actually read in the FITS file \n",
" with fits.open(f, 'readonly') as HDUlist:\n",
" HDUlist.info()\n",
" sci = HDUlist[1].data\n",
"type(sci)"
"print(type(sci))"
]
},
{
Expand All @@ -232,18 +237,22 @@
"source": [
"## Streaming from the Roman Science Platform S3 Bucket\n",
"\n",
"Though Roman data will eventually be available through MAST, we currently offer a small set of simulated data available in a separate S3 bucket. These files can be streamed in exactly the same way as the JWST FITS file above. Additionally, we can browse the available files similarly to a Unix terminal. A full list of commands can be found in the `s3fs` documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#)."
"Though Roman data will eventually be available through MAST, we currently offer a small set of simulated data available in a separate S3 bucket. These files can be streamed in exactly the same way as the HST FITS file above. Additionally, we can browse the available files similarly to a Unix terminal. A full list of commands can be found in the `s3fs` documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#).\n",
"\n",
"The S3 bucket containing the data is currently only open to the public on the science platform where we have managed the permissions so none need to be specified explicitly. Because of the required permissions, the cell below will not work on a private comuter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"asdf_dir_uri = 's3://roman-sci-test-data-prod-summer-beta-test/'\n",
"fs = s3fs.S3FileSystem()\n",
"fs = s3fs.S3FileSystem(anon=True)\n",
"\n",
"asdf_dir_uri = 's3://roman-sci-test-data-prod-summer-beta-test/'\n",
"fs.ls(asdf_dir_uri)"
]
},
Expand All @@ -260,7 +269,9 @@
"- `DENSE_REGION`: contains calibrated and uncalibrated simulated data of dense stellar fields obtained with different filters for all the eighteen WFI detectors. The data are separarted into two directories, each with a different pointings. Filenames in these directories use the prefixes `r0000101001001001001*` and `r0000101001001001002*`, which correspond to the use of the F158 and F129 optical elements respectively.\n",
"- `GALAXIES`: contains one calibrated, simulated image of a galaxy field obtained using the F158 optical element.\n",
"\n",
"Below, we use `roman_datamodels` to read the ASDF file corresponding to the dense region as an example."
"Below, we use `roman_datamodels` to read the ASDF file corresponding to the dense region as an example. To simplify the workflow we are providing a URI to the sample Roman data. During operations, the data would be referenced using the URI when perform queries through MAST or other data access methods that are currently under development.\n",
"\n",
"The file naming convention for Roman is quite elaborate as each includes all the relevant information about the observation. Please see the [Data Levels and Products](https://roman-docs.stsci.edu/data-handbook-home/wfi-data-format/data-levels-and-products) Roman documentation page for more information on the file naming conventions."
]
},
{
Expand All @@ -274,8 +285,14 @@
"with fs.open(asdf_file_uri, 'rb') as f:\n",
" dm = rdm.open(f)\n",
" \n",
"print(type(dm))\n",
"print(dm.meta)"
"print(dm.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have loaded Roman data into a datamodel, please review the [Working with ASDF Notebook](../working_with_asdf/working_with_asdf.ipynb) notebook to explore how to use them."
]
},
{
Expand All @@ -289,9 +306,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Downloading Files Locally (not recommended)\n",
"## Downloading Files (not recommended)\n",
"\n",
"It is **not recommended** for users to download Roman data products due to the large file size and the number of the files that are expected from the survey nature of the mission. Instead, users are encouraged to construct and adopt workflows that utilize the file streaming services described above for the best experience.\n",
"\n",
"However, there may be instances where data files must be downloaded for certain specific science cases. To do that, we can use the URIs and the `S3FileSystem.get` function (documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get)). Running the below cell will download the data to your personal instance of the science platform. However, the preliminary, simulated sample of Roman data on the science platform are currently not accessible outside of the science platform.\n",
"\n",
"Though it is **not recommended**, there may be instances where data files must be downloaded locally for certain specific science cases. To do that, we can use the URIs and the `S3FileSystem.get` function (documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get))."
"**NOTE**: MAST data can be downloaded on your private computer using `anon=True` in the `S3FileSystem` initialization. However, the preliminary, simulated sample of Roman data on the science platform are currently not accessible outside of the science platform."
]
},
{
Expand All @@ -304,7 +325,8 @@
"# from pathlib import Path\n",
"# local_file_path = Path('data/')\n",
"# local_file_path.mkdir(parents=True, exist_ok=True)\n",
"# fs.get(uri, local_file_path)"
"# fs = s3fs.S3FileSystem()\n",
"# fs.get(URI, local_file_path)"
]
},
{
Expand Down Expand Up @@ -337,7 +359,7 @@
"The data streaming information from this notebook largely builds off of the TIKE data-acces notebook by Thomas Dutkiewicz.\n",
"\n",
"**Author:** Will C. Schultz \n",
"**Updated On:** 2024-05-14"
"**Updated On:** 2024-09-16"
]
},
{
Expand All @@ -358,9 +380,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Roman Calibration latest (2024-03-25)",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "roman-cal"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -372,7 +394,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.3"
}
},
"nbformat": 4,
Expand Down

0 comments on commit bc86767

Please sign in to comment.