Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Discovery and Access Notebook Update #44

Merged
merged 4 commits into from
Sep 18, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"source": [
"## Kernel Information\n",
"\n",
"To run this notebook, please select the \"Roman Calibration\" kernel at the top right of your window.\n",
"To run this notebook on the Roman Science Platform, please select the \"Roman Calibration\" kernel at the top right of your window.\n",
"\n",
"## Imports\n",
"Here we import the required packages for our data access examples including:\n",
Expand Down Expand Up @@ -74,13 +74,16 @@
},
"source": [
"## Introduction\n",
"This notebook is designed to provide examples of accessing data from the science platform. It demonstrates how to stream data from the cloud directly into memory, bypassing the need to download the data locally and use excess storage. This method of cloud-based data access is *HIGHLY* recommended. However, we understand that some use-cases will require downloading the data locally, so we also provide an example of how to do this at the end of the notebook.\n",
"This notebook is designed to provide examples of accessing data from the science platform. Due to the survey nature of the Roman Space Telescope, it will produce large data volumes of data that will need to be easily and quickly accessed to perform scientific tasks like creating catalogs, difference imaging, generating light curves, etc. Downloading all the required data would burden most users by requiring excessive data storage solutions (likely >10TB).\n",
"\n",
"This notebook demonstrates how to stream data from the cloud directly into memory, bypassing the need to download the data locally and use excess storage. This method of cloud-based data access is *HIGHLY* recommended. However, we understand that some use-cases will require downloading the data locally, so we also provide an example of how to do this at the end of the notebook.\n",
"\n",
"During operations, each Roman data file will be given a Unique Resource Identifier (URI), an analog to an online filepath that is similar to a URL, which points to where the data is hosted on the AWS cloud. Users will retrieve these URIs from one of several sources including MAST (see [Accessing WFI Data](https://roman-docs.stsci.edu/data-handbook-home/accessing-wfi-data) for more information) and will be able to use the URI to access the desired data from the cloud. \n",
"\n",
"Here-in we examine how to download data from two types of sources:\n",
"- The STScI MAST server which hosts data for in-flight telescopes including Hubble, TESS, and JWST\n",
"- The STScI MAST server which hosts data for in-flight telescopes including Hubble, TESS, and JWST and will host Roman data in the future\n",
"- Simulated Roman Space Telescope data hosted in storage containers on the AWS cloud\n",
"\n",
"\n",
"### Defining terms\n",
"- *Cloud computing*: the practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than using a local server or a personal computer.\n",
"- *AWS*: Amazon Web Services (AWS) is the cloud computing platform provided by Amazon.\n",
Expand All @@ -100,7 +103,7 @@
"metadata": {},
"source": [
"## Accessing MAST Data\n",
"In this section, we will go through the steps to retreive archived MAST data from the cloud including how to query the archive, stream the files directly from the cloud, as well as download them locally.\n",
"In this section, we will go through the steps to retreive archived MAST data from the cloud including how to query the archive and stream the files directly from the cloud, as well as download them locally.\n",
"\n",
"### Enabling Cloud Access\n",
"The most important step for accessing data from the cloud is to enable *astroquery* to retreive URIs and other relevant cloud information. Even if we are working locally and plan to download the data files (not recommended for Roman data), we need to use this command to copy the file locations."
Expand All @@ -124,7 +127,9 @@
"metadata": {},
"source": [
"### Querying MAST\n",
"Now we are ready to begin our query. This example is rather simple, but it is quick and easy to reproduce. We will be querying JWST NIRCAM data of M83. In our query, we specify that we want to look at JWST data using the F444W filter and NIRCAM. We also specify the proposal id to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the observations. We then filter the products to gather all the rate image data products which still leaves us with 144 filtered products. To reduce the number of URIs we filter through, we choose a single observation to continue with in this notebook."
"Now we are ready to begin our query. This example is rather simple, but it is quick and easy to reproduce. We will be querying HST WFC3/IR data of M83. In practice, the science platform should primarily be used for analyzing and exploring Roman data products. However due to the smaller file sizes, HST WFC3/IR data provides a nice example. The process is identical regardless of which space telescope is used.\n",
"\n",
"In our query, we specify that we want to look at HST data using the F160W filter and WFC3/IR. We also specify the proposal id to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the observations. We then filter the products to gather all the level 3 science data products associated with a specific project which still leaves us with 60 data products."
]
},
{
Expand All @@ -133,18 +138,17 @@
"metadata": {},
"outputs": [],
"source": [
"obs = Observations.query_criteria(obs_collection='JWST',\n",
" filters='F444W',\n",
" instrument_name='NIRCAM/IMAGE',\n",
" proposal_id=['1783'],\n",
" dataRights='PUBLIC')\n",
"obs = Observations.query_criteria(obs_collection='HST',\n",
" filters='F160W',\n",
" instrument_name='WFC3/IR',\n",
" proposal_id=['11360'],\n",
" dataRights='PUBLIC',\n",
" )\n",
"products = Observations.get_product_list(obs)\n",
"\n",
"filtered = Observations.filter_products(products,\n",
" productSubGroupDescription='RATE')\n",
"print('Filtered data products:\\n', filtered, '\\n')\n",
"single = Observations.filter_products(filtered,\n",
" obsID='87766440')\n",
"print('Single data product:\\n', single, '\\n')"
" calib_level=[3], productType=['SCIENCE'], dataproduct_type=['image'], project=['CALWF3'])\n",
"print('Filtered data products:\\n', filtered, '\\n')"
]
},
{
Expand All @@ -160,15 +164,15 @@
"metadata": {},
"outputs": [],
"source": [
"uris = Observations.get_cloud_uris(single)\n",
"uris = Observations.get_cloud_uris(filtered)\n",
wcschultz marked this conversation as resolved.
Show resolved Hide resolved
"uris"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `get_cloud_uris` method checks for duplicates in the provided products to minimize the data access volume. It is also important to note that `get_cloud_uris` will always return a list. Thus, we need to extract the individual URI strings to access the files."
"The `get_cloud_uris` method checks for duplicates in the provided products to minimize the data access volume. It is also important to note that `get_cloud_uris` will always return a list. Thus, we need to extract an individual URI strings to access the file. Here we choose the first URI, but in practice you would select the URI associated with the desired file."
wcschultz marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
Expand All @@ -185,7 +189,7 @@
"metadata": {},
"source": [
"### Streaming files directly into memory\n",
"Here, we will use `s3fs` to directly access the data stored in the AWS S3 servers. Note that we must set `anon=True` to acces the files."
"Here, we will use `s3fs` to directly access the data stored in the AWS S3 servers. Typically to access data from AWS, authentification or log-in credentials need to be passed into `S3FileSystem`. This is primarily used to access private S3 servers. However to access publicly available data, `s3fs` can be used in \"anonymous\" mode by setting `anon=True`. As the data on MAST is publicly available, we will use the anonymous mode here."
wcschultz marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
Expand All @@ -211,12 +215,13 @@
"outputs": [],
"source": [
"# Open the file in AWS: 'F' is the S3 file\n",
"import numpy as np\n",
"with fs.open(uri, 'rb') as f:\n",
" # Now actually read in the FITS file \n",
" with fits.open(f, 'readonly') as HDUlist:\n",
" HDUlist.info()\n",
" sci = HDUlist[1].data\n",
"type(sci)"
"print(type(sci))"
]
},
{
Expand All @@ -232,7 +237,9 @@
"source": [
"## Streaming from the Roman Science Platform S3 Bucket\n",
"\n",
"Though Roman data will eventually be available through MAST, we currently offer a small set of simulated data available in a separate S3 bucket. These files can be streamed in exactly the same way as the JWST FITS file above. Additionally, we can browse the available files similarly to a Unix terminal. A full list of commands can be found in the `s3fs` documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#)."
"Though Roman data will eventually be available through MAST, we currently offer a small set of simulated data available in a separate S3 bucket. These files can be streamed in exactly the same way as the JWST FITS file above. Additionally, we can browse the available files similarly to a Unix terminal. A full list of commands can be found in the `s3fs` documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#).\n",
wcschultz marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"Because the S3 bucket is specific to the science platform, it is not publicly available. We have managed the permissions so none need to be specified explicitly, however we do need to create a new `S3FileSystem` with the new permissions. Because of the permissions, the cell below will not work on a private comuter without modification to include the correct permissions."
tddesjardins marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
Expand All @@ -241,9 +248,9 @@
"metadata": {},
"outputs": [],
"source": [
"asdf_dir_uri = 's3://roman-sci-test-data-prod-summer-beta-test/'\n",
"fs = s3fs.S3FileSystem()\n",
"\n",
"asdf_dir_uri = 's3://roman-sci-test-data-prod-summer-beta-test/'\n",
"fs.ls(asdf_dir_uri)"
]
},
Expand All @@ -260,7 +267,9 @@
"- `DENSE_REGION`: contains calibrated and uncalibrated simulated data of dense stellar fields obtained with different filters for all the eighteen WFI detectors. The data are separarted into two directories, each with a different pointings. Filenames in these directories use the prefixes `r0000101001001001001*` and `r0000101001001001002*`, which correspond to the use of the F158 and F129 optical elements respectively.\n",
"- `GALAXIES`: contains one calibrated, simulated image of a galaxy field obtained using the F158 optical element.\n",
"\n",
"Below, we use `roman_datamodels` to read the ASDF file corresponding to the dense region as an example."
"Below, we use `roman_datamodels` to read the ASDF file corresponding to the dense region as an example. To simplify the workflow we are providing a URI to the sample Roman data. During operations, the data would be referenced using the URI when perform queries through MAST or other data access methods that are currently not set-up.\n",
wcschultz marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"The file naming convention for Roman is quite elaborate as each includes all the relevant information about the observation. Please see the [Data Levels and Products](https://roman-docs.stsci.edu/data-handbook-home/wfi-data-format/data-levels-and-products) Roman documentation page for more information on the file naming conventions."
]
},
{
Expand All @@ -274,8 +283,14 @@
"with fs.open(asdf_file_uri, 'rb') as f:\n",
" dm = rdm.open(f)\n",
" \n",
"print(type(dm))\n",
"print(dm.meta)"
"print(dm.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have loaded Roman data into a datamodel, please review the [Working with ASDF Notebook](../working_with_asdf/working_with_asdf.ipynb) notebook to explore how to use them."
]
},
{
Expand All @@ -289,9 +304,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Downloading Files Locally (not recommended)\n",
"## Downloading Files (not recommended)\n",
"\n",
"Though it is **not recommended**, there may be instances where data files must be downloaded for certain specific science cases. To do that, we can use the URIs and the `S3FileSystem.get` function (documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get)). Running the below cell will download the data to your personal instance of the science platform, however the same code could be combined with some of the above cells and run on your local machine to download the data to a private computer.\n",
tddesjardins marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"Though it is **not recommended**, there may be instances where data files must be downloaded locally for certain specific science cases. To do that, we can use the URIs and the `S3FileSystem.get` function (documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get))."
"*NOTE*: MAST data can be downloaded on your private computer using `anon=True` in the `S3FileSystem` initialization. However to download the preliminary sample of Roman data, you will need to set up the credentials correctly. This is done for you on the science platform, but would need to be manually set on a private computer."
wcschultz marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
Expand All @@ -304,7 +321,8 @@
"# from pathlib import Path\n",
"# local_file_path = Path('data/')\n",
"# local_file_path.mkdir(parents=True, exist_ok=True)\n",
"# fs.get(uri, local_file_path)"
"# fs = s3fs.S3FileSystem()\n",
"# fs.get(URI, local_file_path)"
]
},
{
Expand Down Expand Up @@ -337,7 +355,7 @@
"The data streaming information from this notebook largely builds off of the TIKE data-acces notebook by Thomas Dutkiewicz.\n",
"\n",
"**Author:** Will C. Schultz \n",
"**Updated On:** 2024-05-14"
"**Updated On:** 2024-09-16"
]
},
{
Expand All @@ -358,9 +376,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Roman Calibration latest (2024-03-25)",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "roman-cal"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -372,7 +390,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.3"
}
},
"nbformat": 4,
Expand Down