-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A NB draft showing an example of LSDB pipeline #18
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,236 @@ | ||
{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Line #4. %pip install aiohttp lsdb Just a note, I'm suggesting we add dask-expr as a dependency to this workflow, but it's already a requirement of the package so it's not needed here Reply via ReviewNB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See below comment on from_legacy_dataframe.
Do we need to actually do the LSDB join for this use case? Would it be better to just do the join at the NestedFrame level:
from dask_expr import from_legacy_dataframe nf_object = NestedFrame.from_dask_dataframe(from_legacy_dataframe(lsdb_object._ddf)) nf_source = NestedFrame.from_dask_dataframe(from_legacy_dataframe(lsdb_source._ddf)) nf_joined = nf_object.add_nested(nf_source, "ztf_source", how="left") Reply via ReviewNB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Were you planning on adding a bit of analysis to this? Maybe just a query and a function run would be helpful Reply via ReviewNB |
||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6171e5bbd47ce869", | ||
"metadata": {}, | ||
"source": [ | ||
"# Load large catalog data from the LSDB\n", | ||
"\n", | ||
"Here we load a small part of ZTF DR14 stored as HiPSCat catalog using the [LSDB](https://lsdb.readthedocs.io/)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c055a44b8ce3b34", | ||
"metadata": {}, | ||
"source": [ | ||
"## Install LSDB and its dependencies and import the necessary modules\n", | ||
"\n", | ||
"We also need `aiohttp`, which is an optional LSDB's dependency, needed to access the catalog data from the web." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "af1710055600582", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:00.759441Z", | ||
"start_time": "2024-05-24T12:53:58.854875Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"\n", | ||
"# Comment the following line to skip LSDB installation\n", | ||
"%pip install aiohttp lsdb" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e4d03aa76aeeb1c0", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:03.834087Z", | ||
"start_time": "2024-05-24T12:54:00.761330Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import nested_pandas as npd\n", | ||
"from lsdb import read_hipscat\n", | ||
"from nested_dask import NestedFrame\n", | ||
"from nested_pandas.series.packer import pack" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e169686259687cb2", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load ZTF DR14\n", | ||
"For the demonstration purposes we use a light version of the ZTF DR14 catalog distributed by LINCC Frameworks, a half-degree circle around RA=180, Dec=10.\n", | ||
"We load the data from HTTPS as two LSDB catalogs: objects (metadata catalog) and source (light curve catalog)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a403e00e2fd8d081", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:06.745405Z", | ||
"start_time": "2024-05-24T12:54:03.834904Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"catalogs_dir = \"https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/ztf/\"\n", | ||
"\n", | ||
"lsdb_object = read_hipscat(\n", | ||
" f\"{catalogs_dir}/ztf_object\",\n", | ||
" columns=[\"ra\", \"dec\", \"ps1_objid\"],\n", | ||
")\n", | ||
"lsdb_source = read_hipscat(\n", | ||
" f\"{catalogs_dir}/ztf_source\",\n", | ||
" columns=[\"mjd\", \"mag\", \"magerr\", \"band\", \"ps1_objid\", \"catflags\"],\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "4ed4201f2c59f542", | ||
"metadata": {}, | ||
"source": [ | ||
"We need to merge these two catalogs to get the light curve data.\n", | ||
"It is done with LSDB's `.join()` method which would give us a new catalog with all the columns from both catalogs. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "b9b57bced7f810c0", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:06.770931Z", | ||
"start_time": "2024-05-24T12:54:06.746786Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# We can ignore warning here - for this particular case we don't need margin cache\n", | ||
"lsdb_joined = lsdb_object.join(\n", | ||
" lsdb_source,\n", | ||
" left_on=\"ps1_objid\",\n", | ||
" right_on=\"ps1_objid\",\n", | ||
" suffixes=(\"\", \"\"),\n", | ||
")\n", | ||
"joined_ddf = lsdb_joined._ddf\n", | ||
"joined_ddf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8ac9c6ceaf6bc3d2", | ||
"metadata": {}, | ||
"source": [ | ||
"## Convert LSDB joined catalog to `nested_dask.NestedFrame`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "347f97583a3c1ba4", | ||
"metadata": {}, | ||
"source": [ | ||
"First, we plan the computation to convert the joined Dask DataFrame to a NestedFrame." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9522ce0977ff9fdf", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:06.789983Z", | ||
"start_time": "2024-05-24T12:54:06.772721Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def convert_to_nested_frame(df: pd.DataFrame, nested_columns: list[str]):\n", | ||
" other_columns = [col for col in df.columns if col not in nested_columns]\n", | ||
"\n", | ||
" # Since object rows are repeated, we just drop duplicates\n", | ||
" object_df = df[other_columns].groupby(level=0).first()\n", | ||
" nested_frame = npd.NestedFrame(object_df)\n", | ||
"\n", | ||
" source_df = df[nested_columns]\n", | ||
" # lc is for light curve\n", | ||
" # https://github.com/lincc-frameworks/nested-pandas/issues/88\n", | ||
" # nested_frame.add_nested(source_df, 'lc')\n", | ||
" nested_frame[\"lc\"] = pack(source_df, name=\"lc\")\n", | ||
"\n", | ||
" return nested_frame\n", | ||
"\n", | ||
"\n", | ||
"ddf = joined_ddf.map_partitions(\n", | ||
" lambda df: convert_to_nested_frame(df, nested_columns=lsdb_source.columns),\n", | ||
" meta=convert_to_nested_frame(joined_ddf._meta, nested_columns=lsdb_source.columns),\n", | ||
")\n", | ||
"nested_ddf = NestedFrame.from_dask_dataframe(ddf)\n", | ||
"nested_ddf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "de6820724bcd4781", | ||
"metadata": {}, | ||
"source": [ | ||
"Second, we compute the NestedFrame." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a82bd831fc1f6f92", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:19.282406Z", | ||
"start_time": "2024-05-24T12:54:06.790699Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"ndf = nested_ddf.compute()\n", | ||
"ndf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9c82f593efca9d30", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-05-24T12:54:19.284710Z", | ||
"start_time": "2024-05-24T12:54:19.283179Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 2 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython2", | ||
"version": "2.7.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you getting that this is actually a Dask-Nested NestedFrame object? On my end it seems like this is stuck as a Dask DataFrame. Interestingly from_dask_dataframe works when loading the data directly using dask. Investigating...
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be an issue with LSDB using legacy frames, vs dask natively using dask-expr dataframes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.