Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss pkey-fkey reindexing in tutorial #259

Merged
merged 1 commit into from
Aug 12, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions tutorials/custom_dataset.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,46 @@
"Inside the `make_db` function we first download the raw files (or read from the local filesystem) and then create a `relbench.base.Database` object out of those. Thus, the `make_db` functions serves as documentation for your pre-processing steps, while also conveniently allowing you to develop and debug them within the RelBench framework."
]
},
{
"cell_type": "markdown",
"id": "a9552931-ef4a-4c2b-b63e-6ec56738db31",
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-12T05:14:46.141436Z",
"iopub.status.busy": "2024-08-12T05:14:46.140960Z",
"iopub.status.idle": "2024-08-12T05:14:46.160386Z",
"shell.execute_reply": "2024-08-12T05:14:46.159281Z",
"shell.execute_reply.started": "2024-08-12T05:14:46.141396Z"
}
},
"source": [
"#### Pkey/Fkey Reindexing"
]
},
{
"cell_type": "markdown",
"id": "f9845d78-7943-47be-a1ce-da5b0b01a3d4",
"metadata": {},
"source": [
"The intended usage is not to call the `make_db` function directly but to use the `get_db` function which internally calls `make_db` and adds a layer of other functionality such as caching."
]
},
{
"cell_type": "markdown",
"id": "c67714b3-f3f0-4841-b66f-3623242ff033",
"metadata": {},
"source": [
"Another important thing that `get_db` does is that it calls `db.reindex_pkeys_and_fkeys()` on the database `db` returned by `make_db`. This reindexes the primary- and foreign- key columns so that the primary keys columns are consecutive integers starting from 0. This makes some downstream logic in RelBench convenient to implement, as it can work under the unified assumption that the pkeys and fkeys are integers, and that too sequential."
]
},
{
"cell_type": "markdown",
"id": "dcc2dc99-128f-43dd-bb31-20752ee469d8",
"metadata": {},
"source": [
"If you want to preserve the original pkey values, either because you believe they can be used as features for predictive tasks, or because you would like to cross-reference the prediction results with the original data source, simply add a duplicate column without marking it as pkey_col. The model designer is free to decide whether to include this duplicate column as input to the model or not."
]
},
{
"cell_type": "markdown",
"id": "fec0b02e-ce3e-426b-8fe2-38b8fa8d20c8",
Expand Down
Loading