CLIA-1 - Specification for Uploading Dataset to MongoDB #29

gitstart-app · 2024-08-01T16:16:03Z

This PR was created by GitStart to address the requirements from this ticket: CLIA-1.

Specification for Uploading Dataset to MongoDB

Overview
The task involves downloading a 3GB CSV file, converting it to a more efficient format, and
then uploading the data to a MongoDB database using a defined schema. This is a one-off
task but the schema for the mongoDB may change in the future so it may need to be
adapted for final usage to match any changes in the schema.

Tasks and Requirements\

Download the data
We want to upload the data in this 3GB CSV file from here: https://osf.io/2f857
(metadata also available at the link).
A smaller file that you can work from is available at this link. You can read this with
pd.read_feather("all_indicators.feather")

\

Dataframe Conversion Pipeline
Create a pipeline to convert the output of the previous step to a dataframe ideal for
use with the provided schema.
Abstract operations into functions as required to keep the command chain concise
and readable.
Utilize Polars or Pandas for the dataframe operations.

\

Schema Mapping and Data Upload
Use the output of the previous step with the preliminary schema provided for
MongoDB. Specifically use the Invocation class for the document model.

\

Testing
Write tests for the database upload step using a stub for the database. These tests
do not need to test for unforeseen issues, this dataset is the only one the script will
be used for.
Use pytest with fixtures as required.

\

Module Structure

All code, including tests, should be contained within a single Python module.
Ensure the conversion/upload functionality is accessible from Python as an
importable function from the module.
Write the script in a way that it can easily be altered to deal with schema changes.

gitstart-app · 2024-08-01T16:16:10Z

This PR is estimated to cost between 100 and 120 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

…tests

gitstart-nimhdsst · 2024-08-07T15:03:30Z

@leej3 FYI the changes in this PR that are unrelated are due to fixing the lint failing in pre-commit tests. tox test seems to be failing in main as well due to python versioning issue

gitstart-nimhdsst · 2024-08-07T15:04:11Z

scripts/invocation_upload.py

+# NOTICE: items without mapping to the dataframe
+unmapped = {
+    "article": "",
+    "is_relevant": None,
+    "is_explicit": None,
+    "user_comment": "",
+    "osm_version": "",
+}


@leej3 I retained the unmapped dictionary because some of the values are needed outside the metrics. the unmapped dic holds all missing values and are referenced in the appropriate location

leej3 · 2024-08-07T16:35:41Z

Looks good thanks. We can look at speeding things up separately.

draft implementation for uploading dataset to MongoDB

4378d68

gitstart-nimhdsst marked this pull request as draft August 1, 2024 16:18

agt24 marked this pull request as ready for review August 2, 2024 17:51

gitstart-nimhdsst and others added 2 commits August 7, 2024 03:05

change from object programming to functional programming + update to …

a9fadc7

…tests

Update pyproject.toml

73927a3

gitstart-nimhdsst requested a review from leej3 August 7, 2024 04:53

gitstart-nimhdsst reviewed Aug 7, 2024

View reviewed changes

save intermediate, use all cols, use synchronous

3b8be19

leej3 force-pushed the CLIA-1 branch from 11e9871 to 3b8be19 Compare August 7, 2024 15:48

revert to async for now and formatting

037bdcb

leej3 approved these changes Aug 7, 2024

View reviewed changes

leej3 merged commit 22712b8 into main Aug 7, 2024
0 of 2 checks passed

agt24 mentioned this pull request Aug 21, 2024

CLIA-2 - Funder Mapping Specification #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIA-1 - Specification for Uploading Dataset to MongoDB #29

CLIA-1 - Specification for Uploading Dataset to MongoDB #29

gitstart-app bot commented Aug 1, 2024

gitstart-app bot commented Aug 1, 2024

gitstart-nimhdsst commented Aug 7, 2024 •

edited

Loading

gitstart-nimhdsst Aug 7, 2024

leej3 commented Aug 7, 2024

CLIA-1 - Specification for Uploading Dataset to MongoDB #29

CLIA-1 - Specification for Uploading Dataset to MongoDB #29

Conversation

gitstart-app bot commented Aug 1, 2024

gitstart-app bot commented Aug 1, 2024

gitstart-nimhdsst commented Aug 7, 2024 • edited Loading

gitstart-nimhdsst Aug 7, 2024

Choose a reason for hiding this comment

leej3 commented Aug 7, 2024

gitstart-nimhdsst commented Aug 7, 2024 •

edited

Loading