Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIA-1 - Specification for Uploading Dataset to MongoDB #29

Merged
merged 5 commits into from
Aug 7, 2024
Merged

Conversation

gitstart-app[bot]
Copy link

@gitstart-app gitstart-app bot commented Aug 1, 2024

This PR was created by GitStart to address the requirements from this ticket: CLIA-1.


Specification for Uploading Dataset to MongoDB

Overview
The task involves downloading a 3GB CSV file, converting it to a more efficient format, and
then uploading the data to a MongoDB database using a defined schema. This is a one-off
task but the schema for the mongoDB may change in the future so it may need to be
adapted for final usage to match any changes in the schema.


Tasks and Requirements\

  • Download the data
    We want to upload the data in this 3GB CSV file from here: https://osf.io/2f857
    (metadata
    also available at the link).
    A smaller file that you can work from is available at this link. You can read this with
    pd.read_feather("all_indicators.feather")

\

  • Dataframe Conversion Pipeline
    Create a pipeline to convert the output of the previous step to a dataframe ideal for
    use with the provided schema.
    Abstract operations into functions as required to keep the command chain concise
    and readable.
    Utilize Polars or Pandas for the dataframe operations.

\

  • Schema Mapping and Data Upload
    Use the output of the previous step with the preliminary schema provided for
    MongoDB. Specifically use the Invocation class for the document model.

\

  • Testing
    Write tests for the database upload step using a stub for the database. These tests
    do not need to test for unforeseen issues, this dataset is the only one the script will
    be used for.
    Use pytest with fixtures as required.

\

  • Module Structure

All code, including tests, should be contained within a single Python module.
Ensure the conversion/upload functionality is accessible from Python as an
importable function from the module.
Write the script in a way that it can easily be altered to deal with schema changes.

Copy link
Author

gitstart-app bot commented Aug 1, 2024

This PR is estimated to cost between 100 and 120 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

@gitstart-nimhdsst gitstart-nimhdsst marked this pull request as draft August 1, 2024 16:18
@agt24 agt24 marked this pull request as ready for review August 2, 2024 17:51
@gitstart-nimhdsst gitstart-nimhdsst requested a review from leej3 August 7, 2024 04:53
@gitstart-nimhdsst
Copy link
Contributor

gitstart-nimhdsst commented Aug 7, 2024

@leej3 FYI the changes in this PR that are unrelated are due to fixing the lint failing in pre-commit tests. tox test seems to be failing in main as well due to python versioning issue

Comment on lines 21 to 26
# NOTICE: items without mapping to the dataframe
unmapped = {
"article": "",
"is_relevant": None,
"is_explicit": None,
"user_comment": "",
"osm_version": "",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leej3 I retained the unmapped dictionary because some of the values are needed outside the metrics. the unmapped dic holds all missing values and are referenced in the appropriate location

@leej3
Copy link
Collaborator

leej3 commented Aug 7, 2024

Looks good thanks. We can look at speeding things up separately.

@leej3 leej3 merged commit 22712b8 into main Aug 7, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants