-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLIA-1 - Specification for Uploading Dataset to MongoDB #29
Conversation
This PR is estimated to cost between 100 and 120 credits. |
@leej3 FYI the changes in this PR that are unrelated are due to fixing the lint failing in pre-commit tests. tox test seems to be failing in main as well due to python versioning issue |
scripts/invocation_upload.py
Outdated
# NOTICE: items without mapping to the dataframe | ||
unmapped = { | ||
"article": "", | ||
"is_relevant": None, | ||
"is_explicit": None, | ||
"user_comment": "", | ||
"osm_version": "", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leej3 I retained the unmapped dictionary because some of the values are needed outside the metrics. the unmapped dic holds all missing values and are referenced in the appropriate location
Looks good thanks. We can look at speeding things up separately. |
This PR was created by GitStart to address the requirements from this ticket: CLIA-1.
Specification for Uploading Dataset to MongoDB
Overview
The task involves downloading a 3GB CSV file, converting it to a more efficient format, and
then uploading the data to a MongoDB database using a defined schema. This is a one-off
task but the schema for the mongoDB may change in the future so it may need to be
adapted for final usage to match any changes in the schema.
Tasks and Requirements\
We want to upload the data in this 3GB CSV file from here: https://osf.io/2f857
(metadata also available at the link).
A smaller file that you can work from is available at this link. You can read this with
pd.read_feather("all_indicators.feather")
\
Create a pipeline to convert the output of the previous step to a dataframe ideal for
use with the provided schema.
Abstract operations into functions as required to keep the command chain concise
and readable.
Utilize Polars or Pandas for the dataframe operations.
\
Use the output of the previous step with the preliminary schema provided for
MongoDB. Specifically use the Invocation class for the document model.
\
Write tests for the database upload step using a stub for the database. These tests
do not need to test for unforeseen issues, this dataset is the only one the script will
be used for.
Use pytest with fixtures as required.
\
All code, including tests, should be contained within a single Python module.
Ensure the conversion/upload functionality is accessible from Python as an
importable function from the module.
Write the script in a way that it can easily be altered to deal with schema changes.