Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: modify csv uploader job to remove dupe keywords #426

Merged
merged 2 commits into from
Sep 28, 2023

Conversation

0c0w3
Copy link
Collaborator

@0c0w3 0c0w3 commented Sep 26, 2023

References

Dupe keyword discussion

Description

There's some kind of unrelated new problem with the uploader where json.dumps() can't serialize the Url objects created by Pydantic. Nothing changed in the uploader so I don't know if this is due to upgrading Pydantic or what. This PR fixes that too.

PR Review Checklist

Put an x in the boxes that apply

  • This PR conforms to the Contribution Guidelines
  • The PR title starts with the JIRA issue reference, format [DISCO-####], and has the same title (if applicable)
  • [load test: (abort|warn)] keywords are applied (if applicable)
  • Documentation has been updated (if applicable)
  • Functional and performance test coverage has been expanded and maintained (if applicable)

There's some kind of unrelated new problem with the uploader where
`json.dumps()` can't serialize the `Url` objects created by Pydantic. Nothing
changed in the uploader so I don't know if this is due to upgrading Pydantic or
what. This PR fixes that too.
@0c0w3 0c0w3 requested a review from ncloudioj September 26, 2023 23:09
@taddes
Copy link
Contributor

taddes commented Sep 27, 2023

Hey @0c0w3 ! Pydantic V2 introduced a lot of changes and a significant portion of out of the box functionality is not as it was. The way values are serialized changed considerably and the way in which json data is serialized means there are some new methods that can be implemented. Urls for instance function differently when you dump them to dicts of json strings.

It's well after hours for me, but we can discuss this tomorrow and I could recommend a few things. I don't think you need to import pydantic_core, for example. Much the same can be achieved by using Pydantic's new native serialization methods. Either way, thanks for looking after this job!

Copy link
Contributor

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

merino/jobs/csv_rs_uploader/base.py Outdated Show resolved Hide resolved
@0c0w3
Copy link
Collaborator Author

0c0w3 commented Sep 27, 2023

Thanks @taddes. I'm looking at https://docs.pydantic.dev/latest/concepts/serialization/ and model.model_dump(...) seems like what I want but it still produces Url objects. model.model_dump_json(...) converts the whole suggestion object to a string, which isn't what I want either. I'll keep looking.

I wondered why the tests didn't catch this and it's because the chunked uploader test uses suggestion objects with a single string-valued int-valued property. When I add an HttpUrl property, the chunked uploader test fails in the same way. I'll make sure this is tested regardless of how Url ends up getting serialized.

@0c0w3
Copy link
Collaborator Author

0c0w3 commented Sep 27, 2023

Ah model_dump(mode="json") works.

@0c0w3
Copy link
Collaborator Author

0c0w3 commented Sep 27, 2023

OK, this is good! It's exposed a few related problems:

  • The CSV uploader should use model_dump(mode="json") on the suggestion dicts it passes to the chunked uploader, not vars()
  • The interface to the chunked uploader needed clarification: The suggestion dicts passed into it must be JSON serializable; it's the consumer's responsibility, not the chunked uploader's
  • The CSV uploader tests expect the suggestion dicts passed to the chunked uploader to contain HttpUrl values, but that's wrong, they should be strings

@0c0w3 0c0w3 requested a review from taddes September 27, 2023 02:02
@0c0w3
Copy link
Collaborator Author

0c0w3 commented Sep 27, 2023

ci/circleci: checks failed due to what looks like a connection error? I don't have permissions to re-run it.

@ncloudioj
Copy link
Contributor

@0c0w3 it's green now. Are you login on CircleCI? You should be able to trigger retry there. I noticed that you are not a member of the "mozilla-services" org on Github, maybe that's why you can't assign contextual-services as the reviewer. To fix those quirks, we can ask GH admins to add you to "mozilla-services" if you don't mind, let me know :)

@0c0w3
Copy link
Collaborator Author

0c0w3 commented Sep 27, 2023

@ncloudioj That would be fine, thanks!

@ncloudioj ncloudioj added this pull request to the merge queue Sep 28, 2023
Merged via the queue into main with commit 71f8ca3 Sep 28, 2023
@ncloudioj ncloudioj deleted the feat/csv-uploader-dupe-keywords branch September 28, 2023 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants