-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat add hivemind etl scripts #15
Conversation
- The discord vectorstore, and discord summary indexes are finished. - The discourse vectorstore is at its final stages, just needed the data on staging to be updated. - some gdrive codes are written but they need to be completed and its test cases should be updated.
the parts were related to discourse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I left some small comments here and there and also the metadata comment from the conversation in discord conversation: https://discord.com/channels/915914985140531240/1184488640743735297/1184530840261251182.
For the rest, everything looks good so I'll already submit my approval
It seems airlfow in docker was having trouble when having that library. It was the hdbscan not installing which was a sub-dependency of phoenix.
we're now having a prefix and per thread, channel and day it would be updated.
It seems the `DiscoursePosts` with raw equal to NULL, were deleted before.
In case of 256 the documents with long metadata was producing error.
- If a channel only has a single thread, the thread summary is stored as the channel summary as well. - If a server only has a single channel, the channel summary is stored as the server summary as well.
moved some discord summarizer codes to utils as it could be used across multiple summarizers.
In previous commits we were assuming a post can have multiple category as we're doing a COLLECT on DiscourseCategory.name
- Also, cleaning codes with isort and black
they are <@&[role_id]>
the linter is giving error for the pdf part of the lib. trying to adjust a version and see if the textlint ignore that.
it was needed for google drive etl which isn't fully implemented for now.
…rse-summary feat: hivemind etl discourse summary and CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is google drive vector store and summary already fully implemented or is this just a first start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it isn't implemented. The written scripts were just a first start.
In this PR we're aiming to add
Notes: