Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat add hivemind etl scripts #15

Merged
merged 49 commits into from
Dec 21, 2023
Merged

Feat add hivemind etl scripts #15

merged 49 commits into from
Dec 21, 2023

Conversation

amindadgar
Copy link
Member

In this PR we're aiming to add

  1. Discord vectorstore sctipts
  2. Discord summary vectorstore scripts
  3. Discourse vectorstore sctripts
  4. Gdrive vectorstore scripts

Notes:

  • We've finished the implementation and testing for the two first items.
  • For item 3, we still need to check with a newer version of staging data (the staging data needed some updates)
  • Item 4, is work-in-progress and we would complete it in another PR.

- The discord vectorstore, and discord summary indexes are finished.
- The discourse vectorstore is at its final stages, just needed the data on staging to be updated.
- some gdrive codes are written but they need to be completed and its test cases should be updated.
the parts were related to discourse.
@amindadgar amindadgar requested a review from TjitsevdM December 13, 2023 12:05
Copy link

@TjitsevdM TjitsevdM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I left some small comments here and there and also the metadata comment from the conversation in discord conversation: https://discord.com/channels/915914985140531240/1184488640743735297/1184530840261251182.

For the rest, everything looks good so I'll already submit my approval

It seems airlfow in docker was having trouble when having that library. It was the hdbscan not installing which was a sub-dependency of phoenix.
we're now having a prefix and per thread, channel and day it would be updated.
It seems the `DiscoursePosts` with raw equal to NULL, were deleted before.
In case of 256 the documents with long metadata was producing error.
- If a channel only has a single thread, the thread summary is stored as the channel summary as well.
- If a server only has a single channel, the channel summary is stored as the server summary as well.
moved some discord summarizer codes to utils as it could be used across multiple summarizers.
In previous commits we were assuming a post can have multiple category as we're doing a COLLECT on DiscourseCategory.name
@amindadgar amindadgar requested a review from TjitsevdM December 14, 2023 15:08
@amindadgar amindadgar requested a review from cyri113 December 20, 2023 07:19
@amindadgar amindadgar requested a review from TjitsevdM December 20, 2023 12:40

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is google drive vector store and summary already fully implemented or is this just a first start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it isn't implemented. The written scripts were just a first start.

@cyri113 cyri113 merged commit a7cbaa7 into main Dec 21, 2023
15 checks passed
@amindadgar amindadgar deleted the feat-add-hivemind-etl-scripts branch December 25, 2023 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants