Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat add hivemind etl scripts #15

Merged
merged 49 commits into from
Dec 21, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
8919742
feat: Adding the hivemind ETL scripts!
amindadgar Dec 13, 2023
0ee1e41
update: removing some parts that were for debugging!
amindadgar Dec 13, 2023
c7455de
update: removing phoenix llm monitoring tool for now!
amindadgar Dec 14, 2023
92f48b9
update: comment phoenix dags!
amindadgar Dec 14, 2023
8619ff3
update: Adding None values in case of channel, and day summaries!
amindadgar Dec 14, 2023
06fbf7c
Update: Discord summarization query!
amindadgar Dec 14, 2023
2f21fd9
fix: typo in help command of discourse_vectorstore_etl!
amindadgar Dec 14, 2023
eb08b06
update: Adding a condition to discourse data fetching!
amindadgar Dec 14, 2023
e3a6294
Update: Increased chunk size to 512!
amindadgar Dec 14, 2023
996d279
feat: Added the discord summary boundary case!
amindadgar Dec 14, 2023
c2f44c3
update: code cleaning with black!
amindadgar Dec 14, 2023
4926d60
fix: Updated roles id finding in text content!
amindadgar Dec 14, 2023
620d20e
feat: Updated the discord-vector-store interval!
amindadgar Dec 14, 2023
d1beb5e
feat: Adding discourse summarizer codes!
amindadgar Dec 14, 2023
690690c
udpate: moved the tests to its right directory!
amindadgar Dec 14, 2023
080a485
update: fixing the airflow image version to 2.7.3!
amindadgar Dec 14, 2023
bc26f1a
fix: each post always have 1 category!
amindadgar Dec 14, 2023
4e57b07
update: Added more test cases for discourse summary!
amindadgar Dec 14, 2023
d47e407
feat: Completing the discourse summary!
amindadgar Dec 18, 2023
ec5efa3
feat: commenting the debug parts and code cleaning!
amindadgar Dec 18, 2023
9657cc5
feat: For now excluding all metadata for discord summaries!
amindadgar Dec 19, 2023
7faf40c
feat: excluding all metadata in summaries!
amindadgar Dec 19, 2023
6045dfd
update: remove credentials printing!
amindadgar Dec 19, 2023
1a56e53
feat: Added logging to the iteration count of summaries!
amindadgar Dec 19, 2023
6006d49
feat: Added logs to summary preparation!
amindadgar Dec 19, 2023
b963c22
Merge branch 'main' into feat-add-hivemind-etl-discourse-summary
amindadgar Dec 19, 2023
142f0c4
update: removing duplicate codes!
amindadgar Dec 19, 2023
252eded
fix: linter issues based on super-linter rules!
amindadgar Dec 19, 2023
0ebdd2f
fix: more linter issues!
amindadgar Dec 19, 2023
b37aec6
fix: more linter issues!
amindadgar Dec 19, 2023
88338b2
fix: linter issues and the requiremnets.txt issue!
amindadgar Dec 19, 2023
d695bb2
feat: Added init files so pytest can find the tests!
amindadgar Dec 19, 2023
3ce2033
fix: pylint linter issue!
amindadgar Dec 19, 2023
200a401
trying more!
amindadgar Dec 19, 2023
7602d90
feat: added textlinter ignore for requirements.txt file!
amindadgar Dec 19, 2023
7b4cb79
trying more!
amindadgar Dec 19, 2023
761cf27
Merge branch 'main' into feat-add-hivemind-etl-discourse-summary
amindadgar Dec 19, 2023
f6a0d99
update: test cases with the latest code updates!
amindadgar Dec 20, 2023
5c55642
feat: Added new services to docker-compose!
amindadgar Dec 20, 2023
838ce68
fix: roles have different structure in text!
amindadgar Dec 20, 2023
4ff253e
update: test cases with latest code updates!
amindadgar Dec 20, 2023
242a43b
fix: docker-compose.test.yaml creds!
amindadgar Dec 20, 2023
2f27d52
trying to fix the textlinter error!
amindadgar Dec 20, 2023
c787ae3
update: removing the pypdf package for now!
amindadgar Dec 20, 2023
fafe587
Merge pull request #18 from TogetherCrew/feat-add-hivemind-etl-discou…
amindadgar Dec 20, 2023
c96d92d
feat: Added the embedding_dim and chunk_size as env variables!
amindadgar Dec 20, 2023
457c97d
fix: linter errors based on super-linter rules!
amindadgar Dec 20, 2023
d83d7d9
feat: Added the new env variables to the docker-compose!
amindadgar Dec 20, 2023
b27cac8
feat: reading embed dim from .env!
amindadgar Dec 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
trying more!
  • Loading branch information
amindadgar committed Dec 19, 2023
commit 200a40176ec64dc943c4b199780b857fd2386c43
7 changes: 3 additions & 4 deletions dags/hivemind_etl_helpers/src/db/gdrive/db_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,10 +118,9 @@ def fetch_files_date_field(
query += ") AS distinct_results;"

cursor.execute(query)
results = cursor.fetchone()
if results[0] is not None:
# TODO: check the type of results
results = postprocess_results(results[0]) # type: ignore
query_results = cursor.fetchone()
if query_results[0] is not None:
results = postprocess_results(query_results[0])
else:
results = {}
except Exception as exp:
Expand Down
6 changes: 3 additions & 3 deletions dags/hivemind_etl_helpers/src/utils/cohere_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,15 @@ def _get_text_embeddings(self, texts: list[str]) -> list[list[float]]:
Can be overridden for batch queries.

"""
return self.get_text_embedding(texts=texts)
return self.get_text_embedding(texts=texts) # type: ignore

def _get_text_embedding(self, text: str) -> list[float]:
"""Get text embedding."""
return self.get_text_embedding(text=text)
return self.get_text_embedding(text=text) # type: ignore

def _get_query_embedding(self, query: str) -> list[float]:
"""Get query embedding."""
return self.get_text_embedding(text=query)
return self.get_text_embedding(text=query) # type: ignore

async def _aget_query_embedding(self, query: str) -> list[float]:
"""The asynchronous version of _get_query_embedding."""
Expand Down