-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/302 telegram raw vectorize #303
Conversation
Also, added a test case for the extraction of it.
WalkthroughThe pull request introduces enhancements to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 20
🧹 Outside diff range and nitpick comments (13)
dags/hivemind_etl_helpers/src/db/telegram/utility.py (1)
6-8
: Consider adding a docstring and evaluating the necessity of this class.The
TelegramUtils
class is well-defined and correctly extendsTelegramPlatform
. However, there are a couple of suggestions for improvement:
- Add a docstring to explain the purpose and intended use of this class.
- Evaluate whether this class is necessary. If it doesn't add any new functionality beyond what
TelegramPlatform
provides, consider usingTelegramPlatform
directly instead.Here's an example of how you might add a docstring:
class TelegramUtils(TelegramPlatform): """ Utility class for Telegram-related operations. This class extends TelegramPlatform to provide additional utility functions specific to our application's needs. """ def __init__(self, chat_id: str, chat_name: str) -> None: super().__init__(chat_id, chat_name)If you plan to add more methods to this class in the future, please disregard the second suggestion.
dags/hivemind_etl_helpers/src/db/telegram/extract/tc_chats.py (2)
10-33
: LGTM with a minor suggestion: Consider improving type hints.The
extract_chats
method is well-structured with appropriate error handling, efficient Neo4j query execution, and result processing. The docstring provides clear information about the method's purpose and return value.Consider updating the type hint for
chat_info
to match the method's return type:- chat_info: list[str] = [] + chat_info: list[tuple[str, str]] = []This change would make the type hint consistent with the method's return type annotation and the actual data structure being returned.
1-33
: Overall: Well-implemented and ready for integration.The
TelegramChats
class is well-designed for its purpose of extracting chat information from a Neo4j database. It demonstrates good practices in terms of error handling, database connection management, and code structure. The implementation is clean and efficient, making it suitable for integration into the larger system.As this class is part of a larger ETL process, ensure that it integrates smoothly with other components of the pipeline, particularly in terms of error handling and data flow. Consider implementing unit tests to verify the behavior of the
extract_chats
method with different Neo4j query results and potential error scenarios.dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py (1)
1-99
: Consider adding more edge cases and error scenarios.The test file provides good coverage for the main scenarios of the
TelegramPlatform
class. To further improve the test suite, consider adding the following test cases:
- Test for error handling when the database connection fails.
- Test for the scenario where a platform is disconnected (i.e., has a non-null
disconnectedAt
field).- Test for updating an existing platform.
- Test for handling invalid input data (e.g., invalid chat_id or chat_name).
Overall, the test structure is well-organized and follows good practices for unittest.
dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (1)
1-109
: Consider adding more edge cases and error scenario tests.The test suite is well-structured and covers the basic functionality of the
TelegramChats
class. To further improve test coverage, consider adding the following tests:
- Test with invalid chat data (e.g., missing required fields).
- Test for error handling when the database connection fails.
- Test for the behavior when extracting a large number of chats (performance test).
- Test for any specific error conditions that
extract_chats
might encounter.These additional tests would help ensure the robustness of the
TelegramChats
class under various scenarios.Would you like assistance in drafting these additional test cases?
dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (4)
10-22
: LGTM: Setup and teardown methods ensure proper test isolation.The
setUp
andtearDown
methods are well-implemented, ensuring a clean test environment for each test case. The_delete_everything()
method effectively clears the Neo4j database.Consider adding error handling in the
_delete_everything()
method to catch and log any potential database connection issues:def _delete_everything(self): """remove everything on neo4j db""" try: with self.extractor._connection.neo4j_driver.session() as session: session.execute_write(lambda tx: tx.run("MATCH (n) DETACH DELETE (n)")) except Exception as e: print(f"Error clearing database: {e}") raise
34-74
: LGTM: Single data extraction test is comprehensive.This test effectively creates a single message with associated users and reactions, and then verifies the extracted data against an expected
TelegramMessagesModel
instance. The test covers all relevant fields and uses appropriate timestamp representations.Consider adding a constant for the chat ID at the class level to improve maintainability:
class TestExtractTelegramMessages(TestCase): CHAT_ID = "1234567890" def setUp(self) -> None: load_dotenv() self.extractor = ExtractMessages() self._delete_everything() # ... rest of the class ... def test_extract_single_data(self): with self.extractor._connection.neo4j_driver.session() as session: session.run( # ... query ... { "chat_id": self.CHAT_ID, # ... other parameters ... }, ) # ... rest of the test ...This change would make it easier to update the chat ID across all tests if needed.
104-195
: LGTM: Multiple data extraction test is comprehensive, with room for improvement.This test effectively creates and verifies the extraction of multiple messages with various relationships (edits, replies, reactions). The test covers a complex scenario and checks all relevant fields in the extracted data.
Consider the following improvements:
- Sort the extracted and expected data before comparison to ensure the test is not dependent on the order of retrieval:
def test_extract_multiple_data(self): # ... existing code ... data = sorted(self.extractor.extract(from_date=datetime(2024, 1, 1)), key=lambda x: x.message_id) expected_data = sorted([ TelegramMessagesModel(...), TelegramMessagesModel(...), TelegramMessagesModel(...), ], key=lambda x: x.message_id) self.assertEqual(data, expected_data)
- Add assertions to verify the relationships between messages (e.g., replies, edits):
def test_extract_multiple_data(self): # ... existing code ... self.assertEqual(data[2].repliers, ["User Two"]) # Verify that message 5 replied to message 3 self.assertNotEqual(data[0].message_text, "🎉️️️️️️ Welcome to the TC Ingestion Pipeline") # Verify that message 3 was editedThese changes would make the test more robust and verify additional aspects of the data extraction.
1-195
: Overall, the test suite is well-structured and comprehensive.This integration test file for the
ExtractMessages
class provides excellent coverage of various scenarios, including empty data, single message extraction, and complex multi-message relationships. The tests are well-organized, consistent in structure, and follow good practices for test isolation.Some final suggestions for improvement:
- Consider adding more edge cases, such as messages with empty text or messages with a large number of reactions or replies.
- Add a test for error handling, e.g., when the database connection fails.
- Use parameterized tests for similar test cases to reduce code duplication.
Example of a parameterized test:
@parameterized.expand([ ("empty_data", datetime(2023, 1, 1), []), ("future_date", datetime(2024, 1, 1), []), ]) def test_extract_with_various_dates(self, name, from_date, expected): data = self.extractor.extract(from_date=from_date) self.assertEqual(data, expected)These improvements would further enhance the robustness and maintainability of the test suite.
dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py (1)
13-15
: Improve parameter descriptions in the__init__
docstring for clarity.The descriptions for
chat_id
andchat_name
are unclear or grammatically incorrect.Consider updating them to:
chat_id : str The ID of the Telegram chat. chat_name : str The name of the Telegram chat.
dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py (2)
31-38
: Ensure consistency in metadata key naming conventionsThe metadata keys are currently in camelCase (e.g.,
"createdAt"
,"updatedAt"
), while Python typically uses snake_case. For consistency and adherence to Python conventions, consider using snake_case for the metadata keys:metadata={ - "author": message.author_username, - "createdAt": message.message_created_at, - "updatedAt": message.message_edited_at, - "mentions": message.mentions, - "replies": message.repliers, - "reactors": message.reactors, - "chat_name": self.chat_name, + "author": message.author_username, + "created_at": message.message_created_at, + "updated_at": message.message_edited_at, + "mentions": message.mentions, + "replies": message.repliers, + "reactors": message.reactors, + "chat_name": self.chat_name, },Also, update the lists in
excluded_embed_metadata_keys
andexcluded_llm_metadata_keys
accordingly.
5-58
: Recommend adding unit tests forTransformMessages
classTo ensure the correctness and reliability of the
TransformMessages
class, consider adding unit tests for thetransform
method.Would you like assistance in generating unit tests for this class or opening a GitHub issue to track this task?
dags/hivemind_telegram_etl.py (1)
93-98
: Handle potential time zone issues when calculatingfrom_date
When calculating
from_date
, consider time zone differences that might affectlatest_date
. Ensure that both dates are aware of time zones or are in UTC to prevent unexpected behavior.Optionally, you can adjust the date handling as follows:
from_date = latest_date - timedelta(days=30) + from_date = from_date.replace(tzinfo=datetime.timezone.utc)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (14)
- dags/hivemind_etl_helpers/ingestion_pipeline.py (4 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/extract/init.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/extract/tc_chats.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/schema.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/transform/init.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/utility.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/utils/init.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py (1 hunks)
- dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (1 hunks)
- dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (1 hunks)
- dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py (1 hunks)
- dags/hivemind_telegram_etl.py (1 hunks)
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/ingestion_pipeline.py
166-166:
return
insidefinally
blocks cause exceptions to be silenced(B012)
dags/hivemind_etl_helpers/src/db/telegram/extract/__init__.py
1-1:
.tc_chats.TelegramChats
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
2-2:
.messages.ExtractMessages
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py
32-34: f-string without any placeholders
Remove extraneous
f
prefix(F541)
dags/hivemind_etl_helpers/src/db/telegram/transform/__init__.py
1-1:
.messages.TransformMessages
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
dags/hivemind_etl_helpers/src/db/telegram/utils/__init__.py
1-1:
.platform.TelegramPlatform
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
dags/hivemind_telegram_etl.py
6-6:
dotenv.load_dotenv
imported but unusedRemove unused import:
dotenv.load_dotenv
(F401)
11-11:
qdrant_client.http.models
imported but unusedRemove unused import:
qdrant_client.http.models
(F401)
🔇 Additional comments (23)
dags/hivemind_etl_helpers/src/db/telegram/utility.py (1)
1-3
: LGTM: Import statement is well-structured.The import statement is correctly formatted and imports only the necessary class. This approach helps maintain clean and efficient code.
dags/hivemind_etl_helpers/src/db/telegram/schema.py (1)
1-1
: LGTM: Import statement is correct.The import of
BaseModel
frompydantic
is appropriate for defining the Pydantic model.dags/hivemind_etl_helpers/src/db/telegram/extract/tc_chats.py (2)
1-3
: LGTM: Imports are appropriate.The imports for
logging
andNeo4jOps
fromtc_neo4j_lib
are suitable for the class's functionality.
6-9
: LGTM: Class structure is well-defined.The
TelegramChats
class is correctly structured with an__init__
method that initializes the database connection using theNeo4jOps
singleton. This approach is efficient for managing database connections.dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py (5)
1-7
: LGTM: Import statements are appropriate and concise.The import statements are relevant to the test cases and there are no unused imports.
10-20
: LGTM: Class structure and setUp method are well-defined.The
TestTelegramPlatform
class correctly extendsTestCase
and thesetUp
method properly initializes the test environment, including cleaning the database for each test.
22-24
: LGTM: Test case for no platform available is clear and correct.The
test_check_no_platform_available
method effectively tests the scenario where no platform exists in the database.
26-44
: LGTM: Test case for single platform availability is comprehensive.The
test_single_platform_available
method effectively tests the scenario where a single platform exists in the database. It correctly inserts a platform document and verifies its retrieval.
94-99
: LGTM: Test case for platform creation is concise and effective.The
test_create_platform
method effectively tests the creation of a new platform and verifies its existence in the database.dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (8)
1-4
: LGTM: Appropriate imports for the test file.The import statements are well-structured and include all necessary components for the test suite.
7-7
: LGTM: Well-structured test class.The
TestTelegramChats
class is properly set up, inheriting fromTestCase
and following unittest conventions.
8-11
: LGTM: Proper test setup.The
setUp
method correctly initializes the test environment by loading environment variables, creating aTelegramChats
instance, and ensuring a clean state.
13-14
: LGTM: Proper test cleanup.The
tearDown
method ensures a clean state after each test by calling_delete_everything
, which is a good practice for test isolation.
16-19
: LGTM: Effective database cleanup method.The
_delete_everything
method correctly uses a Cypher query to remove all nodes from the Neo4j database, ensuring a clean state for each test.
21-27
: LGTM: Good edge case test for empty database.The
test_extract_chats_empty_db
method effectively tests the behavior ofextract_chats
when the database is empty. The assertion message is clear and descriptive.
29-55
: LGTM: Comprehensive test for single chat extraction.The
test_extract_chats_single_chat
method effectively tests theextract_chats
functionality for a single chat entry. It correctly creates a chat with appropriate parameters and verifies the extracted chat ID.
57-109
: LGTM: Thorough test for multiple chat extraction.The
test_extract_chats_multiple_chats
method effectively tests theextract_chats
functionality for multiple chat entries. It correctly creates multiple chats with appropriate parameters and verifies that all chat IDs are extracted.dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (3)
1-9
: LGTM: Imports and test class definition are appropriate.The necessary modules and classes are imported, and the test class is correctly defined as a subclass of
TestCase
.
24-32
: LGTM: Empty data extraction tests are well-implemented.These tests effectively cover the edge case of extracting data from an empty database, both with and without a
from_date
parameter. The assertions correctly check for an empty list as the expected result.
76-102
: LGTM: Test for extraction with future date is well-implemented.This test effectively verifies that no messages are returned when the
from_date
is set to a future date. The test structure is consistent with the previous test, which is good for readability and maintainability.dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py (1)
39-55
: Confirm the inclusion ofauthor
in LLM metadataIn
excluded_embed_metadata_keys
, the"author"
key is excluded, but it is not excluded inexcluded_llm_metadata_keys
. This meansauthor
will be included in LLM metadata but not in embedding metadata. Is this intentional? If not, consider adding"author"
toexcluded_llm_metadata_keys
for consistency:excluded_llm_metadata_keys=[ + "author", "createdAt", "updatedAt", "mentions", "replies", "reactors", "chat_name", ],
dags/hivemind_telegram_etl.py (1)
102-104
:⚠️ Potential issueEnsure correct parameter passing to
processor
functionIn line 104, the
processor
function is called withchat_id=chat_id
, but the function expectschat_info
as the first parameter. Please update the function call to match the function signature.Apply this diff to correct the function call:
- processor(chat_id=chat_id, community_id=community_id) + processor(chat_info=chat_id, community_id=community_id)Additionally, verify that
chat_id
here is actually achat_info
tuple. If not, ensure that you're passing the correct data structure expected byprocessor
.dags/hivemind_etl_helpers/ingestion_pipeline.py (1)
93-122
: Method_create_payload_index
looks goodThe implementation of
_create_payload_index
method is clear and the docstring is well-documented. It effectively creates an index on the specified payload field.
dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py
Outdated
Show resolved
Hide resolved
and more fine-grained test case for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py (1 hunks)
- dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (1 hunks)
- dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py
- dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py
35-37: f-string without any placeholders
Remove extraneous
f
prefix(F541)
🔇 Additional comments (5)
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py (5)
1-10
: LGTM: Imports and class initialization look good.The imports are appropriate, and the
ExtractMessages
class is initialized correctly with achat_id
parameter and a Neo4j connection.
12-26
: LGTM: Well-defined method signature and informative docstring.The
extract
method is clearly defined with appropriate type hints. The docstring provides comprehensive information about the method's purpose, parameters, and return value.
71-82
: LGTM: Query execution and result transformation are well-implemented.The conditional parameter setup, query execution within a Neo4j session, and transformation of results into
TelegramMessagesModel
instances are all implemented correctly.
84-84
: LGTM: Correct return statement.The method correctly returns the list of
TelegramMessagesModel
instances, which is consistent with the method signature.
1-84
: Overall assessment: Good implementation with minor improvements needed.The
ExtractMessages
class provides a well-structured and mostly correct implementation for extracting Telegram messages from a Neo4j database. The code is generally clear, with appropriate type hints and informative docstrings.The main area for improvement is in the query construction, as highlighted in the previous comment. Addressing these issues will enhance the efficiency and robustness of the code.
Great job overall, and once the suggested changes are implemented, this will be a solid addition to the project.
🧰 Tools
🪛 Ruff
35-37: f-string without any placeholders
Remove extraneous
f
prefix(F541)
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests