Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/302 telegram raw vectorize #303

Merged
merged 13 commits into from
Oct 21, 2024
Merged

Conversation

amindadgar
Copy link
Member

@amindadgar amindadgar commented Oct 21, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new ETL pipeline for extracting, transforming, and loading data from Telegram.
    • Added functionality to extract Telegram messages and chat information, with options to filter by date.
    • New classes for handling message extraction and transformation, along with chat management.
    • Enhanced logging in the pipeline to indicate document processing status.
  • Bug Fixes

    • Improved error handling and logging in various methods to enhance reliability and clarity.
  • Tests

    • Comprehensive test suites for new functionality, ensuring correct behavior for message and chat extraction processes.

@amindadgar amindadgar linked an issue Oct 21, 2024 that may be closed by this pull request
Copy link
Contributor

coderabbitai bot commented Oct 21, 2024

Caution

Review failed

The head commit changed during the review from 1160935 to 401912c.

Walkthrough

The pull request introduces enhancements to the CustomIngestionPipeline class, including an updated run_pipeline method with a specified return type and improved logging. New methods for creating payload indices and retrieving the latest document date have been added. Additionally, new classes for extracting Telegram messages and chat information have been introduced, along with a Pydantic model for message structure. A new DAG for ETL processes related to Telegram data has also been created, integrating these functionalities into a cohesive workflow.

Changes

File Change Summary
dags/hivemind_etl_helpers/ingestion_pipeline.py Updated run_pipeline method signature and logging; added _create_payload_index and get_latest_document_date methods.
dags/hivemind_etl_helpers/src/db/telegram/extract/__init__.py Added imports for TelegramChats and ExtractMessages.
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py Introduced ExtractMessages class with extract method for message retrieval.
dags/hivemind_etl_helpers/src/db/telegram/extract/tc_chats.py Added TelegramChats class with extract_chats method for chat information retrieval.
dags/hivemind_etl_helpers/src/db/telegram/schema.py New file defining TelegramMessagesModel Pydantic model for message structure.
dags/hivemind_etl_helpers/src/db/telegram/transform/__init__.py Added import for TransformMessages.
dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py Introduced TransformMessages class with transform method to convert messages to documents.
dags/hivemind_etl_helpers/src/db/telegram/utility.py Added TelegramUtils class extending TelegramPlatform.
dags/hivemind_etl_helpers/src/db/telegram/utils/__init__.py Imported TelegramPlatform class.
dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py Introduced TelegramPlatform class with methods for platform management.
dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py New test suite for ExtractMessages class covering various extraction scenarios.
dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py New test suite for TelegramChats class validating chat extraction.
dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py New test suite for TelegramPlatform class covering platform existence and creation.
dags/hivemind_telegram_etl.py New DAG for ETL processes with tasks for fetching chat IDs, checking existence, and processing messages.

Possibly related PRs

  • feat: using endpoint for all discourse posts! #297: Involves replacing forumUuid with forum_endpoint, aligning with modifications in the main PR that enhance the CustomIngestionPipeline class, particularly in data processing and logging.
  • Feat/analyzer helpers telegram #298: Introduces changes to the TransformRawInfo and TransformRawMembers classes, which may relate to the overall data transformation processes that are also part of the enhancements made in the main PR regarding the ingestion pipeline and document handling.

🐰 In the land of code where rabbits play,
New features hop in, brightening the day!
With messages extracted and chats to explore,
Our pipelines are dancing, oh, what a score!
So let’s cheer for the changes, both big and small,
In this wondrous world, we’re having a ball! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

🧹 Outside diff range and nitpick comments (13)
dags/hivemind_etl_helpers/src/db/telegram/utility.py (1)

6-8: Consider adding a docstring and evaluating the necessity of this class.

The TelegramUtils class is well-defined and correctly extends TelegramPlatform. However, there are a couple of suggestions for improvement:

  1. Add a docstring to explain the purpose and intended use of this class.
  2. Evaluate whether this class is necessary. If it doesn't add any new functionality beyond what TelegramPlatform provides, consider using TelegramPlatform directly instead.

Here's an example of how you might add a docstring:

class TelegramUtils(TelegramPlatform):
    """
    Utility class for Telegram-related operations.

    This class extends TelegramPlatform to provide additional
    utility functions specific to our application's needs.
    """

    def __init__(self, chat_id: str, chat_name: str) -> None:
        super().__init__(chat_id, chat_name)

If you plan to add more methods to this class in the future, please disregard the second suggestion.

dags/hivemind_etl_helpers/src/db/telegram/extract/tc_chats.py (2)

10-33: LGTM with a minor suggestion: Consider improving type hints.

The extract_chats method is well-structured with appropriate error handling, efficient Neo4j query execution, and result processing. The docstring provides clear information about the method's purpose and return value.

Consider updating the type hint for chat_info to match the method's return type:

-        chat_info: list[str] = []
+        chat_info: list[tuple[str, str]] = []

This change would make the type hint consistent with the method's return type annotation and the actual data structure being returned.


1-33: Overall: Well-implemented and ready for integration.

The TelegramChats class is well-designed for its purpose of extracting chat information from a Neo4j database. It demonstrates good practices in terms of error handling, database connection management, and code structure. The implementation is clean and efficient, making it suitable for integration into the larger system.

As this class is part of a larger ETL process, ensure that it integrates smoothly with other components of the pipeline, particularly in terms of error handling and data flow. Consider implementing unit tests to verify the behavior of the extract_chats method with different Neo4j query results and potential error scenarios.

dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py (1)

1-99: Consider adding more edge cases and error scenarios.

The test file provides good coverage for the main scenarios of the TelegramPlatform class. To further improve the test suite, consider adding the following test cases:

  1. Test for error handling when the database connection fails.
  2. Test for the scenario where a platform is disconnected (i.e., has a non-null disconnectedAt field).
  3. Test for updating an existing platform.
  4. Test for handling invalid input data (e.g., invalid chat_id or chat_name).

Overall, the test structure is well-organized and follows good practices for unittest.

dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (1)

1-109: Consider adding more edge cases and error scenario tests.

The test suite is well-structured and covers the basic functionality of the TelegramChats class. To further improve test coverage, consider adding the following tests:

  1. Test with invalid chat data (e.g., missing required fields).
  2. Test for error handling when the database connection fails.
  3. Test for the behavior when extracting a large number of chats (performance test).
  4. Test for any specific error conditions that extract_chats might encounter.

These additional tests would help ensure the robustness of the TelegramChats class under various scenarios.

Would you like assistance in drafting these additional test cases?

dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (4)

10-22: LGTM: Setup and teardown methods ensure proper test isolation.

The setUp and tearDown methods are well-implemented, ensuring a clean test environment for each test case. The _delete_everything() method effectively clears the Neo4j database.

Consider adding error handling in the _delete_everything() method to catch and log any potential database connection issues:

def _delete_everything(self):
    """remove everything on neo4j db"""
    try:
        with self.extractor._connection.neo4j_driver.session() as session:
            session.execute_write(lambda tx: tx.run("MATCH (n) DETACH DELETE (n)"))
    except Exception as e:
        print(f"Error clearing database: {e}")
        raise

34-74: LGTM: Single data extraction test is comprehensive.

This test effectively creates a single message with associated users and reactions, and then verifies the extracted data against an expected TelegramMessagesModel instance. The test covers all relevant fields and uses appropriate timestamp representations.

Consider adding a constant for the chat ID at the class level to improve maintainability:

class TestExtractTelegramMessages(TestCase):
    CHAT_ID = "1234567890"

    def setUp(self) -> None:
        load_dotenv()
        self.extractor = ExtractMessages()
        self._delete_everything()

    # ... rest of the class ...

    def test_extract_single_data(self):
        with self.extractor._connection.neo4j_driver.session() as session:
            session.run(
                # ... query ...
                {
                    "chat_id": self.CHAT_ID,
                    # ... other parameters ...
                },
            )
        # ... rest of the test ...

This change would make it easier to update the chat ID across all tests if needed.


104-195: LGTM: Multiple data extraction test is comprehensive, with room for improvement.

This test effectively creates and verifies the extraction of multiple messages with various relationships (edits, replies, reactions). The test covers a complex scenario and checks all relevant fields in the extracted data.

Consider the following improvements:

  1. Sort the extracted and expected data before comparison to ensure the test is not dependent on the order of retrieval:
def test_extract_multiple_data(self):
    # ... existing code ...
    data = sorted(self.extractor.extract(from_date=datetime(2024, 1, 1)), key=lambda x: x.message_id)
    expected_data = sorted([
        TelegramMessagesModel(...),
        TelegramMessagesModel(...),
        TelegramMessagesModel(...),
    ], key=lambda x: x.message_id)
    self.assertEqual(data, expected_data)
  1. Add assertions to verify the relationships between messages (e.g., replies, edits):
def test_extract_multiple_data(self):
    # ... existing code ...
    self.assertEqual(data[2].repliers, ["User Two"])  # Verify that message 5 replied to message 3
    self.assertNotEqual(data[0].message_text, "🎉️️️️️️ Welcome to the TC Ingestion Pipeline")  # Verify that message 3 was edited

These changes would make the test more robust and verify additional aspects of the data extraction.


1-195: Overall, the test suite is well-structured and comprehensive.

This integration test file for the ExtractMessages class provides excellent coverage of various scenarios, including empty data, single message extraction, and complex multi-message relationships. The tests are well-organized, consistent in structure, and follow good practices for test isolation.

Some final suggestions for improvement:

  1. Consider adding more edge cases, such as messages with empty text or messages with a large number of reactions or replies.
  2. Add a test for error handling, e.g., when the database connection fails.
  3. Use parameterized tests for similar test cases to reduce code duplication.

Example of a parameterized test:

@parameterized.expand([
    ("empty_data", datetime(2023, 1, 1), []),
    ("future_date", datetime(2024, 1, 1), []),
])
def test_extract_with_various_dates(self, name, from_date, expected):
    data = self.extractor.extract(from_date=from_date)
    self.assertEqual(data, expected)

These improvements would further enhance the robustness and maintainability of the test suite.

dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py (1)

13-15: Improve parameter descriptions in the __init__ docstring for clarity.

The descriptions for chat_id and chat_name are unclear or grammatically incorrect.

Consider updating them to:

chat_id : str
    The ID of the Telegram chat.
chat_name : str
    The name of the Telegram chat.
dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py (2)

31-38: Ensure consistency in metadata key naming conventions

The metadata keys are currently in camelCase (e.g., "createdAt", "updatedAt"), while Python typically uses snake_case. For consistency and adherence to Python conventions, consider using snake_case for the metadata keys:

                     metadata={
-                        "author": message.author_username,
-                        "createdAt": message.message_created_at,
-                        "updatedAt": message.message_edited_at,
-                        "mentions": message.mentions,
-                        "replies": message.repliers,
-                        "reactors": message.reactors,
-                        "chat_name": self.chat_name,
+                        "author": message.author_username,
+                        "created_at": message.message_created_at,
+                        "updated_at": message.message_edited_at,
+                        "mentions": message.mentions,
+                        "replies": message.repliers,
+                        "reactors": message.reactors,
+                        "chat_name": self.chat_name,
                     },

Also, update the lists in excluded_embed_metadata_keys and excluded_llm_metadata_keys accordingly.


5-58: Recommend adding unit tests for TransformMessages class

To ensure the correctness and reliability of the TransformMessages class, consider adding unit tests for the transform method.

Would you like assistance in generating unit tests for this class or opening a GitHub issue to track this task?

dags/hivemind_telegram_etl.py (1)

93-98: Handle potential time zone issues when calculating from_date

When calculating from_date, consider time zone differences that might affect latest_date. Ensure that both dates are aware of time zones or are in UTC to prevent unexpected behavior.

Optionally, you can adjust the date handling as follows:

             from_date = latest_date - timedelta(days=30)
+            from_date = from_date.replace(tzinfo=datetime.timezone.utc)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c4586cf and c563c19.

📒 Files selected for processing (14)
  • dags/hivemind_etl_helpers/ingestion_pipeline.py (4 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/extract/init.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/extract/tc_chats.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/schema.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/transform/init.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/utility.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/utils/init.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py (1 hunks)
  • dags/hivemind_telegram_etl.py (1 hunks)
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/ingestion_pipeline.py

166-166: return inside finally blocks cause exceptions to be silenced

(B012)

dags/hivemind_etl_helpers/src/db/telegram/extract/__init__.py

1-1: .tc_chats.TelegramChats imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


2-2: .messages.ExtractMessages imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py

32-34: f-string without any placeholders

Remove extraneous f prefix

(F541)

dags/hivemind_etl_helpers/src/db/telegram/transform/__init__.py

1-1: .messages.TransformMessages imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

dags/hivemind_etl_helpers/src/db/telegram/utils/__init__.py

1-1: .platform.TelegramPlatform imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

dags/hivemind_telegram_etl.py

6-6: dotenv.load_dotenv imported but unused

Remove unused import: dotenv.load_dotenv

(F401)


11-11: qdrant_client.http.models imported but unused

Remove unused import: qdrant_client.http.models

(F401)

🔇 Additional comments (23)
dags/hivemind_etl_helpers/src/db/telegram/utility.py (1)

1-3: LGTM: Import statement is well-structured.

The import statement is correctly formatted and imports only the necessary class. This approach helps maintain clean and efficient code.

dags/hivemind_etl_helpers/src/db/telegram/schema.py (1)

1-1: LGTM: Import statement is correct.

The import of BaseModel from pydantic is appropriate for defining the Pydantic model.

dags/hivemind_etl_helpers/src/db/telegram/extract/tc_chats.py (2)

1-3: LGTM: Imports are appropriate.

The imports for logging and Neo4jOps from tc_neo4j_lib are suitable for the class's functionality.


6-9: LGTM: Class structure is well-defined.

The TelegramChats class is correctly structured with an __init__ method that initializes the database connection using the Neo4jOps singleton. This approach is efficient for managing database connections.

dags/hivemind_etl_helpers/tests/integration/test_telegram_comminity.py (5)

1-7: LGTM: Import statements are appropriate and concise.

The import statements are relevant to the test cases and there are no unused imports.


10-20: LGTM: Class structure and setUp method are well-defined.

The TestTelegramPlatform class correctly extends TestCase and the setUp method properly initializes the test environment, including cleaning the database for each test.


22-24: LGTM: Test case for no platform available is clear and correct.

The test_check_no_platform_available method effectively tests the scenario where no platform exists in the database.


26-44: LGTM: Test case for single platform availability is comprehensive.

The test_single_platform_available method effectively tests the scenario where a single platform exists in the database. It correctly inserts a platform document and verifies its retrieval.


94-99: LGTM: Test case for platform creation is concise and effective.

The test_create_platform method effectively tests the creation of a new platform and verifies its existence in the database.

dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (8)

1-4: LGTM: Appropriate imports for the test file.

The import statements are well-structured and include all necessary components for the test suite.


7-7: LGTM: Well-structured test class.

The TestTelegramChats class is properly set up, inheriting from TestCase and following unittest conventions.


8-11: LGTM: Proper test setup.

The setUp method correctly initializes the test environment by loading environment variables, creating a TelegramChats instance, and ensuring a clean state.


13-14: LGTM: Proper test cleanup.

The tearDown method ensures a clean state after each test by calling _delete_everything, which is a good practice for test isolation.


16-19: LGTM: Effective database cleanup method.

The _delete_everything method correctly uses a Cypher query to remove all nodes from the Neo4j database, ensuring a clean state for each test.


21-27: LGTM: Good edge case test for empty database.

The test_extract_chats_empty_db method effectively tests the behavior of extract_chats when the database is empty. The assertion message is clear and descriptive.


29-55: LGTM: Comprehensive test for single chat extraction.

The test_extract_chats_single_chat method effectively tests the extract_chats functionality for a single chat entry. It correctly creates a chat with appropriate parameters and verifies the extracted chat ID.


57-109: LGTM: Thorough test for multiple chat extraction.

The test_extract_chats_multiple_chats method effectively tests the extract_chats functionality for multiple chat entries. It correctly creates multiple chats with appropriate parameters and verifies that all chat IDs are extracted.

dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (3)

1-9: LGTM: Imports and test class definition are appropriate.

The necessary modules and classes are imported, and the test class is correctly defined as a subclass of TestCase.


24-32: LGTM: Empty data extraction tests are well-implemented.

These tests effectively cover the edge case of extracting data from an empty database, both with and without a from_date parameter. The assertions correctly check for an empty list as the expected result.


76-102: LGTM: Test for extraction with future date is well-implemented.

This test effectively verifies that no messages are returned when the from_date is set to a future date. The test structure is consistent with the previous test, which is good for readability and maintainability.

dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py (1)

39-55: Confirm the inclusion of author in LLM metadata

In excluded_embed_metadata_keys, the "author" key is excluded, but it is not excluded in excluded_llm_metadata_keys. This means author will be included in LLM metadata but not in embedding metadata. Is this intentional? If not, consider adding "author" to excluded_llm_metadata_keys for consistency:

                     excluded_llm_metadata_keys=[
+                        "author",
                         "createdAt",
                         "updatedAt",
                         "mentions",
                         "replies",
                         "reactors",
                         "chat_name",
                     ],
dags/hivemind_telegram_etl.py (1)

102-104: ⚠️ Potential issue

Ensure correct parameter passing to processor function

In line 104, the processor function is called with chat_id=chat_id, but the function expects chat_info as the first parameter. Please update the function call to match the function signature.

Apply this diff to correct the function call:

-    processor(chat_id=chat_id, community_id=community_id)
+    processor(chat_info=chat_id, community_id=community_id)

Additionally, verify that chat_id here is actually a chat_info tuple. If not, ensure that you're passing the correct data structure expected by processor.

dags/hivemind_etl_helpers/ingestion_pipeline.py (1)

93-122: Method _create_payload_index looks good

The implementation of _create_payload_index method is clear and the docstring is well-documented. It effectively creates an index on the specified payload field.

dags/hivemind_etl_helpers/src/db/telegram/schema.py Outdated Show resolved Hide resolved
dags/hivemind_telegram_etl.py Show resolved Hide resolved
dags/hivemind_etl_helpers/ingestion_pipeline.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/ingestion_pipeline.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/ingestion_pipeline.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/ingestion_pipeline.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c563c19 and dd5101b.

📒 Files selected for processing (3)
  • dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py
  • dags/hivemind_etl_helpers/tests/integration/test_telegram_chats.py
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py

35-37: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (5)
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py (5)

1-10: LGTM: Imports and class initialization look good.

The imports are appropriate, and the ExtractMessages class is initialized correctly with a chat_id parameter and a Neo4j connection.


12-26: LGTM: Well-defined method signature and informative docstring.

The extract method is clearly defined with appropriate type hints. The docstring provides comprehensive information about the method's purpose, parameters, and return value.


71-82: LGTM: Query execution and result transformation are well-implemented.

The conditional parameter setup, query execution within a Neo4j session, and transformation of results into TelegramMessagesModel instances are all implemented correctly.


84-84: LGTM: Correct return statement.

The method correctly returns the list of TelegramMessagesModel instances, which is consistent with the method signature.


1-84: Overall assessment: Good implementation with minor improvements needed.

The ExtractMessages class provides a well-structured and mostly correct implementation for extracting Telegram messages from a Neo4j database. The code is generally clear, with appropriate type hints and informative docstrings.

The main area for improvement is in the query construction, as highlighted in the previous comment. Addressing these issues will enhance the efficiency and robustness of the code.

Great job overall, and once the suggested changes are implemented, this will be a solid addition to the project.

🧰 Tools
🪛 Ruff

35-37: f-string without any placeholders

Remove extraneous f prefix

(F541)

@amindadgar amindadgar merged commit ab5136f into main Oct 21, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Telegram raw data vectorize
1 participant