Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added word max limit! #335

Merged
merged 23 commits into from
Dec 10, 2024
Merged

Conversation

amindadgar
Copy link
Member

@amindadgar amindadgar commented Dec 5, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a minimum word limit for fetching messages.
    • Added support for filtering messages based on character length.
  • Bug Fixes

    • Adjusted error handling in message transformation to focus on content validity.
  • Tests

    • Enhanced test coverage for message fetching and transformation functionalities.
    • Updated assertions to reflect changes in message content and expected outputs.
  • Chores

    • Updated Dockerfile to include the English language model for spaCy.
    • Modified requirements.txt to update package versions and add new dependencies.

Copy link
Contributor

coderabbitai bot commented Dec 5, 2024

Walkthrough

This pull request introduces several modifications across multiple files, primarily focusing on the fetch_raw_messages function to include a new parameter for filtering messages based on a minimum word count. Additionally, new tests are added to verify this functionality. The Dockerfile is updated to include a language model for spaCy, and the requirements file reflects a version update for a package along with the addition of spaCy. Other changes involve adjustments to existing tests and the transformation logic for Discord messages.

Changes

File Change Summary
dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py - Updated fetch_raw_messages function to include **kwargs and a min_word_limit parameter.
- Modified logic to filter messages based on word count.
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py - Added tests test_fetch_raw_messages_fetch_limited_characters and test_fetch_raw_messages_fetch_limited_characters_specified to verify message filtering based on character length and specified word limits.
Dockerfile - Added command to download the en_core_web_sm language model for spaCy in both production and test stages.
requirements.txt - Updated tc-hivemind-backend version from 1.4.0 to 1.4.2.
- Added spacy==3.7.5.
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py - Updated message contents and expected outputs in test_transform_two_data to reflect new test cases.
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py - Modified message content in test_save_documents and updated assertions accordingly.
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py - Removed prepare_raw_message_urls import and related functionality.
- Adjusted error handling and content validation logic.
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py - Updated assertions in TestTransformRawMsgToDocument to reflect changes in URL representation in transformed messages.

Possibly related PRs

🐰 In the code, we hop and play,
Fetching messages in a new way.
With word limits set just right,
Our tests ensure we fetch with might!
Docker's ready, spaCy's here,
Transforming messages, let’s give a cheer! 🎉


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d3f8d7d and 67a3aae.

📒 Files selected for processing (2)
  • dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (6 hunks)
  • requirements.txt (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • requirements.txt
  • dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py (2)

24-27: Fix docstring inconsistencies and typos

The docstring has the following issues:

  • Typo: "shuold" should be "should"
  • Inconsistency: The parameter is named "min_word_limit" but the description mentions "characters"
     kwargs : dict
         min_word_limit : int
-            the minimum words that the messages shuold contain
-            default is 8 characters
+            the minimum words that the messages should contain
+            default is 8 words

Line range hint 4-53: Consider improving type safety and modularity

The function could benefit from the following improvements:

  1. Add type hints for kwargs to document expected parameters
  2. Extract word limit filtering logic to a separate function
+from typing import TypedDict
+
+class MessageFilterOptions(TypedDict, total=False):
+    min_word_limit: int
+
 def fetch_raw_messages(
     guild_id: str,
     selected_channels: list[str],
     from_date: datetime,
-    **kwargs,
+    **kwargs: MessageFilterOptions,
 ) -> list[dict]:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bc758f7 and ba4c527.

📒 Files selected for processing (2)
  • dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py (4 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py

375-375: Redefinition of unused test_fetch_raw_messages_fetch_limited_characters from line 290

(F811)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1)

290-466: Refactor test methods to reduce duplication

Both test methods share identical setup code and could benefit from refactoring to improve maintainability and readability.

Consider these improvements:

  1. Extract common setup code into helper methods:
def _setup_test_messages(self, messages_data):
    """
    Helper method to setup test messages in the database
    
    Args:
        messages_data: List of tuples containing (author_index, content)
    """
    client = MongoSingleton.get_instance().client
    guild_id = "1234"
    channels = ["111111", "22222"]
    users_id = ["user1", "user2", "user3", "user4", "user5"]
    
    self.setup_db(channels=channels, guild_id=guild_id)
    self._setup_guild_members(guild_id, users_id)
    
    raw_data = [
        self._create_message(
            author=users_id[author_idx],
            content=content,
            channel=channels[idx % len(channels)]
        )
        for idx, (author_idx, content) in enumerate(messages_data)
    ]
    
    client[guild_id].drop_collection("rawinfos")
    client[guild_id]["rawinfos"].insert_many(raw_data)
    return guild_id, channels
  1. Refactor test methods to use the helper:
def test_fetch_raw_messages_fetch_limited_characters(self):
    """
    Test fetching messages with default word limit (8 words)
    """
    messages_data = [
        (0, "AA"),
        (1, "A sample text with more than 8 characters!")
    ]
    guild_id, channels = self._setup_test_messages(messages_data)
    
    messages = fetch_raw_messages(
        guild_id,
        selected_channels=channels,
        from_date=datetime(2023, 9, 20)
    )
    
    self.assertEqual(len(messages), 1)
    self.assertEqual(
        messages[0]["content"],
        "A sample text with more than 8 characters!"
    )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ba4c527 and 3a018eb.

📒 Files selected for processing (1)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1 hunks)
🔇 Additional comments (1)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1)

378-466: ⚠️ Potential issue

Fix duplicate test method and improve assertions

The test method has several issues:

  1. Name causes F811 error (redefinition of unused name)
  2. Incorrect usage of assertCountEqual
  3. Duplicate guild_id assignment
  4. Comment refers to incorrect date

Apply these changes:

-    def test_fetch_raw_messages_fetch_limited_characters_specified(self):
+    def test_fetch_raw_messages_with_custom_word_limit(self):
         """
-        fetch raw messages and do filtering with a specified value
+        Test fetching messages with a custom minimum word limit of 1 word
         """
         client = MongoSingleton.get_instance().client
 
         guild_id = "1234"
         channels = ["111111", "22222"]
         users_id = ["user1", "user2", "user3", "user4", "user5"]
-        guild_id = "1234"  # Remove duplicate assignment
         self.setup_db(
             channels=channels,
             guild_id=guild_id,
         )
         # ... rest of the setup ...
-        # Fetch messages from a specific date (October 3, 2023)
+        # Fetch messages with minimum word limit of 1
         messages = fetch_raw_messages(
             guild_id,
             selected_channels=channels,
             from_date=datetime(2023, 9, 20),
             min_word_limit=1,
         )
-        # Check if the fetched messages are equal to the expected messages
-        self.assertCountEqual(messages, 2)
+        # Verify both messages are returned with word limit of 1
+        self.assertEqual(len(messages), 2)
+        contents = {msg["content"] for msg in messages}
+        self.assertEqual(
+            contents,
+            {"AA", "A sample text with more than 8 characters!"}
+        )

@amindadgar amindadgar linked an issue Dec 5, 2024 that may be closed by this pull request
3 tasks
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (11)
Dockerfile (1)

7-7: Consider using a smaller spaCy model for better resource efficiency

The en_core_web_lg model is quite large (~560MB). Consider using en_core_web_sm (~12MB) or en_core_web_md (~40MB) if the full feature set of the large model isn't necessary. This would significantly reduce the container size and build time.

-RUN python -m spacy download en_core_web_lg
+RUN python -m spacy download en_core_web_md
dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py (2)

8-21: Improve method documentation

The docstring needs improvement:

  1. "a discord message text" is too specific as this is a base class
  2. Missing description of the cleaning process
  3. Missing example usage
     def extract_main_content(self, text: str) -> str:
         """
-        extract main content of a message
+        Extract the main content from text by removing noise and standardizing tokens.
+        
+        This method:
+        - Removes punctuation, whitespace, stop words
+        - Filters out URLs and numbers
+        - Converts tokens to their base form (lemmatization)
+        - Ensures ASCII compatibility

         Parameters
         ------------
         text : str
-            a discord message text
+            The input text to process

         Returns
         --------
         cleaned_text : str
+            Processed text containing only meaningful content
+
+        Example
+        --------
+        >>> preprocessor = BasePreprocessor()
+        >>> preprocessor.extract_main_content("Hello, how are you? https://example.com")
+        'hello'
         """

29-39: Consider making token filtering rules configurable

The token filtering rules are hardcoded. Consider making them configurable through constructor parameters to increase flexibility.

 class BasePreprocessor:
-    def __init__(self) -> None:
+    def __init__(
+        self,
+        *,
+        remove_punct: bool = True,
+        remove_space: bool = True,
+        remove_stop: bool = True,
+        remove_url: bool = True,
+        remove_num: bool = True,
+        ascii_only: bool = True
+    ) -> None:
         try:
             self.nlp = spacy.load("en_core_web_lg")
         except OSError as exp:
             raise OSError("Model spacy `en_core_web_lg` is not installed!") from exp
+        self.filters = {
+            'remove_punct': remove_punct,
+            'remove_space': remove_space,
+            'remove_stop': remove_stop,
+            'remove_url': remove_url,
+            'remove_num': remove_num,
+            'ascii_only': ascii_only
+        }
dags/hivemind_etl_helpers/tests/unit/test_discord_preprocess.py (3)

26-28: Add assertion message for better test failure debugging

When using assertEqual, include a message parameter to provide context when the test fails.

-        self.assertEqual(self.preprocessor.remove_ids(input_text), expected)
+        self.assertEqual(
+            self.preprocessor.remove_ids(input_text),
+            expected,
+            "Failed to remove multiple user and role IDs correctly"
+        )

11-43: Add more edge cases to test suite

The test suite could benefit from additional edge cases:

  1. Malformed IDs
  2. Nested IDs
  3. IDs with special characters

Example additional test:

def test_malformed_ids(self):
    """Test handling of malformed ID patterns"""
    input_text = "Hello <@12345 <@&> <@&abc>"
    expected = "Hello"
    self.assertEqual(
        self.preprocessor.remove_ids(input_text),
        expected,
        "Failed to handle malformed IDs correctly"
    )

6-10: Consider using pytest fixtures instead of setUp

While unittest's setUp works, pytest fixtures provide more flexibility and better reuse across test files.

-class TestDiscordPreprocessor(unittest.TestCase):
-    def setUp(self):
-        """Set up test fixtures before each test method."""
-        self.preprocessor = DiscordPreprocessor()
+import pytest
+
+@pytest.fixture
+def preprocessor():
+    """Fixture providing a DiscordPreprocessor instance."""
+    return DiscordPreprocessor()
+
+def test_remove_user_ids(preprocessor):
+    """Test removal of user IDs from text"""
+    input_text = "Hello <@123456> how are you?"
+    expected = "Hello how are you?"
+    assert preprocessor.remove_ids(input_text) == expected
dags/hivemind_etl_helpers/src/db/discord/preprocessor.py (3)

11-12: Remove unnecessary empty init

The empty __init__ method can be removed as it doesn't add any functionality beyond what's inherited from BasePreprocessor.

-    def __init__(self) -> None:
-        pass

14-35: Optimize clean_texts method implementation

The method could be more efficient using list comprehension and could benefit from better type hints.

Consider this improvement:

-    def clean_texts(self, texts: list[str]) -> list[str]:
+    def clean_texts(self, texts: list[str], *, skip_empty: bool = True) -> list[str]:
         """
         clean the given text

         Parameters
         ------------
         texts : list[str]
             a list of discord messages text
+        skip_empty : bool, optional
+            whether to skip empty results after cleaning (default: True)

         Returns
         ---------
         texts_cleaned : str
             the cleaned text
             (discord ids removed)
         """
-        texts_cleaned: list[str] = []
-
-        for text in texts:
-            text_cleaned = self.clean_text(text=text)
-            texts_cleaned.append(text_cleaned)
-
-        return texts_cleaned
+        cleaned = [self.clean_text(text=text) for text in texts]
+        return [text for text in cleaned if text or not skip_empty]

73-78: Enhance regex pattern and string cleaning

The current implementation has a few potential improvements:

  1. The regex pattern could be compiled once for better performance
  2. Multiple spaces handling could be more efficient

Consider these improvements:

+    # Class variable for better performance
+    _ID_PATTERN = re.compile(r"<@&?\d+>")
+
     def remove_ids(self, text: str) -> str:
-        pattern = r"<@&?\d+>"
-
         # Removing matches
-        cleaned_text = re.sub(pattern, "", text)
-
-        cleaned_text = " ".join(cleaned_text.split())
+        cleaned_text = self._ID_PATTERN.sub("", text).strip()
+        cleaned_text = " ".join(word for word in cleaned_text.split() if word)
         return cleaned_text
dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (2)

17-17: Remove unnecessary f-string prefix.

The string doesn't contain any placeholders, so the f prefix is not needed.

-            raise OSError(f"Model spacy `en_core_web_lg` is not installed!") from exp
+            raise OSError("Model spacy `en_core_web_lg` is not installed!") from exp
🧰 Tools
🪛 Ruff (0.8.0)

17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)


19-141: LGTM! Comprehensive test coverage with clear test cases.

The test suite provides good coverage of basic functionality with clear, focused test cases and descriptive docstrings.

Consider adding these test cases to improve coverage:

  1. Test word limit functionality (mentioned in PR title)
  2. Edge cases:
    • Messages with only mentions/no actual content
    • Very long messages
    • Messages with repeated mentions
  3. Message content normalization:
    • Whitespace handling
    • Case sensitivity
    • Unicode characters

Example test case for word limit:

def test_word_limit(self):
    """Test function respects word limit configuration"""
    test_data = [
        {"content": "This is a short message"},
        {"content": "This is a longer message with more than five words"},
    ]
    
    # Test with 5-word limit
    result = update_raw_messages(test_data, min_word_limit=5)
    self.assertEqual(len(result), 1, "Should only return message exceeding word limit")
    self.assertEqual(
        result[0]["content"],
        test_data[1]["content"],
        "Should return longer message"
    )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 70459fc and 33d0fb4.

📒 Files selected for processing (10)
  • Dockerfile (1 hunks)
  • dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py (2 hunks)
  • dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py (4 hunks)
  • dags/hivemind_etl_helpers/src/db/discord/preprocessor.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (3 hunks)
  • dags/hivemind_etl_helpers/tests/unit/test_common_preprocessor.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/unit/test_discord_preprocess.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (1 hunks)
  • requirements.txt (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • requirements.txt
🚧 Files skipped from review as they are similar to previous changes (2)
  • dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py
  • dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py

17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)

dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py

25-25: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (4)
dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py (2)

8-8: LGTM: Import statement is correctly placed

The import statement for DiscordPreprocessor is appropriately placed with other imports.


33-34: ⚠️ Potential issue

Verify error handling for empty processed messages

The code processes messages but doesn't handle the case where all messages might be filtered out during preprocessing.

Let's verify the behavior of transform_discord_raw_messages with empty input:

Consider adding a check:

     processed_messages = update_raw_messages(raw_data=raw_mongo_messages)
+    if not processed_messages:
+        return []
     messages_docuemnt = transform_discord_raw_messages(guild_id, processed_messages)
dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (2)

1-8: LGTM! Imports are clean and appropriate.

The imports cover all necessary dependencies for the test suite.


12-16: LGTM! Efficient test setup.

Loading the spaCy model once in setUpClass is a good practice for performance optimization.

Dockerfile Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (1)

127-127: Verify text preprocessing assumptions

The test message has been updated to demonstrate multiple preprocessing steps (stemming, stop word removal, case normalization). Consider adding a comment explaining these preprocessing expectations for better test documentation.

-                "content": "Apples are falling from trees 1",
+                # Test message to verify text preprocessing:
+                # - Word stemming (falling -> fall)
+                # - Stop word removal (are, from)
+                # - Case normalization (Apples -> apple)
+                # - Number removal (1)
+                "content": "Apples are falling from trees 1",
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (2)

Line range hint 1-324: Consider improving test organization and documentation

The test file is well-structured but could benefit from:

  1. Docstrings explaining the test scenarios
  2. Breaking down the large test method into smaller, focused test cases
  3. Using test data factories or fixtures for common setup

Consider refactoring the test class like this:

@pytest.fixture
def discord_setup():
    """Setup basic Discord test environment with channels and roles."""
    # Move setup_db logic here
    pass

class TestDiscordMessageTransformation:
    """Test suite for Discord message transformation functionality."""
    
    def test_basic_message_transformation(self, discord_setup):
        """Test basic message content is correctly transformed."""
        pass
    
    def test_mentions_transformation(self, discord_setup):
        """Test messages with user/role mentions are correctly transformed."""
        pass
    
    def test_url_transformation(self, discord_setup):
        """Test messages with URLs are correctly transformed."""
        pass

Add proper test cleanup using tearDown to prevent test interference

The test file shares database collections with multiple other integration tests (17+ files found). While it does cleanup collections at the start of each test method, it doesn't guarantee cleanup after test execution, especially in case of test failures. This could lead to test pollution.

  • Add tearDown method to ensure cleanup even if tests fail:
def tearDown(self):
    """Clean up test database state."""
    client = MongoSingleton.get_instance().client
    client["Core"].drop_collection("modules")
    client["Core"].drop_collection("platforms")
    client[self.guild_id].drop_collection("guildmembers")
    client[self.guild_id].drop_collection("rawinfos")
    client[self.guild_id].drop_collection("roles")
🔗 Analysis chain

Line range hint 1-324: Verify test isolation and cleanup

The test uses direct database operations but doesn't ensure proper cleanup in case of test failures.

Let's check if there are any cleanup mechanisms in place:

Add proper cleanup using setUp/tearDown methods:

def setUp(self):
    """Set up test database state."""
    self.client = MongoSingleton.get_instance().client
    self.guild_id = "1234"
    self.cleanup_collections()
    
def tearDown(self):
    """Clean up test database state."""
    self.cleanup_collections()
    
def cleanup_collections(self):
    """Drop all test collections."""
    self.client["Core"].drop_collection("modules")
    self.client["Core"].drop_collection("platforms")
    self.client[self.guild_id].drop_collection("guildmembers")
    self.client[self.guild_id].drop_collection("rawinfos")
    self.client[self.guild_id].drop_collection("roles")

Based on the search results, I can see that several test files in the codebase have proper cleanup mechanisms implemented:

  1. Some tests use tearDown with proper cleanup:

    • test_telegram_comminity.py
    • test_extract_messages.py
    • test_telegram_chats.py
  2. The file in question (test_discord_prepare_document_from_db.py) uses drop_collection calls at the beginning of each test method, which is a form of setup cleanup, but lacks proper teardown.

Let me verify if there are any other test files that might be affected by this test's database state:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for cleanup patterns in test files
rg -A 5 "def tearDown|cleanup|drop_collection" "dags/hivemind_etl_helpers/tests"

Length of output: 31557


Script:

#!/bin/bash
# Check for other test files that might use the same collections
rg -l "Core.*modules|Core.*platforms|guildmembers|rawinfos" "dags/hivemind_etl_helpers/tests"

Length of output: 1526

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 33d0fb4 and 5b4057d.

📒 Files selected for processing (3)
  • Dockerfile (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (2 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Dockerfile
🔇 Additional comments (1)
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (1)

188-188: Verify preprocessing pipeline consistency

The expected text demonstrates the preprocessing pipeline's behavior. Consider adding additional test cases to verify edge cases:

  1. Multiple numbers in text
  2. Special characters
  3. Mixed case words
  4. Common stop words

Let's verify if there are other test cases covering these scenarios:

the processor should move to the embedding part and we shouldn't change the actual content of the messages. They have been moved to hivemind_backend library.
the test case didn't belong to any code.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (2)

21-21: Consider documenting BasePreprocessor's validation criteria

The introduction of BasePreprocessor for content validation is a good architectural choice, but its validation criteria should be documented to ensure maintainability.

Consider adding a docstring or comment explaining what constitutes "valuable information" in the context of message processing.


240-240: Remove commented out metadata exclusions

Since URL reference handling has been removed, these commented lines should be deleted rather than left as comments.

Apply these changes:

-            # "url_reference",

Also applies to: 257-257

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cd4e0c7 and 866088d.

📒 Files selected for processing (5)
  • dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (6 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (2 hunks)
  • requirements.txt (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • requirements.txt
  • dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py
  • dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py
🔇 Additional comments (1)
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (1)

222-222: LGTM: Proper input sanitization added

Good addition of control character removal for cleaner text processing.

@amindadgar amindadgar merged commit 96d6787 into main Dec 10, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Hivemind] Data processing improvements
1 participant