feat: Added word max limit! #335

amindadgar · 2024-12-05T10:01:34Z

Summary by CodeRabbit

New Features
- Introduced a minimum word limit for fetching messages.
- Added support for filtering messages based on character length.
Bug Fixes
- Adjusted error handling in message transformation to focus on content validity.
Tests
- Enhanced test coverage for message fetching and transformation functionalities.
- Updated assertions to reflect changes in message content and expected outputs.
Chores
- Updated Dockerfile to include the English language model for spaCy.
- Modified requirements.txt to update package versions and add new dependencies.

coderabbitai · 2024-12-05T10:01:43Z

Walkthrough

This pull request introduces several modifications across multiple files, primarily focusing on the fetch_raw_messages function to include a new parameter for filtering messages based on a minimum word count. Additionally, new tests are added to verify this functionality. The Dockerfile is updated to include a language model for spaCy, and the requirements file reflects a version update for a package along with the addition of spaCy. Other changes involve adjustments to existing tests and the transformation logic for Discord messages.

Changes

File	Change Summary
`dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py`	- Updated `fetch_raw_messages` function to include `**kwargs` and a `min_word_limit` parameter. - Modified logic to filter messages based on word count.
`dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py`	- Added tests `test_fetch_raw_messages_fetch_limited_characters` and `test_fetch_raw_messages_fetch_limited_characters_specified` to verify message filtering based on character length and specified word limits.
`Dockerfile`	- Added command to download the `en_core_web_sm` language model for spaCy in both production and test stages.
`requirements.txt`	- Updated `tc-hivemind-backend` version from `1.4.0` to `1.4.2`. - Added `spacy==3.7.5`.
`dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py`	- Updated message contents and expected outputs in `test_transform_two_data` to reflect new test cases.
`dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py`	- Modified message content in `test_save_documents` and updated assertions accordingly.
`dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py`	- Removed `prepare_raw_message_urls` import and related functionality. - Adjusted error handling and content validation logic.
`dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py`	- Updated assertions in `TestTransformRawMsgToDocument` to reflect changes in URL representation in transformed messages.

Possibly related PRs

fix: discord hivemind ETL added removal of null characters! #295: This PR introduces a removal of null characters in the transform_discord_raw_messages.py file, which is relevant to the changes made in the main PR that also modifies the same file and function.
Feat: Hivemind include url for documents #330: This PR enhances the transform_discord_raw_messages.py file by adding URL construction logic, which relates to the changes in the main PR that also involves modifications to the same file and function.

🐰 In the code, we hop and play,
Fetching messages in a new way.
With word limits set just right,
Our tests ensure we fetch with might!
Docker's ready, spaCy's here,
Transforming messages, let’s give a cheer! 🎉

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d3f8d7d and 67a3aae.

📒 Files selected for processing (2)

dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (6 hunks)
requirements.txt (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

requirements.txt
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)

dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py (2)
24-27: Fix docstring inconsistencies and typos

The docstring has the following issues:

Typo: "shuold" should be "should"

Inconsistency: The parameter is named "min_word_limit" but the description mentions "characters"
     kwargs : dict
         min_word_limit : int
-            the minimum words that the messages shuold contain
-            default is 8 characters
+            the minimum words that the messages should contain
+            default is 8 words
Line range hint 4-53: Consider improving type safety and modularity

The function could benefit from the following improvements:

Add type hints for kwargs to document expected parameters

Extract word limit filtering logic to a separate function
+from typing import TypedDict
+
+class MessageFilterOptions(TypedDict, total=False):
+    min_word_limit: int
+
 def fetch_raw_messages(
     guild_id: str,
     selected_channels: list[str],
     from_date: datetime,
-    **kwargs,
+    **kwargs: MessageFilterOptions,
 ) -> list[dict]:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bc758f7 and ba4c527.

📒 Files selected for processing (2)

dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py (4 hunks)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.0)

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py

375-375: Redefinition of unused test_fetch_raw_messages_fetch_limited_characters from line 290

(F811)

dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1)

290-466: Refactor test methods to reduce duplication

Both test methods share identical setup code and could benefit from refactoring to improve maintainability and readability.

Consider these improvements:

Extract common setup code into helper methods:

def _setup_test_messages(self, messages_data):
    """
    Helper method to setup test messages in the database
    
    Args:
        messages_data: List of tuples containing (author_index, content)
    """
    client = MongoSingleton.get_instance().client
    guild_id = "1234"
    channels = ["111111", "22222"]
    users_id = ["user1", "user2", "user3", "user4", "user5"]
    
    self.setup_db(channels=channels, guild_id=guild_id)
    self._setup_guild_members(guild_id, users_id)
    
    raw_data = [
        self._create_message(
            author=users_id[author_idx],
            content=content,
            channel=channels[idx % len(channels)]
        )
        for idx, (author_idx, content) in enumerate(messages_data)
    ]
    
    client[guild_id].drop_collection("rawinfos")
    client[guild_id]["rawinfos"].insert_many(raw_data)
    return guild_id, channels

Refactor test methods to use the helper:

def test_fetch_raw_messages_fetch_limited_characters(self):
    """
    Test fetching messages with default word limit (8 words)
    """
    messages_data = [
        (0, "AA"),
        (1, "A sample text with more than 8 characters!")
    ]
    guild_id, channels = self._setup_test_messages(messages_data)
    
    messages = fetch_raw_messages(
        guild_id,
        selected_channels=channels,
        from_date=datetime(2023, 9, 20)
    )
    
    self.assertEqual(len(messages), 1)
    self.assertEqual(
        messages[0]["content"],
        "A sample text with more than 8 characters!"
    )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ba4c527 and 3a018eb.

📒 Files selected for processing (1)

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1 hunks)

🔇 Additional comments (1)

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1)

378-466: ⚠️ Potential issue

Fix duplicate test method and improve assertions

The test method has several issues:

Name causes F811 error (redefinition of unused name)
Incorrect usage of assertCountEqual
Duplicate guild_id assignment
Comment refers to incorrect date

Apply these changes:

-    def test_fetch_raw_messages_fetch_limited_characters_specified(self):
+    def test_fetch_raw_messages_with_custom_word_limit(self):
         """
-        fetch raw messages and do filtering with a specified value
+        Test fetching messages with a custom minimum word limit of 1 word
         """
         client = MongoSingleton.get_instance().client
 
         guild_id = "1234"
         channels = ["111111", "22222"]
         users_id = ["user1", "user2", "user3", "user4", "user5"]
-        guild_id = "1234"  # Remove duplicate assignment
         self.setup_db(
             channels=channels,
             guild_id=guild_id,
         )
         # ... rest of the setup ...
-        # Fetch messages from a specific date (October 3, 2023)
+        # Fetch messages with minimum word limit of 1
         messages = fetch_raw_messages(
             guild_id,
             selected_channels=channels,
             from_date=datetime(2023, 9, 20),
             min_word_limit=1,
         )
-        # Check if the fetched messages are equal to the expected messages
-        self.assertCountEqual(messages, 2)
+        # Verify both messages are returned with word limit of 1
+        self.assertEqual(len(messages), 2)
+        contents = {msg["content"] for msg in messages}
+        self.assertEqual(
+            contents,
+            {"AA", "A sample text with more than 8 characters!"}
+        )

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (11)

Dockerfile (1)

7-7: Consider using a smaller spaCy model for better resource efficiency

The en_core_web_lg model is quite large (~560MB). Consider using en_core_web_sm (~12MB) or en_core_web_md (~40MB) if the full feature set of the large model isn't necessary. This would significantly reduce the container size and build time.
-RUN python -m spacy download en_core_web_lg
+RUN python -m spacy download en_core_web_md

dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py (2)

8-21: Improve method documentation

The docstring needs improvement:

"a discord message text" is too specific as this is a base class
Missing description of the cleaning process
Missing example usage

     def extract_main_content(self, text: str) -> str:
         """
-        extract main content of a message
+        Extract the main content from text by removing noise and standardizing tokens.
+        
+        This method:
+        - Removes punctuation, whitespace, stop words
+        - Filters out URLs and numbers
+        - Converts tokens to their base form (lemmatization)
+        - Ensures ASCII compatibility

         Parameters
         ------------
         text : str
-            a discord message text
+            The input text to process

         Returns
         --------
         cleaned_text : str
+            Processed text containing only meaningful content
+
+        Example
+        --------
+        >>> preprocessor = BasePreprocessor()
+        >>> preprocessor.extract_main_content("Hello, how are you? https://example.com")
+        'hello'
         """

29-39: Consider making token filtering rules configurable

The token filtering rules are hardcoded. Consider making them configurable through constructor parameters to increase flexibility.

 class BasePreprocessor:
-    def __init__(self) -> None:
+    def __init__(
+        self,
+        *,
+        remove_punct: bool = True,
+        remove_space: bool = True,
+        remove_stop: bool = True,
+        remove_url: bool = True,
+        remove_num: bool = True,
+        ascii_only: bool = True
+    ) -> None:
         try:
             self.nlp = spacy.load("en_core_web_lg")
         except OSError as exp:
             raise OSError("Model spacy `en_core_web_lg` is not installed!") from exp
+        self.filters = {
+            'remove_punct': remove_punct,
+            'remove_space': remove_space,
+            'remove_stop': remove_stop,
+            'remove_url': remove_url,
+            'remove_num': remove_num,
+            'ascii_only': ascii_only
+        }

dags/hivemind_etl_helpers/tests/unit/test_discord_preprocess.py (3)

26-28: Add assertion message for better test failure debugging

When using assertEqual, include a message parameter to provide context when the test fails.

-        self.assertEqual(self.preprocessor.remove_ids(input_text), expected)
+        self.assertEqual(
+            self.preprocessor.remove_ids(input_text),
+            expected,
+            "Failed to remove multiple user and role IDs correctly"
+        )

11-43: Add more edge cases to test suite

The test suite could benefit from additional edge cases:

Malformed IDs
Nested IDs
IDs with special characters

Example additional test:

def test_malformed_ids(self):
    """Test handling of malformed ID patterns"""
    input_text = "Hello <@12345 <@&> <@&abc>"
    expected = "Hello"
    self.assertEqual(
        self.preprocessor.remove_ids(input_text),
        expected,
        "Failed to handle malformed IDs correctly"
    )

6-10: Consider using pytest fixtures instead of setUp

While unittest's setUp works, pytest fixtures provide more flexibility and better reuse across test files.

-class TestDiscordPreprocessor(unittest.TestCase):
-    def setUp(self):
-        """Set up test fixtures before each test method."""
-        self.preprocessor = DiscordPreprocessor()
+import pytest
+
+@pytest.fixture
+def preprocessor():
+    """Fixture providing a DiscordPreprocessor instance."""
+    return DiscordPreprocessor()
+
+def test_remove_user_ids(preprocessor):
+    """Test removal of user IDs from text"""
+    input_text = "Hello <@123456> how are you?"
+    expected = "Hello how are you?"
+    assert preprocessor.remove_ids(input_text) == expected

dags/hivemind_etl_helpers/src/db/discord/preprocessor.py (3)

11-12: Remove unnecessary empty init

The empty __init__ method can be removed as it doesn't add any functionality beyond what's inherited from BasePreprocessor.

-    def __init__(self) -> None:
-        pass

14-35: Optimize clean_texts method implementation

The method could be more efficient using list comprehension and could benefit from better type hints.

Consider this improvement:

-    def clean_texts(self, texts: list[str]) -> list[str]:
+    def clean_texts(self, texts: list[str], *, skip_empty: bool = True) -> list[str]:
         """
         clean the given text

         Parameters
         ------------
         texts : list[str]
             a list of discord messages text
+        skip_empty : bool, optional
+            whether to skip empty results after cleaning (default: True)

         Returns
         ---------
         texts_cleaned : str
             the cleaned text
             (discord ids removed)
         """
-        texts_cleaned: list[str] = []
-
-        for text in texts:
-            text_cleaned = self.clean_text(text=text)
-            texts_cleaned.append(text_cleaned)
-
-        return texts_cleaned
+        cleaned = [self.clean_text(text=text) for text in texts]
+        return [text for text in cleaned if text or not skip_empty]

73-78: Enhance regex pattern and string cleaning

The current implementation has a few potential improvements:

The regex pattern could be compiled once for better performance
Multiple spaces handling could be more efficient

Consider these improvements:

+    # Class variable for better performance
+    _ID_PATTERN = re.compile(r"<@&?\d+>")
+
     def remove_ids(self, text: str) -> str:
-        pattern = r"<@&?\d+>"
-
         # Removing matches
-        cleaned_text = re.sub(pattern, "", text)
-
-        cleaned_text = " ".join(cleaned_text.split())
+        cleaned_text = self._ID_PATTERN.sub("", text).strip()
+        cleaned_text = " ".join(word for word in cleaned_text.split() if word)
         return cleaned_text

dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (2)

17-17: Remove unnecessary f-string prefix.

The string doesn't contain any placeholders, so the f prefix is not needed.
-            raise OSError(f"Model spacy `en_core_web_lg` is not installed!") from exp
+            raise OSError("Model spacy `en_core_web_lg` is not installed!") from exp
🧰 Tools

🪛 Ruff (0.8.0)

17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)

19-141: LGTM! Comprehensive test coverage with clear test cases.

The test suite provides good coverage of basic functionality with clear, focused test cases and descriptive docstrings.

Consider adding these test cases to improve coverage:

Test word limit functionality (mentioned in PR title)

Edge cases:

Messages with only mentions/no actual content

Very long messages

Messages with repeated mentions

Message content normalization:

Whitespace handling

Case sensitivity

Unicode characters

Example test case for word limit:
def test_word_limit(self):
    """Test function respects word limit configuration"""
    test_data = [
        {"content": "This is a short message"},
        {"content": "This is a longer message with more than five words"},
    ]
    
    # Test with 5-word limit
    result = update_raw_messages(test_data, min_word_limit=5)
    self.assertEqual(len(result), 1, "Should only return message exceeding word limit")
    self.assertEqual(
        result[0]["content"],
        test_data[1]["content"],
        "Should return longer message"
    )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 70459fc and 33d0fb4.

📒 Files selected for processing (10)

Dockerfile (1 hunks)
dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py (1 hunks)
dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py (2 hunks)
dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py (4 hunks)
dags/hivemind_etl_helpers/src/db/discord/preprocessor.py (1 hunks)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (3 hunks)
dags/hivemind_etl_helpers/tests/unit/test_common_preprocessor.py (1 hunks)
dags/hivemind_etl_helpers/tests/unit/test_discord_preprocess.py (1 hunks)
dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (1 hunks)
requirements.txt (1 hunks)

✅ Files skipped from review due to trivial changes (1)

requirements.txt

🚧 Files skipped from review as they are similar to previous changes (2)

dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py

🧰 Additional context used

🪛 Ruff (0.8.0)

dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py

17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)

dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py

25-25: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (4)

dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py (2)

8-8: LGTM: Import statement is correctly placed

The import statement for DiscordPreprocessor is appropriately placed with other imports.

33-34: ⚠️ Potential issue

Verify error handling for empty processed messages

The code processes messages but doesn't handle the case where all messages might be filtered out during preprocessing.

Let's verify the behavior of transform_discord_raw_messages with empty input:

Consider adding a check:

     processed_messages = update_raw_messages(raw_data=raw_mongo_messages)
+    if not processed_messages:
+        return []
     messages_docuemnt = transform_discord_raw_messages(guild_id, processed_messages)

dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (2)

1-8: LGTM! Imports are clean and appropriate.

The imports cover all necessary dependencies for the test suite.

12-16: LGTM! Efficient test setup.

Loading the spaCy model once in setUpClass is a good practice for performance optimization.

Dockerfile

dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py

dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py

dags/hivemind_etl_helpers/tests/unit/test_common_preprocessor.py

we would remove dummy data from database, so we added a message content that holds some information.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (1)
127-127: Verify text preprocessing assumptions

The test message has been updated to demonstrate multiple preprocessing steps (stemming, stop word removal, case normalization). Consider adding a comment explaining these preprocessing expectations for better test documentation.
-                "content": "Apples are falling from trees 1",
+                # Test message to verify text preprocessing:
+                # - Word stemming (falling -> fall)
+                # - Stop word removal (are, from)
+                # - Case normalization (Apples -> apple)
+                # - Number removal (1)
+                "content": "Apples are falling from trees 1",
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (2)
Line range hint 1-324: Consider improving test organization and documentation

The test file is well-structured but could benefit from:

Docstrings explaining the test scenarios

Breaking down the large test method into smaller, focused test cases

Using test data factories or fixtures for common setup

Consider refactoring the test class like this:
@pytest.fixture
def discord_setup():
    """Setup basic Discord test environment with channels and roles."""
    # Move setup_db logic here
    pass

class TestDiscordMessageTransformation:
    """Test suite for Discord message transformation functionality."""
    
    def test_basic_message_transformation(self, discord_setup):
        """Test basic message content is correctly transformed."""
        pass
    
    def test_mentions_transformation(self, discord_setup):
        """Test messages with user/role mentions are correctly transformed."""
        pass
    
    def test_url_transformation(self, discord_setup):
        """Test messages with URLs are correctly transformed."""
        pass
Add proper test cleanup using tearDown to prevent test interference

The test file shares database collections with multiple other integration tests (17+ files found). While it does cleanup collections at the start of each test method, it doesn't guarantee cleanup after test execution, especially in case of test failures. This could lead to test pollution.

Add tearDown method to ensure cleanup even if tests fail:
def tearDown(self):
    """Clean up test database state."""
    client = MongoSingleton.get_instance().client
    client["Core"].drop_collection("modules")
    client["Core"].drop_collection("platforms")
    client[self.guild_id].drop_collection("guildmembers")
    client[self.guild_id].drop_collection("rawinfos")
    client[self.guild_id].drop_collection("roles")
🔗 Analysis chain

Line range hint 1-324: Verify test isolation and cleanup

The test uses direct database operations but doesn't ensure proper cleanup in case of test failures.

Let's check if there are any cleanup mechanisms in place:

Add proper cleanup using setUp/tearDown methods:
def setUp(self):
    """Set up test database state."""
    self.client = MongoSingleton.get_instance().client
    self.guild_id = "1234"
    self.cleanup_collections()
    
def tearDown(self):
    """Clean up test database state."""
    self.cleanup_collections()
    
def cleanup_collections(self):
    """Drop all test collections."""
    self.client["Core"].drop_collection("modules")
    self.client["Core"].drop_collection("platforms")
    self.client[self.guild_id].drop_collection("guildmembers")
    self.client[self.guild_id].drop_collection("rawinfos")
    self.client[self.guild_id].drop_collection("roles")
Based on the search results, I can see that several test files in the codebase have proper cleanup mechanisms implemented:

Some tests use tearDown with proper cleanup:

test_telegram_comminity.py

test_extract_messages.py

test_telegram_chats.py

The file in question (test_discord_prepare_document_from_db.py) uses drop_collection calls at the beginning of each test method, which is a form of setup cleanup, but lacks proper teardown.

Let me verify if there are any other test files that might be affected by this test's database state:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for cleanup patterns in test files
rg -A 5 "def tearDown|cleanup|drop_collection" "dags/hivemind_etl_helpers/tests"
Length of output: 31557

Script:
#!/bin/bash
# Check for other test files that might use the same collections
rg -l "Core.*modules|Core.*platforms|guildmembers|rawinfos" "dags/hivemind_etl_helpers/tests"
Length of output: 1526

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 33d0fb4 and 5b4057d.

📒 Files selected for processing (3)

Dockerfile (1 hunks)
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (2 hunks)
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

Dockerfile

🔇 Additional comments (1)

dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (1)

188-188: Verify preprocessing pipeline consistency

The expected text demonstrates the preprocessing pipeline's behavior. Consider adding additional test cases to verify edge cases:

Multiple numbers in text
Special characters
Mixed case words
Common stop words

Let's verify if there are other test cases covering these scenarios:

dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py

the processor should move to the embedding part and we shouldn't change the actual content of the messages. They have been moved to hivemind_backend library.

the test case didn't belong to any code.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)

dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (2)
21-21: Consider documenting BasePreprocessor's validation criteria

The introduction of BasePreprocessor for content validation is a good architectural choice, but its validation criteria should be documented to ensure maintainability.

Consider adding a docstring or comment explaining what constitutes "valuable information" in the context of message processing.

240-240: Remove commented out metadata exclusions

Since URL reference handling has been removed, these commented lines should be deleted rather than left as comments.

Apply these changes:
-            # "url_reference",
Also applies to: 257-257

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cd4e0c7 and 866088d.

📒 Files selected for processing (5)

dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (6 hunks)
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (1 hunks)
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py (1 hunks)
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (2 hunks)
requirements.txt (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

requirements.txt
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py

🔇 Additional comments (1)

dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (1)

222-222: LGTM: Proper input sanitization added

Good addition of control character removal for cleaner text processing.

dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py

dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py

feat: Added word max limit!

ba4c527

coderabbitai bot reviewed Dec 5, 2024

View reviewed changes

fix: test case to align with the latest codes!

3a018eb

coderabbitai bot reviewed Dec 5, 2024

View reviewed changes

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py Outdated Show resolved Hide resolved

dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py Outdated Show resolved Hide resolved

amindadgar added 2 commits December 5, 2024 14:43

fix: naming typo!

cdcea54

fix: test case assertions!

70459fc

amindadgar linked an issue Dec 5, 2024 that may be closed by this pull request

[Hivemind] Data processing improvements #334

Closed

3 tasks

amindadgar added 4 commits December 9, 2024 12:41

feat: Added noise cancelling for discord messages!

9b2bead

fix: remove unnecessary print!

3788374

fix: Added missing spacy dependency!

b007eec

feat: updating the messages content!

33d0fb4

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

amindadgar added 4 commits December 9, 2024 13:40

fix: Added spacy model download in test environment!

32eaafb

fix: aligning with latest codes!

b2b6468

we would remove dummy data from database, so we added a message content that holds some information.

fix: align input output data!

0a3a6df

fix: Added more informative messages!

5b4057d

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py Outdated Show resolved Hide resolved

amindadgar added 6 commits December 9, 2024 15:37

feat: removed the processors!

cd4e0c7

the processor should move to the embedding part and we shouldn't change the actual content of the messages. They have been moved to hivemind_backend library.

fix: remove the unrequired processor!

d0fb008

feat: removed unused test cases!

7dfc963

the test case didn't belong to any code.

feat: updated data with not removing the urls from actual content!

077ab8e

feat: using latest version of hivemind-backend lib!

ed99d65

feat: checking if there's any valid date in the text!

866088d

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

amindadgar added 5 commits December 10, 2024 10:08

feat: using smaller sized spacy model!

96d7ab3

fix: wrong if condition!

5634296

fix: aligning with latest codes!

d3f8d7d

fix: using the fixed version of hivemind-backend lib!

ce0a9c2

fix: isort linter issue!

67a3aae

amindadgar merged commit 96d6787 into main Dec 10, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added word max limit! #335

feat: Added word max limit! #335

amindadgar commented Dec 5, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

feat: Added word max limit! #335

feat: Added word max limit! #335

Conversation

amindadgar commented Dec 5, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 5, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

amindadgar commented Dec 5, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)