-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added word max limit! #335
Conversation
WalkthroughThis pull request introduces several modifications across multiple files, primarily focusing on the Changes
Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py (2)
24-27
: Fix docstring inconsistencies and typosThe docstring has the following issues:
- Typo: "shuold" should be "should"
- Inconsistency: The parameter is named "min_word_limit" but the description mentions "characters"
kwargs : dict min_word_limit : int - the minimum words that the messages shuold contain - default is 8 characters + the minimum words that the messages should contain + default is 8 words
Line range hint
4-53
: Consider improving type safety and modularityThe function could benefit from the following improvements:
- Add type hints for kwargs to document expected parameters
- Extract word limit filtering logic to a separate function
+from typing import TypedDict + +class MessageFilterOptions(TypedDict, total=False): + min_word_limit: int + def fetch_raw_messages( guild_id: str, selected_channels: list[str], from_date: datetime, - **kwargs, + **kwargs: MessageFilterOptions, ) -> list[dict]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py
(4 hunks)dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
375-375: Redefinition of unused test_fetch_raw_messages_fetch_limited_characters
from line 290
(F811)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
Outdated
Show resolved
Hide resolved
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1)
290-466
: Refactor test methods to reduce duplicationBoth test methods share identical setup code and could benefit from refactoring to improve maintainability and readability.
Consider these improvements:
- Extract common setup code into helper methods:
def _setup_test_messages(self, messages_data): """ Helper method to setup test messages in the database Args: messages_data: List of tuples containing (author_index, content) """ client = MongoSingleton.get_instance().client guild_id = "1234" channels = ["111111", "22222"] users_id = ["user1", "user2", "user3", "user4", "user5"] self.setup_db(channels=channels, guild_id=guild_id) self._setup_guild_members(guild_id, users_id) raw_data = [ self._create_message( author=users_id[author_idx], content=content, channel=channels[idx % len(channels)] ) for idx, (author_idx, content) in enumerate(messages_data) ] client[guild_id].drop_collection("rawinfos") client[guild_id]["rawinfos"].insert_many(raw_data) return guild_id, channels
- Refactor test methods to use the helper:
def test_fetch_raw_messages_fetch_limited_characters(self): """ Test fetching messages with default word limit (8 words) """ messages_data = [ (0, "AA"), (1, "A sample text with more than 8 characters!") ] guild_id, channels = self._setup_test_messages(messages_data) messages = fetch_raw_messages( guild_id, selected_channels=channels, from_date=datetime(2023, 9, 20) ) self.assertEqual(len(messages), 1) self.assertEqual( messages[0]["content"], "A sample text with more than 8 characters!" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
(1 hunks)
🔇 Additional comments (1)
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py (1)
378-466
:
Fix duplicate test method and improve assertions
The test method has several issues:
- Name causes F811 error (redefinition of unused name)
- Incorrect usage of assertCountEqual
- Duplicate guild_id assignment
- Comment refers to incorrect date
Apply these changes:
- def test_fetch_raw_messages_fetch_limited_characters_specified(self):
+ def test_fetch_raw_messages_with_custom_word_limit(self):
"""
- fetch raw messages and do filtering with a specified value
+ Test fetching messages with a custom minimum word limit of 1 word
"""
client = MongoSingleton.get_instance().client
guild_id = "1234"
channels = ["111111", "22222"]
users_id = ["user1", "user2", "user3", "user4", "user5"]
- guild_id = "1234" # Remove duplicate assignment
self.setup_db(
channels=channels,
guild_id=guild_id,
)
# ... rest of the setup ...
- # Fetch messages from a specific date (October 3, 2023)
+ # Fetch messages with minimum word limit of 1
messages = fetch_raw_messages(
guild_id,
selected_channels=channels,
from_date=datetime(2023, 9, 20),
min_word_limit=1,
)
- # Check if the fetched messages are equal to the expected messages
- self.assertCountEqual(messages, 2)
+ # Verify both messages are returned with word limit of 1
+ self.assertEqual(len(messages), 2)
+ contents = {msg["content"] for msg in messages}
+ self.assertEqual(
+ contents,
+ {"AA", "A sample text with more than 8 characters!"}
+ )
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
Outdated
Show resolved
Hide resolved
dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (11)
Dockerfile (1)
7-7
: Consider using a smaller spaCy model for better resource efficiencyThe
en_core_web_lg
model is quite large (~560MB). Consider usingen_core_web_sm
(~12MB) oren_core_web_md
(~40MB) if the full feature set of the large model isn't necessary. This would significantly reduce the container size and build time.-RUN python -m spacy download en_core_web_lg +RUN python -m spacy download en_core_web_mddags/hivemind_etl_helpers/src/db/common/base_preprocessor.py (2)
8-21
: Improve method documentationThe docstring needs improvement:
- "a discord message text" is too specific as this is a base class
- Missing description of the cleaning process
- Missing example usage
def extract_main_content(self, text: str) -> str: """ - extract main content of a message + Extract the main content from text by removing noise and standardizing tokens. + + This method: + - Removes punctuation, whitespace, stop words + - Filters out URLs and numbers + - Converts tokens to their base form (lemmatization) + - Ensures ASCII compatibility Parameters ------------ text : str - a discord message text + The input text to process Returns -------- cleaned_text : str + Processed text containing only meaningful content + + Example + -------- + >>> preprocessor = BasePreprocessor() + >>> preprocessor.extract_main_content("Hello, how are you? https://example.com") + 'hello' """
29-39
: Consider making token filtering rules configurableThe token filtering rules are hardcoded. Consider making them configurable through constructor parameters to increase flexibility.
class BasePreprocessor: - def __init__(self) -> None: + def __init__( + self, + *, + remove_punct: bool = True, + remove_space: bool = True, + remove_stop: bool = True, + remove_url: bool = True, + remove_num: bool = True, + ascii_only: bool = True + ) -> None: try: self.nlp = spacy.load("en_core_web_lg") except OSError as exp: raise OSError("Model spacy `en_core_web_lg` is not installed!") from exp + self.filters = { + 'remove_punct': remove_punct, + 'remove_space': remove_space, + 'remove_stop': remove_stop, + 'remove_url': remove_url, + 'remove_num': remove_num, + 'ascii_only': ascii_only + }dags/hivemind_etl_helpers/tests/unit/test_discord_preprocess.py (3)
26-28
: Add assertion message for better test failure debuggingWhen using assertEqual, include a message parameter to provide context when the test fails.
- self.assertEqual(self.preprocessor.remove_ids(input_text), expected) + self.assertEqual( + self.preprocessor.remove_ids(input_text), + expected, + "Failed to remove multiple user and role IDs correctly" + )
11-43
: Add more edge cases to test suiteThe test suite could benefit from additional edge cases:
- Malformed IDs
- Nested IDs
- IDs with special characters
Example additional test:
def test_malformed_ids(self): """Test handling of malformed ID patterns""" input_text = "Hello <@12345 <@&> <@&abc>" expected = "Hello" self.assertEqual( self.preprocessor.remove_ids(input_text), expected, "Failed to handle malformed IDs correctly" )
6-10
: Consider using pytest fixtures instead of setUpWhile unittest's setUp works, pytest fixtures provide more flexibility and better reuse across test files.
-class TestDiscordPreprocessor(unittest.TestCase): - def setUp(self): - """Set up test fixtures before each test method.""" - self.preprocessor = DiscordPreprocessor() +import pytest + +@pytest.fixture +def preprocessor(): + """Fixture providing a DiscordPreprocessor instance.""" + return DiscordPreprocessor() + +def test_remove_user_ids(preprocessor): + """Test removal of user IDs from text""" + input_text = "Hello <@123456> how are you?" + expected = "Hello how are you?" + assert preprocessor.remove_ids(input_text) == expecteddags/hivemind_etl_helpers/src/db/discord/preprocessor.py (3)
11-12
: Remove unnecessary empty initThe empty
__init__
method can be removed as it doesn't add any functionality beyond what's inherited fromBasePreprocessor
.- def __init__(self) -> None: - pass
14-35
: Optimize clean_texts method implementationThe method could be more efficient using list comprehension and could benefit from better type hints.
Consider this improvement:
- def clean_texts(self, texts: list[str]) -> list[str]: + def clean_texts(self, texts: list[str], *, skip_empty: bool = True) -> list[str]: """ clean the given text Parameters ------------ texts : list[str] a list of discord messages text + skip_empty : bool, optional + whether to skip empty results after cleaning (default: True) Returns --------- texts_cleaned : str the cleaned text (discord ids removed) """ - texts_cleaned: list[str] = [] - - for text in texts: - text_cleaned = self.clean_text(text=text) - texts_cleaned.append(text_cleaned) - - return texts_cleaned + cleaned = [self.clean_text(text=text) for text in texts] + return [text for text in cleaned if text or not skip_empty]
73-78
: Enhance regex pattern and string cleaningThe current implementation has a few potential improvements:
- The regex pattern could be compiled once for better performance
- Multiple spaces handling could be more efficient
Consider these improvements:
+ # Class variable for better performance + _ID_PATTERN = re.compile(r"<@&?\d+>") + def remove_ids(self, text: str) -> str: - pattern = r"<@&?\d+>" - # Removing matches - cleaned_text = re.sub(pattern, "", text) - - cleaned_text = " ".join(cleaned_text.split()) + cleaned_text = self._ID_PATTERN.sub("", text).strip() + cleaned_text = " ".join(word for word in cleaned_text.split() if word) return cleaned_textdags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (2)
17-17
: Remove unnecessary f-string prefix.The string doesn't contain any placeholders, so the
f
prefix is not needed.- raise OSError(f"Model spacy `en_core_web_lg` is not installed!") from exp + raise OSError("Model spacy `en_core_web_lg` is not installed!") from exp🧰 Tools
🪛 Ruff (0.8.0)
17-17: f-string without any placeholders
Remove extraneous
f
prefix(F541)
19-141
: LGTM! Comprehensive test coverage with clear test cases.The test suite provides good coverage of basic functionality with clear, focused test cases and descriptive docstrings.
Consider adding these test cases to improve coverage:
- Test word limit functionality (mentioned in PR title)
- Edge cases:
- Messages with only mentions/no actual content
- Very long messages
- Messages with repeated mentions
- Message content normalization:
- Whitespace handling
- Case sensitivity
- Unicode characters
Example test case for word limit:
def test_word_limit(self): """Test function respects word limit configuration""" test_data = [ {"content": "This is a short message"}, {"content": "This is a longer message with more than five words"}, ] # Test with 5-word limit result = update_raw_messages(test_data, min_word_limit=5) self.assertEqual(len(result), 1, "Should only return message exceeding word limit") self.assertEqual( result[0]["content"], test_data[1]["content"], "Should return longer message" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
Dockerfile
(1 hunks)dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py
(1 hunks)dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py
(2 hunks)dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py
(4 hunks)dags/hivemind_etl_helpers/src/db/discord/preprocessor.py
(1 hunks)dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
(3 hunks)dags/hivemind_etl_helpers/tests/unit/test_common_preprocessor.py
(1 hunks)dags/hivemind_etl_helpers/tests/unit/test_discord_preprocess.py
(1 hunks)dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py
(1 hunks)requirements.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- requirements.txt
🚧 Files skipped from review as they are similar to previous changes (2)
- dags/hivemind_etl_helpers/src/db/discord/fetch_raw_messages.py
- dags/hivemind_etl_helpers/tests/integration/test_discord_fetch_raw_messages.py
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py
17-17: f-string without any placeholders
Remove extraneous f
prefix
(F541)
dags/hivemind_etl_helpers/src/db/common/base_preprocessor.py
25-25: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🔇 Additional comments (4)
dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py (2)
8-8
: LGTM: Import statement is correctly placed
The import statement for DiscordPreprocessor is appropriately placed with other imports.
33-34
:
Verify error handling for empty processed messages
The code processes messages but doesn't handle the case where all messages might be filtered out during preprocessing.
Let's verify the behavior of transform_discord_raw_messages with empty input:
Consider adding a check:
processed_messages = update_raw_messages(raw_data=raw_mongo_messages)
+ if not processed_messages:
+ return []
messages_docuemnt = transform_discord_raw_messages(guild_id, processed_messages)
dags/hivemind_etl_helpers/tests/unit/test_discord_update_raw_messages.py (2)
1-8
: LGTM! Imports are clean and appropriate.
The imports cover all necessary dependencies for the test suite.
12-16
: LGTM! Efficient test setup.
Loading the spaCy model once in setUpClass
is a good practice for performance optimization.
dags/hivemind_etl_helpers/src/db/discord/discord_raw_message_to_document.py
Outdated
Show resolved
Hide resolved
dags/hivemind_etl_helpers/tests/unit/test_common_preprocessor.py
Outdated
Show resolved
Hide resolved
dags/hivemind_etl_helpers/tests/unit/test_common_preprocessor.py
Outdated
Show resolved
Hide resolved
we would remove dummy data from database, so we added a message content that holds some information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (1)
127-127
: Verify text preprocessing assumptionsThe test message has been updated to demonstrate multiple preprocessing steps (stemming, stop word removal, case normalization). Consider adding a comment explaining these preprocessing expectations for better test documentation.
- "content": "Apples are falling from trees 1", + # Test message to verify text preprocessing: + # - Word stemming (falling -> fall) + # - Stop word removal (are, from) + # - Case normalization (Apples -> apple) + # - Number removal (1) + "content": "Apples are falling from trees 1",dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (2)
Line range hint
1-324
: Consider improving test organization and documentationThe test file is well-structured but could benefit from:
- Docstrings explaining the test scenarios
- Breaking down the large test method into smaller, focused test cases
- Using test data factories or fixtures for common setup
Consider refactoring the test class like this:
@pytest.fixture def discord_setup(): """Setup basic Discord test environment with channels and roles.""" # Move setup_db logic here pass class TestDiscordMessageTransformation: """Test suite for Discord message transformation functionality.""" def test_basic_message_transformation(self, discord_setup): """Test basic message content is correctly transformed.""" pass def test_mentions_transformation(self, discord_setup): """Test messages with user/role mentions are correctly transformed.""" pass def test_url_transformation(self, discord_setup): """Test messages with URLs are correctly transformed.""" pass
Add proper test cleanup using tearDown to prevent test interference
The test file shares database collections with multiple other integration tests (17+ files found). While it does cleanup collections at the start of each test method, it doesn't guarantee cleanup after test execution, especially in case of test failures. This could lead to test pollution.
- Add
tearDown
method to ensure cleanup even if tests fail:def tearDown(self): """Clean up test database state.""" client = MongoSingleton.get_instance().client client["Core"].drop_collection("modules") client["Core"].drop_collection("platforms") client[self.guild_id].drop_collection("guildmembers") client[self.guild_id].drop_collection("rawinfos") client[self.guild_id].drop_collection("roles")🔗 Analysis chain
Line range hint
1-324
: Verify test isolation and cleanupThe test uses direct database operations but doesn't ensure proper cleanup in case of test failures.
Let's check if there are any cleanup mechanisms in place:
Add proper cleanup using
setUp
/tearDown
methods:def setUp(self): """Set up test database state.""" self.client = MongoSingleton.get_instance().client self.guild_id = "1234" self.cleanup_collections() def tearDown(self): """Clean up test database state.""" self.cleanup_collections() def cleanup_collections(self): """Drop all test collections.""" self.client["Core"].drop_collection("modules") self.client["Core"].drop_collection("platforms") self.client[self.guild_id].drop_collection("guildmembers") self.client[self.guild_id].drop_collection("rawinfos") self.client[self.guild_id].drop_collection("roles")
Based on the search results, I can see that several test files in the codebase have proper cleanup mechanisms implemented:
Some tests use
tearDown
with proper cleanup:
test_telegram_comminity.py
test_extract_messages.py
test_telegram_chats.py
The file in question (
test_discord_prepare_document_from_db.py
) usesdrop_collection
calls at the beginning of each test method, which is a form of setup cleanup, but lacks proper teardown.Let me verify if there are any other test files that might be affected by this test's database state:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for cleanup patterns in test files rg -A 5 "def tearDown|cleanup|drop_collection" "dags/hivemind_etl_helpers/tests"Length of output: 31557
Script:
#!/bin/bash # Check for other test files that might use the same collections rg -l "Core.*modules|Core.*platforms|guildmembers|rawinfos" "dags/hivemind_etl_helpers/tests"Length of output: 1526
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
Dockerfile
(1 hunks)dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py
(2 hunks)dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- Dockerfile
🔇 Additional comments (1)
dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py (1)
188-188
: Verify preprocessing pipeline consistency
The expected text demonstrates the preprocessing pipeline's behavior. Consider adding additional test cases to verify edge cases:
- Multiple numbers in text
- Special characters
- Mixed case words
- Common stop words
Let's verify if there are other test cases covering these scenarios:
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py
Outdated
Show resolved
Hide resolved
the processor should move to the embedding part and we shouldn't change the actual content of the messages. They have been moved to hivemind_backend library.
the test case didn't belong to any code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (2)
21-21
: Consider documenting BasePreprocessor's validation criteriaThe introduction of BasePreprocessor for content validation is a good architectural choice, but its validation criteria should be documented to ensure maintainability.
Consider adding a docstring or comment explaining what constitutes "valuable information" in the context of message processing.
240-240
: Remove commented out metadata exclusionsSince URL reference handling has been removed, these commented lines should be deleted rather than left as comments.
Apply these changes:
- # "url_reference",
Also applies to: 257-257
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py
(6 hunks)dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py
(1 hunks)dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py
(1 hunks)dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py
(2 hunks)requirements.txt
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- requirements.txt
- dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py
- dags/hivemind_etl_helpers/tests/integration/test_pg_vector_access_with_discord.py
🔇 Additional comments (1)
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (1)
222-222
: LGTM: Proper input sanitization added
Good addition of control character removal for cleaner text processing.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores