-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: hivemind telegram ETL, data types and timestamp conversion! #317
Conversation
Warning Rate limit exceeded@amindadgar has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 6 minutes and 15 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe pull request modifies the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
dags/hivemind_etl_helpers/tests/unit/test_telegram_extract_daily_messages.py (1)
35-36
: Add timestamp format documentationWhile the timestamp values are correct for the new seconds-based format, it would be helpful to add a comment explaining the timestamp format change from milliseconds to seconds for future maintainers.
- message_created_at=1633046400.0, # Equivalent to 2021-10-01T00:00:00 UTC - message_edited_at=1633046500.0, + message_created_at=1633046400.0, # 2021-10-01T00:00:00 UTC (in seconds, not milliseconds) + message_edited_at=1633046500.0, # 2021-10-01T00:01:40 UTC (in seconds, not milliseconds)Also applies to: 44-44
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py
(1 hunks)dags/hivemind_etl_helpers/src/db/telegram/extract/messages_daily.py
(2 hunks)dags/hivemind_etl_helpers/tests/unit/test_telegram_extract_daily_messages.py
(6 hunks)dags/hivemind_telegram_etl.py
(1 hunks)
🔇 Additional comments (9)
dags/hivemind_etl_helpers/src/db/telegram/extract/messages_daily.py (2)
10-10
: LGTM! Verify chat_id type usage across the codebase.
The type change from str
to int
aligns with Telegram's chat ID format and maintains consistency with ExtractMessages
class.
Let's verify all instantiations of this class use integer chat IDs:
✅ Verification successful
Type change from str
to int
for chat_id
is safe and correctly used
The verification shows that all instantiations of ExtractMessagesDaily
are using integer chat IDs:
- In
hivemind_telegram_etl.py
: Used withchat_id
variable passed directly - In tests: Explicitly initialized with integer value
12344321
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for instantiations of ExtractMessagesDaily to ensure they use integer chat_ids
rg -l "ExtractMessagesDaily\(" | xargs rg -A 2 "ExtractMessagesDaily\("
Length of output: 114
Script:
#!/bin/bash
# Let's try a different approach to find usage patterns
# First, find all Python files that import or reference ExtractMessagesDaily
rg -l "ExtractMessagesDaily" --type py
# Then search for actual instantiations with context
rg "ExtractMessagesDaily\s*\(" --type py -B2 -A2
Length of output: 1613
34-34
: Verify timestamp ranges and document the expected format.
The change from milliseconds to seconds is significant. To prevent potential issues:
- Verify that all incoming timestamps are indeed in seconds
- Add validation for reasonable timestamp ranges
- Document the expected timestamp format
Let's check the timestamp ranges in existing messages:
Consider adding timestamp validation:
def _validate_timestamp(timestamp: int) -> None:
# Reasonable range: 2020-01-01 to 1 year from now
min_timestamp = 1577836800 # 2020-01-01
max_timestamp = int(datetime.now().timestamp()) + 31536000 # now + 1 year
if not min_timestamp <= timestamp <= max_timestamp:
raise ValueError(f"Timestamp {timestamp} is outside reasonable range")
✅ Verification successful
This broader search should help us understand:
- How message_created_at is used across the codebase
- General timestamp handling patterns
- The complete context of telegram message processing
Based on the search results, I can now provide a definitive response about the timestamp handling:
Timestamps are consistently handled in seconds across the codebase
The evidence shows that:
- Integration tests use timestamps in seconds (e.g., 1672531200.0 for "2023-01-01")
- The schema defines
message_created_at
as float - The DateTimeFormatConverter utility consistently works with seconds-based timestamps
- The only millisecond timestamps appear in a different context (user_bot_checker tests where timestamps are explicitly multiplied by 1000)
The change to remove the division by 1000 was correct as the codebase consistently expects timestamps in seconds, not milliseconds.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for message_created_at usage to verify timestamp ranges
ast-grep --pattern 'message_created_at = $_' | xargs -I {} rg -A 2 "{}"
# Look for any remaining millisecond timestamp conversions
rg -l "/ ?1000" | xargs rg "/ ?1000"
Length of output: 114
Script:
#!/bin/bash
# Let's try a broader search for timestamp-related code
# Search for message_created_at usage
rg "message_created_at" -B 2 -A 2
# Search for timestamp conversions
rg "fromtimestamp|timestamp" -B 2 -A 2
# Search for any files containing telegram message processing
fd "telegram.*messages.*\.py$" --exec cat {}
Length of output: 60653
dags/hivemind_etl_helpers/src/db/telegram/extract/messages.py (2)
Line range hint 89-93
: LGTM! Query parameters are properly handled.
The Neo4j query correctly uses parameterized values, and the integer chat_id
is safely passed to the query execution.
Line range hint 31-32
: Verify timestamp conversion consistency.
The PR objective mentions timestamp conversion changes, but this file still converts timestamps to milliseconds (from_date.timestamp() * 1000
). Please verify if this should be updated to use seconds instead, as mentioned in the PR description.
Let's check timestamp handling across related files:
#!/bin/bash
# Description: Find all timestamp conversions in related files
# Expected: Consistent timestamp handling (either all milliseconds or all seconds)
# Search for timestamp conversions
rg -B 2 -A 2 "timestamp\(\)\s*\*\s*1000|timestamp\(\)"
# Search for specific timestamp-related patterns in Neo4j queries
ast-grep --pattern 'message.date'
dags/hivemind_etl_helpers/tests/unit/test_telegram_extract_daily_messages.py (3)
13-13
: LGTM: Chat ID type change is consistent
The change from string to integer type for chat_id aligns with the PR's objective of fixing Telegram data types.
62-63
: LGTM: Same-day test case properly updated
The timestamps are correctly configured to test same-day message grouping with the new seconds-based format.
Also applies to: 72-73, 81-81
97-98
: LGTM: Different-days test case properly updated
The timestamps are correctly set 24 hours apart to test message grouping across different days. Let's verify that all timestamp conversions in the codebase are consistent.
Also applies to: 107-108, 116-117
✅ Verification successful
Timestamp formats are consistent across test files
All timestamps in both test files are using Unix epoch seconds format (not milliseconds), and the values are consistent:
test_telegram_extract_daily_messages.py
: Uses timestamps like 1633046400.0 (Oct 2021)test_extract_messages.py
: Uses timestamps like 1672531200.0 (Jan 2023)
No inconsistencies found in timestamp handling across the test files.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any remaining millisecond-based timestamps in test files
rg -l "message_created_at.*000\." "dags/hivemind_etl_helpers/tests"
# Search for timestamp division by 1000 which might indicate millisecond conversion
rg "/\s*1000" "dags/hivemind_etl_helpers/tests"
Length of output: 261
Script:
#!/bin/bash
# Let's check the actual content of these files to see the timestamp formats
echo "=== test_telegram_extract_daily_messages.py ==="
rg "message_created_at" "dags/hivemind_etl_helpers/tests/unit/test_telegram_extract_daily_messages.py" -B1 -A1
echo -e "\n=== test_extract_messages.py ==="
rg "message_created_at" "dags/hivemind_etl_helpers/tests/integration/test_extract_messages.py" -B1 -A1
Length of output: 2371
dags/hivemind_telegram_etl.py (2)
145-145
: LGTM! Improved logging clarity.
The logging now correctly shows the count of transformed documents rather than input messages, providing better visibility into the transformation process.
130-141
: Verify chat_id type conversion in the processor task.
The ExtractMessages and ExtractMessagesDaily classes now expect an integer chat_id, but the processor task might receive it as a string.
Let's verify the chat_id type handling:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (5)
dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py (1)
Line range hint
44-44
: Consider adding type conversion safety checks.While MongoDB is flexible with types, it's good practice to ensure consistent type handling. Consider adding validation to ensure
chat_id
is always an integer before database operations.def check_platform_existence(self) -> tuple[ObjectId | None, ObjectId | None]: + if not isinstance(self.chat_id, int): + raise TypeError("chat_id must be an integer") document = self._client[self.database][self.collection].find_one( {"metadata.id": self.chat_id}, { "community": 1, "_id": 1, }, )Also applies to: 71-71
dags/analyzer_helper/telegram/extract_raw_members.py (2)
Line range hint
39-44
: Consider using a query builder pattern for better maintainability.The current query construction using string concatenation works but could be improved. Consider implementing a query builder pattern for better maintainability and readability.
Example refactor:
def build_member_query(self, parameters: dict) -> str: base_query = """ MATCH (u:TGUser)-[r1:JOINED|LEFT]->(c:TGChat) WHERE c.id = $chat_id """ conditions = [] if "start_date" in parameters: conditions.append("r1.date > $start_date") where_clause = " AND ".join(conditions) if where_clause: base_query += f" AND {where_clause}" return base_query + """ MATCH (u:TGUser)-[r2:JOINED]->(c:TGChat {id: $chat_id}) WITH u, MAX(r2.date) AS joined_at OPTIONAL MATCH (u:TGUser)-[r3:LEFT]->(c:TGChat {id: $chat_id}) WITH u, joined_at, MAX(r3.date) AS left_at RETURN u.id AS id, joined_at, left_at, u.is_bot as is_bot, u.username AS username, u.first_name AS first_name, u.last_name AS last_name """
Line range hint
91-107
: Add error handling and logging for timestamp operations.The timestamp conversion and database operations could benefit from explicit error handling and logging for better observability and debugging.
Example improvement:
def extract(self, recompute: bool = False) -> list: """ Extract members data with improved error handling and logging. Args: recompute (bool): If True, extract all members; if False, extract incrementally. Returns: list: List of extracted member details. Raises: ValueError: If timestamp conversion fails. PyMongoError: If MongoDB operations fail. """ try: if not recompute: latest_rawmember = self.rawmembers_collection.find_one( sort=[("joined_at", DESCENDING)] ) if latest_rawmember: latest_joined_at = latest_rawmember["joined_at"] latest_joined_at_timestamp = self.converter.datetime_to_timestamp( latest_joined_at ) logging.info(f"Extracting members after timestamp: {latest_joined_at_timestamp}") return self.fetch_member_details(start_date=latest_joined_at_timestamp) logging.info("Performing full member extraction") return self.fetch_member_details() except Exception as e: logging.error(f"Failed to extract members: {str(e)}") raisedags/analyzer_helper/telegram/extract_raw_data.py (2)
Line range hint
12-19
: Add type validation in constructor.Consider adding explicit type validation to handle potential string inputs gracefully:
def __init__(self, chat_id: int, platform_id: str): """ Initialize the ExtractRawInfo with the forum endpoint, platform id and set up Neo4j and MongoDB connection. + + Args: + chat_id (int): The Telegram chat ID + platform_id (str): The platform identifier + + Raises: + TypeError: If chat_id is not an integer """ + if not isinstance(chat_id, int): + try: + chat_id = int(chat_id) + except (TypeError, ValueError): + raise TypeError("chat_id must be an integer") + self.neo4jConnection = Neo4jConnection() self.driver = self.neo4jConnection.connect_neo4j() self.chat_id = chat_id
Line range hint
132-146
: Update method documentation and type hints.The
fetch_message_reactions
method's documentation needs improvement:def fetch_message_reactions(self, message_id: int) -> List[Dict[str, Any]]: """ - fetch message reactions + Fetch reactions for a specific message. Parameters ------------- - message_id : str + message_id : int the id of message to fetch its reactions Returns --------- reactions : list[dict[str, any]] - the list of reactions pointed to a message + List of dictionaries containing reaction details with keys: + - reactor_id (int): ID of the user who reacted + - reaction_time (str): Timestamp of the reaction """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
dags/analyzer_helper/telegram/extract_raw_data.py
(1 hunks)dags/analyzer_helper/telegram/extract_raw_members.py
(1 hunks)dags/analyzer_helper/telegram/tests/integration/test_telegram_extract_raw_members.py
(1 hunks)dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py
(1 hunks)dags/hivemind_telegram_etl.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dags/hivemind_telegram_etl.py
🧰 Additional context used
📓 Learnings (1)
dags/analyzer_helper/telegram/tests/integration/test_telegram_extract_raw_members.py (2)
Learnt from: amindadgar
PR: TogetherCrew/airflow-dags#222
File: dags/analyzer_helper/telegram/tests/integration/test_telegram_extract_raw_members.py:40-85
Timestamp: 2024-10-03T07:00:00.992Z
Learning: In unit tests, avoid suggesting refactoring that doesn't affect efficiency in production code.
Learnt from: amindadgar
PR: TogetherCrew/airflow-dags#222
File: dags/analyzer_helper/telegram/tests/integration/test_telegram_extract_raw_members.py:40-85
Timestamp: 2024-10-09T08:57:20.288Z
Learning: In unit tests, avoid suggesting refactoring that doesn't affect efficiency in production code.
🔇 Additional comments (4)
dags/hivemind_etl_helpers/src/db/telegram/utils/platform.py (1)
Line range hint 8-71
: Verify type consistency in ETL pipeline.
The change to int
type for chat_id
is part of a broader standardization effort. Ensure that all ETL components (extractors, transformers, loaders) handle the integer type consistently.
#!/bin/bash
# Description: Check for potential type inconsistencies in related ETL files
# Search for chat_id type annotations and usage
echo "Checking chat_id type annotations in ETL files..."
rg -l "chat_id.*:.*(?:str|int)" --type python
# Look for potential type conversion points
echo "Checking for type conversion points..."
rg -l "str\(.*chat_id.*\)|int\(.*chat_id.*\)" --type python
dags/analyzer_helper/telegram/tests/integration/test_telegram_extract_raw_members.py (1)
12-12
: LGTM! The chat_id type change aligns with the codebase standardization.
The change from string to integer type is consistent with the broader changes across the codebase and maintains proper type handling in both MongoDB and Neo4j operations.
Let's verify the Neo4j schema expectations for the chat_id property:
✅ Verification successful
The chat_id type change is consistent with Neo4j schema usage
The verification confirms that:
- All TGChat node creations use the
id
property with numeric values via$chat_id
parameter binding - All MATCH clauses consistently use
TGChat {id: $chat_id}
pattern without any type conversions - The production code in
extract_raw_members.py
expects numeric chat_id values
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that TGChat.id is consistently defined as integer type
# across the codebase
# Search for TGChat node creation patterns to verify property type
rg -A 2 "CREATE.*TGChat.*id:"
# Search for any potential type conversion of chat_id in queries
rg -A 2 "MATCH.*TGChat.*id:"
Length of output: 4961
dags/analyzer_helper/telegram/extract_raw_data.py (2)
Line range hint 89-89
: Verify Neo4j schema compatibility with integer chat_id.
Ensure that the id
property of TGChat
nodes in Neo4j is stored as a numeric type to maintain consistency with the new int
type of chat_id
.
#!/bin/bash
# Search for Neo4j schema definitions or migrations
echo "Checking Neo4j schema definitions:"
rg "TGChat" --type py -B 2 -A 2
# Look for any explicit type definitions for chat_id
echo "\nChecking for explicit chat_id type definitions:"
rg "chat_id.*type" --type py
12-12
: Verify type compatibility across the codebase.
The change from str
to int
for chat_id
is logical as Telegram chat IDs are numeric. However, we should ensure this change is compatible with:
- All callers of this class
- Neo4j query parameter handling
- Any string-based chat ID storage in MongoDB
Summary by CodeRabbit
New Features
ExtractMessages
,ExtractMessagesDaily
, and other related classes to acceptchat_id
as an integer, enhancing data type consistency.ExtractMessagesDaily
to directly use timestamps in seconds.Bug Fixes
Tests
chat_id
type and timestamp handling.