Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(twitter): Scraper Enhancements Account Rotation and Rate Limit H…
…andling (#576) * chore: cleanup and delete old tests * chore: delete old tests * refactor(twitter): remove sentiment and trends handlers - Delete TwitterSentimentHandler and TwitterTrendsHandler structs - Remove corresponding HandleWork functions for sentiment and trends - Update WorkerType constants and related functions to exclude sentiment and trends - Adjust WorkHandlerManager initialization to remove sentiment and trends handlers * refactor(twitter): export Auth function, update scrapers, and remove objx dependency - **Exported Auth Function:** - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible. - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function. - **Removed Unused Dependency:** - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed. - **Optimized Sleep Durations:** - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance. - **Cleaned Up Codebase:** - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase. - **Enhanced Test Configuration:** - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`. - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse. * chore: delete scrape tweets by trends (deprecated) * feat(tests): enhance Twitter auth and scraping tests This commit improves the Twitter authentication and scraping tests in the pkg/tests/scrapers/twitter_auth_test.go file. The changes include: - Add godotenv package to load environment variables - Implement a loadEnv function to handle .env file loading - Enhance "authenticates and logs in successfully" test: - Verify cookie file doesn't exist before authentication - Check cookie file creation after authentication - Perform a simple profile scrape to validate the session - Improve "reuses session from cookies" test: - Verify cookie file creation - Force cookie reuse by clearing the first scraper - Validate the reused session with a profile scrape - Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies": - Authenticate twice to ensure cookie reuse - Scrape the profile of user 'god' - Fetch and verify the last 3 tweets containing #Bitcoin - Log scraped data for manual inspection These changes provide more robust testing of the Twitter authentication process, session reuse, and scraping functionality, ensuring better coverage and reliability of the Twitter-related features. * chore: rename files * chore: add godoc dev notes * feat(twitter): implement account rotation and rate limit handling This commit introduces significant improvements to the Twitter scraping functionality: 1. Account Management: - Add TwitterAccount struct to represent individual Twitter accounts - Implement TwitterAccountManager for managing multiple accounts - Create functions for account rotation and rate limit tracking 2. Authentication: - Refactor Auth function to use account rotation - Implement cookie-based session management for each account - Add retry logic for authentication failures 3. Scraping Functions: - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation - Implement rate limit detection and account switching - Add retry mechanisms for failed operations 4. Configuration: - Move from hardcoded credentials to .env file-based configuration - Implement loadAccountsFromConfig to read multiple accounts from .env 5. Error Handling: - Improve error logging and handling throughout the package - Add specific handling for rate limit errors 6. Performance: - Implement concurrent scraping with multiple accounts - Add delays between requests to avoid aggressive rate limiting These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation. * feat(twitter): centralize configuration management - Extract sleep time configuration into TwitterConfig struct in config.go - Update Auth function in auth.go to accept TwitterConfig parameter - Remove hardcoded sleep time values from auth.go This change improves modularity and flexibility by centralizing configuration management for the Twitter scraper. It allows for easier modification of sleep times and future expansion of configuration options without altering the core authentication logic. BREAKING CHANGE: Auth function now requires a TwitterConfig parameter. Callers must create a TwitterConfig instance using NewTwitterConfig() before invoking Auth. * Revert "feat(twitter): centralize configuration management" This reverts commit 7ad1bfe. * refactor(twitter): move sleep configs to config.go and fix import cycle - Extracted sleep configurations from `auth.go` into `config.go`. - Defined `ShortSleepDuration` and `RateLimitDuration` constants. - Created `ShortSleep()` and `GetRateLimitDuration()` functions. - Updated `auth.go` to use the new config functions. This improves modularity by separating concerns and adheres to idiomatic Go practices. * refactor(twitter): replace Auth with NewScraper in followers and tweets - Updated `followers.go` and `tweets.go` to replace calls to `Auth` with `NewScraper`. - Resolved `undefined: Auth` errors due to previous refactoring. - Ensured all functionality and error handling remains consistent. - Improved codebase consistency following the constructor pattern. This completes the refactoring of the scraper creation process, enhancing code readability and maintainability. * fix(twitter): resolve type incompatibility errors after Scraper refactoring - Updated function signatures and variable types in `tweets.go` and `followers.go` to use the custom `*Scraper` type. - Adjusted return statements and method calls to match the new `Scraper` type. - Fixed `IncompatibleAssign` errors by ensuring consistency across all files. - Ensured all methods utilize the embedded `*twitterscraper.Scraper` methods through the custom `Scraper` type. This change finalizes the refactoring, ensuring all components work together seamlessly and conform to idiomatic Go practices. * refactor(twitter): move retry logic to config.go - Extract generic Retry function to config.go - Remove specific retry functions from tweets.go - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use new Retry function - Move MaxRetries constant to config.go * refactor(twitter): modularize scraper components and improve error handling - Move account management logic from auth.go to a separate file - Centralize authentication and rate limit handling in common functions - Simplify ScrapeFollowersForProfile and ScrapeTweetsByQuery using Retry - Remove duplicate code and unnecessary initializations - Increase WorkerResponseTimeout from 30 to 45 seconds in DefaultConfig
- Loading branch information