-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(twitter): Scraper Enhancements Account Rotation and Rate Limit Handling #576
Conversation
- Delete TwitterSentimentHandler and TwitterTrendsHandler structs - Remove corresponding HandleWork functions for sentiment and trends - Update WorkerType constants and related functions to exclude sentiment and trends - Adjust WorkHandlerManager initialization to remove sentiment and trends handlers
…objx dependency - **Exported Auth Function:** - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible. - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function. - **Removed Unused Dependency:** - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed. - **Optimized Sleep Durations:** - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance. - **Cleaned Up Codebase:** - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase. - **Enhanced Test Configuration:** - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`. - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse.
This commit improves the Twitter authentication and scraping tests in the pkg/tests/scrapers/twitter_auth_test.go file. The changes include: - Add godotenv package to load environment variables - Implement a loadEnv function to handle .env file loading - Enhance "authenticates and logs in successfully" test: - Verify cookie file doesn't exist before authentication - Check cookie file creation after authentication - Perform a simple profile scrape to validate the session - Improve "reuses session from cookies" test: - Verify cookie file creation - Force cookie reuse by clearing the first scraper - Validate the reused session with a profile scrape - Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies": - Authenticate twice to ensure cookie reuse - Scrape the profile of user 'god' - Fetch and verify the last 3 tweets containing #Bitcoin - Log scraped data for manual inspection These changes provide more robust testing of the Twitter authentication process, session reuse, and scraping functionality, ensuring better coverage and reliability of the Twitter-related features.
This commit introduces significant improvements to the Twitter scraping functionality: 1. Account Management: - Add TwitterAccount struct to represent individual Twitter accounts - Implement TwitterAccountManager for managing multiple accounts - Create functions for account rotation and rate limit tracking 2. Authentication: - Refactor Auth function to use account rotation - Implement cookie-based session management for each account - Add retry logic for authentication failures 3. Scraping Functions: - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation - Implement rate limit detection and account switching - Add retry mechanisms for failed operations 4. Configuration: - Move from hardcoded credentials to .env file-based configuration - Implement loadAccountsFromConfig to read multiple accounts from .env 5. Error Handling: - Improve error logging and handling throughout the package - Add specific handling for rate limit errors 6. Performance: - Implement concurrent scraping with multiple accounts - Add delays between requests to avoid aggressive rate limiting These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation.
} | ||
}) | ||
|
||
AfterEach(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better placed at the top - more readable and it works as well
twoFACode string | ||
) | ||
|
||
loadEnv := func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit against loading .env
files from the system in the integration tests - ideally we should follow the rabbit hole and try to re-factor code to not depend on a config singleton that is initialized "the first time" that "something" access to it.
It makes it so hard to track it down where it originates and who initializes what. My suggestion here is to take a bit of time to break this down and remove dependencies from the configuration singleton in the twitter scraper code.
projectRoot := filepath.Join(filepath.Dir(filename), "..", "..", "..") | ||
envPath := filepath.Join(projectRoot, ".env") | ||
|
||
err := godotenv.Load(envPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to fail hard here:
err := godotenv.Load(envPath) | |
err := godotenv.Load(envPath) | |
Expect(err).ToNot(HaveOccurred()) |
Expect(twitterUsername).NotTo(BeEmpty(), "TWITTER_USERNAME environment variable is not set") | ||
Expect(twitterPassword).NotTo(BeEmpty(), "TWITTER_PASSWORD environment variable is not set") | ||
|
||
config.GetInstance().TwitterUsername = twitterUsername |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needing both env file and configuring here username/pass? I guess the env is required because otherwise config fails hardly, but that's just signaling that the scraper code should be revisited
RunSpecs(t, "Scrapers Suite") | ||
} | ||
|
||
func TestMain(m *testing.M) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto here with the above - this happens because the config singleton tries to parse os.Args. And the Twitter scrapper calls the config singleton, which at the first load does parse flags from args.
My suggestion here would be: if the tests needs such a "forced" execution or "uncommon" setup it likely means the code under test needs adaptation and we can avoid these workarounds to get things set up correctly.
This also seems to supersede #573 |
pkg/scrapers/twitter/auth.go
Outdated
} | ||
} | ||
|
||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constant sleep time over here - shall we have this configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated here: 1797bfe
I am leaving in the package config.go for now until we structure the config better
pkg/scrapers/twitter/auth.go
Outdated
account.RateLimitedUntil = time.Now().Add(time.Hour) | ||
} | ||
|
||
func Auth(account *TwitterAccount) *twitterscraper.Scraper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest here to create a classic constructor for the scraper, instead of Auth creating a scraper and returning it:
func NewScraper(account *TwitterAccount) *twitterscraper.Scraper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/scrapers/twitter/auth.go
Outdated
logrus.Debugf("Login successful for %s", account.Username) | ||
return scraper | ||
} | ||
|
||
func Login(scraper *twitterscraper.Scraper, credentials ...string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not part of the changeset, however - as you are with this in refactoring: it would make sense for these methods to actually be part of scraper *twitterscraper.Scraper
:
func (scraper *twitterscraper.Scraper) Login( credentials ...string) error {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/scrapers/twitter/auth.go
Outdated
@@ -43,9 +106,8 @@ func IsLoggedIn(scraper *twitterscraper.Scraper) bool { | |||
} | |||
|
|||
func Logout(scraper *twitterscraper.Scraper) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto here:
func (scraper *twitterscraper.Scraper) Logout() error {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/scrapers/twitter/auth.go
Outdated
|
||
func Auth(account *TwitterAccount) *twitterscraper.Scraper { | ||
scraper := twitterscraper.New() | ||
baseDir := config.GetInstance().MasaDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this looks one of the entrypoint of the config - and this piece of code seems to depend only on the MasaDir, I'd suggest to have only the dir in the constructor when creating the twitterScraper object, for instance:
func NewScraper(account *TwitterAccount, cookiedir string) *twitterscraper.Scraper
In this way the scraper does not depend from other structures inside its own code - dependencies are predictable, and the caller have the responsibility to have all the configurations for the scraper when instantiating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Extract sleep time configuration into TwitterConfig struct in config.go - Update Auth function in auth.go to accept TwitterConfig parameter - Remove hardcoded sleep time values from auth.go This change improves modularity and flexibility by centralizing configuration management for the Twitter scraper. It allows for easier modification of sleep times and future expansion of configuration options without altering the core authentication logic. BREAKING CHANGE: Auth function now requires a TwitterConfig parameter. Callers must create a TwitterConfig instance using NewTwitterConfig() before invoking Auth.
This reverts commit 7ad1bfe.
- Extracted sleep configurations from `auth.go` into `config.go`. - Defined `ShortSleepDuration` and `RateLimitDuration` constants. - Created `ShortSleep()` and `GetRateLimitDuration()` functions. - Updated `auth.go` to use the new config functions. This improves modularity by separating concerns and adheres to idiomatic Go practices.
- Updated `followers.go` and `tweets.go` to replace calls to `Auth` with `NewScraper`. - Resolved `undefined: Auth` errors due to previous refactoring. - Ensured all functionality and error handling remains consistent. - Improved codebase consistency following the constructor pattern. This completes the refactoring of the scraper creation process, enhancing code readability and maintainability.
…toring - Updated function signatures and variable types in `tweets.go` and `followers.go` to use the custom `*Scraper` type. - Adjusted return statements and method calls to match the new `Scraper` type. - Fixed `IncompatibleAssign` errors by ensuring consistency across all files. - Ensured all methods utilize the embedded `*twitterscraper.Scraper` methods through the custom `Scraper` type. This change finalizes the refactoring, ensuring all components work together seamlessly and conform to idiomatic Go practices.
- Extract generic Retry function to config.go - Remove specific retry functions from tweets.go - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use new Retry function - Move MaxRetries constant to config.go
…ndling - Move account management logic from auth.go to a separate file - Centralize authentication and rate limit handling in common functions - Simplify ScrapeFollowersForProfile and ScrapeTweetsByQuery using Retry - Remove duplicate code and unnecessary initializations - Increase WorkerResponseTimeout from 30 to 45 seconds in DefaultConfig
I need to go back and re-write these tests in this PR |
@mudler I am merging this so we can ship the release of the testnet may you still review and leave any comments even though I am closing this. I made a lot of changes |
Overview
This PR introduces a robust account rotation system and improved rate limit handling for our Twitter scraping functionality. These enhancements significantly increase the reliability and efficiency of our Twitter data collection process.
Key Features
Benefits
Testing
The new features have been thoroughly tested with multiple Twitter accounts and various scraping scenarios. All existing functionality remains intact, with improved performance and reliability.