Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(twitter): Scraper Enhancements Account Rotation and Rate Limit Handling #576

Merged
merged 17 commits into from
Oct 4, 2024

Conversation

teslashibe
Copy link
Contributor

@teslashibe teslashibe commented Oct 3, 2024

Overview

This PR introduces a robust account rotation system and improved rate limit handling for our Twitter scraping functionality. These enhancements significantly increase the reliability and efficiency of our Twitter data collection process.

Key Features

  1. Account Rotation: Implemented a TwitterAccountManager to manage multiple Twitter accounts and rotate between them.
  2. Rate Limit Handling: Added intelligent rate limit detection and account switching to avoid API restrictions.
  3. Retry Mechanism: Introduced retry logic for both tweet scraping and profile fetching operations.
  4. Configuration Improvement: Updated .env file-based configuration.
  5. Error Handling: Enhanced error logging and handling throughout the package.
  6. Performance Optimization: Implemented concurrent scraping capabilities with multiple accounts.

Benefits

  • Increased scraping capacity and reliability
  • Reduced downtime due to rate limiting
  • Improved error resilience and logging
  • More flexible and secure credential management
  • Better code modularity and maintainability

Testing

The new features have been thoroughly tested with multiple Twitter accounts and various scraping scenarios. All existing functionality remains intact, with improved performance and reliability.

- Delete TwitterSentimentHandler and TwitterTrendsHandler structs
- Remove corresponding HandleWork functions for sentiment and trends
- Update WorkerType constants and related functions to exclude sentiment and trends
- Adjust WorkHandlerManager initialization to remove sentiment and trends handlers
…objx dependency

- **Exported Auth Function:**
  - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible.
  - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function.

- **Removed Unused Dependency:**
  - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed.

- **Optimized Sleep Durations:**
  - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance.

- **Cleaned Up Codebase:**
  - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase.

- **Enhanced Test Configuration:**
  - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`.
  - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse.
This commit improves the Twitter authentication and scraping tests in the
pkg/tests/scrapers/twitter_auth_test.go file. The changes include:

- Add godotenv package to load environment variables
- Implement a loadEnv function to handle .env file loading
- Enhance "authenticates and logs in successfully" test:
  - Verify cookie file doesn't exist before authentication
  - Check cookie file creation after authentication
  - Perform a simple profile scrape to validate the session
- Improve "reuses session from cookies" test:
  - Verify cookie file creation
  - Force cookie reuse by clearing the first scraper
  - Validate the reused session with a profile scrape
- Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies":
  - Authenticate twice to ensure cookie reuse
  - Scrape the profile of user 'god'
  - Fetch and verify the last 3 tweets containing #Bitcoin
  - Log scraped data for manual inspection

These changes provide more robust testing of the Twitter authentication
process, session reuse, and scraping functionality, ensuring better
coverage and reliability of the Twitter-related features.
This commit introduces significant improvements to the Twitter scraping functionality:

1. Account Management:
   - Add TwitterAccount struct to represent individual Twitter accounts
   - Implement TwitterAccountManager for managing multiple accounts
   - Create functions for account rotation and rate limit tracking

2. Authentication:
   - Refactor Auth function to use account rotation
   - Implement cookie-based session management for each account
   - Add retry logic for authentication failures

3. Scraping Functions:
   - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation
   - Implement rate limit detection and account switching
   - Add retry mechanisms for failed operations

4. Configuration:
   - Move from hardcoded credentials to .env file-based configuration
   - Implement loadAccountsFromConfig to read multiple accounts from .env

5. Error Handling:
   - Improve error logging and handling throughout the package
   - Add specific handling for rate limit errors

6. Performance:
   - Implement concurrent scraping with multiple accounts
   - Add delays between requests to avoid aggressive rate limiting

These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation.
@teslashibe teslashibe added the enhancement New feature or request label Oct 3, 2024
@teslashibe teslashibe self-assigned this Oct 3, 2024
@teslashibe teslashibe marked this pull request as ready for review October 3, 2024 05:02
}
})

AfterEach(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better placed at the top - more readable and it works as well

twoFACode string
)

loadEnv := func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit against loading .env files from the system in the integration tests - ideally we should follow the rabbit hole and try to re-factor code to not depend on a config singleton that is initialized "the first time" that "something" access to it.

It makes it so hard to track it down where it originates and who initializes what. My suggestion here is to take a bit of time to break this down and remove dependencies from the configuration singleton in the twitter scraper code.

projectRoot := filepath.Join(filepath.Dir(filename), "..", "..", "..")
envPath := filepath.Join(projectRoot, ".env")

err := godotenv.Load(envPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to fail hard here:

Suggested change
err := godotenv.Load(envPath)
err := godotenv.Load(envPath)
Expect(err).ToNot(HaveOccurred())

Expect(twitterUsername).NotTo(BeEmpty(), "TWITTER_USERNAME environment variable is not set")
Expect(twitterPassword).NotTo(BeEmpty(), "TWITTER_PASSWORD environment variable is not set")

config.GetInstance().TwitterUsername = twitterUsername
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needing both env file and configuring here username/pass? I guess the env is required because otherwise config fails hardly, but that's just signaling that the scraper code should be revisited

RunSpecs(t, "Scrapers Suite")
}

func TestMain(m *testing.M) {
Copy link
Contributor

@mudler mudler Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here with the above - this happens because the config singleton tries to parse os.Args. And the Twitter scrapper calls the config singleton, which at the first load does parse flags from args.

My suggestion here would be: if the tests needs such a "forced" execution or "uncommon" setup it likely means the code under test needs adaptation and we can avoid these workarounds to get things set up correctly.

@mudler
Copy link
Contributor

mudler commented Oct 3, 2024

This also seems to supersede #573

}
}

time.Sleep(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant sleep time over here - shall we have this configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here: 1797bfe

I am leaving in the package config.go for now until we structure the config better

account.RateLimitedUntil = time.Now().Add(time.Hour)
}

func Auth(account *TwitterAccount) *twitterscraper.Scraper {
Copy link
Contributor

@mudler mudler Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest here to create a classic constructor for the scraper, instead of Auth creating a scraper and returning it:

func NewScraper(account *TwitterAccount)  *twitterscraper.Scraper 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logrus.Debugf("Login successful for %s", account.Username)
return scraper
}

func Login(scraper *twitterscraper.Scraper, credentials ...string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not part of the changeset, however - as you are with this in refactoring: it would make sense for these methods to actually be part of scraper *twitterscraper.Scraper:

func (scraper *twitterscraper.Scraper) Login( credentials ...string) error {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -43,9 +106,8 @@ func IsLoggedIn(scraper *twitterscraper.Scraper) bool {
}

func Logout(scraper *twitterscraper.Scraper) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here:

func (scraper *twitterscraper.Scraper) Logout() error {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


func Auth(account *TwitterAccount) *twitterscraper.Scraper {
scraper := twitterscraper.New()
baseDir := config.GetInstance().MasaDir
Copy link
Contributor

@mudler mudler Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this looks one of the entrypoint of the config - and this piece of code seems to depend only on the MasaDir, I'd suggest to have only the dir in the constructor when creating the twitterScraper object, for instance:

func NewScraper(account *TwitterAccount, cookiedir string)  *twitterscraper.Scraper 

In this way the scraper does not depend from other structures inside its own code - dependencies are predictable, and the caller have the responsibility to have all the configurations for the scraper when instantiating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Extract sleep time configuration into TwitterConfig struct in config.go
- Update Auth function in auth.go to accept TwitterConfig parameter
- Remove hardcoded sleep time values from auth.go

This change improves modularity and flexibility by centralizing configuration
management for the Twitter scraper. It allows for easier modification of
sleep times and future expansion of configuration options without altering
the core authentication logic.

BREAKING CHANGE: Auth function now requires a TwitterConfig parameter.
Callers must create a TwitterConfig instance using NewTwitterConfig()
before invoking Auth.
- Extracted sleep configurations from `auth.go` into `config.go`.
- Defined `ShortSleepDuration` and `RateLimitDuration` constants.
- Created `ShortSleep()` and `GetRateLimitDuration()` functions.
- Updated `auth.go` to use the new config functions.

This improves modularity by separating concerns and adheres to idiomatic Go practices.
- Updated `followers.go` and `tweets.go` to replace calls to `Auth` with `NewScraper`.
- Resolved `undefined: Auth` errors due to previous refactoring.
- Ensured all functionality and error handling remains consistent.
- Improved codebase consistency following the constructor pattern.

This completes the refactoring of the scraper creation process, enhancing code readability and maintainability.
…toring

- Updated function signatures and variable types in `tweets.go` and `followers.go` to use the custom `*Scraper` type.
- Adjusted return statements and method calls to match the new `Scraper` type.
- Fixed `IncompatibleAssign` errors by ensuring consistency across all files.
- Ensured all methods utilize the embedded `*twitterscraper.Scraper` methods through the custom `Scraper` type.

This change finalizes the refactoring, ensuring all components work together seamlessly and conform to idiomatic Go practices.
- Extract generic Retry function to config.go
- Remove specific retry functions from tweets.go
- Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use new Retry function
- Move MaxRetries constant to config.go
…ndling

- Move account management logic from auth.go to a separate file
- Centralize authentication and rate limit handling in common functions
- Simplify ScrapeFollowersForProfile and ScrapeTweetsByQuery using Retry
- Remove duplicate code and unnecessary initializations
- Increase WorkerResponseTimeout from 30 to 45 seconds in DefaultConfig
@teslashibe
Copy link
Contributor Author

This also seems to supersede #573

I need to go back and re-write these tests in this PR

@teslashibe
Copy link
Contributor Author

@mudler I am merging this so we can ship the release of the testnet may you still review and leave any comments even though I am closing this. I made a lot of changes

@teslashibe teslashibe merged commit 419e224 into main Oct 4, 2024
9 of 11 checks passed
@teslashibe teslashibe deleted the feat-array-of-twitter-logins branch October 4, 2024 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants