feat(twitter): Scraper Enhancements Account Rotation and Rate Limit H…

…andling (#576) * chore: cleanup and delete old tests * chore: delete old tests * refactor(twitter): remove sentiment and trends handlers - Delete TwitterSentimentHandler and TwitterTrendsHandler structs - Remove corresponding HandleWork functions for sentiment and trends - Update WorkerType constants and related functions to exclude sentiment and trends - Adjust WorkHandlerManager initialization to remove sentiment and trends handlers * refactor(twitter): export Auth function, update scrapers, and remove objx dependency - **Exported Auth Function:** - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible. - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function. - **Removed Unused Dependency:** - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed. - **Optimized Sleep Durations:** - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance. - **Cleaned Up Codebase:** - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase. - **Enhanced Test Configuration:** - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`. - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse. * chore: delete scrape tweets by trends (deprecated) * feat(tests): enhance Twitter auth and scraping tests This commit improves the Twitter authentication and scraping tests in the pkg/tests/scrapers/twitter_auth_test.go file. The changes include: - Add godotenv package to load environment variables - Implement a loadEnv function to handle .env file loading - Enhance "authenticates and logs in successfully" test: - Verify cookie file doesn't exist before authentication - Check cookie file creation after authentication - Perform a simple profile scrape to validate the session - Improve "reuses session from cookies" test: - Verify cookie file creation - Force cookie reuse by clearing the first scraper - Validate the reused session with a profile scrape - Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies": - Authenticate twice to ensure cookie reuse - Scrape the profile of user 'god' - Fetch and verify the last 3 tweets containing #Bitcoin - Log scraped data for manual inspection These changes provide more robust testing of the Twitter authentication process, session reuse, and scraping functionality, ensuring better coverage and reliability of the Twitter-related features. * chore: rename files * chore: add godoc dev notes * feat(twitter): implement account rotation and rate limit handling This commit introduces significant improvements to the Twitter scraping functionality: 1. Account Management: - Add TwitterAccount struct to represent individual Twitter accounts - Implement TwitterAccountManager for managing multiple accounts - Create functions for account rotation and rate limit tracking 2. Authentication: - Refactor Auth function to use account rotation - Implement cookie-based session management for each account - Add retry logic for authentication failures 3. Scraping Functions: - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation - Implement rate limit detection and account switching - Add retry mechanisms for failed operations 4. Configuration: - Move from hardcoded credentials to .env file-based configuration - Implement loadAccountsFromConfig to read multiple accounts from .env 5. Error Handling: - Improve error logging and handling throughout the package - Add specific handling for rate limit errors 6. Performance: - Implement concurrent scraping with multiple accounts - Add delays between requests to avoid aggressive rate limiting These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation. * feat(twitter): centralize configuration management - Extract sleep time configuration into TwitterConfig struct in config.go - Update Auth function in auth.go to accept TwitterConfig parameter - Remove hardcoded sleep time values from auth.go This change improves modularity and flexibility by centralizing configuration management for the Twitter scraper. It allows for easier modification of sleep times and future expansion of configuration options without altering the core authentication logic. BREAKING CHANGE: Auth function now requires a TwitterConfig parameter. Callers must create a TwitterConfig instance using NewTwitterConfig() before invoking Auth. * Revert "feat(twitter): centralize configuration management" This reverts commit 7ad1bfe. * refactor(twitter): move sleep configs to config.go and fix import cycle - Extracted sleep configurations from `auth.go` into `config.go`. - Defined `ShortSleepDuration` and `RateLimitDuration` constants. - Created `ShortSleep()` and `GetRateLimitDuration()` functions. - Updated `auth.go` to use the new config functions. This improves modularity by separating concerns and adheres to idiomatic Go practices. * refactor(twitter): replace Auth with NewScraper in followers and tweets - Updated `followers.go` and `tweets.go` to replace calls to `Auth` with `NewScraper`. - Resolved `undefined: Auth` errors due to previous refactoring. - Ensured all functionality and error handling remains consistent. - Improved codebase consistency following the constructor pattern. This completes the refactoring of the scraper creation process, enhancing code readability and maintainability. * fix(twitter): resolve type incompatibility errors after Scraper refactoring - Updated function signatures and variable types in `tweets.go` and `followers.go` to use the custom `*Scraper` type. - Adjusted return statements and method calls to match the new `Scraper` type. - Fixed `IncompatibleAssign` errors by ensuring consistency across all files. - Ensured all methods utilize the embedded `*twitterscraper.Scraper` methods through the custom `Scraper` type. This change finalizes the refactoring, ensuring all components work together seamlessly and conform to idiomatic Go practices. * refactor(twitter): move retry logic to config.go - Extract generic Retry function to config.go - Remove specific retry functions from tweets.go - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use new Retry function - Move MaxRetries constant to config.go * refactor(twitter): modularize scraper components and improve error handling - Move account management logic from auth.go to a separate file - Centralize authentication and rate limit handling in common functions - Simplify ScrapeFollowersForProfile and ScrapeTweetsByQuery using Retry - Remove duplicate code and unnecessary initializations - Increase WorkerResponseTimeout from 30 to 45 seconds in DefaultConfig
masa-finance · Oct 4, 2024 · 419e224 · 419e224
1 parent 532ffab
commit 419e224
Show file tree

Hide file tree

Showing 24 changed files with 502 additions and 881 deletions.
diff --git a/go.mod b/go.mod
@@ -218,7 +218,6 @@ require (
 	github.com/spaolacci/murmur3 v1.1.0 // indirect
 	github.com/spf13/afero v1.11.0 // indirect
 	github.com/spf13/cast v1.6.0 // indirect
-	github.com/stretchr/objx v0.5.2 // indirect
 	github.com/subosito/gotenv v1.6.0 // indirect
 	github.com/supranational/blst v0.3.13 // indirect
 	github.com/syndtr/goleveldb v1.0.1-0.20210819022825-2ae1ddf74ef7 // indirect

diff --git a/go.sum b/go.sum
@@ -713,7 +713,6 @@ github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+
 github.com/stretchr/objx v0.2.0/go.mod h1:qt09Ya8vawLte6SNmTgCsAVtYtaKzEcn8ATUoHMkEqE=
 github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
 github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
-github.com/stretchr/objx v0.5.2 h1:xuMeJ0Sdp5ZMRXx/aWO6RZxdr3beISkG5/G/aIRr3pY=
 github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA=
 github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
 github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=

diff --git a/pkg/scrapers/twitter/account.go b/pkg/scrapers/twitter/account.go
@@ -0,0 +1,45 @@
+package twitter
+
+import (
+	"sync"
+	"time"
+)
+
+type TwitterAccount struct {
+	Username         string
+	Password         string
+	TwoFACode        string
+	RateLimitedUntil time.Time
+}
+
+type TwitterAccountManager struct {
+	accounts []*TwitterAccount
+	index    int
+	mutex    sync.Mutex
+}
+
+func NewTwitterAccountManager(accounts []*TwitterAccount) *TwitterAccountManager {
+	return &TwitterAccountManager{
+		accounts: accounts,
+		index:    0,
+	}
+}
+
+func (manager *TwitterAccountManager) GetNextAccount() *TwitterAccount {
+	manager.mutex.Lock()
+	defer manager.mutex.Unlock()
+	for i := 0; i < len(manager.accounts); i++ {
+		account := manager.accounts[manager.index]
+		manager.index = (manager.index + 1) % len(manager.accounts)
+		if time.Now().After(account.RateLimitedUntil) {
+			return account
+		}
+	}
+	return nil
+}
+
+func (manager *TwitterAccountManager) MarkAccountRateLimited(account *TwitterAccount) {
+	manager.mutex.Lock()
+	defer manager.mutex.Unlock()
+	account.RateLimitedUntil = time.Now().Add(GetRateLimitDuration())
+}
diff --git a/pkg/scrapers/twitter/auth.go b/pkg/scrapers/twitter/auth.go
@@ -3,49 +3,53 @@ package twitter
 import (
 	"fmt"
 
-	twitterscraper "github.com/masa-finance/masa-twitter-scraper"
+	"github.com/sirupsen/logrus"
 )
 
-// Login attempts to log in to the Twitter scraper service.
-// It supports three modes of operation:
-// 1. Basic login using just a username and password.
-// 2. Login requiring an email confirmation, using a username, password, and email address.
-// 3. Login with two-factor authentication, using a username, password, and 2FA code.
-// Parameters:
-//   - scraper: A pointer to an instance of the twitterscraper.Scraper.
-//   - credentials: A variadic list of strings representing login credentials.
-//     The function expects either two strings (username, password) for basic login,
-//     or three strings (username, password, email/2FA code) for email confirmation or 2FA.
-//
-// Returns an error if login fails or if an invalid number of credentials is provided.
-func Login(scraper *twitterscraper.Scraper, credentials ...string) error {
+func NewScraper(account *TwitterAccount, cookieDir string) *Scraper {
+	scraper := &Scraper{Scraper: newTwitterScraper()}
+
+	if err := LoadCookies(scraper.Scraper, account, cookieDir); err == nil {
+		logrus.Debugf("Cookies loaded for user %s.", account.Username)
+		if scraper.IsLoggedIn() {
+			logrus.Debugf("Already logged in as %s.", account.Username)
+			return scraper
+		}
+	}
+
+	ShortSleep()
+
+	if err := scraper.Login(account.Username, account.Password, account.TwoFACode); err != nil {
+		logrus.WithError(err).Warnf("Login failed for %s", account.Username)
+		return nil
+	}
+
+	ShortSleep()
+
+	if err := SaveCookies(scraper.Scraper, account, cookieDir); err != nil {
+		logrus.WithError(err).Errorf("Failed to save cookies for %s", account.Username)
+	}
+
+	logrus.Debugf("Login successful for %s", account.Username)
+	return scraper
+}
+
+func (scraper *Scraper) Login(username, password string, twoFACode ...string) error {
 	var err error
-	switch len(credentials) {
-	case 2:
-		// Basic login with username and password.
-		err = scraper.Login(credentials[0], credentials[1])
-	case 3:
-		// The third parameter is used for either email confirmation or a 2FA code.
-		// This design assumes the Twitter scraper's Login method can contextually handle both cases.
-		err = scraper.Login(credentials[0], credentials[1], credentials[2])
-	default:
-		// Return an error if the number of provided credentials is neither 2 nor 3.
-		return fmt.Errorf("invalid number of login credentials provided")
+	if len(twoFACode) > 0 {
+		err = scraper.Scraper.Login(username, password, twoFACode[0])
+	} else {
+		err = scraper.Scraper.Login(username, password)
 	}
 	if err != nil {
-		return fmt.Errorf("%v", err)
+		return fmt.Errorf("login failed: %v", err)
 	}
 	return nil
 }
 
-func IsLoggedIn(scraper *twitterscraper.Scraper) bool {
-	return scraper.IsLoggedIn()
-}
-
-func Logout(scraper *twitterscraper.Scraper) error {
-	err := scraper.Logout()
-	if err != nil {
-		return fmt.Errorf("[-] Logout failed: %v", err)
+func (scraper *Scraper) Logout() error {
+	if err := scraper.Scraper.Logout(); err != nil {
+		return fmt.Errorf("logout failed: %v", err)
 	}
 	return nil
 }
diff --git a/pkg/scrapers/twitter/common.go b/pkg/scrapers/twitter/common.go
@@ -0,0 +1,85 @@
+package twitter
+
+import (
+	"fmt"
+	"os"
+	"strings"
+	"sync"
+
+	"github.com/joho/godotenv"
+	"github.com/masa-finance/masa-oracle/pkg/config"
+	"github.com/sirupsen/logrus"
+)
+
+var (
+	accountManager *TwitterAccountManager
+	once           sync.Once
+)
+
+func initializeAccountManager() {
+	accounts := loadAccountsFromConfig()
+	accountManager = NewTwitterAccountManager(accounts)
+}
+
+func loadAccountsFromConfig() []*TwitterAccount {
+	err := godotenv.Load()
+	if err != nil {
+		logrus.Fatalf("error loading .env file: %v", err)
+	}
+
+	accountsEnv := os.Getenv("TWITTER_ACCOUNTS")
+	if accountsEnv == "" {
+		logrus.Fatal("TWITTER_ACCOUNTS not set in .env file")
+	}
+
+	return parseAccounts(strings.Split(accountsEnv, ","))
+}
+
+func parseAccounts(accountPairs []string) []*TwitterAccount {
+	return filterMap(accountPairs, func(pair string) (*TwitterAccount, bool) {
+		credentials := strings.Split(pair, ":")
+		if len(credentials) != 2 {
+			logrus.Warnf("invalid account credentials: %s", pair)
+			return nil, false
+		}
+		return &TwitterAccount{
+			Username: strings.TrimSpace(credentials[0]),
+			Password: strings.TrimSpace(credentials[1]),
+		}, true
+	})
+}
+
+func getAuthenticatedScraper() (*Scraper, *TwitterAccount, error) {
+	once.Do(initializeAccountManager)
+	baseDir := config.GetInstance().MasaDir
+
+	account := accountManager.GetNextAccount()
+	if account == nil {
+		return nil, nil, fmt.Errorf("all accounts are rate-limited")
+	}
+	scraper := NewScraper(account, baseDir)
+	if scraper == nil {
+		logrus.Errorf("Authentication failed for %s", account.Username)
+		return nil, account, fmt.Errorf("Twitter authentication failed for %s", account.Username)
+	}
+	return scraper, account, nil
+}
+
+func handleRateLimit(err error, account *TwitterAccount) bool {
+	if strings.Contains(err.Error(), "Rate limit exceeded") {
+		accountManager.MarkAccountRateLimited(account)
+		logrus.Warnf("rate limited: %s", account.Username)
+		return true
+	}
+	return false
+}
+
+func filterMap[T any, R any](slice []T, f func(T) (R, bool)) []R {
+	result := make([]R, 0, len(slice))
+	for _, v := range slice {
+		if r, ok := f(v); ok {
+			result = append(result, r)
+		}
+	}
+	return result
+}
diff --git a/pkg/scrapers/twitter/config.go b/pkg/scrapers/twitter/config.go
@@ -0,0 +1,35 @@
+package twitter
+
+import (
+	"fmt"
+	"time"
+
+	"github.com/sirupsen/logrus"
+)
+
+const (
+	ShortSleepDuration = 20 * time.Millisecond
+	RateLimitDuration  = time.Hour
+	MaxRetries         = 3
+)
+
+func ShortSleep() {
+	time.Sleep(ShortSleepDuration)
+}
+
+func GetRateLimitDuration() time.Duration {
+	return RateLimitDuration
+}
+
+func Retry[T any](operation func() (T, error), maxAttempts int) (T, error) {
+	var zero T
+	for attempt := 1; attempt <= maxAttempts; attempt++ {
+		result, err := operation()
+		if err == nil {
+			return result, nil
+		}
+		logrus.Errorf("retry attempt %d failed: %v", attempt, err)
+		time.Sleep(time.Duration(attempt) * time.Second)
+	}
+	return zero, fmt.Errorf("operation failed after %d attempts", maxAttempts)
+}
diff --git a/pkg/scrapers/twitter/cookies.go b/pkg/scrapers/twitter/cookies.go
@@ -5,37 +5,32 @@ import (
 	"fmt"
 	"net/http"
 	"os"
+	"path/filepath"
 
 	twitterscraper "github.com/masa-finance/masa-twitter-scraper"
 )
 
-func SaveCookies(scraper *twitterscraper.Scraper, filePath string) error {
+func SaveCookies(scraper *twitterscraper.Scraper, account *TwitterAccount, baseDir string) error {
+	cookieFile := filepath.Join(baseDir, fmt.Sprintf("%s_twitter_cookies.json", account.Username))
 	cookies := scraper.GetCookies()
-	js, err := json.Marshal(cookies)
+	data, err := json.Marshal(cookies)
 	if err != nil {
 		return fmt.Errorf("error marshaling cookies: %v", err)
 	}
-	err = os.WriteFile(filePath, js, 0644)
-	if err != nil {
-		return fmt.Errorf("error saving cookies to file: %v", err)
-	}
-
-	// Load the saved cookies back into the scraper
-	if err := LoadCookies(scraper, filePath); err != nil {
-		return fmt.Errorf("error loading saved cookies: %v", err)
+	if err = os.WriteFile(cookieFile, data, 0644); err != nil {
+		return fmt.Errorf("error saving cookies: %v", err)
 	}
-
 	return nil
 }
 
-func LoadCookies(scraper *twitterscraper.Scraper, filePath string) error {
-	js, err := os.ReadFile(filePath)
+func LoadCookies(scraper *twitterscraper.Scraper, account *TwitterAccount, baseDir string) error {
+	cookieFile := filepath.Join(baseDir, fmt.Sprintf("%s_twitter_cookies.json", account.Username))
+	data, err := os.ReadFile(cookieFile)
 	if err != nil {
-		return fmt.Errorf("error reading cookies from file: %v", err)
+		return fmt.Errorf("error reading cookies: %v", err)
 	}
 	var cookies []*http.Cookie
-	err = json.Unmarshal(js, &cookies)
-	if err != nil {
+	if err = json.Unmarshal(data, &cookies); err != nil {
 		return fmt.Errorf("error unmarshaling cookies: %v", err)
 	}
 	scraper.SetCookies(cookies)

diff --git a/pkg/scrapers/twitter/followers.go b/pkg/scrapers/twitter/followers.go
@@ -1,38 +1,28 @@
 package twitter
 
 import (
-	"encoding/json"
 	"fmt"
 
-	_ "github.com/lib/pq"
 	twitterscraper "github.com/masa-finance/masa-twitter-scraper"
 	"github.com/sirupsen/logrus"
 )
 
-// ScrapeFollowersForProfile scrapes the profile and tweets of a specific Twitter user.
-// It takes the username as a parameter and returns the scraped profile information and an error if any.
 func ScrapeFollowersForProfile(username string, count int) ([]twitterscraper.Legacy, error) {
-	scraper := auth()
+	return Retry(func() ([]twitterscraper.Legacy, error) {
+		scraper, account, err := getAuthenticatedScraper()
+		if err != nil {
+			return nil, err
+		}
 
-	if scraper == nil {
-		return nil, fmt.Errorf("there was an error authenticating with your Twitter credentials")
-	}
+		followingResponse, errString, _ := scraper.FetchFollowers(username, count, "")
+		if errString != "" {
+			if handleRateLimit(fmt.Errorf(errString), account) {
+				return nil, fmt.Errorf("rate limited")
+			}
+			logrus.Errorf("Error fetching followers: %v", errString)
+			return nil, fmt.Errorf("%v", errString)
+		}
 
-	followingResponse, errString, _ := scraper.FetchFollowers(username, count, "")
-	if errString != "" {
-		logrus.Printf("Error fetching profile: %v", errString)
-		return nil, fmt.Errorf("%v", errString)
-	}
-
-	// Marshal the followingResponse into a JSON string for logging
-	responseJSON, err := json.Marshal(followingResponse)
-	if err != nil {
-		// Log the error if the marshaling fails
-		logrus.Errorf("[-] Error marshaling followingResponse: %v", err)
-	} else {
-		// Log the JSON string of followingResponse
-		logrus.Debugf("Following response: %s", responseJSON)
-	}
-
-	return followingResponse, nil
+		return followingResponse, nil
+	}, MaxRetries)
 }
diff --git a/pkg/scrapers/twitter/profile.go b/pkg/scrapers/twitter/profile.go
@@ -0,0 +1,23 @@
+package twitter
+
+import (
+	twitterscraper "github.com/masa-finance/masa-twitter-scraper"
+)
+
+func ScrapeTweetsProfile(username string) (twitterscraper.Profile, error) {
+	return Retry(func() (twitterscraper.Profile, error) {
+		scraper, account, err := getAuthenticatedScraper()
+		if err != nil {
+			return twitterscraper.Profile{}, err
+		}
+
+		profile, err := scraper.GetProfile(username)
+		if err != nil {
+			if handleRateLimit(err, account) {
+				return twitterscraper.Profile{}, err
+			}
+			return twitterscraper.Profile{}, err
+		}
+		return profile, nil
+	}, MaxRetries)
+}