Skip to content

Commit

Permalink
feat(twitter): Scraper Enhancements Account Rotation and Rate Limit H…
Browse files Browse the repository at this point in the history
…andling (#576)

* chore: cleanup and delete old tests

* chore: delete old tests

* refactor(twitter): remove sentiment and trends handlers

- Delete TwitterSentimentHandler and TwitterTrendsHandler structs
- Remove corresponding HandleWork functions for sentiment and trends
- Update WorkerType constants and related functions to exclude sentiment and trends
- Adjust WorkHandlerManager initialization to remove sentiment and trends handlers

* refactor(twitter): export Auth function, update scrapers, and remove objx dependency

- **Exported Auth Function:**
  - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible.
  - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function.

- **Removed Unused Dependency:**
  - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed.

- **Optimized Sleep Durations:**
  - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance.

- **Cleaned Up Codebase:**
  - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase.

- **Enhanced Test Configuration:**
  - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`.
  - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse.

* chore: delete scrape tweets by trends (deprecated)

* feat(tests): enhance Twitter auth and scraping tests

This commit improves the Twitter authentication and scraping tests in the
pkg/tests/scrapers/twitter_auth_test.go file. The changes include:

- Add godotenv package to load environment variables
- Implement a loadEnv function to handle .env file loading
- Enhance "authenticates and logs in successfully" test:
  - Verify cookie file doesn't exist before authentication
  - Check cookie file creation after authentication
  - Perform a simple profile scrape to validate the session
- Improve "reuses session from cookies" test:
  - Verify cookie file creation
  - Force cookie reuse by clearing the first scraper
  - Validate the reused session with a profile scrape
- Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies":
  - Authenticate twice to ensure cookie reuse
  - Scrape the profile of user 'god'
  - Fetch and verify the last 3 tweets containing #Bitcoin
  - Log scraped data for manual inspection

These changes provide more robust testing of the Twitter authentication
process, session reuse, and scraping functionality, ensuring better
coverage and reliability of the Twitter-related features.

* chore: rename files

* chore: add godoc dev notes

* feat(twitter): implement account rotation and rate limit handling

This commit introduces significant improvements to the Twitter scraping functionality:

1. Account Management:
   - Add TwitterAccount struct to represent individual Twitter accounts
   - Implement TwitterAccountManager for managing multiple accounts
   - Create functions for account rotation and rate limit tracking

2. Authentication:
   - Refactor Auth function to use account rotation
   - Implement cookie-based session management for each account
   - Add retry logic for authentication failures

3. Scraping Functions:
   - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation
   - Implement rate limit detection and account switching
   - Add retry mechanisms for failed operations

4. Configuration:
   - Move from hardcoded credentials to .env file-based configuration
   - Implement loadAccountsFromConfig to read multiple accounts from .env

5. Error Handling:
   - Improve error logging and handling throughout the package
   - Add specific handling for rate limit errors

6. Performance:
   - Implement concurrent scraping with multiple accounts
   - Add delays between requests to avoid aggressive rate limiting

These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation.

* feat(twitter): centralize configuration management

- Extract sleep time configuration into TwitterConfig struct in config.go
- Update Auth function in auth.go to accept TwitterConfig parameter
- Remove hardcoded sleep time values from auth.go

This change improves modularity and flexibility by centralizing configuration
management for the Twitter scraper. It allows for easier modification of
sleep times and future expansion of configuration options without altering
the core authentication logic.

BREAKING CHANGE: Auth function now requires a TwitterConfig parameter.
Callers must create a TwitterConfig instance using NewTwitterConfig()
before invoking Auth.

* Revert "feat(twitter): centralize configuration management"

This reverts commit 7ad1bfe.

* refactor(twitter): move sleep configs to config.go and fix import cycle

- Extracted sleep configurations from `auth.go` into `config.go`.
- Defined `ShortSleepDuration` and `RateLimitDuration` constants.
- Created `ShortSleep()` and `GetRateLimitDuration()` functions.
- Updated `auth.go` to use the new config functions.

This improves modularity by separating concerns and adheres to idiomatic Go practices.

* refactor(twitter): replace Auth with NewScraper in followers and tweets

- Updated `followers.go` and `tweets.go` to replace calls to `Auth` with `NewScraper`.
- Resolved `undefined: Auth` errors due to previous refactoring.
- Ensured all functionality and error handling remains consistent.
- Improved codebase consistency following the constructor pattern.

This completes the refactoring of the scraper creation process, enhancing code readability and maintainability.

* fix(twitter): resolve type incompatibility errors after Scraper refactoring

- Updated function signatures and variable types in `tweets.go` and `followers.go` to use the custom `*Scraper` type.
- Adjusted return statements and method calls to match the new `Scraper` type.
- Fixed `IncompatibleAssign` errors by ensuring consistency across all files.
- Ensured all methods utilize the embedded `*twitterscraper.Scraper` methods through the custom `Scraper` type.

This change finalizes the refactoring, ensuring all components work together seamlessly and conform to idiomatic Go practices.

* refactor(twitter): move retry logic to config.go

- Extract generic Retry function to config.go
- Remove specific retry functions from tweets.go
- Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use new Retry function
- Move MaxRetries constant to config.go

* refactor(twitter): modularize scraper components and improve error handling

- Move account management logic from auth.go to a separate file
- Centralize authentication and rate limit handling in common functions
- Simplify ScrapeFollowersForProfile and ScrapeTweetsByQuery using Retry
- Remove duplicate code and unnecessary initializations
- Increase WorkerResponseTimeout from 30 to 45 seconds in DefaultConfig
  • Loading branch information
teslashibe authored Oct 4, 2024
1 parent 532ffab commit 419e224
Show file tree
Hide file tree
Showing 24 changed files with 502 additions and 881 deletions.
1 change: 0 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,6 @@ require (
github.com/spaolacci/murmur3 v1.1.0 // indirect
github.com/spf13/afero v1.11.0 // indirect
github.com/spf13/cast v1.6.0 // indirect
github.com/stretchr/objx v0.5.2 // indirect
github.com/subosito/gotenv v1.6.0 // indirect
github.com/supranational/blst v0.3.13 // indirect
github.com/syndtr/goleveldb v1.0.1-0.20210819022825-2ae1ddf74ef7 // indirect
Expand Down
1 change: 0 additions & 1 deletion go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -713,7 +713,6 @@ github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+
github.com/stretchr/objx v0.2.0/go.mod h1:qt09Ya8vawLte6SNmTgCsAVtYtaKzEcn8ATUoHMkEqE=
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
github.com/stretchr/objx v0.5.2 h1:xuMeJ0Sdp5ZMRXx/aWO6RZxdr3beISkG5/G/aIRr3pY=
github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA=
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
Expand Down
45 changes: 45 additions & 0 deletions pkg/scrapers/twitter/account.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
package twitter

import (
"sync"
"time"
)

type TwitterAccount struct {
Username string
Password string
TwoFACode string
RateLimitedUntil time.Time
}

type TwitterAccountManager struct {
accounts []*TwitterAccount
index int
mutex sync.Mutex
}

func NewTwitterAccountManager(accounts []*TwitterAccount) *TwitterAccountManager {
return &TwitterAccountManager{
accounts: accounts,
index: 0,
}
}

func (manager *TwitterAccountManager) GetNextAccount() *TwitterAccount {
manager.mutex.Lock()
defer manager.mutex.Unlock()
for i := 0; i < len(manager.accounts); i++ {
account := manager.accounts[manager.index]
manager.index = (manager.index + 1) % len(manager.accounts)
if time.Now().After(account.RateLimitedUntil) {
return account
}
}
return nil
}

func (manager *TwitterAccountManager) MarkAccountRateLimited(account *TwitterAccount) {
manager.mutex.Lock()
defer manager.mutex.Unlock()
account.RateLimitedUntil = time.Now().Add(GetRateLimitDuration())
}
72 changes: 38 additions & 34 deletions pkg/scrapers/twitter/auth.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,49 +3,53 @@ package twitter
import (
"fmt"

twitterscraper "github.com/masa-finance/masa-twitter-scraper"
"github.com/sirupsen/logrus"
)

// Login attempts to log in to the Twitter scraper service.
// It supports three modes of operation:
// 1. Basic login using just a username and password.
// 2. Login requiring an email confirmation, using a username, password, and email address.
// 3. Login with two-factor authentication, using a username, password, and 2FA code.
// Parameters:
// - scraper: A pointer to an instance of the twitterscraper.Scraper.
// - credentials: A variadic list of strings representing login credentials.
// The function expects either two strings (username, password) for basic login,
// or three strings (username, password, email/2FA code) for email confirmation or 2FA.
//
// Returns an error if login fails or if an invalid number of credentials is provided.
func Login(scraper *twitterscraper.Scraper, credentials ...string) error {
func NewScraper(account *TwitterAccount, cookieDir string) *Scraper {
scraper := &Scraper{Scraper: newTwitterScraper()}

if err := LoadCookies(scraper.Scraper, account, cookieDir); err == nil {
logrus.Debugf("Cookies loaded for user %s.", account.Username)
if scraper.IsLoggedIn() {
logrus.Debugf("Already logged in as %s.", account.Username)
return scraper
}
}

ShortSleep()

if err := scraper.Login(account.Username, account.Password, account.TwoFACode); err != nil {
logrus.WithError(err).Warnf("Login failed for %s", account.Username)
return nil
}

ShortSleep()

if err := SaveCookies(scraper.Scraper, account, cookieDir); err != nil {
logrus.WithError(err).Errorf("Failed to save cookies for %s", account.Username)
}

logrus.Debugf("Login successful for %s", account.Username)
return scraper
}

func (scraper *Scraper) Login(username, password string, twoFACode ...string) error {
var err error
switch len(credentials) {
case 2:
// Basic login with username and password.
err = scraper.Login(credentials[0], credentials[1])
case 3:
// The third parameter is used for either email confirmation or a 2FA code.
// This design assumes the Twitter scraper's Login method can contextually handle both cases.
err = scraper.Login(credentials[0], credentials[1], credentials[2])
default:
// Return an error if the number of provided credentials is neither 2 nor 3.
return fmt.Errorf("invalid number of login credentials provided")
if len(twoFACode) > 0 {
err = scraper.Scraper.Login(username, password, twoFACode[0])
} else {
err = scraper.Scraper.Login(username, password)
}
if err != nil {
return fmt.Errorf("%v", err)
return fmt.Errorf("login failed: %v", err)
}
return nil
}

func IsLoggedIn(scraper *twitterscraper.Scraper) bool {
return scraper.IsLoggedIn()
}

func Logout(scraper *twitterscraper.Scraper) error {
err := scraper.Logout()
if err != nil {
return fmt.Errorf("[-] Logout failed: %v", err)
func (scraper *Scraper) Logout() error {
if err := scraper.Scraper.Logout(); err != nil {
return fmt.Errorf("logout failed: %v", err)
}
return nil
}
85 changes: 85 additions & 0 deletions pkg/scrapers/twitter/common.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
package twitter

import (
"fmt"
"os"
"strings"
"sync"

"github.com/joho/godotenv"
"github.com/masa-finance/masa-oracle/pkg/config"
"github.com/sirupsen/logrus"
)

var (
accountManager *TwitterAccountManager
once sync.Once
)

func initializeAccountManager() {
accounts := loadAccountsFromConfig()
accountManager = NewTwitterAccountManager(accounts)
}

func loadAccountsFromConfig() []*TwitterAccount {
err := godotenv.Load()
if err != nil {
logrus.Fatalf("error loading .env file: %v", err)
}

accountsEnv := os.Getenv("TWITTER_ACCOUNTS")
if accountsEnv == "" {
logrus.Fatal("TWITTER_ACCOUNTS not set in .env file")
}

return parseAccounts(strings.Split(accountsEnv, ","))
}

func parseAccounts(accountPairs []string) []*TwitterAccount {
return filterMap(accountPairs, func(pair string) (*TwitterAccount, bool) {
credentials := strings.Split(pair, ":")
if len(credentials) != 2 {
logrus.Warnf("invalid account credentials: %s", pair)
return nil, false
}
return &TwitterAccount{
Username: strings.TrimSpace(credentials[0]),
Password: strings.TrimSpace(credentials[1]),
}, true
})
}

func getAuthenticatedScraper() (*Scraper, *TwitterAccount, error) {
once.Do(initializeAccountManager)
baseDir := config.GetInstance().MasaDir

account := accountManager.GetNextAccount()
if account == nil {
return nil, nil, fmt.Errorf("all accounts are rate-limited")
}
scraper := NewScraper(account, baseDir)
if scraper == nil {
logrus.Errorf("Authentication failed for %s", account.Username)
return nil, account, fmt.Errorf("Twitter authentication failed for %s", account.Username)
}
return scraper, account, nil
}

func handleRateLimit(err error, account *TwitterAccount) bool {
if strings.Contains(err.Error(), "Rate limit exceeded") {
accountManager.MarkAccountRateLimited(account)
logrus.Warnf("rate limited: %s", account.Username)
return true
}
return false
}

func filterMap[T any, R any](slice []T, f func(T) (R, bool)) []R {
result := make([]R, 0, len(slice))
for _, v := range slice {
if r, ok := f(v); ok {
result = append(result, r)
}
}
return result
}
35 changes: 35 additions & 0 deletions pkg/scrapers/twitter/config.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
package twitter

import (
"fmt"
"time"

"github.com/sirupsen/logrus"
)

const (
ShortSleepDuration = 20 * time.Millisecond
RateLimitDuration = time.Hour
MaxRetries = 3
)

func ShortSleep() {
time.Sleep(ShortSleepDuration)
}

func GetRateLimitDuration() time.Duration {
return RateLimitDuration
}

func Retry[T any](operation func() (T, error), maxAttempts int) (T, error) {
var zero T
for attempt := 1; attempt <= maxAttempts; attempt++ {
result, err := operation()
if err == nil {
return result, nil
}
logrus.Errorf("retry attempt %d failed: %v", attempt, err)
time.Sleep(time.Duration(attempt) * time.Second)
}
return zero, fmt.Errorf("operation failed after %d attempts", maxAttempts)
}
27 changes: 11 additions & 16 deletions pkg/scrapers/twitter/cookies.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,37 +5,32 @@ import (
"fmt"
"net/http"
"os"
"path/filepath"

twitterscraper "github.com/masa-finance/masa-twitter-scraper"
)

func SaveCookies(scraper *twitterscraper.Scraper, filePath string) error {
func SaveCookies(scraper *twitterscraper.Scraper, account *TwitterAccount, baseDir string) error {
cookieFile := filepath.Join(baseDir, fmt.Sprintf("%s_twitter_cookies.json", account.Username))
cookies := scraper.GetCookies()
js, err := json.Marshal(cookies)
data, err := json.Marshal(cookies)
if err != nil {
return fmt.Errorf("error marshaling cookies: %v", err)
}
err = os.WriteFile(filePath, js, 0644)
if err != nil {
return fmt.Errorf("error saving cookies to file: %v", err)
}

// Load the saved cookies back into the scraper
if err := LoadCookies(scraper, filePath); err != nil {
return fmt.Errorf("error loading saved cookies: %v", err)
if err = os.WriteFile(cookieFile, data, 0644); err != nil {
return fmt.Errorf("error saving cookies: %v", err)
}

return nil
}

func LoadCookies(scraper *twitterscraper.Scraper, filePath string) error {
js, err := os.ReadFile(filePath)
func LoadCookies(scraper *twitterscraper.Scraper, account *TwitterAccount, baseDir string) error {
cookieFile := filepath.Join(baseDir, fmt.Sprintf("%s_twitter_cookies.json", account.Username))
data, err := os.ReadFile(cookieFile)
if err != nil {
return fmt.Errorf("error reading cookies from file: %v", err)
return fmt.Errorf("error reading cookies: %v", err)
}
var cookies []*http.Cookie
err = json.Unmarshal(js, &cookies)
if err != nil {
if err = json.Unmarshal(data, &cookies); err != nil {
return fmt.Errorf("error unmarshaling cookies: %v", err)
}
scraper.SetCookies(cookies)
Expand Down
40 changes: 15 additions & 25 deletions pkg/scrapers/twitter/followers.go
Original file line number Diff line number Diff line change
@@ -1,38 +1,28 @@
package twitter

import (
"encoding/json"
"fmt"

_ "github.com/lib/pq"
twitterscraper "github.com/masa-finance/masa-twitter-scraper"
"github.com/sirupsen/logrus"
)

// ScrapeFollowersForProfile scrapes the profile and tweets of a specific Twitter user.
// It takes the username as a parameter and returns the scraped profile information and an error if any.
func ScrapeFollowersForProfile(username string, count int) ([]twitterscraper.Legacy, error) {
scraper := auth()
return Retry(func() ([]twitterscraper.Legacy, error) {
scraper, account, err := getAuthenticatedScraper()
if err != nil {
return nil, err
}

if scraper == nil {
return nil, fmt.Errorf("there was an error authenticating with your Twitter credentials")
}
followingResponse, errString, _ := scraper.FetchFollowers(username, count, "")
if errString != "" {
if handleRateLimit(fmt.Errorf(errString), account) {
return nil, fmt.Errorf("rate limited")
}
logrus.Errorf("Error fetching followers: %v", errString)
return nil, fmt.Errorf("%v", errString)
}

followingResponse, errString, _ := scraper.FetchFollowers(username, count, "")
if errString != "" {
logrus.Printf("Error fetching profile: %v", errString)
return nil, fmt.Errorf("%v", errString)
}

// Marshal the followingResponse into a JSON string for logging
responseJSON, err := json.Marshal(followingResponse)
if err != nil {
// Log the error if the marshaling fails
logrus.Errorf("[-] Error marshaling followingResponse: %v", err)
} else {
// Log the JSON string of followingResponse
logrus.Debugf("Following response: %s", responseJSON)
}

return followingResponse, nil
return followingResponse, nil
}, MaxRetries)
}
23 changes: 23 additions & 0 deletions pkg/scrapers/twitter/profile.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package twitter

import (
twitterscraper "github.com/masa-finance/masa-twitter-scraper"
)

func ScrapeTweetsProfile(username string) (twitterscraper.Profile, error) {
return Retry(func() (twitterscraper.Profile, error) {
scraper, account, err := getAuthenticatedScraper()
if err != nil {
return twitterscraper.Profile{}, err
}

profile, err := scraper.GetProfile(username)
if err != nil {
if handleRateLimit(err, account) {
return twitterscraper.Profile{}, err
}
return twitterscraper.Profile{}, err
}
return profile, nil
}, MaxRetries)
}
Loading

0 comments on commit 419e224

Please sign in to comment.