Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(scrapers): Add new Twitter scraper test suite #573

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

teslashibe
Copy link
Contributor

@teslashibe teslashibe commented Oct 1, 2024

Description

This pull request introduces a new test suite for the Twitter scraper functionality. The suite is designed to ensure that our Twitter scraping operations are reliable, secure, and function as expected under various conditions. The tests cover authentication, session reuse, and specific scraping capabilities.

Changes

  • Added scrapers_suite_test.go
    Sets up the test suite for all scraper-related tests, ensuring proper environment variables are loaded and the testing environment is correctly configured.

  • Added twitter_scraper_test.go
    Contains tests specifically for the Twitter scraper, focusing on authentication and scraping functionalities.

Tests Included

  1. Authenticates and Logs In Successfully

    • Purpose: Verifies that the scraper can authenticate with Twitter using valid credentials.
    • What it tests:
      • Ensures that the cookie file does not exist before authentication.
      • Authenticates with Twitter and checks that a new cookie file is created.
      • Checks that the scraper is in a logged-in state.
      • Performs a simple profile scrape to confirm the session is valid.
  2. Reuses Session from Cookies

    • Purpose: Confirms that the scraper can reuse an existing session from stored cookies.
    • What it tests:
      • Performs an initial authentication to generate a cookie file.
      • Clears the scraper instance to simulate a fresh start.
      • Authenticates again, expecting it to use the existing cookies.
      • Verifies that the scraper is still in a logged-in state.
      • Confirms the session by performing a profile scrape.
  3. Scrapes Profile and Recent Tweets Using Saved Cookies

    • Purpose: Tests the scraper's ability to perform complex scraping tasks using an authenticated session.
    • What it tests:
      • Uses the saved cookies to authenticate without re-entering credentials.
      • Scrapes the Twitter profile of 'god' and checks the retrieved data.
      • Scrapes recent tweets containing the #Bitcoin hashtag.
      • Verifies that the correct number of tweets are retrieved and outputs their content.

Notes

  • Environment Variables: The tests rely on environment variables for Twitter credentials. Ensure that TWITTER_USERNAME, TWITTER_PASSWORD, and optionally TWITTER_2FA_CODE are set in a .env file at the project root.
  • Temporary Directories: Tests create temporary directories to isolate cookie files and other data, preventing interference with existing sessions.
  • Cleanup: After each test, the temporary directories and any generated files are cleaned up to maintain a clean testing environment.

How to Test

  1. Set up the required environment variables in your .env file.

  2. From root run the tests using the Ginkgo testing framework:

    ginkgo -v ./pkg/tests/scrapers
  3. Review the output to ensure all tests pass successfully.

https://www.loom.com/share/da177d585ab941edbfbf6fe5236878fe

- Delete TwitterSentimentHandler and TwitterTrendsHandler structs
- Remove corresponding HandleWork functions for sentiment and trends
- Update WorkerType constants and related functions to exclude sentiment and trends
- Adjust WorkHandlerManager initialization to remove sentiment and trends handlers
…objx dependency

- **Exported Auth Function:**
  - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible.
  - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function.

- **Removed Unused Dependency:**
  - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed.

- **Optimized Sleep Durations:**
  - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance.

- **Cleaned Up Codebase:**
  - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase.

- **Enhanced Test Configuration:**
  - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`.
  - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse.
This commit improves the Twitter authentication and scraping tests in the
pkg/tests/scrapers/twitter_auth_test.go file. The changes include:

- Add godotenv package to load environment variables
- Implement a loadEnv function to handle .env file loading
- Enhance "authenticates and logs in successfully" test:
  - Verify cookie file doesn't exist before authentication
  - Check cookie file creation after authentication
  - Perform a simple profile scrape to validate the session
- Improve "reuses session from cookies" test:
  - Verify cookie file creation
  - Force cookie reuse by clearing the first scraper
  - Validate the reused session with a profile scrape
- Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies":
  - Authenticate twice to ensure cookie reuse
  - Scrape the profile of user 'god'
  - Fetch and verify the last 3 tweets containing #Bitcoin
  - Log scraped data for manual inspection

These changes provide more robust testing of the Twitter authentication
process, session reuse, and scraping functionality, ensuring better
coverage and reliability of the Twitter-related features.
@teslashibe teslashibe self-assigned this Oct 1, 2024
@teslashibe teslashibe added the enhancement New feature or request label Oct 1, 2024
@teslashibe
Copy link
Contributor Author

teslashibe commented Oct 1, 2024

@mudler how do we configure CI to use environment variables to run tests?

Also I am having issues loading .env for some reason. Maybe you can take a look and help?

brendanplayford@Brendans-MacBook-Pro masa-oracle % ginkgo -v ./pkg/tests/scrapers
time="2024-10-01T14:21:23-07:00" level=warning msg="Error loading .env file" error="open .env: no such file or directory"
time="2024-10-01T14:21:23-07:00" level=warning msg="USER_AGENTS environment variable is not set. Using default user agent."
time="2024-10-01T14:21:23-07:00" level=info msg="Loaded .env from /Users/brendanplayford/masa/masa-oracle/.env"
Running Suite: Scrapers Suite - /Users/brendanplayford/masa/masa-oracle/pkg/tests/scrapers
==========================================================================================
Random Seed: 1727817683

Other than this tests are working great!

@mudler mudler linked an issue Oct 2, 2024 that may be closed by this pull request
This commit introduces significant improvements to the Twitter scraping functionality:

1. Account Management:
   - Add TwitterAccount struct to represent individual Twitter accounts
   - Implement TwitterAccountManager for managing multiple accounts
   - Create functions for account rotation and rate limit tracking

2. Authentication:
   - Refactor Auth function to use account rotation
   - Implement cookie-based session management for each account
   - Add retry logic for authentication failures

3. Scraping Functions:
   - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation
   - Implement rate limit detection and account switching
   - Add retry mechanisms for failed operations

4. Configuration:
   - Move from hardcoded credentials to .env file-based configuration
   - Implement loadAccountsFromConfig to read multiple accounts from .env

5. Error Handling:
   - Improve error logging and handling throughout the package
   - Add specific handling for rate limit errors

6. Performance:
   - Implement concurrent scraping with multiple accounts
   - Add delays between requests to avoid aggressive rate limiting

These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation.
@mudler
Copy link
Contributor

mudler commented Oct 3, 2024

@mudler how do we configure CI to use environment variables to run tests?

Also I am having issues loading .env for some reason. Maybe you can take a look and help?

brendanplayford@Brendans-MacBook-Pro masa-oracle % ginkgo -v ./pkg/tests/scrapers
time="2024-10-01T14:21:23-07:00" level=warning msg="Error loading .env file" error="open .env: no such file or directory"
time="2024-10-01T14:21:23-07:00" level=warning msg="USER_AGENTS environment variable is not set. Using default user agent."
time="2024-10-01T14:21:23-07:00" level=info msg="Loaded .env from /Users/brendanplayford/masa/masa-oracle/.env"
Running Suite: Scrapers Suite - /Users/brendanplayford/masa/masa-oracle/pkg/tests/scrapers
==========================================================================================
Random Seed: 1727817683

Other than this tests are working great!

that just shows how the code is too much tied on the configuration and viper generally. When building the first integration tests what I did is trying to move away from depending on a file first and instrument it from the code.

Ideally the code exercised in the integration tests shouldn't depend on a file on the system, because as you can see now things gets hairy really quickly with tests

My reasoning usually in this case is:

  • e2e tests shoud prepare all the necessary (e.g. .env files. configuration in the system, etc.) as similar as you would run the software via CLI
  • Integration tests and unit tests shouldn't depend on system settings such configuration files - if the code does it should be refactored to NOT depend on specific files, and make it load configuration files only from specific entrypoints (e.g. in packages dedicated to configure and then pass the configuration down the line to the components which are under test)

Why: it forces a clear separation of domains in the code which tends to help into writing code which is more "pluggable" and testable

As we are not yet at building e2e test suites I'd recommend to go ahead and refactor the code and avoiding config singletons all the way down to the methods until you figure out which lines are calling the singleton and then loading the config/env files.

This is basically what I did when untie-ing the Masa node from the config instance, if you look closer after my refactoring the masa node doesn't call anymore the config instance which in turns load the .env file and expects all the configurations. Now there is a node options which is instrumented entirely via code: https://github.com/masa-finance/masa-oracle/blob/main/node/options.go#L12 and for instance the configuration is read and translated to node options here:

func initOptions(cfg *config.AppConfig) ([]node.Option, *workers.WorkHandlerManager, *pubsub.PublicKeySubscriptionHandler) {

it's still misplaced - that function shouldn't be there but more closer to the data that needs to generate the options (in this case AppConfig), but that's still in WIP as I didn't wanted to move too many things in one PR

@teslashibe
Copy link
Contributor Author

@mudler I need to re-write this test now. I will do this tomorrow probably

mudler added a commit that referenced this pull request Oct 17, 2024
These are going to be taken care of as part of
#573

Signed-off-by: mudler <[email protected]>
mudler added a commit that referenced this pull request Oct 17, 2024
* fix(contracts): load config from embedded

There was still some code reading from the filesystem instead of the
embedded files in the binary.
Regression introduced in #523.

Fixes: #578

See also: #579

Signed-off-by: mudler <[email protected]>

* chore(tests): temporarly disable Twitter tests

These are going to be taken care of as part of
#573

Signed-off-by: mudler <[email protected]>

---------

Signed-off-by: mudler <[email protected]>
restevens402 added a commit that referenced this pull request Oct 22, 2024
commit d654c56
Author: smeb y <[email protected]>
Date:   Mon Oct 21 15:55:05 2024 +0800

    fix(Dockerfile): add ca-certificate (#604)

    Update Dockerfile

    Signed-off-by: smeb y <[email protected]>

commit c1e624a
Author: Ettore Di Giacinto <[email protected]>
Date:   Thu Oct 17 15:23:03 2024 +0200

    chore(docs): drop unnecessary duplicated content

    Signed-off-by: Ettore Di Giacinto <[email protected]>

commit 9529e4c
Author: Ettore Di Giacinto <[email protected]>
Date:   Thu Oct 17 15:21:07 2024 +0200

    chore(docs): update .env.example (#603)

    Signed-off-by: Ettore Di Giacinto <[email protected]>

commit 038fad6
Author: Ettore Di Giacinto <[email protected]>
Date:   Thu Oct 17 14:55:25 2024 +0200

    fix(contracts): load config from embedded (#602)

    * fix(contracts): load config from embedded

    There was still some code reading from the filesystem instead of the
    embedded files in the binary.
    Regression introduced in #523.

    Fixes: #578

    See also: #579

    Signed-off-by: mudler <[email protected]>

    * chore(tests): temporarly disable Twitter tests

    These are going to be taken care of as part of
    #573

    Signed-off-by: mudler <[email protected]>

    ---------

    Signed-off-by: mudler <[email protected]>

commit a8a77a6
Author: Brendan Playford <[email protected]>
Date:   Tue Oct 15 14:32:29 2024 -0700

    feat(twitter): Implement random sleep and improve login process (#601)

    - Add RandomSleep function to introduce variability in request timing
    - Update NewScraper to use RandomSleep before and after login attempts
    - Adjust sleep duration range to 500ms - 2s for more natural behavior
    - Improve error handling and logging in the login process

commit 01ec8c4
Author: Brendan Playford <[email protected]>
Date:   Tue Oct 15 08:59:08 2024 -0700

    chore(version): update protocol version and update twitter_cookies.example.json

commit 6f594e1
Author: Brendan Playford <[email protected]>
Date:   Tue Oct 15 08:48:01 2024 -0700

    feat(twitter): Enhanced Twitter Worker Selection Algorithm (#591)

    * Add detailed error logging and track worker update time

    Enhanced the worker manager to append specific error messages to a list for better debugging. Additionally, updated node data to track the last update time, improving data consistency and traceability.

    * Update version.go

    * refactor(twitter): remove retry functionality from scraper

    - Remove Retry function and MaxRetries constant from config.go
    - Update ScrapeFollowersForProfile, ScrapeTweetsProfile, and ScrapeTweetsByQuery
      to remove Retry wrapper
    - Adjust error handling in each function to directly return errors
    - Simplify code structure and reduce complexity
    - Maintain rate limit handling functionality

    * chore(workers): update max workers to 50

    * chore(workers): upate to 25

    * feat(pubsub): improve node sorting algorithm for Twitter reliability

    - Prioritize nodes with more recent last returned tweets
    - Maintain high importance for total returned tweet count
    - Consider time since last timeout to allow recovery from temporary issues
    - Deprioritize nodes with recent "not found" occurrences
    - Remove NotFoundCount from sorting criteria

    This change aims to better balance node performance and recent activity,
    while allowing nodes to recover quickly from temporary issues like rate limiting.

    * feat(workers): improve Twitter worker selection algorithm

    - Modify GetEligibleWorkers to use a specialized selection for Twitter workers
    - Introduce controlled randomness in Twitter worker selection
    - Balance between prioritizing high-performing Twitter workers and fair distribution
    - Maintain existing behavior for non-Twitter worker selection
    - Preserve handling of local worker and respect original worker limit

    This change enhances the worker selection algorithm for Twitter tasks to provide
    a better balance between utilizing top-performing nodes and ensuring fair work
    distribution. It introduces a dynamic pool size calculation and controlled
    randomness for Twitter workers, while maintaining the existing round-robin
    approach for other worker types.

    ---------

    Co-authored-by: Bob Stevens <[email protected]>

commit f09fb20
Author: Brendan Playford <[email protected]>
Date:   Tue Oct 8 15:38:45 2024 -0700

    Feat(workers) implement adaptive worker selection for improved task distribution (#589)

    * feat(worker-selection): Implement performance-based worker sorting

    - Add performance metrics fields to NodeData struct
    - Implement NodeSorter for flexible sorting of worker nodes
    - Create SortNodesByTwitterReliability function for Twitter workers
    - Update GetEligibleWorkerNodes to use category-specific sorting
    - Modify GetEligibleWorkers to use sorted workers and add worker limit

    This commit enhances the worker selection process by prioritizing workers
    based on their performance metrics. It introduces a flexible sorting
    mechanism that can be easily extended to other worker categories in the
    future. The changes improve reliability and efficiency in task allocation
    across the Masa Oracle network.

    * feat(worker-selection): Implement priority-based selection for Twitter work

    - Update DistributeWork to use priority selection for Twitter category
    - Maintain round-robin selection for other work categories by shuffling workers
    - Integrate new GetEligibleWorkers function with work type-specific behavior
    - Respect MaxRemoteWorkers limit for all work types
    - Add distinct logging for Twitter and non-Twitter worker selection

    This commit enhances the work distribution process by implementing
    priority-based worker selection for Twitter-related tasks while
    preserving the existing round-robin behavior for other work types.
    It leverages the newly added performance metrics to choose the most
    reliable workers for Twitter tasks, and ensures consistent behavior
    for other categories by shuffling the worker list. This hybrid approach
    improves efficiency for Twitter tasks while maintaining the expected
    behavior for all other work types.

    * Update .gitignore

    * feat(worker-selection): Implement priority-based sorting for Twitter workers

    - Add LastNotFoundTime and NotFoundCount fields to NodeData struct
    - Enhance SortNodesByTwitterReliability function with multi-criteria sorting:
      1. Prioritize nodes found more often (lower NotFoundCount)
      2. Consider recency of last not-found occurrence
      3. Sort by higher number of returned tweets
      4. Consider recency of last returned tweet
      5. Prioritize nodes with fewer timeouts
      6. Consider recency of last timeout
      7. Use PeerId for stable sorting when no performance data is available
    - Remove random shuffling from GetEligibleWorkers function

    This commit improves worker selection for Twitter tasks by implementing
    a more sophisticated sorting algorithm that takes into account node
    reliability and performance metrics. It aims to enhance the efficiency
    and reliability of task distribution in the Masa Oracle network.

    * feat(worker-selection): Update Twitter fields in NodeData and Worker Manager

    Add functions to update Twitter-related metrics in NodeData and integrate updates into Worker Manager processes. This ensures accurate tracking of tweet-related events and peer activity in the system.

    * feat(worker-selection): Add unit tests for NodeData and NodeDataTracker

    Introduce unit tests for the NodeData and NodeDataTracker functionalities, covering scenarios involving updates to Twitter-related fields. These tests ensure the correctness of the UpdateTwitterFields method in NodeData and the UpdateNodeDataTwitter method in NodeDataTracker.

    * chore(workers): update timeouts and bump version

    ---------

    Co-authored-by: Bob Stevens <[email protected]>

commit 0ef0df4
Author: Brendan Playford <[email protected]>
Date:   Tue Oct 8 14:37:01 2024 -0700

    feat(api): Add configurable API server enablement (#586)

    * feat(api): Add configurable API server enablement

    This commit introduces a new feature that allows the API server to be
    conditionally enabled or disabled based on configuration. The changes
    include:

    1. In cmd/masa-node/main.go:
       - Refactored signal handling into a separate function `handleSignals`
       - Added conditional logic to start the API server only if enabled
       - Improved logging to indicate API server status

    2. In pkg/config/app.go:
       - Added `APIEnabled` field to the `AppConfig` struct
       - Set default value for `APIEnabled` to false in `setDefaultConfig`
       - Added command-line flag for `apiEnabled` in `setCommandLineConfig`

    3. In pkg/config/constants.go:
       - Added `APIEnabled` constant for environment variable configuration

    These changes provide more flexibility in node configuration, allowing
    users to run the node with or without the API server. This can be useful
    for security purposes or in scenarios where the API is not needed.

    The API can now be enabled via:
    - Environment variable: API_ENABLED=true
    - Command-line flag: --apiEnabled
    - Configuration file: apiEnabled: true

    By default, the API server will be disabled for enhanced security.

    * chore(config): update to take api-enabled=true and update Makefile with run-api case

    * Update Makefile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests: Twitter package: auth, save cookies, scrape
2 participants