test(scrapers): Add new Twitter scraper test suite #573

teslashibe · 2024-10-01T21:15:47Z

Description

This pull request introduces a new test suite for the Twitter scraper functionality. The suite is designed to ensure that our Twitter scraping operations are reliable, secure, and function as expected under various conditions. The tests cover authentication, session reuse, and specific scraping capabilities.

Changes

Added scrapers_suite_test.go
Sets up the test suite for all scraper-related tests, ensuring proper environment variables are loaded and the testing environment is correctly configured.
Added twitter_scraper_test.go
Contains tests specifically for the Twitter scraper, focusing on authentication and scraping functionalities.

Tests Included

Authenticates and Logs In Successfully
- Purpose: Verifies that the scraper can authenticate with Twitter using valid credentials.
- What it tests:
  - Ensures that the cookie file does not exist before authentication.
  - Authenticates with Twitter and checks that a new cookie file is created.
  - Checks that the scraper is in a logged-in state.
  - Performs a simple profile scrape to confirm the session is valid.
Reuses Session from Cookies
- Purpose: Confirms that the scraper can reuse an existing session from stored cookies.
- What it tests:
  - Performs an initial authentication to generate a cookie file.
  - Clears the scraper instance to simulate a fresh start.
  - Authenticates again, expecting it to use the existing cookies.
  - Verifies that the scraper is still in a logged-in state.
  - Confirms the session by performing a profile scrape.
Scrapes Profile and Recent Tweets Using Saved Cookies
- Purpose: Tests the scraper's ability to perform complex scraping tasks using an authenticated session.
- What it tests:
  - Uses the saved cookies to authenticate without re-entering credentials.
  - Scrapes the Twitter profile of 'god' and checks the retrieved data.
  - Scrapes recent tweets containing the #Bitcoin hashtag.
  - Verifies that the correct number of tweets are retrieved and outputs their content.

Notes

Environment Variables: The tests rely on environment variables for Twitter credentials. Ensure that TWITTER_USERNAME, TWITTER_PASSWORD, and optionally TWITTER_2FA_CODE are set in a .env file at the project root.
Temporary Directories: Tests create temporary directories to isolate cookie files and other data, preventing interference with existing sessions.
Cleanup: After each test, the temporary directories and any generated files are cleaned up to maintain a clean testing environment.

How to Test

Set up the required environment variables in your .env file.
From root run the tests using the Ginkgo testing framework:
```
ginkgo -v ./pkg/tests/scrapers
```
Review the output to ensure all tests pass successfully.

https://www.loom.com/share/da177d585ab941edbfbf6fe5236878fe

- Delete TwitterSentimentHandler and TwitterTrendsHandler structs - Remove corresponding HandleWork functions for sentiment and trends - Update WorkerType constants and related functions to exclude sentiment and trends - Adjust WorkHandlerManager initialization to remove sentiment and trends handlers

…objx dependency - **Exported Auth Function:** - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible. - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function. - **Removed Unused Dependency:** - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed. - **Optimized Sleep Durations:** - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance. - **Cleaned Up Codebase:** - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase. - **Enhanced Test Configuration:** - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`. - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse.

This commit improves the Twitter authentication and scraping tests in the pkg/tests/scrapers/twitter_auth_test.go file. The changes include: - Add godotenv package to load environment variables - Implement a loadEnv function to handle .env file loading - Enhance "authenticates and logs in successfully" test: - Verify cookie file doesn't exist before authentication - Check cookie file creation after authentication - Perform a simple profile scrape to validate the session - Improve "reuses session from cookies" test: - Verify cookie file creation - Force cookie reuse by clearing the first scraper - Validate the reused session with a profile scrape - Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies": - Authenticate twice to ensure cookie reuse - Scrape the profile of user 'god' - Fetch and verify the last 3 tweets containing #Bitcoin - Log scraped data for manual inspection These changes provide more robust testing of the Twitter authentication process, session reuse, and scraping functionality, ensuring better coverage and reliability of the Twitter-related features.

teslashibe · 2024-10-01T21:22:57Z

@mudler how do we configure CI to use environment variables to run tests?

Also I am having issues loading .env for some reason. Maybe you can take a look and help?

brendanplayford@Brendans-MacBook-Pro masa-oracle % ginkgo -v ./pkg/tests/scrapers
time="2024-10-01T14:21:23-07:00" level=warning msg="Error loading .env file" error="open .env: no such file or directory"
time="2024-10-01T14:21:23-07:00" level=warning msg="USER_AGENTS environment variable is not set. Using default user agent."
time="2024-10-01T14:21:23-07:00" level=info msg="Loaded .env from /Users/brendanplayford/masa/masa-oracle/.env"
Running Suite: Scrapers Suite - /Users/brendanplayford/masa/masa-oracle/pkg/tests/scrapers
==========================================================================================
Random Seed: 1727817683

Other than this tests are working great!

This commit introduces significant improvements to the Twitter scraping functionality: 1. Account Management: - Add TwitterAccount struct to represent individual Twitter accounts - Implement TwitterAccountManager for managing multiple accounts - Create functions for account rotation and rate limit tracking 2. Authentication: - Refactor Auth function to use account rotation - Implement cookie-based session management for each account - Add retry logic for authentication failures 3. Scraping Functions: - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation - Implement rate limit detection and account switching - Add retry mechanisms for failed operations 4. Configuration: - Move from hardcoded credentials to .env file-based configuration - Implement loadAccountsFromConfig to read multiple accounts from .env 5. Error Handling: - Improve error logging and handling throughout the package - Add specific handling for rate limit errors 6. Performance: - Implement concurrent scraping with multiple accounts - Add delays between requests to avoid aggressive rate limiting These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation.

…ling" This reverts commit b9d936f.

mudler · 2024-10-03T07:10:39Z

@mudler how do we configure CI to use environment variables to run tests?

Also I am having issues loading .env for some reason. Maybe you can take a look and help?

brendanplayford@Brendans-MacBook-Pro masa-oracle % ginkgo -v ./pkg/tests/scrapers
time="2024-10-01T14:21:23-07:00" level=warning msg="Error loading .env file" error="open .env: no such file or directory"
time="2024-10-01T14:21:23-07:00" level=warning msg="USER_AGENTS environment variable is not set. Using default user agent."
time="2024-10-01T14:21:23-07:00" level=info msg="Loaded .env from /Users/brendanplayford/masa/masa-oracle/.env"
Running Suite: Scrapers Suite - /Users/brendanplayford/masa/masa-oracle/pkg/tests/scrapers
==========================================================================================
Random Seed: 1727817683

Other than this tests are working great!

that just shows how the code is too much tied on the configuration and viper generally. When building the first integration tests what I did is trying to move away from depending on a file first and instrument it from the code.

Ideally the code exercised in the integration tests shouldn't depend on a file on the system, because as you can see now things gets hairy really quickly with tests

My reasoning usually in this case is:

e2e tests shoud prepare all the necessary (e.g. .env files. configuration in the system, etc.) as similar as you would run the software via CLI
Integration tests and unit tests shouldn't depend on system settings such configuration files - if the code does it should be refactored to NOT depend on specific files, and make it load configuration files only from specific entrypoints (e.g. in packages dedicated to configure and then pass the configuration down the line to the components which are under test)

Why: it forces a clear separation of domains in the code which tends to help into writing code which is more "pluggable" and testable

As we are not yet at building e2e test suites I'd recommend to go ahead and refactor the code and avoiding config singletons all the way down to the methods until you figure out which lines are calling the singleton and then loading the config/env files.

This is basically what I did when untie-ing the Masa node from the config instance, if you look closer after my refactoring the masa node doesn't call anymore the config instance which in turns load the .env file and expects all the configurations. Now there is a node options which is instrumented entirely via code: https://github.com/masa-finance/masa-oracle/blob/main/node/options.go#L12 and for instance the configuration is read and translated to node options here:

masa-oracle/cmd/masa-node/config.go

Line 10 in 7e74904

    
           func initOptions(cfg *config.AppConfig) ([]node.Option, *workers.WorkHandlerManager, *pubsub.PublicKeySubscriptionHandler) {

it's still misplaced - that function shouldn't be there but more closer to the data that needs to generate the options (in this case AppConfig), but that's still in WIP as I didn't wanted to move too many things in one PR

teslashibe · 2024-10-04T01:45:14Z

@mudler I need to re-write this test now. I will do this tomorrow probably

These are going to be taken care of as part of #573 Signed-off-by: mudler <[email protected]>

* fix(contracts): load config from embedded There was still some code reading from the filesystem instead of the embedded files in the binary. Regression introduced in #523. Fixes: #578 See also: #579 Signed-off-by: mudler <[email protected]> * chore(tests): temporarly disable Twitter tests These are going to be taken care of as part of #573 Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]>

commit d654c56 Author: smeb y <[email protected]> Date: Mon Oct 21 15:55:05 2024 +0800 fix(Dockerfile): add ca-certificate (#604) Update Dockerfile Signed-off-by: smeb y <[email protected]> commit c1e624a Author: Ettore Di Giacinto <[email protected]> Date: Thu Oct 17 15:23:03 2024 +0200 chore(docs): drop unnecessary duplicated content Signed-off-by: Ettore Di Giacinto <[email protected]> commit 9529e4c Author: Ettore Di Giacinto <[email protected]> Date: Thu Oct 17 15:21:07 2024 +0200 chore(docs): update .env.example (#603) Signed-off-by: Ettore Di Giacinto <[email protected]> commit 038fad6 Author: Ettore Di Giacinto <[email protected]> Date: Thu Oct 17 14:55:25 2024 +0200 fix(contracts): load config from embedded (#602) * fix(contracts): load config from embedded There was still some code reading from the filesystem instead of the embedded files in the binary. Regression introduced in #523. Fixes: #578 See also: #579 Signed-off-by: mudler <[email protected]> * chore(tests): temporarly disable Twitter tests These are going to be taken care of as part of #573 Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]> commit a8a77a6 Author: Brendan Playford <[email protected]> Date: Tue Oct 15 14:32:29 2024 -0700 feat(twitter): Implement random sleep and improve login process (#601) - Add RandomSleep function to introduce variability in request timing - Update NewScraper to use RandomSleep before and after login attempts - Adjust sleep duration range to 500ms - 2s for more natural behavior - Improve error handling and logging in the login process commit 01ec8c4 Author: Brendan Playford <[email protected]> Date: Tue Oct 15 08:59:08 2024 -0700 chore(version): update protocol version and update twitter_cookies.example.json commit 6f594e1 Author: Brendan Playford <[email protected]> Date: Tue Oct 15 08:48:01 2024 -0700 feat(twitter): Enhanced Twitter Worker Selection Algorithm (#591) * Add detailed error logging and track worker update time Enhanced the worker manager to append specific error messages to a list for better debugging. Additionally, updated node data to track the last update time, improving data consistency and traceability. * Update version.go * refactor(twitter): remove retry functionality from scraper - Remove Retry function and MaxRetries constant from config.go - Update ScrapeFollowersForProfile, ScrapeTweetsProfile, and ScrapeTweetsByQuery to remove Retry wrapper - Adjust error handling in each function to directly return errors - Simplify code structure and reduce complexity - Maintain rate limit handling functionality * chore(workers): update max workers to 50 * chore(workers): upate to 25 * feat(pubsub): improve node sorting algorithm for Twitter reliability - Prioritize nodes with more recent last returned tweets - Maintain high importance for total returned tweet count - Consider time since last timeout to allow recovery from temporary issues - Deprioritize nodes with recent "not found" occurrences - Remove NotFoundCount from sorting criteria This change aims to better balance node performance and recent activity, while allowing nodes to recover quickly from temporary issues like rate limiting. * feat(workers): improve Twitter worker selection algorithm - Modify GetEligibleWorkers to use a specialized selection for Twitter workers - Introduce controlled randomness in Twitter worker selection - Balance between prioritizing high-performing Twitter workers and fair distribution - Maintain existing behavior for non-Twitter worker selection - Preserve handling of local worker and respect original worker limit This change enhances the worker selection algorithm for Twitter tasks to provide a better balance between utilizing top-performing nodes and ensuring fair work distribution. It introduces a dynamic pool size calculation and controlled randomness for Twitter workers, while maintaining the existing round-robin approach for other worker types. --------- Co-authored-by: Bob Stevens <[email protected]> commit f09fb20 Author: Brendan Playford <[email protected]> Date: Tue Oct 8 15:38:45 2024 -0700 Feat(workers) implement adaptive worker selection for improved task distribution (#589) * feat(worker-selection): Implement performance-based worker sorting - Add performance metrics fields to NodeData struct - Implement NodeSorter for flexible sorting of worker nodes - Create SortNodesByTwitterReliability function for Twitter workers - Update GetEligibleWorkerNodes to use category-specific sorting - Modify GetEligibleWorkers to use sorted workers and add worker limit This commit enhances the worker selection process by prioritizing workers based on their performance metrics. It introduces a flexible sorting mechanism that can be easily extended to other worker categories in the future. The changes improve reliability and efficiency in task allocation across the Masa Oracle network. * feat(worker-selection): Implement priority-based selection for Twitter work - Update DistributeWork to use priority selection for Twitter category - Maintain round-robin selection for other work categories by shuffling workers - Integrate new GetEligibleWorkers function with work type-specific behavior - Respect MaxRemoteWorkers limit for all work types - Add distinct logging for Twitter and non-Twitter worker selection This commit enhances the work distribution process by implementing priority-based worker selection for Twitter-related tasks while preserving the existing round-robin behavior for other work types. It leverages the newly added performance metrics to choose the most reliable workers for Twitter tasks, and ensures consistent behavior for other categories by shuffling the worker list. This hybrid approach improves efficiency for Twitter tasks while maintaining the expected behavior for all other work types. * Update .gitignore * feat(worker-selection): Implement priority-based sorting for Twitter workers - Add LastNotFoundTime and NotFoundCount fields to NodeData struct - Enhance SortNodesByTwitterReliability function with multi-criteria sorting: 1. Prioritize nodes found more often (lower NotFoundCount) 2. Consider recency of last not-found occurrence 3. Sort by higher number of returned tweets 4. Consider recency of last returned tweet 5. Prioritize nodes with fewer timeouts 6. Consider recency of last timeout 7. Use PeerId for stable sorting when no performance data is available - Remove random shuffling from GetEligibleWorkers function This commit improves worker selection for Twitter tasks by implementing a more sophisticated sorting algorithm that takes into account node reliability and performance metrics. It aims to enhance the efficiency and reliability of task distribution in the Masa Oracle network. * feat(worker-selection): Update Twitter fields in NodeData and Worker Manager Add functions to update Twitter-related metrics in NodeData and integrate updates into Worker Manager processes. This ensures accurate tracking of tweet-related events and peer activity in the system. * feat(worker-selection): Add unit tests for NodeData and NodeDataTracker Introduce unit tests for the NodeData and NodeDataTracker functionalities, covering scenarios involving updates to Twitter-related fields. These tests ensure the correctness of the UpdateTwitterFields method in NodeData and the UpdateNodeDataTwitter method in NodeDataTracker. * chore(workers): update timeouts and bump version --------- Co-authored-by: Bob Stevens <[email protected]> commit 0ef0df4 Author: Brendan Playford <[email protected]> Date: Tue Oct 8 14:37:01 2024 -0700 feat(api): Add configurable API server enablement (#586) * feat(api): Add configurable API server enablement This commit introduces a new feature that allows the API server to be conditionally enabled or disabled based on configuration. The changes include: 1. In cmd/masa-node/main.go: - Refactored signal handling into a separate function `handleSignals` - Added conditional logic to start the API server only if enabled - Improved logging to indicate API server status 2. In pkg/config/app.go: - Added `APIEnabled` field to the `AppConfig` struct - Set default value for `APIEnabled` to false in `setDefaultConfig` - Added command-line flag for `apiEnabled` in `setCommandLineConfig` 3. In pkg/config/constants.go: - Added `APIEnabled` constant for environment variable configuration These changes provide more flexibility in node configuration, allowing users to run the node with or without the API server. This can be useful for security purposes or in scenarios where the API is not needed. The API can now be enabled via: - Environment variable: API_ENABLED=true - Command-line flag: --apiEnabled - Configuration file: apiEnabled: true By default, the API server will be disabled for enhanced security. * chore(config): update to take api-enabled=true and update Makefile with run-api case * Update Makefile

teslashibe added 7 commits October 1, 2024 11:10

chore: cleanup and delete old tests

6ced8b5

chore: delete old tests

0cd796a

chore: delete scrape tweets by trends (deprecated)

107fc67

chore: rename files

fbc08e2

teslashibe requested review from mudler and restevens402 October 1, 2024 21:15

teslashibe self-assigned this Oct 1, 2024

teslashibe added the enhancement New feature or request label Oct 1, 2024

chore: add godoc dev notes

92d636b

mudler linked an issue Oct 2, 2024 that may be closed by this pull request

tests: Twitter package: auth, save cookies, scrape #571

Open

teslashibe added 2 commits October 2, 2024 21:28

Revert "feat(twitter): implement account rotation and rate limit hand…

33b942a

…ling" This reverts commit b9d936f.

This was referenced Oct 3, 2024

feat(tests): add integration tests for web scraper #574

Merged

feat(twitter): Scraper Enhancements Account Rotation and Rate Limit Handling #576

Merged

mudler added a commit that referenced this pull request Oct 17, 2024

chore(tests): temporarly disable Twitter tests

504e92f

These are going to be taken care of as part of #573 Signed-off-by: mudler <[email protected]>

mudler mentioned this pull request Oct 17, 2024

fix(contracts): load config from embedded #602

Merged

1 task

mudler unassigned teslashibe Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(scrapers): Add new Twitter scraper test suite #573

test(scrapers): Add new Twitter scraper test suite #573

teslashibe commented Oct 1, 2024 •

edited

Loading

teslashibe commented Oct 1, 2024 •

edited

Loading

mudler commented Oct 3, 2024 •

edited

Loading

teslashibe commented Oct 4, 2024

test(scrapers): Add new Twitter scraper test suite #573

Are you sure you want to change the base?

test(scrapers): Add new Twitter scraper test suite #573

Conversation

teslashibe commented Oct 1, 2024 • edited Loading

Description

Changes

Tests Included

Authenticates and Logs In Successfully

Reuses Session from Cookies

Scrapes Profile and Recent Tweets Using Saved Cookies

Notes

How to Test

teslashibe commented Oct 1, 2024 • edited Loading

mudler commented Oct 3, 2024 • edited Loading

teslashibe commented Oct 4, 2024

teslashibe commented Oct 1, 2024 •

edited

Loading

teslashibe commented Oct 1, 2024 •

edited

Loading

mudler commented Oct 3, 2024 •

edited

Loading