-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(scrapers): Add new Twitter scraper test suite #573
base: main
Are you sure you want to change the base?
Conversation
- Delete TwitterSentimentHandler and TwitterTrendsHandler structs - Remove corresponding HandleWork functions for sentiment and trends - Update WorkerType constants and related functions to exclude sentiment and trends - Adjust WorkHandlerManager initialization to remove sentiment and trends handlers
…objx dependency - **Exported Auth Function:** - Renamed `auth` to `Auth` in `tweets.go` to make it publicly accessible. - Updated all scraper files (e.g., `followers.go`, `tweets.go`) to use the exported `Auth` function. - **Removed Unused Dependency:** - Eliminated `github.com/stretchr/objx` from `go.mod` as it was no longer needed. - **Optimized Sleep Durations:** - Reduced sleep durations in the `Auth` function from `500ms` to `100ms` for better performance. - **Cleaned Up Codebase:** - Removed obsolete sentiment analysis code from `tweets.go` to streamline the codebase. - **Enhanced Test Configuration:** - Fixed environment variable loading in `twitter_auth_test.go` by ensuring `.env` is correctly loaded via `scrapers_suite_test.go`. - Added and updated tests in `twitter_auth_test.go` and `scrapers_suite_test.go` to validate Twitter authentication and session reuse.
This commit improves the Twitter authentication and scraping tests in the pkg/tests/scrapers/twitter_auth_test.go file. The changes include: - Add godotenv package to load environment variables - Implement a loadEnv function to handle .env file loading - Enhance "authenticates and logs in successfully" test: - Verify cookie file doesn't exist before authentication - Check cookie file creation after authentication - Perform a simple profile scrape to validate the session - Improve "reuses session from cookies" test: - Verify cookie file creation - Force cookie reuse by clearing the first scraper - Validate the reused session with a profile scrape - Add new test "scrapes the profile of 'god' and recent #Bitcoin tweets using saved cookies": - Authenticate twice to ensure cookie reuse - Scrape the profile of user 'god' - Fetch and verify the last 3 tweets containing #Bitcoin - Log scraped data for manual inspection These changes provide more robust testing of the Twitter authentication process, session reuse, and scraping functionality, ensuring better coverage and reliability of the Twitter-related features.
@mudler how do we configure CI to use environment variables to run tests? Also I am having issues loading .env for some reason. Maybe you can take a look and help?
Other than this tests are working great! |
This commit introduces significant improvements to the Twitter scraping functionality: 1. Account Management: - Add TwitterAccount struct to represent individual Twitter accounts - Implement TwitterAccountManager for managing multiple accounts - Create functions for account rotation and rate limit tracking 2. Authentication: - Refactor Auth function to use account rotation - Implement cookie-based session management for each account - Add retry logic for authentication failures 3. Scraping Functions: - Update ScrapeTweetsByQuery and ScrapeTweetsProfile to use account rotation - Implement rate limit detection and account switching - Add retry mechanisms for failed operations 4. Configuration: - Move from hardcoded credentials to .env file-based configuration - Implement loadAccountsFromConfig to read multiple accounts from .env 5. Error Handling: - Improve error logging and handling throughout the package - Add specific handling for rate limit errors 6. Performance: - Implement concurrent scraping with multiple accounts - Add delays between requests to avoid aggressive rate limiting These changes significantly enhance the robustness and efficiency of the Twitter scraping functionality, allowing for better handling of rate limits and improved reliability through account rotation.
…ling" This reverts commit b9d936f.
that just shows how the code is too much tied on the configuration and viper generally. When building the first integration tests what I did is trying to move away from depending on a file first and instrument it from the code. Ideally the code exercised in the integration tests shouldn't depend on a file on the system, because as you can see now things gets hairy really quickly with tests My reasoning usually in this case is:
Why: it forces a clear separation of domains in the code which tends to help into writing code which is more "pluggable" and testable As we are not yet at building e2e test suites I'd recommend to go ahead and refactor the code and avoiding config singletons all the way down to the methods until you figure out which lines are calling the singleton and then loading the config/env files. This is basically what I did when untie-ing the Masa node from the config instance, if you look closer after my refactoring the masa node doesn't call anymore the config instance which in turns load the .env file and expects all the configurations. Now there is a node options which is instrumented entirely via code: https://github.com/masa-finance/masa-oracle/blob/main/node/options.go#L12 and for instance the configuration is read and translated to node options here: masa-oracle/cmd/masa-node/config.go Line 10 in 7e74904
it's still misplaced - that function shouldn't be there but more closer to the data that needs to generate the options (in this case AppConfig), but that's still in WIP as I didn't wanted to move too many things in one PR |
@mudler I need to re-write this test now. I will do this tomorrow probably |
These are going to be taken care of as part of #573 Signed-off-by: mudler <[email protected]>
* fix(contracts): load config from embedded There was still some code reading from the filesystem instead of the embedded files in the binary. Regression introduced in #523. Fixes: #578 See also: #579 Signed-off-by: mudler <[email protected]> * chore(tests): temporarly disable Twitter tests These are going to be taken care of as part of #573 Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]>
commit d654c56 Author: smeb y <[email protected]> Date: Mon Oct 21 15:55:05 2024 +0800 fix(Dockerfile): add ca-certificate (#604) Update Dockerfile Signed-off-by: smeb y <[email protected]> commit c1e624a Author: Ettore Di Giacinto <[email protected]> Date: Thu Oct 17 15:23:03 2024 +0200 chore(docs): drop unnecessary duplicated content Signed-off-by: Ettore Di Giacinto <[email protected]> commit 9529e4c Author: Ettore Di Giacinto <[email protected]> Date: Thu Oct 17 15:21:07 2024 +0200 chore(docs): update .env.example (#603) Signed-off-by: Ettore Di Giacinto <[email protected]> commit 038fad6 Author: Ettore Di Giacinto <[email protected]> Date: Thu Oct 17 14:55:25 2024 +0200 fix(contracts): load config from embedded (#602) * fix(contracts): load config from embedded There was still some code reading from the filesystem instead of the embedded files in the binary. Regression introduced in #523. Fixes: #578 See also: #579 Signed-off-by: mudler <[email protected]> * chore(tests): temporarly disable Twitter tests These are going to be taken care of as part of #573 Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]> commit a8a77a6 Author: Brendan Playford <[email protected]> Date: Tue Oct 15 14:32:29 2024 -0700 feat(twitter): Implement random sleep and improve login process (#601) - Add RandomSleep function to introduce variability in request timing - Update NewScraper to use RandomSleep before and after login attempts - Adjust sleep duration range to 500ms - 2s for more natural behavior - Improve error handling and logging in the login process commit 01ec8c4 Author: Brendan Playford <[email protected]> Date: Tue Oct 15 08:59:08 2024 -0700 chore(version): update protocol version and update twitter_cookies.example.json commit 6f594e1 Author: Brendan Playford <[email protected]> Date: Tue Oct 15 08:48:01 2024 -0700 feat(twitter): Enhanced Twitter Worker Selection Algorithm (#591) * Add detailed error logging and track worker update time Enhanced the worker manager to append specific error messages to a list for better debugging. Additionally, updated node data to track the last update time, improving data consistency and traceability. * Update version.go * refactor(twitter): remove retry functionality from scraper - Remove Retry function and MaxRetries constant from config.go - Update ScrapeFollowersForProfile, ScrapeTweetsProfile, and ScrapeTweetsByQuery to remove Retry wrapper - Adjust error handling in each function to directly return errors - Simplify code structure and reduce complexity - Maintain rate limit handling functionality * chore(workers): update max workers to 50 * chore(workers): upate to 25 * feat(pubsub): improve node sorting algorithm for Twitter reliability - Prioritize nodes with more recent last returned tweets - Maintain high importance for total returned tweet count - Consider time since last timeout to allow recovery from temporary issues - Deprioritize nodes with recent "not found" occurrences - Remove NotFoundCount from sorting criteria This change aims to better balance node performance and recent activity, while allowing nodes to recover quickly from temporary issues like rate limiting. * feat(workers): improve Twitter worker selection algorithm - Modify GetEligibleWorkers to use a specialized selection for Twitter workers - Introduce controlled randomness in Twitter worker selection - Balance between prioritizing high-performing Twitter workers and fair distribution - Maintain existing behavior for non-Twitter worker selection - Preserve handling of local worker and respect original worker limit This change enhances the worker selection algorithm for Twitter tasks to provide a better balance between utilizing top-performing nodes and ensuring fair work distribution. It introduces a dynamic pool size calculation and controlled randomness for Twitter workers, while maintaining the existing round-robin approach for other worker types. --------- Co-authored-by: Bob Stevens <[email protected]> commit f09fb20 Author: Brendan Playford <[email protected]> Date: Tue Oct 8 15:38:45 2024 -0700 Feat(workers) implement adaptive worker selection for improved task distribution (#589) * feat(worker-selection): Implement performance-based worker sorting - Add performance metrics fields to NodeData struct - Implement NodeSorter for flexible sorting of worker nodes - Create SortNodesByTwitterReliability function for Twitter workers - Update GetEligibleWorkerNodes to use category-specific sorting - Modify GetEligibleWorkers to use sorted workers and add worker limit This commit enhances the worker selection process by prioritizing workers based on their performance metrics. It introduces a flexible sorting mechanism that can be easily extended to other worker categories in the future. The changes improve reliability and efficiency in task allocation across the Masa Oracle network. * feat(worker-selection): Implement priority-based selection for Twitter work - Update DistributeWork to use priority selection for Twitter category - Maintain round-robin selection for other work categories by shuffling workers - Integrate new GetEligibleWorkers function with work type-specific behavior - Respect MaxRemoteWorkers limit for all work types - Add distinct logging for Twitter and non-Twitter worker selection This commit enhances the work distribution process by implementing priority-based worker selection for Twitter-related tasks while preserving the existing round-robin behavior for other work types. It leverages the newly added performance metrics to choose the most reliable workers for Twitter tasks, and ensures consistent behavior for other categories by shuffling the worker list. This hybrid approach improves efficiency for Twitter tasks while maintaining the expected behavior for all other work types. * Update .gitignore * feat(worker-selection): Implement priority-based sorting for Twitter workers - Add LastNotFoundTime and NotFoundCount fields to NodeData struct - Enhance SortNodesByTwitterReliability function with multi-criteria sorting: 1. Prioritize nodes found more often (lower NotFoundCount) 2. Consider recency of last not-found occurrence 3. Sort by higher number of returned tweets 4. Consider recency of last returned tweet 5. Prioritize nodes with fewer timeouts 6. Consider recency of last timeout 7. Use PeerId for stable sorting when no performance data is available - Remove random shuffling from GetEligibleWorkers function This commit improves worker selection for Twitter tasks by implementing a more sophisticated sorting algorithm that takes into account node reliability and performance metrics. It aims to enhance the efficiency and reliability of task distribution in the Masa Oracle network. * feat(worker-selection): Update Twitter fields in NodeData and Worker Manager Add functions to update Twitter-related metrics in NodeData and integrate updates into Worker Manager processes. This ensures accurate tracking of tweet-related events and peer activity in the system. * feat(worker-selection): Add unit tests for NodeData and NodeDataTracker Introduce unit tests for the NodeData and NodeDataTracker functionalities, covering scenarios involving updates to Twitter-related fields. These tests ensure the correctness of the UpdateTwitterFields method in NodeData and the UpdateNodeDataTwitter method in NodeDataTracker. * chore(workers): update timeouts and bump version --------- Co-authored-by: Bob Stevens <[email protected]> commit 0ef0df4 Author: Brendan Playford <[email protected]> Date: Tue Oct 8 14:37:01 2024 -0700 feat(api): Add configurable API server enablement (#586) * feat(api): Add configurable API server enablement This commit introduces a new feature that allows the API server to be conditionally enabled or disabled based on configuration. The changes include: 1. In cmd/masa-node/main.go: - Refactored signal handling into a separate function `handleSignals` - Added conditional logic to start the API server only if enabled - Improved logging to indicate API server status 2. In pkg/config/app.go: - Added `APIEnabled` field to the `AppConfig` struct - Set default value for `APIEnabled` to false in `setDefaultConfig` - Added command-line flag for `apiEnabled` in `setCommandLineConfig` 3. In pkg/config/constants.go: - Added `APIEnabled` constant for environment variable configuration These changes provide more flexibility in node configuration, allowing users to run the node with or without the API server. This can be useful for security purposes or in scenarios where the API is not needed. The API can now be enabled via: - Environment variable: API_ENABLED=true - Command-line flag: --apiEnabled - Configuration file: apiEnabled: true By default, the API server will be disabled for enhanced security. * chore(config): update to take api-enabled=true and update Makefile with run-api case * Update Makefile
Description
This pull request introduces a new test suite for the Twitter scraper functionality. The suite is designed to ensure that our Twitter scraping operations are reliable, secure, and function as expected under various conditions. The tests cover authentication, session reuse, and specific scraping capabilities.
Changes
Added
scrapers_suite_test.go
Sets up the test suite for all scraper-related tests, ensuring proper environment variables are loaded and the testing environment is correctly configured.
Added
twitter_scraper_test.go
Contains tests specifically for the Twitter scraper, focusing on authentication and scraping functionalities.
Tests Included
Authenticates and Logs In Successfully
Reuses Session from Cookies
Scrapes Profile and Recent Tweets Using Saved Cookies
'god'
and checks the retrieved data.#Bitcoin
hashtag.Notes
TWITTER_USERNAME
,TWITTER_PASSWORD
, and optionallyTWITTER_2FA_CODE
are set in a.env
file at the project root.How to Test
Set up the required environment variables in your
.env
file.From root run the tests using the Ginkgo testing framework:
Review the output to ensure all tests pass successfully.
https://www.loom.com/share/da177d585ab941edbfbf6fe5236878fe