Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: worker selection and worker package refactor #482

Closed
wants to merge 9 commits into from

Conversation

teslashibe
Copy link
Contributor

Description

This PR fixes #

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

- Separate out localWorker to send local work requests
- Separate out remoteWorker to enable selection and handling of remote work requests
- Lay foundation for remote worker selection - even though queue is limited all remoteWorkers still get the same request which is inefficient
- Log number of responses in the queue
- Make logs verbose for easy debugging of localWork and remoteWork
…cessing

- Add round-robin iterator for fair worker distribution
- Prioritize local worker when eligible
- Cycle through remote workers in subsequent calls
- Improve error handling and logging for worker responses
- Enhance code readability and maintainability

This update ensures a balanced workload across all available workers
over time, while still prioritizing local processing when possible.
- Add maxSpawnAttempts constant to limit remote worker spawn attempts
- Implement retry mechanism for spawning remote workers with exponential backoff
- Introduce spawnRemoteWorker function for better organization and error handling
- Enhance logging for better visibility into worker spawning and processing
- Improve handling of dead letters and timeouts in remote worker operations
- Refactor handleRemoteWorker to be more robust against transient failures
- Update tryWorker function to handle both local and remote worker scenarios
- Implement round-robin worker selection with retries in SendWork function

These changes aim to increase the reliability of the worker system,
particularly when dealing with remote workers, and provide better
insights into error scenarios for easier debugging and monitoring.
- Created new file worker_selection.go to house worker selection logic
- Moved GetEligibleWorkers, isEligibleRemoteWorker, and RoundRobinIterator to worker_selection.go
- Updated send_work.go to use new exported functions from worker_selection.go
- Renamed newRoundRobinIterator to NewRoundRobinIterator for proper exporting
- Updated imports and function calls in send_work.go to reflect new structure
- Improved code organization and modularity
@teslashibe teslashibe added the DO NOT MERGE Something will break if you do label Aug 6, 2024
Copy link

github-actions bot commented Aug 6, 2024

PR description is too short and seems to not fulfill PR template, please fill in

@teslashibe teslashibe changed the title Teslashibe/workers refactor feat: worker selection and worker package refactor Aug 6, 2024
Copy link

github-actions bot commented Aug 6, 2024

PR description is too short and seems to not fulfill PR template, please fill in

Copy link

github-actions bot commented Aug 6, 2024

PR description is too short and seems to not fulfill PR template, please fill in

- Refactored `tryWorkersRoundRobin` to `tryWorkersConcurrently` to call `tryWorker` for each worker asynchronously.
- Simplified response collection using a single `responseCollector` channel.
- Updated `tryWorker` to directly process workers and send responses to `responseCollector`.

This change improves the concurrency model by allowing all workers to process simultaneously and collecting the first successful response.
Copy link

github-actions bot commented Aug 7, 2024

PR description is too short and seems to not fulfill PR template, please fill in

Refactored worker selection to use NodeTracker's new GetEligibleWorkerNodes method and introduced an eligibility check for staked workers. Added a new utility function for converting strings to WorkerTypes and a method in NodeEventTracker to retrieve eligible nodes based on work categories.
Copy link

github-actions bot commented Aug 7, 2024

PR description is too short and seems to not fulfill PR template, please fill in

@teslashibe
Copy link
Contributor Author

@restevens402

Hey
@Bob Stevens
I just saw your commit - thanks for diving in. The issue I have with this code as I see it as a product reversion, it takes us back to where we started which is having the work sent out to all workers and its a first to respond that gets to the API. This still means that the behavior is that all workers get the request which does not increase our network throughput if each worker has the same rate limit - as is the case with Twitter. Its also inefficient because every worker has an instance spawned and does the work even if they are not first to respond which is duplicating work on the network.
In testing tonight I did see the issue you mentioned but went a more different experimental route here: https://github.com/masa-finance/masa-oracle/tree/teslashibe/worker-remote-worker-selection-with-local-fallback
This selects a number of random workers from the list defined in the config file and then sends requests once to each worker if the worker before it doesn't respond in a given time (this could be concurrent or latency based selection in the future ) - I set a low timeout that is tunable in the config - it cycles through the map of remote workers until one responds or all fail then fails over to the local worker.
Even this example above doesn't create linear scaling and I need to test it more on a network with an open 4001. But this is the way we should be thinking with our direction. Concurrency across all workers is not what we should be doing anymore.

@5u6r054 5u6r054 changed the base branch from test to main August 7, 2024 03:43
Copy link

github-actions bot commented Aug 7, 2024

PR description is too short and seems to not fulfill PR template, please fill in

@teslashibe teslashibe changed the base branch from main to test August 9, 2024 13:24
@teslashibe teslashibe closed this Aug 9, 2024
Copy link

github-actions bot commented Aug 9, 2024

PR description is too short and seems to not fulfill PR template, please fill in

@teslashibe teslashibe deleted the teslashibe/workers-refactor branch August 22, 2024 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE Something will break if you do
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants