-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: remote worker selection with local worker fallback & configuration and fine tuning #489
feat: remote worker selection with local worker fallback & configuration and fine tuning #489
Conversation
- Separate out localWorker to send local work requests - Separate out remoteWorker to enable selection and handling of remote work requests - Lay foundation for remote worker selection - even though queue is limited all remoteWorkers still get the same request which is inefficient - Log number of responses in the queue - Make logs verbose for easy debugging of localWork and remoteWork
…cessing - Add round-robin iterator for fair worker distribution - Prioritize local worker when eligible - Cycle through remote workers in subsequent calls - Improve error handling and logging for worker responses - Enhance code readability and maintainability This update ensures a balanced workload across all available workers over time, while still prioritizing local processing when possible.
- Add maxSpawnAttempts constant to limit remote worker spawn attempts - Implement retry mechanism for spawning remote workers with exponential backoff - Introduce spawnRemoteWorker function for better organization and error handling - Enhance logging for better visibility into worker spawning and processing - Improve handling of dead letters and timeouts in remote worker operations - Refactor handleRemoteWorker to be more robust against transient failures - Update tryWorker function to handle both local and remote worker scenarios - Implement round-robin worker selection with retries in SendWork function These changes aim to increase the reliability of the worker system, particularly when dealing with remote workers, and provide better insights into error scenarios for easier debugging and monitoring.
- Created new file worker_selection.go to house worker selection logic - Moved GetEligibleWorkers, isEligibleRemoteWorker, and RoundRobinIterator to worker_selection.go - Updated send_work.go to use new exported functions from worker_selection.go - Renamed newRoundRobinIterator to NewRoundRobinIterator for proper exporting - Updated imports and function calls in send_work.go to reflect new structure - Improved code organization and modularity
…r limit - Add MaxRemoteWorkers to WorkerConfig in config.go - Update tryWorkersRoundRobin function in send_work.go to respect MaxRemoteWorkers limit - Implement fallback mechanism to local worker when remote workers fail or are unavailable - Add detailed logging throughout the worker selection process for better debugging - Ensure a last resort local worker is always available if no other workers are found
@teslashibe can you take a look at this today. I didn't pull in your nodeData updates that you put in here https://github.com/masa-finance/masa-oracle/pull/482/files#diff-6eb837343e9a3348891e7ffe57d5f47407edf4ee00d8d5f74018ca6640f9c58d should they go into this PR? |
Refactored worker selection to use NodeTracker's new GetEligibleWorkerNodes method and introduced an eligibility check for staked workers. Added a new utility function for converting strings to WorkerTypes and a method in NodeEventTracker to retrieve eligible nodes based on work categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick scan it looks ok - however I'd like to give it another look tomorrow with a fresh mind
Shuffle eligible nodes before selecting workers. Added context-based connection attempts to ensure we can connect to selected worker before returning. Adjusted AppConfig initialization order and added version flag for CLI configuration.
- Check and log errors from NewMultiaddr and AddrInfoFromP2pAddr - Resolve "value of err is never used" linter warning - Change timeouts on workers to 20s and 15s respectively
…th-local-fallback
- Add ConnectionTimeout to WorkerConfig in config.go - Update GetEligibleWorkers to use the configurable timeout
…llback' of https://github.com/masa-finance/masa-oracle into teslashibe/worker-remote-worker-selection-with-local-fallback
Updated sleep durations to one minute for efficiency and commented out peer removal logic to retain node activity tracking. Also removed the callback for peer removal in DHT initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big update. Tests well. The improvements in targeted work distribution are a good improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments covered and resolved. I think we are good to go.
755064c
to
88911fa
Compare
I cosed this PR: the worker selection code is fundamentally broken and refactoring is not an option. This should not go into a patch release because it will never work in the way it is intended: #504 |
Description
Final implementation
nodeData
to filter peers that can do the work4001
is open which is a requirement<1000ms
or if there is another error in<20s
we failover to the local workerIn the
GetEligibleWorkers
function, the connection to eligible workers is checked using the following process:Here's a breakdown of the connection check process:
For each multiaddress of an eligible worker:
Convert the multiaddress to peer info.
Attempt to connect to the peer:
node.Host.Connect()
with a 1-second timeout.If the connection succeeds:
workers
slice.workerFound
to true and break the loop.This process ensures that only workers with successful connections are added to the list of eligible workers. The 1-second timeout prevents the function from hanging too long on unresponsive nodes.
Overview
This PR introduces a new worker selection strategy for our distributed task processing system. The main goal is to increase the overall throughput of our network by more effectively distributing work across all available nodes, while respecting individual node processing limits and potential latency of nodes.
Key Changes
Worker Selection Algorithm
GetEligibleWorkers
function that collects both local and remote workers.Configuration Updates
WorkerConfig
struct with the following parameters:WorkerTimeout
: 500msWorkerResponseTimeout
: 250msMaxRetries
: 1MaxSpawnAttempts
: 3WorkerBufferSize
: 100MaxRemoteWorkers
: 1Remote Worker Handling
MaxRemoteWorkers
) on the number of remote workers to attempt - this is currently synchronous and each worker is given a set time to respond as defined in the config file. This means developers can tune their nodes to their own use cases.Local Worker Fallback
Timeout and Retry Mechanism
Implementation Details
Worker Selection
This function now collects all eligible workers, including both local and remote, and shuffles the list for better load distribution.
Round-Robin Iterator
The round-robin iterator ensures a fair distribution of work among available workers.
Worker Configuration
This new configuration structure allows for easy tuning of worker behavior.
Rationale
The new implementation aims to increase network throughput by:
Performance Implications
Future Improvements
MaxRemoteWorkers
to further distribute work across the network concurrently between a limited set of peers i.e. 3Notes for Reviewers
Signed commits