Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike: Investigate and Design a Solution for NodeData Volatility #590

Open
teslashibe opened this issue Oct 10, 2024 · 1 comment
Open
Assignees

Comments

@teslashibe
Copy link
Contributor

teslashibe commented Oct 10, 2024

Problem Statement:

Our current nodeData design suffers from volatility issues in our distributed network environment. Specifically:

  1. Data Inconsistency: Nodes in the network may have conflicting or outdated information about other nodes, leading to inconsistent network state across the system.

  2. Data Loss: When nodes restart or temporarily disconnect, they may lose valuable information about the network state, impacting the overall system reliability.

  3. Lack of Single Source of Truth: There's no authoritative source for node information, making it difficult to resolve conflicts and ensure data accuracy.

  4. Inefficient Data Propagation: The current system lacks an efficient mechanism to propagate node updates across the network, potentially leading to stale data and increased network overhead.

  5. Scalability Concerns: As the network grows, the current design may not efficiently handle hundreds of nodes, potentially causing performance degradation.

  6. Limited Persistence: The current system doesn't have robust persistence mechanisms, making it challenging to recover the network state after system-wide failures.

Objectives:

  1. Research and design a robust data consistency and persistence system for our distributed node network.
  2. Evaluate the feasibility of implementing a central authority node using a multiaddress approach.
  3. Explore efficient mechanisms for local caching, periodic synchronization, and gossip protocols.
  4. Consider thread-safety, efficient data structures, and conflict resolution strategies.
  5. Assess the impact of the proposed changes on the existing codebase and identify integration points.

Acceptance Criteria:

  1. A high-level design document outlining the proposed solution, including:
    • CentralAuthority struct and its responsibilities
    • Updated NodeEventTracker design
    • Data flow and synchronization mechanisms
    • Conflict resolution strategies
    • Persistence and recovery mechanisms
  2. Proof-of-concept code demonstrating key components of the proposed solution
  3. Analysis of potential performance impacts and scalability considerations
  4. Identification of major risks and mitigation strategies
  5. Estimation of effort required for full implementation

Outcome:

A comprehensive understanding of the problem space and a well-defined approach to address the nodeData volatility issues, setting the foundation for a more robust and scalable distributed network system.

==================================

Outcome:

  1. High-Level Design Document:

a. CentralAuthority struct and its responsibilities:

  • Maintains an array of NodeData objects as the primary storage
  • Defined using a multiaddress to specify a single, well-known node as the authority
  • Provides methods for adding, updating, retrieving, and removing NodeData
  • Implements thread-safe operations using sync.RWMutex
  • Handles persistence of NodeData to allow recovery after restarts

b. Updated NodeEventTracker design:

  • Manages local copies of NodeData
  • Interacts with the CentralAuthority for data synchronization
  • Implements local caching for fast data access
  • Provides methods for updating local cache and triggering synchronization with CentralAuthority

c. Data flow and synchronization mechanisms:

  • Gossip protocol integrated with existing pubsub system for quick distribution of updates
  • Periodic synchronization between nodes and the central authority
  • Methods for non-authority nodes to fetch data from the central authority

d. Conflict resolution strategies:

  • Implement a merge function for NodeData that resolves conflicts and inconsistencies
  • Use timestamps and version numbers to determine the most up-to-date information

e. Persistence and recovery mechanisms:

  • Implement efficient JSON marshaling/unmarshaling for data persistence
  • Periodic saving of NodeData to disk
  • Recovery mechanisms to reload data after node restarts
  1. Proof-of-Concept Code:
// CentralAuthority struct
type CentralAuthority struct {
    nodes     []NodeData
    mu        sync.RWMutex
    dataFile  string
    multiaddr multiaddr.Multiaddr
}

// NodeEventTracker struct
type NodeEventTracker struct {
    localCache map[peer.ID]NodeData
    centralAuth *CentralAuthority
    pubsub *pubsub.PubSub
    // ... other fields
}

// Merge function for NodeData
func mergeNodeData(old, new NodeData) NodeData {
    // Implementation of merge logic
}

// Gossip protocol integration
func (net *NodeEventTracker) handleGossipMessage(msg *pubsub.Message) {
    // Handle incoming gossip messages
}

// Persistence methods
func (ca *CentralAuthority) saveData() error {
    // Save data to disk
}

func (ca *CentralAuthority) loadData() error {
    // Load data from disk
}

// Helper function for determining central authority
func isCentralAuthority(nodeAddr, authorityAddr multiaddr.Multiaddr) bool {
    // Compare node address with authority address
}
  1. Performance and Scalability Analysis:

    • The use of a central authority provides a single source of truth, improving consistency
    • Local caching in each node reduces network overhead and improves read performance
    • The gossip protocol allows for efficient propagation of updates in large networks
    • Periodic synchronization helps maintain eventual consistency across the network
    • The solution should scale well to hundreds of nodes, with the central authority being the potential bottleneck
  2. Major Risks and Mitigation Strategies:

    • Risk: Central authority becomes a single point of failure
      Mitigation: Implement a failover mechanism or consider a multi-authority approach
    • Risk: Network partitions may lead to inconsistent states
      Mitigation: Implement conflict resolution strategies and eventual consistency mechanisms
    • Risk: High network overhead during synchronization
      Mitigation: Optimize synchronization frequency and implement delta updates

This solution addresses the current data volatility issues by providing a centralized authority, implementing efficient synchronization mechanisms, and ensuring data persistence. It can be integrated into the existing codebase by updating the NodeEventTracker and introducing the CentralAuthority component.

The design considers efficient searching and updating of NodeData, handles concurrent updates, ensures data consistency, minimizes network overhead, and provides graceful handling of node joins and leaves. It also addresses proper handling of the central authority role and considers edge cases such as network partitions or temporary unavailability of the central authority node.

@teslashibe teslashibe changed the title spike: spike: Investigate and Design a Solution for NodeData Volatility Oct 10, 2024
@mudler
Copy link
Contributor

mudler commented Oct 10, 2024

mmmm isn't this practically #518 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants