Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor, not pinging hosts, unstable results #17

Open
STiAT opened this issue Dec 3, 2020 · 3 comments
Open

Monitor, not pinging hosts, unstable results #17

STiAT opened this issue Dec 3, 2020 · 3 comments

Comments

@STiAT
Copy link

STiAT commented Dec 3, 2020

Hi,

I've used your example code to add 593 of our hosts to the monitor tool, since it basically should do what I need (I need a statistics every 60 seconds which I would want to log away).

  • Of those 593 hosts only 255 are ever reported in the ticker (always 255, always the first 255 I add as targets). It makes no difference weather I use AddTarget or AddTargetDelayed, it's always 255.

  • Hosts which are certainly available are being reported as PacketLoss 100 %, while Wireshark sees the successful UDP packages:
    image

The monitor reports the following:
2020/12/03 14:14:17 hanco: {PacketsSent:10 PacketsLost:10 Best:0 Worst:0 Median:NaN Mean:NaN StdDev:NaN}

Which is certainly the host and it can be pinged:
image

It is certainly true that several of the hosts are currently "down" because behind Firewalls and I'm not testing in the same zone, but I'd expect it to report the loss at the "right" hosts ... and all 593 instead of just 255 ofc ;-).

Current code, which pretty much reflects the example with some little changes:

package slaping

import (
	"net"
	"os"
	"time"
	"log"
	"os/signal"
	"syscall"

	"github.com/digineo/go-ping"
	"github.com/digineo/go-ping/monitor"
)

var (
	pingInterval        = 5 * time.Second
	pingTimeout         = 4 * time.Second
	reportInterval      = 60 * time.Second
	size           uint = 56
	pinger         *ping.Pinger
	targets        []string
)

func (p *SLAPing) PingHosts() {
	// setup, we need to populate cache once before we start to actually have host and maint info
	CMDBHosts, err := p.CMDB.GetSLAHosts()

	if err != nil {
		log.Println(err)
	}

	// Bind to sockets
	if pi, err := ping.New("0.0.0.0", "::"); err != nil {
		log.Printf("Unable to bind: %s\nRunning as root?\n", err)
		os.Exit(2)
	} else {
		pinger = pi
	}
	pinger.SetPayloadSize(uint16(size))
	defer pinger.Close()

	// Create monitor
	monitor := monitor.New(pinger, pingInterval, pingTimeout)
	defer monitor.Stop()

	// Add targets
	targets := CMDBHosts
	validtargets := 0
	for i, target := range CMDBHosts {
		ipAddr, err := net.ResolveIPAddr("", target)
		if err != nil {
			// currently irrelevant
			continue
		}

		err = monitor.AddTarget(string([]byte{byte(i)}), *ipAddr)
		//err = monitor.AddTargetDelayed(string([]byte{byte(i)}), *ipAddr, 10*time.Millisecond*time.Duration(i))

		if err != nil {
			// currently irrelevant
			continue
		}
		validtargets = validtargets + 1
	}

	log.Printf("Total targets: %d, Valid targets: %d, Errors: %d\n", len(targets), validtargets, len(targets)-validtargets)

	// Start report routine
	ticker := time.NewTicker(reportInterval)
	defer ticker.Stop()
	go func() {
		for range ticker.C {
			for i, metrics := range monitor.ExportAndClear() {
				log.Printf("%s: %+v\n", targets[[]byte(i)[0]], *metrics)
			}
		}
	}()

	// Handle SIGINT and SIGTERM.
	// in other words we wait here until interrupted and let the ticker tick
	ch := make(chan os.Signal, 1)
	signal.Notify(ch, syscall.SIGINT, syscall.SIGTERM)
	log.Println("received", <-ch)
}

[edit]
My initial thought was that it may be connected to the ICMP pings getting blocked by firewalls and threads are getting locked up due to the timeouts not properly working there, since it's an uncommon case and may not be that well tested.
Now, due to the result "shuffle" I'm not sure but I have systems in the results which actually are blocked by firewalls, going through the slice sequentially that could explain why it's always the same number, but I didn't see any max threads number to get locked in the library, is there a limit where the channel / parallel work of the library blocking?

[edit2]
I could test it now on a host which has the necessary firewall permissions to actually run this without getting blocked, the results are still the same. Something is wrong here, just can't make out what.

@STiAT
Copy link
Author

STiAT commented Dec 21, 2020

Any idea on this issue?

@corny
Copy link
Member

corny commented Dec 21, 2020

Not really, but please try the rewrite-monitor branch.

@STiAT
Copy link
Author

STiAT commented Dec 22, 2020

Hey there, thanks for the answer.

Rewrote the code for now,, but it seems to "stall" overall. The issue is I can't find a way to set a timeout, and having 300/600 hosts behind a firewall it probably will wait forever.

In the monitor, AddTarget/AddTargetDelayed seeems to use p.timeout, but I don't see a way to set it, maybe I just overlooked that and it's there ( I really couldn't see it in that branch, if it's there sorry). But to test that I need a way to set the timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants