Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebUI shows no nodes live when they're actually up and pass health checks #4

Open
seglo opened this issue Apr 26, 2016 · 14 comments
Open
Assignees

Comments

@seglo
Copy link

seglo commented Apr 26, 2016

I was able to get the plugin working. I'm using this on CentOS and it was required that I install the datastax repo for yum first before anything would work (can this be automated?), but my main issue now is the UI is reporting inconsistent information.

The health checks for the "Cluster Nodes" is working (why is it called this? shouldn't they be more descriptive like "C* Nodes"?), but the Ambari UI shows the following:

ambari-cassandra
(ignore the 4 warning alerts, they're not related to Cassandra)

When I run a nodetool status you can see all my nodes are up:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.147.0.23  87.84 KB   256     51.4%             300c7c50-e1ca-4979-8fc4-0d7bf48e766b  RAC1
UN  10.147.0.22  192.27 KB  256     52.0%             521ffe0d-4a32-4e29-8862-d9297c53e8d2  RAC1
UN  10.147.0.21  234.35 KB  256     48.9%             3c1f75d3-c111-45f0-85bc-cc0a795c5cad  RAC1
UN  10.147.0.24  241.48 KB  256     47.7%             24b59f0b-24d4-4322-900c-4657f37e05af  RAC1

@seglo seglo changed the title WebUI shows all nodes down when they're actually up. WebUI shows no nodes live when they're actually up and pass health checks Apr 26, 2016
@seglo
Copy link
Author

seglo commented May 5, 2016

I've just redeployed a cluster and the issue remains. Any suggestions?

@ajak6
Copy link
Contributor

ajak6 commented May 5, 2016

This should not happen. You can check Ambari-agent logs and server logs if there are any exceptions.

@seglo
Copy link
Author

seglo commented May 5, 2016

I get these errors for my 3 C* nodes in ambari-agent.log.

2016-05-05 16:04:58,281 [CRITICAL] [Cassandra] [Cassandra_service] (Cassandra Service Process) Connection failed: [Errno 111] Connection refused to ip-10-147-0-23.ec2.internal:7000
2016-05-05 16:05:01,265 [CRITICAL] [Cassandra] [Cassandra_service] (Cassandra Service Process) Connection failed: [Errno 111] Connection refused to ip-10-147-0-22.ec2.internal:7000
2016-05-05 16:05:03,661 [CRITICAL] [Cassandra] [Cassandra_service] (Cassandra Service Process) Connection failed: [Errno 111] Connection refused to ip-10-147-0-21.ec2.internal:7000

Yet I can connect to these host:port's from the machine ambari-server is installed on.

[centos@ip-10-147-0-10 ambari-server]$ telnet ip-10-147-0-21.ec2.internal 7000
Trying 10.147.0.21...
Connected to ip-10-147-0-21.ec2.internal.

I also have no problem running CQLSH and connecting to the cluster.

@ajak6
Copy link
Contributor

ajak6 commented May 26, 2016

Are you able to resolve the issue?

@seglo
Copy link
Author

seglo commented May 26, 2016

No I haven't. I was going to look into it some more soon. Has anyone else reported this problem?

@seglo
Copy link
Author

seglo commented May 27, 2016

What's really strange about this is that the heartbeats seem to be working fine and Cassandra is inded running (notice it says "No Alerts"), but this summary window says 0/3 nodes are live. What part of the plugin code would be responsible to indicating with a Cluster Node is live or not on this view?

ambari-cassanda-cluster-nodes

@seglo
Copy link
Author

seglo commented May 27, 2016

Probably a symptom of the same problem. When I go into a specific host it shows the Cassandra service as not started, even though it's running.

ambari-cassandra-server-not-started

@mithmatt
Copy link

This might be an issue with the status function. Can you please confirm if there are no exceptions being thrown here?

The recommended way for defining the status function is as follows:
Run some command to check if the component is running.

  • If the component is running, do not throw any errors, 0 return code on running the command.
  • If the component is not running, raise ComponentIsNotRunning exception.

@seglo
Copy link
Author

seglo commented May 31, 2016

@mithmatt I'll add some exception handling and confirm the return code.

Earlier I did actually stick a debug statement in the status function, but it never appeared to be executed.

@ajak6
Copy link
Contributor

ajak6 commented Jun 2, 2016

the status function in the python file is executed by ambari for the heartbeat. I tried reinstalling the service and I don't see the issue.
screen shot 2016-06-02 at 11 53 45 am

What OS version are you using?
What is the HDP stack version you are using?
What is the ambari version?
Try changing the status method in cassandra_master.py to check the pid file by giving the path of pid in check_process_status method.

@ajak6 ajak6 self-assigned this Jun 2, 2016
@seglo
Copy link
Author

seglo commented Jun 3, 2016

For some reason service cassandra status was returning an exit code of 3 even though the service was running successfully.

I'm running CentOS 7, so I'm using systemd. The exit code of the equivalent systemd command returned a 0 exit code. When I updated the status command in cassandra_master.py to systemctl status ambari-service the "warning" icon flipped to an "ok".

[centos@ip-10-147-0-21 ~]$ ./saferuncommand.sh sudo systemctl status cassandra
● cassandra.service - SYSV: Starts and stops Cassandra
   Loaded: loaded (/etc/rc.d/init.d/cassandra)
   Active: active (exited) since Thu 2016-06-02 17:23:02 UTC; 1 day 2h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 32132 ExecStop=/etc/rc.d/init.d/cassandra stop (code=exited, status=1/FAILURE)
  Process: 32182 ExecStart=/etc/rc.d/init.d/cassandra start (code=exited, status=0/SUCCESS)

Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Starting SYSV: Starts and stops Cassandra...
Jun 02 17:23:02 ip-10-147-0-21 su[32189]: (to cassandra) root on none
Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Started SYSV: Starts and stops Cassandra.
Jun 02 17:23:02 ip-10-147-0-21 cassandra[32182]: Starting Cassandra: OK

0

[centos@ip-10-147-0-21 ~]$ ./saferuncommand.sh sudo service cassandra status                                                                                                                                                                                                           
● cassandra.service - SYSV: Starts and stops Cassandra
   Loaded: loaded (/etc/rc.d/init.d/cassandra)
   Active: active (exited) since Thu 2016-06-02 17:23:02 UTC; 1 day 2h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 32132 ExecStop=/etc/rc.d/init.d/cassandra stop (code=exited, status=1/FAILURE)
  Process: 32182 ExecStart=/etc/rc.d/init.d/cassandra start (code=exited, status=0/SUCCESS)

Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Starting SYSV: Starts and stops Cassandra...
Jun 02 17:23:02 ip-10-147-0-21 su[32189]: (to cassandra) root on none
Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Started SYSV: Starts and stops Cassandra.
Jun 02 17:23:02 ip-10-147-0-21 cassandra[32182]: Starting Cassandra: OK

3

@ajak6
Copy link
Contributor

ajak6 commented Jun 3, 2016

Yes for centos its good to use sysmtectl. If it is resolved close the issue.

@seglo
Copy link
Author

seglo commented Jun 3, 2016

Would you accept a PR that switches based on whether systemctl is present?

    def status(self, env):
        import params
        env.set_params(params)
        status_cmd = format("""
            if hash systemctl 2>/dev/null; then
              systemctl status cassandra
            else
              service cassandra status
            fi""")
        Execute(status_cmd)
        print 'Status of the Master'

seglo added a commit to boldradius/ambari-cassandra-service that referenced this issue Jun 6, 2016
@ghost
Copy link

ghost commented Sep 7, 2016

@seglo 's solution worked for me.

I had the same issue on the same OS (CentOS).

capture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants