Discussion:
Topology hung in KILLED state when nimbus leader changes
(too old to reply)
Pradeep Badiger
2018-11-05 15:08:00 UTC
Permalink
Hi All,

I am having an issue with a 3 node Nimbus cluster (Storm 1.1.0) where a topology gets hung in KILLED state and doesn't get removed even after the wait time elapsed.

When I checked the nimbus logs on the host which was previously the leader, I see below exception reported.

2018-11-02 11:37:16.583 [timer] ERROR org.apache.storm.daemon.nimbus - Exception while trying transition for Test-2-1541172153 and event :remove
java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='example.test.com', port=6627, isLeader=true}
at org.apache.storm.daemon.nimbus$is_leader.doInvoke(nimbus.clj:142) ~[storm-core-1.1.0.jar:1.1.0]
at clojure.lang.RestFn.invoke(RestFn.java:410) ~[clojure-1.7.0.jar:?]
at org.apache.storm.daemon.nimbus$transition_BANG_.invoke(nimbus.clj:351) ~[storm-core-1.1.0.jar:1.1.0]
at org.apache.storm.daemon.nimbus$delay_event$fn__9852.invoke(nimbus.clj:398) ~[storm-core-1.1.0.jar:1.1.0]
at org.apache.storm.timer$mk_timer$fn__1720$fn__1721.invoke(timer.clj:50) ~[storm-core-1.1.0.jar:1.1.0]
at org.apache.storm.timer$mk_timer$fn__1720.invoke(timer.clj:42) ~[storm-core-1.1.0.jar:1.1.0]
at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

I am able to reproduce the issue consistently. Below are the steps.


1. Started the topology
2. Checked Storm UI to find the leader nimbus.
3. After keeping the topology in ACTIVE state for some time, killed it with a wait time of 60 secs.
4. When the state got changed to KILLED, the lead nimbus process was restarted in order to force the leader re-election.
5. When a new leader was selected, the topology remained in the KILLED state forever.

Am I missing some configurations within Storm to handle such situation? Is there anything to look at in Zookeeper to know what state the topology is in?

I saw a defect in jira to handle the leadership changes during KILL and Rebalance. Do we still have issues?

https://issues.apache.org/jira/browse/STORM-1604

Any help is appreciated.

Thanks,
Pradeep V.B.




This email and any files transmitted with it are confidential, proprietary and intended solely for the individual or entity to whom they are addressed. If you have received this email in error please delete it immediately.
Loading...