Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates #550

luhkevin · 2016-04-13T05:41:03Z

I'm using release 1.0.1 of the elasticsearch scheduler. I have it running with one instance of the "executor" (aka, one instance of the actual elasticsearch image). When I restart the mesos-cluster, the elasticsearch-scheduler itself comes back up, since it's running on Marathon. However, I've waited at least 20 minutes, and the actual elasticsearch executor/process never shows back up in the mesos task list. All I'm getting is this in the scheduler logs:

[DEBUG] 2016-04-13 05:31:51,188 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:53,068 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler isHostnameResolveable - Attempting to resolve hostname: 54.215.156.251 [DEBUG] 2016-04-13 05:31:53,071 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:53,075 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "55afd3b4-648f-46fc-8bb4-4ebfdf3793cd-O305" }, framework_id { value: "5fe8ad28-f110-4bfd-a324-93264f3e5fbb-0000" }, slave_id { value: "55afd3b4-648f-46fc-8bb4-4ebfdf3793cd-S0" }, hostname: "54.215.156.251", resources { name: "cpus", type: SCALAR, scalar { value: 6.3, }, role: "*" }, resources { name: "mem", type: SCALAR, scalar { value: 44907.0, }, role: "*" }, resources { name: "disk", type: SCALAR, scalar { value: 44628.0, }, role: "*" }, resources { name: "ports", type: RANGES, ranges { range { begin: 31000, end: 31099, }, range { begin: 31101, end: 31202, }, range { begin: 31205, end: 31279, }, range { begin: 31281, end: 31287, }, range { begin: 31289, end: 31315, }, range { begin: 31317, end: 31392, }, range { begin: 31394, end: 31554, }, range { begin: 31556, end: 31699, }, range { begin: 31701, end: 31879, }, range { begin: 31881, end: 31887, }, range { begin: 31889, end: 31924, }, range { begin: 31926, end: 32000, }, }, role: "*" }, url { scheme: "http", address { hostname: "54.215.156.251", ip: "172.31.13.204", port: 5051, }, path: "/slave(1)" }, Reason: First ES node is not responding [DEBUG] 2016-04-13 05:31:54,270 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:54,350 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [ERROR] 2016-04-13 05:31:54,352 org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/].[dispatcherServlet] log - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception org.apache.http.conn.HttpHostConnectException: Connect to 54.215.156.251:31333 [54.215.156.251/54.215.156.251] failed: Connection refused

I truncated the stack trace, but I can post more of it if necessary.

I will say that I'm using my own ports for the http and tcp endpoints, and not having the framework choose them randomly, as is recommended. But I don't see any errors about resource offers and ports in the log. Any ideas?

The text was updated successfully, but these errors were encountered:

philwinder · 2016-04-14T16:02:31Z

A bit strange. This line:

Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING

Is saying that there is an instance of ES in it's state. I.e. the state it presumably recovered from zookeeper. But if you restarted, why is there old state in zookeeper?

You say you restarted the mesos cluster. How did you do that? Can you replicate your setup somehow to cause this issue again?

I started by thinking there might be an issue when an executor fails when the scheduler is dead, but we have system tests that test this exact scenario and don't have a problem.

So it might be some edge case where you kill the entire cluster and Mesos itself doesn't have time to report the losses. Basically, I'm thinking that the issue is that if you kill everything in the cluster, there is nothing around to realise that it has died. So maybe not an ES one, just a general cluster resiliency one.

I'll keep the issue open though, because you might be able to prove that this is an ES issue.

luhkevin · 2016-04-14T16:13:33Z

Well, zookeeper has a data directory that it would recover state from across restarts, right? Or maybe I'm confused about how zookeeper stores state.

I've replicated this issue a few times. I'm deploying on EC2, and all I do is reboot from the command line (or from the console). In addition, one thing I tried to do was stop mesos-master, mesos-slave, and marathon before rebooting so as to shut mesos down "cleanly", and this still didn't work. Although I suppose I didn't stop zookeeper prior to reboot.

Also, I'd like to add that from the elasticsearch console (e.g. the web UI on port 31100), in the "Tasks" tab, I can clearly see an elasticsearch task still running, as if the "executor" were still there. But again, this task doesn't show up in the mesos UI.

The strange thing is, the other frameworks I'm running all show up correctly. I'm running the kafka framework, and I had added a broker to it prior to rebooting, and it does take a while (i.e. a few minutes) for the broker to show up, but it does appear again.

philwinder · 2016-04-14T16:20:14Z

Ok, thanks. I think the issue is the master failure/shutdown.

The ES framework stores its state in ZooKeeper. Because you've persisted that information, it reloads the state once everything starts up again. So when it starts, it thinks there is an instance running. But of course it is not.

There are several work arounds:

Delete the ES framework state from zookeeper before you restart
Shutdown the framework before you kill the master.
Don't kill the master.

In production, the master should never fail. If it does, things that happen whilst the Master is down will not get propagated through to the frameworks.

If we wanted to fix this, the best thing we could do is implement a "ping" mechanism that really pings the ES HTTP node upon startup. (At the moment we rely on Mesos reporting state).

So there are things we could do, but it is only a problem if both your Master and all your ES instances die. Since this should never happen in production, it's not an emergency.

Thanks for pointing this out @luhkevin. I'll rename the issue to a more suitable task name.

Phil

luhkevin · 2016-04-14T16:22:13Z

Happy to help!

luhkevin · 2016-04-14T16:26:08Z

Oh, I completely forgot to add this, but I was actually running zookeeper, mesos master, slave, and marathon all on one EC2 instance (not a long term thing, just to experiment). So that might factor into all this restart/ping logic.

luhkevin changed the title ~~Tasks is seen as running when restarting mesos clusters~~ Task is seen as running when restarting mesos clusters Apr 13, 2016

luhkevin changed the title ~~Task is seen as running when restarting mesos clusters~~ Elasticsearch doesn't start and task is seen as running when restarting mesos clusters Apr 13, 2016

philwinder added the bug label Apr 14, 2016

philwinder changed the title ~~Elasticsearch doesn't start and task is seen as running when restarting mesos clusters~~ Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates Apr 14, 2016

philwinder added this to the Backlog milestone Apr 14, 2016

philwinder mentioned this issue Apr 18, 2016

Investigate whether we can use "reconcileTasks" method rather than custom healthcheck message #209

Open

philwinder mentioned this issue Jun 22, 2016

Uninstall instructions #571

Open

philwinder mentioned this issue Sep 23, 2016

New frameworks registering when scheduler is restarted #581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates #550

Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates #550

luhkevin commented Apr 13, 2016

philwinder commented Apr 14, 2016

luhkevin commented Apr 14, 2016

philwinder commented Apr 14, 2016

luhkevin commented Apr 14, 2016

luhkevin commented Apr 14, 2016

Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates #550

Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates #550

Comments

luhkevin commented Apr 13, 2016

philwinder commented Apr 14, 2016

luhkevin commented Apr 14, 2016

philwinder commented Apr 14, 2016

luhkevin commented Apr 14, 2016

luhkevin commented Apr 14, 2016