Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates #550

Open
luhkevin opened this issue Apr 13, 2016 · 5 comments
Labels
Milestone

Comments

@luhkevin
Copy link

I'm using release 1.0.1 of the elasticsearch scheduler. I have it running with one instance of the "executor" (aka, one instance of the actual elasticsearch image). When I restart the mesos-cluster, the elasticsearch-scheduler itself comes back up, since it's running on Marathon. However, I've waited at least 20 minutes, and the actual elasticsearch executor/process never shows back up in the mesos task list. All I'm getting is this in the scheduler logs:

[DEBUG] 2016-04-13 05:31:51,188 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:53,068 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler isHostnameResolveable - Attempting to resolve hostname: 54.215.156.251 [DEBUG] 2016-04-13 05:31:53,071 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:53,075 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "55afd3b4-648f-46fc-8bb4-4ebfdf3793cd-O305" }, framework_id { value: "5fe8ad28-f110-4bfd-a324-93264f3e5fbb-0000" }, slave_id { value: "55afd3b4-648f-46fc-8bb4-4ebfdf3793cd-S0" }, hostname: "54.215.156.251", resources { name: "cpus", type: SCALAR, scalar { value: 6.3, }, role: "*" }, resources { name: "mem", type: SCALAR, scalar { value: 44907.0, }, role: "*" }, resources { name: "disk", type: SCALAR, scalar { value: 44628.0, }, role: "*" }, resources { name: "ports", type: RANGES, ranges { range { begin: 31000, end: 31099, }, range { begin: 31101, end: 31202, }, range { begin: 31205, end: 31279, }, range { begin: 31281, end: 31287, }, range { begin: 31289, end: 31315, }, range { begin: 31317, end: 31392, }, range { begin: 31394, end: 31554, }, range { begin: 31556, end: 31699, }, range { begin: 31701, end: 31879, }, range { begin: 31881, end: 31887, }, range { begin: 31889, end: 31924, }, range { begin: 31926, end: 32000, }, }, role: "*" }, url { scheme: "http", address { hostname: "54.215.156.251", ip: "172.31.13.204", port: 5051, }, path: "/slave(1)" }, Reason: First ES node is not responding [DEBUG] 2016-04-13 05:31:54,270 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:54,350 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [ERROR] 2016-04-13 05:31:54,352 org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/].[dispatcherServlet] log - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception org.apache.http.conn.HttpHostConnectException: Connect to 54.215.156.251:31333 [54.215.156.251/54.215.156.251] failed: Connection refused

I truncated the stack trace, but I can post more of it if necessary.

I will say that I'm using my own ports for the http and tcp endpoints, and not having the framework choose them randomly, as is recommended. But I don't see any errors about resource offers and ports in the log. Any ideas?

@luhkevin luhkevin changed the title Tasks is seen as running when restarting mesos clusters Task is seen as running when restarting mesos clusters Apr 13, 2016
@luhkevin luhkevin changed the title Task is seen as running when restarting mesos clusters Elasticsearch doesn't start and task is seen as running when restarting mesos clusters Apr 13, 2016
@philwinder
Copy link
Contributor

A bit strange. This line:

Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING

Is saying that there is an instance of ES in it's state. I.e. the state it presumably recovered from zookeeper. But if you restarted, why is there old state in zookeeper?

You say you restarted the mesos cluster. How did you do that? Can you replicate your setup somehow to cause this issue again?

I started by thinking there might be an issue when an executor fails when the scheduler is dead, but we have system tests that test this exact scenario and don't have a problem.

So it might be some edge case where you kill the entire cluster and Mesos itself doesn't have time to report the losses. Basically, I'm thinking that the issue is that if you kill everything in the cluster, there is nothing around to realise that it has died. So maybe not an ES one, just a general cluster resiliency one.

I'll keep the issue open though, because you might be able to prove that this is an ES issue.

@luhkevin
Copy link
Author

Well, zookeeper has a data directory that it would recover state from across restarts, right? Or maybe I'm confused about how zookeeper stores state.

I've replicated this issue a few times. I'm deploying on EC2, and all I do is reboot from the command line (or from the console). In addition, one thing I tried to do was stop mesos-master, mesos-slave, and marathon before rebooting so as to shut mesos down "cleanly", and this still didn't work. Although I suppose I didn't stop zookeeper prior to reboot.

Also, I'd like to add that from the elasticsearch console (e.g. the web UI on port 31100), in the "Tasks" tab, I can clearly see an elasticsearch task still running, as if the "executor" were still there. But again, this task doesn't show up in the mesos UI.

The strange thing is, the other frameworks I'm running all show up correctly. I'm running the kafka framework, and I had added a broker to it prior to rebooting, and it does take a while (i.e. a few minutes) for the broker to show up, but it does appear again.

@philwinder philwinder added the bug label Apr 14, 2016
@philwinder
Copy link
Contributor

Ok, thanks. I think the issue is the master failure/shutdown.

The ES framework stores its state in ZooKeeper. Because you've persisted that information, it reloads the state once everything starts up again. So when it starts, it thinks there is an instance running. But of course it is not.

There are several work arounds:

  1. Delete the ES framework state from zookeeper before you restart
  2. Shutdown the framework before you kill the master.
  3. Don't kill the master.

In production, the master should never fail. If it does, things that happen whilst the Master is down will not get propagated through to the frameworks.

If we wanted to fix this, the best thing we could do is implement a "ping" mechanism that really pings the ES HTTP node upon startup. (At the moment we rely on Mesos reporting state).

So there are things we could do, but it is only a problem if both your Master and all your ES instances die. Since this should never happen in production, it's not an emergency.

Thanks for pointing this out @luhkevin. I'll rename the issue to a more suitable task name.

Phil

@philwinder philwinder changed the title Elasticsearch doesn't start and task is seen as running when restarting mesos clusters Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates Apr 14, 2016
@philwinder philwinder added this to the Backlog milestone Apr 14, 2016
@luhkevin
Copy link
Author

Happy to help!

@luhkevin
Copy link
Author

Oh, I completely forgot to add this, but I was actually running zookeeper, mesos master, slave, and marathon all on one EC2 instance (not a long term thing, just to experiment). So that might factor into all this restart/ping logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants