-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement http "ping" mechanism to ensure that ES tasks really are running, not just rely on Mesos to send us state updates #550
Comments
A bit strange. This line:
Is saying that there is an instance of ES in it's state. I.e. the state it presumably recovered from zookeeper. But if you restarted, why is there old state in zookeeper? You say you restarted the mesos cluster. How did you do that? Can you replicate your setup somehow to cause this issue again? I started by thinking there might be an issue when an executor fails when the scheduler is dead, but we have system tests that test this exact scenario and don't have a problem. So it might be some edge case where you kill the entire cluster and Mesos itself doesn't have time to report the losses. Basically, I'm thinking that the issue is that if you kill everything in the cluster, there is nothing around to realise that it has died. So maybe not an ES one, just a general cluster resiliency one. I'll keep the issue open though, because you might be able to prove that this is an ES issue. |
Well, zookeeper has a data directory that it would recover state from across restarts, right? Or maybe I'm confused about how zookeeper stores state. I've replicated this issue a few times. I'm deploying on EC2, and all I do is reboot from the command line (or from the console). In addition, one thing I tried to do was stop mesos-master, mesos-slave, and marathon before rebooting so as to shut mesos down "cleanly", and this still didn't work. Although I suppose I didn't stop zookeeper prior to reboot. Also, I'd like to add that from the elasticsearch console (e.g. the web UI on port 31100), in the "Tasks" tab, I can clearly see an elasticsearch task still running, as if the "executor" were still there. But again, this task doesn't show up in the mesos UI. The strange thing is, the other frameworks I'm running all show up correctly. I'm running the kafka framework, and I had added a broker to it prior to rebooting, and it does take a while (i.e. a few minutes) for the broker to show up, but it does appear again. |
Ok, thanks. I think the issue is the master failure/shutdown. The ES framework stores its state in ZooKeeper. Because you've persisted that information, it reloads the state once everything starts up again. So when it starts, it thinks there is an instance running. But of course it is not. There are several work arounds:
In production, the master should never fail. If it does, things that happen whilst the Master is down will not get propagated through to the frameworks. If we wanted to fix this, the best thing we could do is implement a "ping" mechanism that really pings the ES HTTP node upon startup. (At the moment we rely on Mesos reporting state). So there are things we could do, but it is only a problem if both your Master and all your ES instances die. Since this should never happen in production, it's not an emergency. Thanks for pointing this out @luhkevin. I'll rename the issue to a more suitable task name. Phil |
Happy to help! |
Oh, I completely forgot to add this, but I was actually running zookeeper, mesos master, slave, and marathon all on one EC2 instance (not a long term thing, just to experiment). So that might factor into all this restart/ping logic. |
I'm using release 1.0.1 of the elasticsearch scheduler. I have it running with one instance of the "executor" (aka, one instance of the actual elasticsearch image). When I restart the mesos-cluster, the elasticsearch-scheduler itself comes back up, since it's running on Marathon. However, I've waited at least 20 minutes, and the actual elasticsearch executor/process never shows back up in the mesos task list. All I'm getting is this in the scheduler logs:
[DEBUG] 2016-04-13 05:31:51,188 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:53,068 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler isHostnameResolveable - Attempting to resolve hostname: 54.215.156.251 [DEBUG] 2016-04-13 05:31:53,071 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:53,075 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "55afd3b4-648f-46fc-8bb4-4ebfdf3793cd-O305" }, framework_id { value: "5fe8ad28-f110-4bfd-a324-93264f3e5fbb-0000" }, slave_id { value: "55afd3b4-648f-46fc-8bb4-4ebfdf3793cd-S0" }, hostname: "54.215.156.251", resources { name: "cpus", type: SCALAR, scalar { value: 6.3, }, role: "*" }, resources { name: "mem", type: SCALAR, scalar { value: 44907.0, }, role: "*" }, resources { name: "disk", type: SCALAR, scalar { value: 44628.0, }, role: "*" }, resources { name: "ports", type: RANGES, ranges { range { begin: 31000, end: 31099, }, range { begin: 31101, end: 31202, }, range { begin: 31205, end: 31279, }, range { begin: 31281, end: 31287, }, range { begin: 31289, end: 31315, }, range { begin: 31317, end: 31392, }, range { begin: 31394, end: 31554, }, range { begin: 31556, end: 31699, }, range { begin: 31701, end: 31879, }, range { begin: 31881, end: 31887, }, range { begin: 31889, end: 31924, }, range { begin: 31926, end: 32000, }, }, role: "*" }, url { scheme: "http", address { hostname: "54.215.156.251", ip: "172.31.13.204", port: 5051, }, path: "/slave(1)" }, Reason: First ES node is not responding [DEBUG] 2016-04-13 05:31:54,270 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [DEBUG] 2016-04-13 05:31:54,350 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_54.215.156.251_20160413T044508.235Z exists, using old state: TASK_RUNNING [ERROR] 2016-04-13 05:31:54,352 org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/].[dispatcherServlet] log - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception org.apache.http.conn.HttpHostConnectException: Connect to 54.215.156.251:31333 [54.215.156.251/54.215.156.251] failed: Connection refused
I truncated the stack trace, but I can post more of it if necessary.
I will say that I'm using my own ports for the http and tcp endpoints, and not having the framework choose them randomly, as is recommended. But I don't see any errors about resource offers and ports in the log. Any ideas?
The text was updated successfully, but these errors were encountered: