Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled Exception Error when server goes away #338

Closed
edgework opened this issue Mar 14, 2017 · 27 comments
Closed

Unhandled Exception Error when server goes away #338

edgework opened this issue Mar 14, 2017 · 27 comments

Comments

@edgework
Copy link

If the rabbitMq server goes down the client will crash with an unhandled exception ECONNRESET

@michaelklishin
Copy link

If you shut RabbitMQ down in such a way that it can close its end of client TCP connections first, the clients will eventually get an ECONNRESET: that's expected from TCP sockets.

Some clients provide automatic connection recovery but I don't think this one does. So it's up to application developer to handle connection failures.

@edgework
Copy link
Author

edgework commented Mar 15, 2017

A RabbitMQ server unexpectedly shutting down SHOULD NOT cause all its clients to crash!

The socket should implement an 'error' handler and deal with this. NOT take the whole application down.

socket.on("error", (err) =>
   ...
)

@michaelklishin
Copy link

michaelklishin commented Mar 15, 2017

@edgework there's more to an AMQP 0-9-1 connection than just a socket. I cannot comment on what may be taking your "entire application" down but if a library can take your application down, perhaps the application should be more defensive.

Clients that offer connection recovery have to worry about way more than reopening their TCP connection.

@edgework
Copy link
Author

Unhandled Exception will take anything down. Your library is not handling an exception in the socket. Its that simple.

In any case https://www.npmjs.com/package/amqp handles this failure correctly so I'll just use that.

@cressie176
Copy link
Collaborator

cressie176 commented Mar 15, 2017

@edgework I think it's reasonable for amqplib to emit an error event. It doesn't have sufficient context to make a universally acceptable decision for a fault of this severity, and as such I consider it good design for a low level library to bring this to your attention in the bluntest way possible.

The alternative, which is to enforce a set of opinions onto users would make the amqplib less flexible. e.g.

  • Should it log to STDERR making it hard for the user to take appropriate action?
  • Should it emit an alternative (non-error) event, risking the problem going unnoticed if there were no listeners
  • Should it emit an error event if there were no listeners to the alternative event (back to square 1)
  • Should it attempt to automatically reconnect? How often? How many times? Then what? Not to mention that auto reconnection could be bad if the problem was related to network saturation or a memory issue with with RabbitMQ server.
  • How should publishers / subscribers be given new channels if they hold references to broken ones?
  • Should subscribers be automatically resubscribed?

Unfortunately there isn't a "correct" answer. My approach was to write a library that sat on top of amqplib and did some of these things in a highly opinionated way which suited my exact situation. I'm very thankful that amqplib was flexible enough to enable me to do this.

Anyway good luck with amqp

@rick-rheo
Copy link

rick-rheo commented Mar 29, 2017

Hi. @michaelklishin

I too am running into this problem. I deal with it somewhat by grabbing the uncaught exception and if its ECONNREST I'm guessing that its ampq and closing all the ampq connections and then go into a loop retrying and that works until about a minute later when the ampq library throws another uncaught exception

Error: Heartbeat timeout at Heart.<anonymous> (/Volumes/code/agent/server/node_modules/amqplib/lib/connection.js:425:19) at emitNone (events.js:67:13) at Heart.emit (events.js:166:7) at Heart.runHeartbeat (/Volumes/code/agent/server/node_modules/amqplib/lib/heartbeat.js:88:17) at wrapper [as _onTimeout] (timers.js:275:11) at Timer.listOnTimeout (timers.js:92:15)

This one I can't do anything about. Ignoring it will HARD crash the application.

Given that I its a not trivial for me to restart my entire app because a connection to one server closed unexpectedly. How would you suggest I deal with this?

@rick-rheo
Copy link

@cressie176 @michaelklishin Anyone? Is this project active?

@cressie176
Copy link
Collaborator

cressie176 commented Apr 2, 2017

Hi @rick-rheo. Yes this project is still active.

WRT your issue I'm probably not going to be much use. I'm just a user of amqplib and don't know the internals very well. I suggest putting some debug in the source here, here, here and here. Maybe you'll see your dead connections trying to heartbeat for some reason.

You could also set the heartbeat option to 0 to disable heartbeats on both the client and server. I'm not sure what undesirable side effects that will have.

@cressie176
Copy link
Collaborator

cressie176 commented Apr 2, 2017

Thought I'd setup a test case using a rabbitmq container. @rick-rheo, @mateodelnorte #334 does this correlate to what you're seeing?

The biggest problem I see with the following is the uncaughtException if the consumer code throws an asynchronous error, but I'm not really sure what amqplib could do about that.

var amqp = require('./channel_api')
var Promise = require('bluebird')

process.on('error', err => {
  console.log('#338: process error', err)
})

process.on('uncaughtException', function (err) {
  console.log('#338: process uncaught exception', err)
})

process.on('unhandledRejection', (reason, p) => {
  console.log('#338: process unhandled rejection', p, 'reason', reason)
})

amqp.connect('amqp://localhost').then(conn => {

  console.log('#338: Connected')

  conn.on('error', (err) => {
      console.log('#338: TODO Handle connection error', err)
  })

  return conn.createChannel().then(ch => {
    return new Promise(resolve => {

      ch.assertQueue('test_q')
          .then(qok => qok.queue)
          .then(queue => {
            console.log('#338: Consuming')
            return ch.consume(queue, (msg) => {
                console.log('#338: Received msg')
                if (msg == null) return
                if (msg.content.toString() == 'crash now') throw new Error('Crashing Now!')
                if (msg.content.toString() == 'crash later') {
                    setImmediate(() => {
                        throw new Error('Crashing Later!')
                    })
                }
            }, { noAck: true }).then(function(_consumeOk) {
              console.log(' [*] Waiting for messages. To exit press CTRL+C');
            })
      })
    })
  }).finally(() => {
    console.log('#338: Closing connection')
    conn.close()
  })
}).catch(err => {
  console.log('#338: Caught Error from connect', err)
})
Trigger Handled by Connection Channel Consumer Heartbeat
Synchronous error in consumer code connecton.on('error') Closed Closed Unsubscribed Cleared
Asynchronous error in consumer code process.on('uncaughtException') OK OK Subscribed OK
Container paused, wait 2 minutes connecton.on('error') Closed Closed Unsubscribed Cleared
Container killed connecton.on('error') Closed Closed Unsubscribed Cleared
Queue deleted consumer received null message OK OK N/A OK
Container never running promise catch N/A N/A N/A N/A

@squaremo
Copy link
Collaborator

squaremo commented Apr 3, 2017

It looks like amqplib doesn't try to catch errors from the socket after the opening handshake is complete: https://github.com/squaremo/amqp.node/blob/master/lib/connect.js#L186. If RabbitMQ is killed and drops connections on the floor, that'll mean amqplib does not respond to those errors, so it'll keep heartbeating.

It's possible that just re-emitting the error on the connection (or doing some variety of connection shutdown-with-error) will fix this. It may be a short while before I get to try this.

@michaelklishin
Copy link

@squaremo in case you are looking for naming suggestions, Java client refers to such connection errors as "connection shutdown." Server-sent connection.close also triggers a shutdown and both automatic and manual recovery assumes applications handle shutdown events.

@blazerguns
Copy link

@squaremo Any chance you got to work on this fix? This still a constant problem in otherwise stable codebase.

@squaremo
Copy link
Collaborator

It looks like amqplib doesn't try to catch errors from the socket after the opening handshake is complete

That's not true; it passes the socket to the Connection constructor, and that registers a handler for 'error' (https://github.com/squaremo/amqp.node/blob/ead34d50e2c02dc79fdabdd9195cc9aec0e61e12/lib/connection.js#L264, and when it's gone through the handshake, https://github.com/squaremo/amqp.node/blob/ead34d50e2c02dc79fdabdd9195cc9aec0e61e12/lib/connection.js#L269).

If I run @cressie176's test program above, and kill -9 rabbitmq-server, I get the output

#338: TODO Handle connection error Error: Unexpected close                                                                                                                      
    at succeed (/Users/mikeb/space/amqp.node/lib/connection.js:271:13)                                                                                                          
    at onOpenOk (/Users/mikeb/space/amqp.node/lib/connection.js:253:5)                                                                                                          
    at /Users/mikeb/space/amqp.node/lib/connection.js:166:32                                                                                                                    
    at /Users/mikeb/space/amqp.node/lib/connection.js:160:12                                                                                                                        at Socket.recv (/Users/mikeb/space/amqp.node/lib/connection.js:498:12)                                                                                                          at Object.onceWrapper (events.js:272:13)                                                                                                                                        at Socket.emit (events.js:180:13)                                                                                                                                               at emitReadable_ (_stream_readable.js:525:10)                                                                                                                                   at emitReadable (_stream_readable.js:519:7)                                                                                                                                     at addChunk (_stream_readable.js:285:7)                                                                                                                                         at readableAddChunk (_stream_readable.js:261:11)                                                                                                                            
    at Socket.Readable.push (_stream_readable.js:218:10)                                                                                                                        
    at TCP.onread (net.js:581:20)

I.e., the handler for the connection 'error' event is called. Is there a better way to replicate the problem people seem to be having?

@blazerguns
Copy link

blazerguns commented Jul 25, 2018

@squaremo, this is what I get when I do a kill -9 to rabbit server.

 ERROR: master/21480 on dummy: Unexpected error occured to connection to messageQ workerQueue (err.code=ECONNRESET)
    Error: read ECONNRESET
        at TCP.onread (net.js:659:25)

In my code I have

 conn.on('error', err => {
    log.error(err, "Unexpected error occured to connection to messageQ %s", cfg.SENDTOWORKERSQ);
    prodMsgClient = null;
    prodMsgChannel = null;
    setTimeout(function() {
      log.info("Trying to connect for queue %s again...", cfg.SENDTOWORKERSQ);
      return initMessageQueue();
    }, 60 * 1000);
  });

I was hoping that the error would be caught and we will be able to write some logic to reconnect after a timeout. However this does not seem to work. As soon as I get this error, my app crashes. I want to prevent app crash and retry after a minute. Also I have 2 more instances of this client running, and they dont seem to get this error. It could be that, before the error is delivered to other client, the app crashes. My thought is, if you modified @cressie176 code and added one more connection obj say conn1 and conn2 and then do a kill -9 you might see this issue. Just my guess.

A sample code that can be used is listed in another bug #445
Am I doing something wrong here?

@squaremo
Copy link
Collaborator

I adapted the code above as https://gist.github.com/squaremo/76246899d5b68302eaeaf76ac917d33c#file-338-js

This makes two changes:

  • wrap the code that sets up the connection etc., in a function, and in the handler for connection.on('error', ...), set a timeout to call the function again in a minute (i.e., try to reconnect from scratch); and,
  • call the top-level function twice so there's two connections at any one time.

This is what the output looks like when I run it, kill RabbitMQ, then restart RabbitMQ:

app1: Connected                                                                                                                                                                 app2: Connected                                                                                                                                                                 app1: Consuming                                                                                                                                                                 app1 [*] Waiting for messages. To exit press CTRL+C                                                                                                                             app2: Consuming                                                                                                                                                                 app2 [*] Waiting for messages. To exit press CTRL+C                                                                                                                                                                                                                                                                                                             
app2: TODO Handle connection error Error: Unexpected close                                                                                                                      
    at succeed (/Users/mikeb/space/amqp.node/lib/connection.js:271:13)                                                                                                          
    at onOpenOk (/Users/mikeb/space/amqp.node/lib/connection.js:253:5)                                                                                                          
    at /Users/mikeb/space/amqp.node/lib/connection.js:166:32                                                                                                                    
    at /Users/mikeb/space/amqp.node/lib/connection.js:160:12                                                                                                                    
    at Socket.recv (/Users/mikeb/space/amqp.node/lib/connection.js:498:12)                                                                                                      
    at Object.onceWrapper (events.js:273:13)                                                                                                                                    
    at Socket.emit (events.js:182:13)                                                                                                                                           
    at emitReadable_ (_stream_readable.js:531:12)                                                                                                                               
    at process._tickCallback (internal/process/next_tick.js:63:19)                                                                                                              
app1: TODO Handle connection error Error: Unexpected close                                                                                                                      
    at succeed (/Users/mikeb/space/amqp.node/lib/connection.js:271:13)                                                                                                          
    at onOpenOk (/Users/mikeb/space/amqp.node/lib/connection.js:253:5)                                                                                                          
    at /Users/mikeb/space/amqp.node/lib/connection.js:166:32                                                                                                                    
    at /Users/mikeb/space/amqp.node/lib/connection.js:160:12                                                                                                                    
    at Socket.recv (/Users/mikeb/space/amqp.node/lib/connection.js:498:12)                                                                                                      
    at Object.onceWrapper (events.js:273:13)                                                                                                                                    
    at Socket.emit (events.js:182:13)                                                                                                                                           
    at emitReadable_ (_stream_readable.js:531:12)                                                                                                                               
    at process._tickCallback (internal/process/next_tick.js:63:19)                                                                                                              
                                                                                                                                                                                
app2: Connected                                                                                                                                                                 
app2: Consuming
app2 [*] Waiting for messages. To exit press CTRL+C
app1: Connected
app1: Consuming
app1 [*] Waiting for messages. To exit press CTRL+C
^C

Is there an element of your suggestion that I've missed, @blazerguns?

Perhaps it's just the version of NodeJS being used. Or the way RabbitMQ disconnects; I'm killing it with -9, but actual connection drops rather than the process terminating could provoke a different behaviour.

Doing a bit of archaeology, I see that the version released at the time of the original report did not have the socket error handling I remarked on above, so perhaps there's nothing mysterious here after all: it didn't work then, it does work now, and #445 is a different problem.

@squaremo
Copy link
Collaborator

I see that the version released at the time of the original report did not have the socket error handling I remarked on above

Ugh, no, I'm wrong again (looking in the wrong file) -- it's been there for ages. Scratch that idea.

@blazerguns
Copy link

Let me work on a sample code for you to test. I am running on node 10.5.0. May be that?

@blazerguns
Copy link

also, RabbitMQ 3.5.7, Erlang 18.3

@Or-Priesender
Copy link

Hi, were having the same problem here, using version 0.5.2 of amqplib, the error received :
{2632} [5]09:06:07:313 - {ERROR!} Uncaught exception - app is crashing :Error: Heartbeat timeout
at Heart. (D:\home\site\wwwroot\node_modules\amqplib\lib\connection.js:426:19)
at emitNone (events.js:106:13)
at Heart.emit (events.js:208:7)
at Heart.runHeartbeat (D:\home\site\wwwroot\node_modules\amqplib\lib\heartbeat.js:88:17)
at ontimeout (timers.js:475:11)
at tryOnTimeout (timers.js:310:5)
at Timer.listOnTimeout (timers.js:270:5)

it is important to notice that this error is caught from 'uncaughtException' event, and not the error event from amqplib's connection or channel, which means we cannot implement a retry logic.

Is there a possible fix or workaround?
Thanks!

@Lawrence-Windsor
Copy link

Did this issue ever get a solution, i am facing it as of 8th March 2019?

...
events.js:174
      throw er; // Unhandled 'error' event
      ^

Error: Heartbeat timeout
    at Heart.<anonymous> (/usr/src/app/node_modules/amqplib/lib/connection.js:427:19)
    at Heart.emit (events.js:189:13)
    at Heart.runHeartbeat (/usr/src/app/node_modules/amqplib/lib/heartbeat.js:88:17)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)
Emitted 'error' event at:
    at Heart.<anonymous> (/usr/src/app/node_modules/amqplib/lib/connection.js:428:12)
    at Heart.emit (events.js:189:13)
    [... lines matching original stack trace ...]
    at Timer.processTimers (timers.js:223:10)
...

@Or-Priesender
Copy link

For me the problem was socket availability of thw node process I was running, on an azure cloud virtual machine. It wasnt a problem with amqpnode itself.

@squaremo
Copy link
Collaborator

squaremo commented Mar 9, 2019

@Lawrence-Windsor Are you adding an error handler to the connection? If so, would you provide a (small as possible) test case that fails reliably, please.

@luissajunior
Copy link

Seems the problem is the version of node. With node 10 works and node 8 dont in my tests

@cressie176
Copy link
Collaborator

Closing, as the reports are quite old (most recent 2019) and none provided a repeatable test case. amqplib no longer supports node 8, so if that was the issue as suggested by @luissajunior (thanks), then the solution is to upgrade.

@Tychus
Copy link

Tychus commented Jun 1, 2022

Opening this again as i'm having the same issue.

I'm handling close error and blocked everything works fine when the broker is closed properly and the app will keep trying to reconnect as expected. But when i kill the broker (i'm running it through docker) i get both the error and close events as described in the doc but the app will crash right after.

  private async connect(): Promise<void> {
    if (this.connected) {
      return;
    }
    await this.connectionLock.runExclusive(async () => {
      if (this.connected) {
        return;
      }
      await pRetry(
        async () => {
          logger.info('Attempting rabbitMQ connection');
          this.connection = await client.connect(this.connectionSettings.getConnectionString());
        },
        {
          onFailedAttempt: (error) => {
            logger.error(`RabbitMq connection failed. Attempt ${error.attemptNumber}. Error: ${error.message}`);
            if (error.retriesLeft === 0) {
              this.connected = false;
              throw Error(`Failed to establish rabbitMq connection after ${this.maxConnectionRetry} attempts`);
            }
          },
          factor: 2,
          minTimeout: 1000,
          maxTimeout: 5000,
          retries: this.maxConnectionRetry,
        },
      );
      logger.info('RabbitMq connection successful');
      this.setEventListeners();
      this.eventEmitter.emit('connected');
      this.connected = true;
    });
  }

private setEventListeners(): void {
    this.connection.on('close', async (message) => {
      logger.warn(`RabbitMq connection lost: ${message}`);
      this.eventEmitter.emit('disconnected');
      await this.handleErrorEvents();
    });

    this.connection.on('error', async (error) => {
      logger.error(`RabbitMq connection error: ${error}`);
      this.eventEmitter.emit('error');
      await this.handleErrorEvents();
    });

    this.connection.on('blocked', async (reason) => {
      logger.warn(`RabbitMq connection blocked: ${reason}`);
      this.eventEmitter.emit('blocked');
      await this.disconnect();
      await this.handleErrorEvents();
    });
}

private async handleErrorEvents(): Promise<void> {
    this.connected = false;
    await this.connect();
  }

async disconnect(): Promise<void> {
  await this.connection.close();
  this.eventEmitter.emit('disconnected');
  this.connected = false;
}

Exception looks something like this:

[Node] 2022-06-01T08:23:12.546Z [error]: RabbitMq connection error: Error: Unexpected close ( connection receiving error event)
[Node] 2022-06-01T08:23:12.549Z [warn]: RabbitMq connection lost: Error: Unexpected close (connection receiving close event)
[Node] 2022-06-01T08:23:12.551Z [warn]: Bus connection lost. saving events for publishing on reconnection (publisher handling disconnect)
[Node] 2022-06-01T08:23:12.555Z [info]: Attempting rabbitMQ connection
[Node] Error: Unhandled error. (undefined)
[Node]     at new NodeError (node:internal/errors:372:5)
[Node]     at EventEmitter.emit (node:events:516:17)
[Node]     at EventEmitter.emit (node:domain:475:12)
[Node]     at ChannelModel.<anonymous> (/home/ty/Projects/rmq/node_modules/rmqlib/dist/messaging/rabbitMqBus/rabbitMqConnection.js:80:31)
[Node]     at ChannelModel.emit (node:events:527:28)
[Node]     at ChannelModel.emit (node:domain:475:12)
[Node]     at Connection.emit (node:events:527:28)
[Node]     at Connection.emit (node:domain:475:12)
[Node]     at Connection.C.onSocketError (/home/ty/Projects/rmq/node_modules/amqplib/lib/connection.js:353:10)
[Node]     at Socket.emit (node:events:539:35)
[Node]     at Socket.emit (node:domain:475:12)
[Node]     at endReadableNT (node:internal/streams/readable:1345:12)
[Node]     at processTicksAndRejections (node:internal/process/task_queues:83:21)
[Node] [nodemon] app crashed - waiting for file changes before starting...

node -v -> v16.15.0
PS: excuse the code i'm no TS/JS dev just trying to make things work

@AmosSpark
Copy link

cc: #690 (comment)

  • you can handle error on connection like so;
 connection.on("error", (error) => {
      logger.error(`Error occurred on connection: ${error}`);
      connection.close();
    });
  • and then for channel;
channel.on("error", (error) => {
      logger.error(`Error occurred on channel: ${error}`);
      channel.close();
      connection.close();
    });

@michaelklishin
Copy link

Unhandled Exception will take anything down. Your library is not handling an exception in the socket. Its that simple.

In any case https://www.npmjs.com/package/amqp handles this failure correctly so I'll just use that.

Just an FYI for those looking to solve all of their connection recovery problems "by using that other lib". That package is abandonware and is known to cause reconnection storms (and leaks) so severe that it is on the black list of client libraries are at least two RabbitMQ-as-a-Service providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests