-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposing 9094 tcp port for alertmanager #420
Comments
Cool, do you want to contribute this? |
Sure I will take care of it, just need a bit of free time to write the code. You can assign to me. |
I started to look for possible changes and I have some doubt (mostly due to my lack of experience with Cortex). In order to expose correctly 9094 we need to make adjustment in
If we compare to cortex-jsonnet generated yaml, I have the following question ?
1# - why by default do we gossip with other cortex micro-services using I will obviously try / break / tweak things locally :), but if there is some explanation on how things are supposed to communicate somewhere, I am very interested to know about it. |
Memberlist is enabled by default.
Just missing. No real reason I guess. See explanation below.
Yeah about that, it's confusing. It actually not really relevant technically what you edit in the Port section of a Exposing a port here gives the system additional information about the network connections a container uses, but is primarily informational. Not specifying a port here DOES NOT prevent that port from being exposed. Any port which is listening on the default "0.0.0.0" address inside a container will be accessible from the network. Cannot be updated. |
To implement this all you really have to do is.
|
Information provided are very useful and interesting, thanks as always @nschad.
It does not seems that alertmanager need more flags in order to enable cluster mode, in fact, it seems that cluster mode is enable even by default. First config without cluster key + no statefulset
Second config with cluster key + statefulset
Am I missing something here ? |
@AlexandreRoux Uhm. I don't know 🤷 . Sorry Also reading this
I'm not really sure how to proceed... However I think the missing piece might be the |
@nschad - I am the same, not too sure how to proceed... and thinking to follow up on #cortex or #prometheus Slack channel to dive further with the help of the community. In an attempt to reply to the questions, let's compare HA documentation to some observations : A# First observation
Let's see what happens when we don't use the config.alertmanager.cluster key :
L214 and L255 from altermanager main.go implies that HA cluster will be enable by default unless an empty string being passed to Log lines are then showing cluster.go as caller implying that alertmanager is running in cluster mode.
This is very correct and I believe that passing the list of alertmanager pods is the right way BUT it does not matter when we talk about what mode is running the alertmanager. B# Second observation
Let's see what happens when we use the config.alertmanager.cluster key with empty string :
I was able to noticed that cluster.go is completely remove from the pods logs + multitenant_alertmanager/status stating that gossip-based clustering is disabled. This again seems to match alertmanager documentation and code. C# ConclusionIn conclusion this let me believe the following :
|
@AlexandreRoux Thank you for your work. Okay this is my take. So basically alertmanager was and is already running "cluster" mode. However it does know what his peers are therefore the occasional EOF. When I configured the alertmanager peers via a headless service, the members do show up in the This does not happen when you do not configure it. It's probably likely that you now have 4 "alertmanager" clusters when using a So to me it seems that all we have to do is configure the peers? |
I am sorry for the delay to come back on this one...
I have push the #435 in order to correct the headless service :)
I am 90% sure I was able to catch EOF while having all my alertmanagers listed under |
I am (unfortunately) able to reproduce EOF while having a correctly cluster setup :
Here I am running with the PR #435 to get my headless service corrected :
Even thought I really agree here, I am going to do more tests by tweaking the Port section as friedrichg got his issue fixed this way
I am also going to try replacing I will keep posted. |
I have run further tests and I am (unfortunately again) able to reproduce EOF, here is my changes / results : 1# Fix alertmanager Port section by exposing correctly 9094 for cluster gossip in TCP + UDP
2# Change the value of
I came back 24hours after the changes and by checking my ruler container logs (level: error) I can still find EOF :
It is worth to mention that alertmanager logs are empty most likely because EOF is a result of ruler closing the connection before the alertmanager idle connection timeout (5min). This has been discussed in : cortexproject/cortex#4958 (comment) I will edit my PR to include the correction of Port Section with 9094 if we are interested to have it in helm chart. |
My understanding is that alertmanager is using port 9094 to communicate between peers in cluster mode but master/templates/alertmanager does not have references of exposing this port.
Configuring alertmanager in cluster can be done via :
The above configuration seems to be working :
However I believe we should expose correctly 9094, the same as it can be done via cortex-jsonnet (https://github.com/cortexproject/cortex-jsonnet) :
In addition, I raised an Issue to Cortex project for a know EOF error in ruler when sending alert to alertmanager and outcome seems to be exposing correctly 9094 via statefulset. cortexproject/cortex#4958
The text was updated successfully, but these errors were encountered: