-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RelayMiner]: add proxy.Ping(...)
capability to test connectivity between relay servers and backend URLs
#744
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis update introduces a new Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range, codebase verification and nitpick comments (2)
pkg/relayer/relayminer.go (1)
138-154
: Log error details for better debugging.In the
ServePing
function, when an error occurs during thePing
call, consider logging the error details before returning a 500 status. This will help in diagnosing issues more effectively.if errs := rel.relayerProxy.Ping(ctx); errs != nil { + rel.logger.Error().Err(errs).Msg("failed to ping relay servers") w.WriteHeader(http.StatusInternalServerError) return }
pkg/relayer/proxy/proxy_test.go (1)
155-158
: Consider extending test coverage forPing
method.While the current test ensures no errors are returned, consider adding scenarios to test edge cases, such as:
- Handling of unreachable URLs.
- Different HTTP status codes.
- Timeout scenarios.
This will ensure comprehensive coverage of the
Ping
method's behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (1)
pkg/relayer/proxy/proxy.go (1)
203-203
: Fix typographical error in log message.The log message contains a typographical error: "occured" should be "occurred".
- Msg("an unexpected error occured while pinging backend URL") + Msg("an unexpected error occurred while pinging backend URL")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddyzags For a first contribution, this is super impressive!! 💪
Left some comments but great job otherwise. 🙌
Let's finish it off and lmk if you're interested in contributing more as we get closer to BETA.
a5eb442
to
a48b247
Compare
Hey @Olshansk 👋🏾 ! Thanks for taking the time to review the code and the feedback.
Yes let's tackle that. I will make changes to the latest comments by Monday at the latest. I am also working on the forward feature (see issue comment). I'll open a PR as soon as I can.
And yes, I am definitely interested in contributing more 🚀 |
@eddyzags Sounds great! Please re-request a review when you're ready. I will take another look then. |
@eddyzags Just FYI: #744 (comment) |
@eddyzags Another bump |
Hey @Olshansk I've just added the last requested modifications:
As the Ping safeguard implementation won't start the Relay Miner pod if the suppliers aren't reachable, I had to clean the Relay Miner's Helm value files when starting the Localnet environment based on the constraints below:
In other words, this new logic forces us to only reference the available suppliers in the Relay Miner's configuration. Error message:
The Relay miner configuration seems to expects all the suppliers list (Ollama, Anvi & Rest) even if the supplier's container isn't deployed. If this is true, the Relay Miner won't be able to start with the Ping safeguard if one supplier isn't deployed. |
@eddyzags Appreciate the detailed feedback. Re review - I will take a look next week but hopefully @bryanchriswhite can help out as well.
Re this error. I'll take a deeper look next week but in the meantime, can you try:
This might be something historical we've already solved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddyzags Please merge with main
and will re-review then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool feature @eddyzags 😎 - thanks for the contribution! 🚀 🙌
(And apologies for the review delay 😅)
Configures a `ping` server to test the connectivity of every backend URLs. If | ||
all the backend URLs are reachable, the endpoint returns a 200 HTTP | ||
Code. Otherwise, if one or more backend URLs aren't reachable, the service | ||
returns an 500 HTTP Internal server error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using 204 No Content for the success and 503 Service Unavailable for the error status codes? I think they could be more appropriate than 200 and 500.
Configures a `ping` server to test the connectivity of every backend URLs. If | |
all the backend URLs are reachable, the endpoint returns a 200 HTTP | |
Code. Otherwise, if one or more backend URLs aren't reachable, the service | |
returns an 500 HTTP Internal server error. | |
Configures a `ping` server to test the connectivity of all backend URLs. If | |
all the backend URLs are reachable, the endpoint returns a 200 HTTP | |
Code. If one or more backend URLs aren't reachable, the service | |
returns an 500 HTTP Internal server error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I think those status codes are more appropriate.
I am on it 👍🏾
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about 503 Service Unavailable
in the end. The Relayminer server was ready to handle the request, but subsequent calls are not possible. Therefore, there is an internal error in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on "The Relayminer server was ready to handle the request, subsequent calls are not possible" I'm not sure I follow?
Do you think 504 Gateway Timeout is more appropriate?
My thinking is that 500 is quite generic and we should consider using a more specific status if one applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that 500 is quite generic and we should consider using a more specific status if one applies.
I agree. 500 can be generic, and we must use a more specific status code. I also think we shouldn't use 500 because that implies that the server (in this case, relayminer
) is completely unusable. But that isn't entirely the case. AFAIK if one supplier isn't reachable anymore, the other configured ones can still function, and the relayminer can continue to mine requests for them.
Could you elaborate on "The Relayminer server was ready to handle the request, subsequent calls are not possible" I'm not sure I follow?
In my mind, there are three actors here:
- The client to the
/ping
endpoint. (could becurl -x GET <addr>:<port>/ping
) - The server (
relayminer
) - The third-party service (
supplier
)
What I meant by "The Relayminer server was ready/able to handle requests" is the interaction between 1 & 2. And what I meant by the "subsequent calls are not possible" is the interaction between 2 & 3. I hope that makes sense.
My problem with 503 Service Unavailable
is that it implies a temporary state. But we are not so sure about that. The supplier might never come back online.
So yes, I think 504 Gateway Timeout
is more appropriate if the context timeout. But if the the third-party (supplier
) return an unexpected response, we must use 503 Bad Gateway
IMO. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional diff?
Same goes for all files in the loalnet/kubernetes directory.
One simple way to revert these would be to do something like:
git fetch origin
git checkout origin/main -- localnet/kubernetes
git add localnet/kubernetes
git commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these changes were intentional and required for this PR.
In the main branch, you can see that Tilt is setting the Kubernetes Values for the Relayminer's configuration (1, 2 & 3) where all the suppliers are defined (see here for Relayminer 1) whether the localnet_config_default
(see here) actually deploys the Kubernetes pod or not. This is a problem because these PR modifications prevent the Relayminer application from starting if one or more Backend URLs aren't reachable.
I needed to find a way to:
Pseudocode:
* If Title `localnet_config` enables Ollama, rest, and anvil deployment
* Use a Relayminer configuration where Ollama, Rest, and Anvil are defined as suppliers.
* If Title `localnet_config` enables Ollama, Rest. But disabled Anvil
* Use a Relayminer configuration where Ollama and Rest are defined as suppliers. But not Anvil
* If Title `localnet_config` enables Ollama. But disabled Rest and Anvil
* Use a Relayminer configuration where Ollama supplier is defined as a supplier. But not Rest, Anvil.
, etc...
Maybe there is some better way to define the according suppliers based on the localnet_config dynamically
. I am open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(cc @okdas)
@eddyzags Do you still plan on finishing off this work? |
…ty between relay servers and backend URLs (#1) * relayer: add RelayServers() method to RelayProxy interface; Add Ping(), ServiceIDs(), Forward() method to RelayServer interface; add RelayServers slice with helper method byServiceID * relayer: add forward config entry * relayer: implement ServiceIDs, Forward, and Ping method for synchrounous RPC server * relayer: add RelayServers implementation for RelayProxy * relayer: add Ping and Forward options * relayer: integrate ping option * relayer: add ServePing and ServeForward method to RelayMiner * test proxy.Ping() in test + remove forward feature * add serve ping test * add doc
pkg/relayer/cmd/cmd.go
Outdated
ln, err := net.Listen("tcp", relayMinerConfig.Ping.Addr) | ||
if err != nil { | ||
return fmt.Errorf("failed to listen ping server: %w", err) | ||
} | ||
|
||
if err := relayMiner.ServePing(ctx, ln); err != nil { | ||
return fmt.Errorf("failed to start ping server: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! 🙌 😄
My rationale is to improve alignment with the single responsibility principle. In this case, a change in the relayminer ping config structure or server implementation SHOULD NOT cause this function to change (the main relayer CLI entrypoint).
Notice how this would be consistent with the other config-based conditional branches, which are encapsulated into their own methods (i.e. #ServeMetrics()
, #ServePprof()
).
@@ -134,6 +135,17 @@ func runRelayer(cmd *cobra.Command, _ []string) error { | |||
} | |||
} | |||
|
|||
if relayMinerConfig.Ping.Enabled { | |||
ln, err := net.Listen("tcp", relayMinerConfig.Ping.Addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately obvious to me that net.SplitHostPort()
should be called as a result of calling net.Listen()
; however, a simple test seems to indicate that is the case as the error messages trace back to net.SplitHostPort()
:
package temp
import (
"net"
"testing"
"github.com/stretchr/testify/require"
)
func TestNoPort(t *testing.T) {
ln, err := net.Listen("tcp", "localhost")
require.NoError(t, err)
require.NotNil(t, ln)
}
func TestTooManyColons(t *testing.T) {
ln, err := net.Listen("tcp", "localhost::9000")
require.NoError(t, err)
require.NotNil(t, ln)
}
func TestURLInsteadOfHostPort(t *testing.T) {
ln, err := net.Listen("tcp", "http://localhost")
require.NoError(t, err)
require.NotNil(t, ln)
}
func TestFQDN(t *testing.T) {
ln, err := net.Listen("tcp", "example.com:9000")
require.NoError(t, err)
require.NotNil(t, ln)
}
@eddyzags I suppose the output of this test also indicates that the UX would be that if the address is invalid, then the relayminer fails to start, correct?
=== RUN TestNoPort
test_test.go:12:
Error Trace: /home/bwhite/Projects/pokt/poktroll/temp/test_test.go:12
Error: Received unexpected error:
listen tcp: address localhost: missing port in address
Test: TestNoPort
--- FAIL: TestNoPort (0.00s)
=== RUN TestTooManyColons
test_test.go:18:
Error Trace: /home/bwhite/Projects/pokt/poktroll/temp/test_test.go:18
Error: Received unexpected error:
listen tcp: address localhost::9000: too many colons in address
Test: TestTooManyColons
--- FAIL: TestTooManyColons (0.00s)
=== RUN TestURLInsteadOfHostPort
test_test.go:24:
Error Trace: /home/bwhite/Projects/pokt/poktroll/temp/test_test.go:24
Error: Received unexpected error:
listen tcp: lookup tcp///localhost: unknown port
Test: TestURLInsteadOfHostPort
--- FAIL: TestURLInsteadOfHostPort (0.03s)
=== RUN TestFQDN
test_test.go:30:
Error Trace: /home/bwhite/Projects/pokt/poktroll/temp/test_test.go:30
Error: Received unexpected error:
listen tcp 93.184.215.14:9000: bind: cannot assign requested address
Test: TestFQDN
--- FAIL: TestFQDN (0.03s)
FAIL
Process finished with the exit code 1
pkg/relayer/proxy/synchronous.go
Outdated
_ = resp.Body.Close() | ||
|
||
if resp.StatusCode >= http.StatusInternalServerError { | ||
return ErrRelayerProxySupplierNotReachable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should interpolate which endpoint was not reachable:
return ErrRelayerProxySupplierNotReachable | |
return ErrRelayerProxySupplierNotReachable.Wrapf( | |
"endpoint URL %q status: %d", | |
endpointUrl, resp.StatusCode, | |
) |
// Ping tries to dial the suppliers backend URLs to test the connection. | ||
func (sync *synchronousRPCServer) Ping(ctx context.Context) error { | ||
for _, supplierCfg := range sync.serverConfig.SupplierConfigsMap { | ||
httpClient := &http.Client{Timeout: 2 * time.Second} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
httpClient := &http.Client{Timeout: 2 * time.Second} | |
httpClient := &http.Client{Timeout: 2 * time.Second} | |
endpointUrl := supplierCfg.ServiceConfig.BackendUrl.String() |
// YAMLRelayMinerPingConfig represents the configuration to expose a ping server. | ||
type YAMLRelayMinerPingConfig struct { | ||
Enabled bool `yaml:"enabled"` | ||
// Addr is the address to bind to (format: 'hostname:port') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Addr is the address to bind to (format: 'hostname:port') | |
// Addr is the address to bind to (format: 'hostname:port' where 'hostname' can be a DNS name or an IP) |
rp.logger.Error().Err(err). | ||
Msg("an unexpected error occured while pinging backend URL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think an error message is necessary here, especially if we add more detail elsewhere as suggested.
rp.logger.Error().Err(err). | |
Msg("an unexpected error occured while pinging backend URL") | |
rp.logger.Error().Err(err).Send() |
@@ -57,3 +60,72 @@ func TestRelayMiner_StartAndStop(t *testing.T) { | |||
err = relayminer.Stop(ctx) | |||
require.NoError(t, err) | |||
} | |||
|
|||
func TestRelayMiner_Ping(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this all the test coverage that there is or did I miss something?
I would love to see coverage over a few more scenarios:
- Multi-supplier relayminer success.
- Relayminer with multiple endpoint, including a DNS name, ipv4 address and ipv6 address success
Mocking the DNS name is pretty straightforward with the go-mockdns package. - Multi-endpoint partial success/failure on startup (i.e. some endpoints should be unreachable on startup)
- Multi-endpoint partial success/failure post-startup (i.e. some endpoints should become unreachable after startup)
Consider refactoring this test code into a test suite to de-duplicate common code by moving it into methods on the suite and storing case-specific test state on the suite struct (per test case).
], | ||
] | ||
|
||
if localnet_config["rest"]["enabled"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need multiple configuration files, and not just one?
Summary
This PR adds the capability to test the connectivity between the Relay Servers and the Backend URLs in two ways.
Safeguard at Startup:
For every
suppliers.[].service_config.backend_url
referenced as input inside the Relay Miner Configuration file, the Relay Proxy will verify wether the network connection between the targetedbackend_url
and the relayerminer process is functioning properly. If one or more connections aren't possible, the relay miner won't be able to start.Configurable Ping HTTP server:
The Relay Miner process will listen for incoming request to synchronously test the connectivity of every referenced
suppliers.[].service_config.backend_url
. If one or more backend URLs aren't reachable, the incoming request will fail.Based on the
serverConfig.ServerType
(Example: HTTP), each Server Type will implement their own logic to implement to test the connectivity.Issue
Type of change
Select one or more:
Testing
Documentation changes (only if making doc changes)
make docusaurus_start
; only needed if you make doc changesLocal Testing (only if making code changes)
make go_develop_and_test
make test_e2e
PR Testing (only if making code changes)
devnet-test-e2e
label to the PR.make trigger_ci
if you want to re-trigger tests without any code changesSanity Checklist
Summary by CodeRabbit
Summary by CodeRabbit
New Features
ping
functionality, allowing users to test backend connectivity within the relay miner's setup.Bug Fixes
Tests