Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ai): add hardware info from Orchestrators and expand network information available. #3246

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

ad-astra-video
Copy link
Collaborator

What does this pull request do? Explain your changes. (required)

This PR enables enhanced reporting on Livepeer network state for a Gateway to use in monitoring or analysis processes. This could also be the start of exporting all relevant information from a Gateway to implement an external selection process using the orch webhook.

Orchestrator is updated in a couple ways to provide more information to the Gateway during polling process. AI-worker and ai-runner are updated to retrieve hardware information at startup of the runner (PR #273). The hardware information is stored in the ai worker list in memory and is removed when the ai-worker disconnects. Orchestrators are also updated to enable providing additional information to Gateways with the OrchestratorInfo gRPC message to add hardware information and capabilities prices to provide Gateways all prices the Orchestrator has set at the time of the GetOrchestrator request.

Gateway is updated to save additional information from Orchestrators in the polling process. This enables the Gateway to export useful information on the state of the network to external monitoring or analysis processes with the added /getNetworkCapabilities endpoint.

example data for an orchestrator

Open to all feedback and design change suggestions. I went with using the local db because this is not intended to serve a real time process and the polling process already saved some data to the db.

cc @ecmulli @gioelecerati

Specific updates (required)

Gateway Updates

  • Shorten polling interval to 25 minutes to keep information fresh.
  • Local db is updated to save OrchestratorInfo json received from each Orchestrator in the polling process in new remoteInfo column in the db orchestrators table.
  • /getNetworkCapabilities endpoint is added to local cli webserver to enable exporting lightly formatted network information including capabilities_prices and hardware information deployed by Orchestrators for each pipeline/model id.
  • /getOrchestratorInfo endpoint is added to local cli webserver to get raw OrchestratorInfo info from an Orchestrator.

Orchestrator Updates

  • Orchestrator is updated to provide additional information in the OrchestratorInfo response only when capabilities included in the GetOrchestrator request are nil. All transcoding and AI jobs requests include capabilities which will prevent the additional information being included in the response to not further increase the size of the response and information processing during actual work.
  • Hardware information is obtained by AI-Worker polling the runner containers for hardware information at startup and save the information in memory to provide to the Orchestrator at time of connection.
  • Orchestrators store the hardware information in memory for each AI-Worker when the AI-Worker connects and is removed when the AI-Worker disconnects.
  • PriceInfo gRPC message is updated to add two optional fields for capability and constraint to support sending capabilities_prices to Gateway

AI Runner Updates PR #273

  • AI-Runner container is updated to include new endpoints to provide the hardware information. /hardware/info provides the basic information on the hardware at startup and continues to be available after startup. /hardware/stats provides more focused information on utilization of the GPU to assist Orchestrators getting current information for monitoring.
    • Stats reporting needs additional build out in go-livepeer to support tracking what worker and gpu is assigned to a request. I believe this would be a separate gRPC message and workflow so propose the go-livepeer build out of this is in a separate PR.
    • In the AI-Runner, the routes and pipelines need to be updated so the /hardware/stats endpoint is not blocked while the pipeline is running. This can be achieved by adding outputs = await asyncio.to_thread([pipeline object], **kwargs) and the routes updated to return await pipeline([...]) as well as updating the route and pipeline call functions to be async functions where needed. I tested this on the text-to-image and audio-to-text pipeline and confirmed the /hardware/stats endpoint is blocked until the pipeline call returns.

How did you test each of these updates (required)

Built docker image and put on my mainnet Orchestrator

Does this pull request close any open issues?

No

Checklist:

ad-astra-video and others added 4 commits November 11, 2024 21:20
*expand net.PriceInfo to include optional capabillity and constraint information
*update /getNetworkCapabilities to summary information relevant for data aggregation
*add /getOrchestratorInfo endpoint to get raw OrchestratorInfo data for one Orchestrator
@rickstaa
Copy link
Member

Thanks! For future reference this was a follow up on #3052.

Copy link

codecov bot commented Nov 12, 2024

Codecov Report

Attention: Patch coverage is 49.20000% with 127 lines in your changes missing coverage. Please review.

Project coverage is 34.86813%. Comparing base (b87c7c3) to head (b4144f0).

Files with missing lines Patch % Lines
server/handlers.go 39.58333% 55 Missing and 3 partials ⚠️
net/lp_rpc.pb.go 0.00000% 25 Missing ⚠️
common/db.go 44.82759% 15 Missing and 1 partial ⚠️
core/ai_worker.go 52.17391% 11 Missing ⚠️
core/orchestrator.go 72.22222% 6 Missing and 4 partials ⚠️
server/ai_worker.go 0.00000% 7 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3246         +/-   ##
===================================================
+ Coverage   34.74775%   34.86813%   +0.12038%     
===================================================
  Files            136         136                 
  Lines          36175       36400        +225     
===================================================
+ Hits           12570       12692        +122     
- Misses         22893       22989         +96     
- Partials         712         719          +7     
Files with missing lines Coverage Δ
core/ai.go 58.51852% <ø> (ø)
core/livepeernode.go 75.55556% <100.00000%> (+1.95556%) ⬆️
discovery/db_discovery.go 70.58824% <100.00000%> (+0.32802%) ⬆️
discovery/discovery.go 92.12598% <ø> (ø)
eth/watchers/stub.go 99.63636% <100.00000%> (ø)
net/lp_rpc_grpc.pb.go 9.93789% <ø> (ø)
server/rpc.go 70.05814% <100.00000%> (+2.19027%) ⬆️
server/webserver.go 95.91837% <100.00000%> (+0.08504%) ⬆️
server/ai_worker.go 49.35065% <0.00000%> (-0.53994%) ⬇️
core/orchestrator.go 72.49115% <72.22222%> (-0.01193%) ⬇️
... and 4 more

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b87c7c3...b4144f0. Read the comment docs.

Files with missing lines Coverage Δ
core/ai.go 58.51852% <ø> (ø)
core/livepeernode.go 75.55556% <100.00000%> (+1.95556%) ⬆️
discovery/db_discovery.go 70.58824% <100.00000%> (+0.32802%) ⬆️
discovery/discovery.go 92.12598% <ø> (ø)
eth/watchers/stub.go 99.63636% <100.00000%> (ø)
net/lp_rpc_grpc.pb.go 9.93789% <ø> (ø)
server/rpc.go 70.05814% <100.00000%> (+2.19027%) ⬆️
server/webserver.go 95.91837% <100.00000%> (+0.08504%) ⬆️
server/ai_worker.go 49.35065% <0.00000%> (-0.53994%) ⬇️
core/orchestrator.go 72.49115% <72.22222%> (-0.01193%) ⬇️
... and 4 more

... and 1 file with indirect coverage changes

@thomshutt
Copy link
Contributor

@ad-astra-video is this ready for review?

@ad-astra-video
Copy link
Collaborator Author

Yep! I will fix the editor config issue real quick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants