Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make descriptions of bacalhau config settings more understandable for non-experts #4606

Open
aronchick opened this issue Oct 8, 2024 · 0 comments
Labels
request/new Request: Indicates a new request that has been submitted and awaits initial triage th/config Theme: Related to configuration files and settings across the project th/oncall Theme: Issues to be addressed by the oncall th/user-experience Theme: Issues aimed at improving the end-user experience type/bug Type: Something is not working as expected

Comments

@aronchick
Copy link
Collaborator

Here is the original:

❯ bacalhau config list
 KEY                                               VALUE                        DESCRIPTION
 API.Auth.AccessPolicyPath                                                      AccessPolicyPath is the path to a file or directory that will be loaded as the
                                                                                policy to apply to all inbound API requests. If unspecified, a policy that
                                                                                permits access to all API endpoints to both authenticated and unauthenticated
                                                                                users (the default as of v1.2.0) will be used.
 API.Auth.Methods                                  map[ClientKey:{challenge }]  Methods maps "method names" to authenticator implementations. A method name is a
                                                                                human-readable string chosen by the person configuring the system that is shown
                                                                                to users to help them pick the authentication method they want to use. There can
                                                                                be multiple usages of the same Authenticator *type* but with different configs
                                                                                and parameters, each identified with a unique method name. For example, if an
                                                                                implementation wants to allow users to log in with Github or Bitbucket, they
                                                                                might both use an authenticator implementation of type "oidc", and each would
                                                                                appear once on this provider with key / method name "github" and "bitbucket". By
                                                                                default, only a single authentication method that accepts authentication via
                                                                                client keys will be enabled.
 API.Host                                          0.0.0.0                      Host specifies the hostname or IP address on which the API server listens or the
                                                                                client connects.
 API.Port                                          1234                         Port specifies the port number on which the API server listens or the client
                                                                                connects.
 API.TLS.AutoCert                                                               AutoCert specifies the domain for automatic certificate generation.
 API.TLS.AutoCertCachePath                                                      AutoCertCachePath specifies the directory to cache auto-generated certificates.
 API.TLS.CAFile                                                                 CAFile specifies the path to the Certificate Authority file.
 API.TLS.CertFile                                                               CertFile specifies the path to the TLS certificate file.
 API.TLS.Insecure                                  false                        Insecure allows insecure TLS connections (e.g., self-signed certificates).
 API.TLS.KeyFile                                                                KeyFile specifies the path to the TLS private key file.
 API.TLS.SelfSigned                                false                        SelfSigned indicates whether to use a self-signed certificate.
 API.TLS.UseTLS                                    false                        UseTLS indicates whether to use TLS for client connections.
 Compute.AllocatedCapacity.CPU                     70%                          CPU specifies the amount of CPU a compute node allocates for running jobs. It
                                                                                can be expressed as a percentage (e.g., "85%") or a Kubernetes resource string
                                                                                (e.g., "100m").
 Compute.AllocatedCapacity.Disk                    70%                          Disk specifies the amount of Disk space a compute node allocates for running
                                                                                jobs. It can be expressed as a percentage (e.g., "85%") or a Kubernetes resource
                                                                                string (e.g., "10Gi").
 Compute.AllocatedCapacity.GPU                     100%                         GPU specifies the amount of GPU a compute node allocates for running jobs. It
                                                                                can be expressed as a percentage (e.g., "85%") or a Kubernetes resource string
                                                                                (e.g., "1"). Note: When using percentages, the result is always rounded up to
                                                                                the nearest whole GPU.
 Compute.AllocatedCapacity.Memory                  70%                          Memory specifies the amount of Memory a compute node allocates for running jobs.
                                                                                It can be expressed as a percentage (e.g., "85%") or a Kubernetes resource
                                                                                string (e.g., "1Gi").
 Compute.AllowListedLocalPaths                     [/tmp /data]                 AllowListedLocalPaths specifies a list of local file system paths that the
                                                                                compute node is allowed to access.
 Compute.Auth.Token                                                             Token specifies the key for compute nodes to be able to access the orchestrator.
 Compute.Enabled                                   false                        Enabled indicates whether the compute node is active and available for job
                                                                                execution.
 Compute.Heartbeat.InfoUpdateInterval              1m0s                         InfoUpdateInterval specifies the time between updates of non-resource
                                                                                information to the orchestrator.
 Compute.Heartbeat.Interval                        5s                           Interval specifies the time between heartbeat signals sent to the orchestrator.
 Compute.Heartbeat.ResourceUpdateInterval          30s                          ResourceUpdateInterval specifies the time between updates of resource
                                                                                information to the orchestrator.
 Compute.Orchestrators                             [nats://127.0.0.1:4222]      Orchestrators specifies a list of orchestrator endpoints that this compute node
                                                                                connects to.
 DataDir                                           /Users/daaronch/.bacalhau    DataDir specifies a location on disk where the bacalhau node will maintain
                                                                                state.
 DisableAnalytics                                  false                        No description available
 Engines.Disabled                                  []                           Disabled specifies a list of engines that are disabled.
 Engines.Types.Docker.ManifestCache.Refresh        1h0m0s                       Refresh specifies the refresh interval for cache entries.
 Engines.Types.Docker.ManifestCache.Size           1000                         Size specifies the size of the Docker manifest cache.
 Engines.Types.Docker.ManifestCache.TTL            1h0m0s                       TTL specifies the time-to-live duration for cache entries.
 FeatureFlags.ExecTranslation                      false                        ExecTranslation enables the execution translation feature.
 InputSources.Disabled                             []                           Disabled specifies a list of storages that are disabled.
 InputSources.MaxRetryCount                        3                            ReadTimeout specifies the maximum number of attempts for reading from a storage.
 InputSources.ReadTimeout                          5m0s                         ReadTimeout specifies the maximum time allowed for reading from a storage.
 InputSources.Types.IPFS.Endpoint                                               Endpoint specifies the multi-address to connect to for IPFS. e.g
                                                                                /ip4/127.0.0.1/tcp/5001
 JobAdmissionControl.AcceptNetworkedJobs           true                         AcceptNetworkedJobs indicates whether to accept jobs that require network
                                                                                access.
 JobAdmissionControl.Locality                      anywhere                     Locality specifies the locality of the job input data.
 JobAdmissionControl.ProbeExec                                                  ProbeExec specifies the command to execute for probing job submission.
 JobAdmissionControl.ProbeHTTP                                                  ProbeHTTP specifies the HTTP endpoint for probing job submission.
 JobAdmissionControl.RejectStatelessJobs           false                        RejectStatelessJobs indicates whether to reject stateless jobs, i.e. jobs
                                                                                without inputs.
 JobDefaults.Batch.Priority                        0                            Priority specifies the default priority allocated to a batch or ops job. This
                                                                                value is used when the job hasn't explicitly set its priority requirement.
 JobDefaults.Batch.Task.Publisher.Params           map[]                        Params specifies the publisher configuration data.
 JobDefaults.Batch.Task.Publisher.Type                                          Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.
 JobDefaults.Batch.Task.Resources.CPU              500m                         CPU specifies the default amount of CPU allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "100m" for 0.1 CPU cores). This value is used when
                                                                                the task hasn't explicitly set its CPU requirement.
 JobDefaults.Batch.Task.Resources.Disk                                          Disk specifies the default amount of disk space allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "1Gi" for 1 gibibyte). This value is
                                                                                used when the task hasn't explicitly set its disk space requirement.
 JobDefaults.Batch.Task.Resources.GPU                                           GPU specifies the default number of GPUs allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "1" for 1 GPU). This value is used when the task
                                                                                hasn't explicitly set its GPU requirement.
 JobDefaults.Batch.Task.Resources.Memory           1Gb                          Memory specifies the default amount of memory allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "256Mi" for 256 mebibytes). This value
                                                                                is used when the task hasn't explicitly set its memory requirement.
 JobDefaults.Batch.Task.Timeouts.ExecutionTimeout  0s                           ExecutionTimeout is the maximum time allowed for task execution
 JobDefaults.Batch.Task.Timeouts.TotalTimeout      0s                           TotalTimeout is the maximum total time allowed for a task
 JobDefaults.Daemon.Priority                       0                            Priority specifies the default priority allocated to a service or daemon job.
                                                                                This value is used when the job hasn't explicitly set its priority requirement.
 JobDefaults.Daemon.Task.Resources.CPU             500m                         CPU specifies the default amount of CPU allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "100m" for 0.1 CPU cores). This value is used when
                                                                                the task hasn't explicitly set its CPU requirement.
 JobDefaults.Daemon.Task.Resources.Disk                                         Disk specifies the default amount of disk space allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "1Gi" for 1 gibibyte). This value is
                                                                                used when the task hasn't explicitly set its disk space requirement.
 JobDefaults.Daemon.Task.Resources.GPU                                          GPU specifies the default number of GPUs allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "1" for 1 GPU). This value is used when the task
                                                                                hasn't explicitly set its GPU requirement.
 JobDefaults.Daemon.Task.Resources.Memory          1Gb                          Memory specifies the default amount of memory allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "256Mi" for 256 mebibytes). This value
                                                                                is used when the task hasn't explicitly set its memory requirement.
 JobDefaults.Ops.Priority                          0                            Priority specifies the default priority allocated to a batch or ops job. This
                                                                                value is used when the job hasn't explicitly set its priority requirement.
 JobDefaults.Ops.Task.Publisher.Params             map[]                        Params specifies the publisher configuration data.
 JobDefaults.Ops.Task.Publisher.Type                                            Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.
 JobDefaults.Ops.Task.Resources.CPU                500m                         CPU specifies the default amount of CPU allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "100m" for 0.1 CPU cores). This value is used when
                                                                                the task hasn't explicitly set its CPU requirement.
 JobDefaults.Ops.Task.Resources.Disk                                            Disk specifies the default amount of disk space allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "1Gi" for 1 gibibyte). This value is
                                                                                used when the task hasn't explicitly set its disk space requirement.
 JobDefaults.Ops.Task.Resources.GPU                                             GPU specifies the default number of GPUs allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "1" for 1 GPU). This value is used when the task
                                                                                hasn't explicitly set its GPU requirement.
 JobDefaults.Ops.Task.Resources.Memory             1Gb                          Memory specifies the default amount of memory allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "256Mi" for 256 mebibytes). This value
                                                                                is used when the task hasn't explicitly set its memory requirement.
 JobDefaults.Ops.Task.Timeouts.ExecutionTimeout    0s                           ExecutionTimeout is the maximum time allowed for task execution
 JobDefaults.Ops.Task.Timeouts.TotalTimeout        0s                           TotalTimeout is the maximum total time allowed for a task
 JobDefaults.Service.Priority                      0                            Priority specifies the default priority allocated to a service or daemon job.
                                                                                This value is used when the job hasn't explicitly set its priority requirement.
 JobDefaults.Service.Task.Resources.CPU            500m                         CPU specifies the default amount of CPU allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "100m" for 0.1 CPU cores). This value is used when
                                                                                the task hasn't explicitly set its CPU requirement.
 JobDefaults.Service.Task.Resources.Disk                                        Disk specifies the default amount of disk space allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "1Gi" for 1 gibibyte). This value is
                                                                                used when the task hasn't explicitly set its disk space requirement.
 JobDefaults.Service.Task.Resources.GPU                                         GPU specifies the default number of GPUs allocated to a task. It uses Kubernetes
                                                                                resource string format (e.g., "1" for 1 GPU). This value is used when the task
                                                                                hasn't explicitly set its GPU requirement.
 JobDefaults.Service.Task.Resources.Memory         1Gb                          Memory specifies the default amount of memory allocated to a task. It uses
                                                                                Kubernetes resource string format (e.g., "256Mi" for 256 mebibytes). This value
                                                                                is used when the task hasn't explicitly set its memory requirement.
 Labels                                            map[]                        Labels are key-value pairs used to describe and categorize the nodes.
 Logging.Level                                     info                         Level sets the logging level. One of: trace, debug, info, warn, error, fatal,
                                                                                panic.
 Logging.LogDebugInfoInterval                      30s                          LogDebugInfoInterval specifies the interval for logging debug information.
 Logging.Mode                                      default                      Mode specifies the logging mode. One of: default, json.
 NameProvider                                      puuid                        NameProvider specifies the method used to generate names for the node. One of:
                                                                                hostname, aws, gcp, uuid, puuid.
 Orchestrator.Advertise                                                         Advertise specifies URL to advertise to other servers.
 Orchestrator.Auth.Token                                                        Token specifies the key for compute nodes to be able to access the orchestrator
 Orchestrator.Cluster.Advertise                                                 Advertise specifies the address to advertise to other cluster members.
 Orchestrator.Cluster.Host                                                      Host specifies the hostname or IP address for cluster communication.
 Orchestrator.Cluster.Name                                                      Name specifies the unique identifier for this orchestrator cluster.
 Orchestrator.Cluster.Peers                        []                           Peers is a list of other cluster members to connect to on startup.
 Orchestrator.Cluster.Port                         0                            Port specifies the port number for cluster communication.
 Orchestrator.Enabled                              false                        Enabled indicates whether the orchestrator node is active and available for job
                                                                                submission.
 Orchestrator.EvaluationBroker.MaxRetryCount       10                           MaxRetryCount specifies the maximum number of times an evaluation can be retried
                                                                                before being marked as failed.
 Orchestrator.EvaluationBroker.VisibilityTimeout   1m0s                         VisibilityTimeout specifies how long an evaluation can be claimed before it's
                                                                                returned to the queue.
 Orchestrator.Host                                 0.0.0.0                      Host specifies the hostname or IP address on which the Orchestrator server
                                                                                listens for compute node connections.
 Orchestrator.NodeManager.DisconnectTimeout        5s                           DisconnectTimeout specifies how long to wait before considering a node
                                                                                disconnected.
 Orchestrator.NodeManager.ManualApproval           false                        ManualApproval, if true, requires manual approval for new compute nodes joining
                                                                                the cluster.
 Orchestrator.Port                                 4222                         Host specifies the port number on which the Orchestrator server listens for
                                                                                compute node connections.
 Orchestrator.Scheduler.HousekeepingInterval       30s                          HousekeepingInterval specifies how often to run housekeeping tasks.
 Orchestrator.Scheduler.HousekeepingTimeout        2m0s                         HousekeepingTimeout specifies the maximum time allowed for a single housekeeping
                                                                                run.
 Orchestrator.Scheduler.QueueBackoff               1m0s                         QueueBackoff specifies the time to wait before retrying a failed job.
 Orchestrator.Scheduler.WorkerCount                12                           WorkerCount specifies the number of concurrent workers for job scheduling.
 Publishers.Disabled                               []                           Disabled specifies a list of publishers that are disabled.
 Publishers.Types.IPFS.Endpoint                                                 Endpoint specifies the multi-address to connect to for IPFS. e.g
                                                                                /ip4/127.0.0.1/tcp/5001
 Publishers.Types.Local.Address                    127.0.0.1                    Address specifies the endpoint the publisher serves on.
 Publishers.Types.Local.Port                       6001                         Port specifies the port the publisher serves on.
 Publishers.Types.S3.PreSignedURLDisabled          false                        PreSignedURLDisabled specifies whether pre-signed URLs are enabled for the S3
                                                                                provider.
 Publishers.Types.S3.PreSignedURLExpiration        0s                           PreSignedURLExpiration specifies the duration before a pre-signed URL expires.
 ResultDownloaders.Disabled                        []                           Disabled is a list of downloaders that are disabled.
 ResultDownloaders.Timeout                         0s                           Timeout specifies the maximum time allowed for a download operation.
 ResultDownloaders.Types.IPFS.Endpoint                                          Endpoint specifies the multi-address to connect to for IPFS. e.g
                                                                                /ip4/127.0.0.1/tcp/5001
 StrictVersionMatch                                false                        StrictVersionMatch indicates whether to enforce strict version matching.
 UpdateConfig.Interval                             24h0m0s                      Interval specifies the time between update checks, when set to 0 update checks
                                                                                are not performed.
 WebUI.Backend                                                                  Backend specifies the address and port of the backend API server. If empty, the
                                                                                Web UI will use the same address and port as the API server.
 WebUI.Enabled                                     false                        Enabled indicates whether the Web UI is enabled.
 WebUI.Listen                                      0.0.0.0:8438                 Listen specifies the address and port on which the Web UI listens.

There are LOTS of terms that make no sense unless you're an expert. E.g.

  • Difference between InfoUpdateInterval & ResourceUpdateInterval
  • "DataDir specifies a location on disk where the bacalhau node will maintain state" - uh what? what is state? What else is here?
  • "ExecTranslation enables the execution translation feature." what is an execution translation?
  • "Engines.Disabled" why do we have disabled? Shouldn't this just be a list of enabled engines, with defaults?
  • "JobAdmissionControl.Locality anywhere Locality specifies the locality of the job input data." what does "locality of the job input data" mean?
  • "Labels map[] Labels are key-value pairs used to describe and categorize the nodes." - categorize the nodes? for what purpose?

And lots more. Here's the updated one, which is probably also terrible, but just passed through ChatGPT. We've got to go through this and make it MUCH clearer.

KEY | VALUE | DESCRIPTION
-- | -- | --
API.Auth.AccessPolicyPath |   | Specifies the file or directory path to the access policy that governs inbound API requests. This policy defines who can access which API endpoints and with what permissions. If not set, a default policy permits all users (authenticated or not) full access to all API endpoints.
API.Auth.Methods | map[ClientKey:{challenge}] | Defines the available authentication methods for API access, mapping method names to their authenticator configurations. Each method name is a user-friendly identifier for an authentication mechanism (e.g., "password", "oauth"). This allows multiple authentication types or configurations of the same type (like "github" and "bitbucket" for OIDC authentication). By default, only client key authentication is enabled.
API.Host | 0.0.0.0 | Specifies the network interface (hostname or IP address) that the API server will bind to and listen for incoming connections, or that clients will use to connect to the API server.
API.Port | 1234 | Defines the network port number that the API server will use for incoming connections, or that clients will use to connect to the API server.
API.TLS.AutoCert |   | Specifies the domain name for which the API server should automatically generate TLS certificates (e.g., using Let's Encrypt). This facilitates secure HTTPS connections without manual certificate management.
API.TLS.AutoCertCachePath |   | Defines the filesystem directory where automatically generated TLS certificates are stored and cached. This helps in reusing certificates between restarts.
API.TLS.CAFile |   | Specifies the path to the Certificate Authority (CA) certificate file used to verify client certificates when mutual TLS is enabled. This enhances security by ensuring that only clients with certificates signed by a trusted CA can connect.
API.TLS.CertFile |   | Specifies the path to the TLS certificate file that the API server will use to establish secure (HTTPS) connections. Providing a valid certificate ensures encrypted communication between clients and the server.
API.TLS.Insecure | false | Indicates whether to allow insecure TLS connections, such as those using self-signed or invalid certificates. When set to true, the client will not verify the server's TLS certificate, which can expose the connection to man-in-the-middle attacks.
API.TLS.KeyFile |   | Specifies the path to the private key file associated with the TLS certificate used by the API server. This key is essential for establishing secure TLS connections.
API.TLS.SelfSigned | false | Indicates whether the API server should generate and use a self-signed TLS certificate for secure connections. Useful for testing or internal deployments where a trusted certificate is not necessary.
API.TLS.UseTLS | false | Determines whether the API server should use TLS (HTTPS) for client connections. If set to true, the server will require secure connections, enhancing data security during transmission.
Compute.AllocatedCapacity.CPU | 70% | Specifies the portion of CPU resources that the compute node dedicates to running jobs. Can be specified as a percentage of total CPU capacity (e.g., 70%) or as an absolute value using Kubernetes resource notation (e.g., 1000m for 1 CPU core). This ensures that the node reserves sufficient CPU for tasks while leaving resources for system operations.
Compute.AllocatedCapacity.Disk | 70% | Defines the amount of disk storage that the compute node reserves for job execution, either as a percentage of total disk space (e.g., 70%) or as an absolute value (e.g., 10Gi for 10 gigabytes). This allocation prevents jobs from consuming all disk space, which could affect node stability.
Compute.AllocatedCapacity.GPU | 100% | Specifies the amount of GPU resources allocated for running jobs. Can be a percentage (e.g., 85%) or an absolute number (e.g., 1 for one GPU). Note that when using percentages, the result is always rounded up to the nearest whole GPU. This setting is crucial for workloads that require GPU acceleration.
Compute.AllocatedCapacity.Memory | 70% | Specifies the amount of memory allocated for running jobs, either as a percentage of total memory (e.g., 70%) or as an absolute value (e.g., 1Gi for 1 gibibyte). Proper memory allocation helps prevent jobs from causing system memory exhaustion.
Compute.AllowListedLocalPaths | [/tmp, /data] | Specifies a list of directories on the local file system that the compute node is allowed to access when executing jobs. This is a security measure to limit jobs to specific paths and prevent unauthorized access to sensitive files.
Compute.Auth.Token |   | Specifies the authentication token required for compute nodes to access the orchestrator. This token ensures that only authorized compute nodes can join the network and receive jobs.
Compute.Enabled | false | Indicates whether the compute node is active and available for executing jobs. If set to false, the node will not participate in job execution. This can be used to temporarily disable a node without removing its configuration.
Compute.Heartbeat.InfoUpdateInterval | 1m0s | Defines the interval at which the compute node sends non-resource information updates (e.g., metadata, status) to the orchestrator. Regular updates help the orchestrator make informed scheduling decisions.
Compute.Heartbeat.Interval | 5s | Specifies how often the compute node sends heartbeat signals to the orchestrator to indicate that it is alive and functioning. Frequent heartbeats ensure timely detection of node failures.
Compute.Heartbeat.ResourceUpdateInterval | 30s | Specifies the interval at which the compute node sends resource usage updates (e.g., CPU, memory, disk utilization) to the orchestrator. This information assists in efficient resource scheduling.
Compute.Orchestrators | [nats://127.0.0.1:4222] | Lists the orchestrator endpoints that the compute node connects to for job scheduling and coordination. This can include multiple orchestrators for redundancy and load balancing.
DataDir | /Users/daaronch/.bacalhau | Specifies the directory on disk where the Bacalhau node stores its state, including logs, data, and other persistent information. Proper configuration ensures data persistence across restarts.
DisableAnalytics | false | When set to true, disables the collection and sending of analytics data. This can be used to enhance privacy if analytics are not desired.
Engines.Disabled | [] | Specifies a list of execution engines that are disabled. Execution engines define how tasks are run (e.g., Docker, WASM). Disabling unused engines can reduce resource consumption.
Engines.Types.Docker.ManifestCache.Refresh | 1h0m0s | Defines the interval at which the Docker manifest cache refreshes its entries. Regular refreshes ensure that the cache stays updated with the latest images.
Engines.Types.Docker.ManifestCache.Size | 1000 | Specifies the maximum number of entries that the Docker manifest cache can hold. A larger cache can improve performance but consumes more memory.
Engines.Types.Docker.ManifestCache.TTL | 1h0m0s | Sets the time-to-live duration for entries in the Docker manifest cache before they are considered stale. This helps in balancing between cache freshness and performance.
FeatureFlags.ExecTranslation | false | Enables the experimental feature for translating execution plans. This feature might change how tasks are executed and is intended for testing new functionalities.
InputSources.Disabled | [] | Specifies a list of input source types that are disabled. Input sources are methods for fetching input data (e.g., IPFS, HTTP). Disabling unused sources can enhance security and performance.
InputSources.MaxRetryCount | 3 | Specifies the maximum number of attempts to read from an input source before failing. This helps in handling transient errors during data fetching.
InputSources.ReadTimeout | 5m0s | Sets the maximum duration allowed for reading from an input source before timing out. This prevents tasks from hanging indefinitely due to slow or unresponsive data sources.
InputSources.Types.IPFS.Endpoint |   | Specifies the multi-address for connecting to an IPFS node (e.g., /ip4/127.0.0.1/tcp/5001). Configuring this allows the node to fetch data from IPFS networks.
JobAdmissionControl.AcceptNetworkedJobs | true | Determines whether the compute node accepts jobs that require network access. If set to false, jobs needing network connectivity will be rejected, enhancing security by isolating job execution.
JobAdmissionControl.Locality | anywhere | Specifies the required locality of the job input data, such as local or anywhere. Controls where jobs are scheduled based on data location to optimize performance and data access costs.
JobAdmissionControl.ProbeExec |   | Specifies a command or script to execute for probing or validating job submissions. This can be used to implement custom validation logic before accepting jobs.
JobAdmissionControl.ProbeHTTP |   | Specifies an HTTP endpoint to call for probing or validating job submissions. Useful for integrating external validation services or APIs.
JobAdmissionControl.RejectStatelessJobs | false | Indicates whether to reject jobs that have no input data (stateless jobs). If set to true, such jobs will be rejected, enforcing that all jobs must have inputs.
JobDefaults.Batch.Priority | 0 | Sets the default priority for batch jobs when not explicitly specified. Higher values indicate higher priority, influencing job scheduling order.
JobDefaults.Batch.Task.Publisher.Params | map[] | Provides default configuration parameters for the publisher used in batch tasks. Publishers handle how and where job results are stored or distributed.
JobDefaults.Batch.Task.Publisher.Type |   | Specifies the default publisher type for batch tasks (e.g., s3, local, ipfs). This determines the method used to publish job outputs.
JobDefaults.Batch.Task.Resources.CPU | 500m | Sets the default CPU allocation for tasks in batch jobs. Uses Kubernetes notation (e.g., 500m for 0.5 CPU cores). Ensures that tasks have sufficient CPU resources for execution.
JobDefaults.Batch.Task.Resources.Disk |   | Defines the default disk space allocation for tasks in batch jobs. Uses Kubernetes resource notation (e.g., 1Gi). Proper disk allocation prevents tasks from failing due to insufficient storage.
JobDefaults.Batch.Task.Resources.GPU |   | Sets the default number of GPUs allocated to tasks in batch jobs. Essential for tasks that require GPU acceleration.
JobDefaults.Batch.Task.Resources.Memory | 1Gb | Specifies the default memory allocation for tasks in batch jobs. Uses Kubernetes notation (e.g., 256Mi). Adequate memory allocation helps prevent out-of-memory errors during task execution.
JobDefaults.Batch.Task.Timeouts.ExecutionTimeout | 0s | Sets the maximum duration allowed for task execution in batch jobs. If set to 0, there is no execution timeout. This helps in preventing tasks from running indefinitely.
JobDefaults.Batch.Task.Timeouts.TotalTimeout | 0s | Sets the maximum total time allowed for a task in batch jobs, including retries. If 0, there is no total timeout. This ensures that tasks complete within a reasonable timeframe.
JobDefaults.Daemon.Priority | 0 | Specifies the default priority for daemon jobs. Priority influences scheduling and resource allocation.
JobDefaults.Daemon.Task.Resources.CPU | 500m | Sets the default CPU allocation for tasks in daemon jobs. Adequate CPU allocation ensures smooth operation of long-running services.
JobDefaults.Daemon.Task.Resources.Disk |   | Defines the default disk space allocation for tasks in daemon jobs.
JobDefaults.Daemon.Task.Resources.GPU |   | Sets the default GPU allocation for tasks in daemon jobs.
JobDefaults.Daemon.Task.Resources.Memory | 1Gb | Specifies the default memory allocation for tasks in daemon jobs. Sufficient memory allocation is crucial for the stability of services.
JobDefaults.Ops.Priority | 0 | Sets the default priority for operations (ops) jobs.
JobDefaults.Ops.Task.Publisher.Params | map[] | Provides default publisher configuration for tasks in ops jobs.
JobDefaults.Ops.Task.Publisher.Type |   | Specifies the default publisher type for ops tasks.
JobDefaults.Ops.Task.Resources.CPU | 500m | Sets the default CPU allocation for tasks in ops jobs.
JobDefaults.Ops.Task.Resources.Disk |   | Defines the default disk space allocation for tasks in ops jobs.
JobDefaults.Ops.Task.Resources.GPU |   | Sets the default GPU allocation for tasks in ops jobs.
JobDefaults.Ops.Task.Resources.Memory | 1Gb | Specifies the default memory allocation for tasks in ops jobs.
JobDefaults.Ops.Task.Timeouts.ExecutionTimeout | 0s | Sets the maximum execution time for tasks in ops jobs.
JobDefaults.Ops.Task.Timeouts.TotalTimeout | 0s | Sets the total maximum time for tasks in ops jobs, including retries.
JobDefaults.Service.Priority | 0 | Specifies the default priority for service jobs.
JobDefaults.Service.Task.Resources.CPU | 500m | Sets the default CPU allocation for tasks in service jobs.
JobDefaults.Service.Task.Resources.Disk |   | Defines the default disk space allocation for tasks in service jobs.
JobDefaults.Service.Task.Resources.GPU |   | Sets the default GPU allocation for tasks in service jobs.
JobDefaults.Service.Task.Resources.Memory | 1Gb | Specifies the default memory allocation for tasks in service jobs.
Labels | map[] | Key-value pairs used to describe and categorize the node, which can be used for scheduling and organization purposes. Labels help in selecting nodes based on specific criteria.
Logging.Level | info | Sets the logging verbosity level. Options include: trace, debug, info, warn, error, fatal, panic. Higher verbosity levels provide more detailed logs for troubleshooting.
Logging.LogDebugInfoInterval | 30s | Defines the interval at which the node logs debug information, useful for monitoring and troubleshooting. Regular debug logs can help in early detection of issues.
Logging.Mode | default | Specifies the logging output format. Options include default (human-readable) and json. JSON format is useful for machine parsing and log aggregation systems.
NameProvider | puuid | Determines the method used to generate the node's unique name or identifier. Options include hostname, aws, gcp, uuid, puuid. This setting helps in node identification and management.
Orchestrator.Advertise |   | Specifies the URL that the orchestrator advertises to other servers for connectivity. Proper configuration ensures that compute nodes and clients can discover the orchestrator.
Orchestrator.Auth.Token |   | Defines the authentication token required for compute nodes to access the orchestrator. This token secures the orchestrator from unauthorized access.
Orchestrator.Cluster.Advertise |   | Specifies the network address that the orchestrator cluster advertises to other cluster members. This is essential for orchestrator nodes to communicate and form a cluster.
Orchestrator.Cluster.Host |   | Sets the hostname or IP address used for cluster communication between orchestrator nodes. Proper configuration is crucial for cluster stability.
Orchestrator.Cluster.Name |   | Provides a unique identifier for the orchestrator cluster, used to distinguish between different clusters. This prevents cross-communication between separate clusters.
Orchestrator.Cluster.Peers | [] | A list of other cluster member addresses to connect to on startup for forming a cluster. Helps in bootstrapping the cluster network.
Orchestrator.Cluster.Port | 0 | Specifies the port number used for cluster communication between orchestrator nodes. Consistent port configuration is necessary for node discovery.
Orchestrator.Enabled | false | Indicates whether the orchestrator node is active and accepting job submissions. If set to false, the orchestrator will not schedule jobs or communicate with compute nodes.
Orchestrator.EvaluationBroker.MaxRetryCount | 10 | Sets the maximum number of times an evaluation can be retried before being marked as failed. This helps in preventing endless retry loops for problematic evaluations.
Orchestrator.EvaluationBroker.VisibilityTimeout | 1m0s | Defines how long an evaluation is locked by a worker before it's returned to the queue if not completed. This ensures that stalled evaluations can be retried by other workers.
Orchestrator.Host | 0.0.0.0 | Specifies the network interface that the orchestrator server listens on for compute node connections.
Orchestrator.NodeManager.DisconnectTimeout | 5s | Specifies the duration after which a compute node is considered disconnected if no heartbeat is received. This helps in quickly detecting and handling node failures.
Orchestrator.NodeManager.ManualApproval | false | If set to true, requires manual approval for new compute nodes joining the cluster. This enhances security by preventing unauthorized nodes from joining.
Orchestrator.Port | 4222 | Specifies the port number that the orchestrator server listens on for compute node connections.
Orchestrator.Scheduler.HousekeepingInterval | 30s | Defines how often the scheduler runs housekeeping tasks, such as cleaning up old jobs and releasing resources. Regular housekeeping ensures optimal system performance.
Orchestrator.Scheduler.HousekeepingTimeout | 2m0s | Sets the maximum duration allowed for a single housekeeping run. This prevents housekeeping tasks from running indefinitely and impacting system performance.
Orchestrator.Scheduler.QueueBackoff | 1m0s | Specifies the duration to wait before retrying a failed job in the scheduling queue. This helps in handling transient errors without overwhelming the system.
Orchestrator.Scheduler.WorkerCount | 12 | Determines the number of concurrent worker processes for job scheduling. Adjusting this can optimize scheduling performance based on system resources.
Publishers.Disabled | [] | Lists the publishers that are disabled. Publishers handle the output of jobs (e.g., saving results to S3, IPFS). Disabling unused publishers can improve security and reduce resource usage.
Publishers.Types.IPFS.Endpoint |   | Specifies the multi-address for connecting to an IPFS node for publishing job results. Configuring this allows job outputs to be stored on the IPFS network.
Publishers.Types.Local.Address | 127.0.0.1 | Sets the address where the local publisher serves job results. This is used when publishing outputs to a local storage endpoint.
Publishers.Types.Local.Port | 6001 | Specifies the port number where the local publisher serves job results.
Publishers.Types.S3.PreSignedURLDisabled | false | Indicates whether pre-signed URLs are disabled for the S3 publisher. Pre-signed URLs allow temporary, secure access to S3 objects without requiring credentials.
Publishers.Types.S3.PreSignedURLExpiration | 0s | Sets the expiration duration for pre-signed URLs generated by the S3 publisher. This controls how long the URLs remain valid for accessing job results.
ResultDownloaders.Disabled | [] | Lists the result downloaders that are disabled. Result downloaders retrieve job outputs from compute nodes. Disabling unused downloaders can enhance security.
ResultDownloaders.Timeout | 0s | Specifies the maximum duration allowed for a result download operation before timing out. This prevents downloads from hanging indefinitely.
ResultDownloaders.Types.IPFS.Endpoint |   | Specifies the multi-address for connecting to an IPFS node to download job results. Proper configuration enables retrieval of outputs stored on IPFS.
StrictVersionMatch | false | Determines whether to enforce strict version matching between nodes and clients. If true, nodes with differing versions may be incompatible, ensuring consistency across the network.
UpdateConfig.Interval | 24h0m0s | Specifies the interval at which the system checks for software updates. Set to 0 to disable update checks. Regular updates can bring new features and security patches.
WebUI.Backend |   | Specifies the address and port of the backend API server that the Web UI connects to. If empty, it uses the same address as the API server. Proper configuration ensures the Web UI can communicate with the backend services.
WebUI.Enabled | false | Indicates whether the Web UI is enabled and accessible. Enabling the Web UI allows users to interact with the system through a graphical interface.
WebUI.Listen | 0.0.0.0:8438 | Specifies the address and port on which the Web UI listens for incoming connections. Configuring this allows users to access the Web UI from different network interfaces.

@MichaelHoepler

@aronchick aronchick added type/bug Type: Something is not working as expected request/new Request: Indicates a new request that has been submitted and awaits initial triage labels Oct 8, 2024
@wdbaruni wdbaruni added th/config Theme: Related to configuration files and settings across the project th/user-experience Theme: Issues aimed at improving the end-user experience th/oncall Theme: Issues to be addressed by the oncall labels Oct 12, 2024 — with Linear
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
request/new Request: Indicates a new request that has been submitted and awaits initial triage th/config Theme: Related to configuration files and settings across the project th/oncall Theme: Issues to be addressed by the oncall th/user-experience Theme: Issues aimed at improving the end-user experience type/bug Type: Something is not working as expected
Projects
Status: Inbox
Development

No branches or pull requests

2 participants