Configuration¶

Gradient is configured exclusively through its NixOS module. There is no configuration file or command-line flags - all options are set in your NixOS configuration under services.gradient.

Minimal Setup¶

services.gradient = {
  enable            = true;
  frontend.enable   = true;
  configurePostgres = true;
  reverseProxy.nginx.enable = true;
  domain            = "gradient.example.com";
  jwtSecretFile     = "/run/secrets/gradient-jwt";
  cryptSecretFile   = "/run/secrets/gradient-crypt";
};

configurePostgres creates a local PostgreSQL database and user. reverseProxy adds a virtual host that proxies /api/, /proto, and /cache/ to the backend and serves the frontend SPA (either with nginx or caddy as a reverse proxy)

Secrets¶

Two secrets are required. Generate them with:

# JWT signing key (HS256, minimum 32 bytes)
openssl rand -base64 48 > /run/secrets/gradient-jwt

# Database encryption key
openssl rand -base64 48 > /run/secrets/gradient-crypt

Warning

Never commit secret files to version control. Use sops-nix or agenix to manage them.

All Options¶

Option	Default	Description
`domain`	-	Public hostname (required)
`baseDir`	`/var/lib/gradient`	Data directory
`listenAddr`	`127.0.0.1`	Bind address
`port`	`3000`	HTTP port
`jwtSecretFile`	-	Path to JWT secret file (required)
`cryptSecretFile`	-	Path to encryption secret file (required)
`metricsTokenFile`	`null`	When set, enables `GET /metrics` (Prometheus exposition). The file's contents must be presented as `Authorization: Bearer <token>` by scrapers. When `null`, the endpoint returns 404.
`databaseUrlFile`	auto	Override the PostgreSQL connection string file
`databaseMaxConnections`	`32`	Max connections in the scheduler / worker pool (`GRADIENT_DATABASE_MAX_CONNECTIONS`). Total per process is `databaseMaxConnections + databaseWebMaxConnections + databaseCacheMaxConnections`; raise only if Postgres `max_connections` has headroom.
`databaseMinConnections`	`2`	Min connections kept warm in the scheduler / worker pool (`GRADIENT_DATABASE_MIN_CONNECTIONS`).
`databaseCacheMaxConnections`	`32`	Max connections in the dedicated cache-query pool (`GRADIENT_DATABASE_CACHE_MAX_CONNECTIONS`). Isolated from the scheduler/worker pool so a large eval's worker prefetch storm cannot starve dispatch.
`databaseCacheMinConnections`	`2`	Min connections kept warm in the cache-query pool (`GRADIENT_DATABASE_CACHE_MIN_CONNECTIONS`).
`databaseWebMaxConnections`	`16`	Max connections in the axum HTTP pool (`GRADIENT_DATABASE_WEB_MAX_CONNECTIONS`).
`databaseWebMinConnections`	`1`	Min connections kept warm in the axum HTTP pool (`GRADIENT_DATABASE_WEB_MIN_CONNECTIONS`).
`reportErrors`	`false`	Send errors to Sentry
`settings.sentryDsn`	`null`	Override the Sentry DSN used when `reportErrors` is true. `null` ships reports to the upstream Wavelens instance at `reports.wavelens.io`; set your own DSN to keep reports in-house. (`GRADIENT_SENTRY_DSN`)
`discoverable`	`true`	Accept incoming `/proto` WebSocket connections from workers
`settings.maxProtoConnections`	`256`	Max simultaneous worker WebSocket connections; further upgrades return `503 Service Unavailable` with `Retry-After: 10` until a slot frees
`settings.keepEvaluations`	`30`	Global maximum of evaluations kept per project (caps the per-project setting)
`settings.logChunkBytes`	`262144` (256 KiB)	Target uncompressed size for each zstd build-log chunk written on finalize. Chunks split on line boundaries, so an over-long line may exceed this. (`GRADIENT_LOG_CHUNK_BYTES`)
`settings.maxStorageGb`	`0`	Instance-wide cap on total cached NAR storage, in GB. When all writable caches for an org have less than 10 MiB headroom, new evaluations park in `Waiting`. `0` = unlimited; per-cache `max_storage_gb` limits still apply. (`GRADIENT_MAX_STORAGE_GB`)
`settings.evalCacheMaxTotalBytes`	`10737418240` (10 GiB)	Total byte cap for fleet-shared eval-cache blobs. The eviction sweep drops oldest-`updated_at` rows until the surviving total is at or under this. (`GRADIENT_EVAL_CACHE_MAX_TOTAL_BYTES`)
`settings.evalCacheMaxAgeDays`	`30`	Max age in days for an eval-cache blob; older blobs are evicted by the sweep regardless of the size cap. (`GRADIENT_EVAL_CACHE_MAX_AGE_DAYS`)
`settings.evalCacheSweepIntervalSecs`	`3600`	Interval in seconds between eval-cache eviction sweeps. (`GRADIENT_EVAL_CACHE_SWEEP_INTERVAL_SECS`)
`settings.maxRequestSize`	`2097152` (2 MiB)	Max HTTP request body in bytes for most endpoints (caps webhook/JSON payloads to prevent OOM). The per-chunk build-request blob endpoint uses a fixed 20 MiB cap.
`settings.maxNarUploadSize`	`536870912` (512 MiB)	Max body in bytes for `POST /caches/{cache}/nars`; overrides the general `maxRequestSize` cap for NAR uploads. (`GRADIENT_MAX_NAR_UPLOAD_SIZE`)
`settings.maxSourceUploadSize`	`536870912` (512 MiB)	Max total bytes for the source `gradient build` uploads: the single-shot `POST /build-requests/source`, the chunked `source/{upload}/chunk` staged total, and the chunked manifest total. `gradient build` streams the source in bounded chunks, so this caps the assembled size, not one request. The built-in reverse proxy's `client_max_body_size` is derived from the largest configured upload limit. (`GRADIENT_MAX_SOURCE_UPLOAD_SIZE`)
`settings.logLevel.default`	`info`	Log level: `trace` `debug` `info` `warn` `error`. Targets `gradient_*`; dependency noise (`hyper`, `sqlx`, …) is pinned to `warn`. `RUST_LOG` overrides everything.
`settings.logLevel.cache`	null	`gradient_cache` log level override (null inherits default)
`settings.logLevel.web`	null	`gradient_web` log level override (null inherits default)
`settings.logLevel.proto`	null	`gradient_proto` log level override (null inherits default)
`settings.logLevel.scheduler`	null	`gradient_scheduler` log level override (null inherits default)
`settings.enableRegistration`	`true`	Allow new user self-registration
`settings.createOrg`	`everyone`	Who may create organizations via the API: `none` (only the declarative state), `superusers`, or `everyone`. The frontend hides the "Create Organization" button accordingly. (`GRADIENT_CREATE_ORG`)
`settings.createCache`	`everyone`	Who may create caches via the API: `none` (only the declarative state), `superusers`, or `everyone`. The frontend hides the "Create Cache" button accordingly. (`GRADIENT_CREATE_CACHE`)
`settings.prCommitName`	`null`	Git author/committer name for the commits the `open_pr` action pushes. `null` lets each forge pick attribution: GitHub credits the App bot and signs it verified; Gitea/Forgejo and GitLab use the token owner (needs the `read:user` / `read_user` scope), falling back to a `Gradient <gradient@users.noreply.HOST>` bot identity when that scope is missing. Set both this and `prCommitEmail` to force an explicit identity. (`GRADIENT_PR_COMMIT_NAME`)
`settings.prCommitEmail`	`null`	Git author/committer email for `open_pr` commits; see `prCommitName`. (`GRADIENT_PR_COMMIT_EMAIL`)
`settings.deleteState`	`true`	Remove entities no longer in `state` (see below)
`settings.cacheTtlHours`	`336`	TTL in hours for cached NARs not fetched recently (0 = disabled)
`settings.cacheMaintenanceIntervalSecs`	`3600`	Interval in seconds between cache maintenance GC passes. (`GRADIENT_CACHE_MAINTENANCE_INTERVAL_SECS`)
`settings.signSweepIntervalSecs`	`3600`	Fallback interval between NAR signature backfill sweeps; a freshly uploaded NAR is signed in place by the upload handler, so this only covers subscription placeholders and any row left unsigned. (`GRADIENT_SIGN_SWEEP_INTERVAL_SECS`)
`settings.narVerifyDigest`	`false`	When set, the S3 presigned NAR commit path GETs the uploaded object and rehashes it against the reported `file_hash` before marking it cached, catching same-length corruption at the cost of a full object read. Off by default: the presigned path still HEAD-checks size; the relayed and REST upload paths always content-verify since they already hold the bytes in memory. (`GRADIENT_NAR_VERIFY_DIGEST`)
`settings.narUploadGraceHours`	`24`	Grace before the orphan-files GC reclaims a NAR object no DB row references (covers the upload commit window). (`GRADIENT_NAR_UPLOAD_GRACE_HOURS`)
`settings.gcWedgedEvalHours`	`24`	Hours after which an untouched active evaluation is presumed wedged and stops blocking evaluation GC; the wedged eval itself is never deleted (0 = block forever). (`GRADIENT_GC_WEDGED_EVAL_HOURS`)
`settings.narStorageOpenTimeoutSecs`	`60`	Max seconds the server will wait to open a NAR object stream (e.g. an S3 GET) before emitting `NarUnavailable`. Caps how long a stalled storage backend can block a `NarRequest`.
`settings.narSendChunkTimeoutSecs`	`30`	Max seconds a single outbound `NarPush` chunk may sit in the per-connection writer queue waiting for the WebSocket sink to drain before the transfer is aborted with `NarAbort`.
`settings.maxConcurrentNarServes`	`8`	Max NAR-serving tasks running concurrently per worker connection. Bounds memory and storage-backend fan-out when a worker requests many paths in a single batch.
`settings.maxNarBufferBytes`	`10737418240` (10 GiB)	Max total bytes the server may hold across open `*.partial` NAR upload files (un-finalised `NarPush` streams staged under `<baseDir>/nar-partial`). Prevents a rogue worker from filling the disk by opening many uploads without finalising them (issue #109).
`settings.narPartialTtlSecs`	`86400` (24 h)	TTL in seconds for partially-received NAR uploads (`*.partial`) staged under `<baseDir>/nar-partial`. A periodic sweep deletes partials older than this so an abandoned resumable transfer can't pin disk forever (issue #225). `0` disables the sweep. (`GRADIENT_NAR_PARTIAL_TTL_SECS`)
`settings.workerHeartbeatTimeoutSecs`	`120`	Seconds a connected worker may go silent before the server declares it dead and re-queues its in-flight jobs. The worker heartbeats every 10 s; liveness is stamped when the server processes a frame, so the generous default keeps a briefly stalled server from false-declaring a healthy worker dead. This is the only detector for a worker that dies without a clean TCP close (hard OOM-kill, frozen host, network partition); a graceful disconnect is handled immediately regardless. `0` disables the liveness watchdog. (`GRADIENT_WORKER_HEARTBEAT_TIMEOUT_SECS`)
`settings.upstreamQueryConcurrency`	`32`	Max simultaneous outbound upstream narinfo probes server-wide (latency ordering). (`GRADIENT_UPSTREAM_QUERY_CONCURRENCY`)
`settings.allowAnonymousCache`	`true`	Allow unauthenticated clients to use `GET /cache/{cache}/proto` for public caches. When `false`, anonymous handshakes are rejected with 403. Private caches always require an API key regardless. (`GRADIENT_PROTO_ALLOW_ANONYMOUS_CACHE`)
`settings.anonMaxConnectionsPerIp`	`32`	Maximum simultaneous anonymous `/cache/proto` connections per client IP. (`GRADIENT_PROTO_ANON_MAX_CONNECTIONS_PER_IP`)
`settings.trustedProxies`	`127.0.0.1/32,::1/128`	Comma-separated CIDR allowlist of peers permitted to set `X-Forwarded-For` (`GRADIENT_TRUSTED_PROXIES`).
`settings.localIps`	`10.0.0.0/8`	Comma-separated CIDR allowlist whose resolved client IPs receive each cache's `local_priority` value (`GRADIENT_LOCAL_IPS`).
`settings.buildMaxAttempts`	`3`	Maximum number of build attempts before a transient failure is promoted to `FailedPermanent`. (`GRADIENT_BUILD_MAX_ATTEMPTS`)
`settings.substituteMissEscalationThreshold`	`2`	Substitute attempts before a substitutable build escalates to a real arch-bound build. (`GRADIENT_SUBSTITUTE_MISS_ESCALATION_THRESHOLD`)
`settings.inputsUnavailableMaxLoops`	`3`	Max `InputsUnavailable` self-heal loops per build before the circuit breaker opens and the build fails fast instead of churning the cache. (`GRADIENT_INPUTS_UNAVAILABLE_MAX_LOOPS`)
`settings.buildRetryBackoffSecs`	`30`	Base back-off in seconds before retrying a transient build failure; doubled after each prior attempt (exponential). (`GRADIENT_BUILD_RETRY_BACKOFF_SECS`)
`settings.buildDefaultTimeoutSecs`	`14400`	Default wall-clock timeout (seconds) for builds whose `.drv` does not set a `timeout` attribute. `0` disables. (`GRADIENT_BUILD_DEFAULT_TIMEOUT_SECS`)
`settings.buildDefaultMaxSilentSecs`	`3600`	Default silent-output timeout (seconds) for builds whose `.drv` does not set a `maxSilent` attribute. `0` disables. (`GRADIENT_BUILD_DEFAULT_MAX_SILENT_SECS`)
`settings.schedulerScoringPolicy`	`resource-aware`	Scheduler scoring policy ranking queued jobs against a requesting worker (`GRADIENT_SCHEDULER_SCORING_POLICY`). Values: `simple`, `resource-aware`. `simple` is the basic rule set, weighing path availability, NAR size, dependency count, wait-time anti-starvation, builtin de-prioritization and fetch-worker reservation. `resource-aware` adds RAM/OOM-fit, CPU affinity, preferLocalBuild affinity and per-org fair-share on top, and is the default. Unknown values fall back to `resource-aware`. See scheduler scoring.

Build failure states and retries¶

Builds can fail in three distinct ways:

Status	Terminal	Meaning
`FailedPermanent`	Yes	Builder exited non-zero; no retry will be attempted
`FailedTransient`	No	Transient error (OOM, disk full, network/substitution failure, builder crash); scheduler will re-queue automatically
`FailedTimeout`	Yes	Exceeded `buildDefaultTimeoutSecs` or `buildDefaultMaxSilentSecs`

FailedTransient is non-terminal: the build is re-queued automatically with an exponential back-off until buildMaxAttempts is exhausted, at which point the status is promoted to FailedPermanent. API entry-point queries treat FailedTransient as in-progress; the frontend renders all three variants as "Failed".

Per-derivation .drv attributes timeout, maxSilent, and preferLocalBuild override the server defaults when present on a derivation. Note that Nix meta.* attributes do not propagate to the .drv; these must be set as top-level derivation attributes.

Reverse Proxies¶

The Gradient server does not come with a built-in http server for the frontend. Therefore a reverse proxy / webserver is needed for hosting. The nixos module provides two preconfigured reverse proxies: - nginx - caddy

Nginx¶

Option	Default	Description
`reverseProxy.nginx.enable`	`false`	Whether to enable nginx as the reverse proxy
`reverseProxy.nginx.manageTls`	`true`	Let nginx obtain/serve the certificate (`enableACME` + `forceSSL`). Set `false` when an upstream proxy terminates TLS and forwards plain HTTP to nginx; keep `useTls = true` for correct `https://` URLs and secure cookies. No effect when `useTls = false`.

Caddy¶

Note

To match the upstream services.caddy configuration you have to manage the ACME host certificate yourself.

Option	Default	Description
`reverseProxy.caddy.enable`	`false`	Whether to enable caddy as the reverse proxy
`reverseProxy.caddy.useACMEHost`	`null`	Passed directly to `services.caddy.virtualHosts.<name>.useACMEHost`
`reversePorxy.caddy.extraConfig`	`""`	Caddy config options written to `services.caddy.virtualHosts.<name>.extraConfig` after the reverse proxy setup

Custom Reverse Proxy¶

If you want to use your own reverse proxy you have to setup redirects as follows: - https://example.com/api (with all subpaths) -> http://${ADDR}:${PORT}/api - https://example.com/proto -> http://${ADDR}:${PORT}/proto (must support websockets) - https://example.com/cache (with all subpaths) -> http://${ADDR}:${PORT}/cache All other requests should be handled by a static webserver hosting the files at: - ${pkgs.gradient-frontend}/share/gradient-frontend

Metrics¶

Set services.gradient.metricsTokenFile to a file path to enable GET /metrics (Prometheus exposition format). When unset, the endpoint returns 404.

services.gradient.metricsTokenFile = "/run/secrets/gradient-metrics";

Generate a token with openssl rand -base64 32. Configure your Prometheus scraper with bearer_token_file: /run/secrets/gradient-metrics (or pass the token directly via Authorization: Bearer <token> for ad-hoc curls). The endpoint is rate-limited at 6 req/s with a burst of 5; a 15s scrape interval is comfortable.

The MVP exposes build/evaluation status counts, scheduler queue depth, connected workers, and cache totals. Per-org/cache labels and histograms are tracked as a follow-up.

Metrics pipeline & retention¶

The Job Board records build/eval phase timings, dispatch decisions (with scoring breakdown), and worker statistics into dedicated tables. A background task prunes them so they stay bounded; all settings live under services.gradient.settings:

Option / env var	Default	Purpose
`metricsRollupIntervalSecs` / `GRADIENT_METRICS_ROLLUP_INTERVAL`	60	Rollup-aggregator pass interval.
`metricsRetentionRawDays` / `GRADIENT_METRICS_RETENTION_RAW_DAYS`	14	Retention for raw `phase_event` / `worker_sample` rows (0 = forever).
`metricsRetentionRollupDays` / `GRADIENT_METRICS_RETENTION_ROLLUP_DAYS`	400	Retention for minute/hour rollups; day/week kept (0 = forever).
`dispatchRetentionDays` / `GRADIENT_DISPATCH_RETENTION_DAYS`	30	Retention for `dispatched_job` forensic rows (0 = forever).
`workerSampleIntervalSecs` / `GRADIENT_WORKER_SAMPLE_INTERVAL`	15	Worker live-metric sampling interval.
`metricsLabelTopn` / `GRADIENT_METRICS_LABEL_TOPN`	20	Per-dimension cardinality cap for rollup labels.
`otlpEndpoint` / `GRADIENT_OTLP_ENDPOINT`	null	OTLP collector endpoint for metric push (null disables).
`otlpPushIntervalSecs` / `GRADIENT_OTLP_PUSH_INTERVAL`	30	OTLP push interval.
`dispatchRecordCandidates` / `GRADIENT_DISPATCH_RECORD_CANDIDATES`	false	Persist runner-up scoring candidates per dispatch.
`instanceMetricsIntervalSecs` / `GRADIENT_INSTANCE_METRICS_INTERVAL`	30	InstanceContext window recomputation interval.
`graphConsistencyIntervalSecs` / `GRADIENT_GRAPH_CONSISTENCY_INTERVAL`	300	Read-only build-graph consistency sweep interval; violations log as warnings (0 disables).

OIDC¶

services.gradient.oidc = {
  enable           = true;
  required         = false;   # set true to disable username/password login and require OIDC for all users
  clientId         = "gradient";
  clientSecretFile = "/run/secrets/gradient-oidc-secret";
  discoveryUrl     = "https://auth.example.com";
  scopes           = [ "openid" "email" "profile" ];
  iconUrl          = null;    # optional provider logo URL
};

Gradient uses PKCE (S256) and discovers all provider endpoints from discoveryUrl/.well-known/openid-configuration and callback url is at https://$domain/api/v1/auth/oidc/callback. Set required to true to disable basic auth and require OIDC for all users. Because PKCE is sent on every request, providers that gate it (e.g. kanidm) do not need allowInsecureClientDisablePkce.

To map OIDC groups to organization roles, request the groups scope (add "groups" to scopes) so the ID token carries the user's group claims, then attach oidc_group lists to state-managed roles (see Declarative State).

SCIM¶

services.gradient.scim = {
  enable     = true;
  tokenFile  = "/run/secrets/gradient-scim-token";
  hardDelete = false;   # default: DELETE soft-disables (active=false)
};

Enabling SCIM mounts an instance-level SCIM 2.0 provisioning surface at https://$domain/scim/v2, authenticated by the bearer token in tokenFile (not user credentials). SCIM provisions passwordless managed users that later authenticate via OIDC; SCIM groups map to roles through scim_group (see SCIM and Declarative State).

Option	Env	Default	Description
`scim.enable`	`GRADIENT_SCIM_ENABLED`	`false`	Mount the `/scim/v2` provisioning endpoints
`scim.tokenFile`	`GRADIENT_SCIM_TOKEN_FILE`	-	Path to the file holding the SCIM bearer token (required when enabled)
`scim.hardDelete`	`GRADIENT_SCIM_HARD_DELETE`	`false`	Hard-delete (cascade) on `DELETE /Users/{id}`; default soft-disables (`active=false`)

Email¶

services.gradient.email = {
  enable              = true;
  requireVerification = true;
  smtpHost            = "smtp.example.com";
  smtpPort            = 587;
  smtpUsername        = "gradient@example.com";
  smtpPasswordFile    = "/run/secrets/gradient-smtp";
  fromAddress         = "gradient@example.com";
  fromName            = "Gradient";
};

GitHub App¶

A GitHub App provides automatic webhook delivery and CI status reporting without per-project tokens. One App covers all organizations on the instance.

Setup¶

Create a GitHub App at github.com → Settings → Developer settings → GitHub Apps → New GitHub App.
Webhook URL: https://gradient.example.com/api/v1/hooks/github
Webhook secret: generate a random value and note it
Permissions: Repository → Commit statuses (Read & Write), Repository → Contents (Read)
Subscribe to events: Push, Installation
After creation note the App ID and download the private key PEM.
Configure Gradient:

services.gradient.githubApp = {
  enable             = true;
  appId              = 123456;
  privateKeyFile     = "/run/secrets/gradient-github-app-key";
  webhookSecretFile  = "/run/secrets/gradient-github-app-webhook-secret";
};

Install the App on each GitHub organization. Gradient auto-creates the github-<account> integration pair from the install webhook. Alternatively, org admins can create integrations manually via the UI by entering the installation id (one per GitHub account; multiple per org are supported).
Once installed, push events automatically trigger evaluations (no polling) and CI statuses are reported using the installation token instead of a per-project PAT.

Forge Webhooks (Gitea / Forgejo / GitLab / GitHub without App)¶

For non-GitHub forges or GitHub without the App, configure a per-organization webhook secret via the UI:

Open Organization → Settings → Forge Webhooks and click Generate Webhook Secret.
Copy the displayed Webhook URL and Secret.
In your forge, create a push webhook pointing to the URL, using the secret for HMAC-SHA256 signing.

Forge path by type:

Forge	URL path segment	Signature header
Gitea / Forgejo	`/hooks/gitea/{org}` or `/hooks/forgejo/{org}`	`X-Gitea-Signature`
GitLab	`/hooks/gitlab/{org}`	`X-Gitlab-Token`
GitHub (no App)	`/hooks/github/{org}`	`X-Hub-Signature-256`

Gradient matches the incoming push payload's clone URL against active projects and queues an evaluation immediately.

Workers¶

Build capacity is provided by gradient-worker instances that connect to the server over a WebSocket at /proto. Workers are separate processes and can run on the same host or on dedicated build machines.

The server does not start a worker automatically. Configure one explicitly using the gradient-worker NixOS module.

Co-located Worker¶

To run a worker on the same machine as the server, import the worker module and configure services.gradient.worker:

imports = [ inputs.gradient.nixosModules.gradient-worker ];

services.gradient.worker = {
  enable    = true;
  serverUrl = "ws://127.0.0.1:3000/proto";
  capabilities = {
    fetch = true;
    eval  = true;
    build = true;
    sign  = true;
  };
  settings.buildMetrics = true; # opt in to per-build resource metrics for smarter scheduling (enables Nix's cgroups experimental feature)
};

Remote Workers¶

Deploy gradient-worker on dedicated build machines. First register the worker under an organization - either declaratively via state.workers (see below) or via the API. The worker_id must be a UUID v4. The worker auto-generates one on first start and persists it to /var/lib/gradient-worker/worker-id:

cat /var/lib/gradient-worker/worker-id

curl -X POST https://gradient.example.com/api/v1/orgs/myorg/workers \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"worker_id": "550e8400-e29b-41d4-a716-446655440001"}'
# → {"error":false,"message":{"peer_id":"<uuid>","token":"<token>"}}

You can optionally pre-generate the token and pass it in the request (openssl rand -base64 48); the response will then omit the token field.

Then on the build machine:

imports = [ inputs.gradient.nixosModules.gradient-worker ];

services.gradient.worker = {
  enable    = true;
  serverUrl = "wss://gradient.example.com/proto";
  peersFile = "/run/secrets/gradient-worker-peers";

  capabilities = {
    fetch = true;
    eval  = true;
    build = true;
    sign  = true;
  };

  settings = {
    maxConcurrentBuilds      = 8;
    evalWorkers              = 2;
    maxConcurrentEvaluations = 2;
    buildMetrics             = true; # opt in to per-build resource metrics for smarter scheduling (enables Nix's cgroups experimental feature)
  };
};

Write the registration result to the peers file (one peer_id:token pair per line):

echo "<peer_id>:<token>" > /run/secrets/gradient-worker-peers

The special peer ID * can be used instead of a specific UUID to respond with that token for any peer the server challenges:

# /run/secrets/gradient-worker-peers
*:<token>

The token must be the 48-byte random secret returned by the registration API (generated via openssl rand -base64 48 server-side).

Worker Options¶

Option	Default	Description
`serverUrl`	`null`	WebSocket URL of the server's `/proto` endpoint (required)
`workerId`	`null`	Override the worker UUID (`GRADIENT_WORKER_ID`). When null, the ID is read from `$StateDirectory/worker-id` or auto-generated on first start
`peersFile`	`null`	Path to peers file (`peer_id:token` per line, `*` = any peer)
`useTls`	`true`	Serve Gradient over HTTPS: emit `https://` URLs (`GRADIENT_SERVE_URL`, OIDC redirect) and mark session cookies `Secure`. With `reverseProxy.nginx.manageTls` also enables ACME + forceSSL on the nginx vhost. Set `false` only for plain-HTTP deployments
`discoverable`	`false`	Accept incoming connections from the server (reverse-proxy mode)
`listenAddr`	`127.0.0.1`	Bind address for the worker listener
`port`	`3100`	Listener port when `discoverable` is enabled
`capabilities.fetch`	`false`	Prefetch flake inputs
`capabilities.eval`	`false`	Run Nix evaluations
`capabilities.build`	`false`	Execute Nix builds
`capabilities.sign`	`false`	Sign store paths
`capabilities.federate`	`false`	Act as a federation relay (requires `discoverable`)
`settings.maxConcurrentBuilds`	`64`	Parallel build slots
`settings.maxBuildCores`	`null`	Cap on CPU cores a single build may use (nix `--cores` / `NIX_BUILD_CORES`, `GRADIENT_WORKER_MAX_BUILD_CORES`). When null, a build may use all available cores
`settings.evalWorkers`	`1`	Number of evaluator subprocesses
`settings.maxConcurrentEvaluations`	`1`	Parallel evaluations
`settings.cpuCoreScore`	`null`	Override the advertised single-core speed score (higher is faster, `GRADIENT_WORKER_CPU_CORE_SCORE`). When null, the worker benchmarks the host at startup
`settings.evalForkWorkers`	`null`	Number of parallel eval subprocesses in the pool (the eval concurrency). When null, auto-sizes to the host core count capped at 16, then bounded so `evalForkWorkers * maxEvalRss` fits host RAM. Per-system shards share one eval-cache safely (per-shard commits append to the WAL; one checkpoint folds it in at end-of-eval). (`GRADIENT_EVAL_FORK_WORKERS`)
`settings.maxEvalRss`	`8589934592` (8 GiB)	Safety cap on an eval subprocess's resident memory: once its RSS exceeds this many bytes it is recycled (parent-side). Keep it above a typical eval's heap so warm workers are not recycled mid-evaluation (`GRADIENT_MAX_EVAL_RSS`)
`settings.minFreeRamMb`	`0` (adaptive)	Free-RAM safety margin in MiB for the eval-subprocess reaper. When host `MemAvailable` falls below this, the worker SIGKILLs the largest live eval subprocess to forestall a host OOM (the parent then reports the eval as failed instead of the machine freezing). `0` selects an adaptive margin of `max(1 GiB, 10% of total RAM)`. `maxEvalRss` only recycles between calls; this is the proactive peak guard. (`GRADIENT_MIN_FREE_RAM_MB`)
`settings.evalCacheDir`	`null`	Eval-cache directory exported to eval workers as `NIX_CACHE_HOME`. When null, resolves to `{baseDir}/eval-cache` (`GRADIENT_EVAL_CACHE_DIR`)
`settings.evalCacheShare`	`true`	Enable fleet eval-cache sharing (pull/push of `<fingerprint>.sqlite` blobs across workers, issue #386) (`GRADIENT_EVAL_CACHE_SHARE`)
`settings.evalMetricsEnabled`	`true`	Capture per-evaluation Nix metrics (thunks, heap, peak RSS, per-entry-point hotspots, flake graph). When false, eval-workers skip the stats read (zero overhead). (`GRADIENT_EVAL_METRICS_ENABLED`)
`settings.maxNixdaemonConnections`	`32`	Worker's local nix-daemon connection pool size. Each in-flight NAR import holds one connection; size for `maxConcurrentBuilds * 8` plus headroom
`settings.narPartialTtlSecs`	`86400` (24 h)	TTL in seconds for partially-received NAR downloads (`*.partial`) staged under `<baseDir>/nar-partial`. A periodic sweep deletes partials older than this so an abandoned resumable transfer can't pin disk forever (issue #225). `0` disables the sweep. (`GRADIENT_NAR_PARTIAL_TTL_SECS`)
`settings.maxProtoConnections`	`16`	Max simultaneous WebSocket connections (for discoverable mode)
`settings.gcrootsDir`	`/nix/var/nix/gcroots/gradient`	Directory for worker-held indirect GC roots. One symlink per active build (drv + outputs) pins inputs and just-built outputs through the daemon so a concurrent `nix-collect-garbage` cannot race the build. Empty string disables
`settings.buildMetrics`	`false`	Capture per-build resource metrics (`GRADIENT_WORKER_BUILD_METRICS`) that feed the resource-aware scheduler's RAM/CPU/disk predictions for smarter job placement. Off by default because it turns on Nix's experimental `cgroups` feature + `use-cgroups` on the daemon and delegates the cgroup-v2 controllers to `nix-daemon.service` (`Delegate=yes`) so `memory.peak`/`io.stat` are exposed. CPU time comes from the daemon build result; peak RAM and disk I/O are sampled live from the build's cgroup (located via `buildCgroupStateDir`) - reliable at build concurrency 1, best-effort under concurrency. Wall-clock build time is always reported
`settings.buildCgroupRoot`	`/sys/fs/cgroup`	Cgroup-v2 mount root; sampled cgroup paths must live under it (`GRADIENT_WORKER_BUILD_CGROUP_ROOT`)
`settings.buildCgroupStateDir`	`/nix/var/nix/cgroups`	Nix's `<state-dir>/cgroups` map of build-user UID → cgroup path; the worker reads the newest entry to locate a running build's cgroup (`GRADIENT_WORKER_BUILD_CGROUP_STATE_DIR`). Granted read access via `ReadOnlyPaths`
`settings.logBurstBytesPerMin`	`8388608` (8 MiB)	Burst token bucket: max build-log bytes forwarded to the server per build in any 1-minute window. On trip the worker appends a truncation marker and stops forwarding that build's log (the build still runs). (`GRADIENT_LOG_BURST_BYTES_PER_MIN`)
`settings.logSustainedBytesPerHour`	`67108864` (64 MiB)	Sustained token bucket: max build-log bytes forwarded per build in any 1-hour window. (`GRADIENT_LOG_SUSTAINED_BYTES_PER_HOUR`)
`settings.logFetchFromStore`	`true`	When a derivation is already built locally (no fresh log), read nix's stored `.bz2` build log and forward it so the UI still shows output. (`GRADIENT_LOG_FETCH_FROM_STORE`)
`settings.logLevel.default`	`info`	Worker log level
`settings.logLevel.eval`	null	Evaluator log level override
`settings.logLevel.build`	null	Builder log level override
`settings.logLevel.proto`	null	Protocol log level override

Hashing¶

Gradient hashes NARs and compressed cache files with SHA-256 by default. No client-side experimental feature is required to substitute from a Gradient cache.

BLAKE3-prefixed (blake3:{nix32}) hashes are still accepted on the read path so narinfo rows uploaded while the BLAKE3 default was active (issue #132) keep resolving, and so upstream caches that advertise either algorithm interoperate cleanly.

Declarative State¶

Users, organizations, projects, integrations, caches, API keys, custom roles, and workers can be declared in services.gradient.state and reconciled on every startup. See Declarative State for the full options reference.

API keys¶

State-managed API keys are declared under state.api_keys.<name>:

key_file (required, path): file containing the lowercase 64-char SHA-256 hex digest of the token (without the GRAD prefix).
owned_by (required, string): username that owns the key.
permissions (required, list of strings): permission identifiers the key grants. See gradient_db::permissions::Permission (or GET /user/keys/permissions) for the full list.
organization (optional, string): organization name to pin the key to. Omit for an unscoped key.

Roles¶

State-managed custom roles are declared under state.roles.<name>:

name (defaults to attrset key): role name. Must be unique within the organization and must not collide with built-in role names (Admin, Write, View).
organization (required, string): the organization this role belongs to.
permissions (required, list of strings): the capabilities the role grants.

Managed roles cannot be modified or deleted via the API.

Flake input overrides¶

Each project may declare per-input flake overrides applied during evaluation fetch. Each entry must set exactly one of:

url - a flake-ref string to replace the input's URL.
keep_url = true - force an update of the input keeping the URL declared in the project's flake.nix.

Empty flake_input_overrides = {} (the default) means no overrides - flake.lock is used as-is. Setting the attrset to {} from a non-empty state removes all override rows for that project.

This is a persistent, project-wide mechanism. For a one-off override on a single build request, use gradient build's --override-input instead.

services.gradient.state.projects.my-project = {
  # ...
  flake_input_overrides = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
    flake-utils.keep_url = true;
  };
};