Fallback Client

The FallbackClient class provides an interface for interacting with the fallback service, which allows users to run jobs on fallback cloud like RunPod. It uses SkyPilot for managing the cluster. This class includes methods for creating, managing, and retrieving jobs.

class compute_horde_sdk.v1.fallback.FallbackClient

A fallback client that provides the same API as ComputeHordeClient.

__init__(cloud, idle_minutes=15, **kwargs)

Initializes a FallbackClient that can execute jobs using a SkyPilot-managed cloud backend.

Parameters:
  • cloud (str) – The name of the cloud backend to use (e.g. “runpod”). This value is passed directly to SkyPilot. We currently only test with “runpod”, but you are welcome to try the other providers too.

  • idle_minutes (int) – Number of minutes the fallback instance can remain idle before being shut down. Defaults to 15 minutes.

  • kwargs (Any) – Additional arguments forwarded to the SkyPilot cloud environment setup.

Raises:

ModuleNotFoundError – If the fallback extra is not installed.

Return type:

None

Note

You must install the fallback extra with:

pip install compute-horde-sdk[fallback]

For details on available cloud environments, see: https://docs.skypilot.co/en/v0.8.1/overview.html#cloud-vms

async create_job(job_spec)

Run a fallback job in the SkyPilot cluster. This method does not retry a failed job. Use run_until_complete() if you want failed jobs to be automatically retried.

Parameters:

job_spec (FallbackJobSpec) – Job specification to run.

Returns:

A FallbackJob class instance representing the created job.

Return type:

FallbackJob

async run_until_complete(job_spec, job_attempt_callback=None, timeout=None, max_attempts=3)

Run a fallback job in the SkyPilot cluster until it is successful. It will call create_job() repeatedly until the job is successful.

Parameters:
  • job_spec (FallbackJobSpec) – Job specification to run.

  • job_attempt_callback (Callable[[FallbackJob], None] | Callable[[FallbackJob], Awaitable[None]] | Callable[[FallbackJob], Coroutine[Any, Any, None]] | None) – A callback function that will be called after every attempt of running the job. The callback will be called immediately after an attempt is made run the job, before waiting for the job to complete. The function must take one argument of type FallbackJob. It can be a regular or an async function.

  • timeout (float | None) – Maximum number of seconds to wait for.

  • max_attempts (int) – Maximum number times the job will be attempted to run within timeout seconds. Negative or 0 means unlimited attempts.

Returns:

A FallbackJob class instance representing the created job. If the job was rerun, it will represent the last attempt.

Return type:

FallbackJob

async get_job(job_uuid)

Retrieve information about a job from the SkyPilot cluster.

Parameters:

job_uuid (str) – The UUID of the job to retrieve.

Returns:

A FallbackJob instance representing this job.

Raises:

FallbackNotFoundError – If the job with this UUID does not exist.

Return type:

FallbackJob

async get_jobs()

Retrieve information about your jobs from the SkyPilot cluster.

Returns:

A list of FallbackJob instances representing your jobs.

Return type:

list[FallbackJob]

async iter_jobs()

Retrieve information about your jobs from the ComputeHorde.

Returns:

An async iterator of FallbackJob instances representing your jobs.

Return type:

AsyncIterator[FallbackJob]

async get_job_streaming_port(job_uuid)

Retrieve the SSH port of the job for streaming.

Parameters:

job_uuid (str)

Return type:

int | None

async create_ssh_tunnel(job_uuid, local_port)

Create an SSH tunnel to the job’s streaming port.

Parameters:
  • job_uuid (str) – The UUID of the job to tunnel to

  • local_port (int) – The local port to forward to

Return type:

None