Documentation

Jobs

Jobs are computational tasks that run on machines in your Compute Share organization.

Overview

A job represents a containerized workload that gets distributed to available machines in your organization for execution.

Job Lifecycle

Jobs progress through these stages:

  1. Created - Job is submitted but not yet queued
  2. Queued - Waiting for an available machine
  3. Assigned - Assigned to a machine and starting
  4. Running - Currently executing
  5. Completed - Finished successfully or with error

Creating Jobs

Job Configuration

Jobs are defined in YAML files. Here's a basic example:

name: data-processing-job
image: python:3.11
command: python process.py --input data.csv
environment:
  DATA_SOURCE: s3://mybucket/data
resources:
  cpu: 2
  memory: 4096  # MB
  timeout: 3600 # seconds

Required Fields

  • name - Unique identifier for the job
  • image - Docker image to use
  • command - Command to execute inside the container

Optional Fields

  • environment - Environment variables
  • resources - Resource requirements and limits
  • timeout - Maximum execution time
  • datasets - Input datasets to mount
  • outputs - Output artifacts to collect

Submitting Jobs

Via CLI

Submit a job using the CLI:

# Submit from a config file
compute-share submit job.yaml

# Submit with inline config
compute-share submit --name "quick-test" --image "ubuntu:latest" --command "echo Hello"

Project-based Jobs

Organize related jobs into projects:

project: ml-training
jobs:
  - name: preprocess-data
    image: python:3.11
    command: python preprocess.py

  - name: train-model
    image: tensorflow/tensorflow:latest
    command: python train.py
    depends_on:
      - preprocess-data

Managing Jobs

Viewing Jobs

Monitor jobs from the Jobs dashboard:

  • Active Jobs - Currently running or queued
  • Completed Jobs - Finished executions with status
  • Job History - Full history with logs and metrics

Job Details

Click on any job to view:

  • Current status and progress
  • Assigned machine
  • Resource usage
  • Execution logs
  • Output artifacts

Canceling Jobs

Cancel a running or queued job:

compute-share cancel <job-id>

Or from the dashboard:

  1. Navigate to the job detail page
  2. Click Cancel Job

Resource Requirements

Specifying Resources

Define CPU and memory needs:

resources:
  cpu: 4          # Number of CPU cores
  memory: 8192    # Memory in MB
  gpu: 1          # Number of GPUs (optional)
  disk: 10240     # Disk space in MB

Resource Matching

Jobs are only assigned to machines that meet the requirements:

  • Available CPU cores ≥ requested CPU
  • Available memory ≥ requested memory
  • GPU count matches (if requested)
  • Sufficient disk space

Working with Data

Input Datasets

Mount datasets from your organization:

datasets:
  - name: training-data
    mount: /data/input
    mode: ro  # read-only

Output Artifacts

Collect results after job completion:

outputs:
  - name: trained-model
    path: /output/model.pkl
  - name: metrics
    path: /output/metrics.json

Environment Variables

Pass configuration via environment:

environment:
  MODEL_TYPE: "transformer"
  BATCH_SIZE: "32"
  LEARNING_RATE: "0.001"

Monitoring Jobs

Real-time Logs

View logs as the job executes:

compute-share logs <job-id> --follow

Metrics

Track job performance:

  • Runtime - Execution duration
  • CPU Usage - Actual CPU consumption
  • Memory Usage - Peak memory usage
  • Exit Code - Process exit status

Alerts

Set up notifications for:

  • Job completion
  • Job failures
  • Long-running jobs
  • Resource threshold violations

Job Patterns

Batch Processing

Run multiple similar jobs:

for file in data/*.csv; do
  compute-share submit --name "process-$(basename $file)" \
    --image "python:3.11" \
    --command "python process.py $file"
done

Parallel Workflows

Execute independent jobs concurrently:

jobs:
  - name: task-1
    image: worker:latest
    command: process --chunk 1

  - name: task-2
    image: worker:latest
    command: process --chunk 2

  - name: task-3
    image: worker:latest
    command: process --chunk 3

Sequential Pipelines

Chain dependent jobs:

jobs:
  - name: fetch-data
    command: fetch.sh

  - name: transform-data
    command: transform.sh
    depends_on: [fetch-data]

  - name: analyze-data
    command: analyze.sh
    depends_on: [transform-data]

Troubleshooting

Job Stays Queued

Possible reasons:

  • No machines meet resource requirements
  • All suitable machines are busy
  • Machine health checks are failing
  • Check machine availability in dashboard

Job Fails Immediately

Common causes:

  • Invalid Docker image
  • Command not found in container
  • Missing environment variables
  • Insufficient resources on assigned machine

Job Times Out

Solutions:

  • Increase timeout value
  • Optimize job efficiency
  • Request more resources
  • Split into smaller jobs

Best Practices

Resource Requests

  • Request only what you need
  • Don't over-provision
  • Test resource usage with small runs
  • Monitor and adjust based on metrics

Error Handling

  • Include proper error handling in scripts
  • Set appropriate timeouts
  • Log errors clearly
  • Use exit codes meaningfully

Efficiency

  • Minimize container image size
  • Cache dependencies when possible
  • Use appropriate base images
  • Clean up temporary files

Next Steps