Jobs

Jobs are computational tasks that run on machines in your Compute Share organization.

Overview

A job represents a containerized workload that gets distributed to available machines in your organization for execution.

Job Lifecycle

Jobs progress through these stages:

Created - Job is submitted but not yet queued
Queued - Waiting for an available machine
Assigned - Assigned to a machine and starting
Running - Currently executing
Completed - Finished successfully or with error

Creating Jobs

Job Configuration

Jobs are defined in YAML files. Here's a basic example:

name: data-processing-job
image: python:3.11
command: python process.py --input data.csv
environment:
  DATA_SOURCE: s3://mybucket/data
resources:
  cpu: 2
  memory: 4096  # MB
  timeout: 3600 # seconds

Required Fields

name - Unique identifier for the job
image - Docker image to use
command - Command to execute inside the container

Optional Fields

environment - Environment variables
resources - Resource requirements and limits
timeout - Maximum execution time
datasets - Input datasets to mount
outputs - Output artifacts to collect

Submitting Jobs

Via CLI

Submit a job using the CLI:

# Submit from a config file
compute-share submit job.yaml

# Submit with inline config
compute-share submit --name "quick-test" --image "ubuntu:latest" --command "echo Hello"

Project-based Jobs

Organize related jobs into projects:

project: ml-training
jobs:
  - name: preprocess-data
    image: python:3.11
    command: python preprocess.py

  - name: train-model
    image: tensorflow/tensorflow:latest
    command: python train.py
    depends_on:
      - preprocess-data

Managing Jobs

Viewing Jobs

Monitor jobs from the Jobs dashboard:

Active Jobs - Currently running or queued
Completed Jobs - Finished executions with status
Job History - Full history with logs and metrics

Job Details

Click on any job to view:

Current status and progress
Assigned machine
Resource usage
Execution logs
Output artifacts

Canceling Jobs

Cancel a running or queued job:

compute-share cancel <job-id>

Or from the dashboard:

Navigate to the job detail page
Click Cancel Job

Resource Requirements

Specifying Resources

Define CPU and memory needs:

resources:
  cpu: 4          # Number of CPU cores
  memory: 8192    # Memory in MB
  gpu: 1          # Number of GPUs (optional)
  disk: 10240     # Disk space in MB

Resource Matching

Jobs are only assigned to machines that meet the requirements:

Available CPU cores ≥ requested CPU
Available memory ≥ requested memory
GPU count matches (if requested)
Sufficient disk space

Working with Data

Input Datasets

Mount datasets from your organization:

datasets:
  - name: training-data
    mount: /data/input
    mode: ro  # read-only

Output Artifacts

Collect results after job completion:

outputs:
  - name: trained-model
    path: /output/model.pkl
  - name: metrics
    path: /output/metrics.json

Environment Variables

Pass configuration via environment:

environment:
  MODEL_TYPE: "transformer"
  BATCH_SIZE: "32"
  LEARNING_RATE: "0.001"

Monitoring Jobs

Real-time Logs

View logs as the job executes:

compute-share logs <job-id> --follow

Metrics

Track job performance:

Runtime - Execution duration
CPU Usage - Actual CPU consumption
Memory Usage - Peak memory usage
Exit Code - Process exit status

Alerts

Set up notifications for:

Job completion
Job failures
Long-running jobs
Resource threshold violations

Job Patterns

Batch Processing

Run multiple similar jobs:

for file in data/*.csv; do
  compute-share submit --name "process-$(basename $file)" \
    --image "python:3.11" \
    --command "python process.py $file"
done

Parallel Workflows

Execute independent jobs concurrently:

jobs:
  - name: task-1
    image: worker:latest
    command: process --chunk 1

  - name: task-2
    image: worker:latest
    command: process --chunk 2

  - name: task-3
    image: worker:latest
    command: process --chunk 3

Sequential Pipelines

Chain dependent jobs:

jobs:
  - name: fetch-data
    command: fetch.sh

  - name: transform-data
    command: transform.sh
    depends_on: [fetch-data]

  - name: analyze-data
    command: analyze.sh
    depends_on: [transform-data]

Troubleshooting

Job Stays Queued

Possible reasons:

No machines meet resource requirements
All suitable machines are busy
Machine health checks are failing
Check machine availability in dashboard

Job Fails Immediately

Common causes:

Invalid Docker image
Command not found in container
Missing environment variables
Insufficient resources on assigned machine

Job Times Out

Solutions:

Increase timeout value
Optimize job efficiency
Request more resources
Split into smaller jobs

Best Practices

Resource Requests

Request only what you need
Don't over-provision
Test resource usage with small runs
Monitor and adjust based on metrics

Error Handling

Include proper error handling in scripts
Set appropriate timeouts
Log errors clearly
Use exit codes meaningfully

Efficiency

Minimize container image size
Cache dependencies when possible
Use appropriate base images
Clean up temporary files

Documentation