Skip to content

Concepts & Roles

Understanding the platform's tenancy model and permission system helps you work effectively with projects, groups, and storage.


Tenancy Hierarchy

Resources on the platform are organised in a hierarchy:

graph TD
    Admin[Platform Admin] -->|creates| Group[Group\ne.g. Research Lab]
    Group -->|contains| Project[Project\ne.g. LLM Fine-Tuning]
    Project -->|launches| Workspace[Workspace / IDE]
    Project -->|binds| ProjStorage[Project Storage PVC]
    Group -->|owns| GroupStorage[Group Storage PVC]
    GroupStorage -->|inherited by| ProjStorage
Entity Description
Group A team or department. Groups own shared storage and can be project members.
Project The primary resource allocation unit. Holds GPU/CPU quotas, config files, and deployments.
Workspace An interactive IDE (JupyterLab / VSCode) running inside the project's quota.
Storage (PVC) Persistent volume claim; survives workspace restarts.

Quota Lifecycle

flowchart LR
    Admin["Admin defines\nResource Plan"] -->|assigned to| Project
    Project -->|quota enforced on| Launch[Workspace Launch]
    Launch -->|requests| DRA["DRA ResourceClaim"]
    Launch -->|uses| Queue["Platform Queue"]
    DRA -->|allocates| GPULimit["GPU / SM Share"]
    Queue -->|applies| Priority["Priority / Preemption Policy"]

Each project is assigned a Resource Plan by an admin. When you launch a workspace, the platform verifies project quota and any per-user quota before creating Kubernetes DRA resources for the pod.

A plan also carries:

  • An allowed GPU model list — only these models are selectable from the launch form.
  • An optional schedule window that restricts when plan-bound queues are usable (see Plan Window below).
  • One or more bound queues with their priority, preemption, and GPU-model policy.

Scheduling Queues, Priority, and Preemption

Workloads choose a platform queue. Queue records are synced to Volcano Queue CRDs for policy compatibility, while DRA GPU pods run through Kubernetes default-scheduler with ResourceClaims. A queue owns a priority value, an optional preemptible flag, and an optional deserved GPU limit:

Concept Meaning
Priority Higher-priority queues are scheduled first when GPUs are scarce.
Preemptible Pods in this queue can be evicted to free GPUs for higher-priority queues.
Deserved GPU The GPU floor a queue is guaranteed even when other queues compete. A 0 value marks a queue as a "victim" eligible to be drained first.

The default queue is always available; plan-bound queues are listed alongside it on the workspace and deploy forms.


Plan Window

A plan window is the time interval during which a project's plan-bound queues are usable. Outside the window, plan-bound queues stop scheduling new pods and the platform's plan-window reaper may evict pods already running in those queues. Pods on the default queue are not affected.

The dashboard shows live countdowns for the next boundary (open/close), so users can avoid eviction by switching to the default queue or waiting for the next window.


DRA Resource Claims

GPU access on the platform is provisioned through Kubernetes Dynamic Resource Allocation (DRA) ResourceClaims. Each claim describes:

  • A device class (e.g., a specific GPU model)
  • An SM share percentage that describes the requested compute-time share
  • A VRAM policyelastic shares VRAM with peers, hard_cap enforces a strict ceiling

There are two ways a claim is created:

  • Inline claim — the workspace launch flow creates a ResourceClaim that lives with the pod and disappears when the pod is deleted.
  • Project-managed claim — created from a project's GPU Claims tab and reused by Pod or Deployment config files across launches. It is a pending contract until Kubernetes DRA allocates it and reservedFor points at a live Pod; only that bound state counts against quota and resource-hours.

Config files refer to project-managed claims through named deploy-time slots, for example platform-go/dra-claim-name: '{{ gpuClaimName "train-a" }}'. The slot name is not the claim name; the deploying user maps each slot to one of their own ResourceClaims. A Deployment with multiple replicas shares the same claim allocation and is counted once per claim. A single Kubernetes Deployment cannot assign different claims to different replicas because every replica shares one PodTemplate, so split the workload into multiple Deployments when only some pods should use a claim.


Role-Based Permissions

System Roles

Permission User Manager Admin
Use workspaces within quota
Create personal projects
Submit resource requests
Access admin panel
Manage all users / groups
Define resource plans
Manage platform queues
View audit logs
RBAC policy management

Project Roles

Permission Member Manager Admin
Launch workspaces
View project configs
Edit config files
Manage deployments
Add/remove members
Delete project

Group Roles

Permission Member Admin
Access group storage
View group members
Add/remove members
Set storage permissions

Storage Permission Model

Storage permissions follow a dual-path inheritance model:

flowchart TD
    GroupAdmin["Group Admin\nsets permissions"] -->|batch set| GroupStorage["Group PVC\n(source of truth)"]
    GroupStorage -->|inherited, read-only| ProjectView["Project Storage View\nfor Group Members"]
    ProjectAdmin["Project Admin"] -->|direct management| DirectPerms["Project Storage\nfor Non-Group Members"]
    DirectPerms --> ProjectView
  • Group members always inherit storage permissions from the group level. Their permissions cannot be overridden at the project level.
  • Non-group project members have permissions managed directly at the project level.

Key Terms

Term Definition
PVC Persistent Volume Claim — a Kubernetes storage resource that survives pod restarts
Platform Queue A named scheduling policy that controls priority, preemption, queue windows, and optional GPU model affinity
Plan Window The time interval during which a project's plan-bound queues are usable
Preemption Eviction of a pod in a preemptible queue so a higher-priority workload can run
DRA Dynamic Resource Allocation — Kubernetes API that provisions GPUs via ResourceClaim objects
ResourceClaim A DRA object that requests a fractional or whole GPU; platform-managed standalone claims consume quota only while bound to live Pods
Config File A versioned Kubernetes YAML/config stored with content-addressable immutability
SM share Streaming Multiprocessor share percentage used to describe fractional GPU compute allocation
Resource Plan A named template defining GPU, CPU, memory limits, allowed GPU models, and schedule windows for a project
Storage Lane A profile (shared-rwx, legacy-rwx, fast-rwo) that selects how a PVC is provisioned
Content-Addressable Storage (CAS) Storage where files are identified by their SHA-256 hash, making them immutable
ltree PostgreSQL data type for hierarchical tree paths (used for nested projects)
Workspace An interactive cloud IDE (JupyterLab, VSCode) running inside a Kubernetes pod