Understanding the conceptual framework of Baseten Training for effective model development.
TrainingProject
sTrainingProject
is a lightweight organization tool to help you group different TrainingJob
s together.
While there a few technical details to consider, your team can use TrainingProject
s to facilitate collaboration and organization.
TrainingJob
TrainingProject
, the actual work of training a model happens within a TrainingJob
. Each TrainingJob
represents a single, complete execution of your training script with a specific configuration.
TrainingJob
is the fundamental unit of execution. It bundles together:
image
.compute
resources needed to run the job.runtime
configurations like startup commands and environment variables.TrainingJob
s while knowing that previous ones have been persisted on Baseten.TRAINING_JOB_CREATED
), to resources being set up (TRAINING_JOB_DEPLOYING
), to actively running your script (TRAINING_JOB_RUNNING
), and finally to a terminal state like TRAINING_JOB_COMPLETED
. More details on the job lifecycle can be found on the Lifecycle page.Runtime
from truss_train import definitions
training_runtime = definitions.Runtime(
# ... other configuration options
cache_config=definitions.CacheConfig(enabled=True)
)
/root/.cache/user_artifacts
, which can be accessed via the $BT_RW_CACHE_DIR
environment variable./root/.cache/user_artifacts
instead. However, if you need to access data mounted to /root/.cache/huggingface
for compatibility reasons, you can set enable_legacy_hf_cache=True
in your CacheConfig
. Note that this legacy option is not recommended for new projects.Checkpointing
provides seemless storage for checkpoints and a jumping off point for inference and eval.
CheckpointingConfig
to the Runtime
and set enabled
to True
from truss_train import definitions
training_runtime = definitions.Runtime(
# ... other configuration options
checkpointing_config=definitions.CheckpointingConfig(enabled=True)
)
$BT_CHECKPOINT_DIR
environment variable in your job’s environment. Ensure your code is writing checkpoints to the $BT_CHECKPOINT_DIR
.
Compute
resource in your TrainingJob
by setting the node_count
to the number of nodes you’d like to use (e.g. 2).from truss_train import definitions
compute = definitions.Compute(
node_count=2, # Use 2 nodes for multinode training
# ... other compute configuration options
)
SecretReference
SecretReference
for secure handling of secrets.
SecretReference
. The actual secret value is never exposed in your code.from truss_train import definitions
runtime = definitions.Runtime(
# ... other runtime options
environment_variables={
"HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
},
)
TrainingJob
with checkpointing enabled, produces one or more model artifacts.truss train deploy_checkpoint
to deploy a model from your most recent training job. You can read more about this at Deploying Trained Models.