Continued Pretraining

Continued Pretraining is not yet available.

Introduction to Continued Pretraining

Continued Pre-Training (CPT) is a technique for extending a base language model's knowledge by training it further on a new corpus of data. CosmicAC supports CPT as a managed job type, handling GPU provisioning and checkpoint storage so you can focus on the training data and configuration.

A base language model is trained on a large, general-purpose corpus using a next-token prediction objective. This process gives the model broad linguistic capability and general world knowledge, but it does not make the model an expert in any specific domain.

CPT picks up where pretraining left off. You train the model further, using the same next-token prediction objective, on a domain-specific corpus. The model updates its weights to better represent the statistical patterns in your data. The result is a model with the same general capability as the base model, but with a shifted knowledge distribution that reflects your domain.

Because CPT uses the same self-supervised objective as the original pretraining (predicting the next token), it does not require labeled examples. Your training data is raw text: technical documentation, scientific literature, codebases, internal knowledge bases, or any other domain corpus.

How Continued Pretraining Differs from Fine-tuning

The terms continued pretraining and fine-tuning (also called supervised fine-tuning, or SFT) are sometimes used interchangeably, but they describe different things.

CPT trains on unlabeled text to shift the model's knowledge. You are not teaching the model a new behavior or task, you are extending what it knows. The training objective is the same as original pretraining. Predict the next token across a large corpus.

SFT trains on labeled examples, typically instruction-response pairs or conversation transcripts, to shape how the model behaves. The objective is to make the model produce specific types of outputs given specific types of inputs.

In practice, these two techniques are often used in sequence:

Start with a general-purpose base model.
Apply continued pretraining, CPT on your domain corpus to ground the model in domain knowledge.
Apply supervised fine-tuning, SFT on labeled examples to teach it how to respond in the context of that domain.

CPT is the knowledge step; SFT is the behavior step.

When to Use Continued Pretraining

CPT is appropriate when:

The base model lacks meaningful knowledge of your domain because the domain is specialized, uses domain-specific vocabulary, or was underrepresented in the original pretraining corpus.
You have a substantial corpus of domain text (typically hundreds of millions to billions of tokens for meaningful knowledge transfer, though smaller datasets can still help on narrow domains).
You want to preserve the model's general capability while adding domain knowledge, rather than replacing its behavior entirely.

CPT is not appropriate when:

The base model already understands your domain reasonably well and you primarily want to change how it responds. In that case, SFT alone is more efficient.
Your dataset is small. CPT on a small corpus can degrade general capability without meaningfully improving domain performance.
Your goal is a narrow task (classification, extraction, summarization to a specific format). SFT on task-specific examples will outperform CPT for task performance.

Checkpoints

CosmicAC saves checkpoints at intervals during training. A checkpoint is a snapshot of the model's weights at a specific point in the training run. Checkpoints let you:

Resume a training run if it is interrupted
Evaluate model quality at different stages of training and select the best checkpoint rather than using the final weights
Roll back to an earlier state if later training steps degrade performance

Next steps

Install the CosmicAC CLI before creating a training job