CLAUDE CODE MARKETPLACES

cicd

CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.

npx skills add https://github.com/NVIDIA/Megatron-LM --skill cicd
SKILL.md

CI/CD Guide


CI Pipeline Structure

The main workflow is .github/workflows/cicd-main.yml. It triggers on pushes to branches matching pull-request/[0-9]+ and deploy-release/*, on merge groups, on a daily schedule, and on manual dispatch.

is-not-external-contributor
  └─ pre-flight
       └─ configure          # determines scope, container tag, n_repeat
            ├─ linting
            ├─ cicd-container-build
            │    ├─ cicd-parse-unit-tests → cicd-unit-tests-latest
            │    ├─ cicd-parse-integration-tests-h100 → cicd-integration-tests-latest-h100
            │    └─ cicd-parse-integration-tests-gb200 → cicd-integration-tests-latest-gb200 (maintainers only)
            └─ Nemo_CICD_Test  # final pass/fail gate

Images are pushed to:

  • AWS ECR: 766267172432.dkr.ecr.us-east-1.amazonaws.com/…
  • GCP Artifact Registry: us-east4-docker.pkg.dev/nv-projdgxchipp-20260113193621/megatron-lm/…

CI Test Scope Labels

The CI pipeline reads PR labels to decide test scope, n_repeat, and container image.

Decision tree (first match wins):

Conditionscopen_repeatlightweightNotes
Merge groupmr-github1falseAutomatic, no label needed
Label: Run testsmr-github1trueTrains 4 steps, no golden-value compare
Label: Run functional testsmr-github5falseTrains 100 steps, golden-value compare
(no label)mr-github-slim5falseSlim subset only

Orthogonal image label:

LabelEffect
container::ltsUse the LTS base image instead of dev (combinable with any scope label)
Run MBridge testsAlso triggers the MBridge L1 test suite

Which label to attach when opening a PR

Changed paths / nature of changeLabel to attach
Docs only (docs/, *.md, docstrings)(none)
CI/tooling only (.github/, tools/, Makefile)(none)
Test files only (tests/) — existing tests, no new golden valuesRun tests
New test cases added (no golden values exist yet)Run functional tests
Re-enabling a disabled test (scope -broken → active)Run functional tests
Non-numerical library code (logging, error handling, CLI flags, refactors)Run tests
Could affect training numerics (model arch, attention, optimizer, distributed, MoE routing)Run functional tests
Container or dependency changes (docker/, pyproject.toml, uv.lock)Run tests + container::lts
Touches MBridge integrationadd Run MBridge tests

Rule of thumb: default to Run tests. Always use Run functional tests when the PR adds new test cases (golden values must be generated) or when the change could plausibly shift loss curves.


Triggering Internal CI

Use tools/trigger_internal_ci.py to push the current branch to the internal GitLab remote and trigger a pipeline — without touching the GitLab UI. Full setup and usage details: @tools/trigger_internal_ci.md.

Prerequisites (one-time):

# 1. Add the internal GitLab remote
git remote add gitlab git@<gitlab-hostname>:ADLR/Megatron-LM.git

# 2. Create a personal access token with 'api' scope on your GitLab profile,
#    then store it:
export GITLAB_TOKEN=glpat-<your-token>

Usage:

python tools/trigger_internal_ci.py \
  --gitlab-origin gitlab \
  [--functional-test-scope mr] \
  [--functional-test-repeat 5] \
  [--functional-test-cases all] \
  [--dry-run]

The script force-pushes the current branch as pull-request/<branch> and prints the resulting pipeline URL.


CI Failure Investigation

CI branches always follow the pattern pull-request/<number>.

Locating the PR from a CI Branch

# Extract PR number from the current branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')

# Fetch the PR metadata (title, labels, author, base branch)
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM

# Show the changeset for that PR
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM

Reading CI Job Logs

# List recent workflow runs for the PR
gh run list --repo NVIDIA/Megatron-LM --branch "pull-request/$PR_NUMBER"

# Stream failing job output
gh run view <run-id> --repo NVIDIA/Megatron-LM --log-failed

Full per-rank logs are not in the runner stdout. They are uploaded as GitHub artifacts named logs-<test_case>-<run_id>-<uuid>.

# 1. Find artifact name
gh run view <run-id> --repo NVIDIA/Megatron-LM --json artifacts \
  --jq '.artifacts[].name'

# 2. Download the artifact zip
gh run download <run-id> --repo NVIDIA/Megatron-LM \
  --name "logs-<artifact-name>" -D ./ci-logs

# 3. Locate which rank logs contain errors
grep -r -l "ERROR\|Traceback\|FAILED\|fatal" ./ci-logs/

# 4. Log files can exceed 10 000 lines — never read a full log at once.
wc -l ./ci-logs/<test>/<attempt>/attempt_0/<rank>/stderr.log
sed -n '1,200p' ./ci-logs/.../stderr.log   # read in chunks

Identifying Failure Root Cause

  1. Linting failure — re-run tools/autoformat.sh locally; the diff shows exactly what needs to change.
  2. Container build failure — inspect the cicd-container-build job log.
  3. Unit test failure — the failing bucket is in the cicd-unit-tests-latest job matrix.
  4. Functional test failure — look at the cicd-integration-tests-* job. Start with stdout.log for rank 0.
  5. Flaky test — the runner retries automatically up to 3 times. If all retries exhausted and the pattern matches a known transient (NCCL, ECC, segfault), it is infrastructure noise.

Correlating a Failure with the PR Changeset

# Find unit tests that cover a changed source file
grep -r "from megatron.core.transformer.attention" tests/unit_tests/ -l

# Check CODEOWNERS for reviewer assignment
cat .github/CODEOWNERS | grep "<changed-path>"
Installs0
GitHub Stars16.4k
LanguagePython
AddedMay 25, 2026
View on GitHub