Skills momentic-ai/skillsmomentic-result-classification

Editor's Note

momentic-result-classification

Classify or explain Momentic test run results using Momentic MCP tools.

Install

npx skills add https://github.com/momentic-ai/skills --skill momentic-result-classification

SKILL.md

Momentic result classification (MCP)

Momentic is an end-to-end testing framework where each test is composed of browser interaction steps. Each step combines Momentic-specific behavior (AI checks, natural-language locators, ai actions, etc.) with Playwright capabilities wrapped in our YAML step schema. When these tests are run, they produce results data that can be used to analyze the outcome of the test. The results data contains metadata about the run as well as any assets generated by the run (e.g. screenshots, logs, network requests, video recordings, etc.). Your job is to use these test results to classify failures that occurred in Momentic test runs.

Instructions

Given a failing test run, analyze why the test run failed. Often you'll need to look beyond the current run to understand this, looking at past runs of the same test, or other context provided by the Momentic MCP tools
After analyzing why the run failed, bucket the failure into one of the below categories, explaining the reasoning for choosing the specific category.

Helpful MCP tools

momentic_get_run — Returns some metadata about the run and a summary of the full run results. Use the metadata to help you parse through the run results (e.g. which attempt to look at, which step failed, etc.)

momentic_list_runs — Recent runs for a test so you can compare the result of past runs over time. Always pass gitBranchName when it exists on the run in question so that it's more likely you're looking at the same version of the test.

momentic_get_step_result — Returns the result of a specific step, with other information such as full step trace and before/after screenshots. Use parentStepIdChain for steps nested inside other steps.

momentic_get_test_steps_for_run — Returns the simplified test steps recorded on a run (stepsSnapshot, beforeStepsSnapshot, afterStepsSnapshot). You can use this to understand the intent of the test if you need more information than what you can glean from the test name and description.

Background

Test run result structure

When momentic tests are run via the CLI, the results are stored in a "run group". The data for this run group is stored in a single directory within the momentic project. By default, the directory is called test-results, but can be changed in momentic project settings or on a single run of a run group. The run group results folder has the following structure:

test-results/
├── metadata.json         data about the run group, including git metadata and timing info.
└── runs/                 On zip for each test run in the run group.
    ├── <runId_1>.zip         a zipped run directory containing data about this specific test run.  Follows the structure described below.
    └── <runId_2>.zip

When unzipped, run directories have the following structure:

<runId>/
├── metadata.json           run-level metadata.
└── attempts/<n>/           one folder per attempt (1-based n).
    ├── metadata.json       attempt outcome and step results.
    ├── console.json        optional browser console output.
    └── assets/
        ├── <snapshotId>.jpeg     before/after screenshot for each step (see attempt metadata.json for snapshot ID).
        ├── <snapshotId>.html     before/after DOM snapshot for each step (see attempt metadata.json for snapshot ID).
        ├── har-pages.log         HAR pages (ndjson).
        ├── har-entries.log       HAR network entries (ndjson).
        ├── resource-usage.ndjson CPU/memory samples taken during the attempt.
        ├── <videoName>           video recording (when video recording is enabled).
        └── browser-crash.zip     browser crash dump (only present on crash).

When getting run results via the momentic MCP, tools such as momentic_get_run will return links to the MCP working directory (default .momentic-mcp). This directory will contain unzipped run result folders, following the structure above, named run-result-<runId>.

Element locators

Certain step types that interact with elements have a "target" property, or locator, that specifies which element the step should interact with.

Locator caches

Locators identify elements by sending the page state html/xml to an llm as well as a screenshot. The llm identifies which element on the page the user is referring to. Momentic will attempt to "cache" the answer from the llm so that future runs don't require AI calls. On future runs, the page state is checked against the cached element to determine whether the element is still usable, or the page has changed enough such that another AI call is required.

A locator cache can bust for a variety of reasons:

the element description has changed, in which case we'll always bust the cache
the cached element could not be located in the current page state
the cached element was located in the page state, but fails certain checks specified on the cache entry, such as requiring a certain position, shape, or content.

You can find the cacheBustReason on the trace property in the results for a given step. The cache property is also listed on the results, showing the full cache saved for that element.

Identifying bad caches

Sometimes the element that was cached is not the element that the user intended to target. This can cause failures or unexpected behaviors in tests. In these cases, it helps to verify exactly why the wrong cache was saved in the first place. Use the runId property of the targetUpdateLoggerTags on the incorrect cache to get the details of the original run, calling momentic_get_run with this runId. This will return the run where the cache target was updated.

Module caching

Cached modules skip executing their steps when the module cache key and resolved inputs are unchanged, and reuse the cached return value from the module's last step.

Authentication modules can also save and restore browser auth state from the module cache, including cookies, localStorage, and IndexedDB. They may use a page-content check after restoring auth state to decide whether the cache is still valid.

File uploads

A file upload step prepares one file for the next native file picker, so it must run before the action that opens the picker.

Sources can be remote URLs, file:// references to earlier downloads, CLI-local paths, or uploaded user files. The step can also override the presented filename, and Momentic wires the prepared file into the browser's file chooser handling.

Using past runs

You MUST look at past runs of the same test when understanding why a test failed. Looking at past runs helps you identify:

When did this test start failing?
What differed vs the last passing run?
Did the same action behave differently on an earlier run?

Use step results and screenshots on past runs to answer these questions. Do NOT rely only on summaries from momentic_get_run or momentic_list_runs to understand what happened in a test run. You MUST look at the specific run details, including step results and screenshots, to determine the behavior of past runs.

When looking at past runs, use the following workflow:

Call the momentic_list_runs tool to identify the runs you want more detail on.
Call momentic_get_run for that specific run to get the run details.
Call momentic_get_step_result for step results that you want to see in more detail, especially for step screenshots.

ALWAYS look at screenshots when determining the behavior of test runs.

Multi-attempt runs

When momentic_list_runs shows a passing run with attempts > 1, treat it as a partial failure worth investigating, not a clean passing run. Use the attemptNumber parameter to retrieve earlier failed attempt results for that run to understand what was going wrong before the retry succeeded.

Flakiness and intermittent failures

In order to consider a test flaky or failing intermittently, it must be intermittently failing for the same app and test behavior.
- Just because a test failed once does NOT mean that it's flaky - it could have failed because of an application change. You need to determine whether or not there was an application or test change between runs by analyzing the screenshots and/or browser state in the results.
- IMPORTANT: You cannot make assumptions about flakiness or intermittent failures without verifying whether there was an application or test change that caused the failure

Test temporality

Any past results may not necessarily match today’s test file. The test may have changed, meaning the result was on a different version of the test.
You can call get_test_steps_for_run to help you determine if the test itself changed between runs, although note that this tool returns a summary of each test step. If you suspect that specific details on certain steps have changed between test runs, full step details are included in the response from momentic_get_step_result.

Identifying related vs unrelated issues

Use test name, description, and, if needed, the simplified test steps returned by momentic_get_test_steps_for_run to determine what the test is intending to verify
Failures outside that intent are unrelated, otherwise consider them related.
Any failures in setup (beforeSteps/beforeResults) or teardown (afterSteps/afterResults) steps are pretty much always considered unrelated.
Related vs. unrelated changes only apply to bugs and changes (e.g. an INFRA failure would still be INFRA regardless of whether it's in the setup or main section).

Bug vs change

Bug: something very clearly went wrong when it shouldn't have, such as an error message appearing. It's obvious just by looking at a single step or two that this is a bug.
Change: any other behavior changes in the application

Formal classification output

Exactly one category id — no new labels, no multi-label.
Ground your decision in data. Be sure that you've fully investigated the run before assigning the category.
When reasoning cites another run, use the full runId UUID exactly as returned by tools. Do not shorten it to a prefix.

Reasoning: <a few sentences tied to summary, past runs, and intent>
Category: <one id from the list>
Confidence: <high | medium | low>

Confidence levels:

high — direct evidence (e.g. clear screenshot of label change or crash)
medium — strong inference from multiple signals but no single conclusive screenshot or data point
low — ambiguous evidence; the classification required significant inference or the root cause is unclear

Category ids

Use these strings verbatim:

NO_FAILURE — The run had no failures; all attempts passed.
RELATED_APPLICATION_CHANGE — A failure related to the test's intended behavior.
RELATED_APPLICATION_BUG — A failure related to the test's intended behavior that is clearly a bug.
UNRELATED_APPLICATION_CHANGE — A failure unrelated to the test's intended behavior.
UNRELATED_APPLICATION_BUG — A failure unrelated to the test's intended behavior that is clearly a bug.
- Example: any app bug in setup, not in the test steps.
TEST_CAN_BE_IMPROVED — We know what to change about the test to make it better.
- Examples: an obvious race condition that can be fixed by adding or modifying steps; vague assertion or locator descriptions; test misconfiguration such as a missing file for a file upload step.
- If increasing a timeout or adding wait steps seems like the fix, you must be extremely confident that this would make the test consistently pass, backed by evidence from past runs.
INFRA — Something very rare happened, or something that doesn't happen all of the time and that you're confident is related to outside factors.
- Examples: browser crash, high resource usage, or rate limiting.
PERFORMANCE — Page loading or application performance was too slow, and just waiting longer would likely have allowed the step to pass.
- Use for sporadic slowdowns or load stalls that usually should not justify a permanent update to the test.
- Do not choose this category just because a step timed out.
- You must be confident that this was a temporary performance issue that occurs infrequently and would likely be resolved by waiting longer in the current test run.
- Choose INFRA instead when external systems, browser crashes, resource exhaustion, or rate limits caused the slowdown.
- Examples: page took too long to load; loading spinner did not disappear before the step timed out, but past runs show it normally does; an assertion timed out because the expected state appeared slowly, not because the assertion or test intent was wrong.
MOMENTIC_ISSUE — Some issue occurred with the execution of the test or Momentic data was incorrect.
- Examples: unexpected behavior when viewing the run trace; the AI clearly misread or hallucinated data that is unambiguous in the screenshot, and no reasonable test alternative exists to avoid the AI step.

Installs829

GitHub Stars12

AddedMar 5, 2026