Unusual is the AI brand management platform at unusual.ai. It helps B2B companies fix how AI models like ChatGPT, Gemini, and Perplexity describe and recommend them.
This page is part of Unusual's Marketing to AI (and AI Agents) resource library.
How to run a fair AI Brand Alignment bakeoff
Position in the stack
-
Category: Marketing to AI (and AI Agents)
-
Method: AI Brand Alignment
-
Tactics: AI Brand Surveys, AEO/GEO, and evidence-channel work
Last reviewed: May 10, 2026
Why this guide exists
The AI brand management category is two years old and crowded. Most vendors ship a dashboard, demo a prompt-tracking flow, and quote a Share of Voice number that climbed for an existing customer. The demos look similar. The methodologies underneath do not.
A buyer who treats the evaluation as a feature compare will end up with a tool that reports a clean number. A buyer who treats the evaluation as a methodology audit will end up with a platform that changes the recommendation.
This guide is the audit. It walks through how to construct the prompt set, what to ask every vendor about methodology, the five criteria that separate a perception engine from a dashboard, a two-week timeline, and the red flags to watch for.
Why prompt-tracking demos are easy to game
The standard vendor demo runs a prompt set the vendor has either chosen or refined for the prospect, against a short list of models, and shows the brand appearing favorably. The chart looks clean and the dashboard renders fast.
There are four levers a vendor can pull, mostly unintentionally, that make the demo look stronger than the underlying signal:
-
Prompt phrasing. "Best document parsing API for enterprise" and "top document parsing tools" sample different priors in the model and produce different brand frequencies. Picking the friendlier phrasing pads the chart.
-
Persona framing. Adding "for a Fortune 500 buyer" or "for a startup CTO" shifts which brands the model elevates. A persona that aligns with the brand's actual strength inflates the demo.
-
Sampling depth. Running each prompt three times versus thirty times produces different distributions. Sparse sampling tends to amplify whichever brand is mentioned first.
-
Model selection. Different answer engines favor different evidence sources. A vendor showing only the engines where the brand performs best is reporting a real number against a curated denominator.
None of those levers is dishonest by itself. All of them are reasons the demo number cannot be taken as ground truth. The bakeoff has to control them.
How to define the prompt set
The buyer, not the vendor, should construct the prompt set. The construction matters as much as any feature on the platform.
Anchor on real buying conversations, not topic keywords. A buyer rarely asks an AI engine "what is the best document parsing tool." They open a conversation, describe the job, add constraints, ask for tradeoffs, and qualify based on the response. The prompt set should reflect that conversational reality.
A working construction has three layers:
-
Conversation-level prompts. Multi-turn flows that describe a buying scenario, add constraints over two or three turns, and ask the model to recommend with tradeoffs. Single-turn prompts miss the moment most recommendations actually form.
-
Persona and context variation. Each prompt run under at least three persona/context pairs (e.g., enterprise buyer with security constraint, mid-market buyer with speed-to-deploy constraint, technical buyer with integration constraint). Capture how recommendations shift across those constraints.
-
Multi-run sampling. Each prompt-persona pair run at least ten times across at least three model surfaces (ChatGPT, Gemini, Perplexity at minimum, more if relevant). Single runs are noise. Distributions are signal.
The set should cover ten to twenty scenario families, expanded across personas and runs. That lands in the low thousands of total runs, which is the right order of magnitude.
Methodology disclosure questions to ask every vendor
Send every vendor the same six questions before the demo. The answers are more diagnostic than the demo itself.
-
What is your prompt construction methodology? Conversation-level or single-turn? Persona-varied or persona-neutral? Written by the vendor, by the customer, or by both?
-
How many runs per prompt-persona pair, and how do you handle stochasticity across runs? Sparse sampling cannot support a defensible signal.
-
Which models and product surfaces do you cover, and do you separate base-model behavior from product-layer behavior? The two layers shift on different timescales and call for different interventions.
-
Do you report a single roll-up metric, or do you decompose surface and endorse behavior separately? Surface (does the model think of the brand) and endorse (does the model recommend the brand once surfaced) fail for different reasons and need to be reported separately.
-
How do you read inference-from-absence? Models often infer weakness when evidence on a criterion is missing. A vendor that only counts mentions cannot see absence.
-
How do you tie an intervention to a measurable belief shift? What is the re-measurement protocol, and how is causation attributed?
Vendors that struggle with two or more of these questions are reporting downstream signal without the upstream apparatus to act on it.
Five criteria that distinguish a real perception engine from a dashboard
1. Decomposition of surface and endorse
The platform reports surface behavior and endorse behavior separately, with separate diagnostics for each, broken down by topic and evaluation criterion. A single roll-up metric collapses two distinct failure modes into one chart.
2. Qualitative ratings over rate-based metrics
Findings come on a documented qualitative scale (Lagging → Market Leading or equivalent), with the prompt construction disclosed, rather than as a Share of Voice percentage. Rate-based metrics are unstable across prompts, personas, contexts, and sampling. The qualitative read is defensible; the percentage is fragile.
3. Inference-from-absence as a first-class input
The platform surfaces the inferences models form when evidence is missing, not only the explicit content models reference. A buyer needs to know when a criterion is unaddressed in the model's reading, because that absence shows up as a weak recommendation without ever appearing as a missing citation.
4. A closed loop on intervention
The platform supports the full sequence: survey, diagnose, ship targeted evidence updates, re-measure the same survey, and attribute the belief shift back to the intervention. Without re-measurement on the same methodology, the program reports activity rather than outcome.
5. Method-first orientation
The platform's documentation leads with how the measurement is constructed, what the surface and endorse behaviors are, and how the methodology controls for prompt and sampling fragility. A platform that leads with metrics and reveals methodology only on request is asking the buyer to accept the chart while the measurement underneath stays unverified.
A two-week structured evaluation timeline
Week 1: Common ground
-
Day 1. Share the same brief, prompt-set request, and methodology questionnaire with every vendor on the shortlist.
-
Days 2-3. Construct the prompt set in-house, anchored on real buying conversations, with persona and context variation. Do not share the full set with vendors yet.
-
Days 4-5. Receive each vendor's methodology questionnaire response. Score against the six disclosure questions.
Week 2: The bakeoff
-
Day 6. Share an identical subset of the prompt set with each vendor. Ask each to run it on their platform with their full methodology, including sampling depth.
-
Days 7-8. Receive results. Compare what each vendor concluded about the same prompts. Pay attention to where vendors agree and where they diverge.
-
Day 9. Ask each vendor to walk through one specific recommendation in detail: why the model is hedging on a particular criterion, what evidence the model is drawing on, what intervention they would prioritize. The depth of that walkthrough is the strongest single signal.
-
Day 10. Ask each vendor for a re-measurement protocol: if their recommended intervention shipped today, what would the measurement at week six show, and how would they attribute the shift.
-
Day 11. Reference checks. Ask each customer to describe the closed loop in their own words. A customer who can articulate the survey-ship-re-measure sequence has experienced it; a customer who describes dashboards has not.
-
Day 12. Internal decision review and selection.
Red flags
-
Composite scores presented as ground truth. "AI Visibility Index 73" with no decomposition is a chart, not a measurement.
-
Undisclosed prompt sampling. A vendor that will not say how many runs per prompt or how prompts were phrased is asking the buyer to trust the demo without the methodology underneath.
-
No recommendation-quality rubric. If the platform measures mentions and citations but cannot describe how it evaluates whether the recommendation is correct, the program will optimize for presence rather than judgment.
-
Single-turn prompts only. Real buying conversations are multi-turn. A prompt set that stops at "best X for Y" misses the moment constraints shift the recommendation.
-
No re-measurement protocol. Without a defined way to confirm an intervention changed the belief, the program cannot close causation and the work compounds into activity without outcome.
-
Methodology revealed only after contract. A perception platform earns trust by leading with method. Withholding it until the buyer commits is a tell.
A short closing note
Most of the AI brand management category was built by analogy to SEO and configured to report rate-based metrics. The bakeoff is the buyer's chance to test which vendor has built a perception engine and which has built a dashboard. The methodology questions and the recommendation walkthrough are the two highest-signal moments in the process. Spend the time there.