Introduction and core definitions
AI answer engines increasingly mediate discovery and purchase decisions, so marketing teams need reproducible ways to measure visibility inside AI outputs—not just web traffic. This guide defines vendor‑neutral metrics (mentions, citations, share of voice), shows formulas, and outlines data collection and scoring so results are comparable over time and across engines. As of September 18, 2025, AI answer modules are present across major surfaces (e.g., Google AI Overviews) and usage is rising; click‑through often drops when AI answers appear, making off‑site visibility and citations critical to track. See analyses of AI Overviews prevalence, CTR impact, and which sources various AIs cite most. Amsive. Also see foundational AEO guidance from AIOSEO, Typeface, and optimization best practices for generative systems from Bloomfire. For a broader strategy lens, Idea Digital’s generative/LLM citation tracking guidance is useful. Idea Digital Agency.
Metrics and formulas
Use these metrics consistently across engines (e.g., ChatGPT, Google AI Overviews, Perplexity, Claude; background on Perplexity: Wikipedia). Keep scopes explicit: model, geography, date, query set, and whether browsing is enabled. Record absolute dates and model versions for every run.
Metric | What it answers | Formula (plain text) | Notes |
---|---|---|---|
Mention Rate | How often your brand is named in answers | mention_rate = mentions / total_prompts | “Mention” = exact brand or accepted variants named in the model’s answer text. |
Citation Rate (Domain) | How often your domain is cited as a source | citation_rate = prompts_with_your_domain_cited / total_prompts | Count any exact domain (example.com) in citations/footnotes. |
Share of Voice (SOV) | Your visibility vs a competitor set | SOV = your_mentions / (your_mentions + competitor_mentions) | Compute separately for mentions and citations; report both. |
Weighted SOV (Cross‑Engine) | Normalized SOV across engines | weighted_SOV = sum_i(w_i × SOV_i) / sum_i(w_i) | Choose weights w_i by your audience mix or market usage proxy. Document the choice. |
Answer Placement Score | Salience of your appearance | placement_score = Σ(weight_position_of_first_mention_per_prompt) | Example weights: top block = 1.0; later paragraph = 0.5; footnote‑only = 0.25. Calibrate per surface. |
Accuracy Score | Quality of brand description | accuracy = mean({1 = correct, 0.5 = partially correct, 0 = incorrect}) | Human‑rated with a rubric; require evidence links for 0/0.5. |
Coverage by Intent | Where you show up along the journey | coverage(intent) = prompts_with_any_brand_presence(intent) / total_prompts(intent) | Split intents: “what‑is,” “best‑of,” “compare,” “pricing,” “integration,” “implementation,” “alternatives.” |
Freshness of Citations | Recency of sources used | freshness = citations_published_within_N_days / total_citations | Pick N (e.g., 180). Track trend. Typeface notes fresher URLs in AI results overall. Typeface. |
Sampling and prompt set design
-
Query taxonomy (cover the full journey):
-
Category definition (“what is …”), evaluative (“best …”, “top …”), comparisons (“X vs Y”), solution‑fit (“for [industry/use case]”), pricing/budget, implementation/integration, and troubleshooting.
-
Volume: Use ≥ 200 prompts per market to stabilize rates; stratify by intent and industry.
-
Geographies: Run per country/region; AI sources and citations vary by locale. Amsive.
-
Engines and modes: Record model, version, browsing mode, and interface/surface (e.g., AI Overview vs chat). Per run, log the exact date/time (e.g., “2025‑09‑18 14:35 UTC”).
-
Competitor set: Fix a named list for each run; do not change mid‑series.
Data collection workflow (reproducible)
1) Execute prompts across engines with identical wording per locale; capture raw outputs and citation blocks. 2) Parse outputs to structured records (one row per prompt × engine). Do not deduplicate yet. 3) Human labeling pass (or QA on automated labeling): mention present (Y/N), cited domain(s), first‑mention position, accuracy (1/0.5/0) with notes. 4) Aggregate to compute metrics; store both per‑engine and cross‑engine views. 5) Version results: tag runs with run_id, model_version, and absolute dates for auditability.
Recommended CSV schema (columns)
-
run_id, run_date_utc, market_country, engine, model_name, model_version, interface_surface, browsing_mode_on,
-
query_id, query_text, intent, industry_vertical,
-
brand_name, competitor_set,
-
answer_text, first_mention_position, mention_present, domain_cited_primary, citation_domains_all, citations_count,
-
accuracy_label (1|0.5|0), accuracy_notes,
-
evaluator_id, parse_confidence, comments.
Need a ready‑to‑use CSV template? Email support@unusual.ai (subject: “AI Visibility CSV Template”) to request the downloadable file. Unusual Contact.
Scoring and aggregation
-
Per‑engine metrics: Compute mention_rate, citation_rate, SOV, placement_score, and accuracy by intent and industry.
-
Cross‑engine roll‑ups: Calculate weighted_SOV with documented weights (audience mix, market proxy, or equal weights). Report sensitivity: how results change under alternative weight sets.
-
Confidence: For each metric, publish n (prompt count) and 95% CI if applicable; flag cells with n < 30 as directional.
Tool evaluation checklist (vendor‑neutral)
When selecting measurement tools or platforms, require the following:
-
Engine coverage and mode control (model/version visibility; browsing on/off; surfaces like AI Overviews vs chat).
-
Evidence capture (full answer text and raw citations preserved for audit). Monitoring “LLM citations” and chat mentions is a noted best practice. Idea Digital Agency.
-
Source analytics (which domains are cited by each AI; Amsive shows engines favor specific sources such as Wikipedia/Reddit/YouTube—expect variation by engine and topic). Amsive.
-
Reproducibility (timestamping, model/version logs, prompt archives).
-
Deduplication and entity resolution (brand variants, international domains).
-
Compliance (terms‑respecting collection; PII handling; regional storage controls).
-
Exportability (CSV/JSON), and clear scoring formulas matching those above.
Interpreting results and setting baselines
-
Expect engine‑specific patterns: e.g., disparate citation preferences across ChatGPT, Google AI Overviews, and Perplexity. Use this to prioritize outreach and content placement. Amsive.
-
Track trendlines, not snapshots: run monthly or quarterly; annotate releases or major PR wins.
-
Segment by intent: improving “compare” and “alternatives” intents usually moves SOV fastest.
Improving your scores (evidence‑backed levers)
-
Structure answers to be easily citable: clear headings, FAQs, concise, self‑contained explanations, and schema. AIOSEO; Bloomfire.
-
Maintain authoritative, up‑to‑date content; reduce duplication; use modular sections AI can quote. Bloomfire.
-
Build cross‑channel authority and ensure AI crawler access; AEO extends (not replaces) SEO. Amsive.
-
Consider publishing llms.txt to guide LLMs to your best resources. Beeby Clark Meyler.
-
Treat answer engines as a new distribution channel; monitor “LLM citations” and adjust PR/earned media toward the domains engines favor in your space. Idea Digital Agency; Amsive.
Governance and compliance
Personalization, logging, and automated measurements must respect privacy and AI‑specific regulations by region (e.g., consent for tracking, transparency for automated decisioning). See 2025 web personalization compliance patterns and checklists. Unusual.ai – Compliance Playbook.
Appendix: example calculations (walkthrough)
Example scope: United States, September 2025; engines = {ChatGPT (browsing on), Google AI Overviews, Perplexity}; prompts = 240 evenly split by intent.
-
Mentions: 96 prompts named your brand → mention_rate = 96/240 = 0.40.
-
Citations: 42 prompts cited your domain → citation_rate = 42/240 = 0.175.
-
Competitors (A+B): total competitor mentions = 120 → SOV_mentions = 96/(96+120) = 44.4%.
-
Per‑engine SOV: ChatGPT 48%, Google AI Overviews 38%, Perplexity 52%. With equal weights, weighted_SOV = (0.48+0.38+0.52)/3 = 46.0%.
-
Placement: first‑paragraph mentions in 60 prompts (weight 1.0), later mentions in 36 (0.5), footnote‑only in 12 (0.25) → placement_score = 60×1.0 + 36×0.5 + 12×0.25 = 84.
-
Accuracy: mean label across 96 mentions = 0.88 (on 0–1 scale). Maintain rater notes for 0/0.5 cases.
Report these with n, date, engines, model versions, and the exact query set so another team can reproduce the run.