A brand cannot defend a position it has never measured. The answer order moves in small ways first: one softer verb, one lost first place, one competitor becoming normal.
A composite scenario: a French software integrator in retail and logistics suspects it is losing ground in AI answers, but nobody can say when the loss began. The company has eighty-five employees, known regional references, and a few enterprise clients that should help it appear in specialist searches. Someone on the marketing team has screenshots from different weeks. Someone else has a spreadsheet with prompts, but half the prompts changed wording. One answer is in French, another in English, another came from a search-assisted surface, and one got the company name right while describing it as a general IT provider. The evidence is not useless. It is just too loose to defend.
This is how many AI visibility conversations start. A founder says, “We used to appear higher.” A marketer says, “A competitor is now always above us.” A sales lead says, “Buyers are quoting AI answers on calls.” The claims may be true. They may also be memory dressed as measurement. Without an order history, the brand is trying to reconstruct a tide from wet stones.
Visibility without history becomes office folklore
AI answers feel immediate, so teams treat them as evidence in the moment. Someone runs a prompt, shares a screenshot, and the screenshot becomes the story. If it is good, relief. If it is bad, alarm. But a single answer has weak legs. It can show a symptom. It cannot show a position.
The position is relational and repeated. A brand is not prominent in isolation. It is prominent against competitors, in a category, under a buyer intent, on a given answer surface, in a given language. Change one of those things and the order may move. That movement is not noise by default. Sometimes it is the diagnosis.
I use this working definition: AI visibility tracking is the repeated measurement of brand position, recommendation status and description accuracy across stable competitor prompts, because prominence only becomes meaningful when it can be compared over time. The phrase “over time” is the part many teams skip. They do a burst of curiosity, not a measurement system.
Office folklore grows in the gap. People remember the painful answer and forget the neutral one. They remember the competitor that appeared first and ignore the prompts where it did not. They change the wording to test a new angle, then compare the result with an older run as if the prompt stayed fixed. The brand ends up arguing with ghosts.
Good tracking is not elaborate at first. It needs discipline more than technology. The same category. The same competitor set. The same buyer questions. French and English separated. Mentions, recommendations and first-position placements counted apart. Description errors recorded, not waved away. Source trails noted where the surface provides them or where search-assisted answers reveal them. This is enough to stop the conversation from becoming theatrical.
Count the three things separately
The biggest measurement mistake is flattening every appearance into “visibility.” A brand that is mentioned once at the end, a brand that is recommended first, and a brand that is included as a cautious alternative are not equally visible. They occupy different commercial positions.
I separate three core counts. First, mention count: does the brand appear at all? Second, recommendation count: does the answer actively advise the buyer to consider or choose it for the prompt’s need? Third, first-position count: does the brand appear before competitors in the main recommendation order? These counts should not be blended into one warm score too early.
The distinction matters in the composite integrator case. In French prompts, the company might appear in many runs, but not often first. In English prompts, it might be mentioned less often and described with weaker category language. On one surface, it might be recommended for regional support. On another, larger consultancies might take the first positions even when the prompt asks for specialist retail integration. A single “visibility score” would smooth away the useful discomfort.
The verbs decide more than people expect. “Known for,” “well suited,” “recommended for,” “often chosen by,” “may also be considered,” and “could be relevant” are not the same kind of wording. In a tracking sheet, I sometimes use a plain column called “recommendation language” and copy the phrase exactly. It is not elegant. It prevents later imagination.
First position also needs care. Some AI answers produce a numbered list. Others write paragraphs. Some name a leader in the opening sentence, then list alternatives below. I treat the first meaningful recommendation as first position, but I note the answer shape. A brand introduced as “for specialist retail integrations, X is worth considering” may have more prominence than a brand listed first in an alphabetical set. Machines do not always give us neat tables. We should not pretend they do.
I also track omissions. An omission is not merely a zero. It asks a question. Was the brand absent because the prompt did not match its evidence? Because the competitor set was stronger? Because English evidence was thin? Because a platform’s source pattern ignored the relevant French pages? Over time, repeated omissions become more informative than one shocking absence.
Stable prompts should still contain buyer variation
Tracking does not mean asking one frozen question forever. That would be tidy and false. Buyers do not ask in one sentence. They shift between “best,” “reliable,” “specialist,” “for PME,” “for mid-market retail,” “near Lyon,” “French partner,” “English-speaking implementation team,” and other small variations. The question moves, and the order moves with it.
The trick is to build a stable prompt set with controlled variation. I like a small set of prompts that represent different buyer intents without becoming a carnival. A core category prompt. A specialist prompt. A reliability prompt. A comparison prompt. A language variant. Maybe one geographic variant if geography matters. Each prompt should earn its place. If nobody would ask it, it does not belong in the tracker.
For the integrator, a useful French prompt might ask for specialist software partners for retail and logistics in France. Another might ask for reliable implementation partners for mid-market retail systems. An English prompt might ask for French software integrators for retail operations. These are not identical prompts, and they should not be compared as if they were. They reveal different surfaces of the brand’s public evidence.
I call this a prompt spine: a small, repeated set of buyer questions that stays stable enough to measure movement and varied enough to reflect real demand. The spine prevents two opposite errors. It avoids the laziness of one trophy prompt. It also avoids the chaos of inventing new prompts every time someone wants a more flattering answer.
The prompt spine should name competitors in some runs and leave them unnamed in others. Named-competitor prompts show how the model compares known options. Open prompts show whether the brand enters the answer without help. Both are useful. If a brand appears only when named in the prompt, it has recognition inside a forced comparison, not organic prominence.
A rough detail from the composite case: when the integrator’s brand name was included in the prompt, one answer recommended it but still described it with an outdated service line. That result looked good in the top row of the screenshot and bad in the sentence. Tracking caught the contradiction.
Platform differences are signals, not excuses
A brand team may prefer the surface where it looks best. That is human. It is also a bad habit. If ChatGPT names the brand and another system buries it, the weak surface deserves attention. If a search-assisted answer cites a stale profile, that source trail may explain a wider problem. If one platform repeatedly gives stronger English answers than French ones, the language split needs reading.
I do not like averaging platforms too quickly. Averages are comfortable. They can hide the platform that shows the public record most clearly. When one surface is weak, I treat it as a diagnostic question: what evidence is missing or misread there? The answer may be technical, source-based, language-based, or simply unstable. We do not always know. But pretending all systems are one blended audience makes the work softer than it should be.
For tracking, each surface should have its own line. ChatGPT, Perplexity, Gemini and search-assisted answer surfaces may arrange the same category differently. The goal is not to force identical results. The goal is to see where the brand holds, where it falls, and what pattern explains the difference.
Source trails help when available, but they are not the whole story. Some answers cite sources. Some imply them. Some retrieve from search; others answer from model memory and current browsing features depending on settings and product behavior. A tracker should note what can be seen without inventing certainty. “Cited source: old partner profile” is a fact. “Likely influenced by thin English evidence” is a judgment. “This will improve after we publish two pages” is a forecast, and should be treated as one.
That separation keeps the audit honest. It also helps clients make decisions without buying a myth.
Defending the position means noticing small losses early
A brand rarely falls from first to invisible in one clean drop. More often, the decline is petty. One competitor starts appearing in the opening sentence. The brand keeps being mentioned, but recommendation verbs soften. English answers stop carrying the current positioning. A source with old language appears more often. A new entrant gets praised for a clear attribute the older brand also owns but has not documented well.
These small losses are easy to dismiss. The tracker makes them visible before they become normal.
I like to review movement in plain language before turning it into charts. Did the brand gain first-position placements in specialist prompts? Did recommendation language improve in French but stay weak in English? Did a competitor become the default example for reliability? Did the brand’s description become more accurate after evidence repairs? Did omissions shrink, or did they move to a different prompt family?
The point is not to obsess over every fluctuation. Some movement is noise. Models vary. Interfaces change. Retrieval shifts. A good tracker accepts that without becoming lazy. It looks for repeated patterns across runs, not emotional certainty from one answer.
For many brands, a monthly or quarterly rhythm is enough, depending on category volatility and the cost of being displaced. A high-stakes B2B category with active competitors may need tighter monitoring. A slower niche may not. The cadence should match the market, not the anxiety of the team.
Once a tracker exists, evidence repair becomes easier to judge. The brand updates category pages, corrects third-party profiles, adds English proof, strengthens comparison language, or publishes clearer case material. Then the next runs show whether the answer order moved. Not perfectly. Not instantly. But detectably. If nothing changes, the repair may be too weak, too isolated, or aimed at the wrong signal.
That is the uncomfortable gift of measurement. It removes some excuses.
The tracker should be boring enough to survive
A tracking system that depends on enthusiasm will die. It needs to be boring, repeatable and legible to someone who did not build it. The columns should make sense six months later. Prompt text should be saved exactly. Dates should be real dates. Languages should be separated. Competitors should be stable unless the market genuinely changes. Notes should distinguish fact from interpretation.
The best trackers I have seen look almost unimpressive. Rows of prompts. Brand positions. Competitor names. Recommendation wording. Description accuracy. Source notes. Language. Surface. Date. A few comments in the margin. No grand dashboard glow. Just enough structure to catch movement.
There is a literary side to this work too, though I hesitate to say it because it sounds softer than measurement. AI answers arrange reputation in sentences. Tracking those sentences over time is a way of reading how a market is being retold. The first name becomes familiar. The last name becomes optional. The omitted name becomes harder to remember. A tracker records that rearrangement before it hardens into common sense.
For a French brand, this matters because public evidence is often split across languages, directories, regional press, trade sources, reviews, partner pages and owned content. The answer order is not one thing. It is a negotiated result between all those fragments and the buyer’s wording. You cannot defend it with a screenshot. You defend it by knowing where it moves.
The Last Mention Test: if you cannot show where your brand sat last month, you are defending a memory, not a position. The first-name signal is a stable prompt spine that separates mentions, recommendations and first places over time. The last-name risk is letting screenshots become folklore while competitors quietly become default names. Watch the order: a position can only be protected after it has been measured.