Key takeaways

  • ChatGPT doesn’t give the same answer twice. Testing a prompt once is measuring noise.
  • Five measurement levels coexist: presence, position, Share of Voice, intent-weighted frequency, context. Confusing them leads to the wrong conclusion.
  • A usable measurement rests on a basket of 20 to 50 prompts, repeated 3 to 5 times, tracked over time.
  • A KPI that triggers no action is a vanity metric. Stop tracking it.

ChatGPT Isn’t Deterministic: Is Your Measurement Reliable?

Your measurement isn’t reliable if it rests on a single test per prompt. ChatGPT generates its answers probabilistically, not deterministically. The same prompt, submitted twice a few minutes apart, can cite different brands, in a different order, with different sources.

The technical cause is documented. OpenAI exposes a temperature parameter that controls the randomness of generation. Above zero, two runs of the same prompt produce outputs that diverge. And the default temperature in the ChatGPT interface isn’t zero.

Several sources of variance add to that base. The model version changes the selection of cited brands. According to RESONEO’s study on the ChatGPT 5.3 Instant and 5.4 Thinking variants (2026), a same prompt can lose up to 20% of its unique domains from one variant to the other. The ChatGPT account memory, when active, shapes recommendations. The timing of the test affects web search results. So does location.

Direct consequence for your measurement. If you test a prompt at a moment T, on a single account, in a single version, you capture a sample of one in a large variation space. You measure noise, and you make decisions on noise.

A reliable measurement requires three conditions, detailed further down: a representative prompt basket, several repetitions per prompt, and tracking over time. Without those three conditions, your numbers move without you being able to say whether anything has changed for your brand.

The 5 Measurement Levels of Your ChatGPT Visibility

Five metrics describe your ChatGPT visibility, and each answers a different question. Confusing them leads to the wrong conclusion.

Presence: Are You Cited, Yes or No?

Presence is a binary metric. On a given prompt, your brand is cited, or it isn’t. It’s the starting point, and it’s insufficient. A 60% presence rate on your basket says nothing about your weight in the answer, nor about the quality of the context. But if it’s at 0%, you know you don’t have a measurement problem: you have a visibility problem.

Position: Where Do You Appear in the Answer?

Position measures the order of your brand in the list of cited brands. Being cited first or fifth changes everything. ChatGPT foregrounds the first citations, and human reading stops fast. Position is expressed as an average over tests where you’re cited, and has no meaning if presence is at zero.

Share of Voice: What Is Your Weight Against Competitors?

Share of Voice measures your share of the total citation volume in your sector. Out of 100 brand citations for your target prompts, how many come back to you? It’s the competitive benchmark metric, and the most readable to leadership. It requires that you measure your competitors at the same time as yourself, on the same basket.

Intent-Weighted Frequency: Not All Prompts Carry the Same Volume

Being cited 100 times on generic informational prompts doesn’t weigh as much as being cited 10 times on prompts with strong commercial intent. Intent-weighted frequency applies a coefficient to citation volume based on the business potential of each prompt. Without weighting, your numbers mix qualified traffic and noise.

Context: Under What Conditions Are You Cited?

Context looks at what surrounds your citation. Are you cited as a reference, as an example, as a counter-example? With positive or negative sentiment? On which angle of the query? This metric is qualitative, heavier to automate, and the one that reveals AI hallucinations about your brand.

Level Question it answers Typical error
Presence Am I cited, yes or no? Settling for it and ignoring position
Position Where in the answer? Averaging across tests without presence
Share of Voice What is my weight against competitors? Comparing without a common prompt basket
Intent-weighted frequency Which prompts actually drive business? Counting every citation at equal weight
Context How am I cited? Ignoring it and missing hallucinations

How to Build a Prompt Basket That Doesn’t Lie

A reliable prompt basket rests on representativeness, not volume. A basket of 500 generic prompts measures worse than a basket of 30 prompts chosen for their business value and their diversity of intent.

Build your basket on four criteria.

  • Intent coverage. Mix awareness, comparison, commercial intent, pain-point and budget prompts. A basket that only covers one intent gives you a truncated view.
  • Cluster around your core topics. Concentrate the prompts on three to five thematic clusters, the ones that define your market. Covering twenty topics on the surface serves nothing.
  • Real phrasings. Users don’t write in keywords. Take the exact, long, natural conversational phrasing as it appears in ChatGPT.
  • Reasonable volume. 20 to 50 target prompts are enough for a sector. Beyond that, the measurement and maintenance load becomes a brake.

A poorly built basket produces numbers that move for no interpretable reason. A well-built basket produces a measurement that compares over time. The difference doesn’t show right away, but it pays off after three months of tracking.

How Many Tests Do You Need to Measure Right?

A usable measurement requires at minimum 3 to 5 repetitions per prompt, on a weekly or bi-weekly cadence. That’s the threshold below which noise overwhelms signal.

The logic comes in three steps.

  1. Several runs per prompt. Three to five runs of the same prompt, at short intervals, give you a distribution of citations. You can then reason on an average rather than on a point.
  2. Several prompts per cluster. The 20-to-50 prompt basket mentioned above produces an aggregated measurement per cluster, more stable than an isolated prompt.
  3. Several cycles over time. A snapshot has no meaning. What matters is the trend over a rolling four to six weeks, set against the actions you’ve taken.

The exercise is heavy by hand. 30 prompts × 5 runs × weekly tracking is 150 tests per week, not counting the scraping of cited brands and consolidation. For a single cluster. And you have to add competitor tracking on the same basket to compute a Share of Voice.

Hence the use of dedicated tools or in-house scripts to automate. More on this below.

Action Metric or Vanity Metric: The “Monday Morning” Rule

A KPI has value when its variation triggers a decision. Otherwise, it’s a vanity metric. The rule fits in one question: if this number drops 20% next week, what do you do Monday morning?

No clear answer, stop tracking the metric. Three concrete examples.

  • Vanity metric: “Our brand is cited 847 times this month by ChatGPT.” Without breakdown by cluster, without intent weighting, this number triggers no action. It decorates a slide.
  • Action metric: “Our Share of Voice on the comparison cluster went from 18% to 12% over four weeks, and competitor X gained 7 points over the same period.” Monday action: audit what changed on the competitor side, identify lost prompts, prioritise pages to refresh.
  • Action metric: “Across 6 tests of a target commercial prompt, my brand was cited 1 time out of 6.” Monday action: isolated prompt, check the editorial coverage of the topic and the presence on the sources ChatGPT mobilises for this cluster.

The filter is strict, and that’s intentional. Better to track three metrics that drive decisions than fifteen that produce reports. This framing is borrowed from Hi-commerce (2026).

Measuring in Practice: Manual Test, Scripts, or Dedicated Platform?

You have three options to move to measurement, with different costs and limits.

The manual test. You open ChatGPT, you type your prompt, you note the cited brands in a spreadsheet. Cost: zero, beyond time. Limit: impractical beyond 10 prompts, impossible to maintain over time, and the manual test is often done on a single account, with a memory that biases the measurement. Useful for a one-off audit, not for tracking.

In-house scripts. You query the OpenAI API in a loop, you parse the cited brands, you store the data. Cost: development time and an API budget. Limit: you measure the API, which doesn’t behave exactly like the ChatGPT interface your real users use. You also lose the web search dimension activated by default on the consumer side.

The dedicated platform. A GEO tool automates the repetitions, manages the basket, computes the five metrics, tracks your competitors and the trend over time. Cost: a monthly subscription. Benefit: the measurement holds without recurring load, and you reinvest the time saved into actions.

I built Cockpyt AI for the third option. The tool measures your ChatGPT citations, your position, your Share of Voice and your competitors, on a prompt basket you define, with no manual testing. And it does the same on Perplexity, Gemini and Claude, which is the other blind spot of manual measurement.

FAQ

How many prompts do you need in a ChatGPT measurement basket?

20 to 50 target prompts are enough for a sector. Beyond that, the maintenance load becomes a brake. Below 15, you lose intent coverage. The rule is: cover three to five thematic clusters, with a balanced mix of search intents.

Why do my ChatGPT measurements vary from one week to the next?

Because ChatGPT isn’t deterministic by default. The same question can produce different answers depending on the moment, the model version, the account memory, and the web search result. Variance is a characteristic of the engine, not an error in your method. You reduce it with repetitions and tracking over time.

Should you measure on the ChatGPT interface or on the API?

The interface reflects what your real users experience, including web search and memory. The API gives a more stable measurement but one further removed from user behaviour. Good measurement uses the interface, or a tool that simulates it faithfully. The API alone often underestimates reality.

What’s the difference between Share of Voice and mention rate?

The mention rate measures your presence on the basket, independently of competitors. Share of Voice measures your share of the total citation volume in your sector. The first answers “am I visible”. The second answers “how much room do I take against the others”.

How often should you measure?

A weekly or bi-weekly cadence is enough for most sectors. Faster brings nothing because ChatGPT visibility shifts don’t play out by the day. Slower misses the drop-offs caused by model version changes.

Can you measure ChatGPT visibility without a dedicated tool?

Yes, on a reduced cluster and for a one-off audit. The manual test or the in-house script work. The limit arrives fast: tracking over time, several competitors, several clusters, several engines. At that stage, a dedicated tool becomes more cost-effective than the time it saves.

Sources
RESONEO, study on the ChatGPT 5.3 Instant and 5.4 Thinking variants, 2026. (exact reference to verify before publication)
OpenAI, documentation of the temperature parameter in the API, platform.openai.com, accessed May 2026.
Hi-commerce, “Mesurer sa visibilité dans ChatGPT, Perplexity et Gemini : le guide 2026”, hi-commerce.fr, 2026.

Florian Zorgnotti

I’m Florian Zorgnotti, an SEO consultant based in Nice since 2016. I’ve led 300+ projects, specializing in WordPress, Shopify, and Generative Engine Optimization (GEO) to help brands grow their visibility in search and AI platforms.