Multimodal Signals
Quick facts
- What it is
- The signals on non-text assets (images, video, audio, charts) that decide whether AI engines can read, ground, and cite them
- The dominant channel (2026)
- Text — alt, caption, transcript, schema, surrounding prose — not pixels. Web-retrieval pipelines feeding AI answers are still mostly text-shaped
- The reading-mode split
- Index-integrated AI (Google AIO) reuses Google's existing image/video index; live-fetch AI (ChatGPT, Perplexity, Claude with browsing) reads HTML at retrieval time and rarely OCRs or transcribes
- The single highest-leverage video lever
- A same-page transcript. Self-hosted video without a transcript is effectively invisible to live-fetch AI
- Speakable schema, honestly
- Beta, US-only, English-only, news-only, Google Assistant TTS only as of 2025-12. Not a general 'this content is for AI' signal — verify before relying on it
1. What multimodal signals are
Multimodal signals are the signals attached to non-text content — images, video, audio, charts and tables — that decide whether an AI engine can read, ground, and cite those assets in an answer. They are the multimodal row of Generative Engine Optimization’s signal-family table, and the E-E-A-T §8 promise of “trust-readability for non-text assets” paid out.
The crucial nuance, stated up front: for most current AI engines, “multimodal reading” is partly literal and partly text-channel. Frontier multimodal LLMs — GPT-4V, Gemini, Claude with vision — can see images when handed them (OpenAI GPT-4V system card · Gemini technical report). But the web-retrieval pipelines feeding AI answers still mostly pass text to the model, not pixels. So the alt attribute, the caption, the transcript, and the schema markup is the asset, to the engine.
Definition (GEO Wiki working definition): multimodal signals are the readable-to-AI signals attached to non-text assets — primarily the text channel (alt, caption, transcript, surrounding prose), the structured-data channel (ImageObject, VideoObject, AudioObject, Dataset), and the provenance channel (C2PA credentials, EXIF, IPTC) — that decide whether an asset is groundable in an AI-generated answer.
This entry walks four asset types, in order: images, video, audio, charts / tables / diagrams. The mechanism that shapes all four is the same — the asymmetry between how index-integrated AI reads non-text assets and how live-fetch AI does. That asymmetry is §3, and it is the entry’s pivot.
2. The four asset types — at a glance
The table this entry’s mechanism stories hang off. Each row is a separate § below.
| Asset type | Primary text channel | Primary structured-data channel | Where it surfaces in AI answers |
|---|---|---|---|
| Images | alt attribute + surrounding caption / prose | ImageObject (caption, contentUrl, license, creator, embeddedTextCaption) | AIO inline image cards; image-result pivots; thumbnails alongside cited answers |
| Video | Same-page transcript + captions (SRT/VTT) + description | VideoObject (description, transcript, thumbnailUrl, uploadDate, contentUrl, duration) | AIO video carousels; “watch this section” timestamps; YouTube-sourced transcript quotes lifted into answers |
| Audio | Transcript + show notes + episode description | AudioObject (transcript, contentUrl, caption) and PodcastEpisode (audio, partOfSeries, episodeNumber) | AIO podcast cards; Google Assistant TTS (the narrow speakable surface) |
| Charts / tables / diagrams | HTML data table + caption + summary statistics in prose | Dataset (distribution, variableMeasured, measurementTechnique) | Quoted as data tables in answers; almost never as the pixel-rendered chart |
Two structural points worth reading slowly. First: every row’s “primary text channel” is what carries the answer-engine payload. Second: VideoObject has an actual transcript field defined by Schema.org as “the transcript of that object” — the markup explicitly anticipates the text-channel-dominates pattern. The rest of this entry walks each asset type once, after first making the engine-reading-mode asymmetry concrete.
3. The two engine reading modes — index-integrated vs live-fetch
The asymmetry that shapes everything else. The split is the multimodal rotation of Schema.org for AI §5’s index-integrated vs live-fetch reading: same mechanism, applied to non-text assets.
Index-integrated AI (Google AI Overviews, Gemini via Search): reuses Google’s pre-existing image and video index. Image alt has been parsed by Google for over a decade. YouTube auto-CC and uploaded transcripts ride the same infrastructure that fed Google Video Search long before AIO existed. AIO image carousels are sourced from Google Images; the underlying multimodal extraction happened during normal indexing, not at answer time. The AI Overviews entry’s “Core Web Vitals + multimodal signals met” row is a reuse of Google’s existing quality systems, not a separate AIO-specific layer (Google Search Central — AI features and your website).
Live-fetch AI (ChatGPT search, Perplexity, Claude with browsing): reads HTML at retrieval time. It sees the alt attribute and the surrounding prose. It does not OCR images at retrieval time, does not transcribe video, does not process audio. The asset itself is invisible unless the text channel describes it.
The crucial caveat: when a user uploads an image, video, or PDF directly into ChatGPT or Perplexity, those products do run vision and OCR on the file. That is the user-uploaded path, not the web-retrieval path. Your image, sitting on your page, is read by the retrieval pipeline in text form — the alt, the caption, the schema. Model capability is not pipeline capability.
| Surface | Image reading | Video reading | Audio reading |
|---|---|---|---|
| Google AI Overviews | Index-time vision + alt + caption + ImageObject; AIO surfaces inline image cards | Index-time + YouTube CC infrastructure + VideoObject; AIO surfaces video carousels with timestamps | Index-time + AudioObject; podcast cards; speakable for narrow news-TTS |
| ChatGPT search | alt + surrounding text; no retrieval-time OCR of fetched pages | Transcript + description; no retrieval-time transcription | Transcript + show notes; no retrieval-time transcription |
| Perplexity | alt + surrounding text; no retrieval-time OCR | Transcript + description | Transcript + show notes |
| Google Gemini | Via Search (index-integrated path) plus native multimodal model | Via Search + YouTube infrastructure | Via Search + AudioObject |
| Genspark | Multimodal-first answer surface (“Sparkpages”) — emerging, less publicly documented | Same | Same |
The load-bearing claim that §4–§7 hang off: the text channel is the dominant signal for non-text assets across both modes — for index-integrated AI because the text was indexed long before AI answers existed, for live-fetch AI because pixel processing is mostly not in the retrieval pipeline. Build for the text channel first; provenance and pixel-level capability matter at the margin.
4. Images — alt, caption, ImageObject, provenance
Images are the asset type with the most signal history and the most well-trodden text channel. Three signal layers, treated in order.
Text channel — the dominant lever. The alt attribute, the surrounding caption, and a descriptive heading near the image are what every AI reading mode picks up. Google’s own image best practices put it directly: “Google uses alt text along with computer vision algorithms and the contents of the page to understand the subject matter of the image” (Google Search Central, updated 2026-03-02). WCAG accessibility guidance and AI extraction heuristics overlap heavily here — what a screen reader needs is what live-fetch AI sees. The W3C Success Criterion 1.1.1 (Non-text Content) requires that “All non-text content that is presented to the user has a text alternative that serves the equivalent purpose” (W3C WAI).
Structured-data channel — ImageObject carries caption, contentUrl, license, creator, embeddedTextCaption, exifData, and representativeOfPage (schema.org/ImageObject). For products, Product.image; for articles, Article.image. For the broader markup discipline, see Schema.org for AI; this entry only states which fields carry the multimodal payload.
Provenance channel — C2PA content credentials are a cross-vendor cryptographic attestation of an image’s origin and edit history. EXIF camera metadata and IPTC photo credits carry photographer, copyright, and source. As of 2025-11, the IPTC Photo Metadata Standard v2025.1 added an explicit “AI System Used” property that identifies the model that generated an image (with ChatGPT, DALL-E, and Google Gemini as named examples) (IPTC). Provenance signals are an emerging trust-filter input that parallels what E-E-A-T describes for text authorship.
| Signal | Index-integrated read | Live-fetch read |
|---|---|---|
alt attribute | Both; weighed for image-card relevance | Yes; the primary text channel |
| Surrounding caption / prose | Both | Yes |
ImageObject JSON-LD | Parsed by AIO; treated as structured data | Read as page text (per Schema.org for AI §5) |
| C2PA / EXIF / IPTC | AIO can verify via index | Usually not fetched; the page’s HTML doesn’t surface it |
For live-fetch AI, the image itself is invisible — only its text channel exists. The two highest-volume vertical applications are e-commerce product images (alt = product name + key variant) and editorial photography (alt = scene + subject + context), but the discipline is general.
5. Video — transcript, captions, VideoObject, the hosting effect
Video is the asset type where hosting choice dominates the signal landscape. The same content uploaded to YouTube and the same content self-hosted with no transcript have radically different AI-readability profiles, regardless of how the markup is shaped.
Text channel — same-page transcript (the single highest-leverage multimodal lever for most sites), captions in SRT/VTT, video description, video title. A transcript is the only form in which the spoken content enters the live-fetch retrieval pipeline at all.
Structured-data channel — VideoObject has an explicit transcript field, defined verbatim by Schema.org as “the transcript of that object” (schema.org/VideoObject). Other key fields: description, thumbnailUrl, uploadDate, contentUrl, embedUrl, duration (ISO 8601). Per Schema.org for AI, this is parsed by AIO and read as page text by live-fetch.
The hosting effect — YouTube and Vimeo auto-generate captions and transcripts that Google’s index already consumes. Google’s own video SEO best practices recommend “Create a dedicated watch page for each video” (Google Search Central, updated 2025-12-18). Self-hosted video without a transcript is largely invisible to both reading modes for content purposes — the page has a <video> tag, and nothing else.
| Hosting choice | Transcript availability | AI readability |
|---|---|---|
| YouTube / Vimeo embed | Auto-CC + creator-uploaded options | High — AIO sources transcripts directly; live-fetch reads the embed’s surrounding HTML + the platform-hosted transcript when reachable |
| Self-hosted with same-page transcript | Manually authored + same-page HTML | High — both modes read the transcript text |
| Self-hosted with VTT/SRT only | Sidecar file, no same-page text | Moderate — index-integrated reads the sidecar; many live-fetch retrievers don’t fetch it |
| Self-hosted without transcript | None | Effectively invisible for content purposes |
One sub-discipline worth naming: captions burned into video pixels are unreadable to any text-channel reader; only file-based captions work. Same logic as “pictures of text” for images — pixels are not text. For media-heavy sites — publishers, ed-tech, video-first content businesses — hosting choice is usually a larger lever than markup completeness.
6. Audio and speakable schema — narrower than it sounds
This section corrects the loudest multimodal myth — that speakable schema makes content “voice-AI-ready” in the general sense most users mean. It does not.
The actual speakable scope, stated forthrightly per Google’s own current documentation, last updated 2025-12-10: speakable structured data is in beta and feeds “users in the U.S. that have Google Home devices set to English, and publishers that publish content in English” (Google Search Central — Speakable). It has been in beta with these same US-only, English-only, news-only restrictions for years; the 2025-12-10 update did not relax any of them. Its target surface is Google Assistant TTS playback — not a general “this content is for AI” signal.
The general audio AI strategy — the part most readers actually need:
- Transcript is the same lever as video, with the same dominance. A podcast without a transcript is invisible to live-fetch AI for content purposes; the spoken audio never enters the text channel.
- Show notes and episode descriptions act as the search-engine-facing summary; live-fetch AI reads these first.
AudioObjectcarriestranscript,contentUrl,caption,encodingFormat, andduration(schema.org/AudioObject).PodcastEpisodecarriespartOfSeries,episodeNumber,duration,datePublished, and an embeddedaudio(an AudioObject) (schema.org/PodcastEpisode) — note that the transcript attaches via the embedded AudioObject, not directly on PodcastEpisode itself.
Speakable is for TTS playback; transcripts are for AI readability. Do not conflate them, and do not let a vendor or platform sell speakable markup as a general AI-readability lever — the marketing routinely overstates what the spec actually does.
7. Charts, tables, and diagrams — data as text
The shortest of the asset sections — the discipline is small and well-known — but high-leverage for analytical content where the data is the point.
The headline claim: extraction heuristics read HTML data tables, not pixel-rendered charts. A bar chart rendered as a PNG is unreadable to live-fetch AI; a bar chart accompanied by its underlying data table is fully readable. Two patterns carry almost all of the discipline:
- Chart + data-table fallback — render the chart visually for humans, include the underlying numbers as an HTML
<table>(or as plain text) so the text channel carries the information. - Caption + summary statistics — a one-paragraph caption stating the headline number, the source, and the time period in plain text, citable independently of whether the chart is read at all.
<!-- Render for humans -->
<img src="/charts/q1-revenue.png" alt="Quarterly revenue trend, Q1 2024 through Q1 2026, in $M">
<!-- Citable for AI: the data table fallback -->
<figcaption>Q1 revenue grew from $12M (Q1 2024) to $19M (Q1 2026), a 58% increase.</figcaption>
<table>
<thead><tr><th>Quarter</th><th>Revenue ($M)</th></tr></thead>
<tbody>
<tr><td>Q1 2024</td><td>12</td></tr>
<tr><td>Q1 2025</td><td>15</td></tr>
<tr><td>Q1 2026</td><td>19</td></tr>
</tbody>
</table>
Schema.org has Dataset (schema.org/Dataset) for full published datasets and Table semantics for tabular content, but the bar to clear here is the HTML data table itself; markup is icing. Data tables are also among the highest-citability content shapes per Citability — the same passages that work for chart readability work for direct quotation into AI answers.
8. Trust and provenance — the E-E-A-T half for non-text assets
This § is the E-E-A-T §8 promise of “trust-readability of non-text assets” paid out, and the route from Schema.org for AI §8 on ImageObject/VideoObject provenance.
Trust filters apply to non-text assets too. An AI-generated stock-photo flood with no provenance, a video with a fabricated author byline, a chart whose underlying numbers cannot be sourced — all trip the same AI-at-scale trust filters that AI Content Detection covers for text. The mechanism is identical; only the asset type changes.
| Asset | Provenance signals | Maturity (as of 2026-05) |
|---|---|---|
| Image | C2PA content credentials; EXIF camera metadata; IPTC photo credit + new “AI System Used” field (v2025.1); creator in ImageObject | C2PA adoption spreading (steering-committee members include Adobe, Microsoft, BBC, OpenAI, Sony; general members include NYT, Nikon, Canon — C2PA Membership); EXIF/IPTC mature; AI-generation field new in IPTC v2025.1 (IPTC, 2025-11-27) |
| Video | creator/publisher in VideoObject; platform channel verification (YouTube); upload-date consistency; SynthID watermarking on AI-generated video (Google DeepMind) | Mature where YouTube-hosted; SynthID active for Google-generated AI video |
| Audio | creator in AudioObject; host-platform verification; SynthID watermarking on AI-generated audio | Moderate; SynthID active for Google-generated AI audio |
| Charts / data | Cited data source; methodology link; downloadable raw data; provenance of the underlying numbers | Fully mature — this is just citation discipline |
SynthID, Google DeepMind’s watermarking technology, covers all four modalities — image, video, audio, and text. Per the canonical page: “The watermarks are embedded across Google’s generative AI consumer products, and are imperceptible to humans – but can be detected by SynthID’s technology” (Google DeepMind). The canonical page does not publish specific detection-accuracy numbers; treat it as a direction-credible, coefficient-unmeasured signal — the same posture AI Content Detection §6 recommends for watermarking generally.
The honest bound: image-provenance ecosystems (C2PA, SynthID, IPTC’s AI-generation field) are real and spreading but not yet a confirmed citation gate at any major AI engine as of 2026-05. Direction is credible; the coefficient is unmeasured.
9. What the evidence says — and what it does not
Direction-not-coefficient framing, the same shape Multilingual GEO §7 and Entity Recognition §6 use for their respective claims. The mechanism direction is well-attested; specific citation-rate lifts for specific markup choices are not.
| What holds | The bounded reading |
|---|---|
| Google explicitly recommends multimodal hygiene for AI Search. “Support your textual content with high-quality images and videos on your pages” is one of 8 official recommendations (Google Search Central, 2025-05 · Search Engine Land coverage) | A direction signal from the vendor with the most-cited AI surface, not a measurement. The lift magnitude is not rigorously published. Practitioner coverage notes Google offered “limited actionable detail” |
| AIO inline image cards and video carousels are publicly observable — every AIO answer for a product, recipe, how-to, or visual-research query surfaces them | Structural fact, not a measurement. Which specific image is chosen for the card is not publicly documented; do not reverse-engineer “ranking factors” from the carousel placement |
| YouTube transcripts demonstrably reach Google’s index — direct quotes lifted from YouTube auto-CC have been observed in AI Overviews answers | Practitioner observation, not a rigorous benchmark. The direction (YouTube transcripts feed AIO) is high-confidence; the citation rate for a specific channel or video is not a known coefficient |
| Multimodal LLMs can describe images when handed them (GPT-4V system card · Gemini technical report) | Model capability, not pipeline capability. Web-retrieval pipelines feeding AI answers typically still pass text to the model, not pixels. The claim that “AI search engines see my images” is over-extension |
| C2PA / SynthID provenance ecosystems are real and growing — Adobe, Microsoft, BBC, OpenAI, Sony on C2PA’s steering committee; NYT, Nikon, Canon as general members (C2PA Membership); SynthID embedded in Google’s generative consumer products | Adoption is verified; effect on AI-citation behavior is not. Not yet a confirmed citation gate at any major AI engine |
The text channel dominates for both reading modes — observable in every live-fetch AI engine; the asset’s alt, surrounding text, and transcript are what come back in citations | Snapshot of the 2026-05 state. The claim will weaken over time as multimodal-native retrievers ship more widely; the entry’s lastUpdated and nextReviewDue carry that signal |
The honest gap, stated plain: as of 2026-05, no published rigorous benchmark measures the citation-rate lift from any specific multimodal markup choice — neither alt-text quality, nor ImageObject completeness, nor transcript presence, nor C2PA attestation. Direction is credible across all four; magnitude is not. Anyone who quotes a precise number like “images lift AI citation rates by N%” is over-claiming. Read the direction, build for the text channel, and resist importing unverified coefficients into investment decisions.
10. Anti-patterns — multimodal misreads
The errors this entry exists to prevent, in the same shape as the Citability §6 and Multilingual GEO §8 anti-pattern tables.
| Misread | Why it looks right | Why it’s wrong |
|---|---|---|
| ”Alt-text stuffing helps AI find my images” | Looks like the extension of keyword stuffing into a new surface | Google explicitly warns that keyword stuffing alt attributes “results in a negative user experience and may cause your site to be seen as spam” (Google Images best practices). AI quality systems detect and down-weight the same pattern — see AI Content Detection |
| ”Pictures of text” — rendering body text inside an image | Looks designerly; gives total typographic control | Text rendered as image pixels is invisible to all text-channel readers, and OCR is not standard in live-fetch retrieval pipelines. The text exists only to a reader who happens to be a vision-enabled model handed the image directly — not the retrieval pipeline |
| ”Self-hosted video without a transcript is fine — the audio speaks for itself” | Looks self-evident — humans can hear it | The spoken content never enters the text channel; live-fetch AI sees a <video> tag and nothing else. The single highest-impact fix on a media-heavy site is adding a same-page transcript |
| ”Speakable schema makes my content voice-AI-ready” | Looks like the obvious markup choice for AI/voice surfaces | Speakable is beta, US-only, English-only, news-only, Google-Assistant-TTS-only as of 2025-12-10 (Google Search Central — Speakable). It has not expanded in years. It is not a general AI-readability signal |
| ”Bar chart as image is enough — humans can read it” | Looks sufficient; the chart is right there | Extraction heuristics read HTML data tables, not pixel-rendered charts. Live-fetch AI sees a <figure> and an alt text, and that is it. The numbers in the chart never enter the answer pipeline unless you publish them as text |
| ”AI-generated stock photos at scale, no provenance” | Looks like cheap visual coverage | The same AI-at-scale pattern AI Content Detection covers for text applies to images. C2PA, IPTC v2025.1’s “AI System Used” field, and SynthID watermarks are increasingly the trust signal AI engines look for; an unattested AI-image flood reads as the multimodal version of mass-generated content |
| ”My image was passed to GPT-4o, so AI search engines must see it too” | Looks transitive — same vendor’s model | Model capability is not retrieval-pipeline capability. The user-uploaded path runs vision; the web-retrieval path mostly passes text. Your image on your page is read as alt and caption, not as pixels |
The failure mode is rarely “we forgot the markup.” It is treating multimodal as a vision problem when in 2026 it is still mostly a text-channel problem.
11. Why this matters for GEO + how to act
Multimodal GEO is not a separate AI-specific discipline. It is citability and E-E-A-T applied to non-text assets, with the engine-reading-mode asymmetry from §3 as the filter for which lever pays off in which channel. The levers are not new disciplines; they are the same disciplines retooled per asset type.
| Your intent | First stop |
|---|---|
| Implement image, video, audio, or chart markup correctly | Schema Implementation playbook |
| Decide which markup format (JSON-LD, RDFa, Microdata) | JSON-LD |
| Audit a site’s multimodal layer end-to-end | Full GEO Audit playbook |
| Make a page’s text channel actually extract | Citability playbook · Citability concept |
| Calibrate trust signals for non-text assets | E-E-A-T · AI Content Detection for the AI-at-scale anti-pattern |
| Understand the schema vocabulary | Schema.org for AI |
| Bind asset authorship to a creator entity | Entity Recognition · Knowledge Graph Presence |
| See where this sits in the loop | Answer Loop |
| The method that ties it together | Generative Engine Optimization |
The practical reading: audit the text channel for every non-text asset before touching markup or provenance. Most teams discover the dominant lever is not a missing ImageObject block — it is missing transcripts on video, missing alt on content images, charts published as PNGs without underlying data, or AI-generated images at scale without provenance. Fix the text channel first; provenance and pixel-level capability matter at the margin.
For the term itself and its neighbors, see the GEO glossary.
References
Official / standards:
- Schema.org — ImageObject · VideoObject · AudioObject · PodcastEpisode · Dataset
- Google Search Central — Google Images best practices (updated 2026-03-02) · Video SEO best practices (updated 2025-12-18) · Speakable (SpeakableSpecification) structured data (updated 2025-12-10, still beta) · AI features and your website · Top ways to ensure your content performs well in Google’s AI experiences on Search
- W3C WAI — Understanding Success Criterion 1.1.1: Non-text Content
- IPTC — Photo Metadata Standard (v2025.1, 2025-11-27 — added “AI System Used” property)
- C2PA — Coalition for Content Provenance and Authenticity · Membership · Specifications
- Google DeepMind — SynthID (image / video / audio / text watermarking)
Vendor / technical:
- OpenAI — GPT-4V(ision) system card (2023-09-25)
- Google DeepMind — Gemini: A Family of Highly Capable Multimodal Models (arXiv:2312.11805, 2023-12-19) · Introducing Gemini (2023-12-06) — “designed Gemini to be natively multimodal, pre-trained from the start on different modalities”
Industry:
- Search Engine Land — Goodwin, D. (2025-05-21). Google shares 8 ways to be successful with AI Search experiences
Frequently asked questions
What are multimodal signals in GEO?
Doesn't GPT-4V / Gemini see images? Why does the text channel still matter?
What's the single highest-leverage multimodal lever?
Does speakable schema make my content voice-AI-ready?
Do AI search engines actually rank pages with images higher?
See also
Sources
Primary
- ImageObject — Schema.org · Schema.org
- VideoObject — Schema.org · Schema.org
- AudioObject — Schema.org · Schema.org
- PodcastEpisode — Schema.org · Schema.org
- Dataset — Schema.org · Schema.org
- Google Images best practices · Google Search Central · 2026-03-02
- Video SEO best practices · Google Search Central · 2025-12-18
- Speakable (SpeakableSpecification) structured data · Google Search Central · 2025-12-10
- AI features and your website · Google Search Central · 2025-12-10
- Top ways to ensure your content performs well in Google's AI experiences on Search · Google Search Central · 2025-05-21
- Understanding Success Criterion 1.1.1: Non-text Content · W3C Web Accessibility Initiative
- Coalition for Content Provenance and Authenticity (C2PA) · C2PA
- C2PA Membership · C2PA
- C2PA Specifications · C2PA
- SynthID — identifying AI-generated content · Google DeepMind
- IPTC Photo Metadata Standard (v2025.1) · International Press Telecommunications Council · 2025-11-27
- GPT-4V(ision) system card · OpenAI · 2023-09-25
- Gemini: A Family of Highly Capable Multimodal Models · Google DeepMind / arXiv · 2023-12-19
- Introducing Gemini: our largest and most capable AI model · Google · 2023-12-06
Secondary
- Google shares 8 ways to be successful with AI Search experiences · Search Engine Land (Danny Goodwin)