Concept · Signals

Multimodal Signals

Quick facts

What it is: The signals on non-text assets (images, video, audio, charts) that decide whether AI engines can read, ground, and cite them
The dominant channel (2026): Text — alt, caption, transcript, schema, surrounding prose — not pixels. Web-retrieval pipelines feeding AI answers are still mostly text-shaped
The reading-mode split: Index-integrated AI (Google AIO) reuses Google's existing image/video index; live-fetch AI (ChatGPT, Perplexity, Claude with browsing) reads HTML at retrieval time and rarely OCRs or transcribes
The single highest-leverage video lever: A same-page transcript. Self-hosted video without a transcript is effectively invisible to live-fetch AI
Speakable schema, honestly: Beta, US-only, English-only, news-only, Google Assistant TTS only as of 2025-12. Not a general 'this content is for AI' signal — verify before relying on it

1. What multimodal signals are

Multimodal signals are the signals attached to non-text content — images, video, audio, charts and tables — that decide whether an AI engine can read, ground, and cite those assets in an answer. They are the multimodal row of Generative Engine Optimization’s signal-family table, and the E-E-A-T §8 promise of “trust-readability for non-text assets” paid out.

The crucial nuance, stated up front: for most current AI engines, “multimodal reading” is partly literal and partly text-channel. Frontier multimodal LLMs — GPT-4V, Gemini, Claude with vision — can see images when handed them (OpenAI GPT-4V system card · Gemini technical report). But the web-retrieval pipelines feeding AI answers still mostly pass text to the model, not pixels. So the alt attribute, the caption, the transcript, and the schema markup is the asset, to the engine.

Definition (GEO Wiki working definition): multimodal signals are the readable-to-AI signals attached to non-text assets — primarily the text channel (alt, caption, transcript, surrounding prose), the structured-data channel (ImageObject, VideoObject, AudioObject, Dataset), and the provenance channel (C2PA credentials, EXIF, IPTC) — that decide whether an asset is groundable in an AI-generated answer.

This entry walks four asset types, in order: images, video, audio, charts / tables / diagrams. The mechanism that shapes all four is the same — the asymmetry between how index-integrated AI reads non-text assets and how live-fetch AI does. That asymmetry is §3, and it is the entry’s pivot.

2. The four asset types — at a glance

The table this entry’s mechanism stories hang off. Each row is a separate § below.

Asset type	Primary text channel	Primary structured-data channel	Where it surfaces in AI answers
Images	`alt` attribute + surrounding caption / prose	`ImageObject` (`caption`, `contentUrl`, `license`, `creator`, `embeddedTextCaption`)	AIO inline image cards; image-result pivots; thumbnails alongside cited answers
Video	Same-page transcript + captions (SRT/VTT) + description	`VideoObject` (`description`, `transcript`, `thumbnailUrl`, `uploadDate`, `contentUrl`, `duration`)	AIO video carousels; “watch this section” timestamps; YouTube-sourced transcript quotes lifted into answers
Audio	Transcript + show notes + episode description	`AudioObject` (`transcript`, `contentUrl`, `caption`) and `PodcastEpisode` (`audio`, `partOfSeries`, `episodeNumber`)	AIO podcast cards; Google Assistant TTS (the narrow speakable surface)
Charts / tables / diagrams	HTML data table + caption + summary statistics in prose	`Dataset` (`distribution`, `variableMeasured`, `measurementTechnique`)	Quoted as data tables in answers; almost never as the pixel-rendered chart

Two structural points worth reading slowly. First: every row’s “primary text channel” is what carries the answer-engine payload. Second: VideoObject has an actual transcript field defined by Schema.org as “the transcript of that object” — the markup explicitly anticipates the text-channel-dominates pattern. The rest of this entry walks each asset type once, after first making the engine-reading-mode asymmetry concrete.

3. The two engine reading modes — index-integrated vs live-fetch

The asymmetry that shapes everything else. The split is the multimodal rotation of Schema.org for AI §5’s index-integrated vs live-fetch reading: same mechanism, applied to non-text assets.

Index-integrated AI (Google AI Overviews, Gemini via Search): reuses Google’s pre-existing image and video index. Image alt has been parsed by Google for over a decade. YouTube auto-CC and uploaded transcripts ride the same infrastructure that fed Google Video Search long before AIO existed. AIO image carousels are sourced from Google Images; the underlying multimodal extraction happened during normal indexing, not at answer time. The AI Overviews entry’s “Core Web Vitals + multimodal signals met” row is a reuse of Google’s existing quality systems, not a separate AIO-specific layer (Google Search Central — AI features and your website).

Live-fetch AI (ChatGPT search, Perplexity, Claude with browsing): reads HTML at retrieval time. It sees the alt attribute and the surrounding prose. It does not OCR images at retrieval time, does not transcribe video, does not process audio. The asset itself is invisible unless the text channel describes it.

The crucial caveat: when a user uploads an image, video, or PDF directly into ChatGPT or Perplexity, those products do run vision and OCR on the file. That is the user-uploaded path, not the web-retrieval path. Your image, sitting on your page, is read by the retrieval pipeline in text form — the alt, the caption, the schema. Model capability is not pipeline capability.

Surface	Image reading	Video reading	Audio reading
Google AI Overviews	Index-time vision + alt + caption + ImageObject; AIO surfaces inline image cards	Index-time + YouTube CC infrastructure + VideoObject; AIO surfaces video carousels with timestamps	Index-time + AudioObject; podcast cards; speakable for narrow news-TTS
ChatGPT search	`alt` + surrounding text; no retrieval-time OCR of fetched pages	Transcript + description; no retrieval-time transcription	Transcript + show notes; no retrieval-time transcription
Perplexity	`alt` + surrounding text; no retrieval-time OCR	Transcript + description	Transcript + show notes
Google Gemini	Via Search (index-integrated path) plus native multimodal model	Via Search + YouTube infrastructure	Via Search + AudioObject
Genspark	Multimodal-first answer surface (“Sparkpages”) — emerging, less publicly documented	Same	Same

The load-bearing claim that §4–§7 hang off: the text channel is the dominant signal for non-text assets across both modes — for index-integrated AI because the text was indexed long before AI answers existed, for live-fetch AI because pixel processing is mostly not in the retrieval pipeline. Build for the text channel first; provenance and pixel-level capability matter at the margin.

4. Images — alt, caption, ImageObject, provenance

Images are the asset type with the most signal history and the most well-trodden text channel. Three signal layers, treated in order.

Text channel — the dominant lever. The alt attribute, the surrounding caption, and a descriptive heading near the image are what every AI reading mode picks up. Google’s own image best practices put it directly: “Google uses alt text along with computer vision algorithms and the contents of the page to understand the subject matter of the image” (Google Search Central, updated 2026-03-02). WCAG accessibility guidance and AI extraction heuristics overlap heavily here — what a screen reader needs is what live-fetch AI sees. The W3C Success Criterion 1.1.1 (Non-text Content) requires that “All non-text content that is presented to the user has a text alternative that serves the equivalent purpose” (W3C WAI).

Structured-data channel — ImageObject carries caption, contentUrl, license, creator, embeddedTextCaption, exifData, and representativeOfPage (schema.org/ImageObject). For products, Product.image; for articles, Article.image. For the broader markup discipline, see Schema.org for AI; this entry only states which fields carry the multimodal payload.

Provenance channel — C2PA content credentials are a cross-vendor cryptographic attestation of an image’s origin and edit history. EXIF camera metadata and IPTC photo credits carry photographer, copyright, and source. As of 2025-11, the IPTC Photo Metadata Standard v2025.1 added an explicit “AI System Used” property that identifies the model that generated an image (with ChatGPT, DALL-E, and Google Gemini as named examples) (IPTC). Provenance signals are an emerging trust-filter input that parallels what E-E-A-T describes for text authorship.

Signal	Index-integrated read	Live-fetch read
`alt` attribute	Both; weighed for image-card relevance	Yes; the primary text channel
Surrounding caption / prose	Both	Yes
`ImageObject` JSON-LD	Parsed by AIO; treated as structured data	Read as page text (per Schema.org for AI §5)
C2PA / EXIF / IPTC	AIO can verify via index	Usually not fetched; the page’s HTML doesn’t surface it

For live-fetch AI, the image itself is invisible — only its text channel exists. The two highest-volume vertical applications are e-commerce product images (alt = product name + key variant) and editorial photography (alt = scene + subject + context), but the discipline is general.

5. Video — transcript, captions, VideoObject, the hosting effect

Video is the asset type where hosting choice dominates the signal landscape. The same content uploaded to YouTube and the same content self-hosted with no transcript have radically different AI-readability profiles, regardless of how the markup is shaped.

Text channel — same-page transcript (the single highest-leverage multimodal lever for most sites), captions in SRT/VTT, video description, video title. A transcript is the only form in which the spoken content enters the live-fetch retrieval pipeline at all.

Structured-data channel — VideoObject has an explicit transcript field, defined verbatim by Schema.org as “the transcript of that object” (schema.org/VideoObject). Other key fields: description, thumbnailUrl, uploadDate, contentUrl, embedUrl, duration (ISO 8601). Per Schema.org for AI, this is parsed by AIO and read as page text by live-fetch.

The hosting effect — YouTube and Vimeo auto-generate captions and transcripts that Google’s index already consumes. Google’s own video SEO best practices recommend “Create a dedicated watch page for each video” (Google Search Central, updated 2025-12-18). Self-hosted video without a transcript is largely invisible to both reading modes for content purposes — the page has a <video> tag, and nothing else.

Hosting choice	Transcript availability	AI readability
YouTube / Vimeo embed	Auto-CC + creator-uploaded options	High — AIO sources transcripts directly; live-fetch reads the embed’s surrounding HTML + the platform-hosted transcript when reachable
Self-hosted with same-page transcript	Manually authored + same-page HTML	High — both modes read the transcript text
Self-hosted with VTT/SRT only	Sidecar file, no same-page text	Moderate — index-integrated reads the sidecar; many live-fetch retrievers don’t fetch it
Self-hosted without transcript	None	Effectively invisible for content purposes

One sub-discipline worth naming: captions burned into video pixels are unreadable to any text-channel reader; only file-based captions work. Same logic as “pictures of text” for images — pixels are not text. For media-heavy sites — publishers, ed-tech, video-first content businesses — hosting choice is usually a larger lever than markup completeness.

6. Audio and speakable schema — narrower than it sounds

This section corrects the loudest multimodal myth — that speakable schema makes content “voice-AI-ready” in the general sense most users mean. It does not.

The actual speakable scope, stated forthrightly per Google’s own current documentation, last updated 2025-12-10: speakable structured data is in beta and feeds “users in the U.S. that have Google Home devices set to English, and publishers that publish content in English” (Google Search Central — Speakable). It has been in beta with these same US-only, English-only, news-only restrictions for years; the 2025-12-10 update did not relax any of them. Its target surface is Google Assistant TTS playback — not a general “this content is for AI” signal.

The general audio AI strategy — the part most readers actually need:

Transcript is the same lever as video, with the same dominance. A podcast without a transcript is invisible to live-fetch AI for content purposes; the spoken audio never enters the text channel.
Show notes and episode descriptions act as the search-engine-facing summary; live-fetch AI reads these first.
AudioObject carries transcript, contentUrl, caption, encodingFormat, and duration (schema.org/AudioObject). PodcastEpisode carries partOfSeries, episodeNumber, duration, datePublished, and an embedded audio (an AudioObject) (schema.org/PodcastEpisode) — note that the transcript attaches via the embedded AudioObject, not directly on PodcastEpisode itself.

Speakable is for TTS playback; transcripts are for AI readability. Do not conflate them, and do not let a vendor or platform sell speakable markup as a general AI-readability lever — the marketing routinely overstates what the spec actually does.

7. Charts, tables, and diagrams — data as text

The shortest of the asset sections — the discipline is small and well-known — but high-leverage for analytical content where the data is the point.

The headline claim: extraction heuristics read HTML data tables, not pixel-rendered charts. A bar chart rendered as a PNG is unreadable to live-fetch AI; a bar chart accompanied by its underlying data table is fully readable. Two patterns carry almost all of the discipline:

Chart + data-table fallback — render the chart visually for humans, include the underlying numbers as an HTML <table> (or as plain text) so the text channel carries the information.
Caption + summary statistics — a one-paragraph caption stating the headline number, the source, and the time period in plain text, citable independently of whether the chart is read at all.

<!-- Render for humans -->
<img src="/charts/q1-revenue.png" alt="Quarterly revenue trend, Q1 2024 through Q1 2026, in $M">

<!-- Citable for AI: the data table fallback -->
<figcaption>Q1 revenue grew from $12M (Q1 2024) to $19M (Q1 2026), a 58% increase.</figcaption>
<table>
  <thead><tr><th>Quarter</th><th>Revenue ($M)</th></tr></thead>
  <tbody>
    <tr><td>Q1 2024</td><td>12</td></tr>
    <tr><td>Q1 2025</td><td>15</td></tr>
    <tr><td>Q1 2026</td><td>19</td></tr>
  </tbody>
</table>

Schema.org has Dataset (schema.org/Dataset) for full published datasets and Table semantics for tabular content, but the bar to clear here is the HTML data table itself; markup is icing. Data tables are also among the highest-citability content shapes per Citability — the same passages that work for chart readability work for direct quotation into AI answers.

8. Trust and provenance — the E-E-A-T half for non-text assets

This § is the E-E-A-T §8 promise of “trust-readability of non-text assets” paid out, and the route from Schema.org for AI §8 on ImageObject/VideoObject provenance.

Trust filters apply to non-text assets too. An AI-generated stock-photo flood with no provenance, a video with a fabricated author byline, a chart whose underlying numbers cannot be sourced — all trip the same AI-at-scale trust filters that AI Content Detection covers for text. The mechanism is identical; only the asset type changes.

Asset	Provenance signals	Maturity (as of 2026-05)
Image	C2PA content credentials; EXIF camera metadata; IPTC photo credit + new “AI System Used” field (v2025.1); `creator` in `ImageObject`	C2PA adoption spreading (steering-committee members include Adobe, Microsoft, BBC, OpenAI, Sony; general members include NYT, Nikon, Canon — C2PA Membership); EXIF/IPTC mature; AI-generation field new in IPTC v2025.1 (IPTC, 2025-11-27)
Video	`creator`/`publisher` in `VideoObject`; platform channel verification (YouTube); upload-date consistency; SynthID watermarking on AI-generated video (Google DeepMind)	Mature where YouTube-hosted; SynthID active for Google-generated AI video
Audio	`creator` in `AudioObject`; host-platform verification; SynthID watermarking on AI-generated audio	Moderate; SynthID active for Google-generated AI audio
Charts / data	Cited data source; methodology link; downloadable raw data; provenance of the underlying numbers	Fully mature — this is just citation discipline

SynthID, Google DeepMind’s watermarking technology, covers all four modalities — image, video, audio, and text. Per the canonical page: “The watermarks are embedded across Google’s generative AI consumer products, and are imperceptible to humans – but can be detected by SynthID’s technology” (Google DeepMind). The canonical page does not publish specific detection-accuracy numbers; treat it as a direction-credible, coefficient-unmeasured signal — the same posture AI Content Detection §6 recommends for watermarking generally.

The honest bound: image-provenance ecosystems (C2PA, SynthID, IPTC’s AI-generation field) are real and spreading but not yet a confirmed citation gate at any major AI engine as of 2026-05. Direction is credible; the coefficient is unmeasured.

9. What the evidence says — and what it does not

Direction-not-coefficient framing, the same shape Multilingual GEO §7 and Entity Recognition §6 use for their respective claims. The mechanism direction is well-attested; specific citation-rate lifts for specific markup choices are not.

What holds	The bounded reading
Google explicitly recommends multimodal hygiene for AI Search. “Support your textual content with high-quality images and videos on your pages” is one of 8 official recommendations (Google Search Central, 2025-05 · Search Engine Land coverage)	A direction signal from the vendor with the most-cited AI surface, not a measurement. The lift magnitude is not rigorously published. Practitioner coverage notes Google offered “limited actionable detail”
AIO inline image cards and video carousels are publicly observable — every AIO answer for a product, recipe, how-to, or visual-research query surfaces them	Structural fact, not a measurement. Which specific image is chosen for the card is not publicly documented; do not reverse-engineer “ranking factors” from the carousel placement
YouTube transcripts demonstrably reach Google’s index — direct quotes lifted from YouTube auto-CC have been observed in AI Overviews answers	Practitioner observation, not a rigorous benchmark. The direction (YouTube transcripts feed AIO) is high-confidence; the citation rate for a specific channel or video is not a known coefficient
Multimodal LLMs can describe images when handed them (GPT-4V system card · Gemini technical report)	Model capability, not pipeline capability. Web-retrieval pipelines feeding AI answers typically still pass text to the model, not pixels. The claim that “AI search engines see my images” is over-extension
C2PA / SynthID provenance ecosystems are real and growing — Adobe, Microsoft, BBC, OpenAI, Sony on C2PA’s steering committee; NYT, Nikon, Canon as general members (C2PA Membership); SynthID embedded in Google’s generative consumer products	Adoption is verified; effect on AI-citation behavior is not. Not yet a confirmed citation gate at any major AI engine
The text channel dominates for both reading modes — observable in every live-fetch AI engine; the asset’s `alt`, surrounding text, and transcript are what come back in citations	Snapshot of the 2026-05 state. The claim will weaken over time as multimodal-native retrievers ship more widely; the entry’s `lastUpdated` and `nextReviewDue` carry that signal

The honest gap, stated plain: as of 2026-05, no published rigorous benchmark measures the citation-rate lift from any specific multimodal markup choice — neither alt-text quality, nor ImageObject completeness, nor transcript presence, nor C2PA attestation. Direction is credible across all four; magnitude is not. Anyone who quotes a precise number like “images lift AI citation rates by N%” is over-claiming. Read the direction, build for the text channel, and resist importing unverified coefficients into investment decisions.

10. Anti-patterns — multimodal misreads

The errors this entry exists to prevent, in the same shape as the Citability §6 and Multilingual GEO §8 anti-pattern tables.

Misread	Why it looks right	Why it’s wrong
”Alt-text stuffing helps AI find my images”	Looks like the extension of keyword stuffing into a new surface	Google explicitly warns that keyword stuffing alt attributes “results in a negative user experience and may cause your site to be seen as spam” (Google Images best practices). AI quality systems detect and down-weight the same pattern — see AI Content Detection
”Pictures of text” — rendering body text inside an image	Looks designerly; gives total typographic control	Text rendered as image pixels is invisible to all text-channel readers, and OCR is not standard in live-fetch retrieval pipelines. The text exists only to a reader who happens to be a vision-enabled model handed the image directly — not the retrieval pipeline
”Self-hosted video without a transcript is fine — the audio speaks for itself”	Looks self-evident — humans can hear it	The spoken content never enters the text channel; live-fetch AI sees a `<video>` tag and nothing else. The single highest-impact fix on a media-heavy site is adding a same-page transcript
”Speakable schema makes my content voice-AI-ready”	Looks like the obvious markup choice for AI/voice surfaces	Speakable is beta, US-only, English-only, news-only, Google-Assistant-TTS-only as of 2025-12-10 (Google Search Central — Speakable). It has not expanded in years. It is not a general AI-readability signal
”Bar chart as image is enough — humans can read it”	Looks sufficient; the chart is right there	Extraction heuristics read HTML data tables, not pixel-rendered charts. Live-fetch AI sees a `<figure>` and an alt text, and that is it. The numbers in the chart never enter the answer pipeline unless you publish them as text
”AI-generated stock photos at scale, no provenance”	Looks like cheap visual coverage	The same AI-at-scale pattern AI Content Detection covers for text applies to images. C2PA, IPTC v2025.1’s “AI System Used” field, and SynthID watermarks are increasingly the trust signal AI engines look for; an unattested AI-image flood reads as the multimodal version of mass-generated content
”My image was passed to GPT-4o, so AI search engines must see it too”	Looks transitive — same vendor’s model	Model capability is not retrieval-pipeline capability. The user-uploaded path runs vision; the web-retrieval path mostly passes text. Your image on your page is read as `alt` and `caption`, not as pixels

The failure mode is rarely “we forgot the markup.” It is treating multimodal as a vision problem when in 2026 it is still mostly a text-channel problem.

11. Why this matters for GEO + how to act

Multimodal GEO is not a separate AI-specific discipline. It is citability and E-E-A-T applied to non-text assets, with the engine-reading-mode asymmetry from §3 as the filter for which lever pays off in which channel. The levers are not new disciplines; they are the same disciplines retooled per asset type.

Your intent	First stop
Implement image, video, audio, or chart markup correctly	Schema Implementation playbook
Decide which markup format (JSON-LD, RDFa, Microdata)	JSON-LD
Audit a site’s multimodal layer end-to-end	Full GEO Audit playbook
Make a page’s text channel actually extract	Citability playbook · Citability concept
Calibrate trust signals for non-text assets	E-E-A-T · AI Content Detection for the AI-at-scale anti-pattern
Understand the schema vocabulary	Schema.org for AI
Bind asset authorship to a creator entity	Entity Recognition · Knowledge Graph Presence
See where this sits in the loop	Answer Loop
The method that ties it together	Generative Engine Optimization

The practical reading: audit the text channel for every non-text asset before touching markup or provenance. Most teams discover the dominant lever is not a missing ImageObject block — it is missing transcripts on video, missing alt on content images, charts published as PNGs without underlying data, or AI-generated images at scale without provenance. Fix the text channel first; provenance and pixel-level capability matter at the margin.

For the term itself and its neighbors, see the GEO glossary.

References

Official / standards:

Schema.org — ImageObject · VideoObject · AudioObject · PodcastEpisode · Dataset
Google Search Central — Google Images best practices (updated 2026-03-02) · Video SEO best practices (updated 2025-12-18) · Speakable (SpeakableSpecification) structured data (updated 2025-12-10, still beta) · AI features and your website · Top ways to ensure your content performs well in Google’s AI experiences on Search
W3C WAI — Understanding Success Criterion 1.1.1: Non-text Content
IPTC — Photo Metadata Standard (v2025.1, 2025-11-27 — added “AI System Used” property)
C2PA — Coalition for Content Provenance and Authenticity · Membership · Specifications
Google DeepMind — SynthID (image / video / audio / text watermarking)

Vendor / technical:

OpenAI — GPT-4V(ision) system card (2023-09-25)
Google DeepMind — Gemini: A Family of Highly Capable Multimodal Models (arXiv:2312.11805, 2023-12-19) · Introducing Gemini (2023-12-06) — “designed Gemini to be natively multimodal, pre-trained from the start on different modalities”

Industry:

Search Engine Land — Goodwin, D. (2025-05-21). Google shares 8 ways to be successful with AI Search experiences

Frequently asked questions

What are multimodal signals in GEO?

The signals attached to non-text content — images, video, audio, charts and tables — that determine whether an AI engine can read, ground, and cite those assets in an answer. In 2026, that mostly means the **text that travels with the asset** (alt text, caption, transcript, schema markup) rather than pixel-level vision. The reason is structural: AI retrieval pipelines feeding answer engines still pass text to the model, not images, even when the underlying model is natively multimodal.

Doesn't GPT-4V / Gemini see images? Why does the text channel still matter?

Model capability is not the same as retrieval-pipeline capability. GPT-4V can describe an image when you upload it (per OpenAI's system card), and Gemini is natively multimodal from pre-training (per the Gemini technical report). But when ChatGPT search or Perplexity retrieves your page from the web, it typically extracts text and passes that to the model — your image is not handed in directly. So in 2026, the asset's text channel (alt, surrounding prose, transcript) is what the answer model sees, regardless of how vision-capable the underlying model is.

What's the single highest-leverage multimodal lever?

For most sites, **a same-page transcript on every video**. A YouTube embed gives you Google's auto-CC infrastructure for free; a self-hosted `<video>` tag without a transcript is effectively invisible to live-fetch AI engines because the spoken content never enters the text channel. The transcript is the asset's only entry point into the answer pipeline.

Does speakable schema make my content voice-AI-ready?

No — and this is the most common multimodal myth. Speakable structured data is in beta, restricted to US-based news sites in English, and feeds Google Assistant TTS playback (per Google's own current docs, last updated 2025-12-10). It has been in beta with the same restrictions for years. It is not a general 'this content is AI-readable' signal. The general AI-readability lever for audio is a transcript on the page, not speakable markup.

Do AI search engines actually rank pages with images higher?

Google's official position is that there are no special requirements to appear in AI Overviews or AI Mode beyond standard SEO best practices (per the 'AI features and your website' doc). Google's May 2025 'Top ways to succeed in AI Search' guidance does include 'Support your textual content with high-quality images and videos' as one of eight recommendations — but the lift magnitude is not rigorously published. Read the direction (multimodal hygiene is on the recommendations list) without quoting unverified citation-rate numbers.

Sources

Primary

ImageObject — Schema.org · Schema.org
VideoObject — Schema.org · Schema.org
AudioObject — Schema.org · Schema.org
PodcastEpisode — Schema.org · Schema.org
Dataset — Schema.org · Schema.org
Google Images best practices · Google Search Central · 2026-03-02
Video SEO best practices · Google Search Central · 2025-12-18
Speakable (SpeakableSpecification) structured data · Google Search Central · 2025-12-10
AI features and your website · Google Search Central · 2025-12-10
Top ways to ensure your content performs well in Google's AI experiences on Search · Google Search Central · 2025-05-21
Understanding Success Criterion 1.1.1: Non-text Content · W3C Web Accessibility Initiative
Coalition for Content Provenance and Authenticity (C2PA) · C2PA
C2PA Membership · C2PA
C2PA Specifications · C2PA
SynthID — identifying AI-generated content · Google DeepMind
IPTC Photo Metadata Standard (v2025.1) · International Press Telecommunications Council · 2025-11-27
GPT-4V(ision) system card · OpenAI · 2023-09-25
Gemini: A Family of Highly Capable Multimodal Models · Google DeepMind / arXiv · 2023-12-19
Introducing Gemini: our largest and most capable AI model · Google · 2023-12-06

Secondary

Google shares 8 ways to be successful with AI Search experiences · Search Engine Land (Danny Goodwin)