December 5, 2025 (July 15, 2026)

Video Models
Image Models
1. Aspect ratios
2. Unlimited Image Generation (Relax Mode)
Music Models
1. Output modes
2. Status values
Text-to-Speech Models

Video Models

Endpoint Compatibility

Model	create	create-frames	create-fusion	motion-control
`v6` (default)	✓	✓	—	—
`v5.6`	✓	✓	✓	✓
`pixverse-c1`	✓	✓	✓	—
`seedance-2.0`	✓	✓	✓ (+3 ref videos, +3 ref audios)	—
`seedance-2.0-fast`	✓	✓	✓ (+3 ref videos, +3 ref audios)	—
`seedance-2.0-mini`	✓	✓	✓ (+3 ref videos, +3 ref audios)	—
`kling-o3`	✓	✓	✓	—
`kling-v3`	✓	✓	—	—
`grok-imagine`	✓	—	—	—
`grok-imagine-1.5`	✓ (i2v only)	—	—	—
`veo-3.1-lite`	✓	✓	—	—
`veo-3.1-standard`	✓	✓	—	—
`veo-3.1-fast`	✓	✓	—	—
`sora-2`	✓	—	—	—
`sora-2-pro`	✓	—	—	—
`happyhorse-1.0`	✓	—	—	—

Fusion notation: v5 uses @pic1/@pic2/@pic3, all other fusion-capable models use @image1…@imageN (mapped positionally to frame_1_path…frame_N_path). The Seedance fusion family additionally supports @video1…@video3 for reference videos (video_1_path…video_3_path) and @audio1…@audio3 for reference audios (audio_1_path…audio_3_path) — PixVerse’s omni mode.

grok-imagine-1.5 is image-to-video only — it requires first_frame_path and rejects text-to-video (no-image) requests. The original grok-imagine supports both text-to-video and image-to-video.

Extend, upscale, modify, lipsync

Upscale, modify, and lipsync are native-PixVerse only. Extend supports v6 and the third-party grok-imagine model.

Endpoint	`v6`	`v5`	`v5.5`	`v5.6`
extend	✓	—	—	—
upscale	✓	✓	✓	✓
modify	—	—	✓	—
lipsync	—	✓	—	—

extend also accepts grok-imagine (480p/720p, 2-10s, native audio).

v5 family (legacy)

v5 carries legacy-only modes — multi-frame create-transition, lipsync, and fusion with the original @pic1/@pic2/@pic3 notation.

Endpoint	`v5`	`v5.5`	`v5.6`	`v5-fast`
create	✓	✓	✓	✓
create-frames	✓	✓	✓	—
create-transition (2-frame)	✓	✓	✓	—
create-transition (3+ frame)	✓	—	—	—
create-fusion	✓	—	✓	—
extend	—	—	—	—
modify	—	✓	—	—
lipsync	✓	—	—	—
upscale	✓	✓	✓	✓

v5 accepts both @image1…@imageN (unified) and the legacy @pic1/@pic2/@pic3 synonyms for backward compatibility.

Quality, Duration, Aspect Ratio

Model	Qualities	Durations	Aspect Ratios	Max ref (fusion)
`v6`	360p, 540p, 720p (default), 1080p	1-15s	16:9, 9:16, 1:1, 4:3, 3:4	—
`v5.6`	360p, 540p (default), 720p, 1080p	1-10s (1080p max 8)	16:9, 9:16, 1:1, 4:3, 3:4	7 imgs
`v5.5`	360p, 540p (default), 720p, 1080p	1-10s (1080p max 8)	16:9, 9:16, 1:1, 4:3, 3:4	—
`v5`	360p, 540p (default), 720p, 1080p	1-10s (1080p max 8)	16:9, 9:16, 1:1, 4:3, 3:4	3 imgs
`v5-fast`	360p, 540p (default), 720p, 1080p	1-10s (1080p max 8)	16:9, 9:16, 1:1, 4:3, 3:4	—
`pixverse-c1`	360p, 540p, 720p, 1080p	1-15s	16:9, 4:3, 1:1, 3:4, 9:16, 3:2, 2:3	7 imgs
`seedance-2.0`	480p, 720p, 1080p, 2160p	4-15s	16:9, 4:3, 1:1, 3:4, 9:16, 21:9	9 imgs + 3 videos + 3 audios
`seedance-2.0-fast`	480p, 720p	4-15s	16:9, 4:3, 1:1, 3:4, 9:16, 21:9	9 imgs + 3 videos + 3 audios
`seedance-2.0-mini`	480p, 720p	4-15s	16:9, 4:3, 1:1, 3:4, 9:16, 21:9	9 imgs + 3 videos + 3 audios
`kling-o3`	720p (Std), 1080p (Pro)	3-15s	16:9, 1:1, 9:16	7 imgs
`kling-v3`	720p (Std), 1080p (Pro)	3-15s	16:9, 1:1, 9:16	—
`grok-imagine`	480p, 720p	1-15s	16:9, 4:3, 1:1, 3:4, 9:16, 3:2, 2:3	—
`grok-imagine-1.5`	480p, 720p	1-15s	i2v only (from image)	—
`veo-3.1-lite`	720p, 1080p	4, 6, 8	16:9, 9:16	—
`veo-3.1-standard`	720p, 1080p, 2160p	4, 6, 8	16:9, 9:16	—
`veo-3.1-fast`	720p, 1080p, 2160p	4, 6, 8	16:9, 9:16	—
`sora-2`	720p	4, 8, 12	16:9, 9:16	—
`sora-2-pro`	720p, 1080p	4, 8, 12	16:9, 9:16	—
`happyhorse-1.0`	720p, 1080p	3-15s	16:9, 9:16, 1:1, 4:3, 3:4	—

aspect_ratio is required for t2v and fusion, not accepted for i2v or transition (derived from image).
For kling-o3 / kling-v3: quality: 720p routes to Std, quality: 1080p routes to Pro.
For veo-3.1-standard / veo-3.1-fast: quality: 1080p requires duration: 8.
Seedance fusion (omni mode) is the only path that accepts reference videos and audios. Images go in frame_1_path … frame_9_path, videos in video_1_path … video_3_path (@video1…@video3), and audios in audio_1_path … audio_3_path (@audio1…@audio3). At least one image or reference video is required. Each reference video must be 2-15s, ≤50 MB, 24-60 fps, ≤6000×6000 px, and each reference audio 2-15s. The sum of all reference video durations, and the sum of all reference audio durations, must each be ≤ 15 sec.

Audio

Model	`audio`
`v6`	toggle
`v5.6`	toggle
`v5.5`	toggle
`v5`	— (use `lip_sync_tts_prompt` + `sound_effect_prompt`)
`v5-fast`	—
`pixverse-c1`	toggle
`seedance-2.0`	toggle
`seedance-2.0-fast`	toggle
`seedance-2.0-mini`	toggle
`kling-o3`	toggle
`kling-v3`	toggle
`grok-imagine`	rejected
`grok-imagine-1.5`	rejected
`veo-3.1-lite`	rejected
`veo-3.1-standard`	always on
`veo-3.1-fast`	always on
`sora-2`	rejected
`sora-2-pro`	rejected
`happyhorse-1.0`	always on

toggle — accept audio: true / false.
always on — audio generated automatically; audio: false is rejected.
rejected — audio parameter is not accepted (content has no audio track or audio is handled internally).

Native PixVerse — extra flags

multi_shot, preview_mode, off_peak_mode, and seed are supported only on native PixVerse models. Third-party models reject them.

Model	`multi_shot`	`preview_mode`	`off_peak_mode`	`seed`
`v6`	✓	✓	✓	✓
`v5.6`	—	✓	✓	✓
`v5.5`	—	✓	✓	✓
`v5`	—	✓	✓	✓
`v5-fast`	—	✓	✓	✓
`pixverse-c1`	—	✓	✓	✓

Image Models

All image models share the same endpoints: create, list, get, delete.

Model	Qualities	Max Refs	Est. Time
`qwen-image` (default)	720p, 1080p	3	~3s
`nano-banana`	1080p	3	~10s
`nano-banana-2`	512p, 1080p, 1440p, 2160p	9	~30s
`nano-banana-2-lite`	1080p	14	~7s
`nano-banana-pro`	1080p, 1440p, 2160p	9	~60s
`seedream-4.0`	1080p, 1440p, 2160p	6	~10s
`seedream-4.5`	1440p, 2160p	6	~15s
`seedream-5.0-pro`	1080p, 1440p	10	~30s
`seedream-5.0-lite`	1440p, 1800p, 2160p	6	~30s
`kling-3.0`	1080p, 1440p	1	~15s
`kling-o3`	1080p, 1440p, 2160p	1	~20s
`gpt-image-2.0`	1080p, 1440p, 2160p	9	~30s

create_count: 1-4 (default 1).
detail_level (gpt-image-2.0 only, required): low, medium, high. Rejected for all other models. Affects credit cost (low = 0.5×, medium = 1×, high = 2× of the per-quality base).

Aspect ratios

Each model accepts its own list. If aspect_ratio is omitted, the default (first column) is used. Passing a value not in the model’s row is rejected with 400.

Models	Default	Accepted `aspect_ratio` values
`nano-banana`, `nano-banana-2`, `nano-banana-2-lite`, `nano-banana-pro`, `seedream-4.0`, `seedream-4.5`, `seedream-5.0-pro`, `seedream-5.0-lite`	`auto`	`auto`, `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `5:4`, `4:5`, `3:2`, `2:3`, `21:9`
`qwen-image`	`1:1`	`1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `5:4`, `4:5`, `3:2`, `2:3`, `21:9`
`kling-o3`, `kling-3.0`	`1:1`	`1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `3:2`, `2:3`, `21:9`
`gpt-image-2.0`	`1:1`	`1:1`, `1:2`, `4:3`, `3:4`, `3:2`, `2:3`, `16:9`, `9:16`, `2:1`, `21:9`

Unlimited Image Generation (Relax Mode)

Pro+ subscription plans include unlimited image generation in Relax Mode for select models:

Plan	Price	Unlimited Models
Pro	$30/m	`qwen-image`, `nano-banana-2-lite`
Premium	$60/m	`qwen-image`, `nano-banana-2-lite`, `nano-banana`, `nano-banana-2`, `seedream-4.0`
Ultra	$199/m	all 12 image models

Music Models

All music models share the same endpoints: generate, list, get. Generation is asynchronous — poll audio_status_final or pass a replyUrl. All models generate at an automatic duration (typically 2-5 minutes).

Model	Provider	Prompt max	Custom lyrics	Reference images	Credits
`music-2.6` (default)	MiniMax	2,000	≤ 3,500 chars	—	40
`music-v1`	ElevenLabs	4,000	≤ 3,500 chars	—	150
`lyria-3-pro-preview`	Google	5,000	—	up to 10	20

Output modes

The output is derived from instrumental and lyrics — there is no auto_lyrics parameter.

Request	Result
`instrumental: true`	instrumental, no vocals
`lyrics` provided	vocals sung from your lyrics (`music-2.6` / `music-v1` only)
neither	vocals with lyrics written by the model

lyrics is rejected on lyria-3-pro-preview and cannot be combined with instrumental.
image_path_1 … image_path_10 are accepted by lyria-3-pro-preview only, provided sequentially.

Status values

audio_status	audio_status_name	audio_status_final
1	COMPLETED	true
5	QUEUED	false
8	FAILED	true
10	GENERATING	false

A failed track (audio_status 8) carries fail_code and a human-readable fail_reason. These are usually transient backend errors — re-submitting the same request often succeeds.

Lyria 3 Pro is also available through Flow Music at a lower price with many more features — cover and restyle, lyrics-adjust remix, extend and replace, and audio effects.

Text-to-Speech Models

All speech models share the same endpoints: generate, list, get, plus voices and models for discovery. Generation is asynchronous — poll audio_status_final or pass a replyUrl. Credits are billed per character — credits = ceil(chars / N).

Model	Provider	Voice settings	Max chars	N (chars/credit)
`eleven-multilingual-v2`	ElevenLabs	stability, similarity_boost, speed, style	10,000	50
`eleven-v3`	ElevenLabs	stability, similarity_boost, speed + audio tags	5,000	50
`eleven-turbo-v2.5`	ElevenLabs	stability, similarity_boost, speed	40,000	100
`speech-2.8-hd` (default)	MiniMax	speed, volume, pitch, emotion	10,000	50
`speech-2.8-turbo`	MiniMax	speed, volume, pitch, emotion	10,000	100

Every request needs a voice_id from GET speech/voices (filtered by model and language). The provider and the paired provider_voice_id are derived automatically.

Voice settings

The two providers take different settings, enforced per family (a setting from the wrong family is rejected with 400):

MiniMax — speed (0.5-2), volume (0-10), pitch (-12-12), emotion (auto, happy, sad, angry, fearful, disgusted, surprised, neutral, calm).
ElevenLabs — stability (0-1), similarity_boost (0-1), speed (0.7-1.2). style (0-1) and use_speaker_boost are accepted by eleven-multilingual-v2 only.

Expressive control

Model	Control
`eleven-v3`	inline audio tags in the text — `[whispers]`, `[excited]`, `[shouts]`, `[sighs]`, `[laughs]`, `[curious]`, … Works best with longer, sentence-level text.
`speech-2.8-hd` / `speech-2.8-turbo`	the `emotion` voice setting (a natural, subtle coloring).
`eleven-multilingual-v2`	the `style` voice setting.

eleven-v3 is the most expressive model. The other ElevenLabs models read audio tags literally — use them only with eleven-v3.

Languages

language_code is validated per model against the live models catalog (MiniMax and the ElevenLabs v2/turbo models support 30-40 languages). eleven-v3 auto-detects the language and rejects language_code with 400.

Status values

audio_status	audio_status_name	audio_status_final
1	COMPLETED	true
5	QUEUED	false
8	FAILED	true
10	GENERATING	false

A failed job (audio_status 8) carries fail_code and a human-readable fail_reason. These are usually transient backend errors — re-submitting the same request often succeeds.

Table of contents

Video Models

Endpoint Compatibility

Extend, upscale, modify, lipsync

v5 family (legacy)

Quality, Duration, Aspect Ratio

Audio

Native PixVerse — extra flags

Image Models

Aspect ratios

Unlimited Image Generation (Relax Mode)

Music Models

Output modes

Status values

Text-to-Speech Models

Voice settings

Expressive control

Languages

Status values