December 5, 2025 (June 26, 2026)
Table of contents
Video Models
Endpoint Compatibility
| Model | create | create-frames | create-fusion | motion-control |
|---|---|---|---|---|
v6 (default) | ✓ | ✓ | — | — |
v5.6 | ✓ | ✓ | ✓ | ✓ |
pixverse-c1 | ✓ | ✓ | ✓ | — |
seedance-2.0 | ✓ | ✓ | ✓ (+3 ref videos, +3 ref audios) | — |
seedance-2.0-fast | ✓ | ✓ | ✓ (+3 ref videos, +3 ref audios) | — |
seedance-2.0-mini | ✓ | ✓ | ✓ (+3 ref videos, +3 ref audios) | — |
kling-o3 | ✓ | ✓ | ✓ | — |
kling-v3 | ✓ | ✓ | — | — |
grok-imagine | ✓ | — | — | — |
grok-imagine-1.5 | ✓ (i2v only) | — | — | — |
veo-3.1-lite | ✓ | ✓ | — | — |
veo-3.1-standard | ✓ | ✓ | — | — |
veo-3.1-fast | ✓ | ✓ | — | — |
sora-2 | ✓ | — | — | — |
sora-2-pro | ✓ | — | — | — |
happyhorse-1.0 | ✓ | — | — | — |
Fusion notation: v5 uses @pic1/@pic2/@pic3, all other fusion-capable models use @image1…@imageN (mapped positionally to frame_1_path…frame_N_path). The Seedance fusion family additionally supports @video1…@video3 for reference videos (video_1_path…video_3_path) and @audio1…@audio3 for reference audios (audio_1_path…audio_3_path) — PixVerse’s omni mode.
grok-imagine-1.5 is image-to-video only — it requires first_frame_path and rejects text-to-video (no-image) requests. The original grok-imagine supports both text-to-video and image-to-video.
Extend, upscale, modify, lipsync
Upscale, modify, and lipsync are native-PixVerse only. Extend supports v6 and the third-party grok-imagine model.
extend also accepts grok-imagine (480p/720p, 2-10s, native audio).
v5 family (legacy)
v5 carries legacy-only modes — multi-frame create-transition, lipsync, and fusion with the original @pic1/@pic2/@pic3 notation.
| Endpoint | v5 | v5.5 | v5.6 | v5-fast |
|---|---|---|---|---|
| create | ✓ | ✓ | ✓ | ✓ |
| create-frames | ✓ | ✓ | ✓ | — |
| create-transition (2-frame) | ✓ | ✓ | ✓ | — |
| create-transition (3+ frame) | ✓ | — | — | — |
| create-fusion | ✓ | — | ✓ | — |
| extend | — | — | — | — |
| modify | — | ✓ | — | — |
| lipsync | ✓ | — | — | — |
| upscale | ✓ | ✓ | ✓ | ✓ |
v5 accepts both @image1…@imageN (unified) and the legacy @pic1/@pic2/@pic3 synonyms for backward compatibility.
Quality, Duration, Aspect Ratio
| Model | Qualities | Durations | Aspect Ratios | Max ref (fusion) |
|---|---|---|---|---|
v6 | 360p, 540p, 720p (default), 1080p | 1-15s | 16:9, 9:16, 1:1, 4:3, 3:4 | — |
v5.6 | 360p, 540p (default), 720p, 1080p | 1-10s (1080p max 8) | 16:9, 9:16, 1:1, 4:3, 3:4 | 7 imgs |
v5.5 | 360p, 540p (default), 720p, 1080p | 1-10s (1080p max 8) | 16:9, 9:16, 1:1, 4:3, 3:4 | — |
v5 | 360p, 540p (default), 720p, 1080p | 1-10s (1080p max 8) | 16:9, 9:16, 1:1, 4:3, 3:4 | 3 imgs |
v5-fast | 360p, 540p (default), 720p, 1080p | 1-10s (1080p max 8) | 16:9, 9:16, 1:1, 4:3, 3:4 | — |
pixverse-c1 | 360p, 540p, 720p, 1080p | 1-15s | 16:9, 4:3, 1:1, 3:4, 9:16, 3:2, 2:3 | 7 imgs |
seedance-2.0 | 480p, 720p, 1080p, 2160p | 4-15s | 16:9, 4:3, 1:1, 3:4, 9:16, 21:9 | 9 imgs + 3 videos + 3 audios |
seedance-2.0-fast | 480p, 720p | 4-15s | 16:9, 4:3, 1:1, 3:4, 9:16, 21:9 | 9 imgs + 3 videos + 3 audios |
seedance-2.0-mini | 480p, 720p | 4-15s | 16:9, 4:3, 1:1, 3:4, 9:16, 21:9 | 9 imgs + 3 videos + 3 audios |
kling-o3 | 720p (Std), 1080p (Pro) | 3-15s | 16:9, 1:1, 9:16 | 7 imgs |
kling-v3 | 720p (Std), 1080p (Pro) | 3-15s | 16:9, 1:1, 9:16 | — |
grok-imagine | 480p, 720p | 1-15s | 16:9, 4:3, 1:1, 3:4, 9:16, 3:2, 2:3 | — |
grok-imagine-1.5 | 480p, 720p | 1-15s | i2v only (from image) | — |
veo-3.1-lite | 720p, 1080p | 4, 6, 8 | 16:9, 9:16 | — |
veo-3.1-standard | 720p, 1080p, 2160p | 4, 6, 8 | 16:9, 9:16 | — |
veo-3.1-fast | 720p, 1080p, 2160p | 4, 6, 8 | 16:9, 9:16 | — |
sora-2 | 720p | 4, 8, 12 | 16:9, 9:16 | — |
sora-2-pro | 720p, 1080p | 4, 8, 12 | 16:9, 9:16 | — |
happyhorse-1.0 | 720p, 1080p | 3-15s | 16:9, 9:16, 1:1, 4:3, 3:4 | — |
aspect_ratiois required fort2vandfusion, not accepted fori2vortransition(derived from image).- For
kling-o3/kling-v3:quality: 720proutes to Std,quality: 1080proutes to Pro. - For
veo-3.1-standard/veo-3.1-fast:quality: 1080prequiresduration: 8. - Seedance fusion (omni mode) is the only path that accepts reference videos and audios. Images go in
frame_1_path…frame_9_path, videos invideo_1_path…video_3_path(@video1…@video3), and audios inaudio_1_path…audio_3_path(@audio1…@audio3). At least one image or reference video is required. Each reference video must be 2-15s, ≤50 MB, 24-60 fps, ≤6000×6000 px, and each reference audio 2-15s. The sum of all reference video durations, and the sum of all reference audio durations, must each be ≤ 15 sec.
Audio
| Model | audio |
|---|---|
v6 | toggle |
v5.6 | toggle |
v5.5 | toggle |
v5 | — (use lip_sync_tts_prompt + sound_effect_prompt) |
v5-fast | — |
pixverse-c1 | toggle |
seedance-2.0 | toggle |
seedance-2.0-fast | toggle |
seedance-2.0-mini | toggle |
kling-o3 | toggle |
kling-v3 | toggle |
grok-imagine | rejected |
grok-imagine-1.5 | rejected |
veo-3.1-lite | rejected |
veo-3.1-standard | always on |
veo-3.1-fast | always on |
sora-2 | rejected |
sora-2-pro | rejected |
happyhorse-1.0 | always on |
toggle— acceptaudio: true/false.always on— audio generated automatically;audio: falseis rejected.rejected—audioparameter is not accepted (content has no audio track or audio is handled internally).
Native PixVerse — extra flags
multi_shot, preview_mode, off_peak_mode, and seed are supported only on native PixVerse models. Third-party models reject them.
| Model | multi_shot | preview_mode | off_peak_mode | seed |
|---|---|---|---|---|
v6 | ✓ | ✓ | ✓ | ✓ |
v5.6 | — | ✓ | ✓ | ✓ |
v5.5 | — | ✓ | ✓ | ✓ |
v5 | — | ✓ | ✓ | ✓ |
v5-fast | — | ✓ | ✓ | ✓ |
pixverse-c1 | — | ✓ | ✓ | ✓ |
Image Models
All image models share the same endpoints: create, list, get, delete.
| Model | Qualities | Max Refs | Est. Time |
|---|---|---|---|
qwen-image (default) | 720p, 1080p | 3 | ~3s |
nano-banana | 1080p | 3 | ~10s |
nano-banana-2 | 512p, 1080p, 1440p, 2160p | 9 | ~30s |
nano-banana-pro | 1080p, 1440p, 2160p | 9 | ~60s |
seedream-4.0 | 1080p, 1440p, 2160p | 6 | ~10s |
seedream-4.5 | 1440p, 2160p | 6 | ~15s |
seedream-5.0-lite | 1440p, 1800p | 6 | ~30s |
kling-3.0 | 1080p, 1440p | 1 | ~15s |
kling-o3 | 1080p, 1440p, 2160p | 1 | ~20s |
gpt-image-2.0 | 1080p, 1440p, 2160p | 9 | ~30s |
create_count: 1-4 (default 1).detail_level(gpt-image-2.0only, required):low,medium,high. Rejected for all other models. Affects credit cost (low = 0.5×, medium = 1×, high = 2× of the per-quality base).
Aspect ratios
Each model accepts its own list. If aspect_ratio is omitted, the default (first column) is used. Passing a value not in the model’s row is rejected with 400.
| Models | Default | Accepted aspect_ratio values |
|---|---|---|
nano-banana, nano-banana-2, nano-banana-pro, seedream-4.0, seedream-4.5, seedream-5.0-lite | auto | auto, 1:1, 16:9, 9:16, 4:3, 3:4, 5:4, 4:5, 3:2, 2:3, 21:9 |
qwen-image | 1:1 | 1:1, 16:9, 9:16, 4:3, 3:4, 5:4, 4:5, 3:2, 2:3, 21:9 |
kling-o3, kling-3.0 | 1:1 | 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3, 21:9 |
gpt-image-2.0 | 1:1 | 1:1, 1:2, 4:3, 3:4, 3:2, 2:3, 16:9, 9:16, 2:1, 21:9 |
Unlimited Image Generation (Relax Mode)
Pro+ subscription plans include unlimited image generation in Relax Mode for select models:
| Plan | Price | Unlimited Models |
|---|---|---|
| Pro | $30/m | qwen-image |
| Premium | $60/m | qwen-image, nano-banana, nano-banana-2, seedream-4.0 |
| Ultra | $199/m | all 10 image models |
Music Models
All music models share the same endpoints: generate, list, get. Generation is asynchronous — poll audio_status_final or pass a replyUrl. All models generate at an automatic duration (typically 2-5 minutes).
| Model | Provider | Prompt max | Custom lyrics | Reference images | Credits |
|---|---|---|---|---|---|
music-2.6 (default) | MiniMax | 2,000 | ≤ 3,500 chars | — | 40 |
music-v1 | ElevenLabs | 4,000 | ≤ 3,500 chars | — | 150 |
lyria-3-pro-preview | 5,000 | — | up to 10 | 20 |
Output modes
The output is derived from instrumental and lyrics — there is no auto_lyrics parameter.
| Request | Result |
|---|---|
instrumental: true | instrumental, no vocals |
lyrics provided | vocals sung from your lyrics (music-2.6 / music-v1 only) |
| neither | vocals with lyrics written by the model |
lyricsis rejected onlyria-3-pro-previewand cannot be combined withinstrumental.image_path_1…image_path_10are accepted bylyria-3-pro-previewonly, provided sequentially.
Status values
| audio_status | audio_status_name | audio_status_final |
|---|---|---|
| 1 | COMPLETED | true |
| 5 | QUEUED | false |
| 8 | FAILED | true |
| 10 | GENERATING | false |
A failed track (audio_status 8) carries fail_code and a human-readable fail_reason. These are usually transient backend errors — re-submitting the same request often succeeds.
Lyria 3 Pro is also available through Flow Music at a lower price with many more features — cover and restyle, lyrics-adjust remix, extend and replace, and audio effects.
Text-to-Speech Models
All speech models share the same endpoints: generate, list, get, plus voices and models for discovery. Generation is asynchronous — poll audio_status_final or pass a replyUrl. Credits are billed per character — credits = ceil(chars / N).
| Model | Provider | Voice settings | Max chars | N (chars/credit) |
|---|---|---|---|---|
eleven-multilingual-v2 | ElevenLabs | stability, similarity_boost, speed, style | 10,000 | 50 |
eleven-v3 | ElevenLabs | stability, similarity_boost, speed + audio tags | 5,000 | 50 |
eleven-turbo-v2.5 | ElevenLabs | stability, similarity_boost, speed | 40,000 | 100 |
speech-2.8-hd (default) | MiniMax | speed, volume, pitch, emotion | 10,000 | 50 |
speech-2.8-turbo | MiniMax | speed, volume, pitch, emotion | 10,000 | 100 |
Every request needs a voice_id from GET speech/voices (filtered by model and language). The provider and the paired provider_voice_id are derived automatically.
Voice settings
The two providers take different settings, enforced per family (a setting from the wrong family is rejected with 400):
- MiniMax —
speed(0.5-2),volume(0-10),pitch(-12-12),emotion(auto,happy,sad,angry,fearful,disgusted,surprised,neutral,calm). - ElevenLabs —
stability(0-1),similarity_boost(0-1),speed(0.7-1.2).style(0-1) anduse_speaker_boostare accepted byeleven-multilingual-v2only.
Expressive control
| Model | Control |
|---|---|
eleven-v3 | inline audio tags in the text — [whispers], [excited], [shouts], [sighs], [laughs], [curious], … Works best with longer, sentence-level text. |
speech-2.8-hd / speech-2.8-turbo | the emotion voice setting (a natural, subtle coloring). |
eleven-multilingual-v2 | the style voice setting. |
eleven-v3 is the most expressive model. The other ElevenLabs models read audio tags literally — use them only with eleven-v3.
Languages
language_code is validated per model against the live models catalog (MiniMax and the ElevenLabs v2/turbo models support 30-40 languages). eleven-v3 auto-detects the language and rejects language_code with 400.
Status values
| audio_status | audio_status_name | audio_status_final |
|---|---|---|
| 1 | COMPLETED | true |
| 5 | QUEUED | false |
| 8 | FAILED | true |
| 10 | GENERATING | false |
A failed job (audio_status 8) carries fail_code and a human-readable fail_reason. These are usually transient backend errors — re-submitting the same request often succeeds.