December 5, 2025 (June 26, 2026)

Table of contents

  1. Video Models
    1. Endpoint Compatibility
      1. Extend, upscale, modify, lipsync
      2. v5 family (legacy)
    2. Quality, Duration, Aspect Ratio
    3. Audio
    4. Native PixVerse — extra flags
  2. Image Models
    1. Aspect ratios
    2. Unlimited Image Generation (Relax Mode)
  3. Music Models
    1. Output modes
    2. Status values
  4. Text-to-Speech Models
    1. Voice settings
    2. Expressive control
    3. Languages
    4. Status values

Video Models

Endpoint Compatibility

Model create create-frames create-fusion motion-control
v6 (default)
v5.6
pixverse-c1
seedance-2.0 ✓ (+3 ref videos, +3 ref audios)
seedance-2.0-fast ✓ (+3 ref videos, +3 ref audios)
seedance-2.0-mini ✓ (+3 ref videos, +3 ref audios)
kling-o3
kling-v3
grok-imagine
grok-imagine-1.5 (i2v only)
veo-3.1-lite
veo-3.1-standard
veo-3.1-fast
sora-2
sora-2-pro
happyhorse-1.0

Fusion notation: v5 uses @pic1/@pic2/@pic3, all other fusion-capable models use @image1@imageN (mapped positionally to frame_1_pathframe_N_path). The Seedance fusion family additionally supports @video1@video3 for reference videos (video_1_pathvideo_3_path) and @audio1@audio3 for reference audios (audio_1_pathaudio_3_path) — PixVerse’s omni mode.

grok-imagine-1.5 is image-to-video only — it requires first_frame_path and rejects text-to-video (no-image) requests. The original grok-imagine supports both text-to-video and image-to-video.

Extend, upscale, modify, lipsync

Upscale, modify, and lipsync are native-PixVerse only. Extend supports v6 and the third-party grok-imagine model.

Endpoint v6 v5 v5.5 v5.6
extend
upscale
modify
lipsync

extend also accepts grok-imagine (480p/720p, 2-10s, native audio).

v5 family (legacy)

v5 carries legacy-only modes — multi-frame create-transition, lipsync, and fusion with the original @pic1/@pic2/@pic3 notation.

Endpoint v5 v5.5 v5.6 v5-fast
create
create-frames
create-transition (2-frame)
create-transition (3+ frame)
create-fusion
extend
modify
lipsync
upscale

v5 accepts both @image1@imageN (unified) and the legacy @pic1/@pic2/@pic3 synonyms for backward compatibility.

Quality, Duration, Aspect Ratio

Model Qualities Durations Aspect Ratios Max ref (fusion)
v6 360p, 540p, 720p (default), 1080p 1-15s 16:9, 9:16, 1:1, 4:3, 3:4
v5.6 360p, 540p (default), 720p, 1080p 1-10s (1080p max 8) 16:9, 9:16, 1:1, 4:3, 3:4 7 imgs
v5.5 360p, 540p (default), 720p, 1080p 1-10s (1080p max 8) 16:9, 9:16, 1:1, 4:3, 3:4
v5 360p, 540p (default), 720p, 1080p 1-10s (1080p max 8) 16:9, 9:16, 1:1, 4:3, 3:4 3 imgs
v5-fast 360p, 540p (default), 720p, 1080p 1-10s (1080p max 8) 16:9, 9:16, 1:1, 4:3, 3:4
pixverse-c1 360p, 540p, 720p, 1080p 1-15s 16:9, 4:3, 1:1, 3:4, 9:16, 3:2, 2:3 7 imgs
seedance-2.0 480p, 720p, 1080p, 2160p 4-15s 16:9, 4:3, 1:1, 3:4, 9:16, 21:9 9 imgs + 3 videos + 3 audios
seedance-2.0-fast 480p, 720p 4-15s 16:9, 4:3, 1:1, 3:4, 9:16, 21:9 9 imgs + 3 videos + 3 audios
seedance-2.0-mini 480p, 720p 4-15s 16:9, 4:3, 1:1, 3:4, 9:16, 21:9 9 imgs + 3 videos + 3 audios
kling-o3 720p (Std), 1080p (Pro) 3-15s 16:9, 1:1, 9:16 7 imgs
kling-v3 720p (Std), 1080p (Pro) 3-15s 16:9, 1:1, 9:16
grok-imagine 480p, 720p 1-15s 16:9, 4:3, 1:1, 3:4, 9:16, 3:2, 2:3
grok-imagine-1.5 480p, 720p 1-15s i2v only (from image)
veo-3.1-lite 720p, 1080p 4, 6, 8 16:9, 9:16
veo-3.1-standard 720p, 1080p, 2160p 4, 6, 8 16:9, 9:16
veo-3.1-fast 720p, 1080p, 2160p 4, 6, 8 16:9, 9:16
sora-2 720p 4, 8, 12 16:9, 9:16
sora-2-pro 720p, 1080p 4, 8, 12 16:9, 9:16
happyhorse-1.0 720p, 1080p 3-15s 16:9, 9:16, 1:1, 4:3, 3:4
  • aspect_ratio is required for t2v and fusion, not accepted for i2v or transition (derived from image).
  • For kling-o3 / kling-v3: quality: 720p routes to Std, quality: 1080p routes to Pro.
  • For veo-3.1-standard / veo-3.1-fast: quality: 1080p requires duration: 8.
  • Seedance fusion (omni mode) is the only path that accepts reference videos and audios. Images go in frame_1_pathframe_9_path, videos in video_1_pathvideo_3_path (@video1@video3), and audios in audio_1_pathaudio_3_path (@audio1@audio3). At least one image or reference video is required. Each reference video must be 2-15s, ≤50 MB, 24-60 fps, ≤6000×6000 px, and each reference audio 2-15s. The sum of all reference video durations, and the sum of all reference audio durations, must each be ≤ 15 sec.

Audio

Model audio
v6 toggle
v5.6 toggle
v5.5 toggle
v5 (use lip_sync_tts_prompt + sound_effect_prompt)
v5-fast
pixverse-c1 toggle
seedance-2.0 toggle
seedance-2.0-fast toggle
seedance-2.0-mini toggle
kling-o3 toggle
kling-v3 toggle
grok-imagine rejected
grok-imagine-1.5 rejected
veo-3.1-lite rejected
veo-3.1-standard always on
veo-3.1-fast always on
sora-2 rejected
sora-2-pro rejected
happyhorse-1.0 always on
  • toggle — accept audio: true / false.
  • always on — audio generated automatically; audio: false is rejected.
  • rejectedaudio parameter is not accepted (content has no audio track or audio is handled internally).

Native PixVerse — extra flags

multi_shot, preview_mode, off_peak_mode, and seed are supported only on native PixVerse models. Third-party models reject them.

Model multi_shot preview_mode off_peak_mode seed
v6
v5.6
v5.5
v5
v5-fast
pixverse-c1

Image Models

All image models share the same endpoints: create, list, get, delete.

Model Qualities Max Refs Est. Time
qwen-image (default) 720p, 1080p 3 ~3s
nano-banana 1080p 3 ~10s
nano-banana-2 512p, 1080p, 1440p, 2160p 9 ~30s
nano-banana-pro 1080p, 1440p, 2160p 9 ~60s
seedream-4.0 1080p, 1440p, 2160p 6 ~10s
seedream-4.5 1440p, 2160p 6 ~15s
seedream-5.0-lite 1440p, 1800p 6 ~30s
kling-3.0 1080p, 1440p 1 ~15s
kling-o3 1080p, 1440p, 2160p 1 ~20s
gpt-image-2.0 1080p, 1440p, 2160p 9 ~30s
  • create_count: 1-4 (default 1).
  • detail_level (gpt-image-2.0 only, required): low, medium, high. Rejected for all other models. Affects credit cost (low = 0.5×, medium = 1×, high = 2× of the per-quality base).

Aspect ratios

Each model accepts its own list. If aspect_ratio is omitted, the default (first column) is used. Passing a value not in the model’s row is rejected with 400.

Models Default Accepted aspect_ratio values
nano-banana, nano-banana-2, nano-banana-pro, seedream-4.0, seedream-4.5, seedream-5.0-lite auto auto, 1:1, 16:9, 9:16, 4:3, 3:4, 5:4, 4:5, 3:2, 2:3, 21:9
qwen-image 1:1 1:1, 16:9, 9:16, 4:3, 3:4, 5:4, 4:5, 3:2, 2:3, 21:9
kling-o3, kling-3.0 1:1 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3, 21:9
gpt-image-2.0 1:1 1:1, 1:2, 4:3, 3:4, 3:2, 2:3, 16:9, 9:16, 2:1, 21:9

Unlimited Image Generation (Relax Mode)

Pro+ subscription plans include unlimited image generation in Relax Mode for select models:

Plan Price Unlimited Models
Pro $30/m qwen-image
Premium $60/m qwen-image, nano-banana, nano-banana-2, seedream-4.0
Ultra $199/m all 10 image models

Music Models

All music models share the same endpoints: generate, list, get. Generation is asynchronous — poll audio_status_final or pass a replyUrl. All models generate at an automatic duration (typically 2-5 minutes).

Model Provider Prompt max Custom lyrics Reference images Credits
music-2.6 (default) MiniMax 2,000 ≤ 3,500 chars 40
music-v1 ElevenLabs 4,000 ≤ 3,500 chars 150
lyria-3-pro-preview Google 5,000 up to 10 20

Output modes

The output is derived from instrumental and lyrics — there is no auto_lyrics parameter.

Request Result
instrumental: true instrumental, no vocals
lyrics provided vocals sung from your lyrics (music-2.6 / music-v1 only)
neither vocals with lyrics written by the model
  • lyrics is rejected on lyria-3-pro-preview and cannot be combined with instrumental.
  • image_path_1image_path_10 are accepted by lyria-3-pro-preview only, provided sequentially.

Status values

audio_status audio_status_name audio_status_final
1 COMPLETED true
5 QUEUED false
8 FAILED true
10 GENERATING false

A failed track (audio_status 8) carries fail_code and a human-readable fail_reason. These are usually transient backend errors — re-submitting the same request often succeeds.

Lyria 3 Pro is also available through Flow Music at a lower price with many more features — cover and restyle, lyrics-adjust remix, extend and replace, and audio effects.

Text-to-Speech Models

All speech models share the same endpoints: generate, list, get, plus voices and models for discovery. Generation is asynchronous — poll audio_status_final or pass a replyUrl. Credits are billed per character — credits = ceil(chars / N).

Model Provider Voice settings Max chars N (chars/credit)
eleven-multilingual-v2 ElevenLabs stability, similarity_boost, speed, style 10,000 50
eleven-v3 ElevenLabs stability, similarity_boost, speed + audio tags 5,000 50
eleven-turbo-v2.5 ElevenLabs stability, similarity_boost, speed 40,000 100
speech-2.8-hd (default) MiniMax speed, volume, pitch, emotion 10,000 50
speech-2.8-turbo MiniMax speed, volume, pitch, emotion 10,000 100

Every request needs a voice_id from GET speech/voices (filtered by model and language). The provider and the paired provider_voice_id are derived automatically.

Voice settings

The two providers take different settings, enforced per family (a setting from the wrong family is rejected with 400):

  • MiniMax — speed (0.5-2), volume (0-10), pitch (-12-12), emotion (auto, happy, sad, angry, fearful, disgusted, surprised, neutral, calm).
  • ElevenLabs — stability (0-1), similarity_boost (0-1), speed (0.7-1.2). style (0-1) and use_speaker_boost are accepted by eleven-multilingual-v2 only.

Expressive control

Model Control
eleven-v3 inline audio tags in the text — [whispers], [excited], [shouts], [sighs], [laughs], [curious], … Works best with longer, sentence-level text.
speech-2.8-hd / speech-2.8-turbo the emotion voice setting (a natural, subtle coloring).
eleven-multilingual-v2 the style voice setting.

eleven-v3 is the most expressive model. The other ElevenLabs models read audio tags literally — use them only with eleven-v3.

Languages

language_code is validated per model against the live models catalog (MiniMax and the ElevenLabs v2/turbo models support 30-40 languages). eleven-v3 auto-detects the language and rejects language_code with 400.

Status values

audio_status audio_status_name audio_status_final
1 COMPLETED true
5 QUEUED false
8 FAILED true
10 GENERATING false

A failed job (audio_status 8) carries fail_code and a human-readable fail_reason. These are usually transient backend errors — re-submitting the same request often succeeds.