Create text-to-speech audio stream token and payload

Table of contents

December 23, 2024 (January 15, 2025)

  1. Request Headers
  2. Request Body
  3. Responses
  4. Examples
  5. Model
  6. Try It

Please configure at least one www.hailuo.ai account for this endpoint, see Setup MiniMax for details.

This endpoint creates a near real-time audio stream from the provided text.

  • Average time to response is 3 seconds.
  • Up to 20 parallel jobs per account are supported.
  • Currently, this service is offered free of charge.

Over 300 pre-built voices provided GET audio/voices supporting the following:

  • Languages: English (US, UK, Australia, India), Chinese (Mandarin and Cantonese), Japanese, Korean, French, German, Spanish, Portuguese (including Brazilian), Italian, Arabic, Russian, Turkish, Dutch, Ukrainian, Vietnamese, and Indonesian.
    The list is constantly updated to include more languages!
  • Emotions: happy, sad, angry, fearful, disgusted, surprised, neutral
  • Accents: US (General), English, Indian
  • Ages: Young Adult, Adult, Middle-Aged, Senior
  • Genders: Male, Female

Returned by this endpoint token and payload will be used by WebSocket WSS audio/wss.

https://api.useapi.net/v1/minimax/audio/create-stream

Request Headers
Authorization: Bearer {API token}
Content-Type: application/json
# Alternatively you can use multipart/form-data
# Content-Type: multipart/form-data
Request Body
{
    "account": "Optional MiniMax www.hailuo.ai API account",
    "text": "Required text",
    "voice_id": "Required voice id"
}
  • account is optional when only one www.hailuo.ai account configured. However, if you have multiple accounts configured, this parameter becomes required.

  • text is required. Insert <#0.5#> to add a 0.5s pause between sentences. Adjust the duration as needed.
    Maximum length: 5000 characters.

  • voice_id is required. Use GET audio/voices to get list of all available voices.

  • model is optional.
    Supported values: speech-01-hd (default), speech-01-turbo.

  • language_boost is optional. Use tag_name from array voice_tag_language of GET audio/config.
    Default value Auto.

  • emotion is optional. Use value from array t2a_emotion of GET audio/config.
    Default value Auto.

  • vol is optional.
    Default 1.

  • speed is optional.
    Valid range: 0.5…2, default 1.

  • pitch is optional.
    Valid range: -12…12, default 0.

  • deepen_lighten is optional.
    Valid range: -100…100, default 0.

  • stronger_softer is optional.
    Valid range: -100…100, default 0.

  • nasal_crisp is optional.
    Valid range: -100…100, default 0.

  • spacious_echo is optional.
    Supported values: true, false (default).

  • lofi_telephone is optional.
    Supported values: true, false (default).

  • robotic is optional.
    Supported values: true, false (default).

  • auditorium_echo is optional.
    Supported values: true, false (default).

Responses
  • 200 OK

    Field token and payload values are used by WebSocket WSS audio/wss.

    • The token contains WebSocket authorization information and will expire in 24 hours.
    • The payload contains a properly formed and validated payload built from user-provided input. You should send this payload over the WebSocket to generate a real-time sound stream and optionally retrieve the generated MP3 file.
    {
        "token": "token for WSS WebSocket endpoint",
        "payload": {
            "msg_id": "1e53d-a593-e40b9-54e20-f29c84-40ae3",
            "model": "",
            "text": "Text for TTS generation",
            "voice_setting": {
                "speed": 1,
                "vol": 1,
                "pitch": 0,
                "voice_id": "123456789",
                "emotion": "happy"
            },
            "audio_setting": {},
            "effects": {
                "deepen_lighten": 0,
                "stronger_softer": 0,
                "nasal_crisp": 0,
                "spacious_echo": false,
                "lofi_telephone": false,
                "robotic": false,
                "auditorium_echo": false
            },
            "er_weights": [],
            "language_boost": "German",
            "stream": true
        }
    }
    
  • 400 Bad Request

    {
      "error": "<Error message>"
    }
    
  • 401 Unauthorized

    {
      "error": "Unauthorized"
    }
    
Examples

The code provided at WSS audio/wss is used on this page when you use the Try It feature.

Model

Field token and payload values are used by WebSocket WSS audio/wss.

  • The token contains WebSocket authorization information and will expire in 24 hours.
  • The payload contains a properly formed and validated payload built from user-provided input. You should send this payload over the WebSocket to generate a real-time sound stream and optionally retrieve the generated MP3 file.
{ // TypeScript, all fields are optional
  token: string
  payload: {}
}
Try It