`configure_session`

An asynchronous function that runs at the beginning of every session. It configures the session by returning a SessionConfig object, which contains parameters like VAD (voice activity detection), STT (speech-to-text), TTS (text-to-speech), the initial messages of the session, etc.

Example usage

from jay_ai import ConfigureSessionInput, SessionConfig

async def configure_session(input: ConfigureSessionInput):
    user_timezone = input["custom_data"]["my_user_timezone"]
    return SessionConfig(
        initial_messages=[
          {"role": "system", "content": "You are a helpful assistant."}
        ],
        vad=VAD.Silero(),
        stt=STT.Deepgram(api_key=os.environ["DEEPGRAM_API_KEY"]),
        tts=TTS.OpenAI(
            api_key=os.environ["OPENAI_API_KEY"]
        ),
        session_data={
            "my_user_id": "test-12345",
            "my_user_timezone": user_timezone
        }
    )

Parameters

input

object

required

Hide properties

custom_data

object

required

Arbitrary fields that you can specify when you call the startSession API endpoint. Makes it possible to include fields that are specific to the session or to your users. Learn how to set these fields in the Starting Sessions guide.

Example input parameter:

{
  "custom_data": {
    "my_user_id": "abc123"
  }
}

Returns

SessionConfig

object

required

Hide properties

initial_messages

object

required

A list of messages containing the conversation so far. Make this an empty array if you want the conversation to start from scratch.

Show properties

content

string

required

The contents of the message.

role

string

required

The role of the speaker. Either ‘system’, ‘user’, ‘assistant’, or ‘tool’.

name

string (Optional)

An optional name for the speaker. Some LLMs, such as OpenAI’s, can use this field to differentiate between participants of the same role.

tool_call_id

string (Optional)

Tool call that this message is responding to. Only present if the role is "tool".

vad

VAD.Silero

required

The voice activity detection (VAD) provider and its settings. Currently, only Silero is supported.

Show properties

min_speech_duration

float

default:"0.05"

The minimum duration (in seconds) of speech needed before the VAD decides speech has started.

min_silence_duration

float

default:"0.55"

The duration (in seconds) of silence required before the VAD concludes that speech has ended.

prefix_padding_duration

float

default:"0.5"

The duration (in seconds) of audio to include before detected speech begins. This helps retain audio leading up to the first speech frames.

max_buffered_speech

float

default:"60.0"

The maximum duration (in seconds) of speech that will be buffered. Once this limit is reached, additional incoming speech data for the current segment will be ignored.

activation_threshold

float

default:"0.5"

The threshold for deciding if audio is speech. A value closer to 1.0 requires stronger confidence of speech, while a value closer to 0.0 is more permissive.

sample_rate

int

default:"16000"

The audio sample rate (in Hz) used by Silero. Must be either 8,000 Hz or 16,000 Hz.

stt

STT.OpenAI | STT.Azure | STT.Deepgram

required

The speech-to-text (STT) provider and its settings.

Show possible types

OpenAI

STT.OpenAI

OpenAI’s speech-to-text provider.

Show properties

api_key

string

required

Your OpenAI API key.

language

string

default:"en"

The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency. Defaults to "en" (English).

model

string

default:"whisper-1"

The OpenAI model that will perform the speech-to-text.

Azure

STT.Azure

Microsoft Azure’s speech-to-text provider.

Show properties

api_key

string

required

Your Azure Speech API key.

region

string

required

The Azure region that hosts your resource (e.g. "eastus").

sample_rate

int

default:"16000"

The sample rate for the stream.

num_channels

int

default:"1"

The number of audio channels.

languages

array

default:"[\"en-US\"]"

An array of strings containing the potential source languages. The language is specified in BCP-47 format.

Deepgram

STT.Deepgram

Deepgram’s speech-to-text provider.

Show properties

api_key

string

required

Your Deepgram API key.

model

string

default:"nova-2-general"

The AI model used to process submitted audio. Learn More

language

string

default:"en-US"

The BCP-47 language tag that hints at the primary spoken language. Learn More.

interim_results

bool

default:"true"

Specifies whether the streaming endpoint should provide ongoing transcription updates as more audio is received. When set to true, the endpoint sends continuous updates, meaning transcription results may evolve over time. Learn More.

punctuate

bool

default:"true"

Indicates whether to add punctuation and capitalization to the transcript. Learn More.

smart_format

bool

default:"true"

Indicates whether to apply formatting to transcript output. When set to true, additional formatting will be applied to transcripts to improve readability. Learn More.

sample_rate

int

default:"16000"

Sample rate of submitted streaming audio. Required (and only read) when a value is provided for encoding. Learn More.

no_delay

bool

default:"true"

Only relevant when smart_format is true. If a speaker begins saying a number while no_delay is false and Smart Format is enabled, Deepgram will wait to return a transcription until the speaker has finished and continues on to non-numerical speech. This behavior ensures numbers have the best possible formatting and are not broken up over multiple chunks.

endpointing_ms

int

default:"25"

Indicates how long Deepgram will wait to detect whether a speaker has finished speaking or pauses for a significant period of time. When set to true, the streaming endpoint immediately finalizes the transcription for the processed time range and returns the transcript with a speech_final parameter set to true. Learn More.

filler_words

bool

default:"true"

Indicates whether to include filler words like “uh” and “um” in transcript output. When set to true, these words will be included. Learn More.

keywords

array of (string, float)

default:"[]"

Unique proper nouns or specialized terms you want the model to include in its predictions, which aren’t part of the model’s default vocabulary. Learn More.

profanity_filter

bool

default:"false"

Indicates whether to remove profanity from the transcript. Learn More.

tts

required

The text-to-speech (TTS) provider and its settings.

Show possible types

ElevenLabs

TTS.ElevenLabs

ElevenLabs’ text-to-speech provider.

Show properties

api_key

string

required

Your ElevenLabs API key.

voice

Voice

default:"DEFAULT_VOICE"

The voice to use. Defaults to:

DEFAULT_VOICE = Voice(
    id="EXAVITQu4vr4xnSDxMaL",
    name="Bella",
    category="premade",
    settings=VoiceSettings(
        stability=0.71, similarity_boost=0.5, style=0.0, use_speaker_boost=True
    ),
)

Show Voice

string

required

Voice ID to use. You can use https://api.elevenlabs.io/v1/voices to list all the available voices.

name

string

required

The name of the ElevenLabs voice (e.g. “Sarah”).

Get Started

Guides

References

Configure Session

`configure_session`

Example usage

Parameters

Returns

Get Started

Guides

References

​configure_session

​Example usage

​Parameters

​Returns

`configure_session`

Example usage

Parameters

Returns