Configure Session
configure_session
An asynchronous function that runs at the beginning of every session. It configures the session
by returning a SessionConfig
object, which contains parameters like VAD (voice
activity detection), STT (speech-to-text), TTS (text-to-speech), the initial messages of the session, etc.
Example usage
from jay_ai import ConfigureSessionInput, SessionConfig
async def configure_session(input: ConfigureSessionInput):
user_timezone = input["custom_data"]["my_user_timezone"]
return SessionConfig(
initial_messages=[
{"role": "system", "content": "You are a helpful assistant."}
],
vad=VAD.Silero(),
stt=STT.Deepgram(api_key=os.environ["DEEPGRAM_API_KEY"]),
tts=TTS.OpenAI(
api_key=os.environ["OPENAI_API_KEY"]
),
session_data={
"my_user_id": "test-12345",
"my_user_timezone": user_timezone
}
)
Parameters
Arbitrary fields that you can specify when you call the startSession
API endpoint. Makes it possible to include fields that are specific to the session or to your users. Learn how to set these fields in the Starting Sessions guide.
Example input
parameter:
{
"custom_data": {
"my_user_id": "abc123"
}
}
Returns
A list of messages containing the conversation so far. Make this an empty array if you want the conversation to start from scratch.
The contents of the message.
The role of the speaker. Either ‘system’, ‘user’, ‘assistant’, or ‘tool’.
An optional name for the speaker. Some LLMs, such as OpenAI’s, can use this field to differentiate between participants of the same role.
Tool call that this message is responding to. Only present if the role
is "tool"
.
The voice activity detection (VAD) provider and its settings. Currently, only Silero is supported.
The minimum duration (in seconds) of speech needed before the VAD decides speech has started.
The duration (in seconds) of silence required before the VAD concludes that speech has ended.
The duration (in seconds) of audio to include before detected speech begins. This helps retain audio leading up to the first speech frames.
The maximum duration (in seconds) of speech that will be buffered. Once this limit is reached, additional incoming speech data for the current segment will be ignored.
The threshold for deciding if audio is speech. A value closer to 1.0 requires stronger confidence of speech, while a value closer to 0.0 is more permissive.
The audio sample rate (in Hz) used by Silero. Must be either 8,000 Hz or 16,000 Hz.
The speech-to-text (STT) provider and its settings.
OpenAI’s speech-to-text provider.
Microsoft Azure’s speech-to-text provider.
Your Azure Speech API key.
The Azure region that hosts your resource (e.g. "eastus"
).
The sample rate for the stream.
The number of audio channels.
Deepgram’s speech-to-text provider.
Your Deepgram API key.
The AI model used to process submitted audio. Learn More
The BCP-47 language tag that hints at the primary spoken language. Learn More.
Specifies whether the streaming endpoint should provide ongoing transcription updates as more audio is received. When set to true
, the endpoint sends continuous updates, meaning transcription results may evolve over time. Learn More.
Indicates whether to add punctuation and capitalization to the transcript. Learn More.
Indicates whether to apply formatting to transcript output. When set to true
, additional formatting will be applied to transcripts to improve readability. Learn More.
Sample rate of submitted streaming audio. Required (and only read) when a value is provided for encoding. Learn More.
Only relevant when smart_format
is true
. If a speaker begins saying a number while no_delay
is false
and Smart Format is enabled, Deepgram will wait to return a transcription until the speaker has finished and continues on to non-numerical speech. This behavior ensures numbers have the best possible formatting and are not broken up over multiple chunks.
Indicates how long Deepgram will wait to detect whether a speaker has finished speaking or pauses for a significant period of time. When set to true
, the streaming endpoint immediately finalizes the transcription for the processed time range and returns the transcript with a speech_final
parameter set to true
. Learn More.
Indicates whether to include filler words like “uh” and “um” in transcript output. When set to true
, these words will be included. Learn More.
Unique proper nouns or specialized terms you want the model to include in its predictions, which aren’t part of the model’s default vocabulary. Learn More.
Indicates whether to remove profanity from the transcript. Learn More.
The text-to-speech (TTS) provider and its settings.
ElevenLabs’ text-to-speech provider.
Your ElevenLabs API key.
The voice to use. Defaults to:
DEFAULT_VOICE = Voice(
id="EXAVITQu4vr4xnSDxMaL",
name="Bella",
category="premade",
settings=VoiceSettings(
stability=0.71, similarity_boost=0.5, style=0.0, use_speaker_boost=True
),
)
Voice ID to use. You can use https://api.elevenlabs.io/v1/voices
to list all the available voices.
The name of the ElevenLabs voice (e.g. “Sarah”).
The category of ElevenLabs voice (e.g. “professional”). See the full list of valid ElevenLabs category
fields
Defines the stability for voice settings.
Defines the similarity boost for voice settings.
Defines the style for voice settings. This parameter is available on V2+ models.
Defines the use speaker boost for voice settings. This parameter is available on V2+ models.
Identifier of the ElevenLabs model that will be used.
The output format of the generated audio.
Whether to enable/disable parsing of SSML tags within the provided text. For best results, we recommend enabling SSML tags so that fully contained messages are sent to the websockets endpoint, otherwise this may result in additional latency.
Schedule for chunk lengths, ranging from 50 to 500.
OpenAI’s text-to-speech provider.
Google’s text-to-speech provider.
Your Google credentials as a dictionary.
The language code of the voice to use when generating the audio.
The gender of the speaker. Valid values are: “male”, “female”, “neutral”, or the empty string "". Use an empty string if it doesn’t matter which gender the selected voice will have.
The name of the voice. If both the name and the gender are not set, the service will choose a voice based on the other parameters such as language_code
.
Configuration to set up audio encoder. The encoding determines the output audio format that we’d like. Valid values: “linear16”, “wav”, or “mp3”.
Audio sample rate (in hertz) for this audio.
Speaking pitch, in the range [-20.0, 20.0]. 20 means increase 20 semitones from the original pitch. -20 means decrease 20 semitones from the original pitch.
An identifier which selects ‘audio effects’ profiles that are applied on (post synthesized) text to speech. Effects are applied on top of each other in the order they are given. See audio profiles in Google’s documentation for current supported profile ids.
Speaking rate/speed, in the range [0.25, 4.0]. 1.0 is the normal native speed supported by the specific voice. 2.0 is twice as fast, and 0.5 is half as fast.
Microsoft Azure’s text-to-speech provider.
Your Azure Speech API key.
The Azure region that hosts your resource (e.g. "eastus"
).
Audio sample rate (in hertz) for this audio. Valid values are: 8000, 16000, 22050, 24000, 44100, 48000.
The name of the voice to use. Explore Azure voices.
Specify changes to pitch, contour, range, rate, and volume for the text to speech output. Learn more
Indicates the speaking rate of the text. The rate changes should be within 0.5 to 2 times the original audio. You can express rate as:
- A number that acts as a multiplier of the default. For example, a value of 1 results in no change in the original rate. A value of 0.5 results in a halving of the original rate. A value of 2 results in twice the original rate.
- A string:
"x-slow"
(equivalently 0.5)"slow"
(equivalently 0.64)"medium"
(equivalently 1, default value)"fast"
(equivalently 1.55)"x-fast"
(equivalently 2)
Indicates the volume level of the speaking voice. You can express the volume as:
- A number in the range of 0.0 to 100.0, from quietest to loudest, such as 75. The default value is 100.
- A string:
- “
silent
” (equivalently 0) - “
x-soft
” (equivalently 20) - “
soft
” (equivalently 40) - “
medium
” (equivalently 60) - “
loud
” (equivalently 80) - “
x-loud
” (equivalently 100, default value)
- “
Indicates the baseline pitch for the text. The pitch changes should be within 0.5 to 1.5 times the original audio. You can express the pitch as:
"x-low"
"low"
"medium"
"high"
"x-high"
The ID of a custom endpoint. Learn more
Deepgram’s text-to-speech provider.
Your Deepgram API key.
The AI model used to process submitted audio. Learn more.
Expected encoding of the submitted streaming audio. Learn more.
Sample rate of submitted streaming audio. Learn more.
Cartesia’s text-to-speech provider.
Your Cartesia API key.
The ID of the model to use for the generation. See Cartesia’s available models.
The language that the given voice should speak the transcript in. See Cartesia’s available languages.
The audio encoding format. Currently, only "pcm_s16le"
is supported.
Either the string ID of the voice or a 192-dimensional vector (i.e. a list of 192 numbers) that represents the voice.
Either a number between -1.0 and 1.0 or a natural language description of speed (“fastest”, “fast”, “normal”, “slow”, “slowest”). If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.
The emotion of the speaker (e.g. “positivity:high”). See Cartesia’s emotion guide.
Sample rate (in hertz) for this audio.
FishAudio text-to-speech provider.
Your FishAudio API key.
Identifier of the FishAudio voice model to use. Defaults to energetic male voice: https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89/
Allows you to control the tradeoff between stability and latency. Setting this to "balanced"
reduces latency, but may be less stable.
Enables normalization of the input text, which improves stability for numbers, dates, and URLs.
Arbitrary fields that will be available throughout the session (e.g. in the llm_response_handler
). Allows you to define custom data related to the user or session. Must be JSON serializable.
An optional string representing a system or agent message to pre-send to the session.
Whether user speech can interrupt the agent mid-speech. Defaults to true
.
Minimum amount of time (in seconds) of user speech that must be detected before the agent’s speech is interrupted. Defaults to 0.5
.
Minimum number of words spoken by the user that are required to interrupt agent speech. Defaults to 0
.
Specifies the minimum endpointing delay for STT. Defaults to 0.5
.
Maximum number of nested function calls allowed. Defaults to 1
.