speechkit package¶

speechkit Python SDK for using Yandex Speech recognition and synthesis

class speechkit.DataStreamingRecognition(session, language_code=None, model=None, profanity_filter=None, partial_results=None, single_utterance=None, audio_encoding=None, sample_rate_hertz=None, raw_results=None)¶

Bases: object

Data streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection.

Unlike other recognition methods, you can get intermediate results while speech is in progress. After a pause, the service returns final results and starts recognizing the next utterance.

After receiving the message with the recognition settings, the service starts a recognition session. The following limitations apply to each session:

1. You can’t send audio fragments too often or too rarely. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds. For example, send 400 ms of audio for recognition every 400 ms.

Maximum duration of transmitted audio for the entire session: 5 minutes.

Maximum size of transmitted audio data: 10 MB.

To use this type of recognition, you need to create function that yields bytes data

Example

>>> CHUNK_SIZE = 4000
>>> session = Session.from_jwt("jwt")
>>> data_streaming_recognition = DataStreamingRecognition(
...     session,
...     language_code='ru-RU',
...     audio_encoding='LINEAR16_PCM',
...     session=8000,
...     partial_results=False,
...     single_utterance=True,
... )
...
>>> def gen_audio_from_file_function():
...     with open('/path/to/pcm_data/speech.pcm', 'rb') as f:
...         data = f.read(CHUNK_SIZE)
...         while data != b'':
...             yield data
...             data = f.read(CHUNK_SIZE)
...
>>> for i in data_streaming_recognition.recognize(gen_audio_capture_function):
...     print(i)  # (['text'], final_flag, end_of_utterance_flag)
...

Read more about streaming recognition in Yandex streaming recognition docs

Initialize speechkit.DataStreamingRecognition

Parameters

session (speechkit._auth.Session) – Session instance for auth
language_code (string | None) – The language to use for recognition. Acceptable values: ru-ru (case-insensitive, used by default): Russian, en-us (case-insensitive): English, tr-tr (case-insensitive): Turkish.
model (string | None) – The language model to be used for recognition. The closer the model is matched, the better the recognition result. You can only specify one model per request. Default value: general.
profanity_filter (boolean | None) – The profanity filter. Acceptable values: true: Exclude profanity from recognition results, false (default): Do not exclude profanity from recognition results.
partial_results (boolean | None) – The intermediate results filter. Acceptable values: true: Return intermediate results (part of the recognized utterance). For intermediate results, final is set to false, false (default): Return only the final results (the entire recognized utterance).
single_utterance (boolean | None) – Flag that disables recognition after the first utterance. Acceptable values: true: Recognize only the first utterance, stop recognition, and wait for the user to disconnect, false (default): Continue recognition until the end of the session.
audio_encoding (string | None) – The format of the submitted audio. Acceptable values: LINEAR16_PCM: LPCM with no WAV header, OGG_OPUS (default): OggOpus format.
sample_rate_hertz (integer | None) – (int64) The sampling frequency of the submitted audio. Required if format is set to LINEAR16_PCM. Acceptable values: 48000 (default): Sampling rate of 48 kHz, 16000: Sampling rate of 16 kHz, 8000: Sampling rate of 8 kHz.
raw_results (boolean | None) – Flag that indicates how to write numbers. true: In words. false (default): In figures.

recognize(gen_audio_function)¶

Recognize streaming data, gen_audio_function must yield audio data with parameters given in init.

Parameters: gen_audio_function (function) – Function generates audio data
Returns: yields tuple, where first element is list of alternatives text, second final (boolean) flag, third endOfUtterance (boolean) flag, ex. ([‘text’], False, False)
Return type: tuple

recognize_raw(gen_audio_function)¶

Recognize streaming data, gen_audio_function must yield audio data with parameters given in init. Answer type read in Yandex Docs

Parameters: gen_audio_function (function) – Function generates audio data
Returns: Yields recognized data in raw format
Return type: speechkit._recognition.yandex.cloud.ai.stt.v2.stt_service_pb2.StreamingRecognitionResponse

class speechkit.RecognitionLongAudio(session, service_account_id, aws_bucket_name=None, aws_credentials_description='Default AWS credentials created by `speechkit` python SDK', aws_region_name='ru-central1')¶

Bases: object

Long audio fragment recognition can be used for multi-channel audio files up to 1 GB. To recognize long audio fragments, you need to execute 2 requests:

Send a file for recognition.

Get recognition results.

Example

>>> recognizeLongAudio = RecognitionLongAudio(session, '<service_account_id>')
>>> recognizeLongAudio.send_for_recognition('file/path')
>>> if recognizeLongAudio.get_recognition_results():
...     data = recognizeLongAudio.get_data()
...
>>> recognizeLongAudio.get_raw_text()
...'raw recognized text'

Initialize speechkit.RecognitionLongAudio

Parameters: session (speechkit.Session) – Session instance for auth

get_data()¶

Get the response. Use speechkit.RecognitionLongAudio.get_recognition_results() first to store _answer_data

Contain a list of recognition results (chunks[]).

Returns

None if text not found ot Each result in the chunks[] list contains the following fields:

alternatives[]: List of recognized text alternatives. Each alternative contains the following fields:
- words[]: List of recognized words:
  startTime: Time stamp of the beginning of the word in the recording. An error of 1-2 seconds
  is possible.
  
  endTime: Time stamp of the end of the word. An error of 1-2 seconds is possible.
  
  word: Recognized word. Recognized numbers are written in words (for example, twelve rather
  than 12).
  
  confidence: This field currently isn’t supported. Don’t use it.
  
  text: Full recognized text. By default, numbers are written in figures. To recognition
  the entire text in words, specify true in the raw_results field.
  
  confidence: This field currently isn’t supported. Don’t use it.
channelTag: Audio channel that recognition was performed for.

Return type

list | None

get_raw_text()¶

Get raw text from _answer_data data

Returns: Text
Return type: string

get_recognition_results()¶

Monitor the recognition results using the received ID. The number of result monitoring requests is limited, so consider the recognition speed: it takes about 10 seconds to recognize 1 minute of single-channel audio.

Returns: State of recognition is done or not
Return type: boolean

send_for_recognition(file_path, **kwargs)¶

Send a file for recognition

Parameters

file_path (string) – Path to input file
folder_id (string) – ID of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
languageCode (string) – The language that recognition will be performed for. Only Russian is currently supported (ru-RU).
model (string) – The language model to be used for recognition. Default value: general.
profanityFilter (boolean) – The profanity filter.
audioEncoding (string) –
The format of the submitted audio. Acceptable values:
- LINEAR16_PCM: LPCM with no WAV _header.
- OGG_OPUS (default): OggOpus format.
sampleRateHertz (integer) –
The sampling frequency of the submitted audio. Required if format is set to LINEAR16_PCM. Acceptable values:
- 48000 (default): Sampling rate of 48 kHz.
- 16000: Sampling rate of 16 kHz.
- 8000: Sampling rate of 8 kHz.
audioChannelCount (integer) – The number of channels in LPCM files. By default, 1. Don’t use this field for OggOpus files.
rawResults (boolean) – Flag that indicates how to write numbers. true: In words. false (default): In figures.

Return type

None

class speechkit.Session(auth_type, credential, folder_id)¶

Bases: object

Class provides yandex API authentication.

Stores credentials for given auth method

Parameters

auth_type (string) – Type of auth may be Session.IAM_TOKEN() or Session.API_KEY()
folder_id (string | None) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
credential (string) – Auth key iam or api key

API_KEY = 'api_key'¶: Api key if api-key auth, value: ‘api_key’

IAM_TOKEN = 'iam_token'¶: Iam_token if iam auth, value: ‘iam_token’

property auth_method¶

classmethod from_api_key(api_key, folder_id=None)¶

Creates session from api key

Parameters

api_key (string) – Yandex Cloud Api-Key
folder_id (string | None) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.

Returns

Session instance

Return type

Session

classmethod from_jwt(jwt_token, folder_id=None)¶

Creates Session from JWT token

Parameters

jwt_token (string) – JWT
folder_id (string | None) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.

Returns

Session instance

Return type

Session

classmethod from_yandex_passport_oauth_token(yandex_passport_oauth_token, folder_id)¶

Creates Session from oauth token Yandex account

Parameters

yandex_passport_oauth_token (string) – OAuth token from Yandex.OAuth
folder_id (string) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.

Returns

Session instance

Return type

Session

property header¶

Authentication header.

Returns: Dict in format {‘Authorization’: ‘Bearer or Api-Key {iam or api_key}’}
Return type: dict

property streaming_recognition_header¶

Authentication header for streaming recognition

Returns: Tuple in format (‘authorization’, ‘Bearer or Api-Key {iam or api_key}’)
Return type: tuple

class speechkit.ShortAudioRecognition(session)¶

Bases: object

Short audio recognition ensures fast response time and is suitable for single-channel audio of small length.

Audio requirements:

Maximum file size: 1 MB.
Maximum length: 30 seconds.
Maximum number of audio channels: 1.

If your file is larger, longer, or has more audio channels, use speechkit.RecognitionLongAudio.

Initialization speechkit.ShortAudioRecognition

Parameters: session (speechkit.Session) – Session instance for auth

recognize(data, **kwargs)¶

Recognize text from BytesIO data given, which is audio

Parameters

data (io.BytesIO, bytes) – Data with audio samples to recognize
lang (string) –
The language to use for recognition. Acceptable values:
- ru-RU (by default) — Russian.
- en-US — English.
- tr-TR — Turkish.
topic (string) – The language model to be used for recognition. Default value: general.
profanityFilter (boolean) – This parameter controls the profanity filter in recognized speech.
format (string) –
The format of the submitted audio. Acceptable values:
- lpcm — LPCM with no WAV _header.
- oggopus (default) — OggOpus.
sampleRateHertz (string) –
The sampling frequency of the submitted audio. Used if format is set to lpcm. Acceptable values:
- 48000 (default) — Sampling rate of 48 kHz.
- 16000 — Sampling rate of 16 kHz.
- 8000 — Sampling rate of 8 kHz.
folderId (string) – ID of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.

Returns

The recognized text

Return type

string

class speechkit.SpeechSynthesis(session)¶

Bases: object

Generates speech from received text.

Initialize speechkit.SpeechSynthesis

Parameters: session (speechkit.Session) – Session instance for auth

synthesize(file_path, **kwargs)¶

Generates speech from received text and saves it to file

Parameters

file_path (string) – The path to file where store data
text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.
ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.
lang (string) –
Language. Acceptable values:
- ru-RU (default) — Russian.
- en-US — English.
- tr-TR — Turkish.
voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.
speed (string) –
Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where:
- 3.0 — Fastest rate.
- 1.0 (default) — Average human speech rate.
- 0.1 — Slowest speech rate.
format (string) –
The format of the synthesized audio. Acceptable values:
- lpcm — Audio file is synthesized in LPCM format with no WAV _header. Audio properties:
  - Sampling — 8, 16, or 48 kHz, depending on the value of the sample_rate_hertz parameter.
  - Bit depth — 16-bit.
  - Byte order — Reversed (little-endian).
  - Audio data is stored as signed integers.
- oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
  the OGG container format (OggOpus).
sample_rate_hertz – The sampling frequency of the synthesized audio. Used if format is set to lpcm. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.
folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.

synthesize_stream(**kwargs)¶

Generates speech from received text and return io.BytesIO() object with data.

Parameters

text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.
ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.
lang (string) –
Language. Acceptable values:
- ru-RU (default) — Russian.
- en-US — English.
- tr-TR — Turkish.
voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.
speed (string) –
Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where:
- 3.0 — Fastest rate.
- 1.0 (default) — Average human speech rate.
- 0.1 — Slowest speech rate.
format (string) –
The format of the synthesized audio. Acceptable values:
- lpcm — Audio file is synthesized in LPCM format with no WAV _header. Audio properties:
  - Sampling — 8, 16, or 48 kHz, depending on the value of the sample_rate_hertz parameter.
  - Bit depth — 16-bit.
  - Byte order — Reversed (little-endian).
  - Audio data is stored as signed integers.
- oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
  the OGG container format (OggOpus).
sampleRateHertz (string) –
The sampling frequency of the synthesized audio. Used if format is set to lpcm. Acceptable values:
- 48000 (default): Sampling rate of 48 kHz.
- 16000: Sampling rate of 16 kHz.
- 8000: Sampling rate of 8 kHz.
folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.