speechkit package¶
speechkit Python SDK for using Yandex Speech recognition and synthesis
- class speechkit.DataStreamingRecognition(session, language_code=None, model=None, profanity_filter=None, partial_results=None, single_utterance=None, audio_encoding=None, sample_rate_hertz=None, raw_results=None)¶
Bases:
object
Data streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection.
Unlike other recognition methods, you can get intermediate results while speech is in progress. After a pause, the service returns final results and starts recognizing the next utterance.
After receiving the message with the recognition settings, the service starts a recognition session. The following limitations apply to each session:
1. You can’t send audio fragments too often or too rarely. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds. For example, send 400 ms of audio for recognition every 400 ms.
Maximum duration of transmitted audio for the entire session: 5 minutes.
Maximum size of transmitted audio data: 10 MB.
To use this type of recognition, you need to create function that yields bytes data
- Example
>>> CHUNK_SIZE = 4000 >>> session = Session.from_jwt("jwt") >>> data_streaming_recognition = DataStreamingRecognition( ... session, ... language_code='ru-RU', ... audio_encoding='LINEAR16_PCM', ... session=8000, ... partial_results=False, ... single_utterance=True, ... ) ... >>> def gen_audio_from_file_function(): ... with open('/path/to/pcm_data/speech.pcm', 'rb') as f: ... data = f.read(CHUNK_SIZE) ... while data != b'': ... yield data ... data = f.read(CHUNK_SIZE) ... >>> for i in data_streaming_recognition.recognize(gen_audio_capture_function): ... print(i) # (['text'], final_flag, end_of_utterance_flag) ...
Read more about streaming recognition in Yandex streaming recognition docs
Initialize
speechkit.DataStreamingRecognition
- Parameters
session (speechkit._auth.Session) – Session instance for auth
language_code (string | None) – The language to use for recognition. Acceptable values: ru-ru (case-insensitive, used by default): Russian, en-us (case-insensitive): English, tr-tr (case-insensitive): Turkish.
model (string | None) – The language model to be used for recognition. The closer the model is matched, the better the recognition result. You can only specify one model per request. Default value: general.
profanity_filter (boolean | None) – The profanity filter. Acceptable values: true: Exclude profanity from recognition results, false (default): Do not exclude profanity from recognition results.
partial_results (boolean | None) – The intermediate results filter. Acceptable values: true: Return intermediate results (part of the recognized utterance). For intermediate results, final is set to false, false (default): Return only the final results (the entire recognized utterance).
single_utterance (boolean | None) – Flag that disables recognition after the first utterance. Acceptable values: true: Recognize only the first utterance, stop recognition, and wait for the user to disconnect, false (default): Continue recognition until the end of the session.
audio_encoding (string | None) – The format of the submitted audio. Acceptable values: LINEAR16_PCM: LPCM with no WAV header, OGG_OPUS (default): OggOpus format.
sample_rate_hertz (integer | None) – (int64) The sampling frequency of the submitted audio. Required if format is set to LINEAR16_PCM. Acceptable values: 48000 (default): Sampling rate of 48 kHz, 16000: Sampling rate of 16 kHz, 8000: Sampling rate of 8 kHz.
raw_results (boolean | None) – Flag that indicates how to write numbers. true: In words. false (default): In figures.
- recognize(gen_audio_function)¶
Recognize streaming data, gen_audio_function must yield audio data with parameters given in init.
- Parameters
gen_audio_function (function) – Function generates audio data
- Returns
yields tuple, where first element is list of alternatives text, second final (boolean) flag, third endOfUtterance (boolean) flag, ex. ([‘text’], False, False)
- Return type
tuple
- recognize_raw(gen_audio_function)¶
Recognize streaming data, gen_audio_function must yield audio data with parameters given in init. Answer type read in Yandex Docs
- Parameters
gen_audio_function (function) – Function generates audio data
- Returns
Yields recognized data in raw format
- Return type
speechkit._recognition.yandex.cloud.ai.stt.v2.stt_service_pb2.StreamingRecognitionResponse
- class speechkit.RecognitionLongAudio(session, service_account_id, aws_bucket_name=None, aws_credentials_description='Default AWS credentials created by `speechkit` python SDK', aws_region_name='ru-central1')¶
Bases:
object
Long audio fragment recognition can be used for multi-channel audio files up to 1 GB. To recognize long audio fragments, you need to execute 2 requests:
Send a file for recognition.
Get recognition results.
- Example
>>> recognizeLongAudio = RecognitionLongAudio(session, '<service_account_id>') >>> recognizeLongAudio.send_for_recognition('file/path') >>> if recognizeLongAudio.get_recognition_results(): ... data = recognizeLongAudio.get_data() ... >>> recognizeLongAudio.get_raw_text() ...'raw recognized text'
Initialize
speechkit.RecognitionLongAudio
- Parameters
session (speechkit.Session) – Session instance for auth
- get_data()¶
Get the response. Use
speechkit.RecognitionLongAudio.get_recognition_results()
first to store _answer_dataContain a list of recognition results (chunks[]).
- Returns
None if text not found ot Each result in the chunks[] list contains the following fields:
alternatives[]: List of recognized text alternatives. Each alternative contains the following fields:
words[]: List of recognized words:
- startTime: Time stamp of the beginning of the word in the recording. An error of 1-2 seconds
is possible.
endTime: Time stamp of the end of the word. An error of 1-2 seconds is possible.
- word: Recognized word. Recognized numbers are written in words (for example, twelve rather
than 12).
confidence: This field currently isn’t supported. Don’t use it.
- text: Full recognized text. By default, numbers are written in figures. To recognition
the entire text in words, specify true in the raw_results field.
confidence: This field currently isn’t supported. Don’t use it.
channelTag: Audio channel that recognition was performed for.
- Return type
list | None
- get_raw_text()¶
Get raw text from _answer_data data
- Returns
Text
- Return type
string
- get_recognition_results()¶
Monitor the recognition results using the received ID. The number of result monitoring requests is limited, so consider the recognition speed: it takes about 10 seconds to recognize 1 minute of single-channel audio.
- Returns
State of recognition is done or not
- Return type
boolean
- send_for_recognition(file_path, **kwargs)¶
Send a file for recognition
- Parameters
file_path (string) – Path to input file
folder_id (string) – ID of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
languageCode (string) – The language that recognition will be performed for. Only Russian is currently supported (ru-RU).
model (string) – The language model to be used for recognition. Default value: general.
profanityFilter (boolean) – The profanity filter.
audioEncoding (string) –
The format of the submitted audio. Acceptable values:
LINEAR16_PCM: LPCM with no WAV _header.
OGG_OPUS (default): OggOpus format.
sampleRateHertz (integer) –
The sampling frequency of the submitted audio. Required if format is set to LINEAR16_PCM. Acceptable values:
48000 (default): Sampling rate of 48 kHz.
16000: Sampling rate of 16 kHz.
8000: Sampling rate of 8 kHz.
audioChannelCount (integer) – The number of channels in LPCM files. By default, 1. Don’t use this field for OggOpus files.
rawResults (boolean) – Flag that indicates how to write numbers. true: In words. false (default): In figures.
- Return type
None
- class speechkit.Session(auth_type, credential, folder_id)¶
Bases:
object
Class provides yandex API authentication.
Stores credentials for given auth method
- Parameters
auth_type (string) – Type of auth may be
Session.IAM_TOKEN()
orSession.API_KEY()
folder_id (string | None) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
credential (string) – Auth key iam or api key
- API_KEY = 'api_key'¶
Api key if api-key auth, value: ‘api_key’
- IAM_TOKEN = 'iam_token'¶
Iam_token if iam auth, value: ‘iam_token’
- property auth_method¶
- classmethod from_api_key(api_key, folder_id=None)¶
Creates session from api key
- Parameters
api_key (string) – Yandex Cloud Api-Key
folder_id (string | None) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
- Returns
Session instance
- Return type
- classmethod from_jwt(jwt_token, folder_id=None)¶
Creates Session from JWT token
- Parameters
jwt_token (string) – JWT
folder_id (string | None) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
- Returns
Session instance
- Return type
- classmethod from_yandex_passport_oauth_token(yandex_passport_oauth_token, folder_id)¶
Creates Session from oauth token Yandex account
- Parameters
yandex_passport_oauth_token (string) – OAuth token from Yandex.OAuth
folder_id (string) – Id of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
- Returns
Session instance
- Return type
- property header¶
Authentication header.
- Returns
Dict in format {‘Authorization’: ‘Bearer or Api-Key {iam or api_key}’}
- Return type
dict
- property streaming_recognition_header¶
Authentication header for streaming recognition
- Returns
Tuple in format (‘authorization’, ‘Bearer or Api-Key {iam or api_key}’)
- Return type
tuple
- class speechkit.ShortAudioRecognition(session)¶
Bases:
object
Short audio recognition ensures fast response time and is suitable for single-channel audio of small length.
- Audio requirements:
Maximum file size: 1 MB.
Maximum length: 30 seconds.
Maximum number of audio channels: 1.
If your file is larger, longer, or has more audio channels, use
speechkit.RecognitionLongAudio
.Initialization
speechkit.ShortAudioRecognition
- Parameters
session (speechkit.Session) – Session instance for auth
- recognize(data, **kwargs)¶
Recognize text from BytesIO data given, which is audio
- Parameters
data (io.BytesIO, bytes) – Data with audio samples to recognize
lang (string) –
The language to use for recognition. Acceptable values:
ru-RU (by default) — Russian.
en-US — English.
tr-TR — Turkish.
topic (string) – The language model to be used for recognition. Default value: general.
profanityFilter (boolean) – This parameter controls the profanity filter in recognized speech.
format (string) –
The format of the submitted audio. Acceptable values:
lpcm — LPCM with no WAV _header.
oggopus (default) — OggOpus.
sampleRateHertz (string) –
The sampling frequency of the submitted audio. Used if format is set to lpcm. Acceptable values:
48000 (default) — Sampling rate of 48 kHz.
16000 — Sampling rate of 16 kHz.
8000 — Sampling rate of 8 kHz.
folderId (string) – ID of the folder that you have access to. Don’t specify this field if you make a request on behalf of a service account.
- Returns
The recognized text
- Return type
string
- class speechkit.SpeechSynthesis(session)¶
Bases:
object
Generates speech from received text.
Initialize
speechkit.SpeechSynthesis
- Parameters
session (speechkit.Session) – Session instance for auth
- synthesize(file_path, **kwargs)¶
Generates speech from received text and saves it to file
- Parameters
file_path (string) – The path to file where store data
text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.
ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.
lang (string) –
Language. Acceptable values:
ru-RU (default) — Russian.
en-US — English.
tr-TR — Turkish.
voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.
speed (string) –
Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where:
3.0 — Fastest rate.
1.0 (default) — Average human speech rate.
0.1 — Slowest speech rate.
format (string) –
The format of the synthesized audio. Acceptable values:
lpcm — Audio file is synthesized in LPCM format with no WAV _header. Audio properties:
Sampling — 8, 16, or 48 kHz, depending on the value of the sample_rate_hertz parameter.
Bit depth — 16-bit.
Byte order — Reversed (little-endian).
Audio data is stored as signed integers.
- oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
the OGG container format (OggOpus).
sample_rate_hertz – The sampling frequency of the synthesized audio. Used if format is set to lpcm. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.
folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.
- synthesize_stream(**kwargs)¶
Generates speech from received text and return
io.BytesIO()
object with data.- Parameters
text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.
ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.
lang (string) –
Language. Acceptable values:
ru-RU (default) — Russian.
en-US — English.
tr-TR — Turkish.
voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.
speed (string) –
Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where:
3.0 — Fastest rate.
1.0 (default) — Average human speech rate.
0.1 — Slowest speech rate.
format (string) –
The format of the synthesized audio. Acceptable values:
lpcm — Audio file is synthesized in LPCM format with no WAV _header. Audio properties:
Sampling — 8, 16, or 48 kHz, depending on the value of the sample_rate_hertz parameter.
Bit depth — 16-bit.
Byte order — Reversed (little-endian).
Audio data is stored as signed integers.
- oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
the OGG container format (OggOpus).
sampleRateHertz (string) –
The sampling frequency of the synthesized audio. Used if format is set to lpcm. Acceptable values:
48000 (default): Sampling rate of 48 kHz.
16000: Sampling rate of 16 kHz.
8000: Sampling rate of 8 kHz.
folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.