Speech to Text Workflow Operation

ID: speechtotext


The speech to text operation can be used to generate subtitles from video or audio files. Currently, there are three STT engines available, Whisper, WhisperC++ and Vosk. The subtitles file format ist WebVTT.

Parameter Table

configuration keys required Example description
source-flavor yes source/presenter The source media package to use
target-flavor yes archive Flavor of the produced subtitle file.
target-element no track Define where to append the subtitles file. Possibilities are: as a 'track' or as an 'attachment'. The default is "track".
language-code no de The language of the video or audio source (default is "eng"). Vosk only: It has to match the name of the language model directory. See 'vosk-cli'.
language-fallback yes* en The fallback value if the dublin core/media package language field is not present.
target-tags no delivery/captions Tags for the subtitle file.** The generator and generator-type tags will be set automatically. (Whisper/WhisperC++ only: If no language-code is set, the lang tag will be auto-generated.)
translate no true Transcription is translated into English, valid values true or false (Whisper/WhisperC++ only)

Vosk Only, default value can be modified on Vosk config file. *For conventionally used tags see the general page on Subtitles.


In order for it to work, you have to install the vosk-cli, whisper or whispercpp package.

Operation Examples

    description="Generates subtitles for video and audio files">
    <configuration key="source-flavor">*/source</configuration>
    <configuration key="target-flavor">captions/source</configuration>
    <configuration key="target-element">track</configuration>
    <configuration key="target-tags">archive,subtitle,engage-download</configuration>
    <configuration key="language-code">eng</configuration>
    description="Generates subtitles for video and audio files, derive language-code from metadata">
      <configuration key="source-flavor">*/source</configuration>
      <configuration key="target-flavor">captions/source</configuration>
      <configuration key="target-element">track</configuration>
      <configuration key="target-tags">archive,subtitle,engage-download</configuration>

Language code

The accepted language code are the two letter codes defined in ISO 639-1. A reference list can be found here:

List of ISO 639-1 codes