Transcripts (Automated by Google Speech)


The GoogleSpeechTranscriptionService invokes the Google Speech-to-Text service via REST API to transcribe audio to text.

During the execution of an Opencast workflow, an audio file is extracted from one of the presenter videos and sent to the Google Speech-to-Text service. When the results are received, they are converted to the desired caption format and attached to the media package.

Note that because Google's Speech-to-Text service can take a while to process a recording, we do not wait for it to finish before proceeding with the rest of Opencast's normal processing, the transcription process is asynchronous.

Translation finishes, workflow 2 is started.

Google Speech-to-Text service documentation, including which languages are currently supported, can be found here.


Notes: Instructions and screenshots provided in this section are based on Google Speech-to-Text documentation at the time of writing this document. For up to date instructions please search for 'google speech to text configuration' or visit Google Cloud service page.

Step 1: Activate Google Speech and Google Cloud Storage APIs

Step 2: Get Google Cloud credentials




Getting your Refresh Token and Authorization endpoint



Step 3: Configure GoogleSpeechTranscriptionService

Edit etc/org.opencastproject.transcription.googlespeech.GoogleSpeechTranscriptionService.cfg:

Example of configuration file:

# Change enabled to true to enable this service. 


# google cloud storage bucket<BUCKET_NAME>

# Language of the supplied audio. See the Google Speech-to-Text service documentation
# for available languages. If empty, the default will be used ("en-US").

# Filter out profanities from result. Default is false

# Workflow to be executed when results are ready to be attached to media package.

# Interval the workflow dispatcher runs to start workflows to attach transcripts to the media package
# after the transcription job is completed.
# (in seconds) Default is 1 minute.

# How long it should wait to check jobs after their start date + track duration has passed.
# The default is 5 minutes.
# (in seconds)

# How long to wait after a transcription is supposed to finish before marking the job as 
# cancelled in the database. Default is 5 hours.
# (in seconds)

# How long to keep result files in the working file repository in days.
# The default is 7 days.

# Email to send notifications of errors. If not entered, the value from
# in will be used.

Step 4: Add encoding profile for extracting audio

The Google Speech-to-Text service has limitations on audio types. Supported audio type are here. By default Opencast will use the encoding settings in etc/encoding/

Step 5: Add workflow operations and create new workflow

Add the following operations to your workflow. We suggest adding them after the media package is published so that users can watch videos without having to wait for the transcription to finish, but it depends on your use case. The only requirement is to take a snapshot of the media package so that the second workflow can retrieve it from the archive to attach the caption/transcripts.

    <!--  Encode audio to flac -->
      description="Extract audio for transcript generation">
        <configuration key="source-flavor">*/source</configuration>
        <configuration key="target-flavor">audio/flac</configuration>
        <configuration key="target-tags">transcript</configuration>
        <configuration key="encoding-profile">audio-flac</configuration>
        <configuration key="process-first-match-only">true</configuration>

    <!-- Start Google Speech transcription job -->
      description="Start Google Speech transcription job">
        <!--  Skip this operation if flavor already exists. Used for cases when mediapackage already has captions. -->
        <configuration key="skip-if-flavor-exists">captions/timedtext</configuration>
        <configuration key="language-code">en-US</configuration>
        <!-- Audio to be translated, produced in the previous compose operation -->
        <configuration key="source-tag">transcript</configuration>

Step 6: Create a workflow that will add the generated caption/transcript to the media package and republish it

A sample one can be found in etc/workflows/google-speech-attach-transcripts.xml

 <!-- Attach caption/transcript -->

    <operation id="google-speech-attach-transcription"
      description="Attach captions/transcription">
        <!-- This is filled out by the transcription service when starting this workflow -->
        <configuration key="transcription-job-id">${transcriptionJobId}</configuration>
        <configuration key="line-size">80</configuration>
        <configuration key="target-flavor">captions/timedtext</configuration>
        <configuration key="target-tag">archive</configuration>
        <configuration key="target-caption-format">vtt</configuration>

    <!-- Publish to engage player -->

    <operation id="publish-engage"
      description="Distribute and publish to engage server">
        <configuration key="download-source-flavors">dublincore/*,security/*,captions/*</configuration>
        <configuration key="strategy">merge</configuration>
        <configuration key="check-availability">false</configuration>

    <!-- Publish to oaipmh -->

      description="Update recording metadata in default OAI-PMH repository">
        <configuration key="source-flavors">dublincore/*,security/*,captions/*</configuration>
        <configuration key="repository">default</configuration>

Transcription delay before cancellation

If an event is deleted before the end of Google transcription process, or the Google Speech to Text API has some issues, or something unexpected happens, the transcription process for the event will not be immediately cancelled. Instead, transcription will be attempted several times based on the video duration and configuration properties: completion.check.buffer and max.processing.time.

Video duration + completion.check.buffer + max.processing.time set the duration before a Google transcription job is cancelled.

completion.check.buffer 5 minutes by default

completion.check.buffer 5 hours by default.

All these values can be changed in Google Transcription properties file: etc/org.opencastproject.transcription.googlespeech.GoogleSpeechTranscriptionService.cfg

For example, if you have a 30 min video, using the default values, it will take 5 hours and 35 min before the transcription is cancelled (when something goes wrong).

Workflow Operations