This article addresses downloading and editing transcriptions outside of EchoVideo and then uploading them to apply to media. For information on editing transcriptions within EchoVideo via the Transcript Editor, see Using the EchoVideo Transcript Editor.

The accuracy of the transcriptions auto-generated by the ASR service will depend greatly on several variables, including the quality of the microphone, the ambient noise in the room, the vocal quality of the speaker, and whether or not the lecturer is a native speaker of the language being transcribed. In addition, the use of subject-specific vocabulary throughout a lecture may provide transcriptions that contain few proper interpretations of these terms, which students may need to grasp the subject matter.

To address the accuracy of transcriptions, transcription files can be downloaded, edited, and then re-uploaded to replace the original transcription.

Once cleaned up for accuracy, transcription files can be manually uploaded as closed captioning files. Transcripts and closed captioning files follow the WebVTT standard and can generally be interchanged. The primary difference between transcriptions and closed captions is the accuracy provided / required.

This article provides instructions for working with a transcription file and saving it in a format that can be used for transcriptions, closed captions, or both. Be sure to also review the Tips and Tricks section at the bottom of this article.

If you have updated a transcription file but cannot upload it (it fails file validation), there may be a simple fix. Search the web for VTT validator to find a helpful tool to validate your file (such as https://quuz.org/webvtt/). It may be as simple as a missing character.

Step 1: Create a Copy of the Transcription

In most cases, the automatic transcription file will be the only transcription file applied to the video but minimally is likely to be the original transcription file.

Download, save a copy, and rename the original transcription

Click on the tile of a video in your Content Home to open the Media Details Page.
Alternatively, you can click a video icon from the Class List and select Details from the menu that appears.
Choose Accessibility.
Click the vertical action menu next to the transcript ID.
Select Download Original from the menu that appears, as shown in the figure below.

Depending on your browser settings, the file will be saved to the Downloads folder, or you will be asked to select a location.
Copy the file to the location where you will work with it, such as from the Downloads folder to a Section Name Transcriptions folder.
Rename the file in Windows Explorer or Mac Finder, adding the date of the class to which it is published. See Tips and Tricks below for an explanation as to why.

Creating a copy and renaming it before opening the file to edit it will make saving your edits easier. While you can work directly in the downloaded transcription file and then use Save As to change its filename or location, doing so adds several steps to the process.

Defining the Sections and Required Components of a WebVTT File

A WebVTT file is essentially a text file that meets the WebVTT specification outlined by the W3C: https://www.w3.org/TR/webvtt1/. While the specification is relatively detailed, it simply defines the components that must reside in the file, how they must be separated from other parts of the file, and guidelines for other non-required components of the file if they exist (spacing, prefixes, etc.).

The figure below shows a downloaded WebVTT transcription file returned from the ASR service. It contains all of the necessary components for a file to be used as a transcription file or a closed captioning file, along with other information. Each component is defined below the figure.
Example WebVTT transcript file returned from the ASR service with entry items as described

WEBVTT: The first entry in the file is the WebVTT header. It simply reads WEBVTT.

The specification requires this. Do not remove or change it.
Start / End Time Cues: Each transcribed text segment is preceded by a time-cue such as 00:00:05.940 -- 00:00:16.150. This time cue identifies the start and end location in the video / audio file where the text is spoken. It allows the transcription panel or the closed captions to be synced with the playback.

The time-cue format is very specific. Unless you have a reason for doing so, do not change the time cues in the transcription file. If you need to change one, refer to the specification for proper formatting.
Cue Payload / Spoken Text: The transcribed text is called the cue payload and is the text shown in the transcription or closed caption at the time / duration specified in the time cue.

This text must reside on the line immediately following the time cue.
Confidence Scoring: The NOTE CONF entries are confidence scores given to each word in the cue payload to identify how confident the automated transcription program was that the word it transcribed was the word that was spoken. The figures shown are percentages, and each number corresponds to each word in the payload text above it.

The NOTE CONF entries are not required for a valid transcription or closed caption file and can be removed. But they can help find segments where the text is more likely to need editing. A lower confidence percentage might indicate speech is garbled or not appropriately interpreted by the automated transcription program.
Line Breaks: The paragraph markers or line breaks in the transcription file are required and should be left alone if possible. The specification requires that the time cue is on the line immediately before the cue payload and that each time cue / payload set must be separated from the following item by at least one blank line.
Notice in the above figure that the NOTE CONF entries are separated from their associated cue segment (the one above it) by a blank line and the next cue segment by a blank line. These are not included solely for the readability of the downloaded file; the WebVTT specification requires that spacing.

Step 2: Edit and Save the Transcription File

The below instructions use Microsoft Word as the editing program, but you can use any text editor or word processing software. The only requirement for the program is that it can save your changes while retaining the original .vtt extension or as a text file type using a .txt extension on the file.

Launch the editing program you want to use and open the .vtt file you downloaded and saved for editing.

Once open, the VTT file looks like the figure in the previous section, a text file with a WEBVTT heading, time cues, cue payloads (text), and NOTE CONF entries, all of which are described above.

As shown in the figure below, you may need to change the file type selection box to All Files to find the VTT file to open.
Edit the text cues in the file as needed to match the speaker's words. The Tips and Tricks section below offers suggestions for streamlining the editing process.
When finished, save your edited file.

If you followed all instructions in the Download section above and are working on the renamed copy of the download, click Save to save your edited file. The program might warn you about the file type, but you can ignore it and click OK to save the file.

If you need to save the edited file as a new name or in a different location, you have some extra steps to perform:

Click Save As.
Choose Plain Text (*.txt) as the type of file you are saving, as shown in the figure below.

This will replace the .vtt extension on the file with a .txt extension. You will change this later after saving the file.
Click Save.
If a File Conversion dialog box, like the one shown below, is provided, it is recommended (but not required) that you enable the Insert line breaks checkbox and then select LF only or CR only for the line break type. This ensures that all line breaks are of the same kind.
Open Windows Explorer or Mac Finder and find the file you just saved.
Change the .txt extension to .vtt. The file you upload must have the .vtt extension.

When your edited file is complete, it can be uploaded to the capture. The edited version will appear in the classroom transcriptions panel and will be downloaded when Download edited is selected from the transcript menu.

Tips and Tricks for Editing a Transcription File

What follows are some ideas and tips for making working with transcription files easier and faster. Of course, you will have to view the capture and follow along to make accurate changes to the transcript, but there are some other things you can do to help streamline the editing process.

Save a Copy of the Original to a Dedicated Location and Rename It

As per the procedures in the Download section above, copy or move it to a dedicated location after downloading the original automated transcription. If you are responsible for cleaning up all of the transcriptions for a section, create a folder for that section. Then, when you add each new capture's transcription to the folder, append the filename with the date of the class to which it belongs. This will make it easier to know which files go to which class.

In addition, copying and renaming the file before you open it ensures you can select Save instead of Save As when you are done (or while editing). Most editing programs can save your changes to the same file you opened, meaning it will not try to force a file type, such as txt onto your file. This will reduce the steps needed to be sure your edited file is an uploadable .VTT file.

Play the Capture in the Classroom With the Transcription Panel Open

One plan for editing transcriptions is to play the capture in the classroom with the transcription panel open while having the editing program open in another window. Since the transcription is synced with the video and highlighted, you can easily see the transcribed text with the audio and discern where the errors occur. This works even better if you have access to a dual monitor setup, or can play the video / transcriptions on a different computer or a tablet using a mobile browser (transcriptions are not available in the Mobile Apps yet).

Another thing you can do is read through the transcription panel while the video is paused. When you see a segment that requires editing, click on it to sync the video to that location. Notice the timestamp below the playback bar. Find that location in the transcript file, and make your edits.

The Space Bar works as a pause / play button. Use it accordingly. When the video gets to where the transcript requires editing, tap the spacebar to pause the video, switch to the editing program, make the edits, then return to the classroom and continue.

Use Search to Locate Lower Confidence Scores

The NOTE CONF entries in the transcript are confidence scores given for each word by the automatic transcribing program, to indicate how confident it was that the word it entered was the one spoken. Use these scores to help find the areas in the text where the program was not certain the term it entered was accurate. This will not always find the most problematic areas in a transcript but it can help.

Each number in the CONF set corresponds to a word in the text cue immediately above it. Below is one cue where the confidence score for the first word in the cue payload is 41.

00:00:58.870 -- 00:01:02.750
Millions was the first person to see a surface feature on the planet mars.

NOTE CONF {"raw":[41,100,100,100,100,100,99,99,100,98,100,100,100,100]}

In looking at the transcription, the word Millions garnered a 41 in confidence score. In watching the video, the word Millions should be Huygens.

One reason Microsoft Word is used as an example editor here is its Find Special capabilities. The steps below show how to use this feature to find the lower confidence score entries in the transcription to help target your editing.

With the transcription file open, select Find then Advanced Find.
Click More to expand the box for additional options.
In the Find what field of the Find and Replace dialog box, type the first digit of the percentage range you want to find in the confidence scores.
The example in the figure below uses 5, to find all scores between 50 and 59 percent confidence.
Click the Special button at the bottom of the dialog box, then select Any digit from the list, as shown in the figure below.
Click Find Next. Each instance of the Find what range is found and highlighted with each click of Find Next.
You can try to make the edits as you go along, or you may want to simply highlight the corresponding words / phrases for now, then return while viewing the capture to fix the errors later.
Repeat these steps for each range of percentages you want to highlight.

If you use the Find / Highlight method, save the file as an RTF file; VTT / plain text files will not retain highlighting. However, you can return later, make the appropriate edits, then Save As using the steps above.

For example, the following cue text has some fairly low confidence scores.

00:00:28.110 -- 00:00:36.050
A hundred in abitibi words now the estimates get tougher

NOTE CONF {"raw":[55,89,91,50,67,99,100,99,96,99]}

Without the context of the video, it is impossible to know what this phrase is supposed to be. So while performing a Find action, highlight the phrases with low scores. Then, you can easily find and return to the highlighted locations while viewing the capture. Use the time cue given for each phrase and scrub to the approximate location in the capture. Review the speech and edit the transcription phrase accordingly.

The above cue, once edited, now reads:

00:00:28.110 -- 00:00:36.050
A hundred billion inhabitable worlds. Now the estimates get tougher

NOTE CONF {"raw":[55,89,91,50,67,99,100,99,96,99]}

Once you have fixed these low-confidence instances, you may want to change the confidence scoring for each word in the edited file to 100. You do not have to, but if the confidence scoring of the transcription is used for other purposes, the edited file will have confidence scores that accurately reflect your changes. This is entirely optional. You can also remove the confidence scores altogether if they no longer reflect the confidence level of the transcription text.

Using Spell Check or Grammar Check

Using spell check on the transcription file may work wonderfully or may not work well at all. This is because the automatic transcription program is essentially designed to translate speech into text and will, therefore, attempt to insert an actual word in place of whatever word it thinks it hears. For this reason, your transcript may have a lot of inaccurate transcription locations but few, if any, spelling errors.

In addition, programs like Microsoft Word will run (by default) a grammar check along with spell check. Because the transcription text segments may not line up as complete sentences the lecturer speaks, the grammar checker may balk at many phrases, calling them incomplete or attempting to make them complete by changing word forms.

This is not to say that these checking features will not work. You should try them, and you may find more or less success with them, depending on the class, the lecturer, the subject matter, etc. But do not blindly click Change on each identified problem. Doing so may create more errors in the transcription than in the original.

Related to