This article addresses downloading and editing transcriptions outside of EchoVideo and then uploading them to apply to media. While you can still use this method, we recommend using the EchoVideo Transcription editor instead. The editor allows you to export, upload, edit, and apply transcriptions as closed captions for media. For information, see Using the EchoVideo Transcript Editor.

The accuracy of the transcriptions that are auto-generated by the ASR service will depend a great deal on several variables, including the quality of the microphone, the ambient noise in the room, the vocal quality of the speaker, and whether or not the lecturer is a native speaker of the language being transcribed. In addition, the use of subject-specific vocabulary throughout a lecture may provide transcriptions that contain few proper interpretations of these terms, which students may need in order to grasp the subject matter.

To address the accuracy of transcriptions, transcription files can be downloaded, edited, and then re-uploaded to replace the original transcription.

Once they are cleaned up for accuracy, transcription files can also be manually uploaded as closed captioning files. Both transcription files and closed captioning files adhere to the WebVTT standard and can therefore be generally interchanged. The primary difference between transcriptions and closed captions is the level of accuracy provided / required.

This article provides step-by-step instructions for working with a transcription file and saving it in a format that can be used for either transcriptions, closed captions, or both. Be sure to also review the Tips and Tricks section at the bottom of this article.

If you have updated a transcription file but, for some reason, cannot upload it (it fails file validation), there may be a simple fix. Search the web for .vtt validator to find a helpful tool to validate your file (such as https://quuz.org/webvtt/). It may be as simple as a missing character.

Download, Save a Copy, and Rename the Original Transcription

The automatic transcription file will, in most cases, be the only transcription file applied to the video; however, it is likely to be supplemented by the original transcription file.

From the Captures page, find and click on the capture whose transcriptions you want to work with to open the Media Details page.

The Details tab is displayed by default.
Select Accessibility.
Click the vertical action menu for the transcript, as shown in the figure below.
Select Download Original, as shown in the figure above.

The file will be saved to the Downloads folder or you will be asked to select a location, depending on your browser settings.
Alternatively, you can download the edited transcript version currently applied to the video if there is one, and edit that file instead.
Copy the file to the location from which you will work with it (from the Downloads folder to a Section Name Transcriptions folder, for example).
Rename the file in Windows Explorer or Mac Finder, adding the Date of the class to which it is published. See Tips and Tricks below for an explanation as to why.

Creating a copy and renaming it before opening the file to edit it will make saving your edits easier. While you can work directly in the downloaded transcription file and then use Save As to change its filename or location, doing so adds several steps to the process.

Defining the Sections and Required Components of a WebVTT File

A WebVTT file is essentially a text file that meets the WebVTT specification as outlined by the W3C: https://www.w3.org/TR/webvtt1/. While the specification is fairly detailed, it simply defines the components that must reside in the file, how they must be separated from other parts of the file, and guidelines for other non-required components of the file if they exist (spacing, prefixes, etc).

The figure below shows a downloaded WebVTT transcription file as returned from the ASR service. It contains all of the necessary components for a file to be used as a transcription or a closed captioning file, along with some other information. Each component is defined below the figure.

Example WebVTT transcript file returned from the ASR service with entry items as described

WEBVTT: The first entry in the file is the WebVTT header. It simply reads WEBVTT. This is required by the specification.

Do not remove or change the first entry.
Start / End Time Cues: Each transcribed segment of text is preceded by a time-cue such as: 00:00:05.940 --> 00:00:16.150. This time cue identifies the start and end location in the video / audio file where the text is spoken. It allows the transcription panel or the closed captions to be synced with the playback. The time-cue format is very specific. Unless you have a reason for doing so, do not change the time-cues in the transcription file. If you do need to change one, refer to the specification for proper formatting.
Cue Payload / Spoken Text: The transcribed text is called the cue-payload and is the text shown in the transcription or closed caption at the time / duration specified in the time cue.
This text must reside on the line immediately following the time-cue.
Confidence Scoring: The NOTE CONF entries are confidence scores given to each word in the cue-payload, to identify how confident the automated transcription program was that the word it transcribed was the word that was spoken. The figures shown are percentages, and each number corresponds to each word in the payload text above it.
The NOTE CONF entries are not required for a valid transcription or closed caption file and can be removed, but they can help find segments where the text is more likely in need of editing. A lower confidence percentage might indicate speech that was garbled or not otherwise interpreted properly by the automated transcription program.
Line Breaks: The paragraph markers or line breaks in the transcription file are required and if possible, should be left alone. The specification is clear about requiring that the time-cue is on the line immediately before the cue payload and that each time-cue / payload set must be separated from the next item by at least one blank line.
Notice in the above figure that the NOTE CONF entries are separated from their associated cue segment (the one above it) by a blank line, and also from the next cue segment by a blank line. These are not included solely for the readability of the downloaded file; that spacing is required by the WebVTT specification.

Edit and Save the Transcription File

The below instructions use Microsoft Word as the editing program, but you can use any text editor or word processing software. The only requirement is for the program to be able to save your changes while retaining the original .vtt extension, or as a text file type, using a .txt extension on the file.

Launch the editing program you want to use and open the .vtt file downloaded and saved for editing.

You may need to change the file type selection box to All Files to find the VTT file to open, as shown in the figure below.

Once open, the VTT file looks like the figure in the previous section, a text file with a WEBVTT heading, time cues, cue payloads (text), and NOTE CONF entries, all of which are described above.
Edit the text cues in the file as needed to match the speaker's words. See the Tips and Tricks section below for some suggestions on streamlining the editing process.
When finished, save your edited file.
Once you have saved your edits, upload the new transcription to apply it to the video for classroom viewers.

If you followed all instructions in the Download section above, and are working in the renamed copy of the download, click Save to save your edited file. The program might give you a warning message about the file type, but you can ignore it and click OK to save the file.

If you need to save the edited file as a new name or in a different location, you have some extra steps to perform:

Click Save As.
Select Plain Text (*.txt) as the type of file you are saving, as shown in the below figure.
This will replace the .vtt extension on the file with a .txt extension. You will change this later after saving the file.
Click Save.
If provided a File Conversion dialog box, like the one shown below, it is recommended (but not required) that you enable the Insert line breaks checkbox, and then select LF only or CR only for the line break type. This simply ensures that all line breaks in the file are of the same type.
Open Windows Explorer or Mac Finder and find the file you just saved.
Change the .txt extension to .vtt. The file you upload must have the .vtt extension.

When your edited file is complete, it can be uploaded to the capture. The edited version will appear in the transcriptions panel in the classroom and will be the version downloaded when Download edited is selected from the transcript menu.

Tips and Tricks for Editing a Transcription File

What follows are some ideas and tips for making working with transcription files easier and faster. Of course, you will have to view the capture and follow along to make accurate changes to the transcript, but there are some other things you can do to help streamline the editing process. Also, if you have updated a transcription file but for some reason cannot upload it (it fails file validation), there may be a simple fix. Search the web for VTT validator to find a helpful tool to validate your file (such as https://quuz.org/webvtt/). It may be as simple as a missing character.

Save a Copy of the Original to a Dedicated Location and Rename It

As per the procedures in the Download section above, after downloading the original automated transcription, copy or move it to a dedicated location. If you are responsible for cleaning up all of the transcriptions for a section, create a folder for that section. Then when you add each new capture's transcription to the folder, append the filename with the date of the class to which it belongs. This will make it easier to know which files go to which class.

In addition, copying and renaming the file before you open it ensures you can select Save instead of Save As when you are done (or while editing). Most editing programs can save your changes to the exact same file you opened, meaning it will not try to force a file type, such as txt onto your file. This will reduce the number of steps needed to be sure your edited file is an uploadable .vtt file.

Play the Capture in the Classroom With the Transcription Panel Open

One plan for editing transcriptions is to play the capture in the classroom with the transcription panel open while having the editing program open in another window. Since the transcription is synced with the video and highlighted, you can easily see the transcribed text with the audio and discern where the errors occur. This works even better if you have access to a dual monitor setup, or can play the video / transcriptions on a different computer or on a tablet using a mobile browser (transcriptions are not available in the Mobile Apps yet).

Another thing you can do is read through the transcription panel while the video is paused. When you see a segment that requires editing, click on it to sync the video to that location. Notice the timestamp below the playback bar. Find that location in the transcript file, and make your edits.

The Space Bar works as a pause / play button. Use it accordingly. When the video gets to a location where the transcript requires editing, tap the spacebar to pause the video, switch to the editing program, make the edits, then return to the classroom and continue.

Use Search to Locate Lower Confidence Scores

The NOTE CONF entries in the transcript are confidence scores given for each word by the automatic transcribing program, to indicate how confident it was that the word it entered was the one spoken. Use these scores to help find the areas in the text where the program was not certain the term it entered was accurate. This will not always find the most problem areas in a transcript but it can help.

Each of the numbers in the CONF set corresponds to a word in the text cue immediately above it. Below is one cue where the confidence score for the first word in the cue payload is 41.

00:00:58.870 --> 00:01:02.750
Millions was the first person to see a surface feature on the planet mars.

NOTE CONF {"raw":[41,100,100,100,100,100,99,99,100,98,100,100,100,100]}

In looking at the transcription, the word Millions garnered a 41 in confidence score. In watching the video, the word Millions should actually be Huygens.

One of the reasons Microsoft Word is used as an example editor here is because of the Find > Special capabilities it has. The steps below show how to use this feature to find the lower confidence score entries in the transcription, to help target your editing.

With the transcription file open, select Find then Advanced Find.
Click More>> to expand the box for additional options.
In the Find what field of the Find and Replace dialog box, type the first digit of the percentage range you want to find in the confidence scores. The example in the figure below uses 5, to find all scores between 50 and 59 percent confidence.
Click the Special button at the bottom of the dialog box, then select Any digit from the list, as shown in the figure below.
Click Find Next. Each instance of the Find what range is found and highlighted with each click of Find Next.
You can try to make the edits as you go along, or you may want to simply highlight the corresponding words / phrases for now, then return while viewing the capture to fix the errors later.
Repeat these steps for each range of percentages you want to highlight.

If you use the Find / Highlight method, you will have to save the file as an RTF file; VTT / plain text files will not retain highlighting. However, you can return later, make the appropriate edits, and then Save As using the steps above.

For example, the following cue text has some fairly low confidence scores.

00:00:28.110 --> 00:00:36.050
A hundred in abitibi words now the estimates get tougher

NOTE CONF {"raw":[55,89,91,50,67,99,100,99,96,99]}

Without the context of the video, it is impossible to know what this phrase is supposed to be. So while you are performing a Find action, highlight the phrases with low scores. Then, you can easily find and return to the highlighted locations while viewing the capture. Use the time cue given for each phrase and scrub to the approximate location in the capture. Review the speech and edit the transcription phrase accordingly.

The above cue, once edited, now reads:

00:00:28.110 --> 00:00:36.050
A hundred billion inhabitable worlds. Now the estimates get tougher

NOTE CONF {"raw":[55,89,91,50,67,99,100,99,96,99]}

Once you have fixed these low-confidence instances, you may want to change the confidence scoring for each word in the edited file to 100. You do not have to, but if in the future the confidence scoring of the transcription is used for other purposes, the edited file will have confidence scores that accurately reflect your changes. This is completely optional. You can also choose to remove the confidence scores altogether if they no longer reflect the confidence level of the transcription text.

Use Spell Check or Grammar Check

Using spell check on the transcription file may work wonderfully or may not work well at all. This is because the automatic transcription program is essentially designed to translate speech into text and will, therefore, attempt to insert an actual word in place of whatever word it thinks it hears. For this reason, your transcript may have a lot of inaccurate transcription locations but few, if any, spelling errors.

In addition, programs like Microsoft Word will run (by default) a grammar check along with spell check. Because the transcription text segments may not line up as complete sentences spoken by the lecturer, the grammar checker may balk at many of the phrases, calling them incomplete, or attempt to make them complete by changing word forms.

This is not to say that either of these checking features will not work. You should go ahead and try them, and you may find more or less success with them, depending on the class, the lecturer, the subject matter, etc. But do not blindly click Change on each identified problem. Doing so may create more errors in the transcription than existed in the original.

Related to