Web Captions 1 – History and Formats

Posted on Category: Accessibility
Two ducks at full moon (1900 - 1930) by Ohara Koson (1877-1945). Original from The Rijksmuseum. Digitally enhanced by rawpixel.

​While compiling a video recently for my company’s design site, I was flagged for forgetting a web captions file. Being both an accessibility advocate and having never created one of these assets before, I did some research and manually crafted one for the minute-long piece.

It was fascinating. It’s practically never talked about in the UX community, has its own syntax, styling, and UX expectations that I’ve never explored before. I’ve decided to compile what I’ve learned here in this mini-series on web captions.

History​ of Web Captions

According to Wikipedia, captioning was first introduced in 1972 by WGBH, a US public television channel stationed right here in Boston, Massachusetts. Caption began as ‘open captioning’, which really just means that the text was ‘burned in’ to the broadcast itself, meaning it was always on with no way of disabling the text.

In 1973, the first ‘closed captioning’ was broadcast (optionally displayable to the user), and in 1976, the US Federal Communications Commission (FCC) created a scalable to deliver the ‘closed captioning’ data as a standard. The UK also picked up closed captioning this same year. Lastly, in 1982, real-time captioning was being delivered in the US.

Finally, once HTML5 became a standard in 2014, there was a standardized semantic way to deliver captioning for online video via the track element.


Captioning and subtitles have specific meanings here in the US and in Canada. Subtitles are for foreign language translation. Captions are for describing the entire dialog, and key audio clues, such as sound effects, music, if a character is whispering, etc. In the UK and most other locales, all captions are referred to as subtitles, regardless of the target need or audience.

Semantically, the HTML spec follows the US version of definitions. And while the track element can support items like descriptions, metadata, and chapter; caption and subtitle seem more frequently used.

​Competing file type formats​

There are two major file types used as web captions standards within the industry. TTML (Timed Text Markup Language) and WebVTT (Web Video Text Tracks – VTT for short).

TTML is the broadcast standard used by the BCC (EBU-TT-D) and Netflix. Youtube also supports it. TTML is an XML file format, similar to HTML but does have some unique elements it expects depending on whom the file is being delivered to. Essentially it looks like:

<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ttml#styling">
 <styling xmlns:tts="http://www.w3.org/ns/ttml#styling">
  <style xml:id="narrator" tts:color="red" />
  <p xml:id="narrator" begin="00:01.00" end="00:05.00" > 
NARRATOR: This standard caption text</p>

Although TTML is the broadcast standard, VTT is the only caption format that works natively in browsers; TTML doesn’t even load. VTT’s syntax is much cleaner looking than TTML, something akin to Markdown or HAML. That same TTML code above looks something like this in VTT:


00:01.000 --> 00:05.000
NARRATOR: This standard caption text

And the CSS looks like this:


If you are looking for a more in-depth dive, the W3C has a great comparison of the two formats. The major differences seem to be that TTML has better scope management (e.g. helpful for conversations swapping between languages in blocks as opposed to one-offs), a more robust hierarchy, and different positioning math. Also, VTT relies on external styling, whereas TTML embeds it within the document.

In terms of supported styles and properties shared between each, stick with…

  • Timing of hours: minutes: seconds.milliseconds (Milliseconds must use three digits to be valid, hours are optional)
  • Positioning using percentages
  • Styles that only use color, opacity, visibility, background-color, font-family, font-size, font-weight, font-style, line-height, outline (outline-color, outline-style, outline-width)

There are additional caption formats (fifty plus file types), with various standards within each as well, but none work on the web.

For example, there are a few different formats that use the TXT file type as their wrapper. Most of these seem to work specifically with one-off apps you’ve never likely heard of.

However, there is another file format called SRT that is almost identical to VTT. It’s a tad simpler to write, it seems, but it’s not as robust. It is an accepted format for some major services, but like TTML, it doesn’t work natively in the HTML video tag the way that VTT files do. If you’d like a more in-depth overview, check out this article by JBI Studios.

I’m so thankful that the W3C did the right thing, and forced a single, robust, easy-to-write format and that all major browsers complied.

Managing the Mess

So what do you do if you receive a non-VTT file as your caption content? Thankfully, there is an answer. Subtitle Edit is a free, actively-developed, Windows-only application with a portable installer (which I now keep on my cloud sync service).

Subtitle Edit's interface showing the timestamped caption content and video preview. It also handles additional caption formats

On a quick exploration, this application is serious business. It converts about seventy plus formats to other additional caption formats, including the critical VTT needed to caption media natively on the web. It can help you translate content, it has plugins for localizing US English to UK English, and it tracks how many words appear per second of on-screen display (extremely critical, which we will get to in the next part of the series).


If you want to caption anything on the web, you must deliver a VTT file. There is no other alternative at the moment, unless you craft your own solution, like the BBC has to serve up TTML.

If you are working on your own on a small project, rolling your own VTT file and validating it is probably the best way to go. Otherwise, an actual authoring program or third-party seems far more scalable and responsible.

Next time, I’ll take a deeper dive into text patterns that are commonly recommended by industry leaders, what’s supported in each browser, and how much design flexibility a designer can actually take or needs to be responsible for.

Update Nov 10, 2021: Part 1B was merged into this article to simplify the read. Sorry if that’s thrown you off. Send GIFs of pitchforks if it really bothers you.

Want to read the rest of the series?

  1. History and Formats
  2. UX Principles for Captions
  3. Caption Styling Challenges
  4. Final Styling Recommendations
My opinions & views expressed may not reflect my employer's.