What’s new in HTML5: The Track Element

One of the more exciting developments in HTML5 video is the inclusion of the track element in the newest versions of the desktop browsers. In addition to bringing captioning and subtitle support to HTML5 video, the invisible track element allows publishers to attach a rich array of textual metadata to their videos. In this blog post, we’ll look at the different types of tracks that can be used in conjunction with the <video> tag.

Browser Support

First, the bad news. The track element is extremely new, and browser support is growing, but limited. The current version of Chrome supports it, but the functionality must be enabled via a configuration option (go to chrome://flags, Enable <track> element). Internet Explorer 10, which is available in the Windows 8 developer preview, also has a working implementation. Mozilla is working on it for Firefox, but no timetable has been given for when it will be complete. In short, the track element is future tech, but luckily we can begin working with it today.

WebVTT: A New Format for Text Tracks

The web standards community has developed a new standard, called WebVTT (Web Video Text Tracks), which will be supported by all browsers implementing the track element. WebVTT provides a simple, extensible, and human-readable format on which to build text tracks. Although it is based on SRT (a popular subtitling format), a few tweaks have been made to the format. For content creators who already have subtitles in SRT, a no-frills converter is available.

Here’s a very simple example of a WebVTT file:


00:00.000 –> 00:10.000
This text is related to the first ten seconds of the video

00:10.000 –> 00:20.000
This text is related to the next ten seconds of the video

In this example, the file contains two timed segments, called cues. These cues can come in many flavors, up to and including full HTML.

Here’s how to embed a video with a text track:

<video controls>
<source src="video.mp4" type="video/mp4" />
<source src="video.webm" type="video/webm" />
<track kind="subtitles" src="subtitles.vtt" />

One Element, Many Uses

One of the reasons the track element is so captivating is its versatility. It can be used to make video accessible, to organize content that occurs within a video, to enable more robust interactions, and much more. This type is specified in the track element by setting the kind attribute. There are currently five different values the kind attribute can be set to: subtitles, captions, descriptions, chapters and metadata.

Accessibility: Captions, Subtitles and Descriptions

Let’s take a quick look at the first three text track types, subtitles, captions and descriptions. On the surface, they may seem similar, but they actually serve different purposes.

  • Subtitles are what you might expect to see while watching a foreign-language film — they’re a transcription or translation of the video’s dialogue.
  • Captions, on the other hand are designed for viewers who can’t hear the audio of the video, and include descriptions of non-dialogue sound. For example, if a character in a video slams a door off-camera, the captions would include something like [door slams]. Both subtitles and captions are displayed by the browsers as text overlays on top of the playing video.
  • Descriptions are not displayed visually, but are rather spoken out loud by a screen reader, benefitting viewers who can’t see the video. Not surprisingly, descriptions describe what’s happening visually in the scene.

All three of these kinds of tracks combine to make a video accessible to more viewers, and, as we’ll discuss later, to search engines as well.

Chapters: Navigating the Video

One of the more difficult problems to solve in web video has been how to index and recall discrete segments of content within a longer video. This is especially true when the different sub-segments pertain to dramatically different subjects. Publishers are either required to break up the video into more manageble chunks and tag the smaller chunks appropriately, or use complicated tools or scripts to synchronize the video player with an external index.

Using chapter text tracks, publishers can organize their long-form content in a WebVTT file which is embedded alongside of the video. Although current browser implementations do not yet do anything with chapter tracks, one can safely assume that they will do so in the future. In the meantime, developers can access the information contained in the chapters track via JavaScript and use it to build their own chapter interfaces.


The track element supports one additional type of text track, metadata, which is at the same time vague and extremely powerful. Metadata tracks allow developers to synchronize any information they wish with time points within a video. When the time point described in the cue is reached, a JavaScript event will fire, and the text contained in the cue is passed to the script. A simple example could be latitude and longitude coordinates which correspond to certain time points within a video. A script could listen for these cues, and update a map with the current coordinates as they change in the video.

The possibile use cases for metadata tracks are virtually limitless, and we’ll explore some of these in more detail in a future post.

Making Video Content More Searchable

We’ve discussed how the text tracks are interpreted by the browser and displayed to a viewer, but this only scratches the surface of what’s possible once videos are annotated by text tracks. Search engines can use the contextual information contained in the tracks to correlate search queries to specific points within in a video. Because the tracks are separated logically, a search engine can prioritize results based on the length of a related segment, the frequency with which the search term appears in the video, and even whether the subject of the search term appears visually in the scene, regardless of whether or not the word itself is spoken.

Furthermore, a search engine can make use of translation engines to open up search results to users who speak different languages from the language used in the source video. The subtitle tracks themselves could theoretically also be translated automatically by the browser. Although a human translation is obviously preferable, this approach allows many more viewers to engage with the content at very little additional cost.

Captions in the JW Player

No discussion (at least on this blog) would be complete without a note on support in the JW Player for the topic at hand. Although the current player can display SRT captions through the Captions plugin, this support will become much more tightly integrated into the upcoming version 6.0 of the JW Player, which will support WebVTT as well. Here’s a sneak peak at what captions selection will look like in the new player:

JW Player 6 Captions Support

Where Do We Go From Here?

As we’ve seen, text tracks aren’t just for subtitles – there’s a virtually limitless range of applications for them. Over the next few weeks, we’ll be posting demos and examples showing the track element in action. I’ll also be presenting on the track element at this year’s DevCon5 conference later this month. [Update: slides and demos here.] So stay tuned!