Report

CCSUBS hosted a workshop at NAB 2024 where 35 stakeholders from the media supply chain, representing content platform, viewers, service providers and suppliers, discussed interoperability challenges facing captioning and subtitling (timed text) in streaming. The workshop centered around presentations and interactive discussions, which are summarized below.

Summary

The following positions received strong support during the workshop:

The live streaming ecosystem should adopt global timed text standards and move away from legacy technologies;
Interoperability testing and best practices, more than new standards, are essential to improving timed text across the streaming ecosystem; and
Several challenges would be best solved as a group, which does not exist today.

Live applications

Whereas streaming of pre-recorded content has been around for more than a decade, global streaming of live events, across multiple territories and where there is low-latency between authoring and presentation, is a relatively recent development. Several challenges remain, including:

Preserving fidelity throughout the chain, from production to distribution;
Maintaining temporal synchronization between timed text and the audio and video;
Avoiding conflicts between the placement of the timed text and graphics elements of the video, e.g., scores;
Supporting multiple simultaneous languages, including non European languages (motorcycle racing MotoGP events, for example, are routinely broadcast live in 39 languages).

A strong consensus was expressed that the live streaming ecosystem must move away from legacy technology, which does not meet its current and future needs. The CTA 608/708 and teletext infrastructure, for example, was not designed for global distribution. As another example, legacy image-based DVB subs are not accessible, searchable and scalable. Live streaming practices should instead build on global standards such as TTML, for which a growing body of expertise is available and which can be extended to meet new challenges.

It was nevertheless highlighted that legacy technologies are perceived as lower-risk despite their limitations and that interoperability challenges remain, as highlighted below.

Quality control

The group discussed briefly the challenges with measuring timed text quality, especially in the context of the growing use of machine-generated timed text. It was, for example, noted that word-error-rate (WER) is not a sufficient metric; that the definition of quality can differ between viewers; and that automated text-to-speech remains relatively inaccurate in languages like Japanese where homonyms are common and text can be horizontal or vertical. Defining quality metrics was identified as a potential topic of further discussion.

Need for viewer feedback

Viewer feedback is critical in improving timed text in streaming since the litmus test for timed text is a satisfied user. It was emphasized that timed text is not intended only for people with disabilities but is intended to make the content more accessible for all, to overcome permanent, temporary or situational disability – including noisy rooms, foreign accents, etc.
Viewer feedback is also critical because different users have different preferences, not only in terms of styling but also in terms of verbatim versus condensed text. Allowing persistent customization is therefore an essential aspect of the viewer experience.

Interoperability

End-to-end capabilities remain limited, with subtitles and captions often being displayed to the viewer as bottom-centered white text, even though media players are capable of much richer rendering and timed text formats support richer styling. These limitations were traced to the following:

Poor interoperability across media players forcing the use of the lower common denominator;
Inability to get the richer timed text formats across the entire chain due to limitations in one link of the chain, e.g., emission encoder only supporting legacy formats.

The group concluded that the issue did not lie in the absence of standards, but rather in their proliferation and inconsistent applications. Strong interest was therefore expressed in defining best practices and organizing plugfests to improve interoperability using modern timed text formats.

A dedicated user group and next steps

Consensus was expressed that the interoperability challenges identified above would be best resolved as a group with persistent communication channels and recurring meetings. The group would include participants from across the timed text ecosystem, including individuals representing end viewers and content authors. The group would not develop standards and instead focus on developing best practices, conducting interoperability testing and gathering requirements. It was noted that no such group appears to exist today but that umbrella organizations that could host such a group exist, e.g., W3C, SMPTE, etc.. It was emphasized that, regardless of its exact structure, such a group would require persistent leadership, i.e. chairperson(s), and IT infrastructure.

No consensus was reached on the exact form of the group and its leadership. Instead a teleconference call will be scheduled in May with the objective of reaching consensus on these items.

A short face-to-face meeting is also planned for mid-June, co-located with the Fraunhofer Media Web Symposium in Berlin. This meeting might be the first of the newly formed group and an opportunity for those that could not attend the workshop to join the discussion.

Subscribe to our email reflector or follow us on LinkedIn to stay informed of our work and join our discussions.

Acknowledgements

CCSUBS would like to thank event organizers and chairs Andreas Tai (a2a11y) and Pierre-Anthony Lemieux (Sandflow Consulting); program committee members Eric Carlson (Apple), Cyril Concolato (Netflix), Flavio Ribeiro (Netflix), Ken Harrenstien (YouTube) and Bill McLaughlin (AI-Media); event coordinator Lisa Vierra (Netflix); and event sponsors Netflix, VITAC, SMPTE and AWS.

Attendees

Raphael Barbieri (EiTV), Lionel Bringuier (Videon Labs), Pierre-Anthony Lemieux (Sandflow Consulting), Rafael Parlatore (MAV), Christian Vogler (Gallaudet University Technology Access Program), Casey Occhialini (Paramount), Cyril Concolato (Netflix), Adi Margolin (VITAC), Renato Cassaca (Voice Interaction), Fereidoon Khosravi (Venera Technologies), Jason Livingston (Telestream), Dan Baer (Apple), Andreas Tai (a2a11y), Eric Carlson (Apple), Giovanni Galvez (Syncwords), Devin Heitmueller (LTN Global Communications), Derek Throldahl (3Play Media), Bill McLaughlin (AI-Media), Ankur Sharma (Amazon), Chris Zhang (AWS), Flavio Ribeiro (Netflix), Patrick Pearson (Netflix), Steven Thorpe (NEP Group), Pasi Toiva (Labwise/Allegro DVT), Scott Labrozzi (Disney), Derik Yarnell (Comcast Cable), David Deelo (Sony Pictures Entertainment), Yasser Syed (Comcast), , Krasimir Kolarov (Apple), Louise Tapia (VITAC), Griffith Clipson (VITAC), , Tsubasa Hirano (IMAGICA), Bill Magill (National Captioning Institute), Tim De Marco (Prime Video), Robin Hérin (Ateme).

Contacts

Please contact the event chairs Pierre-Anthony Lemieux and Andreas Tai with any question.