Part 1 Captioning

A screen shot of the application can be seen below:

Screenshot of final player

The media we will be using for this tutorial is the Orange open movie project
“Elephants Dream”, which can be obtained from: Orange.

The player application looks pretty standard, we can see the usual array of shuttle controls in the bottom center,
and bottom right controls to show a playlist, chapter list and to toggle full screen mode.In this tutorial we are
going to cover a basic aspect of media accessibility – closed captions.

Closed captions were developed for television primarily to provide a written alternative to the soundtrack for the
benefit of viewers who are Deaf or hard of hearing; however they have since become invaluable in many other
contexts such as working in a quiet environment.

In the example application, when we press the “CC” button, captions should start appearing in the lower third of
the video; this allows us to comprehend the material without the audio track. If however we learned manual
signing as our first language, then we might comprehend the material more easily if it is presented in that
manner; and in a later tutorial we will look at mechanisms for presenting a simultaneous ASL translation of the audio.

Note that the mechanism to turn on captions is as prominently displayed in the UI as the mechanism to control
the volume, in this application the button is labeled with the US symbol for captions. In a more sophisticated
application, the graphics could be tailored depending on the user’s locale. Note also how the captions are
presented over video in the lower third, and the transport controls do not obscure the captions.

We don’t auto-play the media, because if it has an audio component it can hamper users of screen reader
software who need to be able to hear to explore the interface. We will cover interaction with a screen reader
again in a later tutorial.

For the 1 in 4 users who experience some kind of visual loss; for whom, without the ability to see the video image,
it is difficult to follow what is going on. The application will eventually provide a second button, again labeled with
the international symbol, to allow us to turn on audio description. Audio description presents the important visual
information as audio, allowing users that cannot see the video to comprehend the material. There are several
mechanisms we can use to provide audio description, and again we will cover these in its own tutorial.

A complete text transcript, which collates both the caption and description data over time, can also be important.
This transcript allows readers to catch up if they missed something, if they were distracted by action in the video,
or are slower readers. And can be presented by assistive technology, such as a Braille display. The application will
make the text of the captions (and later the descriptions too) available as text events, for example to be displayed
in an HTML context.

One other important feature of this application is that it can be driven entirely by the keyboard; this is important for
the 1 in 5 users that have some kind of mobility or dexterity issues, as well as those using a screen reader. By
providing keyboard access, assistive technologies such as sip and puff controls can readily be interfaced to the
application. It is important to ensure keyboard focus is highly visible, this is important as so often the default look
of applications gives keyboard users little or no visual cues as to where they are. We need to ensure controls are
taken out of the keyboard sequence (IsTabStop=false) when not visible/active. And that changing focus should not
cause a change in state, or activate functionality.

The source code for this tutorial is available here for you to download, play with, and even use in
your own applications and we’ll be updating this over time with additional features.

To give us a starting point for our application, we won’t be starting from scratch, but rather we will begin with one
of the templates from the Expression Media Encoder product (you do not need this application to
complete the tutorial however).

We will be adapting this template, which already has some basic caption support, to add some additional features
and show how you can integrate it into your own application. The standard Encoder templates assume that the
caption data is inserted into the media by the Expression Encoder, but we will be converting this so that the caption
data is combined with the media at the client, which allows a number of more flexible scenarios.

There are a great number of caption formats available, and the demonstration code could be easily extended to
read any of them, but the purposes of this tutorial, we will show how to use the SAMI format. This is really a legacy
format now; having been defined in the early days of the Windows Media Player, and it has some significant
drawbacks. But a lot of content is still available in this form, and makes for a more interesting example.

SAMI looks a little like an HTML file. Here is an example of a SAMI source file for the Elephant Dream movie:

   <Title>Elephants Dream Captions</Title>
    Media {elephantsdream-480.wmv}
    Metrics {time:ms; duration: 538500;}
    Spec {MSFT:1.0;}
   <Style TYPE="text/css">
      <Sync Start=15000>
         <P Class=.ENCC>at the. at the left we can see
      <Sync Start=17500>
         <P Class=.ENCC> 
      <Sync Start=18000>
         <P Class=.ENCC>at the right we can see the
      <Sync Start=19800>
        <P Class=.ENCC> 
      <Sync Start=20000>
        <P Class=.ENCC>the head-snarlers
      <Sync Start=537000>
        <P Class=.ENCC>it is ...

The full sami source is here elephantsdream-480.smi

Each of the times in the SAMI file is given as an offset in milliseconds from the start of the media. In this demo we
won’t be using the style information, but extracting just the text and the timing from the SAMI file.

Create the Movie

We wont cover Importing your media assets into Expression Encoder here, there are plenty of good resources on the product site.

Extending The Template

In order to follow this project in detail, you will need Visual studio and the Silverlight 3 development environment
and have some basic understanding of using it to create a Silverlight application.

Copy the template

First thing we will do is grab the template code from one of the Expression Encoder examples e.g.:

 C:\Program Files\Microsoft Expression\Encoder 3\Templates\en\BlackGlass.

  1. copy files to demo project location.
  2. open solution.
  3. remove existing Template project

Create Silverlight project

In the solution, we will now create a new Silverlight project. If you were adapting a pre-existing application, you
could include that instead. Right click on the Project and select Add from the drop down menu, select a Silverlight
application and rename it:

Screenshot of project dialog box

At the prompt:

Screenshot dialog asking whether to create Web host project

Select OK

Now, open page.xaml and add the following code (you could also do this using the visual designer in
Expression Blend)

	            <RowDefinition Height="29"/>
	            <RowDefinition Height="*"/>
	            <ColumnDefinition Width="*"/>
	        <TextBlock x:Name="textTitle" Text="Accessible Media Player" TextWrapping="Wrap" FontSize="16" HorizontalAlignment="Center" VerticalAlignment="Center" d:LayoutOverrides="GridBox"/>

Add the height and width properties to UserControl to cause the player to fill the window:

      Width="Auto" Height="Auto"

Save, Build and Run in VS by hitting F5, if you see the following:

Screenshot of dialog asking to enable debug mode in web page

Just hit OK.

And Internet explorer should appear running the basic application:

Screenshot of empty application running in IE

Now we have this basic project working, we can add in the player from the Expression media encoder.

Add project references to the Dream project.

Right click on the References tab in the solution explorer:

Screenshot of references tab

At the dialog add all of the Encoder supplied projects:

Screenshot of final player

In MainPage.xaml add namespaces:


Finally add the following XAML into the grid:

        <ep:ExpressionPlayer Margin="0,0,0,0" x:Name="AccessibleMediaPlayer" Grid.Row="1"/>

Build and run, and we have a functioning media player.

Screenshot of final player

Now we need to point it at the “Elephant Dream” video resource we created with the Expression Encoder (Note
that the standard template already contains a closed caption button). Firstly we need to modify the application
startup to call the initialization routine in the Encoder template, add in the additional line as shown here:

      private void Application_Startup(object sender, StartupEventArgs e)
            this.RootVisual = new MainPage();
           // add this to initiate the parameter read
           (this.RootVisual as MainPage).AccessibleMediaPlayer.OnStartup(sender, e);  

This causes the Encoder Template to read the initialization parameters from the HTML context which embeds the
Silverlight control. In this we add a <param> element which contains the initialization to point the player to the
media source we want to play:

Add into the .aspx and .html files in the Dream.Web project the following as parameters of the Silverlight object:

   <param name="initparams" value='playerSettings = 
	    <Title>Elephants Dream</Title>

And now when we start the application we are able to see the video play:

Screenshot of final player

But no matter how much we press the CC button however; captions will not show as we did not encode any
captions into the video (if we had then of course they would show at this point). Actually creating and getting the
caption data to the point of playback is actually the hardest part, and not something we will cover in this tutorial;
we’ll come back to this point in later tutorials.

Adding Closed Captions

In general, making media accessible comes down to providing information in forms which can be adapted to
people’s needs, either by ensuring it is encoded in a malleable data format like text, or by providing coding
redundancy. Captions are an example of this, in the abstract they are an alternative means of conveying the
important information in the audio track. We ideally want them to be visible in the video area to match expectation
from broadcast TV, and we want them timed to appear when the equivalent audio appears.

Architecturally there are a couple of ways we can achieve this. The first thought idea, and perhaps the technically
simplest, is just to burn the captions into the video; that way everybody gets them and they look just the way that
you want, right?

Well there’s a problem right there - they look the way you want, but you are not your user. Consider for
example a senior viewer whose eyesight may be not what it was; as well as losing some hearing, they may be using
captions to provide backup for difficult patches in the audio, but simultaneously need the text quite a bit larger.
Some people prefer to have as much of the video visible as possible, so they may want a translucent, or even
completely transparent background to the text; while others want a solid background for maximum contrast and
clarity. Internationally the conventions change, for example in the UK text color is used to distinguish different
speakers, while in the US typographic conventions are used.

So then you might consider creating different encoded versions of the movie with all these options, Well that’s possible, but if you have a lot; say thousands or even tens of thousands of videos on your server, that’s going to
cost you money. Also consider the experience of someone receiving the whole thing through a refreshable Braille
display; they will need the captions as text, so it can be presented to that device.

So what we are going to do is something akin to the way it’s done in the TV and used closed captions based
on a text representations that can be switched on and off at the client, and rendered how the user needs them


As we have seen, the template does provide some caption support already, but its set up to play captions embedded in the media by the Expression encoder product at encode time, and which provides a script to
import some forms of caption data and convert it to media markers, this mode is fine for some scenarios;
but relies on you having the encoder product, and we may want a little more flexibility e.g. the option
of switching between different languages at playback, and that is not supported in the basic templates.
So we will adapt the code so that it can load captions on the fly at the client.

First thing we are going to add is another parameter to tell the player where the caption file we want to use is,
and add this to the <param> initialization in the .aspx and .html file:

       <CaptionSource>elephantsdream-480.smi</ CaptionSource >

If we run at this point, the template will throw an exception because it doesn’t understand this additional
parameter. So now we are actually going to need to modify the template code. In an ideal world we would use
C# subclassing to avoid changing the provided code and localize our changes; however that is slightly
beyond the scope of this article, and left as an exercise for the reader, here we will dive in and modify
the code itself. We won’t damage anything in the Encoder template itself, since we copied the code at the

So to recognize the new parameter, we need to add some bits into the Playlist functionality in the code file:

Firstly we add a private member to the PlaylistItem class to hold the value:

      private Uri m_captionUrl;

Then following the conventions of the template, we add a getter and setter to make this a public property of the
class, so the XAML loader will set the private member for us.

        public Uri CaptionSource
                return m_captionUrl;

                m_captionUrl = value;

Lastly we want add a case in the Deserialize(XmlReader reader) method to handle the case where a caption
source is provided:

     else if (reader.IsStartElement("CaptionSource"))
         string rawCaptionSourceUrl = reader.ReadElementContentAsString();
         string decodedCaptionSourceUrl = HttpUtility.UrlDecode(rawCaptionSourceUrl);
         this.CaptionSource = new Uri(decodedCaptionSourceUrl, UriKind.RelativeOrAbsolute);

So, now that we know where the captions are, we want to download and parse them. It is a quirk of the media
marker system that the marker collection is only valid once the media is opened, So In the MediaPlayer.cs code,
we’ll add a LoadCaptions method and call this in InternalPlay(), right before the media source is set to play, this
method will download and parse the file and create media markers as appropriate:

        PlaylistItem playlistItem = Playlist.Items[m_currentPlaylistIndex];
         LoadCaptions(new IsoUri(playlistItem.CaptionSource));

Downloading the captions right before play might lead to some delays in the first captions showing up, so it might
prove necessary to separate the download from the application of the markers to the media element, however for
the purposes of this tutorial we do them both at the same time. Expression Encoder template is set up to resolve
the IsoUri class to a stream, so we need to pull the text out of that stream and decode it using a helper function to
parse the data.

       private void LoadCaptions(IsoUri captionsUri)
            string data = "";
            if (captionsUri.DownloadSucceeded)
                var text = new System.IO.StreamReader(captionsUri.Stream);
                data = text.ReadToEnd();

Now we add the ParseCaptionData method to extract the captions. Note that we are not really interpreting the
SAMI caption file here, but parsing it in a very simplistic manner to extract the timing and caption text. In a full
implementation we would want to respect the styling and so on for each caption.

        private void ParseCaptionData(string p)
            if (p == "") throw new Exception("bad caption data");
            else LoadFromSAMI(p);

For brevity we will not describe in detail the methods which parse the SAMI data, since it is likely that you would
want to replace this with your own code, and it’s a bit complex to describe; the details are in the source code for
those that are curious.

The main method is this:

        public void LoadFromSAMI(string text)
            XDocument samiXml = MapSamiToXml(text);
            string selectedClass = "";

            var body = samiXml.Element("Body");
            foreach (var sync in body.Elements())
                int start = Int32.Parse(sync.Attribute("Start").Value);
                int end = 0;
                    var next = sync.ElementsAfterSelf();
                    foreach (XElement e in next)
                        end = Int32.Parse(e.Attribute("Start").Value);
                        break;  // only want the first.
                catch (Exception)

                foreach (var para in sync.Elements())
                    string klass = para.Attribute("Class").Value;
                    bool wantThisClass = selectedClass == "" || (string.Compare(klass, selectedClass, StringComparison.OrdinalIgnoreCase) == 0);
                    bool isSourceID = false;

                    if (para.Attribute("ID") != null)
                        isSourceID = (string.Compare(para.Attribute("ID").Value, "Source", StringComparison.OrdinalIgnoreCase) == 0);

                    if (!isSourceID && wantThisClass)
                        selectedClass = klass;  // match these ones only.


The key method which actually inserts media markers into the media element gets called by this parsing code for
each identified caption:

       public void AddMediaMarker(int ms, string type, string data)
            TimelineMarker marker = new TimelineMarker();
            marker.Time = new TimeSpan(0, 0, 0, 0, ms);
            marker.Type = type;
            marker.Text = data.Trim();

SAMI is not an XML format, although as you can see from the example above, it is similar. The following code uses
the XML and regular expression processing built into Silverlight to convert it to a valid XML file. This is by no
means a complete SAMI implementation, but provides sufficient functionality for the purposes of demonstration.

       private XDocument MapSamiToXml(string p)
            XDocument xml = new XDocument();
            var ws = new char[] { '\r', '\n' };
            var syncTag = new string[] { "<Sync " };
            var pTag = new string[] { "<P " };

            // Slit up the input into sync groups
            var syncs = p.Split(syncTag, StringSplitOptions.None);
            List<XElement> xmlContent = new List<XElement>();

            foreach (string s in syncs)
            {   // process each sync group in turn
                if (s.Contains("Start="))
                    string attribs1 = s.Substring(0, s.IndexOf('>'));
                    string rest = s.Remove(0, attribs1.Length + 1).TrimStart(ws);
                    List<XElement> kids = new List<XElement>();

                    // process all the content between two <P> tags.
                    foreach (string content in rest.Split(pTag, StringSplitOptions.RemoveEmptyEntries))
                        string attribs2 = content.Substring(0, content.IndexOf('>'));
                        // extract any remaining plain text as the caption text
                        string caption = RemoveHtml(content.Remove(0, attribs2.Length + 1).TrimStart(ws));
                        XElement pElement = new XElement("P");
                        AddXmlAttributes(pElement, attribs2);
                    XElement syncElement = new XElement("Sync",kids.ToArray());
                    AddXmlAttributes(syncElement, attribs1);
            xml.Add(new XElement("Body", xmlContent.ToArray()));
            return xml;

        private static string RemoveHtml(string p)
            string plaintext = p;
            plaintext = Regex.Replace(plaintext, @"<[^>]*>", String.Empty);
            plaintext = Regex.Replace(plaintext, @"&[^;]*;", String.Empty);
            return plaintext.Trim();

        private static void AddXmlAttributes(XElement pElement, string prefix)
            var syntax = new char[] { ' ', '=' };
            var attribs = prefix.Split(syntax);
            int i = 0;
            while (i < attribs.Length)
                pElement.SetAttributeValue(attribs[i], attribs[i + 1]);
                i += 2;

So that’s it, now if we build and run the applications, it will by default show us the captions, and if we want to hide
them we can.

Screenshot of final player with Elephants dream playing

Adding some UI or logic to switch between different caption files isn’t much harder, and this can be based in the
first instance on the user’s current locale.

It was relatively straightforward, because most of the necessary infrastructure for captions was already present in
the template, all we needed to do was load the captions in the player, rather than rely on them being pre-coded.

Wrap Up

In this first tutorial on media accessibility we have showed how to extend one of the Media Encoder templates to
allow client side insertion of captions. We looked specifically at the code required todo this, but the techniques of
adding and responding to media markers are quite general and could easily be repurposed for other players. We
also looked at how one could parse SAMI data for captions, this code could easily be replaced to support other
caption file formats.

Last edited Oct 28, 2009 at 6:53 PM by seanhayes, version 8


Birbilis Nov 19, 2012 at 4:22 PM 
the Orange link above is broken: "http://http//" should be

seanhayes Mar 23, 2011 at 10:17 PM 
Note that the official Silverlight media framework, ( is probably a much better starting point now than this project, with Smooth streaming and TTML support. but I'm leaving it here for historical reasons. so that you can study the basics of how it works. I'm working on a newer project on the authoring side now which I hope to release soon.

teabag Oct 2, 2010 at 6:21 PM 
more tutorials please...

ViolinistJohn Nov 17, 2009 at 1:38 PM 
I figured it out... the avi won't play as Media Player doesn't recognise it. I got a wmv by using the eval version of Expression Encoder to create it from the
Then place that in the same location as the DreamTestPage.html and it all started working. Put the Sami file in the ClientBin directory and closed captions work.
Now to try and understand what is actually going on in under the hood.

ViolinistJohn Nov 17, 2009 at 8:37 AM 
I'm having trouble getting the tutorial project to work. Specifically I can't get the video to load correctly.
When I start the project the silverlight page loads but I get an error message at the top of the player window in red:
Could not open media file http://localhost:64652/Elephants_Dream_1024.avi AG_E_NETWORK_ERROR.

I am using the 1024.avi file as there doesn't seem to be a wmv on the Orange web page - did you make your own using Encoder?
I suspect this may be part of the problem. How do I get a wmv? or can I get the avi to work - how?
Any help appreciated - Thanks.

seanhayes Oct 30, 2009 at 8:34 AM 
This project does not support smooth streaming, but the principles employed here would work. The Smooth streaming player SDK [url:] exposes the timeline marker collection just like the standard media element. So if you use that as your starting point you should be able to adapt this technique. Hope this helps and good luck.

kmazur Oct 27, 2009 at 5:23 PM 
Does this project support smooth streaming? I've got two projects. One that loads markers from external XML file. Another that supports smooth streaming. I need to merge the two and if your project supports smooth streaming, it looks like it would be a good stepping stone to the marker merging.

kmazur Oct 27, 2009 at 5:21 PM 
Does this project support smooth streaming?