DISPATCH | A. Amirante |
Internet-Draft | University of Napoli |
Expires: December 22, 2011 | T. Castaldi |
L. Miniero | |
Meetecho | |
S P. Romano | |
University of Napoli | |
June 20, 2011 |
Session Recording for Conferences using SMIL
draft-romano-dcon-recording-04
This document deals with session recording, specifically for what concerns recording of multimedia conferences, both centralized and distributed. Each involved media is recorded separately, and is then properly tagged. A SMIL [W3C.CR-SMIL3-20080115] metadata is used to put all the separate recordings together and handle their synchronization, as well as the possibly asynchronous opening and closure of media within the context of a conference. This SMIL metadata can subsequently be used by an interested user by means of a compliant player in order to passively receive a playout of the whole multimedia conference session. The motivation for this document comes from our experience with our conferencing framework, Meetecho, for which we implemented a recording functionality.
This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 22, 2011.
Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.
This document deals with session recording, specifically for what concerns recording of multimedia conferences, both centralized and distributed. Each involved media is recorded separately, and is then properly tagged. Such a functionality is often required in many conferencing systems, and is of great interest to the XCON [RFC5239] Working Group. The motivation for this document comes from our experience with our conferencing framework, Meetecho, for which we implemented a recording functionality. Meetecho is a standards-based conferencing framework, and so we tried our best to implement recording in a standard fashion as well.
In the approach presented in this document, a SMIL [W3C.CR-SMIL3-20080115] metadata is used to put all the separate recordings together and handle their synchronization, as well as the possibly asynchronous opening and closure of media within the context of a conference. This SMIL metadata can subsequently be used by an interested user by means of a compliant player in order to passively receive a playout of the whole multimedia conference session.
The document presents the approach by sequentially describing the several required steps. So, in Section 4 the recording step is presented, with an overview of how each involved media might be recorded and stored for future use. As it will be explained in the following sections, existing approaches might be exploited to achieve this steps (e.g. MEDIACTRL [RFC5567]. Then, in Section 5 the tagging process is described, by showing how each media can be addressed in a SMIL metadata file, with specific focus upon the timing and inter-media synchronization aspects. Finally, Section 6 is devoted to describing how a potential player for the recorded session can be implemented and what it is supposed to achieve.
In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119 [RFC2119] and indicate requirement levels for compliant implementations.
TBD.
When a multimedia conference is realized over the Internet, several media might be involved at the same time. Besides, these media might come and go asynchronously during the lifetime of the same conference. This makes it quite clear that, in case such a conference needs to be recorded in order to allow a subsequent, possibly offline, playout, these media need to be recorded in a format that is aware of all the timing-related aspects. A typical example is a videoconference with slide sharing. While audio and video have a life of their own, slides changes might be triggered at a completely different pace. Besides, the start of a slideshow might occur much later than the start of the audio/video session. All these requirements must be taken into account when dealing with session recording in a conference. Besides, it's important that all the individual recordings be taken in a standard fashion, in order to achieve the maximum compatibility among different solutions and avoid any proprietary mechanism or approach that could prevent a successful playout later on.
In this document, we present our approach towards media recording in a conference. Specifically, we will deal with the recording of the following media:
Additional media that might be involved in a conference (e.g. desktop or application sharing) are not presented in this document, and their description is left to future extensions.
In a conferencing system compliant with [RFC5239], audio and video streams contributed by participants are carried in RTP channels [RFC3550]. These RTP channels may or may not be secured (e.g by means of SRTP/ZRTP). Whether or not these channels are secured, anyway, is not an issue in this case. In fact, as it is usually the case, all the participants terminate their media streams at a central point (a mixer entity), with whom they would have a secured connection. This means that the mixer would get access to the unencrypted payloads, and would be able to mix and/or store them accordingly.
From an high level topology point of view, this is how a recorder for audio and video streams could be envisaged:
SIP +------------+ SIP /----------| XCON AS |-------- / +------------+ \ / |MEDIACTRL \ / | \ +-----+ +-----+ +-----+ | | RTP | | RTP | | |UA-A +<------------>+Mixer+<------------>+UA-B | | | | | | | +-----+ +-++--+ +-----+ | | RTP UA-A | | RTP UA-B (Rx+Tx) (Rx+Tx) V V +----------+ | | | Recorder | | | +----------+
That said, actually recording audio and video streams in a conference may be accomplished in several ways. Two different approaches might be highlighted:
+-------+ | UAC-C | +-------+ " C (RTP) " " " v +-------+ A (RTP) +----------+ B (RTP) +-------+ | UAC-A |===================>| Recorder |<===================| UAC-B | +-------+ +----------+ +-------+ * * * ****> A.gsm, A.h263 ****> B.g711, B.h264 ****> C.amr
+-------+ | UAC-C | +-------+ " C (RTP) " " " v +-------+ A (RTP) +----------+ B (RTP) +-------+ | UAC-A |===================>| Recorder |<===================| UAC-B | +-------+ +----------+ +-------+ * * * ****> (A+B+C).wav, (A+B+C).h263
Of the two, the second is probably more feasable. In fact, the first would require a potentially vast amount of separate recordings which would need to be subsequently muxed and correlated to each other. Besides, within the context of a multimedia conference, most of the times the streams are already mixed for all the participants, and so recording the mix directly would be a clear advantage. Such an approach, of course, assumes that all the streams pass through a central point where the mixing occurs: it is the case depicted in Figure 1. The recording would take place in that point. Such central point, the mixer (which in this case would also act as the recorder, or a frontend to it), might be a MEDIACTRL-based [RFC5567] Media Server. Considering the similar nature of audio and video (both being RTP based and mixed by probably the same entity) they are analysed in the same section of this document. The same applies to tagging and playout as well. It is important to note that in case any policy is involved (e.g. moderation by means of the BFCP [RFC4582]) the mixer would take it into account when recording. In fact, the same policies applied to the actual conference with respect to the delivery of audio and video to the participants needs to be enforced for the recording as well.
In a more general way, if the mixer does not support a direct recording of the mixes it prepares, recording a mix can be achieved by attaching the recorder entity (whatever it is) as a passive participant to the conference. This would allow the recorder to receive all the involved audio and video streams already properly mixed, with policies already taken into consideration. This approach is depicted in Figure 4.
+-------+ | UAC | | C | +-------+ " ^ C (RTP) " " " " " " A+B (RTP) v " +-------+ A (RTP) +--------+ A+C (RTP) +-------+ | UAC |===================>| Media |===================>| UAC | | A |<===================| Server |<===================| B | +-------+ B+C (RTP) +--------+ B (RTP) +-------+ " " " A+B+C (RTP) " v +----------+ | Recorder | +----------+ * ****> (A+B+C).wav, (A+B+C).h263
Whether or not the mixer is MEDIACTRL-based, it's quite likely that the AS handling the multimedia conference business logic has some control on the mixing involved. This means it can request the recording of each available audio and/or video mix in a conference, if only by adding the passive participant as mentioned above. Besides, events occurring at the media level or business logic in the AS itself allow the AS to take note of timing information for each of the recorded media. For instance, the AS may take note of when the video mixing started, in order to properly tag the video recording in the tagging phase. Both the recordings and the timing events list would subsequently be used in order to prepare the metadata information of the audio and video in the overall session recording description. Such a phase is described in Section 5.2.1.
In a MEDIACTRL Media Server, such a functionality might be accomplished by means of the Mixer Control Package [I-D.ietf-mediactrl-mixer-control-package]. At the end of the conference, URLs to the actual recordings would be made available for the AS to use. The AS might then subsequently access those recordings according to its business logic, e.g. to store them somewhere else (the MS storage might be temporary) or to implement an offline transcoding and/or mixing of all the recordings in order to obtain a single file representative of the whole audio/video participation in the conference. Practical examples of these scenarios are presented in [I-D.ietf-mediactrl-call-flows].
Of course, if the recording of a mix is not possible or desired, one could still fallback to the first approach, that is individually recording all the incoming contributions. It is the case, for instance, of conferencing systems which don't implement video mixing, but just rely instead on a switching/forwarding of the potentially several video streams to each participant. This functionality can also be achieved by means of the same control package previously introduced, since it allows for the recording of both mixes and individual connections. Once the conference ends, the AS can then decide what to do with the recordings, e.g. mixing them all together offline (thus obtaining an overall mix) or leave them as they are. The tagging process would the subsequently take the decision into account, and address the resulting media accordingly.
What has been said about audio and video partially applies to text chats as well. In fact, just as for audio and video a central mixer is usually involved, for instant messaging most of the times the contributions by all participants pass through a central node from where they are forwarded to the other participants. It is the case, for instance, of XMPP [RFC3920] and MSRP [RFC4975] based text conferences. If so, recording of the text part of a conference is not hard to achieve either. The AS just needs to implement some form of logging, in order to store all the messages flowing through the text conference central node, together with information on the senders of these messages and timing-related information. Of course, the AS may not directly be the text conference mixer: the same considerations apply, however, in the sense that the remote mixer must be able to implement the aforementioned logging, and must be able to receive related instructions from the controlling AS. Besides, considering the possible protocol-agnostic nature of the conferencing system (as envisaged in [RFC5239]), several different instant messaging protocols may be involved in the same conference. Just as the conferencing system would act as a protocol gateway during the lifetime of the conference (i.e. provide MSRP users with the text coming from XMPP participants and viceversa), all the contributions coming from the different instant messaging protocols would need to be recorded in the same log, and in the same format, to avoid ambiguity later on.
An example of a recorder for instant messaging is presented in Figure 5.
+-------+ | UAC-C | +-------+ ^ C (MSRP) " '10.11.24 Hi!' " " v +-------+ A (XMPP) +----------+ B (IRC) +-------+ | UAC-A |<==================>| Recorder |<==================>| UAC-B | +-------+ '10.11.26 Hey C' +----------+ '10.11.30 Hey man' +-------+ * * * [..] ****> 10.11.24 <User C> Hi! ****> 10.11.26 <User A> Hey C ****> 10.11.30 <User B> Hey man [..]
The same considerations already mentioned about optional policies involved apply to text conferences as well: i.e., if a UAC is not allowed to contribute text to the chat, this contribution is excluded both from the mix the other participants receive and from the ongoing recording.
Considerations about the format of the recording are left to Section 5.2.2. Until then, we just assume the AS has a way to record text conferences somehow in a format it is familiar with. This format would subsequently be converted to another, standard, format that a player would be able to access.
Another media typically available in a multimedia conference over the internet is the slides presentation. In fact, slides, whatever format they're in, are still the most common way of presenting something within a collaboration framework. The problem is that, most of the times, these slides are deployed in a proprietary way (e.g. Microsoft Powerpoint and the like). This means that, besides the recording aspect of the issue, the delivery itself of such a slides can be problematic when considered in a standards based conferencing framework.
Considering that no standard way of implementing such a functionality is commonly available yet, we assume that a conferencing framework makes such slides available to the participants in a conference as a slideshow, that is, a series of static images whose appearance might be dictated by a dedicated protocol. For instance, a presenter may trigger the change of a slide by means of an instant messaging protocol, providing each authorized participant with an URL from where to get the current slide with optional metadata to describe its content.
An example is presented in Figure 6. The presenter has previously uploaded its presentation converted in a proprietary format. The presentation has been converted to images and a description of the new format has been sent back to the presenter (e.g. an XML metadata). At this point, the presenter makes use of XMPP to inform the other participants about the current slide, by providing an HTTP URL to the related image.
+-----------+ | Presenter | +-----------+ " (XMPP) " Current presentation: f44gf " Current slide number: 4 " URL: http://example.com/f44gf/4.jpg " v +-------+ (XMPP) +----------+ (XMPP) +-------+ | UAC-A |<===================| ConfServ |===================>| UAC-B | +-------+ +----------+ +-------+ | | | HTTP GET (http://example.com/f44gf/4.jpg) | v HTTP GET (http://example.com/f44gf/4.jpg) | v
From this assumption, the recording of each slide presentation would be relatively trivial to achieve. In fact, the AS would just need to have access to the set of images (with the optional metadata involved) of each presentation, and to the additional information related to presenters and to when each slide was triggered. For instance, the AS may take note of the fact that slide 4 from presentation "f44gf" of the example above has been presented by UAC "spromano" from the second 56 of the conference to the second 302. Properly recording all those events would allow for a subsequent tagging, thus allowing for the integration of this medium in the whole session recording description together with the other media involved. This phase will be described in Section 5.2.3.
To conclude the overview on the analysed media, we consider a further medium which is quite commonly deployed in multimedia conferences: the shared whiteboard. There are several ways of implementing such a functionality. While some standard solutions exist, they are rarely used within the context of commercial conferencing application, which usually prefer to implement it in a proprietary fashion.
Without delving into a discussion on this aspect, suffices it to say that for a successful recording of a whiteboard session most of the times it is enough to just record the individual contributions of each involved participant (together with the usual timing-related information). In fact, this would allow for a subsequent replay of the whiteboard session in an easy way. Unlike audio and video, whiteboarding usually is a very lightweight media, and so recording the individual contributions rather than the resulting mix (as we suggested in Section 4.1) is advisable. These contributions may subsequently be mixed together in order to obtain a standard recording (e.g. a series of images, animations, or even a low framerate video). An example of recording for this medium is presented in Figure 7.
+-------+ | UAC-C | +-------+ " C (XMPP) " 10.11.20: line " " v +-------+ A (XMPP) +-----------+ B (XMPP) +-------+ | UAC-A |===================>| WB server |<===================| UAC-B | +-------+ 10.10.56: circle +-----------+ 10.12.30: text +-------+ * * * ****> 10.10.56: circle (A) ****> 10.11.20: line (C) ****> 10.12.30: text (B)
The recording process may be enriched by the population of a parallel event list. For instance, optimizations might include event as the creation of a new whiteboard, the clearing of an existing whiteboard or the adding of a background image that replaced the previously existing content. Such event would be precious in a subsequent playout of the recorded steps, since they would allow for a more lightweight replication in case seeking is involved. For instance, if 70 drawings have been done, but at second 560 of the conference the whiteboard has been cleared and since then only 5 drawings have been added, a viewer seeking to second 561 would just need the clear+5 drawings to be replicated. Anyway, further discussion upon the tagging process of this media is presented in Section 5.2.4.
Once the different media have been recorded and stored, and their timing related somehow, this information needs to be properly tagged in order to allow intra-media and inter-media synchronization in case a playout is invoked. Besides, it would be desirable to make use of standard means for achieving such a functionality. For these reasons, we chose to make use of the Synchronized Multimedia Integration Language [W3C.CR-SMIL3-20080115], which fulfills all the aforementioned requirements, besides being a well-established W3C standard. In fact, timing information is very easy to address using this specification, and VCR-like controls (start, pause, stop, rewind, fast forward, seek and the like) are all easily deploayble in a player using the format.
The SMIL specification provides means to address different media by using custom tags (e.g. audio, img, textstream and so on), and for each of these media the related tempification can be easily described. The following subsections will describe how a SMIL metadata could be prepared in order to map with the media recorded as described in Section 4.
Specifically, considering how a SMIL file is assumed to be constructed, the head will be described in Section 5.1, while the body (with different focus for each media) will be presented in Section 5.2.
As specified in [W3C.CR-SMIL3-20080115], a SMIL file is composed of two separate sections: a head and a body. The head, among all the needed information, also includes details about the allowed layouts for a multimedia presentation. Considering the amount of media that might have been involved in a single conference, properly constructing such a section definitely makes much sense. In fact, all the involved media need to be placed in order not to prevent access to other concurrent media within the context of the same recording.
For instance, this is how a series of different media might be placed in a layout according to different screen resolutions:
<?xml version="1.0" encoding="UTF-8"?> <smil xmlns:xml="http://www.w3.org/XML/1998/namespace"> <head> <switch systemScreenSize="800X600"> <layout> <root-layout width="800" height="600" background-color="black"/> <region id="image0" regionName="image" fit="fill" top="310" \ left="370" width="400" height="350" /> <region id="video0" regionName="video" top="0" left="370" \ width="430" height="310" fit="fill" /> <region id="chat0" regionName="chat" fit="fill" alt="chat" \ top="410" left="370" width="400" height="-60"/> <region id="wb0" regionName="wb" top="0" left="0" width="370" \ height="520"/> </layout> </switch> <switch systemScreenSize="1024X768"> <layout> <root-layout width="1024" height="768" \ background-color="black"/> <region id="image1" regionName="image" fit="fill" top="310" \ left="594" width="400" height="350"/> <region id="video1" regionName="video" top="0" left="594" \ width="430" height="310" fit="fill"/> <region id="chat1" regionName="chat" fit="fill" alt="chat" \ top="578" left="594" width="400" height="108"/> <region id="wb1" regionName="wb" top="0" left="0" width="594" \ height="688"/> </layout> </switch> [..]
That said, it's important that this section of the SMIL file be constructed properly. In fact, the layout description also contains explicit region identifiers, which are referred to when describing media in the body section.
TBD. (?)
The SMIL head section described previously is very important for what concerns presentation-related settings, but does not contain any timing-related information. Such information, in fact, belongs to a separate section in the SMIL file, the so called body. This body contains the information on all the involved media in the recorded session, and for each media timing information are provided. This timing information includes not only when each media appears and when it goes away, but also details on the media lifetime as well. By correlating the timing information for each media, a SMIL reader can infer inter-media synchronization and present the recorded session as it was conceived to appear.
Besides, the involved media can be grouped in the body in order to implement sequential and/or parallel playback involving a subset of the available media. This is made possible by making use of the <seq> and <par> elements. The <par> element in particular is of great interest to this document, since in a multimedia conference many media are presented to participants at the same time.
That said, it is important to be able to separately address each involved medium. To do so, SMIL makes use of well specified elements. For instance, a <video> element is used to state the presence of a video stream in the session. Each of these elements can be furtherly customized and configured by means of ad-hoc attributes. For instance, the 'src' attribute in a <video> element means that the actual video stream source can be found at the provided address.
The element for each media is also the place where SMIL adds information upon when the addressed media comes into play. This is done by means of two attributes called 'begin' and 'end' respectively. As the names themselves suggest, the 'begin' attribute gives a temporal reference on the media start, while the 'end' attribute specifies when the media ends. For instance, an element formatted in the following way:
<video src="http://www.example.com/conference45.avi" region="box12" \ begin="15s" end="400s"/>
means that a video stream (whose URL is provided in 'src') must be played in the session only 15 seconds after the session beginning, and that it must end 385 seconds after. This information is also used when seeking through a session. For instance, if a user accessing the recording seeks to 200 seconds after the beginning, the video will appear as well at the relative time of 200-15=185 seconds.
Considering the recorded media presented in Section 4, the construction of following sections of the body will be described:
In SMIL, the element to describe an audio stream is <audio>, while for video the element is <video>. Considering that these two stream types are handled in a very similar way, only video will be addressed. This is an explicit choice for two reasons: (i) video is slightly more complex to address than audio, and so treating video makes more sense; (ii) often off-line encoders/muxers will place the recorded elementary audio and video streams in a single video container, which means both streams can actually be addressed in a single media file.
That said, <video> is the element used in a SMIL bod to state the presence of an audio/video stream. It's tempification, related to other media, might be implemented by making use of a <par>/<seq> aggregator. In such an element, some attributes are of great relevance and should be included:
All these information can easily be taken according to the stream as recorded previously (optionally re-encoded and/or re-muxed), together with the timing information as part of the event log. The 'src', in particular, can be any video file, which means that an encoding of the stream for a player is quite trivial to achieve.
Besides, as mentioned in Section 4.1, recordings may be available as already mixed streams, or individual streams. In case the recording is already mixed, then the tagging can be done as seen in the previous paragraph:
<video src="http://www.example.com/conference45.avi" region="box12" \ begin="15s" end="400s"/>
where this element would state the presence of an audio/video stream, to appear in the specified region in the specified range of time. In case several recordings are available, instead, the tagging would be a little more complex: in fact, the metadata would need to address the parallel playback of the different recordings, which would also need to reflect the actual lifetime of the original streams in the conference. For instance, if UAC A joined the conference much before UAC B, its contributions would appear in the playout accordingly. An example of how this could be achieved in a SMIL metadata is presented here:
<par> [..] <video src="http://www.example.com/userA.avi" region="box12" \ begin="15s" end="400s"/> <video src="http://www.example.com/userB.avi" region="box16" \ begin="230s" end="521s"/> [..] </par>
This lines tell an interested player that the two specified video streams (whose URLs are provided in the respective 'src' attributes) must be played in parallel, and in different regions. Anyway, video stream 'userA.avi' starts after 15 seconds, while 'userB.avi' starts after 230 seconds since the beginning of the conference, reflecting the appearance of these media in the conference itself.
Text in SMIL can be addressed in several different ways, the most common ones being <text> and <textstream> elements. <text>, however, usually deals only with static text content, that is text without timing information (e.g. HTML). For this reason, <textstream> should be used instead, since it allows text to appear and disappear in real-time.
The attributes to configure the element are basically the same as the one presented for <video> (src, region, begin, end). The difference, anyway, is on the file to refer to in the 'src' attribute. In fact, if timing information is needed, a proper format for tempified text is needed. The <textstream> element supports RealText Markup, which is a separate markup language for dealing with real-time text. It is the format used, for instance, for subtitle captioning. An example of RealText is presented in the following lines:
<window width="340" height="160" wordwrap="true" loop="false" \ bgcolor="white"> <font color="black" face="Arial" size="+0"> <Time begin="0:00:02.2"/><br/><User C>Hi <Time begin="0:00:04.5"/><br/><User A>Hey C <Time begin="0:00:08.1"/><br/><User B>Hey man [..]
This example recalls Figure 5, where the first message (by User C) was sent at 10.11.24. Assuming the text conference started at 10.11.22, the log is converted to RealText and tagged accordingly (e.g. User C saying his first message two seconds after the conference started). The RealText fine can then be addressed in SMIL using the aforementioned <textstream> element:
<par> [..] <textstream src="http://example.com/chats/conf45.rt" region="chat" \ begin="0s" end="500s"/> [..] </par>
Once the requirement on the file format is assessed, the next step is obvious. Whatever format the chat in the conference had been recorded into, it needs to be converted to a RealText file in order to have it addressed in the resulting SMIL file. The conversion is usually very trivial to achieve, considering that chat logs often have the same information needed in a RealText file except for the presentation format.
The easiest way to deal with a slideshow and/or a shared slide presentation is to make use of the <img> element. In fact, as anticipated in Section 4.3, slides in a presentation most often are composed of a static content, and can be assimilated to images. This means that addressing a complete presentation in a SMIL file can be achieved by following these steps:
An example of this, recalling the scenario depicted in Figure 6, is presented here:
<par> [..] <img src="http://www.example.com/f44gf/1.jpg" region="image" \ begin="0s" end="10s"/> <img src="http://www.example.com/f44gf/2.jpg" region="image" \ begin="10s" end="18s"/> <img src="http://www.example.com/f44gf/3.jpg" region="image" \ begin="18s" end="30s"/> [..] </par>
The slideshow would usually be a sequence, and so a <seq> would seem the more apt way to address the presentation sharing. Nevertheless, timing information are very important, and it's quite likely that several additional media will flow in parallel with the slides (e.g. the video stream which includes the presenter talking). That's why a <par> element is used instead, which for brevity omits the other media involved.
As anticipated in Section 4.4, no standard solution is usually deployed when talking of whitebording in a conferencing system. For this reason, the recording process suggested in Section 4.4 is just a timing-aware dump of all the interactions occurred at the whiteboard level. These interactions might subsequently be converted in a more common format as, for instance, a video or an image slide show. In case of a video, the same considerations of Section 5.2.1 would apply, since the whiteboard recording would actually be a video itself. In case it is converted to a slideshow, the tagging process would occur as explained in Section 5.2.3.
However, SMIL also allows for custom, non-standard media to be involved in its metadata. This can be achieved by means of the standard element <ref>, which is a generic media reference. This element allows for the description and addressing of non-standard media (or at least media the chosen SMIL specification is not aware of), which could be implemented in a custom player. This means that, if a whiteboard has been recorded in a proprietary way, and this way needs for a reason or for another to be preserved, the <ref> element may be used to address it: in fact, the same attributes previously introduced (including 'src' and the others) are available to this element as well. Of course, if this approach is used only a player able to understand the proprietary media extension would be able to replay the recorded whiteboard session. To make the player aware of the format employed, a 'type' attribute could be added as well.
An example of how the recorded whiteboard might be addressed is provided here:
<par> [..] <ref src="http://example.com/wb/wb12.txt" region="wb" \ type="myFormat"/> [..] </par>
Once the SMIL metadata has been properly prepared, a playout of the recorded conference is not difficult to achieve. In fact, an interested user just needs to get a SMIL-aware player supporting the several file formats involved, that are: (i) audio/video; (ii) images; (iii) RealText; (iv) the whiteboarding session, whatever format it has been recorded into. Considering the standard nature of SMIL and of almost all the media involved, the session is likely to be easily accessable to many players out there in the wild. Anyway, the 'type' attribute for all the involved media can be used to check for the support of the related media or not.
Additional information provided in the SMIL head (e.g. the <switch> elements and the <layout> they suggest) provide guidance for players to presenting the addressed media in the expected way.
The sequence an interested user needs to realize in order to access a recorded conference session can be summarized in the following simplified steps:
A general overview of the scenario can be seen in Figure 16.
+------+ 1. START +----------+ +----------+ | User |------------>| User |------------------------->| Sessions | | |<------------| (player) | 2. get conf45.smil | database | +------+ 6. SHOW +----------+ +----------+ | | | | | | | | | 3. get audios and videos +-----------+ | | +---------------------------->| WebServer | | | | (video) | | | 4. get RealText files +-----------+ | +------------------------------->| (text) | | 5. get slide images +-----------+ +---------------------------------->| (images) | +-----------+
In this quite oversimplified scenario, an interested viewer chooses to start viewing a previously recorded conference. She/he knows the address to the recorded session (http://example.com/conf45.smil) and passes it to her/his player (1.). Starting the playout triggers the retrieval of the SMIL description (2.), which may be achieved by means of HTTP or any other protocol. Once the player has access to the description, it starts retrieving the individual media resources addressed there (video in 3., chat in 4., slides in 5.), and, according to the implementation of the player, it either waits for all the downloads to complete or just buffers a little while before starting the presentation to the user (6.).
TBD.
The authors would like to thank...