CLUE                                                           J. Lennox
Internet-Draft                                                     Vidyo
Intended status: Standards Track                                P. Witty
Expires: December 3, 2012
                                                              A. Romanow
                                                           Cisco Systems
                                                            June 1, 2012


   Real-Time Transport Protocol (RTP) Usage for Telepresence Sessions
                     draft-lennox-clue-rtp-usage-04

Abstract

   This document describes mechanisms and recommended practice for
   transmitting the media streams of telepresence sessions using the
   Real-Time Transport Protocol (RTP).

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on December 3, 2012.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as


Lennox, et al.          Expires December 3, 2012                [Page 1]

Internet-Draft         RTP Usage for Telepresence              June 2012


   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  RTP requirements for CLUE  . . . . . . . . . . . . . . . . . .  3
   4.  RTCP requirements for CLUE . . . . . . . . . . . . . . . . . .  5
   5.  Multiplexing multiple streams or multiple sessions?  . . . . .  6
   6.  Use of multiple transport flows  . . . . . . . . . . . . . . .  6
   7.  Use Cases  . . . . . . . . . . . . . . . . . . . . . . . . . .  7
   8.  Other implementation constraints . . . . . . . . . . . . . . .  9
   9.  Requirements of a solution . . . . . . . . . . . . . . . . . .  9
   10. Mapping streams to requested captures  . . . . . . . . . . . . 11
     10.1.  Sending SSRC to capture ID mapping outside the media
            stream  . . . . . . . . . . . . . . . . . . . . . . . . . 11
     10.2.  Sending capture IDs in the media stream . . . . . . . . . 12
       10.2.1.  Multiplex ID shim . . . . . . . . . . . . . . . . . . 13
       10.2.2.  RTP header extension  . . . . . . . . . . . . . . . . 13
       10.2.3.  Combined approach . . . . . . . . . . . . . . . . . . 14
     10.3.  Recommendations . . . . . . . . . . . . . . . . . . . . . 16
   11. Security Considerations  . . . . . . . . . . . . . . . . . . . 16
   12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 16
   13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
     13.1.  Normative References  . . . . . . . . . . . . . . . . . . 16
     13.2.  Informative References  . . . . . . . . . . . . . . . . . 17
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18


Lennox, et al.          Expires December 3, 2012                [Page 2]

Internet-Draft         RTP Usage for Telepresence              June 2012


1.  Introduction

   Telepresence systems, of the architecture described by
   [I-D.ietf-clue-telepresence-use-cases] and
   [I-D.ietf-clue-telepresence-requirements], will send and receive
   multiple media streams, where the number of streams in use is
   potentially large and asymmetric between endpoints, and streams can
   come and go dynamically.  These characteristics lead to a number of
   architectural design choices which, while still in the scope of
   potential architectures envisioned by the Real-Time Transport
   Protocol [RFC3550], must be fairly different than those typically
   implemented by the current generation of voice or video conferencing
   systems.

   Furthermore, captures, as defined by the CLUE Framework
   [I-D.ietf-clue-framework], are a somewhat different concept than
   RTP's concept of media streams, so there is a need to communicate the
   associations between them.

   This document makes recommendations, for this telepresence
   architecture, about how streams should be encoded and transmitted in
   RTP, and how their relation to captures should be communicated.


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119] and
   indicate requirement levels for compliant implementations.


3.  RTP requirements for CLUE

   CLUE will permit a SIP call to include multiple media streams: easily
   dozens at a time (given, e.g., a continuous presence screen in a
   multi-point conference), potentially out of a possible pool of
   hundreds.  Furthermore, endpoints will have an asymmetric number of
   media streams.

   Two main backwards compatibility issues exist: firstly, on an initial
   SIP offer we can not be sure that the far end will support CLUE, and
   therefore a CLUE endpoint must not offer a selection of RTP sessions
   which would confuse a CLUEless endpoint.  Secondly, there exist many
   SIP devices in the network through which calls may be routed; even if
   we know that the far end supports CLUE, re-offering with a larger
   selection of RTP sessions may fall foul of one of these middle boxes.


Lennox, et al.          Expires December 3, 2012                [Page 3]

Internet-Draft         RTP Usage for Telepresence              June 2012


   We also desire to simplify NAT and firewall traversal by allowing
   endpoints to deal with only a single static address/port mapping per
   media type rather than multiple mappings which change dynamically
   over the duration of the call.

   A SIP call in common usage today will typically offer one or two
   video RTP sessions (one for presentation, one for main video), and
   one audio session.  Each of these RTP sessions will be used to send
   either zero or one media streams in either direction, with the
   presence of these streams negotiated in the SDP (offering a
   particular session as send only, receive only, or send and receive),
   and through BFCP (for presentation video).

   In a CLUE environment this model -- sending zero or one source (in
   each direction) per RTP session -- doesn't scale as discussed above,
   and mapping asymmetric numbers of sources to sessions is needlessly
   complex.

   Therefore, telepresence systems SHOULD use a single RTP session per
   media type, as shown in Figure 1, except where there's a need to give
   sessions different transport treatment.  All sources of the same
   media type, although from distinct captures, are sent over this
   single RTP session.


   Camera 1   -.__                                        _,'Screen 1
                  `--._   , =-----------...........     ,'
                       `'+.._`\ _________________ _\,'
                        /    '|        RTP          |
   Camera 2 ------------+----,''''''''''''''''''''':-------- Screen 2
                        \  _ ----------------------.'.
                     _,.-''-----------------------,/  `-._
                _,.-'                                     `.. Screen 3
   Camera 3  ,-'                                             `


    Figure 1: Multiplexing multiple media streams into one RTP session

   During call setup, a single RTP session is negotiated for each media
   type.  In SDP, only one media line is negotiated per media and
   multiple media streams are sent over the same UDP channel negotiated
   using the SDP media line.

   A number of protocol issues involved in multiplexing RTP streams into
   a single session are discussed in
   [I-D.westerlund-avtcore-multiplex-architecture] and
   [I-D.lennox-rtcweb-rtp-media-type-mux].  In the rest of this document
   we concentrate on examining the mapping of RTP streams to requested


Lennox, et al.          Expires December 3, 2012                [Page 4]

Internet-Draft         RTP Usage for Telepresence              June 2012


   CLUE captures in the specific context of telepresence systems.

   The CLUE architecture requires more than simply source multiplexing,
   as defined by [RFC3550].  The key issue is how a receiver interprets
   the multiplexed streams it receives, and correlates them with the
   captures it has requested.  In some cases, the CLUE Framework
   [I-D.ietf-clue-framework]'s concept of the "capture" maps cleanly to
   the RTP concept of an SSRC, but in many cases it does not.

   First we will consider the cases that need to be considered.  We will
   then examine the two most obvious approaches to mapping streams for
   captures, showing their pros and cons.  We then describe a third
   possible alternative.


4.  RTCP requirements for CLUE

   When sending media streams, we are also required to send
   corresponding RTCP information.  However, while a unidirectional RTP
   stream (as identified by a single SSRC) will contain a single stream
   of media, the associated RTCP stream will include sender information
   about the stream, but will also include feedback for streams sent in
   the opposite direction.  On a simple point-to-point case, it may be
   possible to naively forward on RTCP in a similar manner to RTP, but
   in more complicated use cases where multipoint devices are switching
   streams to multiple receivers, this simple approach is insufficient.

   As an example, receiver report messages are sent with the source SSRC
   of a single media stream sent in the same direction as the RTCP, but
   contain within the message zero or more receiver report blocks for
   streams sent in the other direction.  Forwarding on the receiver
   report packets to the same endpoints which are receiving the media
   stream tagged with that SSRC will provide no useful information to
   endpoints receiving the messages, and does not guarantee that the
   reports will ever reach the origin of the media streams on which they
   are reporting.

   CLUE therefore requires devices to more intelligently deal with
   received RTCP messages, which will require full packet inspection,
   including SRTCP decryption.  The low rate of RTCP transmission/
   reception makes this feasible to do.

   RTCP also carries information to establish clock synchronization
   between multiple RTP streams.  For CLUE, this information will be
   crucial, not only for traditional lip-sync between video and audio,
   but also for synchronized playout of multiple video streams from the
   same room.  This information needs to be provided even in the case of
   switched captures, to provide clock synchronization for sources that


Lennox, et al.          Expires December 3, 2012                [Page 5]

Internet-Draft         RTP Usage for Telepresence              June 2012


   are temporarily being shown for a switched capture.


5.  Multiplexing multiple streams or multiple sessions?

   It may not be immediately obvious whether this problem is best
   described as multiplexing multiple RTP sessions onto a single
   transport layer, or as multiplexing multiple media streams onto a
   single RTP session.  Certainly, the different captures represent
   independent purposes for the media that is sent; however, as any
   stream may be switched into any of the multiplexed captures, we
   maintain the requirement that all media streams within a CLUE call
   must have a unique SSRC -- this is also a requirement for the above
   use of RTCP.

   Because of this, CLUE's use of RTP can best be described as
   multiplexing multiple streams onto one RTP session, but with
   additional data about the streams to identify their intended
   destinations.  A solution to perform this multiplexing may also be
   sufficient to multiplex multiple RTP sessions onto one transport
   session, but this is not a requirement.


6.  Use of multiple transport flows

   Most existing videoconferencing systems use separate RTP sessions for
   main and presentation video sources, distinguished by the SDP content
   attribute [RFC4796].  The use of the CLUE telepresence framework
   [I-D.ietf-clue-framework] to describe multiplexed streams can remove
   the need to establish separate RTP sessions (and transport flows) for
   these sessions, as the relevant information can be provided by CLUE
   messaging instead.

   However, it can still be useful in many cases to establish multiple
   RTP sessions (and transport flows) for a single CLUE session.  Two
   clear cases would be for disaggregated media (where media is being
   sent to devices with different transport addresses), or scenarios
   where different sources should get different quality-of-service
   treatment.  To support such scenarios, the use of multiple RTP
   sessions, with SDP m lines with different transport addresses, would
   be necessary.

   To support this case, CLUE messaging needs to be able to indicate the
   RTP session in which a requested capture is intended to be received.


Lennox, et al.          Expires December 3, 2012                [Page 6]

Internet-Draft         RTP Usage for Telepresence              June 2012


7.  Use Cases

   There are three distinct use cases relevant for telepresence systems:
   static stream choice, dynamically changing streams chosen from a
   finite set, and dynamic changing streams chosen from an unbounded
   set.

   Static stream choice:

   In this case, the streams sent over the multiplex are constant over
   the complete session.  An example is a triple-camera system to MCU in
   which left, center and right streams are sent for the duration of the
   session.

   This describes an endpoint to endpoint, endpoint to multipoint
   device, and equivalently a transcoding multipoint device to endpoint.

   This is illustrated in Figure 2.


               ,'''''''''''|                           +-----------Y
               |           |                           |           |
               | +--------+|"""""""""""""""""""""""""""|+--------+ |
               | |EndPoint||---------------------------||EndPoint| |
               | +--------+|"""""""""""""""""""""""""""|+--------+ |
               |           |                           |           |
               "-----------'                           "------------


                  Figure 2: Point to Point Static Streams

   Dynamic streams from a finite set:

   In this case, the receiver has requested a smaller number of streams
   than the number of media sources that are available, and expects the
   sender to switch the sources being sent based on criteria chosen by
   the sender.  (This is called auto-switched in the CLUE Framework
   [I-D.ietf-clue-framework].)

   An example is a triple-camera system to two-screen system, in which
   the sender needs to switch either LC -> LR, or CR -> LR.  (Note in
   particular, in this example, that the center camera stream could be
   sent as either the left or the right auto-switched capture.)

   This describes an endpoint to endpoint, endpoint to multipoint
   device, and a transcoding device to endpoint.

   This is illustrated in Figure 3.


Lennox, et al.          Expires December 3, 2012                [Page 7]

Internet-Draft         RTP Usage for Telepresence              June 2012


               ,'''''''''''|                           +-----------Y
               |           |                           |+--------+ |
               | +--------+|"""""""""""""""""""""""""""||EndPoint| |
               | |EndPoint||                           |+--------+_|
               | +--------+''''''''''                   '''''''''''
               |           |........
               "-----------'


              Figure 3: Point to Point Finite Source Streams

   Dynamic streams from an unbounded set:

   This case describes a switched multipoint device to endpoint, in
   which the multipoint device can choose to send any streams received
   from any other endpoints within the conference to the endpoint.

   For example, in an MCU to triple-screen system, the MCU could send
   e.g.  LCR of a triple-camera system -> LCR, or CCC of three single-
   camera endpoints -> LCR.

   This is illustrated in Figure 4.


              +-+--+--+
              | |EP|  `-.
              | +--+  |`.`-.
              +-------`. `. `.
                        `-.`. `-.
                           `.`-. `-.
                             `-.`.  `-.-------+              +------+
              +--+--+---+       `.`.|  +---+  ---------------| +--+ |
              |  |EP|   +----.....:=.  |MCU|  ...............| |EP| |
              |  +--+   |"""""""""--|  +---+  |______________| +--+ |
              +---------+"""""""""";'.'.'.'---+              +------+
                                 .'.'.'.'
                               .'.'.'.'
                              / /.'.'
                            .'.::-'
               +--+--+--+ .'.::'
               |  |EP|  .'.::'
               |  +--+  .::'
               +--------.'


                  Figure 4: Multipoint Unbounded Streams

   Within any of these cases, every stream within the multiplexed


Lennox, et al.          Expires December 3, 2012                [Page 8]

Internet-Draft         RTP Usage for Telepresence              June 2012


   session MUST have a unique SSRC.  The SSRC is chosen at random
   [RFC3550] to ensure uniqueness (within the conference), and contains
   no meaningful information.

   Any source may choose to restart a stream at any time, resulting in a
   new SSRC.  For example, a transcoding MCU might, for reasons of load
   balancing, transfer an encoder onto a different DSP, and throw away
   all context of the encoding at this state, sending an RTCP BYE
   message for the old SSRC, and picking a new SSRC for the stream when
   started on the new DSP.

   Because of this possibility of changing the SSRC at any time, all our
   use cases can be considered as simplifications of the third and most
   difficult case, that of dynamic streams from an unbounded set.  Thus,
   this is the primary case we will consider.


8.  Other implementation constraints

   To cope with receivers with limited decoding resources, for example a
   hardware based telepresence endpoint with a fixed number of decoding
   modules, each capable of handling only a single stream, it is
   particularly important to ensure that the number of streams which the
   transmitter is expecting the receiver to decode never exceeds the
   maximum number the receiver has requested.  In this case the receiver
   will be forced to drop some of the received streams, causing a poor
   user experience, and potentially higher bandwidth usage, should it be
   required to retransmit I-frames.

   On a change of stream, such a receiver can be expected to have a one-
   out, one-in policy, so that the decoder of the stream currently being
   received on a given capture is stopped before starting the decoder
   for the stream replacing it.  The sender MUST therefore indicate to
   the receiver which stream will be replaced upon a stream change.


9.  Requirements of a solution

   This section lists, more briefly, the requirements a media
   architecture for Clue telepresence needs to achieve, summarizing the
   discussion of previous sections.  In this section, RFC 2119 language
   refers to requirements on a solution, not an implementation; thus,
   requirements keywords are not written in capital letters.


Lennox, et al.          Expires December 3, 2012                [Page 9]

Internet-Draft         RTP Usage for Telepresence              June 2012


   Media-1:  It must not be necessary for a Clue session to use more
      than a single transport flow for transport of a given media type
      (video or audio).
   Media-2:  It must, however, be possible for a Clue session to use
      multiple transport flows for a given media type where it is
      considered valuable (for example, for distributed media, or
      differential quality-of-service).
   Media-3:  It must be possible for a Clue endpoint or MCU to
      simultaneously send sources corresponding to static, to
      composited, and to switched captures, in the same transport flow.
      (Any given device might not necessarily be able send all of these
      source types; but for those that can, it must be possible for them
      to be sent simultaneously.)
   Media-4:  It must be possible for an original source to move among
      switched captures (i.e. at one time be sent for one switched
      capture, and at a later time be sent for another one).
   Media-5:  It must be possible for a source to be placed into a
      switched capture even if the source is a "late joiner", i.e. was
      added to the conference after the receiver requested the switched
      source.
   Media-6:  Whenever a given source is assigned to a switched capture,
      it must be immediately possible for a receiver to determine the
      switched capture it corresponds to, and thus that any previous
      source is no longer being mapped to that switched capture.
   Media-7:  It must be possible for a receiver to identify the actual
      source that is currently being mapped to a switched capture, and
      correlate it with out-of-band (non-Clue) information such as
      rosters.
   Media-8:  It must be possible for a source to move among switched
      captures without requiring a refresh of decoder state (e.g., for
      video, a fresh I-frame), when this is unnecessary.  However, it
      must also be possible for a receiver to indicate when a refresh of
      decoder state is in fact necessary.
   Media-9:  If a given source is being sent on the same transport flow
      for more than one reason (e.g. if it corresponds to more than one
      switched capture at once, or to a static capture), it should be
      possible for a sender to send only one copy of the source.
   Media-10:  On the network, media flows should, as much as possible,
      look and behave like currently-defined usages of existing
      protocols; established semantics of existing protocols must not be
      redefined.
   Media-11:  The solution should seek to minimize the processing burden
      for boxes that distribute media to decoding hardware.
   Media-12:  If multiple sources from a single synchronization context
      are being sent simultaneously, it must be possible for a receiver
      to associate and synchronize them properly, even for sources that
      are are mapped to switched captures.


Lennox, et al.          Expires December 3, 2012               [Page 10]

Internet-Draft         RTP Usage for Telepresence              June 2012


10.  Mapping streams to requested captures

   The goal of any scheme is to allow the receiver to match the received
   streams to the requested captures.  As discussed in Section 7, during
   the lifetime of the transmission of one capture, we may see one or
   multiple media streams which belong to this capture, and during the
   lifetime of one media stream, it may be assigned to one or more
   captures.

   Topologically, the requirements in Section 9 are best addressed by
   implementing static and a switched captures with an RTP Media
   Translator, i.e. the topology that RTP Topologies [RFC5117] defines
   as Topo-Media-Translator.  (A composited capture would be the
   topology described by Topo-Mixer; an MCU can easily produce either or
   both as appropriate, simultaneously.).  The MCU selectively forwards
   certain sources, corresponding to those sources which it currently
   assigns to the requested switched captures.

   Demultiplexing of streams is done by SSRC; each stream is known to
   have a unique SSRC.  However, this SSRC contains no information about
   capture IDs.  There are two obvious choices for providing the mapping
   from SSRC to captures: sending the mapping outside of the media
   stream, or tagging media packets with the capture ID.  (There may be
   other choices, e.g., payload type number, which might be appropriate
   for multiplexing one audio with one video stream on the same RTP
   session, but this not relevant for the cases discussed here.)

   (An alternative architecture would be to map all captures directly to
   SSRCs, and then to use a Topo-Mixer topology to represent switched
   captures as a "mixed" source with a single contributing CSRC.
   However, such an architecture would not be able to satisfy the
   requirements Media-8, Media-9, or Media-12 described in Section 9,
   without substantial changes to the semantics of RTP.)

10.1.  Sending SSRC to capture ID mapping outside the media stream

   Every RTP packet includes an SSRC, which can be used to demultiplex
   the streams.  However, although the SSRC uniquely identifies a
   stream, it does not indicate which of the requested captures that
   stream is tied to.  If more than one capture is requested, a mapping
   from SSRC to capture ID is therefore required so that the media
   receiver can treat each received stream correctly.

   As described above, the receiver may need to know in advance of
   receiving the media stream how to allocate its decoding resources.
   Although implementations MAY cache incoming media received before
   knowing which multiplexed stream it applies to, this is optional, and
   other implementations may choose to discard media, potentially


Lennox, et al.          Expires December 3, 2012               [Page 11]

Internet-Draft         RTP Usage for Telepresence              June 2012


   requiring an expensive state refresh, such as an Full Intra Request
   (FIR) [RFC5104].

   In addition, a receiver will have to store lookup tables of SSRCs to
   stream IDs/decoders etc.  Because of the large SSRC space (32 bits),
   this will have to be in the form of something like a hash map, and a
   lookup will have to be performed for every incoming packet, which may
   prove costly for e.g.  MCUs processing large numbers of incoming
   streams.

   Consider the choices for where to put the mapping from SSRC to
   capture ID.  This mapping could be sent in the CLUE messaging.  The
   use of a reliable transport means that it can be sure that the
   mapping will not be lost, but if this reliability is achieved through
   retransmission, the time taken for the mapping to reach all receivers
   (particularly in a very large scale conference, e.g., with thousands
   of users) could result in very poor switching times, providing a bad
   user experience.

   A second option for sending the mapping is in RTCP, for instance as a
   new SDES item.  This is likely to follow the same path as media, and
   therefore if the mapping data is sent slightly in advance of the
   media, it can be expected to be received in advance of the media.
   However, because RTCP is lossy and, due to its timing rules, cannot
   always be sent immediately, the mapping may not be received for some
   time, resulting in the receiver of the media not knowing how to route
   the received media.  A system of acks and retransmissions could
   mitigate this, but this results in the same high switching latency
   behaviour as discussed for using CLUE as a transport for the mapping.


       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  CaptureID=9  |   length=4    |           Capture ID          :
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      :                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


            Figure 5: SDES item for encoding of the Capture ID

10.2.  Sending capture IDs in the media stream

   The second option is to tag each media packet with the capture ID.
   This means that a receiver immediately knows how to interpret
   received media, even when an unknown SSRC is seen.  As long as the


Lennox, et al.          Expires December 3, 2012               [Page 12]

Internet-Draft         RTP Usage for Telepresence              June 2012


   media carries a known capture ID, it can be assumed that this media
   stream will replace the stream currently being received with that
   capture ID.

   This gives significant advantages to switching latency, as a switch
   between sources can be achieved without any form of negotiation with
   the receiver.  There is no chance of receiving media without knowing
   to which switched capture it belongs.

   However, the disadvantage in using a capture ID in the stream that it
   introduces additional processing costs for every media packet, as
   capture IDs are scoped only within one hop (i.e., within a cascaded
   conference a capture ID that is used from the source to the first MCU
   is not meaningful between two MCUs, or between an MCU and a
   receiver), and so they may need to be added or modified at every
   stage.

   As capture IDs are chosen by the media sender, by offering a
   particular capture to multiple recipients with the same ID, this
   requires the sender to only produce one version of the stream
   (assuming outgoing payload type numbers match).  This reduces the
   cost in the multicast case, although does not necessarily help in the
   switching case.

   An additional issue with putting capture IDs in the RTP packets comes
   from cases where a non-CLUE aware endpoint is being switched by an
   MCU to a CLUE endpoint.  In this case, we may require up to an
   additional 12 bytes in the RTP header, which may push a media packet
   over the MTU.  However, as the MTU on either side of the switch may
   not match, it is possible that this could happen even without adding
   extra data into the RTP packet.  The 12 additional bytes per packet
   could also be a significant bandwidth increase in the case of very
   low bandwidth audio codecs.

10.2.1.  Multiplex ID shim

   As in draft-westerlund-avtcore-transport-multiplexing

10.2.2.  RTP header extension

   The capture ID could be carried within the RTP header extension
   field, using [RFC5285].  This is negotiated within the SDP i.e.

   a=extmap:1 urn:ietf:params:rtp-hdrex:clue-capture-id

   Packets tagged by the sender with the capture ID will then contain a
   header extension as shown below


Lennox, et al.          Expires December 3, 2012               [Page 13]

Internet-Draft         RTP Usage for Telepresence              June 2012


        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |  ID=1 |  L=3  |                 capture id                    |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |  capture id   |
       +-+-+-+-+-+-+-+-+


       Figure 6: RTP header extension for encoding of the capture ID

   To add or modify the capture ID can be an expensive operation,
   particularly if SRTP is used to authenticate the packet.
   Modification to the contents of the RTP header requires a
   reauthentication of the complete packet, and this could prove to be a
   limiting factor in the throughput of a multipoint device.  However,
   it may be that reauthentication is required in any case due to the
   nature of SDP.  SDP permits the receiver to choose payload types,
   meaning that a similar option to modify the payload type in the
   packet header will cause the need to reauthenticate.

10.2.3.  Combined approach

   The two major flaws of the above methods (high latency switching of
   SSRC multiplexing, high computational cost on switching nodes) can be
   mitigated with a combined method.  In this, the multiplex ID can be
   included in packets belonging to the first frame of media (typically
   an IDR/GDR), but following this only the SSRC is used to demultiplex.

10.2.3.1.  Behaviour of receivers

   A receiver of a stream should demultiplex on SSRC if it knows the
   capture ID for the given SSRC, otherwise it should look within the
   packet for the presence of the stream ID.  This has an issue where a
   stream switches from one capture to a second - for example, in the
   second use case described in Section 7, where the transmitter chooses
   to switch the center stream from the receiver's right capture to the
   left capture, and so the receiver will already know an incorrect
   mapping from that stream's SSRC to a capture ID.

   In this case the receiver should, at the RTP level, detect the
   presence of the capture ID and update its SSRC to capture ID map.
   This could potentially have issues where the demultiplexer has now
   sent the packet to the wrong physical device - this could be solved
   by checking for the presence of a capture ID in every packet, but
   this will have speed implications.  If a packet is received where the
   receiver does not already know the mapping between SSRC and capture
   ID, and the packet does not contain a capture ID, the receiver may


Lennox, et al.          Expires December 3, 2012               [Page 14]

Internet-Draft         RTP Usage for Telepresence              June 2012


   discard it, and MUST request a transmission of the capture ID (see
   below).

10.2.3.2.  Choosing when to send capture IDs

   The updated capture ID needs to be known as soon as possible on a
   switch of SSRCs, as the receiver may be unable to allocate resources
   to decode the incoming stream, and may throw away the received
   packets.  It can be assumed that the incoming stream is undecodable
   until the capture ID is received.

   In common video codecs (e.g.  H.264), decoder refresh frames (either
   IDR or GDR) also have this property, in that it is impossible to
   decode any video without first receiving the refresh point.  It
   therefore seems natural to include the capture ID within every packet
   of an IDR or GDR.

   For most audio codecs, where every packet can be decoded
   independently, there is not such an obvious place to put this
   information.  Placing the capture ID within the first n packets of a
   stream on a switch is the most simple solution, where n needs to be
   sufficiently large that it can be expected that at least one packet
   will have reached the receiver.  For example, n=50 on 20ms audio
   packets will give 1 second of capture IDs, which should give
   reasonable confidence of arrival.

   In the case where a stream is switched between captures, for reasons
   of coding efficiency, it may be desirable to avoid sending a new IDR
   frame for this stream, if the receiver's architecture allows the same
   decoding state to be used for its various captures.  In this case,
   the capture ID could be sent for a small number of frames after the
   source switches capture, similarly to audio.

10.2.3.3.  Requesting Capture ID retransmits

   There will, unfortunately, always be cases where a receiver misses
   the beginning of a stream, and therefore does not have the mapping.
   One proposal could be to send the capture ID in SDES with every SDES
   packet; this should ensure that within ~5 seconds of receiving a
   stream, the capture ID will be received.  However, a faster method
   for requesting the transmission of a capture ID would be preferred.

   Again, we look towards the present solution to this problem with
   video.  RFC5104 provides an Full Intra Refresh feedback message,
   which requests that the encoder provide the stream such that
   receivers need only the stream after that point.  A video receiver
   without the start of the stream will naturally need to make this
   request, so by always including the capture ID in refresh frames, we


Lennox, et al.          Expires December 3, 2012               [Page 15]

Internet-Draft         RTP Usage for Telepresence              June 2012


   can be sure that the receiver will have all the information it needs
   to decode the stream (both a refresh point, and a capture ID).

   For audio, we can reuse this message.  If a receiver receives an
   audio stream for which it has no SSRC to capture mapping, it should
   send a FIR message for the received SSRC.  Upon receiving this, an
   audio encoder must then tag outgoing media packets with the capture
   ID for a short period of time.

   Alternately, a new RTCP feedback message could be defined which would
   explicitly request a refresh of the capture ID mapping.

10.3.  Recommendations

   We recommend that endpoints MUST support the RTP header extension
   method of sharing capture IDs, with the extension in every media
   packet.  For low bandwidth situations, this may be considered
   excessive overhead; in which case endpoints MAY support the combined
   approach.

   This will be advertised in the SDP (in a way yet to be determined);
   if a receiver advertises support for the combined approach,
   transmitters which support sending the combined approach SHOULD use
   it in preference.


11.  Security Considerations

   The security considerations for multiplexed RTP do not seem to be
   different than for non-multiplexed RTP.

   Capture IDs need to be integrity-protected in secure environments;
   however, they do not appear to need confidentiality.


12.  IANA Considerations

   Depending on the decisions, the new RTP header extension element, the
   new RTCP SDES item, and/or the new AVPF feedback message will need to
   be registered.


13.  References

13.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.


Lennox, et al.          Expires December 3, 2012               [Page 16]

Internet-Draft         RTP Usage for Telepresence              June 2012


   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
              Jacobson, "RTP: A Transport Protocol for Real-Time
              Applications", STD 64, RFC 3550, July 2003.

13.2.  Informative References

   [I-D.ietf-clue-framework]
              Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino,
              "Framework for Telepresence Multi-Streams",
              draft-ietf-clue-framework-05 (work in progress), May 2012.

   [I-D.ietf-clue-telepresence-requirements]
              Romanow, A. and S. Botzko, "Requirements for Telepresence
              Multi-Streams",
              draft-ietf-clue-telepresence-requirements-01 (work in
              progress), October 2011.

   [I-D.ietf-clue-telepresence-use-cases]
              Romanow, A., Botzko, S., Duckworth, M., Even, R., and I.
              Communications, "Use Cases for Telepresence Multi-
              streams", draft-ietf-clue-telepresence-use-cases-02 (work
              in progress), January 2012.

   [I-D.lennox-rtcweb-rtp-media-type-mux]
              Lennox, J. and J. Rosenberg, "Multiplexing Multiple Media
              Types In a Single Real-Time Transport Protocol (RTP)
              Session", draft-lennox-rtcweb-rtp-media-type-mux-00 (work
              in progress), October 2011.

   [I-D.westerlund-avtcore-multiplex-architecture]
              Westerlund, M., Burman, B., and C. Perkins, "RTP
              Multiplexing Architecture",
              draft-westerlund-avtcore-multiplex-architecture-01 (work
              in progress), March 2012.

   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
              Protocol (SDP) Content Attribute", RFC 4796,
              February 2007.

   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
              "Codec Control Messages in the RTP Audio-Visual Profile
              with Feedback (AVPF)", RFC 5104, February 2008.

   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
              January 2008.

   [RFC5285]  Singer, D. and H. Desineni, "A General Mechanism for RTP
              Header Extensions", RFC 5285, July 2008.


Lennox, et al.          Expires December 3, 2012               [Page 17]

Internet-Draft         RTP Usage for Telepresence              June 2012


Authors' Addresses

   Jonathan Lennox
   Vidyo, Inc.
   433 Hackensack Avenue
   Seventh Floor
   Hackensack, NJ  07601
   US

   Email: jonathan@vidyo.com


   Paul Witty
   England
   UK

   Email: paul.witty@balliol.oxon.org


   Allyn Romanow
   Cisco Systems
   San Jose, CA  95134
   USA

   Email: allyn@cisco.com


Lennox, et al.          Expires December 3, 2012               [Page 18]