| rfc8845xml2.original.xml | rfc8845.xml | |||
|---|---|---|---|---|
| <?xml version='1.0' encoding='utf-8'?> | ||||
| <!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent"> | ||||
| <rfc xmlns:xi="http://www.w3.org/2001/XInclude" submissionType="IETF" | ||||
| category="std" consensus="yes" number="8845" obsoletes="" updates="" | ||||
| xml:lang="en" sortRefs="true" symRefs="true" tocInclude="true" | ||||
| version="3" ipr="trust200902" docName="draft-ietf-clue-framework-25"> | ||||
| <!-- xml2rfc v2v3 conversion 2.45.2 --> | ||||
| <front> | ||||
| <title abbrev="CLUE Framework">Framework for Telepresence Multi-Streams</tit | ||||
| le> | ||||
| <seriesInfo name="RFC" value="8845"/> | ||||
| <author fullname="Mark Duckworth" initials="M." role="editor" surname="Du | ||||
| ckworth"> | ||||
| <organization/> | ||||
| <address> | ||||
| <postal> | ||||
| <city></city><region></region><code></code> | ||||
| <country></country> | ||||
| </postal> | ||||
| <email>mrducky73@outlook.com</email> | ||||
| </address> | ||||
| </author> | ||||
| <author fullname="Andrew Pepperell" initials="A." surname="Pepperell"> | ||||
| <organization>Acano</organization> | ||||
| <address> | ||||
| <postal> | ||||
| <city>Uxbridge</city> | ||||
| <country>United Kingdom</country> | ||||
| </postal> | ||||
| <email>apeppere@gmail.com</email> | ||||
| </address> | ||||
| </author> | ||||
| <author fullname="Stephan Wenger" initials="S." surname="Wenger"> | ||||
| <organization abbrev="Tencent">Tencent</organization> | ||||
| <address> | ||||
| <postal> | ||||
| <street>2747 Park Blvd.</street> | ||||
| <city>Palo Alto</city><region>CA</region><code>94306</code> | ||||
| <country>United States of America</country> | ||||
| </postal> | ||||
| <email>stewe@stewe.org</email> | ||||
| </address> | ||||
| </author> | ||||
| <date month="January" year="2021"/> | ||||
| <area>ART</area> | ||||
| <workgroup>CLUE</workgroup> | ||||
| <keyword>Telepresence</keyword> | ||||
| <keyword>Conferencing</keyword> | ||||
| <keyword>Video-Conferencing</keyword> | ||||
| <keyword>MCU</keyword> | ||||
| <abstract> | ||||
| <t> | ||||
| This document defines a framework for a protocol to enable devices | ||||
| in a telepresence conference to interoperate. The protocol enables | ||||
| communication of information about multiple media streams so a | ||||
| sending system and receiving system can make reasonable decisions | ||||
| about transmitting, selecting, and rendering the media streams. | ||||
| This protocol is used in addition to SIP signaling and Session Description Pr | ||||
| otocol (SDP) | ||||
| negotiation for setting up a telepresence session.</t> | ||||
| </abstract> | ||||
| </front> | ||||
| <middle> | ||||
| <section anchor="s-1" numbered="true" toc="default"> | ||||
| <name>Introduction</name> | ||||
| <t> | ||||
| Current telepresence systems, though based on open standards such | ||||
| as RTP <xref target="RFC3550" format="default"/> and SIP <xref target="RFC326 | ||||
| 1" format="default"/>, cannot easily interoperate with | ||||
| each other. A major factor limiting the interoperability of | ||||
| telepresence systems is the lack of a standardized way to describe | ||||
| and negotiate the use of multiple audio and video streams | ||||
| comprising the media flows. This document provides a framework for | ||||
| protocols to enable interoperability by handling multiple streams | ||||
| in a standardized way. The framework is intended to support the | ||||
| use cases described in "Use Cases for Telepresence Multistreams" | ||||
| <xref target="RFC7205" format="default"/> and to meet the requirements in "Re | ||||
| quirements for | ||||
| Telepresence Multistreams" <xref target="RFC7262" format="default"/>. This in | ||||
| cludes cases using | ||||
| multiple media streams that are not necessarily telepresence.</t> | ||||
| <t> | ||||
| The basic session setup for the use cases is based on SIP <xref target="RFC32 | ||||
| 61" format="default"/> | ||||
| and SDP offer/answer <xref target="RFC3264" format="default"/>. In addition | ||||
| to basic SIP & SDP | ||||
| offer/answer, signaling that is ControLling mUltiple streams for | ||||
| tElepresence (CLUE) specific is required to exchange the | ||||
| information describing the multiple Media Streams. The motivation | ||||
| for this framework, an overview of the signaling, and the information | ||||
| required to be exchanged are described in subsequent sections of | ||||
| this document. Companion documents describe the signaling details | ||||
| <xref target="RFC8848" format="default"/>, the data model <xref target="RFC88 | ||||
| 46" format="default"/>, and the protocol <xref target="RFC8847" format="default" | ||||
| />.</t> | ||||
| </section> | ||||
| <section anchor="s-2" numbered="true" toc="default"> | ||||
| <name>Requirements Language</name> | ||||
| <t> | ||||
| The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQU | ||||
| IRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL | ||||
| NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14> | ||||
| RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>", | ||||
| "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to | ||||
| be interpreted as | ||||
| described in BCP 14 <xref target="RFC2119" format="default"/> <xref tar | ||||
| get="RFC8174" format="default"/> | ||||
| when, and only when, they appear in all capitals, as shown here. | ||||
| </t> | ||||
| </section> | ||||
| <section anchor="s-3" numbered="true" toc="default"> | ||||
| <name>Definitions</name> | ||||
| <t> | ||||
| The terms defined below are used throughout this document and | ||||
| in companion documents. Capitalization is used in order to easily identify a | ||||
| defined term.</t> | ||||
| <dl newline="false" spacing="normal"> | ||||
| <dt>Advertisement:</dt> | ||||
| <dd>A CLUE message a Media Provider sends to a Media | ||||
| Consumer describing specific aspects of the content of the Media | ||||
| and any restrictions it has in terms of being able to provide | ||||
| certain Streams simultaneously.</dd> | ||||
| <dt>Audio Capture (AC):</dt> | ||||
| <dd>Media Capture for audio. Denoted as "ACn" in the | ||||
| examples in this document.</dd> | ||||
| <dt>Capture:</dt> | ||||
| <dd>Same as Media Capture.</dd> | ||||
| <dt>Capture Device:</dt> | ||||
| <dd>A device that converts physical input, such as | ||||
| audio, video, or text, into an electrical signal, in most cases to | ||||
| be fed into a Media encoder.</dd> | ||||
| <dt>Capture Encoding:</dt> | ||||
| <dd>A specific Encoding of a Media Capture, to be | ||||
| sent by a Media Provider to a Media Consumer via RTP.</dd> | ||||
| <dt>Capture Scene:</dt> | ||||
| <dd>A structure representing a spatial region captured | ||||
| by one or more Capture Devices, each capturing Media representing a | ||||
| portion of the region. The spatial region represented by a Capture | ||||
| Scene may correspond to a real region in physical space, such as a | ||||
| room. A Capture Scene includes attributes and one or more Capture | ||||
| Scene Views, with each view including one or more Media Captures.</dd> | ||||
| <dt>Capture Scene View (CSV):</dt> | ||||
| <dd>A list of Media Captures of the same | ||||
| Media type that together form one way to represent the entire | ||||
| Capture Scene.</dd> | ||||
| <dt>CLUE:</dt> | ||||
| <dd>CLUE is an | ||||
| acronym for "ControLling mUltiple streams for tElepresence", which is | ||||
| the name of the IETF working group in which this document and certain | ||||
| companion documents have been developed. Often, CLUE-* refers to | ||||
| something that has been designed by the CLUE working group; for | ||||
| example, this document may be called the CLUE-framework document | ||||
| herein and elsewhere.</dd> | ||||
| <dt>CLUE-capable device:</dt> | ||||
| <dd>A device that supports the CLUE data channel | ||||
| <xref target="RFC8850" format="default"/>, the CLUE protocol <xref target="RF | ||||
| C8847" format="default"/> and the principles of CLUE negotiation; it also seeks | ||||
| CLUE-enabled calls.</dd> | ||||
| <dt>CLUE-enabled call:</dt> | ||||
| <dd>A call in which two CLUE-capable devices have | ||||
| successfully negotiated support for a CLUE data channel in SDP | ||||
| <xref target="RFC4566" format="default"/>. A CLUE-enabled call is not necessa | ||||
| rily immediately able | ||||
| to send CLUE-controlled Media; negotiation of the data channel and | ||||
| of the CLUE protocol must complete first. Calls between two CLUE-capable devi | ||||
| ces that have not yet successfully completed | ||||
| negotiation of support for the CLUE data channel in SDP are not | ||||
| considered CLUE-enabled.</dd> | ||||
| <dt>Conference:</dt> | ||||
| <dd>Used as defined in "A Framework for | ||||
| Conferencing within the Session Initiation Protocol (SIP)" <xref target="RFC4 | ||||
| 353" format="default"/>.</dd> | ||||
| <dt>Configure Message:</dt> | ||||
| <dd>A CLUE message a Media Consumer sends to a Media | ||||
| Provider specifying which content and Media Streams it wants to | ||||
| receive, based on the information in a corresponding Advertisement | ||||
| message.</dd> | ||||
| <dt>Consumer:</dt> | ||||
| <dd>Short for Media Consumer.</dd> | ||||
| <dt>Encoding:</dt> | ||||
| <dd>Short for Individual Encoding.</dd> | ||||
| <dt>Encoding Group:</dt> | ||||
| <dd>A set of Encoding parameters representing a total | ||||
| Media Encoding capability to be subdivided across potentially | ||||
| multiple Individual Encodings.</dd> | ||||
| <dt>Endpoint:</dt> | ||||
| <dd>A CLUE-capable device that is the logical point of final | ||||
| termination through receiving, decoding and Rendering, and/or | ||||
| initiation through capturing, encoding, and sending of Media | ||||
| Streams. An Endpoint consists of one or more physical devices | ||||
| that source and sink Media Streams, and exactly one <xref target="RFC4353" fo | ||||
| rmat="default"/> | ||||
| Participant (which, in turn, includes exactly one SIP User Agent). | ||||
| Endpoints can be anything from multiscreen/multicamera rooms to | ||||
| handheld devices.</dd> | ||||
| <dt>Global View:</dt> | ||||
| <dd>A set of references to one or more CSVs | ||||
| of the same Media type that are defined within Scenes of the same | ||||
| Advertisement. A Global View is a suggestion from the Provider to | ||||
| the Consumer for one set of CSVs that provide a useful | ||||
| representation of all the Scenes in the Advertisement.</dd> | ||||
| <dt>Global View List:</dt> | ||||
| <dd>A list of Global Views included in an | ||||
| Advertisement. A Global View List may include Global Views of | ||||
| different Media types.</dd> | ||||
| <dt>Individual Encoding:</dt> | ||||
| <dd>a set of parameters representing a way to | ||||
| encode a Media Capture to become a Capture Encoding.</dd> | ||||
| <dt>Multipoint Control Unit (MCU):</dt> | ||||
| <dd>a CLUE-capable device that connects | ||||
| two or more Endpoints into one single multimedia | ||||
| Conference <xref target="RFC7667" format="default"/>. An MCU includes a Mixe | ||||
| r like that described in <xref target="RFC4353" format="default"/>, | ||||
| without the requirement of <xref target="RFC4353" format="default"/> to send | ||||
| Media to each | ||||
| participant.</dd> | ||||
| <dt>Media:</dt> | ||||
| <dd>Any data that, after suitable encoding, can be conveyed over | ||||
| RTP, including audio, video, or timed text.</dd> | ||||
| <dt>Media Capture (MC):</dt> | ||||
| <dd>A source of Media, such as from one or more Capture | ||||
| Devices or constructed from other Media Streams.</dd> | ||||
| <dt>Media Consumer:</dt> | ||||
| <dd>A CLUE-capable device that intends to receive | ||||
| Capture Encodings.</dd> | ||||
| <dt>Media Provider:</dt> | ||||
| <dd>A CLUE-capable device that intends to send Capture | ||||
| Encodings.</dd> | ||||
| <dt>Multiple Content Capture (MCC):</dt> | ||||
| <dd>A Capture that mixes and/or | ||||
| switches other Captures of a single type (for example, all audio or all | ||||
| video). Particular Media Captures may or may not be present in the | ||||
| resultant Capture Encoding, depending on time or space. Denoted as | ||||
| "MCCn" in the example cases in this document.</dd> | ||||
| <dt>Plane of Interest:</dt> | ||||
| <dd>The spatial plane within a Scene containing the | ||||
| most-relevant subject matter.</dd> | ||||
| <dt>Provider:</dt> | ||||
| <dd>Same as a Media Provider.</dd> | ||||
| <dt>Render:</dt> | ||||
| <dd>The process of generating a representation from Media, such | ||||
| as displayed motion video or sound emitted from loudspeakers.</dd> | ||||
| <dt>Scene:</dt> | ||||
| <dd>Same as a Capture Scene.</dd> | ||||
| <dt>Simultaneous Transmission Set:</dt> | ||||
| <dd>A set of Media Captures that can be | ||||
| transmitted simultaneously from a Media Provider.</dd> | ||||
| <dt>Single Media Capture:</dt> | ||||
| <dd>A Capture that contains Media from a single | ||||
| source Capture Device, e.g., an Audio Capture from a single | ||||
| microphone or a Video Capture from a single camera.</dd> | ||||
| <dt>Spatial Relation:</dt> | ||||
| <dd>The arrangement of two objects in space, in | ||||
| contrast to relation in time or other relationships.</dd> | ||||
| <dt>Stream:</dt> | ||||
| <dd>A Capture Encoding sent from a Media Provider to a Media | ||||
| Consumer via RTP <xref target="RFC3550" format="default"/>.</dd> | ||||
| <dt>Stream Characteristics:</dt> | ||||
| <dd>The Media Stream attributes commonly used | ||||
| in non-CLUE SIP/SDP environments (such as Media codec, bitrate, | ||||
| resolution, profile/level, etc.) as well as CLUE-specific | ||||
| attributes, such as the Capture ID or a spatial location.</dd> | ||||
| <dt>Video Capture (VC):</dt> | ||||
| <dd>Media Capture for video. Denoted as VCn in the | ||||
| example cases in this document.</dd> | ||||
| <dt>Video Composite:</dt> | ||||
| <dd>A single image that is formed, normally by an RTP | ||||
| mixer inside an MCU, by combining visual elements from separate | ||||
| sources.</dd> | ||||
| </dl> | ||||
| </section> | ||||
| <section anchor="s-4" numbered="true" toc="default"> | ||||
| <name>Overview and Motivation</name> | ||||
| <t> | ||||
| This section provides an overview of the functional elements | ||||
| defined in this document to represent a telepresence or | ||||
| multistream system. The motivations for the framework described | ||||
| in this document are also provided.</t> | ||||
| <t> | ||||
| Two key concepts introduced in this document are the terms "Media Provider" a | ||||
| nd "Media Consumer". A Media Provider represents the | ||||
| entity that sends the Media and a Media Consumer represents the | ||||
| entity that receives the Media. A Media Provider provides Media in | ||||
| the form of RTP packets; a Media Consumer consumes those RTP | ||||
| packets. Media Providers and Media Consumers can reside in | ||||
| Endpoints or in Multipoint Control Units (MCUs). A Media Provider | ||||
| in an Endpoint is usually associated with the generation of Media | ||||
| for Media Captures; these Media Captures are typically sourced | ||||
| from cameras, microphones, and the like. Similarly, the Media | ||||
| Consumer in an Endpoint is usually associated with renderers, such | ||||
| as screens and loudspeakers. In MCUs, Media Providers and | ||||
| Consumers can have the form of outputs and inputs, respectively, | ||||
| of RTP mixers, RTP translators, and similar devices. Typically, | ||||
| telepresence devices, such as Endpoints and MCUs, would perform as | ||||
| both Media Providers and Media Consumers, the former being | ||||
| concerned with those devices' transmitted Media and the latter | ||||
| with those devices' received Media. In a few circumstances, a | ||||
| CLUE-capable device includes only Consumer or Provider | ||||
| functionality, such as recorder-type Consumers or webcam-type | ||||
| Providers.</t> | ||||
| <t> | ||||
| The motivations for the framework outlined in this document | ||||
| include the following:</t> | ||||
| <ol spacing="normal" type="(%d)"> | ||||
| <li>Endpoints in telepresence systems typically have multiple Media | ||||
| Capture and Media Render devices, e.g., multiple cameras and | ||||
| screens. While previous system designs were able to set up calls | ||||
| that would capture Media using all cameras and display Media on all | ||||
| screens, for example, there was no mechanism that could associate | ||||
| these Media Captures with each other in space and time, in a cross-vendor | ||||
| interoperable way.</li> | ||||
| <li>The mere fact that there are multiple Media Capture and Media Render | ||||
| devices, each of which may be configurable in aspects such as zoom, | ||||
| leads to the difficulty that a variable number of such devices can | ||||
| be used to capture different aspects of a region. The Capture | ||||
| Scene concept allows for the description of multiple setups for | ||||
| those multiple Media Capture devices that could represent sensible | ||||
| operation points of the physical Capture Devices in a room, chosen | ||||
| by the operator. A Consumer can pick and choose from those | ||||
| configurations based on its rendering abilities and then inform the | ||||
| Provider about its choices. Details are provided in <xref target="s-7" fo | ||||
| rmat="default"/>.</li> | ||||
| <li>In some cases, physical limitations or other reasons disallow | ||||
| the concurrent use of a device in more than one setup. For | ||||
| example, the center camera in a typical three-camera conference | ||||
| room can set its zoom objective to capture either the middle | ||||
| few seats only or all seats of a room, but not both concurrently. The | ||||
| Simultaneous Transmission Set concept allows a Provider to signal | ||||
| such limitations. Simultaneous Transmission Sets are part of the | ||||
| Capture Scene description and are discussed in <xref target="s-8" format=" | ||||
| default"/>.</li> | ||||
| <li>Often, the devices in a room do not have the computational | ||||
| complexity or connectivity to deal with multiple Encoding options | ||||
| simultaneously, even if each of these options is sensible in | ||||
| certain scenarios, and even if the simultaneous transmission is | ||||
| also sensible (i.e., in case of multicast Media distribution to | ||||
| multiple Endpoints). Such constraints can be expressed by the | ||||
| Provider using the Encoding Group concept, which is described in <xref tar | ||||
| get="s-9" format="default"/>.</li> | ||||
| <li>Due to the potentially large number of RTP Streams required for | ||||
| a Multimedia Conference involving potentially many Endpoints, each | ||||
| of which can have many Media Captures and Media renderers, it has | ||||
| become common to multiplex multiple RTP Streams onto the same | ||||
| transport address, so as to avoid using the port number as a | ||||
| multiplexing point and the associated shortcomings such as | ||||
| NAT/firewall traversal. The large number of possible permutations | ||||
| of sensible options a Media Provider can make available to a Media | ||||
| Consumer makes a mechanism desirable that allows it to narrow down | ||||
| the number of possible options that a SIP offer/answer exchange has | ||||
| to consider. Such information is made available using protocol | ||||
| mechanisms specified in this document and companion documents. | ||||
| The | ||||
| Media Provider and Media Consumer may use information in CLUE | ||||
| messages to reduce the complexity of SIP offer/answer messages. | ||||
| Also, there are aspects of the control of both Endpoints and MCUs | ||||
| that dynamically change during the progress of a call, such as | ||||
| audio-level-based screen switching, layout changes, and so on, | ||||
| which need to be conveyed. Note that these control aspects are | ||||
| complementary to those specified in traditional SIP-based | ||||
| conference management, such as Binary Floor Control Protocol (BFCP). An e | ||||
| xemplary call flow can be | ||||
| found in <xref target="s-5" format="default"/>.</li> | ||||
| </ol> | ||||
| <t> | ||||
| Finally, all this information needs to be conveyed, and the notion | ||||
| of support for it needs to be established. This is done by the | ||||
| negotiation of a "CLUE channel", a data channel negotiated early | ||||
| during the initiation of a call. An Endpoint or MCU that rejects | ||||
| the establishment of this data channel, by definition, does not | ||||
| support CLUE-based mechanisms, whereas an Endpoint or MCU that | ||||
| accepts it is indicating support for CLUE as specified in this | ||||
| document and its companion documents.</t> | ||||
| </section> | ||||
| <section anchor="s-5" numbered="true" toc="default"> | ||||
| <name>Description of the Framework/Model</name> | ||||
| <t> | ||||
| The CLUE framework specifies how multiple Media Streams are to be | ||||
| handled in a telepresence Conference.</t> | ||||
| <t> | ||||
| A Media Provider (transmitting Endpoint or MCU) describes specific | ||||
| aspects of the content of the Media and the Media Stream Encodings | ||||
| it can send in an Advertisement; and the Media Consumer responds to | ||||
| the Media Provider by specifying which content and Media Streams it | ||||
| wants to receive in a Configure message. The Provider then | ||||
| transmits the asked-for content in the specified Streams.</t> | ||||
| <t> | ||||
| This Advertisement and Configure typically occur during call | ||||
| initiation, after CLUE has been enabled in a call, but they <bcp14>MAY</bcp14 | ||||
| > also | ||||
| happen at any time throughout the call, whenever there is a change | ||||
| in what the Consumer wants to receive or (perhaps less common) what the | ||||
| Provider can send.</t> | ||||
| <t> | ||||
| An Endpoint or MCU typically acts as both Provider and Consumer at | ||||
| the same time, sending Advertisements and sending Configurations in | ||||
| response to receiving Advertisements. (It is possible to be just | ||||
| one or the other.)</t> | ||||
| <t> | ||||
| The data model <xref target="RFC8846" format="default"/> is based around two | ||||
| main concepts: a Capture and an Encoding. A Media Capture, | ||||
| such as of type audio or video, has attributes to describe the | ||||
| content a Provider can send. Media Captures are described in terms | ||||
| of CLUE-defined attributes, such as Spatial Relationships and | ||||
| purpose of the Capture. Providers tell Consumers which Media | ||||
| Captures they can provide, described in terms of the Media Capture | ||||
| attributes.</t> | ||||
| <t> | ||||
| A Provider organizes its Media Captures into one or more Capture | ||||
| Scenes, each representing a spatial region, such as a room. A | ||||
| Consumer chooses which Media Captures it wants to receive from the | ||||
| Capture Scenes.</t> | ||||
| <t> | ||||
| In addition, the Provider can send the Consumer a description of | ||||
| the Individual Encodings it can send in terms of identifiers that | ||||
| relate to items in SDP <xref target="RFC4566" format="default"/>.</t> | ||||
| <t> | ||||
| The Provider can also specify constraints on its ability to provide | ||||
| Media, and a sensible design choice for a Consumer is to take these | ||||
| into account when choosing the content and Capture Encodings it | ||||
| requests in the later offer/answer exchange. Some constraints are | ||||
| due to the physical limitations of device; for example, a camera | ||||
| may not be able to provide zoom and non-zoom views simultaneously. | ||||
| Other constraints are system based, such as maximum bandwidth.</t> | ||||
| <t> | ||||
| The following diagram illustrates the information contained in an | ||||
| Advertisement.</t> | ||||
| <figure anchor="ref-advertisement-structure"> | ||||
| <name>Advertisement Structure</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| ................................................................... | ||||
| . Provider Advertisement +--------------------+ . | ||||
| . | Simultaneous Sets | . | ||||
| . +------------------------+ +--------------------+ . | ||||
| . | Capture Scene N | +--------------------+ . | ||||
| . +-+----------------------+ | | Global View List | . | ||||
| . | Capture Scene 2 | | +--------------------+ . | ||||
| . +-+----------------------+ | | +----------------------+ . | ||||
| . | Capture Scene 1 | | | | Encoding Group N | . | ||||
| . | +---------------+ | | | +-+--------------------+ | . | ||||
| . | | Attributes | | | | | Encoding Group 2 | | . | ||||
| . | +---------------+ | | | +-+--------------------+ | | . | ||||
| . | | | | | Encoding Group 1 | | | . | ||||
| . | +----------------+ | | | | parameters | | | . | ||||
| . | | V i e w s | | | | | bandwidth | | | . | ||||
| . | | +---------+ | | | | | +-------------------+| | | . | ||||
| . | | |Attribute| | | | | | | V i d e o || | | . | ||||
| . | | +---------+ | | | | | | E n c o d i n g s || | | . | ||||
| . | | | | | | | | Encoding 1 || | | . | ||||
| . | | View 1 | | | | | | || | | . | ||||
| . | | (list of MCs) | | |-+ | +-------------------+| | | . | ||||
| . | +----|-|--|------+ |-+ | | | | . | ||||
| . +---------|-|--|---------+ | +-------------------+| | | . | ||||
| . | | | | | A u d i o || | | . | ||||
| . | | | | | E n c o d i n g s || | | . | ||||
| . v | | | | Encoding 1 || | | . | ||||
| . +---------|--|--------+ | | || | | . | ||||
| . | Media Capture N |------>| +-------------------+| | | . | ||||
| . +-+---------v--|------+ | | | | | . | ||||
| . | Media Capture 2 | | | | |-+ . | ||||
| . +-+--------------v----+ |-------->| | | . | ||||
| . | Media Capture 1 | | | | |-+ . | ||||
| . | +----------------+ |---------->| | . | ||||
| . | | Attributes | | |_+ +----------------------+ . | ||||
| . | +----------------+ |_+ . | ||||
| . +---------------------+ . | ||||
| . . | ||||
| ................................................................... | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t><xref target="ref-basic-information-flow" format="default"/> illustrate | ||||
| s the call flow used by a simple system (two Endpoints) in compliance with this | ||||
| document. A very brief outline of the call flow is described in the text that f | ||||
| ollows.</t> | ||||
| <figure anchor="ref-basic-information-flow"> | ||||
| <name>Basic Information Flow</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| +-----------+ +-----------+ | ||||
| | Endpoint1 | | Endpoint2 | | ||||
| +----+------+ +-----+-----+ | ||||
| | INVITE (BASIC SDP+CLUECHANNEL) | | ||||
| |--------------------------------->| | ||||
| | 200 0K (BASIC SDP+CLUECHANNEL)| | ||||
| |<---------------------------------| | ||||
| | ACK | | ||||
| |--------------------------------->| | ||||
| | | | ||||
| |<################################>| | ||||
| | BASIC MEDIA SESSION | | ||||
| |<################################>| | ||||
| | | | ||||
| | CONNECT (CLUE CTRL CHANNEL) | | ||||
| |=================================>| | ||||
| | ... | | ||||
| |<================================>| | ||||
| | CLUE CTRL CHANNEL ESTABLISHED | | ||||
| |<================================>| | ||||
| | | | ||||
| | ADVERTISEMENT 1 | | ||||
| |*********************************>| | ||||
| | ADVERTISEMENT 2 | | ||||
| |<*********************************| | ||||
| | | | ||||
| | CONFIGURE 1 | | ||||
| |<*********************************| | ||||
| | CONFIGURE 2 | | ||||
| |*********************************>| | ||||
| | | | ||||
| | REINVITE (UPDATED SDP) | | ||||
| |--------------------------------->| | ||||
| | 200 0K (UPDATED SDP)| | ||||
| |<---------------------------------| | ||||
| | ACK | | ||||
| |--------------------------------->| | ||||
| | | | ||||
| |<################################>| | ||||
| | UPDATED MEDIA SESSION | | ||||
| |<################################>| | ||||
| | | | ||||
| v v | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t> | ||||
| An initial offer/answer exchange establishes a basic Media session, | ||||
| for example, audio-only, and a CLUE channel between two Endpoints. | ||||
| With the establishment of that channel, the Endpoints have | ||||
| consented to use the CLUE protocol mechanisms and, therefore, <bcp14>MUST</bc | ||||
| p14> | ||||
| adhere to the CLUE protocol suite as outlined herein.</t> | ||||
| <t> | ||||
| Over this CLUE channel, the Provider in each Endpoint conveys its | ||||
| characteristics and capabilities by sending an Advertisement as | ||||
| specified herein. The Advertisement is typically not sufficient to | ||||
| set up all Media. The Consumer in the Endpoint receives the | ||||
| information provided by the Provider and can use it for several | ||||
| purposes. It uses it, along with information from an offer/answer | ||||
| exchange, to construct a CLUE Configure message to tell the | ||||
| Provider what the Consumer wishes to receive. Also, the Consumer | ||||
| may use the information provided to tailor the SDP it is going to | ||||
| send during any following SIP offer/answer exchange, and its | ||||
| reaction to SDP it receives in that step. It is often a sensible | ||||
| implementation choice to do so. Spatial relationships associated | ||||
| with the Media can be included in the Advertisement, and it is | ||||
| often sensible for the Media Consumer to take those spatial | ||||
| relationships into account when tailoring the SDP. The Consumer | ||||
| can also limit the number of Encodings it must set up resources to | ||||
| receive, and not waste resources on unwanted Encodings, because it | ||||
| has the Provider's Advertisement information ahead of time to | ||||
| determine what it really wants to receive. The Consumer can also | ||||
| use the Advertisement information for local rendering decisions.</t> | ||||
| <t> | ||||
| This initial CLUE exchange is followed by an SDP offer/answer | ||||
| exchange that not only establishes those aspects of the Media that | ||||
| have not been "negotiated" over CLUE, but also has the effect of | ||||
| setting up the Media transmission itself, involving potentially | ||||
| security exchanges, Interactive Connectivity Establishment (ICE), and whatnot | ||||
| . This step is considered "plain vanilla | ||||
| SIP".</t> | ||||
| <t> | ||||
| During the lifetime of a call, further exchanges <bcp14>MAY</bcp14> occur ove | ||||
| r the | ||||
| CLUE channel. In some cases, those further exchanges lead to a | ||||
| modified system behavior of Provider or Consumer (or both) without | ||||
| any other protocol activity such as further offer/answer exchanges. | ||||
| For example, a Configure Message requesting that the Provider place a | ||||
| different Capture source into a Capture Encoding, signaled over the | ||||
| CLUE channel, ought not to lead to heavy-handed mechanisms like SIP | ||||
| re-invites. In other cases, however, after the CLUE negotiation, an | ||||
| additional offer/answer exchange becomes necessary. For example, | ||||
| if both sides decide to upgrade the call from one screen to a | ||||
| multi-screen call, and more bandwidth is required for the additional | ||||
| video channels compared to what was previously negotiated using | ||||
| offer/answer, a new offer/answer exchange is required.</t> | ||||
| <t> | ||||
| One aspect of the protocol outlined herein, and specified in more | ||||
| detail in companion documents, is that it makes available to the | ||||
| Consumer information regarding the Provider's capabilities to | ||||
| deliver Media and attributes related to that Media such as their | ||||
| Spatial Relationship. The operation of the renderer inside the | ||||
| Consumer is unspecified in that it can choose to ignore some | ||||
| information provided by the Provider and/or not Render Media | ||||
| Streams available from the Provider (although the Consumer follows | ||||
| the CLUE protocol and, therefore, gracefully receives and responds | ||||
| to the Provider's information using a Configure operation).</t> | ||||
| <t> | ||||
| A CLUE-capable device interoperates with a device that does not | ||||
| support CLUE. The CLUE-capable device can determine, by the result | ||||
| of the initial offer/answer exchange, if the other device supports | ||||
| and wishes to use CLUE. The specific mechanism for this is | ||||
| described in <xref target="RFC8848" format="default"/>. If the other device | ||||
| does | ||||
| not use CLUE, then the CLUE-capable device falls back to behavior | ||||
| that does not require CLUE.</t> | ||||
| <t> | ||||
| As for the Media, Provider and Consumer have an end-to-end | ||||
| communication relationship with respect to (RTP-transported) Media; | ||||
| and the mechanisms described herein and in companion documents do | ||||
| not change the aspects of setting up those RTP flows and sessions. | ||||
| In other words, the RTP Media sessions conform to the negotiated | ||||
| SDP whether or not CLUE is used.</t> | ||||
| </section> | ||||
| <section anchor="s-6" numbered="true" toc="default"> | ||||
| <name>Spatial Relationships</name> | ||||
| <t> | ||||
| In order for a Consumer to perform a proper rendering, it is often | ||||
| necessary (or at least helpful) for the Consumer to have received | ||||
| spatial information about the Streams it is receiving. CLUE | ||||
| defines a coordinate system that allows Media Providers to describe | ||||
| the Spatial Relationships of their Media Captures to enable proper | ||||
| scaling and spatially sensible rendering of their Streams. The | ||||
| coordinate system is based on a few principles:</t> | ||||
| <ul spacing="normal"> | ||||
| <li>Each Capture Scene has a distinct coordinate system, unrelated | ||||
| to the coordinate systems of other Scenes.</li> | ||||
| <li>Simple systems that do not have multiple Media Captures to | ||||
| associate spatially need not use the coordinate model, although | ||||
| it can still be useful to provide an Area of Capture.</li> | ||||
| <li> | ||||
| <t>Coordinates can either be in real, physical units (millimeters), | ||||
| have an unknown scale, or have no physical scale. Systems that | ||||
| know their physical dimensions (for example, professionally | ||||
| installed Telepresence room systems) <bcp14>MUST</bcp14> provide those rea | ||||
| l-world measurements to enable the best user experience for | ||||
| advanced receiving systems that can utilize this information. | ||||
| Systems that don't know specific physical dimensions but still | ||||
| know relative distances <bcp14>MUST</bcp14> use "Unknown Scale". "No Scal | ||||
| e" is | ||||
| intended to be used only where Media Captures from different | ||||
| devices (with potentially different scales) will be forwarded | ||||
| alongside one another (e.g., in the case of an MCU). | ||||
| </t> | ||||
| <ul spacing="normal"> | ||||
| <li>"Millimeters" means the scale is in millimeters.</li> | ||||
| <li>"Unknown Scale" means the scale is not necessarily in millimeter | ||||
| s, but | ||||
| the scale is the same for every Capture in the Capture Scene.</li> | ||||
| <li>"No Scale" means the scale could be different for each | ||||
| Capture -- an MCU Provider that advertises two adjacent | ||||
| Captures and picks sources (which can change quickly) from | ||||
| different Endpoints might use this value; the scale could be | ||||
| different and changing for each Capture. But the areas of | ||||
| capture still represent a Spatial Relation between Captures.</li> | ||||
| </ul> | ||||
| </li> | ||||
| <li>The coordinate system is right-handed Cartesian X, Y, Z with the | ||||
| origin at a spatial location of the Provider's choosing. The | ||||
| Provider <bcp14>MUST</bcp14> use the same coordinate system with the same | ||||
| scale | ||||
| and origin for all coordinates within the same Capture Scene.</li> | ||||
| </ul> | ||||
| <t>The direction of increasing coordinate values is as follows: | ||||
| X increases from left to right, from the point of view of an | ||||
| observer at the front of the room looking toward the back; | ||||
| Y increases from the front of the room to the back of the room; | ||||
| Z increases from low to high (i.e., floor to ceiling).</t> | ||||
| <t> | ||||
| Cameras in a Scene typically point in the direction of increasing | ||||
| Y, from front to back. But there could be multiple cameras | ||||
| pointing in different directions. If the physical space does not | ||||
| have a well-defined front and back, the Provider chooses any | ||||
| direction for X, Y, and Z consistent with right-handed | ||||
| coordinates.</t> | ||||
| </section> | ||||
| <section anchor="s-7" numbered="true" toc="default"> | ||||
| <name>Media Captures and Capture Scenes</name> | ||||
| <t> | ||||
| This section describes how Providers can describe the content of | ||||
| Media to Consumers.</t> | ||||
| <section anchor="s-7.1" numbered="true" toc="default"> | ||||
| <name>Media Captures</name> | ||||
| <t> | ||||
| Media Captures are the fundamental representations of Streams that | ||||
| a device can transmit. What a Media Capture actually represents is | ||||
| flexible:</t> | ||||
| <ul spacing="normal"> | ||||
| <li>It can represent the immediate output of a physical source (e.g., | ||||
| camera, microphone) or 'synthetic' source (e.g., laptop computer, DVD play | ||||
| er).</li> | ||||
| <li>It can represent the output of an audio mixer or video composer.</ | ||||
| li> | ||||
| <li>It can represent a concept such as 'the loudest speaker'.</li> | ||||
| <li>It can represent a conceptual position such as 'the leftmost | ||||
| Stream'.</li> | ||||
| </ul> | ||||
| <t> | ||||
| To identify and distinguish between multiple Capture instances, | ||||
| Captures have a unique identity. For instance, VC1, VC2, AC1, and | ||||
| AC2 (where VC1 and VC2 refer to two different Video Captures and | ||||
| AC1 and AC2 refer to two different Audio Captures).</t> | ||||
| <t>Some key points about Media Captures: | ||||
| </t> | ||||
| <ul spacing="normal"> | ||||
| <li>A Media Capture is of a single Media type (e.g., audio or | ||||
| video).</li> | ||||
| <li>A Media Capture is defined in a Capture Scene and is given an | ||||
| Advertisement unique identity. The identity may be referenced | ||||
| outside the Capture Scene that defines it through an MCC.</li> | ||||
| <li>A Media Capture may be associated with one or more CSVs.</li> | ||||
| <li>A Media Capture has exactly one set of spatial information.</li> | ||||
| <li>A Media Capture can be the source of at most one Capture | ||||
| Encoding.</li> | ||||
| </ul> | ||||
| <t> | ||||
| Each Media Capture can be associated with attributes to describe | ||||
| what it represents.</t> | ||||
| <section anchor="s-7.1.1" numbered="true" toc="default"> | ||||
| <name>Media Capture Attributes</name> | ||||
| <t> | ||||
| Media Capture attributes describe information about the Captures. | ||||
| A Provider can use the Media Capture attributes to describe the | ||||
| Captures for the benefit of the Consumer of the Advertisement | ||||
| message. All these attributes are optional. Media Capture | ||||
| attributes include: | ||||
| </t> | ||||
| <ul spacing="normal"> | ||||
| <li>Spatial information, such as Point of Capture, Point on Line | ||||
| of Capture, and Area of Capture, (all of which, in combination, | ||||
| define the capture field of, for example, a camera).</li> | ||||
| <li>Other descriptive information to help the Consumer choose | ||||
| between Captures (e.g., description, presentation, view, | ||||
| priority, language, person information, and type).</li> | ||||
| </ul> | ||||
| <t> | ||||
| The subsections below define the Capture attributes.</t> | ||||
| <section anchor="s-7.1.1.1" numbered="true" toc="default"> | ||||
| <name>Point of Capture</name> | ||||
| <t> | ||||
| The Point of Capture attribute is a field with a single Cartesian | ||||
| (X, Y, Z) point value that describes the spatial location of the | ||||
| capturing device (such as camera). For an Audio Capture with | ||||
| multiple microphones, the Point of Capture defines the nominal midpoint of th | ||||
| e microphones.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.2" numbered="true" toc="default"> | ||||
| <name>Point on Line of Capture</name> | ||||
| <t> | ||||
| The Point on Line of Capture attribute is a field with a single | ||||
| Cartesian (X, Y, Z) point value that describes a position in space | ||||
| of a second point on the axis of the capturing device, toward the | ||||
| direction it is pointing; the first point being the Point of | ||||
| Capture (see above).</t> | ||||
| <t> | ||||
| Together, the Point of Capture and Point on Line of Capture define | ||||
| the direction and axis of the capturing device, for example, the | ||||
| optical axis of a camera or the axis of a microphone. The Media | ||||
| Consumer can use this information to adjust how it Renders the | ||||
| received Media if it so chooses.</t> | ||||
| <t> | ||||
| For an Audio Capture, the Media Consumer can use this information | ||||
| along with the Audio Capture Sensitivity Pattern to define a three-dimensiona | ||||
| l volume of capture where sounds can be expected to be | ||||
| picked up by the microphone providing this specific Audio Capture. | ||||
| If the Consumer wants to associate an Audio Capture with a Video | ||||
| Capture, it can compare this volume with the Area of Capture for | ||||
| video Media to provide a check on whether the Audio Capture is | ||||
| indeed spatially associated with the Video Capture. For example, a | ||||
| video Area of Capture that fails to intersect at all with the audio | ||||
| volume of capture, or is at such a long radial distance from the | ||||
| microphone Point of Capture that the audio level would be very low, | ||||
| would be inappropriate.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.3" numbered="true" toc="default"> | ||||
| <name>Area of Capture</name> | ||||
| <t> | ||||
| The Area of Capture is a field with a set of four (X, Y, Z) points | ||||
| as a value that describes the spatial location of what is being | ||||
| "captured". This attribute applies only to Video Captures, not | ||||
| other types of Media. By comparing the Area of Capture for | ||||
| different Video Captures within the same Capture Scene, a Consumer | ||||
| can determine the Spatial Relationships between them and Render | ||||
| them correctly.</t> | ||||
| <t> | ||||
| The four points <bcp14>MUST</bcp14> be co-planar, forming a quadrilateral, wh | ||||
| ich | ||||
| defines the Plane of Interest for the particular Media Capture.</t> | ||||
| <t> | ||||
| If the Area of Capture is not specified, it means the Video Capture | ||||
| might be spatially related to other Captures in the same Scene, but | ||||
| there is no detailed information on the relationship. For a switched | ||||
| Capture that switches between different sections within a larger | ||||
| area, the Area of Capture <bcp14>MUST</bcp14> use coordinates for the larger | ||||
| potential area.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.4" numbered="true" toc="default"> | ||||
| <name>Mobility of Capture</name> | ||||
| <t> | ||||
| The Mobility of Capture attribute indicates whether or not the | ||||
| Point of Capture, Point on Line of Capture, and Area of Capture | ||||
| values stay the same over time, or are expected to change | ||||
| (potentially frequently). Possible values are static, dynamic, and | ||||
| highly dynamic.</t> | ||||
| <t> | ||||
| An example for "dynamic" is a camera mounted on a stand that is | ||||
| occasionally hand-carried and placed at different positions in | ||||
| order to provide the best angle to capture a work task. A camera | ||||
| worn by a person who moves around the room is an example for | ||||
| "highly dynamic". In either case, the effect is that the Point of Capture, | ||||
| Capture Axis, and Area of Capture change with time.</t> | ||||
| <t> | ||||
| The Point of Capture of a static Capture <bcp14>MUST NOT</bcp14> move for the | ||||
| life of | ||||
| the CLUE session. The Point of Capture of dynamic Captures is | ||||
| categorized by a change in position followed by a reasonable period | ||||
| of stability -- in the order of magnitude of minutes. Highly | ||||
| dynamic Captures are categorized by a Point of Capture that is | ||||
| constantly moving. If the Area of Capture, Point of Capture, and | ||||
| Point on Line of Capture attributes are included with dynamic or highly | ||||
| dynamic Captures, they indicate spatial information at the time of | ||||
| the Advertisement.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.5" numbered="true" toc="default"> | ||||
| <name>Audio Capture Sensitivity Pattern</name> | ||||
| <t> | ||||
| The Audio Capture Sensitivity Pattern attribute applies only to | ||||
| Audio Captures. This attribute gives information about the nominal | ||||
| sensitivity pattern of the microphone that is the source of the | ||||
| Capture. Possible values include patterns such as omni, shotgun, | ||||
| cardioid, and hyper-cardioid.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.6" numbered="true" toc="default"> | ||||
| <name>Description</name> | ||||
| <t> | ||||
| The Description attribute is a human-readable description (which | ||||
| could be in multiple languages) of the Capture.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.7" numbered="true" toc="default"> | ||||
| <name>Presentation</name> | ||||
| <t> | ||||
| The Presentation attribute indicates that the Capture originates | ||||
| from a presentation device, that is, one that provides supplementary | ||||
| information to a Conference through slides, video, still images, | ||||
| data, etc. Where more information is known about the Capture, it <bcp14>MAY< | ||||
| /bcp14> | ||||
| be expanded hierarchically to indicate the different types of | ||||
| presentation Media, e.g., presentation.slides, presentation.image, | ||||
| etc.</t> | ||||
| <t> | ||||
| Note: It is expected that a number of keywords will be defined that | ||||
| provide more detail on the type of presentation. Refer to <xref target="RFC88 | ||||
| 46" format="default"/> for how to extend the model.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.8" numbered="true" toc="default"> | ||||
| <name>View</name> | ||||
| <t> | ||||
| The View attribute is a field with enumerated values, indicating | ||||
| what type of view the Capture relates to. The Consumer can use | ||||
| this information to help choose which Media Captures it wishes to | ||||
| receive. Possible values are as follows:</t> | ||||
| <dl newline="false" spacing="normal" indent="12"> | ||||
| <dt>Room:</dt> | ||||
| <dd>Captures the entire Scene | ||||
| </dd> | ||||
| <dt>Table:</dt> | ||||
| <dd>Captures the conference table with seated people | ||||
| </dd> | ||||
| <dt>Individual:</dt> | ||||
| <dd>Captures an individual person</dd> | ||||
| <dt>Lectern:</dt> | ||||
| <dd>Captures the region of the lectern including the | ||||
| presenter, for example, in a classroom-style conference room | ||||
| </dd> | ||||
| <dt>Audience:</dt> | ||||
| <dd>Captures a region showing the audience in a classroom-style co | ||||
| nference room | ||||
| </dd> | ||||
| </dl> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.9" numbered="true" toc="default"> | ||||
| <name>Language</name> | ||||
| <t> | ||||
| The Language attribute indicates one or more languages used in the | ||||
| content of the Media Capture. Captures <bcp14>MAY</bcp14> be offered in diff | ||||
| erent | ||||
| languages in case of multilingual and/or accessible Conferences. A | ||||
| Consumer can use this attribute to differentiate between them and | ||||
| pick the appropriate one.</t> | ||||
| <t> | ||||
| Note that the Language attribute is defined and meaningful both for | ||||
| Audio and Video Captures. In case of Audio Captures, the meaning | ||||
| is obvious. For a Video Capture, "Language" could, for example, be | ||||
| sign interpretation or text.</t> | ||||
| <t> | ||||
| The Language attribute is coded per <xref target="RFC5646" format="default"/> | ||||
| .</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.10" numbered="true" toc="default"> | ||||
| <name>Person Information</name> | ||||
| <t> | ||||
| The Person Information attribute allows a Provider to provide | ||||
| specific information regarding the people in a Capture (regardless | ||||
| of whether or not the Capture has a Presentation attribute). The | ||||
| Provider may gather the information automatically or manually from | ||||
| a variety of sources; however, the xCard <xref target="RFC6351" format="defau | ||||
| lt"/> format is used to | ||||
| convey the information. This allows various information, such as | ||||
| Identification information (<xref section="6.2" sectionFormat="of" target="RF | ||||
| C6350" format="default"/>), Communication | ||||
| Information (<xref section="6.4" sectionFormat="of" target="RFC6350" format=" | ||||
| default"/>), and Organizational information | ||||
| (<xref section="6.6" sectionFormat="of" target="RFC6350" format="default"/>), | ||||
| to be communicated. A Consumer may then | ||||
| automatically (i.e., via a policy) or manually select Captures | ||||
| based on information about who is in a Capture. It also allows a | ||||
| Consumer to Render information regarding the people participating | ||||
| in the Conference or to use it for further processing.</t> | ||||
| <t> | ||||
| The Provider may supply a minimal set of information or a larger | ||||
| set of information. However, it <bcp14>MUST</bcp14> be compliant to <xref tar | ||||
| get="RFC6350" format="default"/> and | ||||
| supply a "VERSION" and "FN" property. A Provider may supply | ||||
| multiple xCards per Capture of any KIND (<xref section="6.1.4" sectionFormat= | ||||
| "of" target="RFC6350" format="default"/>).</t> | ||||
| <t> | ||||
| In order to keep CLUE messages compact, the Provider <bcp14>SHOULD</bcp14> us | ||||
| e a | ||||
| URI to point to any LOGO, PHOTO, or SOUND contained in the xCard | ||||
| rather than transmitting the LOGO, PHOTO, or SOUND data in a CLUE | ||||
| message.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.11" numbered="true" toc="default"> | ||||
| <name>Person Type</name> | ||||
| <t> | ||||
| The Person Type attribute indicates the type of people contained in | ||||
| the Capture with respect to the meeting agenda (regardless of | ||||
| whether or not the Capture has a Presentation attribute). As a | ||||
| Capture may include multiple people, the attribute may contain | ||||
| multiple values. However, values <bcp14>MUST NOT</bcp14> be repeated within t | ||||
| he | ||||
| attribute.</t> | ||||
| <t> | ||||
| An Advertiser associates the person type with an individual Capture | ||||
| when it knows that a particular type is in the Capture. If an | ||||
| Advertiser cannot link a particular type with some certainty to a | ||||
| Capture, then it is not included. On reception of a | ||||
| Capture with a Person Type attribute, a Consumer knows with some certainty th | ||||
| at | ||||
| the Capture contains that person type. The Capture may contain | ||||
| other person types, but the Advertiser has not been able to | ||||
| determine that this is the case.</t> | ||||
| <t>The types of Captured people include: | ||||
| </t> | ||||
| <dl newline="false" spacing="normal" indent="15"> | ||||
| <dt>Chair:</dt> | ||||
| <dd>the person responsible for running the meeting | ||||
| according to the agenda.</dd> | ||||
| <dt>Vice-Chair:</dt> | ||||
| <dd>the person responsible for assisting the chair in | ||||
| running the meeting.</dd> | ||||
| <dt>Minute Taker:</dt> | ||||
| <dd>the person responsible for recording the | ||||
| minutes of the meeting.</dd> | ||||
| <dt>Attendee:</dt> | ||||
| <dd>the person has no particular responsibilities with | ||||
| respect to running the meeting.</dd> | ||||
| <dt>Observer:</dt> | ||||
| <dd>an Attendee without the right to influence the | ||||
| discussion.</dd> | ||||
| <dt>Presenter:</dt> | ||||
| <dd>the person scheduled on the agenda to make a | ||||
| presentation in the meeting. Note: This is not related to any | ||||
| "active speaker" functionality.</dd> | ||||
| <dt>Translator:</dt> | ||||
| <dd>the person providing some form of translation | ||||
| or commentary in the meeting.</dd> | ||||
| <dt>Timekeeper:</dt> | ||||
| <dd>the person responsible for maintaining the | ||||
| meeting schedule.</dd> | ||||
| </dl> | ||||
| <t> | ||||
| Furthermore, the Person Type attribute may contain one or more | ||||
| strings allowing the Provider to indicate custom meeting-specific | ||||
| types.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.12" numbered="true" toc="default"> | ||||
| <name>Priority</name> | ||||
| <t> | ||||
| The Priority attribute indicates a relative priority between | ||||
| different Media Captures. The Provider sets this priority, and the | ||||
| Consumer <bcp14>MAY</bcp14> use the priority to help decide which Captures it | ||||
| wishes to receive.</t> | ||||
| <t> | ||||
| The Priority attribute is an integer that indicates a relative | ||||
| priority between Captures. For example, it is possible to assign a | ||||
| priority between two presentation Captures that would allow a | ||||
| remote Endpoint to determine which presentation is more important. | ||||
| Priority is assigned at the individual Capture level. It represents | ||||
| the Provider's view of the relative priority between Captures with | ||||
| a priority. The same priority number <bcp14>MAY</bcp14> be used across multip | ||||
| le | ||||
| Captures. It indicates that they are equally important. If no priority | ||||
| is assigned, no assumptions regarding relative importance of the | ||||
| Capture can be assumed.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.13" numbered="true" toc="default"> | ||||
| <name>Embedded Text</name> | ||||
| <t> | ||||
| The Embedded Text attribute indicates that a Capture provides | ||||
| embedded textual information. For example, the Video Capture may | ||||
| contain speech-to-text information composed with the video image.</t> | ||||
| </section> | ||||
| <section anchor="s-7.1.1.14" numbered="true" toc="default"> | ||||
| <name>Related To</name> | ||||
| <t> | ||||
| The Related To attribute indicates the Capture contains additional | ||||
| complementary information related to another Capture. The value | ||||
| indicates the identity of the other Capture to which this Capture | ||||
| is providing additional information.</t> | ||||
| <t> | ||||
| For example, a Conference can utilize translators or facilitators | ||||
| that provide an additional audio Stream (i.e., a translation or | ||||
| description or commentary of the Conference). Where multiple | ||||
| Captures are available, it may be advantageous for a Consumer to | ||||
| select a complementary Capture instead of or in addition to a | ||||
| Capture it relates to.</t> | ||||
| </section> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-7.2" numbered="true" toc="default"> | ||||
| <name>Multiple Content Capture</name> | ||||
| <t> | ||||
| The MCC indicates that one or more Single Media Captures are | ||||
| multiplexed (temporally and/or spatially) or mixed in one Media | ||||
| Capture. Only one Capture type (i.e., audio, video, etc.) is | ||||
| allowed in each MCC instance. The MCC may contain a reference to | ||||
| the Single Media Captures (which may have their own attributes) as | ||||
| well as attributes associated with the MCC itself. An MCC may also | ||||
| contain other MCCs. The MCC <bcp14>MAY</bcp14> reference Captures from withi | ||||
| n the | ||||
| Capture Scene that defines it or from other Capture Scenes. No | ||||
| ordering is implied by the order that Captures appear within an MCC. | ||||
| An MCC <bcp14>MAY</bcp14> contain no references to other Captures to indicate | ||||
| that | ||||
| the MCC contains content from multiple sources, but no information | ||||
| regarding those sources is given. MCCs either contain the | ||||
| referenced Captures and no others or have no referenced Captures | ||||
| and, therefore, may contain any Capture.</t> | ||||
| <t> | ||||
| One or more MCCs may also be specified in a CSV. This allows an | ||||
| Advertiser to indicate that several MCC Captures are used to | ||||
| represent a Capture Scene. <xref target="ref-advertisement-sent-to-endpoint- | ||||
| f-two-encodings" format="default"/> provides an example of this | ||||
| case.</t> | ||||
| <t> | ||||
| As outlined in <xref target="s-7.1" format="default"/>, each instance of the | ||||
| MCC has its own | ||||
| Capture identity, i.e., MCC1. It allows all the individual Captures | ||||
| contained in the MCC to be referenced by a single MCC identity.</t> | ||||
| <t>The example below shows the use of a Multiple Content Capture:</t> | ||||
| <table anchor="ref-multiple-content-capture-concept" align="center"> | ||||
| <name>Multiple Content Capture Concept</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> </th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">{MC attributes}</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC2</td> | ||||
| <td align="left">{MC attributes}</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC3</td> | ||||
| <td align="left">{MC attributes}</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC1(VC1,VC2,VC3)</td> | ||||
| <td align="left">{MC and MCC attributes}</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(MCC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| This indicates that MCC1 is a single Capture that contains the | ||||
| Captures VC1, VC2, and VC3, according to any MCC1 attributes.</t> | ||||
| <section anchor="s-7.2.1" numbered="true" toc="default"> | ||||
| <name>MCC Attributes</name> | ||||
| <t> | ||||
| Media Capture attributes may be associated with the MCC instance | ||||
| and the Single Media Captures that the MCC references. A Provider | ||||
| should avoid providing conflicting attribute values between the MCC | ||||
| and Single Media Captures. Where there is conflict the attributes | ||||
| of the MCC, a Provider should override any that may be present in the individ | ||||
| ual | ||||
| Captures.</t> | ||||
| <t> | ||||
| A Provider <bcp14>MAY</bcp14> include as much or as little of the original so | ||||
| urce | ||||
| Capture information as it requires.</t> | ||||
| <t> | ||||
| There are MCC-specific attributes that <bcp14>MUST</bcp14> only be used with | ||||
| Multiple Content Captures. These are described in the sections | ||||
| below. The attributes described in <xref target="s-7.1.1" format="default"/> | ||||
| <bcp14>MAY</bcp14> also be used | ||||
| with MCCs.</t> | ||||
| <t> | ||||
| The spatial-related attributes of an MCC indicate its Area of | ||||
| Capture and Point of Capture within the Scene, just like any other | ||||
| Media Capture. The spatial information does not imply anything | ||||
| about how other Captures are composed within an MCC.</t> | ||||
| <t>For example: a virtual Scene could be constructed for the MCC | ||||
| Capture with two Video Captures with a MaxCaptures attribute set | ||||
| to 2 and an Area of Capture attribute provided with an overall | ||||
| area. Each of the individual Captures could then also include an | ||||
| Area of Capture attribute with a subset of the overall area. | ||||
| The Consumer would then know how each Capture is related to others | ||||
| within the Scene, but not the relative position of the individual | ||||
| Captures within the composed Capture. | ||||
| </t> | ||||
| <table anchor="table_2"> | ||||
| <name>Example of MCC and Single Media Capture Attributes</name> | ||||
| <thead> | ||||
| <tr><th align="left">Capture Scene #1</th><th/></tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td>VC1</td> | ||||
| <td align="right"> | ||||
| <ul empty="true" spacing="compact"> | ||||
| <li>AreaofCapture=(0,0,0)(9,0,0)</li> | ||||
| <li>(0,0,9)(9,0,9)</li> | ||||
| </ul> | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>VC2</td> | ||||
| <td align="right"> | ||||
| <ul empty="true" spacing="compact"> | ||||
| <li>AreaofCapture=(10,0,0)(19,0,0)</li> | ||||
| <li>(10,0,9)(19,0,9)</li> | ||||
| </ul> | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC1(VC1,VC2)</td> | ||||
| <td align="right"> | ||||
| <ul empty="true" spacing="compact"> | ||||
| <li>MaxCaptures=2</li> | ||||
| <li>AreaofCapture=(0,0,0)(19,0,0)</li> | ||||
| <li>(0,0,9)(19,0,9)</li> | ||||
| </ul> | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV(MCC1)</td> | ||||
| <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| The subsections below describe the MCC-only attributes.</t> | ||||
| <section anchor="s-7.2.1.1" numbered="true" toc="default"> | ||||
| <name>MaxCapture: Maximum Number of Captures within an MCC</name> | ||||
| <t> | ||||
| The MaxCaptures attribute indicates the maximum | ||||
| number of individual Captures that may appear in a Capture Encoding | ||||
| at a time. The actual number at any given time can be less than or | ||||
| equal to this maximum. It may be used to derive how the Single | ||||
| Media Captures within the MCC are composed/switched with regard | ||||
| to space and time.</t> | ||||
| <t> | ||||
| A Provider can indicate that the number of Captures in an MCC | ||||
| Capture Encoding is equal ("=") to the MaxCaptures value or that | ||||
| there may be any number of Captures up to and including ("<=") the | ||||
| MaxCaptures value. This allows a Provider to distinguish between an | ||||
| MCC that purely represents a composition of sources and an MCC | ||||
| that represents switched sources or switched and composed sources.</t> | ||||
| <t> | ||||
| MaxCaptures may be set to one so that only content related to one | ||||
| of the sources is shown in the MCC Capture Encoding at a time, or | ||||
| it may be set to any value up to the total number of Source Media | ||||
| Captures in the MCC.</t> | ||||
| <t> | ||||
| The bullets below describe how the setting of MaxCaptures versus the | ||||
| number of Captures in the MCC affects how sources appear in a | ||||
| Capture Encoding:</t> | ||||
| <ul spacing="normal"> | ||||
| <li>A switched case occurs when | ||||
| MaxCaptures is set to <= 1 and the number of Captures in | ||||
| the MCC is greater than 1 (or not specified) in the MCC. Zero | ||||
| or one Captures may be switched into the Capture Encoding. Note: | ||||
| zero is allowed because of the "<=".</li> | ||||
| <li>A switched case occurs when MaxCaptures is set to = 1 and | ||||
| the number of Captures in the MCC is greater than 1 (or not | ||||
| specified) in the MCC. Only one Capture source is contained in | ||||
| a Capture Encoding at a time.</li> | ||||
| <li>A switched and composed case occurs when MaxCaptures is set | ||||
| to <= N (with N > 1) and the number of Captures in the | ||||
| MCC is greater than N (or not specified). The Capture Encoding | ||||
| may contain purely switched sources (i.e., <=2 allows for one | ||||
| source on its own), or it may contain composed and switched | ||||
| sources (i.e., a composition of two sources switched between the | ||||
| sources).</li> | ||||
| <li>A switched and composed case occurs when MaxCaptures is set | ||||
| to = N (with N > 1) and the number of Captures in the MCC | ||||
| is greater than N (or not specified). The Capture Encoding | ||||
| contains composed and switched sources (i.e., a composition of | ||||
| N sources switched between the sources). It is not possible to | ||||
| have a single source.</li> | ||||
| <li>A switched and composed case occurs when MaxCaptures is set | ||||
| <= to the number of Captures in the MCC. The Capture | ||||
| Encoding may contain Media switched between any number (up to | ||||
| the MaxCaptures) of composed sources.</li> | ||||
| <li>A composed case occurs when MaxCaptures is set = to the number | ||||
| of Captures in the | ||||
| MCC. All the sources are composed into | ||||
| a single Capture Encoding.</li> | ||||
| </ul> | ||||
| <t> | ||||
| If this attribute is not set, then as a default, it is assumed that all | ||||
| source Media Capture content can appear concurrently in the Capture | ||||
| Encoding associated with the MCC.</t> | ||||
| <t> | ||||
| For example, the use of MaxCaptures equal to 1 on an MCC with three | ||||
| Video Captures, VC1, VC2, and VC3, would indicate that the Advertiser | ||||
| in the Capture Encoding would switch between VC1, VC2, and VC3 as | ||||
| there may be only a maximum of one Capture at a time.</t> | ||||
| </section> | ||||
| <section anchor="s-7.2.1.2" numbered="true" toc="default"> | ||||
| <name>Policy</name> | ||||
| <t> | ||||
| The Policy MCC attribute indicates the criteria that the Provider | ||||
| uses to determine when and/or where Media content appears in the | ||||
| Capture Encoding related to the MCC.</t> | ||||
| <t> | ||||
| The attribute is in the form of a token that indicates the policy | ||||
| and an index representing an instance of the policy. The same | ||||
| index value can be used for multiple MCCs.</t> | ||||
| <t> | ||||
| The tokens are as follows: | ||||
| </t> | ||||
| <dl newline="false" spacing="normal"> | ||||
| <dt>SoundLevel:</dt> | ||||
| <dd>This indicates that the content of the MCC is | ||||
| determined by a sound-level-detection algorithm. The loudest | ||||
| (active) speaker (or a previous speaker, depending on the index | ||||
| value) is contained in the MCC.</dd> | ||||
| <dt>RoundRobin:</dt> | ||||
| <dd>This indicates that the content of the MCC is | ||||
| determined by a time-based algorithm. For example, the Provider | ||||
| provides content from a particular source for a period of time and | ||||
| then provides content from another source, and so on.</dd> | ||||
| </dl> | ||||
| <t> | ||||
| An index is used to represent an instance in the policy setting. An | ||||
| index of 0 represents the most current instance of the policy, i.e., | ||||
| the active speaker, 1 represents the previous instance, i.e., the | ||||
| previous active speaker, and so on.</t> | ||||
| <t> | ||||
| The following example shows a case where the Provider provides two | ||||
| Media Streams, one showing the active speaker and a second Stream | ||||
| showing the previous speaker.</t> | ||||
| <table anchor="ref-example-policy-mcc-attribute-usage" align="center | ||||
| "> | ||||
| <name>Example Policy MCC Attribute Usage</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> </th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC2</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC1(VC1,VC2)</td> | ||||
| <td align="left">Policy=SoundLevel:0<br/> | ||||
| MaxCaptures=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC2(VC1,VC2)</td> | ||||
| <td align="left">Policy=SoundLevel:1<br/> | ||||
| MaxCaptures=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(MCC1,MCC2)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| </section> | ||||
| <section anchor="s-7.2.1.3" numbered="true" toc="default"> | ||||
| <name>SynchronizationID: Synchronization Identity</name> | ||||
| <t> | ||||
| The SynchronizationID MCC attribute indicates how the | ||||
| individual Captures in multiple MCC Captures are synchronized. To | ||||
| indicate that the Capture Encodings associated with MCCs contain | ||||
| Captures from the same source at the same time, a Provider should | ||||
| set the same SynchronizationID on each of the concerned | ||||
| MCCs. It is the Provider that determines what the source for the | ||||
| Captures is, so a Provider can choose how to group together Single | ||||
| Media Captures into a combined "source" for the purpose of | ||||
| switching them together to keep them synchronized according to the | ||||
| SynchronizationID attribute. For example, when the Provider is in | ||||
| an MCU, it may determine that each separate CLUE Endpoint is a | ||||
| remote source of Media. The SynchronizationID may be used | ||||
| across Media types, i.e., to synchronize audio- and video-related | ||||
| MCCs.</t> | ||||
| <t> | ||||
| Without this attribute it is assumed that multiple MCCs may provide | ||||
| content from different sources at any particular point in time.</t> | ||||
| <t>For example: | ||||
| </t> | ||||
| <table anchor="table_4"> | ||||
| <name>Example SynchronizationID MCC Attribute Usage</name> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #1</th> <th/></tr> | ||||
| <tr><td>VC1</td> <td>Description=Left</td></tr> | ||||
| <tr><td>VC2</td> <td>Description=Center</td></t | ||||
| r> | ||||
| <tr><td>VC3</td> <td>Description=Right</td></tr | ||||
| > | ||||
| <tr><td>AC1</td> <td>Description=Room</td></tr> | ||||
| <tr><td>CSV(VC1,VC2,VC3)</td> <td/></tr> | ||||
| <tr><td>CSV(AC1)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #2</th> <th/></tr> | ||||
| <tr><td>VC4</td> <td>Description=Left</td></tr> | ||||
| <tr><td>VC5</td> <td>Description=Center</td></t | ||||
| r> | ||||
| <tr><td>VC6</td> <td>Description=Right</td></tr | ||||
| > | ||||
| <tr><td>AC2</td> <td>Description=Room</td></tr> | ||||
| <tr><td>CSV(VC4,VC5,VC6)</td> <td/></tr> | ||||
| <tr><td>CSV(AC2)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #3</th> <th/></tr> | ||||
| <tr><td>VC7</td> <td/></tr> | ||||
| <tr><td>AC3</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #4</th> <th/></tr> | ||||
| <tr><td>VC8</td> <td/></tr> | ||||
| <tr><td>AC4</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #5</th> <th/></tr> | ||||
| <tr><td>MCC1(VC1,VC4,VC7)</td> <td>SynchronizationID=1<br/>M | ||||
| axCaptures=1</td></tr> | ||||
| <tr><td>MCC2(VC2,VC5,VC8)</td> <td>SynchronizationID=1<br/>M | ||||
| axCaptures=1</td></tr> | ||||
| <tr><td>MCC3(VC3,VC6)</td> <td>MaxCaptures=1</td></tr> | ||||
| <tr><td>MCC4(AC1,AC2,AC3,AC4)</td> <td>SynchronizationID=1<br/>M | ||||
| axCaptures=1</td></tr> | ||||
| <tr><td>CSV(MCC1,MCC2,MCC3)</td> <td/></tr> | ||||
| <tr><td>CSV(MCC4)</td> <td/></tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| The above Advertisement would indicate that MCC1, MCC2, MCC3, and | ||||
| MCC4 make up a Capture Scene. There would be four Capture | ||||
| Encodings (one for each MCC). Because MCC1 and MCC2 have the same | ||||
| SynchronizationID, each Encoding from MCC1 and MCC2, respectively, | ||||
| would together have content from only Capture Scene 1 or only | ||||
| Capture Scene 2 or the combination of VC7 and VC8 at a particular | ||||
| point in time. In this case, the Provider has decided the sources | ||||
| to be synchronized are Scene #1, Scene #2, and Scene #3 and #4 | ||||
| together. The Encoding from MCC3 would not be synchronized with | ||||
| MCC1 or MCC2. As MCC4 also has the same SynchronizationID | ||||
| as MCC1 and MCC2, the content of the audio Encoding will be | ||||
| synchronized with the video content.</t> | ||||
| </section> | ||||
| <section anchor="s-7.2.1.4" numbered="true" toc="default"> | ||||
| <name>Allow Subset Choice</name> | ||||
| <t> | ||||
| The Allow Subset Choice MCC attribute is a boolean value, | ||||
| indicating whether or not the Provider allows the Consumer to | ||||
| choose a specific subset of the Captures referenced by the MCC. | ||||
| If this attribute is true, and the MCC references other Captures, | ||||
| then the Consumer <bcp14>MAY</bcp14> select (in a Configure message) a specif | ||||
| ic | ||||
| subset of those Captures to be included in the MCC, and the | ||||
| Provider <bcp14>MUST</bcp14> then include only that subset. If this attribut | ||||
| e is | ||||
| false, or the MCC does not reference other Captures, then the | ||||
| Consumer <bcp14>MUST NOT</bcp14> select a subset.</t> | ||||
| </section> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-7.3" numbered="true" toc="default"> | ||||
| <name>Capture Scene</name> | ||||
| <t> | ||||
| In order for a Provider's individual Captures to be used | ||||
| effectively by a Consumer, the Provider organizes the Captures into | ||||
| one or more Capture Scenes, with the structure and contents of | ||||
| these Capture Scenes being sent from the Provider to the Consumer | ||||
| in the Advertisement.</t> | ||||
| <t> | ||||
| A Capture Scene is a structure representing a spatial region | ||||
| containing one or more Capture Devices, each capturing Media | ||||
| representing a portion of the region. A Capture Scene includes one | ||||
| or more Capture Scene Views (CSVs), with each CSV including one or | ||||
| more Media Captures of the same Media type. There can also be | ||||
| Media Captures that are not included in a CSV. A | ||||
| Capture Scene represents, for example, the video image of a group | ||||
| of people seated next to each other, along with the sound of their | ||||
| voices, which could be represented by some number of VCs and ACs in | ||||
| the CSVs. An MCU can also describe in Capture | ||||
| Scenes what it constructs from Media Streams it receives.</t> | ||||
| <t> | ||||
| A Provider <bcp14>MAY</bcp14> advertise one or more Capture Scenes. What | ||||
| constitutes an entire Capture Scene is up to the Provider. A | ||||
| simple Provider might typically use one Capture Scene for | ||||
| participant Media (live video from the room cameras) and another | ||||
| Capture Scene for a computer-generated presentation. In more-complex systems | ||||
| , the use of additional Capture Scenes is also | ||||
| sensible. For example, a classroom may advertise two Capture | ||||
| Scenes involving live video: one including only the camera | ||||
| capturing the instructor (and associated audio) the other | ||||
| including camera(s) capturing students (and associated audio).</t> | ||||
| <t> | ||||
| A Capture Scene <bcp14>MAY</bcp14> (and typically will) include more than one | ||||
| type | ||||
| of Media. For example, a Capture Scene can include several CSVs | ||||
| for Video Captures and several CSVs for | ||||
| Audio Captures. A particular Capture <bcp14>MAY</bcp14> be included in more | ||||
| than | ||||
| one CSV.</t> | ||||
| <t> | ||||
| A Provider <bcp14>MAY</bcp14> express Spatial Relationships between Captures | ||||
| that | ||||
| are included in the same Capture Scene. However, there is no | ||||
| Spatial Relationship between Media Captures from different Capture | ||||
| Scenes. In other words, Capture Scenes each use their own spatial | ||||
| measurement system as outlined in <xref target="s-6" format="default"/>.</t> | ||||
| <t> | ||||
| A Provider arranges Captures in a Capture Scene to help the | ||||
| Consumer choose which Captures it wants to Render. The CSVs | ||||
| in a Capture Scene are different alternatives the | ||||
| Provider is suggesting for representing the Capture Scene. Each | ||||
| CSV is given an advertisement-unique identity. The | ||||
| order of CSVs within a Capture Scene has no | ||||
| significance. The Media Consumer can choose to receive all Media | ||||
| Captures from one CSV for each Media type (e.g., | ||||
| audio and video), or it can pick and choose Media Captures | ||||
| regardless of how the Provider arranges them in CSVs. | ||||
| Different CSVs of the same Media type are | ||||
| not necessarily mutually exclusive alternatives. Also note that | ||||
| the presence of multiple CSVs (with potentially | ||||
| multiple Encoding options in each view) in a given Capture Scene | ||||
| does not necessarily imply that a Provider is able to serve all the | ||||
| associated Media simultaneously (although the construction of such | ||||
| an over-rich Capture Scene is probably not sensible in many cases). | ||||
| What a Provider can send simultaneously is determined through the | ||||
| Simultaneous Transmission Set mechanism, described in <xref target="s-8" form | ||||
| at="default"/>.</t> | ||||
| <t> | ||||
| Captures within the same CSV <bcp14>MUST</bcp14> be of the same | ||||
| Media type -- it is not possible to mix audio and Video Captures in | ||||
| the same CSV, for instance. The Provider <bcp14>MUST</bcp14> be | ||||
| capable of encoding and sending all Captures (that have an Encoding | ||||
| Group) in a single CSV simultaneously. The order of | ||||
| Captures within a CSV has no significance. A | ||||
| Consumer can decide to receive all the Captures in a single CSV, | ||||
| but a Consumer could also decide to receive just a | ||||
| subset of those Captures. A Consumer can also decide to receive | ||||
| Captures from different CSVs, all subject to the | ||||
| constraints set by Simultaneous Transmission Sets, as discussed in | ||||
| <xref target="s-8" format="default"/>.</t> | ||||
| <t> | ||||
| When a Provider advertises a Capture Scene with multiple CSVs, it | ||||
| is essentially signaling that there are multiple representations of | ||||
| the same Capture Scene available. In some cases, these multiple | ||||
| views would be used simultaneously (for instance, a "video view" and | ||||
| an "audio view"). In some cases, the views would conceptually be | ||||
| alternatives (for instance, a view consisting of three Video | ||||
| Captures covering the whole room versus a view consisting of just a | ||||
| single Video Capture covering only the center of a room). In this | ||||
| latter example, one sensible choice for a Consumer would be to | ||||
| indicate (through its Configure and possibly through an additional | ||||
| offer/answer exchange) the Captures of that CSV that | ||||
| most closely matched the Consumer's number of display devices or | ||||
| screen layout.</t> | ||||
| <t> | ||||
| The following is an example of four potential CSVs for | ||||
| an Endpoint-style Provider:</t> | ||||
| <ol spacing="normal" type="1"> | ||||
| <li>(VC0, VC1, VC2) - left, center, and right camera Video Captures</l | ||||
| i> | ||||
| <li>(MCC3) - Video Capture associated with loudest room segment</li> | ||||
| <li>(VC4) - Video Capture zoomed out view of all people in the room</l | ||||
| i> | ||||
| <li>(AC0) - main audio</li> | ||||
| </ol> | ||||
| <t> | ||||
| The first view in this Capture Scene example is a list of Video | ||||
| Captures that have a Spatial Relationship to each other. | ||||
| Determination of the order of these Captures (VC0, VC1, and VC2) for | ||||
| rendering purposes is accomplished through use of their Area of | ||||
| Capture attributes. The second view (MCC3) and the third view | ||||
| (VC4) are alternative representations of the same room's video, | ||||
| which might be better suited to some Consumers' rendering | ||||
| capabilities. The inclusion of the Audio Capture in the same | ||||
| Capture Scene indicates that AC0 is associated with all of those | ||||
| Video Captures, meaning it comes from the same spatial region. | ||||
| Therefore, if audio were to be Rendered at all, this audio would be | ||||
| the correct choice, irrespective of which Video Captures were | ||||
| chosen.</t> | ||||
| <section anchor="s-7.3.1" numbered="true" toc="default"> | ||||
| <name>Capture Scene Attributes</name> | ||||
| <t> | ||||
| Capture Scene attributes can be applied to Capture Scenes as well | ||||
| as to individual Media Captures. Attributes specified at this | ||||
| level apply to all constituent Captures. Capture Scene attributes | ||||
| include the following:</t> | ||||
| <ul spacing="normal"> | ||||
| <li>Human-readable description of the Capture Scene, which could | ||||
| be in multiple languages;</li> | ||||
| <li>xCard Scene information</li> | ||||
| <li>Scale information ("Millimeters", "Unknown Scale", "No Scale"), | ||||
| as | ||||
| described in <xref target="s-6" format="default"/>.</li> | ||||
| </ul> | ||||
| <section anchor="s-7.3.1.1" numbered="true" toc="default"> | ||||
| <name>Scene Information</name> | ||||
| <t> | ||||
| The Scene Information attribute provides information regarding the | ||||
| Capture Scene rather than individual participants. The Provider | ||||
| may gather the information automatically or manually from a | ||||
| variety of sources. The Scene Information attribute allows a | ||||
| Provider to indicate information such as organizational or | ||||
| geographic information allowing a Consumer to determine which | ||||
| Capture Scenes are of interest in order to then perform Capture | ||||
| selection. It also allows a Consumer to Render information | ||||
| regarding the Scene or to use it for further processing.</t> | ||||
| <t> | ||||
| As per <xref target="s-7.1.1.10" format="default"/>, the xCard format is used | ||||
| to convey this | ||||
| information and the Provider may supply a minimal set of | ||||
| information or a larger set of information.</t> | ||||
| <t> | ||||
| In order to keep CLUE messages compact the Provider <bcp14>SHOULD</bcp14> use | ||||
| a | ||||
| URI to point to any LOGO, PHOTO, or SOUND contained in the xCard | ||||
| rather than transmitting the LOGO, PHOTO, or SOUND data in a CLUE | ||||
| message.</t> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-7.3.2" numbered="true" toc="default"> | ||||
| <name>Capture Scene View Attributes</name> | ||||
| <t> | ||||
| A Capture Scene can include one or more CSVs in | ||||
| addition to the Capture-Scene-wide attributes described above. | ||||
| CSV attributes apply to the CSV as a | ||||
| whole, i.e., to all Captures that are part of the CSV. | ||||
| </t> | ||||
| <t>CSV attributes include the following: | ||||
| </t> | ||||
| <ul spacing="normal"> | ||||
| <li>A human-readable description (which could be in multiple | ||||
| languages) of the CSV.</li> | ||||
| </ul> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-7.4" numbered="true" toc="default"> | ||||
| <name>Global View List</name> | ||||
| <t> | ||||
| An Advertisement can include an optional Global View list. Each | ||||
| item in this list is a Global View. The Provider can include | ||||
| multiple Global Views, to allow a Consumer to choose sets of | ||||
| Captures appropriate to its capabilities or application. The | ||||
| choice of how to make these suggestions in the Global View list | ||||
| for what represents all the Scenes for which the Provider can send | ||||
| Media is up to the Provider. This is very similar to how each CSV | ||||
| represents a particular Scene.</t> | ||||
| <t> | ||||
| As an example, suppose an Advertisement has three Scenes, and each | ||||
| Scene has three CSVs, ranging from one to three Video Captures in | ||||
| each CSV. The Provider is advertising a total of nine Video | ||||
| Captures across three Scenes. The Provider can use the Global | ||||
| View list to suggest alternatives for Consumers that can't receive | ||||
| all nine Video Captures as separate Media Streams. For | ||||
| accommodating a Consumer that wants to receive three Video | ||||
| Captures, a Provider might suggest a Global View containing just a | ||||
| single CSV with three Captures and nothing from the other two | ||||
| Scenes. Or a Provider might suggest a Global View containing | ||||
| three different CSVs, one from each Scene, with a single Video | ||||
| Capture in each.</t> | ||||
| <t>Some additional rules: | ||||
| </t> | ||||
| <ul spacing="normal"> | ||||
| <li>The ordering of Global Views in the Global View list is | ||||
| insignificant.</li> | ||||
| <li>The ordering of CSVs within each Global View is | ||||
| insignificant.</li> | ||||
| <li>A particular CSV may be used in multiple Global Views.</li> | ||||
| <li>The Provider must be capable of encoding and sending all | ||||
| Captures within the CSVs of a given Global View | ||||
| simultaneously.</li> | ||||
| </ul> | ||||
| <t> | ||||
| The following figure shows an example of the structure of Global | ||||
| Views in a Global View List.</t> | ||||
| <figure anchor="ref-global-view-list-structure"> | ||||
| <name>Global View List Structure</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| ........................................................ | ||||
| . Advertisement . | ||||
| . . | ||||
| . +--------------+ +-------------------------+ . | ||||
| . |Scene 1 | |Global View List | . | ||||
| . | | | | . | ||||
| . | CSV1 (v)<----------------- Global View (CSV 1) | . | ||||
| . | <-------. | | . | ||||
| . | | *--------- Global View (CSV 1,5) | . | ||||
| . | CSV2 (v) | | | | . | ||||
| . | | | | | . | ||||
| . | CSV3 (v)<---------*------- Global View (CSV 3,5) | . | ||||
| . | | | | | | . | ||||
| . | CSV4 (a)<----------------- Global View (CSV 4) | . | ||||
| . | <-----------. | | . | ||||
| . +--------------+ | | *----- Global View (CSV 4,6) | . | ||||
| . | | | | | . | ||||
| . +--------------+ | | | +-------------------------+ . | ||||
| . |Scene 2 | | | | . | ||||
| . | | | | | . | ||||
| . | CSV5 (v)<-------' | | . | ||||
| . | <---------' | . | ||||
| . | | | (v) = video . | ||||
| . | CSV6 (a)<-----------' (a) = audio . | ||||
| . | | . | ||||
| . +--------------+ . | ||||
| `......................................................' | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-8" numbered="true" toc="default"> | ||||
| <name>Simultaneous Transmission Set Constraints</name> | ||||
| <t> | ||||
| In many practical cases, a Provider has constraints or limitations | ||||
| on its ability to send Captures simultaneously. One type of | ||||
| limitation is caused by the physical limitations of capture | ||||
| mechanisms; these constraints are represented by a Simultaneous | ||||
| Transmission Set. The second type of limitation reflects the | ||||
| encoding resources available, such as bandwidth or video encoding | ||||
| throughput (macroblocks/second). This type of constraint is | ||||
| captured by Individual Encodings and Encoding Groups, discussed | ||||
| below.</t> | ||||
| <t> | ||||
| Some Endpoints or MCUs can send multiple Captures simultaneously; | ||||
| however, sometimes there are constraints that limit which Captures | ||||
| can be sent simultaneously with other Captures. A device may not | ||||
| be able to be used in different ways at the same time. Provider | ||||
| Advertisements are made so that the Consumer can choose one of | ||||
| several possible mutually exclusive usages of the device. This | ||||
| type of constraint is expressed in a Simultaneous Transmission Set, | ||||
| which lists all the Captures of a particular Media type (e.g., | ||||
| audio, video, or text) that can be sent at the same time. There are | ||||
| different Simultaneous Transmission Sets for each Media type in the | ||||
| Advertisement. This is easier to show in an example.</t> | ||||
| <t> | ||||
| Consider the example of a room system where there are three cameras, | ||||
| each of which can send a separate Capture covering two people | ||||
| each: VC0, VC1, and VC2. The middle camera can also zoom out (using an | ||||
| optical zoom lens) and show all six people, VC3. But the middle | ||||
| camera cannot be used in both modes at the same time; it has to | ||||
| either show the space where two participants sit or the whole six | ||||
| seats, but not both at the same time. As a result, VC1 and VC3 | ||||
| cannot be sent simultaneously.</t> | ||||
| <t> | ||||
| Simultaneous Transmission Sets are expressed as sets of the Media | ||||
| Captures that the Provider could transmit at the same time (though, | ||||
| in some cases, it is not intuitive to do so). If a Multiple | ||||
| Content Capture is included in a Simultaneous Transmission Set, it | ||||
| indicates that the Capture Encoding associated with it could be | ||||
| transmitted as the same time as the other Captures within the | ||||
| Simultaneous Transmission Set. It does not imply that the Single | ||||
| Media Captures contained in the Multiple Content Capture could all | ||||
| be transmitted at the same time.</t> | ||||
| <t> | ||||
| In this example, the two Simultaneous Transmission Sets are shown in | ||||
| <xref target="ref-two-simultaneous-transmission-sets" format="default"/>. If | ||||
| a Provider advertises one or more mutually exclusive | ||||
| Simultaneous Transmission Sets, then, for each Media type, the | ||||
| Consumer <bcp14>MUST</bcp14> ensure that it chooses Media Captures that lie w | ||||
| holly | ||||
| within one of those Simultaneous Transmission Sets.</t> | ||||
| <table anchor="ref-two-simultaneous-transmission-sets" align="center"> | ||||
| <name>Two Simultaneous Transmission Sets</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left">Simultaneous Sets</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">{VC0, VC1, VC2}</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">{VC0, VC3, VC2}</td> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| A Provider OPTIONALLY can include the Simultaneous Transmission | ||||
| Sets in its Advertisement. These constraints apply across all the | ||||
| Capture Scenes in the Advertisement. It is a syntax-conformance | ||||
| requirement that the Simultaneous Transmission Sets <bcp14>MUST</bcp14> allow | ||||
| all | ||||
| the Media Captures in any particular CSV to be used | ||||
| simultaneously. Similarly, the Simultaneous Transmission Sets <bcp14>MUST</b | ||||
| cp14> | ||||
| reflect the simultaneity expressed by any Global View.</t> | ||||
| <t> | ||||
| For shorthand convenience, a Provider <bcp14>MAY</bcp14> describe a Simultane | ||||
| ous | ||||
| Transmission Set in terms of CSVs and Capture | ||||
| Scenes. If a CSV is included in a Simultaneous | ||||
| Transmission Set, then all Media Captures in the CSV | ||||
| are included in the Simultaneous Transmission Set. If a Capture | ||||
| Scene is included in a Simultaneous Transmission Set, then all its | ||||
| CSVs (of the corresponding Media type) are included | ||||
| in the Simultaneous Transmission Set. The end result reduces to a | ||||
| set of Media Captures, of a particular Media type, in either case.</t> | ||||
| <t> | ||||
| If an Advertisement does not include Simultaneous Transmission | ||||
| Sets, then the Provider <bcp14>MUST</bcp14> be able to simultaneously provide | ||||
| all | ||||
| the Captures from any one CSV of each Media type from each Capture | ||||
| Scene. Likewise, if there are no Simultaneous Transmission Sets | ||||
| and there is a Global View list, then the Provider <bcp14>MUST</bcp14> be abl | ||||
| e to | ||||
| simultaneously provide all the Captures from any particular Global | ||||
| View (of each Media type) from the Global View list.</t> | ||||
| <t> | ||||
| If an Advertisement includes multiple CSVs in a | ||||
| Capture Scene, then the Consumer <bcp14>MAY</bcp14> choose one CSV | ||||
| for each Media type, or it <bcp14>MAY</bcp14> choose individual Captures base | ||||
| d on the | ||||
| Simultaneous Transmission Sets.</t> | ||||
| </section> | ||||
| <section anchor="s-9" numbered="true" toc="default"> | ||||
| <name>Encodings</name> | ||||
| <t> | ||||
| Individual Encodings and Encoding Groups are CLUE's mechanisms | ||||
| allowing a Provider to signal its limitations for sending Captures, | ||||
| or combinations of Captures, to a Consumer. Consumers can map the | ||||
| Captures they want to receive onto the Encodings, with the Encoding | ||||
| parameters they want. As for the relationship between the CLUE-specified mec | ||||
| hanisms based on Encodings and the SIP offer/answer | ||||
| exchange, please refer to <xref target="s-5" format="default"/>.</t> | ||||
| <section anchor="s-9.1" numbered="true" toc="default"> | ||||
| <name>Individual Encodings</name> | ||||
| <t> | ||||
| An Individual Encoding represents a way to encode a Media Capture | ||||
| as a Capture Encoding, to be sent as an encoded Media Stream from | ||||
| the Provider to the Consumer. An Individual Encoding has a set of | ||||
| parameters characterizing how the Media is encoded.</t> | ||||
| <t> | ||||
| Different Media types have different parameters, and different | ||||
| encoding algorithms may have different parameters. An Individual | ||||
| Encoding can be assigned to at most one Capture Encoding at any | ||||
| given time.</t> | ||||
| <t> | ||||
| Individual Encoding parameters are represented in SDP | ||||
| <xref target="RFC4566" format="default"/>, | ||||
| not in CLUE messages. For example, for a video Encoding using | ||||
| H.26x compression technologies, this can include parameters such | ||||
| as follows: | ||||
| </t> | ||||
| <ul spacing="compact"> | ||||
| <li>Maximum bandwidth;</li> | ||||
| <li>Maximum picture size in pixels;</li> | ||||
| <li>Maximum number of pixels to be processed per second;</li> | ||||
| </ul> | ||||
| <t> | ||||
| The bandwidth parameter is the only one that specifically relates | ||||
| to a CLUE Advertisement, as it can be further constrained by the | ||||
| maximum group bandwidth in an Encoding Group.</t> | ||||
| </section> | ||||
| <section anchor="s-9.2" numbered="true" toc="default"> | ||||
| <name>Encoding Group</name> | ||||
| <t> | ||||
| An Encoding Group includes a set of one or more Individual | ||||
| Encodings, and parameters that apply to the group as a whole. By | ||||
| grouping multiple Individual Encodings together, an Encoding Group | ||||
| describes additional constraints on bandwidth for the group. A | ||||
| single Encoding Group <bcp14>MAY</bcp14> refer to Encodings for different Med | ||||
| ia | ||||
| types.</t> | ||||
| <t>The Encoding Group data structure contains: | ||||
| </t> | ||||
| <ul spacing="normal"> | ||||
| <li>Maximum bitrate for all Encodings in the group combined;</li> | ||||
| <li>A list of identifiers for the Individual Encodings belonging to th | ||||
| e group.</li> | ||||
| </ul> | ||||
| <t> | ||||
| When the Individual Encodings in a group are instantiated into | ||||
| Capture Encodings, each Capture Encoding has a bitrate that <bcp14>MUST</bcp1 | ||||
| 4> be | ||||
| less than or equal to the max bitrate for the particular Individual | ||||
| Encoding. The "maximum bitrate for all Encodings in the group" | ||||
| parameter gives the additional restriction that the sum of all the | ||||
| individual Capture Encoding bitrates <bcp14>MUST</bcp14> be less than or equa | ||||
| l to | ||||
| this group value.</t> | ||||
| <t> | ||||
| The following diagram illustrates one example of the structure of a | ||||
| Media Provider's Encoding Groups and their contents.</t> | ||||
| <figure anchor="ref-encoding-group-structure"> | ||||
| <name>Encoding Group Structure</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| ,-------------------------------------------------. | ||||
| | Media Provider | | ||||
| | | | ||||
| | ,--------------------------------------. | | ||||
| | | ,--------------------------------------. | | ||||
| | | | ,--------------------------------------. | | ||||
| | | | | Encoding Group | | | ||||
| | | | | ,-----------. | | | ||||
| | | | | | | ,---------. | | | ||||
| | | | | | | | | ,---------.| | | ||||
| | | | | | Encoding1 | |Encoding2| |Encoding3|| | | ||||
| | `.| | | | | | `---------'| | | ||||
| | `.| `-----------' `---------' | | | ||||
| | `--------------------------------------' | | ||||
| `-------------------------------------------------' | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t>A Provider advertises one or more Encoding Groups. Each Encoding | ||||
| Group includes one or more Individual Encodings. Each Individual | ||||
| Encoding can represent a different way of encoding Media. For | ||||
| example, one Individual Encoding may be 1080p60 video, another could | ||||
| be 720p30, with a third being 352x288p30, all in, for example, H.264 | ||||
| format.</t> | ||||
| <t>While a typical three-codec/display system might have one Encoding | ||||
| Group per "codec box" (physical codec, connected to one camera and | ||||
| one screen), there are many possibilities for the number of | ||||
| Encoding Groups a Provider may be able to offer and for the | ||||
| Encoding values in each Encoding Group.</t> | ||||
| <t> | ||||
| There is no requirement for all Encodings within an Encoding Group | ||||
| to be instantiated at the same time.</t> | ||||
| </section> | ||||
| <section anchor="s-9.3" numbered="true" toc="default"> | ||||
| <name>Associating Captures with Encoding Groups</name> | ||||
| <t> | ||||
| Each Media Capture, including MCCs, <bcp14>MAY</bcp14> be associated with one | ||||
| Encoding Group. To be eligible for configuration, a Media Capture | ||||
| <bcp14>MUST</bcp14> be associated with one Encoding Group, which is used to | ||||
| instantiate that Capture into a Capture Encoding. When an MCC is | ||||
| configured, all the Media Captures referenced by the MCC will appear | ||||
| in the Capture Encoding according to the attributes of the chosen | ||||
| Encoding of the MCC. This allows an Advertiser to specify Encoding | ||||
| attributes associated with the Media Captures without the need to | ||||
| provide an individual Capture Encoding for each of the inputs.</t> | ||||
| <t> | ||||
| If an Encoding Group is assigned to a Media Capture referenced by | ||||
| the MCC, it indicates that this Capture may also have an individual | ||||
| Capture Encoding.</t> | ||||
| <t>For example: | ||||
| </t> | ||||
| <table anchor="ref-example-usage-of-encoding-with-mcc-and-source-capture | ||||
| s" align="center"> | ||||
| <name>Example Usage of Encoding with MCC and Source Captures</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left">Capture Scene #1</th> | ||||
| <th align="left"> </th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">EncodeGroupID=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC2</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC1(VC1,VC2)</td> | ||||
| <td align="left">EncodeGroupID=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(VC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(MCC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| This would indicate that VC1 may be sent as its own Capture | ||||
| Encoding from EncodeGroupID=1 or that it may be sent as part of a | ||||
| Capture Encoding from EncodeGroupID=2 along with VC2.</t> | ||||
| <t> | ||||
| More than one Capture <bcp14>MAY</bcp14> use the same Encoding Group.</t> | ||||
| <t> | ||||
| The maximum number of Capture Encodings that can result from a | ||||
| particular Encoding Group constraint is equal to the number of | ||||
| Individual Encodings in the group. The actual number of Capture | ||||
| Encodings used at any time <bcp14>MAY</bcp14> be less than this maximum. Any | ||||
| of | ||||
| the Captures that use a particular Encoding Group can be encoded | ||||
| according to any of the Individual Encodings in the group.</t> | ||||
| <t> | ||||
| It is a protocol conformance requirement that the Encoding Groups | ||||
| <bcp14>MUST</bcp14> allow all the Captures in a particular CSV to | ||||
| be used simultaneously.</t> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-10" numbered="true" toc="default"> | ||||
| <name>Consumer's Choice of Streams to Receive from the Provider</name> | ||||
| <t> | ||||
| After receiving the Provider's Advertisement message (which includes | ||||
| Media Captures and associated constraints), the Consumer composes | ||||
| its reply to the Provider in the form of a Configure message. The | ||||
| Consumer is free to use the information in the Advertisement as it | ||||
| chooses, but there are a few obviously sensible design choices, | ||||
| which are outlined below.</t> | ||||
| <t> | ||||
| If multiple Providers connect to the same Consumer (i.e., in an | ||||
| MCU-less multiparty call), it is the responsibility of the Consumer | ||||
| to compose Configures for each Provider that both fulfill each | ||||
| Provider's constraints as expressed in the Advertisement, as well | ||||
| as its own capabilities.</t> | ||||
| <t> | ||||
| In an MCU-based multiparty call, the MCU can logically terminate | ||||
| the Advertisement/Configure negotiation in that it can hide the | ||||
| characteristics of the receiving Endpoint and rely on its own | ||||
| capabilities (transcoding/transrating/etc.) to create Media Streams | ||||
| that can be decoded at the Endpoint Consumers. The timing of an | ||||
| MCU's sending of Advertisements (for its outgoing ports) and | ||||
| Configures (for its incoming ports, in response to Advertisements | ||||
| received there) is up to the MCU and is implementation dependent.</t> | ||||
| <t> | ||||
| As a general outline, a Consumer can choose, based on the | ||||
| Advertisement it has received, which Captures it wishes to receive, | ||||
| and which Individual Encodings it wants the Provider to use to | ||||
| encode the Captures.</t> | ||||
| <t> | ||||
| On receipt of an Advertisement with an MCC, the Consumer treats the | ||||
| MCC as per other non-MCC Captures with the following differences:</t> | ||||
| <ul spacing="normal"> | ||||
| <li>The Consumer would understand that the MCC is a Capture that | ||||
| includes the referenced individual Captures (or any Captures, if | ||||
| none are referenced) and that these individual Captures are | ||||
| delivered as part of the MCC's Capture Encoding.</li> | ||||
| <li>The Consumer may utilize any of the attributes associated with | ||||
| the referenced individual Captures and any Capture Scene attributes | ||||
| from where the individual Captures were defined to choose Captures | ||||
| and for Rendering decisions.</li> | ||||
| <li>If the MCC attribute Allow Subset Choice is true, then the | ||||
| Consumer may or may not choose to receive all the indicated | ||||
| Captures. It can choose to receive a subset of Captures indicated | ||||
| by the MCC.</li> | ||||
| </ul> | ||||
| <t>For example, if the Consumer receives: | ||||
| </t> | ||||
| <ul empty="true" spacing="normal"> | ||||
| <li>MCC1(VC1,VC2,VC3){attributes}</li> | ||||
| </ul> | ||||
| <t> | ||||
| A Consumer could choose all the Captures within an MCC; however, if | ||||
| the Consumer determines that it doesn't want VC3, it can return | ||||
| MCC1(VC1,VC2). If it wants all the individual Captures, then it | ||||
| returns only the MCC identity (i.e., MCC1). If the MCC in the | ||||
| Advertisement does not reference any individual Captures, or the | ||||
| Allow Subset Choice attribute is false, then the Consumer cannot | ||||
| choose what is included in the MCC: it is up to the Provider to | ||||
| decide.</t> | ||||
| <t> | ||||
| A Configure Message includes a list of Capture Encodings. These | ||||
| are the Capture Encodings the Consumer wishes to receive from the | ||||
| Provider. Each Capture Encoding refers to one Media Capture and | ||||
| one Individual Encoding.</t> | ||||
| <t> | ||||
| For each Capture the Consumer wants to receive, it configures one | ||||
| of the Encodings in that Capture's Encoding Group. The Consumer | ||||
| does this by telling the Provider, in its Configure Message, which | ||||
| Encoding to use for each chosen Capture. Upon receipt of this | ||||
| Configure from the Consumer, common knowledge is established | ||||
| between Provider and Consumer regarding sensible choices for the | ||||
| Media Streams. The setup of the actual Media channels, at least in | ||||
| the simplest case, is left to a following offer/answer exchange. | ||||
| Optimized implementations may speed up the reaction to the | ||||
| offer/answer exchange by reserving the resources at the time of | ||||
| finalization of the CLUE handshake.</t> | ||||
| <t> | ||||
| CLUE Advertisements and Configure Messages don't necessarily | ||||
| require a new SDP offer/answer for every CLUE message | ||||
| exchange. But the resulting Encodings sent via RTP must conform to | ||||
| the most-recent SDP offer/answer result.</t> | ||||
| <t> | ||||
| In order to meaningfully create and send an initial Configure, the | ||||
| Consumer needs to have received at least one Advertisement, and an | ||||
| SDP offer defining the Individual Encodings, from the Provider.</t> | ||||
| <t> | ||||
| In addition, the Consumer can send a Configure at any time during | ||||
| the call. The Configure <bcp14>MUST</bcp14> be valid according to the most | ||||
| recently received Advertisement. The Consumer can send a Configure | ||||
| either in response to a new Advertisement from the Provider or on | ||||
| its own, for example, because of a local change in conditions | ||||
| (people leaving the room, connectivity changes, multipoint related | ||||
| considerations).</t> | ||||
| <t> | ||||
| When choosing which Media Streams to receive from the Provider, and | ||||
| the encoding characteristics of those Media Streams, the Consumer | ||||
| advantageously takes several things into account: its local | ||||
| preference, simultaneity restrictions, and encoding limits.</t> | ||||
| <section anchor="s-10.1" numbered="true" toc="default"> | ||||
| <name>Local Preference</name> | ||||
| <t> | ||||
| A variety of local factors influence the Consumer's choice of | ||||
| Media Streams to be received from the Provider:</t> | ||||
| <ul spacing="normal"> | ||||
| <li>If the Consumer is an Endpoint, it is likely that it would | ||||
| choose, where possible, to receive Video and Audio Captures that | ||||
| match the number of display devices and audio system it has.</li> | ||||
| <li>If the Consumer is an MCU, it may choose to receive loudest | ||||
| speaker Streams (in order to perform its own Media composition) | ||||
| and avoid pre-composed Video Captures.</li> | ||||
| <li>User choice (for instance, selection of a new layout) may result | ||||
| in a different set of Captures, or different Encoding | ||||
| characteristics, being required by the Consumer.</li> | ||||
| </ul> | ||||
| </section> | ||||
| <section anchor="s-10.2" numbered="true" toc="default"> | ||||
| <name>Physical Simultaneity Restrictions</name> | ||||
| <t> | ||||
| Often there are physical simultaneity constraints of the Provider | ||||
| that affect the Provider's ability to simultaneously send all of | ||||
| the Captures the Consumer would wish to receive. For instance, an | ||||
| MCU, when connected to a multi-camera room system, might prefer to | ||||
| receive both individual video Streams of the people present in the | ||||
| room and an overall view of the room from a single camera. Some | ||||
| Endpoint systems might be able to provide both of these sets of | ||||
| Streams simultaneously, whereas others might not (if the overall | ||||
| room view were produced by changing the optical zoom level on the | ||||
| center camera, for instance).</t> | ||||
| </section> | ||||
| <section anchor="s-10.3" numbered="true" toc="default"> | ||||
| <name>Encoding and Encoding Group Limits</name> | ||||
| <t> | ||||
| Each of the Provider's Encoding Groups has limits on bandwidth, | ||||
| and the constituent potential Encodings have limits on the | ||||
| bandwidth, computational complexity, video frame rate, and | ||||
| resolution that can be provided. When choosing the Captures to be | ||||
| received from a Provider, a Consumer device <bcp14>MUST</bcp14> ensure that t | ||||
| he | ||||
| Encoding characteristics requested for each individual Capture | ||||
| fits within the capability of the Encoding it is being configured | ||||
| to use, as well as ensuring that the combined Encoding | ||||
| characteristics for Captures fit within the capabilities of their | ||||
| associated Encoding Groups. In some cases, this could cause an | ||||
| otherwise "preferred" choice of Capture Encodings to be passed | ||||
| over in favor of different Capture Encodings -- for instance, if a | ||||
| set of three Captures could only be provided at a low resolution | ||||
| then a three screen device could switch to favoring a single, | ||||
| higher quality, Capture Encoding.</t> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-11" numbered="true" toc="default"> | ||||
| <name>Extensibility</name> | ||||
| <t> | ||||
| One important characteristics of the Framework is its | ||||
| extensibility. The standard for interoperability and handling | ||||
| multiple Streams must be future-proof. The framework itself is | ||||
| inherently extensible through expanding the data model types. For | ||||
| example:</t> | ||||
| <ul spacing="normal"> | ||||
| <li>Adding more types of Media, such as telemetry, can done by | ||||
| defining additional types of Captures in addition to audio and | ||||
| video.</li> | ||||
| <li>Adding new functionalities, such as 3-D Video Captures, may | ||||
| require additional attributes describing the Captures.</li> | ||||
| </ul> | ||||
| <t> | ||||
| The infrastructure is designed to be extended rather than | ||||
| requiring new infrastructure elements. Extension comes through | ||||
| adding to defined types.</t> | ||||
| </section> | ||||
| <section anchor="s-12" numbered="true" toc="default"> | ||||
| <name>Examples - Using the Framework (Informative)</name> | ||||
| <t> | ||||
| This section gives some examples, first from the point of view of | ||||
| the Provider, then the Consumer, then some multipoint scenarios.</t> | ||||
| <section anchor="s-12.1" numbered="true" toc="default"> | ||||
| <name>Provider Behavior</name> | ||||
| <t> | ||||
| This section shows some examples in more detail of how a Provider | ||||
| can use the framework to represent a typical case for telepresence | ||||
| rooms. First, an Endpoint is illustrated, then an MCU case is | ||||
| shown.</t> | ||||
| <section anchor="s-12.1.1" numbered="true" toc="default"> | ||||
| <name>Three-Screen Endpoint Provider</name> | ||||
| <t> | ||||
| Consider an Endpoint with the following description:</t> | ||||
| <t> | ||||
| Three cameras, three displays, and a six-person table</t> | ||||
| <ul spacing="normal"> | ||||
| <li>Each camera can provide one Capture for each 1/3-section of the | ||||
| table.</li> | ||||
| <li>A single Capture representing the active speaker can be provided | ||||
| (voice-activity-based camera selection to a given encoder input | ||||
| port implemented locally in the Endpoint).</li> | ||||
| <li>A single Capture representing the active speaker with the other | ||||
| two Captures shown picture in picture (PiP) within the Stream can | ||||
| be provided (again, implemented inside the Endpoint).</li> | ||||
| <li>A Capture showing a zoomed out view of all six seats in the room | ||||
| can be provided.</li> | ||||
| </ul> | ||||
| <t> | ||||
| The Video and Audio Captures for this Endpoint can be described as | ||||
| follows.</t> | ||||
| <t> | ||||
| Video Captures: | ||||
| </t> | ||||
| <dl newline="false" spacing="normal" indent="6"> | ||||
| <dt>VC0</dt> | ||||
| <dd>(the left camera Stream), Encoding Group=EG0, view=table</dd> | ||||
| <dt>VC1</dt> | ||||
| <dd>(the center camera Stream), Encoding Group=EG1, view=table</dd> | ||||
| <dt>VC2</dt> | ||||
| <dd>(the right camera Stream), Encoding Group=EG2, view=table</dd> | ||||
| <dt>MCC3</dt> | ||||
| <dd>(the loudest panel Stream), Encoding Group=EG1, view=table, MaxC | ||||
| aptures=1, policy=SoundLevel</dd> | ||||
| <dt>MCC4</dt> | ||||
| <dd>(the loudest panel Stream with PiPs), Encoding Group=EG1, view=r | ||||
| oom, MaxCaptures=3, policy=SoundLevel</dd> | ||||
| <dt>VC5</dt> | ||||
| <dd>(the zoomed out view of all people in the room), Encoding Group= | ||||
| EG1, view=room</dd> | ||||
| <dt>VC6</dt> | ||||
| <dd>(presentation Stream), Encoding Group=EG1, presentation</dd> | ||||
| </dl> | ||||
| <t> | ||||
| The following diagram is a top view of the room with three cameras, three | ||||
| displays, and six seats. Each camera captures two people. The six | ||||
| seats are not all in a straight line.</t> | ||||
| <figure anchor="ref-room-layout-top-view"> | ||||
| <name>Room Layout Top View</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| ,-. d | ||||
| ( )`--.__ +---+ | ||||
| `-' / `--.__ | | | ||||
| ,-. | `-.._ |_-+Camera 2 (VC2) | ||||
| ( ).' <--(AC1)-+-''`+-+ | ||||
| `-' |_...---'' | | | ||||
| ,-.c+-..__ +---+ | ||||
| ( )| ``--..__ | | | ||||
| `-' | ``+-..|_-+Camera 1 (VC1) | ||||
| ,-. | <--(AC2)..--'|+-+ ^ | ||||
| ( )| __..--' | | | | ||||
| `-'b|..--' +---+ |X | ||||
| ,-. |``---..___ | | | | ||||
| ( )\ ```--..._|_-+Camera 0 (VC0) | | ||||
| `-' \ <--(AC0) ..-''`-+ | | ||||
| ,-. \ __.--'' | | <----------+ | ||||
| ( ) |..-'' +---+ Y | ||||
| `-' a (0,0,0) origin is under Camera 1 | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t> | ||||
| The two points labeled 'b' and 'c' are intended to be at the midpoint | ||||
| between the seating positions, and where the fields of view of the | ||||
| cameras intersect.</t> | ||||
| <t> | ||||
| The Plane of Interest for VC0 is a vertical plane that intersects | ||||
| points 'a' and 'b'.</t> | ||||
| <t> | ||||
| The Plane of Interest for VC1 intersects points 'b' and 'c'. The | ||||
| plane of interest for VC2 intersects points 'c' and 'd'.</t> | ||||
| <t> | ||||
| This example uses an area scale of millimeters.</t> | ||||
| <t>Areas of capture:</t> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| bottom left bottom right top left top right | ||||
| VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) | ||||
| VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) | ||||
| VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) | ||||
| MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) | ||||
| MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) | ||||
| VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) | ||||
| VC6 none | ||||
| ]]></artwork> | ||||
| <t>Points of capture:</t> | ||||
| <artwork name="" type="" align="left" alt=""> | ||||
| VC0 (-1678,0,800) | ||||
| VC1 (0,0,800) | ||||
| VC2 (1678,0,800) | ||||
| MCC3 none | ||||
| MCC4 none | ||||
| VC5 (0,0,800) | ||||
| VC6 none | ||||
| </artwork> | ||||
| <t> | ||||
| In this example, the right edge of the VC0 area lines up with the | ||||
| left edge of the VC1 area. It doesn't have to be this way. There | ||||
| could be a gap or an overlap. One additional thing to note for | ||||
| this example is the distance from 'a' to 'b' is equal to the distance | ||||
| from 'b' to 'c' and the distance from 'c' to 'd'. All these distances are | ||||
| 1346 mm. This is the planar width of each Area of Capture for VC0, | ||||
| VC1, and VC2.</t> | ||||
| <t> | ||||
| Note the text in parentheses (e.g., "the left camera Stream") is | ||||
| not explicitly part of the model, it is just explanatory text for | ||||
| this example, and it is not included in the model with the Media | ||||
| Captures and attributes. Also, MCC4 doesn't say anything about | ||||
| how a Capture is composed, so the Media Consumer can't tell based | ||||
| on this Capture that MCC4 is composed of a "loudest panel with PiPs".</t> | ||||
| <t> | ||||
| Audio Captures:</t> | ||||
| <t> | ||||
| Three ceiling microphones are located between the cameras and the | ||||
| table, at the same height as the cameras. The microphones point | ||||
| down at an angle toward the seating positions.</t> | ||||
| <ul spacing="normal"> | ||||
| <li>AC0 (left), Encoding Group=EG3</li> | ||||
| <li>AC1 (right), Encoding Group=EG3</li> | ||||
| <li>AC2 (center), Encoding Group=EG3</li> | ||||
| <li>AC3 being a simple pre-mixed audio Stream from the room (mono), | ||||
| Encoding Group=EG3</li> | ||||
| <li>AC4 audio Stream associated with the presentation video (mono) | ||||
| Encoding Group=EG3, presentation</li> | ||||
| </ul> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| Point of Capture: Point on Line of Capture: | ||||
| AC0 (-1342,2000,800) (-1342,2925,379) | ||||
| AC1 ( 1342,2000,800) ( 1342,2925,379) | ||||
| AC2 ( 0,2000,800) ( 0,3000,379) | ||||
| AC3 ( 0,2000,800) ( 0,3000,379) | ||||
| AC4 none | ||||
| ]]></artwork> | ||||
| <t>The physical simultaneity information is: | ||||
| </t> | ||||
| <ul empty="true" spacing="normal"> | ||||
| <li>Simultaneous Transmission Set #1 {VC0, VC1, VC2, MCC3, MCC4, | ||||
| VC6}</li> | ||||
| <li>Simultaneous Transmission Set #2 {VC0, VC2, VC5, VC6}</li> | ||||
| </ul> | ||||
| <t> | ||||
| This constraint indicates that it is not possible to use all the VCs at | ||||
| the same time. VC5 cannot be used at the same time as VC1 or MCC3 | ||||
| or MCC4. Also, using every member in the set simultaneously may | ||||
| not make sense -- for example, MCC3 (loudest) and MCC4 (loudest with | ||||
| PiP). In addition, there are Encoding constraints that make | ||||
| choosing all of the VCs in a set impossible. VC1, MCC3, MCC4, | ||||
| VC5, and VC6 all use EG1 and EG1 has only three ENCs. This constraint | ||||
| shows up in the Encoding Groups, not in the Simultaneous | ||||
| Transmission Sets.</t> | ||||
| <t> | ||||
| In this example, there are no restrictions on which Audio Captures | ||||
| can be sent simultaneously.</t> | ||||
| <t> | ||||
| Encoding Groups:</t> | ||||
| <t> | ||||
| This example has three Encoding Groups associated with the Video | ||||
| Captures. Each group can have three Encodings, but with each | ||||
| potential Encoding having a progressively lower specification. In | ||||
| this example, 1080p60 transmission is possible (as ENC0 has a | ||||
| maxPps value compatible with that). Significantly, as up to three | ||||
| Encodings are available per group, it is possible to transmit some | ||||
| Video Captures simultaneously that are not in the same view in the | ||||
| Capture Scene, for example, VC1 and MCC3 at the same time. The | ||||
| information below about Encodings is a summary of what would be | ||||
| conveyed in SDP, not directly in the CLUE Advertisement.</t> | ||||
| <figure anchor="ref-example-encoding-groups-for-video"> | ||||
| <name>Example Encoding Groups for Video</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| encodeGroupID=EG0, maxGroupBandwidth=6000000 | ||||
| encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, | ||||
| maxPps=124416000, maxBandwidth=4000000 | ||||
| encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, | ||||
| maxPps=27648000, maxBandwidth=4000000 | ||||
| encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, | ||||
| maxPps=15552000, maxBandwidth=4000000 | ||||
| encodeGroupID=EG1 maxGroupBandwidth=6000000 | ||||
| encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, | ||||
| maxPps=124416000, maxBandwidth=4000000 | ||||
| encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, | ||||
| maxPps=27648000, maxBandwidth=4000000 | ||||
| encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, | ||||
| maxPps=15552000, maxBandwidth=4000000 | ||||
| encodeGroupID=EG2 maxGroupBandwidth=6000000 | ||||
| encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, | ||||
| maxPps=124416000, maxBandwidth=4000000 | ||||
| encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, | ||||
| maxPps=27648000, maxBandwidth=4000000 | ||||
| encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, | ||||
| maxPps=15552000, maxBandwidth=4000000 | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t> | ||||
| For audio, there are five potential Encodings available, so all | ||||
| five Audio Captures can be encoded at the same time.</t> | ||||
| <figure anchor="ref-example-encoding-group-for-audio"> | ||||
| <name>Example Encoding Group for Audio</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| encodeGroupID=EG3, maxGroupBandwidth=320000 | ||||
| encodeID=ENC9, maxBandwidth=64000 | ||||
| encodeID=ENC10, maxBandwidth=64000 | ||||
| encodeID=ENC11, maxBandwidth=64000 | ||||
| encodeID=ENC12, maxBandwidth=64000 | ||||
| encodeID=ENC13, maxBandwidth=64000 | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t> | ||||
| Capture Scenes:</t> | ||||
| <t> | ||||
| The following table represents the Capture Scenes for this | ||||
| Provider. Recall that a Capture Scene is composed of alternative | ||||
| CSVs covering the same spatial region. Capture | ||||
| Scene #1 is for the main people Captures, and Capture Scene #2 is | ||||
| for presentation.</t> | ||||
| <t>Each row in the table is a separate CSV.</t> | ||||
| <table align="center"> | ||||
| <name>Example CSVs</name> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC0, VC1, VC2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC3</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC4</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC5</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">AC0, AC1, AC2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">AC3</td> | ||||
| </tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #2</th> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC6</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">AC4</td> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| Different Capture Scenes are distinct from each other and do not | ||||
| overlap. A Consumer can choose a view from each Capture Scene. In | ||||
| this case, the three Captures, VC0, VC1, and VC2, are one way of | ||||
| representing the video from the Endpoint. These three Captures | ||||
| should appear adjacent to each other. Alternatively, another | ||||
| way of representing the Capture Scene is with the Capture MCC3, | ||||
| which automatically shows the person who is talking; this is the same for | ||||
| the MCC4 and VC5 alternatives.</t> | ||||
| <t> | ||||
| As in the video case, the different views of audio in Capture | ||||
| Scene #1 represent the "same thing", in that one way to receive | ||||
| the audio is with the three Audio Captures (AC0, AC1, and AC2), and | ||||
| another way is with the mixed AC3. The Media Consumer can choose | ||||
| an audio CSV it is capable of receiving.</t> | ||||
| <t> | ||||
| The spatial ordering is understood by the Media Capture attribute's | ||||
| Area of Capture, Point of Capture, and Point on Line of Capture.</t> | ||||
| <t> | ||||
| A Media Consumer would likely want to choose a CSV | ||||
| to receive, partially based on how many Streams it can simultaneously | ||||
| receive. A Consumer that can receive three video Streams would | ||||
| probably prefer to receive the first view of Capture Scene #1 | ||||
| (VC0, VC1, and VC2) and not receive the other views. A Consumer that | ||||
| can receive only one video Stream would probably choose one of the | ||||
| other views.</t> | ||||
| <t> | ||||
| If the Consumer can receive a presentation Stream too, it would | ||||
| also choose to receive the only view from Capture Scene #2 (VC6).</t> | ||||
| </section> | ||||
| <section anchor="s-12.1.2" numbered="true" toc="default"> | ||||
| <name>Encoding Group Example</name> | ||||
| <t> | ||||
| This is an example of an Encoding Group to illustrate how it can | ||||
| express dependencies between Encodings. The information below | ||||
| about Encodings is a summary of what would be conveyed in SDP, not | ||||
| directly in the CLUE Advertisement.</t> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| encodeGroupID=EG0 maxGroupBandwidth=6000000 | ||||
| encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, | ||||
| maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 | ||||
| encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, | ||||
| maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 | ||||
| encodeID=AUDENC0, maxBandwidth=96000 | ||||
| encodeID=AUDENC1, maxBandwidth=96000 | ||||
| encodeID=AUDENC2, maxBandwidth=96000 | ||||
| ]]></artwork> | ||||
| <t> | ||||
| Here, the Encoding Group is EG0. Although the Encoding Group is | ||||
| capable of transmitting up to 6 Mbit/s, no individual video | ||||
| Encoding can exceed 4 Mbit/s.</t> | ||||
| <t> | ||||
| This Encoding Group also allows up to three audio Encodings, AUDENC<0-2> | ||||
| ;. It is not required that audio and video Encodings reside | ||||
| within the same Encoding Group, but if so, then the group's overall | ||||
| maxBandwidth value is a limit on the sum of all audio and video | ||||
| Encodings configured by the Consumer. A system that does not wish | ||||
| or need to combine bandwidth limitations in this way should | ||||
| instead use separate Encoding Groups for audio and video in order | ||||
| for the bandwidth limitations on audio and video to not interact.</t> | ||||
| <t> | ||||
| Audio and video can be expressed in separate Encoding Groups, as | ||||
| in this illustration.</t> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| encodeGroupID=EG0 maxGroupBandwidth=6000000 | ||||
| encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, | ||||
| maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 | ||||
| encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, | ||||
| maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 | ||||
| encodeGroupID=EG1 maxGroupBandwidth=500000 | ||||
| encodeID=AUDENC0, maxBandwidth=96000 | ||||
| encodeID=AUDENC1, maxBandwidth=96000 | ||||
| encodeID=AUDENC2, maxBandwidth=96000 | ||||
| ]]></artwork> | ||||
| </section> | ||||
| <section anchor="s-12.1.3" numbered="true" toc="default"> | ||||
| <name>The MCU Case</name> | ||||
| <t> | ||||
| This section shows how an MCU might express its Capture Scenes, | ||||
| intending to offer different choices for Consumers that can handle | ||||
| different numbers of Streams. Each MCC is for video. A single | ||||
| Audio Capture is provided for all single and multi-screen | ||||
| configurations that can be associated (e.g., lip-synced) with any | ||||
| combination of Video Captures (the MCCs) at the Consumer.</t> | ||||
| <table anchor="ref-mcu-main-capture-scenes" align="center"> | ||||
| <name>MCU Main Capture Scenes</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left">Capture Scene #1</th> | ||||
| <th align="left"/> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">MCC</td> | ||||
| <td align="left">for a one-screen Consumer</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC1, MCC2</td> | ||||
| <td align="left">for a two-screen Consumer</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC3, MCC4, MCC5</td> | ||||
| <td align="left">for a three-screen Consumer</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">MCC6, MCC7, MCC8, MCC9</td> | ||||
| <td align="left">for a four-screen Consumer</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">AC0</td> | ||||
| <td align="left">AC representing all participants</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(MCC0)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(MCC1,MCC2)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(MCC3,MCC4,MCC5)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(MCC6,MCC7,MCC8,MCC9)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(AC0)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| If/when a presentation Stream becomes active within the Conference, | ||||
| the MCU might re-advertise the available Media as:</t> | ||||
| <table anchor="ref-mcu-presentation-capture-scene" align="center"> | ||||
| <name>MCU Presentation Capture Scene</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left">Capture Scene #2</th> | ||||
| <th align="left">Note</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC10</td> | ||||
| <td align="left">Video Capture for presentation</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">AC1</td> | ||||
| <td align="left">Presentation audio to accompany VC10</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(VC10)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(AC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-12.2" numbered="true" toc="default"> | ||||
| <name>Media Consumer Behavior</name> | ||||
| <t> | ||||
| This section gives an example of how a Media Consumer might behave | ||||
| when deciding how to request Streams from the three-screen | ||||
| Endpoint described in the previous section.</t> | ||||
| <t> | ||||
| The receive side of a call needs to balance its requirements | ||||
| (based on number of screens and speakers), its decoding capabilities, | ||||
| available bandwidth, and the Provider's capabilities in order | ||||
| to optimally configure the Provider's Streams. Typically, it would | ||||
| want to receive and decode Media from each Capture Scene | ||||
| advertised by the Provider.</t> | ||||
| <t> | ||||
| A sane, basic, algorithm might be for the Consumer to go through | ||||
| each CSV in turn and find the collection of Video | ||||
| Captures that best matches the number of screens it has (this | ||||
| might include consideration of screens dedicated to presentation | ||||
| video display rather than "people" video) and then decide between | ||||
| alternative views in the video Capture Scenes based either on | ||||
| hard-coded preferences or on user choice. Once this choice has been | ||||
| made, the Consumer would then decide how to configure the | ||||
| Provider's Encoding Groups in order to make best use of the | ||||
| available network bandwidth and its own decoding capabilities.</t> | ||||
| <section anchor="s-12.2.1" numbered="true" toc="default"> | ||||
| <name>One-Screen Media Consumer</name> | ||||
| <t> | ||||
| MCC3, MCC4, and VC5 are all different views by themselves, not | ||||
| grouped together in a single view; so, the receiving device should | ||||
| choose between one of those. The choice would come down to | ||||
| whether to see the greatest number of participants simultaneously | ||||
| at roughly equal precedence (VC5), a switched view of just the | ||||
| loudest region (MCC3), or a switched view with PiPs (MCC4). An | ||||
| Endpoint device with a small amount of knowledge of these | ||||
| differences could offer a dynamic choice of these options, in-call, to the us | ||||
| er.</t> | ||||
| </section> | ||||
| <section anchor="s-12.2.2" numbered="true" toc="default"> | ||||
| <name>Two-Screen Media Consumer Configuring the Example</name> | ||||
| <t> | ||||
| Mixing systems with an even number of screens, "2n", and those | ||||
| with "2n+1" cameras (and vice versa) is always likely to be the | ||||
| problematic case. In this instance, the behavior is likely to be | ||||
| determined by whether a "two-screen" system is really a "two-decoder" | ||||
| system, i.e., whether only one received Stream can be displayed | ||||
| per screen or whether more than two Streams can be received and | ||||
| spread across the available screen area. To enumerate three possible | ||||
| behaviors here for the two-screen system when it learns that the far | ||||
| end is "ideally" expressed via three Capture Streams:</t> | ||||
| <ol spacing="normal" type="1"> | ||||
| <li>Fall back to receiving just a single Stream (MCC3, MCC4, or VC5 | ||||
| as per the one-screen Consumer case above) and either leave one | ||||
| screen blank or use it for presentation if/when a | ||||
| presentation becomes active.</li> | ||||
| <li>Receive three Streams (VC0, VC1, and VC2) and display across two | ||||
| screens (either with each Capture being scaled to 2/3 of a | ||||
| screen and the center Capture being split across two screens), or, | ||||
| as would be necessary if there were large bezels on the | ||||
| screens, with each Stream being scaled to 1/2 the screen width | ||||
| and height and there being a fourth "blank" panel. This fourth panel | ||||
| could potentially be used for any presentation that became | ||||
| active during the call.</li> | ||||
| <li>Receive three Streams, decode all three, and use control informa | ||||
| tion | ||||
| indicating which was the most active to switch between showing | ||||
| the left and center Streams (one per screen) and the center and | ||||
| right Streams.</li> | ||||
| </ol> | ||||
| <t> | ||||
| For an Endpoint capable of all three methods of working described | ||||
| above, again it might be appropriate to offer the user the choice | ||||
| of display mode.</t> | ||||
| </section> | ||||
| <section anchor="s-12.2.3" numbered="true" toc="default"> | ||||
| <name>Three-Screen Media Consumer Configuring the Example</name> | ||||
| <t> | ||||
| This is the most straightforward case: the Media Consumer would | ||||
| look to identify a set of Streams to receive that best matched its | ||||
| available screens; so, the VC0 plus VC1 plus VC2 should match | ||||
| optimally. The spatial ordering would give sufficient information | ||||
| for the correct Video Capture to be shown on the correct screen. | ||||
| The Consumer would need to divide a single Encoding | ||||
| Group's capability by 3 either to determine what resolution and frame | ||||
| rate to configure the Provider with or to configure the individual | ||||
| Video Captures' Encoding Groups with what makes most sense (taking | ||||
| into account the receive side decode capabilities, overall call | ||||
| bandwidth, the resolution of the screens plus any user preferences | ||||
| such as motion vs. sharpness).</t> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-12.3" numbered="true" toc="default"> | ||||
| <name>Multipoint Conference Utilizing Multiple Content Captures</name> | ||||
| <t> | ||||
| The use of MCCs allows the MCU to construct outgoing Advertisements | ||||
| describing complex Media switching and composition scenarios. The | ||||
| following sections provide several examples.</t> | ||||
| <t> | ||||
| Note: in the examples the identities of the CLUE elements (e.g., | ||||
| Captures, Capture Scene) in the incoming Advertisements overlap. | ||||
| This is because there is no coordination between the Endpoints. | ||||
| The MCU is responsible for making these unique in the outgoing | ||||
| Advertisement.</t> | ||||
| <section anchor="s-12.3.1" numbered="true" toc="default"> | ||||
| <name>Single Media Captures and MCC in the Same Advertisement</name> | ||||
| <t> | ||||
| Four Endpoints are involved in a Conference where CLUE is used. An | ||||
| MCU acts as a middlebox between the Endpoints with a CLUE channel | ||||
| between each Endpoint and the MCU. The MCU receives the following | ||||
| Advertisements.</t> | ||||
| <table anchor="ref-advertisement-received-from-endpoint-a" align="cent | ||||
| er"> | ||||
| <name>Advertisement Received from Endpoint A</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=AustralianConfRoom</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">Description=Audience<br/>EncodeGroupID=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(VC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <table anchor="ref-advertisement-received-from-endpoint-b" align="cent | ||||
| er"> | ||||
| <name>Advertisement Received from Endpoint B</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=ChinaConfRoom</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">Description=Speaker<br/>EncodeGroupID=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC2</td> | ||||
| <td align="left">Description=Audience<br/>EncodeGroupID=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(VC1, VC2)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t keepWithPrevious="true">Note: Endpoint B indicates that it sends tw | ||||
| o Streams.</t> | ||||
| <table anchor="ref-advertisement-received-from-endpoint-c" align="cent | ||||
| er"> | ||||
| <name>Advertisement Received from Endpoint C</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=USAConfRoom</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">Description=Audience<br/>EncodeGroupID=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(VC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| If the MCU wanted to provide a Multiple Content Captures containing | ||||
| a round-robin switched view of the audience from the three Endpoints | ||||
| and the speaker, it could construct the following Advertisement:</t> | ||||
| <table anchor="ref-advertisement-sent-to-endpoint-f-one-encoding"> | ||||
| <name>Advertisement Sent to Endpoint F - One Encoding</name> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #1</th> <th>Description=AustralianConfRoom</ | ||||
| th> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>VC1</td> <td>Description=Audience</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV(VC1)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #2</th> <th>Description=ChinaConfRoom</th> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>VC2</td> <td>Description=Speaker</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>VC3</td> <td>Description=Audience</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV(VC2, VC3)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #3</th> <th>Description=USAConfRoom</th> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>VC4</td> <td>Description=Audience</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV(VC4)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #4</th> <th/></tr> | ||||
| <tr> | ||||
| <td>MCC1(VC1,VC2,VC3,VC4)</td> | ||||
| <td>Policy=RoundRobin:1<br/> | ||||
| MaxCaptures=1<br/> | ||||
| EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV(MCC1)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| Alternatively, if the MCU wanted to provide the speaker as one Media | ||||
| Stream and the audiences as another, it could assign an Encoding | ||||
| Group to VC2 in Capture Scene 2 and provide a CSV in Capture Scene | ||||
| #4 as per the example below.</t> | ||||
| <table anchor="ref-advertisement-sent-to-endpoint-f-two-encodings"> | ||||
| <name>Advertisement Sent to Endpoint F - Two Encodings</name> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=AustralianConfRoom</th> | ||||
| </tr> | ||||
| <tr><td>VC1</td> <td>Description=Audience</td> | ||||
| </tr> | ||||
| <tr><td>CSV(VC1)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #2</th> <th>Description=ChinaConfRoom</th> | ||||
| </tr> | ||||
| <tr><td>VC2</td> <td>Description=Speaker | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr><td>VC3</td> <td>Description=Audience</td> | ||||
| </tr> | ||||
| <tr><td>CSV(VC2, VC3)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #3</th> <th>Description=USAConfRoom</th> | ||||
| </tr> | ||||
| <tr><td>VC4</td> <td>Description=Audience</td> | ||||
| </tr> | ||||
| <tr><td>CSV(VC4)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #4</th> <th/> | ||||
| </tr> | ||||
| <tr><td>MCC1(VC1,VC3,VC4)</td> <td>Policy=RoundRobin:1 | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>EncodingGroup=1 | ||||
| <br/>AllowSubset=True</td> | ||||
| </tr> | ||||
| <tr><td>MCC2(VC2)</td> <td>MaxCaptures=1 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr><td>CSV2(MCC1,MCC2)</td> <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| Therefore, a Consumer could choose whether or not to have a separate | ||||
| speaker-related Stream and could choose which Endpoints to see. If | ||||
| it wanted the second Stream but not the Australian conference room, | ||||
| it could indicate the following Captures in the Configure message:</t> | ||||
| <table anchor="table_15"> | ||||
| <name>MCU Case: Consumer Response</name> | ||||
| <tbody> | ||||
| <tr><td>MCC1(VC3,VC4)</td> <td>Encoding</td></tr> | ||||
| <tr><td>VC2</td> <td>Encoding</td></tr> | ||||
| </tbody> | ||||
| </table> | ||||
| </section> | ||||
| <section anchor="s-12.3.2" numbered="true" toc="default"> | ||||
| <name>Several MCCs in the Same Advertisement</name> | ||||
| <t> | ||||
| Multiple MCCs can be used where multiple Streams are used to carry | ||||
| Media from multiple Endpoints. For example:</t> | ||||
| <t> | ||||
| A Conference has three Endpoints D, E, and F. Each Endpoint has | ||||
| three Video Captures covering the left, middle, and right regions of | ||||
| each conference room. The MCU receives the following | ||||
| Advertisements from D and E.</t> | ||||
| <table anchor="ref-advertisement-received-from-endpoint-d" align="cent | ||||
| er"> | ||||
| <name>Advertisement Received from Endpoint D</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=AustralianConfRoom</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">CaptureArea=Left</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left"/> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC2</td> | ||||
| <td align="left">CaptureArea=Center</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left"/> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC3</td> | ||||
| <td align="left">CaptureArea=Right</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left"/> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(VC1,VC2,VC3)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <table anchor="ref-advertisement-received-from-endpoint-e" align="cent | ||||
| er"> | ||||
| <name>Advertisement Received from Endpoint E</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=ChinaConfRoom</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">CaptureArea=Left</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left"/> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC2</td> | ||||
| <td align="left">CaptureArea=Center</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left"/> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC3</td> | ||||
| <td align="left">CaptureArea=Right</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left"/> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV(VC1,VC2,VC3)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| The MCU wants to offer Endpoint F three Capture Encodings. Each | ||||
| Capture Encoding would contain all the Captures from either | ||||
| Endpoint D or Endpoint E, depending on the active speaker. | ||||
| The MCU sends the following Advertisement:</t> | ||||
| <table anchor="ref-advertisement-sent-to-endpoint-f"> | ||||
| <name>Advertisement Sent to Endpoint F</name> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #1</th><th>Description=AustralianConfRoom</th> | ||||
| </tr> | ||||
| <tr><td>VC1</td> <td/></tr> | ||||
| <tr><td>VC2</td> <td/></tr> | ||||
| <tr><td>VC3</td> <td/></tr> | ||||
| <tr><td>CSV(VC1,VC2,VC3)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #2</th> <th>Description=ChinaConfRoom</t | ||||
| h></tr> | ||||
| <tr><td>VC4</td> <td/></tr> | ||||
| <tr><td>VC5</td> <td/></tr> | ||||
| <tr><td>VC6</td> <td/></tr> | ||||
| <tr><td>CSV(VC4,VC5,VC6)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr><th>Capture Scene #3</th> <th/></tr> | ||||
| <tr><td>MCC1(VC1,VC4)</td> <td>CaptureArea=Left | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>EncodingGroup=1 | ||||
| </td> | ||||
| </tr> | ||||
| <tr><td>MCC2(VC2,VC5)</td> <td>CaptureArea=Center | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>EncodingGroup=1 | ||||
| </td> | ||||
| </tr> | ||||
| <tr><td>MCC3(VC3,VC6)</td> <td>CaptureArea=Right | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>EncodingGroup=1 | ||||
| </td> | ||||
| </tr> | ||||
| <tr><td>CSV(MCC1,MCC2,MCC3)</td> <td/></tr> | ||||
| </tbody> | ||||
| </table> | ||||
| </section> | ||||
| <section anchor="s-12.3.3" numbered="true" toc="default"> | ||||
| <name>Heterogeneous Conference with Switching and Composition</name> | ||||
| <t> | ||||
| Consider a Conference between Endpoints with the following | ||||
| characteristics:</t> | ||||
| <dl newline="false" spacing="normal"> | ||||
| <dt>Endpoint A -</dt> | ||||
| <dd>4 screens, 3 cameras</dd> | ||||
| <dt>Endpoint B -</dt> | ||||
| <dd>3 screens, 3 cameras</dd> | ||||
| <dt>Endpoint C -</dt> | ||||
| <dd>3 screens, 3 cameras</dd> | ||||
| <dt>Endpoint D -</dt> | ||||
| <dd>3 screens, 3 cameras</dd> | ||||
| <dt>Endpoint E -</dt> | ||||
| <dd>1 screen, 1 camera</dd> | ||||
| <dt>Endpoint F -</dt> | ||||
| <dd>2 screens, 1 camera</dd> | ||||
| <dt>Endpoint G -</dt> | ||||
| <dd>1 screen, 1 camera</dd> | ||||
| </dl> | ||||
| <t> | ||||
| This example focuses on what the user in one of the three-camera | ||||
| multi-screen Endpoints sees. Call this person User A, at Endpoint | ||||
| A. There are four large display screens at Endpoint A. Whenever | ||||
| somebody at another site is speaking, all the Video Captures from | ||||
| that Endpoint are shown on the large screens. If the talker is at | ||||
| a three-camera site, then the video from those three cameras fills three of | ||||
| the screens. If the person speaking is at a single-camera site, then video | ||||
| from that camera fills one of the screens, while the other screens | ||||
| show video from other single-camera Endpoints.</t> | ||||
| <t> | ||||
| User A hears audio from the four loudest talkers.</t> | ||||
| <t> | ||||
| User A can also see video from other Endpoints, in addition to the | ||||
| current person speaking, although much smaller in size. Endpoint A has four | ||||
| screens, so one of those screens shows up to nine other Media Captures | ||||
| in a tiled fashion. When video from a three-camera Endpoint appears in | ||||
| the tiled area, video from all three cameras appears together across | ||||
| the screen with correct Spatial Relationship among those three images.</t> | ||||
| <figure anchor="ref-endpoint-a-4-screen-display"> | ||||
| <name>Endpoint A - Four-Screen Display</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| +---+---+---+ +-------------+ +-------------+ +-------------+ | ||||
| | | | | | | | | | | | ||||
| +---+---+---+ | | | | | | | ||||
| | | | | | | | | | | | ||||
| +---+---+---+ | | | | | | | ||||
| | | | | | | | | | | | ||||
| +---+---+---+ +-------------+ +-------------+ +-------------+ | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t> | ||||
| User B at Endpoint B sees a similar arrangement, except there are | ||||
| only three screens, so the nine other Media Captures are spread out across | ||||
| the bottom of the three displays, in a PiP format. | ||||
| When video from a three-camera Endpoint appears in the PiP area, video | ||||
| from all three cameras appears together across one screen with | ||||
| correct Spatial Relationship.</t> | ||||
| <figure anchor="ref-endpoint-b-3-screen-display-with-pips"> | ||||
| <name>Endpoint B - Three-Screen Display with PiPs</name> | ||||
| <artwork name="" type="" align="left" alt=""><![CDATA[ | ||||
| +-------------+ +-------------+ +-------------+ | ||||
| | | | | | | | ||||
| | | | | | | | ||||
| | | | | | | | ||||
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | ||||
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | ||||
| +-------------+ +-------------+ +-------------+ | ||||
| ]]></artwork> | ||||
| </figure> | ||||
| <t> | ||||
| When somebody at a different Endpoint becomes the current speaker, | ||||
| then User A and User B both see the video from the new person speaking | ||||
| appear on their large screen area, while the previous speaker takes | ||||
| one of the smaller tiled or PiP areas. The person who is the | ||||
| current speaker doesn't see themselves; they see the previous speaker | ||||
| in their large screen area.</t> | ||||
| <t> | ||||
| One of the points of this example is that Endpoints A and B each | ||||
| want to receive three Capture Encodings for their large display areas, | ||||
| and nine Encodings for their smaller areas. A and B are be able to | ||||
| each send the same Configure message to the MCU, and each receive | ||||
| the same conceptual Media Captures from the MCU. The differences | ||||
| are in how they are Rendered and are purely a local matter at A and | ||||
| B.</t> | ||||
| <t>The Advertisements for such a scenario are described below. | ||||
| </t> | ||||
| <table anchor="ref-advertisement-received-at-the-mcu-from-endpoints-a- | ||||
| to-d" align="center"> | ||||
| <name>Advertisement Received at the MCU from Endpoints A to D</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=Endpoint x</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC2</td> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">VC3</td> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">AC1</td> | ||||
| <td align="left">EncodingGroup=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV1(VC1, VC2, VC3)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV2(AC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <table anchor="ref-advertisement-received-at-the-mcu-from-endpoints-e- | ||||
| to-g" align="center"> | ||||
| <name>Advertisement Received at the MCU from Endpoints E to G</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th align="left"> Capture Scene #1</th> | ||||
| <th align="left"> Description=Endpoint y</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="left">VC1</td> | ||||
| <td align="left">EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">AC1</td> | ||||
| <td align="left">EncodingGroup=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV1(VC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td align="left">CSV2(AC1)</td> | ||||
| <td align="left"/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| Rather than considering what is displayed, CLUE concentrates more | ||||
| on what the MCU sends. The MCU doesn't know anything about the | ||||
| number of screens an Endpoint has.</t> | ||||
| <t> | ||||
| As Endpoints A to D each advertise that three Captures make up a | ||||
| Capture Scene, the MCU offers these in a "site switching" mode. | ||||
| That is, there are three Multiple Content Captures (and | ||||
| Capture Encodings) each switching between Endpoints. The MCU | ||||
| switches in the applicable Media into the Stream based on voice | ||||
| activity. Endpoint A will not see a Capture from itself.</t> | ||||
| <t> | ||||
| Using the MCC concept, the MCU would send the following | ||||
| Advertisement to Endpoint A:</t> | ||||
| <table anchor="ref-advertisement-sent-to-endpoint-a-source-part"> | ||||
| <name>Advertisement Sent to Endpoint A - Source Part</name> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #1</th><th>Description=Endpoint B</th> | ||||
| </tr> | ||||
| <tr><td>VC4</td> <td>CaptureArea=Left</td></tr> | ||||
| <tr><td>VC5</td> <td>CaptureArea=Center</td></tr> | ||||
| <tr><td>VC6</td> <td>CaptureArea=Right</td></tr> | ||||
| <tr><td>AC1</td> <td/></tr> | ||||
| <tr><td>CSV(VC4,VC5,VC6)</td> <td/></tr> | ||||
| <tr><td>CSV(AC1)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #2</th><th>Description=Endpoint C</th> | ||||
| </tr> | ||||
| <tr><td>VC7</td> <td>CaptureArea=Left</td></tr> | ||||
| <tr><td>VC8</td> <td>CaptureArea=Center</td></tr> | ||||
| <tr><td>VC9</td> <td>CaptureArea=Right</td></tr> | ||||
| <tr><td>AC2</td> <td/></tr> | ||||
| <tr><td>CSV(VC7,VC8,VC9)</td> <td/></tr> | ||||
| <tr><td>CSV(AC2)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #3</th><th>Description=Endpoint D</th> | ||||
| </tr> | ||||
| <tr><td>VC10</td> <td>CaptureArea=Left</td></tr> | ||||
| <tr><td>VC11</td> <td>CaptureArea=Center</td></tr> | ||||
| <tr><td>VC12</td> <td>CaptureArea=Right</td></tr> | ||||
| <tr><td>AC3</td> <td/></tr> | ||||
| <tr><td>CSV(VC10,VC11,VC12)</td> <td/></tr> | ||||
| <tr><td>CSV(AC3)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #4</th><th>Description=Endpoint E</th> | ||||
| </tr> | ||||
| <tr><td>VC13</td> <td/></tr> | ||||
| <tr><td>AC4</td> <td/></tr> | ||||
| <tr><td>CSV(VC13)</td> <td/></tr> | ||||
| <tr><td>CSV(AC4)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #5</th><th>Description=Endpoint F</th> | ||||
| </tr> | ||||
| <tr><td>VC14</td> <td/></tr> | ||||
| <tr><td>AC5</td> <td/></tr> | ||||
| <tr><td>CSV(VC14)</td> <td/></tr> | ||||
| <tr><td>CSV(AC5)</td> <td/></tr> | ||||
| </tbody> | ||||
| <tbody> | ||||
| <tr> | ||||
| <th>Capture Scene #6</th><th>Description=Endpoint G</th> | ||||
| </tr> | ||||
| <tr><td>VC15</td> <td/></tr> | ||||
| <tr><td>AC6</td> <td/></tr> | ||||
| <tr><td>CSV(VC15)</td> <td/></tr> | ||||
| <tr><td>CSV(AC6)</td> <td/></tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| The above part of the Advertisement presents information about the | ||||
| sources to the MCC. The information is effectively the same as the | ||||
| received Advertisements, except that there are no Capture Encodings | ||||
| associated with them and the identities have been renumbered.</t> | ||||
| <t> | ||||
| In addition to the source Capture information, the MCU advertises | ||||
| site switching of Endpoints B to G in three Streams.</t> | ||||
| <table anchor="table_22"> | ||||
| <name>Advertisement Sent to Endpoint A - Switching Parts</name> | ||||
| <thead> | ||||
| <tr> | ||||
| <th>Capture Scene #7</th><th>Description=Output3streammix</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td>MCC1(VC4,VC7,VC10,&zwsp;VC13)</td> <td>CaptureArea=Left | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC2(VC5,VC8,VC11,&zwsp;VC14)</td> <td>CaptureArea=Center | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC3(VC6,VC9,VC12,&zwsp;VC15)</td> <td>CaptureArea=Right | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC4() (for audio)</td> <td>CaptureArea=whole Scene | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC5() (for audio)</td> <td>CaptureArea=whole Scene | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:1 | ||||
| <br/>EncodingGroup=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC6() (for audio)</td> <td>CaptureArea=whole Scene | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:2 | ||||
| <br/>EncodingGroup=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC7() (for audio)</td> <td>CaptureArea=whole Scene | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:3 | ||||
| <br/>EncodingGroup=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV(MCC1,MCC2,MCC3)</td> <td/></tr> | ||||
| <tr> | ||||
| <td>CSV(MCC4,MCC5,MCC6,&zwsp;MCC7)</td> <td/></tr> | ||||
| </tbody></table> | ||||
| <t> | ||||
| The above part describes the three main switched Streams that relate to | ||||
| site switching. MaxCaptures=1 indicates that only one Capture from | ||||
| the MCC is sent at a particular time. SynchronizationID=1 indicates | ||||
| that the source sending is synchronized. The Provider can choose to | ||||
| group together VC13, VC14, and VC15 for the purpose of switching | ||||
| according to the SynchronizationID. Therefore, when the Provider | ||||
| switches one of them into an MCC, it can also switch the others | ||||
| even though they are not part of the same Capture Scene.</t> | ||||
| <t> | ||||
| All the audio for the Conference is included in Scene #7. | ||||
| There isn't necessarily a one-to-one relation between any Audio | ||||
| Capture and Video Capture in this Scene. Typically, a change in | ||||
| the loudest talker will cause the MCU to switch the audio Streams more | ||||
| quickly than switching video Streams.</t> | ||||
| <t> | ||||
| The MCU can also supply nine Media Streams showing the active and | ||||
| previous eight speakers. It includes the following in the | ||||
| Advertisement:</t> | ||||
| <table anchor="table_23"> | ||||
| <name>Advertisement Sent to Endpoint A - 9 Switched Parts</name | ||||
| > | ||||
| <thead> | ||||
| <tr> | ||||
| <th>Capture Scene #8</th><th>Description=Output9stream</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="right">MCC8(VC4,VC5,VC6,VC7, | ||||
| <br/>VC8,VC9,VC10,VC11, | ||||
| <br/>VC12,VC13,VC14,VC15)</td> | ||||
| <td>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr><tr> | ||||
| <td align="right">MCC9(VC4,VC5,VC6,VC7, | ||||
| <br/>VC8,VC9,VC10,VC11, | ||||
| <br/>VC12,VC13,VC14,VC15) | ||||
| </td> | ||||
| <td>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:1 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr><tr> | ||||
| <th align="center">to</th><th align="center">to</th> | ||||
| </tr><tr> | ||||
| <td align="right">MCC16(VC4,VC5,VC6,VC7, | ||||
| <br/>VC8,VC9,VC10,VC11, | ||||
| <br/>VC12,VC13,VC14,VC15)</td> | ||||
| <td>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:8 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr><tr> | ||||
| <td align="right">CSV(MCC8,MCC9,MCC10, | ||||
| <br/>MCC11,MCC12,MCC13, | ||||
| <br/>MCC14,MCC15,MCC16)</td> | ||||
| <td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| The above part indicates that there are nine Capture Encodings. Each | ||||
| of the Capture Encodings may contain any Captures from any source | ||||
| site with a maximum of one Capture at a time. Which Capture is | ||||
| present is determined by the policy. The MCCs in this Scene do not | ||||
| have any spatial attributes.</t> | ||||
| <t> | ||||
| Note: The Provider alternatively could provide each of the MCCs | ||||
| above in its own Capture Scene.</t> | ||||
| <t> | ||||
| If the MCU wanted to provide a composed Capture Encoding containing | ||||
| all of the nine Captures, it could advertise in addition:</t> | ||||
| <table anchor="ref-advertisement-sent-to-endpoint-a-9-composed-part"> | ||||
| <name>Advertisement Sent to Endpoint A - 9 Composed Parts</name | ||||
| > | ||||
| <thead> | ||||
| <tr> | ||||
| <th>Capture Scene #9</th><th>Description=NineTiles</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td align="right">MCC13(MCC8,MCC9,MCC10,<br/> | ||||
| MCC11,MCC12,MCC13,<br/> | ||||
| MCC14,MCC15,MCC16)</td> | ||||
| <td>MaxCaptures=9<br/> | ||||
| EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV(MCC13)</td><td/> | ||||
| </tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| As MaxCaptures is 9, it indicates that the Capture Encoding contains | ||||
| information from nine sources at a time.</t> | ||||
| <t> | ||||
| The Advertisement to Endpoint B is identical to the above, other | ||||
| than the fact that Captures from Endpoint A would be added and the Captures | ||||
| from Endpoint B would be removed. Whether the Captures are Rendered | ||||
| on a four-screen display or a three-screen display is up to the | ||||
| Consumer to determine. The Consumer wants to place Video Captures | ||||
| from the same original source Endpoint together, in the correct | ||||
| spatial order, but the MCCs do not have spatial attributes. So, the | ||||
| Consumer needs to associate incoming Media packets with the | ||||
| original individual Captures in the Advertisement (such as VC4, | ||||
| VC5, and VC6) in order to know the spatial information it needs for | ||||
| correct placement on the screens. The Provider can use the RTCP | ||||
| CaptureId source description (SDES) item and associated RTP header extension, | ||||
| as | ||||
| described in <xref target="RFC8849" format="default"/>, to convey this | ||||
| information to the Consumer.</t> | ||||
| </section> | ||||
| <section anchor="s-12.3.4" numbered="true" toc="default"> | ||||
| <name>Heterogeneous Conference with Voice-Activated Switching</name> | ||||
| <t> | ||||
| This example illustrates how multipoint "voice-activated switching" | ||||
| behavior can be realized, with an Endpoint making its own decision | ||||
| about which of its outgoing video Streams is considered the "active talker" f | ||||
| rom that Endpoint. Then, an MCU can decide which is the | ||||
| active talker among the whole Conference.</t> | ||||
| <t> | ||||
| Consider a Conference between Endpoints with the following | ||||
| characteristics:</t> | ||||
| <dl newline="false" spacing="normal"> | ||||
| <dt>Endpoint A -</dt> | ||||
| <dd>3 screens, 3 cameras</dd> | ||||
| <dt>Endpoint B -</dt> | ||||
| <dd>3 screens, 3 cameras</dd> | ||||
| <dt>Endpoint C -</dt> | ||||
| <dd>1 screen, 1 camera</dd> | ||||
| </dl> | ||||
| <t> | ||||
| This example focuses on what the user at Endpoint C sees. The | ||||
| user would like to see the Video Capture of the current talker, | ||||
| without composing it with any other Video Capture. In this | ||||
| example, Endpoint C is capable of receiving only a single video | ||||
| Stream. The following tables describe Advertisements from Endpoints A and B | ||||
| to the MCU, and from the MCU to Endpoint C, that can be used to accomplish | ||||
| this.</t> | ||||
| <table anchor="ref-advertisement-received-at-the-mcu-from-endpoints-a- | ||||
| and-b"> | ||||
| <name>Advertisement Received at the MCU from Endpoints A and B</name | ||||
| > | ||||
| <thead> | ||||
| <tr> | ||||
| <th>Capture Scene #1</th><th>Description=Endpoint x</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td>VC1</td> <td>CaptureArea=Left | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>VC2</td> <td>CaptureArea=Center | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>VC3</td> <td>CaptureArea=Right | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC1(VC1,VC2,VC3)</td> <td>MaxCaptures=1 | ||||
| <br/>CaptureArea=whole Scene | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>AC1</td> <td>CaptureArea=whole Scene | ||||
| <br/>EncodingGroup=2</td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV1(VC1, VC2, VC3)</td><td/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV2(MCC1)</td><td/> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>CSV3(AC1)</td><td/> | ||||
| </tr></tbody> | ||||
| </table> | ||||
| <t> | ||||
| Endpoints A and B are advertising each individual Video Capture, | ||||
| and also a switched Capture MCC1 that switches between the other | ||||
| three based on who is the active talker. These Endpoints do not | ||||
| advertise distinct Audio Captures associated with each individual | ||||
| Video Capture, so it would be impossible for the MCU (as a Media | ||||
| Consumer) to make its own determination of which Video Capture is | ||||
| the active talker based just on information in the audio Streams.</t> | ||||
| <table anchor="ref-advertisement-sent-from-the-mcu-to-c"> | ||||
| <name>Advertisement Sent from the MCU to Endpoint C</name> | ||||
| <thead> | ||||
| <tr><th>Capture Scene #1</th><th>Description=conference</th> | ||||
| </tr> | ||||
| </thead> | ||||
| <tbody> | ||||
| <tr> | ||||
| <td>MCC1()</td> | ||||
| <td>CaptureArea=Left | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1 | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC2()</td><td>CaptureArea=Center | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1 | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC3()</td><td>CaptureArea=Right | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>SynchronizationID=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1 | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC4()</td><td>CaptureArea=whole Scene | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=1 | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC5() (for audio)</td><td>CaptureArea=whole Scene | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:0 | ||||
| <br/>EncodingGroup=2 | ||||
| </td> | ||||
| </tr> | ||||
| <tr> | ||||
| <td>MCC6() (for audio)</td><td>CaptureArea=whole Scene | ||||
| <br/>MaxCaptures=1 | ||||
| <br/>Policy=SoundLevel:1 | ||||
| <br/>EncodingGroup=2 | ||||
| </td> | ||||
| </tr> | ||||
| <tr><td>CSV1(MCC1,MCC2,MCC3)</td><td/></tr> | ||||
| <tr><td>CSV2(MCC4)</td><td/></tr> | ||||
| <tr><td>CSV3(MCC5,MCC6)</td><td/></tr> | ||||
| </tbody> | ||||
| </table> | ||||
| <t> | ||||
| The MCU advertises one Scene, with four video MCCs. Three of them | ||||
| in CSV1 give a left, center, and right view of the Conference, with | ||||
| site switching. MCC4 provides a single Video Capture | ||||
| representing a view of the whole Conference. The MCU intends for | ||||
| MCC4 to be switched between all the other original source | ||||
| Captures. In this example, Advertisement of the MCU is not giving all | ||||
| the information about all the other Endpoints' Scenes and which of | ||||
| those Captures are included in the MCCs. The MCU could include all | ||||
| that if it wants to give the Consumers more | ||||
| information, but it is not necessary for this example scenario.</t> | ||||
| <t> | ||||
| The Provider advertises MCC5 and MCC6 for audio. Both are | ||||
| switched Captures, with different SoundLevel policies indicating | ||||
| they are the top two dominant talkers. The Provider advertises | ||||
| CSV3 with both MCCs, suggesting the Consumer should use both if it | ||||
| can.</t> | ||||
| <t> | ||||
| Endpoint C, in its Configure Message to the MCU, requests to | ||||
| receive MCC4 for video and MCC5 and MCC6 for audio. In order for | ||||
| the MCU to get the information it needs to construct MCC4, it has | ||||
| to send Configure Messages to Endpoints A and B asking to receive MCC1 from | ||||
| each of them, along with their AC1 audio. Now the MCU can use | ||||
| audio energy information from the two incoming audio Streams from | ||||
| Endpoints A and B to determine which of those alternatives is the current | ||||
| talker. Based on that, the MCU uses either MCC1 from A or MCC1 | ||||
| from B as the source of MCC4 to send to Endpoint C.</t> | ||||
| </section> | ||||
| </section> | ||||
| </section> | ||||
| <section anchor="s-14" numbered="true" toc="default"> | ||||
| <name>IANA Considerations</name> | ||||
| <t> | ||||
| This document has no IANA actions. | ||||
| </t> | ||||
| </section> | ||||
| <section anchor="s-15" numbered="true" toc="default"> | ||||
| <name>Security Considerations</name> | ||||
| <t> | ||||
| There are several potential attacks related to telepresence, | ||||
| specifically the protocols used by CLUE. This is the case due to | ||||
| conferencing sessions, the natural involvement of multiple | ||||
| Endpoints, and the many, often user-invoked, capabilities provided | ||||
| by the systems.</t> | ||||
| <t> | ||||
| An MCU involved in a CLUE session can experience many of the same | ||||
| attacks as a conferencing system such as the one enabled by | ||||
| the Conference | ||||
| Information Data Model for Centralized Conferencing (XCON) framework <xref ta | ||||
| rget="RFC5239" format="default"/>. Examples of attacks include the | ||||
| following: an Endpoint attempting to listen to sessions in which | ||||
| it is not authorized to participate, an Endpoint attempting to | ||||
| disconnect or mute other users, and theft of service by an | ||||
| Endpoint in attempting to create telepresence sessions it is not | ||||
| allowed to create. Thus, it is <bcp14>RECOMMENDED</bcp14> that an MCU | ||||
| implementing the protocols necessary to support CLUE follow the | ||||
| security recommendations specified in the conference control | ||||
| protocol documents. | ||||
| In the case of CLUE, SIP is the conferencing | ||||
| protocol, thus the security considerations in <xref target="RFC4579" format=" | ||||
| default"/> <bcp14>MUST</bcp14> be | ||||
| followed. Other security issues related to MCUs are discussed in | ||||
| the XCON framework <xref target="RFC5239" format="default"/>. The use of xCar | ||||
| d with potentially | ||||
| sensitive information provides another reason to implement | ||||
| recommendations in <xref section="11" sectionFormat="of" target="RFC5239" for | ||||
| mat="default"/>.</t> | ||||
| <t> | ||||
| One primary security concern, surrounding the CLUE framework | ||||
| introduced in this document, involves securing the actual | ||||
| protocols and the associated authorization mechanisms. These | ||||
| concerns apply to Endpoint-to-Endpoint sessions as well as | ||||
| sessions involving multiple Endpoints and MCUs. <xref target="ref-basic-infor | ||||
| mation-flow" format="default"/> in | ||||
| <xref target="s-5" format="default"/> provides a basic flow of information ex | ||||
| change for CLUE | ||||
| and the protocols involved.</t> | ||||
| <t> | ||||
| As described in <xref target="s-5" format="default"/>, CLUE uses SIP/SDP to | ||||
| establish the session prior to exchanging any CLUE-specific | ||||
| information. Thus, the security mechanisms recommended for SIP | ||||
| <xref target="RFC3261" format="default"/>, including user authentication and | ||||
| authorization, <bcp14>MUST</bcp14> be supported. In addition, the Media <bcp1 | ||||
| 4>MUST</bcp14> be | ||||
| secured. Datagram Transport Layer Security (DTLS) / Secure Real-time | ||||
| Transport Protocol (SRTP) <bcp14>MUST</bcp14> be supported and <bcp14>SHOULD< | ||||
| /bcp14> be used unless the | ||||
| Media, which is based on RTP, is secured by other means (see <xref target="RF | ||||
| C7201" format="default"/> <xref target="RFC7202" format="default"/>). Media sec | ||||
| urity is | ||||
| also discussed in <xref target="RFC8848" format="default"/> and <xref target= | ||||
| "RFC8849" format="default"/>. Note that SIP call setup is done before any | ||||
| CLUE-specific information is available, so the authentication and | ||||
| authorization are based on the SIP mechanisms. The entity that will | ||||
| be authenticated may use the Endpoint identity or the Endpoint user | ||||
| identity; this is an application issue and not a CLUE-specific | ||||
| issue.</t> | ||||
| <t> | ||||
| A separate data channel is established to transport the CLUE | ||||
| protocol messages. The contents of the CLUE protocol messages are | ||||
| based on information introduced in this document. The CLUE data | ||||
| model <xref target="RFC8846" format="default"/> defines, through an XML | ||||
| schema, the syntax to be used. One type of information that could | ||||
| possibly introduce privacy concerns is the xCard information, as | ||||
| described in <xref target="s-7.1.1.10" format="default"/>. The decision about | ||||
| which xCard | ||||
| information to send in the CLUE channel is an application policy | ||||
| for point-to-point and multipoint calls based on the authenticated | ||||
| identity that can be the Endpoint identity or the user of the | ||||
| Endpoint. For example, the telepresence multipoint application can | ||||
| authenticate a user before starting a CLUE exchange with the | ||||
| telepresence system and have a policy per user.</t> | ||||
| <t> | ||||
| In addition, the (text) description field in the Media Capture | ||||
| attribute (<xref target="s-7.1.1.6" format="default"/>) could possibly reveal | ||||
| sensitive | ||||
| information or specific identities. The same would be true for the | ||||
| descriptions in the Capture Scene (<xref target="s-7.3.1" format="default"/>) | ||||
| and CSV | ||||
| (<xref target="s-7.3.2" format="default"/>) attributes. An implementation <bc | ||||
| p14>SHOULD</bcp14> give users | ||||
| control over what sensitive information is sent in an | ||||
| Advertisement. One other important consideration for the | ||||
| information in the xCard as well as the description field in the | ||||
| Media Capture and CSV attributes is that while the | ||||
| Endpoints involved in the session have been authenticated, there | ||||
| are no assurance that the information in the xCard or description | ||||
| fields is authentic. Thus, this information <bcp14>MUST NOT</bcp14> be used | ||||
| to | ||||
| make any authorization decisions.</t> | ||||
| <t> | ||||
| While other information in the CLUE protocol messages does not | ||||
| reveal specific identities, it can reveal characteristics and | ||||
| capabilities of the Endpoints. That information could possibly | ||||
| uniquely identify specific Endpoints. It might also be possible | ||||
| for an attacker to manipulate the information and disrupt the CLUE | ||||
| sessions. It would also be possible to mount a DoS attack on the | ||||
| CLUE Endpoints if a malicious agent has access to the data | ||||
| channel. Thus, it <bcp14>MUST</bcp14> be possible for the Endpoints to estab | ||||
| lish | ||||
| a channel that is secure against both message recovery and | ||||
| message modification. Further details on this are provided in the | ||||
| CLUE data channel solution document <xref target="RFC8850" format="default"/> | ||||
| .</t> | ||||
| <t> | ||||
| There are also security issues associated with the authorization | ||||
| to perform actions at the CLUE Endpoints to invoke specific | ||||
| capabilities (e.g., rearranging screens, sharing content, etc.). | ||||
| However, the policies and security associated with these actions | ||||
| are outside the scope of this document and the overall CLUE | ||||
| solution.</t> | ||||
| </section> | ||||
| </middle> | ||||
| <back> | ||||
| <references> | ||||
| <name>References</name> | ||||
| <references> | ||||
| <name>Normative References</name> | ||||
| <!--&I-D.ietf-clue-datachannel; is 8850 --> | ||||
| <reference anchor="RFC8850" target="https://www.rfc-editor.org/info/rfc8850"> | ||||
| <front> | ||||
| <title>Controlling Multiple Streams for Telepresence (CLUE) Protocol Data | ||||
| Channel</title> | ||||
| <author initials="C." surname="Holmberg" fullname="Christer Holmberg"> | ||||
| <organization/> | ||||
| </author> | ||||
| <date month="January" year="2021"/> | ||||
| </front> | ||||
| <seriesInfo name="RFC" value="8850"/> | ||||
| <seriesInfo name="DOI" value="10.17487/RFC8850"/> | ||||
| </reference> | ||||
| <!--&I-D.ietf-clue-data-model-schema; is 8846--> | ||||
| <reference anchor="RFC8846" target="http://www.rfc-editor.org/info/rfc8846"> | ||||
| <front> | ||||
| <title>An XML Schema for the Controlling Multiple Streams for Telepr | ||||
| esence (CLUE) Data Model</title> | ||||
| <author initials="R" surname="Presta" fullname="Roberta Presta"> | ||||
| <organization/> | ||||
| </author> | ||||
| <author initials="S P." surname="Romano" fullname="Simon Romano"> | ||||
| <organization/> | ||||
| </author> | ||||
| <date month="January" year="2021"/> | ||||
| </front> | ||||
| <seriesInfo name="RFC" value="8846"/> | ||||
| <seriesInfo name="DOI" value="10.17487/RFC8846"/> | ||||
| </reference> | ||||
| <!--&I-D.ietf-clue-protocol; is 8847 --> | ||||
| <reference anchor='RFC8847' target='https://www.rfc-editor.org/info/rfc8847'> | ||||
| <front> | ||||
| <title>Protocol for Controlling Multiple Streams for Telepresence (CLUE)</title> | ||||
| <author initials='R' surname='Presta' fullname='Roberta Presta'> | ||||
| <organization /> | ||||
| </author> | ||||
| <author initials='S P.' surname='Romano' fullname='Simon Pietro Romano'> | ||||
| <organization /> | ||||
| </author> | ||||
| <date month='January' year='2021' /> | ||||
| </front> | ||||
| <seriesInfo name="RFC" value="8847" /> | ||||
| <seriesInfo name='DOI' value='10.17487/RFC8847' /> | ||||
| </reference> | ||||
| <!--&I-D.ietf-clue-signaling; is 8848 --> | ||||
| <reference anchor="RFC8848" | ||||
| target="https://www.rfc-editor.org/info/rfc8848"> | ||||
| <front> | ||||
| <title>Session Signaling for Controlling Multiple Streams for | ||||
| Telepresence (CLUE)</title> | ||||
| <author initials="R" surname="Hanton" fullname="Robert Hanton"> | ||||
| <organization/> | ||||
| </author> | ||||
| <author initials="P" surname="Kyzivat" fullname="Paul Kyzivat"> | ||||
| <organization/> | ||||
| </author> | ||||
| <author initials="L" surname="Xiao" fullname="Lennard Xiao"> | ||||
| <organization/> | ||||
| </author> | ||||
| <author initials="C" surname="Groves" fullname="Christian Groves"> | ||||
| <organization/> | ||||
| </author> | ||||
| <date month="January" year="2021"/> | ||||
| </front> | ||||
| <seriesInfo name="RFC" value="8848"/> | ||||
| <seriesInfo name="DOI" value="10.17487/RFC8848"/> | ||||
| </reference> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 2119.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 3261.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 3264.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 3550.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 4566.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 4579.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 5239.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 5646.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 6350.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 6351.xml"/> | ||||
| <xi:include href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/refer | ||||
| ence.RFC.8174.xml"/> | ||||
| </references> | ||||
| <references> | ||||
| <name>Informative References</name> | ||||
| <!-- &I-D.ietf-clue-rtp-mapping; is 8849 --> | ||||
| <reference anchor='RFC8849' target="https://www.rfc-editor.org/info/rfc8849"> | ||||
| <front> | ||||
| <title>Mapping RTP Streams to Controlling Multiple Streams for Telepresence | ||||
| (CLUE) Media Captures</title> | ||||
| <author initials='R' surname='Even' fullname='Roni Even'> | ||||
| <organization /> | ||||
| </author> | ||||
| <author initials='J' surname='Lennox' fullname='Jonathan Lennox'> | ||||
| <organization /> | ||||
| </author> | ||||
| <date month='January' year='2021' /> | ||||
| </front> | ||||
| <seriesInfo name='RFC' value='8849' /> | ||||
| <seriesInfo name="DOI" value="10.17487/RFC8849"/> | ||||
| </reference> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 4353.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 7667.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 7201.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 7202.xml"/> | ||||
| <xi:include | ||||
| href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC. | ||||
| 7205.xml"/> | ||||
| <xi:include href="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/refer | ||||
| ence.RFC.7262.xml"/> | ||||
| </references> | ||||
| </references> | ||||
| <section anchor="acks" numbered="false" toc="default"> | ||||
| <name>Acknowledgements</name> | ||||
| <t> | ||||
| <contact fullname="Allyn Romanow"/> and <contact fullname="Brian Baldino"/> w | ||||
| ere | ||||
| authors of early draft versions. | ||||
| <contact fullname="Mark Gorzynski"/> also contributed much to the initial app | ||||
| roach. | ||||
| Many others also contributed, | ||||
| including <contact fullname="Christian Groves"/>, | ||||
| <contact fullname="Jonathan Lennox"/>, | ||||
| <contact fullname="Paul Kyzivat"/>, | ||||
| <contact fullname="Rob Hanton"/>, | ||||
| <contact fullname="Roni Even"/>, | ||||
| <contact fullname="Christer Holmberg"/>, | ||||
| <contact fullname="Stephen Botzko"/>, | ||||
| <contact fullname="Mary Barnes"/>, | ||||
| <contact fullname="John Leslie"/>, and | ||||
| <contact fullname="Paul Coverdale"/>.</t> | ||||
| </section> | ||||
| </back> | ||||
| </rfc> | ||||
| End of changes. 1 change blocks. | ||||
| lines changed or deleted | lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||