CLUE R. Hansen Internet-Draft Cisco Systems Intended status: Standards Track A. Pepperell Expires: December 2, 2012 Silverflare A. Romanow B. Baldino Cisco Systems M. Duckworth Polycom May 31, 2012 The need for consumer spatial information in CLUE draft-hansen-clue-consumer-layout-00 Abstract This draft is for discussion in the CLUE working group. It proposes adding the ability for the consumer to provide specific information to the provider. This document proposes allowing consumers to include spatial parameters in their consumer requests to providers in order to improve the provider's ability to assign media to streams in a way that is helpful for rendering. The solution proposed here is in partial response to CLUE Task #10, Does Framework provide sufficient info for receiver? Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 2, 2012. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the Hansen, et al. Expires December 2, 2012 [Page 1] Internet-Draft Consumer spatial information in CLUE May 2012 document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Motiviation - Conferencing in CLUE . . . . . . . . . . . . . . 3 4. Issues associated with subscribing to multiple switched captures . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1. Provider advertising spatially-related switched captures . . . . . . . . . . . . . . . . . . . . . . . . . 5 5. Consumer includes optional spatial information . . . . . . . . 6 5.1. Applicability of consumer spatial information to audio . . 8 6. Implications and conclusions . . . . . . . . . . . . . . . . . 8 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 8 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8.1. Normative References . . . . . . . . . . . . . . . . . . . 8 8.2. Informative References . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 Hansen, et al. Expires December 2, 2012 [Page 2] Internet-Draft Consumer spatial information in CLUE May 2012 1. Introduction This draft notes some limitations of CLUE when it comes to correctly rendering video under certain conditions, and proposes the optional addition of spatial information by the consumer to resolve these issues. This does not imply that the authors believe that the proposed solution is the only option available; rather, this draft is meant as a starting point for discussion. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119] and indicate requirement levels for compliant implementations. 3. Motiviation - Conferencing in CLUE The current methodology of the CLUE framework [I-D.ietf-clue-framework] is well suited to the case of systems with a relatively static set of capture devices. However, scenarios with a much more dynamic set of capture devices being presented to consumers, such as a voice-switched conferencing where multiple endpoints connect to a middle box such as an MCU, present additional challenges. An example of such a scenario is shown below, with four endpoints A, B, C and D in a conference: +-----+ +---+ / \ +---+ | A |----/ \----| B | +---+ / \ +---+ + MCU + +---+ \ / +---+ | C |----\ /----| D | +---+ \ / +---+ +-----+ In this scenario endpoint A is not directly connected to any of the other endpoints and so will not have the capture information associated with their media streams directly available. One approach is for the MCU to advertise B, C and D's captures as separate capture scenes to A - A can then subscribe to any capture from any of the other endpoints. However, as the size of the conference increases the number of Hansen, et al. Expires December 2, 2012 [Page 3] Internet-Draft Consumer spatial information in CLUE May 2012 captures that must be advertised will quickly become impractical. Further, in many conferencing scenarios, endpoints do not wish to specify the endpoints they want to see - instead they wish to see the video and audio from the 'most relevant' endpoints as determined by the MCU (where relevance is usually determined by audio activity level). Finally, advertising all available captures in this fashion can be problematic in the case of captures that are simultaneously exclusive, as one consumer may ask for one and a second for its mutually exclusive partner. As such, the MCU has the ability in CLUE to advertise switched captures; these don't directly represent specific real video or audio captures. Instead, subscribing to one of these captures means that the provider will switch the stream it sends to the consumer based on its internal logic. In the example above, the MCU might advertise a single, switched video capture to A; if A subscribed to this then the MCU would forward the video stream from B, C or D based on which it felt was most relevant (often calculated based on the loudness of an associated audio stream). 4. Issues associated with subscribing to multiple switched captures As such, The consumer A from the previous example can subscribe to one or more of these switched captures and will receive that many streams from the MCU, switched from their originating source. However, A does not receive the spatial capture information from the originating source associated with these streams alongside the RTP packets. As a result things become more complicated when A subscribes to multiple video captures, and when the other endpoints provide multiple video streams with correlated spatial information. For example, if A is a three-screen system and hence requests three streams, if all the streams it receives are independent it can render them as it wishes, as shown below where it receives one stream from each of B, C and D: +------+ +------+ +------+ | | | | | | | B | | C | | D | | | | | | | +------+ +------+ +------+ However, if A receives more than one stream from a particular endpoint and these streams have related spatial relationships then it is possible for A to lay them out erroneously. This is illustrated below, where A is receiving three streams of video that originated at B, which should correctly be ordered (L)eft, (C)enter, (R)ight: Hansen, et al. Expires December 2, 2012 [Page 4] Internet-Draft Consumer spatial information in CLUE May 2012 +------+ +------+ +------+ | | | | | | | B(L) | | B(C) | | B(R) | Correct | | | | | | +------+ +------+ +------+ +------+ +------+ +------+ | | | | | | | B(C) | | B(R) | | B(L) | Incorrect | | | | | | +------+ +------+ +------+ When laid out incorrectly this leads to objects (such as a person being viewed) being split into sections displayed in disparate, non- contiguous locations. This problem could be solved if A had the spatial capture information from B. In a small conference it may be possible for the middle box to pre-send all the capture information from all other endpoints to A (and to every other endpoint), but as the number of captures per endpoint and the number of endpoints in a conference rise caching all the data becomes impractical. An alternative would be for A to request the originating capture information for streams it is receiving, or for the MCU to send it whenever it switches streams. However, because the RTP packets and the CLUE capture information will be sent in separate channels this will lead to cases where A is receiving RTP packets but has not yet received the corresponding capture data and the same problem occurs. The endpoint must then choose between displaying nothing or risk making incorrect layout choices. 4.1. Provider advertising spatially-related switched captures One tool that already exists within the CLUE framework that can be used to partially solve this problem is the MCU including spatial information for the switched captures it advertised. In this case, for example, it would advertise three captures with area of capture information for each that portray them as the left, center and right captures of a single hypothetical room. In this case, when the MCU has unrelated one-screen streams to send to A it can associate them with whichever switched capture it chooses. But when sending a two- or three-screen set of streams it can ensure that they are correctly laid out adjacent to each other and in the correct order. A could then request these three captures and render the streams appropriately on its left, center and right screen, needing to take no action to ensure that the streams are correctly laid out. Hansen, et al. Expires December 2, 2012 [Page 5] Internet-Draft Consumer spatial information in CLUE May 2012 However, this solution is not sufficient for all use-cases. The issue is that the MCU will need to advertise a suitable separate group of switched captures for each endpoint configuration that could connect to it. If the possible endpoint configurations are limited, this may still represent a plausible number; for instance, an MCU that wanted to support endpoints with one, two, three or four screens laid out contigously left-to-right could advertise a capture set with the following entries: { [VC0] [VC1, VC2] [VC3, VC4, VC5] [VC6, VC7, VC8, VC9] } where VC0 was a single switched capture, VC1 and VC2 were two switched captures each representing half the scene, and so on. But this means that the MCU is only able to support certain pre- defined layouts - supporting additional configurations of screens (such as a 2x2 array) requires a new entry for each, and designing a new endpoint configuration means updating all the MCUs it interoperates with. This problem becomes particularly acute if the endpoint has many screens, or wants to perform local composition (subscribing to multiple streams per screen and rendering them locally for display) - this both substantially increases the number of streams that the endpoint would wish to subscribe to, and increases the complexity of layouts possible. For instance, an endpoint with two screens that wanted to show a 2x2 grid of participants on each would need to subscribe to eight captures with appropriate spatial information. 5. Consumer includes optional spatial information We can address these issues and allow an endpoint more complex stream rendering configurations, while substantially reducing the number and complexity of switched captures the MCU must advertise. The approach is for the consumer to optionally include some information on the spatial relationships with its rendering as part of its request. This allows the MCU to advertise a single collection of switched captures with no spatial information for the consumer to subscribe to, rather than attempting to anticipate every layout an endpoint might desire, and having to advertise an entry for each with suitable spatial information. There are a number of forms this consumer information could take. Hansen, et al. Expires December 2, 2012 [Page 6] Internet-Draft Consumer spatial information in CLUE May 2012 However, the form most consistent with the existing CLUE data model, and offering most flexibility for the future, is for the consumer to be able to describe the spatial relationship of its screens in the same fashion and using the same system as in the provider's capture attributes. 'Area of Display' would be an optional attribute of a consumer request, and would have the same properties as the provider's 'Area of Capture' (i.e. four co-planar {X,Y,Z} coordinates). If the consumer includes information on the area of display the provider may then choose to use that information to inform its choice when switching video. Alternatively, in the cases where there were no spatial constraints on the video the provider was switching, or where fixed streams were being sent, the area of display information could be ignored. A straightforward example of this would be where consumer A is a three-screen system wishing to join a large conference including one-, two- and three-screen systems. The MCU offers a capture scene including three switched captures, to which A wishes to subscribe. A then sends a choice for each of those captures, and for each choice includes an area of display attribute giving the position of each of its screens. The MCU can then use that information to ensure that, when switching in the video streams from multi-screen systems, it does so in a way that they will be rendered correctly on A. A more complicated example is where A is still a three-screen system wishing to join a large conference including one-, two- and three- screen systems, but now wishes to receive more than one video stream per screen, composing them locally. The layout A wishes to achieve is (three large screens, each with one main video displayed full- screen and three picture-in-picture views): +-------------+ +-------------+ +-------------+ | | | | | | | | | | | | | | | | | | | | | | | | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | +-------------+ +-------------+ +-------------+ The MCU advertises that it can send at least 12 switched video streams to A simultaneously. A makes 12 choices, including a suitable area of display for each one. This information allows the MCU to not just ensure that multi-screen systems are not laid out incorrectly, but potentially to also optimize other choices, such as not splitting multi-screen systems being rendered in the smaller PiP panes across bezels, show presentation and full-motion video received from the same participant on the same screen, and so on. Hansen, et al. Expires December 2, 2012 [Page 7] Internet-Draft Consumer spatial information in CLUE May 2012 5.1. Applicability of consumer spatial information to audio The text above is primarily concerned with resolving issues for video, but it may still be relevant for audio; the consumer may wish to provide spatial information about the locations at which they will be playing out their audio. However, for the most part I believe this is less relevant; that audio does not have the same rigid requirements for playout that were described above for video, and that for the most part the problem can be solved with the provider- specified spatial coordinates already defined in the specification. 6. Implications and conclusions CLUE has been designed as a provider-oriented protocol, with the provider giving a list of the resources it can supply and the consumer selecting from these. This proposal fits into that pattern; spatial information included in a consumer request forms part of that request, insofar as it does not limit the provider but instead gives additional information for the provider to use as it sees fit. Consumers that have no need for the spatial information need not include it, and providers can choose to ignore the spatial information if it is not relevant to their selection process. Allowing the optional reuse of spatial information that is currently sent only by the provider in the consumer request increases the range of problems for which CLUE can provide a solution, while placing no additional burden on systems which do not have these concerns, as they can safely ignore the information 7. Security Considerations The proposal herein has no security implications; the new information from the consumer is optional and sent at their discretion, and reveals nothing that can compromise their system. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Hansen, et al. Expires December 2, 2012 [Page 8] Internet-Draft Consumer spatial information in CLUE May 2012 8.2. Informative References [I-D.ietf-clue-framework] Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino, "Framework for Telepresence Multi-Streams", draft-ietf-clue-framework-05 (work in progress), May 2012. Authors' Addresses Robert Hansen Cisco Systems San Jose, CA 95134 USA Email: rohanse2@cisco.com Andy Pepperell Silverflare Email: andy.pepperell@silverflare.com Allyn Romanow Cisco Systems San Jose, CA 95134 USA Email: allyn@cisco.com Brian Baldino Cisco Systems San Jose, CA 95134 USA Email: bbaldino@cisco.com Mark Duckworth Polycom Andover, MA 01810 USA Email: mark.duckworth@polycom.com Hansen, et al. Expires December 2, 2012 [Page 9]