CLUE A. Romanow Internet-Draft R. Hansen Intended status: Standards Track Cisco Systems Expires: December 2, 2012 A. Pepperell Silverflare B. Baldino Cisco Systems May 31, 2012 The need for audio rendering tag mechanism in the CLUE Framework draft-romanow-clue-audio-rendering-tag-00 Abstract The purpose of this draft is for discussion in the CLUE working group. It proposes adding an audio rendering tag to the CLUE framework [I-D.ietf-clue-framework], which makes it possible for the consumer to correctly render audio with respect to video in a multistream video conference. The solution proposed is in partial response to CLUE Task #10, Does Framework provide sufficient info for receiver? Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 2, 2012. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents Romanow, et al. Expires December 2, 2012 [Page 1] Internet-Draft Audio rendering tag for CLUE May 2012 (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Motivation- the issue . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Audio Rendering Tag Mechanism . . . . . . . . . . . . . . . . . 3 4. Use of the RTP header extension . . . . . . . . . . . . . . . . 5 5. Use case note . . . . . . . . . . . . . . . . . . . . . . . . . 6 6. Security Considerations . . . . . . . . . . . . . . . . . . . . 6 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 6 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 6 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 9.1. Normative References . . . . . . . . . . . . . . . . . . . 7 9.2. Informative References . . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 7 Romanow, et al. Expires December 2, 2012 [Page 2] Internet-Draft Audio rendering tag for CLUE May 2012 1. Motivation- the issue A goal for CLUE audio is that listeners perceive the direction of a sound source to be the same as that of the visual image of the source; this is referred to as directional audio. In some situations the existing clue mechanisms are adequate. The consumer can use the spatial information to correctly place the audio when the provider advertisement includes spatial information (point of origin and capture area) giving a static relationship between both video and associated audio captures. However, in some circumstances, for different reasons, the audio and/or video spatial information is not sent in the provider advertisement. For instance, the case of a three-screen system advertising three video captures and one switched audio capture, where the audio is switched from the loudest of three microphones. In this case, how will the consumer know how to associate the audio with the correct video so it can be played out in the correct location? Here we suggest a simple mechanism -- audio rendering tagging. When audio and video cannot be matched through provider advertisement spatial information, we would like the ability to play out audio on multiple speakers matching the position of the speaker in the original scene. Also, the audio may be assigned to a speaker in real-time. It may need to be mixe locally and played out on any speaker. For example, if the consumer wants to hear the top 3 speakers, regardless of where they are located remotely, if all 3 top speakers are coming from the left, then the 3 speakers need to be mixed, perhaps locally, and played out on the left. Note: Several typical scenarios are described at the end of this note in section titled Use Case. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119] and indicate requirement levels for compliant implementations. 3. Audio Rendering Tag Mechanism We propose an audio tagging mechanism In order to cope with a changing mapping of the most significant audio and video participants Romanow, et al. Expires December 2, 2012 [Page 3] Internet-Draft Audio rendering tag for CLUE May 2012 (i.e., normal MCU operations in the presence of more participants' media streams that can be rendered simultaneously) and to get audio played out correctly to multiple speakers. A consumer optionally tells the provider an audio tag value corresponding to each of its chosen video captures which enables received audio to be associated with the correct video stream, even when the set of audible participants changes. This information is included with the consumer request so there is no need for additional CLUE message exchanges (specifically, no additional provider capture advertisements or consumer requests). The audio tags are defined in the consumer request as opposed to in a capture advertised by a producer. The reason for this is that it is valid for a consumer to request a capture multiple times (with different encodings, for example) and hence a method is required for differentiating between these streams. When the consumer configures the provider, saying which captures it wants, it also optionally includes an audio tag with each capture request. For example, VC1, ATag1; VC2, ATag2. When the provider sends audio packets to the consumer, it includes the appropriate audio tag in an RTP header extension. For example, if the provider is sending audio packets that are associated with VC1, it tags the packets with ATag1. The consumer can then play out the audio in a position appropriate for video from VC1. Suppose that several audio streams need to be played out through the same speaker - for example, the 3 audio streams (AC1, AC2, AC3) need to be played out at the speaker associated with VC1. The provider would send: AC1 ATag1 AC2 ATag1 AC3 ATag1 AC1, AC2 and AC3 are all played out on the same speaker, the audio output associated with VC1. This takes care of the issue of dynamic audio output - assigning the right speaker to audio streams. Figure 1 illustrates an example showing 3 screens, each with a main video and 3 PIPs. Below each screen is a list of the video captures, VCs with the associated Audio Tag. Romanow, et al. Expires December 2, 2012 [Page 4] Internet-Draft Audio rendering tag for CLUE May 2012 ----------------------3 Screens --------------------- |------------------+- -----------------+------------------Y | | | | | VC1 | VC2 | VC3 | | | | | | | | | | | | | | ''''|'''''''''| | ''''|'''''|'''| | '''''|''''|''''|| | |VC4|.VC5.|VC6| | |VC7|.VC8.|VC9| | |VC10|VC11|VC12|| '------------------+-------------------+------------------- VC1 VC2 VC3 VC4 Audio Tag 1 VC7 Audio tag 2 VC10 Audio tag 3 VC5 VC8 VC11 VC6 VC9 VC12 Figure 1: Audio rendering tags for 3 screen example The provider may choose not to include the extension header in an audio packet, signaling that there is no association between the current audio and current video (i.e., an audio-only participant). It may also include more than one audio tag in the extension header, signaling that this audio is associated with multiple current video participants, due perhaps to a capture being received multiple times at different resolutions, or two video captures that both include the current speaker. This mechanism also allows multiple audio streams to be associated with a single video stream (i.e. for a composed video stream); this simply requires the appropriate audio packets to be tagged with the same id. 4. Use of the RTP header extension We propose that audio tags are integer numbers between 0 and 255 optionally set by the consumer per requested capture. This allows up to 16 tags to be included in a one-byte RTP header extension [RFC 5285]. An example header extension for an audio packet with one tag follows. The audio tag extension is ID1. The example includes another header extension (ID0) to show how the proposal would interact with [I-D.lennox-clue-rtp-usage]: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 0xBE | 0xDE | length=1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Romanow, et al. Expires December 2, 2012 [Page 5] Internet-Draft Audio rendering tag for CLUE May 2012 | ID0 | L=0 | data | ID1 | L=0 | Tag | -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP ext headers for audio rendering tag and capture ID The lack of the RTP header extension in a packet means that the audio packet is not associated with any of the requested video streams that included audio tags. 5. Use case note o An endpoint can receive multiple video and audio streams and render complex layouts locally. o It may have a wide display area so directional audio is important. o It may have one loudspeaker per display, or perhaps some entirely different multi-loudspeaker setup known only to the endpoint itself. o The endpoint may therefore have the capability of playing back audio from a wide range of positions. o Either from a few fixed zones or with fine granularity. o Either by routing a sound source to a single loudspeaker, by panning between pairs of loudspeakers, or by some other advanced distribution scheme involving several or even all loudspeakers. 6. Security Considerations TBD 7. Acknowledgements Thanks to Johan Nielsen for discussions and adding the Use case note.cuss 8. IANA Considerations TBD 9. References Romanow, et al. Expires December 2, 2012 [Page 6] Internet-Draft Audio rendering tag for CLUE May 2012 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 9.2. Informative References [I-D.ietf-clue-framework] Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino, "Framework for Telepresence Multi-Streams", draft-ietf-clue-framework-05 (work in progress), May 2012. [I-D.lennox-clue-rtp-usage] Lennox, J., Witty, P., and A. Romanow, "Real-Time Transport Protocol (RTP) Usage for Telepresence Sessions", draft-lennox-clue-rtp-usage-03 (work in progress), March 2012. Authors' Addresses Allyn Romanow Cisco Systems San Jose, CA 95134 USA Email: allyn@cisco.com Robert Hansen Cisco Systems Langley, UK Email: rohanse2@cisco.com Andy Pepperell Silverflare Email: andy.pepperell@silverflare.com Romanow, et al. Expires December 2, 2012 [Page 7] Internet-Draft Audio rendering tag for CLUE May 2012 Brian Baldino Cisco Systems San Jose, CA 95134 USA Email: bbaldino@cisco.com Romanow, et al. Expires December 2, 2012 [Page 8]