Ambisonics in an Ogg Opus ContainerGoogle Inc.345 Spear StreetSan FranciscoCA94105USAjks@google.commichael@graczyk.com
RAI
codecThis document defines an extension to the Opus audio codec to encapsulate coded ambisonics using the Ogg format. Ambisonics is a representation format for three dimensional sound fields which can be used for surround sound and immersive virtual reality playback. See and for technical details on the ambisonics format. For the purposes of the this document, ambisonics can be considered a multichannel audio stream. A separate stereo stream can be used alongside the ambisonics in a head-tracked virtual reality experience to provide so-called non-diegetic audio - audio which should remain unchanged by listener head rotation; e.g., narration or stereo music. Ogg is a general purpose container, supporting audio, video, and other media. It can be used to encapsulate audio streams coded using the Opus codec. See and for technical details on the Opus codec and its encapsulation in the Ogg container respectively. This document extends the Ogg Opus format by defining two new channel mapping families for encoding ambisonics. The Ogg Opus format is extended indirectly by adding an item with value 2 or 3 to the IANA "Opus Channel Mapping Families" registry. When 2 or 3 are used as the Channel Mapping Family Number in an Ogg stream, the semantic meaning of the channels in the multichannel Opus stream is one of the ambisonics layouts defined in this document. This mapping can also be used in other contexts which make use of the channel mappings defined by the Opus Channel Mapping Families registry. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in . Ambisonics can be encapsulated in the Ogg format by encoding with the Opus codec and setting the channel mapping family value to 2 or 3 in the Ogg identification header (ID). A demuxer implementation encountering Channel Mapping Family 2 or Family 3 MUST interpret the Opus stream as containing ambisonics with the format described in or , respectively. Allowed numbers of channels: (1 + n)^2 + 2j for n = 0...14 and j = 0 or 1, where n denotes the (highest) ambisonic order and j whether or not there is a separate non-diegetic stereo stream. This corresponds to periphonic ambisonics from zeroth to fourteenth order plus potentially two channels of non-diegetic stereo. Explicitly the allowed number of channels are 1, 3, 4, 6, 9, 11, 16, 18, 25, 27, 36, 38, 49, 51, 64, 66, 81, 83, 100, 102, 121, 123, 144, 146, 169, 171, 196, 198, 225, 227. This channel mapping uses the same channel mapping table format used by channel mapping family 1. The output channels are ambisonic components ordered in Ambisonic Channel Number (ACN) order, defined in , followed by two optional channels of non-diegetic stereo indexed (left, right). For the ambisonic channels the ACN component corresponds to channel index as k = ACN. The reverse correspondence can also be computed for an ambisonic channel with index k. Note that channel mapping family 2 allows for so-called mixed order ambisonic representation where only a subset of the full ambisonic order number of channels. By specifying the full number in the channel count field, the inactive ACNs can then be indicated in the channel mapping field using the index 255.Ambisonic channels are normalized with Schmidt Semi-Normalization (SN3D). The interpretation of the ambisonics signal as well as detailed definitions of ACN channel ordering and SN3D normalization are described in Section 2.1. Allowed numbers of channels: (1 + n)^2 + 2j for n = 0...14 and j = 0 or 1, where n denotes the (highest) ambisonic order and j whether or not there is a separate non-diegetic stereo stream. This corresponds to periphonic ambisonics from zeroth to fourteenth order plus potentially two channels of non-diegetic stereo. Explicitly the allowed number of channels are 1, 3, 4, 6, 9, 11, 16, 18, 25, 27, 36, 38, 49, 51, 64, 66, 81, 83, 100, 102, 121, 123, 144, 146, 169, 171, 196, 198, 225, 227.
In this mapping, C output channels (the channel count) are generated at the decoder by multiplying K = N + M decoded channels with a designated demixing matrix, D, having C rows and K columns. Here, N denotes the number of streams encoded and M the number of these which are coupled to produce two channels. As for channel mapping family 2 this mapping family also allows for encoding and decoding of full order ambisonics, mixed order ambisonics, and for non-diegetic stereo channels, but also has the added flexibility of mixing channels. Let X denote a column vector containing K decoded channels X1, X2, ..., XK (from N streams), and let S denote a column vector containing C output streams S1, S2, ..., SC. Then S = D X, i.e., The matrix MUST be provided as side information and MUST be stored in the channel mapping table part of the identification header, c.f. section 5.1.1 in . The matrix replaces the need for a channel mapping field and for channel mapping family 3 the mapping table has the following layout: The fields in the channel mapping table have the following meaning: Stream Count 'N' (8 bits, unsigned): This is the total number of streams encoded in each Ogg packet. Coupled Stream Count 'M' (8 bits, unsigned): This is the number of the N streams whose decoders are to be configured to produce two channels (stereo). Demixing Matrix (16*K*C bits, signed): The coefficients of the demixing matrix stored column-wise as 16-bit, signed, two's complement fixed-point values with 15 fractional bits (Q15), little endian. If needed, the output gain field can be used for a normalization scale. For mixed order ambisonic representations, the silent ACN channels are indicated by all zeros in the corresponding rows of the mixing matrix. This allows also for mixed order with non-diegetic stereo as the number of columns implies the presence of non-diegetic channels.Note that specifies that the identification header cannot exceed one "page", which is 65,025 octets. This limits the ambisonic order to be lower than 12, if full order is utilized and the number of coded streams is the same as the ambisonic order plus the two non-diegetic channels. Also note that the total output channel number, C, MUST be set in the 3rd field of the identification header. An Ogg Opus player MAY use the matrix in Figure to implement downmixing from multichannel files using Channel Mapping Family 2 and 3, when there is no non-diegetic stereo. This downmixing is known to give acceptable results for stereo downmixing from ambisonics. The first and second ambisonic channels are known as "W" and "Y" respectively. The first ambisonic channel (W) is a mono audio stream which represents the average audio signal over all directions. Since W is not directional, Ogg Opus players MAY use W directly for mono playback. If a non-diegetic stereo track is present, the player MAY use the matrix in Figure for downmixing. Ls and Rs denote the two non-diegetic stereo channels. Implementations of the Ogg container need take appropriate security considerations into account, as outlined in Section 10 of . The extension defined in this document requires that semantic meaning be assigned to more channels than the existing Ogg format requires. Since more allocations will be required to encode and decode these semantically meaningful channels, care should be taken in any new allocation paths. Implementations MUST NOT overrun their allocated memory nor read from uninitialized memory when managing the ambisonic channel mapping. This document updates the IANA Media Types registry "Opus Channel Mapping Families" to add two new assignments. ValueReference2This Document 3This Document Thanks to Timothy Terriberry, Marcin Gorzel and Andrew Allen for their guidance and valuable contributions to this document. Key words for use in RFCs to Indicate Requirement LevelsIn many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.Definition of the Opus Audio CodecThis document defines the Opus interactive speech and audio codec. Opus is designed to handle a wide range of interactive audio applications, including Voice over IP, videoconferencing, in-game chat, and even live, distributed music performances. It scales from low bitrate narrowband speech at 6 kbit/s to very high quality stereo music at 510 kbit/s. Opus uses both Linear Prediction (LP) and the Modified Discrete Cosine Transform (MDCT) to achieve good compression of both speech and music. [STANDARDS-TRACK]Ogg Encapsulation for the Opus Audio CodecThis document defines the Ogg encapsulation for the Opus interactive speech and audio codec. This allows data encoded in the Opus format to be stored in an Ogg logical bitstream.AMBIX - A SUGGESTED AMBISONICS FORMATAmbisonics. Part one: General system descriptionFurther Study of Sound Field Coding with Higher Order Ambisonics