forked from AOMediaCodec/iamf
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.bs
3195 lines (2385 loc) · 197 KB
/
index.bs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<pre class='metadata'>
Group: AOM
Status: FD
Text Macro: SPECVERSION v1.0.0-errata
Title: Immersive Audio Model and Formats
Editor: SungHee Hwang, Samsung, [email protected]
Editor: Felicia Lim, Google, [email protected]
Repository: AOMediaCodec/iamf
Shortname: iamf
URL: https://aomediacodec.github.io/iamf/v1.0.0-errata.html
!Previously approved version: <a href="https://aomediacodec.github.io/iamf/v1.0.0.html">https://aomediacodec.github.io/iamf/v1.0.0.html</a>
!Latest approved version: <a href="https://aomediacodec.github.io/iamf/latest-approved.html">https://aomediacodec.github.io/iamf/latest-approved.html</a>
!Latest draft version: <a href="https://aomediacodec.github.io/iamf/latest-draft.html">https://aomediacodec.github.io/iamf/latest-draft.html</a>
Date: 2024-04-03
!Reference Implementation: <a href="https://github.com/AOMediaCodec/libiamf/releases/tag/v1.0.0-errata/">libiamf v1.0.0-errata</a>
Abstract: This document specifies the Immersive Audio (IA) model, the standalone IA Sequence format, and the [[!ISO-BMFF]]-based IA container format.
Local Boilerplate: footer yes
Metadata Order: This version, !*, *
</pre>
<pre class="anchors">
url: https://www.iso.org/standard/68960.html#; spec: ISO-BMFF; type: dfn;
text: AudioSampleEntry
text: channelcount
text: samplerate
text: roll_distance
text: SamplingRateBox
url: https://www.iso.org/standard/68960.html#; spec: ISO-BMFF; type: property;
text: iso6
text: stsd
text: edts
text: stts
text: roll
text: elst
text: trun
text: ctts
text: stss
text: btrt
url: https://aomediacodec.github.io/av1-spec/av1-spec.pdf#; spec: AV1-Spec; type: dfn;
text: Clip3
url: https://www.iso.org/standard/43345.html#; spec: AAC; type: dfn;
text: raw_data_block()
url: https://www.iso.org/standard/55688.html#; spec: MP4-Systems; type: dfn;
text: objectTypeIndication
text: streamType
text: upstream
text: decSpecificInfo()
text: DecoderConfigDescriptor()
text: Syntactic Description Language
url: https://www.iso.org/standard/76383.html#; spec: MP4-Audio; type: dfn;
text: AudioSpecificConfig()
text: audioObjectType
text: channelConfiguration
text: GASpecificConfig()
text: frameLengthFlag
text: dependsOnCoreCoder
text: extensionFlag
text: samplingFrequencyIndex
url: https://www.iso.org/standard/79110.html#; spec: ISO-MP4; type: dfn;
text: ESDBox
url: https://tools.ietf.org/html/rfc6381#; spec: RFC-6381; type: property;
text: codecs
url: https://tools.ietf.org/html/rfc8486#; spec: RFC-8486; type: dfn;
text: channel count
text: ChannelMappingFamily
url: https://tools.ietf.org/html/rfc7845#; spec: RFC-7845; type: dfn;
text: ID Header
text: Magic Signature
text: Output Channel Count
text: Output Gain
text: Pre-skip
url: https://tools.ietf.org/html/rfc6716#; spec: RFC-6716; type: dfn;
text: Opus packet
url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf#; spec: ITU-1770-4; type: dfn;
text: LKFS
url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf#; spec: ITU-2051-3; type: dfn;
text: Loudspeaker configuration for Sound System A (0+2+0)
text: Loudspeaker configuration for Sound System B (0+5+0)
text: Loudspeaker configuration for Sound System C (2+5+0)
text: Loudspeaker configuration for Sound System D (4+5+0)
text: Loudspeaker configuration for Sound System E (4+5+1)
text: Loudspeaker configuration for Sound System F (3+7+0)
text: Loudspeaker configuration for Sound System G (4+9+0)
text: Loudspeaker configuration for Sound System H (9+10+3)
text: Loudspeaker configuration for Sound System I (0+7+0)
text: Loudspeaker configuration for Sound System J (4+7+0)
text: SP Label
url: https://xiph.org/flac/format.html; spec: FLAC; type: dfn;
text: METADATA_BLOCK
text: METADATA_BLOCK_STREAMINFO
text: FRAME
text: FRAME_HEADER
text: minimum block size
text: maximum block size
text: minimum frame size
text: maximum frame size
text: number of channels
text: MD5 signature
text: Block size in inter-channel samples
text: Sample rate
text: Channel assignment
text: Sample size in bits
url: https://www.iso.org/standard/77752.html#; spec: MP4-PCM; type: dfn;
text: format_flags
text: PCM_sample_size
</pre>
<pre class='biblio'>
{
"AI-CAD-Mixing": {
"title": "AI 3D immersive audio codec based on content-adaptive dynamic down-mixing and up-mixing framework",
"status": "Paper",
"publisher": "AES",
"href": "https://www.aes.org/e-lib/browse.cfm?elib=21489"
},
"AAC": {
"title": "Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC)",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/43345.html"
},
"MP4-Audio": {
"title": "Information technology — Coding of audio-visual objects — Part 3: Audio",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/76383.html"
},
"MP4-Systems": {
"title": "Information technology — Coding of audio-visual objects — Part 1: Systems",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/55688.html"
},
"ISO-BMFF": {
"title": "Information Technology - Coding of audio-visual objects - Part 12: ISO base media file format",
"status" : "Standard",
"publisher" : "ISO/IEC",
"href" : "https://www.iso.org/standard/68960.html"
},
"ISO-CICP": {
"title": "Information Technology - Coding-Independent Code Points - Part 3: Audio",
"status" : "Standard",
"publisher" : "ISO/IEC",
"href" : "https://www.iso.org/standard/73413.html"
},
"ITU-1770-4": {
"title": "Algorithms to measure audio programme loudness and true-peak audio level",
"status": "Standard",
"publisher": "ITU",
"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf"
},
"ITU-2051-3": {
"title": "Advance sound system for programme production",
"status": "Standard",
"publisher": "ITU",
"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf"
},
"Q-Format": {
"title": "Q (number format)",
"status": "Best Practice",
"publisher": "Wikipedia",
"href": "https://en.wikipedia.org/wiki/Q_(number_format)"
},
"BCP-47": {
"title": "BCP 47",
"status": "Best Practice",
"publisher": "IETF",
"href": "https://www.rfc-editor.org/info/bcp47"
},
"FLAC": {
"title": "Free Lossless Audio Codec",
"status": "Best Practice",
"publisher": "xiph.org",
"href": "https://xiph.org/flac/format.html"
},
"AV1-Spec": {
"title": "AV1 Bitstream & Decoding Process Specification",
"status": "Spec",
"publisher": "aomedia.org",
"href": "https://aomediacodec.github.io/av1-spec/av1-spec.pdf"
},
"ITU-2076-2": {
"title": "Audio Definition Model",
"status": "Standard",
"publisher": "ITU",
"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2076-2-201910-I!!PDF-E.pdf"
},
"ITU-2127-0": {
"title": "Audio Definition Model renderer for advanced sound systems",
"status": "Standard",
"publisher": "ITU",
"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2127-0-201906-I!!PDF-E.pdf"
},
"EBU-Tech-3396": {
"title": "BINAURAL EBU ADM RENDERER (BEAR) FOR OBJECT-BASED SOUND OVER HEADPHONES",
"status": "Spec",
"publisher": "EBU",
"href": "https://tech.ebu.ch/publications/tech3396"
},
"Resonance-Audio": {
"title": "Efficient Encoding and Decoding of Binaural Sound with Resonance Audio",
"status": "Paper",
"publisher": "AES",
"href": "https://www.aes.org/e-lib/browse.cfm?elib=20446"
},
"MP4-PCM": {
"title": "Information technology — MPEG audio technologies — Part 5: Uncompressed audio in MPEG-4 file format",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/77752.html"
},
"RFC-3629": {
"title": "UTF-8, a transformation format of ISO 10646",
"status": "Standard",
"publisher": "IETF",
"href": "https://tools.ietf.org/html/rfc3629"
},
"RFC-6381": {
"title": "The 'Codecs' and 'Profiles' Parameters for Bucket Media Types",
"status": "Standard",
"publisher": "IETF",
"href": "https://tools.ietf.org/html/rfc6381"
},
"RFC-6716": {
"title": "Definition of the Opus Audio Codec",
"status": "Standard",
"publisher": "IETF",
"href": "https://tools.ietf.org/html/rfc6716"
},
"RFC-7845": {
"title": "Ogg Encapsulation for the Opus Audio Codec",
"status": "Standard",
"publisher": "IETF",
"href": "https://tools.ietf.org/html/rfc7845"
},
"RFC-8486": {
"title": "Ambisonics in an Ogg Opus Container",
"status": "Standard",
"publisher": "IETF",
"href": "https://tools.ietf.org/html/rfc8486"
}
}
</pre>
# Introduction # {#introduction}
This specification defines the Immersive Audio Model and Formats (IAMF) to provide an [=Immersive Audio=] experience to end-users.
IAMF is used to provide [=Immersive Audio=] content for presentation on a wide range of devices in both streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g., headphones, mobile phones, tablets, TVs, sound bars, home theater systems, and big screens.
Here are some typical IAMF use cases and examples of how to instantiate the model for the use cases.
- UC1: One [=Audio Element=] (e.g., 3.1.2ch or First Order Ambisonics (FOA)) is delivered to a big-screen TV (in a home) or a mobile device through a unicast network. It is rendered to a loudspeaker layout (e.g., 3.1.2ch) or headphones with loudness normalization, and is played back on loudspeakers built into the big-screen TV or headphones connected to the mobile device, respectively.
- UC2: Two [=Audio Element=]s (e.g., 5.1.2ch and Stereo) are delivered to a big-screen TV through a unicast network. Both are rendered to the same loudspeaker layout built into the big-screen TV and are mixed. After applying loudness normalization appropriate to the home environment, the [=Rendered Mix Presentation=] is played back on the loudspeakers.
- UC3: Two [=Audio Element=]s (e.g., FOA and Non-diegetic Stereo) are delivered to a mobile device through a unicast network. FOA is rendered to Binaural (or Stereo) and Non-diegetic is rendered to Stereo. After mixing them, it is processed with loudness normalization and is played back on headphones through the mobile device.
Example 1: UC1 with [=3D audio signal=] = 3.1.2ch.
- Audio Substream: The Left (L) and Right (R) channels are coded as one audio stream, the Left top front (Ltf) and Right top front (Rtf) channels as one audio stream, the Center channel as one audio stream, and the Low-Frequency Effects (LFE) channel as one audio stream.
- Audio Element (3.1.2ch): Consists of 4 Audio Substreams which are grouped into one [=Channel Group=].
- Mix Presentation: Provides rendering algorithms for rendering the Audio Element to popular loudspeaker layouts and headphones, and the loudness information of the [=3D audio signal=].
Example 2: UC2 with two [=3D audio signal=]s = 5.1.2ch and Stereo.
- Audio Substream: The L and R channels are coded as one audio stream, the Left surround (Ls) and Right surround (Rs) channels as one audio stream, the Ltf and Rtf channels as one audio stream, the Center channel as one audio stream, and the LFE channel as one audio stream.
- Audio Element 1 (5.1.2ch): Consists of 5 Audio Substreams which are grouped into one [=Channel Group=].
- Audio Element 2 (Stereo): Consists of 1 Audio Substream which is grouped into one [=Channel Group=].
- Parameter Substream 1-1: Contains mixing parameter values that are applied to Audio Element 1 by considering the home environment.
- Parameter Substream 1-2: Contains mixing parameter values that are applied to Audio Element 2 by considering the home environment.
- Mix Presentation: Provides rendering algorithms for rendering Audio Elements 1 & 2 to popular loudspeaker layouts, mixing information based on Parameter Substreams 1-1 & 1-2, and loudness information of the [=Rendered Mix Presentation=].
Example 3: UC3 with two [=3D audio signal=]s = First Order Ambisonics (FOA) and Non-diegetic Stereo.
- Audio Substream: The L and R channels are coded as one audio stream and each channel of the FOA signal as one audio stream.
- Audio Element 1 (FOA): Consists of 4 Audio Substreams which are grouped into one [=Channel Group=].
- Audio Element 2 (Non-diegetic Stereo): Consists of 1 Audio Substream which is grouped into one [=Channel Group=].
- Parameter Substream 1-1: Contains mixing parameter values that are applied to Audio Element 1 by considering the mobile environment.
- Parameter Substream 1-2: Contains mixing parameter values that are applied to Audio Element 2 by considering the mobile environment.
- Mix Presentation: Provides rendering algorithms for rendering Audio Elements 1 & 2 to popular loudspeaker layouts and headphones, mixing information based on Parameter Substreams 1-1 & 1-2, and loudness information of the [=Rendered Mix Presentation=].
# Immersive Audio Model # {#iamodel}
## Model Overview ## {#model-overview}
This specification defines a model for representing [=Immersive Audio=] contents based on [=Audio Substream=]s contributing to [=Audio Element=]s meant to be rendered and mixed to form one or more presentations as depicted in the figure below.
<center><img src="images/decoding_flow_cropped.svg" width="800"></center>
<center><figcaption>Processing flow to decode, reconstruct, render, and mix the 3D audio signals for immersive audio playback.</figcaption></center>
The model comprises a number of coded [=Audio Substream=]s and the metadata that describes how to decode, render and mix the [=Audio Substream=]s for playback. The model itself is codec-agnostic; any supported audio codec may be used to code the [=Audio Substream=]s.
The model includes one or more [=Audio Element=]s, each of which consists of one or more [=Audio Substream=]s. The [=Audio Substream=]s that make up an [=Audio Element=] are grouped into one or more [=Channel Group=]s. The model further includes [=Mix Presentation=]s and [=Parameter Substream=]s.
The term <dfn noexport>3D audio signal</dfn> means a representation of sound that incorporates additional information beyond traditional stereo or surround sound formats such as Ambisonics (Scene-based), Object-based audio and Channel-based audio (e.g., 3.1.2ch or 7.1.4ch).
The term <dfn noexport>Immersive Audio</dfn> (IA) means the combination of [=3D audio signal=]s recreating a sound experience close to that of a natural environment.
The term <dfn noexport>Audio Substream</dfn> means a sequence of audio samples, which may be encoded with any compatible audio codec.
The term <dfn noexport>Channel Group</dfn> means a set of [=Audio Substream=](s) which is(are) able to provide a spatial resolution of audio contents by itself or which is(are) able to provide an enhanced spatial resolution of audio contents by combining with the preceding [=Channel Group=]s.
The term <dfn noexport>Audio Element</dfn> means a [=3D audio signal=], and is constructed from one or more [=Audio Substream=]s (grouped into one or more [=Channel Groups=]) and the metadata describing them. The [=Audio Substream=]s associated with one [=Audio Element=] use the same audio codec.
The term <dfn noexport>Mix Presentation</dfn> means a series of processes to present [=Immersive Audio=] contents to end-users by using [=Audio Element=](s). It contains metadata that describes how the [=Audio Element=](s) is(are) rendered and mixed together for playback through physical loudspeakers or headphones, as well as loudness information.
The term <dfn noexport>Parameter Substream</dfn> means a sequence of parameter values that are associated with the algorithms used for reconstructing, rendering, and mixing. It is applied to its associated [=Audio Element=] or [=Mix Presentation=]. [=Parameter Substream=]s may change their values over time and may further be animated; for example, any changes in values may be smoothed over some time duration. As such, they may be viewed as a 1D signal with different metadata specified for different time durations.
The term <dfn noexport>Rendered Mix Presentation</dfn> means a [=3D audio signal=] after the [=Audio Element=](s) defined in a [=Mix Presentation=] is(are) rendered and mixed together for playback through physical loudspeakers or headphones.
## Architecture ## {#architecture}
Based on the model, this specification defines the Immersive Audio Model and Formats (<dfn noexport>IAMF</dfn>) architecture as depicted in the figure below.
<center><img src="images/Hypothetical IAMF Architecture.png" style="width:100%; height:auto;"></center>
<center><figcaption>IAMF Architecture</figcaption></center>
For a given input [=3D audio signal=],
- A Pre-Processor generates the [=Channel Group=](s), [=Descriptors=] and [=Parameter Substream=](s).
- A Codec Encoder generates the coded [=Audio Substream=](s).
- An OBU Packetizer generates an [=IA Sequence=] from the coded [=Audio Substream=](s), [=Descriptors=] and [=Parameter Substream=](s).
- An OBU Parser outputs the coded [=Audio Substream=](s) and the [=Parameter Substream=](s) from the [=IA Sequence=].
- A Codec Decoder outputs decoded [=Channel Group=](s) after decoding the coded [=Audio Substream=](s).
- An Element Reconstructor re-assembles the [=Audio Element=]s by combining the [=Channel Group=](s) guided by [=Descriptors=] and [=Parameter Substream=](s).
- A Renderer can be used to render the [=Audio Element=]s to a multi-channel or binaural format based on [=Descriptors=].
- A Mixer sums the rendered [=Audio Element=]s and applies further mixing parameters guided by the [=Descriptors=] and the [=Parameter Substream=](s).
- A Post-Processor outputs an [=Immersive Audio=] by using the [=Channel Group=](s), the [=Descriptors=], and the [=Parameter Substream=](s).
An IAMF generation processing including the Pre-Processor, the [=Channel Group=](s), the Codec Encoder, and the OBU Packetizer are defined in [[#iamfgeneration]]. The [=IA Sequence=] is defined in [[#standalone-ia-sequence]]. An IAMF processing including the OBU Parser, the Codec Decoder, the Element Reconstructor, the Renderer, the Mixer, and the Post-Processor are defined in [[#processing]].
Although not shown in the figure above, the [=IA Sequence=] may be encapsulated by a file packager, such as the ISO-BMFF Encapsulation, to output an IAMF file (ISO-BMFF file). Then, a file parser, such as the ISO-BMFF Parser, decapsulates it to output the [=IA Sequence=]. The ISO-BMFF Encapsulation, IAMF file (ISO-BMFF file), and ISO-BMFF Parser are defined in [[#isobmff]].
## Bitstream Structure ## {#bitstream}
### Overview ### {#overview}
An [=IA Sequence=] is a bitstream to represent [=Immersive Audio=] contents and consists of [=Descriptors=] and [=IA Data=].
The metadata in the [=Descriptors=] and [=IA Data=] are packetized into individual Open Bitstream Units (OBU)s. The term Open Bitstream Unit (OBU) is the concrete, physical unit used to represent the components in the model. In this specification, the term IA OBU can be used interchangeably with OBU.
The normative definitions for an [=IA Sequence=] are defined in [[#standalone-ia-sequence]].
### Categorization and Use of Immersive Audio OBUs ### {#use-of-obu}
#### Descriptors #### {#bitstream-descriptors}
<dfn noexport>Descriptors</dfn> contain all the information that is required to set up and configure the decoders, reconstruction algorithms, renderers, and mixers. [=Descriptors=] do not contain audio signals.
- The [=IA Sequence Header OBU=] indicates the start of a full [=IA Sequence=] description and contains information related to profiles.
- The [=Codec Config OBU=] provides information which is required for setting up a decoder for a coded [=Audio Substream=].
- The [=Audio Element OBU=] provides information which is required for combining one or more [=Audio Substream=]s to reconstruct an [=Audio Element=].
- The [=Mix Presentation OBU=] provides information which is required for rendering and mixing one or more [=Audio Element=]s to generate the final [=Immersive Audio=] output.
- Multiple [=Mix Presentation=]s can be defined as alternatives to each other within the same [=IA Sequence=]. Furthermore, the choice of which [=Mix Presentation=] to use at playback is left to the user. For example, multi-language support is implemented by defining different [=Mix Presentation=]s, where the first mix describes the use of the [=Audio Element=] with English dialogue, and the second mix describes the use of the [=Audio Element=] with French dialogue.
#### IA Data #### {#iadata}
<dfn noexport>IA Data</dfn> contains the time-varying data that is required in the generation of the final [=Immersive Audio=] output.
- The [=Audio Frame OBU=] provides the coded audio frame for an [=Audio Substream=]. Each frame has an implied start timestamp and an explicitly defined duration. A coded [=Audio Substream=] is represented as a sequence of [=Audio Frame OBU=]s with the same identifier, in time order.
- The [=Parameter Block OBU=] provides the parameter values in a block for a [=Parameter Substream=]. Each block has an implied start timestamp and an explicitly defined duration. A time-varying [=Parameter Substream=] is represented as a sequence of parameter values in [=Parameter Block OBU=]s with the same identifier, in time order.
- The [=Temporal Delimiter OBU=] identifies the [=Temporal Unit=]s. It may or may not be present in [=IA Sequence=]. If present, the first OBU of every [=Temporal Unit=] is the [=Temporal Delimiter OBU=].
## Timing Model ## {#timingmodel}
A coded [=Audio Substream=] is made of consecutive [=Audio Frame OBU=]s. Each [=Audio Frame OBU=] is made of audio samples at a given sample rate. The decode duration of an [=Audio Frame OBU=] is the number of audio samples divided by the sample rate. The presentation duration of an [=Audio Frame OBU=] is the number of audio samples remaining after trimming divided by the sample rate. The decode start time (respectively presentation start time) of an [=Audio Frame OBU=] is the sum of the decode durations (respectively presentation durations) of previous [=Audio Frame OBU=]s in the IA Sequence, or 0 otherwise. The decode duration (respectively presentation duration) of a coded [=Audio Substream=] is the sum of the decode durations (respectively presentation durations) of all its [=Audio Frame OBU=]s. The decode start time of an [=Audio Substream=] is the decode start time of its first [=Audio Frame OBU=]. The presentation start time of an [=Audio Substream=] is the presentation start time of its first [=Audio Frame OBU=] which is not entirely trimmed.
A [=Parameter Substream=] is made of consecutive [=Parameter Block OBU=]s. Each [=Parameter Block OBU=] is made of parameter values at a given sample rate. The decode duration of a [=Parameter Block OBU=] is the number of parameter values divided by the sample rate. The decode start time of a [=Parameter Block OBU=] is the sum of the decode duration of previous [=Parameter Block OBU=]s if any, 0 otherwise. The decode duration of a [=Parameter Substream=] is the sum of all its [=Parameter Block OBU=]s' decode durations. The start time of a [=Parameter Substream=] is the decode start time of its first [=Parameter Block OBU=]. When all parameter values in a [=Parameter Substream=] are constant, no [=Parameter Block OBU=]s may be present in the [=IA Sequence=].
Within an [=Audio Element=], the presentation start times of all [=Audio Substream=]s coincide and is the presentation start time of the [=Audio Element=]. All [=Audio Substream=]s have the same presentation duration which is the presentation duration of the [=Audio Element=].
- The decode start times of all coded [=Audio Substream=]s and all [=Parameter Substream=]s coincide and is the decode start time of the [=Audio Element=].
- All coded [=Audio Substream=]s and all [=Parameter Substream=]s have the same decode duration which is the decode duration of the [=Audio Element=].
Within a [=Mix Presentation=], the presentation start time of all [=Audio Element=]s coincide and all [=Audio Element=]s have the same duration defining the duration of the [=Mix Presentation=].
Within an [=IA Sequence=], all [=Mix Presentation=]s have the same duration, defining the duration of the [=IA Sequence=], and have the same presentation start time defining the presentation start time of the [=IA Sequence=].
The term <dfn noexport>Temporal Unit</dfn> conceptually means a set of all [=Audio Frame OBU=]s with the same decode start time and the same duration from all coded [=Audio Substream=]s and all non-redundant [=Parameter Block OBU=]s with the decode start time within the duration.
The figure below shows an example of the Timing Model in terms of the decode start times and durations of the coded [=Audio Substream=] and [=Parameter Substream=].
<center><img src="images/IAMF Timing Model.png" style="width:100%; height:auto;"></center>
<center><figcaption>An example of the IAMF Timing Model. AFO: [=Audio Frame OBU=], PBO: [=Parameter Block OBU=], \(\text{PT}x\): time \(x\) (ms) on the presentation layer's timeline, \(\text{DT}y\): time \(y\) (ms) on the decoding layer's timeline.</figcaption></center>
NOTE: For a given decoded [=Audio Substream=] (before trimming) and its associated [=Parameter Substream=](s), a decoder MAY apply trimming in 1 of 2 ways:
<br/>
1) The decoder processes the [=Audio Substream=] using the [=Parameter Substream=](s), and then trims the processed audio samples.
<br/>
2) The decoder trims both the [=Audio Substream=] and the [=Parameter Substream=](s). Then, the decoder processes the trimmed [=Audio Substream=] using the trimmed [=Parameter Substream=](s).
# Open Bitstream Unit (OBU) Syntax and Semantics # {#obu-syntax}
The [=IA Sequence=] uses the OBU syntax.
This section specifies the OBU syntax elements and their semantics.
## Immersive Audio OBU Syntax and Semantics ## {#immersiveaudio-obu}
OBUs are structured with an [=OBU Header=] and an OBU payload.
The [=OBU Header=] and all OBU payloads, including the [=Reserved OBU=], are byte aligned.
<b>Syntax</b>
```
class IAOpenBitstreamUnit() {
OBUHeader obu_header;
if (obu_type == OBU_IA_Sequence_Header)
IASequenceHeaderOBU ia_sequence_header_obu;
else if (obu_type == OBU_IA_Codec_Config)
CodecConfigOBU codec_config_obu;
else if (obu_type == OBU_IA_Audio_Element)
AudioElementOBU audio_element_obu;
else if (obu_type == OBU_IA_Mix_Presentation)
MixPresentationOBU mix_presentation_obu;
else if (obu_type == OBU_IA_Parameter_Block)
ParameterBlockOBU parameter_block_obu;
else if (obu_type == OBU_IA_Temporal_Delimiter)
TemporalDelimiterOBU temporal_delimiter_obu;
else if (obu_type == OBU_IA_Audio_Frame)
AudioFrameOBU audio_frame_obu(true);
else if (obu_type >= 6 and <= 23)
AudioFrameOBU audio_frame_obu(false);
else if (obu_type >=24 and <= 30)
ReservedOBU reserved_obu;
}
```
<b>Semantics</b>
If the syntax element [=obu_type=] is equal to OBU_IA_Sequence_Header, an ordered series of OBUs is presented to the decoding process as a string of bytes.
## OBU Header Syntax and Semantics ## {#obu-header-syntax}
This section specifies the format of the <dfn noexport>OBU Header</dfn>.
<b>Syntax</b>
```
class OBUHeader() {
unsigned int (5) obu_type;
unsigned int (1) obu_redundant_copy;
unsigned int (1) obu_trimming_status_flag;
unsigned int (1) obu_extension_flag;
leb128() obu_size;
if (obu_trimming_status_flag) {
leb128() num_samples_to_trim_at_end;
leb128() num_samples_to_trim_at_start;
}
if (obu_extension_flag) {
leb128() extension_header_size;
unsigned int (8 x extension_header_size) extension_header_bytes;
}
}
```
<b>Semantics</b>
<dfn noexport>obu_type</dfn> specifies the type of data structure contained in the OBU payload.
<pre class = "def">
obu_type: Name of obu_type
0 : OBU_IA_Codec_Config
1 : OBU_IA_Audio_Element
2 : OBU_IA_Mix_Presentation
3 : OBU_IA_Parameter_Block
4 : OBU_IA_Temporal_Delimiter
5 : OBU_IA_Audio_Frame
6~23 : OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17
24~30 : Reserved
31 : OBU_IA_Sequence_Header
</pre>
<dfn noexport>obu_redundant_copy</dfn> indicates whether this OBU is a redundant copy of the previous OBU with the same [=obu_type=] in the [=IA Sequence=]. A value of 1 indicates that it is a redundant copy, while a value of 0 indicates that it is not.
It SHALL always be set to 0 for the following [=obu_type=] values:
- OBU_IA_Temporal_Delimiter
- OBU_IA_Audio_Frame
- OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17
If a decoder encounters an OBU with [=obu_redundant_copy=] = 1, and it has also received the previous non-redundant OBU, it MAY ignore the redundant OBU. If the decoder has not received the previous non-redundant OBU, it SHALL treat the redundant copy as a non-redundant OBU and process the OBU accordingly.
<dfn noexport>obu_trimming_status_flag</dfn> indicates whether this OBU has audio samples to be trimmed. It SHALL be set to 0 or 1 if the [=obu_type=] is set to OBU_IA_Audio_Frame or OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17. Otherwise, it SHALL be set to 0.
For a given coded [=Audio Substream=],
- If an [=Audio Frame OBU=] has its [=num_samples_to_trim_at_start=] field set to a non-zero value N, the decoder SHALL discard the first N audio samples.
- If an [=Audio Frame OBU=] has its [=num_samples_to_trim_at_end=] field set to a non-zero value N, the decoder SHALL discard the last N audio samples.
NOTE: Because of possible coding dependencies, discarding a sample can sometimes mean decoding the entire audio frame.
- For a given [=Audio Frame OBU=], the sum of [=num_samples_to_trim_at_start=] and [=num_samples_to_trim_at_end=] SHALL be less than or equal to the number of samples in the [=Audio Frame OBU=] (i.e., [=num_samples_per_frame=]).
NOTE: This means that if one of the values is set to the number of samples in the [=Audio Frame OBU=] (i.e., [=num_samples_per_frame=]), the other value is set to 0.
- When [=num_samples_to_trim_at_start=] is non-zero, all [=Audio Frame OBU=]s with the same [=audio_substream/audio_substream_id=], and preceding this OBU back until the [=Codec Config OBU=] defining this [=Audio Substream=], SHALL have their [=num_samples_to_trim_at_start=] field equal to the number of samples in the corresponding [=Audio Frame OBU=] (i.e., [=num_samples_per_frame=]).
- When [=num_samples_to_trim_at_end=] is non-zero in an [=Audio Frame OBU=], there SHALL be no subsequent [=Audio Frame OBU=] with the same [=audio_substream/audio_substream_id=] until a non-redundant [=Codec Config OBU=] defining an [=Audio Substream=] with the same [=audio_substream/audio_substream_id=].
<dfn noexport>obu_extension_flag</dfn> indicates whether the [=extension_header_size=] field is present. If it is set to 0, the [=extension_header_size=] field SHALL NOT be present. Otherwise, the [=extension_header_size=] field SHALL be present.
NOTE: A future version of the specification may use this flag to specify an extension header field by setting [=obu_extension_flag=] = 1 and setting the size of the extended header to [=extension_header_size=].
<dfn noexport>obu_size</dfn> indicates the size in bytes of the OBU immediately following the [=obu_size=] field. If the [=obu_trimming_status_flag=] and/or [=obu_extension_flag=] fields are set to 1, [=obu_size=] SHALL include the sizes of the additional fields. The [=obu_size=] MAY be greater than the size needed to represent the OBU syntax. Parsers SHOULD ignore bytes past the OBU syntax that they recognize.
<dfn noexport>num_samples_to_trim_at_end</dfn> indicates the number of samples that need to be trimmed from the end of the samples in this [=Audio Frame OBU=].
<dfn noexport>num_samples_to_trim_at_start</dfn> indicates the number of samples that need to be trimmed from the start of the samples in this [=Audio Frame OBU=].
<dfn noexport>extension_header_size</dfn> indicates the size in bytes of the extension header immediately following this field.
<dfn noexport>extension_header_bytes</dfn> indicates the byte representations of the syntaxes of the extension header. Parsers that don't understand these bytes SHOULD ignore them.
## Reserved OBU Syntax and Semantics ## {#obu-reserved}
Paresers SHOULD ignore <dfn noexport>Reserved OBU</dfn>s.
NOTE: Future versions of the specification MAY define syntax and semantics for an [=obu_type=] value, making it no longer a [=Reserved OBU=] for those parsers compliant with these future versions.
## IA Sequence Header OBU Syntax and Semantics ## {#obu-iasequenceheader}
The <dfn noexport>IA Sequence Header OBU</dfn> is used to indicate the start of an [=IA Sequence=], i.e., the first OBU in an [=IA Sequence=] SHALL have [=obu_type=] = OBU_IA_Sequence_Header. This section specifies the payload format of the [=IA Sequence Header OBU=].
NOTE: When an [=IA Sequence=] is stored in a file, the [=IA Sequence Header OBU=] can be used to identify that the file contains an [=IA Sequence=].
This OBU MAY be placed frequently within one single [=IA Sequence=] for an application such as broadcasting or multicasting. In that case, all [=IA Sequence Header OBU=]s except the first one SHALL be marked as redundant (i.e., [=obu_redundant_copy=] = 1). So, if a decoder encounters a non-redundant [=IA Sequence Header OBU=] (i.e., [=obu_redundant_copy=] = 0), and it has also received the previous [=IA Sequence Header OBU=], the non-redundant [=IA Sequence Header OBU=] indicates the start of a new [=IA Sequence=].
<b>Syntax</b>
```
class IASequenceHeaderOBU() {
unsigned int (32) ia_code;
unsigned int (8) primary_profile;
unsigned int (8) additional_profile;
}
```
<b>Semantics</b>
<dfn noexport>ia_code</dfn> is a ‘four-character code’ (4CC), <code>iamf</code>.
NOTE: When IA OBUs are delivered over a protocol that does not provide explicit [=IA Sequence=] boundaries, a parser may locate the [=IA Sequence=] start by searching for the code <code>iamf</code> preceded by specific [=OBU Header=] values. For example, by assuming that [=obu_extension_flag=] is set to 0 and because [=obu_trimming_status_flag=] is set to 0 for an [=IA Sequence Header OBU=], the [=OBU Header=] can be either 0xF806 or 0xFC06.
<dfn noexport>primary_profile</dfn> indicates the primary profile that this [=IA Sequence=] complies with. Parsers SHOULD discard the [=IA Sequence=] if they do not support the value indicated here.
The mappings below are applied for both [=primary_profile=] and [=additional_profile=].
- 0: Simple Profile
- 1: Base Profile
- 2~255: Reserved
<dfn noexport>additional_profile</dfn> indicates an additional profile that this [=IA Sequence=] complies with. If an [=IA Sequence=] only complies with the [=primary_profile=], this field SHALL be set to the same value as [=primary_profile=].
NOTE: If a future version defines a new profile, e.g., HypotheticalProfile, that is backward compatible with the [[#profiles-base|Base Profile]], for example by defining new OBUs that would be ignored by the Base-compatible parser, an IA writer can decide to set the [=primary_profile=] to "Base Profile" while setting the [=additional_profile=] to "HypotheticalProfile". This way an old processor will know it can parse and produce an acceptable rendering, while a new processor still knows it can produce a better result because it will not ignore the additional features.
## Codec Config OBU Syntax and Semantics ## {#obu-codecconfig}
The <dfn noexport>Codec Config OBU</dfn> provides information on how to set up a decoder for a coded [=Audio Substream=].
The <dfn noexport>CodecConfig()</dfn> class provides codec-specific configurations for the decoder.
This section specifies the payload format of the [=Codec Config OBU=] and the [=CodecConfig()=] class.
<b>Syntax</b>
```
class CodecConfigOBU() {
leb128() codec_config_id;
CodecConfig codec_config;
}
class CodecConfig() {
unsigned int (32) codec_id;
leb128() num_samples_per_frame;
signed int (16) audio_roll_distance;
DecoderConfig decoder_config(codec_id);
}
```
<b>Semantics</b>
<dfn noexport for="codec_config_obu">codec_config_id</dfn> defines an identifier for a codec configuration. Within an [=IA Sequence=], there SHALL be one unique [=codec_config_obu/codec_config_id=] per codec. There SHALL be exactly one [=Codec Config OBU=] with a given identifier in a set of [=Descriptors=]. [=Audio Element=]s use this identifier to indicate that its corresponding [=Audio Substream=]s are coded with this codec configuration.
<dfn noexport>codec_config</dfn> is an instance of the [=CodecConfig()=] class, which provides codec-specific information for seting up the decoder.
<dfn noexport>codec_id</dfn> indicates a ‘four-character code’ (4CC) to identify the codec used to generate the coded [=Audio Substream=]s. This specification supports the following four [=codec_id=] values defined below:
- <code>Opus</code>: All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL comply with the [[!RFC-6716]] specification and the [=decoder_config=] structure SHALL comply with the constraints given in [[#opus-specific]].
- <code>mp4a</code>: All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL comply with the [[!AAC]] specification and the [=decoder_config=] structure SHALL comply with the constraints given in [[#aac-lc-specific]].
- <code>fLaC</code>: All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL comply with the [[!FLAC]] specification and the [=decoder_config=] structure SHALL comply with the constraints given in [[#flac-specific]].
- <code>ipcm</code>: All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL contain linear PCM (LPCM) audio samples and the [=decoder_config=] structure SHALL comply with the constraints given in [[#lpcm-specific]].
Parsers SHOULD ignore [=Codec Config OBU=]s with a [=codec_id=] that they don't support.
NOTE: Derived specifications or future versions of this specification may support additional codecs.
NOTE: <code>ipcm</code> should not be confused with <code>lpcm</code>, which is another 4CC to identify codecs in other container formats (e.g., QuickTime).
<dfn noexport>num_samples_per_frame</dfn> indicates the frame length, in samples, of the [=audio_frame=] provided in the audio_frame_obu. It SHALL NOT be set to zero. If the [=decoder_config=] structure for a given codec specifies a value for the frame length, the two values SHALL be equal.
<dfn noexport>audio_roll_distance</dfn> indicates how many audio frames prior to the current audio frame need to be decoded (and the decoded samples discarded) to set the decoder in a state that will produce the correct decoded audio signal. It SHALL always be a negative value or zero. For some audio codecs, even if an audio frame can be decoded independently, the decoded signal after decoding only that frame may not represent a correct, decoded audio signal, even ignoring compression artifacts. This can be due to overlap transforms. While potentially acceptable when starting to decode an [=Audio Substream=], it may be problematic when automatically switching between similar [=Audio Substream=]s of different quality and/or bitrate.
- It SHALL be set to \(-R\) when [=codec_id=] is set to <code>Opus</code>, where
\[R = \left\lceil{\frac{3840}{\text{num_samples_per_frame}}}\right\rceil.\]
- It SHALL be set to -1 when [=codec_id=] is set to <code>mp4a</code>.
- It SHALL be set to 0 when [=codec_id=] is set to <code>fLaC</code> or <code>ipcm</code>.
<dfn noexport>decoder_config</dfn> is an instance of the [=DecoderConfig()=] class, which specifies the set of codec parameters required to decode the [=Audio Substream=]. It is byte aligned.
## Audio Element OBU Syntax and Semantics ## {#obu-audioelement}
The <dfn noexport>Audio Element OBU</dfn> provides information on how to combine one or more [=Audio Substream=]s to reconstruct an [=Audio Element=]. This section specifies the payload format of the [=Audio Element OBU=].
Additionally, the following parameter definitions are used in the [=Audio Element OBU=], and their syntax structures are specified in this section:
- <dfn noexport>DemixingParamDefinition()</dfn> and <dfn noexport>DefaultDemixingInfoParameterData()</dfn> provide the parameter definitions for demixing info, which is required for reconstructing a scalable channel audio representation.
- <dfn noexport>ReconGainParamDefinition()</dfn> provides the parameter definition for recon gain, which is required for reconstructing a scalable channel audio representation.
<b>Syntax</b>
```
class AudioElementOBU() {
leb128() audio_element_id;
unsigned int (3) audio_element_type;
unsigned int (5) reserved;
leb128() codec_config_id;
leb128() num_substreams;
for (i = 0; i < num_substreams; i++) {
leb128() audio_substream_id;
}
leb128() num_parameters;
for (i = 0; i < num_parameters; i++) {
leb128() param_definition_type;
if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
DemixingParamDefinition demixing_info;
}
else if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
ReconGainParamDefinition recon_gain_info;
}
else if (param_definition_type > 2) {
leb128() param_definition_size;
unsigned int (8 x param_definition_size) param_definition_bytes;
}
}
if (audio_element_type == CHANNEL_BASED) {
ScalableChannelLayoutConfig scalable_channel_layout_config;
} else if (audio_element_type == SCENE_BASED) {
AmbisonicsConfig ambisonics_config;
} else {
leb128() audio_element_config_size;
unsigned int (8 x audio_element_config_size) audio_element_config_bytes;
}
}
```
```
class DemixingParamDefinition() extends ParamDefinition() {
DefaultDemixingInfoParameterData default_demixing_info_parameter_data;
}
```
```
class DefaultDemixingInfoParameterData() extends DemixingInfoParameterData() {
unsigned int (4) default_w;
unsigned int (4) reserved;
}
```
```
class ReconGainParamDefinition() extends ParamDefinition() {
}
```
<b>Semantics</b>
<dfn noexport for="audio_element_obu">audio_element_id</dfn> defines an identifier for an [=Audio Element=]. Within an [=IA Sequence=], there SHALL be one unique [=audio_element_obu/audio_element_id=] per [=Audio Element=]. There SHALL be exactly one [=Audio Element OBU=] with a given identifier in a set of [=Descriptors=]. [=Mix Presentation=]s refer to a particular [=Audio Element=] using this identifier.
<dfn noexport>audio_element_type</dfn> specifies the audio representation of this [=Audio Element=], which is constructed from one or more [=Audio Substream=]s. Parsers SHOULD ignore [=Audio Element OBU=]s with an [=audio_element_type=] that they do not recognize.
<pre class = "def">
audio_element_type: The type of audio representation.
0 : CHANNEL_BASED
1 : SCENE_BASED
2~7 : Reserved
</pre>
<dfn noexport for="audio_element_obu">codec_config_id</dfn> indicates the identifier for the codec configuration which this [=Audio Element=] refers to. Parsers SHOULD ignore [=Audio Element OBU=]s with a [=audio_element_obu/codec_config_id=] identifying a [=codec_id=] that they don't support.
<dfn noexport>num_substreams</dfn> specifies the number of [=Audio Substream=]s that are used to reconstruct this [=Audio Element=]. It SHALL NOT be set to 0.
<dfn noexport for="audio_element_obu">audio_substream_id</dfn> indicates the identifier for an [=Audio Substream=] which this [=Audio Element=] refers to.
Let a particular [=Channel Group=]'s [=Audio Substream=]s be indexed as \(\left[c, n_c\right]\), where a [=Channel Group=] format is described in [[#scalablechannelaudio-channelgroupformat]] and
- \(c = \left[1, \ldots, C\right]\) is the [=Channel Group=] index and \(C\) is the number of [=Channel Group=]s.
- \(n_c = \left[1, \ldots, N_c\right]\) is the [=Audio Substream=] index in the \(c\)-th [=Channel Group=] and \(N_c\) is the number of [=Audio Substream=]s in the \(c\)-th [=Channel Group=].
Then, the i-th [=audio_element_obu/audio_substream_id=] maps to a [=Channel Group=]'s [=Audio Substream=]s as follows, where i is the index of the array:
\[
\left[
\left[ 1, 1 \right],
\left[ 1, 2 \right],
\cdots,
\left[ 1, N_1 \right],
\left[ 2, 1 \right],
\left[ 2, 2 \right],
\cdots,
\left[ 2, N_2 \right],
\cdots,
\left[ C, 1 \right],
\left[ C, 2 \right],
\cdots,
\left[ C, N_c \right]
\right]
\]
The order of the [=Audio Substream=]s in each [=Channel Group=] (i.e., the semantics of \(n_c\)) is specified in [[#syntax-scalable-channel-layout-config]].
<dfn noexport>num_parameters</dfn> specifies the number of [=Parameter Substream=]s that are used by the algorithms specified in this [=Audio Element=].
- When [=audio_element_type=] = 0, this field SHALL be set to 0, 1, or 2.
- When [=audio_element_type=] = 1, this field SHALL be set to 0.
- Parsers SHALL support any value of [=num_parameters=].
NOTE: For a given [=audio_element_type=], a future version of the specification may define a new [=Parameter Substream=] which may be ignored by IA decoders compliant with this version of the specification. In that case, a new [=param_definition_type=] will be defined in a future version of [=Audio Element OBU=].
<dfn noexport>param_definition_type</dfn> specifies the type of the parameter definition. The parameter definition types are listed in the table below, along with their associated parameter definitions.
<table class = "def">
<tr>
<th>param_definition_type</th><th>Parameter definition type</th><th>Parameter definition</th>
</tr>
<tr>
<td>0</td><td>PARAMETER_DEFINITION_MIX_GAIN</td><td>[=MixGainParamDefinition()=]</td>
</tr>
<tr>
<td>1</td><td>PARAMETER_DEFINITION_DEMIXING</td><td>[=DemixingParamDefinition()=]</td>
</tr>
<tr>
<td>2</td><td>PARAMETER_DEFINITION_RECON_GAIN</td><td>[=ReconGainParamDefinition()=]</td>
</tr>
</table>
- The type PARAMETER_DEFINITION_MIX_GAIN SHALL NOT be present in [=Audio Element OBU=].
- The type SHALL NOT be duplicated in one [=Audio Element OBU=].
- When [=codec_id=] = <code>fLaC</code> or <code>ipcm</code>, the type PARAMETER_DEFINITION_RECON_GAIN SHALL NOT be present.
- When [=num_layers=] > 1, the type PARAMETER_DEFINITION_RECON_GAIN SHALL be present.
- When the highest [=loudspeaker_layout=] of the (non-)scalable channel audio (i.e., [=num_layers=] = 1) is less than or equal to 3.1.2ch, the type PARAMETER_DEFINITION_DEMIXING SHALL NOT be present.
- When the highest [=loudspeaker_layout=] of the scalable channel audio (i.e., [=num_layers=] > 1) is greater than 3.1.2ch, both PARAMETER_DEFINITION_DEMIXING and PARAMETER_DEFINITION_RECON_GAIN types SHALL be present.
- When [=num_layers=] = 1 and [=loudspeaker_layout=] is greater than 3.1.2ch, the type PARAMETER_DEFINITION_DEMIXING MAY be present.
- An OBU parser SHALL be able to parse [=param_definition_type=] = P (where P > 2) and [=param_definition_size=]. The OBU parser SHOULD ignore the bytes indicated by [=param_definition_size=] that it does not recognize.
<dfn noexport>demixing_info</dfn> is an instance of the [=DemixingParamDefinition()=] class, which provides the parameter definition for the demixing information, which is used to reconstruct a scalable channel audio representation. The corresponding parameter data to be provided in [=Parameter Block OBU=]s with the same [=parameter_block_obu/parameter_id=] is specified in the [=DemixingInfoParameterData()=] class.
In this parameter definition,
- [=parameter_rate=] SHALL be set to the sample rate of this [=Audio Element=].
- [=param_definition_mode=] SHALL be set to 0.
- [=ParamDefinition/duration=] SHALL be the same as [=num_samples_per_frame=] of this [=Audio Element=].
- [=ParamDefinition/num_subblocks=] SHALL be set to 1.
- [=ParamDefinition/constant_subblock_duration=] SHALL be the same as [=ParamDefinition/duration=].
<dfn noexport>recon_gain_info</dfn> is an instance of the [=ReconGainParamDefinition()=] class, which provides the parameter definition for the gain value, which is used to reconstruct a scalable channel audio representation. The corresponding parameter data to be provided in [=Parameter Block OBU=]s with the same [=parameter_block_obu/parameter_id=] is specified in the [=ReconGainInfoParameterData()=] class.
In this parameter definition,
- [=parameter_rate=] SHALL be set to the sample rate of this [=Audio Element=].
- [=param_definition_mode=] SHALL be set to 0.
- [=ParamDefinition/duration=] SHALL be the same as [=num_samples_per_frame=] of this [=Audio Element=].
- [=ParamDefinition/num_subblocks=] SHALL be set to 1.
- [=ParamDefinition/constant_subblock_duration=] SHALL be same as [=ParamDefinition/duration=].
<dfn noexport>param_definition_size</dfn> indicates the size in bytes of [=param_definition_bytes=].
<dfn noexport>param_definition_bytes</dfn> represents reserved bytes for future use when new [=param_definition_type=] values are defined. Parsers SHOULD ignore these bytes when they don't understand the parameter definition.
<dfn noexport>scalable_channel_layout_config</dfn> is an instance of the [=ScalableChannelLayoutConfig()=] class, which provides the metadata required for combining the [=Audio Substream=]s referred to here in order to reconstruct a scalable channel layout.
<dfn noexport>ambisonics_config</dfn> is an instance of the [=AmbisonicsConfig()=] class, which provides the metadata required for combining the [=Audio Substream=]s referred to here in order to reconstruct an Ambisonics layout.
<dfn noexport>audio_element_config_size</dfn> indicates the size in bytes of [=audio_element_config_bytes=].
<dfn noexport>audio_element_config_bytes</dfn> represents reserved bytes for future use when new [=audio_element_type=] values are defined. Parsers SHOULD ignore these bytes when they don't recognize a particular configuration.
<dfn noexport>default_demixing_info_parameter_data</dfn> is an instance of the [=DefaultDemixingInfoParameterData()=] class, which provides the default demixing parameter data to apply to all audio samples when there are no [=Parameter Block OBU=]s (with the same [=ParamDefinition/parameter_id=] defined in this [=DemixingParamDefinition()=]) provided.
- In this class, [=w_idx_offset=] in [=demixing_info_parameter_data=] SHALL be ignored.
- Instead, [=default_w=] directly indicates the weight value [=w(k)|\(w(k)\)=].
<dfn noexport>default_w</dfn> indicates the weight value [=w(k)|\(w(k)\)=] for the [=TF2toT2 de-mixer=] specified in [[#processing-scalablechannelaudio-demixer]].
The mapping of [=default_w=] to [=w(k)|\(w(k)\)=] SHOULD be as follows:
<pre class = "def">
default_w : w(k)
0 : 0
1 : 0.0179
2 : 0.0391
3 : 0.0658
4 : 0.1038
5 : 0.25
6 : 0.3962
7 : 0.4342
8 : 0.4609
9 : 0.4821
10 : 0.5
11 ~ 15 : reserved
</pre>
A default recon gain value of 0 dB is implied when there are no [=Parameter Block OBU=]s (with the same [=ParamDefinition/parameter_id=] defined in this [=ReconGainParamDefinition()=]) provided.
### Parameter Definition Syntax and Semantics ### {#parameter-definition}
Parameter definition classes inherit from the abstract <dfn noexport>ParamDefinition()</dfn> class.
<b>Syntax</b>
```
abstract class ParamDefinition() {
leb128() parameter_id;
leb128() parameter_rate;
unsigned int (1) param_definition_mode;
unsigned int (7) reserved;
if (param_definition_mode == 0) {
leb128() duration;
leb128() constant_subblock_duration;
if (constant_subblock_duration == 0) {
leb128() num_subblocks;
for (i = 0; i< num_subblocks; i++) {
leb128() subblock_duration;
}
}
}
}
```
<b>Semantics</b>
<dfn noexport for="ParamDefinition">parameter_id</dfn> indicates the identifier for the [=Parameter Substream=] which this parameter definition refers to. There SHALL be one unique [=ParamDefinition/parameter_id=] per [=Parameter Substream=].
<dfn noexport>parameter_rate</dfn> specifies the rate used by this [=Parameter Substream=], expressed as ticks per second. Time-related fields associated with this [=Parameter Substream=], such as durations, SHALL be expressed in the number of ticks.
- The parameter rate SHALL be a value such that the number of ticks per frame, computed as
\[\frac{\text{parameter_rate} \times \text{num_samples_per_frame}}{\text{Audio Element sample rate}},\]
is a non-zero integer.
<dfn noexport>param_definition_mode</dfn> indicates whether this parameter definition specifies the [=ParamDefinition/duration=], [=ParamDefinition/num_subblocks=], [=ParamDefinition/constant_subblock_duration=] and [=ParamDefinition/subblock_duration=] fields for the parameter blocks with the same [=parameter_block_obu/parameter_id=].
- When this field is set to 0, all of the [=ParamDefinition/duration=], [=ParamDefinition/num_subblocks=], [=ParamDefinition/constant_subblock_duration=], and [=ParamDefinition/subblock_duration=] fields SHALL be specified in this parameter definition. None of the parameter blocks with the same [=parameter_block_obu/parameter_id=] SHALL specify these same fields.
- When this field is set to 1, none of the [=ParamDefinition/duration=], [=ParamDefinition/num_subblocks=], [=ParamDefinition/constant_subblock_duration=], and [=ParamDefinition/subblock_duration=] fields SHALL be specified in this parameter definition. Instead, each parameter block with the same [=parameter_block_obu/parameter_id=] SHALL specify these same fields.
<dfn noexport for="ParamDefinition">duration</dfn> specifies the duration for which each parameter block with the same [=parameter_block_obu/parameter_id=] is valid and applicable. It SHALL NOT be set to 0.
<dfn noexport for="ParamDefinition">constant_subblock_duration</dfn> specifies the duration of each subblock, in the case where all subblocks except the last subblock have equal durations. If all subblocks except the last subblock do not have equal durations, the value of [=ParamDefinition/constant_subblock_duration=] SHALL be set to 0.
When [=ParamDefinition/constant_subblock_duration=] is not equal to 0,
- [=ParamDefinition/num_subblocks=] is implicitly calculated as
\[ \text{num_subblocks} = \left\lceil{ \frac{\text{duration}}{\text{constant_subblock_duration}}}\right\rceil. \]
- If \(\textrm{num_subblocks} \times \text{constant_subblock_duration} > \text{duration}\), the actual duration of the last subblock SHALL be
\[ \text{duration} - \left( \text{num_subblocks} - 1 \right) \times \text{constant_subblock_duration}. \]
When [=ParamDefinition/constant_subblock_duration=] is equal to 0, the summation of all [=ParamDefinition/subblock_duration=] in this parameter block SHALL be equal to [=ParamDefinition/duration=].
<dfn noexport for="ParamDefinition">num_subblocks</dfn> specifies the number of different sets of parameter values specified in each parameter block with the same [=parameter_block_obu/parameter_id=], where each set describes a different subblock of the timeline, contiguously.
<dfn noexport for="ParamDefinition">subblock_duration</dfn> specifies the duration for the given subblock. It SHALL NOT be set to 0.
The values for [=ParamDefinition/duration=], [=ParamDefinition/constant_subblock_duration=], and [=ParamDefinition/subblock_duration=] SHALL be expressed as the number of ticks at the [=parameter_rate=] specified in the corresponding parameter definition.
### Scalable Channel Layout Config Syntax and Semantics ### {#syntax-scalable-channel-layout-config}
The <dfn noexport>ScalableChannelLayoutConfig()</dfn> class provides the configuration for a given scalable channel audio representation.
The <dfn noexport>ChannelAudioLayerConfig()</dfn> class provides the configuration for a specific [=Channel Group=].
This section specifies the syntax structures of the [=ScalableChannelLayoutConfig()=] and [=ChannelAudioLayerConfig()=] classes.
<b>Syntax</b>
```
class ScalableChannelLayoutConfig() {
unsigned int (3) num_layers;
unsigned int (5) reserved;
for (i = 1; i <= num_layers; i++) {
ChannelAudioLayerConfig channel_audio_layer_config(i);
}
}
class ChannelAudioLayerConfig(i) {
unsigned int (4) loudspeaker_layout(i);
unsigned int (1) output_gain_is_present_flag(i);
unsigned int (1) recon_gain_is_present_flag(i);
unsigned int (2) reserved;
unsigned int (8) substream_count(i);
unsigned int (8) coupled_substream_count(i);
if (output_gain_is_present_flag(i) == 1) {
unsigned int (6) output_gain_flags(i);
unsigned int (2) reserved;
signed int (16) output_gain(i);
}
}
```
When an [=Audio Element=] is composed of \(G(r)\) number of [=Audio Substream=]s, its scalable channel audio representation is layered into \(r\) [=num_layers=] of [=Channel Group=]s.
- The order of the [=Channel Group=]s in each [=Temporal Unit=] SHALL be same as the order of the [=channel_audio_layer_config=]s in [=ScalableChannelLayoutConfig()=].
- The \(q\)-th [=Channel Group=] consists of \(G(q) - G(q - 1)\) number of [=Audio Substream=]s, where \(q = 1, 2, \ldots, r\) and \(G(0) = 0\).
- Let the term "Audio Frames" mean the set of all [=Audio Frame OBU=]s (for this [=Audio Element=]) that have the same start timestamp. All Audio Frames in an [=IA Sequence=] SHALL have the same number of [=Audio Frame OBU=]s.
- [=Parameter Block OBU=]s MAY be associated with Audio Frames.
<center><img src="images/Immersive Audio Sequence with scalable channel audio (before OBU packing).png" style="width:100%; height:auto;"></center>
<center><figcaption>Immersive Audio Sequence with scalable channel audio (before OBU packing). See [[#standalone]] for related details on OBU ordering within an IA Sequence.</figcaption></center>
Each [=Channel Group=] (or scalable channel audio layer) is associated with a different [=loudspeaker_layout=]. The IA decoder SHALL select one of the layers according to the following rules, in order:
- The IA decoder SHOULD first attempt to select the layer with a [=loudspeaker_layout=] that matches the physical playback layout.
- If there is no match, the IA decoder SHOULD select the layer with the closest [=loudspeaker_layout=] to the physical layout and then apply up- or down-mixing appropriately, after decoding and reconstruction of the channel audio. Sections [[#iamfgeneration-scalablechannelaudio-downmixmechanism]] and [[#processing-downmixmatrix]] provide examples of dynamic and static down-mixing matrices for some common layouts that MAY be used.
The relationship among all [=Channel Group=]s for the given scalable channel audio representation SHALL comply with [[#scalablechannelaudio-channelgroupformat]] and the relationship among all channel layouts indicated by [=loudspeaker_layout=]s specified in an [=Audio Element OBU=] SHALL comply with [[#scalablechannelaudio-channellayoutgenerationrule]].
<b>Semantics</b>
<dfn noexport>num_layers</dfn> indicates the number of [=Channel Group=]s for scalable channel audio. It SHALL NOT be set to zero and its maximum value SHALL be 6.
- If [=loudspeaker_layout=] is set to Binaural, this field SHALL be set to 1.
<dfn noexport>channel_audio_layer_config</dfn> is an instance of the [=ChannelAudioLayerConfig()=] class, which provides the i-th [=Channel Group=]'s configuration, where i is the layer index provided as input argument to this instance of the [=ChannelAudioLayerConfig()=] class.
<dfn noexport>loudspeaker_layout</dfn> indicates the channel layout to be reconstructed from the precedent [=Channel Group=]s and current [=Channel Group=]. If parsers do not recognize a [=loudspeaker_layout=] for a particular layer, they SHOULD skip the [=channel_audio_layer_config=] for that layer and all subsequent layers.
In this version of the specification, [=loudspeaker_layout=] indicates one of the 10 channel layouts listed below.
<table class="def">
<tr>
<th><code>loudspeaker_layout</code></th><th>Channel Layout</th><th>Loudspeaker Location Ordering</th><th>Reference</th>
</tr>
<tr>
<td>0000</td><td>Mono</td><td>C</td><td></td>
</tr>
<tr>
<td>0001</td><td>Stereo</td><td>L/R</td><td>[=Loudspeaker configuration for Sound System A (0+2+0)=] of [[!ITU-2051-3]]</td>
</tr>
<tr>
<td>0010</td><td>5.1ch</td><td>L/C/R/Ls/Rs/LFE</t><td>[=Loudspeaker configuration for Sound System B (0+5+0)=] of [[!ITU-2051-3]]</td>
</tr>
<tr>
<td>0011</td><td>5.1.2ch</td><td>L/C/R/Ls/Rs/Ltf/Rtf/LFE</td><td>[=Loudspeaker configuration for Sound System C (2+5+0)=] of [[!ITU-2051-3]]</td>
</tr>
<tr>
<td>0100</td><td>5.1.4ch</td><td>L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE</td><td>[=Loudspeaker configuration for Sound System D (4+5+0)=] of [[!ITU-2051-3]]</td>
</tr>
<tr>
<td>0101</td><td><dfn noexport>7.1ch</dfn></td><td>L/C/R/Lss/Rss/Lrs/Rrs/LFE</td><td>[=Loudspeaker configuration for Sound System I (0+7+0)=] of [[!ITU-2051-3]]</td>
</tr>
<tr>
<td>0110</td><td>7.1.2ch</td><td>L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE</td><td>The combination of [=7.1ch=] and the Left and Right top front pair of [=7.1.4ch=]</td>
</tr>
<tr>
<td>0111</td><td><dfn noexport>7.1.4ch</dfn></td><td>L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE</td><td>[=Loudspeaker configuration for Sound System J (4+7+0)=] of [[!ITU-2051-3]]</td>
</tr>
<tr>
<td>1000</td><td>3.1.2ch</th><th>L/C/R/Ltf/Rtf/LFE</td><td>The front subset (L/C/R/Ltf/Rtf/LFE) of [=7.1.4ch=]</td>
</tr>
<tr>
<td>1001</td><td>Binaural</td><td>L/R</td><td></td>
</tr>
<tr>
<td>others</td><td>Reserved</td><td></td><td></td>
</tr>
</table>
Where C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, Rs: Right Surround, Rss: Right Side Surround, Lrs: Left Rear Surround, Rrs: Right Rear Surround, Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects
NOTE: The Ltr and Rtr of 5.1.4ch down-mixed from 7.1.4ch is within the range of Ltb and Rtb of 7.1.4ch, in terms of their positions according to [[!ITU-2051-3]].
For a given input [=3D audio signal=] with [=audio_element_type=] = CHANNEL_BASED, if the input [=3D audio signal=] has height channels (e.g., 7.1.4ch or 5.1.2ch), it is RECOMMENDED to use channel layouts with height channels (i.e., higher than or equal to 3.1.2ch) for all [=loudspeaker_layouts=].
- Examples for RECOMMENDED list of channel layouts: 3.1.2ch/5.1.2ch, 3.1.2ch/5.1.2ch/7.1.4ch, 5.1.2ch/7.1.4ch, etc.
- Examples for NOT RECOMMENDED list of channel layouts: 2ch/3.1.2ch/5.1.2ch, 2ch/3.1.2ch/5.1.2ch/7.1.4ch, 2ch/5.1.2ch/7.1.4ch, 2ch/7.1.4ch, etc.
NOTE: This specification allows down-mixing mechanisms (e.g., as specified in [[#iamfgeneration-scalablechannelaudio-downmixmechanism]]) to drop the height channel if the output layout has no height channels. An example is down-mixing from 7.1.4ch to Mono, Stereo, 5.1ch or 7.1ch. Therefore, given an input [=3D audio signal=] with height channels, an encoder may generate a set of scalable audio channel groups with layouts that do not have height channels.
<dfn noexport>output_gain_is_present_flag</dfn> indicates if the output_gain information fields for the [=Channel Group=] are present.
- 0: No output_gain information fields for the [=Channel Group=] are present.