21-09-2008, 11:29 AM
Basics of Audio Compression
Advances in digital audio technology are fueled by two sources: hardware developments and new signal processing techniques. When processors dissipated tens of watts of power and memory densities were on the order of kilobits per square inch, portable playback devices like an MP3 player were not possible. Now, however, power dissipation, memory densities, and processor speeds have improved by several orders of magnitude. Advancements in signal processing are exemplified by Internet broadcast applications: if the desired sound quality for an internet broadcast used 16-bit PCM encoding at 44.1 KHz, such an application would require a 1.4 Mbps (2 x 16 x 44k) channel for a stereo signal! Fortunately new bit rate reduction techniques in signal processing for audio of this quality are constantly being released.
Increasing hardware efficiency and an expanding array of digital audio representation formats are giving rise to a wide variety of new digital audio applications. These applications include portable music playback devices, digital surround sound for cinema, high-quality digital radio and television broadcast, Digital Versatile Disc (DVD), and many others. This paper introduces digital audio signal compression, a technique essential to the implementation of many digital audio applications. Digital audio signal compression is the removal of redundant or otherwise irrelevant information from a digital audio signal, a process that is useful for conserving both transmission bandwidth and storage space. We begin by defining some useful terminology. We then present a typical "encoder" (as compression algorithms are often called) and explain how it functions. Finally consider some standards that employ digital audio signal compression, and discuss the future of the field.
Psychoacoustics is the study of subjective human perception of sounds. Effectively, it is the study of acoustical perception. Psychoacoustic modeling has long-since been an integral part of audio compression. It exploits properties of the human auditory system to remove the redundancies inherent in audio signals that the human ear cannot perceive. More powerful signals at certain frequencies 'mask' less powerful signals at nearby frequencies by de-sensitizing the human ear's basilar membrane (which is responsible for resolving the frequency components of a signal). The entire MP3 phenomenon is made possible by the confluence of several distinct but interrelated elements: a few simple insights into the nature of human psychoacoustics, a whole lot of number crunching, and conformance to a tightly specified format for encoding and decoding audio into compact bitstreams.
Audio Compression vs. Speech Compression
This paper focuses on audio compression techniques, which differ from those used in speech compression. Speech compression uses a model of the human vocal tract to express particular signal in a compressed format. This technique is not usually applied in the field of audio compression due to the ast array of sounds that can be generated - models that represent audio generation would be too complex to implement. So instead of modeling the source of sounds, modern audio compression models the receiver, i.e., the human ear.
Lossless vs. Lossy
When we speak of compression, we must distinguish between two different types: lossless, and lossy. Lossless compression retains all the information in a given signal, i.e., a decoder can, perfectly reconstruct a compressed signal. In contrast, lossy compression eliminates information, from the original signal. As a result, a reconstructed signal may differ from the original. With audio signals, the differences between the original and reconstructed signals only matter if they are detectable by the human ear. As we will explore shortly, audio compression employs both lossy and lossless techniques.
Basic Building Blocks
Figure 1 shows a generic encoder or "compressor that takes blocks of sampled audio signal as its input. These blocks typically consist of between 500 and 1500 samples per channel, depending on the encoder specification. For example, the MPEG-1 layer III (MP3) specification takes 576 samples per channel per input block. The output is a compressed representation of the input block (a "frame") that can be transmitted or stored for subsequent decoding.