The Illusion of Digital Audio

Though based on principles established in the late 1930's, digital audio encoding started to appear in the telecommunications industry in the 1960's. Research into its commercial use was pioneered by the Japanese national broadcasting organization NHK and Nippon Columbia.

This sampling process is known as quantization and its accompoliced by a device known as an analog-to-digital converter or ADC. 


The second element to encoding audio signals digitally is the how frequently samples of the signal are taken, or the sampling rate. The rate at which an ADC samples a signal determines the frequency response of the digitizing process.


In modern digital audio, sampling rates of 48Khz are common, offering an average of 2-3 samples for frequencies at the uppermost limits of human hearing.


While PCM is the more popular method of encoding audio signals digitally, other methods such as pulse density modulation or PDM are sometimes used. 


Where electrical signaling decoupled sound reproduction from its physical connection to vibrating waves, digital sound completely detached the information of sound from any underlying medium.

Once an analog signal is digitized it now exists as a stream of bits. No matter how many times it’s copied or transferred between storage media the information always remains exactly as it was initially captured. Audio signal data could now be instantly copied, stored on multiple forms of storage media, and transmitted digitally, never degrading or changing.


Consuming the audio stored in a stream of bits is done by first converting the data back to an analog signal, via a digital-to-analog converter or DAC. 


With audio data now stored effectively as a table of amplitude values, the simultaneous advancement of computing technology could now be harnessed to process audio in more complex and powerful ways. While some of the properties of analog signal processing could be replicated mathematically in software, new forms of analyzing and modifying digital signals were developed. The power of software also allows for incredible flexibility, allowing filters to be modified, structured, and layered in complex configurations without ever changing physical components.


One of the largest drawbacks of digital audio data is its storage requirements. In an era where storage capacity was expensive and limited, the notion of hundreds of megabytes being used up for a single piece of media became a hindrance to its practical migration beyond optical disks. The rise of digital video would also inherently require a more efficient method for storing audio. Add to this the emergence of the internet and its eventual transformation to a global media distribution platform, the need for new methods of transmitting digital audio within limited data bandwidth becomes apparent.




In digital audio, the metric of bitrate is used to specify the minimum transfer throughput required to maintain realtime playback of a stream. 


In general purpose data compression, repeated instances of data within a dataset are identified, and restructured with a smaller reference to a single expression of that repeated section. This effectively removes repeated content, reducing storage requirements overall. When the data is uncompressed, the references are replaced with the original repeating content, restoring the dataset perfectly to its uncompressed state. This is known as lossless compression since the act of compressing the data doesn't destroy any information. Lossless compression is used where it is critical that no information is lost as in the case of most data used by computers.


In contrast, information that interact with our brain via our senses, such as visual and auditory experiences, behave very differently. Specifically with audio, not every audible frequency of sound that enters our ears is perceived by our brains. This phenomenon is known as auditory masking and it can be exploited to compress digital audio data in a lossy manner. 


This lossy compression removes significant amounts of information from the signal while still maintaining most of the audio’s fidelity.


Frequency masking occurs when a sound is made inaudible by a noise or sound of the same duration as the original sound. This tends to occur when two similar frequencies are played at the same time with one being significantly louder than the other.


Temporal masking , in‌ ‌contrast, occurs when a sudden sound obscures other sounds which are present immediately preceding or following it. 


From extensive research conducted on auditory masking, response models have been developed that map the manner in which our hearing responds to this phenomenon. Masking is used to remove information in the frequency domain in order to compress digital audio.