About Audio Signal Encoding and Processing \ Kyle Huang

Throughout history, dense populations of humans have used various methods to preserve information. The first of these methods was likely oral tradition, and it had a specific utility. In anthropologist Joseph Henrich’s 2015 book The Secret of Our Success, he mentions the phenomenon of a “collective brain” in social groups of individuals.

To put it simply, technological advancements can be modeled as a tiny probability of how likely it is for each individual to discover one in their lifetime. For example, suppose a single human has a 0.1% chance of discovering arrow fletching in their lifetime. With 10 people, the probability rises to 1%. With 100, it rises to 9.5%. With 10,000, the discovery of fletching becomes almost certain in a single generation, with a probability of 99.995%.

Note

$(1 - 0.001)^n$ is the probability that out of $n$ people, none make the discovery. Therefore, the expression for the likelihood that at least one does is $1 - (1 - 0.001)^n$ .

However, this requires the group to be socially connected. If innovations don’t die with the inventor, the next generation of innovations can build off of previous ones. This is how the collective brain works, and it lives because of knowledge transfer.

In the late 19th century, Thomas Edison was the first to create and patent a device which could both record and replay sound. It was called the phonograph, and its concept of capturing audio led to a chain of discoveries all centered around symbolizing information. Today, it manifests itself in the field of audio engineering, but functionally, it has been a major change in how the collective brain communicates.

This essay covers modern knowledge in these fields without expecting prerequisite knowledge. With this information, you can become more competent in analyzing the quality of stored music, as well as learning fundamental knowledge in signal processing.

Analog and Digital

Analog describes the way signals have been recorded for much of audio history. The phonograph, as built by Thomas Edison, had a rotating cylinder, wrapped in either foil or wax. When speaking into the device, pressure waves vibrated a needle, and the needle carved a groove as the cylinder rotated and moved forward. This made the signal continuous.

Note

The rotation and forward movement of the cylinder was coordinated by a screw mechanism. The rod was threaded, so when you cranked it, the cylinder would rotate as well as moving horizontally, which creates the spiral groove.

At any point in the recording, you can move a miniscule step of time in either direction, and the nature of the groove results in a slightly different value in the signal. This is in contrast to a digitized signal, which stores discrete values at equally-spaced time intervals. To explain the difference, take a diagram of a sine wave.

Fig. 1: Continuous wave

The sine wave is continuous. As shown in Fig. 1, when time t is 13 seconds, there is a y-value representing displacement of the groove, approximately 10.9214. Since the groove is carved into a physical medium, the effective precision of this y-value is infinite.

Fig. 2: Digitized wave

This is a digitization of Fig. 1’s continuous signal. Repeatedly in equal-length time intervals, a sample is taken. As shown in Fig. 2, the y-value at 13 seconds is slightly less than the reference sine wave. This is because between two sampling points, the y-value at any t is the one of the latest sample.

Additionally, storing decimal numbers digitally has limited precision, meaning that the y-value represented by the groove in an analog signal is approximated in the digital version.

In the real world, both analog and digital audio signals are still in use. Every microphone records in analog, but when it needs to be manipulated in a digital audio workstation (DAW), it must have a digital representation on the computer. This is done with an analog-to-digital converter (ADC). To explain the process, let’s take the example of recording human speech with a dynamic microphone.

Fig. 3: Dynamic microphone

Dynamic microphones work because of electromagnetic induction. Inside is a lightweight component called a diaphragm attached to a coil of wire wrapped around a permanent magnet. As the diaphragm moves, the coil moves up and down around the magnet, inducing a small voltage. This voltage then travels through the coil and into the microphone output wire. When the diaphragm moves less, the voltage is lower, and when the diaphragm moves more, the voltage is higher. Because sound is a sequence of pressure waves, the voltage is an analog representation of the audio source.

The constantly changing voltage travels down the wire, and it is inputted into the ADC. 44,100 times per second, this circuit charges a capacitor to the voltage at that instant and estimates that voltage with a comparator. These values are then rounded to the nearest 24-bit integer (integer limits are mapped to voltage range) and sent to your computer.

Pitch and Frequency

When you think about music specifically, one of the largest components is the melody. Melody is often described as a sequence of pitches, sounds that seem “higher” and “lower”. This apparent attribute of sounds can be explained by the concept of frequency.

Sound is a form of energy, and pitch is a pattern observed in sound. Most commonly, we observe this effect with particles of air. Air can be visualized as a large number of microscopic balls bouncing around in a volume. When an object pushes and interacts with these balls, it forms areas with higher densities and lower densities.

For example, imagine a guitar string which has both of its ends fixed to the instrument’s body. When the player pulls on the elastic string, it acts as an oscillator and moves back and forth.

Fig. 4: Compression and rarefaction

As shown in Fig. 4, the motion causes certain areas of the volume to have areas of higher air pressure (compressions) and others to have lower air pressure (rarefactions). Since the guitar string is elastic, this back-and-forth motion repeats again and again, creating a repetitive pattern. The number of times this pattern repeats in a second is called frequency, and it’s measured in hertz.

When a large number of compressions and rarefactions reach your ear, your eardrum vibrates and transmits the energy to three small bones in the middle ear (the malleus, incus, and stapes). These bones amplify the vibrations and pass them on to an organ called the cochlea, which is filled with fluid and lined with hair cells. These hair cells each have a bundle of many strands called stereocilia, and as the fluid in the cochlea ripples, the stereocilia move accordingly. The bending of stereocilia causes small channels to open, letting positively charged particles flood in. This triggers the cell to release a chemical signal at its base, firing a nerve beneath it and sending the signal through the cochlear nerve.

When it reaches your brain, the pitch of the noise is known from the location of the nerve. Because the cochlea is coiled like a snail shell, it has an apex and a base. The stereocilia at the base trigger signals when a high frequency ripple runs through the cochlea, and the apex responds to lower frequencies.

Note

The selective vibration happens unconsciously, and is a result of structure. As you go up the cochlea, stiffness drops dramatically. Since stiff things vibrate quickly, the base responds to high frequencies. The higher you go, the floppier it gets, meaning response to lower frequencies.

Obviously, there’s a limit to the frequencies we can hear. At some point, the waves repeat too slow to meaningfully resonate the stereocilia at the apex of the cochlea or too fast for the base. Through anatomical observations and repeated testing, we’ve found that the frequency range audible to humans is generally 20 Hz to 20,000 Hz.

Sampling Limits

Recall that the standard sampling rate for digital audio files is 44.1 kHz (a kilohertz is a thousand hertz). This number ensures that if you could feasibly hear the sound in analog, the detail would be present in the digital version. To reconstruct a periodic function, at minimum, you need to sample twice the highest frequency. This is because in the absolute ideal scenario, sampling twice the highest frequency lands a sample at each valley and each peak of the wave.

This is called the Nyquist-Shannon sampling theorem, and it appears frequently when working with digital audio. Sampling an audio signal at a rate below the Nyquist boundary (two times the highest frequency) can cause aliasing, a phenomenon in which a high frequency is misidentified as a lower frequency due to the finer details in the sound occurring between two samples. This creates artifacts in the sound, which is why we use low-pass filters.

Note

If you were wondering how audio engineers can separate sounds into frequencies, look into the Fourier Transform. It’s quite complex, but for the purposes of this article, any well-behaved periodic signal can be decomposed into an infinite sum of simple sinusoidal functions.

A low-pass filter passes frequencies below a certain point (leaves them unchanged), and attenuates frequencies above that threshold (quiets them). A high-pass filter does the opposite, passing above the line and attenuating below it. Using a low-pass filter before downsampling audio ensures that all frequencies present in the result were present in the original, with no additional ones from aliasing.

The number 44.1 kHz originated from PCM adaptors, devices that recorded digital audio as video cassette tapes (flashing checkerboard patterns representing ones and zeroes). One of the dominant television encodings at the time, PAL, would allow a maximum of 44,100 Hz for digital audio given its hardware constraints. Since the human hearing range tops at 20 kHz, the digital sampling rate was to be at minimum 40 kHz. After several different standards around the 40 kHz mark were proposed by the prominent CD manufacturers of the time, it was ultimately decided that the CD standard would have a sample rate of 44.1 kHz.

Audio Formats

After the ADC piped in a continuous stream of 24-bit integers representing voltages from the voice coil moving around the magnet, the computer converts each one to 32-bit and stores them in RAM. DAWs internally process everything in 32-bit or 64-bit floats, so converting to a proper bit depth allows our audio processing to work predictable. This list of integers in RAM is called pulse-code modulation (PCM). Once we save this recording to our main storage unit (such as an SSD), we attach a header to the file which declares information including bit depth, sampling rate, and number of channels. Most popularly, this becomes a Waveform Audio File (WAV) on Windows or an Audio Interchange File (AIFF) on macOS.

In storage, 32-bit audio is almost never distributed to any music consumers. The standard in streaming services like Spotify and Apple Music is 16-bit integers, and 24-bit integers are often used for studio recordings or archives. During the export process, this loss of data (removing bits means less distinct representable integers) can create distortion in the signal. Dithering is used as a solution, where a very small amount of noise randomizes the signal slightly and makes the quantization imperceptible.

In the same way that text files can be compressed due to the redundancy in normal writing, lossless audio files also have redundancy and therefore can be compressed. The most famous of these compressed lossless audio files is the Free Lossless Audio Codec (FLAC), which uses a clever method of space optimization. The encoder splits the audio signal into blocks, trying to approximate each one’s signal as a mathematical function. The residuals between the reference and the function are written to the file, and every single bit of data is preserved while minimizing occupied space.

Note

FLAC uses really simple modeling for its approximations, so residuals take up a meaningful amount of space. Recent research has looked into using machine learning to create better approximations. Li, Huang, Wang, et al. released a 2024 paper titled Lossless data compression by large models, if that interests you.

In many cases, the end consumers receiving the full lossless 16-bit data is unnecessary. In casual music streaming or with constrained storage hardware, the largest concern may be minimizing the stored data while mostly preserving the signal in human perception. This is called lossy compression, and it changes the question from “Where is the redundancy in the data?” to “Where is the redundancy for the listener?”

Psychoacoustics is the study of how humans perceive sound, and MP3, the most well-known lossy audio format, leverages certain parts of the psychoacoustic model to reduce stored data. One example is the equal-loudness contour. Even at the same level of sound pressure (decibels), different frequencies will sound louder and quieter than others.

Fig. 5: Equal-loudness contour

Fig. 5 displays a simplified equal-loudness contour, which shows the actual sound pressure on the y-axis and the frequency on the x-axis. Every point on this curve is perceived as equally loud to average young listeners with healthy listening. This phenomenon of different frequencies of the same sound pressure being perceived with distinct “subjective loudness” is utilized by MP3, since frequency bands the ear is insensitive to can be quantized harder.

Another psychoacoustic phenomenon MP3 leverages is the masking effect. After a loud noise, the volume of the next short window of sound must be increased in order to be perceived by the listener. For example, a loud cymbal crash can make the next 100ms inaudible to a listener, despite the raw audio file encoding it. MP3 can leverage this by quantizing these masked segments hard, since the actual perceived effect is minimal.

Since MP3 is running these psychoacoustic models and making decisions on what chunks don’t need as much detail, every encoding process loses information, no matter what. If your friend gives you a 128 kbps MP3 and you decide to transcode it to an MP3 at 320 kbps, data loss is inevitable due to the decisions being made on an already compressed signal, and you will lose information from the 128 kbps MP3 despite having a technically higher bitrate.

Spectral Analysis

Since files can lie about their true contents with a label in metadata, the only reliable way to determine the data present in an audio file is with spectral analysis. Spectral analysis is a visualization of digital audio, and it plots audio data in a two-dimensional graph.

Of course, the horizontal axis represents time, but the vertical axis is where it gets trickier. The vertical axis represents frequency, but due to the nature of sound, a frequency measurement requires a delta time in order to accurately assess the period of the function. In the real world, spectrograms select an interval of the audio signal, run a Fourier Transform on that interval in order to acquire frequency data, and slide the selection to the next chunk. The color/brightness at a certain point encodes the amplitude, or loudness of that certain frequency.

The problem with using an interval to assess frequency data is that it becomes impossible to have arbitrary resolution in both time and frequency simultaneously. (Formalized as the Gabor Limit)

If you set delta time extremely small (tiny selections at a time), you are able to separate exact moments in the audio, so if there’s an abrupt slide upwards in frequency, you can clearly see the movement. However, a small delta time limits your ability to discern frequencies properly. If your selected interval is shorter than the period of the lowest frequencies in the signal, it becomes mathematically impossible to resolve it with a Fourier Transform, and the visualization shows artifacts because of it.

If you set delta frequency extremely small (large selections), you run into the exact opposite problem. The nature of the Fourier Transform means that a longer signal gives you more samples and therefore a more accurate discernment of frequencies. This also requires a larger delta time, meaning that the abrupt slide upwards in frequency becomes blended into a single frequency interval in the visual.

In practice, you can run several spectral analyses with different window lengths, and these will give you enough information to determine patterns for practical use. To determine the processing performed on an audio signal, perform a spectral analysis and pattern match with the following information. (Delta time is irrelevant for this, since lossy encoding simply allocates zero bits to certain frequency bands, resulting in a visible result regardless of window length.)

A lossless audio file or raw PCM data will result in a graph with frequencies that extend all the way up to 22 kHz. Even if the signal itself has very little high frequencies, lossless will show very faint coloration extending up to 22 kHz, and not an abrupt cutoff anywhere below.

The appearance of an MP3 will vary between bitrates. A 320 kbps MP3 will show a hard cutoff line at roughly 20.5 kHz, and a small “shelf” effect at 16 kHz. This is a result of the psychoacoustic model judging certain frequency bands as less worthy of spending data on, since it’s not as obvious to the average listener. All of the following bitrates will have the shelf effect at the same spot. 256 kbps will have a cutoff at 20 kHz, V0 (variable bitrate) at 19.5 kHz, 192 kbps at 19 kHz, V2 at 18.5 kHz, and 128 kbps will cut off at 16 kHz.

With this knowledge, you should be able to tell if an audio signal has been encoded with a psychoacoustic model or if it is the original lossless data.

The Collective Brain

The field of encoding and processing information in different ways is a fascinating one for us to explore, and as shown in this article which barely scratches the surface of digital signal processing (DSP), we’ve achieved high levels of complexity in it. I only recently dove into this information out of utility for my local music collection, but the underlying technology was very interesting to research.

The transfer of information throughout our “collective brain” is mainly done through verbal and typed messages today, but the concepts of compressing information and decoding them may very well be useful in the future for just how much signal we can transfer at a time. If you work with audio files frequently and organize them but never took the time to understand what exactly they were, I hope this was helpful.

That’s all for today, and until next time, I am out.