Sound waves are continuous waves of motion, carried through air molecules as they collide with each other. A speaker uses electrical magnets to rapidly move a cone back and forth, pushing the air around it to produce sound waves. This rapid motion is controlled through an analog signal, a continuous signal which is analogous to another signal – in this case, a current with fluctuating voltages which is analogous to a sound wave. A stream of voltages is pushed out from a source into an amplifier or speaker, creating various charges in the speaker’s magnet and ultimately pulsating the cone.
A computer processor is not designed to handle analog signals, but instead reads digitized information in the form of bits. Computers handle analog signals (like sounds and videos) using discrete samples of the analog waveform. Basically this means they use a sequence of numbers, each representing the value of the wave at a precise position (point in time). The sequence of values repeatedly increases and decreases incrementally between two peak numbers, effectively representing the wave. Each of these discrete values is called a sample. This process of representing signals using a sequence of samples is aptly called sampling. Computers use analog-to-digital converters(ADCs) to encode analog signals as digital samples, and digital-to-analog converters (DACs) to produce analog signals from digital samples. Digital sampling and encoding/decoding analog and digital signals, respectively, is a huge field and an area of tremendous study, with applications in audio, video, data transmission, broadcasting, telephony, streaming media, using sensors, signal analysis, and so forth. Let’s dive into the basics.
Suppose we want to represent one period of a continuous, smooth wave with an amplitude of 1 (forget about frequency for now). Most of us probably remember from science class that amplitude (the intensity of the wave – volume for sound waves) is the distance from
baseline to peak, so this wave would rise and fall between 1 and -1 (note that some sources use base-to-peak to describe amplitude, others use peak-to-peak; to be consistent we will use base-to-peak throughout). Since the baseline is 0, in each period, the wave will start at 0, rise to 1, fall to -1, then rise back to 0.
We can represent that with discrete samples:
0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7,- 0.8, -0.9, -1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, -0.0…
If we plotted each of these sample values at a fixed interval, it’s clear that we would get a wave shape.
We could represent the same wave with a different number of samples (i.e. change the sampling rate):
0.0, 0.3, 0.5, 0.8, 1.0, 0.8, 0.5, 0.3, 0.0, -0.3, -0.5, -0.8, -1.0, -0.8, -0.5, -0.3, 0.0…
or with a different level of precision for each sample (i.e. change the resolution or bit-depth):
0.00, 0.25, 0.50, 0.75, 1.00, 0.75, 0.50, 0.25, 0.00, -0.25, -0.50, -0.75, -1.00, -0.75, -0.50, -0.25, -0.00…
As you might expect, varying either the sampling rate or the bit-depth directly affects the ability to accurately reproduce the analog signal, which in the case of audio, affects the quality of the sound produced. A higher sampling rate provides more data points for a DAC to use to recreate a given segment of the wave, and a higher bit-depth allows each one of the data points to be more precise. Conversely, the lower the sampling rate or bit-depth, the less precisely the DAC is able to produce the desired signal. You probably notice this effect often when listening to mp3′s or streaming videos of various qualities. The term lossy is used to describe encoding processes that do not preserve all of the source data; analog to digital conversion can be lossy.
If a wave/signal is plotted on 2 axis graph, as in the figures, the x-axis represents time, and the y-axis represents the position of the wave. Sampling-rate is equivalent to the number of points we have in a segment of the x-axis (time); bit-depth corresponds to the number of distinct positions on the y-axis any one of those points can be. Suppose that the max amplitude of any signal is 1 (forget about units), meaning at any point in time the position of a wave is between 1 and -1. With a 1-bit depth, we have two values to describe the position (1 bit can be either 0 or 1), which means everything in the signal would have to be rounded off to one value or the other. For some applications with simple signals, that is fine (Morse code, for instance); for complex signals (which real-world sounds are), it will not do. The higher the number of available values to describe the position, the less the impact of rounding-off. For an N-bit depth, the number of available values is 2^N (2 to the power of N).
It is worth noting that there are many advanced signal-processing techniques that use filtering, statistical inference, complex mathematics and other tools to help compensate for signal loss, but they are outside of the scope of this article.
To play back audio as sounds for us to hear, the computer sends the samples to a sound card, which has a DAC to decode the sample values and reproduce the analog signal in the form of an electrical current that is sent to the speakers. Pulse-code modulation (PCM) is the standard for doing this. PCM is more or less the same as the digital sampling examples above, except samples are in the form of bytes. WAVE audio files (.wav) are a PCM-based format (particularly, they use Linear Pulse-Code Modulation, or LPCM, a form of PCM with linear-quantization). The sampling rate of PCM ultimately depends on the hardware, but modern implementations typically allow us to easily choose between a number of supported rates at the software level. The same goes for bit-depth, which in PCM is typically a multiple of 8 (a byte, the smallest addressable unit of data, is 8 bits, so 8 * number of bytes).
So what sampling rate is needed to accurately produce the represented signal? According to the Nyquist-Shannon Sampling Theorem, if the highest frequency of a sampled analog signal is B, the signal can be perfectly reconstructed from a sequence of samples if the sampling rate exceeds 2B. Human hearing ranges from roughly 20 hertz to 20,000 hertz, so in the case of sound, any wave frequency we could possibly hear is reproducible with a sampling rate above 40,000 Hz. Luckily, the fancy electronics we use today are capable of easily producing, reading, encoding and decoding samples at 44.1 kHz, which most sound cards support. That’s right, an astonishing 44,100 times per second, which just so happens to be more than 40,000 Hz..go figure. Some high-end audio devices on the market today support even 192 kHz, which allows very complex, high-frequency waveforms to be accurately constructed. For most purposes, though, 44.1 kHz is quite suitable.
As we discussed above, the higher the bit-depth, the more precise each sample can be. However, there is a trade-off. We know we want a sample rate of at least 44.1 kHz, which means our processor and sound card has to be able to continuously load/store, encode/decode, and generate samples 44,100 times per second, depending on if we are recording, playing back, or synthesizing audio (sometimes all at the same time, all while still running applications, operating system and everything else on the computer). We don’t want to compromise the sampling rate because the Nyquist-Shannon theorem tells us we need it to cover the full range of frequency humans can perceive. This means that the higher the bit-depth, the more work it takes for the computer to do each individual load/store, encode/decode, and generate operation. Computers are fast, but they are only so fast. If we want them to do this 44,100 times per second without interruption, we have to make each operation as light as possible. 16-bit depth seems to be the most common bit-depth for high-quality consumer audio, although some high-end audio equipment supports 24, 32, and even 64-bit encoding. At some point humans cannot perceive the difference, but I’m not sure what it is.
One thing to keep in mind is that when the wave is reproduced, the computer, and more specifically the DAC generating the analog signal from the data, has to “fill in the holes” between samples. There are various techniques used to do this, and obviously they work quite well, but these gaps between samples are effectively noise. When we start to do complex things with our audio signal, like amplify it, time-warp it, add reverberation and effects, and so forth, this noise becomes more and more apparent and audible. This is why above mentioned high-end equipment used in professional audio studios is capable of using ridiculously high sample rates and bit-depths. Even though humans cannot perceive the difference in the source signal, it reduces the level of noise that becomes audible later on in production.