Sound waves are continuous waves of motion, carried through molecules as they collide with each other – through the air, through solid materials, and even through liquids. A speaker uses electrical magnets to rapidly move a cone back and forth, pushing the air around it to produce vibrations, and thus sound waves. This rapid motion is controlled by an analog signal – a continuous signal which is analogous to any other continuous signal – in this case a current with fluctuating voltages, which is analogous to a sound wave. A stream of voltages is pushed out from a source into an amplifier or speaker, creating various charges in the speaker’s magnet and ultimately pulsating the cone.
A computer processor is not designed to handle analog signals, but instead reads digitized information in the form of bits – 0′s and 1′s. Computers imitate analog signals (like sounds and videos) by using discrete samples of the analog waveform. Basically this means they use a sequence of numbers, each representing the value of the wave at a precise position (point in time). The values in the sequence repeatedly increase and decrease incrementally between two peak numbers, effectively becoming a representation of the signal wave. Each of these discrete values is called a sample. This process of representing signals using a sequence of numerical samples is aptly called digital sampling. Computers use analog-to-digital converters(ADCs) to encode analog signals as digital samples, and digital-to-analog converters (DACs) to produce analog signals (typically electrical) from digital samples. Digital sampling and encoding/decoding analog and digital signals, respectively, is a huge field and an area of tremendous study, with applications in audio, video, data transmission, broadcasting, telephony, streaming media, using sensors, signal analysis, and so forth. Let’s dive into the basics.
Suppose we want to represent one period of a continuous, smooth wave with an amplitude of 1 (disregard units, and forget about frequency for now). Most of us probably remember from science class that amplitude (the intensity of the wave – volume for sound waves) is the distance from baseline to peak, so this wave would rise and fall between 1 and -1 (note that some sources use base-to-peak to describe amplitude, others use peak-to-peak; to be consistent we will use base-to-peak throughout). Since the baseline is 0, in each period, the wave will start at 0, rise to 1, fall to -1, then rise back to 0.
We can represent that with discrete samples:
0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7,- 0.8, -0.9, -1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, -0.0…
If we plotted each of these sample values at a fixed interval, it’s clear that we would get a wave shape.
We could represent the same wave with a different number of samples (i.e. change the sampling rate):
0.0, 0.3, 0.5, 0.8, 1.0, 0.8, 0.5, 0.3, 0.0, -0.3, -0.5, -0.8, -1.0, -0.8, -0.5, -0.3, 0.0…
or with a different level of precision for each sample (i.e. change the resolution or bit-depth):
0.00, 0.25, 0.50, 0.75, 1.00, 0.75, 0.50, 0.25, 0.00, -0.25, -0.50, -0.75, -1.00, -0.75, -0.50, -0.25, -0.00…
As you might expect, varying either the sampling rate or the bit-depth directly affects the ability to accurately reproduce the analog signal, which in the case of audio, affects the characteristics of the sound produced. A higher sampling rate provides more data points for a DAC to use to recreate a given segment of the wave, and a higher bit-depth allows each one of the data points to be more precise. Conversely, the lower the sampling rate or bit-depth, the less precisely the DAC is able to produce the original signal. You probably notice this effect often when listening to mp3′s or streaming videos of various qualities. The term lossy is used to describe encoding processes that do not preserve all of the source data; analog to digital conversion is always lossy, but not necessarily noticeable.
If a wave/signal is plotted on a 2 axis graph, as in the figures, the x-axis represents time, and the y-axis represents the position of the wave. Sampling-rate is equivalent to the number of points we have in a segment of the x-axis (time); bit-depth corresponds to the number of distinct positions on the y-axis any one of those points may fall on. Suppose that the maximum amplitude of any signal is 1 (still ignoring units), meaning at any point in time the position of a wave is between 1 and -1. With a 1-bit depth, we have two values to describe the position (1 bit can be either 0 or 1), which means everything in the signal would have to be rounded off to one value or the other. For some applications with simple signals, that is fine (Morse code, for instance); for complex signals (which real-world sounds are), it will not do. The higher the number of available values to describe the position, the lower the impact of rounding-off. For an N-bit depth, the number of available values is 2^N (2 to the power of N).
It is worth noting that there are many advanced signal-processing techniques that use filtering, statistical inference, complex mathematics and other tools to help compensate for signal loss, but they are outside of the scope of this article.
To play back audio as sounds for us to hear, the computer sends the samples to a sound card, which has a DAC to decode the sample values and reproduce the analog signal in the form of an electrical current that is sent to the speakers. Pulse-code modulation (PCM) is the standard for doing this. PCM is more or less the same as the digital sampling examples above, except samples are in the form of bytes. WAVE audio files (.wav) are a PCM-based format (particularly, they use Linear Pulse-Code Modulation, or LPCM, a form of PCM with linear-quantization). The sampling rate of PCM ultimately depends on the hardware, but modern implementations typically allow us to easily choose between a number of supported rates at the software level. The same goes for bit-depth, which in PCM is typically a multiple of 8 (a byte, the smallest addressable unit of data, is 8 bits, so 8 * number of bytes).
So what sampling rate is needed to accurately produce the represented signal? According to the Nyquist-Shannon Sampling Theorem, if the highest frequency of a sampled analog signal is B, the signal can be perfectly reconstructed from a sequence of samples if the sampling rate exceeds 2B. Human hearing ranges from roughly 20 hertz to 20,000 hertz, so in the case of sound, any wave frequency we could possibly hear is reproducible with a sampling rate above 40,000 Hz. Luckily, the fancy electronics we use today are capable of easily producing, reading, encoding and decoding samples at 44.1 kHz, which most sound cards support. That’s right, an astonishing 44,100 times per second, which just so happens to be more than 40,000 Hz..go figure, almost seems like they did it on purpose… Some high-end audio devices on the market today support even 192,000 Hz and more, which allows very complex, high-frequency waveforms to be accurately constructed. For many purposes, though, 44.1 kHz is quite suitable.
As we discussed above, the higher the bit-depth, the more precise each sample can be. However, there is a trade-off. We know we want a sample rate of at least 44.1 kHz, which means our processor and sound card has to be able to continuously load/store, encode/decode, and generate samples 44,100 times per second, depending on if we are recording, playing back, or synthesizing audio (sometimes all at the same time, all while still running applications, operating system and everything else on the computer). We don’t want to compromise the sampling rate because the Nyquist-Shannon theorem tells us we need at least about 44,100 Hz in order to be able to represent any wave in the full range of frequency humans can perceive. This means that the higher the bit-depth, the more work it takes for the computer to do each individual load/store, encode/decode, and generate operation. Computers are fast, but they are only so fast. If we want them to do something 44,100 times per second without interruption, we have to make each operation as light as possible. 16-bit depth seems to be the most common bit-depth for high-quality consumer audio, although some high-end audio equipment supports 24, 32, and even 64-bit encoding. At some point humans cannot perceive the difference, but I’m not sure what it is.
One thing to keep in mind is that when the wave is reproduced, the computer, and more specifically the DAC generating the analog signal from the data, has to “fill in the holes” between samples. There are various techniques used to do this, and obviously they work quite well, but these gaps between samples are effectively noise. When we start to do complex things with our audio signal, like amplify it, time-warp it, add reverberation and effects, and so forth, this noise becomes more and more apparent and audible. This is why above mentioned high-end equipment used in professional audio studios is capable of using ridiculously high sample rates and bit-depths. Even though humans cannot perceive the difference in the source signal, it reduces the level of noise that becomes audible later on in production.