Essay - "Digital Audio: What Does This
Term Really Mean?" - February, 1998
By John Busenitz
If you're like me, you are probably numb to the word
"digital". It rates right up there with terms like "Generation X",
"paparazzi", and "au pair". "Digital" is used to denote the
messiah as well as satan, and is inadequate for both. There is no such thing as
"digital-ready". In fact, "digital", which loosely means "based
on numbers" doesn't really describe what it is meant to. Rather, it actually
describes three things: discrete-time, quantization, and binary-based manipulation and
Yeah, I apologize for sounding like the dorky engineering person I am, but this technobabble most accurately describes what is happening. And I'm not so dorky. It's not like I play Internet MUDD on Friday nights, and have every Wierd Al recording. But enough about me. Basically, what is happening with digital is chopping stuff up into pieces. The music, that is. "Digital" is all about taking something that is so detailed and big, like actual music, or the microphone feed, and approximating it with numbers. Which means it is not "perfect sound forever" despite what the marketing people who coined this cute but wrong phrase might have you believe. The engineers (to toot my own horn) always knew better.
How is this significant? Well, as you are all probably subconsciously aware of , it means smaller, cheaper, and Some Other Things. We will get into those later. Is this vastly different from analog, like a vinyl record? No, not really. Because they all are a limited picture of the thing they represent, which is the musical signal.
Think of an image on a TV screen or a monitor. If you look at it from far away, it appears continuous, pretty much like one of your family album snapshots. As you get closer, you can see the picture is made of dots, i.e., it is quantized, just like space-time, and is composed of little parts, or quantities. Furthermore, it's quantized horizontally as dots (the "vertical resolution") and vertically as lines (the 525 horizontal scanning lines, the "horizontal resolution"). Of course, you could call the horizontal scanning lines big rectangular dots if you want, but the point is, the image on our regular NTSC TVs is quantized in the X and Y direction, and is therefore, already digital! If we really want to get philosophical, electrical current is the passing of discrete electrons, nerves in our ears and brain are passing discrete impulses, and sound waves are individual oxygen and nitrogen atoms striking our eardrums, so they are all discrete-timel. But this essay is not for the purpose of starting flame wars.
For CDs, instead of being quantized in 2-dimensions of space (horizontal and vertical), the signal is quantized in time and also level (voltage, i.e., volume). The actual sound of the music, after passing through the microphones, is recorded, or sampled, at precise instances in time. As is more or less intuitive, the more often things are sampled, the better they can be represented (and decoded into a reasonable facsimile of the original signal). According to theory, we have to sample at twice the highest frequency we wish to record. This was suggested by Nyquist several decades ago, and the theory is termed the Nyquist Criterion. The idea is that a sine wave can be totally reproduced when using only two samples for every period (one complete sine wave). So now we have to make sure that we sample at a rate high enough to record everything we can hear. For current CDs, the samples are made 44,100 times each second (44.1 kHz), although the original digital tape recording master may be at a higher sampling rate.
If the actual music signal changes faster than it is being sampled, there are problems. What results is called "aliasing", the original information going under a new name. Like wagon spokes in a western which appear to turn backwards, this aliasing is the high frequency (quickly-changing) information that is now backwards. Not in time, but in frequency. Obviously, this messes things all up. So what we have to do is filter the musical signal so that there won't be any information being passed through that is of a higher frequency than half of the sampling frequency. For CDs, this mean removing (filtering) everything above 22.05 kHz (half of 44.1 kHz).
Now that I have discussed the time quantization in CDs, let's talk about the other dimension of quantization, namely, level, or amplitude. The sampling rate describes the "chopping up" of a signal in terms of time, but there's also the amplitude at each of those instances to record. As you might expect, more quantization intervals are better, but a huge number of them are not necessary, and take up a lot of space on the disc. And, since the actual value (voltage) has to be rounded up or down to the nearest value (represented by one of 216 values with 16 bit "words" in CDs), there is some amount of error, called quantization error. The way to take care of that is to add a little bit of noise (not the music that teenagers listen to today, but sort of a hissy, rushing-water sound a bit like what one can hear between FM stations). This randomizes the errors so they aren't as noticable, in part because noise is tuned out by the ear. This sort noise is called "dither", a term which I believe originates with that crazy Dagwood's mean ol' boss. Anyway, dither decorrelates the quantization noise from the musical signal, which makes it less audible.
All of what I just described - high-frequency limiting, and finite, discrete, levels - are basically what happens with all of our ways of recording and playing music, such as vinyl records, tapes, DAT, and so on. The methodology is different, but the fundamentals are similar. With vinyl, the cutter head and the stylus can't move faster than a certain speed, so there is a high-frequency limit. And they can't move with perfect precision and accuracy (and vinyl can't record such, anyway), so there is noise and inaccuracy in the amplitude. This means that, just like with "digital", there is a limit to how small a change in amplitude is realizable, because noise obscures anything smaller than itself.
So the limits we end up with for CD digital audio are a bandwidth of 22.05 kHz and a signal-to-noise ratio (SNR) of 96 dB (this means that the music can be, at most, 96 dB louder than the noise). These are the absolute best, theoretical limitations. In the beginning of digital audio, they were unobtainable, but with modern technology (such as oversampling, noise shaping, etc.) we can better push the envelope. The question at hand is, are these limitations good enough? Or do we need to increase bandwidth and SNR?
In this authors opinion, the answer to both questions is yes AND no. No, 16 bits are not quite adequate, under the quietest listening conditions. Our sensitivity in the midrange is just a little beyond that. However, nobody listens under the quietest listening conditions. And noise shaping, which increases resolution in the midrange where we are more sensitive to sounds, at the cost of less resolution at higher frequencies where we don't hear as well, pretty much solves the problem. Precious little music could even take advantage of it. And, more importantly, the industry is not using the current 16-bit standard to its incredible potential. So why in the world could we expect them to do better with an even more sophisticated standard?
These reasons are enough for me to think that we should stick with the same word length (number of bits) and use any extra information capacity to work on the Achilles heel of current stereo, which is spatial reproduction. This is not to say, however, that a longer word length shouldn't be used in the recording and processing stages. During those periods, one cannot be sure what the loudest level will be, so the engineers need a lot of headroom (unused bits) to make sure there won't be any clipping. And digital signal processing can make use of a longer word length during the mathematical computations, to make sure that resolution doesn't go down below 16 bits when something is digitally attenuated, among other operations. But, once the final result is had, it will surely have less than a 96 dB dynamic range, which one can nicely fit into a 16-bit system. Another issue is that very few, if any, electronics can support 24 bits of resolution, at the current consumer voltage. The inherent noise floor of the electrical components themselves (resistors and semiconductors) is too high to allow such high signal/noise ratios. And 24-bit D/A converters use that only in name, not performance, for the most part. What is the point in such high medium resolution if the electronics can't support it?
As for increasing the sampling rate, there are similar arguments. It is the bandwidth of the whole signal chain that matters, not just the recording medium. What combination of microphone, preamplifiers, power amplifiers, and speakers have a bandwidth of 20 kHz? Precious few. This really puts a damper on increasing bandwidth even if we assumed that people can hear higher than 20 kHz. And there are no reliable studies that support such assumptions.
There is, though, a bit of reasoning for higher sampling rates. In order not to cause problems (aliasing and imaging), the signal going into the A/D converter and coming out of the D/A converter must be low-pass filtered (attenuate the higher frequency portion of the signal). To make sure nothing above half the sampling frequency is present, a very sharp (slope) filter must be used. This causes problems like ripple in the frequency response and wacky phase behavior at the higher frequencies. So, if we move the sampling frequency up, those problems won't be nearly as audible; in fact, we can use a simpler filter at a lower frequency. However, a technique called "oversampling" helps with this problem. Oversampling inserts fake samples in between the real samples. With more samples, we must sample at a faster rate to reproduce the signal properly. The fake samples are ultimately filtered out and are too high in frequency to hear. And when we sample at a higher frequency, Ta-daaa! The filter can be one that doesn't cause audible coloration. However, there is still something to be said for a medium that is flat to just beyond the upper frequency limit of audibility. This means that it won't, in addition to the high-frequency limiting of other components, contribute to the audibility of the cumulative effects of high-frequency limiting of the system as a whole. If every component is a few decibels quieter at 20 kHz, it can all add up and become audibly improved.
There are a number of well-recorded CDs that illustrate the potential of CD sound. The PGM discs, made by the late Gabe Wiener, are especially representative of what can happen when the recording engineer actually knows what he is doing. I own "The Buxtehude Project" and the Ricercar harpsichord album, which exhibit incredibly detailed and pure sonics. There are sure to be other recordings of similar quality, which further proves the point that the problem is not with CD technolgy, but the implementation thereof, not to mention the abundance of poor recordings that give the current technology a bad name.
You can tell that I am somewhat skeptical of blindly increasing the sampling rate and word length. Not because doing such is bad in and of itself, but because it will require a compromise in other, more important, areas, like the number of channels and the amount of music. I would rather have better spatial reproduction and/or more music than ostensibly "improved" sonics that are really more hype than reality. With noise shaping and oversampling, we can solve any problems that are even hinted at with straight 16-bit flat dither 44.1 kHz sampling CD digital audio. In any case, DVD can theoretically support a multitude of sampling rates and word lengths, so everyone can be satisfied . . . theoretically. At any rate, I've no doubt that many of you may disagree on these issues, and I welcome the fuel for future discussion.
© Copyright 1998, Secrets of Home Theater & High Fidelity
Return to Table of Contents for this Issue.