↑↑ Home | ↑ Science |

Obtaining data -- Randomness tests -- Extraction and correction

Random numbers are useful for many things, including numerical integration and cryptography. The simplest way of obtaining suitable numbers are pseudo-random-number generators – most operating systems provide one, and the numerics literature documents a range of algorithms.

But genuine non-deterministic random numbers are surprisingly easy to obtain too – look no further than your sound card. The analog input of today's typical sound card is digitised with enough bits that the least significant bits are just noise. If you have an audio application with a level meter (VU meter), you can observe this – the fluctuating input level displayed when you supply no signal input is the noise level. A VU meter usually displays the signal level (i.e., the squared amplitude) in dB. This means that 6 dB amount to one (amplitude) bit.

[Aside: This page describes a way to obtain random numbers from audio inputs without administrator privileges. To increase the random number output of your own box, there are a number of ready-made Linux packages that provide true randomness via the kernel's random device: haveged uses the internal state of modern processors, the audio entropy daemon uses audio noise, and the video entropy daemon noise from a camera.]

To get at the noise bits, you simply record from the sound card in question.
It is not necessary to connect a signal source, though it should not do any
harm either. How you record sound will differ on your operating system and
personal preference. On my Linux box, I use the command-line program
`arecord`.

To obtain the noisy least significant bits, you have to record with the maximum bit width which the sound card provides. Otherwise the least significant bits will be cut off. If in doubt, record with 32 bits and look at the data afterwards to see which bits vary. For some sound cards, the digitised bit width is less than the nominal maximum bit width – for example, the Intel HDA implementation in the ICH7 chipset provides 20 significant bits but up to 32 can be recorded (with the least significant always being zero).

To be able to process the recorded data further, you should also have some basic knowledge of the audio format you record in. It should record the samples from the sound card without modification, which rules out non-linear sample encodings such as μ-law, which some formats support, let alone lossy compression such as MP3. In the following I will assume that your format stores sample values in a whole number of bytes each, like for example the common WAVE file format. You also need to know the header size of your format, in order to be able to skip it. A hexadecimal editor can be very useful to locate those bytes containing the noise bits.

Before using your audio noise bits for anything serious, you will want to be reasonably sure that they are indeed random. This is easier said than done. There is no such thing as a "proof of randomness". A number series containing a long stretch of constant numbers may result from a random process, and one that never contains such stretches definitely is not random. Testing for randomness amounts to testing for properties of random numbers which methods requiring them rely on. If you have a specific application in mind, you may have a criterion that is especially important to you, and should test for it. I have programmed a few general-purpose tests that you can download in this archive.

All the test programs below share the same syntax:

<program> <file> <skip>[.<bit>] <stride> [<channels>]

The first argument is the input file, the second gives the number of header
bytes to skip and the bit to use. `<stride>` is the offset in bytes between
successive values to extract bits from. The last, optional, argument, allows
to process interleaved series of data such as stereo recordings.

This test, implemented in `bit0stat.c`, counts the number of samples in which
a given bit (bit 0 by default) is 0 or 1. It outputs:

The result is a number between -1 and 1, and a positive result means an excess of 1 bits and vice versa. A zero result would indicate a perfectly uniform distribution.

Interestingly, the bit balance seems to serve as an indicator of sound card quality too. My good sound card from Echoaudio has half the imbalance (on average) as the Intel HDA sound on my mainboard.

The test program `5bitstat.c` outputs a histogram of values made up of the
lower 5 bits of each sample by default. If the `<bit>` argument is given,
bits 0 to `<bit>`-1 are used. The program also outputs the quotient of the
RMS deviation and mean *of the histogram* (**not** the values) as a measure
of how flat the histogram is.

The results of this test help decide whether you can use values composed of the noise bits as they are. (That the default number of bits is 5 is just because that is the number of noise bits of one of my sound cards, and this can be changed in the source code.)

The next test program, `bitcorr.c`, computes the autocorrelation of the bit
stream consisting of a given bit of each sample word. It outputs the
autocorrelation values up to an offset of 1024 (can be changed in the source
code). A value of 1 indicates perfect correlation (repetitive pattern), a
value of -1 perfect anticorrelation. The RMS and absolute minimum and maximum
of the correlation values is also output for a quick overview.

This and the following test make use of the ability of the human eye to spot
patterns in images. The program `bits2points.c` composes bits from the audio
file into coordinates in an image and marks the corresponding points. Points
which are hit once are coloured light grey, twice dark grey, and three or more
times black.

This test uses only a predefined share of the sample values, which results from the image size and a density defined in the source code. It outputs the offset of the first unused value in the audio file to allow continuing there.

This test does not provide a numerical result but requires you to decide
whether the resulting image looks random. If you want a supplementary
numerical measure, you can use an image processing program to compute the image
histogram. (For example, use `identify -verbose` from the ImageMagick
package.) Contrary to naive intuition, the share of points hit twice and more
times are not powers of the share of points hit once. Rather, these fractions
(probabilities) are:

where N is the number of pixels in the image (by default 2^{20}) and n the
number of coordinate pairs extracted (by default N/64). The key to this
counting problem is to start from the coordinate pairs, which are statistically
independent and therefore have to be treated as drawings with repetition. The
expected proportion of points being hit once is the same as the probability of
hitting a given, fixed point once. For p_{1} we need exactly one coordinate
pair to hit that point, which happens with probability 1/N, and all the other
(n-1) to avoid it, which gives ((N-1)/N)^{(n-1)}. For p_{2} two coordinate
pairs have to match, and the other (n-2) have to miss. It remains to explain
the binomial coefficients (n for p_{1}) in front. Clearly they come from
some kind of drawing without repetition, but not of coordinate pairs, but of
the random value(s) containing the matching pair(s). The matching coordinates
may occur at any point in the sequence of n random coordinate pairs, so the
probabilities have to be multiplied by the corresponding multiplicities.

Like the previous test, the program `bits2bitmap.c` converts the noise bits
to an image in which one can try to see regularities. Each bit represents one
black or white pixel. Only a fixed number of sample values are used, as needed
for the image size. The offset of the first unused value is output to the
terminal.

My package of programs also contains a program `getbit.c`
which extracts a given bit from all audio samples and writes them to a file.
Its command-line syntax is the same as for the test programs.

If your sound card noise has a statistical bias, that can be corrected. One option is to use the same methods as for turning uniformly-distributed numbers into a different distribution, such as selection or function application. Both will require modelling the bias.

An easier way is to combine the random numbers with the output of a
pseudo-random number generator, for example with a bitwise exclusive-or
operation. The result retains both the statistical properties of the
pseudo-random numbers and the non-determinism of the genuine random numbers,
the best of both worlds. Combining genuine with pseudo-random numbers is free
of the dangers of combining the output of two pseudo-random-number generators
(the result may end up less random than either), so you might apply it as a
precautionary measure in any case. The program `xor.c` in my source package
can be used to do that when you have written the pseudo-random numbers to a
file.