Fairbanks, Grant. Experimental Phonetics – T04

Method for Time or Frequency
Compression-Expansion of Speech *1 ***2

Grant Fairbanks, W. L. Everitt, and R. P. Jaeger **3
University of Illinois, Urbana, Illinois

The purposes of this paper are to outline a method
for compression and expansion of speech, to describe the
device employed in the method, and to demonstrate by
means of recordings the results of the method at this
experimental stage.

Until comparatively recently we had not been aware
of the fact that several approaches to the problem similar
to ours had previously been made by other experimenters.
We have now learned that our method, although
developed independently, resembles in certain
features of theory and details the earlier work of French
and Zinn, 14 Gabrilovitch, 15 Haase, 16 Gabor, 27 Vilbig, 38 and,
perhaps, others.

Fundamentally, the process depends upon the fact
that the duration of the average speech element or
phoneme of live connected speech, such as ah or s or r,
exceeds the minimum duration necessary for perception
by a listener, or exceeds the minimum time necessary for
sampling the essential phonemic qualities of the speech
element in question. This minimum duration has been
the object of a psychophysical study by Peterson 49 and of
theoretical calculation by Gemelli and Pastori. 510 The
excess duration may be referred to as temporal redundancy,
which term we suggest as a useful specification
at the experimental level when spoken language is in
question.

The dimensions of the problem are clearly not only
those of engineering, but also those of psychophysics. In
this paper we confine ourselves to the method. A psychophysical
program is in progress and its results will
be reported separately.

image orig. | comp. | exp

Fig. 1 — Theory of time compression and expansion by sampling.

For purposes of explanation assume two different
phonemes, A and B, which are of equal duration and
joined without interruption as shown (Fig. 1). Assume
that Aʼ and Bʼ are valid samples of A and B, and that
each is of adequate duration for perception. Assume
that samples Aʼ and Bʼ are extracted from A and B and
abutted in time as shown without discontinuity, and that
A — Aʼ and B — Bʼ are discarded. If, now, Aʼ, Bʼ is
reproduced, the time will be shorter than the original A,
B, but the phonemes should be perceptible.

When this proposition was advanced several years37

image revolving playback head assembly | record | erase | capstan | idler

Fig. 2 — Apparatus.

ago by the first author it was validated for connected
speech by cutting and splicing magnetic tape at arbitrary
points, without regard to the phonemes. It was discovered
that substantially more than 50 per cent of the
total time of connected speech could be discarded by this
means without destroying intelligibility. That is, A — Aʼ
could exceed Aʼ. At about the same time, Garvey and
Henneman 6 11independently used the same cutting-and-splicing
method to compress isolated words and found
similar results.

In the case of expansion, assume that phonemes A
and B are caused to be repeated, as in the middle portion.
If A, A, B, B is reproduced, the time will be longer
and the auditory effect, given the above assumptions,
should be that of prolongation of A and of B.

Finally, assume that A and B are first compressed
to Aʼ and Bʼ, and then expanded to Aʼ, Aʼ, Bʼ, Bʼ as
shown at the bottom. Here the original time for A and
B has been restored. A and B have been reconstructed
from Aʼ and Bʼ.

Fig. 2 shows a photograph of the essential part of
an experimental model of a device for compression or
expansion along the lines of such a theory. Basically,
the device is a continuous loop magnetic tape recorder,
mounted at the bottom of the rack containing the other
components. The tape loop, approximately 12 feet long,
rises along the right edge of the rack to a pulley under
slight spring tension at the top. Its pathway is shown
by arrows. Entering the device, the tape is directed by
means of rollers over a Magnecord erase head, and then
over a fixed Magnecord record head where the input is
temporarily recorded. Passing over another roller, the
tape then descends to a revolving playback head assembly
enclosed in a mu-metal box, where the signal recorded
on the loop is scanned. Next the tape passes to
the drive capstan, around a roller, and, finally, over a
Brush permanent magnet erase head.

The revolving head assembly consists of a brass drum
with four Brush playback heads equally spaced around
its periphery. The output of the heads is taken off by
means of a slipring-brush unit The circumferences of
both drum and capstan are 7.64″. Drum and capstan
are mounted on shafts supported in sleeve bearings
at the back of the panel. Massive flywheels are also
mounted on the shafts. The two units are driven by
twin 1/15 hp DC Bodine motors with independent speed
38controls by means of GR Variacs. Speeds are measured
with a GR Strobotac.

The remaining components are conventional. An
independent Magnecorder PT6-A is used for storage
and playback. This has been modified for continuously
variable speed reduction and furnished about a 15 to 1
range of tape velocities.

In Fig. 3 operation of the revolving head assembly
is shown at the left. The four playback heads are identified
by letters. The tape passes over the drum and is in
contact with ¼ of its circumference, or a distance equal
to the peripheral distance between any two adjacent
playback heads. The tape is retained by flanges around
the drum periphery. Tape direction is constantly
counter-clockwise. In the compression application the
direction of drum rotation is also counter-clockwise.
Under load the top tape velocity is approximately 190
in/sec. The top peripheral drum velocity is about 225
in/sec.

For purposes of explanation the tape is divided into
hypothetical numbered segments, each equal to the distance
between heads. The relative positions of tape and
heads are shown at representative times. The diagram
shows 50 per cent time compression as an example.

In Part I segment 1 is shown at t₀ when it first comes
into contact with the drum. At this time it is intercepted
by head A, which is moving in the same direction. If
the drum were stationary, reproduction would be one-for-one.

image I | II | III | IV | V | head

Fig. 3 — Companion process.

If its velocity were equal to the tape, no signal
would be reproduced. Between times I and II, however,
head A moves through ¼ of a revolution. During the
same interval tape segments 1 and 2 pass the 9 o'clock
point where head A was at t₀. As a result, head A reproduces
segment 1 during that interval. The effective
tape velocity is V_r — V_n. In the example diagrammed
V_H equals V_r/2 which equals the effective velocity.
Therefore, the frequencies of segment 1 as reproduced
by head A are divided by 2.

At time II head A is at 6 o'clock and head B is at
9 o'clock, while segment 2 lies between them in contact
with the drum. Head A is about to leave the drum,
while head B is about to begin reproducing segment 3.
Accordingly, although there is no discontinuity, segment
2 is not reproduced by any head. The remaining diagrams
show how the process continues, the odd-numbered
segments being reproduced at reduced frequency
and the even-numbered segments being discarded. It is
evident that various durations of either reproduced or
discarded segments can be realized by varying the absolute
and relative velocities of tape and head, and that
a range of sampling frequencies and compression ratios
can thus be produced.

The output of the device with respect to time is diagrammed
at the right. Between times I and II, for
example, segment 1 is reproduced by head A in the time
necessary for both segments 1 and 2 to pass a point.39

image I | II | III | IV | V | heads

Fig. 4 — Expansion process.

Head B then reproduces segment 3, etc. The final yield
is segments 1, 3, 5, 7. When these segments are stored at
a given speed and then reproduced at an appropriately
higher speed, their original frequencies are restored and
the elapsed time is shortened.

With respect to duration the odd-numbered segments
are termed sampling intervals; the even-numbered segments
discard intervals. The reciprocal of their summed
durations is the sampling frequency. One hundred times
the discard interval divided by the sum of the two intervals
will be termed the compression percentage. Since
sampling is periodic the ratio applies also to the total
message time, and describes the percentage by which
that total time has been reduced.

Assuming that the process results in intelligible speech,
it becomes evident that the processed message may be
transmitted over a system with smaller bandwidth than
originally necessary. The capacity of a conventional
transmission link for handling simultaneous messages
will be a function of the amount of compression, or frequency
division.

Fig. 4 is a similar diagram for expansion. Here the
drum bearing the playback head revolves in a direction
opposite to that of the tape. The illustrative example
shows the condition when these velocities are equal.
The effective velocity is equal to their sum.

At t₀, shown at I, segment 1 is in contact with the
drum between heads A and D. During the next interval
head D, as it moves from 6 o'clock to 9 o'clock, will
reproduce both segments 1 and 2 and then leave the
tape. At that time it will be replaced by head C, which
has moved to the 6 o'clock position to intercept the tape
at the beginning of segment 2, and which will reproduce
segments 2 and 3 during its sweep. The result, shown
at the right, is that between times I and II, while segments
1 and 2 are passing the 6 o'clock point, segments
1, 2, 2, 3 are reproduced. The rest of the figure shows
how this process continues.

Since the effective tape velocity has been increased
by the opposite movement of head and tape, frequency
multiplication has been incurred. The original frequencies
are restored by reproducing the processed message
in an appropriately longer time. One hundred times the
amount of time thus added divided by the original time
is the expansion percentage. In the diagram this equals
100 per cent.

Fig. 5 summarizes the various stages in compression.
The comparative times and frequencies are indicated
at the bottom. In an original time T₀, and with original
frequencies f₀, the input is recorded on the loop at the
velocity V_r and scanned by the revolving head unit
moving in a positive direction at V_rR_c,. This yields the
compressed frequency f_c, shown at the bottom. Simultaneously
the compressed signal is stored at a recording
tape velocity which will be taken as V_R,. This recording
is reproduced at a later time at the higher tape velocity
shown, in the relative time indicated, and with f₀, restored.
The following recordings will illustrate this.40

Fig. 5 — Method of time compression.

In this and the other recordings you will “hear”
repetitions of a semi-nonsense test sentence which provides
a rigorous test of the system. The sentence contains
one and only one example of every American phoneme,
with exception of the unstressed neutral vowel as
in the first syllable of the word away, which occurs three
times.

“Recording 1. Compression. Original message: We hasten
the boy off my garage path to show which edge young owls
could view. Frequency division 1.25. No time compression.
Sampling frequency 10: (sentence). Time compression
20%: (sentence). Test out.”

Next you will “hear” the perceptual effects of various
degrees of compression.

“Recording 2. Time compression series. Sampling frequency
10. Compression 10%: (sentence). Compression 30%:
(sentence). Compression 50%: (sentence). Test out.”

“Recording 3. Time compression series. Sampling frequency
20. Compression 50%: (sentence). Compression 70%:
(sentence). Compression 90%: (sentence). Test out.”

You will have noted that the smaller values of compression
affect intelligibility and perceived speed of talking
very little. Although both factors are perceptibly

Fig. 6 — Method of time expansion.

affected as compression is increased, you can observe
that intelligibility persists with surprisingly large compression
percentages.

Fig. 6 is a similar diagram for speech expansion.
Head movement is negative with respect to the tape, and
equals V_TR_E. In the original time the original frequencies
are multiplied by 1 plus R_E, yielding f_E, as stored.
The message is then reproduced at the lower velocity
shown, f₀ being restored with the time expansion. The
next recording illustrates the three stages.

“Recording 4. Expansion. Original message: (sentence).
Frequency multiplication 1.2. No time expansion. Sampling
frequency 10: (sentence). Time expansion 20%: (sentence).
Test out.”

We will now illustrate the perceptual effects of expansion.
The expansion percentage will be progressively
increased.

“Recording 5. Expansion series. Sampling frequency 10.
Expansion 10%: (sentence). Expansion 30%: (sentence).
Expansion 50%: (sentence). Test out.”

“Recording 6. Expansion series. Sampling frequency 33.3.
Expansion 50%: (sentence). Expansion 70%: (sentence).
Expansion 90%: (sentence). Test out.”

Note that small percentages did not affect the perceived
speed of talking very much, and that the details
of speech became more readily heard as expansion increased.
Toward the end you may have heard an echolike
sound. This occurs when the interval repeated
exceeds the duration of one phoneme. This is a size
limitation in our experimental model and not a limitation
of the method.

Fig. 7 shows a system which involves the following:
(1) compression, (2) transmission of the compressed
message, (3) expansion of the compressed message. The
steps are carried on simultaneously with two units. A
transmission link, undiagrammed, is inserted between
the two at the arrow. Velocities, times and frequencies
are labeled.

image compression-transmission-expansion | record amp. | playback amp. | in | out

Fig. 7 — Method of frequency compression — transmission —
expansion.41

The process is illustrated in the next recordings. First
you will hear the original message. Then you will hear
the transmitted message with frequency division. Finally
you will hear the message as received after reconstruction
by means of expansion and corresponding frequency
multiplication. Eighty per cent of the message was discarded
before transmission and the final message as
you hear it was reconstructed from the 20 per cent fragment
that remained. To help you appreciate the last
point we will also “play” at the end a recording in
which the original frequencies are restored by accelerated
playback without time expansion.

We present this next recording with some hesitation
and we hope you will not be disappointed. It was made
on an experimental model of the device. Its main purpose
is to validate the theory and demonstrate potential
feasibility. (You will “hear” considerable noise and distortion.
Some of this can be eliminated fairly readily,
but part of it is inherent in the method and will need
to be counteracted.)

The important thing, however, is that the final output
is intelligible at all when bandwidth reduction is by
a factor of 5 and compression is 80 per cent.

“Recording 7. Compression—transmission—expansion. Original
message: (sentence). Transmitted message. Original
time. Frequency division 5. Sampling frequency 60: (sentence).
Restored message. Original time. Frequency multiplication
5. Sampling frequency 16: (sentence). Time
compression 80%: (sentence). Test out.”

Apart from its theoretical interest, the method appears
to have several practical applications. For one
thing, the smaller compression and expansion ratios
should be useful in the programming of rebroadcast
speeches in radio, since they furnish “tailormade” time
without the audience's knowledge. A saving of 10 minutes
per hour is completely realistic. Conversely, and we
advance this suggestion with diffidence, thinking of
commercials, more intelligence can be communicated to
an audience in a given amount of time.

Straightforward compression by larger amounts should
be useful wherever high-speed communication is crucial,
as in certain military situations. Expansion should facilitate
branches of study such as experimental phonetics
and linguistics where auditory analysis is important.

Finally, of course, the method gives promise as an approach
to the long-standing problem of bandwidth reduction.

In conclusion we should like to “play” two more
recordings. The first of these is self-explanatory.

“Recording 8. In order to demonstrate that the method is
inherently practical, the recorded explanatory materials in
connection with the recordings that you have heard today,
as well as these words that you arc hearing now, were all
compressed by 10%. Test out.”

We are frequently asked if the method applies to
women's voices, fast articulation, or music. The next
recording illustrates its use with all three.

“Recording 9. Vocal music. Rosemary Clooney. Come On-a
My House. Columbia Record Number 39467. 78 rpm, shellac.
Compression series. No time compression: (music).
Compression 30%. Sampling frequency 20: (music). Compression
60%. Sampling frequency 40: (music).”

The final recording shows the effect upon music
which has already been processed to make it ultra-fast.
In the first section you will hear a portion of the original.
The second section demonstrates 30 per cent compression.

“Recording 10. Compression series. Les Paul. Lover, by
Rodgers and Hart. From Capitol LP Record Number H226,
The New Sound. No time compression: (music). Compression
30%. Sampling frequency 20: (music). Test out.”42

1* Revised manuscript received October 30, 1953. This is the
substantial equivalent of a paper presented at the 1953 National
Convention of the I.R.E. in New York City, reprinted with
minor changes from the Convention Record, Part 8 — Information
Theory, pp. 120-124. Because this paper included a demonstration,
references to the latter have been retained in the published
form.

2*** Reprinted from Transactions of the I.R.E., Professional group on audio, 1954, AU-2, pp. 7-12.

3** Now at Bell Telephone Laboratories.

41 Cited in Gabor.

5↑ Voir note 4.

6↑ Voir note 4.

72 Gabor, D., Jour. Inst. Elec. Eng., vol. 93, Part III, pp. 429457;
1946; vol. 94, Part III, pp. 369-386; 1947; vol. 95, Part
III, p. 39; 1948; vol. 95, Part III, pp. 411-412; 1948.

83 Vilbig, F., Jour. Acous. Soc. Amer., vol. 22, pp. 754-761;
1950; vol. 24, pp. 33-39; 1952.

94 Peterson, G. E., Ph.D. Dissertation, Louisiana State University;
1939.

105 Gemelli, A. and Pastori, G., “L'Analisi Elettroacustica del
Linguaggio,” Milan, Italy, pp. 149-162; 1934.

116 Garvey, W. D. and Henneman, R. H., Air Force Tech. Rep.
#5917; 1950.