Fairbanks, Grant. Experimental Phonetics – T08

Word Intelligibility as a Function of Time Compression **1

Grant Fairbanks and Frank Kodman, Jr. *2
Speech Research Laboratory, University of Illinois, Urbana, Illinois
(Received January 28,1957)

An experiment is described in which words were automatically compressed in duration and presented to
observers for identification. The effects of time compression and of time sampling are assessed, and compared
with those of periodic interruption. The results of an analogous nonauditory study of the effects of phonemic
sampling are presented.

The purpose of the experiment reported here was
to explore the effect of time compression of
speech upon intelligibility. The method of compression
has been described by Fairbanks, Everitt, and Jaeger, 13
and permits display of speech at controllable speeds
without essential alteration of the internal signal
frequencies. The experiment consisted of compressing
the durations of words and presenting them to observers
in a conventional identification situation.

Related work has been reported by Garvey 24 and
Garvey and Henneman, 35 who sampled tape recordings
manually. A number of years ago Fletcher 46 presented
a curve showing the effect of speeded playback of
recordings upon intelligibility, although the temporal
factor was undoubtedly obscured by the accompanying
frequency multiplication. Miller and Licklider's investigation
of interrupted speech 57 is pertinent to the
problem in that frequency of sampling and ratio of
on-time to total time were controlled. The time compression
technique has been used in a group of studies
of factual comprehension by Fairbanks, Guttman, and
Miron. 68

The process of compression 19 involves two stages,
both of which are used in the time compression application.
In the first stage time samples of the input
signal are extracted periodically, compressed (divided)
in frequency, abutted in time, and stored. Time
compression is accomplished in the second stage by
reproducing the stored samples at a speed appropriate
for restoration of the input frequencies. The unit, or
cycle, of time sampling in the first stage is termed the
sampling period, T_s, and equals the sum of the durations
of the sampling interval, I_s and the discard interval, I_d.
The sampling frequency, f_s, the frequency at which the
sampling process intercepts the input signal, is 1/T_s,.
In these terms the time compression ratio is defined as

R_c = I_d / T_s = I_df_s.

R_c, and I_d may be specified independently with the
apparatus, and the experiment involved representative
combinations of the two. In the second stage, as the
signal is displayed in shorter time, f, is transplanted
upward. This higher frequency is termed the interruption
frequency, f_i, and the relations are

f_i = 1/I_s = f_s/(1-R_c).

The interruption frequency has auditory significance,
and constituted a third variable.

Procedure

The basic stimulus material was a tape recording of
one of the well-known lists of 50 phonetically balanced
monosyllables, 710 as read by a native speaker of the
General American dialect, a man with considerable
speaking experience. The list was read at conversational
level, deliberately and accurately, but without over-articulation.
Duration, articulation and general effort
were considered typical for such a speaking task. The
carrier phrase “You will write…” was used, with a
brief complete pause between carrier phrase and word.
The range of vowel pressure was 8 db, as measured
from the recording, with the median word at 3 db re
the lowest word. With respect to duration, which was
measured oscillographically, the words were distributed
fairly evenly over a range from 0.36 to 0.75 sec with the
median at 0.60 sec The latter value is identical to the
median of a larger sample of PB words measured by
Miller and Licklider 511 and the shortest words in the two
samples also were approximately equal. About 10% of
the words in the earlier sample, however, exceeded
0.75 sec Phonetic transcriptions of the recorded words
were made, and the accuracy and representativeness of
each element verified. In structure the words were
distributed as follows: 4 CV, 4 VC, 25 CVC, 7 CVCC,
9 CCVC, 1 CCVCC.

Compressed versions of the original recording were
made for 36 combinations of R_c, and I_d (see entries in
Table I). The five values of I_d, ranging from 0.01 to
0.24 sec, were chosen to span the region of significant
fragmentation of phonemes. At each of these a set of66

Table 1. Mean percentage of word intelligibility at various
combinations of time compression ratio and discard interval;
8 observers, 50 words.

tableau discard interval (sec) | ratio

values of R_c, in the range of 0.4 to 0.9 was selected; six
such values were common to all five discard intervals.
Each version was cut up, its 50 items reordered at
random, and lengths of blank tape inserted between
items to equalize all judgment periods to 3 sec.

The observers were eight young adults, advanced
graduate students with previous experience, who had
normal hearing bilaterally from 250 to 4000 cps and
no discrimination loss by test. Prior to experimentation
they underwent a training period designed to familiarize
them with the vocabulary, that being a basic condition
of the experiment. Administration of the uncompressed
master recording as a discrimination test was the first
step; all observers scored 100%. Then the observers
studied typed lists of the words, being asked to observe
spelling and pronunciation, and next were instructed
to read silently word by word as they listened to three
unrandomized versions with small compression ratios,
0.1, 0.2, and 0.3, which had no effect upon intelligibility.
At this point in training the observers were allowed to
believe that the experiment proper started, and the set
of eight compressed versions for the 0.01-sec discard
interval was administered under formal conditions.
Presentation was first in descending and then in ascending
order toward threshold, at two separate sessions.
Thus, by the end of the training period, the
observers had heard the list 20 times, three times while
watching a copy and 17 times while attempting identification.
The training scores on the compressed versions
were useful in guiding selection of values for the experiment
proper. Four sets of scores were used to estimate
reliability by means of analysis of variance. With
mean intelligibility ranging from 35 to 98%, reliability
coefficients varied from 0.85 to 0.97. The scores were
also used to confirm the propriety of pooling the
descending and ascending data; curves for the two series
differed in the expected direction by small amounts.

The experimental sessions were distributed over 12
weeks and included a second presentation of the versions
with the 0.01-sec discard interval. One discard interval
was completed at a time in two sessions, the first devoted
to a descending and the second to an ascending
series. Administration was bilateral, over PDR-10
earphones with Type 1505 ear cushions, at a sensation

image word intelligibility (%) | compression ratio

Fig. 1. Percentage of word intelligibility as a function of time
compression ratio. Parameter is discard interval (sec).

level of approximately 80 db. The observers were tested
in subgroups of four and rested after each presentation
of 50 words. Responses were written and earphones were
rotated systematically.

Effects of Time Compression

Mean percentages of intelligibility for the various
experimental combinations of R_c, and I_d are listed in
Table I and plotted in Fig. 1, with I_d as the parameter.
It will be seen that the curves for the three smallest
discard intervals all approach maximum intelligibility
close to 100% for values of R_c, up to about 0.8, or when
the stimulus words were shortened to 20% of their
original durations. In the lower part of Fig. 1 the curve
for the 0.16-sec discard interval shows a long plateau
at about 85% intelligibility, while the 0.24-sec curve
appears to have entered a similar plateau at the 70%
level. Plainly any generalization about the effects of
time compression by the time sampling method requires
specification of discard interval (or of I_s, or f_s). The
discard interval of 0.16 sec is close to the mode duration
of the phonetic elements in these particular words as
spoken. Direct measurements were not made, but the
mean duration per phoneme, calculated from the word
durations, was 0.18 sec. Miller and Licklider found
median durations of 0.12, 0.30, and 0.18 sec for the
C–, V and –C subunits, respectively, of their comparable
words. 812 The lower part of Fig. 1 relates to discard
intervals which spanned a substantial part of the
distribution of phonemic duration. With I_d as long as
670.24 sec, for example, the number of unsampled
phonemes became large enough to degrade intelligibility
at all values of R_c,. In addition to R_c, and I_d,
the influence of a third factor, f_i, the interruption
frequency, may be discerned in the upper portion of
Fig. 1. Where the two curves are inflected downward
at large compressions, the effect is less marked when the
discard interval was 0.04 sec. In that case, for compression
ratios of 0.8, 0.85, and 0.9, the corresponding
interruption frequencies were 100, 142, and 225 cy,
respectively. When I_d was 0.01 sec. however, they were
400, 567, and 900 cy, or high enough in the speech
spectrum to overlap the frequency range which is
significant for intelligibility. 913 Thus, the curves of Fig.
1 demonstrate two features of time sampling which are
required if intelligibility is to vary primarily as a function
of compressed signal duration: a discard interval
which is short relative to the signal units; an interruption
frequency which is low relative to the internal
signal frequencies. Stated in another way, intelligibility
is a comparatively “pure” function of R_c, when f_s, is
relatively high and f_i relatively low. Although this
range is limited by the input signal it is sufficiently
large as to be useful. For example, in the case of 80%
compression with 0.04-sec discard interval, f_s is 20
cy, in comparison to a mean phoneme frequency of
about 5; f_i is 100 cy, well below the formant frequencies.

Within the experiment the foregoing conditions were
met most adequately by the series prepared with I_d
constant at 0.04 sec. A nominal intelligibility of 50%
was most closely approached when R_c, was 0.9, or when
the durations of the words ranged from 0.036 to 0.075
sec, with a median of 0.06 sec. The observers were
familiar with the vocabulary and with the effect of
durational distortion. The contribution of these conditions

image word intelligibility (%) | sampling internal (sec.)

Fig. 2. Percentage of word intelligibility as a function of
sampling interval. Curves ascending to the right are for discard
intervals (sec). Intersecting curves are for different compression
ratios.

to the score was not assessed, but they are
estimated to have been worth about 25% in the middle
of the range, from informal trial with naive observers
and comparison of the pre-experimental and experimental
scores. It seems reasonable to suggest that 50%
intelligibility with naive listeners might be approximated
with no great error by using the point of 80%
intelligibility for the present observers. With the 0.04
series this point is at about 85% compression (Table I),
that is, when the words were 15% of their original
durations and the median was 0.09 sec. As has been
noted, the original words were spoken deliberately and,
undoubtedly, with a median duration of 0.6 sec, were
longer than in average casual speech. It is possible that
they were as much as twice as long, since the mean
duration of phonetic elements in connected speech
approximates 0.1 sec, as has been shown repeatedly.
At any rate, it is interesting to use that as an assumption,
and to consider that the condition of 50% compression
presented the words about at their live-speech
durations. Figure 1 may be studied on this assumption
by converting R_c, to new values equal to 2(R_c -0.5),
or to -0.2 (20% expansion), 0, 0.2, 0.4, 0.6, and 0.8,
respectively, substituted for the values shown along the
abscissa, and dividing the I_d labels by two. 1014 If the
duration of the median word when spoken by the
casual speaker is 0.3 sec (instead of 0.6 sec), and if a
median duration of 0.09 sec is required by the naive
listener for 50% intelligibility (instead of 0.06 sec),
then a time compression ratio of 0.7 is descriptive of
the approximate disparity of speaker and listener.
Thus interpreted, the datain dicate that the “temporal
redundancy” of the average word is about two-thirds.

Comparison of Time Compression
and Interruption

In Fig. 2 the data are shown in a plot of intelligibility
against sampling interval. The curves for I_d, those which
ascend toward the right, are essentially those of Fig. 1,
but reversed and displaced horizontally according to
the five different sets of sampling intervals required for
the experimental compressions at the various discard
intervals. The intersecting curves connect points of
equal time compression in the five experimental series;
a given curve shows variation of intelligibility with
I_s, R_c, constant. These curves depict in another way the
findings already discussed, namely, the high intelligibility
of heavily compressed words when I_s and I_d are
short relative to signal units, the decline as the intervals
become long, and, especially in the 0.9 curve, the other
68decline as f_i, i.e., 1/I_s, extends into the speech range.
The latter set of curves is comparable to a set plotted
by Miller and Licklider 1115 for different ratios of on-time
to total time (“speech-time fraction”) when signal is
interrupted periodically. In that experiment the on-time
during one cycle of sampling is analogous to I_s in
the present notation, off-time to I_d, and the speech-time
fraction to 1 - R_c. In interruption the on-time
samples are spread discretely across the original time;
in compression the sampling intervals are “squeezed
together” without discontinuity into a shorter net time.
Figure 3, which has the same coordinates as Fig. 2, was
prepared to compare the effects upon intelligibility.
The curves from Miller and Licklider, originally plotted
in terms of frequency of interruption, were reconstructed
and are shown in light line for speech-time
fractions ranging from 0.125 to 0.75. The curves in
heavy line are for three comparable values of 1 - R_c,
from the present data. The two sets have certain
resemblances with respect to the influence of sampling
interval, such as the locations of the maxima near 0.01
sec for the 0.125 pair, the decreases of intelligibility
toward the right, and the progressive right-ward
shifting of the points from which these inflections
begin. These points of agreement are notable, and they
appear to be based upon the similarities of word
structure and duration already mentioned.

It will also be observed that each compression curve
is substantially higher than its interruption counterpart.
This effect is undoubtedly magnified by the condition
that observers in the present study were thoroughly
familiar with the 50-word vocabulary, in contrast to
the moderate familiarity with a larger vocabulary
practiced in the earlier experiment. The possibility
that it reflects a difference between compression and
interruption, however, cannot be overlooked. Both
techiques involved “interruptions” of the signal, but
these are inherently gradual in the former case, abrupt
in the latter. 1216 In addition, the interruptions that
bound a given discard interval are simultaneous in
compression, not successive; a given sample is faded
up as the preceding sample is faded down. Compressed
signal does not give a subjective impression of discontinuity.
When the interruptions are noticed they
are heard as if there were an independent noise coexistent
with the abbreviated signal. With respect to
the masking effect of the interruption frequency, the
compressed signal should be less intelligible than
interrupted signal within the range of comparatively
long sampling intervals investigated thus far. In Fig.
3, for a given sampling interval and pair of curves, the
frequency of interruptions as heard by the observers is
higher in compression because of the high-speed reproduction
in the second stage of the process. For example,
in the case of the two 0.125 ratio curves, intelligibility
was maximal when the sampling interval was about
0.01 sec. For this combination, f_i was 100 cy in compression,
but only 0.125X100 cy in interruption.
Similarly, f_i would be 733 cy for the point where the
0.125 compression curve ends at the left, or high enough
to enter the speech spectrum disadvantageously, as
both experiments show. The comparable frequency of
interruption in the Miller-Licklider study was about
92 cy, close to the lower border of the spectrum. The
interrupted word has another feature which might be
thought to be advantageous, namely, that the total
stimulus duration approximates the original word
duration.

Effects of Sampling

In the discussion of Fig. 1 it was pointed out that
sampling becomes an important consideration in the
intelligibility of compressed words as the sampling
and discard intervals become long relative to the
durations of word elements. When they become sufficiently
long, differences in the intelligibility of compressed
and interrupted words should tend to disappear,
since sampling would be the dominant factor in both.
In fact, even within the range of the present experiment,
it may be seen in Fig. 3 that the compression curves
appear to be inflecting toward the corresponding
interruption curves at the longer sampling intervals.

When the intervals become so long that they involve
complete words or groups of words, the percentage of
words transmitted becomes roughly equal to the percentage
of signal time, and the percentage of intelligibility
should approximate the latter because it tends
to equal the former by definition. Thus, Miller and
Licklider, speaking of the 50% case, observe: “If the
frequency of interruption is low enough, the articulation
score must be equal to the product of the speech-time
fraction (here 0.5) and the articulation score for interrupted
speech (here almost 100%). With the speech
on 5 seconds, then off 5 seconds, the listeners heard
half the words correctly.” 1317 Before this low-frequency

image word intelligibility (%) | sampling interval (sec.)

Fig. 3. Variation of word intelligibility with sampling interval
for proportions of signal as labeled. Light lines adapted from
Miller and Licklider (reference 5, Fig. A).69

image percentage of identification | percentage of word

Fig. 4. Percentage
of nonauditory word
identification as a
function of percentage
of word. Phonemic
sampling by
transcription of word
fragments.

range is reached, the limiting case of subword sampling
is found where the sampling period and the duration
of the word are equal. This is the lowest sampling
frequency which always produces a word fragment, and
is 1.67 cy for the 0.6-sec median word of the experiment.
The limiting case was reached in the experiment for
the two shortest words only (0.36 and 0.4 sec), since the
lowest sampling frequency was 2.5 cy.

Interest in this case motivated an additional, non-auditory
experiment in word identification which was
designed to be roughly analogous to the auditory
experiment. At the end of the first experiment the
observers were thoroughly familiar with the vocabulary,
having had 92 exposures to the list. 1418 It was feasible,
therefore, to require the observers to attempt identification
of the words from fragments. The fragments
were formed by sampling or discarding phonemes, and
were displayed by means of the International Phonetic
Alphabet in elementary, unmodified symbols. In
anticipation of this procedure, observers had been
chosen .who were trained in phonetic notation at an
advanced level.

Four-hundred fragments were produced as follows.
Each word was considered as divisible into three components:
(1) consonant, consonant cluster or no
consonant (VC words); (2) vowel or diphthong; (3)
consonant, consonant cluster or no consonant (CV
words). With each component either present or absent,
eight versions of each word were available. These
ranged from the intact word to no word through the
following: C––, –V–, ––C, CV–, C–C, –VC. Each
version was typed on a card without any indication of
the position of the fragment within the basic word,
that is, the phonetic symbols were typed in their
original temporal order without spaces or dashes. Each
of the “no word” versions was represented by a completely
blank card. 1519 The 400 versions were formed into
eight stimulus sets, each set consisting of one version
of each of the 50 words assigned at random. Order of
words within set was then randomized.

It will be noted that the mean length of the eight
fragments of each word was 50% of the word's total
phonemes, by virtue of the symmetry practiced in
forming the fragments. Each stimulus set, therefore,
was a version of the list in which the words had been
subjected heterogeneously to a mean sampling distortion
of about 50%. Thus, a given set was analogous
to an auditory version of the list in which some type
of distortion had operated heterogeneously across the
words, rendering approximately 50% of the total
phonemes unidentifiable.

Within a few days after the auditory experiment the
materials were presented individually to the observers.
The sets were administered independently, the familiar
answer sheets were used, and all eight sets were completed
in a single session. Instructions were as follows:
“In the auditory experiment sometimes you heard the
whole word and sometimes only a segment of the word.
In this part of the experiment you will have the identical
situation shown by phonetic symbols. You will be given
a set of 50 cards, each representing a stimulus word.
Your task is to identify the word…. Each set of
cards has the same 50 words you have been listening to
in the auditory experiment. Select the first card from
the set. Imagine you hear what is on the card. Try to
identify the word correctly from the information on the
card just as you have been doing in the auditory
experiment.” After each response the card was placed
face down and no longer consulted. No response was
changed after it was made. Guessing was advised, but
not required, and each observer worked at his own
speed.

Of the 3200 total attempts at identification, 47% were
correct, so that the over-all reduction of response
approximately equaled the over-all sampling distortion
of 50%. When the responses were sorted according to
relative fragment size, 1620 pooling all of the data, the
results were as shown in Fig. 4, where fragment size is
expressed as percentage of phonemes in the complete
word. In view of the phonetic training of the observers,
they undoubtedly identified the phonemic symbols with
good accuracy (the perfect identification of the 100%
fragments adds confidence to this belief), and it is not
nicely that any errors altered the shape of the curve
significantly. Further, since the procedure confronted
the observers with a “pure” problem in word detection,
70the shape may not be attributed to any specific sensory
process. The curve indicates that the probability of
word identification, under the conditions of the experiment
(i.e., limited vocabulary of monosyllables, known
to observers, etc.), is a sigmoid function of phoneme
identification. Recalling that acoustic distortion and
interference are, in effect, methods of phonemic
sampling, it seems appropriate to suggest that the word-phoneme
transfer characteristic in auditory intelligibility
may also be sigmoid under like conditions. If this
is warranted, the data have application to the problem
of predicting the proportion of identifiable phonemes
from an intelligibility score.

It is interesting to note that Miller and Licklider
found a relationship between word intelligibility and
auditory time sampling by interruption that is not
inconsistent. As has already been mentioned, the
interruption curves in Fig. 3 were adapted from a plot
of intelligibility against frequency of interruption. 1121
Some of the frequencies used were lower than that
required for the limiting case of subword sampling.
The pertinent frequency for the median word, 1.67 cy
as in the present study, was within the range, although
it was not an experimental value. Smoothed curves for
the following speech-time fractions are shown: 0.125,
0.25,0.5, and 0.75. At 1.67 interruptions per second the
respective percentages of intelligibility were approximately
as follows: 5,15, 55, and 85.

Acknowledgments

The authors are indebted to the University Research
Board of the University of Illinois for support of the
investigation. They are also grateful to the observers
for cooperation that was more than usual, and to H. V.
Krone for his technical assistance.71

1** Reprinted from The Journal of the Acoustical Society of America, Vol. 29, 1957, pp. 636-41.

2* Now at Department of Psychology, University of Kentucky,
Lexington, Kentucky.

31 Fairbanks, Everitt, and Jaeger, Trans. Inst. Radio Engrs.
AU-2, 7-12 (1954).

42 W. D. Garvey, J. Exptl. Psychol. 45,102-108 (1953).

53 W. D. Garvey and R. H. Henneman, A. F. Tech. Rept. 5719
(1950); A. F. Tech. Rept. 5925 (1952).

64 H. Fletcher, Speech and Hearing (D. Van Nostrand Company,
Inc., New York, 1929), p. 293.

75 G. A. Miller and J. C. R. Licklider, J. Acoust. Soc Am. 22,
167-173 (1950).

86 Fairbanks, Guttman, and Miron, J. Speech and Hearing
Disorders 22, 10-19 (1957); J. Speech and Hearing Disorders
22, 20-22 (1957); J. Speech and Hearing Disorders 22, 23-32
(1957).

9↑ Voir note 3.

107 J. P. Egan, Laryngoscope 58, 955-991 (1948).

11↑ Voir note 7.

128 See reference 5, Fig. 3.

139 To the listener f_i is often unnoticeable when low; if heard, it
gives a subjective impression of “roughness.” As it becomes
higher and acquires pitch it begins to be obtrusive at about 300 cy.

1410 Regarding the assumption about word duration and the
suggested reinterpretation of Fig. 1, it is noteworthy that a discard
interval of 0.02 sec was chosen for the studies of comprehension
cited in reference 6. In that work the original messages were
extended expositions spoken at 141 words per minute, selected as
a representative rate. The choice of 0.02 sec as discard interval
was informal, but considered by the experimenters as “sufficiently
short to avoid impairment of intelligibility by fragmentation of
words.” Compressions as large as 70% were used.

1511 See reference 5, Fig. 4.

1612 Miller and Licklider observed that “gradual modulation improved
both the quality and the intelligibility of the interrupted
speech.” See reference 5, p. 170.

1713 See reference 5, p. 168.

1814 About a year after the experiment, one of the observers
attempted to reconstruct the list from memory as a matter of
curiosity and wrote 44 of the 50 words in a few minutes. Still later
another observer, serving in an experiment with a different list,
regularly offered words from this list among his errors.

1915 It will be seen that this design produced some duplicate cards
for two-element words. Each of the four CV words yielded two
CV cards, one for the CVC and another for the CV– combination,
and two blank cards, one a no word version and the other representing
the ––C combination. The same was true in reverse for
the four VC words.

2016 The basis of sorting was the ratio of number of phonemes in
a fragment to number in the complete word, a total of 11 different
ratios. Seven of these were used in Fig. 4; the remaining four
involved inadequate numbers of stimuli, since they came from the
single five-phoneme word, and each was combined with the other
ratio nearest. The experimental points in Fig. 4 are based upon
totals of 264 to 616 responses.

21↑ Voir note 15.