CTLF Corpus de textes linguistiques fondamentaux • IMPRIMER • RETOUR ÉCRAN
CTLF - Menu général - Textes

Fairbanks, Grant. Experimental Phonetics – T33

Appendix E
Communication Sciences Seminar Lectures

The following material is derived from a taped transcription
of a series of five lectures delivered by Grant
Fairbanks at the University of Florida. The Communication
Sciences Seminar, sponsored by the Vocational
Rehabilitation Administration providing the occasion for
these lectures, was held from June 9 to June 22, 1963.
These lectures represent one of the rare instances of an
extended public presentation by Fairbanks. Although
they are often fragmentary, it was felt that their proximity
to his death, their attempt at comprehensive summary,
and their portrayal of a lively and enthusiastic
scientist at work justified their inclusion in this volume.
the material is only minimally edited in an attempt to
retain the spontaneous character of the oral presentation.
The fourth lecture of the series, entitled “Some Practical
Outcomes of Abstruseness,” has been omitted and instead
forms the basis of the article entitled “Additional Analyses
of the Rhyme Test” in Part Four.

Communication Sciences Seminar Lectures, June 17-18, 1963

Effects of Time Compression and
Expansion in Speech: Part 1

I plan to spend the morning on the problem of time
compression and expansion, and the work we have done
on that, not because the work is exhaustive, but because
it is representative. This afternoon I'm going to do
something that comes a little strange to me, except with
friends in the evening. That is, to turn a little philosophical
in the general area of science as science — not
particularly speech science — and give a few reflections
on the matter. Later on in the afternoon I'll touch
upon some of the administrative aspects of programs in
communications sciences. Then tomorrow I will deal
with some practical results that I've been interested in,
on the theory that even an experimental phonetician occasionally
produces something that has its use.

Now this morning I wish we could afford the luxury
of a lot of discussion, but I am going to lecture for the
great part. I will try to disclose some of the unusual aspects
of this kind of compression-expansion program.
Not the technological, instrumental aspects; they have
been published. The basic article on method was published
in 1954 in the Transactions of the IRE Professional
Group on Audio under the names of Fairbanks,
Everitt, and Jaeger. There is some material that is unpublished,
and I want to touch on that, attempting to
indicate some of the unique or special features of the
method. I'd like to ask you to reserve your questions
until we get through, so that I can make an extended,
uninterrupted presentation. If you have questions or
comments, please note them. We'll try to save some
time for discussion at the end. But we do have a lot of
ground to cover.

Very early, as a student, when I began to be aware
that there were such things as speech sounds, that they
had duration and so on, and that people had ears and
vocal systems and what not, it became clear, as it is to
most people, that the ear plus brain of a listener is a
faster system than the brain plus vocal system of a talker.
There is a disparity between the time constants of these
two systems.

During World War II, when for a time I was a technical
assistant to Charles Bray at the National Research
Council, we were asked to make a survey of voice communication
problems in the Air Force, then the Army
Air Force. At that time my later colleague, Bill Everitt,
was Chief of the Operations Group in the Office of the
Chief Signal Officer. His main psychologist was Don
Lewis, the expert on quantitative methods, who had
been my colleague at Iowa. Lewis and Id then, surveyed
Flight Training Commands for the Army Air Force. On
the basis of this survey the Air Force decided to establish
a research laboratory at Waco, Texas. John Black,
James Curtis, and Paul Moore were identified with this
later in various ways, to name the persons whom I recall
at the moment that are involved in this seminar.

Lewis and I wrote our report; we concluded, of course,
that communications were very bad in the Air Force,
and they were indeed bad. The gear was bad, and the
procedures were bad. One of the main things we reported
was that although time is often of the essence in
aircraft operations, nonetheless the main problem in
247communications was with the talker, not the listener —
the talker's articulation in particular. We concluded
that one of the main sins in articulation was too fast a
rate, and so we recommended that, in spite of the fact
that rate was of the essence, we slow the talkers down.

Then we said that until some sort of device could be
interposed between the talker and the listener to take
advantage of the speed at which the listener could take
in the material, operations would have to put up with a
comparatively slow rate to provide for good articulation.
You soon reach your ceiling in articulation as you accelerate,
as you well know.

This stuck with me until I got to fooling around at
Illinois, about 1950, with a library of tapes of speech
sounds. I had a bunch of continuant sounds, vowels and
consonants, each on a little bit of tape. I began to splice
samples of these together and make synthetic words. I
found that I was able to do fairly well and it sounded
kind of like speech. And then I began to wonder how
short these segments could be, and I began to abbreviate
them until I got them down much shorter than average.
Then I got into the notion of time sampling, and
the idea that one could enter a speech signal at random,
and with periodic time sampling could effect a compression
that was not selective, but instead would bear some
fractional relation to the total time.

Now, we'll try to zero in on this problem. You all
recognize what we have in Figure 1. We have a spectrogram,
and, as you know, the coordinates are frequency
and time, and the energy is roughly proportional to the
density of the bars. This probably displays something

image visible speech

Figure 1

very like what a listener hears as time goes on. He has
a kind of delay circuit in his system that permits him to
perform these kinds of analyses and to get out the spectral
content. But actually the top row shows the way
that the signal is delivered to the tympanic membrane,
as one thing follows another. This is part of an oscillogram
of a vowel. It is periodic, as you can readily sec,
and its fundamental frequency is given by the reciprocal
of the repeated periods that are readily apparent. You
can also see that the wave has a fairly rich, harmonic
content because of its complexity. The ear operates on
this kind of signal. The top figure comes first and the
rest of it follows in order. Now, in the middle section of
the graph the oscillograph is moving at a slower speed,
and here we see word and syllable envelopes. The reason
that I want to show this is that these various syllables,
although not as regular as the waves in the oscillogram,
have their own rates too, and these will occur so many
per second although they arc only quasi-periodic. Essentially
our problem is to operate on the time/frequency
display and to extract time samples in some automatic
fashion. The problem, then, is really one of time compression,
that is, to accelerate rate without altering frequency.

I show in Figure 2 a diagram of the main problem.
The ordinate displays frequency in cycles per second,
and you'll note that it is logarithmic, starting with .1 and
proceeding upward by a factor of 10. At the lower part
of the ordinate are the infrasonic ranges, that is, the rate
and quasi-periodic events, the event frequencies, and at
the upper portion are the sonic frequencies roughly restricted
to the range shown by the boxes of the diagram.
These sonic frequencies define the quasi-periodic events
of the lower portion of the diagram. Our problem is to

image cycles/sec. | intrasonic | sonic | orig. | slow | fast | compressed | expanded

Figure 2248

speed up the rate of the infrasonic frequencies without
changing the rate of the sonic frequencies. Now if we
take a recording and simply play it slowed by a factor
of 2, we're going to halve both the sonic and infrasonic
frequencies. We're going to divide these frequencies by
half, and we're going to divide the rate by half. Both
of them will be slowed because they are both displayed
as a function of time. On the other hand, if we play this
material at a faster rate, we're going to multiply both
the event rate and the internal frequencies as well.

Now in compressed speech, our method of time sampling
provides for doing either one of two things. First,
leaving the event frequencies alone, which you'll see are
like the originals, we can divide the internal frequencies
down and decrease the distance between the upper end
of the event frequency and the lower end of the internal
frequencies. Or, in the time compression example, we
can speed up the event frequency but keep the range of
the internal frequencies the same as in the original. In
the case of frequency expansion, we're increasing the
distance between the upper end of the event frequencies
and the lower end of the internal frequencies. We're
multiplying the internal frequencies up, leaving the
event frequencies alone, or in time expansion, the time
application, we will slow the rate of the event without
altering the internal frequencies.

We do this in a manner roughly shown in Figure 3
for the 50% case. From original segments A and B, we
extract as a sample Aʹ and Bʹ. We will discard the sample
A — Aʹ and B — Bʹ, and we will abut interval Aʹ
and interval Bʹ so that the net result is that we have
sampled A and we have sampled B, reproduced in half
the time. In expansion we take A and repeat it, and we
take B and repeat it, thus doubling the duration of A
and B.

In the last example of the figure we first compress Aʹ
and Bʹ and then we repeat Aʹ and Bʹ, restoring the original
frequency and the original duration. Now for this
we would probably find it useful to supply a little notation.
First of all, we will define one of these A or B segments
as an interval and we'll call this interval in our
notation Ps; the sampling period, which in turn is the
reciprocal of the sampling frequency, we'll call Fs. Now,
instead of using the 2 to 1 example, I'm going to use a
4 to 1 example. We will consider this Ps to be composed
of 4 units; the first portion of this 4-unit Ps interval we
will extract and retain. This in our notation will show
as Is, the sampling interval. The remainder of Ps we're
going to discard. In our notation we'll show this as Id,
the discard interval; then Id + Is = Ps as is obvious.
Now the sampling, or the compression ratio, Rc, is defined
as Id/Ps It will also be obvious then that the
sampling frequency previously defined as Fs = 1/Ps is
the rate at which we enter the signal. However, in the
time compression case, where we enter the signal at 1/Ps
and subsequently play it fast to restore the original frequencies,
we are going to elevate Fs to a higher frequency
that we call the interruption frequency Fi, which
is defined as 1/Is in our notation.

image orig. | comp. | exp.

Figure 3

Now there are various aspects to this particular kind
of a program. I want to list them first, and then I want
to comment on each of them.

First, there is the matter of the effect on intelligibility.
Second, there is the problem of learning or comprehension.
I will want to talk a bit about learning and comprehension
as affected by these matters. Third, obviously
if you play something fast you increase its rate. If
you play it slowly you decrease its rate. I want to talk
about the perceived rate in relation to these things. Then
I want to describe an experiment on the estimation of
the temporal redundancy of language using this particular
technique. I want to go into expanded speech and
do some demonstration, and, finally, I want to talk
about a technique where we first compress, then transmit,
and then expand back up into the original. This,
in general, is the ground that I'm going to try to cover
in this morning's work.

First, I'd like to review briefly the study of Fairbanks
and Kodman, published in the Journal of the Acoustical
Society
in 1957, having to do with intelligibility. Some
of this material was used as part of Kodman's Ph.D.
dissertation at Illinois.

We have a list of 50 PB words. One of the standard
PB word lists was picked at random; one speaker spoke
them. The median duration of the words was .60 second
with a range of .3675 second. There were eight
sophisticated observers. That is, they had no hearing
249loss, they had superior speech discrimination by test, and
they were trained in phonetic notation. Before the first
administration of this 50-word vocabulary, they were
thoroughly familiarized with the vocabulary in a training
task where the stimulus was degraded. Administration
was by PDR-10 headsets. Various compression ratios,
Rc, and Id values were chosen. As Id becomes large,
progressive fragmentation of the word occurs, of course.
We wanted to cover the range from small Id to large Id
intervals.

Thus, in Figure 4 we see the first order results of word
intelligibility in percentage plotted along the ordinate
vs. the compression ratio along the abscissa, with Id as
the parameter. Id, you will note, is .01 second, .04, .08,
.16, and .24. Obviously as compression ratio is increased
and as the duration of the words is shortened, intelligibility
is going to fall off. But it is equally clear from the
figure that intelligibility remains quite high out to a
surprisingly large amount of time compression. I'll go
into that in a later experiment, and I think in a better
way. We estimated that the temporal redundancy was
of the order of 70% in these PB words. You understand
now that this was in a closed set of 50 words with the
observers familiar with the set. You'll also observe that
intelligibility varies as a function of the size of Id, with
the larger values of Id, shown in the lower part of the

image word intelligibility (%) | compression ratio

Figure 4

figure, producing the least intelligibility. You'll recognize
that the maximum intelligibility was less perfect
when Id became quite long.

Now in Figure 5 we are plotting word intelligibility
vs. the size of the sampling interval. The curves are the
five curves that you saw in the preceding figure, but
reversed and displayed along the absicissa. Now you'll
notice that as the sampling interval increases, that is to
say as the compression percentage decreases, each of
these curves begins to rise. The lines in the horizontal
plane connect equal time compression. For instance, the
lowest curve is that for 90% compression and the next
highest that for 80%, etc. Now an interesting thing
about this is that when Id is .16 or .24, the curves become
asymptotic, indicating that intelligibility will never
reach 100%. That is to say that when the discard intervals
are sufficiently long, regardless of how small we
make the compression, we are always going to be leaving
out something that will prevent the intelligibility from
being optimal. For the case for which Rc equals .5, that
is, where Id equals Is, we should expect the curve of
intelligibility to become asymptotic at the point of 50%
word intelligibility. For the average word of this experiment,
this asymptote should occur with an Is of .3
since this would yield a Ps of .3 and together they would
produce a sampling period of .6, which is the mean
duration of the words in the list. You may wonder why,
in the case of 90% compression, the curve is so sharply
double-inflected. What's going on here is that in the
fast play we said that the interruption frequency, as the
listener hears it, is the reciprocal of the sampling interval.
This means that the interruption frequency is equal
to approximately 900 cycles at the shorter sampling intervals.

image word intelligibility (%) | sampling interval (sec.)

Figure 5250

image percentage of identification | percentage of word

Figure 6

Accordingly, these interruptions are getting up
into the audible spectrum, and are artifactually interfering
with the intelligibility.

Figure 6 is what I call, rather facetiously, a non-sensory
discrimination function. At the conclusion of
this experiment these eight observers had each been presented
with this word list about 90 times. They knew it
backwards and forwards, or in any other random order.
As a matter of fact, a number of years later one of them,
George Kurtzrock, sat down just for kicks and wrote 44
of the words on a piece of paper — so they really knew
the words. When we got through with this experiment,
these young phoneticians were presented with a pencil
and paper fragmentation task. The original PB items
were randomly fragmented into parts. For example, if
the original word was of the GVC form it might have
been presented as CV, VC, CG or any of the eight possible
deletions of the original letters including complete
deletion. Each of the original items was thus represented
by an orthographic fragment. We then shuffled up the
fragments, gave them to the subjects, and had them turn
the cards over one at a time and guess the word from
the fragment. In other words, now they were getting
portions of the words, and our interest was to see the
percentage of the total vocabulary that they would get
on this basis. The figure displays the percentage of identification
of the list of stimulus fragments vs. the percentage
of the word that was presented on the card.
You'll note that this is a sigmoid curve and that about
50% identification corresponds roughly with 50% of
the individual word. This is, I think, rather interesting,
particularly if you consider a transformation of it in
Figure 7.

Here we hive a plot along a normal probability ordinate,
shown on the left. These are the same empirical

image phoneme perception (%) | word intelligibility (%) | percentage of word

Figure 7

points from the previous figure. We'll make the arbitrary
assumption here that the range is ±2.5 standard
deviation. Now we plot the word intelligibility values appropriately
here against the percentage of the word, as
in the previous figure. But now we employ a Z notation
metric, where zero equals the median, and the distance
between any two intelligibility values is expressed as the
ratio of the difference between the median value and
the obtained value relative to the obtained standard
deviation. Now I suggest to you that this is getting
pretty lawful. When you are dealing with fragments of
words in this particular manner, you're beginning to be
able to get at something that might be thought of as a
kind of transfer function from word intelligibility to a
portion of phonemes perceived; an attribute which is
basic both to these data and our general approach to
the problems of intelligibility.

Also in the realm of rate, I shall now refer to the
Ph.D. thesis of Kurtzrock, who was interested in time
frequency and time frequency distortion and their effects
on intelligibility. The plan of the experiment is shown
in Figure 8.

The left-hand ordinate of this figure is relative frequency
logarithmically scaled. If the event frequency
in question were 100, the lowest value of the ordinate
would correspond to a frequency of 25, the next 50 and
so on to the uppermost value of 800. The original
value is shown as the centermost dot of the figure, and
the problem was to multiply the frequencies and to divide
the frequencies. In this demonstration we wished
to abbreviate the duration frequency as shown along the
abscissa, to expand the frequency constant along the left-hand
ordinate, and to abbreviate the duration and to
multiply the frequency reciprocally as in fast play, and
to divide the frequency and to accompany it with time251

image relative frequency | time-frequency | time | original | frequency | freq. shift (cot.) | relative duration

Figure 8

expansion as in slow play. Each point of the figure represents
a particular combination of these parameters and
provides a graphic representation of the conditons of the
experiment. Now for this experiment and for others I
have devised a rather interesting word list, which is
shown in Figure 9.

I wanted to construct a list that had 50 words in it,
and it had to be a CVC kind of list — a single initial
consonant, a single final consonant, a single vowel or
nucleus in the middle of a CVC model. Now across the
top the post-vocalic consonants are shown that were
used; along the side, the pre-vocalic consonants, and the
cell entries are the vowels, there being 10 of the common
American vowels — those with a low formant 2 and
those with a high formant 2. First of all, all entries are
real words, they're not nonsense. As a matter of fact,
42 of the 50 occur in the highest-frequency category of
the Thorndike-Lorge list. It is also a condition that in
no row and in no column will there be more than one
vowel entry. Now, there are five occurrences of each of
the 10 pre-vocalic consonants, five occurrences of each

tableau -p | -t | -d | -m | -r | -s | -b | -n | -l | -z | p- | t- | d- | m- | r- | s- | w- | v- | ʃ- | dʒ-

Figure 9

of the 10 vowels, and five occurrences of each of the 10
post-vocalic consonants, so you see that the end product
is a rather precise kind of vocabulary. This particular
design means, then, that each consonant-vowel combination,
each vowel-consonant combination, and each
consonant-consonant combination is unique. Given the
identification of a consonant or a vowel, or a post-vocalic
consonant in a given word, the chance probability of
correctly identifying the entire word is .2, there being
five occurrences of each. On the other hand, being
given two elements of the word, you should get 100%
because each one of these is unique, and if you know
the set, you know what the third element is. This provided
a very interesting training situation because, in
fact, the subjects were trained in this vocabulary and
the method of testing. We trained to the level of 100%
identification on randomized two-element fragments of
the original list.

image word intelligibility (%) | time-frequency | original | time | relative duration | relative frequency

Figure 10

In Figure 10 we can see the main effects of the experiment
Word intelligibility is plotted along the ordinate
in percentage, and relative duration along the abscissa
increasing toward the right. The original version
of the vocabulary is indicated in each of the curves,
100% correct identification for a relative duration of 1.
Now if we decrease the relative duration, time constant,
our intelligibility falls off smoothly to the left. If we increase
the relative duration, frequency constant, intelligibility
continues to be high and perfect The final point
252of expansion, by a factor of 12, begins to dip down a
little bit because when you get expansions of this magnitude,
the speech takes on a sort of unreality and it no
longer sounds like speech. It should be observed that
the abscissa of this plot is scaled logarithmically. The
range summarized by the plot is very wide. This demonstrates
the great elasticity of the time domain in speech.
The dotted curve of the upper figure summarizes the
results obtained from fast and slow play. That is, values
to the left of the original value are for fast play with
duration decreasing and frequency going up, and values
to the right are for slow play with duration increasing
and frequency going down. Note that for these latter
values the intelligibility falls off much more rapidly.
Obviously, then, the difference between these two curves
demonstrates very graphically the importance of frequency
distortion in the signal. In the lower figure, the
abscissa now is relative frequency. Frequency is then
increasing toward the right; our original is up here at
the top repeated from the upper figure. As we multiply
the frequencies the intelligibility falls off as the right-hand
curve displays. As we divide them, time constant,
it falls to the left. The dashed curve displays the data
for time-frequency distortion superimposed on the curve
of frequency distortion. You can note the difference between
frequency division, time constant, and frequency
division accompanied by slow play. The difference is
small, but it is discernible, and these differences are in
the direction that you would expect.

I wish I had time to go into the phonetic analysis and
attempts to predict intelligibility given phoneme identification,
but let me just note the fact that significant
differences were found between the specific consonants,
vowels, and pre-vocalic and post-vocalic consonants in
the list. The data showed that when frequency distortion
was in question, the vowels were much more vulnerable
to degradation than were the consonants. The consonants
would stand up a lot better under frequency
distortion. When time distortion was in question the reverse
was true. The consonants suffered more in their
intelligibilty than did the vowels. Vowels with low formants
were more intelligible than those with high formants
in frequency distortion. Finally, voiceless consonants
were considerably more intelligible than voiced
consonants when frequency distortion was present. So
much for the rate problem.

Now let's get on with the comprehension problem.
I'm referring now to work published by Fairbanks, Guttman,
and Miron, in the Journal of Speech and Hearing
Disorders
, 1957, a series of three experiments. I'm going
to refer to the first and the third in that series. For this
particular problem, we were interested first of all in
establishing some materials that could be used for the
measurement of comprehension, so we devised two technical
messages. They had to do with meteorology; they
were highly factual. There was reference to numbers,
to historical dates, the design of instruments for meteorological
observations, and so on; in short, they were
highly technical messages. One of them was 1,554 words
long, and the other was 1,573 words long. Both messages
were read by a skilled talker at the rate of 141
words per minute, the target being 140. The talker was
paced with one-minute light flashes, and that means
that each of the messages was a little bit longer than 11
minutes. That means we're dealing with fairly long
periods of content.

The technique was to present one message, then test
it, then present the second message and test it. So, message,
test, message, test was the general administrative
situation. We devised a 30-item test for each of these
messages which originally started out as a 50-item test,
then was reduced to a 40-item test and then to a 30-item
test. It was a five-alternative multiple-choice form of
test. By item analysis procedures and the Kuder-Richardson
Reliability formula, R was .87 for 98 subjects
hearing the message without any distortion, which we
regarded as a highly acceptable standard of reliability.
This demonstrates, incidentally, that it is possible to
devise a test of auditory comprehension with high reliability
and strong face validity.

These messages were subjected to a series of compressions;
first there was the uncompressed version which
we will refer to as 0% compression, 30% compression,
50%, 60%, 70%, and then there was the test without
having heard the message, comparable to 100% compression.
We wanted to test a priori information, and
we picked meteorology, incidentally, because there isn't
very much popular information about the kind of technical
detail that we are getting at here. That is, it isn't
commonly available to the kind of subjects we employed.
Thirty-six subjects were used in each of the compression
conditions, except for the test situation, 100% compression,
in which 44 were used. The subjects were airmen
and were controlled with respect to stanine. The subjects
were run in groups and were presented the message
through PDR-10 headsets. The discard interval
was .02 second for all compressions, an interval you will
recall which produces high over-all intelligibility. We
did not want intelligibility to be a question. Now, for
instance, in the case of 50% compression Id would be
.02; Ia would also be .02; Ps would be .04. The sampling
rate would be 25 cycles, and the interruption frequency
would be 50 cycles.

Now, in Figure 11 we see the general effect upon253

image relative message effectiveness | words/min. | relative message time

Figure 11

comprehension when all the data are pooled. Our ordinate
now is relative message effectiveness for the people
who had no message but took the test. Their obtained
score was 20.7% for a five-alternate multiple-choice test.
You can see that this is very close to the assumed a priori
probability. The maximum of those who had the complete
message uncompressed, which is shown at the top
of the ordinate as 1, corresponds to a score mean of
63.8%. The intermediate ordinate values are expressed
relative to the difference between the maximum and
minimum scores, 63.8 and 20.7. Thus this is what the
message added to the pre-knowledge of the subjects.
The abscissa indicates message time. You'll note that it
is linear. The upper abscissa indicates the corresponding
rates in words per minute, and you'll observe that at
times these rates get very high indeed, 500 words per
minute for the maximum compression condition. You'll
notice that as the rate decreases, message effectiveness
increases; but you'll also notice that out to 300 words
per minute we don't lose much. As a matter of fact, the
analysis showed that we were unable to show evidence
that the 300 WPM rate and the intermediate point did
not differ significantly from the original, uncompressed,
message effectiveness.

Figure 12 shows the same data plotted in the sense of
message efficiency. Here the ordinate is the rate at
which the tested items were learned in terms of items
presented per minute, there being 60 total items for the
two messages. The rate at which the items were presented
is shown along the abscissa, and. this of course
increases to the right in terms of items per minute, the
rightmost point being the fastest compression case, and
the leftmost being the uncompressed case. If the rate
of learning depended strictly on the duration, that is, if

image rate learned (items/min.) | rate presented (items/min.)

Figure 12

the rate of learning were proportional to the duration of
the message, we would have line G because the rate
would stay the same irrespective of how we presented it.
However, if the rate of learning stayed constant in spite
of acceleration, we would have line B — that is to say,
line B would describe the ideal case where compression
or acceleration of rate had no effect whatsoever. The
A curve is, of course, the empirical function, and it
shows that for compressions of 0%, 30%, and 50%, the
A curve is fairly close to the B curve. In short, the evidence
suggests that for factual comprehension of technical
material, rates up to about twice normal will pay
their way in terms of increasing efficiency. Beyond the
50% compression point we fall off in efficiency. In short,
this suggests that if efficiency is what you're after, you
can present twice as much material in a given time, or
we could present the two messages in the time of one
and we would not lose too much in factual comprehension.

The next experiment that I want to refer to was
rather interesting in its results. We took as the model
the 30% case of compression. Now, when you compress
in time 30%, of course you have a net time of 70% of
the original. You're saving 30% of the time. The obvious
question is what do you do with that 30%. Do
you go out and have coffee, or what do you do with it?
We decided that we were going to use this to do a study
on selective verbal redundancy. Now the 30% compression
ratio permits us to put at the same rate of speech
43% more words into that time saved; that is, the ratio
of discarded to retained material, 30/70. Thus, with
30% compression we could produce a message that had
43% more words in it, but was equal to the original
obtained time. You'll remember that in these two messages
254we had a 60-item test, a 30-item test on each of
the two. We look 30 of the items from the representative
message effectiveness levels and considered them to
be control items. We took the remaining 3d items and
considered them as experimental items. For the experimental
items we went into the messages and identified
the portions of the content that supported the factual
information called for in each of the test items. We
then divided up the 43% of the words that we added,
and we augmented the statement about each one of
these content sections by reaffirming, by repeating in
some cases, by paraphrasing, or saying them in different
ways. In other words, we added to the explanation
about these experimental items, so that we then had 30
control items that were left unchanged, as in the material
that I reported before, and we had 30 experimental
items that had a total of 43% added information. Thus
we had what we will call a short version and a long
version. We have a short version, the original message,
and the long version with 43% more words in it. Remember
that we used 0% compression with the original
message, and a 30% compression condition for the augmented
version, and vice versa.

Now I think I'll skip a little bit. The bottom portion
of Figure 13 shows the plan of the experiment. The
zero point divides the message up into the control items
at the bottom, and the experimental items at the top.

image (control) items correct (experimental) | short version | long version | 0% compression | 30% compression | words | time | estimated pre-knowledge

Figure 13

This is just sort of a schematic of the experiment. Half
the words are controlled; half the words are experimental;
half the time is controlled; half the time is experimental
in the uncompressed case. The dashed boxes
indicate the long-version messages. We leave the controlled
items alone, but we augment the experimental
items by 43%, both in words and in time in the uncompressed
case shown on the left. On the right is shown
the case of the time compression where we have the
original message in 70% of the original time and the
augmented message in 100% of the original time.
The results are shown schematically at the top, and
the things that we can see here are fairly obvious. The
controlled and the experimental items match fairly
well, that is, they are about equally learnable in the
short version of the message. In the long version we
have produced an increase in the comprehension of
the experimental items, but we have a corresponding
decrease in the comprehension of the controlled items.
For 30% compression the short version still produces
about the same total comprehension as the uncompressed
case, but the augmented message under compression
again shows an increase of comprehension on the
experimental items at the expense of the control items.
We augment the comprehension of the experimental
items, but we reduce the comprehension of the controlled
items.

Now this is certainly an interesting result. First of all
let me tell you the result of the statistical treatment.
The difference between versions, the augmented and
the unaugmented, was not significant. The difference
between compressions was not significant. The items-by-version
interaction was significant beyond the .001 level.
That is, if you are talking about the controlled or the
experimental items, you have to specify what version you
are referring to — the short or the long. The difference
between item classes, the controlled items vs. experimental
items, was also significant. In other words, this
audience of subjects was behaving much as a sponge
would. A sponge doesn't care what kind of fluid it mops
up, but it can only take so much, and thus these subjects
apparently are operating rather like that. They are
fairly well saturated. You can improve these items by
increasing their redundancy, but you do it at the expense
of taking away from the comprehension of the
other items. This is a salutary piece of information, and
it behooves us all to note it well. It means that the
more I say about a particular thing now, the more you'll
understand that thing, but at the expense of something
else that I devote less time to.

Now I want to go into the business of rate. I'd like
to refer here to the Ph.D. thesis of Hutton, done under
255my direction at Illinois, which had to do with the psychophysical
aspects of rate, partly in the compressed and
expanded realm. Estimation of rate, duration, and preference
were the objects of concern in this problem. A
superior speaker produced numerous versions of the
well-known Rainbow Passage. An attempt was made
to produce the passage at widely varying rates, and from
this range of rates to select a number of versions limited
by one that was as short as 9.6 seconds, and one that
was as long as 35.3 seconds, where our long time average
for a large sample of speakers is 17.2 seconds. Out
of this range, eight versions of the passage were picked
to produce a geometric series in duration such that each
one was 20% longer in duration than the preceding one.
Then the plan was to compress and expand these eight
versions by 10% and 20%. Thus, the passage of a given
duration could be expanded by 20% in order to make
its duration like the one that had been originally produced
next higher. By expanding it 10% in duration
we would put it midway between two of the original
versions. The ordinate of Figure 14 shows the duration
of the eight original readings. Phonation is shown in the
hashed bars at the bottom, and you'll see the familiar
phenomenon that has been insufficiently studied, in my
estimation; that is, as the total duration increases and
the rate slows down, the proportion of phonation time
to total time becomes smaller, or, conversely, it becomes
higher as rate increases. A speaker with good articulation
was a condition of this experiment. A speaker who
is interested in good articulation will speed up by taking
it primarily out of the pauses. Sure he'll abbreviate his
speech sounds, but there comes a point where he cannot
abbreviate them any longer and maintain good articulation
so most of the time that he takes will come
out of the pauses. This figure also graphically displays
the regularity of the progression in duration of the original
readings.

image duration (sec.) | pause | phonation | reading

Figure 14

The plan of the experiment is shown in Figure 15,
where the ordinate shows the duration in seconds, geometrically,
and the eight readings that you saw in the
preceding figure are shown along the abscissa. The original,
10%, and 20% compressed and expanded versions
of each of these eight are represented by the plotted
points. There are five versions of each reading in all.
You can readily see that we have a successive series of
overlapping durations. There are, as a matter of fact,
19 different rate levels as represented in the experimental
design.

The observers for this experiment were 10 sophisticated
listeners, graduate students and instructors, who
were well practiced in the art of observing. They were
given samples of these versions taken at random out of
the middle of the set of the 40 different versions to illustrate
the extremes of the ends of the scale of duration
and rate, and to show what the scale of rates they would
be hearing were to be.

Tomorrow I'm going to talk about the results of rate
preference. In addition to this group that judged rate
and duration, and estimated them, we had a group of
10 instructors who judged the preferred rate of these
40 versions, and a group of 40 undergraduate students
who also judged the preferred rate; that is, made estimates
of the degree to which they preferred the rate
and also estimated the over-all effectiveness of the sample.
These latter two groups, the 10 instructors and the
40 undergraduates, did not differ significantly, as if we
didn't know, so consequently the two groups were pooled
for an N of 50 for the rate preference judgments.

This morning I'm not going to talk about the rate
preferences, I'm going to talk only about the estimate.
The procedure was that these were administrated to
these people in random order and duration estimates
taken. They were given a graphic scale in time units
ranging from 0 seconds up to 60 seconds, with various

image duration (sec.) | reading

Figure 15256

image duration est. | duration meas. (sec.)

Figure 16

markers along for each second, and their mode of judgment
was for each passage reading. They would make a
mark as to what they judged the duration to have been
on this linear rating scale. The rate scale ranged from 1
to 9, 1 being slow, 9 being fast. They were told that
the units of the scale were equally separated, and, as
I said, they were given examples of readings of the
passages in both cases so they would know about what
range of durations and rates they could expect. Analysis
was made of these data for both the original and
processed versions; that is, the compressed and expanded
versions vs. the original ones that came out of the mouths
of the talkers. The rate and duration estimates of these
did not differ significantly. Consequently, they were
pooled with respect to their duration, and so the figures
that I will show you now pertain to these 19 rate levels
that were in common. In other words, the compressed
and uncompressed messages of equal duration were
pooled.

Figure 16 shows estimated durations ranging from 10
seconds at the bottom of the ordinate to 50 at the top
vs. measured duration along the abscissa. The 19 points
are for the 19 rate levels, or duration levels if you wish.
Of course they show a high degree of rectilinearity. In
other words, the observers are pretty good judges of
periods of time in this particular kind of task.

Now Figure 17 shows estimated rate vs. measured
rate. Note that the measured rate is in words per second
with rate increasing toward the right. This is again
logarithmic, the curve having been fitted by the least
square criterion. The equation says that the estimated

image rate est. | rate meas. (wds./sec.)

Figure 17

rate, the ordinate, is well approximated by 11.97 times
the logarithm of the measured rate minus .46. In other
words, given the rate in words per second, we may take
roughly 12 times the logarithm of that rate in words per
second and arrive at a nine-point scale value less .5 of
what the estimated rate will be. A handy transformation
to have.

Effects of Time Compression and
Expansion in Speech: Part 2

I should like to continue this morning's lecture on the
problem of the compressibility of speech and the estimation
of temporal redundancy. I shall be reporting
some unpublished results of an experiment that I did
shortly before I left Illinois, and I want to tell you
about some of the preliminary considerations that guided
this experiment. The actual experiment is very compact
in its reporting, but I think it is a rather standardized
sort of procedure that might be commended to
your attention as an approach to other such estimates
in realms other than that of temporal redundancy.

In brief, the problem was to trade flat thermal noise
for time. We have a lot of data on the estimation of
temporal redundancies; some of it is pretty old, and
some of it is pretty good. For instance, we know that
the temporal redundancy or duration of a vowel in connected
speech is far beyond the needed requirement.
This is intrinsic to the nature of vowels so we might
well ignore the steady states of vowels that are very
long. In one of the technically very fine studies of Parmenter
and Trevino, in 1935, oscillographic measurements
257of consonant and vowel durations were made.
These investigators were sophisticated in picking the
limits of the sound, and their mean vowel is about 120
milliseconds long in contrast to their mean consonant,
which is about 80 milliseconds long, a difference factor
of about 1.5. The data were collected from an extended
sample of connected readings performed by one subject,
but very well done. House and I, in a study of the effect
of consonant environments in disyllabic nonsense syllables,
the second syllable being stressed, found the mean
duration for six stressed vowels to be 200 milliseconds, or
roughly 1/5 of a second. Gordon Peterson, in 1939, did
a study on the minimum duration required for vowel
recognition and found the value to be as low as 5 milliseconds.
From these studies we come out with an average
representative value of about 5 milliseconds for
duration up to a respectable maximum as high as 200
milliseconds. We realize now that the a priori guessing
probability of the Peterson study is very high so that
the minimum value derived from that study has to be
viewed in that sense.

Vowels are not the place to look for temporal redundancy;
I was interested in looking at consonants. For
this reason, I developed the Rhyme Test with which
some of you are familiar. The Rhyme Test was developed
at this time because we felt that considerable
information could be gleaned from a study of rather
briskly spoken consonants in which the normal consonant-vowel
transient effects were retained. The test thus
constituted a representative situation for an estimation
of the temporal redundancy of consonants. We also
wanted, as much as possible, to eliminate some of the
non-acoustic, non-phonemic, non-auditory sorts of things
that we were referring to in the last hour, such features
as word length, word familiarity, and so forth. Among
these latter effects must be included the effect of the
open or closed character of the vocabulary. This is a
factor that probably contributes 20% to 25% to the midrange
intelligibility score. That is, a closed set with
known vocabulary in the midrange contributes extensively
to any obtained intelligibility score.

Now if we reduce these kinds of effects we should be
able to make a more refined estimate of the pure consonant
effect. We're going to expect, from what we
know of previous data, to have a rather large temporal
redundancy for words; we don't know yet about consonants.
We did know that when the signal-to-noise
ratio exceeds approximately 15 decibels, the maximum
identification score of the Rhyme Test is not increased.
So in trading noise for time we were working at values
lower than 15 db, we thought. Also, from some of the
pilot work at a signal-to-noise ratio of 15 db, we discovered
very swiftly that compressing these words to
50% of their original time did not degrade the intelligibility.
Of course we knew that from previous studies,
but such might not have been obtained for the Rhyme
Test, in which consonant and consonant-vowel transitions
were in question.

Clearly, it would be unrewarding to attempt to determine
temporal redundancy at any point along the maximum-identification
asymptote of 100% intelligibility.
The problem becomes much more difficult because of
the very slow slope of the intelligibility function near
maximum. You don't know, really, how to evaluate
redundancy. On the other hand, where the function
is fairly steep in the 50% correct-identification range,
we have a manageable value that is representative, reproducible,
within the range, and much more useful
for our purposes.

Essentially our problem then was to discover that
combination of the duration of words and noise that
would yield 50% correct identification. Our goal then
was to trap this result within the experimental conditions.
It took quite a bit of pilot work and quite a
lot of preliminary experimentation before we decided
on the exact values that it would be useful to use.

In Figure 18 you see the plan of the experiment. The
left ordinate of the figure represents the percentage of
the word that remains after compression. For instance,
the top row would be the uncompressed version. At the
lowest value, 17.5% of the average word remained in
the highest compression condition of 82.5%, a very
heavy compression indeed. The four compression values
of the experiment were, then, 0%, 50%, 78.5%, and
82.5%.

image signal/noise (db) | % of word | word list | sub-group (8 Ss ea)

Figure 18

Now to each of these was assigned a random form of
the Rhyme Test. The Rhyme Test is a 50-word test
using a limited number of identical stems with a fixed
stimulus vocabulary. A form was assigned to each of
these compressions. In all of these compressions Id was
15 milliseconds. This means that the sampling frequency
increased as a function of the amount of compression.
258Six signal-to-noise ratios were chosen, ranging from -9
up to 15, and are represented by the columns of the figure.
To each of these a subgroup of eight observers was
assigned. The order of events at the time of the experiment,
which was administered over headsets in half
subgroup sections, four subjects at a time, was a constant
signal-to-noise ratio for each subgroup across the compressed
versions in descending order. This was preceded
by training in a fifth random form of the Rhyme Test
administered with a signal-to-noise ratio of 3 db, under
which condition the obtained identification is approximately
75%. The signal level was about 65 db sensation
level; thermal noise was adjusted relative to the
median vowel of the list. There were no carrier phrases
used in this study; all words were spoken in isolation.

In summary, then, we have 24 noise-by-stimulus-duration
combinations. In each of these 24, the mean will
be based on 400 responses — 8 subjects times 50 words.
Our expectation will be that when the compression is
large and the signal-to-noise ratio is low, we'll have low
intelligibility. When the compression is small and the
signal-to-noise ratio is high, high intelligibility should
obtain. We seek the 50% intelligibility point, that is
our criterion, so we will expect to find it by interpolation.
It will take fairly good noise or compression to degrade
intelligibility to 50%, but hopefully we will be able to
trap the value by this set of combinations in the matrix.

Now Figure 19 shows identification of the word, the
word intelligibility, on the ordinate vs. the signal-to-noise
ratio along the abscissa with percentage word retained
as the parameter. You'll remember that the subgroups

image word intelligibility (%) | % of word | S/N (db)

Figure 19

of subjects were assigned by signal-to-noise ratio.
Thus, each column of plotted points corresponds to the
values of one subgroup; the lines are drawn across subgroups
by compression. You'll notice that when we
compress to 17.5% of the original word we get the lowest
intelligibility values, from 22.5% at -9 S/N to 60%
at 15 S/N. Some of you will have noted that the 50%
compression version up at the top end is higher than
the 100% version. This is not because compression improves
intelligibility. This is because different word lists
were assigned to these different compression values, and
it merely means that toward the high end this particular
form of the Rhyme Test yielded higher intelligibility
than the others.

Our problem is the specification of these combinations
of the experimental conditions which produce an intelligibility
value represented by the 50% horizontal of
the graph. So our problem is one of interpolation.

Figure 20 will show the results. In this figure you see
the 50% equal intelligibility contour. The ordinate is
the percentage of the word that remains after experimental
manipulation. The abscissa is signal-to-noise
ratio, and you'll observe that at approximately -4.5
db the uncompressed version yields 50% intelligibility.
You'll also notice that with increasing compression the
curve descends almost vertically, until you get to approximately
75% compression, and then you turn the
corner and begin to require increasing signal strengths.
Now we're about where we want to get. We drop fairly
straight in duration to about 25% of the duration of
the original word. This value occurs roughly at the
signal-to-noise ratio of -2.5 db. We therefore proposed
that where this corner is turned, we could specify the
value exactly by some treatment combination, but it is

image per cent of word | equal intelligibility contour — 50 per cent | S/N (db)

Figure 20259

not necessary that we do so. We then proposed that
the treatment condition producing this sharp corner of
the function is the condition that will provide us with
an estimate of the temporal redundancy. On that basis
then, I estimate that the temporal redundancy of the
consonants and the consonant-vowel transition elements
of the language are of the order of 75%. Cautiously
stated, this means that when the noise level is such as
to yield 50% intelligibility with a normal signal, the
duration can be reduced about 75% before a substantial
decrease in the noise or a substantial increase in signal
strength is necessary to maintain 50% intelligibility.

In concluding a review of this experiment, I would
remind you that the temporal redundancy of vowels is
is very, very much larger than this, probably on the
order of 25 to 1.

I want to turn next to the effects of expanded speech,
and here I want to do a thing that I hope you'll find
interesting. This is the kind of thing that you can't
publish because it resides in what you hear. What I'm
going to do is play for you an expanded version of
speech and, at the same time, we'll watch an acoustic
spectrogram on the screen. With this expansion we will
follow the movements of the formants and the other
phenomena of the speech as we go along.

If you don't already know the sentence in Figure 21 I
recommend that you write it down: “Measure why that
possum views a boy will Ruth each awful gay cushion
young Joe now heard.” This is a fine sentence. It contains
one, and only one, example of each of the general
American phonemes, plus the three syllabic consonants,
/m/, /n/, and /l/, and it contains them in a good and
workable natural word order. That is, it comes trippingly
to the tongue. “Measure why that possum views
a boy will Ruth each awful gay cushion young Joe now
heard.” It has a weird kind of sense, like Shannon's
third approximations to English; you know, where you

Measure why that possum
views a boy will Ruth
each awful gay cushion
young Joe now heard.

mɛʒɚ hwaɪ ðæt pɑsṃ
vjuz ə bɔɪ wɪl ruɵ
itʃ ɔfḷ geɪ kuʃṇ
jʌŋ dʒou nau hɝd

Figure 21

hit tracks for a few words and then you get off. The
original of this was spoken in seven seconds. This means
that in seven seconds you can go through the complete
armamentarium of the phonetic display of general American.
In a demonstration, it's a quick, practical articulation
test — and I mean articulation test, not intelligibility
test.

Now Figure 22 will show the spectrograph of this
sentence, seven seconds in total duration. The pauses
between phrases have simply been cut out; they are not
shown in terms of their true duration. While you watch
this I'd like to play the original version. This is the
seven-second, untreated version of the sentence. [Fairbanks
here played a recording of his own rendition of
the sentence from which the spectrograms were derived.]

Apart from the fact that my fundamental frequency is
rather low, it measures out at about 85 cycles on the
median. I think that you'll grant that the articulation
was continuous, reasonably natural, not exaggerated,
not fast, but within the phrase reasonably brisk and, if
you'll forgive my immodesty, a reasonably accurate version
of general American.

Next, I want to play two compressed versions of this
sentence. First, a 50% compression, the original sentence
compressed to 3.5 seconds. You'll agree that it
is perfectly intelligible. I'll now play a 75% compression.
You will hear this in 1.75 seconds instead of the original
7 seconds — listen fast. It is clearly not very intelligible
now. Hold in mind, if you can, how this 75% compression
sounded because in the next experiment I'm going
to review, this will be the sort of thing that will be
transmitted through a channel and reconstructed at the
other end of the channel. I want you to remember the
75% compression in particular because nobody can

image mɛʒɚ hwaɪ ðæt pɑsṃ | vjuz ə bɔɪ wɪl ruɵ | itʃ ɔfḷ geɪ kuʃṇ | jʌŋ dʒou nau hɝd

Figure 22260

understand that as such. You can understand words
compressed 75% according to this noise-for-time-trade
business that we were talking about, but that, was a
closed vocabulary and only 50% intelligibility.

Now I am going to play a 50 to 1 expansion of the
original 7-second rendition of the sentence. The sentence
is now expanded to a total duration of 5 minutes
and 50 seconds. This is analogous to high-speed motion
picture photography. The sampling interval for these
expansions was 25 milliseconds. This means that 25
milliseconds was the amount of time that was repeated.
This also means that one cycle of the expansion would
correspond to 50 cycles in the original. After every
sampling interval, we advance on into the material by
½ of a millisecond. Now you have here a sort of auditory
microscope in which you can hear some old things
better, and in which you can hear some new things.

In Figure 23 you see the spectrogram of the first
phrase. This is 5,000 cycles full scale; the bandwidth is
300 cycles. The time scale is given along the abscissa
with the phonemic transcription. There are several
interesting things about this that I want you to listen
for and watch. In the first word we have the transition
from [ɛ] to [ʒ], from vowel to palatal fricative. We
are all familiar with the dialect pronunciations of words
like these: [mɛʒɚ], [plɛʒɚ], [trɛʒɚ], and so on. If
you'll remember back to the original recording as you
heard it, I rendered the word as [mɛʒɚ]. In going
from [ɛ] to [ʒ], you'll see formant 1 and formant 2
going to the [ʒ]. Listen for that transition even though
you did not hear it in the original, and even though it
had no significance. When you get to the last element
of the first word, you will notice clearly that this is an
[r] all the way through. We no longer have to argue
about that, it's an [r] vowel in general American, and

image mɛʒɚ hwaɪ ðæt pɑsṃ

Figure 23

it's an [r] all the way through, not a schwa at all. In
this glide, symbolized as [hw], you notice that there is
definite noise added to the second component; it is not,
in short, the sound [w]. Quickly then you get the
diphthong. Notice that there is no discontinuity in the
transitions around the [ð], but that there is a durational
acknowledgment of its position. You come over here
and you have two voiceless stops juxtaposed in the words
that possum. Notice how the speaker handles these and
the lack of aspiration. You will, however, get aspiration
of the [p˃] of pike and you can see it on the spectrogram.
The voiceless fricative, [s], of possum shows
high-frequency harmonic energy which spans the voiced
elements surrounding it. Then notice the [m]. This is
not truly a phonetically syllabic [ṃ], although it is transcribed
as such. You will observe the spectral suspicion
of a very brief schwa preceding it.

All right, in Figure 24 we'll look at the next phrase,
“views a boy will Ruth.” At the outset we can see the
[ju] glide. Notice the schwa. Nobody up until this
time has heard a schwa prolonged, it being by definition
a free unstressed and unprolonged vowel. But here you
see an expanded schwa. In the transition from [b] to
[ɔɪ], you will notice the labial locus to vowel transition,
to use DeLattre's terminology. Notice the rise in
formant 1 and the fall of formant 2 as you go from
the voiced stop to the vowel. Again, notice the off-glide
on boy. When we get to will Ruth, notice that [r] to
[u] is a transition between continuants. This is perhaps
moot, but I classify [r] as a continuant rather than as
a glide. At the end, notice how the voice perseverates
into the theta.

In Figure 25 is the phrase “each awful gay cushion.”
Here is a very interesting thing. I wish Paul Moore
were here because this is an example of an expanded

image vjuz ə bɔɪ wɪl ruɵ

Figure 24261

image itʃ ɔfḷ geɪ kuʃṇ

Figure 25

glottal attack. You can see the aperiodicity at the start
of the vowel [i] in the first few waves, can you not,
and you'll listen for it because it's not very long, even
expanded by a factor of 50. Now there's an interesting
phenomenon here at the transition from [tʃ] to [ɔ]
that is really a reverse of what we heard in measure.
You'll hear [tʃ + ɔfḷ]. See, we've got the palatal position
for the [tʃ] going down for the vowel post-consonantly,
producing this glide-like effect in reverse. The
[ḷ] is a true syllabic consonant — listen for it. In gay
you'll notice the rise of formant 1 and the fall of formant
2. The [k] of cushion is the only stop in the sentence
that displays good aspiration, as can be seen in
the spectrogram. In cushion we have a very interesting
example of going from [u] to [ʃ] where we get regressive
assimilation of the [ʃ] affecting the articulation
of the [u]. Progressive assimilation is also present in
that the voicing of the vowel perseverates into the
consonant. I think that there is an intrusive schwa.
I'm not sure, but it looks like it from the spectrogram,
[kuʾʃṇ].

Then in Figure 26, the fourth phrase, “young Joe
now heard,” the initial [j], of course, represents a glide
down to the vowel from the, palatal position. Note the
off-glide of [ou] and its relative lack of importance.
Note the continuous phonation right on through the [h]
accompanied by its characteristic aspiration. The [ɝ]
of heard is a stressed [r] vowel; notice that this is [r]
all the way through. Notice the final [d] and the fact
that it is not released at the end, articulated but not
released. You'll notice a very interesting thing here.
You see how one formant goes up and the other one
goes down as we go from [ɝ] to [d]. As we get
toward the end of phonation, the fundamental, as is
indicated by the wide spacing of the vertical bars, is
dropping rather rapidly. As a matter of fact, the fundamental

image jʌŋ dʒou nau hɝd

Figure 26

goes down to about 40 cycles, and on down progressively.
While the fundamental can be heard to go
down, you can also hear a rise in the harmonic energy
accompanying it in a rather interesting way that I have
never heard before.

I can't refrain from airing an observation that I think
is valid and which relates to one of my prejudices. One
of the interesting things about the utterance of a sentence
is, of course, the continuous change. The spectrogram
shows a picture of continuous change, and the recording
shows continuous change; but if you will reflect
upon it for a little bit you will realize that perhaps more
interesting are the long periods when there is no change.
There are a good number of instances when there is
really what amounts to rather long, steady state situations
in speech. Now the reason that I say that this
relates to one of my prejudices is that I think that you
should reserve judgment on some of the implications of
the work that is coming out now. I believe that the
history of experimental phonetics (and I hope that this
is being recorded) will show that the sounds of speech
as phonemic entities are targets, and that they exist by
virtue of their realization of the target. The movements
from one sound to the next, from that sound on to the
next sound, are information-bearing. However, these
transitions are secondary manifestations of the movement
from target to target and are not in themselves
the target. In other words, if we talk about a syllable
composed of consonant-vowel-consonant nucleii, the consonant
is the target, the vowel is the target, and the
other consonant is the target. All three of these things
are targets. Now the movements impart information,
but the movements are not of the essence of speech.
The targets are the essence of speech. It's the targets that
permit us to have a code that we can reorder in various
kinds of positions. This I say happens to be one of my
262prejudices, and I think that when all the data are in
they will show that both of these kinds of information-bearing
elements in speech will play a role. I say this
because I think we need a bit of antidote to some of the
rather wild enthusiasm of movement-articulation studies
that have come out in the past few years.

I would like to report one final experiment in this
program that relates to the problem of compressing a
message, transmitting it in its compressed form, and
then reconstructing the message at the other end of the
channel. This experiment is a little difficult to describe.
I hope I can make it clear. The general purpose of the
experiment will be to pass a signal through a channel
of limited bandwidths. After limiting the channel in
this way, we will restore it to its original bandwidth at
the other end.

Figure 27 shows the general scheme of the process.
Time is displayed along the abscissa, frequency along
the ordinate. The lines are quite arbitrary. They're
designed to show some of the changes in formants to
be expected in connected speech. For instance, the first
line of the lower figure could represent a change in the
fundamental. The next line might be formant 1, the
next formant 2, and so on. In compression you will remember
that we periodically extract a piece and throw
the next piece away. The diagram assumes that compression
is 50%. That is, the discard and the sampling
intervals are equal. If in the original message, schematized
by the lower figure, we take the first half of the
first formant, for instance, and divide it by two, and

image

Figure 27

stretch it in time by two so that the frequencies are
halved, we would have something like that displayed
in the middle figure. In other words, we now would
have a sampled and divided version in the original time.
We now have the signal displayed in half the bandwidth,
but it's in the same time. At the receiver we
have only half of the original material. But we can
use the material that we have saved, and from it reconstruct
the original message. Obviously the discard
interval has to be appropriately short so that we don't
irretrievably throw away material. The upper figure
schematizes the reconstruction which might be made
from the sampled transmission. The reconstruction is
accomplished by the simple expedient of repeating the
small segments of the material which were preserved in
the transmission. This expanded reconstruction is then
reproduced at higher speed in order to restore the original
bandwidth and the original total time. I have purposely
shown the reconstruction of the upper figure as
being rather hashed up, a situation to be expected when
restoration is based upon so little of the original signal.

Now the experiment.

In this experiment I used five PB lists. One talker
rendered these lists with average effort, unmetered at
a sound-pressure level of about 70 db. A carrier phrase
was used for each word. The carrier phrase turned out
to be rather successful, and I commend it to you. The
carrier phrase was: “Copy ————— in a line.” Both
carrier and word were spoken quite briskly. You notice
that the test word is bounded in the carrier phrase by
essentially the same phoneme on both ends, so that the
influences attributable to context should not be very
strong.

The sound-pressure range of the 250 words was 8 db
and the list with highest median sound pressure was only
9/10 of a db higher than the median of the list with
lowest sound pressure.

The plan of the experiment is shown in Figure 28.
The flow arrows represent the treatment conditions for
two independent groups of observers. The speech was
first passed through the compression device for Group
1, then through a 5 KC low-pass filter, an expander, and
finally a .3 to 5 KC band-pass filter. For Group 2 the
signal was not processed in this manner; it was simply
transmitted directly into the channel, cropped at the upper

image compression | low-pass | channel | expander | band-pass

Figure 28263

end by the low-pass filter, passed through the channel,
and presented to the band-pass filter at the other
end. Both filters had 36 db per octave slopes.

Essentially, our problem is to compare the performances
of the two groups. But first it is essential that
we define a couple of units. I think that everything is
clear with the exception of the symbol B. B is the bandwidth-reduction
factor, the amount by which the signal
is divided in compression or multiplied in expansion. It
is also equal to the reciprocal of 1 — Rc, 1/1 — Rc, Rc
being the compression ratio. In the uncompressed case,
B has the value of 1 and the channel bandwidth is 5K/1,
and that was the value of the low-pass filter across the
channel. In the second instance B was 2, the value of
the low-pass filter was 2,500, in the third case it was 4,
with a low-pass filter of 1,250, for B = 6 the filter was
833, and the final case, where B = 8, the filter was 625,
for a total of five different channel bandwidths. In
Group 1 we're going to divide the frequency so that it
will duck under the upper limit of the channel bandwidth.
In this case the sampling frequency that was
used was 20 cycles for compression, and 90 cycles for
expansion. One FB list was assigned to each of these
five conditions. The procedure was as follows: In each
of groups 1 and 2 there were 20 subjects. They were
divided into five subgroups of 4 subjects each, according
to a Latin Square design, in which five different orders
of presentation of the B value conditions were employed,
one order for each subgroup. The constraints of such a
design are that in any given row or column of the Latin
Square, one, and only one, of the five experimental conditions
will appear. Each experimental condition is thus
presented to each subject only once. Each experimental
condition occurs in each serial position only once. Each

image relative word intell. | compression factor

Figure 29

subject heard each word only once, and he heard all
the words from the five lists of vocabulary. The words
were presented at about 65 db sensation level calculated
on the basis of the median word, the response was written
down, the administration was by means of headsets
in subgroups of four. The design was used for both
groups.

Figure 29 is a plot of the probability of word intelligibility
vs. the compression factor B, the amount by which
the signal was divided or by which the channel was
cropped. The plot summarizes the data obtained from
Group 1. As you see, the empirical points are well fitted
by a straight line of .12 B slope. These data suggest that
factor B, in this particular use, is essentially an infrasonic
high-pass filter system. That is, the cutoff frequency
of this infrasonic filter is proportional to the
value of B, and we see that intelligibility falls as the cutoff
frequency rises in this infrasonic range. The reason
I say this is that we are getting longer and longer values
of Id, so that we are starting to throw away significant
chunks of the speech. The other reason I say this is that
we're rolling the lower end of these frequencies off the
bottom of the system as we divide them on down, and
once we roll them off the bottom of the system we don't
get them back. So essentially, the data show the effect
of a high-pass filter operating at very low frequencies,
even though the channel is limited by a low-pass filter.

The unbroken line of Figure 30 will show how intelligibility
varies as a function of the low-pass filter cutoff.
Across the top of the figure are the frequency values
resulting from the division process that correspond to
the bandwidths employed. The solid curve is similar to
the old work that the Bell Laboratory has reported and
that Hirsh and his followers did on the effect of filtering

image relative word intell. | relative bandwidth | low-pass filter | bandwidth compression | low-pass cut-off

Figure 30264

on word intelligibility. You'll notice that the curve rises
in a fairly familiar way at about 1,000 cycles, yielding intelligibility
of 50%. Those of you who are familiar with
this kind of a function will recognize that some of these
scores are somewhat higher than are often found with
low-pass filter systems. The reason for these higher
values is that it was a condition of the experiment that
the level be raised on up to the original level for each
filtering condition. If you'll study the old orthotelephonic
response curves you will realize that as they lower
the frequency of the low-pass filter, the total strength of
the signal also decreases as the energy is cut out. In this
experiment we equalized the signal level because we
wanted to simulate a system in which such a condition
would be assumed. It might be argued, in fact, that this
is the way to study the effects of filtering on speech. At
any rate, the shape of the function is typical.

The broken line of Figure 30 is derived from the previous
figure but converted now to frequency. From the
proximity of the two curves, it seems that we're doing
about the same things in these two systems. The one is
a low-pass system and the other is essentially a high-pass
system. In the compression reconstruction the highs are
not lost; we divide them down and they pass under the
upper end of the low-pass filter and then come on back
up at the other end. Now there are a lot of interesting
things that one can say about this. First of all, we know
that the intelligibility lost from these two different kinds
of effects is not additive. In other words, if we put a
signal through a low-pass filter and we lose 25% of the
word, and we put it through a bandwidth-compression
system and we lose 25% of the word, the sum of these
is not the proper statistic to express how they would
operate if jointly employed. Not only do we think that
they are theoretically independent but the empirical data
obtained from many compression studies in which material
was cropped from 2,000 cycles low-pass up to
10,000 cycles low-pass also show the effect. If the two
effects were independent, it follows that cascading the
two should produce a probability of item identification
for the over-all system which would be approximately
equal to the product of the probability of the low-pass
system times the probability of the bandwidth compression.
This gives us something to test, and we would propose
to construct an experiment based on the proposition
that the effects of these two methods of compressing
the signal are not additive, but multiplicative.

We'll start with a low-pass filter, and in developing
this illustration B will be equal to 4, I'll just assume the
frequencies will be divided by 4. The highest value of
the low-pass filter will again be 5,000 cycles because that
seems to be a useful bandwidth for intelligibility. We're
going to divide this by B, the bandwidth factor, which is
to say we're going to multiply it by the reciprocal of B,
and if we do this, this filter will yield an output that has
an upper limit of 1,250 cycles when B = 4. In the previous
work in which we transmitted this signal through
the low-pass filter for Group 2 at 1,250 cycles, the obtained
intelligibility was 66%. The obtained intelligibility
for Group 1 exposed to the bandwidth-compression
condition when B = 4 was .66 as well, as you will recall.
Now, if our theorem holds that the net probability of
the combined system is the product of the probabilities
of the two separate systems, then we should expect a net
probability of approximately .44 when B = 4. For every
combination of the first study in which the systems were
independent, this kind of prediction was made for the
combined systems.

image relative word intell. | combination | band. comp. | L-P filter | channel bandwidth

Figure 31

Figure 31 displays the predicted intelligibility function
for the combined system. Now you'll notice several interesting
things about this combined system. First of all,
we reduce its slope. The original slopes of the separate
systems are about 25% per octave. We now have a slope
that is of the order of 50% per octave. The second interesting
thing is, if we were to use 1,000 cycles as our
arbitrary channel bandwidth, we would find that instead
of getting 50% intelligibility out of a 1,000-cycle bandwidth
channel, we would get something closer to 80%
intelligibility out of that same channel in the combined
system. If we want to use our familiar criteria of 50%
intelligibility, the curve for the combined system predicts
that we will find that instead of needing about
1,000 cycles of bandwidth, we will only need about 350
cycles of bandwidth. If we come up higher to a useful
265value of intelligibility such as 75%, we will find 75%
intelligibility at a bandwidth of approximately 750 cycles
instead of at a bandwidth of 1,400 cycles in the separate
systems. I wish I had completed the experiment but
I haven't, so we don't have the empirical validation.
However, the theory is the important thing. That is,
the idea that if we operate on the signal in a combined
way with modes that are truly independent, then the
loss will not be additive but, instead, will be derived
from some function of the product of the separate probabilities.
This is the basic proposition, and I will be very
interested when I get the time to perform that experiment
and see if it works.

Well, now I've tried to show you in these sessions this
morning some samples from a general area of research
which has interested me. This is not all of the work
that has been done by us in this area, but it exemplifies
some of the different kinds of factors that have been
approached. As I started out by saying, I repeat, at this
moment I am less interested in the factual results than
I am in some of the conceptions of the experimental
attacks, valid or invalid, right or wrong, fruitful or unfruitful.
This is dirty, hard-nosed research in the laboratory,
and it's a damn good example of what you're
up against when you try to run a program of research on
one general topic.

Reflections on the Scientific Study
of Speech

This afternoon I should like to spend some time on the
more general implications of our science, its justification,
direction, and goals. As I understand the concept of this
seminar, these matters are as important as is the discussion
of empirical results such as we have been doing up
to this point. All of you at some time or other are going
to be forced to defend your worth both to the society at
large and to the scientific community in particular.

So I'd like to talk with you for a little bit about some
of the kinds of things that are going to come up and give
you a little ammunition to use, a few things to cite, and
a few principles. Some of these I learned the rather hard
way, and some seem to be rather obvious and true, on a
priori grounds.

If I sense modern education in the communication sciences
correctly, it's getting so full of techniques and
methodology that we are losing sight of the fact that
we're not training research technicians, we're trying to
train scientists. I think, perhaps, that some of the methodological
niceties run away with us. I well recall, when
I was editing the Journal of Speech and Hearing Disorders,
that I received a manuscript in which an elaborate
factorial analysis of variance was performed and yet
contained only a single mean of the main dependent
variable in the whole manuscript. The statistical treatment
apparently so preoccupied the author that he
turned in the statistical treatment and not the data. The
history of science indicates that we proceed in general by
our problems; our techniques are not always in phase
with these problems and sometimes there will be a breakthrough
in technique, and then the problems will themselves
involve exploiting the technique. On the other
hand, a problem will sometimes arise for which techniques
are not yet available, and then the techniques
become the object so that they may be put to work on
the problem.

I fear, however, that we are often guilty of clothing
our problems in the trappings of science, and deceiving
ourselves with the belief that this is science since it
sounds like science. There are many brilliant, irrational,
quantitative attacks on very important problems that
have obvious solutions which are only clouded by trivially
complicated approaches.

You are going to find that there are some pretty serious
problems of understanding in communication with
the people you are going to be dealing with. Some of
you are going to be working quite intimately with people
connected with speech correction or clinical audiology
or public speaking, clinical psychology, or matters of this
kind. These people have their own raison d'être; they
are important. They believe in their own material, they
are justified in that belief, but they get impatient with
you because communication and understanding do not
always come about so readily between groups of this
kind. So a scientist, I think, ought to understand some
of the characteristics of science, and some of the characteristics
of scientists.

First of all, we have to understand that science is
essentially a slow process. For instance, we have been
able to generate radio waves for approximately 60 years,
yet in 1952 with the aid of a radio telescope, a radio
source 100 million years old was discovered. Science is
slow — it takes a long while.

The second thing I'd like to draw your attention to is
that the need for knowledge, the needs of the consumer,
and the knowledge are often out of phase, and the disparity
is sometimes productive of misunderstanding. It
is a common situation to have the need for knowledge
felt before the knowledge exists, and if you've worked
with speech disorders, you will find that speech correctionists
are forced to provide solutions where solutions
do not yet exist. On the other hand, there are many
times when knowledge is gained before the need, and
then there is a misunderstanding because you provide
the answer to a question the user has not yet asked.
266There's nothing more lonely than an answer to a question
that no one thought of asking. We have to realize that
the need for knowledge cannot always be foreseen. Yet
we need the stockpile of knowledge on which to draw.
In 1625, when the velocity of sound was first measured,
nobody had the idea that this would become the unit of
speed for a jet or a missile. When Newton was working
on his theories of gravitation, he did not contemplate the
problems of orbiting satellites. In 1850, when Mendel
was working on heredity, he was not contemplating the
effects of irradiation on astronauts in the Van Allen Belt.
Nor even as late as 1870, when Clark Maxwell was
working on matter and motion, was he thinking about
atomic fuel as an objective. In other words, much of
our knowledge is information without immediate use.
Yet often as a scientist you will be asked to justify your
activity in terms of immediate application. The answer
is that immediate or even delayed application cannot be
the sole criterion of the worth of the scientist's activity.
Answers before questions and questions before answers
are inherent to the process of science.

Another thing that characterizes scientists, and particularly
behavioral scientists, is that the scientist is fundamentally
concerned with law, that is, statements of functional
relationship. On the other hand, the kind of
people that you are going to be dealing with, speech
correctionists in particular, are concerned not with laws
but with cases. The scientist is usually concerned with
the central tendency of the phenomenon and the law
that bears on it, while the clinician, in overly simplified
terms, is interested in individual behavior. A scientist is
concerned with the ways in which individuals are similar.
A clinician is concerned with the ways in which men are
different. There can be quite opposite directions of
interest in the same set of facts. The speech scientist may
welcome a bit of data because it illustrates a law, while
to the speech practitioner that same data might be useful
because of its diagnostic value.

Another set of communication difficulties arises because
of lack of meaningful references. If you went into
the upper reaches of the Amazon and talked about the way
the teams stand in the National League, you would '
hardly expect the communication to be successful. Nor
is it communicative if you say the damping constant of
formant 2 of the vowel /i/ is approximately 957 decibels
per second to someone without proper background; your
listener either thinks that you're stupid or that you're
abstruse for no reason, or that you're offensive.

The user will expect you to interpret your data, translate
the words into his concepts, point out to him what
is significant, and peddle it to him. On the other hand,
the scientist, if he wants to be stuffy about the matter,
will lake the point of view that the facts are here, we've
grubbed them out, you come and get them. Obviously
the solution is somewhere in between. That is, we have
to seek some kind of middle ground, we have to search
for a way of communicating what is acceptable to the
user, but, at the same time, operate without doing violence
to integrity. The point that you want to satisfy for
yourself is whether or not you are telling a lie when you
translate a difficult and technical concept into lay language,
or even if such translation is useful. The problems
of the nature and responsibilities for communication
are always with us. I think that there is general misunderstanding
and, in fact, in many cases I would say,
mistrust of the motives and activities of the scientist. I
think that some feel that scientists are forbidding, inhuman
characters; that they are cold, logical, lofty birds;
unworldly, impractical, hard, tough, and ungentle people
who don't give a damn for anything but the numbers
that they can grub out of their data. I think this is the
kind of image that people will often impute to you, and
I think that you have to be prepared to refute such an
image. I am afraid for those of you who teach. I am
afraid that our young graduate students graduating with
fresh Ph.D.'s often purposely clothe themselves in such
an image for its own sake. They mistakenly impute causation
to some of the correlated attributes of good scientists.
Many years of leavening, mellowness short of
decay, is a good cure for such posturing.

What are the attributes of scientists which characterize
them and their activities? What motivates them? What
interests them? What kind of birds are they? To start
with, I think that it would be fair to say, generally speaking,
that a scientist is characterized by a desire for order
and logical relationship between phenomena. He is looking
for relationships and explanations; he is looking for
predictions, in the mathematical sense. The invention of
the calendar is an example of the orderly, systematic
characteristic of science. It is characteristic in that it
both frees and binds man to time and permits him to see
the future. The calendar represents the development of
certain laws of natural phenomena and the fitting of an
explanation to them. Their orderly arrangement enables
man to predict what is going to happen next year, next
month, next week, and tomorrow. This is one of the
things that differentiates man from other animals.

When we speak of science, we talk a lot about gathering
data, about experiments as formal activities of science,
and about laws and their relationships, about hypotheses,
about theories, about postulates, about all these kinds of
things that are the stock-in-trade of the working scientist.
I think people ought to understand that when you talk
about gathering data, you might just as well say that
267what you're doing is looking around, and when you say
that you're doing an experiment, you're trying something
that usually doesn't work; that's why you do it.
When you're talking about laws, you're trying to develop
grounds for making guesses. That is, statements of relationship
arc the basis for making guesses. A hypothesis
then is simply a guess; and the familiar null hypothesis
is a very special guess; it's a guess that something is nothing.
This is a very interesting kind of thing to test, you
see. A theory then becomes a statement of policy, an
integrated set of good guesses, if you like.

A postulate is a wished-for fact. As Bertrand Russell
says, “A postulate has all the advantages of trust over
honest toil.” But such is the stock-in-trade of the scientist
as he gropes for order and logical relationship.

The second thing that characterizes a scientist is that
he is challenged by a puzzle. I think that this attribute
is often undervalued. I think that this particular kind
of intellectual challenge is one of the basic things that
makes a scientist. It differentiates between a scientist
and a scientific technician. The scientist is also motivated
by the very human traits of curiosity and adventure. A
recent President of the American Association for the
Advancement of Science has said: “… to wonder and
to wander lead upward in the trend of life. When man
ceases to wonder and to wander, from necessity or from
choice, he ceases to ascend the scale of living beings. As
the geographical frontiers are passed, the value of man's
spiritual adventures increases. The desire for security
and the suppression of curiosity inhibit the intellectual
and spiritual development of man.” As scientists we want
to recognize that we are motivated by such concerns and
not merely the tables of numbers and graphs which arc
the means to that end.

There is an extraordinary amount of sheer aesthetics
or beauty in science. We often speak about a beautiful
theory, where we mean beauty in the purely aesthetic
sense. We refer to a thought as being elegant, and when
we do, we mean that it has a kind of parsimonious inevitability
about it, not to be tinkered with any more
than one would tinker with a painting. That is, it has
an essential rightness, a unity and beauty in and of itself
that is extremely valuable and is much to be desired.
Science is extremely creative and extremely personal. It
has a lot to do with the ability to recognize the unusual.
For instance, Burke, one of the fine quantitative theorists
in psychology, has said in a paper on the one-tailed test
that experimental psychology can hardly afford such
lofty indifference toward the unexpected result. This is
an elegant statement. If we're statistic-bound, we're
incapable of seeing what is unusual when it may appear
right in our data. We must be very aware of the necessity
for the creative and personal in science; imaginative
deftness is required if we are to seek out the unusual, to
ask the right question in the right form, and to recognize
an answer. This is where discovery comes from, not
from manipulation of data and not from computer programing.
To this we must also add the notion of luck in
science. Discoveries are often accidents of circumstance,
and yet luck alone will not assure recognition of a discovery
when it appears. What often passes for luck in
science is the application of educated intuition and insight.
The history of science abounds with instances of
flashes of insight as the basis for discovery.

The scientist is also characterized by pride: pride in
his work, pride in his product, pride in his integrity —
but pride without vanity. In sum, a scientist is a very
human blend of stuff — logic, adventure, art, curiosity,
creation, intuition, luck, pride; all of these characterize
man in general and the scientist in particular.

In a realistic world, you are often going to be required
to write grant proposals and make recommendations to
administrations; you will be faced with the problem of
the practical outcomes of your pure research. We have
recently had a large number of very dramatic practical
outcomes of pure research. The development of artificial
satellites, for example, can be traced back into experimentation
with exotic fuels and high-velocity engines.
It was very interesting to me to notice in 1957, when
the Russians put up the first Sputnik, that the research
station at the University of Manchester tracked this
Sputnik with a radio telescope which was designed for
pure astronomical research. Warren Weaver, writing in
Scientific American on science and the motivation of
scientists, has said that pure science is intensely practical.
The whole of man's experience has demonstrated that
the practical results required for tomorrow depend essentially
on the impractical key of today's curiosity.

The practical outcomes of the work of pure science
impose a moral responsibility on the scientist both in
their use and in their need. The population explosion,
for example, as a problem demands the attention of the
scientist. The by-products of this explosion, crowding
and the development of immune strains of bacteria and
the consequent epidemics, make medicine and medical
activity of cardinal importance. The improved techniques
for prolonging life and the inherent unproductiveness
of the aged require the scientist's attention. Our
food supplies are not unlimited. We're at the mercy of
the weather and we're also facing the gradual erosion of
both the population and the lands where our foods are
produced. As scientists we have some strong responsibilities
in the search for answers to such problems.

Our coal and oil deposits are disappearing. It has
268been estimated that we have about 200 years more of
fossil fuel. At the present time our methods are so inefficient
that it costs us one ton of coal for every eight
that we mine. It is estimated that within 100 years, as
the inaccessibility increases, it will cost us one ton of coal
for every two tons that we mine. Obviously then we must
seek, and seek rapidly, other sources of fuel. Our problems
of international relations demand the attention of scientists,
not only as moral citizens, but also as scientists
who can contribute the basic understanding and knowledge
necessary for their solutions. Then there are the
problems of social injustice: the problems of Little Rock
and Birmingham. We must find some solution based upon
the rationality of science rather than upon emotion
or power politics.

In all of these ways you see that science is an exceedingly
practical activity. But still people are going to want
to accuse you of wanting to sit in your little laboratory
and play your little games and ask the public to pay you
to indulge in this process. You're going to have to explain
to people ways in which something that is remote
from the use creates a backlog on which the users can
draw. Yet I think it is a dangerous bit of activity to
overstress these practical outcomes. I don't intend to
defend what I do on the grounds that it is practical,
even though I believe that it is practical. If we're always
impatient for a quick and dirty result in research because
somebody is sitting over us asking for something
that they can use, well, we're not truly engaged in science.
I think instead that we should realize that it is
our business to try to operate in a way that will meet
needs that we cannot foresee at the present time, and
that the accretion of scientific knowledge through the use
of the experimental method is a justifiable activity for
its own sake. Chester Barnard, once the Chairman of the
National Science Foundation, has said this about as
beautifully as a scientist could say it: “Science is a value
to be cultivated for its own sake, not necessarily or
chiefly for utilitarian purposes. The curiosity, the initiative,
the imagination, the persistence, the patience, the
frustration that must be experienced and endured in science
cannot be adequately motivated by the current exaggeration
of the usefulness of science, but they must be
founded on the belief that all this toil is justified as an
expression of the superior faculties of mankind, as a contribution
to man as a whole.” Science is the creature of
man. It exists to serve man. But there are other ways of
serving man than building a better telephone. Sir Edward
Appleton, the famous astronomer, has said it in
somewhat different words: “We should be misleading the
public as well as ourselves if we based our case for the
general support of the pursuit of science on its utilitarian
aspects alone. I know that we can claim that many discoveries
in pure science, which in their time have no
practical importance, have later proved to be the foundations
of major improvement in our material civilization.
But even that is an argument of profit and loss and to
my mind does not bring us to the heart of the matter.
I should like to go beyond the achievements to the example
of the scientist, be he amateur or professional,
who is impelled by a passionate desire to explore, and to
understand.” That is what I mean by science for its
own sake, when knowledge and insight are sufficient
rewards in themselves. Neither man nor the scientist
need apologize for honoring such a goal.

Finally, I want to quote from a great American who
was a great practical person, namely, Benjamin Franklin:
a politician, an inventor, a printer, a philosopher, a great
statesman. In 1779, during the war, Benjamin Franklin
wrote a letter to the captains of all the U.S. naval vessels
then seeking combat with the English. This is part of
what he wrote: “Gentlemen, a ship was fitted out from
England before the commencement of this war to make
discoveries in unknown fields, under the conduct of that
most celebrated navigator and discoverer, Captain Cook.
That is an undertaking truly laudable in itself, because
the increase of geographical knowledge facilitates the
communication between distant nations and the exchange
of useful products and manufactures, extends the
arts, and benefits of other kinds are increased to mankind
in general. This, then, is to recommend to you that,
should the said ship fall into your hands, you would not
consider her as an enemy, nor suffer any plunder to be
made of the effects contained in her, nor obstruct her
immediate return to England.” In short, here was a
practical scientist, acting as a statesman, in respect of a
truly important scientific objective which can override
and transcend a specific quarrel. The best attributes in
man are the best attributes of science; man and scientist
are inseparable.

More Practical Outcomes
of Abstruseness

I next want to address myself to another practical outcome
of abstruseness 1 and refer to an index which for
the moment let us call an Index of Speech Efficiency.
The index was contrived by Newman Guttman and
myself and has not been published, I'm sorry to say.

I want to refer first, however, to an article that I published
in the Journal of Speech and Hearing Disorders
in 1955, called “Selective Vocal Effects of Delayed Auditory
Feedback.” In this article an experiment is reported,
269the design of which involved 16 subjects, each of
whom read the Rainbow Passage. The subjects read the
passage under seven different conditions; one without
a headset, two with a headset and amplification, and
then under four time intervals of delayed auditory feedback,
.1, .2, .4, .8 second. The undelayed conditions, the
one without headset, and the two pre and post with
headset, were always in that serial position. Thus, for
the four delayed conditions, plus the two readings with
the headset, 96 total samples from 16 subjects were derived.
Now in reporting this paper I used certain curves,
but among them are not the curves that will appear in
Figure 32.

image severity rating | correct word rate (n/sec.) | duration (sec.) | correct words (n) | time delay (sec.)

Figure 32

The abscissa of this figure displays time delay in
seconds ranging from .1 to .8 geometrically scaled. The
number of correct words from the 55 words in the Rainbow
Passage is displayed along the ordinate of the uppermost
plot. You'll notice that the curve descends to its
lowest point at .2 second time delay and that some reduction
in number of correct words occurs at all delay
intervals. In the second plot displaying obtained duration,
which for the Rainbow Passage under normal
conditions is 17.2, the curve rises to a maximum of .2
and then again falls. The correct word rate, which is
the number of correct words divided by the duration, is
shown in the next curve, the expression being in correct
words per second. The lower plot of the figure shows
the mean severity ratings made by skilled judges which
I will describe in more detail. Judges rated the severity
of disturbance on a five-point scale whose positions were
numerically identified as 0, 1, 2, 3, and 4, and described
to them as equal intervals. We see that the severity of
disturbance bears a close and inverse relationship to the
correct-word rate. We find then that neither correct
words nor duration, although highly intercorrelated, is
as good an expression of the judged severity as was the
correct-word rate.

On that basis then, in my article I suggested an Index
of Articulatory Efficiency, which would have the following
form. The index of efficiency would be taken as
equal to Wc, the number of correct words, times Dn, the
normal duration for the passage in question, which I
said was 17.2, over Wt, the number of total words in the
passage, times the obtained duration, Do.

Ie = Wc ∙Dn / Wt ∙ Do

The index will range from 0 for minimum efficiency to 1
for maximum efficiency.

Let us now turn to the judgments made on these 96
samples, which were, of course, arranged in random
order and in which the judges were rating the degree
of disturbance. The scale was linear from 0 to 4, 0 indicating
undisturbed speech and 4 signifying maximal
disturbance. The judges were seven instructors, and
before the ratings were made, approximately 15 samples
of the readings were taken at random as practice
samples.

In addition we took many physical measures of the
performance. I'll not review all of these in detail, but
they involve such things as the variance in relative sound
pressure, the mean fundamental frequency, the standard
deviation of the distribution of the fundamental used,
the mean extent of inflection, the ratio of phonation time
to pause time, rate, duration, and measures of articulation.
Now the coefficients that were particularly important
in the development of this index were as follows.
Judged severity correlated .67 with obtained duration
and -.80 with number, of correct words. In other
words, the higher the judged severity the longer the
obtained duration and the greater the number of errors.
I may say that correct words are judged, that is they are
judged by listeners using a liberal standard. The distinction
is important. If you listen for error words, you'll
hear them behind every bush, but if you listen for words
270that are correct by any legitimate standard, then you
have a set that is appropriate to this measure, whose obtained
reliability is high and within the range of .86 to
.95.

Using the multiple regression model, we then proceeded
to operate on these zero order coefficients between
judged severity and all of the physical measures.
I'll not give you all the details of this analysis because we
found that as we added factors, as is often the experience,
the point of limited return was quickly reached.
But we did find one multiple R, based on the measures
Do and Wc, which predicted the criterion of judged
severity quite well, the value of R being .86. Now you'll
recognize that this R is a rather decent improvement
over the duration or correct-word indices taken separately.
We then used the ratio Wc/Do or correct-word
rate as the predictor. The obtained correlation between
the correct-word rate and the judged severity of performance
was -.89. In other words, when you put these
two variables together in this particular way, the prediction
is improved over that obtained from the multiple
correlation. I'm not quarreling with anybody about the
significance of the elevation, but the fact is that this is
easier to do than the multiple. In other words, the
greater the degree of severity of disturbance, the lower
is the correct-word rate.

Figure 33 displays the obtained scatter of points when
correct-word rate is plotted as a function of judged severity.
The line of best fit to these points is probably not
linear; it would appear to be negatively accelerated.
Neither does homoscadisticity appear to obtain; the
points about the low end in particular tend to be spread.
It would seem reasonable from these observations then to
expect that some transformation would improve our prediction.
The transformation chosen was based upon both
rational and empirical grounds and involved the following.
The numerator of the correct-word-rate ratio, Wc,
clearly called for an exponential transformation, so we
squared Wc to produce a rectilinear progression. The
transformation has a very reasonable rational justification.
As you begin to approach perfection in the production
of correct words, the goal becomes increasingly
harder to obtain, and so you want to reward the obtainment
of the last few correct words by giving it a little bit
more weight, which is exactly what Wc2 accomplishes.
The denominator of the revised index we took as the expression
Dn plus the absolute difference between Dn and
Do. Now you notice the effect of this. Any change in
duration, either longer or shorter than normal, will add
to the total duration used for calculating correct-word
rate. In other words, we folded the fast rate over onto
the slow rate in the scatter. The rationale of this is that

image mean severity rating | speech disturbance index

Figure 33

some people are judged to be inferior in articulatory
ability because they're too fast, so we wanted to have an
expression that would be sensitive to this effect. We
have called the resultant measure the Index of Articulatory
Ability, taken as:

IA = Wc2 / Dn + | Dn - Do |

Its correlation with judged severity is -.92. Now this
is extremely interesting because this means that by this
particular transformation of these two particular items
of speech we are able to account for approximately 80%
of all the variance in performance.

We may now give expression to this in terms of an
index that is applicable to other passages and to other
kinds of situations. We will divide IA by the ratio of
the total length of the passage squared, Wt2, over the
normal duration, Dn. The Rainbow Passage has 55
words so that it will be 552 in the numerator, which is
3,025, and in the denominator we'll have 17.2, so that
this will be the ideal correct-word rate for a subject reading
at average rate and committing no errors. We will
then express IA as a fraction of this so that the generalized
index, IRA, will be relative to normal performance
on the passage in question, namely:

IRA = IA (Dn / Wt2

The readings that the 16 subjects gave us before the
experiment proper started were presented to the judges
for evaluation in random order. The judges were required
to judge these normal readings on a scale ranging
from 1 to 9 of general .effectiveness of the performance.
The correlation between IA and these judgments of
effectiveness was found to be .65. In other words, when
the undisturbed performance of people who are very271

image general effectiveness | rate preference | reading

Figure 34

compact in ability is in question, we now account for
between 35% and 40% of the variance in speaking
ability. Now, of course, a study of this kind requires independent
validation since the index is derived from the
same data to which it is being applied. Although the
index has rational motivation, it is also empirical. This
validation has not yet been performed.

Yesterday I referred to the Ph.D. thesis of Hutton.
You will recall that we used different rate levels of the
Rainbow Passage and that we had a group of judges
estimate the rate, and the duration of the processed
versions of the passage in an attempt to predict these
judgments on the basis of physical measurements. I also
at that time referred to another group consisting of 10
instructors and 40 undergraduate students, a total of 50
who, because they did not differ significantly, were
pooled into one group, and who judged on the basis of
rate preference how they liked the 40 different versions
that they heard, in addition to judging the general effectiveness
of the performance. We can see the results of
these judgments in Figure 34. The eight original readings
are displayed along the abscissa with rate decreasing
toward the right. Rate preference, as you can see, falls
off when the rate gets too fast or too slow, with the maximum
preference on reading 5. General effectiveness, of
course, is highly correlated with rate preference, as can
be seen in the figure.

Figure 35 indicates what happened when we expanded

image general effectiveness | rate preference | expanded | original | compressed | reading

Figure 35

and compressed these original readings. The original
readings are displayed as dashed lines, reproduced from
the previous figure. The solid curves represent the rate
and effectiveness judgments obtained for the treated
versions of durations comparable to the original readings.
Given equal duration or rates, the figure indicates that
the judgments of the expanded or compressed original
readings behave substantially as did the judgments of
the original versions, the differences between versions of
equal rate being non-significant.

In Figure 36 the upper figure displays judged rate
preference vs. estimated rate ranging from slow, 1, to
fast, 9. This estimated rate was obtained after the manner
I described yesterday. The rate level that was judged
to be highest in rate preference had an estimated rate
on the 9-point scale of 5.1, slower and faster rates falling
off rather linearly on both sides of this maximum preferred
rate level. We tried the Fairbanks-Guttman transformation
of these data for each of the 40 rate levels
used. The results of that transformation plotted as a
function of rate preference are displayed in the lower
figure. The Fluency Index of that figure is the same
index I described a little while ago as the Index of
Articulatory Ability, which I think is the name to be
preferred. Although the index is operating on a slow
and fast reading, it doesn't quite do the job here as far
as rate preference is concerned. Rate preference apparently
requires another kind of transformation, and so272

image rate preference | rate est. | rate preference index | fluency index

Figure 36

we devised another index. We take our index, which we
will call an Index of Rate Preference, as equal to:

IRP = (1 - | 5.1 - Re | / 5.1

where 5.1 is the given optimum rate for maximum preference
on the 9-point scale and Re is the obtained
estimated rate on the same scale. In other words, the
index is an expression of the extent to which the obtained
estimated rate differs from the optimum of 5.1. The
lowest value of the index for a 9-point scale is .2, indicating
low predicted preference, and the highest is 1, indicating
highest preference. The index is plotted against
the obtained rate preference judgments in the middle
figure. We now see that this transformation bends the
fast rate back on the slow rate, and we also see that the
function is fairly rectilinear. In other words, this seems
to be a promising sort of a way of getting at the preferred
rate level based on a determination of the estimated
rate.

Yesterday I was describing the determination of estimated
rate from measured rate, and you'll remember
that we had a logarithmic function and we derived an
equation that fitted that function. I will now show what
happens when you substitute values determined from

image rate preference index | measured rate (wds/min)

Figure 37

image reading time | rate (words per min.) | too slow | doubtful | satisfactory | excellent | too fast | percentile

Figure 38273

that equation for estimated rate, and transform the
resultant values into words per minute.

At the top of Figure 37 you see the same index already
given except that now we substitute 11.97 times the
logarithm of measured rate minus the intercept for Re.
In other words, we're using the equation for predicting
estimated rates instead of some empirically determined
estimation of the rate. Now if we plot our rate preference
index along the ordinate and measured rate in
words per minute along the abscissa, we have the function
shown in the figure. The function rises to its maximum
at around 175 words per minute and then declines
with increasing rate. On the basis of interpretations
provided the judges with respect to the original 9-point
scale, we may then define qualitative terms, as shown on
the right, to correspond to the values of IRP. This may
or may not be handsome, but it is an illustration of an
attempt to devise an index that will give us a way of
evaluating the degree to which a given rate is preferred.

Figure 38 is taken from the Voice and Articulation
Drill Book
and shows the practical outcome of the experiments
on rate. This nomograph allows one to calculate
judged quality of reading rates for longer passages.

I want to show you one more figure. You'll all recognize
it, I think, and I think that in spite of its theoretical
basis this constitutes the most practical product that
Fairbanks has ever put out.

It's been a real pleasure to talk with you.

image input | controller unit | storage | mixer | effective saving signal | effector unit | input signal | error signal | comparator | motor | generator | regulator | modulator | output | feedback signals | sensor unit

Figure 39274

1 The previous outcome was the Rhyme Test (Ed.).