CTLF Corpus de textes linguistiques fondamentaux • IMPRIMER • RETOUR ÉCRAN
CTLF - Menu général - Textes

Fairbanks, Grant. Experimental Phonetics – T01

Systematic Research in Experimental Phonetics: *1
1. A Theory of the Speech Mechanism
As a Servosystem **2

Grant Fairbanks ***3

Experimental phonetics is the study
of the biological action known as
speaking which produces the acoustical
time-series known as speech.
Numerous biological systems are involved
in this action, but it is possible
to consider them collectively as a
single, larger, bio-acoustical system
which is a proper object of study as
such. It is this system, the speaking
system
, as a system, that I propose to
discuss. While it is impractical to cite
all my sources here, I want to mention
my reliance upon the writings
of MacColl (3), Wiener (9, 10) and
Trimmer (8) in the fields of control
theory and cybernetics, and to make
special acknowledgement of the personal
influences of Seashore, Tiffin
and Travis, who originally aroused
my interest in the speaking system
almost 20 years ago.

By way of review I will first show
without discussion five diagrams of
communication systems. Figure 1 is
from Scripture (5), Figure 2 from
Shannon (6), Figure 3 from Davis
(2), and Figure 4 from Peterson (4).
Figure 5 shows Bott's (1) unpublished
speaker-listener causal series, which
has been passed on by word of mouth.
As nearly as I can determine, it must
have been formulated about 1930,
antedating the four others. The diagram,
which shows only structural
elements, does not attempt to do justice
to the complete statement.

Figure 6 shows an extension of the
Bott scheme to a two-way speaker-listener
system. Note that the brain
of Speaker 1, B1 at the left, is the
source of Message 1, M1, and also the
destination of M2, with B2 serving
analogous functions. Note also that
each speaker is equipped with a transmitter
and a receiver. Reflect that a
given receiver, such as E1, is operative
at all times, even when its related
transmitter, S1, is producing signal
intended for the independent receiver,
E2. M1, in the form that it issues from
S1 under orders from B1, is simultaneously
relayed back to B1 through
E1. In short, Speaker 1 hears himself
as he talks. In Figure 7 we divide the
diagram down the middle, make certain
adaptations, and arrive at a more3

image U | B | ZN | PN | M | O | C | HN

Figure 1. Diagram of the speech system. U, das Unbewusste; B, Bewusstsein; ZN, Zentralnervensystem;
PN, peripheres Nervensystem; O, Ohr; C, Endorgan im Ohr, HN, Hörnerv.
From Scripture (5).

image

Figure 2. Schematic diagram of a general
communication system. From Shannon (6).

image Auditory communication | talker | listener

Figure 3. Diagram of the process of auditory
communication. From Davis (2).

complete diagram of the situation at
the time S1 is transmitting.

The return of M1 to B1 has often
been referred to in such words as
auditory monitoring, and interpreted
as a sort of ‘checking up’ on what the
speaking apparatus has produced
There is nothing wrong with this
view of matters as far as it goes, but
it seems to me that it misemphasizes
the significance of self-hearing during
speaking. It stresses the past. The
essence of a speaking system, however,
is control of the output, or prediction
of the output's future. In this
kind of system the significance of

image

Figure 4. Fundamental systems in communication
technology. From Peterson (4).

image speaker | listener

Figure 5. Structures of the speaker-listener
causal series. After Bott (1).

data about the past is that they are
used for prediction of the future.

The ‘monitoring’ interpretation also
suggests that the ear is a receiver in a
listening system rather than a component
of a speaking system. Theorists
emphasize two different kinds
of purposes for which measurements
are made by the same instruments.
Trimmer (7) illustrates this by comparing
the use of the same scales, first
to determine the unknown weight of
a watermelon and then to weigh out
exactly five pounds of sugar. In the4

image speaker 1 | speaker 2 | B | S | M | E

Figure 6. Two-way speaker-listener system.
B, brain; S, speaking mechanism; M, message;
E, ear.

case of the watermelon, the purpose
was estimation of weight; in the case
of the sugar it was control of weight.
In Figure 7, E1 and B1 are measuring
M1 for purposes of control. In Figure
6 they are measuring M2 for purposes
of estimation. When I say a word and
you repeat it, your hearing apparatus
measures my word for purposes of
estimation and then your word (the
same word) for purposes of control.
When we are referring to the control
functions of the auditory signal, I
suggest auditory feedback as the
term of choice.

The speaking system does not seem
to be what is called an open cycle
control system. In open cycle control
the device that produces the output
is controlled by some quantity that is
independent of the output. Devices
such as alarm clocks, in which an
event is controlled by time, are familiar
examples. The speech synthesizers
that I have seen employ this form of
control. A deaf child, while being
taught to speak by a deaf therapist
who pursues the method of phonetic
placement with a tongue blade, is almost
entirely under open cycle control.

A closed cycle system, or servosystem,
on the other band, employs
feedback of the output to the place
of control, comparison of the output
to the input, and such manipulation
of the output-producing device as
will cause the output to have the same
functional form as the input The
system performs its task when, by
these means, it produces an output
that is equal to the input times a
constant. Examples of such systems
are the heating plants of our homes
and the homeostatic mechanisms of
our bodies. It seems evident that the
speaking system has at least the rudiments
of a servosystem. In Figure 8
we explore this further with the
model shown in block diagram.

If we start with the effector unit,
shown at the top, we observe a motor,
a generator and a modulator connected
as shown. These are the
respiratory, vibratory and resonation-articulatory
structures, respectively.
(The model deliberately simplifies, if
it were more elaborate, the generator,
for instance, would be shown as a
multi-unit device capable of producing
various types of inputs for the
modulator, and in part located physically
within the latter.) The output
is shown by the heavy arrow at the
right. The heavy lines and arrows at
the top symbolize the effector's motor,
innervation.

The sensor unit at the bottom is
so-labelled to emphasize its control
function. (If its function were estimation,
it would be called a receiver.)
Sensor 1 is the primary component
for output take-off, the ear. The output
is conducted to sensor 1 over two

image B | S | M | E

Figure 7. Elements of the control system
for speaking. B, brain; S, speaking mechanism;
M, message; E, ear.5

image controller unit | effector unit | sensor unit | input | output

Figure 8. Model of a closed cycle control system for speaking.

separate channels, representing the
acoustic pathways to the ear through
the air and through the body tissues.
Sensor 2 and sensor 3 symbolize the
tactile and proprioceptive end-organs.
These supply data about the mechanical
operation of the effector, but not
directly about its output. Although
correlated with the output data taken
off through sensor 1, these data are
comparatively fragmentary. The sensor
unit relays its data to the controller
unit
in the form of feedback
signals
.

The controller is an automatic device
that issues specific orders to the
effector. It does not originate the
message, but receives its instructions
from a separate unit not shown. We
are concerned here with a speaking
system and assume an input, although
plausible extensions along these same
lines may be made to a model of a
language system which also originates
messages.

The anatomical analogy is less definite
here than for the effector and
sensor, and my tendency is to keep it
so for the time being. This indefiniteness
does not, however, restrain us
from fruitful discussion of an automatic
controller in terms of functional
units, and, of course, we should
remind ourselves all along that this is
a model, not a replica.

While a closed cycle heating system,
for example, may be required
only to maintain a constant pre-set
temperature, the speaking system
must vary its output as a function of
time, according to instructions laid
down at the input. The output consists
of qualitatively different units
that must be displayed in a time sequence
that is unique. The selection
and ordering of units are carried on
in advance, usually for a number of
units, and represent a set of input instructions.
As speaking continues,
each set is replaced by another. As
the first component of the controller,
therefore, we provide a storage device,
which receives and stores the
input and gives off an input signal.
The number of units that it can store
is comparatively small and the time
that it will retain them is short. This
is the short persistence memory of
what we intend to say next. We may
think of this device as a tape recorder
in which instructions are stored. Its
tape drive is alternately started and
stopped, and when the tape is stationary
a given unit of instruction is repeatedly
reproduced by a moving
scanning head.

A stored unit of instruction, or input,
corresponds to a unit of output.
Each such unit furnishes what is
termed a control point, sometimes
6called set point. The control points
are the unit goals of the output The
input signal corresponding to a control
point goes simultaneously from
the storage component to the controller's
other two components, a
comparator and a mixer. The comparator
also receives the feedback
signals, as stated earlier. With the
input and feedback signals it performs
a calculation, essentially subtraction,
in which it determines the difference
between the two. At any given time it
thus yields a measure of the amount
by which the control point has not
yet been reached by the output, or a
measure of the non-accomplishment
of the control point This measure is
termed the error signal. In the act of
speaking, the error signal, at the time
in question, is the amount by which
the intended speech unit, then displayed
in the storage device, has not
yet been produced by the effector.

The error signal will equal zero
when the control point has been
achieved by the effector. At such a
time as it does not equal zero, the
error signal provides data which cause
the effector to modify its operation in
such a manner as to bring the error
signal closer to zero. It continues with
time to modify the operation of the
effector progressively so that the
error signal approaches and finally
reaches zero.

To bring this about the error signal
is continuously fed into the mixer,
the function of which is to combine
error signal and input signal into the
effective driving signal. The latter
furnishes specific instructions to the
effector. It alters the effector's operation,
causing its output, relayed back
to the comparator in the form of
feedback signal, more nearly to equal
the input signal and thus reduce the
error signal. The reduced error signal
is then fed into the mixer, modifies
the effective driving signal accordingly,
and so on around the loop until
the error signal equals zero.

At such a time the first unit has
been completed and the system is
ready for new instruction. The information
that that is the state of the
system is given, We repeat, by the
fact of zero error signal In the model
you will note that the error signal is
fed into the storage component as
well as into the mixer. In the storage
component, however, it acts in simple
all-or-none fashion to trigger display
of the next control point when it
equals zero, or to retain a given control
point when it does not equal
zero. In the tape recorder that we
imagined earlier as the storage device,
it would start and stop the tape drive.

This triggering device has an important
refinement of that basic operation.
Since the time constants of the
live speaking system are relatively
long in comparison to the durations
of steady states in the output, analogous
time constants are assumed for
the model. This being the case, the
system would have a low ceiling on
its rate of output, if advancement of
instructions were permitted only at
times of zero error signal The comparator
includes, therefore, a predicting
device. By plotting the error
signal as a function of time during
production of a given unit, this device
continuously predicts by extrapolation
the future time at which the
error signal will equal zero. Thus advancement
of the storage component
to the next control point is not necessarily
delayed until the actual moment
when a condition of zero error signal
obtains. It may be triggered in advance
of that time by an amount, let
us say, equal to the relevant time constants.
By this means, over suitable
channels, a new input can be started
on its way toward the effector before
7the previous control point has been
reached so that it will arrive there at
an appropriate anticipated time.

It may have been observed that,
when the model starts operation
from the inactive state, the effective
driving signal is not at the outset
modified by the error signal, there
being as yet no error signal Under
such conditions the output is uncontrolled
for an amount of time equal
to the time constant of the entire
system. This is inherent in a feedback-controlled
system unless the time constant
is negligible. In the live system
it is suggested either that the excitation
of the effector is highly generalized,
resulting in an initial output that
is undifferentiated until it comes
under control, or that the effector's
operation during this initial period is
mediated by subtle programing of
sequences not dependent upon sensory
feedbacks.

The system has another important
undiagramed characteristic. In the
mixer the rate of change of the effective
driving signal is caused to vary
with the magnitude of the error signal.
When the error signal is large, as
at the start of a unit, the corrective
change is rapid. It becomes progressively
slower as the error signal is
reduced. An advantage of this feature
is reduction of overshoot.

Numerous times we have used the
term unit in the sense of unit of
control. Such a control unit should
not be identified with any of the conventional
units such as the phoneme,
the syllable, the word, or the word
group. There is no time to develop
this idea for the live speaking system
beyond saving, first, that it is not
theoretically necessary that the unit
of control De any presently identified
phonetic unit, and, second, that we
have evidence from several experiments
suggesting that it is something
else. It might be ventured tentatively
that the unit of control is a semi-periodic,
relatively long, articulatory
cycle, with a correlated cycle of output.
It is more satisfactory at present,
however, merely to propose the existence
of a hypothetical unit of speech
control, as yet unspecified and unnamed,
whose characteristics are
dimly coming to be seen.

The idea of building a mechanical
model of the speaking system that we
have discussed is appealing. Comparatively
simple effector and sensor
components which can process recognizable
speech signal are within the
art. We nope shortly to begin construction
of a simple controller, based
on a relay network, that works on
paper. Although to validate the
theory it is not necessary that the
machine talk, it seems possible that a
first approximation to connected
speech can be realized.

One evident feature of the model,
as well as of the live system, is that it
contains many components in a complicated
arrangement and readily becomes
disordered. One type of disorder
is part failure. In that case,
unless the part can be replaced or
repaired, the change in output must
either be compensated or tolerated. A
part disorder is also a system disorder.
The model can be caused to repeat,
prolong and hesitate by several different
manipulations, one of which is
feedback delay. By manipulations
that are revealingly similar it can be
caused to make other kinds of mistakes,
such as substitutions, distortions
and omissions. All such disorders are
demonstrably caused by component
deficiencies. In the model organic and
functional are one.

Since the dynamic events of connected
speech have become conveniently
accessible through the X-ray
motion picture and the acoustic
8spectrogram, students of speech perception
have been giving considerable
attention to the psychophysical significance
of spectral changes in the
speech signal. Although this subject
is outside the scope of the present
paper, a brief comment seems worthwhile.

Phoneticians have long recognized
that the elements of speech are not
produced in step-wise fashion like the
notes of a piano, but by continuous
modulation as a function of time.
Certain of the elements, such as the
diphthongs, involve characteristic
changes during their durations, losing
their entities if they do not so change.
Other elements, such as the vowels,
may be prolonged indefinitely in the
steady state and change is not considered
to be a defining feature. During
production of elements of the
latter type in connected speech, however,
changes occur. Movements to
and from articulatory positions result
in acoustic transitions to and from
steady states in the output.

In the model we have seen how a
transition is used for purposes of control
and prediction. From it is derived
a changing error signal. The model's
objective is to reduce this error signal
to zero, and at such a time as that has
been accomplished the control point
will have been reached. In the case
of the production of elements of
speech that involve steady states, the
control points and error signals correspond,
respectively, to steady states
and acoustic transitions in the output.
It is to be emphasized that the steady
states are the primary objectives, the
targets. The transitions are useful incidents
on the way to the targets. The
roles of both are probably very analogous
when the dynamic speech output
is perceived by an independent
listener.

References

1. Bott, E. A. (Indirect personal communication)

2. Davis, H. Auditory communication.
JSHD, 16, 1951, 3-8.

3. MacColl, L. A. Fundamental Theory
of Servomechanisms
. New York: D.
Van Nostrand, 1945.

4. Peterson, G. E. Basic physical systems
for communication between two individuals.
JSHD, 18, 1953, 116-120.

5. Scripture, E. W. Der Mechanisms der
Sprachsysteme. Z. Experimentalphonetik,
1, 1931, 85-90.

6. Shannon, C E. and W. Weaver. The
Mathematical Theory of Communication.
Urbana: Univ. of Ill. Press, 1949.

7. Trimmer, J. D. The basis for a science
of instrumentology. Science, 118, 1953,
461-465.

8. Trimmer, J. D.. Response of Physical Systems.
New York: Wiley, 1950.

9. Wiener, N. Cybernetics. New York:
Wiley, 1948.

10. Wiener, N.. The Human Use of Human
Beings
. Boston: Houghton Mifflin, 1950.9

1* Under this title a section meeting of
invited papers was arranged for the 1953
Annual Convention of ASHA by M. D.
Steer. The four papers are here published
as a group. At the suggestion of Chairman
Steer, and with the agreement of the
Editorial Staff, they are reproduced, with
minor changes, substantially as read in
order to preserve the flavor of the original
occasion.

2** Reprinted from the Journal of Speech and Hearing Disorders, Vol. 19, 1954, pp. 133-39.

3*** Grant Fairbanks (Ph-D, Iowa, 1936) is
Professor of Speech and Director of the
Speech Research Laboratory, University of
Illinois.