AN INTRODUCTION TO
NUMERICAL ANALYSIS FOR
ELECTRICAL AND COMPUTER ENGINEERS
CHRISTOPHER J, ZAROWSKI
AN INTRODUCTION TO
NUMERICAL ANALYSIS
FOR ELECTRICAL AND
COMPUTER ENGINEERS
Christopher J. Zarowski
University of Alberta, Canada
®
WILEY-
INTERSCIENCE
A JOHN WILEY & SONS, INC. PUBLICATION
TLFeBOOK
TLFeBOOK
AN INTRODUCTION TO
NUMERICAL ANALYSIS
FOR ELECTRICAL AND
COMPUTER ENGINEERS
TLFeBOOK
TLFeBOOK
AN INTRODUCTION TO
NUMERICAL ANALYSIS
FOR ELECTRICAL AND
COMPUTER ENGINEERS
Christopher J. Zarowski
University of Alberta, Canada
®
WILEY-
INTERSCIENCE
A JOHN WILEY & SONS, INC. PUBLICATION
TLFeBOOK
Copyright © 2004 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee
to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400,
fax 978-646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission
should be addressed to the Permissions Department, John Wiley & Sons, Inc., Ill River Street,
Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts
in preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services, please contact our Customer Care
Department within the United States at 877-762-2974, outside the United States at 317-572-3993 or
fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data:
Zarowski, Christopher J.
An introduction to numerical analysis for electrical and computer engineers / Christopher
J. Zarowski.
p. cm.
Includes bibliographical references and index.
ISBN 0-471-46737-5 (com)
1. Electric engineering — Mathematics. 2. Computer science — Mathematics. 3. Numerical
analysis I. Title.
TK153.Z37 2004
621.3'01'518— dc22
2003063761
Printed in the United States of America.
10 987654321
TLFeBOOK
In memory of my mother
Lilian
and of my father
Walter
TLFeBOOK
TLFeBOOK
CONTENTS
Preface xiii
1 Functional Analysis Ideas 1
1 . 1 Introduction 1
1.2 Some Sets 2
1.3 Some Special Mappings: Metrics, Norms, and Inner Products 4
1.3.1 Metrics and Metric Spaces 6
1.3.2 Norms and Normed Spaces 8
1.3.3 Inner Products and Inner Product Spaces 14
1.4 The Discrete Fourier Series (DFS) 25
Appendix l.A Complex Arithmetic 28
Appendix l.B Elementary Logic 31
References 32
Problems 33
2 Number Representations 38
2.1 Introduction 38
2.2 Fixed-Point Representations 38
2.3 Floating-Point Representations 42
2.4 Rounding Effects in Dot Product Computation 48
2.5 Machine Epsilon 53
Appendix 2.A Review of Binary Number Codes 54
References 59
Problems 59
3 Sequences and Series 63
3.1 Introduction 63
3.2 Cauchy Sequences and Complete Spaces 63
3.3 Pointwise Convergence and Uniform Convergence 70
3.4 Fourier Series 73
3.5 Taylor Series 78
TLFeBOOK
CONTENTS
3.6 Asymptotic Series
3.7 More on the Dirichlet Kernel
3.8 Final Remarks
Appendix 3. A Coordinate Rotation Digital Computing
(CORDIC)
3.A.1 Introduction
3.A.2 The Concept of a Discrete Basis
3.A.3 Rotating Vectors in the Plane
3.A.4 Computing Arctangents
3.A.5 Final Remarks
Appendix 3.B Mathematical Induction
Appendix 3.C Catastrophic Cancellation
References
Problems
97
103
107
107
107
108
112
114
115
116
117
119
120
4 Linear Systems of Equations
4.1 Introduction
4.2 Least-Squares Approximation and Linear Systems
4.3 Least-Squares Approximation and Ill-Conditioned Linear
Systems
4.4 Condition Numbers
4.5 LU Decomposition
4.6 Least-Squares Problems and QR Decomposition
4.7 Iterative Methods for Linear Systems
4.8 Final Remarks
Appendix 4. A Hilbert Matrix Inverses
Appendix 4.B SVD and Least Squares
References
Problems
127
127
127
132
135
148
161
176
186
186
191
193
194
Orthogonal Polynomials
5.1 Introduction
5.2 General Properties of Orthogonal Polynomials
5.3 Chebyshev Polynomials
5.4 Hermite Polynomials
5.5 Legendre Polynomials
5.6 An Example of Orthogonal Polynomial Least-Squares
Approximation
5.7 Uniform Approximation
207
207
207
218
225
229
235
238
TLFeBOOK
CONTENTS
References
Problems
6 Interpolation
6.1 Introduction
6.2 Lagrange Interpolation
6.3 Newton Interpolation
6.4 Hermite Interpolation
6.5 Spline Interpolation
References
Problems
7 Nonlinear Systems of Equations
7.1
7.2
7.3
7.4
7.5
7.6
Introduction
Bisection Method
Fixed-Point Method
Newton-Raphson Method
7.4.1 The Method
7.4.2 Rate of Convergence Analysis
7.4.3 Breakdown Phenomena
Systems of Nonlinear Equations
7.5.1 Fixed-Point Method
7.5.2 Newton-Raphson Method
Chaotic Phenomena and a Cryptography Application
References
Problems
8 Unconstrained Optimization
8.1 Introduction
8.2 Problem Statement and Preliminaries
8.3 Line Searches
8.4 Newton's Method
8.5 Equality Constraints and Lagrange Multipliers
Appendix 8.A MATLAB Code for Golden Section Search
References
Problems
9 Numerical Integration and Differentiation
9.1 Introduction
241
241
251
251
252
257
266
269
284
285
290
290
292
296
305
305
309
311
312
312
318
323
332
333
341
341
341
345
353
357
362
364
364
369
369
TLFeBOOK
X CONTENTS
9.2 Trapezoidal Rule 371
9.3 Simpson's Rule 378
9.4 Gaussian Quadrature 385
9.5 Romberg Integration 393
9.6 Numerical Differentiation 401
References 406
Problems 406
10 Numerical Solution of Ordinary Differential Equations 415
10.1 Introduction 415
10.2 First-Order ODEs 421
10.3 Systems of First-Order ODEs 442
10.4 Multistep Methods for ODEs 455
10.4.1 Adams-Bashforth Methods 459
10.4.2 Adams-Moulton Methods 461
10.4.3 Comments on the Adams Families 462
10.5 Variable-Step-Size (Adaptive) Methods for ODEs 464
10.6 Stiff Systems 467
10.7 Final Remarks 469
Appendix 10. A MATLAB Code for Example 10.8 469
Appendix 10.B MATLAB Code for Example 10.13 470
References 472
Problems 473
11 Numerical Methods for Eigenproblems 480
11.1 Introduction 480
11.2 Review of Eigenvalues and Eigenvectors 480
11.3 The Matrix Exponential 488
11.4 The Power Methods 498
11.5 QR Iterations 508
References 518
Problems 519
12 Numerical Solution of Partial Differential Equations 525
12.1 Introduction 525
12.2 A Brief Overview of Partial Differential Equations 525
12.3 Applications of Hyperbolic PDEs 528
12.3.1 The Vibrating String 528
12.3.2 Plane Electromagnetic Waves 534
TLFeBOOK
CONTENTS xi
12.4 The Finite-Difference (FD) Method 545
12.5 The Finite-Difference Time-Domain (FDTD) Method 550
Appendix 12.A MATLAB Code for Example 12.5 557
References 560
Problems 561
13 An Introduction to MATLAB 565
13.1 Introduction 565
13.2 Startup 565
13.3 Some Basic Operators, Operations, and Functions 566
13.4 Working with Polynomials 571
13.5 Loops 572
13.6 Plotting and M-Files 573
References 577
Index 579
TLFeBOOK
TLFeBOOK
PREFACE
The subject of numerical analysis has a long history. In fact, it predates by cen-
turies the existence of the modern computer. Of course, the advent of the modern
computer in the middle of the twentieth century gave greatly added impetus to the
subject, and so it now plays a central role in a large part of engineering analysis,
simulation, and design. This is so true that no engineer can be deemed competent
without some knowledge and understanding of the subject. Because of the back-
ground of the author, this book tends to emphasize issues of particular interest to
electrical and computer engineers, but the subject (and the present book) is certainly
relevant to engineers from all other branches of engineering.
Given the importance level of the subject, a great number of books have already
been written about it, and are now being written. These books span a colossal
range of approaches, levels of technical difficulty, degree of specialization, breadth
versus depth, and so on. So, why should this book be added to the already huge,
and growing list of available books?
To begin, the present book is intended to be a part of the students' first exposure
to numerical analysis. As such, it is intended for use mainly in the second year
of a typical 4-year undergraduate engineering program. However, the book may
find use in later years of such a program. Generally, the present book arises out of
the author's objections to educational practice regarding numerical analysis. To be
more specific
1. Some books adopt a "grocery list" or "recipes" approach (i.e., "methods" at
the expense of "analysis") wherein several methods are presented, but with
little serious discussion of issues such as how they are obtained and their
relative advantages and disadvantages. In this genre often little consideration
is given to error analysis, convergence properties, or stability issues. When
these issues are considered, it is sometimes in a manner that is too superficial
for contemporary and future needs.
2. Some books fail to build on what the student is supposed to have learned
prior to taking a numerical analysis course. For example, it is common for
engineering students to take a first-year course in matrix/linear algebra. Yet,
a number of books miss the opportunity to build on this material in a manner
that would provide a good bridge from first year to more sophisticated uses
of matrix/linear algebra in later years (e.g., such as would be found in digital
signal processing or state variable control systems courses).
TLFeBOOK
xiv PREFACE
3. Some books miss the opportunity to introduce students to the now quite vital
area of functional analysis ideas as applied to engineering problem solving.
Modern numerical analysis relies heavily on concepts such as function spaces,
orthogonality, norms, metrics, and inner products. Yet these concepts are
often considered in a very ad hoc way, if indeed they are considered at all.
4. Some books tie the subject matter of numerical analysis far too closely to
particular software tools and/or programming languages. But the highly tran-
sient nature of software tools and programming languages often blinds the
user to the timeless nature of the underlying principles of analysis. Further-
more, it is an erroneous belief that one can successfully employ numerical
methods solely through the use of "canned" software without any knowledge
or understanding of the technical details of the contents of the can. While
this does not imply the need to understand a software tool or program down
to the last line of code, it does rule out the "black box" methodology.
5. Some books avoid detailed analysis and derivations in the misguided belief
that this will make the subject more accessible to the student. But this denies
the student the opportunity to learn an important mode of thinking that is a
huge aid to practical problem solving. Furthermore, by cutting the student
off from the language associated with analysis the student is prevented from
learning those skills needed to read modern engineering literature, and to
extract from this literature those things that are useful for solving the problem
at hand.
The prospective user of the present book will likely notice that it contains material
that, in the past, was associated mainly with more advanced courses. However, the
history of numerical computing since the early 1980s or so has made its inclusion
in an introductory course unavoidable. There is nothing remarkable about this. For
example, the material of typical undergraduate signals and systems courses was,
not so long ago, considered to be suitable only for graduate-level courses. Indeed,
most (if not all) of the contents of any undergraduate program consists of material
that was once considered far too advanced for undergraduates, provided one goes
back far enough in time.
Therefore, with respect to the observations mentioned above, the following is a
summary of some of the features of the present book:
1 . An axiomatic approach to function spaces is adopted within the first chapter.
So the book immediately exposes the student to function space ideas, espe-
cially with respect to metrics, norms, inner products, and the concept of
orthogonality in a general setting. All of this is illustrated by several examples,
and the basic ideas from the first chapter are reinforced by routine use
throughout the remaining chapters.
2. The present book is not closely tied to any particular software tool or pro-
gramming language, although a few MATLAB-oriented examples are pre-
sented. These may be understood without any understanding of MATLAB
TLFeBOOK
PREFACE XV
(derived from the term matrix laboratory) on the part of the student, how-
ever. Additionally, a quick introduction to MATLAB is provided in Chapter
13. These examples are simply intended to illustrate that modern software
tools implement many of the theories presented in the book, and that the
numerical characteristics of algorithms implemented with such tools are not
materially different from algorithm implementations using older software
technologies (e.g., catastrophic convergence, and ill conditioning, continue
to be major implementation issues). Algorithms are often presented in a
Pascal-like pseudocode that is sufficiently transparent and general to allow
the user to implement the algorithm in the language of their choice.
3. Detailed proofs and/or derivations are often provided for many key results.
However, not all theorems or algorithms are proved or derived in detail
on those occasions where to do so would consume too much space, or not
provide much insight. Of course, the reader may dispute the present author's
choices in this matter. But when a proof or derivation is omitted, a reference
is often cited where the details may be found.
4. Some modern applications examples are provided to illustrate the conse-
quences of various mathematical ideas. For example, chaotic cryptography,
the CORDIC (coordinate rotational digital computing) method, and least
squares for system identification (in a biomedical application) are considered.
5. The sense in which series and iterative processes converge is given fairly
detailed treatment in this book as an understanding of these matters is now
so crucial in making good choices about which algorithm to use in an appli-
cation. Thus, for example, the difference between pointwise and uniform
convergence is considered. Kernel functions are introduced because of their
importance in error analysis for approximations based on orthogonal series.
Convergence rate analysis is also presented in the context of root-finding
algorithms.
6. Matrix analysis is considered in sufficient depth and breadth to provide an
adequate introduction to those aspects of the subject particularly relevant to
modern areas in which it is applied. This would include (but not be limited
to) numerical methods for electromagnetics, stability of dynamic systems,
state variable control systems, digital signal processing, and digital commu-
nications.
7. The most important general properties of orthogonal polynomials are pre-
sented. The special cases of Chebyshev, Legendre, and Hermite polynomials
are considered in detail (i.e., detailed derivations of many basic properties
are given).
8. In treating the subject of the numerical solution of ordinary differential
equations, a few books fail to give adequate examples based on nonlin-
ear dynamic systems. But many examples in the present book are based on
nonlinear problems (e.g., the Duffing equation). Furthermore, matrix methods
are introduced in the stability analysis of both explicit and implicit methods
for nth-order systems. This is illustrated with second-order examples.
TLFeBOOK
xvi PREFACE
Analysis is often embedded in the main body of the text rather than being rele-
gated to appendixes, or to formalized statements of proof immediately following a
theorem statement. This is done to discourage attempts by the reader to "skip over
the math." After all, skipping over the math defeats the purpose of the book.
Notwithstanding the remarks above, the present book lacks the rigor of a math-
ematically formal treatment of numerical analysis. For example, Lebesgue measure
theory is entirely avoided (although it is mentioned in passing). With respect to
functional analysis, previous authors (e.g., E. Kreyszig, Introductory Functional
Analysis with Applications) have demonstrated that it is very possible to do this
while maintaining adequate rigor for engineering purposes, and this approach is
followed here.
It is largely left to the judgment of the course instructor about what particular
portions of the book to cover in a course. Certainly there is more material here
than can be covered in a single term (or semester). However, it is recommended
that the first four chapters be covered largely in their entirety (perhaps excepting
Sections 1.4, 3.6, 3.7, and the part of Section 4.6 regarding SVD). The material of
these chapters is simply too fundamental to be omitted, and is often drawn on in
later chapters.
Finally, some will say that topics such as function spaces, norms and inner
products, and uniform versus pointwise convergence, are too abstract for engineers.
Such individuals would do well to ask themselves in what way these ideas are
more abstract than Boolean algebra, convolution integrals, and Fourier or Laplace
transforms, all of which are standard fare in present-day electrical and computer
engineering curricula.
Engineering past
A
Engineering present
Engineering future
Christopher Zarowski
TLFeBOOK
1
Functional Analysis Ideas
1.1 INTRODUCTION
Many engineering analysis and design problems are far too complex to be solved
without the aid of computers. However, the use of computers in problem solving
has made it increasingly necessary for users to be highly skilled in (practical)
mathematical analysis. There are a number of reasons for this. A few are as follows.
For one thing, computers represent data to finite precision. Irrational numbers
such as it or V2 do not have an exact representation on a digital computer (with the
possible exception of methods based on symbolic computing). Additionally, when
arithmetic is performed, errors occur as a result of rounding (e.g., the truncation of
the product of two n-bit numbers, which might be In bits long, back down to n
bits). Numbers have a limited dynamic range; we might get overflow or underflow
in a computation. These are examples of finite-precision arithmetic effects. Beyond
this, computational methods frequently have sources of error independent of these.
For example, an infinite series must be truncated if it is to be evaluated on a com-
puter. The truncation error is something "additional" to errors from finite-precision
arithmetic effects. In all cases, the sources (and sizes) of error in a computation
must be known and understood in order to make sensible claims about the accuracy
of a computer-generated solution to a problem.
Many methods are "iterative." Accuracy of the result depends on how many
iterations are performed. It is possible that a given method might be very slow,
requiring many iterations before achieving acceptable accuracy. This could involve
much computer runtime. The obvious solution of using a faster computer is usually
unacceptable. A better approach is to use mathematical analysis to understand why
a method is slow, and so to devise methods of speeding it up. Thus, an important
feature of analysis applied to computational methods is that of assessing how
much in the way of computing resources is needed by a given method. A given
computational method will make demands on computer memory, operations count
(the number of arithmetic operations, function evaluations, data transfers, etc.),
number of bits in a computer word, and so on.
A given problem almost always has many possible alternative solutions. Other
than accuracy and computer resource issues, ease of implementation is also rel-
evant. This is a human labor issue. Some methods may be easier to implement
on a given set of computing resources than others. This would have an impact
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
TLFeBOOK
2 FUNCTIONAL ANALYSIS IDEAS
on software/hardware development time, and hence on system cost. Again, math-
ematical analysis is useful in deciding on the relative ease of implementation of
competing solution methods.
The subject of numerical computing is truly vast. Methods are required to handle
an immense range of problems, such as solution of differential equations (ordi-
nary or partial), integration, solution of equations and systems of equations (linear
or nonlinear), approximation of functions, and optimization. These problem types
appear to be radically different from each other. In some sense the differences
between them are true, but there are means to achieve some unity of approach in
understanding them.
The branch of mathematics that (perhaps) gives the greatest amount of unity
is sometimes called functional analysis. We shall employ ideas from this subject
throughout. However, our usage of these ideas is not truly rigorous; for example,
we completely avoid topology, and measure theory. Therefore, we tend to follow
simplified treatments of the subject such as Kreyszig [1], and then only those ideas
that are immediately relevant to us. The reader is assumed to be very comfortable
with elementary linear algebra, and calculus. The reader must also be comfortable
with complex number arithmetic (see Appendix l.A now for a review if necessary).
Some knowledge of electric circuit analysis is presumed since this will provide
a source of applications examples later. (But application examples will also be
drawn from other sources.) Some knowledge of ordinary differential equations is
also assumed.
It is worth noting that an understanding of functional analysis is a tremendous
aid to understanding other subjects such as quantum physics, probability theory
and random processes, digital communications system analysis and design, digital
control systems analysis and design, digital signal processing, fuzzy systems, neural
networks, computer hardware design, and optimal design of systems. Many of the
ideas presented in this book are also intended to support these subjects.
1.2 SOME SETS
Variables in an engineering problem often take on values from sets of numbers.
In the present setting, the sets of greatest interest to us are (1) the set of integers
Z = {. . . -3, -2, -1, 0, 1, 2, 3 . . .}, (2) the set of real numbers R, and (3) the set of
complex numbers C = [x + jy\j — V— 1, x, y e R}. The set of nonnegative inte-
gers is Z + = {0, 1, 2, 3, . . . , } (so Z + C Z). Similarly, the set of nonnegative real
numbers is R + = {x e R\x > 0}. Other kinds of sets of numbers will be introduced
if and when they are needed.
If A and B are two sets, their Cartesian product is denoted by A x B —
{(a, b)\a e A, b e B}. The Cartesian product of n sets denoted Aq, A\, . . . , A n -\
is A x A\ x • • • x A„_! = {(a , a u ..., a„-i)\a k e A k }.
Ideas from matrix/linear algebra are of great importance. We are therefore also
interested in sets of vectors. Thus, R" shall denote the set of n-element vectors
with real-valued components, and similarly, C" shall denote the set of n-element
TLFeBOOK
xo
XI
%n — 2
_-*•«— 1
SOME SETS 3
vectors with complex-valued components. By default, we assume any vector x to
be a column vector:
(1.1)
Naturally, row vectors are obtained by transposition. We will generally avoid using
bars over or under symbols to denote vectors. Whether a quantity is a vector will
be clear from the context of the discussion. However, bars will be used to denote
vectors when this cannot be easily avoided. The indexing of vector elements Xk will
often begin with as indicated in (1.1). Naturally, matrices are also important. Set
R" xm denotes the set of matrices with n rows and m columns, and the elements are
real-valued. The notation C" xm should now possess an obvious meaning. Matri-
ces will be denoted by uppercase symbols, again without bars. If A is an n x m
matrix, then
A — [a p , q ] p =o,...,n-l, q=0,...,m-l- (1.2)
Thus, the element in row p and column q of A is denoted a p , q . Indexing of rows
and columns again will typically begin at 0. The subscripts on the right bracket "]"
in (1.2) will often be omitted in the future. We may also write a pq instead of a pq
where no danger of confusion arises.
The elements of any vector may be regarded as the elements of a sequence of
finite length. However, we are also very interested in sequences of infinite length.
An infinite sequence may be denoted by x — (xk) — (xo, x\, xi, . . .), for which Xk
could be either real-valued or complex-valued. It is possible for sequences to be
doubly infinite, for instance, x — (x^) — (..., x-2, x—i, xo, x\, xi, ■ ■ •)•
Relationships between variables are expressed as mathematical functions, that is,
mappings between sets. The notation f\A — > B signifies that function / associates
an element of set A with an element from set B. For example, /|R — > R represents
a function defined on the real-number line, and this function is also real-valued;
that is, it maps "points" in R to "points" in R. We are familiar with the idea
of "plotting" such a function on the xy plane if y = f(x) (i.e., x, y e R). It is
important to note that we may regard sequences as functions that are defined on
either the set Z (the case of doubly infinite sequences), or the set Z + (the case
of singly infinite sequences). To be more specific, if, for example, k e Z + , then
this number maps to some number x* that is either real-valued or complex-valued.
Since vectors are associated with sequences of finite length, they, too, may be
regarded as functions, but defined on a finite subset of the integers. From (1.1) this
subset might be denoted by Z„ = {0, 1,2, . . . , n — 2,n — 1}.
Sets of functions are important. This is because in engineering we are often
interested in mappings between sets of functions. For example, in electric circuits
voltage and current waveforms (i.e., functions of time) are input to a circuit via volt-
age and current sources. Voltage drops across circuit elements, or currents through
TLFeBOOK
4 FUNCTIONAL ANALYSIS IDEAS
circuit elements are output functions of time. Thus, any circuit maps functions from
an input set to functions from some output set. Digital signal processing systems
do the same thing, except that here the functions are sequences. For example, a
simple digital signal processing system might accept as input the sequence (x„),
and produce as output the sequence (y„) according to
x n + x n+\ ., ,.
y n = (1.3)
for which n e Z + .
Some specific examples of sets of functions are as follows, and more will be
seen later. The set of real- valued functions defined on the interval [a, b] C R that
are n times continuously differentiable may be denoted by C n [a, b]. This means
that all derivatives up to and including order n exist and are continuous. If n —
we often just write C[a,b], which is the set of continuous functions on the interval
[a, b]. We remark that the notation [a, b] implies inclusion of the endpoints of the
interval. Thus, (a, b) implies that the endpoints a and b are not to be included [i.e.,
if x € (a, b), then a < x < b].
A polynomial in the indeterminate x of degree n is
n
Pn(.x) = ^,Pn,kX k - ( L4 )
k=0
Unless otherwise stated, we will always assume p n \ e R. The indeterminate x
is often considered to be either a real number or a complex number. But in
some circumstances the indeterminate x is merely regarded as a "placeholder,"
which means that x is not supposed to take on a value. In a situation like this
the polynomial coefficients may also be regarded as elements of a vector (e.g.,
p n = [p n o p n j • • • p n , n ] ). This happens in digital signal processing when we
wish to convolve 1 sequences of finite length, because the multiplication of polyno-
mials is mathematically equivalent to the operation of sequence convolution. We
will denote the set of all polynomials of degree n as P" . If x is to be from the
interval [a, b] C R, then the set of polynomials of degree n on [a, b] is denoted
by P"[a, b]. If m < n we shall usually assume P m [a, b] C P"[a, b].
1.3 SOME SPECIAL MAPPINGS: METRICS, NORMS,
AND INNER PRODUCTS
Sets of objects (vectors, sequences, polynomials, functions, etc.) often have cer-
tain special mappings defined on them that turn these sets into what are commonly
called function spaces. Loosely speaking, functional analysis is about the properties
These days it seems that the operation of convolution is first given serious study in introductory signals
and systems courses. The operation of convolution is fundamental to all forms of signal processing,
either analog or digital.
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 5
of function spaces. Generally speaking, numerical computation problems are best
handled by treating them in association with suitable mappings on well-chosen
function spaces. For our purposes, the three most important special types of map-
pings are (1) metrics, (2) norms, and (3) inner products. You are likely to be already
familiar with special cases of these really very general ideas.
The vector dot product is an example of an inner product on a vector space, while
the Euclidean norm (i.e., the square root of the sum of the squares of the elements in
a real-valued vector) is a norm on a vector space. The Euclidean distance between
two vectors (given by the Euclidean norm of the difference between the two vectors)
is a metric on a vector space. Again, loosely speaking, metrics give meaning to the
concept of "distance" between points in a function space, norms give a meaning
to the concept of the "size" of a vector, and inner products give meaning to the
concept of "direction" in a vector space. 2
In Section 1.1 we expressed interest in the sizes of errors, and so naturally the
concept of a norm will be of interest. Later we shall see that inner products will
prove to be useful in devising means of overcoming problems due to certain sources
of error in a computation. In this section we shall consider various examples of
function spaces, some of which we will work with later on in the analysis of
certain computational problems. We shall see that there are many different kinds
of metric, norm, and inner product. Each kind has its own particular advantages
and disadvantages as will be discovered as we progress through the book.
Sometimes a quantity cannot be computed exactly. In this case we may try to
estimate bounds on the size of the quantity. For example, finding the exact error
in the truncation of a series may be impossible, but putting a bound on the error
might be relatively easy. In this respect the concepts of supremum and infimum
can be important. These are defined as follows.
Suppose we have E C R. We say that E is bounded above if E has an upper
bound, that is, if there exists a B e R such that x < B for all x e E. If E ^
(empty set; set containing no elements) there is a supremum of E [also called a
least upper bound (lub)], denoted
sup E.
For example, suppose E — [0, 1), then any B > 1 is an upper bound for E, but
sup E = 1. More generally, sup E < B for every upper bound B of E. Thus, the
supremum is a "tight" upper bound. Similarly, E may be bounded below. If E has
a lower bound there is a b e R such that x > b for all x e E. If E ^ 0, then there
exists an infimum [also called a greatest lower bound (gib)], denoted by
ME.
For example, suppose now E = (0, 1]; then any b < is a lower bound for E,
but inf E — 0. More generally, inf E > b for every lower bound b of E. Thus, the
infimum is a "tight" lower bound.
The idea of "direction" is (often) considered with respect to the concept of an orthogonal basis in a
vector space. To define "orthogonality" requires the concept of an inner product. We shall consider this
in various ways later on.
TLFeBOOK
6 FUNCTIONAL ANALYSIS IDEAS
1.3.1 Metrics and Metric Spaces
In mathematics an axiomatic approach is often taken in the development of analysis
methods. This means that we define a set of objects, a set of operations to be
performed on the set of objects, and rules obeyed by the operations. This is typically
how mathematical systems are constructed. The reader (hopefully) has already seen
this approach in the application of Boolean algebra to the analysis and design of
digital electronic systems (i.e., digital logic). We adopt the same approach here.
We will begin with the following definition.
Definition 1.1: Metric Space, Metric A metric space is a set X and a
function d\X x X — > R + , which is called a metric or distance function on X.
If x, y, z € X then d satisfies the following axioms:
(Ml) d(x, y) = if and only if (iff) x = y.
(M2) d(x, y) — d(y, x) (symmetry property).
(M3) d(x, y) < d(x, z) + d(z, y) (triangle inequality).
We emphasize that X by itself cannot be a metric space until we define d. Thus,
the metric space is often denoted by the pair (X, d). The phrase "if and only
if probably needs some explanation. In (Ml), if you were told that d(x, y) — 0,
then you must immediately conclude that x — y. Conversely, if you were told that
x — y, then you must immediately conclude that d(x, y) — 0. Instead of the words
"if and only if it is also common to write
d(x,y) = 0<S-i=)>.
The phrase "if and only if is associated with elementary logic. This subject is
reviewed in Appendix l.B. It is recommended that the reader study that appendix
before continuing with later chapters.
Some examples of metric spaces now follow.
Example 1.1 Set X — R, with
d(x,y) = \x-y\ (1.5)
forms a metric space. The metric (1.5) is what is commonly meant by the "distance
between two points on the real number line." The metric (1.5) is quite useful in
discussing the sizes of errors due to rounding in digital computation. This is because
there is a norm on R that gives rise to the metric in (1.5) (see Section 1.3.2).
Example 1.2 The set of vectors R" with
d(x, y)
n-\
ixk — yk\
U=o
^2 [** ~ y^ 2
1/2
(1.6a)
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 7
forms a (Euclidean) metric space. However, another valid metric on R" is given by
n-\
d\{x,y) = ^\xk-yk\- (1.6b)
k=Q
In other words, we can have the metric space (X, d), or (X, d\). These spaces are
different because their metrics differ.
Euclidean metrics, and their related norms and inner products, are useful in pos-
ing and solving least-squares approximation problems. Least-squares approximation
is a topic we shall consider in detail later.
Example 1.3 Consider the set of (singly) infinite, complex- valued, and bounded
sequences
X — {x — (xq, xi, X2, ■ ■ -)\xk € C, \xk\ < c(x)(all k)}. (l-7a)
Here c{x) > is a bound that may depend on x, but not on k. This set forms a
metric space that may be denoted by l°°[0, oo] if we employ the metric
d(x,y)= sup \x k -yk\- (1.7b)
keZ+
The notation [0, oo] emphasizes that the sequences we are talking about are only
singly infinite. We would use [— oo, oo] to specify that we are talking about doubly
infinite sequences.
Example 1.4 Define i = [a,fi]cR. The set C[a, b] will be a metric space if
d(x,y) — sup \x(t) - y(t)\. (1.8)
In Example 1.1 the metric (1.5) gives the "distance" between points on the real-
number line. In Example 1.4 the "points" are real-valued, continuous functions of
t € [a, b]. In functional analysis it is essential to get used to the idea that functions
can be considered as points in a space.
Example 1.5 The set X in (1.7a), where we now allow c(x) — >• oo (in other
words, the sequence need not be bounded here), but with the metric
El \xt — Vi-I
_L1 — £^_ (i.9)
/=o 2*+i 1 + |x t - yjtl
is a metric space. (Sometimes this space is denoted s.)
TLFeBOOK
8 FUNCTIONAL ANALYSIS IDEAS
Example 1.6 Let p be a real- valued constant such that p > 1. Consider the
set of complex-valued sequences
X = i X = (xo, JCl. *2,
This set together with the metric
.)|** € C, £ 1**1' < OO
k=0 \
(1.10a)
d(x,y)
X>*-wi"
.k=Q
1/P
(1.10b)
forms a metric space that we denote by l p [0, oo]
Example 1.7 Consider the set of complex-valued functions on [a, b] C R
X
-HI.
\x(t)\ dt < oo
for which
d(x, y)
f
\x(t) - y(t)\ dt
1/2
(1.11a)
(1.11b)
is a metric. Pair (X, d) forms a metric space that is usually denoted by L [a, b].
The metric space of Example 1 .7 (along with certain variations) is very impor-
tant in the theory of orthogonal polynomials, and in least-squares approximation
problems. This is because it turns out to be an inner product space too (see
Section 1.3.3). Orthogonal polynomials have a major role to play in the solution
of least squares, and other types of approximation problem.
All of the metrics defined in the examples above may be shown to satisfy the
axioms of Definition 1.1. Of course, at least in some cases, much effort might be
required to do this. In this book we largely avoid making this kind of effort.
1.3.2 Norms and Normed Spaces
So far our examples of function spaces have been metric spaces (Section 1.3.1).
Such spaces are not necessarily associated with the concept of a vector space.
However, normed spaces (i.e., spaces with norms defined on them) are always
associated with vector spaces. So, before we can define a norm, we need to recall
the general definition of a vector space.
The following definition invokes the concept of afield of numbers. This concept
arises in abstract algebra and number theory [e.g., 2, 3], a subject we wish to avoid
considering here. 3 It is enough for the reader to know that R and C are fields under
This avoidance is not to disparage abstract algebra. This subject is a necessary prerequisite to under-
standing concepts such as fast algorithms for digital signal processing (i.e., fast Fourier transforms, and
fast convolution algorithms; e.g., see Ref. 4), cryptography and data security, and error control codes
for digital communications.
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 9
the usual real and complex arithmetic operations. These are really the only fields
that we shall work with. We remark, largely in passing, that rational numbers (set
denoted Q) are also a field under the usual arithmetic operations.
Definition 1.2: Vector Space A vector space (linear space) over a field K is
a nonempty set X of elements x, y, z, ■ ■ . called vectors together with two algebraic
operations. These operations are vector addition, and the multiplication of vectors
by scalars that are elements of K. The following axioms must be satisfied:
(VI) If x, y € X, then x + y e X (additive closure).
(V2) If x, y, z € X, then (x + y) + z — x + (y + z) (associativity).
(V3) There exists a vector in X denoted (zero vector) such that for all x e X,
we have x + = + .r — x.
(V4) For all x e X, there is a vector —x e X such that — x + x — x +
(—x) — 0. We call — x the negative of a vector.
(V5) For all x, y e X we have x + y — y + x (commutativity).
(V6) If x e X and a e K, then the product of a and x is ax, and ax e X.
(V7) If x, y € X, and a e K, then a(x + y) — ax + ay.
(V8) If a, b e K, and x e X, then (a + b)x — ax + bx.
(V9) If a,b e K, and x e X, then ab(x) — a(bx).
(V10) If x e X, and I e K, then Ix — x multiplication of a vector by a unit
scalar; all fields contain a unit scalar (i.e., a number called "one").
In this definition, as already noted, we generally work only with K — R, or K — C.
We represent the zero vector by just as we also represent the scalar zero by 0.
Rarely is there danger of confusion.
The reader is already familiar with the special instances of this that relate to the
sets R" and C". These sets are vector spaces under Definition 1.2, where vector
addition is defined to be
x + y —
xo
yo
X\
y\
+
—
_ X n -l
y n -\
xo
x\
VI
x n -\ + y n -\
and multiplication by a field element is defined to be
oxq
ax i
The zero vector is = [00 • • • 00] , and — x — [—xo — x\ ■ ■ ■ — x n -\
then the elements of x and y are real-valued, and a e R, but if X ■■
(1.12a)
(1.12b)
] r .If X = R"
= C" then the
TLFeBOOK
10
FUNCTIONAL ANALYSIS IDEAS
elements of x and y are complex-valued, and a € C. The metric spaces in Exam-
ple 1.2 are therefore also vector spaces under the operations defined in (1.12a,b).
Some further examples of vector spaces now follow.
Example 1.8 Metric space C[a, b] (Example 1.4) is a vector space under the
operations
(x + y)(t)=x(t) + y(t), (ax)(t)=ax(t), (1.13)
where a e R. The zero vector is the function that is identically zero on the interval
[a,b].
Example 1.9 Metric space l 2 [0, oo] (Example 1.6) is a vector space under the
operations
x + y — (x ,x\, ...) + (yo,yi, ■ . .) = (xo + yo,xi +yi, ...),
ax = (axo,axi, . . .). (1-14)
Here a e C.
If x, y € / 2 [0, oo], then some effort is required to verify axiom (VI). This
requires the Minkowski inequality, which is
OO
l/P
J2\x k + yk\ p
<
k=0
x>*l ;
.k=0
Up
Ei»i j
.k=Q
!//>
(1.15)
Refer back to Example 1.6; here we employ p — 2, but (1.15) is valid for p > 1.
Proof of (1.15) is somewhat involved, and so is omitted here. The interested reader
can see Kreyszig [1, pp. 11-15].
We remark that the Minkowski inequality can be proved with the aid of the
Holder inequality
^\Xkyk\
k=Q
X>*| ;
.k=0
Up
2>*i
-Jt=0
1/9
(1.16)
for which here n > 1 and — + - = 1.
' p q
We are now ready to define a normed space.
Definition 1.3: Normed Space, Norm A normed space X is a vector space
with a norm defined on it. If x e X then the norm of x is denoted by
\\x\\ (read this as "norm of x").
The norm must satisfy the following axioms:
(Nl) ||;c|| > (i.e., the norm is nonnegative).
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 11
(N2) ||x|| = 0^x = 0.
(N3) \\ax\\ — | of | ||x||. Here a is a scalar in the field of X (i.e., a e K; see
Definition 3.2).
(N4) ||* + y\\ < ||x|| + ||y|| (triangle inequality).
The normed space is vector space X together with a norm, and so may be properly
denoted by the pair (X, \\ ■ ||). However, we may simply write X, and say "normed
space X," so the norm that goes along with X is understood from the context of
the discussion.
It is important to note that all normed spaces are also metric spaces, where the
metric is given by
d(x,y) = \\x-y\\ (x,yeX). (1.17)
The metric in (1.17) is called the metric induced by the norm.
Various other properties of norms may be deduced. One of these is:
Example 1.10 Prove | ||y|| - ||x|| | < \\y - x\\.
Proof From (N3) and (N4)
\\y\\ = \\y-x + x\\<\\y-x\\ + \\x\\,\\x\\ = \\x-y + y\\<\\y-x\\ + \\y\\.
Combining these, we obtain
llyll- 11*11 <ILv-*IULvll-ll*ll>-ILv-*ll-
The claim follows immediately.
We may regard the norm as a mapping from X to set R: || • |||Z — > R. This
mapping can be shown to be continuous. However, this requires generalizing the
concept of continuity that you may know from elementary calculus. Here we define
continuity as follows.
Definition 1.4: Continuous Mapping Suppose X — (X, d) and Y — (Y, d)
are two metric spaces. The mapping T\X — > Y is said to be continuous at a point
xq e X if for all e > there is a S > such that
d(Tx, Txo) < e for all x satisfying d(x,xo) < 5. (1.18)
T is said to be continuous if it is continuous at every point of X.
Note that Tx is just another way of writing T(x). (R, | • |) is a normed space; that
is, the set of real numbers with the usual arithmetic operations defined on it is a
TLFeBOOK
12 FUNCTIONAL ANALYSIS IDEAS
vector space, and the absolute value of an element of R is the norm of that element.
If we identify Y in Definition 1.4 with metric space (R, | • |), then (1.18) becomes
d(Tx, Tx ) = d(\\x\\, 1 1* 1 1) = I 11*11 - Ikoll I < e, d(x,x ) = ||* -* || < 5.
To make these claims, we are using (1.17). In other words, X and Y are normed
spaces, and we employ the metrics induced by their respective norms. In addition,
we identify T with || • ||. Using Example 1.10, we obtain
I ||*|| -||*oll I < ll*-*oll <<$■
Thus, the requirements of Definition 1.4 are met, and so we conclude that norms
are continuous mappings.
We now list some other normed spaces.
Example 1.11 The Euclidean space R" and the unitary space C" are both
normed spaces, where the norm is defined to be
■«-i
I>| 2
.k=0
1/2
(1.19)
For R" the absolute value bars may be dropped. 4 It is easy to see that d(x, y) —
||* — y\\ gives the same metric as in (1.6a) for space R". We further remark that
for n = 1 we have 1 1* 1 1 = |* | .
Example 1.12 The space l p [0, oo] is a normed space if we define the norm
to be
" oo -|1/P
X><tl" d-20)
L*=0
for which d(x, y) — ||* — y\\ coincides with the metric in (1.10b).
Example 1.13 The sequence space l°°[0, oo] from Example 1.3ofSection 1.3.1
is a normed space, where the norm is defined to be
|*|| = sup \x k \,
k€Z+
(1.21)
and this norm induces the metric of (1.7b).
4 Suppose z = x + jy (j ■■
in general.
-1, x, y S R) is some arbitrary complex number. Recall that z # \z\
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 13
Example 1.14 The space C[a, b] first seen in Example 1.4 is a normed space,
where the norm is defined by
||x|| = sup |x(f)|. (1.22)
Naturally, this norm induces the metric of (1.8).
Example 1.15 The space L 2 [a, b] of Example 1.7 is a normed space for the
norm
-iV2
11*11 =
/
J a
\x(t)\ 2 dt
(1.23)
This norm induces the metric in (1.11b).
The normed space of Example 1.15 is important in the following respect.
Observe that
||*|| 2 = / \x(t)\ 2 dt. (1.24)
J a
Suppose we now consider a resistor with resistance R. If the voltage drop across its
terminals is v(t) and the current through it is i(t), we know that the instantaneous
power dissipated in the device is p(t) — v(t)i(t). If we assume that the resistor is
a linear device, then v(t) = Ri(t) via Ohm's law. Thus
p(t) = v(t)i(t) = Ri 2 (t). (1.25)
Consequently, the amount of energy delivered to the resistor over time interval
t e [a, b] is given by
J a
i 2 {t)dt. (1.26)
If the voltage/current waveforms in our circuit containing R belong to the space
L 2 [a,b], then clearly E = R\\i\\ 2 . We may therefore regard the square of the L 2
norm [given by (1.24)] of a signal to be the energy of the signal, provided the
norm exists. This notion can be helpful in the optimal design of electric circuits
(e.g., electric filters), and also of optimal electronic circuits. In analogous fashion,
an element x of space / 2 [0, oo] satisfies
oo
IW| 2 = ^kd 2 <oo (1.27)
[see (1.10a) and Example 1.12]. We may consider \\x\\ 2 to be the energy of the
single-sided sequence x. This notion is useful in the optimal design of digital filters.
TLFeBOOK
14 FUNCTIONAL ANALYSIS IDEAS
1.3.3 Inner Products and Inner Product Spaces
The concept of an inner product is necessary before one can talk about orthogonal
bases for vector spaces. Recall from elementary linear algebra that orthogonal
bases were important in representing vectors. From a computational standpoint,
as mentioned earlier, orthogonal bases can have a simplifying effect on certain
types of approximation problem (e.g., least-squares approximations), and represent
a means of controlling numerical errors due to so-called ill-conditioned problems.
Following our axiomatic approach, consider the following definition.
Definition 1.5: Inner Product Space, Inner Product An inner product space
is a vector space X with an inner product defined on it. The inner product is a
mapping (•, -)\X x X -> K that satisfies the following axioms:
(11) (x + y,z) = {x,z) + {y,z).
(12) (ax, y) — a(x, y).
(13) (x,y) = {y,x)*.
(14) (x, x) > 0, and (x, x) = <S> x = 0.
Naturally, x, y, z € X, and a is a scalar from the field K of vector space X. The
asterisk superscript on (y, x) in (13) denotes complex conjugation. 5
If the field of X is not C, then the operation of complex conjugation in (13) is
redundant.
All inner product spaces are also normed spaces, and hence are also metric
spaces. This is because the inner product induces a norm on X
||x|| = [(;c,*}] 1/2 (1.28)
for all x e X. Following (1.17), the induced metric is
d(x, y) = \\x-y\\=[(x-y,x- y)] 1/2 . (1.29)
Directly from the axioms of Definition 1.5, it is possible to deduce that (for
x, y, z e X and a,b e K)
(ax + by,z) =a(x,z) + b(y,z), (1.30a)
{x,ay) = a*(x,y), (1.30b)
and
(x,ay + bz) = a*(x, y) + b*(x, z). (1.30c)
The reader should prove these as an exercise.
If z = X + yj is a complex number, then its conjugate is z* = x — yj.
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS
15
We caution the reader that not all normed spaces are inner product spaces. We
may construct an example with the aid of the following example.
Example 1.16 Let x, y be from an inner product space. If || • || is the norm
induced by the inner product, then \\x + y\\ 2 + \\x — y\\ 2 = 2(||x|| 2 + ||y|| 2 ). This
is the parallelogram equality.
Proof Via (1.30a,c) we have
and
|x + y|| z = (x + y,x + y) = (x, x + y) + (y, x + y)
= (x,x) + (x,y) + (y,x) + (y,y),
\x - y\\ 2 - (x -y,x -y) — (x,x - y) - (y,x - y)
= (x, x) - (x, y) - (y, x) + (y, y).
Adding these gives the stated result.
It turns out that the space l p [0, oo] with p ^ 2 is not an inner product space. The
parallelogram equality can be used to show this. Consider x = (1, 1, 0, 0, . . .), y =
(1, — 1, 0, 0, . . .), which are certainly elements of l p [0, oo] [see (1.10a)]. We see that
11*11 = IMI
,i/p
,||x + y|| = |l*-yll=2.
The parallelogram equality is not satisfied, which implies that our norm does not
come from an inner product. Thus, l p [0, oo] with p ^ 2 cannot be an inner product
space.
On the other hand, Z 2 [0, oo] is an inner product space, where the inner product
is defined to be
■ y) = Y, x ky*k-
(1.31)
i-0
Does this infinite series converge? Yes, it does. To see this, we need the Cauchy-
Schwarz inequality. 6 Recall the Holder inequality of (1.16). Let p — 2, so that
q — 2. Then the Cauchy-Schwarz inequality is
^2\xkyk\
k=Q
X>*i 2
.jt=0
1/2 r
X>*i 2
.k=Q
1/2
(1.32)
The inequality we consider here is related to the Schwarz inequality. We will consider the Schwarz
inequality later on. This inequality is of immense practical value to electrical and computer engineers.
It is used to derive the matched-filter receiver, which is employed in digital communications systems,
to derive the uncertainty principle in quantum mechanics and in signal processing, and to derive the
Cramer-Rao lower bound on the variance of parameter estimators, to name only three applications.
TLFeBOOK
16 FUNCTIONAL ANALYSIS IDEAS
Now
\(x,y)\ =
J2 Xk yk
k=0
<
£l*«*l- (1-33)
fc=0
The inequality in (1.33) follows from the triangle inequality for | • |. (Recall that the
absolute value operation is a norm on R. It is also a norm on C; if z — x + jy e C,
then |z| = yx 2 + y 2 .) The right-hand side of (1.32) is finite because x and y are
in Z 2 [0, oo]. Thus, from (1.33), (x, y) is finite. Thus, the series (1.31) converges.
It turns out that C[a, b] is not an inner product space, either. But we will not
demonstrate the truth of this claim here.
Some further examples of inner product spaces are as follows.
Example 1.17 The Euclidean space R" is an inner product space, where the
inner product is defined to be
71-1
(x,y) = J^x k y k . (1.34)
k=Q
The reader will recognize this as the vector dot product from elementary linear
algebra; that is, x ■ y = (x, y). It is well worth noting that
(x,y)=x T y. (1.35)
Here the superscript T denotes transposition. So, x T is a row vector. The inner
product in (1.34) certainly induces the norm in (1.19).
Example 1.18 The unitary space C" is an inner product space for the inner
product
«-i
■?> = !>«£• d-36)
k=Q
Again, the norm of (1.19) is induced by inner product (1.36). If H denotes the
operation of complex conjugation and transposition (this is called Hermitian trans-
position), then
y H = [y* yt---y*„- l ]
(row vector), and
(x,y) = y H x. (1.37)
Example 1.19 The space L 2 [a, b] from Example 1.7 is an inner product space
if the inner product is defined to be
(x,y) = / x(t)y*(t)dt. (1.38)
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS
17
The norm induced by (1.38) is
J a
-1 1/2
I* 0)1 dt
This in turn induces the metric in (1.11b).
(1-39)
Now we consider the concept of orthogonality in a completely general manner.
Definition 1.6: Orthogonality Let x, y be vectors from some inner product
space X. These vectors are orthogonal iff
<*,y} = 0.
The orthogonality of x and y is symbolized by writing x
A, B C X we write x _L A if x _L a for all a e A, and A _
and b e B.
. y. Similarly, for subsets
B if a _L b for all a € A,
If we consider the inner product space R 2 , then it is easy to see that
([10] r , [0 l] r ) = 0, so [01] r , and [10] r are orthogonal vectors. In fact, these
vectors form an orthogonal basis for R 2 , a concept we will consider more gen-
erally below. If we define the unit vectors eo — [1 0] T , and e\ — [0 l] 7 , then we
recall that any x e R 2 can be expressed as x — xo<?o + *iei- (The extension of
this reasoning to R" for « > 2 should be clear.) Another example of a pair of
orthogonal vectors would be x — -7=[1 l] r , and y = -7=[1 — l] r • These too form
an orthogonal basis for the space R 2 .
Define the functions
and
<Kx)
f{x)
0, x < and x > 1
1, 0<x < 1
0, x < and x > 1
1.
0<X < i
(1.40)
(1.41)
1, i <x < 1
Function <p(x) is called the Haar scaling function, and function i/f(x) is called the
Haar wavelet [5]. The function 4>(x) is also called an non-return-to-zero (NRZ)
pulse, and function V0O is a l so called a Manchester pulse [6]. It is easy to con-
firm that these pulses are elements of L 2 (R) = L 2 (— oo, oo), and that they are
orthogonal, that is, (</>, xj/) — under the inner product defined in (1.38). This is
so because
{^»,^>
j — •
4>(x)t//*(x) dx
Jo
ij/(x) dx — 0.
TLFeBOOK
18 FUNCTIONAL ANALYSIS IDEAS
Thus, we consider cp and xj/ to be elements in the inner product space L 2 (R), for
which the inner product is
-/.
(x,y) = / x(t)y*(t)dt.
J —00
It turns out that the Haar wavelet is the simplest example of the more general class
of Daubechies wavelets. The general theory of these wavelets first appeared in
Daubechies [7]. Their development has revolutionized signal processing and many
other areas. 7 The main reason for this is the fact that for any f(t) e L 2 (R)
oo oo
/( f )= J2 H (f,*n,k)Mt), (1-42)
n = -oo k=-oo
where if n ,k(t) — 2"^ 2 ^r(2 n t — k). This doubly infinite series is called a wavelet
series expansion for /. The coefficients f n ^ — (f, ifn,k) have finite energy. In
effect, if we treat either k or n as a constant, then the resulting doubly infinite
sequence is in the space l 2 [— oo, oo]. In fact, it is also the case that
oo oo
J2 J2 \fn,k\ 2 <oo. (1.43)
n=-oo k=-oo
It is to be emphasized that the \jr used in (1.42) could be (1.41), or it could be
chosen from the more general class in Ref. 7. We shall not prove these things in
this book, as the technical arguments are quite hard.
The wavelet series is presently not as familiar to the broader electrical and
computer engineering community as is the Fourier series. A brief summary of the
Fourier series is as follows. Again, rigorous proofs of many of the following claims
will be avoided, though good introductory references to Fourier series are Tolstov
[8] or Kreyszig [9]. If / e L 2 (0, In), then
OO
f(t)= J2 f" ei '"> 1 = ^' c 1 - 44 )
n =— oo
where the Fourier (series) coefficients are given by
/» = T- f* f(t)e-J nt dt. (1.45)
We may define
e„(t) = exp(jMf) (t e (0, 2n), n e Z) (1.46)
For example, in digital communications the problem of designing good signaling pulses for data
transmission is best treated with respect to wavelet theory.
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 19
so that we see
1 C 2n r • i*
(/,e„> = — j f(t) [e' nt \ dt = f n . (1.47)
The series (1.44) is the complex Fourier series expansion for /. Note that for
n, k e Z
txp[jn(t + 2irk)] = exp[jnt ] exp[2jrjnk] — expljnt]. (1-48)
Here we have used Euler's identity
e JX — cosx + j sinx (1-49)
and cos(2jtIc) — 1, sin(27rfc) = 0. The function e' nt is therefore 2n -periodic; that
is, its period is 2it. It therefore follows that the series on the right-hand side of
(3.40) is a 2it -periodic function, too. The result (1.48) implies that, although / in
(1.44) is initially defined only on (0, 2jt), we are at liberty to "periodically extend"
/ over the entire real-number line; that is, we can treat / as one period of the
periodic function
fit) = J2 f« + 2jt V (1.50)
keZ
for which f(t) = f(t) for t e (0, 2n). Thus, series (1.44) is a way to represent
periodic functions. Because / € L 2 (Q, 2it), it turns out that
00
J2 \fn\ 2 <°° (1-51)
n=— oo
so that (/„) € l 2 [— oo, oo].
Observe that in (1.47) we have "redefined" the inner product on L 2 (0, 2it) to be
1 f 27T
(x,y) = — x(t)y*(t)dt (1.52)
2jt Jo
which differs from (1.38) in that it has the factor J- in front. This variation also
happens to be a valid inner product on the vector space defined by the set in (1.1 la).
Actually, it is a simple example of a weighted inner product.
Now consider, for n ^ m
1 f 2n 1 r ■, i2n
(e„, e m ) = — e jnt e- jmt dt = e J(»-m)t
2jt Jq 2jrj(n — m) L Jo
1nj(n-m) _ j 1—1
0. (1.53)
2jt j (n — m) 2nj(n — m)
Similarly
1 f 27T . . 1 f 27T
(e„,e„) = — / e^'e-^'dt = — / dt = 1. (1.54)
lit Jq lit Jo
So, e n and e m (if n ^ m) are orthogonal with respect to the inner product in (1.52).
TLFeBOOK
20
FUNCTIONAL ANALYSIS IDEAS
From basic electric circuit analysis, periodic signals have finite power. Therefore,
series (1.44) is a way to represent finite power signals. 8 We might therefore consider
the space L 2 (0, 2jt) to be the "space of finite power signals." From considerations
involving the wavelet series representation of (1.42), we may consider L 2 (R) to
be the "space of finite energy signals." Recall also the discussion at the end of
Section 1.3.2 (last paragraph).
An example of a Fourier series expansion is the following.
Example 1.20 Suppose that
f(t) =
1, < t < Tt
-1, Tt < t < 2jt
(1.55)
A sketch of this function is one period of a lit -periodic square wave. The Fourier
coefficients are given by (for n ^ 0)
1 f 2n
In — Z /
2tc Jo
f(t)e-i'"dt =
1
2tc
rn r2jr
/ e~ jnt dt - / e- jnt dt
JO Jtt
1
2jt
jn L Jo jn L Jjt
1 1 _ e -jnn
Tt
jn
2
Ttn
jnn/2
1
2tt
gjnn/l _ e -jnn/2
-jnjr
e -jnn + 1
2./
2
Ttn
J"
-jnjz/2
sin
(?)
where we have made use of
sinx = — [e JX
(1.56)
(1.57)
This is easily derived using the Euler identity in (1.49). For n = 0, it should be
clear that /o = 0.
The coefficients /„ in (1.56) involve expressions containing j. Since /(?) is
real-valued, it therefore follows that we can rewrite the series expansion in such a
manner as to avoid complex arithmetic. It is almost a standard practice to do this.
We now demonstrate this process:
E /.«"" - l -
oo . -1 1
J2 -e~^ 2 sin (-«) e'"' + £ V^/ 2 sin (-») e> n
n=\ ~ n=-oo
oo -. CO -.
J2 -e-^l 1 sin (-«) e' nt + ^ -e inn ' 2 sin (-«) e~' r '
In fact, using phasor analysis and superposition, you can apply (1.44) to determine the steady-state
output of a circuit for any periodic input (including, and especially, nonsinusoidal periodic functions).
This makes the Fourier series very important in electrical/electronic circuit analysis.
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 21
9 1
£ J2 - sin (-«) [ e J nt e-'^ 2 + e^e**" 2 ]
n = \
A °° 1
-E- cos K f -|)] sin (l M )
2A1 . /.T
(1=1
=1
Here we have used the fact that (see Appendix l.A)
e jnt e -jnn/l + e -jnt e Jnn/2 = 2 Rg ^,7,^-^/2] = 2cQS L A _ *\ j _
This is so because if z = x + jy, then z + z* — 2x = 2 Re [z]. Since
cos(a + /J) = cos a cos /J — sin a sin/3,
we have
[/ 7T\"| 7TH JIM
n ( t 1 = cos(nf) cos h sin(nr) sin — .
However, if n is an even number, then sin(7rn/2) = 0, and if n is an odd number,
then cos(tt«/2) = 0. Therefore
A °° 1
-^-cos[„(r-|)]sin(|„)
n = \
A °° 1
-^E^ T T sin[(2n + 1)f]sm2 [ (2n + 1) |]'
but sin 2 [(2n + l)y] = 1, so finally we have
oo . oo .
nt)= J2 f" eir "^-J2^—p[ sin ^ 2n + 1 ^-
n=-oo ,i=0
It is important to note that the wavelet series and Fourier series expansions have
something in common, in spite of the fact that they look quite different and indeed
are associated with quite different function spaces. The common feature is that both
representations involve the use of orthogonal basis functions. We are now ready to
consider this in a general manner.
Begin by recalling from elementary linear algebra that a basis for a vector space
such an X — R" or X — C" is a set of n vectors, say
B = {e ,e 1 ,...,e n - l } (1.58)
such that the elements e^ (basis vectors) are linearly independent. This means that
no vector in the set can be expressed as a linear combination of any of the others.
TLFeBOOK
22 FUNCTIONAL ANALYSIS IDEAS
In general, it is not necessary that (ek, e n ) — for n ^ k. In other words, indepen-
dence does not require orthogonality. However, if set B is a basis (orthogonal or
otherwise) then for any x e X (vector space) there exists a set of coefficients from
the field of the vector space, say, b = {bo, b\ . . . , b n -\}, such that
n-\
x = J^b k e k . (1.59)
k=Q
We say that spaces R" and C" are of dimension n. This is a direct reference to the
number of basis vectors in B. This notion generalizes.
Now let us consider a sequence space (e.g., / 2 [0, oo]). Suppose x —
(xq, x\, X2, . . .) € Z 2 [0, oo]. Define the following unit vector sequences:
e = (1,0,0,0, ...), e\ = (0, 1,0,0, . . .), e 2 = (0,0, 1, 0, . . .), etc. (1.60)
Clearly
oo
X = y^XK*:- (1-61)
k=Q
It is equally clear that no vector e k can be expressed as a linear combination of
any of the others. Thus, the countably infinite set 9 B = {eo, e\, e2, ■ ■ •} forms a
basis for Z 2 [0, oo]. The sequence space is therefore of infinite dimension because
B has a countable infinity of members. It is apparent as well that, under the inner
product defined in (1.31), we have (e n , e m ) — S n - m . Sequence S — (<5„) is called
the Kronecker delta sequence. It is defined by
o, «#o ■ (L62)
Therefore, the vectors in (1.60) are mutually orthogonal as well. So they happen to
form an orthogonal basis for Z 2 [0, oo]. Of course, this is not the only possible basis.
In general, given a countably infinite set of vectors {e k \k e Z + } [no longer neces-
sarily those in (1.60)] that are linearly independent, and such that e k e / 2 [0, oo],
for any x e / 2 [0, oo] there will exist coefficients a k e C such that
00
X = ^, a k e k- (1-63)
k=Q
In view of the above, consider the following linearly independent set of vectors
from some inner product space X:
B = {e k \e k eX, keZ}. (1.64)
y A set A is countably infinite if its members can be put into one-to-one (1-1) correspondence with
the members of the set Z + . This is also equivalent to being able to place the elements of A into 1-1
correspondence with the elements of Z.
TLFeBOOK
SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 23
Assume that this is a basis for X. In this case for any x e X, there are coefficients
a k such that
x — ^2a k e k . (1.65)
We define the set B to be orthogonal iff for all n , k e Z
(e n ,e k ) = K-k- (1.66)
Assume that the elements of B in (1.64) satisfy (1.66). It is then easy to
see that
(x,e n ) = (y2a k e k ,e n \ = ^(ate*. e„) (using (II))
\ * Ik
= ^2a k (e k ,e n ) (using (12))
= ^<$fc_„aifc (using (1.66))
so finally we may say that
(x,e„) = a n . (1-67)
In other words, if the basis B is orthogonal, then
x = *Y]{x, ek)e k . (1.68)
keZ
Previous examples (e.g., Fourier series expansion) are merely special cases of this
general idea. We see that one of the main features of an orthogonal basis is the
ease with which we can obtain the coefficients a k . Nonorthogonal bases are harder
to work with in this respect. This is one of the reasons why orthogonal bases are
so universally popular.
A few comments on terminology are in order here. Some would say that the
condition (1.66) on B in (1.64) means that B is an orthonormal set, and we would
say that condition
(e n ,e k ) = a n S n - k
is the condition for B to be an orthogonal set, where a n is not necessarily unity
(i.e., equal to one) for all n. However, in this book we often insist that orthogonal
basis vectors be "normalized" so condition (1.66) holds.
We conclude the present section by considering the following theorem. It was
mentioned in a footnote that the following Schwarz inequality (or variations of it)
is of very great value in electrical and computer engineering.
TLFeBOOK
24 FUNCTIONAL ANALYSIS IDEAS
Theorem 1.1: Schwarz Inequality Let X be an inner product space, where
x, y e X. Then
\{x,y)\ < 11*11 ILvll- (1-69)
Equality holds iff {x, y] is a linearly dependent set.
Proof If y — then (x, 0) = 0, and (1.69) clearly holds in this special case.
Let y ^ 0. For all scalars a in the field of X we must have [via inner product
axioms and (1.30)]
< \\x — ay\\ — (x — ay, x — ay)
= (x, x) - a*(x, y) - a[(y, x) - a*(y, y)].
If we select a* — (y, x)/(y, y), then the quantity in the brackets [•] vanishes. Thus
0< (x,x)-^\{x,y) = ||x|| 2 - l{X,y [ l
~ (y,y) \\y\\ 2
[using (x, y) — (y,x)*, i.e., axiom (13)]. Rearranging, this yields
\(x,y)\ 2 <\\x\\ 2 \\y\\ 2 ,
and the result (1.69) follows (we must take positive square roots as ||x|| > 0, and
|x|>0).
Equality holds iff y — 0, or else ||x — ay|| 2 = 0, hence x — ay — [recall
(N2)], so x = ay, demonstrating linear dependence of x and y.
We may now see what Theorem 1 . 1 has to say when applied to the special case of
a vector dot product.
Example 1.21 Suppose that X is the inner product space of Example 1.17.
Since
^2x k y k
k=0
\(x,y)\ =
Yl'l=o x k\ ' we nave from Theorem 1.1 that
1 1/2
M-l
"«-i
1/2
"«-l _
^2x k y k
<
E** 2
E? 2
k=0
-k=Q -
-k=Q -
1/2
(1.70)
If y k — axk (a e R) for all k e Z n , then
-l
E-*^
k=Q
<\E*l
k=Q
TLFeBOOK
THE DISCRETE FOURIER SERIES (DFS)
25
and [E£U 2 ] 1/2 =l«l[EZ=S<T, hence
-n-\
1/2
"n-1
E**
-4=0 -
-k=0 -
= i«iE-
jt=0
Thus, (1.70) does indeed hold with equality when y — ax.
1.4 THE DISCRETE FOURIER SERIES (DFS)
The subject of discrete Fourier series (DFS) and its relationship to the complex
Fourier series expansion of Section 1.3.3 is often deferred to later courses (e.g.,
signals and systems), but will be briefly considered here as an additional example
of an orthogonal series expansion.
The complex Fourier series expansion of Section 1.3.3 was for 2jt -periodic
functions defined on the real-number line. A similar series expansion exists for
N-periodic sequences such as x — (x„); that is, for N e {2, 3, 4, . . .} C Z, consider
*>n — / , x n+kN
(1.71)
where x — (x„) is such that x n — for n < 0, and forn>N as well. Thus, x is
just one period of x. We observe that
Xn+mN
E x »
+mN+kN
E x »
+ (m+k)N
/ J X n+r fJ
(r — m + k). This confirms that x is indeed A'-periodic (i.e., periodic with period
N). We normally assume in a context such as this that x„ e C. We also regard x
as a vector: x = [xq x\ ■ ■ ■ xn-\] t e C . An inner product may be defined on
the space of Af -periodic sequences according to
(x,y) - (x,y) - y x
(1.72)
(recall Example 1.18), where y e C N is one period of y. We assume, of course,
that x and y are bounded sequences so that (1.72) is well defined.
Now define e^ — [e^o e k,\ • • • e k,N-\\ T € C N according to
ek,n = exp
2n
i — kn
N
(1.73)
where n e Zpj. The periodization of et — (ek,n) is
Ck,n — 2_^ e k,n+mN
meZ
(1.74)
TLFeBOOK
26 FUNCTIONAL ANALYSIS IDEAS
yielding e k = (e k , n ). That (1.73) is periodic with period N with respect to index n
is easily seen:
ek,n+mN = exp
2n
/ — /c(« + mN)
N
exp
2jt
AT
exp [j27r /cm] = ejt.n-
It can be shown (by exercise) that [using definition (1.72)]
N-\
(e k ,e r ) = (e*, e r ) :
iV-l
= J] exp
M =
l] eX P
n=0
2tt
- / — rn
exp
2tt
i — kn
J N
2jt
/ — (k — r)n
J N
N, k-r =
0, otherwise
(1.75)
Thus, if we consider (e k „), and (e rn ) with k ^ r we find that these sequences are
orthogonal, and so form an orthogonal basis for the vector space C^. From (1.75)
we may write
(e k ,e r ) = NS k - r . (1.76)
Thus, there must exist another vector X — [Xq X\ ■ ■ ■ Xn-i] t e C^ such that
N-l
l v-»
x„ = — ) j X k exp
k=Q
2n
/' — kn
N
(1.77)
for n e Z«. In fact
N-l
\x, e r ) — / t x n£ rn
«=0
N-l (N-l
^E E x * es p
«=0 I fc=0
7V-1 fiV-1
2n
/ — /cn
exp
2tz
- / — rn
J N
= n ^ Xk \ ^ 6XP
£=0 I n=0
= -J^X k (NS k - r ) = X r
2tt
/ — (A: — r)n
J N
k=0
(1.78)
That is
N-l
Xk=^2*n exp
n=0
2jt
- / — kn
N
(1.79)
for k € Z»
TLFeBOOK
THE DISCRETE FOURIER SERIES (DFS) 27
In (1.77) we see x n+m ^ — x„ for all m e Z. Thus, (x n ) in (1.77) is A^-periodic,
and so we have x n = -h Z~lk=o ^k exp jjj-kn with X k given by (1.79). Equation
(1.77) is the discrete Fourier series (DFS) expansion for an Af -periodic complex-
valued sequence x such as in (1.71). The DFS coefficients are given by (1.79).
However, it is common practice to consider only x n for n e Z#, which is equivalent
to only considering the vector x e C N . In this case the vector X e C N given by
(1.79) is now called the discrete Fourier transform (DFT) of the vector x, and the
expression in (1.77) is the inverse DFT (IDFT) of the vector X. We observe that
the DFT, and the IDFT can be concisely expressed in matrix form, where we define
the DFT matrix
F =
cxp l ~i~N kn
eC NxN , (1.80)
k,neZ,
N
and we see from (1.77) that F ' = j/F* (IDFT matrix). Thus, X — Fx. We remark
that the symmetry of F (i.e., F — F T ) means that either k or n in (1.80) may be
interpreted as row or column indices.
The DFT has a long history, and its invention is now attributed to Gauss
[10]. The DFT is of central importance to numerical computing generally, but
has particularly great significance in digital signal processing as it represents a
numerical approximation to the Fourier transform, and it can also be used to
efficiently implement digital filtering operations via so-called fast Fourier trans-
form (FFT) algorithms. The construction of FFT algorithms to efficiently compute
X — Fx (and x — F~ l X) is rather involved, and not within the scope of the
present book. Simply note that the direct computation of the matrix- vector product
X — Fx needs N 2 complex multiplications and N(N — 1) complex additions. For
N — 2 P (p e {1,2,3, . . .}), which is called the radix-2 case, the algorithm of Coo-
ley and Tukey [11] reduces the number of operations to something proportional to
N log 2 N, which is a substantial savings compared to A" 2 operations with the direct
approach when N is large enough. Essentially, the method in Ref. 1 1 implicitly fac-
tors F according to F — F p F p -\ ■ ■ ■ F\, where the matrix factors F^ e C NxN are
sparse (i.e., contain many zero-valued entries). Note that multiplication by zero is
not implemented in either hardware or software and so does not represent a compu-
tational cost in the practical implementation of the FFT algorithm. It is noteworthy
that the algorithm of Ref. 1 1 also has a long history dating back to the work of
Gauss, as noted by Heideman et al. [10]. It is also important to mention that fast
algorithms exist for all possible N ^ 2 P [4]. The following example suggests one
of the important applications of the DFT/DFS.
Example 1.22 Suppose that x„ — Ae j0n with 9 — jj-m for m = 1,2,...,
■j — 1 (N is assumed to be even here). From (1.79) using (1.75)
X k = AN8 m _ k . (1.81)
TLFeBOOK
28 FUNCTIONAL ANALYSIS IDEAS
Now suppose instead that x n — Ae ■' , so similarly
N-\
X k - A XI eXP
«=0
A7-1
A J^exp
n=0
2tt
- / — n(m + &)
7 at
2tt
7 — n(./V — m — k)
J N
AN8 N -
(1.82)
Thus, if now x„ = Acos(0«) = ±A[e^" + e - ^"], then from (1.81) and (1.82), we
must have
X£ = j AiV[<$ m _& + 8 N - m -t]
(1.83)
We observe that Xt — for all k ^ m, N — m, but that
X m — \AN, and X w _ m = jAN .
Thus, X^ is nonzero only for indices k — m and k — N — m corresponding to
the frequency of (x„), which is 6 — jrtn. The DFT/DFS is therefore quite use-
ful in detecting "sinusoids" (also sometimes called "tone detection"). This makes
the DFT/DFS useful in such applications as narrowband radar and sonar signal
detection.
Can you explain the necessity (or, at least, the desirability) of the second equality
inEq. (1.82)?
APPENDIX 1.A COMPLEX ARITHMETIC
Here we summarize the most important facts about arithmetic with complex num-
bers zeC (set of complex numbers). You shall find this material very useful in
electric circuits, as well as in the present book.
Complex numbers may be represented in two ways: (1) Cartesian (rectangular)
form or (2) polar form. First we consider the Cartesian form.
In this case z € C has the form z — x + jy, where x, y e R (set of real num-
bers), and j — V— 1- The complex conjugate of z is defined to be z* — x — jy (so
j* = -;')•
Suppose that z\ = x\ + jy\ and zi — X2 + jyi are two complex numbers. Addi-
tion and subtraction are defined as
z\ ± Z2 — (x\ ± x 2 ) + j(yi ± yz)
TLFeBOOK
COMPLEX ARITHMETIC 29
[e.g., (1 + 2y) + (3 - 5j) = 4 - 3j, and (1 + 2j) - (3 - 5j) = -2 + 7j]. Using
j 2 — — 1, the product of zi and Z2 is
Z1Z2 = C*i + 7>i)fe + jyi)
= x\x 2 + j 2 yiy 2 + jy\.
= (*i*2 - yiw) + j{x\y 2 + x 2 y\)
x\x 2 + j 2 y\y2 + jy\X2 + jx\y 2
We note that
ZZ* = (x + jy)(x - jy) = x 2 + y 2 = \ Z \ 2 ,
so |z| = *J x 2 + y 2 defines the magnitude of z. For example, (1 + 2j)(3 — 5j) =
13 + j. The quotient of z\ and zi is defined to be
ta_ _ zizl _ (x\ + jy\)(x 2 - jy 2 )
Zi Ziz\ x 2 + y 2
_ {x\xi + yin) + j(x 2 y\ - xiy 2 )
x 2 + y 2
_ x\x 2 + y { y 2 x 2 y x - x\y 2
x 2 + y 2 x 2 + y\
where the last equality is z\/z 2 in Cartesian form.
Now we may consider polar form representations. For z — x + jy, we may
regard x and y as the x and y coordinates (respectively) of a point in the Cartesian
plane (sometimes denoted R 2 ). 10 We may therefore express these coordinates in
polar form; thus, for any x and y we can write
x — rcos9, y = rsinO,
where r > 0, and e [0, 2jr), or 6 e (—it, it]. We observe that
x 2 + y 2 = r 2 (cos 2 9 + sin 2 6) = r 2 ,
so \z\ — r.
Now recall the following Maclaurin series expansions (considered in greater
depth in Chapter 3):
V(-i)"- 1 —
M=l
00
2«-l
{In- 1)!
^2n-2
EX
(n -
(2/1-2)!
H-l
(B-l)!
n = l
This suggests that z may be equivalently represented by the column vector [xy] . The vector inter-
pretation of complex numbers can be quite useful.
TLFeBOOK
30 FUNCTIONAL ANALYSIS IDEAS
These series converge for — oo < x < oo. Observe the following:
j x _ y^ U x ) _ y^
n=\ n=\
(jx)
(2«-l)-l
(jx)
(2«-D
[(2/i-l)-l]! [2/1-1]!
where we have split the summation into terms involving even n and odd n. Thus,
continuing
00 r ;2n-2^2n-2 ;2n-l„2«-l
*7*
E
J"' 'X
;- *x'
(2n-2)\ (In- 1)!
Ei
«=i
00
2n-2
,2n-2
„2n-l
+ y-
= Et- 1 )""
(2/1-2)! •'(2/1- 1)!
v 2«-2
n=l
(2n - 2)
+iE(-ir
07 2 " -2 = /"-')
„2n-l
n = \
(2«- 1)!
a 2 " -2 = a 2 )" -1 = (-i)"- 1 )
= cosx + y sinx.
Thus, e'* = cosx + j sinx. This is justification for Euler's identity in (1.49). Addi-
tionally, since e~' x — cosx — j sinx, we have
,}* -1- *-;*
= 2 cosx, e
jx _ a -jx
= 2j sin x .
These immediately imply that
,jx _ „-jx
■?}X I p—JX
V 2
These identities allow for the conversion of expressions involving trig(onometric)
functions into expressions involving exponentials, and vice versa. The necessity to
do this is frequent. For this reason, they should be memorized, or else you should
remember how to derive them "on the spot" when necessary.
Now observe that
re 3 — r cos 9 + jr sm6,
so that if z — x + jy, then, because there exist r and 9 such that x — r cos 6 and
y — r sin9, we may immediately write
Z — re J
TLFeBOOK
ELEMENTARY LOGIC 31
This is z in polar form. For example (assuming that 9 is in radians)
1 + 7 = v^e^ 74 , -l + y = V2e 37ry/4 ,
1 - j = V^-- 7 '*/ 4 , -1-7 = V2e~ 3n J/ 4 .
It can sometimes be useful to observe that
j = e inl2 , -j = e - jn/2 , and - 1 = e ±J7T .
If z\ = r\e^ 1 , and z 2 = r^e^ 1 , then
Z lZ2 = ri^W-W - = ^^(ft-fc).
Z2 r 2
In other words, multiplication and division of complex numbers is very easy when
they are expressed in polar form.
Finally, some terminology. For z — x + jy, we call x the real part of z, and we
call y the imaginary part of z. The notation is
x — Re [z], y — Im [z].
That is, z = Re [z] + 7 Im [z].
APPENDIX 1.B ELEMENTARY LOGIC
Here we summarize the basic language and ideas associated with elementary logic
as some of what is found here appears in later sections and chapters of this book.
The concepts found here appear often in mathematics and engineering literature.
Consider two mathematical statements represented as P and Q. Each statement
may be either true or false. Suppose that we know that if P is true, then Q is
certainly true (allowing the possibility that Q is true even if P is false). Then we
say that P implies Q, or Q is implied by P, or P is a sufficient condition for Q,
or symbolically
P =► Q or Q «= P.
Suppose that if P is false, then Q is certainly false (allowing the possibility
that Q may be false even if P is true). Then we say that P is implied by Q, or Q
implies P, or P is a necessary condition for Q, or
P «= Q or Q =► P.
Now suppose that if P is true, then Q is certainly true, and if P is false, then
Q is certainly false. In other words, P and Q are either both true or both false.
Then we say that P implies and is implied by Q, or P is a necessary and sufficient
TLFeBOOK
32 FUNCTIONAL ANALYSIS IDEAS
condition for Q, or P and Q are logically equivalent, or P if and only if Q, or
symbolically
P O Q-
A common abbreviation for "if and only if is iff.
The logical contrary of the statement P is called "not P." It is often denoted by
either P or ~ P. This is the statement that is true if P is false, or false if P is true.
For example, if P is the statement "x > 1," then ~ P is the statement "x < 1." If
P is the statement "f(x) ^ for all x e R," then ~ P is the statement "there is
at least one x e R for which f(x) — 0." We may write
x 4 - 5x 2 + 4 = «= x — 1 or x — 2,
but the converse is not true because x 4 — 5x 2 + 4 = is a quartic equation pos-
sessing four possible solutions. We may write
x — 3 =>■ x — 3x,
but we cannot say x 2 — 3x =>■ x — 3 because x = is also possible.
Finally, we observe that
P =>■ Q is equivalent to ~ P <^ ~ Q,
P ■$= Q is equivalent to ~ P =>■ ~ Q,
P -o- Q is equivalent to ~ P -o- ~ Q;
that is, taking logical contraries reverses the directions of implication arrows.
REFERENCES
1. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New
York, 1978.
2. A. P. Hillman and G. L. Alexanderson, A First Undergraduate Course in Abstract Alge-
bra, 3rd ed., Wadsworth, Belmont, CA, 1983.
3. R. B. J. T. Allenby, Rings, Fields and Groups: An Introduction to Abstract Algebra,
Edward Arnold, London, UK, 1983.
4. R. E. Blahut, Fast Algorithms for Digital Signal Processing, Addison- Wesley, Reading,
MA, 1985.
5. C. K. Chui, Wavelets: A Mathematical Tool for Signal Analysis. SIAM, Philadelphia,
PA, 1997.
6. R. E. Ziemer and W. H. Tranter, Principles of Communications: Systems, Modulation,
and Noise, 3rd ed., Houghton Mifflin, Boston, MA, 1990.
7. I. Daubechies, "Orthonormal Bases of Compactly Supported Wavelets," Commun. Pure
Appl. Math. 41, 909-996 (1988).
TLFeBOOK
PROBLEMS
33
8. G. P. Tolstov, Fourier Series (transl. from Russian by R. A. Silverman), Dover Publi-
cations, New York, 1962.
9. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979.
10. M. T. Heideman, D. H. Johnson and C. S. Burrus, "Gauss and the History of the Fast
Fourier Transform," IEEE ASSP Mag. 1, 14-21 (Oct. 1984).
11. J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation of Complex
Fourier Series," Math. Comput., 19, 297-301 (April 1965).
PROBLEMS
1.1. (a) Find a, b e R in
1+27
a + bj.
(b) Find r, 9 e R in
J = r <
je
(Of course, choose r > 0, and 9 e (— it, it].)
1.2. Solve for x e C in the quadratic equation
x 2 — 2r coaOx + r 2
0.
Here r > 0, and 9 e (— jt, jv]. Express your solution in polar form.
1.3. Let 0, and <p be arbitrary angles (so 0, 4> e R). Show that
(cos0 + j sin0)(cos0 + j sin</>) = cos(9 + <p) + j sm(9 + <p).
1.4. Prove the following theorem. Suppose z e C such that
z — r cos 9 + jr sin6>
for which r = \z\ > 0, and 9 € (—jt, it]. Let n € {1, 2, 3, . .
positive integer). The n different nth roots of z are given by
(i.e., n is a
.!/«
COS
2jtk
/sin
9 + 2itk
for k = 0, 1,2, ...,«- 1.
1.5. State whether the following are true or false:
(a) \x\ < 2 =► x < 2
(b) |*| < 3 «= < x < 3
TLFeBOOK
34 FUNCTIONAL ANALYSIS IDEAS
(c) x — y>0=^x>;y>0
(d) xy — =$■ x — and y —
(e) x = 10 «= x 2 = lOx
Explain your answer in all cases.
1.6. Consider the function
,. | -x 2 + 2x + 1, < x < 1
/W-i x 2_ 2x+ 3 1<x<2
2'
Find
sup /(x), inf /(x).
*e[0,2] *e[0,2] '
1.7. Suppose that we have the following polynomials in the indeterminate x:
n m
a{x) — 2^aicX , b(x) = } ^ bjX 1 .
fc=0 ,/=o
Prove that
where
n+m
:(x) — a(x)b(x) = 2, c l x >
1=0
ci — ) y cikbi-k-
k=Q
[Comment: This is really asking us to prove that discrete convolution is
mathematically equivalent to polynomial multiplication. It explains why the
MATLAB routine for multiplying polynomials is called conv. Discrete con-
volution is a fundamental operation in digital signal processing, and is an
instance of something called finite impulse response (FIR) filtering. You will
find it useful to note that a^ — for k < 0, and k > n, and that b 1 ■■ — for
j < 0, and j > m. Knowing this allows you to manipulate the summation
limits to achieve the desired result.]
1.8. Recall Example 1.5. Suppose that x\ — 2 k+l , and that y^ = 1 for k e Z + .
Find the sum of the series d(x, y). (Hint: Recall the theory of geometric
series. For example, ^^ff' = " if a # 1.)
1.9. Prove that if x ^ 1, then
<jpl — 7 fCX
k-\
k=\
TLFeBOOK
PROBLEMS 35
is given by
_ l-(«+ l)x" +nx n+l
(l - x) 1
What is the formula for S„ when x — 1 ? (Hint: Begin by showing that
S„ - xS,j = 1 + x + x 2 H h x n ~ l - nx n .)
1.10. Recall Example 1.1. Prove that d(x, y) in (1.5) satisfies all the axioms for a
metric.
1.11. Recall Example 1.18. Prove that (x, y) in (1.36) satisfies all the axioms for
an inner product.
1.12. By direct calculation, show that if x, y, z are elements from an inner product
space, then
||z-x|| 2 +||z-y|| 2 = i||x-y|| 2 + 2|| Z -I(* + >0|| 2
(Appolonius' identity).
1.13. Suppose x, y e R 3 (three-dimensional Euclidean space) such that
x = [l 1 if, y = [l -1 If.
Find all vectors z € R 3 such that (x, z) — (y, z) — 0.
1.14. The complex Fourier series expansion method as described is for / €
L 2 (0,2ir). Find the complex Fourier series expansion for / € L 2 (0, T),
where < T < oo (i.e., the interval on which / is defined is now of arbitrary
length).
1.15. Consider again the complex Fourier series expansion for / € L 2 (0, lit).
Specifically, consider Eq. (1.44). If /(f) € R for all t e (0, 2tt), then show
that /„ = f* n . [The sequence (/„) is conjugate symmetric] Use this to show
that for suitable a n , b n e R (all n) we have
00 00
^ f n e jnt — «o + y^\a n cos(nf) + b„ sin(nf)]-
n=-oo n=\
How are the coefficients a„ and b n related to /„ ? (Be very specific. There
is a simple formula.)
1.16. (a) Suppose that / € L 2 (0, 2jt), and that specifically
r i, o < t < n
\ h 7T <t <27T '
Find /„ in Eq. (1.44) using (1.45); that is, find the complex Fourier series
expansion for f(t). Make sure that you appropriately simplify your series
expansion.
TLFeBOOK
36
FUNCTIONAL ANALYSIS IDEAS
(b) Show how to use the result in Example 1.20 to find the complex Fourier
series expansion for f(t) in (a).
1.17. This problem is about finding the Fourier series expansion for the wave-
form at the output of a full-wave rectifier circuit. This circuit is used in
AC/DC (alternating/direct-current) converters. Knowledge of the Fourier
series expansion gives information to aid in the design of such converters.
(a) Find the complex Fourier series expansion of
'2tt
f(t) =
sin
€//
T\
(b) Find the sequences (a n ), and (b n ) in
v^ T (2nn \ (2nn
f(t) - a + 2_, a " cos I -f~t I + b n sin I — —
n=i *- ^ ' ^
for f(t) in (a). You need to consider how T is related to T\.
1.18. Recall the definitions of the Haar scaling function and Haar wavelet in
Eqs. (1.40) and (1.41), respectively. Define </>*,„(?) = 2 k/2 (p(2 k t - n), and
f k ,„(t) = 2 k ' 2 ir(2 k t - n). Recall that (/(?), g(t)) = /_°^ f(t)g*(t)dt is the
inner product for L 2 (R).
(a) Sketch (j>k, n (t), and Va,n(0-
(b) Evaluate the integrals
(c) Prove that
/oo poo
4>l„(t)dt, and / f\ n {t)dt.
-oo J — oo
(0t,n(O. 0*,m(O) = 5 n-
1.19. Prove the following version of the Schwarz inequality. For all x, y € X (inner
product space)
|Re[(x,y>]|< ||*|| ||y||
with equality iff y = fix, and /J € R is a constant.
[Hint: The proof of this one is not quite like that of Theorem 1.1. Consider
(ax + y, ax + y) > with a e R. The inner product is to be viewed as a
quadratic in a.]
1.20. The following result is associated with the proof of the uncertainty principle
for analog signals.
Prove that for f(t) e L 2 (R) such that \t\f(t) € L 2 (R) and / (1) (?) =
df(t)/dt e L 2 (R), we have the inequality
2
Rc
/
y — (
tf(t)[f m (t)Tdt
/OO /•OO
\tf{t)\ 2 dt / \f m (t)\ 2 dt .
-oo J L*J — oo
TLFeBOOK
PROBLEMS
37
1.21. Suppose e k = [et,o e*,i • • • ^,^-2 ^.iv-i] 7 ^ € C^, where
e*,n = exp
27T
/ — /en
N
and keZ N . If #, y € C w recall that (x, y) — Y^k=o x Wk- Prove tnat
(gfc, e r > = N&k-r- Thus, B = {e,t|/c € Z^} is an orthogonal basis for C^.
Set B is important in digital signal processing because it is used to define
the discrete Fourier transform.
TLFeBOOK
2
Number Representations
2.1 INTRODUCTION
In this chapter we consider how numbers are represented on a computer largely with
respect to the errors that occur when basic arithmetical operations are performed
on them. We are most interested here in so-called rounding errors (also called
roundoff errors). Floating-point computation is emphasized. This is due to the
fact that most numerical computation is performed with floating-point numbers,
especially when numerical methods are implemented in high-level programming
languages such as C, Pascal, FORTRAN, and C++. However, an understanding
of floating-point requires some understanding of fixed-point schemes first, and so
this case will be considered initially. In addition, fixed-point schemes are used to
represent integer data (i.e., subsets of Z), and so the fixed-point representation is
important in its own right. For example, the exponent in a floating-point number
is an integer.
The reader is assumed to be familiar with how integers are represented, and
how they are manipulated with digital hardware from a typical introductory dig-
ital electronics book or course. However, if this is not so, then some review of
this topic appears in Appendix 2. A. The reader should study this material now if
necessary.
Our main (historical) reference text for the material of this chapter is Wilkin-
son [1]. However, Golub and Van Loan [4, Section 2.4] is also a good refer-
ence. Golub and Van Loan [4] base their conventions and results in turn on
Forsythe et al. [5].
2.2 FIXED-POINT REPRESENTATIONS
We now consider fixed-point fractions. We must do so because the mantissa in a
floating-point number is a fixed-point fraction.
We assume that fractions are t + 1 digits long. If the number is in binary, then
we usually say "t + 1 bits" long instead. Suppose, then, that x is a (t + l)-bit
fraction. We shall write it in the form
OO2 = xo.x\X2 ■■■xt-ixt (xk e {0, 1}). (2.1)
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
38
TLFeBOOK
FIXED-POINT REPRESENTATIONS 39
The notation {x)2 means that x is in base-2 (binary) form. More generally, (x) r
means that x is expressed as a base-r number (e.g., if r — 10 this would be the
decimal representation). We use this notation to emphasize which base we are
working with when necessary (e.g., to avoid ambiguity). We shall assume that
(2.1) is a two's complement fraction. Thus, bit xq is the sign bit. If this bit is 1,
we interpret the fraction to be negative; otherwise, it is nonnegative. For example,
(1.1011)2 = (-0.3125)io- [To take the two's complement of (1.1011)2, first com-
plement every bit, and then add (0.0001) 2 . This gives (0.0101) 2 = (0.3 125) i .] In
general, for the case of a (t + l)-bit two's complement fraction, we obtain
-1<x<1-2"'. (2.2)
In fact
(-l)io = (1.00^00)2, (l-2- f )io = (O.U J _ : JT)2. (2.3)
t bits t bits
We may regard (2.2) as specifying the dynamic range of the (t + l)-bit two's
complement fraction representation scheme. Numbers beyond this range are not
represented. Justification of (2.2) [and (2.3)] would follow the argument for the
conversion of two's complement integers into decimal integers that is considered
in Appendix 2.A.
Consider the set {x e R| — 1 < x < 1 — 2 - '}. In other words, x is a real number
within the limits imposed by (2.2), but it is not necessarily equal to a (t + l)-bit
fraction. For example, x — \[2 — 1 is in the range (2.2), but it is an irrational
number, and so does not possess an exact (t + l)-bit representation. We may choose
to approximate such a number with t + 1 bits. Denote the (t + l)-bit approximation
of x as Q[x]. For example, Q[x] might be the approximation to x obtained by
selecting an element from set
B = {b n = -1 + 2~'n\n = 0, 1, . . . , 2 t+1 - 1} C R (2.4)
that is the closest to x, where distance is measured by the metric in Example 1.1.
Note that each number in B is representable as a (t + l)-bit fraction. In fact, B is the
entire set of (t + l)-bit two's complement fractions. Formally, our approximation
is given by
Q[x] — argmin \x — b„\. (2.5)
B6{0,l,...,2 t + 1 -l}
The notation "argmin" means "let Q[x] be the b n for the n in the set
[0, 1, . . . , 2 t+l — 1} that minimizes \x — b n \." In other words, we choose the
argument b n that minimizes the distance to x. Some reflection (and perhaps con-
sidering some simple examples for small t) will lead the reader to conclude that
the error in this approximation satisfies
\x-Q[x]\ <2~ (t+1) . (2.6)
TLFeBOOK
40 NUMBER REPRESENTATIONS
The error e — x — Q[x] is called quantization error. Equation (2.6) is an upper
bound on the size (norm) of this error. In fact, in the notation of Chapter 1, if
||x|| = |;c|, then ||e|| = \\x — Q[x]|| < 2~ ( - t+l \ We remark that our quantization
method is not unique. There are many other methods, and these will generally lead
to different bounds.
When we represent the numbers in a computational problem on a computer,
we see that errors due to quantization can arise even before we perform any oper-
ations on the numbers at all. However, errors will also arise in the course of
performing basic arithmetic operations on the numbers. We consider the sources
of these now.
If x, y are coded as in (2.1), then their sum might not be in the range specified
by (2.2). This can happen only if x and y are either both positive or both negative.
Such a condition is fixed-point overflow. (A test for overflow in two's complement
integer addition appears in Appendix 2.A, and it is easy to modify it for the problem
of overflow testing in the addition effractions.) Similarly, overflow can occur when
a negative number is subtracted from a positive number, or if a positive number
is subtracted from a negative number. A test for this case is possible, too, but we
omit the details. Other than the problem of overflow, no errors can occur in the
addition or subtraction of fractions.
With respect to fractions, rounding error arises only when we perform multipli-
cation and division. We now consider errors in these operations.
We will deal with multiplication first. Suppose that x and y are represented
according to (2.1). Suppose also that xq — yo = 0. It is easy to see that the product
of x and y is given by
p=*y=(i> 2 ~*)(i> 2 ~ B )
\jfc=0 / \n=Q I
= (x + xi2 l +---+x,2 ')(y + yi2 l + ■ ■ ■ + y,2 ')
= x y + (xoyi + *iyo)2 _1 H h x t y t 2~ 2 '. (2.7)
This implies that the product is a (2t + l)-bit number. If we allow x and y to be
either positive or negative, then the product will also be 2t + 1 bits long. Of course,
one of these bits is the sign bit. If we had to multiply several numbers together,
we see that the product wordsize would grow in some proportion to the number of
factors in the product. The growth is clearly very rapid, and no practical computer
could sustain this for very long. We are therefore forced in general to round off the
product p back down to a number that is only t + 1 bits long. Obviously, this will
introduce an error.
How should the rounding be done? There is more than one possibility (just as
there is more than one way to quantize). Wilkinson [1, p. 4] suggests the following.
Since the product p has the form
(ph - P0-P1P2--- Pt-iPtPt+i--- P2t (Pk e {0, 1}) (2.8)
TLFeBOOK
FIXED-POINT REPRESENTATIONS 41
we may add 2~^ +1 ' to this product, and then simply discard the last t bits of the
resulting sum (i.e., the bits indexed t + 1 to 2t). For example, suppose t = 4, and
consider
0.00111111 = p
+0.00001000 = 2~ 5
0.01000111
Thus, the rounded product is (0.0100)2- The error involved in rounding in this
manner is not higher in magnitude than j2~ { — 2~ < -' +l \ Define the result of the
rounding operation to be fx[p] — fx[xy], so then
\P- fx[p]\< \2~'. (2.9)
[For the previous example, p — (0.00111111)2, and so fx[p] — (0.0100)2-] It is
natural to measure the sizes of errors in the same way as we measured the size of
quantization errors earlier. Thus, (2.9) is an upper bound on the size of the error
due to rounding a product. As with quantization, other rounding methods would
generally give other bounds. We remark that Wilkinson's suggestion amounts to
"ordinary rounding."
Finally, we consider fixed-point division. Again, suppose that x and y are rep-
resented as in (2.1), and consider the quotient q — x/y. Obviously, we must avoid
y — 0. Also, the quotient will not be in the permitted range given by (2.2) unless
\y\ > \ x \- This implies that when fixed-point division is implemented either the
dividend x or the divisor y need to be scaled to meet this restriction. Scaling is
multiplication by a power of 2, and so should be implemented to reduce rounding
error. We do not consider the specifics of how to achieve this. Another problem
is that x/y may require an infinite number of bits to represent it. For example,
suppose
= (0.0010) 2 = (0.125)io = / 1\ = —
q (0.0110)2 (0.375)10 Wio ' )2 '
The bar over 01 denotes the fact that this pattern repeats indefinitely. Fortunately,
the same recipe for the rounding of products considered above may also be used
to round quotients. If fx[q] again denotes the result of applying this procedure to
q, then
\q-fx[q]\<\2~'. (2.10)
We see that the difficulties associated with division in fixed-point representations
means that fixed-point arithmetic should, if possible, not be used to implement
algorithms that require division. This forces us to either (1) employ floating-point
representations or (2) develop algorithms that solve the problem without the need
for division operations.
Both strategies are employed in practice. Usually choice 1 is easier.
TLFeBOOK
42 NUMBER REPRESENTATIONS
2.3 FLOATING-POINT REPRESENTATIONS
In the previous section we have seen that fixed-point numbers are of very limited
dynamic range. This poses a major problem in employing them in engineering
computations since obviously we desire to work with numbers far beyond the
range in (2.2). Floating-point representations provide the definitive solution to this
problem. We remark (in passing) that the basic organization of a floating-point
arithmetic unit [i.e., digital hardware for floating-point addition and subtraction
appears in Ref. 2 (see pp. 295-306)]. There is a standard IEEE format for floating-
point numbers. We do not consider this standard here, but it is summarized in
Ref. 2 (see pp. 304-306). Some of the technical subtleties associated with the
IEEE standard are considered by Higham [6].
Following Golub and Van Loan [4, p. 61], the set F (subset of R) of floating-
point numbers consists of numbers of the form
x — xq.x\X2 ■ ■ ■ x t -\x t x r e , (2.11)
where xo is a sign bit (which means that we can replace xq by ±; this is done in
Ref. 4), and r is the base of the representation [typically r — 2 (binary), or r — 10
(decimal); we will emphasize r — 2]. Therefore, x\ € {0, 1, . . . , r — 2, r — 1} for
1 < k < t. These are the digits (bits if r = 2) of the mantissa. We therefore see
that the mantissa is a fraction. l It is important to note that x\ ^ 0, and this has
implications with regard to how operations are performed and the resulting rounding
errors. We call e the exponent. This is an integer quantity such that L < e < U.
For example, we might represent e as an n-bit two's complement integer. We will
assume this unless otherwise specified in what follows. This would imply that
(e)2 — e n -\e n -2 ■ ■ ■ e\eo, and so
-2"" 1 < e < 2" _1 - 1 (2.12)
(see Appendix A for justification). For nonzero jeF, then
m < \x\ < M, (2.13a)
where
m = r L -\ M = r u (l-r-<). (2.13b)
Equation (2.13) gives the dynamic range for the floating-point representation. With
r = 2we see that the total wordsize for the floating-point number is t + n + 1 bits.
In the absence of rounding errors in a computation, our numbers may initially
be from the set
G = {x e R\m < \x\ < M} U {0}. (2.14)
Including the sign bit the mantissa is (for r = 2) t + 1 bits long. Frequently in what follows we shall
refer to it as being only t bits long. This is because we are ignoring the sign bit, which is always
understood to be present.
TLFeBOOK
FLOATING-POINT REPRESENTATIONS 43
This set is analogous to the set {x e R| — 1 < x < 1 — 2 - '} that we saw in the
previous section in our study of fixed-point quantization effects. Again following
Golub and Van Loan [4], we may define a mapping (operator) fl\G — > F. Here
c — fl[x] (x e G) is obtained by choosing the closest c € F to x. As you might
expect, distance is measured using 1 1 • 1 1 = | • | , as we did in the previous section.
Golub and Van Loan call this rounded arithmetic [4], and it coincides with the
rounding procedure described by Wilkinson [1, pp. 7-11].
Suppose that x and y are two floating-point numbers (i.e., elements of F) and
that "op 7 ' denotes any of the four basic arithmetic operations (addition, subtrac-
tion, multiplication, or division). Suppose \x op y\ ^ G. This implies that either
\x op y\ > M (floating-point overflow), or < \x op y\ < m (floating-point under-
flow) has occurred. Under normal circumstances an arithmetic fault such as over-
flow will not happen unless an unstable procedure is being performed. The issue
of "numerical stability" will be considered later. Overflows typically cause runtime
error messages to appear. The underflow arithmetic fault occurs when a number
arises that is not zero, but is too small to represent in the set F. This usually poses
less of a problem than overflow. 2 However, as noted before, we are concerned
mainly with rounding errors here. If \x op y\ e G, then we assume that the com-
puter implementation of x op y will be given by fl[x op y]. In other words, the
operator // models rounding effects in floating-point arithmetic operations. We
remark that where floating-point arithmetic is concerned, rounding error arises in
all four arithmetic operations. This contrasts with fixed-point arithmetic wherein
rounding errors arise only in multiplication and division.
It turns out that for the floating-point rounding procedure suggested above
fl[xopy] = (xopy)(l + €), (2.15)
where
\e\ < \r l - f {= 2~ { if r = 2). (2.16)
We shall justify this only for the case r — 2. Our arguments will follow those of
Wilkinson [1, pp. 7-11].
Let us now consider the addition of the base-2 floating-point numbers
x = xo.xi ■ ■ ■ x t x2 e * (2.17a)
and
y = yo-yi ■ ■ ■ y t x2 ey , (2.17b)
and we assume that |x| > \y\. (If instead \y\ > \x\, then reverse the roles of x and
y.) If e x — e y > t, then
fl[x + y] = x. (2.18)
Underflows are simply set to zero on some machines.
TLFeBOOK
44 NUMBER REPRESENTATIONS
For example, if t = 4, and x = 0.1001 x 2 4 , and j = 0.1110 x 2 _1 , then to add
these numbers, we must shift the bits in the mantissa of one of them so that both
have the same exponent. If we choose y (usually shifting is performed on the
smallest number), then y = 0.00000111 x 2 4 . Therefore, x + y = 0.10010111 x
2 4 , but then fl[x + y] = 0.1001 x2 4 =i.
Now if instead we have e x — e y < t, we divide y by 2 e *~ e y by shifting its man-
tissa e x — e y positions to the right. The sum x + 2 e >'~ ex y is then calculated exactly,
and requires < It bits for its representation. The sum is multiplied by a power of 2,
using left or right shifts to ensure that the mantissa is properly normalized [recall
that for x in (2.11) we must have x\ ^ 0]. Of course, the exponent must be modi-
fied to account for the shift of the bits in the mantissa. The 2f-bit mantissa is then
rounded off to t bits using //. Because we have \m x \ + 2 e >~ ex \m y \ < 1 + 1 = 2,
the largest possible right shift is by one bit position. However, a left shift of up to t
bit positions might be needed because of the cancellation of bits in the summation
process. Let us consider a few examples. We will assume that t — A.
Example 2.1 Let x = 0.1001 x 2 4 , and y = 0.1010 x 2 1 . Thus
0.10010000 x 2 4
+0.00010100 x 2 4
0.10100100 x 2 4
and the sum is rounded to 0.1010 x 2 4 (computed sum).
Example 2.2 Let x = 0.1111 x 2 4 , and y = 0.1010 x 2 2 . Thus
0.11110000 x 2 4
+0.00101000 x 2 4
1.00011000 x 2 4
but 1.00011000 x2 4 = 0.100011000 x 2 5 , and this exact sum is rounded to 0.1001
x 2 5 (computed sum).
Example 2.3 Let x = 0.1111 x 2" 4 , and y = -.1110 x 2" 4 . Thus
0.11110000 x 2 -4
-0.11100000 x 2 -4
0.00010000 x 2" 4
but 0.00010000 x 2" 4 = 0.1000 x 2" 7 , and this exact sum is rounded to 0.1000 x
2 -7 (computed sum). Here there is much cancellation of the bits leading in turn
to a large shift of the mantissa of the exact sum to the left. Yet, the computed sum
is exact.
TLFeBOOK
FLOATING-POINT REPRESENTATIONS 45
We observe that the computed sum is obtained by computing the exact sum,
normalizing it so that the mantissa sq.s\ ■ ■ ■ s t -is t s t +i • ■ ■ S2t satisfies s\ = 1 (i.e.,
s\ ^ 0), and then we round it to t places (i.e., we apply //). If the normalized
exact sum is s — m s x 2 es {— x + y), then the rounding error e' is such that \e'\ <
i2~'2 e '. Essentially, the error e' is due to rounding the mantissa (a fixed-point
number) according to the method used in Section 2.2. Because of the form of m s ,
\l e ° <\s\< 2 e \ and so
fl[x + y] = (x + y)(l + e) (2.19)
which is just a special case of (2.15). This expression requires further explanation,
however. Observe that
\s-fl[s]\ _ \s-(s + e')\ _ \e'\ < ^2-'2 g '
Us- 1 |*| " |*| - |*|
which is the relative error 2 due to rounding. Because we have j2 e * < \s\ < 2 e < ,
i2 e %sotl
3 ~ fl[s]
this error is biggest when |*| = i2 e % so therefore we conclude that
< 2 _f . (2.20)
1*1
From (2.19) fl[s] = s + se, so that \s - fl[s]\ = \s\\e\, or |e| = \s - fl[s]\/\s\.
Thus, |e| < 2~' , which is (2.16). In other words, \e'\ is the absolute error, and |e|
is the relative error.
Finally, if x — or y — then no rounding error occurs: e — 0. Subtraction
results do not differ from addition.
Now consider computing the product of x and y in (2.17). Since x — m x x2 fl ,
and y — m y x 2 e >' with x\ ^ 0, and y\ ^ we must have
\\ < \m x m y \ < 1. (2.21)
This implies that it may be necessary to normalize the mantissa of the product with
a shift to the left, and an appropriate adjustment of the exponent as well. The 2f-bit
mantissa of the product is rounded to give a r-bit mantissa. If x — 0, or y = (or
both x and y are zero), then the product is zero.
In general, if a is the exact value of some quantity and a is some approximation to a, the absolute
error is \\a — a\\, while the relative error is
Ik -all
M
(a # 0).
The relative error is usually more meaningful in practice. This is because an error is really "big" or
"small" only in relation to the size of the quantity being approximated.
TLFeBOOK
46 NUMBER REPRESENTATIONS
We may consider a few examples. We will suppose t — 4. Begin with x —
0.1010 x 2 2 , and y = 0.1111 x 2 1 , so then
xy = 0.10010110 x 2 3 ,
and so fl[xy] — 0.1001 x 2 3 (computed product). If now x — 0.1000 x 2 4 , y —
0.1000 x 2 _1 , then, before normalizing the mantissa, we have
xy = 0.01000000 x 2 3 ,
and after normalization we have
xy = 0.10000000 x 2 2
so that fl[xy] = 0.1000 x 2 2 (computed product). Finally, suppose that x —
0.1010 x 2°, and y = 0.1010 x 2°, so then the unnormalized product is
xy = 0.01100100 x 2°
for which the normalized product is
xy = 0.11001000 x 2"\
so finally fl[xy] — 0.1101 x 2 _1 (computed product).
The application of // to the normalized product will have exactly the same
effect as it did in the case of addition (or of subtraction). This may be under-
stood by recognizing that a 2f-bit mantissa will "look the same" to operator //
regardless of how that mantissa was obtained. It therefore immediately follows
that
fl[xy] = (xy)(l + e), (2.22)
which is another special case of (2.15), and |e| < 2~' , which is (2.16) again.
Now consider the quotient x/y, for x and y ^ in (2.17),
x m x x 2 e " m x „ „
q = -= „ = — x ?*~ e y =m q x 2 e * (2.23)
y m y x 1 y m y
(so m q = m x /m y , and e q — e x — e y ). The arithmetic unit in the machine has an
accumulator that we assume contains m x and which is "double length" in that it
is 2t bits long. Specifically, this accumulator initially stores xq.X\ ■ ■ ■ x t • • • 0. If
( bits
\m x \ > \m y \ the number in the accumulator is shifted one place to the right, and
so e q is increased by one (i.e., incremented). The number in the accumulator is
then divided by m y in such a manner as to give a correctly rounded f-bit result.
This implies that the computed mantissa of the quotient, say, m q — qo-qi • • -qt,
TLFeBOOK
FLOATING-POINT REPRESENTATIONS 47
satisfies the normalization condition q\ — 1, so that j < \m q \ < 1. Once again we
must have
x
fl ~
.y
-(l + O (2.24)
y
such that |e| < 2~' . Therefore, (2.15) and (2.16) are now justified for all instances
of op.
We complete this section with a few examples. Suppose x — 0.1010 x 2 2 , and
v = 0.1100 x 2" 2 , then
0.1010 x 2 2 0.10100000 x 2 1
y 0.1100x2- 2 0.1100x2"'
0.10100000 A A
x 2 4 = 0.11010101 x2 4
0.1100
so that fl[q] = 0.1101 x 2 4 (computed quotient). Now suppose that x = 0.1110 x
2 3 , and y = 0.1001 x 2" 2 , and so
0.1110 x 2 3 0.01110000 x 2 4
y 0.1001 x 2" 2 0.1001 x 2"-
0.01110000 fi fi
x 2 6 = 0.11000111 x2 6
0.1001
so that fl[q] = 0.1100 x 2 6 (computed quotient).
Thus far we have emphasized ordinary rounding, but an alternative implemen-
tation of // is to use chopping. If x — ± (X^i x k2 ) x 2 e , then, for chopping
operator //, we have //[jc] = ± (Yl'k=i x k2~ k ) x 2 e (chopping x to ? + 1 bits
including the sign bit). Thus, the absolute error is
|e'| = \x - fl[x]\ = J2 x * 2 ~ k U e < 2 e J2 2 ~ k
\k=t+l / k=t+l
(as Xk — 1 for all A: > f), but since Yl/k=t+\ 2 ~ k = 2~'> we mus t have
|e'| = \x- fl[x]\ < 2"'2 e ,
and so the relative error for chopping is
I* -//[x]| ^2-V =2 _ f+1
1*1 ^
(because we recall that |x| > j2 e ). We see that the error in chopping is somewhat
bigger than the error in rounding, but chopping is somewhat easier to implement.
TLFeBOOK
48 NUMBER REPRESENTATIONS
2.4 ROUNDING EFFECTS IN DOT PRODUCT COMPUTATION
Suppose x, y e R". We recall from Chapter 1 (and from elementary linear algebra)
that the vector dot product is given by
71-1
(x,y) — x T y — y T x — 'Y^ l x k y k . (2.25)
k=Q
This operation occurs in matrix-vector product computation (e.g., y — Ax, where
A e R" x "), digital filter implementation (i.e., computing discrete-time convolu-
tion), numerical integration, and other applications. In other words, it is so common
that it is important to understand how rounding errors can affect the accuracy of a
computed dot product.
We may regard dot product computation as a recursive process. Thus
n— 1 n— 2
s n -i = ^2x k y k = ^x k y k +x„-\y n -i = s„- 2 + x n -\y n -\.
k=0 k=0
So
sk = sk-i+xkyk (2.26)
for k — 0, 1, ...,« — 1, and s-\ — 0. Each arithmetic operation in (2.26) is a
separate floating-point operation and so introduces its own error into the over-
all calculation. We would like to obtain a general expression for this error. To
begin, we may model the computation process according to
s = fl[xoyo]
s\ = fl[h + fllxiyi]]
h = fl[s\ + fl[x 2 y2]]
S n -2 = fltfn-3 + flUn-2y n -2\\
Sn-\ = fl[Sn-2 + fl[x n -\y n -\\]- (2.27)
From (2.15) we may write
£o = OxoyoHl + 5o)
Si = [S + (*iyi)(l + «i)](l + ei)
S 2 = [Si + fey2)(l + 5 2 )](l+e 2 )
TLFeBOOK
ROUNDING EFFECTS IN DOT PRODUCT COMPUTATION 49
S,t-2 = [Sb-3 + fe-23'n-2)(l + 5„_ 2 )](1 + e„-2)
S B -i = [*n-2 + (x„-iy„-\)(\ + 5„_i)](l + e„_i), (2.28)
where |5 k | < 2" J (for k = 0, 1, . . . , n - 1), and |e k | < 2 - ' (for fc=l,2, ...,
n — 1), via (2.16). It is possible to write 4
n—\ n—\
s n -i = ^2x k y k (l + Yk) = Jn-i + y^XkykYk, (2.29)
fc=0 *:=0
where
B-l
1 + n = (1 + **) ll^ 1 + 6 y)( £ o = 0). (2.30)
Note that the n notation means, for example
n
\\x k — X X\X 2 - --Xn-lXn, (2.31)
k=0
where n is the symbol to compute the product of all x k for k = 0, 1, . . . , n. The
similarity to how we interpret E notation should therefore be clear.
The absolute value operator is a norm on R, so from the axioms for a norm
(recall Definition 1.3), we must have
71-1
\s n -\ -s„-i\ = \x T y - fl[x T y]\ < ^2\x k y k \\y k \. (2.32)
k=0
In particular, obtaining this involves the repeated use of the triangle inequality.
Equation (2.32) thus represents an upper bound on the absolute error involved in
computing a vector dot product. Of course, the notation fl[x T y] symbolizes the
floating-point approximation to the exact quantity x T y. However, the bound in
(2.32) is incomplete because we need to appropriately bound the numbers yk-
To obtain the bound we wish involves using the following lemma.
Lemma 2.1: We have
l+x<e x , x>0 (2.33a)
e* < 1 + 1.01*, 0<jc<.01. (2.33b)
Equation (2.29) is most easily arrived at by considering examples for small n, for instance
.? 3 = x y (l + «o)(l + e o)(l + ei)(l + e 2 )(l + £3) + *mU + «i)(l + ei)(l + e 2 )(l + £3)
+ x 2 y 2 (l + « 2 )(1 + e 2 )(l + £3) + *3:V3(1 + <5 3 )(1 + e 3 ),
and using such examples to "spot the pattern."
TLFeBOOK
50 NUMBER REPRESENTATIONS
Proof Begin with consideration of (2.33a). Recall that for — oo < x < oo
n=0
Therefore
e x = T-r- (2-34)
? x = i+ x + Y2-
2
so that
«=2
but the terms in the summation are all nonnegative, so (2.33a) follows immediately.
Now consider (2.33b), which is certainly valid for x — 0. The result will follow
if we prove
e x - 1
< 1.01 (x#0).
x
From (2.34)
> = 1+7
x ^L(m+ 1)! ^ (m + 1)!
so we may also equivalently prove instead that
^ — ^ X
£—/ (rn 4-
, (m + 1)!
m = l
< o.oi
for < x < 0.01. Observe that
x m 111 I
^2 — = -x + -x 2 + —x 3 H < -x + x 2 + x 3 + x 4
(m + 1)! 2 6 24 ~ 2
m — 1
1 OO -. oo
-*+5>* = -*+5;
fc=2 fc=0
X* - 1 - X
1 1,1 1+x
—x — 1 = — X-
1— x 2 21— x
It is not hard to verify that
1 1+x
-x < 0.01
2 1 -x _
for < x < 0.01. Thus, (2.33b) follows.
TLFeBOOK
ROUNDING EFFECTS IN DOT PRODUCT COMPUTATION 51
If n = 1, 2, 3, .... and if < nu < 0.01, then
(1 + uf < (e")« [via (2.33a)]
<1 + 1.01mm [via (2.33b)]. (2.35)
Now if \Si\ < u for i = 0, 1, . . . , n — 1 then
n( 1 + < 5 <-)<n (1 + l <5 ''l ) - (1 + M) "
j'=0 i=0
so via (2.35)
«-i
n( 1 + 5 ') ^ 1 + 1-OImm, (2.36)
where we must emphasize that < nu < 0.01. Certainly there is a <5 such that
n-\
l+S = Y\(l + Si), (2.37)
i=0
and so from (2.36), |<5| < 1.01mm. If we identify yfc with S in (2.33) for all k, then
IXfel < l.Olnw (2.38)
for which we consider u — 2~' [because in (2.30) both |e,-| and |5,-| < 2~']. Using
(2.38) in (2.32), we obtain
M-l
\x T y - fl[x T y]\ < l.Olnu J2 l*W*l. (2-39)
k=0
but YX=o \xkyk\ = E*=o l^llwl, and this may be symbolized as \x\ T \y\ (so that
\x\ — [|xo||xi| • • • |x„_i|] r ). Thus, we may rewrite (2.39) as
\x T y - fl[x T y]\ < l.0lnu\x\ T \y\. (2.40)
Observe that the relative error satisfies
\x T y
- fHx T y]\
\ X Ty\
<
1.01mm
l*riyl
\x T y\
(2.41)
The bound in (2.41) may be quite large if |x| r |;y| 3> | jc ^y | . This suggests the
possibility of a large relative error. We remark that since u — 2~', nu < 0.01 will
hold in all practical cases unless n is very large (a typical value for t is t — 56).
The potentially large relative errors indicated by the analysis we have just made
are a consequence of the details of how the dot product was calculated. As noted
on p. 65 of Ref. 4, the use of a double-precision accumulator to compute the dot
TLFeBOOK
52 NUMBER REPRESENTATIONS
product can reduce the error dramatically. Essentially, if x and y are floating-
point vectors with f-bit mantissas, the "running sum" s k [of (2.26)] is built up in
an accumulator with a 2f-bit mantissa. Multiplication of two f-bit numbers can
be stored exactly in a double-precision variable. The large dynamic floating-point
range limits the likelihood of overflow/underflow. Only when final sum s n -\ is
written to a single-precision memory location will there be a rounding error. It
therefore follows that when this alternative procedure is employed, we get
fl[x T y] = x T y(l + S) (2.42)
for which \S\ « 2~' (= u). Clearly, this is a big improvement.
The material of this section shows
1. The analysis required to obtain insightful bounds on errors can be quite
arduous.
2. Proper numerical technique can have a dramatic effect in reducing errors.
3. Proper technique can be revealed by analysis.
The following example illustrates how the bound on rounding error in dot prod-
uct computation may be employed.
Example 2.4 Assume the existence of a square root function such that
fl\_\fx\ = V*(l + € ) an d l e l < u - We use the algorithm that corresponds to the
bound of Eq. (2.40) to compute x T x (x e R"), and then use this to give an algo-
rithm for ||x|| = V ' x 1 x. This can be expressed in the form of pseudocode:
s_i := 0;
for k := to n - 1 do begin
$ k :=S k _i +x2;
end;
||x 1 1 := VSnZT;
We will now obtain a bound on the relative error due to rounding in the computation
of | \x 1 1 . We will use the fact that Vl +x < 1 + x (for x > 0).
Now
€\ — = =>■ fl[x x] — x x(l + ei),
and via (2.41)
\e\\ < l.Olnu
M T \x\
\x T x\
\\x\\ 2
1.01mm r- = I.OIhm
\\x\\ 2
-n-l |„. |2 _ V-n-1 2 _ i|^||2 „j IIUII2
(M 1 |*| = Ejto M -T." k Zo4 = \M\\ and |||*|| 2 | = ||*|| 2 ). So in "short-
hand" notation, fl[y/ fl[x T x]] = fl[\\x\\], and
fl[\\x\\] = Vx T xJl + ei (l + e) = ||*||Vl + ei(l + e),
TLFeBOOK
MACHINE EPSILON 53
and VI + e i < 1 + €1, so
//[||*l|]< 11*11(1 +ei)(l + e).
Now (1 + ei)(l + e) — 1 + €\ + € + e\e, implying that
||jc||(1 + ei)(l + e) = 11*11 + ||*||(6i +€ + e x e)
so therefore
//[||*l|]< 11*11 + ll*ll(ei + e + eie),
and thus
//[||*l|]- 11*11
11*11
< \e\ + 6 + eie| < u + 1.01mm + l.Olnu 2
= u[l + 1.01/1+ l.Olrew].
Of course, we have used the fact that |e| < u.
2.5 MACHINE EPSILON
In Section 2.3 upper bounds on the error involved in applying the operator // were
derived. Specifically, we found that the relative error satisfies
I* - fl[x]\ J 2~' (rounding)
— I n-t+\ /„i „; \ ■ \^-^J)
|*| " \ 2 t+1 (chopping)
As suggested in Section 2.4, these bounds are often denoted by u; that is, u — 2~'
for rounding, and u — 2~ t+i for chopping. The bound u is often called the unit
roundoff [4, Section 2.4.2].
The details of how floating-point arithmetic is implemented on any given com-
puting machine may not be known or readily determined by the user. Thus, u
may not be known. However, an "experimental" approach is possible. One may
run a simple program to "estimate" u, and the estimate is the machine epsilon,
denoted e^- The machine epsilon is defined to be the difference between 1.0 and
the next biggest floating-point number [6, Section 2.1]. Consequently, cm = 2~ t+1 .
A pseudocode to compute €m is as follows:
stop := 1;
eps := 1.0;
while stop == 1 do begin
eps := eps/2.0;
x := 1.0 + eps;
ifx< 1.0
begin
TLFeBOOK
54 NUMBER REPRESENTATIONS
stop := 0;
end;
end;
eps := 2.0 * eps;
This code may be readily implemented as a MATLAB routine. MATLAB stores
eps (= €m) as a built-in constant, and the reader may wish to test the code above
to see if the result agrees with MATLAB eps (as a programming exercise).
In this book we shall (unless otherwise stated) regard machine epsilon and unit
roundoff as practically interchangeable.
APPENDIX 2A REVIEW OF BINARY NUMBER CODES
This appendix summarizes typical methods used to represent integers in binary.
Extension of the results in this appendix to fractions is certainly possible. This
material is normally to be found in introductory digital electronics books. The
reader is here assumed to know Boolean algebra. This implies that the reader
knows that + can represent either algebraic addition, or the logical or operation.
Similarly, xy might mean the logical and of the Boolean variables x and y, or it
might mean the arithmetic product of the real variables x and y. The context must
be considered to ascertain which meaning applies.
Below we speak of "complements." These are used to represent negative inte-
gers, and also to facilitate arithmetic with integers. We remark that the results of
this appendix are presented in a fairly general manner. Thus, the reader may wish,
for instance, to see numerical examples of arithmetic using two's complement (2's
comp.) codings. The reader can consult pp. 276-280 of Ref. 2 for such examples.
Almost any other books on digital logic will also provide a source of numerical
examples [3],
We may typically interpret a bit pattern in one of four ways, assuming that the
bit pattern is to represent a number (negative or nonnegative integer). An example
of this is as follows, and it provides a summary of common representations (e.g.,
for n = 3 bits):
Bit Pattern
Unsi
gned Integer
2's Comp.
l's Comp.
Sign
Magnitude
1
1
1
1
1
1
2
2
2
2
1
1
3
3
3
3
1
4
-4
-3
-0
1
1
5
-3
-2
-1
1 1
6
-2
-1
-2
1 1
1
7
-1
-0
-3
TLFeBOOK
REVIEW OF BINARY NUMBER CODES
55
In the four coding schemes summarized in this table, the interpretation of the bit
pattern is always the same when the most significant bit (MSB) is zero. A similar
table for n — 4 appears in Hamacher et al. [2, see p. 271].
Note that, philosophically speaking, the table above implies that a bit pattern
can have more than one meaning. It is up to the engineer to decide what meaning
it should have. Of course, this will be a function of purpose. Presently, our purpose
is that bit patterns should have meaning with respect to the problems of numerical
computing; that is, bit patterns must represent numerical information.
The relative merits of the three signed number coding schemes illustrated in the
table above may be summarized as follows:
Coding Scheme
Advantages
Disadvantages
2's complement
l's complement
Sign magnitude
Simple adder/subtracter
circuit
Only one code for
Easy to obtain the l's comp.
of a number
Intuitively obvious code
Circuit for finding the 2's comp.
more complex than circuit for
finding the l's comp.
Circuit for addition and
subtraction more complex
than for the 2's comp.
adder/subtracter
Two codes for
Has the most complex
adder/subtracter circuit
Two codes for
The following is a summary of some formulas associated with arithmetic (i.e.,
addition and subtraction) with r's and (r — l)'s complements. In binary arithmetic
r — 2, while in decimal arithmetic r — 10. We emphasize the case r — 2.
Let A be an n -digit base-r number (integer)
in-i'
-2 ' '
AiA
l^o
where Ak e {0, 1, . . . , r — 2, r — 1}. Digit A n -\ is the most significant digit
(MSD), while digit A Q is the least significant digit (LSD). Provided that A is
not negative (i.e., is unsigned), we recognize that to convert A to a base-10 repre-
sentation (i.e., ordinary decimal number) requires us to compute
I>A
k=0
If A is allowed to be a negative integer, the usage of this summation needs modi-
fication. This is considered below.
The r's complement of A is defined to be
r's complement of A — A* —
A, A#0
A =
(2.A.1)
TLFeBOOK
56 NUMBER REPRESENTATIONS
The (r — l)'s complement of A is defined to be
(r - l)'s complement of A = A = (r n - 1) - A (2.A.2)
It is important not to confuse the bar over the A in (2. A. 2) with the Boolean not
operation, although for the special case of r — 2 the bar will denote complemen-
tation of each bit of A; that is, for r — 2
^b-i-Ab-2 • • • AjAq
where the bar now denotes the logical not operation. More generally, if A is a
base-r number
A = (r-l)-A n -i (r-l)-A„_ 2 • • • (r - 1) - Aj (r - 1) - A
Thus, to obtain A, each digit of A is subtracted from r — 1. As a consequence,
comparing (2A.1) and (2 A. 2), we see that
A* = A+1 (2A.3)
where the plus denotes algebraic addition (which takes place in base r).
Now we consider the three (previously noted) different methods for coding
integers when r — 2:
1. Sign-magnitude coding
2. One's complement coding
3. Two's complement coding
In all three of these coding schemes the most significant bit (MSB) is the sign bit.
Specifically , if A„_i = 0, the number is nonnegative, and if A„_i = 1, the number
is negative. It can be shown that when the complement (either one's or two's) of
a binary number is taken, this is equivalent to placing a minus sign in front of the
number. As a consequence, when given a binary number A = A„_i A„_2 • • • A\ A$
coded according to one of these three schemes, we may convert that number to a
base- 10 integer according to the following formulas:
1. Sign-Magnitude Coding. The sign-magnitude binary number A = A n -\A n -2
■ ■ ■ Ai Aq (A* € {0, 1}) has the base-10 equivalent
n-2
£Aj2'', A„_i=0
i=0
n-2
-^A,-2', A„_! = l
(2.A.4)
TLFeBOOK
REVIEW OF BINARY NUMBER CODES
With this coding scheme there are two codings for zero:
(0)i = (000 • • • 00) 2 = (100 • • • 00) 2
57
2. One's Complement Coding. In this coding we represent —A as A. The one's
complement binary number A = A„_i A„_2 • • • A\ Aq (A# € {0, 1}) has the
base- 10 equivalent
n-2
I>2<
,An-l=0
n-2
, A„_i = 1
(2.A.5)
With this coding scheme there are also two codes for zero:
(0)io = (000- -00)2 = (HI •••11)2
3. Two's Complement Coding. In this coding we represent —A as A* (— A + 1).
The two's complement binary number A = A„_i A„_2 • • • A\ Ao (At € {0, 1})
has the base- 10 equivalent
-TT^An-i+^Aa
(2.A.6)
The proof is as follows. If A n -\ = 0, then A > and immediately the base- 10
equivalent is A = Yl"i=o A 2' (via the procedure for converting a number in
base-2 to one in base- 10), which is (2. A. 6) for A„_i = 0. Now, if A„_i = 1,
then A < 0, and so if we take the two's complement of A we must get |A|:
\A\ = A+l
= (1 - A„_i)(l - A„_ 2 ) • • • (1 - Ai)(l - A ) + 00 •• • 01
n-\
= ^(1 - A,)2' + 1
i=0
= 2"- 1 (l-A„_ 1 ) + ^(l-A i )2 i + l
i=0
ii—2 M—2
= ^2'' + l-^A,-2 i '(A„_ 1 = l)
n-2
i=0
i=0
TLFeBOOK
58 NUMBER REPRESENTATIONS
1 _ 2"" 1 ^ ,. / " .. 1 - a n+l '
'-!>* "■Z>'-T=r
1-2
i=0 \ (=0
n-2
= 2 »-i _ J^ A,-2''
i=0
and so A = -2"" 1 + £"r o 2 A,- 2'', which is (2.A.6) for A„_i = 1.
In this coding scheme there is only one code for zero:
(0)io = (000- • -00)2
When n-bit integers are added together, there is the possibility that the sum may
not fit in n bits. This is overflow. The condition is easy to detect by monitoring
the signs of the operands and the sum. Suppose that x and y are n-bit two's
complement coded integers, so that the sign bits of these operands are x n -\ and
y n -\. Suppose that the sum is denoted by s, implying that the sign bit is s n -\. The
Boolean function that tests for overflow of s — x + y (algebraic sum of x and y) is
T — x n -\y n -\l n -\ +x n -{y n _ l s n -\.
The first term will be logical 1 if the operands are negative while the sum is
positive. The second term will be logical 1 if the operands are positive but the sum
is negative. Either condition yields T = 1, thus indicating an overflow. A similar
test may be obtained for subtraction, but we omit this here.
The following is both the procedure and the justification of the procedure for
adding two's complement coded integers.
Theorem 2.A.1: Two's Complement Addition If A and B are n-bit two's
complement coded numbers, then compute A + B (the sum of A and B) as though
they were unsigned numbers, discarding any carryout.
Proof Suppose that A > 0, B > 0; then A + B will generate no carryout from
the bit position n — \ since A„_i = B n -\ — (i.e., the sign bits are zero-valued),
and the result will be correct if A + B < 2 n ~ l . (If this inequality is not satisfied,
then the sign bit will be one, indicating a negative answer, which is wrong. This
amounts to an overflow.)
Suppose that A > B > 0; then
A + (-B) = A + B* = A + 2" - B = 2" + A - B,
and if we discard the carryout, this is equivalent to subtracting 2" (because the
carryout has a weight of 2"). Doing this yields A + (— B) — A — B.
TLFeBOOK
PROBLEMS 59
Similarly
(-A) + B = A* + B = 2 n - A + B = 2" + B - A,
and discarding the carry out yields (—A) + B — B — A.
Again, suppose that A > B > 0, then
(-A) + (-B) = A* + B* =2 n - A + 2 n - B = 2" + [2 n -(A + B)]
= 2" + (A + B)*
so discarding the carryout gives (—A) + (— B) = (A + £)*, which is the desired
two's complement representation of —(A + B), provided A + B < 2" _1 . (If this
latter inequality is not satisfied, then we have an overflow.)
The procedure for subtraction (and its justification) follows similarly. We omit
these details.
REFERENCES
1. J. H. Wilkinson, Rounding Errors in Algebraic Processes, Prentice-Hall, Englewood
Cliffs, NJ, 1963.
2. V. C. Hamacher, Z Vranesic, and S. G. Zaky, Computer Organization, 3rd ed.,
McGraw-Hill, New York, 1990.
3. J. F. Wakerly, Digital Design Principles and Practices, Prentice-Hall, Englewood Cliffs,
NJ, 1990.
4. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ.
Press, Baltimore, MD, 1989.
5. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977.
6. N. J. Higham, Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA,
1996.
PROBLEMS
2.1. Let /x[x] denote the operation of reducing x (a fixed-point binary fraction)
to t + 1 bits (including the sign bit) according to the Wilkinson rounding
(ordinary rounding) procedure in Section 2.2. Suppose that a — (0.1000)2,
b — (0.1001)2, and c = (0.0101)2, so t — 4 here. In arithmetic of unlimited
precision, we always have aib + c) — ab + ac. Suppose that a practical com-
puting machine applies the operator fx[-] after every arithmetic operation.
(a) Find x — fx[fx[ab] + fx[ac]].
(b) Find v = fx[afx[b + c]].
Do you obtain x — yl
TLFeBOOK
60 NUMBER REPRESENTATIONS
This problem shows that the order of operations in an algorithm implemented
on a practical computer can affect the answer obtained.
2.2. Recall from Section 2.2 that
Find the absolute error in representing q as a (t + l)-bit binary number. Find
the relative error. Assume both ordinary rounding and chopping (defined at
the end of Section 2.3 with respect to floating-point arithmetic).
2.3. Recall that we define a floating-point number in base r to have the form
x = xq.X\X2 ■ ■ ■ x t -\x t x r e ,
=/
where xq € {+, — } (sign digit), Xk £ {0, 1, . . . , r — 1} for k = 1, 2, . . . , t, e
is the exponent (a signed integer), and x\ ^ (so r~ l < \f\ < 1) if x ^ 0.
Show that for x ^
m <\x\ < M,
where for L < e < U, we have
m
- L ~\ M = r u (l-r-').
2.4. Suppose r — 10. We may consider the result of a decimal arithmetic operation
in the floating-point representation to be
= ±\Yx k io- k \ xlf/.
(a) If fl[x] is the operator for chopping, then
fl[x] — (±.xiX2 • • -x t -\x t ) x 10 e ,
thus, all digits x& for k > t are forced to zero.
(b) If fl[x] is the operator for rounding then it is defined as follows. Add
0.00 ■ ■ -01 to the mantissa if x t+ \ > 5, but if x t+ \ < 5, the mantissa is
t+\ digits
unchanged. Then all digits Xk for k > t are forced to zero.
Show that the absolute error for chopping satisfies the upper bound
\x- fl[x]\ < 10 _f 10 e ,
and that the absolute error for rounding satisfies the upper bound
\x- fl[x]\ < ilCT r 10 e .
TLFeBOOK
PROBLEMS 61
Show that the relative errors satisfy
\x-fl[x]\ \ 10 1 "' (chopping)
5IO 1 ' (rounding)
2.5. Suppose that t — 4 and r — 2 (i.e., we are working with floating-point binary
numbers). Suppose that we have the operands
x = 0.1011 x 10" 3 , y = -0.1101 x 10 2 .
Find x + y, x — y, and xy. Clearly show the steps involved.
2.6. Suppose that A e R" x " ; x e R", and that fl[Ax] represents the result
of computing the product Ax on a floating-point computer. Define \A\ —
[|«ijlL-,y=0,i,...,n-i. and \x\ = [|*oll*il"- \x n -i\] T - We have
fl[Ax] — Ax + e,
where e e R" is the error vector. Of course, e models the rounding errors
involved in the actual computation of product Ax on the computer. Justify
the bound
\e\ < l.01nu\A\\x\.
2.7. Explain why a conditional test such as
if x /y then begin
f:=f/(x-y);
end;
is unreliable.
(Hint: Think about dynamic range limitations in floating-point arithmetic.)
2.8. Suppose that x — [xqX\ ■ ■ ■ x n -{\ T is a real-valued vector, ||x||o,
maxo<A:<n-i |jcjt|, and that we wish to compute ||x||2 = S^=o x k\
Explain the advantages, and disadvantages of the following algorithm with
respect to computational efficiency (number of arithmetic operations, and
comparisons), and dynamic range limitations in floating-point arithmetic:
m := llxlloo;
s:=0;
for k := to n - 1 do begin
s :=s+ (x k /m) 2 ;
end;
||x|| 2 :=mVs;
Comments regarding computational efficiency may be made with respect to
the pseudocode algorithm in Example 2.4.
1/2
TLFeBOOK
62
NUMBER REPRESENTATIONS
2.9. Recall that for x + bx + c = 0, the roots are
x\
-b + \/b 2 - Ac
X2
-b - *Jb 2 - Ac
If b = —0.3001, c = 0.00006, then the "exact" roots for this set of parame-
ters are
xi = 0.29989993, x 2 = 2.0006673 x 10" 4 .
Let us compute the roots using four-digit (i.e., t = 4) decimal (i.e., r = 10)
floating-point arithmetic, where, as a result of rounding quantization b, and
c are replaced with their approximations
b = -0.3001 =b, c = 0.0001 #c.
Compute J2, which is the approximation to x 2 obtained using b and c in
place of b and c. Show that the relative error is
X2 — X2
X2
0.75
(i.e., the relative error is about 75%). {Comment: This is an example of
catastrophic cancellation.)
2.10. Suppose a, b e R, and x — a — b. Floating-point approximations to a and b
are a — fl[a] — a{\ + e a ) and b — fl[b] — b(\ + €b), respectively. Hence
the floating-point approximation to x is x — a — b. Show that the relative
error is of the form
e =
< a-
\a\ + \b\
What is a? When is |e| large?
2.11. For a ^ 0, the quadratic equation ax 2 + bx + c — has roots given by
—b + Vb 2 - Aac -b - \Jb 2 - Aac
M
2a
x 2
2a
For c ^ 0, quadratic equation ex + bx + a — has roots given by
-b + sjb 2 - Aac
1c
-b — *Jb 2 — Aac
Yc '
(a) Show that x\x' 2 — 1 and X2x[ — 1.
(b) Using the result from Problem 2.10, explain accuracy problems that can
arise in computing either x\ or X2 when b 2 ^> \Aac\. Can you use the
result in part (a) to alleviate the problem? Explain.
TLFeBOOK
•5 Sequences and Series
3.1 INTRODUCTION
Sequences and series have a major role to play in computational methods. In this
chapter we consider various types of sequences and series, especially with respect
to their convergence behavior. A series might converge "mathematically," and yet
it might not converge "numerically" (i.e., when implemented on a computer). Some
of the causes of difficulties such as this will be considered here, along with possible
remedies.
3.2 CAUCHY SEQUENCES AND COMPLETE SPACES
It was noted in the introduction to Chapter 1 that many computational processes
are "iterative" (the Newton-Raphson method for finding the roots of an equation,
iterative methods for linear system solution, etc.). The practical effect of this is to
produce sequences of elements from function spaces. The sequence produced by the
iterative computation is only useful if it converges. We must therefore investigate
what this means.
In Chapter 1 it was possible for sequences to be either singly or doubly infinite.
Here we shall assume sequences are singly infinite unless specifically stated to the
contrary.
We begin with the following (standard) definition taken from Kreyszig
[1, pp. 25-26]. Examples of applications of the definitions to follow will be con-
sidered later.
Definition 3.1: Convergence of a Sequence, Limit A sequence (x n ) in a
metric space X — (X, d) is said to converge, or to be convergent iff there is an
x e X such that
lim d(x n , x) — 0. (3.1)
The element x is called the limit of (x n ) (i.e., limit of the sequence), and we may
state that
lim x n — x. (3.2)
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
63
TLFeBOOK
64 SEQUENCES AND SERIES
We say that (x n ) converges to x or has a limit x. If (x n ) is not convergent, then
we say that it is a divergent sequence, or is simply divergent.
A shorthand expression for (3.2) is to write x n — > x. We observe that sequence (x„)
is defined to converge (or not) with respect to a particular metric here denoted d
(recall the axioms for a metric space from Chapter 1). We remark that it is possible
that, for some (x„) in some set X, the sequence might converge with respect to
one metric on the set, but might not converge with respect to another choice of
metric. It must be emphasized that the limit x must be an element of X in order
for the sequence to be convergent.
Suppose, for example, that X — (0, 1] C R, and consider the sequence x n —
■Ar{(n e Z + ). Suppose also that d(x, y) = \x — y\. The sequence (x n ) does not
converge in X because the sequence "wants to go to 0." But is not in X. So the
sequence does not converge. (Of course, the sequence converges in X — R with
respect to our present choice of metric.)
It can be difficult in practice to ascertain whether a particular sequence con-
verges according to Definition 3.1. This is because the limit x may not be known
in advance. In fact, this is almost always the case in computing applications of
sequences. Sometimes it is therefore easier to work with the following:
Definition 3.2: Cauchy Sequence, Complete Space A sequence (x n ) in a
metric space X — (X, d) is called a Cauchy sequence iff for all 6 > there is an
N{e) € Z+ such that
d(x m ,x n ) < e (3.3)
for all m, n > N(e). The space X is a complete space iff every Cauchy sequence
in X converges.
We often write N instead of N(e), because N may depend on our choice of e. It
is possible to prove that any convergent sequence is also Cauchy.
We remark that, if in fact the limit is known (or at least strongly suspected),
then applying Definition 3.1 may actually be easier than applying Definition 3.2.
We see that under Definition 3.2 the elements of a Cauchy sequence get closer to
each other as n and m increase. Establishing the "Cauchiness" of a sequence does
not require knowing the limit of the sequence. This, at least in principle, simplifies
matters. However, a big problem with this definition is that there are metric spaces
X in which not all Cauchy sequences converge. In other words, there are incomplete
metric spaces. For example, the space X — (0, 1] with d(x, y) = \x — y\ is not
complete. Recall that we considered x n = l/(« + 1). This sequence is Cauchy, 1
but the limit is 0, which is not in X. Thus, this sequence is a nonconvergent Cauchy
sequence. Thus, the space (X, | • |) is not complete.
We see that
d\Xm> x n) '■
1 1
m + 1 n + 1
TLFeBOOK
CAUCHY SEQUENCES AND COMPLETE SPACES 65
A more subtle example of an incomplete metric space is the following. Recall
space C[a, b] from Example 1.4. Assume that a — and b = 1, and now choose
the metric to be
d(x, y)
-I
Jo
\x(t) - y(t)\dt
(3.4)
instead of Eq. (1.8). Space C[0, 1] with the metric (3.4) is not complete. This
may be shown by considering the sequence of continuous functions illustrated
in Fig. 3.1. The functions x m {t) in Fig. 3.1a form a Cauchy sequence. (Here we
assume m > 1, and is an integer.) This is because d(x m ,x n ) is the area of the
triangle in Fig. 3.1b, and for any e > 0, we have
V ttl ' Yl ) ^^
whenever m,n > l/(2e). (Suppose n >m and consider that d(x m , x n ) — j(— —
) < j- < €.) We may see that this Cauchy sequence does not converge. Observe
that we have
x m (t) — for t € [0, j], x m (t) —I for t e [a m , 1],
t
Figure 3.1 A Cauchy sequence of functions in C[0, 1].
For any e > we may find 7V(e) > such that for n > m > N(e)
1 1
< e.
m n
If h < m, the roles of n and m may be reversed. The conditions of Definition 3.2 are met and so the
sequence is Cauchy.
TLFeBOOK
66 SEQUENCES AND SERIES
where a m — j + ■£-. Therefore, for all x e C[0, 1],
d(x m ,x) — / \x m (t) -x(t)\dt
/ \x„
Jo
/■1/2 ra m /•<
/ \x(t)\dt+ \x m (t)- x(t)\dt +
JO J 1/2 Ja m
•1/2 fa,„ n\
,(f) - x(t)\dt + I \\ - x(t)\dt.
The integrands are all nonnegative, and so each of the integrals on the right-hand
side are nonnegative, too. Thus, to say that d(x m , x) -> implies that each integral
approaches zero. Since x(t) is continuous, it must be the case that
x(t) = for t e [0, \), x(t) = 1 for t e (± 1].
However, this is not possible for a continuous function. In other words, we have
a contradiction. Hence, (x n ) does not converge (i.e., has no limit in X — C[0, 1]).
Again, we have a Cauchy sequence that does not converge, and so C[0, 1] with
the metric (3.4) is not complete.
This example also shows that a sequence of continuous functions may very well
possess a discontinuous limit. Actually, we have seen this phenomenon before.
Recall the example of the Fourier series in Chapter 1 (see Example 1.20). In this
case the series representation of the square wave was made up of terms that are all
continuous functions. Yet the series converges to a discontinuous limit. We shall
return to this issue again later.
So now, some metric spaces are not complete. This means that even though a
sequence is Cauchy, there is no guarantee of convergence. We are therefore faced
with the following questions:
1. What metric spaces are complete?
2. Can they be "completed" if they are not?
The answer to the second question is "Yes." Given an incomplete metric space,
it is always possible to complete it. We have seen that a Cauchy sequence does
not converge when the sequence tends toward a limit that does not belong to the
space; thus, in a sense, the space "has holes in it." Completion is the process of
filling in the holes. This amounts to adding the appropriate elements to the set that
made up the incomplete space. However, in general, this is a technically difficult
process to implement in many cases, and so we will never do this. This is a job
normally left to mathematicians.
We will therefore content ourselves with answering the first question. This will
be done simply by listing complete metric spaces that are useful to engineers:
1. Sets R and C with the metric d(x, y) = \x — y\ are complete metric spaces.
TLFeBOOK
CAUCHY SEQUENCES AND COMPLETE SPACES
67
2. Recall Example 1.3. The space /°°[0, oo] with the metric
d(x,y) = sup \xk - yk\
keZ+
(3.5)
is a complete metric space. (A proof of this claim appears in Ref. 1, p. 34.)
3. The Euclidean space R" and the unitary space C" both with metric
d(x,y) =
-.1/2
^2 \xk -ykY
.k=0
(3.6)
are complete metric spaces. (Proof is on p. 33 of Ref. 1.)
4. Recall Example 1.6. Fixing p, the space l p [0, oo] such that 1 < p < oo is a
complete metric space. Here we recall that the metric is
d(x, y)
J2\xk -yk\'
.k=Q
UP
(3.7)
5. Recall Example 1.4. The set C[a, b] with the metric
d(x,y)— sup \x(t) — y(t)\
t€[a,b]
is a complete metric space. (Proof is on pp. 36-37 of Ref. 1.)
(3.8)
The last example is interesting because the special case C[0, 1] with metric (3.4)
was previously shown to be incomplete. Keeping the same set but changing the
metric from that in (3.4) to that in (3.8) changes the situation dramatically.
In Chapter 1 we remarked on the importance of the metric space L 2 [a, b] (recall
Example 1.7). The space is important as the "space of finite energy signals on the
interval [a, b]." (A "finite power" interpretation was also possible.) An important
special case of this was L 2 (R) = L 2 (— oo, oo). Are these metric spaces complete?
Our notation implicitly assumes that the set (1.11a) (Chapter 1) contains the so-
called Lebesgue integrable functions on [a,b]. In this case the space L 2 [a, b] is
indeed complete with respect to the metric
d(x, y)
f
.J a
,1/2
\x(t) - y(t)\ dt
(3.9)
Lebesgue integrable functions 2 have a complicated mathematical structure, and we
have promised to avoid any measure theory in this book. It is enough for the reader
One of the "simplest" introductions to these is Rudin [2]. However, these functions appear in the
last chapter [2, Chapter 11]. Knowledge of much of the previous chapters is prerequisite to studying
Chapter 11. Thus, the effort required to learn measure theory is substantial.
TLFeBOOK
68 SEQUENCES AND SERIES
to assume that the functions in L 2 [a, b] are the familiar ones from elementary
calculus. 3
The complete metric spaces considered in the two previous paragraphs also
happen to be normed spaces; recall Section 1.3.2. This is because the metrics are
all induced by suitable norms on the spaces. It therefore follows that these spaces
are complete normed spaces. Complete normed spaces are called Banach spaces.
Some of the complete normed spaces are also inner product spaces. Again, this
follows because in those cases an inner product is defined that induced the norm.
Complete inner product spaces are called Hilbert spaces. To be more specific, the
following spaces are Hilbert spaces:
1 . The Euclidean space R" and the unitary space C" along with the inner product
71-1
y) = J2 x *y*k ( 3 - 10 )
k=0
are both Hilbert spaces.
2. The space L 2 [a, b] with the inner product
( x ,y) = J x(t)y*(t)dt (3.11)
J a
is a Hilbert space. [This includes the special case L 2 (R).]
3. The space Z 2 [0, oo] with the inner product
k=0
y) = J2 x ^y*k ( 3 - 12 )
is a Hilbert space.
We emphasize that (3.10) induces the metric (3.6), (3.11) induces the metric (3.9),
and (3.12) induces the metric (3.7) (but only for case p — 2; recall from Chapter 1
that l p [0, oo] is not an inner product space when p ^ 2). The three Hilbert spaces
listed above are particularly important because of the fact, in part, that elements in
these spaces have (as we have already noted) either finite energy or finite power
interpretations. Additionally, least-squares problems are best posed and solved
within these spaces. This will be considered later.
Define the set (of natural numbers) N = {1, 2, 3, . . .}. We have seen that sequences
of continuous functions may have a discontinuous limit. An extreme example of this
phenomenon is from p. 145 of Rudin [2].
These "familiar" functions are called Riemann integrable functions. These functions form a proper
subset of the Lebesgue integrable functions.
TLFeBOOK
CAUCHY SEQUENCES AND COMPLETE SPACES 69
Example 3.1 For n e N define
x„(f) = lim [cos(n\jrt)] 2m .
When n\t is an integer, then x n (t) — 1 (simply because cos(nk) = ±1 for k e Z).
For all other values of t , we must have x n (t) — (simply because | cos t \ < 1 when
t is not an integral multiple of jt). Define
x(t) — lim x n (t).
If t is irrational, then x„(f) = for all n. Suppose that t is rational; that is, suppose
t — p/q for which p, q e Z. In this case «!f is an integer when n > q in which
case x(t) — 1. Consequently, we may conclude that
. . ,. ,• r / , s-,2m \ 0> t is irrational ,. ...
x(t) — lim lim [cos(«!jrf)] = i , • • (3.13)
n^Kuii^oo I 1, t is rational
We have mentioned (in footnote 3, above) that Riemann integrable functions are a
proper subset of the Lebesgue integrable functions. It turns out that x(t) in (3.13)
is Lebesgue integrable, but not Riemann integrable. In other words, you cannot use
elementary calculus to find the integral of x(t) in (3.13). Of course, x(t) is a very
strange function. This is typical; that is, functions that are not Riemann integrable
are usually rather strange, and so are not commonly encountered (by the engineer).
It therefore follows that we do not need to worry much about the more general
class of Lebesgue integrable functions.
Limiting processes are potentially dangerous. This is illustrated by a very simple
example.
Example 3.2 Suppose n, m e N. Define
m
x m,n = ; •
(This is a double sequence. In Chapter 1 we saw that these arise routinely in wavelet
theory.) Treating n as a fixed constant, we obtain
so
lim x mn = 1
m— >oo
n^oo m^oo
Now instead treat m as a fixed constant so that
lim x m ,„ —
TLFeBOOK
70 SEQUENCES AND SERIES
which in turn implies that
Interchanging the order of the limits has given two completely different answers.
Interchanging the order of limits clearly must be done with great care.
The following example is simply another illustration of how to apply Defini-
tion 3.2.
Example 3.3 Define
1
(n e Z+).
This is a sequence in the metric space (R, | • |). This space is complete, so we need
not know the limit of the sequence to determine whether it converges (although
we might guess that the limit is x — 1). We see that
iX \ytfti j •™yi )
where the triangle inequality has been used. If we assume [without loss of generality
(commonly abbreviated w.l.o.g.)] that n > m > N(e) then
(-l) m
(-1)"
1 111
< 1 < \- -
m + 1
n + 1
m + 1 n + 1 m n
1
1
2
—
+
—
<
—
<
€.
in
n
in
So, for a given e > 0, we select n > m > 2/e. The sequence is Cauchy, and so it
must converge.
We close this section with mention of Appendix 3. A. Think of the material in it
as being a very big applications example. This appendix presents an introduction
to coordinate rotation digital computing (CORDIC). This is an application of a
particular class of Cauchy sequence (called a discrete basis) to the problem of
performing certain elementary operations (e.g., vector rotation, computing sines
and cosines). The method is used in application-specific integrated circuits (ASICs),
gate arrays, and has been used in pocket calculators. Note that Appendix 3. A also
illustrates a useful series expansion which is expressed in terms of the discrete basis.
3.3 POINTWISE CONVERGENCE AND UNIFORM CONVERGENCE
The previous section informed us that sequences can converge in different ways,
assuming that they converge in any sense at all. We explore this issue further here.
TLFeBOOK
POINTWISE CONVERGENCE AND UNIFORM CONVERGENCE 71
Definition 3.3: Pointwise Convergence Suppose that (x n (t)) (n e Z + ) is a
sequence of functions for which t e S C R. We say that the sequence converges
pointwise iff there is an x(t) (t e S) so that for all e > there is an N — N(e, t)
such that
\x n (t) - x(t)\ < € (3.14)
for n > N. We call x the limit of (x n ) and write
x(t) = lim x„(f) (f € S). (3.15)
n— »-00
We emphasize that under this definition A' may depend on both e and t. We may
contrast Definition 3.3 with the following definition.
Definition 3.4: Uniform Convergence Suppose that (x n (t)) (n e Z + ) is a
sequence of functions for which t e S C R. We say that the sequence converges
uniformly iff there is an x(t) (t e S) so that for all e > there is an Af = A^(e)
such that
\x n (t)-x(t)\<e (3.16)
for n > N. We call x the limit of (x n ) and write
x(t) = lim x„{t) (t e S). (3.17)
n— >oo
We emphasize that under this definition N never depends on t, although it may
depend on e. It is apparent that a uniformly convergent sequence is also point-
wise convergent. However, the converse is not true; that is, a pointwise convergent
sequence is not necessarily uniformly convergent. This distinction is important in
understanding the convergence behavior of series as well as of sequences. In par-
ticular, it helps in understanding convergence phenomena in Fourier (and wavelet)
series expansions.
In contrast with the definitions of Section 3.2, under Definitions 3.3 and 3.4 the
elements of (x n ) and the limit x need not reside in the same function space. In
fact, we do not ask what function spaces they belong to at all. In other words, the
definitions of this section represent a different approach to convergence analysis.
As with the Definition 3.1, direct application of Definitions 3.3 and 3.4 can be
quite difficult since the limit, assuming it exists, is not often known in advance
(i.e., a priori) in practice. Therefore, we would hope for a convergence criterion
similar to the idea of Cauchy convergence in Section 3.2 (Definition 3.2). In fact,
we have the following theorem (from Rudin [2, pp. 147-148]).
Theorem 3.1: The sequence of functions (x n ) defined on S C R converges
uniformly on S iff for all e > there is an N such that
\x m {t) - x n {t)\ < e (3.18)
for all n,m > N.
TLFeBOOK
72 SEQUENCES AND SERIES
This is certainly analogous to the Cauchy criterion seen earlier. (We omit the
proof.)
Example 3.4 Suppose that (x„) is defined according to
x n (t) =
1
nt + 1
t € (0, 1) and n e N.
A sketch of x„(t) for various n appears in Fig. 3.2. We see that ("by inspection")
x n -> 0. But consider for all e >
\x„(t)-0\
1
nt + 1
< e
which implies that we must have
B >I(I-l)=tf
so that N is a function of both t and e. Convergence is therefore pointwise, and is
not uniform.
Other criteria for uniform convergence may be established. For example, there is
the following theorem (again from Rudin [2, p. 148]).
Theorem 3.2: Suppose that
Define
lim x n (t) = x(t) (t e S).
M n = sup\x„(t) - x(t)\.
teS
Then x n — > x uniformly on S iff M„ — > as n — > oo.
1
0.8
0.6
0.4
0.2
•^^l
i
i
i
— n=1
-- n = 2
--- n = 20
-
--..../~~~~" -
\
..
*
-
i
i
M
......
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f
Figure 3.2 A plot of typical sequence elements for Example 3.4; here, t e [0.01, 0.99].
TLFeBOOK
FOURIER SERIES 73
8 10
Figure 3.3 A plot of typical sequence elements for Example 3.5.
The proof is really an immediate consequence of Definition 3.4, and so is omitted
here.
Example 3.5 Suppose that
x„(t)
t
t e R and n e N.
1 + nt 2 '
A sketch of x n (t) for various n appears in Fig. 3.3. We note that
dx n (t) (1 + nt 2 ) ■ 1 - t ■ (2nt) 1 - nt 2
dt
[l+nt 1 ]
2l2
[1+Hf 2 ]
2l2
for t = ±-7=. We see that
sin
4n) Isfn
We also see that x n — >• 0. So then
1
M n = sup|x„0)| = t— =■
Clearly, M„ — >■ as n — >■ oo. Therefore, via Theorem 3.2, we immediately conclude
that x n — > x uniformly on the real number line.
3.4 FOURIER SERIES
The Fourier series expansion was introduced briefly in Chapter 1, where the behav-
ior of this series with respect to its convergence properties was not mentioned. In
this section we shall demonstrate the pointwise convergence of the Fourier series
TLFeBOOK
74 SEQUENCES AND SERIES
by the analysis of a particular example. Much of what follows is from Walter [17].
However, of necessity, the present treatment is not so rigorous.
Suppose that
(3.19)
,'(f) = — (7T - t), < t < 2lt.
Tt
The reader is strongly invited to show that this has Fourier series expansion
2 °° 1
git) = - J] - sm(kt). (3.20)
% ' — ' k
k=\
The procedure for doing this closely follows Example 1.20. In our analysis to
follow, it will be easier to work with
fit) = -zgit) =y 7 sin(^f).
2 * — ' k
k=i
Define the sequence of partial sums
n 1
S n (t) — V* - &in(kt) (n e N).
' ' h
So we infer that
(3.21)
k=\
lim S n it) = fit),
(3.22)
(3.23)
but we do not know in what sense the partial sums tend to f{t). Is convergence
pointwise, or uniform?
We shall need the special function
D n it)=-
Tt
- + Y^ cos i kt )
k=\
1 sin(« + 7j)f
2tt sin(if)
(3.24)
This function is called the Dirichlet kernel. The second equality in (3.24) is not
obvious. We will prove it. Consider that
-t\ inD n it))= 2 Sin l2/ +IZ sin (2 f ) COs(A:f)
v 7 k=\ v
t + - y] sin | - - A- 1 /
sin ( fe If,
(3.25)
TLFeBOOK
FOURIER SERIES 75
where we have used the identity sin a cosb — j sin(a + b) + j sin(a — b). By
expanding the sums and looking for cancellations
k=\ v 7 k=\ v '
Applying (3.26) in (3.25), we obtain
X \ ( X
sin | n H — J t — sin I —t
(3.26)
sin ( -t 1 (jrD n (t)) = - sin I n + - 1 /
so immediately
D n it)
1 sin(« + j)f
2tt sin(if)
and this establishes (3.24). Using the identity sin(a + b) = sin a cosZ? + cos a sinfe,
we may also write
D n (t)
1
2tt
sin(nf)cos(jf)
sin(lf)
+ cos(«f)
(3.27)
For t > 0, using the form of the Dirichlet kernel in (3.27), we have
n J D n (x)dx — I
Jo Jo
sin(nx) cos(ix) 1
: — 1 — cos(nx)
2sin(4x) 2
dx
= f'sm(nx ldx+ r'
Jo x J
1 cos(^x) 1
2 sin(±x) x
dx
cos(nx) dx.
(3.28)
We are interested in what happens when t is a small positive value, but n is large.
To begin with, it is not difficult to see that
Less clearly
lim
n— >oo
2 7o
Jo
1 1
cos(«x) dx — lim — sin(nf) = 0.
n^oo 2n
lim / sin(nx)
1 cos(^x) 1
2 sin(±x) x
dx —
(3.29)
(3.30)
TLFeBOOK
76 SEQUENCES AND SERIES
(take this for granted). Through a simple change of variable
sin(«:t)
f' sin(«:t) f
l(nt) = / — - — -dx = /
Jo x Jo
dx.
In fact
f°° sinx tc
I dx — — .
Jo x 2
(3.31)
(3.32)
This is not obvious, either. The result may be found in integral tables [18, p. 483].
In other words, even for very small t, I(nt) does not go to zero as n increases.
Consequently, using (3.29), (3.30), and (3.31) in (3.28), we have (for big n)
n I D n (x
Jo
) dx ss lint).
(3.33)
The results in the previous paragraph help in the following manner. Begin by
noting that
" 1 n rt ft
S n (t) — 2_\ — sia(kt) — 2_, I cos(£x) dx — I
y^ cos(kx)
.k=\
dx
f
Jo
1 v^
- + 2_^cos(£x)
k=\
dx
1 f 1
-t = n D n {x)dx--t (via (3.24)).
2 Jo 2
So from (3.33)
5„(0« I(nt)- if.
Define the sequence t n — -jr. Consequently
1
S n (t„) « I in)- —IT.
In
As n — > oo, t n -> 0, and S„(t n ) — > /(jt). We can say that for big n
S„(0+) « 7(7T).
Now, /(0+) = |-, so for big n
5»(0+)
/(0+)
2 r
* Jo
■dx % 1.18.
(3.34)
(3.35)
(3.36)
(3.37)
(3.38)
Numerical integration is needed to establish this. This topic is the subject of a later
chapter, however.
We see that the sequence of partial sums S„ (0+) converges to a value bigger
than /(0+) asn^ oo. The approximation S n (t) therefore tends to "overshoot"
TLFeBOOK
FOURIER SERIES
77
the true value of fit) for small t . This is called the Gibbs phenomenon, or Gibbs
overshoot. We observe that t — is the place where f(t) has a discontinuity. This
tendency of the Fourier series to overshoot near discontinuities is entirely typical.
We note that f(n) — S n (jt) — for all n > 1. Thus, for any e >
\f(it)-S n (jt)\<e
for all n > 1. The previous analysis for t — 0+, and this one for t — n show that
N (in the definitions of convergence) depends on t. Convergence of the Fourier
series is therefore pointwise and not uniform. Generally, the Gibbs phenomenon is
a symptom of pointwise convergence.
We remark that the Gibbs phenomenon has an impact in the signal processing
applications of series expansions. Techniques for signal compression and signal
enhancement are often based on series expansions. The Gibbs phenomenon can
degrade the quality of decompressed or reconstructed signals. The phenomenon
is responsible for "ringing artifacts." This is one reason why the convergence
properties of series expansions are important to engineers.
Figure 3.4 shows a plot of /(f), S n (t) and the error
E n (t) = S n (t)-f(t).
(3.39)
The reader may confirm (3.38) directly from the plot in Fig. 3.4a.
-1
-2
C
1.5
„ 1
ra
§> 0.5
w
2 °
3, -0.5
£ -1
uf
(a)
I
I
■■■■f(t)
— S n (f)forn = 30
_
i
1.5
-2
1 2 3 4 b b 7
(b) f
Figure 3.4 Plots of the Fourier series expansion for f(t) in (3.21), S n (t) [of (3.22)] for
n = 30, and the error E n (t) = S n (t) — fit).
TLFeBOOK
78
SEQUENCES AND SERIES
CO
c
g>
o
-2-
Figure 3.5 E n (t) of (3.39) for different values of n.
We conclude this section by remarking that
lim
/>2tt
JO
\E n (t)\ 2 dt = 0,
(3.40)
that is, the energy of the error goes to zero in the limit as n goes to infinity.
However, the amplitude of the error in the vicinity of a discontinuity remains
unchanged in the limit. This is more clearly seen in Fig. 3.5, where the error is
displayed for different values of n. Of course, this fact agrees with our analysis.
Equation (3.40) is really a consequence of the fact that (recalling Chapter 1) S n (t)
and fit) are both in the space L 2 (0, 2it). Rigorous proof of this is quite tough,
and so we omit the proof entirely.
3.5 TAYLOR SERIES
Assume that fix) is real-valued and that x e R. One way to define the derivative
of f{x) at x — xq is according to
f w ix)
df(x) ,
— — lx=X0 = lim ■
ax x^*o
f(x) - fix Q )
Xq
The notation is
f in) ix)
d"fix)
dx n
(so f (U) ix) = f(x)). From (3.41), we obtain
./■(x)«/(xo) + / (1) (xo)(x-xo).
(3.41)
(3.42)
(3.43)
But how good is this approximation? Can we obtain a more accurate approximation
to fix) if we know f^ n \xo) for n > 1? Again, what is the accuracy of the resulting
approximation? We consider these issues in this section.
TLFeBOOK
TAYLOR SERIES 79
Begin by recalling the following theorem.
Theorem 3.3: Mean- Value Theorem If f(x) is continuous for x e [a, b]
with a continuous derivative for x e (a, b), then there is a number f € (a, b) such
that
fib) - /(a) m
: =/ (1) (£). (3.44)
o — a
Therefore, if a = xo and b — x, we must have
/(x) = /(x ) + / (1) (f)(x-x ). (3.45)
This expression is "exact," and so is in contrast with (3.43). Proof of Theo-
rem 3.3 may be found in, for example, Bers [19, p. 636]. Theorem 3.3 generalizes
to the following theorem.
Theorem 3.4: Generalized Mean- Value Theorem Suppose that f(x) and
g(x) are continuous functions on x e [a, b]. Assume /^(x), and g^ix) exist and
are continuous, and g^\x) ^ for x e (a, b). There is a number f € (a, b) such
that
f{b) - /(a) / (1) (f)
m-gia) £«(£)
(3.46)
Once again the proof is omitted, but may be found in Bers [19, p. 637].
The tangent to f(x) at x — xo is given by t(x) — f(xo) + / (1) (^o)U — xq)
(f (xo) — /(xo) and f^(xo) = f^Hxo))- We wish to consider t(x) to be an approx-
imation to f(x), so the error is
f(x) -t(x) = e(x)
or
f(x) = /(x ) + f m (x )(x - x ) + e(x). (3.47)
/ (1) (^o).
Thus
e(x) f{x) - /(x ) ^. (1)
x — xo x — xo
But
fix) - /(xo)
hm = /
*->*o x — xq
so immediately
eix)
lim — — - = 0.
Z (1) (*o)
(3.48)
*^*0 X — Xo
From (3.47) e(xo) = 0, and also from (3.47), we obtain
f m (x) = f m ix Q ) + e m ix), (3.49)
TLFeBOOK
80 SEQUENCES AND SERIES
so e ( - r> (xo) — 0. From (3.49), f^Hx) — e^ 2 '(x) (so we now assume that f(x) has
a second derivative, and we will also assume that it is continuous). Theorem 3.4
has the following corollary.
Corollary 3.1 Suppose that f(a) — g(a) — 0, so for all b ^ a there is a £ €
(a, b) such that
fib) f m (H)
8(b) g m 0;)
(3.50)
Now apply this corollary to f(x) — e(x), and g(x) — (x — xo) 2 , with a — xq,
b — x. Thus, from (3.50)
e(x) e m (t)
(3.51)
(x - xo) 1 2(t - x )
(t € (xq, x)). Apply the corollary once more to f(x) = e^'(x), and g(r)
2(t - x ):
e (V) (r) 1 m 1 n\
2(r - x ) 2
(£ € (*<>, T)). Apply (3.51) in (3.52)
= ^'(?) = ^/ w g) (3-52)
«W ! .(2),
= t/ w «)
(x-x ) 2 2'
or (for some £ € (xo, x))
e(*) = |/ (2) (?)(x-x ) 2 . (3.53)
If \f (2) (t)\ < M 2 for t e (x , x), then
\e(x)\ <^-j-M 2 |x-x | 2 (3.54)
and
/(x) = /(xo) + / (1) (x )(x - xq) + e(x), (3.55)
for which (3.54) is an upper bound on the size of the error involved in approxi-
mating f(x) using (3.43).
Example 3.6 Suppose that f(x) — *J~x; then
Suppose that xo = 1, and x = xo + Sx — 1 + Sx. Thus, via (3.55)
Vx" = Vl+5x = 1 + / (1) (l)5x + e(x) = 1 + ±5x + e(x),
TLFeBOOK
TAYLOR SERIES 81
so if, for example, \Sx\ < |, then \f (2) (x)\ < 2 (for x e (\, |)) so M 2 = 2,
and so
\e(x)\ < (Sx) 2
via (3.54). This bound may be compared to the following table of values:
1 + \Sx
e(x)
Sx
VI + <5x
(Sx) 2
3
4
0.5000
0.6250
-0.1250
0.5625
1
2
0.7071
0.7500
-0.0429
0.2500
1.0000
1.0000
0.0000
0.0000
1
2
1.2247
1.2500
-0.0253
0.2500
3
4
1.3229
1.3750
-0.0521
0.5625
It is easy to see that indeed \e(x)\ < (Sx) 2 .
We mention that Corollary 3.1 leads to l'Hopital's rule. It therefore allows us
to determine
.. /(*)
hm
x^a g( x )
when f(a) — g(a) — 0. We now digress briefly to consider this subject. The rule
applies if f(x) and g(x) are continuous at x — a, if f(x) and g(x) have continu-
ous derivatives at x = a, and if g^Hx) ^ near x — a, except perhaps at x = a.
f m (x)
l'Hopital's rule is as follows. If lim x ^ fl f(x) = lim^^ a g(x) — and lim x ^ a J {lj ^
exists, then
f(x) f m (x)
hm —— = hm . (3.56)
x-*a g(x) x^a g( l >(x)
The rationale is that from Corollary 3.1 for all x ^ a there is a £ € (a,b) such that
fix) f^'(£) f*^(£)
j±£ = )(1) .: . . So, if x is close to a then £ must also be close to a, and J (1) ;
is close to its limit. l'Hopital's rule is also referred to as "the rule for evaluating
the indeterminate form §." If it happens that f^\a) — g^(a) — 0, then one may
attempt l'Hopital's rale yet again; that is, if lim x ^ a J (2) exists, then
<■" g(2) (x)
,. fix) .. f (2) (x)
hm = hm —pr- .
x^a g(x) x^a g( l >(x)
Example 3.7 Consider
im
^0
sinx
-e x
X 2
+ 1
= lim
x^0
d
dx
[sinx —
d
dx
e x
• 2 1
+ 1]
— lim
x^0
cosx — e x
2x
=
lim -
x^0
sinx — e x 1
2 = ~2
TLFeBOOK
82 SEQUENCES AND SERIES
for which the rule has been applied twice. Now consider instead
1 — 2x j r
lim = lim " '
— [1 -2x]
2 + 4* ^od_ [2 + Ax]
dx
lim —
x^o 4
1
This is wrong ! l'Hopital's rule does not apply here because /(0) = 1, and g(0) = 2
(i.e., we do not have /(0) = g(0) — as needed by the theory).
The rule can be extended to cover other indeterminate forms (e.g., — ). For example,
consider
1
log x Z
lim x log. x — lim — - — = lim
x^-0 x^O 1 x^o 1
= lim (-X) = 0.
x^Q
An interesting case is that of finding
lim 1 + -
x^oa \ x
This is an indeterminate of the form 1°°. Consider
lim log I 1 + -
x^oo \ x
lim
x— >oo
log, I 1 + -
I
x
1 / 1
1
lim
lim
1 + -
X
The logarithm and exponential functions are continuous functions, so it happens to
be the case that
I lim log e I 1 + - ) -= log e
lim I 1 + -
*^°o \ x
TLFeBOOK
TAYLOR SERIES 83
that is, the limit and the logarithm can be interchanged. Thus
= ■^LssoH)*]
e l =
so finally we have
H)'
lim 1 + - = e. (3.57)
More generally, it can be shown that
lim
n— »oo
(X\ n
1 + -) = e*. (3.58)
This result has various applications, including some in probability theory relating to
Poisson and exponential random variables [20]. An alternative derivation of (3.57)
appears on pp. 64-65 of Rudin [2], but involves the use of the Maclaurin series
expansion for e. We revisit the Maclaurin series for e x later.
We have demonstrated that for suitable £ € (xq, x)
fix) = f(x ) + f m (x )(x - x ) + \f {2 \H )(x - xo) 2
(recall (3.55)). Define
p(x) = /(xo) + f m (x )(x - xo) + y (2) (x )(x - xo) 2 (3.59)
so this is some approximation to f(x) near x — xo- Equation (3.43) is a linear
approximation to f(x), and (3.59) is a quadratic approximation to f(x). Once
again, we wish to consider the error
f(x) - p(x) = e(x). (3.60)
We note that
p(x ) = f(x ), p m (x Q ) = f m (xo), p (2 \xo) = f {2 \xo). (3.61)
In other words, the approximation to f(x) in (3.59) matches the function and its
first two derivatives at x — xq. Because of (3.61), via (3.60)
e(x ) = e m (xo) = e (2) (x ) = 0, (3.62)
and so via (3.59) and (3.60)
e (3) (x) = / (3) (x) (3.63)
TLFeBOOK
84 SEQUENCES AND SERIES
(because // 3 '(x) — since p(x) is a quadratic in x). As in the derivation of (3.53),
we may repeatedly apply Corollary 3.1:
e(x) e^{h)
(x-x ) 3 3(ti-x ) 2
eW(ti) e (2) (?2)
3(fi-x ) 2 3-2(t 2 -x )
e (2) (t 2 ) e (3) (f)
3 • 2(t 2 -x Q ) 3-2
which together yield
e{x) _ /< 3 >(g)
(x — xo) 3 3 • 2
or
for t\ e (xo, x)
for ? 2 e (x , t\)
for f € (x ,r 2 ),
for £ € (xo, x)
«(*) = ^-y / (3) (f X* - xo) 3 (3.64)
for some f € (xo, x). Thus
/(*) = /(xo) + / (1) (xo)(x - xo) + — / (2) (x )(x - xo) 2 + e(x). (3.65)
Analogously to (3.54), if \f®\t)\ 5- ^3 f° r t e (xo,x), then we have the error
bound
k(x)|<^-j-M 3 |x-xo| 3 . (3.66)
We have gone from a linear approximation to /(x) to a quadratic approximation
to f(x). All of this suggests that we may generalize to a degree n polynomial
approximation to /(x). Therefore, we define
n
Pn(x) = ^2p„,k(x - x ) k , (3.67)
k=Q
where
r>„ 7, =
k\
Then
Pn,k = A/ (t) (*o). (3-68)
/(x) = p„(x) + e„+i(x), (3.69)
where the error term is
e n+1 (x) = -_L_/(»+l>($)(jt -x )" +1
(n + 1)!
(3.70)
TLFeBOOK
TAYLOR SERIES 85
for suitable £ € (xo,x). We call p n {x) the Taylor polynomial of degree n. This
polynomial is the approximation to /(x), and the error e n +\(x) in (3.70) can be
formally obtained by the repeated application of Corollary 3.1. These details are
omitted. Expanding (3.69), we obtain
fix) = f(xo) + f m (x )(x - xo) + ^/ (2) (x )(x - x ) 2
+ ■■■ + -J (n \x*){x - x ) n + — { ^-J in+l \H){x - x )" +1 (3.71)
n ! (n + 1 ) !
which is the familiar Taylor formula for f(x). We remark that
f ik \xo) = p?\xo) (3.72)
for k = 0, 1, . . . , n — 1, n. So we emphasize that the approximation p n (x) to f(x)
is based on forcing p„(x) to match the first n derivatives of f(x), as well as
enforcing p n (xo) — f(xo). If \f^ n+v> {t)\ < M n+ \ for all t e I [interval / contains
(xo, x)], then
\e n +i(x)\ < 7— "-ttt M n+l \x - x \ n+1 . (3.73)
If all derivatives of /(x) exist and are continuous, then we have the Taylor series
expansion of f(x), namely, the infinite series
/w = E 7^ /( " )(xo)(x - xo)A: - (3 - 74)
fc=0
The Maclaurin series expansion is a special case of (3.74) for xq — 0:
/(*) = Yi -f (k \0)x k . (3.75)
If we retain only terms k — to k — n in the infinite series (3.74) and (3.75), we
know that e n+ \{x) gives the error in the resulting approximation. This error may
be called the truncation error (since it arises from truncation of the infinite series
to a finite number of terms). Now we consider some examples.
First recall the binomial theorem
(a+x) n = ^r( n k \x k a"- k , (3.76)
where
n
k J k\{n-k)\'
(3.77)
TLFeBOOK
86 SEQUENCES AND SERIES
In (3.76) we emphasize that n e Z + . But we can use Taylor's formula to obtain an
expression for (a + x) a when a^O, and a is not necessarily an element of Z + .
Let us consider the special case
f(x) = (l+x) a
for which (if k > 1)
f (k) (x) = a(a - l)(a - 2) ■ ■ ■ (a - k + 1)(1 + x) a ~ k . (3.78)
These derivatives are guaranteed to exist, provided x > — 1. We will assume this
restriction always applies. So, in particular
/ (fc) (0) = a(a - l)(a - 2) • • • (a - fc + 1) (3.79)
giving the Maclaurin expansion
" 1
(1 + x f = 1 + ^ ry[«(« -!)■■■(«-*+ 1)]j
[o(a- !)• ••(«-«)](! +f) a -"- 1 x n+1 (3.80)
(n+1)!
for some £ € (xo, x). We may extend the definition (3.77), that is, define
o) = 1, (*) = ^ a(a ~ 1) --- (a ~ k + l)(k - l) (3 ' 81)
so that (3.80) becomes
tt +*)" = £(" )**+(„ + ! )a + §) a "" _1 *" +1 , (3.82)
k=0 ,
--Pn(x) =««+!«
for x > — 1 .
Example 3.8 We wish to compute [1.03] 1 / 3 with n — 2 in (3.82), and to esti-
mate the error involved in doing so. We have x — 0.03, a — i, and £ e (0, .03).
Therefore from (3.82) [1 + X] 1 / 3 is approximated by the Taylor polynomial
p 2 ( X ) = 1 + ( J J X + f 3 \ x 2 = l + ^ x _^ x 2
SO
[1.03] 1/3 « £> 2 (0.03) = 1.009900000
TLFeBOOK
TAYLOR SERIES 87
but [1.03] 1/3 = 1.009901634, so e 3 (x) = 1.634 x 10" 6 . From (3.82)
and so e 3 (0.03) = Jr^r(l + £)~ 8/3 = | x 10" 6 (1 + £)" 8/3 . Since < £ < 0.03,
we have
1.5403 x 10" 6 < e 3 (.03) < 1.6667 x 10" 6 .
The actual error is certainly within this range.
If f(x) = -A-r, and if xo — 0, then
l+x
n+l„n+l
1 +X
1 "
***+ ( } X . (3.83)
W=-^ («^1). (3.84)
■^^ 1 — a
=r(*)
This may be seen by recalling that
l-a" +1
y* =
ifc=0
so ELo(-i)^ = 1 ~ ( ~ 1 1 ) +t'"" +1 - and thus
A ^ ^ (-l) n+1 jc B+1 _ 1 - (-l) n+1 x n+1 (-l) n+1 x n+1 __ 1
{-" X l+x l + x l+x l+x'
k—0
This confirms (3.83). We observe that the remainder term r(x) in (3.83) is not given
by e„ + i (x) in (3.82). We have obtained an exact expression for the remainder using
elementary methods.
Now, from (3.83), we have
1 -i_, + ,2_,3 + ... + ( _ ir -y-i , (-1)"'"
l+t 1+t
and we see immediately that
f x dt , , ,
log e (l +x)= I = x - -x 1 + -x 3
Jo l + t
1 , 1
— ;
2
-\ v n rx *n
<—Y) n ~X n
- — + (-1)"/ - — dt. (3.85)
Jo l + t
=r(x)
TLFeBOOK
88 SEQUENCES AND SERIES
For x > 0, and < t < x we have -A^ < 1, implying that
px jti px r"+l
< / dt < t"dt = (x > 0).
Jo 1 + 1 Jo n+l
For —1 < x < with x < t < 0, we have
1 1
<
l+t l+x 1-1*1
so
Jr
/ -— dt <^^- / f"
Jo 1 + f ~ 1-|*| Jo
Consequently, we may conclude that
1
\r(x)\ <
1
1-1*1
„ra+l
« + 1
IB+1
(l-|x|)(« + l)
n+l
n + 1
1*1
n + l
I (1 - |jc|)(n + 1)
x >
-1 <x <0
(3.86)
Equation (3.85) gives us a means to compute logarithms, and (3.86) gives us a
bound on the error.
Now consider (3.83) with x replaced by x 2 :
1 + X k=Q
{-\) n+i x
n+\ v 2n+2
l+X 2
(3.87)
Replacing n with n — 1, replacing x with f, and expanding, this becomes
1
1 - t 2 + t 4
l+t 2
where, on integrating, we obtain
+ (_l)»-l,2»-2+ ( l)nt2n
l + t 2
tan x
f x dt 1
— I r- — x x
Jo l + t 2 3
3,5 7
" ' -X X
(_iyi-\ x 2n-\
2/i — l jq
rx j.2n
Jo TT7
-dt .
(3.S
=r(*)
Because -r-^j < 1 for all t e R, it follows that
\r(x)\ <
|2n+l
2« + 1
(3.89)
TLFeBOOK
TAYLOR SERIES 89
We now have a method of computing n. Since j — tan _1 (l), we have
Tt 111 (-1)"" 1
and
r(1) -^T- (3 ' 91)
2m + 1
Using (3.90) to compute it is not efficient with respect to the number of arith-
metic operations needed (i.e., it is not computationally efficient). This is because
to achieve an accuracy of about 1/n requires about n/2 terms in the series [which
follows from (3.91)]. However, if x is small (i.e., close to zero), then series (3.88)
converges relatively quickly. Observe that
tan" 1 — — — =tan _1 x + tan _1 y. (3.92)
1 — xy
Suppose that x = j, and y = A, then x ^ t — 1, so
% = tan" 1 ( - ) + tan" 1 ( - ) . (3.93)
4 V2.
It is actually faster to compute tan _1 (j), and tan _1 (|) using (3.88), and for these
obtain jt using (3.93) than to compute tan _1 (l) directly. In fact, this approach (a
type of "divide and conquer" method) can be taken further by noting that
tan" 1 ( - | = tan" 1 | - 1 + tan" 1 ( - | , tan" 1 ( - | = tan" 1 ( - | + tan" 1
implying that
— = 2 tan" 1 ( - | +tan _1 ( -
4 \5 \7
= 2tan _1 I - I +tan _1 I - + 2 tan -1 I - I . (3.94)
Now consider f(x) — e x . Since f^(x) — e x for all k € Z + we have for xq —
the Maclaurin series expansion
oo k
e*^-. (3.95)
k=0
This is theoretically valid for — oo < x < oo. We have employed this series before
in various ways. We now consider it as a computational tool for calculating e x .
Appendix 3.C is based on a famous example in Forsythe et al. [21, pp. 14-16].
This example shows that series expansions must be implemented on computers
TLFeBOOK
90 SEQUENCES AND SERIES
with rather great care. Specifically, Appendix 3.C shows what can happen when we
compute e~ 20 by the direct implementation of the series (3.95). Using MATLAB
as stated e~ 20 «s 4.1736 x 10~ 9 , which is based on keeping terms k — to 88
(inclusive) of (3.95). Using additional terms will have no effect on the final answer
as they are too small. However, the correct value is actually e~ 20 — 2.0612 x 10 -9 ,
as may be verified using the MATLAB exponential function, or using a typical
pocket calculator. Our series approximation has resulted in an answer possessing
no significant digits at all. What went wrong? Many of the terms in the series
are orders of magnitude bigger than the final result and typically possess rounding
errors about as big as the final answer. The phenomenon is called catastrophic
cancellation (or catastrophic convergence). As Forsythe et al. [21] stated, "It is
important to realize that this great cancellation is not the cause of error in the
answer; it merely magnifies the error already present in the terms." Catastrophic
cancellation can in principle be eliminated by carrying more significant digits in the
computation. However, this is costly with respect to computing resources. In the
present problem a cheap and very simple solution is to compute e 20 using (3.95),
and then take the reciprocal, i.e., use e~ 2Q — 1/e 20 .
An important special function is the gamma function:
-f
Jo
r(z)= / x z ~ l e~ x dx. (3.96)
>0
Here, we assume z € R. This is an improper integral so we are left to wonder if
-M
/
1 JO
lim / x z e x dx
exists. It turns out that the integral (3.96) converges for z > 0, but diverges for
z < 0. The proof is slightly tedious, and so we will omit it [22, pp. 273-274]. If
z — n e N, then consider
/•OO
{n) = / x"
Jo
r(n)= / x n ~ l e~ x dx. (3.97)
Now
r(n+l)=/ x n e~ x dx= lim / x n e~ x dx
JO M^ooJq
= lim -x n e- x \^+n x n - l e~ x dx
M^oo |_ Jo
(via f u dv — uv — j v du, i.e., integration by parts). Therefore
/•OO
r(n + l) = re/ x n - l e~ x dx =nT{n). (3.98)
Jo
TLFeBOOK
TAYLOR SERIES 91
We see that T(l) = f™ e~ x dx = [-e~ x ]™ = 1. Thus, T(n + 1) = re!. Using the
gamma function in combination with (3.95), we may obtain Stirling' s formula
re! « ^/2nn n+1/2 e- n (3.99)
which is a good approximation to re ! if re is big. The details of a rigorous derivation
of this are tedious, so we give only an outline presentation. Begin by noting that
/•OO rOO
re!= / x n e~ x dx = \ e nXax - x dx.
Jo Jo
Let x — n + y, so
/OO
e nXain+y) - y dy.
-n
Now, since re ln(re + j) = n In [re (l + ^)] = re lnre + re In (l + ^), we have
/oo
e " ln "+" ]n (i+5)-yrfy
-n
/oo
e „ ln( i + Z)_ y ^
-«
Using (3.85), that is
(i + Z) = Z_2l + Z_...,
V re/ re 2re 2 3re 3
In
we have
rein (l + -)->/ = - —
2 3
r , r
2re 3n 2
so
re!=re"e-" /
If now y = *Jnv, then dy = *Jndv, and so
/oo
So if re is big, then
OO y 2 ^3
OO 1,2 w 3
re!=re" + Je"" / e~~ + ^~'" dv.
■I.
OO 2
re! % re" + 2e " / e ? dv
TLFeBOOK
92 SEQUENCES AND SERIES
If we accept that
/
j — (
then immediately we have
~ x2/2 dx = Jhi,
/^ n-\- ~ —n
'2jtn 2g "
(3.100)
and the formula is now established. Stirling's formula is very useful in statistical
mechanics (e.g., deriving the Fermi-Dirac distribution of fermion particle energies,
and this in turn is important in understanding the operation of solid-state electronic
devices at a physical level).
Another important special function is
,'(*)
1
V2W
exp
(x — m)
2^
2-\
-oo < X < oo
(3.101)
which is the Gaussian function (or Gaussian pulse). This function is of immense
importance in probability theory [20], and is also involved in the uncertainty princi-
ple in signal processing and quantum mechanics [23]. A sketch of g(x) for m — 0,
with a 2 — 1, and a 2 = 0.1 appears in Fig. 3.6. For m — and a 2 — 1, the standard
form pulse
f(x) = -^=e- x2 ' 2 (3.102)
'lit
is sometimes defined [20]. In this case we observe that g(x) — — f [ x — ! ^-). We will
show that
(3.103)
/
Jo
e x dx — —Jit ,
2
which can be used to obtain (3.100) by a simple change of variable. From [22]
(p. 262) we have
1.4
1.2
1
0.8
0.6
0.4
0.2
i
— a 2 =\
-- a 2 = 0.1
-
1
1
\
\
1
1
\
\
1
1
\
\
1
1
\
\
-*' 1
\ "--
I "~
/
" r .,
Figure 3.6 Plots of two Gaussian pulses, where g(x) in (3.101) for m = 0, with a
1,0.1.
TLFeBOOK
TAYLOR SERIES 93
Theorem 3.5: Let lirtijc^oo x p f(x) — A. Then
1. f^° f(x) dx converges if p > 1 and — oo < A < oo.
2. / a °° /Or) fifx diverges if p < 1 and A ^ (A may be infinite).
We see that lim^^oo* e x = (perhaps via l'Hopital's rule). So in The-
orem 3.5 /(*) = e _ * , and p — 2, with A = 0, and so L e~ x dx converges.
Define
1m
= / e~ x dx = / e- y
Jo Jo
<iv
and let liiriM^oo I M — I. Then
<iv
= /" [ e-^+y 2 Uxdy
Jr m J
for which /?m is the square OASC in Fig. 3.7. This square has sides of length M.
Since e - ^ +> ' ' > 0, we obtain
f f e - (x2+y2) dx dy < I M < f f e - (x2+y2) dx dy, (3.104)
JRi J JRij J
M\/2
M
v v
y
l
E
- A
i
\ 6
^
O
(
Z D
Figure 3.7 Regions used to establish (3.103).
TLFeBOOK
94 SEQUENCES AND SERIES
where Ry is the region in the first quadrant bounded by a circle of radius M.
Similarly, Ry is the region in the first quadrant bounded by a circle of radius \p2M.
Using polar coordinates, r 2 — x 2 + y 2 and dx dy — r dr d(f>, so (3.104) becomes
nir/2 pM nir/2 rs/2M
i I e~ r r dr d<p < 1 2 M < / e~ r rdrd(t>. (3.105)
J0=O Jr=Q Jij>=Q Jr=a
Since -j£e~ x2 = xe'* 2 we have f Q M re~ rl dr = -\[e~ x2 ]^ = ±[1 - e~ M \
Thus, (3.105) reduces to
j[l-e- M2 l<I 2 M <^ll-e- 2M \ (3.106)
If we now allow M -> oo in (3.106), then /^ -» j, implying that I 2 = j, or
/ = jyfjz. This confirms (3.103).
In probability theory it is quite important to be able to compute functions such
as the error function
2 f x 2
ert{x)=—= / e~ f dt. (3.107)
\/x Jo
This has wide application in digital communications system analysis, for example.
No closed-form 4 expression for (3.107) exists. We may therefore try to compute
(3.107) using series expansions. In particular, we may try working with the Maclau-
rin series expansion for e x :
erf(x)
^ — ^ A
u=o
2 r x
jn Jo
_ 2 y,(-l)« f
~^h k\ Jo
dt
t 2k dt
2 ~ (-l) k x 2k+1
= -=> - — . (3.108)
JWf^kW + 1)
However, to arrive at this expression, we had to integrate an infinite series term
by term. It is not obvious that we can do this. When is this justified?
A power series is any series of the form
CO
f(x) = J^a k (x-x ) k . (3.109)
k=0
Clearly, Taylor and Maclaurin series are all examples of power series. We have the
following theorem.
A closed-form expression is simply a "nice" formula typically involving more familiar functions such
as sines, cosines, tangents, polynomials, and exponential functions.
TLFeBOOK
TAYLOR SERIES 95
Theorem 3.6: Given the power series (3.109), there is an R > (which may
be R = +oo) such that the series is absolutely convergent for \x — xq\ < R, and is
divergent for \x — xq\ > R. At x — xo — R and at x — xq — — R, the series might
converge or diverge.
Series (3.109) is absolutely convergent if the series
oo
h(x) = J^\a k (x-x ) k \ (3.110)
converges. We remark that absolutely convergent series are convergent. This means
that if (3.110) converges, then (3.109) also converges. (However, the converse is
not necessarily true.) We also have the following theorem.
Theorem 3.7: If
00
f (x) — 2_\ a k{x — xq) for \x — xq\ < R,
where R > is the radius of convergence of the power series, then f(x) is
continuous and differentiable in the interval of convergence x e (xo — R, xq + R),
and
oo
fW(x) = Y^ka k (x-x ) k -\ (3.111a)
k=\
/ f(t)dt = J27^h( x - x o) k+1 - (3.111b)
This series [Eq. (3.111a,b)] also has a radius of convergence R.
As a consequence of Theorem 3.7, Eq. (3.108) is valid for — oo < x < oo (i.e., the
radius of convergence is R — +oo). This is because the Maclaurin expansion for
e x had R — +oo.
Example 3.9 Here we will find an expression for the error involved in trun-
cating the series for erf(x) in (3.108).
From (3.71) for some f € [0, x] (interval endpoints may be included because of
continuity of the function being approximated)
n fc
k=0 '
=Pn(x)
TLFeBOOK
^n\* ) — . . . £ X
96 SEQUENCES AND SERIES
where
e,.(x\ =
(«+l)!
Thus, where x — —t 2 , so for some £ such that — t 2 < § <
<?"' 2 = p„(-t 2 ) + e n (-t 2 ),
and hence
2 f x 2 f x
erf(x) = —= p n (-t 2 )dt + —= e n (-t 2 )dt,
Jit Jo JTI Jo
/0 \/n JO
=9n (*) =e„ CO
where the degree « polynomial
2 r
^„(x) = — — /
V 77 Jo
is the approximation, and we are interested in the error
2 f x
e„(x) = erf(x) - #„(*) = — — / e n {-t 2 )dt.
Jn Jo
A (-\) k t 2k
^ k\
Lk=o
dt -
2 " { -lf x 2k+l
' ^ho k ^ 2k + i )
Clearly
e n (x) = ^=P^f X t 2n+2 eSdt,
V^r (« + 1)! J
where we recall that f depends on f in that — t 2 < % < 0. There is an integral
mean-value theorem, which states that for /(?), g(0 S C[a, £>] (and g(f) does not
change sign on the interval [a, b]) there is a f € [a, b] such that
/" g(t)f(t)dt = f(0 f
J a J a
-b
g(t)f(t)dt = f(0 I g {t)dt.
Thus, there is a f € [— x , 0], giving
2 ,(-l)" +1 x 2 "+ 3
e n (x) = — =e 4
V^r (n+l)!2n + 3'
Naturally the error expression in Example 3.9 can be used to estimate how many
terms one must keep in the series expansion (3.108) in order to compute erf(x) to
a desired accuracy.
TLFeBOOK
ASYMPTOTIC SERIES
97
3.6 ASYMPTOTIC SERIES
The Taylor series expansions of Section 3.5 might have a large radius of conver-
gence, but practically speaking, if x is sufficiently far from xq, then many many
terms may be needed in a computer implementation to converge to the correct
solution with adequate accuracy. This is highly inefficient. Also, if many terms
are to be retained, then rounding errors might accumulate and destroy the result.
In other words, Taylor series approximations are really effective only for x suf-
ficiently close to xq (i.e., "small x"). We therefore seek expansion methods that
give good approximations for large values of the argument x. These are called the
asymptotic expansions, or asymptotic series. This section is just a quick introduction
based mainly on Section 19.15 in Kreyszig [24]. Another source of information on
asymptotic expansions, although applied mainly to problems involving differential
equations, appears in Lakin and Sanchez [25].
Asymptotic expansions may take on different forms. That is, there are different
"varieties" of such expansions. (This is apparent in Ref. 25.) However, we will
focus on the following definition.
Definition 3.5: A series of the form
k=0
(3.112)
for which q e R (real-valued constants), and x e R is called an asymptotic expan-
sion, or asymptotic series, of a function f(x), which is defined for all sufficiently
large x if, for every n e Z +
/<*>- E
as x -> oo,
(3.113)
and we shall then write
/(*) ~ J2
£1
k=0
It is to be emphasized that the series (3.112) need not converge for any x. The
condition (3.113) suggests a possible method of finding sequence (ck)- Specifically
[fix)
f(x) - c -► or co = lim f(x),
CO
- \x -> or c\— lim [f(x) — co]x,
[fix)
CO
Cl C2| 2
2 ' X
r]
or C2 = lim
X— >00
[fix) -co Jx 2 ,
TLFeBOOK
98 SEQUENCES AND SERIES
or in general
lim
X — >oo
n-\
/(*) - E
C L
■k
X
k=0 J
(3.114)
for n > 1. However, this recursive procedure is seldom practical for generating
more than the first few series coefficients. Of course, in some cases this might
be all that is needed. We remark that Definition 3.5 can be usefully extended
according to
f(x) ~ g(x) + h(x)
E
U=0
fi
for which
fix) - g(x)
h(x)
k=0
(3.115)
(3.116)
The single most generally useful method for getting (cjt) is probably to use "inte-
gration by parts." This is illustrated with examples.
Example 3.10 Recall erf(x) from (3.107). We would like to evaluate this func-
tion for large x [whereas the series in (3.108) is better suited for small x; see the
error expression in Example 3.9]. In this regard it is preferable to work with the
complementary error function
erfc(x) = 1 — erf(x)
—— / e ' dt.
s/7T Jx
(3.117)
We observe that erf(oo) = 1 [via (3.103)]. Now let r = t , so that dt
With this change of variable
1 f°° _i _ T
erfc(x) = — — I x 2 e dr.
Jtt J x 2
U-i
' 2 dx.
(3.118)
Now observe that via integration by parts, we have
T-2e- T dr = -T- 1/2 e- T \°° 2
2J X 2
x 2e T dx
= X -e~^
x 2
i r
2J X 2
x-2e~ T dx = -x- 3/2 e- T \°° 2
x 2 e T dx
3
/
x 2 e T dx
f
Jx 2
X 2 e T dx,
TLFeBOOK
ASYMPTOTIC SERIES 99
and so on. We observe that this process of successive integration by parts has
generated integrals of the form
F „(*)= /
Jx 2
-(2n+l)/2 -T
e Jt
(3.119)
for n e Z + , and we see that erfc(x) = -j=Fo(x). So, if we apply integration by
parts to F n (x), then
I T -(2« + D/2 e -r rfr
*2
. T -(2»+l)/2 e -r|00 _ ^L±l p T -(2„+3)/2 e -T ^
* 2 J v 2
-(2n + l)-x 2 _ 2n + 1
/;
-v^i\- dt
so that we have the recursive expression
F n (x) =
2m+ 1
-2n+l
-F n+ i(x)
(3.120)
which holds for n e Z + . This may be rewritten as
e x F n (x)
2« + 1
r 2n+l
-e* F B+ i(*).
(3.121)
Repeated application of (3.121) yields
e x2 F (x) = - - J/f,W,
x 2
,2 1 1 1-3,2
e* F (*) = - - — + — e * F 2 (x),
x 2 1 1 1-3 1 • 3 • 5 x 2
and so finally
e x F (x) =
1 1 1-3
x 2x 3 2 2 x 5
+ (-D
n _i 1 ■ 3 ■ ■ ■ (2« - 3)
2« — 1 v2n — 1
=5 2 „_i(x)
„ 1 • 3 • • • (2n - 1) r 2
+ (-D" ^ -«* F n (x).
(3.122)
TLFeBOOK
100 SEQUENCES AND SERIES
From this it appears that our asymptotic expansion is
r 2 1 1 1-3 „ , 1 • 3 • • • (2m - 3)
e F ° W ~x-2^ + 2^--- + ( - 1) 2-1^-1 +■- ^ 123 )
However, this requires confirmation. Define K n — (— 2)~"[1 • 3 • • • (2m — 1)]. From
(3.122), we have
[e x2 F (x) - S 2 „-i(x)]x 2n - 1 = K n e x \ 2n - x F n (x). (3.124)
We wish to show that for any fixed n — 1, 2, 3, . . . the expression in (3.124) on
the right of the equality goes to zero asi -> oo. In (3.119) we have
1
1
<
r (2« + l)/2 -
y2n+l
for all t
> x 2 ,
which
gives the bound
/>OC
e~ T 1
/•CO
F„(x)
- /
dx <
/ e
L*
r (2n+l)/2 - x 2n+l
U
But this
implies that
^ r ^ = ^rr- (3- 125 )
2
\K n \e x \ 2 "- l F n ( x ) < \K n \e x \ 2n - 1 *"" '*"'
v-2n+l y2
I K" I
and L -j i — > for x — > oo. Thus, immediately, (3.123) is indeed the asymptotic
2
expansion for e* Fq(x). Hence
1 -f
erfc(x) ~ — —e
1 1 1-3 „ , 1 • 3 • • • (2m - 3)
1 L (-1)"" 1 + ■ ■
x 2x 3 2 2 x 5 + l ' 2"- 1 x 2 "- 1 +
(3.126)
We recall from Section 3.4 that the integral L ^SI dt was important in analyzing
the Gibbs phenomenon in Fourier series expansions. We now consider asymptotic
approximations to this integral.
Example 3.11 The sine integral is
f x sint
~ Jo t
and the complementary sine integral is
f 00 sin?
si(x) = / dt. (3.128)
Jx t
Si(x) = / dt, (3.127)
o
TLFeBOOK
ASYMPTOTIC SERIES 101
It turns out that si(0) = j, which is shown on p. 277 of Spiegel [22]. We wish to
find an asymptotic series for Si(x). Since
n r°° sint f x sint f°° sint
-=/ dt = / df+/ Jf = Si(x) + si(x), (3.129)
2 Jo t J t Jx t
we will consider the expansion of si(x). If we integrate by parts in succession, then
1
/;
r
J X
f
J X
i ' f°° !
t sint dt — — cosx — 1 • / —
x J x V
,00 ,
Jx 7*
cost dt,
-7 1
£ cos t dt — -sinx + 2
1
f sinf df = — 7 cosx — 3 •
x i
1
■ sin? df,
cost dt,
■ sin t dt,
Jx '
and so on. If n € N, then we may define
,00 ,00
,y n (x) = / t~ n sin tdt, c„(x) — I t~ n cost dt.
J X J X
poo
I t cos tdt — jsinx + 4-
Jx x 4
,oo l
Jx ^
Therefore, for odd n
and for even n
1
Sn 00 = — - COS X - n C n +\(x),
x"
c n {x) = -sinx +n s n+ \(x).
x"
(3.130)
(3.131a)
(3.131b)
We observe that Si(x) = si(x). Repeated application of the recursions (3.131a,b)
results in
11 11.
s\(x) — cos* H 7t- sinx — 1 • 2st,(x)
x
1- 1
1 • 1
1 -2
1-2-3
■ COS X
sinx + 1-2-3 -4s 5 (x),
or in general
s\(x) = cosx
11 1-2
x x 3
1-1 1-2-3
x x*
+ (-l)"(2»)Lv 2 „ +1 (x)
(-D
„+i(2«-2)!
r 2/i — 1
+ ••• + (-1)
+1 (2n-l)!
v 2n
(3.132)
TLFeBOOK
102 SEQUENCES AND SERIES
for n e N (with 0! = 1). From this, we obtain
[si(x) - S 2n (x)]x 2n = (-l) n (2n)\x 2n s 2n+1 (x)
(3.133)
for which S2n(x) is appropriately defined as those terms in (3.132) involving sinx
and cosx. It is unclear whether
lim x 2n S2n+i(x) —
x— >oo
(3.134)
for any n e N. But if we accept (3.134), then
si(x) ~ S2n(x) = cosx
1-1 1-2 „, ,(2/i-2)!
sinx
id_i4^ + ... +( _ ir+ .< 2 »-') !
r ln
(3.135)
Figure 3.8 shows plots of Si(x) and the asymptotic approximation j — S2n(x) for
n = 1, 2, 3. We observe that the approximations are good for "large x," but poor
for "small x," as we would expect. Moreover, for small x, the approximations are
1.5
^ 1
E
<
0.5
(a)
^/
~*~:' ,%— ^^
1
1
Si(x) "
n=1
rt = 2 _
n = 3
1
I
i
i
10
E -1
<
(b)
1.6
1.8
2.2 2.4
x
2.6
- - —
_^ . — '
1 ' '
Si(x) -
n= J [
. . . . n = 2 -
n = 3
s '
/
/
/
2.8
Figure 3.8 Plot of Si(x) using MATLAB routine sinint, and plots of 5- — 52„(x) for n ■■
1,2,3. The horizontal solid line is at height n /2.
TLFeBOOK
MORE ON THE DIRICHLET KERNEL 103
better for smaller n. This is also reasonable. Our plots therefore constitute informal
verification of the correctness of (3.135), in spite of potential doubts about (3.134).
Another approach to finding (q.) is based on the fact that many special func-
tions are solutions to particular differential equations. If the originating differential
equation is known, it may be used to generate sequence (c*)- However, this section
was intended to be relatively brief, and so this method is omitted. The interested
reader may see Kreyszig [24].
3.7 MORE ON THE DIRICHLET KERNEL
In Section 3.4 the Dirichlet kernel was introduced in order to analyze the manner
in which Fourier series converge in the vicinity of a discontinuity of a 2n -periodic
function. However, this was only done with respect to the special case of g(t) in
(3.19). In this section we consider the Dirichlet kernel D n (t) in a more general
manner.
We begin by recalling the complex Fourier series expansion from Chapter 1
(Section 1.3.3). If f(t) e L 2 (0, 2jt), then
00
fa) = Y, f » eint > (3 ' 136)
n — — oo
where /„ = (/, e n ), with e n (t) — e' nt , and
(x,y) = ±- ^ x{t)y*(t)dt. (3.137)
2n Jo
An approximation to f(t) is the truncated Fourier series expansion
L
h(t)= J2 fne' nt e L 2 (0,2jt). (3.138)
The approximation error is
6i (0 = fit) - hit) e L 2 (0, 2it). (3.139)
We seek a general expression for ei(t) that is hopefully more informative than
(3.139). The overall goal is to generalize the error analysis approach seen in
Section 3.4. Therefore, consider
6l(0 = /(0- Y (f> e n)e„(t)
TLFeBOOK
104 SEQUENCES AND SERIES
[via /„ = {/, e n ), and (3.138) into (3.139)]. Thus
e £ (0 = /(0- M^/ f(x)e- jnx dx\e n (t)
fit) ,
^/^/wl i>'" ( ' _J0 U* (3 - 140)
Since
z.
^ e jn(t-x) = j + 2 ^ cos[ „( f _ x )] (3.141)
n=—L n=\
(show this as an exercise), via (3.24), we obtain
J^ sin[(L + \){t - *)]
l+2> cos[«(f-x)] = j-? =27tD L (t-x). (3.142)
^ sin[±(f-x)]
Immediately, we see that
ei(0 = /(0-/ f(x)D L (t-x)dx, (3.143)
Jo
where also (recall (3.139))
Jo
fL(t)= I f(x)D L (t-x)dx. (3.144)
Equation (3.144) is an alternative integral form of the approximation to /(f) origi-
nally specified in (3.138). The integral in (3.144) is really an example of something
called a convolution integral. The following example will demonstrate how we
might apply (3.143).
Example 3.12 Suppose that
sinf, < t < n
^ (f) "0, it < t < In "
Note that f(t) is continuous for all t, but that f^'{t) = df(t)/dt is not continuous
everywhere. For example, f^'{t) is not continuous at t — jr. Plots of fi(t) for
various L (see Fig. 3.10) suggest that /l(?) converges most slowly to f(t) near
t — it. Can we say something about the rate of convergence?
TLFeBOOK
MORE ON THE DIRICHLET KERNEL 105
Therefore, consider
M
1 r
2n J
sinx
sin[(L + \){7t - x)]
sin[i(7r — x)]
dx.
(3.145)
Now
1
-(JT — X)
l ( — ) cos f —x 1 — cos ( — ) sin I -.
\2/ \2 / V2/ \2
COS —X
.2
and since sinx = 2sin(Ax) cos(lx) so (3.145) reduces to
fL(7t) =Uo sin G x )
sin
!H) (
T — X)
dx
i r*i
= — / { cos
In Jo I
(L + l)x - ( L + - ) it
— cos
Lx —
(l + \)*
1
sin
(L + l)x - (l + it
-\7l
-
2it(L+ 1) _
1
2jtL
sin
Lx — I L H — 1 7r
V 27 .
- 7T
-
1
2jr(L+ 1) .
1 + sin I L H — 1 7T
V 2; .
1
2jtL
- 1 + sin
[ L+l 2)\
1 2L + 1- (-1) L
lit L(L + 1)
We see that, as expected
dx
(3.146)
lim A(tt) = 0.
L—>-oo
Also, (3.146) gives €/.(7r) = — /l(7t), and this is the exact value for the approxi-
mation error at t — jt for all L. Furthermore
1
ki(7r)| oc -
for large L, and so we have a measure of the rate of convergence of the error, at
least at the point t = n. (Symbol "oc" means "proportional to.")
We remark that f(t) in Example 3.12 may be regarded as the voltage drop
across the resistor R in Fig. 3.9. The circuit in Fig. 3.9 is a simple half-wave
TLFeBOOK
106 SEQUENCES AND SERIES
Ideal diode
Figure 3.9 An electronic circuit interpretation for Example 3.12; here, v(t) = sin(f) for
all t e R.
Figure 3.10 A plot of f(t) and Fourier series approximations to f(t) [i.e., //,(?) for
L = 3, 10].
rectifier circuit. The reader ought to verify as an exercise that
L
sin t 4-
jt 2
1 1 " l + (-l) n
f L (t)= — + — sin t + > — — cos nt.
n=2
7T(l-n 2 )
(3.147)
Plots of /i(f) for L — 3, 10 versus the plot of f(t) appear in Fig. 3.10.
TLFeBOOK
COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 107
3.8 FINAL REMARKS
We have seen that sequences and series might converge "mathematically" yet not
"numerically." Essentially, we have seen three categories of difficulty:
1. Pointwise convergence of series leading to irreducible errors in certain regions
of the approximation, such as the Gibbs phenomenon in Fourier expansions,
which arises in the vicinity of discontinuities in the function being approxi-
mated
2. The destructive effect of rounding errors as illustrated by the catastrophic
convergence of series
3. Slow convergence such as illustrated by the problem of computing it with
the Maclaurin expansion for tan -1 x
We have seen that some ingenuity may be needed to overcome obstacles such
as these. For example, a divide-and-conquer approach helped in the problem of
computing n . In the case of catastrophic convergence, the problem was solved by
changing the computational algorithm. The problem of overcoming Gibbs overshoot
is not considered here. However, it involves seeking uniformly convergent series
approximations.
APPENDIX 3.A CO ORDINATE R OTATATION DI GITAL
COMPUTING (CORDIC)
3.A.1 Introduction
This appendix presents the basics of a method for computing "elementary func-
tions," which includes the problem of rotating vectors in the plane, computing
tan _1 x, sinS, and co&6. The method to be used is called "coordinate rotation
digital computing" (CORDIC), and was invented by Jack Voider [3] in the late
1950s. However, in spite of the age of the method, it is still important. The method
is one of those great ideas that is able to survive despite technological changes. It
is a good example of how a clever mathematical idea, if anything, becomes less
obsolete with the passage of time.
The CORDIC method was, and is, desirable because it reduces the problem of
computing apparently complicated functions, such as trig functions, to a succession
of simple operations. Specifically, these simple operations are shifting and adding.
In the 1950s it was a major achievement just to build systems that could add two
numbers together because all that was available for use was vacuum tubes, and to a
lesser degree, discrete transistors. However, even with the enormous improvements
in computing technology that have occurred since then, it is still important to reduce
complicated operations to simple ones. Thus, the CORDIC method has survived
very well. For example, in 1980 [5] a special-purpose CORDIC VLSI (very large
scale integration) chip was presented. More recent references to the method will
TLFeBOOK
108 SEQUENCES AND SERIES
be given later. Nowadays, the CORDIC method is more likely to be implemented
with gate-array technology.
Since the CORDIC method involves the operations of shifting and adding only,
once the mathematics of the method is understood, it is easy to build CORDIC
computing hardware using the logic design methods considered in typical elemen-
tary digital logic courses or books. 5 Consideration of CORDIC computing also
makes a connection between elementary mathematics courses (calculus and linear
algebra) and computer hardware systems design, as well as the subject of numerical
analysis.
3.A.2 The Concept of a Discrete Basis
The original paper of Voider [3] does not give a rigorous treatment of the mathe-
matics of the CORDIC method. However, in this section (based closely on Schelin
[6]), we will begin the process of deriving the CORDIC method in a mathemati-
cally rigorous manner. The central idea is to represent operands (e.g, the 9 in sinf9)
in terms of a discrete basis. When the discrete basis representation is combined
with appropriate mathematical identities the CORDIC algorithm results.
Let R denote the set of real numbers. Everything we do here revolves around
the following theorem (From Schelin [6]).
Theorem 3.A.1: Suppose that $ k e R f or k e {0, 1, 2, 3, . . .} satisfy 9 > B\ >
6» 2 > • • • > 6„ > 0, and that
n
Ok < Y^ 9 J +0 " for ° - k - "' (3.A.la)
and suppose 9 e R satisfies
\6\<J2 9 i- ( 3 - A - lb )
If 6 (0) = 0, and 9 (k+l) = 9 {k) + S k 9 k for < k < n, where
1, if9>9 (k)
then
«*H -i iie <9^ (3A ' lc)
\9 - 9 (k) \ < J2 e .i + e n for < k < n, (3.A.ld)
j=k
and so in particular \9 — 0^" +1 '| < 9 n .
The method is also easy to implement in an assembly language (or other low-level) programming
environment.
TLFeBOOK
COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 109
Proof We may use proof by mathematical induction 6 on the index k. For k —
n n
\8-6 (O) \ = \6\<J2 J <J2 e J +0 n
.7=0 j=0
via (3.A.lb).
Assume that \9 - 8 (k) \ < T,"=k e i + 9 " is tme ' and consider \9 - 9 (k+l) \. Via
(3.A.lc), S k and 9 — 0® have the same sign, and so
|0-0 ( * +1) |
)W-8 k $ k \ = \\
j(*)i
ft|.
Now, via the inductive hypothesis (i.e., \6 — 9^ >\ < Yl']=k ®i + ^n)> we have
}(*)!
Via (3. A. la), we obtain
j=k
j=k+l
(3.A.2a)
E^
;'=*+!
<-0*
so that
n
J2 °i + e "
j=k+i
Combining (3.A.2a) with (3. A. 2b) gives
< \9 -8 (k) \ -9 k .
(3.A.2b)
\e-9 {k+l) \ = \\9-9 {k) \-8 k \< J2 e i+ e n
i=k+\
so that (3 A. Id) holds for k replaced by k + 1, and so (3 A. Id) holds via induction.
We will call this result Schelin's theorem. The set {9 k } is called a discrete basis
if it satisfies the restrictions given in the theorem.
In what follows we will interpret 9 as an angle. However, note that 9 in this
theorem could be more general than this. Now suppose that we define
9 k = tan" 1 2~ k
(3.A.3)
for k = 0, 1, . . . , n. Table 3.A.1 shows typical values for 9 k as given by (3.A.3).
We see that good approximations to 9 can be obtained from relatively small n
(because from Schelin's theorem \9 — 9 ( - n+1) \ < 9 n ).
A brief introduction to proof by mathematical induction appears in Appendix 3.B.
TLFeBOOK
110 SEQUENCES AND SERIES
TABLE 3.A.1 Values
for Some Elements of
the Discrete Basis given
by Eq. (3.A.3)
k
Ok
= tan _1 2-' :
45°
1
26.565°
2
14.036°
3
7.1250°
4
3.5763°
Clearly, these satisfy 0q > 0\ > ■ ■ ■ > 6 n > 0, which is one of the restrictions in
Schelin's theorem.
The mean-value theorem of elementary calculus says that there exists a £ e
[xq, xq + A] such that
df(x)
dx
*=?
f(x + A) - f(x )
A
so if f(x) — tan x, then df(x)/dx — 1/(1 + x ), and thus
1
1+x 2
x£[2-< k + l \2- k ]
Qk - e k+ i
2-k _ 2-(*+i)
fc+i
1
^. =2 -(*+l)
- 2 _* _ 2 _ (t+ i) - 1+JC 2
because the slope of tan -1 x is largest at x — 2~ < - k+l \ and this in turn implies
2~k _ 9-(*+i) 2* +2 — 2' t+1 2* +1
61* - ^+i <
and
1 + 2- 2 ^+D 1 + 2 2 ^ +1 ' 1 + 2 2 ^ +1 >
2 k
k >
\+2 2k
because tan l x > x -£- tan ! x = -jj^iC* > 0), for k = 0, 1, . . . , n (let x — 2 k ).
k\l
'For x >
rf x 1 — a:
1+x 2 dx 1+x 2 (1+x 2 ) 2
f* 1 x
/ T dt >
-=- => tan x > r-
: 2 1+x 2
TLFeBOOK
COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 111
Now, as a consequence of these results
Ok ~ n = (0 k - ftfc+i) + (0 k+l - k+2 ) + ■■■ + ($„-2 ~ n -!) + (9 n -X - n )
n-\
v^ 2j+l ( _ 2/+1 A
2^ ! + 2 2(./+D \ i °i+ l ~ i + 220+D J
<
2-'
^ l+2 2 i
./'=*+ 1
j=k+\
implying that
0* < J^ Oj + e„
y=*+l
for A: = 0, 1, ...,«, and thus (3.A.la) holds for {6>yt} in (3.A.3), and so (3.A.3) is
a concrete example of a discrete basis. If you have a pocket calculator handy then
it is easy to verify that
3
J2 j = 92.73° > 90°,
;=o
so we will work only with angles 6 that satisfy \6\ < 90°. Thus, for such 0, there
exists a sequence {S^} such that S^ e {—1, +1}, where
= £**& + e„+i = ( " +1) + e„+i, (3.A.4)
where |e n +i| < 0„ — tan -1 2~". Equation (3.A.lc) gives us a way to find {Sk}- We
can also write that any angle 6 satisfying \0\ < ^ (radians) can be represented
exactly as
oo
= J2 S k°k
k=0
for appropriately chosen coordinates {8k}.
In the next section we will begin to see how useful it is to be able to represent
angles in terms of the discrete basis given by (3. A. 3).
TLFeBOOK
112 SEQUENCES AND SERIES
3.A.3 Rotating Vectors in the Plane
No one would disagree that a basic computational problem in electrical and com-
puter engineering is to find
e J z,
€R,
(3.A.5)
where j — V— 1, z — x + jy, and z' = x' + jy' (x, y, x' ', y' e R). Certainly, this
problem arises in the phasor analysis of circuits, in computer graphics (to rotate
objects, for example), and it also arises in digital signal processing (DSP) a lot. 8
Thus, z and z' are complex variables. Expressing them in terms of their real and
imaginary parts, we may rewrite (3.A.5) as
x + jy' — (cos 9 + j sin9)(x + jy) — [x cos 9 — y sin 9] + j[x sin 6* + y cos 6],
and this can be further rewritten as the matrix-vector product
(3.A.6)
cost
sinf
- sin 9
cos 6
which we recognize as the formula for rotating the vector [xy] T in the plane to
y'f.
Recall the trigonometric identities
[x'y'f
sin 6 —
tan 9
VT
COS 9 —
tan z
Vl + tan 2 #'
(3.A.7)
and so for 9 — 9^ in (3. A. 3)
sin0£ =
Vl+2- 2 *'
cos 9k =
VI +2~ 2k '
(3.A.:
Thus, to rotate Xk + jyic by angle Sj i 9/ C to x^+i + jyk+i is accomplished via
Xk+l
yk+i
^\ + 2~ 2k L S k 2
S k 2~ k
1
Xk
yk
(3.A.9)
In DSP it is often necessary to compute the discrete Fourier transform (DFT), which is an approximation
to the Fourier transform (FT) and is defined by
E( 2izkn \
expl-J-jy-**):
where {jtj.} is the samples of some analog signal, i.e., x^ = x(kT), where k is an integer and T is a
positive constant; x(t) for t s R is the analog signal. Note that additional information about the DFT
appeared in Section 1.4.
TLFeBOOK
COORDINATE ROTATION DIGITAL COMPUTING (CORDIC)
113
where we've used (3. A. 6). From Schelin's theorem, < -"+ 1 ' 1 ss y, so if we wish to
rotate x + jy = x + jy by y ( " +1) to x„+i + jy„+i (« x' + jy') then via (3.A.9)
" +1 = I FT
S\2~ l
(3.A.10)
Define
*„=n
k=0
vT
-2k
(3.A.11)
where Ojfc=0 "* = «n«n-l«n-2 • • • «1«0-
Consider
X n + 1
9n+l
1
5 W 2"
-<5„2
1
5,2"
-Si2~
1
1
So
-S
1
vo
(3A.12)
which is the same expression as (3. A. 10) except that we have dropped the multi-
plication by K n . We observe that to implement (3 .A. 12) requires only the simple
(to implement in digital hardware, or assembly language) operations of shifting
and adding. Implementing (3 A. 12) rotates x + jy by approximately the desired
amount, but gives a solution vector that is a factor of l/K n longer than it should
be. Of course, this can be corrected by multiplying [x n +\ y n +\] T by K n if desired.
Note that in some applications, this would not be necessary.
Note that
cosf
sin0
- sin0
cosy
K„
1
«„2-"
1
5.2" 1
-*„2-"
1
5i2"
1
l
1
So
So
(3.A.13)
and so the matrix product in (3. A. 12) [or (3. A. 10)] represents an efficient approxi-
mate factorization of the rotation operator in (3. A. 6). The approximation gets better
and better as n increases, and in the limit as n -> oo becomes exact.
The computational complexity of the CORDIC rotation algorithm may be des-
cribed as follows. In Eq. (3 A. 10) there are exactly 2n shifts (i.e., multiplications
by 2~ k ), and 2« + 2 additions, plus two scalings by factor K n . As well, only n + \
bits of storage are needed to save the sequence {8^}.
We conclude this section with a numerical example to show how to obtain the
sequence {S^} via (3.A.lc).
TLFeBOOK
114 SEQUENCES AND SERIES
Example 3.A.1 Suppose that we want to rotate a vector by an angle of 9 =
20°, and we decide that n — A gives sufficient accuracy for the application at hand.
Via (3.A.lc) and Table 3.A.1, we have
(1) =$ = 45° as <5 = +1 since 9 > (9 (0) = 0°
0(2) = 0(1) +Sl 0j
= 45° - 26.565° as Si = -1 since 9 < 9 (1)
= 18.435°
0(3) = 0(2) + g 2d2
= 18.435° + 14.036° as <5 2 = +1 since 9 > 9 (2)
= 32.471°
0^ = 9& + s 303
= 32.471° - 7.1250° as S 3 = -1 since < 9 (3)
= 25.346°
0< 5 ) = 0< 4 > + 5 4 4
= 25.346° - 3.5763° as S 4 = -1 since < (9 (4)
= 21.770° «6»
Via (3.A.4) and \9 - 6» (n+1) | < 6„
\e 5 \ = 1.770° < 04 = 3.5763°
so the error bound in Schelin's theorem is actually somewhat conservative, at least
in this special case.
3.A.4 Computing Arctangents
The results in Section 3. A. 3 can be modified to obtain a CORDIC algorithm for
computing an approximation to 6 — tan _1 (y/x). The idea is to find the sequence
{Sk\k = 0, 1, . . . , n — 1, n] to rotate the vector [ x y ] T — [ xq yo ] T to the
vector [x n y n ] T , where y n « 0. More specifically, we would select 8k so that
Iw+il < \yk\-
Let 9 denote the approximation to 6. The desired algorithm to compute 9 may
be expressed as Pascal-like pseudocode:
6 :=0;
x :=x;y := y;
TLFeBOOK
COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 115
for k := to n d
d begin
ify/<>
Other
begin
**:=
= -1;
end
else begin
* k :=
= +1;
end
;
§:=-
S k d k +
9;
Xfr+1 :
= x k -
W k Yk\
y^+i :
= S k 2~
k Xk+Yk\
end ;
In this pseudocode
Xk+l
l
-S k 2~ k '
Xk
yk+\
. S k 2~ k
1
. yk
and we see that for the manner in which sequence {8k} is constructed by the
pseudocode, the inequality \y k +i\ < \yk\ is satisfied. We choose n to achieve the
desired accuracy of our estimate 6 of 6, specifically, \6 — 6\ <9 n .
3.A.5 Final Remarks
As an exercise, the reader should modify the previous results to determine a
CORDIC method for computing cos0, and sinf?. [Hint: Take a good look at
(3.A.13).]
The CORDIC philosophy can be extended to the computation of hyperbolic
trigonometric functions, logarithms 9 and other functions [4, 7]. It can also perform
multiplication and division (see Table on p. 324 of Schelin [6]). As shown by Hu
and Naganathan [9], the rate of convergence of the CORDIC method can be accel-
erated by a method similar to the Booth algorithm (see pp. 287-289 of Hamacher
et al. [10]) for multiplication. However, this is at the expense of somewhat more
complicated hardware structures. A roundoff error analysis of the CORDIC method
has been performed by Hu [8]. We do not present these results in this book as they
are quite involved. Hu claims to have fairly tight bounds on the errors, however.
Fixed-point and floating-point schemes are both analyzed. A tutorial presentation
of CORDIC-based VLSI architectures for digital signal processing applications
appears in Hu [11]. Other papers on the CORDIC method are those by Timmer-
mann et al. [12] and Lee and Lang [13] (which appeared in the IEEE Transactions
on Computers, "Special Issue on Computer Arithmetic" of August 1992). An alter-
native summary of the CORDIC method may be found in Hwang [14]. Many of
the ideas in Hu's paper [11] are applicable in a gate-array technology environ-
ment. Applications include the computation of discrete transforms (e.g., the DFT),
digital filtering, adaptive filtering, Kalman filtering, the solution of special linear
y A clever alternative to the CORDIC approach for log calculations appears in Lo and Chen [15], and
a method of computing square roots without division appears in Mikami et al. [16].
TLFeBOOK
116 SEQUENCES AND SERIES
systems of equations (e.g., Toeplitz), deconvolution, and eigenvalue and singular
value decompositions.
APPENDIX 3.B MATHEMATICAL INDUCTION
The basic idea of mathematical induction is as follows. Assume that we are given
a sequence of statements
So, S\, ... , S„, . . .
and each Si is true or it is false. To prove that all of the statements are true (i.e.,
to prove that S n is true for all n) by induction: (1) prove that S n is true for n — 0,
and then (2) assume that S n is true for any n — k and then show that S n is true for
n = k+\.
Example 3.B.1 Prove
" . 2 »(»+l)(2»+l)
S n = 2_^ 1 = 7 ' ">0.
i=0
Proof We will use induction, but note that there are other methods (e.g., via z
transforms). For n — 0, we obtain
c V^-2 n a »(»+l)(2» + l) ,
So = 2^ ' = ° and 7 1«=0 = 0.
(=0
Thus, S n is certainly true for n — 0.
Assume now that 5„ is true for n — k so that
J t = E' 2 =* ( * + T +1) - (3-B.l)
,=o
We have
and so
5 ft+ i = j^ i 2 = J^ i 2 + (k+ l) 2 = S k + (k + l) 2
i=0 i=0
S k + (k+ \y = + (A: + l) z =
6 6
_ n(n+ 1)(2« + 1)
— ~ \n=k+l
6
where we have used (3.B.1).
TLFeBOOK
CATASTROPHIC CANCELLATION 117
Therefore, S„ is true for n — k + 1 if S n is true for n — k. Therefore, S n is true
for all n > by induction.
APPENDIX 3.C CATASTROPHIC CANCELLATION
The phenomenon of catastrophic cancellation is illustrated in the following out-
put from a MATLAB implementation that ran on a Sun Microsystems Ultra 10
workstation using MATLAB version 6.0.0.88, release 12, in an attempt to compute
exp(— 20) using the Maclaurin series for exp(x) directly:
term k x~k/ k !
1.000000000000000
1 -20.000000000000000
2 200.000000000000000
3 -1333.333333333333258
4 6666.666666666666970
5 -26666.666666666667879
6 88888.888888888890506
7 -253968.253968253964558
8 634920.634920634911396
9 -1410934.744268077658489
10 2821869.488536155316979
11 -5130671 .797338464297354
12 8551119.662230772897601
13 -13155568.711124267429113
14 18793669.587320379912853
15 -25058226.116427175700665
16 31322782.645533967763186
17 -36850332.524157606065273
18 40944813.915730677545071
19 -43099804.121821768581867
20 43099804.121821768581867
21 -41047432.496973112225533
22 37315847.724521011114120
23 -32448563.238713916391134
24 27040469.365594934672117
25 -21632375.492475949227810
26 16640288.840366113930941
27 -12326139.881752677261829
28 8804385.629823340103030
29 -6071990.089533339254558
30 4047993.393022226169705
31 -2611608.640659500379115
32 1632255.400412187911570
33 - 989245 . 69721 9507652335
34 581909.233658534009010
35 -332519.562090590829030
36 184733.090050328260986
37 -99855.724351528784609
38 52555.644395541465201
TLFeBOOK
118 SEQUENCES AND SERIES
39 -26951.612510534087050
40 13475.806255267041706
41 -6573.564026959533294
42 3130.268584266443668
43 -1455.938876402997266
44 661 .790398364998850
45 -294.129065939999407
46 127.882202582608457
47 - 54 . 41 7958545790832
48 22.674149394079514
49 - 9 . 254754854726333
50 3.701901941890533
51 -1.451726251721777
52 0.558356250662222
53 -0.210700471948008
54 0.078037211832596
55 -0.028377167939126
56 0.010134702835402
57 - . 003556036082597
58 0.001226219338827
59 -0.000415667572484
60 0.000138555857495
61 -0.000045428149998
62 0.000014654241935
63 -0.000004652140297
64 0.000001453793843
65 -0.000000447321182
66 0.000000135551873
67 -0.000000040463246
68 0.000000011900955
69 -0.000000003449552
70 0.000000000985586
71 -0.000000000277630
72 0.000000000077119
73 -0.000000000021129
74 0.000000000005710
75 -0.000000000001523
76 0.000000000000401
77 -0.000000000000104
78 0.000000000000027
79 -0.000000000000007
80 0.000000000000002
81 -0.000000000000000
82 0.000000000000000
83 -0.000000000000000
84 0.000000000000000
85 -0.000000000000000
86 0.000000000000000
87 -0.000000000000000
88 0.000000000000000
exp(-20) from sum of the above terms = 0.000000004173637
True value of exp(-20) = 0.000000002061154
TLFeBOOK
REFERENCES 119
REFERENCES
1. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York,
1978.
2. W. Rudin, Principles of Mathematical Analysis, 3rd ed., McGraw-Hill, New York, 1976.
3. J. E. Voider, "The CORDIC Trigonometric Computing Technique," IRE Trans. Electron.
Comput. EC-8, 330-334 (Sept. 1959).
4. J. S. Walther, "A Unified Algorithm for Elementary Functions," AFIPS Conf. Proc,
Vol. 38, 1971 Spring Joint Computer Conf., 379-385 May 18-20, 1971.
5. G. L. Haviland and A. A. Tuszynski, "A CORDIC Arithmetic Processor Chip," IEEE
Trans. Comput. C-29, 68-79 (Feb. 1980).
6. C. W. Schelin, "Calculator Function Approximation," Am. Math. Monthly 90, 317-325
(May 1983).
7. J.-M. Muller, "Discrete Basis and Computation of Elementary Functions," IEEE Trans.
Comput. C-34, 857-862 (Sept. 1985).
8. Y. H. Hu, "The Quantization Effects of the CORDIC Algorithm," IEEE Trans. Signal
Process. 40, 834-844 (April 1992).
9. Y. H. Hu and S. Naganathan, "An Angle Recoding Method for CORDIC Algorithm
Implementation," IEEE Trans. Comput. 42, 99-102 (Jan. 1993).
10. V. C. Hamacher, Z. G. Vranesic, and S. G. Zaky, Computer Organization, 3rd ed.,
McGraw-Hill, New York, 1990.
11. Y. H. Hu, "CORDIC -Based VLSI Architectures for Digital Signal Processing," IEEE
Signal Process. Mag. 9, 16-35 (July 1992).
12. D. Timmermann, H. Hahn, and B. J. Hosticka, "Low Latency Time CORDIC Algo-
rithms," IEEE Trans. Comput. 41, 1010-1015 (Aug. 1992).
13. J. Lee and T. Lang, "Constant-Factor Redundant CORDIC for Angle Calculation and
Rotation," IEEE Trans. Comput. 41, 1016-1025 (Aug. 1992).
14. K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, Wiley, New
York, 1979.
15. H.-Y. Lo and J.-L. Chen, "A Hardwired Generalized Algorithm for Generating the Log-
arithm Base-k by Iteration," IEEE Trans. Comput. C-36, 1363-1367 (Nov. 1987).
16. N. Mikami, M. Kobayashi, and Y. Yokoyama, "A New DSP-Oriented Algorithm for
Calculation of the Square Root Using a Nonlinear Digital Filter," IEEE Trans. Signal
Process. 40, 1663-1669 (July 1992).
17. G. G. Walter, Wavelets and Other Orthogonal Systems with Applications, CRC Press,
Boca Raton, FL, 1994.
18. I. S. Gradshteyn and I. M. Ryzhik, in Table of Integrals, Series and Products, A. Jeffrey,
ed., 5th ed., Academic Press, San Diego, CA, 1994.
19. L. Bers, Calculus: Preliminary Edition, Vol. 2, Holt, Rinehart, Winston, New York, 1967.
20. A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, 2nd ed.,
Addison-Wesley, Reading, MA, 1994.
21. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977.
22. M. R. Spiegel, Theory and Problems of Advanced Calculus (Schaum's Outline Series).
Schaum (McGraw-Hill), New York, 1963.
TLFeBOOK
120
SEQUENCES AND SERIES
23. A. Papoulis, Signal Analysis, McGraw-Hill, New York, 1977.
24. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979.
25. W. D. Lakin and D. A. Sanchez, Topics in Ordinary Differential Equations, Dover Pub-
lications, New York, 1970.
PROBLEMS
3.1. Prove the following theorem: Every convergent sequence in a metric space
is a Cauchy sequence.
3.2. Let /„(*) = x n for n e {1, 2, 3, . . .} = N, and /„(*) € C[0, 1] for all n e N.
(a) What is f(x) = lim,,-,*, /„(*) (for x e [0, 1])?
(b) Is f(x) e C[0, 1]?
3.3. Sequence ix n ) is defined to be x n = (n + l)/(« + 2) for « € Z + . Clearly, if
X = [0, 1) c R, then x n e X for all n e Z + . Assume the metric for metric
space X is dix, y) — \x — y\ (x, y e X).
(a) What is x — lirrin^oo x n l
(b) Is X a complete space?
(c) Prove that (x„) is Cauchy.
3.4. Recall Section 3. A. 3 wherein the rotation operator was defined [Eq. (3. A. 6)].
(a) Find an expression for angle 9 such that for y ^
cos 9
sin 9
- sin 9
cos 9
x
y
X
= G(9)
where x' is some arbitrary nonzero constant.
(b) Prove that G~ l i9) — G T id) [i.e., the inverse of G(9) is given by its
transpose].
(c) Consider the matrix
4 2
A= 14 1
_ 2 4
Let 0„ xm denote an array (matrix) of zeros with n rows, and m columns.
Find G(6»i), and Gi9 2 ) so that
1
2 xl
0lx2
Gi9 2 )
Gi$i) 2xl
0lx2 1
A = R,
= Q<
where R is an upper triangular matrix (defined in Section 4.5, if you do
not recall what this is).
TLFeBOOK
PROBLEMS 121
(d) Find Q~ T (inverse of Q T ).
{Comment: The procedure illustrated by this problem is important in various
applications such as solving least-squares approximations, and in finding the
eigenvalues and eigenvectors of matrices.)
3.5. Review Appendix 3. A. Suppose that we wish to rotate a vector [xy] T e R 2
through an angle 9 — 25° ± 1°. Find n, and the required delta sequence (8k)
to achieve this accuracy.
3.6. Review Appendix 3. A. A certain CORDIC routine has the following pseu-
docode description:
Input x and z (|z| < 1);
y :=0;z :=z;
for k :— to n — 1 do begin
if z k < then begin
end
else begin
end;
y k+ i :=y k + B k x2- k ;
z k+1 :=z k -S k 2- k ;
end;
The algorithm' s output is y n . What is y„ ?
3.7. Suppose that
t
x n (t) =
n + t
for n e Z + , and t e (0, 1) C R. Show that (x n (t)) is uniformly convergent
on S = (0, 1).
3.8. Suppose that u n > 0, and also that ^2T=o M « converges. Prove that ]~[^lo(l +
u n ) converges.
[Hint: Recall Lemma 2.1 (of Chapter 2).]
3.9. Prove that limn^^x" = if \x\ < 1.
3.10. Prove that for a, b, x e R
1 1 1
b\ ~ \ + \a-x\\ + \b
(Comment: This inequality often appears in the context of convergence proofs
for certain series expansions.)
TLFeBOOK
122 SEQUENCES AND SERIES
3.11. Consider the function
In N
K N (x) = j—}^D n (x)
n=0
[recall (3.24)]. Since D n (x) is 2it -periodic, K^(x) is also 2jt -periodic. We
may assume that x e [—n, n],
(a) Prove that
1 1 - cos(N + \)x
K N (x) = — — — i '-.
N + 1 1 — COS*
[Hint: Consider ZLo sin ((« + 2>*) = Im [E*=o e^ n+ ^ x ] .]
(b) Prove that
K N (x) > 0.
(c) Prove that
1 /•*
— / Kn(x) dx — 1.
2tt
J — 71
[Comment: The partial sums of the complex Fourier series expansion of the
2tt -periodic function f(x) (again x e [—it, jt]) are given by
N
'nx
/n(x) = ^2 f^
n=-N
where f n = ^- J* f(x)e~J nx dx. Define
A'
VnW = J^ /jW.
1
7
It can be shown that
1 /■*
on(,x)=—\ f(x-t)K N (t)dt.
2jt ./_„
It is also possible to prove that on(x) -> /(x) uniformly on [— jt, 7r] if /(*)
is continuous. This is often called Fejer's theorem.]
3.12. Repeat the analysis of Example 3.12 for /l(0).
3.13. If f{t) e L 2 (0, In) and f(t) = £ neZ /„^" ( , show that
*2tz
lit Jq
j r2jT °°
—J \f(t)\ 2 dt= J2 i/»i
TLFeBOOK
PROBLEMS 123
This relates the energy/power of f(t) to the energy/power of its Fourier
series coefficients.
[Comment: For example, if f(t) is one period of a 2jt -periodic signal, then
the power interpretation applies. In particular, suppose that f(t) — i(t) a
current waveform, and that this is the current into a resistor of resistance R;
then the average power delivered to R is
2jr
1 f
Pav = — Ri 2 (t)dt.]
2n Jo
3.14. Use the result of Example 1.20 to prove that
(-1)" 7T
£
3.15. Prove that
In + 1 4
n=0
1 f sin[(L + i)r] 1
dt — -.
f
Jo
2tt Jo sinf^f]
3.16. Use mathematical induction to prove that
(1.1)" > 1 + j^jn
for all n e N.
3.17. Use mathematical induction to prove that
(1 + hf >l+nh
for all n e N, with h > — 1, (This is Bernoulli's inequality.)
3.18. Use mathematical induction to prove that 4" + 2 is a multiple of 6 for all
n € N.
3.19. Conjecture a formula for (« € N)
n
s„ = y— — ,
ti k(k+l)
and prove it using mathematical induction.
3.20. Suppose that f(x) — tanx. Use (3.55) to approximate f(x) for all x e
(—1, 1). Find an upper bound on |e(x)|, where e(x) is the error of the
approximation.
TLFeBOOK
124 SEQUENCES AND SERIES
3.21. Given f(x) = x 2 + 1, find all £ € (1, 2) such that
/ (1) (?)
r(1) /(2)-/(l)
2- 1
3.22. Use (3.65) to find an approximation to Vl + x for ie(|, |). Find an upper
bound on the magnitude of e(x).
3.23. Using a pocket calculator, compute (0.97) '/ 3 for n — 3 using (3.82). Find
upper and lower bounds on the error e n +\(x).
3.24. Using a pocket calculator, compute [1.05] 1 / 4 using (3.82). Choose n — 2 (i.e.,
quadratic approximation). Estimate the error involved in doing this using the
error bound expression. Compare the bounds to the actual error.
3.25. Show that
= 2".
3.26. Show that
-I
« -
- 1 )-
_(»
j\
v J ~
■lj"
"W
for j = 1, 2, ...,« — 1, n.
3.27. Show that for n > 2
3.28. For
= «2""
\ K I
P„(k)- ( I )p k (l- p )"- k ,0<p< 1,
where k = 0, 1, . . . , n, show that
(n — k)p
p n (k+ 1) = -/?„(£)■
(fc+ 1)(1 - p)
[Comment: This recursive approach for finding p n (k) extends the range of
n for which p n {k) may be computed before experiencing problems with
numerical errors.]
3.29. Identity (3.103) was confirmed using an argument associated with Fig. 3.7. A
somewhat different approach is the following. Confirm (3.103) by working
with
1 f°° i I 2 1 f°° f°° i i
-L / e- xl ' 2 dx =— / e-^ + y 2) ' 2 dxdy.
I Z7T J —oo J ^- J— OQ t/— 00
TLFeBOOK
PROBLEMS 125
(Hint: Use the Cartesian to polar coordinate conversion x — rcos6,y —
r sin0.)
3.30. Find the Maclaurin (infinite) series expansion for f(x) — sin -1 x. What is
the radius of convergence? [Hint: Theorem 3.7 and Eq. (3.82) are use-
ful.]
3.31. From Eq. (3.85) for x > -1
\og e (l+x) = J2^^- k + r(x), (3.P.1)
k = \
where
\r(x)\
-TTX n+ \ x>0
n + l ' —
-1 <x <0
(3.P.2)
(1-|jc|)(b+1)'
[Eq. (3.86)]. For what range of values of x does the series
(-l) k - l x k
log e (l+x) = ^
k=\
converge? Explain using (3. P. 2).
3.32. The following problems are easily worked with a pocket calculator,
(a) It can be shown that
(-lr
J W = E (2k+lV x2k+1 + e2n+3(x) ' (3 ' P3)
where
(2n + 3)l
e 2n+ 3(x)\ < - l ^A x\ 2n+ \ (3.P.4)
Use (3. P. 4) to compute sin(x) to 3 decimal places of accuracy for x — 1.5
radians. How large does n need to be?
(b) Use the approximation
(_i)"-V
\og e (\+x)^Y.-^- —
n = \
to compute log e (1.5) to three decimal places of accuracy. How large
should N be to achieve this level of accuracy?
TLFeBOOK
126 SEQUENCES AND SERIES
3.33. Assuming that
2 f" sinx ^— v
— / dx — 2 /
Wo * ^
show that for suitable n
jt lr (-i) r
(2r + l)(2r + l)!
2n
2n + l
(2n + 2)(2n + 2)\
(3.P.5)
2 r*
x Jo
sinx
■ dx > 1.17.
What is the smallest n needed? Justify Eq. (3.P.5).
3.34. Using integration by parts, find the asymptotic expansion of
J X
c(x) = I cost dt.
3.35. Using integration by parts, find the asymptotic expansion of
s(x)
-f
J x
sin? dt.
3.36. Use MATLAB to plot (on the same graph) function K^{x) in Problem 3.11
for N = 2, 4, 15.
TLFeBOOK
4 Linear Systems of Equations
4.1 INTRODUCTION
The necessity to solve linear systems of equations is commonplace in numerical
computing. This chapter considers a few examples of how such problems arise
(more examples will be seen in subsequent chapters) and the numerical problems
that are frequently associated with attempts to solve them. We are particularly inter-
ested in the phenomenon of ill conditioning. We will largely concentrate on how
the problem arises, what its effects are, and how to test for this problem. 1 In addi-
tion to this, we will also consider methods of solving linear systems other than the
Gaussian elimination method that you most likely learned in an elementary linear
algebra course. 2 More specifically, we consider LU and QR matrix factorization
methods, and iterative methods of linear system solution. The concept of a singular
value decomposition (SVD) is also introduced.
We will often employ the term "linear systems" instead of the longer phrase
"linear systems of equations." However, the reader must be warned that the phrase
"linear systems" can have a different meaning from our present usage. In signals
and systems courses you will most likely see that a "linear system" is either a
continuous-time (i.e., analog) or discrete-time (i.e., digital) dynamic system whose
input/output (I/O) behavior satisfies superposition. However, such dynamic systems
can be described in terms of linear systems of equations.
4.2 LEAST-SQUARES APPROXIMATION AND LINEAR SYSTEMS
Suppose that f(x), g(x) e L 2 [0, 1], and that these functions are real-valued. Recall-
ing Chapter 1, their inner product is therefore given by
(/,*}= f f(x)g(x)dx. (4.1)
Jo
Methods employed to avoid ill-conditioned linear systems of equations will be mainly considered in
a later chapter. These chiefly involve working with orthogonal basis sets.
Review the Gaussian elimination procedure now if necessary. It is mandatory that you recall the basic
matrix and vector operations and properties [(AB) = B A , etc.] from elementary linear algebra,
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
127
TLFeBOOK
128 LINEAR SYSTEMS OF EQUATIONS
Note that here we will assume that all members of L 2 [0, 1] are real-valued (for
simplicity). Now assume that {<p n (x)\n e Z^} form a linearly independent set such
that 4> n (x) e L 2 [0, 1] for all n. We wish to find the coefficients a n such that
AT— 1
f(x) « J2 a n^n(x) for x e [0, 1]. (4.2)
n=0
A popular approach to finding a n [1] is to choose them to minimize the functional
N-l
V(a) = \\f(x)-J2 a n<Pn(x)\\ 2 , (4.3)
n=0
where, for future convenience, we will treat a n as belonging to the vector a —
[a ai ■■■a N _ 2 a N -i\ T e R N . Of course, in (4.3) ||/|| 2 = (/, /) via (4.1). We are
at liberty to think of
N-l
e(x) = f(x) - J2 an<t>nix) (4.4)
n=0
as the error between fix) and its approximation ^ n a n $ n ix). So, our goal is to
pick a to minimize ||e(x)|| 2 , which, we have previously seen, may be interpreted
as the energy of the error e(x). This methodology of approximation is called least-
squares approximation. The version of this that we are now considering is only
one of a great many variations. We will see others later.
We may rewrite (4.3) as follows:
N-l
|2
V(a) = ||/(x)- J>„0„(x)|| 2
M =
(N-l N-l \
f -^2 fl «0>" f-^2 ak $ k )
n=0 k=0 I
= </,/>-(/,2>*&)
\ k=0 I
JN-1 \ IN -I N-l \
~ ( ^ a n4>n, /) + ( 5Z fl "^"' ^2 ak< t >k
\«=0 / \n=0 k=0 I
N-l N-l
= \\f\\ 2 -J2a k {f,<t> k )-Y i a n {(t> n ,f)
k=0 n=0
N-l N-l
+ ^2^2 a k a n ((j> k ,<l> n )
n=0 k=0
TLFeBOOK
LEAST-SQUARES APPROXIMATION AND LINEAR SYSTEMS 129
N-l
= II/H 2 - 2 £>*(/,</>*}
N-l N-l
+ XI 5Z a k®n{<Pk,<Pn)-
(4.5)
n=0 k=0
Naturally, we have made much use of the inner product properties of Chapter 1 .
It is very useful to define
P = 11/11, gk = {f,<Pk), r n ,k = {fa,fa),
along with the vector
g = [gogi ■■■gN-if e R^,
and the matrix
R = [r n ,k] € R
NxN
(4.6)
(4.7a)
(4.7b)
We immediately observe that R — R T (i.e., R is symmetric). This is by virtue of
the fact that r n ^ — (<p n , fa) = (fa, fa.) — r% n . Immediately we may rewrite (4.5)
in order to obtain the quadratic form"
V(a) = a T Ra — 2a T g + p.
(4.8)
The quadratic form occurs very widely in optimization and approximation prob-
lems, and so warrants considerable study. An expanded view of R is
R =
/'
Jo
(pnix) dx
/„
/'
Jo
</>o(x)(/>i(x) dx
1
(po{x)cj)\(x) dx
Jo
</>i (x) dx
i:
i:
<t>o(x)<t>N-i(x)dx
<Pl(x)<t>N-l(x)dx
/ 4>o(x)4>N-l(x)dx / (pi(x)<p N _i(x)dx ■■■ \ (j>j,_,(x)dx
■Jo Jo Jo
(4.9)
Note that the reader must get used to visualizing matrices and vectors in the general
manner now being employed. The practical use of linear/matrix algebra demands
this. Writing programs for anything involving matrix methods (which encompasses
a great deal) is almost impossible without this ability. Moreover, modern software
tools (e.g., MATLAB) assume that the user is skilled in this manner.
The quadratic form is a generalization of the familiar quadratic ax + bx + c, for which x e C (if
we are interested in the roots of a quadratic equation; otherwise we usually consider x s R) , and also
a, b, c e R.
TLFeBOOK
130 LINEAR SYSTEMS OF EQUATIONS
It is essential to us that i? _1 exist (i.e., we need R to be nonsingular). Fortu-
nately, this is always the case because {(f> n \n e Z^} is an independent set. We shall
prove this. If R is singular, then there is a set of coefficients a, such that
JV-l
£a/(&,^} = (4.10)
for j e Zjsi . This is equivalent to saying that the columns of R are linearly depen-
dent. Now consider the function
N-l
f(x) = ^2,cti<t>i{x)
1=0
so if (4.10) holds for all j e Z N , then
IN-l \ N-l
(/■ 0/> = H a i<$>u4>j ) = S «i (&. 0/> =
\ i=0 / i'=0
for all j € Zjv. Thus
iV-l JV-l
'N-l
Y^OCi((j)i,(t)j)
= 0,
.7=0 ;=0 L ,'=0
implying that
(N-l N-l \
^2 <Xi<j>i, ^aj<t>j) = 0,
i=0 ;=0 /
or in other words, (/, /) = ||/|| 2 = 0, and so f(x) — for all x e [0, 1]. This
contradicts the assumption that {<j) n \n e Zyv} is a linearly independent set. So i? _1
must exist.
From (4.3) it is clear that (via basic norm properties) V(a) > for all a e R N .
If we now assume that f(x) — (all x e [0, 1]), we have p — 0, and g — too.
Thus, a T Ra > for all a.
Definition 4.1: Positive Semidefinite Matrix, Positive Definite Matrix Sup-
pose that A — A T and that A e R" x ". Suppose that x e R". We say that A is
positive semidefinite (psd) iff
x 1 Ax >
for all x. We say that A is positive definite (pd) iff
x T Ax >
for all x # 0.
TLFeBOOK
LEAST-SQUARES APPROXIMATION AND LINEAR SYSTEMS 131
If A is psd, we often symbolize this by writing A > 0, and if A is pd, we often
symbolize this by writing A > 0. If a matrix is pd then it is clearly psd, but the
converse is not necessarily true.
So far it is clear that R > 0. But in fact R > 0. This follows from the linear
independence of the columns of R. If the columns of R were linearly dependent,
then there would be an a ^ such that Ra — 0, but we have already shown that
7? _1 exists, so it must be so that Ra — iff a — 0. Immediately we conclude that
R is positive definite.
Why is R > so important? Recall that we may solve ax 2 + bx + c — (x € C)
by completing the square:
ax + bx
b
2d
12 b 2
+ c-— . (4.11)
4a
Now if a > 0, then y(x) — ax 2 + bx + c has a unique minimum. Since y^'(x) —
lax + b — Qior x — — j-, this choice of x forces the first term of (4.1 1) (right-hand
side of the equality) to zero, and we see that the minimum value of y(x) is
/ b \ _ b 2 _ Aac - b 2
y \2a\)~ C ~A~a'~ ~4~a '
Thus, completing the square makes the location of the minimum (if it exists), and
the value of the minimum of a quadratic very obvious. For the same purpose we
may complete the square of (4.8):
V(a) = [a - R- l g] T R[a - R~ l g] + p- g T R~ l g. (4.13)
It is quite easy to confirm that (4.13) gives (4.8) (so these equations must be
equivalent):
-R~ l g]
r R[a-
R~ 1 g] +
p-g T R~
l g
= [a T -
g T (R~
l ) T ][Ra-
-gl + p-
g T R~
l 8
— a Ra
-g T R
Ra — a
T g + g T R
" 1 * +
P~
8 T
R
= a Ra
T
-g a
-a T g +
p — a Ra
-2a 7
g +
p.
We have used (R ! ) r = (R T ) l = R \ and the fact that a T g — g T a. If vector
x — a — R~ l g, then
[a - R~ l g] T R[a - i? _1 g] = x T Rx.
So, because R > 0, it follows that x T Rx > for all x ^ 0. The last two terms of
(4.13) do not depend on a. So we can minimize V(a) only by minimizing the first
term. R > implies that this minimum must be for x — 0, implying a — a, where
a- R~ 1 g = 0,
or
Ra = g. (4.14)
TLFeBOOK
132 LINEAR SYSTEMS OF EQUATIONS
Thus
a — argminV(a). (4.15)
We see that to minimize ||e(x)|| 2 we must solve a linear system of equations,
namely, Eq. (4.14). We remark that for R > 0, the minimum of V(a) is at a unique
location a e R N ; that is, the minimum is unique.
In principle, solving least-squares approximation problems seems quite simple
because we have systematic (and numerically reliable) methods to solve (4.14)
(e.g., Gaussian elimination with partial pivoting). However, one apparent difficulty
is the need to determine various integrals:
= / f(x)(j>k(x)dx,r nyk = / (f>„(
Jo Jo
Usually, the independent set {4>k\k e Z^} is chosen to make finding r n k rela-
tively straightforward. In fact, sometimes nice closed-form expressions exist. But
numerical integration is generally needed to find gk- Practically, this could involve
applying series expansions such as considered in Chapter 3, or perhaps using
quadratures such as will be considered in a later chapter. Other than this, there
is a more serious problem. This is the problem that R might be ill-conditioned.
4.3 LEAST-SQUARES APPROXIMATION AND ILL-CONDITIONED
LINEAR SYSTEMS
A popular choice for an independent set {<j>k(x)} would be
c/, k (x) = x k for x e [0, 1], k e Z N . (4.16)
Certainly, these functions belong to the inner product space L 2 [0, 1]. Thus, for
f(x) € L 2 [0, 1] an approximation to it is
N-l
f(x) = J2 «*** € P" -1 [0, 1], (4.17)
fc=0
and so we wish to fit a degree N — 1 polynomial to f(x). Consequently
g k = [ x k f(x)dx (4.18a)
which is sometimes called the kth moment^ of f(x) on [0, 1], and also
•1 rl
(j>„(x)<pk(x) dx = /
'o Jo
for n, k e Z#.
r n ,k= / (f>n(x)<Pk(x)dx= x n+k dx= (4.18b)
Jo Jo n + k + 1
The concept of a moment is also central to probability theory.
TLFeBOOK
LEAST-SQUARES APPROXIMATION AND ILL-CONDITIONED LINEAR SYSTEMS 133
For example, suppose that N — 3 (i.e., a quadratic fit); then
-I T
J f(x)dx I xf(x)dx J x f(x)dx
Jo Jo Jo
=gl
= X2
(4.19a)
and
R
1
1
2
1
3
1
2
1
3
1
4
1
3
1
4
1
5
roo roi ro2
no r n r n
no n\ m
(4.19b)
and a = [do «i 02] > so we wish to solve
fl
1
1
2
1
3
1
1
1
2
3
4
1
1
1
3
4
5
/ f(x)dx
Jo
I xf(x)dx
Jo
If
L Jo
x f{x) dx
(4.20)
We remark that R does not depend on the "data" f(x), only the elements of g
do. This is true in general, and it can be used to advantage. Specifically, if f(x)
changes frequently (i.e., we must work with different data), but the independent
set does not change, then we need to invert R only once.
Matrix R in (4.19b) is a special case of the famous Hilbert matrix [2-4]. The
general form of this matrix is (for any N € N)
1
N
1
R =
1
2
3
1
1
1
2
3
4
1
1
1
3
4
5
N+ 1
1
N + 2
1
1
1
1
N N + 1 N + 2
2N - 1
€ R
NxN
(4.21)
Thus, (4.20) is a special case of a Hilbert linear system of equations. The matrix R
in (4.21) seems "harmless," but it is actually a menace from a numerical computing
standpoint. We now demonstrate this concept.
TLFeBOOK
134
LINEAR SYSTEMS OF EQUATIONS
Suppose that our data are something very simple. Say that
f{x) = 1 for all x e [0, 1].
In this case gi = f x k dx — rry. Therefore, for any TV € N, we are compelled to
solve
1
2
1
1
2
3
1
1
3
4
1
1
1
TV
1
TV + 1
1
N
1
L TV TV + 1 TV + 2
2TV- 1
r i "1
fl
a\
«2
i
2
1
3
a-N-i
1
- TV -
(4.22)
A moment of thought reveals that solving (4.22) is trivial because g is the first
column of R. Immediately, we see that
a = [100- -OOY
(4.23)
(No other solution is possible since R~ exists, implying that Ra — g always
possesses a unique solution.)
MATLAB implements Gaussian elimination (with partial pivoting) using the
operator "\" to solve linear systems. For example, if we want x in Ax — y for
which A~ l exists, then x — A\y. MATLAB also computes R using function "hilb";
that is, R — hilb (TV) will result in R being set to a TV x TV Hilbert matrix. Using the
MATLAB "\" operator to solve for a in (4.22) gives the expected answer (4.23) for
TV < 50 (at least). The computer-generated answers are correct to several decimal
places. (Note that it is somewhat unusual to want to fit polynomials to data that
are of such large degree.) So far, so good.
Now consider the results in Appendix 4. A. The MATLAB function "inv" may
be used to compute the inverse of matrices. The appendix shows R _1 (computed
via inv) for TV = 10, 11, 12, and the MATLAB computed product RR~ l for these
cases. Of course, RR~ l — I (identity matrix) is expected in all cases. For the
number of decimal places shown, we observe that RR~ l ^ /. Not only that, but
the error E — RR~ X — I rapidly becomes large with an increase in TV. For TV = 12,
the error is substantial. In fact, the MATLAB function inv has built-in features to
warn of trouble, and it does so for case TV = 12. Since RR~ l is not being computed
correctly, something has clearly gone wrong, and this has happened for rather small
values of TV. This is in striking contrast with the previous problem, where we wanted
to compute a in (4.22). In this case, apparently, nothing went wrong.
We may consider changing our data to f(x) — x N ~ l . In this case gk — 1/(TV +
k) for k e Z#. The vector g in this case will be the last column of R. Thus,
TLFeBOOK
CONDITION NUMBERS 135
mathematically, a — [00 • • • 001] r . If we use MATLAB "\" to compute a for this
problem we obtain the computed solutions:
a = [0.0000 0.0000 0.0000 0.0000 . . .
0.0002 -0.0006 0.0013 -0.00170.0014 -0.0007 l.OOOlf (N = 11),
a = [0.0000 0.0000 0.0000 -0.0002 . . . 0.0015 -0.0067
0.0187 -0.0342 0.0403 -0.0297 0.0124 0.9978] r (N = 12).
The errors in the computed solutions a here are much greater than those experienced
in computing a in (4.23).
It turns out that the Hilbert matrix R is a classical example of an ill-conditioned
matrix (with respect to the problem of solving linear systems of equations). The
linear system in which it resides [i.e., Eq. (4.14)] is therefore an ill-conditioned
linear system. In such systems the final answer (which is a here) can be exquisitely
sensitive to very small perturbations (i.e., disturbances) in the inputs. The inputs
in this case are the elements of R and g. From Chapter 2 we remember that R
and g will not have an exact representation on the computer because of quantiza-
tion errors. Additionally, as the computation proceeds rounding errors will cause
further disturbances. The result is that in the end the final computed solution can
deviate enormously from the correct mathematical solution. On the other hand, we
have also shown that it is possible for the computed solution to be very close to
the mathematically correct solution even in the presence of ill conditioning. Our
problem then is to be able to detect when ill conditioning arises, and hence might
pose a problem.
4.4 CONDITION NUMBERS
In the previous section there appears to be a problem involved in accurately com-
puting the inverse of R (Hilbert matrix). This was attributed to the so-called ill
conditioning of R. We begin here with some simpler lower-order examples that
illustrate how the solution to a linear system Ax — y can depend sensitively on A
and y. This will lead us to develop a theory of condition numbers that warn us that
the solution x might be inaccurately computed due to this sensitivity.
We will consider Ax = y on the assumption that A e R" x ", and x, y e R".
Initially we will assume n — 2, so
fl00 floi
flio an
(4.24)
In practice, we may be uncertain about the accuracy of the entries of A and y.
Perhaps these entities originate from experimental data. So the entries may be
subject to experimental errors. Additionally, as previously mentioned, the elements
TLFeBOOK
136
LINEAR SYSTEMS OF EQUATIONS
of A and y cannot normally be exactly represented on a computer because of the
need to quantize their entries. Thus, we must consider the perturbed system
flOO G01
flio an
<Saoo <5floi
&ci\q San
=/>A
>'o
Syo
Sy\
=Sy
(4.25)
The perturbations are SA and Sy. We will assume that these are "small." As you
might expect, the practical definition of "small" will force us to define and work
with suitable norms. This is dealt with below. We further assume that the computing
machine we use to solve (4.25) is a "magical machine" that computes without
rounding errors. Thus, any errors in the computed solution, denoted x here, can be
due only to the perturbations SA and Sy. It is our hope that x « x. Unfortunately,
this will not always be so, even for n — 2 with small perturbations.
Because n is small, we may obtain closed-form expressions for A -1 , x, [A +
S A] -1 , and x. More specifically
floofln — floiflio
an
-flio
-fl l
floo
(4.26)
and
[A + SA]~ [ =
1
(floo + <5fl o)(flii + San) - (floi + S«oi)(«io + Saw)
an + <5fln
-(flio + 5flio)
-(floi +5fl i)
(aoo + <5floo)
(4.27)
The reader can confirm these by multiplying A as given in (4.26) by A. The
2x2 identity matrix should be obtained. Using these formulas, we may consider
the following example.
Example 4.1 Suppose that
-.01
.01
Nominally, the correct solution is
1
100
Let us consider different perturbation cases:
TLFeBOOK
CONDITION NUMBERS
137
1. Suppose that
In this case
2. Suppose that
for which
SA =
.005
Sy = [0 or
jc = [1.1429 -85.7143]'
<5A =
.03
SA
sy = to or
.01
.02
This matrix is mathematically singular, so it does not possess an inverse. If
MATLAB tries to compute x using (4.27), then we obtain
x = 1.0 x 10 17 x [-0.0865 -8.6469] r
and MATLAB issues a warning that the answer may not be correct. Obvi-
ously, this is truly a nonsense answer.
3. Suppose that
SA
-.02
.Sy = [0.10 -0.05]'.
In this case
ic = [-1.1500 -325.0000] 7 ".
It is clear that small perturbations of A and y can lead to large errors in the
computed value for x. These errors are not a result of accumulated rounding errors
in the computational algorithm for solving the problem. For computations on a
"nonmagical" (i.e., "real") computer, this should be at least intuitively plausible
since our formulas for x are very simple in the sense of creating little opportunity
for rounding error to grow (there are very few arithmetical operations involved).
Thus, the errors x — x must be due entirely (or nearly so) to uncertainties in the
original inputs. We conclude that the real problem is that the linear system we
are solving is too sensitive to perturbations in the inputs. This naturally raises the
question of how we may detect such sensitivity.
In view of this, we shall say that a matrix A is ill-conditioned if the solution x
(in Ax = y) is very sensitive to perturbations on A and y. Otherwise, the matrix
is said to be well-conditioned.
TLFeBOOK
138
LINEAR SYSTEMS OF EQUATIONS
We will need to introduce appropriate norms in order to objectively measure
the sizes of objects in our problem. However, before doing this we make a few
observations that give additional insight into the nature of the problem. In Example
4.1 we note that the first column of A is big (in some sense), while the second
column is small. The smallness of the second column makes A close to being
singular. A similar observation may be made about Hilbert matrices. For a general
TV x TV Hilbert matrix, the last two columns are given by
1
TV- 1
1
TV
1
1
TV
1
TV + 1
1
L 2TV - 2 2N - 1 J
For very large TV, it is apparent that these two columns are almost linearly depen-
dent; that is, one may be taken as close to being equal to the other. A simple
numerical example is that j4, « -j^y. Thus, at least at the outset, it seems that
ill-conditioned matrices are close to being singular, and that this is the root cause
of the sensitivity problem.
We now need to extend our treatment of the concept of norms from what we
have seen in earlier chapters. Our main source is Golub and Van Loan [5], but
similar information is to be found in Ref. 3 or 4 (or the references cited therein).
A fairly rigorous treatment of matrix and vector norms can be found in Horn and
Johnson [6].
Suppose again that x e R". The p-norm of x is defined to be
11*11,=
5>*l'
.k=0
Up
(4.28)
where p > 1. The most important special cases are, respectively, the 1-norm, 2-
norm, and oo-norm:
71-1
l*lll
= £l**l.
k=Q
l*l| 2
= U T x]"\
*||oo
= max \x k \.
0<k<n-\
(4.29a)
(4.29b)
(4.29c)
The operation "max" means to select the biggest \xk\- A unit vector with respect
to norm II • II is a vector x such that llxll = 1. Note that if x is a unit vector with
TLFeBOOK
CONDITION NUMBERS 139
respect to one norm, then it is not necessarily a unit vector with respect to another
choice of norm. For example, suppose that x = [-^j] r ; then
V3+ 1 V3
I |x|| 2 = i,IM|i = ^— ,IMIoo = — •
The vector x is a unit vector under the 2-norm, but is not a unit vector under the
1-norm or the oo-norm.
Norms have various properties that we will list without proof. Assume that
x,y e R". For example, the Holder inequality [recall (1.16) (in Chapter 1) for
comparison] is
\x T y\ < \\x\\ p \\y\\ q (4.30)
for which \- + -7 = 1. A special case is the Cauchy-Schwarz inequality
\x T y\<\\x\h\\y\\2, (4.31)
which is a special instance of Theorem 1.1 (in Chapter 1). An important feature of
norms is that they are equivalent. This means that if || • ||„ and || • ||« are norms
on R", then there are c\, C2 > such that
ci\\x\\ a < 11*11/1 <C2||jc||„ (4.32)
for all x € R" . Some special instances of this are
ll*||2<ll*||i<-A||*||2, ( 433a )
Hx||oo<IWl2<-AlWloo, (4.33b)
IMIco <Nli <n|Mlcc (4.33c)
Equivalence is significant with respect to our problem in the following manner.
When we define condition numbers below, we shall see that the specific value of
the condition number depends in part on the choice of norm. However, equivalence
says that if a matrix is ill-conditioned with respect to one type of norm, then it
must be ill-conditioned with respect to any other type of norm. This can simplify
analysis in practice because it allows us to compute the condition number using
whatever norms are the easiest to work with. Equivalence can be useful in another
respect. If we have a sequence of vectors in the space R" , then, if the sequence is
Cauchy with respect to some chosen norm, it must be Cauchy with respect to any
other choice of norm. This can simplify convergence analysis, again because we
may pick the norm that is easiest to work with.
In Chapter 2 we considered absolute and relative error in the execution of
floating-point operations. In this setting, operations were on scalars, and scalar
solutions were generated. Now we must redefine absolute and relative error for
vector quantities using the norms defined in the previous paragraph. Since x € R"
TLFeBOOK
140 LINEAR SYSTEMS OF EQUATIONS
is the computed (i.e., approximate) solution to x e R" it is reasonable to define the
absolute error to be
€ a = \\x-x\\, (4.34a)
and the relative error is
||x-x||
er = li - n - n - Li . (4.34b)
11*11
Of course, x ^ is assumed here. The choice of norm is in principle arbitrary.
However, if we use the oo-norm, then the concept of relative error with respect to
it can be made equivalent to a statement about the correct number of significant
digits in x:
Hi ~ X|100 « l(T d . (4.35)
\\x\\oo
In other words, the largest element of the computed solution x is correct to approx-
imately d decimal digits. For example, suppose that x = [1.256 — 2.554] r , and
x = [1.251 - 2.887] r ; then x-x = [-0.005 - 0.333] r , and so
||x-x||oo = 0.333, 11x11^ = 2.554,
so therefore e r — 0.1304 « 10 _1 . Thus, x has a largest element that is accurate to
about one decimal digit, but the smallest element is observed to be correct to about
three significant digits.
Matrices can have norms defined on them. We have remarked that ill condi-
tioning seems to arise when a matrix is close to singular. Suitable matrix norms
can allow us to measure how close a matrix is to being singular, and thus gives
insight into its condition. Suppose that A, B e R mx " (so A and B are not neces-
sarily square matrices). || • |||R mx " -> R is a matrix norm, provided the following
axioms hold:
(MN1) ||A|| > for all A, and ||A|| = iff A = 0.
(MN2) ||A + fl||<||A|| + ||B||.
(MN3) 1 1 a A|| = | a | ||A||. Constanta is from the same field as the elements of
the matrix A.
In the present context we usually consider a e R. Extensions to complex-valued
matrices and vectors are possible. The axioms above are essentially the same as for
the norm in all other cases (see Definition 1.3 for comparison). The most common
matrix norms are the Frobenius norm
\\A\\ F
m—\ n—\
EIj«ul 2 (4.36a)
\ k=0 1=0
TLFeBOOK
CONDITION NUMBERS 141
and the p-norms
\\AX\\ P
||A|| p = sup^^. (4.36b)
x=iO \\X\\p
We see that in (4.36b) the matrix p-norm is dependent on the vector /?-norm. Via
(4.36b), we have
\\Ax\\ p < \\A\\ p \\x\\ p . (4.36c)
We may regard A as an operator applied to x that yields output Ax. Equation
(4.36c) gives an upper bound on the size of the output, as we know the size of A
and the size of x as given by their respective p-norms. Also, since A e R mx " it
must be the case that x e R" , but y € R™ . We observe that
\A\\ P = sup
xjtO
11*11.
= max \\Ax\\ p . (4.37)
p \\x\\ p =\
This is an alternative means to compute the matrix p-norm: Evaluate ||Ajt||n at all
points on the unit sphere, which is the set of vectors {x|||x||p = 1}, and then pick
the largest value of ||A;e|L. Note that the term "sphere" is an extension of what
we normally mean by a sphere. For the 2-norm in n dimensions, the unit sphere is
clearly
\\x\\2=[xl+xl + --- + x 2 n _ l \ l l 2 = \. (4.38)
This represents our intuitive (i.e., Euclidean) notion of a sphere. But, say, for the
1-norm the unit sphere is
Iklli = |*oH-l*il + -" + l*».-il = i. (4-39)
Equations (4.38) and (4.39) specify very different looking surfaces in n-dimensional
space. A suggested exercise is to sketch these spheres for n — 2.
As with vector norms, matrix norms have various properties. One property pos-
sessed by the matrix p-norms is called the submultiplicative property:
\\AB\\ P < \\A\\ P \\B\\ P AeR mxn , B e R" x? . (4.40)
(The reader is warned that not all matrix norms possess this property; a coun-
terexample appears on p. 57 of Golub and Van Loan [5]). A miscellany of other
properties (including equivalences) is
l|A|| 2 < ||A|| F < yfr\\A\\ 2 , (4.41a)
max \a,j\ < \\A\\2 < V OTMmax \ a ij I, (4.41b)
TLFeBOOK
142 LINEAR SYSTEMS OF EQUATIONS
m — \
A||i=maxy]|a u |, (4.41c)
/eZ„
(=0
n-\
X
ieZ,
1
A||oo =maxY"|a iJ |, (4.41d)
Sal '—' J
.7=0
,-JAHoo < ||A|j 2 < V^IIAMoo, (4.41e)
l|A||i <||A|| 2 <V^||A||i. (4.41f)
m
The equivalences [e.g., (4.41a) and (4.41b)] have the same significance for
matrices as the analogous equivalences for vectors seen in (4.32) and (4.33).
From (4.41c,d) we see that computing matrix 1 -norms and oo-norms is easy.
However, computing matrix 2-norms is not easy. Consider (4.37) with p — 2:
||A|| 2 = max ||Ax|| 2 . (4.42)
11*112 = 1
Let R — A T A e R" x " (no, R is not a Hilbert matrix here; we have "recycled" the
symbol for another use), so then
\\Ax\\\ = x T A T Ax = x T Rx. (4.43)
Now consider n — 2. Thus
(4.44)
x Rx — [xqxi]
roo r 0i
no n\
Xq
where rgi = no because R — R T . The vectors and matrix in (4.44) multiply out
to become
x T Rx — roo*o + 2r Q \XQXi +r n x\. (4.45)
Since \\A\\\ — max|| x || 2 — i ||Ajt|||, we may find \\A\\\ by maximizing (4.45) subject
to the equality constraint 1 1 jc 1 1^ — 1) i- e -> x 1 x = xi + x\ = \. This problem may
be solved by using Lagrange multipliers (considered somewhat more formally in
Section 8.5). Thus, we must maximize
V(x) =x T Rx-X[x T x- 1], (4.46)
where k is the Lagrange multiplier. Since
V(x) = r m xl + 2r 0l x x 1 + r n x\ - X[xq + x\ - 1],
TLFeBOOK
CONDITION NUMBERS
143
we have
dV(x)
dx
dv(x)
dx\
= 2rooxo + 2roix\ — 2Axo = 0,
= 2roixo + 2r\\x\ — 2Xx\ — 0,
and these equations may be rewritten in matrix form as
HX) r 0i
no n\
XQ
x\
xo
x\
(4.47)
In other words, Rx — Xx, Thus, the optimum choice of x is an eigenvector of
R — A T A. But which eigenvector is it?
First note that A -1 exists (by assumption), so x T Rx — x T A T Ax —
(Ax) T (Ax) > for all x ^ 0. Therefore, R > 0. Additionally, R = R T , so all
of the eigenvalues of R are real numbers. 5 Furthermore, because R > 0, all of its
eigenvalues are positive. This follows if we consider Rx — Xx, and assume that
X < 0. In this case x T Rx — Xx T x — X\\x\\2 < for any x ^ 0. [If X — 0, then
Rx — • x — implies that x = (as 7? _1 exists), so x T Rx — 0.] But this con-
tradicts the assumption that R > 0, and so all of the eigenvalues of R must be
positive. Now, since ||Ax||2 = x T A T Ax — x T Rx — x T (Xx) — A||x|||, and since
Hxllj = 1, it must be the case that IIAxllj is biggest for the eigenvector of R cor-
responding to the biggest eigenvalue of R. If the eigenvalues of R are denoted X\
and Xq with A.i > Xq > 0, then finally we must have
||A||| = Ai.
(4.48)
This argument can be generalized for all n > 2. If R > 0, we assume that all
of its eigenvalues are distinct (this is not always true). If we denote them by
Xq, Xi, . . . , X n -i, then we may arrange them in decreasing order:
A.,,
> X„-2 > ■ ■ ■ > X\ > Xq > 0.
(4.49)
Therefore, for A € R"
|A||2=A„_i.
(4.50)
The problem of computing the eigenvalues and eigenvectors of a matrix has its own
special numerical difficulties. At this point we warn the reader that these problems
must never be treated lightly.
If A is a real-valued symmetric square matrix, then we may prove this claim as follows. Suppose
that for eigenvector x of A, the eigenvalue is A, that is, Ax = Xx. Now ((Ax)*) = ((kx)*) , and so
(x*) T A T = k*(x*) T . Therefore, (x*) T A T x = X*(x*) T x. But (x*) T A T x = (x*) T Ax = X(x*) T x, so
finally X*(x*) T a
; X(x*) x, so we must have X = X*. This can be true only if X e R.
TLFeBOOK
144
LINEAR SYSTEMS OF EQUATIONS
Example 4.2 Let det(A) denote the determinant of A. Suppose that R — A T A,
where
1 0.5
t 0.5 1
We will find ||A|| 2 . Consider
R =
1 0.5
0.5 1
x
x\
XQ
X]
We must solve det(A7 — R) — for A. [Recall that det(A7 — R) is the characteristic
polynomial of R.] Thus
det (XI - R) - det
and (X - l) 2 - \ = X 2 - 2X
A-l -0.5
-0.5 X - 1
\ = 0, for
= (X - I) 2
1
= 0,
A =
(-2) ±J (-2) 2 -4-1
2± 1
So, A.1
X Q
2- 1
Thus, 1 1 A|| 2 = A.i
1 3
2' 2'
|A|| 2
and so finally
(We do not need the eigenvectors of R to compute the 2-norm of A.)
We see that the essence of computing the 2-norm of matrix A is to find the zeros
of the characteristic polynomial of A T A. The problem of finding polynomial zeros
is the subject of a later chapter. Again, this problem has its own special numerical
difficulties that must never be treated lightly.
We now derive the condition number. Begin by assuming that A € R" x ", and
that A -1 exists. The error between computed solution x to Ax — y and x is
e = x-x. (4.51)
Ax — y, but Ax ^ y in general. So we may define the residual
r — y — Ax.
We see that
Ae — Ax — Ax — y — Ax — r.
Thus, e — A~ l r. We observe that if e — 0, then r — 0, but if r is small, then e
is not necessarily small because A -1 might be big, making A~V big. In other
(4.52)
(4.53)
TLFeBOOK
CONDITION NUMBERS 145
words, a small residual r does not guarantee that x is close to x. Sometimes r is
computed as a cursory check to see if x is "reasonable." The main advantage of
r is that it may always be computed, whereas x is not known in advance and so
e may never be computed exactly. Below it will be shown that considering r in
combination with a condition number is a more reliable method of assessing how
close x is to x.
Now, since e — A~V, we can say that \\e\\ p = ||A _1 r|| /:1 < ||A _1 ||p||r||p [via
(4.36c)]. Similarly, since r = Ae, we have \\r\\ p = \\Ae\\ p < \\A\\ p \\e\\ p . Thus
lln ' 1 " <IMI„ ^IIA-'lUML. (4.54)
\\A\\ P
Similarly, x — A~ l y, so immediately
N ' P < \\x\\ P <\\A- l \\ p \\y\\ p . (4.55)
l|A||_
If ||x|L ^ 0, and \\y\\ p ^ 0, then taking reciprocals in (4.55) yields
\\A
< — — . (4.56)
\\A- l \\ p \\y\\ p - \\x\\ p - \\y\\ p
We may multiply corresponding terms in (4.56) and (4.54) to obtain
i ikiip ik|L 114 _, ikn
IIA-'llpllAHp \\y\\ p \\x\\ p \\y\\p
We recall from (4.34b) that e r = ll *r* llp = ^K so
11-^11/? I I -* I I p
!,!„,„ ,'S', P Kr< IIA-'II.IIAH^. (4.58)
We call
\\A- l \\ p \\A\\ p \\y\\ p ~ ••"•• •••'\\y\\
\\r\\ P lly-Ax||,
(4.59)
\\y\\ P \\y\\p
the relative residual. We define
K p (A) = \\A\\p\\A- 1 \\ p (4.60)
to be the condition number of A. It is immediately apparent that k p {A) > 1 for
any A and valid p. We see that e r is between 1/k p (A) and K p (A) times the
relative residual. In particular, if k p (A) >> 1 (i.e., if the condition number is very
large), even if the relative residual is tiny, then e r might be large. On the other
hand, if k p (A) is close to unity, then e r will be small if the relative residual is
small. In conclusion, if K p (A) is large, it is a warning (not a certainty) that small
TLFeBOOK
146
LINEAR SYSTEMS OF EQUATIONS
perturbations in A and y may cause x to differ greatly from x. Equivalently, ifK p (A)
is large, then a small r does not imply that x is close to x.
A rule of thumb in interpreting condition numbers is as follows [3, p. 229], and
is more or less true regardless of p in (4.60). If K p {A) « d x 10*, where d is a
decimal digit from one to nine, we can expect to lose (at worst) about k digits of
accuracy. The reason that p does not matter too much is because we recall that
matrix norms are equivalent. Therefore, for this rule of thumb to be useful, the
working precision of the computing machine/software package must be known.
For example, MATLAB computes to about 16 decimal digits of precision. Thus,
k > 16 would give us concern that x is not close to x.
Example 4.3 Suppose that
A =
1 1-e
1 1
€ R 2x2 , |e| << 1.
We will determine an estimate of ati(A). Clearly
6
1 -1 + 6
1 1
*oo boi
ho b\\
We have
l
l
^ |a/,o| = laool + l«iol = 2, y^|a,-,i| = |«oi I + l«n| = |1 -e| + 1,
i'=0 i=0
so via (4.41c), \\A\\ X = max{2, |1 - e| + 1} « 2. Similarly
i -. i
T \bi,o\ = \boo\ + \ho\ = r7>f] \Ki\ = 1% I + l^ii I = A
/=o
kl
1=0
kl
-1 + e
so again via (4.41c) || A ||i=max|A,
1-i
h\ « A. Thus
*i(A)
v\
We observe that if e = 0, then A
is a reasonable result because
does not exist, so our approximation to k\ (A)
lim k\(A) = oo.
We may wish to compute kj{A) = ||A||2||A _ ||2. We will suppose that A e
R" x " and that A -1 exists. But we recall that computing matrix 2-norms involves
finding eigenvalues. More specifically, \\A\\i is the largest eigenvalue of R — A 7 A
TLFeBOOK
CONDITION NUMBERS 147
[recall (4.50)]. Suppose, as in (4.49), that Xq is the smallest eigenvalue of R for
which the corresponding eigenvector is denoted by v, that is, Rv — Xqv. Then we
4' '
observe that R l v — j-V. In other words, 1/Aq is an eigenvalue of R '.By similar
reasoning, l/Xk for k e Z„ must all be eigenvalues of 7? _1 . Thus, 1/Ao will be the
biggest eigenvalue of R~ l . For present simplicity assume that A is a normal matrix.
This means that AA T — A T A — R. The reader is cautioned that not all matrices A
are normal. However, in this case we have R~ l — A~ l A~ T — A~ T A~ l . [Recall
that (A _1 ) r = (A r ) _1 = A~ T .] We have that IIA -1 !^ is the largest eigenvalue of
A~ T A~ l , but R _1 — A~ T A -1 since A is assumed normal. The largest eigenvalue
of R~ l has been established to be 1/A.o, so it must be the case that for a normal
matrix A (real-valued and invertible)
*2(A) = t /-jp, (4.61)
that is, A is ill-conditioned if the ratio of the biggest to smallest eigenvalue of
A T A is large. In other words, a large eigenvalue spread is associated with matrix
ill conditioning. It turns out that this conclusion holds even if A is not normal; that
is, (4.61) is valid even if A is not normal. But we will not prove this. (The interested
reader can see pp. 312 and 340 of Horn and Johnson [6] for more information.)
An obvious difficulty with condition numbers is that their exact calculation often
seems to require knowledge of A -1 . Clearly this is problematic since computing
A -1 accurately may not be easy or possible (because of ill conditioning). We seem
to have a "chicken and egg" problem. This problem is often dealt with by using
condition number estimators. This in turn generally involves placing bounds on
condition numbers. But the subject of condition number estimation is not within
the scope of this book. The interested reader might consult Higham [7] for further
information on this subject if desired. There is some information on this matter in
the treatise by Golub and Van Loan [5, pp. 128-130], which includes a pseudocode
algorithm for oo-norm condition number estimation of an upper triangular nonsin-
gular matrix. We remark that ||A||2 is sometimes called the spectral norm of A,
and is actually best computed using entities called singular values [5, p. 72]. This
is because computing singular values avoids the necessity of computing A -1 , and
can be done in a numerically reliable manner. Singular values will be discussed in
more detail later.
We conclude this section with a remark about the Hilbert matrix R of Section 4.3.
As discussed by Hill [3, p. 232], we have
K 2 (R) oc e aN
for some a > 0. (Recall that symbol oc means "proportional to.") Proving this is
tough, and we will not attempt it. Thus, the condition number of R grows very
rapidly with N and explains why the attempt to invert R in Appendix 4. A failed
for so small a value of N.
TLFeBOOK
148 LINEAR SYSTEMS OF EQUATIONS
4.5 LU DECOMPOSITION
In this section we will assume A € R" x ", and that A -1 exists. Many algorithms
to solve Ax — y work by factoring the matrix A in various ways. In this section
we consider a Gaussian elimination approach to writing A as
A = LU,
(4.62)
where L is a nonsingular lower triangular matrix, and U is a nonsingular upper
triangular matrix. This is the LU decomposition (factorization) of A. Naturally,
L, U e R nx ", and L — [lij], U — [m,- j]. Since these matrices are lower and upper
triangular, respectively, it must be the case that
ll : — for j > i and «,- ,• = for j < i. (4.63)
For example, the following are (respectively) lower and upper triangular matrices:
1
1 1
1 1 1
U
1 2 3
4 5
6
These matrices are clearly nonsingular since their determinants are 1 and 24, respec-
tively. In fact, L is nonsingular iff /,-_,- ^ for all i , and U is nonsingular iff u ;> ; ^
for all j. We note that with A factored as in (4.62), the solution of Ax — y becomes
quite easy, but the details of this will be considered later. We now concentrate on
finding the factors L, U.
We begin by defining a Gauss transformation matrix Gk such that
G k x
1
"n-1
•
• "
x
•
1 •
•
•
Xk-1
Xk
=
•
• 1 _
Xfi—l
xo
Xk-l
(4.64)
for
x} =
Xj
Xk-l
,i = k,
(4.65)
The superscript k on x k . does not denote raising Xj to a power; it is simply part of
the name of the symbol. This naming convention is needed to account for the fact
that there is a different set of r values for every Gk ■ For this to work requires that
Xk-i # 0. Equation (4.65) followed from considering the matrix-vector product
TLFeBOOK
LU DECOMPOSITION
149
in (4.64):
-r k Xk-i + x k —
-r k+1 Xk-i + xt+i —
l «-l
Xk-l
0.
We observe that Gt is "designed" to annihilate the last n — k elements of vector
x. We also see that G k is lower triangular, and if it exists, always possesses an
inverse because the main diagonal elements are all equal to unity. A lower triangular
matrix where all of the main diagonal elements are equal to unity is called unit
lower triangular. Similar terminology applies to upper triangular matrices. Define
the kth Gauss vector
l «-lJ
k zeros
The £th unit vector is
el =[0---01 0---0 ].
k zeros n—k—l zeros
(4.66)
(4.67)
If / is an n x n identity matrix, then
Gt k T
k = I -x e k _ l
(4.68)
forfc= 1,2,
1. For example, if n — 4, we have
G, =
Gi =
1
~rl
1
-4
1
~<
1
1
1
1
-*]
1
G 2 =
1
1
-4 i
-x\
1
(4.69)
The Gauss transformation matrices may be applied to A, yielding an upper trian-
gular matrix. This is illustrated by the following example.
Example 4.4 Suppose that
12 3 4
-112 1
2 13
11
(=A°).
TLFeBOOK
150
LINEAR SYSTEMS OF EQUATIONS
We introduce matrices A , where A k — GkA k ~ l fork— 1, 2, . . . , n — 1, and finally
U — A" -1 . Once again, A k is not the kth power of A, but rather denotes the kth
matrix in a sequence of matrices. Now consider
GiA v =
10
110
10
1
12 3 4
-112 1
2 13
11
12 3 4
3 5 5
2 13
11
= A 1
for which the x\ entries in the first column of G\ depend on the first column of
A (i.e., of A) according to (4.65). Similarly
G 2 A l
for which the rf entries in the second column of G2 depend on the second column
of A 1 , and also
1
ooo -
"12 3 4"
" 1 2
3
4
1
3 5 5
3
5
5
-§ 1
1 _
2 13
11
_
7
3
1
1
3
1
GiA 1
1
1
2
1
3
1
1 1
4
" 1
2
3
4
5
3
5
5
1
3
7
3
1
3
1
.
6
7
u
for which the t. 3 entries in the third column of G3 depend on the third column of
A 2 . We see that U is indeed upper triangular, and it is also nonsingular. We also
see that
U = G3G2G1 A.
Since the product of lower triangular matrices is a lower triangular matrix, it is the
case that L\ — G3G2G1 is lower triangular. Thus
A = L~ l U.
Since the inverse (if it exists) of a lower triangular matrix is also a lower triangular
matrix, we can define L = L~^ , and so A = LU. Thus
L = Gi G2 Go .
From this example it appears that we need to do much work in order to find GT .
However, this is not the case. It turns out that
T k P T
T e k-\-
(4.70)
TLFeBOOK
LU DECOMPOSITION
151
This is easy to confirm. From (4.68) and (4.70)
G k G:
U
I -
r k e T
][/
r k eU]
T k P T
x e k-\
r kT
k-lS'k-1-
T k p T
X e k-\
T k P T T k P T
T e k-\ X e k-\
But from (4.66) and (4.67) e T k _ x x k = 0, so finally G k G~ l = /.
To obtain r k from (4.65), we see that we must divide by x k -i- In our matrix
factorization application of the Gauss transformation, we have seen (in Example
4.4) that Xk-\ will be an element of A k . These elements are called pivots. It
is apparent that the factorization procedure cannot work if a pivot is zero. The
occurrence of zero-valued pivots is a common situation. A simple example of a
matrix that cannot be factored with our algorithm is
In this case
G\A =
and from (4.65)
A =
1
l _ £i _
1 XQ
1
1
Oio
floo
00.
(4.71)
(4.72)
(4.73)
This result implies that not all matrices possess an LU factorization. Let det(A)
denote the determinant of A. We may state a general condition for the existence of
the LU factorization:
Theorem 4.1: Since A = [fl,,/];,/=o,...,n-i € R" x " we define the kth leading
principle submatrix of A to be A k — [ajj];, ;=o,... ,k-\ € R* x * for k — 1,2, ... , n
(so that A — A n , and A\ = [«oo] = aoo)- There exists a unit lower triangular matrix
L and an upper triangular matrix U such that A — LU, provided that det(Ak) ^
for all k — 1, 2 . . . , n. Furthermore, with U — [uij] e R" x " we have det(Afc) =
The proof is given in Golub and Van Loan [5]. It will not be considered here. For
A in (4.71), we see that A\ — [0] = 0, so det(Ai) = 0. Thus, even though A -1
exists, it does not possess an LU decomposition. It is also easy to verify that for
4
1
8 1
1 1
although A exists, again A does not possess an LU decomposition. In this case
we have det(A2) = det
4
= 0. Theorem 4. 1 leads to a test of positive
definiteness according to the following theorem.
TLFeBOOK
152 LINEAR SYSTEMS OF EQUATIONS
Theorem 4.2: Suppose R e R" x " with R = R T . Suppose that R = LDL T ,
where L is unit lower triangular, and D is a diagonal matrix (L —
[li,j]i,j=o,...,n-i, D = [d/,/L-,/=o,...,n-i)- If rf/,i > for all i e Z„, then 7? > 0.
Proof L is unit lower triangular, so for any y € R" there will be a unique
x e R" such that
y = L T x (y T =x T L)
because L _1 exists. Thus, assuming D >
71-1
x r i?:t = x T LDL T x = y T Dy = ^y}d i4 >
for all y # 0, since <i, j( - > for all i e Z„. In fact, ^f~o y?rfj,, = iff y; = for
all i € Z„. Consequently, x T Rx > for all x ^ 0, and so immediately R > 0.
We relate Z) in Theorem 4.2 to [/ in Theorem 4.1 according to t/ = DL T . If the
LDL T decomposition of a matrix i? exists, then matrix D immediately tells us
whether R is pd just by viewing the signs of the diagonal elements.
We may define (as in Example 4.4) A k — [of ,], where k = 0, 1, . . . , n — 1 and
A = A. Consequently
r* = -5- = ^£^ (4.74)
**"» «t-U-i
for i = fe, k + 1, . . . , n — 1. This follows because Gt contains t. , and as observed
in the example above, r* depends on the column indexed k — 1 in A* -1 . Thus, a
pseudocode program for finding U can therefore be stated as follows:
A°:=A;
for /f := 1 to n - 1 do begin
for / := k to n — 1 do begin
r* := afr_i/a£l] t_n ; {This loop computes r''}
end;
/A A := G k A k ~^ ; { G^ contains zf via (4.64) }
end;
U:=A n - 1 ;
We see that the pivots are aZZ\ t_j for A: = 1, 2, , . . , n — 1. Now
U — G„-iG n -2 ■ ■ ■ G2G1 A.
so
A = G- l G- x ---G- l _ 2 G- l _ l U. (4.75)
TLFeBOOK
LU DECOMPOSITION 153
Consequently, from (4.70), we obtain
H-l
L = (/ + x x el)(I + x 1 e\) •••(/ + r"" 1 ^) = / + £ r fc ej_ x . (4.76)
To confirm the last equality of (4.76), consider defining L m = GJ~ •••G~ 1 for
m = 1, . . . , n — 1. Assume that L m — I + Y1T=1 xke k-v wruc h is true for m — 1
because L\ — G~[ = I + x x e^ . Consider L m+ \ = L m G~ +l , so
L m+1 = (/ + fy e [_ 1 )(/ + r"'+ 1 ^)
k=l
=i+T,^ T k -i+* m+i ^+i:
k „ T _1_ T m + l „ T _1_ X^T k „ T -r m + l „ T
r e,_, + t e m + > r e,_,r e
jt-l l e m-
A:=l *:=!
But ^_ 1 T m+1 = for k = 1, . . . , m from (4.66) and (4.67). Thus
m m-\-\
L m+l =I + Y j r k e T k _ x + r m+l e T m =l+Y j r k eI-v
k=\ k=\
Therefore, (4.76) is valid by mathematical induction. (A simpler example of a
proof by induction appears in Appendix 3.B.) Because of (4.76), the previous
pseudocode implicitly computes L as well as U. Thus, if no zero-valued pivots
are encountered, the algorithm will terminate, having provided us with both L
and U. [As an exercise, the reader should use (4.76) to find L in Example 4.4
simply by looking at the appropriate entries of the matrices Gt\ that is, do not
use L = Gj~ Gj G7 . Having found L by this means, confirm that LU — A.] We
remark that (4.76) shows that L is unit lower triangular.
It is worth mentioning that certain classes of matrix are guaranteed to possess
an LU decomposition. Suppose that A e R" x " with A — A T and A > 0. Let v —
[vo ■ ■ ■ Vk-\ • • • ] T ; then, if v ^ 0, we have v T Av > 0, but if A^ is the kth
_ T n—k zeros
leading principle submatrix of A, then
v Av — u Akii >
which holds for all k — 1,2, ... ,n. Consequently, A& > for all k, and so A^
exists for all k. Since A^ exists for all k, it follows that det(A^) ^ for all k.
The conditions of Theorem 4.1 are met, and so A possesses an LU decomposi-
tion. That is, all real-valued, symmetric positive definite matrices possess an LU
decomposition.
TLFeBOOK
154
LINEAR SYSTEMS OF EQUATIONS
We recall that the class of positive definite matrices is an important one since
they have a direct association with least-squares approximation problems. This was
demonstrated in Section 4.2.
How many floating-point operations (flops) are needed by the algorithm for
finding the LU decomposition of a matrix? Answering this question gives us an
indication of the computational complexity of the algorithm. Neglecting multipli-
cation by zero or by one, to compute A k — GkA k ~ l requires (« — k)(n — k + 1)
multiplications, and the same number of additions. This follows from considering
the product GkA k ~ l with the factors partitioned into submatrices according to
G k
h
T k
h-k
,/t-l
,k-l
1 00
i oi
(4.77)
where It is a k x k identity matrix, 7* is (« — k) x k and is zero-valued except for
k-\
its last column, which contains — r [see (4.64)]. Similarly, A 00 is (k — 1) x
(k - 1), A* is (k - 1) x («
.k-i
1), and A,, ' is (n - k + 1) x (n
1).
From the pseudocode, we see that we need ~YTk=\ ( n ~ ^) division operations.
Operation A k — GtA k ~ x is executed for k = 1 to n — 1, so the total number of
operations is:
M-l
2_\( n — k)(n — k + \) multiplications
k=\
n-\
2__S n — k)(n — k + \) additions
k=\
n-\
J> - k)
divisions
We now recognize that
iV
k=\
N(N+1) ^ 2 _ N(N + l)(2N + 1)
I> 2
jt=i
(4.78)
where the second summation identity was proven in Appendix 3.B. The first sum-
mation identity may be proved in a similar manner. Therefore
n-l
H-l
k=\
J2( n - k )( n ~ k + 1) = J^[« 2 + n - (2n + 1)£ + k 2 ]
k=\
n—
= {n- 1)(« 2 + n) - (2n + l)J2 k + J2 k2
n—\ n — \
k=l k=l
TLFeBOOK
LU DECOMPOSITION 155
1 , 1
= -n 3 «, (4.79a)
3 3
«— 1 n— 1 . -.
Y^(n-k) = n(n- l)-^k = -n 2 - -n. (4.79b)
k=l k=\
So-called asymptotic complexity measures are denned using
Definition 4.2: Big O We say that fin) = 0(g(n)) if there is a < c < oo,
and an N € N (N < oo) such that
fin) < cgin)
for all n > N.
Our algorithm needs a total of fin) — 2(|« 3 — |n) + jn 2 — jn — |w 3 + jn 2 —
In flops. We may say that 0(n 3 ) operations (flops) are needed (so here gin) =
n 3 ). We may read 0(« 3 ) as "order n-cubed," so order n-cubed operations are
needed. If one operation takes one unit of time on a computing machine we say
the asymptotic time complexity of the algorithm is 0(w 3 ). Parameter n (matrix
order) is the size of the problem. We might also say that the time complexity
of the algorithm is cubic in the size of the problem since the number of oper-
ations fin) is a cubic polynomial in n. But we caution the reader about flop
counting:
"Flop counting is a necessarily crude approach to the measuring of program efficiency
since it ignores subscripting, memory traffic, and the countless other overheads asso-
ciated with program execution. We must not infer too much from a comparison of
flops counts. . . . Flop counting is just a 'quick and dirty' accounting method that
captures only one of several dimensions of the efficiency issue."
— Golub and Van Loan [5, p. 20]
Asymptotic complexity measures allow us to talk about algorithmic resource
demands without getting bogged down in detailed expressions for computing time,
memory requirements, and other variables. However, the comment by Golub and
Van Loan above may clearly be extended to asymptotic measures.
Suppose that A is LC/-factorable, and that we know L and U . Suppose that we
wish to solve Ax — y. Thus
LUx = y, (4.80)
and define Ux — z, so we begin by considering
Lz = y. (4.81)
TLFeBOOK
156 LINEAR SYSTEMS OF EQUATIONS
In expanded form this becomes
4o
ho
lu
ho
hi
'22
'«-i,o 4-1,1 '«-i,2 ••• 4-l,«-l
zo
yo
Z\
yi
Zl
=
yi
Z n -\
y n -\
(4.82)
Since L ' exists, solving (4.81) is easy using forward elimination {forward substi-
tution). Specifically, from (4.82)
zo
yo
4,o
z\ = - — [y\ -zo4,o]
'1,1
Z2 = " Ij2 - Zoh,0 - Zl/2,l]
'2,2
Thus, in general
Zn-l =
'« — l,n-
Z* =
lk„
n-2
y n -\ — / ,Zkln-l,k
k=Q
k-l
yk ~ ^ Zjlk.i
i=0
(4.83)
for fc = 1, 2, ...,« — 1 with zo = yo/4,0- Since we now know z, we may solve
[/x = z by backward substitution. To see this, express the problem in expanded
form:
«0,0 "0,1
Ml 1
M 0,n-2 M0,h-1
"l,n-2 «l,n-l
tin— 2,w— 2 W«— 2, n — 1
M„_i„_i
x
zo
XI
Zl
*n-2
Zn-2
_ ■*»-!
Zn-l
From (4.84), we obtain
Xn—l
Xn—2
Zn-l
Mn—\,n—\
1
Un—2,n—2
-YZn-2 — X„-lU n -2,n-l]
(4.84)
TLFeBOOK
LU DECOMPOSITION
157
X n -3
Mn—3,n—3
~iZn— 3 x n — \Un—?>,n — { Xn—2M-n—3,n—2]
Xq =
"0,0
zo
n-\
/ ,XkUQ,k
k=\
In general
**
1
Uk,k
Zk
^2 XiUk,i
i=k+\
(4.85)
for k — n — 2, . . . ,0 with x n -\ — z n -\/u n -i,n-\- The forward-substitution and
backward-substitution algorithms that we have just derived have an asymptotic
time complexity of 0(n 2 ). The reader should confirm this as an exercise. This
result suggests that most of the computational effort needed to solve for x in
Ax = y lies in the LU decomposition stage.
So far we have said nothing about the performance of our linear system solution
method with respect to finite precision arithmetic effects (i.e., rounding error).
Before considering this matter, we make a few remarks regarding the stability of
our method. We have noted that the LU decomposition algorithm will fail if a zero-
valued pivot is encountered. This can happen even if Ax — y has a solution and A
is well-conditioned. In other words, our algorithm is actually unstable since we can
input numerically well-posed problems that cause it to fail. This does not necessarily
mean that our algorithm should be totally rejected. For example, we have shown
that positive definite matrices will never result in a zero-valued pivot. Furthermore,
if A > 0, and it is well-conditioned, then it can be shown that an accurate answer
will be provided by the algorithm despite its faults. Nonetheless, the problem of
failure due to encountering a zero-valued pivot needs to be addressed. Also, what
happens if a pivot is not exactly zero, but is close to zero? We might expect that
this can result in a computed solution x that differs greatly from the mathematically
exact solution x, especially where rounding error is involved, even if A is well-
conditioned.
Recall (2.15) from Chapter 2,
fl[x op y] = (x op y)(l + e)
(4.86)
for which |e| < 2 '. If we store A in a floating-point machine, then, because of
the necessity to quantize we are really storing the elements
[fl[A]]ij=fl[a i j] = a i j(l + e iJ )
(4.87)
TLFeBOOK
158 LINEAR SYSTEMS OF EQUATIONS
with |e,-j| < 2~'. Suppose now that A, 5 e R mx "; we then define 6
|A| = [|a u |]eR mxn , (4.88)
and by B < A, we mean bij < aij for all i and j. So we may express (4.87) more
compactly as
\fl[A]-A\<u\A\, (4.89)
where u — 2~' , since |e,- ;| < 2~' .
Forsythe and Moler [4, pp. 104-105] show that the computed solution z to
Lz — y [recall (4.81)] as obtained by forward substitution is actually the exact
solution to a perturbed lower triangular system
(L + 8L)z = y, (4.90)
where SL is a lower triangular perturbation matrix, and where
|<5L| < 1.01nw|L|. (4.91)
A very similar bound exists for the problem of solving Ux — z by backward-
substitution. We will not derive these bounds, but will simply mention that the
derivation involves working with a bound similar to (2.39) in Chapter 2. From
(4.91) we have relative perturbations -rMr < l.Olnw. It is apparent that since u
is typically quite tiny, unless n (matrix order) is quite huge, these relative pertur-
bations will not be significant. In other words, forward substitution and backward
substitution are very stable procedures that are quite resistant to the effects of
rounding errors. Thus, any difficulties with our linear system solution procedure in
terms of a rounding error likely involve only the LU factorization stage.
The rounding error analysis for our Gaussian elimination algorithm is even more
involved than the effort required to obtain (4.91), so again we will content ourselves
with citing the main result without proof. We cite Theorem 3.3.1 in Golub and Van
Loan [5] as follows.
Theorem 4.3: Assume that A is an n x n matrix of floating-point numbers.
If no zero-valued pivots are encountered during the execution of the Gaussian
elimination algorithm for which A is the input, then the computed triangular factors
(here denoted L and U) satisfy
LU = A + 8A (4.92a)
such that
\SA\ < 3(« - 1)m(|A| + \L\\U\) + 0(u 2 ). (4.92b)
There is some danger in confusing this with the determinant. That is, some people use \A\ to denote
the determinant of A. We will avoid this here by sticking with det(A) as the notation for determinant
of A.
TLFeBOOK
LU DECOMPOSITION
159
In this theorem the term 0(u 2 ) denotes a part of the error term dependent on u 2 .
This is quite small as u 2 — 2~ 2t (rounding assumed), and so may be practically
disregarded. The term arises in the work of Golub and Van Loan [5] because those
authors prefer to work with slightly looser bounding results than are to be found
in the volume by Forsythe and Moler [4]. The bound in (4.92b) gives us cause for
concern. The perturbation matrix S A may not be small. This is because |L||L/| can
be quite large. An example of this would be
A =
1 4 1
2 8.001 1
-1 1
for which
1
2 1
-1000 1
U
1 4 1
0.001 -1
-999
This has happened because
A 1 =
1
4
1
0.001
-1
-1
1
which has a\ j = 0.001. This is a small pivot and is ultimately responsible for
giving us "big" triangular factors. Clearly, the smaller the pivot the bigger the
potential problem. Golub and Van Loan's [5] Theorem 3.3.2 (which we will not
repeat here) goes on to demonstrate that the errors in the computed triangular
factors can adversely affect the solution to Ax = L Ux — y as obtained by forward
substitution and backward substitution. Thus, if we use the computed solutions L
and U in LUx = y, then the computed solution x may not be close to x.
How may our Gaussian elimination LU factorization algorithm be modified to
make it more stable? The standard solution is to employ partial pivoting. We do not
consider the method in detail here, but illustrate it with a simple example (Example
4.5). Essentially, before applying a Gauss transformation Gt, the rows of matrix
A k ~ l are permuted (i.e., exchanged) in such a manner as to make the pivot as
large as possible while simultaneously ensuring that A k is as close to being upper
triangular as possible. Permutation operations have a matrix description, and such
matrices may be denoted by P^. We remark that P7 — Pi .
Example 4.5 Suppose that
A =
1
4
1
2
8
1
-1
1
= A h
TLFeBOOK
160
LINEAR SYSTEMS OF EQUATIONS
Thus
G1P1A
o
10 0"
"010"
" 1
4
1 "
\ 1
1
2
8
1
1 _
_ 1 _
_
-1
1
10 0"
"2 8 1"
" 2
8
1
\ 1
1 4 1
=
l
2
1 _
-1 1
_
-1
1
and
"10
"10 0"
"2 8 1
G2P2A 1 =
1
_ 1 _
1
_ 1 _
1
_ -1 1
"10 0"
"2 8 1"
=
1
-1 1
= A 2
1
.0 1
for which U — A 2 . We see that P 2 interchanges rows 2 and 3 rather than 1 and
2 because to do otherwise would ruin the upper triangular structure we seek. It is
apparent that
G 2 P 2 GiPiA = U,
so that
for which
A=P- 1 G~ 1 P- 1 G^U,
=L
r 1
1
1
1
This matrix is manifestly not lower triangular. Thus, our use of partial pivoting to
achieve algorithmic stability has been purchased at the expense of some loss of
structure (although Theorem 3.4.1 in Ref. 5 shows how to recover much of what
is lost. 7 ) Also, permutations involve moving data around in the computer, and this
is a potentially significant cost. But these prices are usually worth paying.
In general, the Gaussian elimination with partial pivoting algorithm generates
and it turns out that
G n _ 1 P n _ 1 -G 2 P 2 G 1 P 1 A = U,
P„_l ■ ■ ■ P 2 PiA = LU
for which L is unit lower triangular, and U is upper triangular. The expression for L in terms of the
factors Gk is messy, and so we omit it. The interested reader can see pp. 112-113 of Ref. 5 for details.
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION
161
It is worth mentioning that the need to trade off algorithm speed in favor of sta-
bility is common in numerical computing; that is, fast algorithms often have stability
problems. Much of numerical computing is about creating the fastest possible stable
algorithms. This is a notoriously challenging engineering problem.
A much more detailed account of Gaussian elimination with partial pivoting
appears in Golub and Van Loan [5, pp. 108-1 16]. This matter will not be discussed
further in this book.
4.6 LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION
In this section we consider the QR decomposition of A € R mx " for which m > n,
and A is of full rank [i.e., rank (A) = «]. Full rank in this sense means that the
columns of A are linearly independent. The QR decomposition of A is
A = QR, (4.93)
where Q e R mxm is an orthogonal matrix [i.e., Q T Q — QQ T — I (identity
matrix)], and R e R mx " is upper triangular in the following sense:
R =
0.0
ro,i •
ro,n-l
r\, n -\
•
•
fn — \,n — l
n
o
(4.94)
Here TZ e R" x " is a square upper triangular matrix and is nonsingular because A
is full rank. The bottom block of zeros in R of (4.94) is (m — n) x n.
It should be immediately apparent that the existence of a QR decomposition
for A makes it quite easy to solve for x in Ax — y, if A -1 exists (which implies
that in this special case A is square). Thus, Ax — QRx — y, and so Rx — Q T y.
The upper triangular linear system Rx — Q T y may be readily solved by backward
substitution (recall the previous section).
The case where m > n is important because it arises in overdetermined least-
squares approximation problems. We illustrate with the following example based on
a real-world problem. 8 Figure 4. 1 is a plot of some simulated body core temperature
This example is from the problem of estimating the circadian rhythm parameters of human patients
who have sustained head injuries. The estimates are obtained by the suitable processing of various
physiological data sets (e.g., body core temperature, heart rate, blood pressure). The nature of the injury
has made the patients' rhythms deviate from the nominal 24-h cycle. Correct estimation of rhythm
parameters can lead to improved clinical treatment because of improved timing in the administering of
TLFeBOOK
162
LINEAR SYSTEMS OF EQUATIONS
O
Q.
E
D.
37.2
Illustration of least-squares fitting
37 1
Noisy data with trend
■ Linear trend component
- Model
£ - --\
j
37
36.9
-5R «
/ _\ A -
/
10
20
30
40 50
Time (hours)
60
70
80
90
Figure 4.1 Simulated human patient temperature data to illustrate overdetermined least-
squares model parameter estimation. Here we have N = 1000 samples f n (the dots), for
T s = 300 (seconds), T = 24 (hours), a = 2 x 10" 7o C/s, b = 37°C, and c = 0.1°C. The
solution to (4.103) is a = 2.0582 x 10" 7o C/s, b = 36.9999°C, c = 0.1012°C.
measurements
has three components
from a human patient (this is the noisy data with trend). The data
1. A sinusoidal component
2. Random noise.
3. A linear trend.
Our problem is to estimate the parameters of the sinusoid (i.e., the amplitude,
period, and phase), which represents the patient's circadian rhythm. In other words,
the noise and trend are undesirable and so are to be, in effect, removed from the
desired sinusoidal signal component. Here we will content ourselves with estimating
only the amplitude of the sinusoid. The problem of estimating the remaining param-
eters is tougher. Methods to estimate the remaining parameters will be considered
later. (This is a nonlinear optimization problem.)
We assume the model for the data in Fig. 4.1 is the analog signal
2n
f(t) = at + b + c sin I — t
lit).
(4.95)
Here the first two terms model the trend (assumed to be a straight line), the third
term is the desired sinusoidal signal component, and r](t) is a random noise com-
ponent. We only possess samples of the signal /„ = f(nT s ) (i.e., t — nT s ), for
n — 0, I, . . . , N — 1 , where T s is the sampling period of the data collection system.
medication. We emphasize that the model in (4.95) is grossly oversimplified. Indeed, a better model is
to replace term at + b with subharmonic, and harmonic terms of sin I ^j-t\. A harmonic term is one
of frequency -S-n, while a subharmonic has frequency -S--- Cosine terms should also be included in
the improved model.
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION
163
We assume that we know T which is the period of the patient's circadian rhythm.
Our model also implicitly assumes knowledge of the phase of the sinusoid, too.
These are very artificial assumptions since in practice these are the most important
parameters we are trying to estimate, and they are never known in advance. How-
ever, our present circumstances demand simplification. Our estimate of /„ may be
defined by
f n =aT s n + b + csm[ — nT s . (4.96)
This is a sampled version of the analog model, except the noise term has been
deleted.
We may estimate the unknown model parameters a, b, c by employing the same
basic strategy we used in Section 4.2, specifically, a least-squares approach. Thus,
defining x — [a b c] T (vector of unknown parameters), we strive to minimize
JV— 1
AT— 1
v(*) = j>„ 2 = £[/„- ;„] 2
(4.97)
n=0
n=0
with respect to x. Using matrix/vector notation was very helpful in Section 4.2,
and it remains so here. Define
T s ii 1 sin | — T s n
(4.98)
Thus
fn ~ V n X.
(4.99)
We may define the error vector e — [eo e\ • •
[/o /l • • • In-iV > and the matrix of basis vectors
eN-\] T , data vector f
A =
T
..T
L "N-l J
eR
JVx3
(4.100)
Consequently, via (4.99)
e — f — Ax.
(4.101)
Obviously, we would like to have e = 0, which implies the desire to solve Ax — f.
If we have N — 3 and A -1 exists, then we may uniquely solve for x given any
/. However, in practice, N >> 3, so our linear system is overdetermined. Thus,
no unique solution is possible. We have no option but to select x to minimize e in
TLFeBOOK
164 LINEAR SYSTEMS OF EQUATIONS
some sense. Once again, previous experience from Section 4.2 says least-squares
is a viable choice. Thus, since IHI2 = e T e — *}2 n =o e n> we consider
V(x) = e T e= f T f -2x T A T f +x T A T Ax (4.102)
[which is a more compact version of (4.97)]. This is yet another quadratic form
[recall (4.8)]. We see that P e R 3x3 , and g e R 3 . In our problem A is full rank so
from the results in Section 4.2 we see that P > 0. Naturally, from the discussions
of Sections 4.3 and 4.4, the conditioning of P is a concern. Here it turns out that
because P is of low order (largely because we are interested only in estimating
three parameters) it typically has a low condition number. However, as the order
of P rises, the conditioning of P usually rapidly worsens; that is, ill conditioning
tends to be a severe problem when the number of parameters to be estimated rises.
From Section 4.2 we know that the optimum choice for x, denoted x, is obtained
by solving the linear system
Px = g. (4.103)
The model curve of Fig. 4.1 (solid line) is the curve obtained using x in (4.96).
Thus, since x — [a b c] T , we plot /„ for a,b,c in place of a,b,c in (4.96).
Equation (4.103) can be written as
A T Ax = A T f. (4.104)
This is just the overdetermined linear system Ax — f multiplied on the left (i.e.,
premultiplied) by A T . The system (4.104) is often referred to in the literature as
the normal equations.
How is the previous applications example relevant to the problem of QR fac-
torizing A as in Eq. (4.93)? To answer this, we need to consider the condition
numbers of A, and of P — A T A, and to see how orthogonal matrices Q facili-
tate the solution of overdetermined least-squares problems. We will then move on
to the problem of how to practically compute the QR factorization of a full-rank
matrix. We will consider the issue of conditioning first since this is a justification
for considering QR factorization methods as opposed to the linear system solution
methods of the previous section.
Singular values were mentioned in Section 4.4 as being relevant to the problem
of computing spectral norms, and so of computing K2(A). Now we need to consider
the consequences of
Theorem 4.4: Singular Value Decomposition (SVD) Suppose A € R mxn ;
then there exist orthogonal matrices
U = [«0K1 ' ' ' «m-l] € R mxm , V = [W0W1 • • • V n -l] € R BXB
such that
E = U T AV = diag (ct , o u . . . , o p _{) e R mx ", p = min{m, «}, (4.105)
where oq > o\ > . . . > er„-i > 0.
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION
165
An outline proof appears in Ref. 5 (p. 71) and is omitted here. The nota-
tion diag(tro, • ■ ■ , o>-i) means a diagonal matrix with main diagonal elements
ao, . . . , o p -\. For example, if m — 3, n — 2, then p — 2, and
U T AV
CTQ
(7!
but if m = 2, n = 3, then again p — 2, but now
U T AV
o-o
ori
The numbers er, are called singular values. Vector ut is the ith left singular vector,
and vi is the ith right singular vector. The following notation is helpful:
cr r (A) = the ith singular value of A (i e Z p ).
cr max (A) = the biggest singular value of A.
&mm(A) = the smallest singular value of A.
We observe that because AV — WE, and A T U — VE T we have, respectively
Avj — OiUi, A Uj — OiV{ (4.106)
for i e Z p . Singular values give matrix 2-norms; as noted in the following theorem.
Theorem 4.5:
l|A|| 2 = O-o = 0-max(A).
Proof Recall the result (4.37). From (4.105) A = UEV T so
and
\Ax\\l =x T A T Ax,
P -\
A T A = VE^Ey 7 = J^aj L v i vf € R" x ".
For any x e R" there exist di such that
n-l
X = 'Y^diVi
(4.107)
(4.108)
;=o
(because V is orthogonal so its column vectors form an orthogonal basis for R").
Thus
-i»-i
-ill = xTx = J2J2 did J v ^ v J = 12 d ?
(4.109)
i=o ;=o
i=0
TLFeBOOK
166 LINEAR SYSTEMS OF EQUATIONS
(via vJvj — S{-j). Now
x
P -\ P -\
T A T Ax = Y]<T?(x T Vi)(v?x) = J2 a ^ x ' ^ >2 ' (4.110)
i=0 (=0
but
(x, vi) = {Y^djVj, V{ J = ^djjvj, vj) = <i,-. (4.111)
\ .7 ' .7
Using (4.111) in (4.110), we obtain
M-l
;=o
,2 — a f„„ ; ^ „ r \ir_ ™ ^; m ;,a II A -*■! |2
for which it is understood that <r = for i > p — 1. We maximize 1 1 Ax\ \^ subject
to constraint ||x||2 = 1, which means employing Lagrange multipliers; that is, we
maximize
n— 1 /« — 1
L(d) = J2 a i Ld ?- x IZ^ 2-1 ' (4.113)
(=0 \(=0
where d = [do d\ ■ ■ ■ d n -\] T . Thus
8L(d) 2
ddj
"7 z - /v<
*■} —
or
ajdj
= A.dj .
From (4.114)
into
(4.112)
n-\
\\Ax\\ 2 2 =
^E d ?
= a
(4.114)
(4.115)
i'=0
for which we have used the fact that \\x\\\ = 1 in (4.109). From (4.114) X is the
eigenvalue of a diagonal matrix containing of. Consequently, k is maximized for
X — Oq. Therefore, ||A||2 = &o-
Suppose that
OQ > • • • > O r -\ > O r — ■ ■ ■ — Op-i — 0, (4.116)
then
rank(A) = r. (4.117)
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 167
Thus, the SVD of A can tell us the rank of A. In our overdetermined least-squares
problem we have m > n and A is assumed to be of full-rank. This implies that
r — n. Also, p — n. Thus, all singular values of a full-rank matrix are bigger than
zero. Now suppose that A -1 exists. From (4.105) A -1 = VT,~ l U T . Immediately,
l|A- 1 || 2 = l/crmin(A). Hence
^2(A) = ||A|| 2 ||A- 1 || 2 = gmax( ^ ) . (4.118)
Cmin(A)
Thus, a large singular value spread is associated with matrix ill conditioning. [Recall
(4.61) and the related discussion.] As remarked on p. 223 of Ref. 5, Eq. (4.118)
can be extended to cover full-rank rectangular matrices with m > n:
A€R mx ",rank(A) = «^/C2(A)= °" max(A) . (4.119)
CTminCA)
This also holds for the transpose of A because A T — VT, T U T , so A T has the
same singular values as A. Thus, *r 2 (A r ) = K2(A), Golub and Van Loan [5, p. 225]
claim (without formal proof) that /c 2 (A r A) = [/c 2 (A)] 2 . In other words, if the linear
system Ax — f is ill-conditioned, then A T Ax — A T f is even more ill-conditioned.
The condition number of the latter system is the square of that of the former system.
More information on the conditioning of rectangular matrices is to be found in
Appendix 4.B. This includes justification that Ar 2 (A r A) = [k 2 (A)] 2 .
A popular approach toward solving the normal equations A T Ax — A T f is based
on Cholesky decomposition
Theorem 4.6: Cholesky Decomposition If R e R" xn is symmetric and pos-
itive definite, then there exists a unique lower triangular matrix L e R" x " with
positive diagonal entries such that R — LL T . This is the Cholesky decomposition
(factorization) of R.
Algorithms to find this decomposition appear in Chapter 4 of Ref. 5. We do not
consider them except to note that if they are used, then the computed solution to
A T Ax = A T f, which we denote by x, may satisfy
ll *~* lh ~u[K 2 (A)f, (4.120)
ll*||2
where u is as in (4.89). Thus, this method of linear system solution is potentially
highly susceptible to errors due to ill-conditioned problems. On the other hand,
Cholesky approaches are computationally efficient in that they require about n 3 /3
flops (Floating-point operations). Clearly, Gaussian elimination may be employed
to solve the normal equations as well, but we recall that Gaussian elimination
needed about 2n 3 /3 flops. Gaussian elimination is less efficient because it does
not account for symmetry in matrix R. Note that these counts do not take into
consideration the number of flops needed to determine A T A and A T f, and do
TLFeBOOK
168
LINEAR SYSTEMS OF EQUATIONS
not account for the number of flops needed by the forward/backward substitution
steps. However, the comparison between Cholesky decomposition and Gaussian
elimination is reasonably fair because these other steps are essentially the same for
both approaches.
Recall that \\e\\\ — \\Ax — f\\\. Thus, for orthogonal matrix Q
\\Q T e\\l = [Q T e] T Q T e = e T QQ T e = e T e = ||e|||.
(4.121)
Thus, the 2-norm is invariant to orthogonal transformations. This is one of the
more important properties of 2-norms. Now consider
\\e\\ 2 2 = \\Q T Ax-Q T f\\ 2 2 .
(4.122)
Suppose that
e r / =
./■"
(4.123)
for which /" € R", and /' € R m_ ". Thus, from (4.94) and Q T A = R, we obtain
Q T Ax - Q T f =
Kx- f u
-f
implying that
Immediately, we see that
kill = lift* - /"111 + ll/'lll.
Kx = f u .
(4.124)
(4.125)
(4.126)
The least-squares optimal solution x is therefore found by backward substitution.
Equally clearly, we see that
min||e||i = ||/'||i
2
Pls-
(4.127)
This is the minimum error energy. Quantity p^ s is also called the minimum sum
of squares, and e is called the residual [5]. It is easy to verify that /C2(G) = 1
(Q is orthogonal). In other words, orthogonal matrices are perfectly conditioned.
This means that the operation Q T A will not result in a matrix that is not as well
conditioned as A. This in turn suggests that solving our least-squares problem
using QR decomposition might be numerically more reliable than working with
the normal equations. As explained on p. 230 of Ref. 5, this is not necessarily
always true, but it is nevertheless a good reason to contemplate QR approaches to
solving least-squares problems. 9
y If the residual is big and the problem is ill-conditioned, then neither QR nor normal equation methods
may give an accurate answer. However, QR approaches may be more accurate for small residuals in
ill-conditioned problems than normal equation approaches.
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 169
How may we compute Ql There are three major approaches:
1. Gram-Schmidt algorithms
2. Givens rotation algorithms
3. Householder transformation algorithms
We will consider only Householder transformations.
We begin by a review of how vectors are projected onto vectors. Recall the
law of cosines from trigonometry in reference to Fig. 4.2a. Assume that x, y e R".
Suppose that \\x — y\\2 = a, \\x\\i — b, and that \\yW2 — c. Therefore, where 9 is
the angle between x and y (0 < 9 < n radians)
b 2 + c 2 - 2bccos9,
(4.128)
or in terms of the vectors x and y, Eq. (4.128) becomes
\\x-y\\ 2 2 = \\x\\l + \\y\\ 2 2 -2\\x\\ 2 \\y\\2Cos$.
In terms of inner products, this becomes
(x-y,x-y) = (x, x) + (y, y) - 2[(x, x)]^ 2 [(y, y)} 1 ' 2 cos9,
which reduces to
(x,y) = l(x,x)] 1/2 [(y,y)] l/2 c 0& 9,
(4.129)
(x,y)
\x\\2\\y\\2COSt
(4.130)
(a)
^y
Figure 4.2 Illustration of the law of cosines (a) and the projection of vector x onto vector
y (b).
TLFeBOOK
170
LINEAR SYSTEMS OF EQUATIONS
Now consider Fig. 4.2b. Vector P y x is the projection ofx onto y, where P y denotes
the projection operator that projects x onto y. It is immediately apparent that
\PyX\\2 = ||x|| 2 COS(
(4.131)
This is the Euclidean length of P y x. The unit vector in the direction of y is y/\ \y\ | 2 .
Therefore
llxll? COS0
-y. (4.132)
PyX
\\y\h
But from (4.130) this becomes
(x,y)
P y X = Try.
\\y\\\
(4.133)
Since (x, y) — x y — y x, we see that
T
y x
P y x = ^—ry =
\\y\\\
yy
(4.134)
In (4.134) yy € R" x ", so the operator P y has the matrix representation
1
Py =
ml yy
In Fig. 4.2b we see that z — x — P y x, and that
Z = (I ~ Py)X
1
\\y\\\
yy
X,
which is the component of x that is orthogonal to y. We observe that
1
IMI
T T
jyy yy
y\\y\\ 2 2 y T
\\y\\i
^r 3
Py
(4.135)
(4.136)
(4.137)
If A 2 — A, we say that matrix A is idempotent. Thus, projection operators are
idempotent. Also, Pf = P y so projection operators are also symmetric.
In Fig. 4.3, x, y, z € R", and y T z — 0. Define the Householder transformation
matrix
H = I
, yy
'\\y\\i
(4.138)
We see that H — I — 2P y [via (4.135)]. Hence Hx is as shown in Fig. 4.3; that
is, the Householder transformation finds the reflection of vector x with respect to
vector z, and z -L y (Definition 1.6 of Chapter 1). Recall the unit vector e, € R"
ei = [0---oio---0] r ,
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 171
Figure 4.3 Geometric interpretation of the Householder transformation operator H . Note
that z T y = 0.
so eo = [10 • • • 0] r . Suppose that we want Hx — aeo for some a e R with a ^ 0;
that is, we wish to design H to annihilate all elements of x except for the top
element. Let y — x + aeo; then
T T T T
y x — (x + ae Q )x — x x + cexo
(as x — [xq x\ ■ ■ ■ x n -\\ T ), and
||y|| 2 = (x + ae Q )(x + aeo) — x x + 2axo + a
Therefore
T T
Hx — x — 2- — —x = x — 2 -y
\\y\\ 2 2
\\y\\V
so from (4.139), this becomes
Hx = x — 2
x — 2
(x x + axQ)(x + aeo)
iiyiil
(x x + ajto)-^ + &(y x)eo
1 -2-
x x + ajto
x — 2a
T
y x
x T x + 2ax + a 2 ] \\y\\i
To force the first term to zero, we require
x x + 2axQ + a — 2(x x + oixq) — 0,
which implies that a 2 — x T x, or in other words, we need
« = ±l|x|| 2 .
r^o-
(4.139a)
(4.139b)
(4.140)
(4.141)
TLFeBOOK
172 LINEAR SYSTEMS OF EQUATIONS
Consequently, we select y — x ± ||x||2<?o- I n this case
y T x x T x + axQ
Hx - —2a T-e = —2a — T e
I l^lll
x T x + 2a.ro + « 2
a + axo
= ~ 2q! t 7 , -, g o = -ae = TlMbeo.
LoL L + 2axo
(4.142)
so Hx = aeo for a — —a if y — x + aeo with a = ±||x||2-
Example 4.6 Suppose x — [4 3 0] r , so ||jc||2 = 5. Choose a = 5. Thus
y = [9 3 Of, and
H = 1-2
yy _ j_
'y T y ~ 45
-36 -27
-27 36
45
We see that
1
Hx = —
45
' -36
-27
"
" 4 "
1
= 45
" -225 "
" -5 "
-27
36
45
3
=
so Hx — —aeo
.
The Householder transformation is designed to annihilate elements of vectors. But
in contrast with the Gauss transformations of Section 4.5, Householder matrices
are orthogonal. To see this observe that
H T H =
1-2
yy
yTy
1-2
yy
y T yj
T T T
y T y iy T yf
T T
= 7-4^+4^ = 7.
y y y y
Thus, no matter how we select y, K2(H) = 1. Householder matrices are therefore
perfectly conditioned.
To obtain R in (4.94), we define
H k
4-i
€R"
(4.143)
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION
173
where k — 1,2, ... ,n, h-i is an order k — 1 identity matrix, Hk is an order m —
k + 1 Householder transformation matrix. We design Hk to annihilate elements k
to m — 1 of column A: — 1 in A k ~ l , where A — A, and
A k = H k A k ~\
(4.144)
so A" = R (in (4.94)). Much as in Section 4.5, we have A k = [af ,.] € R mx ", and
we assume m > n.
Example 4.7 Suppose
'oo
i°
MO
'20
-'30
-*01
,,0
*21
-*31
*02
1°
-M2
,.0
'22
'32
eR
4x3
and so therefore
H 1 A V
A 2 = H 2 A X =
X X X X
X X X X
X X X X
X X X X
10
x x x
x x x
x x x
" a
"oo
a
"10
a
"20
"30
fl 1
"oo
"01
a
"21
fl°
"31
'01
Ml
3 21
'31
"02
a
"12
Q 22
Q 32
7 1
'02
2 1
M2
3 22
'32
r fl 1
"00
a 1
"oo
"01
"11
a 21
"31
a 1
"oi
8 n
a 1 ""
"02
"12
a 22
"32
7 1
'02
2 2
Z 2
-'22
2 2
^32 J
and
A 3 = H 3 A 2
10
10
x x
x x
fl 00
"01
°n
"02
a 2
"l2
3 22
r a 1
"oo
= R.
"01
a ii
2 1 ""
"02
2 2
l 2
"22
2 2
-'32
TLFeBOOK
174 LINEAR SYSTEMS OF EQUATIONS
The x signs denote the Householder matrix elements that are not specified. This
example is intended only to show the general pattern of elements in the matrices.
Define
** = [4-l.k-l 4U-1 • • • 4-U-lf € Rm "" +1 ' ( 4 - 145 )
so if x k = [x k x\ ■■■ x k m _ k f then x k = a\~l_ xk _ v and so
y k (y k ) T
H k = l m _ k+l -2 y -^- k , (4.146)
where y k — x k ± | |jc*| |2^q, and e\ = [1 • • • 0] r € R m_ * +1 . A pseudocode anal-
ogous to that for Gaussian elimination (recall Section 4.5) is as follows:
A°:=A;
for k := 1 to n do begin
for / := to m — k do begin
x k := av~, _., fc1 ; {This loop makes x^}
end;
y k :=x' t + sign(x*)||x' ( ||2eg;
A k := H k A k ~ 1 ; { H k contains H k via (4.146) }
end;
R:=A n ;
From (4.143) H. H k — I m because H. H k — I m - k +\, and of course /Jj/jt-i =
4_i; that is, H^ is orthogonal for all k. Since
R = A"=H n H n - 1 ---H 2 H l A, (4.147)
we have
A = Hi Hi ■■■ Hi R. (4. 148)
=Q
Thus, the pseudocode above implicitly computes Q because it creates the orthog-
onal factors Hk-
In the pseudocode we see that
y*=;c* + sign(*J)||**||2e§. (4.149)
Recall from (4.141) that a = ±||x||2, so we must choose the sign of a. It is best that
a — sign(*o)||x||2, where sign(xo) = +1 for xq > 0, and sign(xo) = — 1 if xq < 0.
This turns out to ensure that H remains as close as possible to perfect orthogonality
in the face of rounding errors. Because ||x||2 might be very large or very small,
there is a risk of overflow or underflow in the computation of ||x||2. Thus, it
is often better to compute y from jc/IWIoo- This works because scaling x does
TLFeBOOK
LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION
175
not mathematically alter H (which may be confirmed as an exercise). Typically,
m >> n (e.g., in the example of Fig. 4.1 we had m — N — 1000, while n — 3),
so, since Hk € R mxm , we rarely can accumulate and store the elements of Hk for
all k as too much memory is needed for such a task. Instead, it is much better to
observe that (for example) if H e R mxm , A e R mxn then, as y e R m , we have
HA
T ~\
, yy
y yl
yTy
y(A T y) T .
(4.150)
From (4.150) Ay e R", which has y'th element
m — 1
[A T y]j = Y] a k,jyk
(4.151)
k=Q
for j =0, 1,
1. If p = 2/y 1 y, then, from (4.150) and (4.151), we have
~m — 1
for i =0, 1,...,
implements this is as follows:
fl := 2/y T y;
for; = to n - 1 do begin
s := E^L"o a km
S := ps;
for / := to m - 1 do begin
[HA]ij = a u - ^y;
1, and j =0, 1,
J2 a k,j
yk
.k=Q
(4.152)
1. A pseudocode program that
end;
end;
This program is written to overwrite matrix A with matrix HA. This reduces
computer system memory requirements. Recall (4.123), where we see that Q T f
must be computed so that /" can be found. Knowledge of /" is essential to
compute x via (4.126). As in the problem of computing HkA k ~ l , we do not wish
to accumulate and save the factors Hk in
Q 1 f = H„H„-i---Hif.
(4.153)
Instead, Q f would be computed using an algorithm similar to that suggested by
(4.152).
All the suggestions in the previous paragraph are needed in a practical imple-
mentation of the Householder transformation matrix method for QR factorization.
As noted in Ref. 5, the rounding error performance of the practical Householder
QR factorization algorithm is quite good. It is stated [5] as well that the number of
flops needed by the Householder method for finding x is greater than that needed by
TLFeBOOK
176 LINEAR SYSTEMS OF EQUATIONS
Cholesky factorization. Somewhat simplistically, the Cholesky method is computa-
tionally more efficient than the Householder method, but the Householder method
is less susceptible to ill conditioning and to rounding errors than is the Cholesky
method. More or less, there is therefore a tradeoff between speed and accuracy
involved in selecting between these competing methods for solving the overde-
termined least-squares problem. The Householder approach is also claimed [5] to
require more memory than the Cholesky approach.
4.7 ITERATIVE METHODS FOR LINEAR SYSTEMS
Matrix A € R" x " is said to be sparse if most of its n 2 elements are zero-valued.
Such matrices can arise in various applications, such as in the numerical solution
of partial differential equations (PDEs). Sections 4.5 and 4.6 have presented such
direct methods as the LU and QR decompositions (factorizations) of A in order to
solve Ax — b (assuming that A is nonsingular). However, these procedures do not
in themselves take advantage of any structure that may be possessed by A such
as sparsity. Thus, they are not necessarily computationally efficient procedures.
Therefore, in the present section, we consider iterative methods to determine x € R"
in Ax = b. In this section, whenever we consider Ax — b, we will always assume
that A -1 exists. Iterative methods work by creating a Cauchy sequence of vectors
(x w ) that converges to x. 10 Iterative methods may be particularly advantageous
when A is not only sparse, but is also large (i.e., large n). This is because direct
methods often require the considerable movement of data around the computing
machine memory system, and this can slow the computation down substantially. But
a properly conceived and implemented iterative method can alleviate this problem.
Our presentation of iterative methods here is based largely on the work of
Quarteroni et al. [8, Chapter 4]. We use much of the same notation as that in
Ref. 8. But it is a condensed presentation as this section is intended only to convey
the main ideas about iterative linear system solvers.
In Section 4.4 matrix and vector norms were considered in order to characterize
the sizes of errors in the numerical estimate of x in Ax = b due to perturbations
of A, and b. We will need to consider such norms here. As noted above, our goal
here is to derive a methodology to generate vector sequence (x*^) 11 such that
lim x (k) =x, (4.154)
k—>oo
where i = [io ii • • • x n -\\ T e R" satisfies Ax — b and x^ = [xq x\ ■ ■ ■
x^_j] r € R". The basic idea is to find an operator T such that x^ k+v> — Tx^ k \—
T(x^)), for fc = 0,1,2,.... Because (x^) is designed to be Cauchy (recall
As such, we will be revisiting ideas first seen in Section 3.2.
Note that the "(k)" in x^' does not denote the raising of i to a power or the taking of the kxh
derivative, but rather is part of the n
the k\h power of A, but A^ is not.
derivative, but rather is part of the name of the vector. Similar notation applies to matrices. So, A is
TLFeBOOK
ITERATIVE METHODS FOR LINEAR SYSTEMS 177
Section 3.2) for any e > 0, there will be an m e Z + such that ||x^ m ^ — x\\ < e
[recall that d(x^ \ x) — \\x w — x\\]. The operator T is defined according to
x (k+1) = Bx (k) + f, (4.155)
where x^ e R" is the starting value (initial guess about the solution x), B e R" x "
is called the iteration matrix, and / e R" is derived from A and b in Ax — b. Since
we want (4.154) to hold, from (4.155) we seek B and / such that x — Bx + f, or
A~ l b — BA~ l b + f (using Ax — b, implying x = A~ l b), so
f = (I-B)A~ l b. (4.156)
The error vector at step k is defined to be
e (*) =,(*)_ X| (4.157)
and naturally we want lim^oo e^ — 0. Convergence would be in some suitably
selected norm.
As matters now stand, there is no guarantee that (4.154) will hold. We achieve
convergence only by the proper selection of B, and for matrices A possessing
suitable properties (considered below). Before we can consider these matters we
require certain basic results involving matrix norms.
Definition 4.3: Spectral Radius Let s(A) denote the set of eigenvalues of
matrix A e R" x ". The spectral radius of A is
p(A) = max |A,|,
X€s(A)
An important property possessed by p(A) is as follows.
Property 4.1 If A € R" x " with e > 0, then there is a norm denoted || • || e
(i.e., a norm perhaps dependent on e) satisfying the consistency condition (4.36c),
and such that
\\A\\ € < P (A) + e.
Proof See Isaacson and Keller [9].
This is just a formal way of saying that there is always a matrix norm that is
arbitrarily close to the spectral radius of A
p(A) = inf||A|| (4.158)
with the infimum (defined in Section 1.3) taken over all possible norms that satisfy
(4.36c). We say that the sequence of matrices (A^) [with A^ € R" xn ] converges
to A € R" x " iff
lim \\A (k) -A|| =0. (4.159)
TLFeBOOK
178 LINEAR SYSTEMS OF EQUATIONS
The norm in (4.159) is arbitrary because of norm equivalence (recall discussion on
this idea in Section 4.4).
Theorem 4.7: Let A e R" x "; then
lim A k = •»• p(A) < 1. (4.160)
As well, the matrix geometric series YI^LqA converges iff p(A) < 1. In this
instance
oo
J2A k = (I - A)~\ (4.161)
k=Q
So, if p(A) < 1, then matrix I — A is invertible, and also
< ||(7 - A)" 1 !! < - \— , (4.162)
1 + I|A||-" V "-1-IIAH
where || • || here is an induced matrix norm (i.e., (4.36b) holds) such that ||A|| < 1.
Proof We begin by showing (4.160) holds. Let p(A) < 1 so there must be an
e > such that p(A) < 1 — e, and from Property 4.1 there is a consistent matrix
norm 1 1 • 1 1 such that
\\A\\ < p(A) + € < 1.
Because [recall (4.40)] of ||A^|| < \\A\\ k < 1, and the definition of convergence,
as k -* oo, we have A k -» € R" x ". Conversely, assume that lim^^oo A* = 0,
and let X be any eigenvalue of A. For eigenvector x (^ 0) of A associated with
eigenvalue X, we have A k x — X k x, and so \m\k^oo^ k — 0. Thus, |A| < 1, and
hence p(A) < 1. Now consider (4.161). If X is an eigenvalue of A, then 1 — X is
an eigenvalue of / — A. We observe that
(/ - A)(I + A + A 2 + • • • + A" -1 + A") = / - A" +1 . (4.163)
Since p(A) < \, I — A has an inverse, and letting n — > oo in (4.163) yields
(/-A)^A* = /
fc=0
so that (4.161) holds.
Now, because matrix norm || • || satisfies (4.36b), we must have ||/|| = 1. Thus
1 = ||/|| < 11/ -A|| ||(7- A)" 1 !! < (l + ||A||)||(7- A)" 1 !!,
TLFeBOOK
ITERATIVE METHODS FOR LINEAR SYSTEMS 179
which gives the first inequality in (4.162). Since / = (7 — A) + A, we have
so that
(I - A)' 1 = / + A(I - A)~ l
|| ( /_ A) -i||< l + ||A|| ||(7- A)" 1 !].
Condition ||A|| < 1 implies that this yields the second inequality in (4.162).
We mention that in Theorem 4.7 an induced matrix norm exists to give ||A|| <
1 because of Property 4.1 (recall that (A®) is convergent, giving p(A) < 1).
Theorem 4.7 now leads us to the following theorem.
Theorem 4.8: Suppose that / e R" satisfies (4.156); then (x w ) converges to
x satisfying Ax — b for any x^ iff p(B) < 1.
Proof From (4.155)-(4.157), we have
e (k+D = x (k+i) _ x = Bx (k) + f _ x = Bx (k) + (I _ B)A -i b _ x
= Be {k) + Bx + (I - B)A~ l b - x
= Be (k) + Bx + x- Bx -x
= Be {k \
Immediately, we see that
for k e Z + . From Theorem 4.7
e » = BV 0) (4.164)
lim B k e (0) =
for all e (0) € R" iff p{B) < 1.
On the other hand, suppose p(B) > 1; then there is at least one eigenvalue X of
B such that |A.| > 1. Let e^ be the eigenvector associated with X, so Be^ — \e®\
implying that e^ — X k e < -°\ But this implies that e^' -/> as k — >■ oo since |A.| > 1.
This theorem gives a general condition on B so that iterative procedure (4.155)
converges. Theorem 4.9 (below) will say more. However, our problem now is to
find B. From (4.158), and Theorem 4.7 a sufficient condition for convergence is
that HZ? 1 1 < 1, for any matrix norm.
A general approach to constructing iterative methods is to use the additive
splitting of the matrix A according to
A = P-N, (4.165)
TLFeBOOK
180
LINEAR SYSTEMS OF EQUATIONS
where P, N e R nxn are suitable matrices, and P~ l exists. Matrix P is sometimes
called a preconditioning matrix, or preconditioner (for reasons we will not consider
here, but that are explained in Ref. 8). To be specific, we rewrite (4.155) as
Ak+l)
= P~ l Nx
(k)
P~ l b,
that is, for k e Z H
Px
(*+i)
Nx
(k)
(4.166)
so that / = P~ l b, and B — P^N. Alternatively
Ak+l)
(4.167)
where r^ is the residual vector at step k. From (4.167) we see that to obtain
x (k+i) re q U j res us [ solve a linear system of equations involving P. Clearly, for
this approach to be worth the trouble, P must be nonsingular, and be easy to invert
as well in order to save on computations.
We will now make the additional assumption that the main diagonal elements of
A are nonzero (i.e., a,-,; ^ for all i e Z„). All the iterative methods we consider in
this section will assume this. In this case we may express Ax = b in the equivalent
form
1
an
(4.168)
for i — 0, 1 , . . . , n — 1 .
The expression (4.168) immediately leads to, for any initial guess x®\
Jacobi method, which is defined by the iterations
the
,(*+!)
n-\
J2 ai iA
{10
.7=0
(4.169)
for i =0, 1,
splitting
n — 1. It is easy to show that this algorithm implements the
P = D,N = D- A = L + U, (4.170)
where D — diag(ao,o, «i,i, ■ ■ ■ , «n-i,«-i) (i-e-, diagonal matrix that is the main
diagonal elements of A), L is the lower triangular matrix such that Z, ; = — a,y
if ( > j, and lu = if i < j, and [/ is the upper triangular matrix such that
Uij — —au if j > i, and m (/ - = if j < i. Here the iteration matrix B is given by
B = Bj = P~ [ N = D"'(L + [/)
D~ l A.
(4.171)
TLFeBOOK
ITERATIVE METHODS FOR LINEAR SYSTEMS 181
The Jacobi method generalizes according to
(*+i) _ _^_
i
an
n-\
bi - ^ aijx ( j
(k)
j=0
+ (1 - w)x\
(/<)
(4.172)
where i — 0, 1, . . . , n — 1, and a> is the relaxation parameter. Relaxation parame-
ters are introduced into iterative procedures in order to control convergence rates.
The algorithm (4.172) is called the Jacobi overrelaxation (JOR) method. In this
algorithm the iteration matrix B takes on the form
B = Bj{co) = wBj + (1 - co)I,
and (4.172) can be expressed in the form (4.167) according to
coD-'r^.
^ +1 > = *<*>
(4.173)
(4.174)
The JOR method satisfies (4.156) provided that co ^ 0. The method is easily seen
to reduce to the Jacobi method when w = 1.
An alternative to the Jacobi method is the Gauss-Seidel method. This is defined
as
1
xf +1) =
i—\ n—\
bi - ^ ajjx) + - ^2 a 'J x j
,/=o y=i'+i
(k)
where i = 0, 1, . . . , n — 1. In matrix form (4.175) can be expressed as
Dx^^b + Lx^ + Ux^,
(4.175)
(4.176)
where D, L, and U are the same matrices as those associated with the Jacobi
method. In the Gauss-Seidel method we implement the splitting
D-L, N = U
(4.177)
with the iteration matrix
B = B GS = {D-L)- l U.
(4.178)
As there is an overrelaxation method for the Jacobi approach, the same idea applies
for the Gauss-Seidel case. The Gauss-Seidel successive overrelaxation (SOR)
method is defined to be
.(*+!)
w
an
Y^a,
. r (*+i)
ij x j
j=i+i
.(*)
+ (1 - co)x\
(k)
(4.179)
TLFeBOOK
182 LINEAR SYSTEMS OF EQUATIONS
again for i = 0, 1, ...,« — 1. In matrix form this procedure can be expressed as
Dx (k+1) = co[b + Lx (k+l) + Ux (k) ] + (1 - co)Dx (k)
or
[/ - coD~ l L]x {k+l) = &>D _1 fr + [(1 - w)I + wD~ l U]x (k) , (4.180)
for which the iteration matrix is now
B = B GS (a}) = [I -o)D- l L]- l [(l - co)I + wD~ l U]. (4.181)
We see from (4.180) (on multiplying both sides by D) that
[D - aL]x (k+l) =wb+ [(1 - co)D + oU]x (k) ,
so from the fact that A = D — (L + U) [recall (4.170)], this may be rearranged as
x (k+\) = x (k)
1
-D-L
a>
r (k) (4.182)
(r (k) = b- Ax (k) ), which is the form (4.167). Condition (4.156) holds if w ^ 0.
The case a> — 1 corresponds to the Gauss-Seidel method in (4.175). If a> e (0, 1),
the technique is often called an underrelaxation method, while for a> e (1, oo) it
is an overrelaxation method.
We will now summarize, largely without proof, results concerning the conver-
gence of (x w ) to x for sequences generated by the previous iterative algorithms.
We observe that every iteration in any of the proposed methods needs (in the worst
case, assuming that A is not sparse) 0(n 2 ) arithmetic operations. The total number
of iterations is m, and is needed to achieve desired accuracy \\x^ mS) — x\\ < e, and
so in turn the total number of arithmetic operations needed is 0(mn 2 ). Gaussian
elimination needs 0(« 3 ) operations to solve Ax — b, so the iterative methods are
worthwhile computationally only if m is sufficiently small. If m is about the same
size as n, then little advantage can be expected from iterative methods. On the
other hand, if A is sparse, perhaps possessing only 0(n) nonzero elements, then
the iterative methods require only 0(mn) operations to achieve ||x^ m ^ — x\\ < e.
We need to give conditions on A so that x^' — >• x, and also to say something about
the number of iterations needed to achieve convergence to desired accuracy.
Let us begin with the following definition.
Definition 4.4: Diagonal Dominance Matrix A € R nx " is diagonally domi-
nant if
n-\
J] lay | (4.183)
.7=0
for i — 0, 1 , . . . , n — 1 .
TLFeBOOK
ITERATIVE METHODS FOR LINEAR SYSTEMS 183
We mention here that Definition 4.4 is a bit different from Definition 6.2 (Chapter 6),
where diagonal dominance concepts appear in the context of spline interpolation
problems. It can be shown that if A in Ax = b is diagonally dominant according to
Definition 4.4, then the Jacobi and Gauss-Seidel methods both converge. Proof for
the Jacobi method appears in Theorem 4.2 of Ref. 8, while the Gauss-Seidel case is
proved by Axelsson [10].
If A — A T , and A > both the Jacobi and Gauss-Seidel methods will converge.
A proof for the Gauss-Seidel case appears in Golub and Van Loan [5, Theo-
rem 10.1.2]. The Jacobi case is considered in Ref. 8. Convergence results exist for
the overrelaxation methods JOR and SOR. For example, if A — A T with A > 0,
the SOR method is convergent iff < w < 2 [8]. Naturally, we wish to select a> so
that convergence occurs as rapidly as possible (i.e., m in ||jc^ — x|| < e is min-
imal). However, the problem of selecting the optimal value for a> is well beyond
the scope of this book.
We recall that our iterative procedures have the general form in (4.155), where
it is intended that x — Bx + f. We may regard y — Tx — Bx + f as a mapping
r|R" — > R". On linear vector space R" we may define the metric
d(x, y) — max \xj - yj\ (4.184)
(recall the properties of metrics from Chapter 1). Space (R", d) is a complete metric
space [11, p. 308]. From Kreyszig [11] we have the following theorem.
Theorem 4.9: If the linear system x = Bx + f is such that
M-l
£iM<!
.7=0
for i = 0, 1 , . . . , n — 1 then solution x is unique. The solution can be obtained as
the limit of the vector sequence (x™) for k — 0, 1,2,... (x^ is arbitrary), where
x (k+i) = Bx (k) + ff
and where for a — max,-^ J]"Z \bij\, we have the error bounds
in
d(x (m \x) < — ^C-'^W) < -?—d(x (0 \x m ). (4.185)
1 — a 1 — a
Proof We will give only an outline proof. This theorem is really just a special
instance of the contraction theorem, which appears and is proved in Chapter 7 (see
Theorem 7.3 and Corollary 7.1).
The essence of the proof is to consider the fact that
d(Tx, Ty) — max
ieZ„
n-l
TLFeBOOK
184
LINEAR SYSTEMS OF EQUATIONS
n-1
< max \xj — y ,■ I max V^ \ba I
;eZ„ !6Z„ ^— '
.,=0
n-1
= d(x,y)max^|%|,
,£Z " .7=0
so d(Tx, Ty) < ad(x, y), if we define
n-1
[recall (4.41d)].
In this theorem we see that if a < 1, then jp" — > jc. In this case d(Tx, Ty) <
cf(x, 3O for all ;t, y e R". Such a mapping r is called a contraction mapping (or
contractive mapping). We see that contraction mappings have the effect of moving
points in a space closer together. The error bounds stated in (4.185) give us an
idea about the number of iterations m needed to achieve ||x^ m ^ — x\\ < e (e >
0). We emphasize that condition a < 1 is sufficient for convergence, so (x^)
may converge to x even if this condition is violated. It is also noteworthy that
convergence will be fast if a is small, that is, if ||fi||oo i s small. The result in
Theorem 4.8 certainly suggests convergence ought to be fast if p(B) is small.
Example 4.8 We shall consider the application of SOR to the problem of
solving Ax — b, where
4 10
14 10
14 1
14
b =
1
2
3
4
We shall assume that jt (0) = [0000] T . Note that SOR is not the best way to solve
this problem. A better approach is to be found in Section 6.5 (Chapter 6). This
example is for illustration only. However, it is easy to confirm that
x = [0.1627 0.3493 0.4402 0.8900] r .
Recall that the SOR iterations are specified by (4.179). However, we have not
discussed how to terminate the iterative process. A popular choice is to recall that
,(*)
b — Ax^' [see (4.167)], and to stop the iterations when for k
Am)\
-(0)1
< T
(4.186)
TLFeBOOK
ITERATIVE METHODS FOR LINEAR SYSTEMS 185
Number of iterations needed by SOR
150
100
Figure 4.4 Plot of the number of iterations m needed by SOR as a function of co for the
parameters of Example 4.8 in order to satisfy the stopping condition ||r' m - ) | loo/Ik
r(0|
< T.
for some r > (a small value). For our present purposes 1 1 • 1 1 shall be the norm
in (4.29c), which is compatible with the needs of Theorem 4.9. We shall choose
t = 0.001.
We observe that A is diagonally dominant, so convergence is certainly expected
for co — 1. In fact, A > so convergence of the SOR method can be expected for
all co e (0, 2). Figure 4.4 plots the m that achieves (4.186) versus co, and we see
that there is an optimal choice for co that is somewhat larger than co — 1 . In this
case though the optimal choice does not lead to much of an improvement over
choice o)=l.
For our problem [recalling (4.170)], we have
' 4 C
"
"
D =
4
C
4
,L =
-1
-1
.
_ ° c
4
-
-1
_
" -
-1
"
U =
-1
-1
_
(4.178)
" 0.0000
-0.2500
0.0000
0.0000
Bgs —
0.0000
0.0000
0.0625
-0.0156
-0.2500
0.0625
0.0000
-0.2500
0.0000
0.
00
39
-0.0156
0.06
25
We therefore find that ||BgsIIoo = 0.3281. It is possible to show (preferably using
MATLAB or some other software tool that is good with eigenproblems) that
TLFeBOOK
186 LINEAR SYSTEMS OF EQUATIONS
P(Bgs) — 0.1636. Given (4.185) in Theorem 4.9 we therefore expect fast con-
vergence for our problem since a is fairly small. In fact
ll* (w) -*||oo< ; "^""ji IKP-^r^Hoo (4.187)
1 - WdgsWoo
[using x^ — 0, x^ — (D — L)~ l b]. For the stopping criterion of (4.186) we
obtained (recalling that w = 1, and r=0.001)m = 5 with
x (5) = [0.1630 0.3490 0.4403 0.8899] r
so that
||jc c5) — jcMoo = 3.6455 x 10 -4 .
The right-hand side of (4.187) evaluates to
i/,,t;s!i '"' ; <\(D -L)- l b\\oo = 4.7523 x 10" 3 .
l-\\B G s\
Thus, (4.187) certainly holds true.
4.8 FINAL REMARKS
We have seen that inaccurate solutions to linear systems of equations can arise
when the linear system is ill-conditioned. Condition numbers warn us if this is
a potential problem. However, even if a problem is well-conditioned, an inac-
curate solution may arise if the algorithm applied to solve it is unstable. In the
case of problems arising out of algorithm instability, we naturally replace the
unstable algorithm with a stable one (e.g., Gaussian elimination may need to
be replaced by Gaussian elimination with partial pivoting). In the case of an
ill-conditioned problem, we may try to improve the accuracy of the solution by
either
1 . Using an algorithm that does not worsen the conditioning of the underlying
problem (e.g., choosing QR factorization in preference to Cholesky factor-
ization)
2. Reformulating the problem so that it is better conditioned
We have not considered the second alternative in this chapter. This will be done
in Chapter 5.
APPENDIX 4.A HILBERT MATRIX INVERSES
Consider the following MATLAB code:
R = hilb(10);
inv(R)
TLFeBOOK
HILBERT MATRIX INVERSES
187
ans
1 .Oe+12 *
Columns 1 through 7
.0000
-0
,0000
0,
,0000
-0,
,0000
,0000
-0
.0000
.0000
.0000
.0000
-0
,0000
,0000
-0
,0002
.0005
-0
.0008
.0000
-0
,0000
0.
,0001
-0,
,0010
,0043
-0
.0112
.0178
.0000
.0000
-0
,0010
,0082
-0
,0379
.1010
-0
.1616
.0000
-0
,0002
0,
,0043
-0,
,0379
,1767
-0
.4772
.7712
.0000
.0005
-0
,0112
,1010
-0
,4772
1
.3014
-2
.1208
.0000
-0
,0008
0.
,0178
-0,
,1616
,7712
-2
.1208
3
.4803
.0000
.0008
-0
,0166
,1529
-0
,7358
2
.0376
-3
.3636
.0000
-0
,0004
0,
,0085
-0,
,0788
,3820
-1
.0643
1
.7659
.0000
.0001
-0
,0018
,0171
-0
,0832
.2330
-0
.3883
Columns 8 through 10
-0.0000
.0000
-0
,0000
0.0008
-0
.0004
0.
,0001
-0.0166
.0085
-0
,0018
0.1529
-0
.0788
0,
,0171
-0.7358
.3820
-0
,0832
2.0376
-1
.0643
0,
,2330
-3.3636
1
.7659
-0
,3883
3.2675
-1
.7231
0,
,3804
-1 .7231
.9122
-0
,2021
0.3804
-0
.2021
0,
,0449
R*inv(R)
Columns 1 through 7
1
,0000
,0000
.0000
-0
.0000
,0001
.0001
-0
.0001
,0000
1
,0000
.0000
-0
.0000
,0001
.0001
-0
.0002
,0000
,0000
1
.0000
-0
.0000
,0001
.0000
-0
.0001
,0000
,0000
.0000
1
.0000
,0000
.0000
-0
.0000
,0000
,0000
.0000
-0
.0000
1
,0000
-0
.0000
-0
.0000
,0000
,0000
.0000
-0
.0000
,0000
1
.0000
-0
.0000
,0000
,0000
.0000
-0
.0000
,0000
.0000
.9999
,0000
,0000
.0000
-0
.0000
,0000
.0001
-0
.0000
,0000
,0000
.0000
-0
.0000
,0000
.0000
-0
.0001
,0000
,0000
.0000
-0
.0000
,0000
.0000
-0
.0000
Columns 8 through 10
-0.0000
0.0001
0.0000
-0.0000
-0.0001
-0.0001
-0.0001
-0.0000
0.0000
0.0000
0.0000
0.0000
TLFeBOOK
188
LINEAR SYSTEMS OF EQUATIONS
0.0000
-0
,0000
.0000
-0.0000
-0
,0000
.0000
0.0000
-0
,0000
.0000
1 .0000
-0
.0000
,0000
0.0000
1
.0000
,0000
-0.0000
-0
.0000
1
,0000
R = hilb(11)
)
inv(R)
1 .Oe+14 *
Columns 1 through 7
,0000
-0
,0000
.0000
-0
,0000
.0000
-0
,0000
,0000
.0000
.0000
-0
.0000
.0000
-0
,0000
,0000
-0
.0000
,0000
-0
,0000
.0000
-0
,0000
,0002
-0
,0006
,0012
,0000
.0000
-0
,0000
.0003
-0
,0019
,0064
-0
.0137
,0000
-0
,0000
,0002
-0
,0019
,0110
-0
,0381
,0817
.0000
,0000
-0
.0006
.0064
-0
,0381
,1329
-0
,2877
,0000
-0
,0000
.0012
-0
,0137
,0817
-0
,2877
,6270
,0000
,0001
-0
.0016
.0183
-0
,1101
,3902
-0
,8555
,0000
-0
.0000
.0013
-0
,0149
,0905
-0
,3227
.7111
.0000
.0000
-0
.0006
.0068
-0
,0415
,1487
-0
,3292
,0000
-0
.0000
.0001
-0
,0013
,0081
-0
,0293
,0651
Columns 8 through 11
-0
.0000
.0000
-0,
,0000
.0000
,0001
-0
,0000
0,
,0000
-0
,0000
-0
.0016
,0013
-0,
,0006
.0001
,0183
-0
,0149
0,
,0068
-0
,0013
-0
.1101
.0905
-0,
.0415
.0081
,3902
-0
.3227
0,
,1487
-0
,0293
-0
,8555
,7111
-0,
,3292
.0651
1
,1733
-0
,9796
0,
,4553
-0
,0903
-0
,9796
.8212
-0,
,3830
.0762
,4553
-0
,3830
0,
,1792
-0
,0357
-0
,0903
,0762
-0.
,0357
.0071
R*inv<
(R)
ans
Columns 1 through 7
,9997
-0
,0009
,0022
,0028
-0
,0164
,0558
-0
,1229
.0002
,9992
.0020
.0023
-0
,0132
.0454
-0,
,1029
.0002
-0
,0007
1
.0018
.0019
-0
,0112
,0385
-0.
,0844
.0002
-0
,0006
.0016
1
.0017
-0
,0097
,0331
-0,
,0736
,0002
-0
,0006
,0015
.0015
,9915
,0285
-0
,0638
.0002
-0
,0005
.0014
,0013
-0
,0076
1
.0258
-0,
,0581
TLFeBOOK
HILBERT MATRIX INVERSES
189
-0
.0002
-0
,0005
,0013
,0012
-0
,0070
-0
.0001
-0
,0005
,0012
,0011
-0
,0063
-0
.0001
-0
,0004
,0011
,0010
-0
,0059
-0
.0001
-0
,0004
,0010
,0009
-0
,0053
-0
.0001
-0
,0004
,0010
,0009
-0
,0052
0.0234 0.9468
0.0216 -0.0474
0.0201 -0.0448
0.0187 -0.0406
0.0179 -0.0395
Columns 8 through 11
,1665
-0,
,1405
.0652
-0.
,0091
,1351
-0,
,1165
.0530
-0,
,0071
,1125
-0,
.0973
.0452
-0.
,0058
,0964
-0,
.0844
.0385
-0.
,0047
,0858
-0,
,0739
.0341
-0,
.0041
,0745
-0,
,0661
.0300
-0.
,0037
,0696
-0,
.0592
.0280
-0,
,0033
1
,0635
-0,
,0547
.0251
-0.
,0029
,0581
0,
,9495
.0235
-0
,0028
,0536
-0,
,0458
1
.0213
-0,
,0024
,0527
-0,
.0445
.0207
,9976
R = hilb(12);
inv(R)
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RC0ND = 2.632091e-17.
1 .Oe+15 *
Columns 1 through 7
,0000
-0
,0000
0.
.0000
-0,
,0000
0,
.0000
-0
.0000
.0000
.0000
.0000
-0
,0000
,0000
-0,
.0000
.0000
-0
.0000
,0000
-0
,0000
,0000
-0.
,0000
0,
,0001
-0
.0002
.0006
.0000
.0000
-0
,0000
,0001
-0,
.0008
.0032
-0
.0086
,0000
-0
,0000
0.
,0001
-0.
,0008
0,
,0054
-0
.0229
.0624
.0000
.0000
-0
,0002
,0032
-0,
.0229
.0990
-0
.2720
,0000
-0
,0000
0.
,0006
-0.
,0086
0,
.0624
-0
.2720
.7528
.0000
.0000
-0
,0011
,0151
-0,
,1107
.4863
-1
.3545
,0000
-0
,0000
0.
,0013
-0.
,0173
0,
.1276
-0
.5640
1
.5794
,0000
,0000
-0
,0009
,0124
-0,
.0920
.4090
-1
.1511
,0000
-0
,0000
0.
,0004
-0,
,0050
0,
.0377
-0
.1686
.4765
.0000
.0000
-0
,0001
,0009
-0,
.0067
.0301
-0
.0855
Columns 8 through 12
.0000
.0000
-0
,0000
,0000
-0
,0000
,0000
-0
,0000
0,
,0000
-0,
,0000
,0000
.0011
.0013
-0
,0009
,0004
-0
,0001
,0151
-0
,0173
0.
,0124
-0.
,0050
,0009
.1107
.1275
-0
,0920
,0377
-0
,0067
,4863
-0
,5639
0,
,4090
-0,
,1686
,0301
1
.3544
1
.5793
-1
.1510
,4765
-0
,0855
2
,4505
-2
,8712
2.
,1015
-0.
,8732
,1572
TLFeBOOK
190
LINEAR SYSTEMS OF EQUATIONS
-2.8713 3.3786 -2.4821
2.1016 -2.4822 1.8297
-0.8732 1.0348 -0.7651
0.1572 -0.1869 0.1386
1.0348 -0.1869
-0.7651 0.1385
0.3208 -0.0582
-0.0582 0.0106
R*inv(R)
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RC0ND = 2.632091 e-17.
Columns 1 through 7
1
,0126
-0
,0066
-0
,0401
,0075
,1532
-0
,8140
2
.2383
,0113
.9943
-0
,0361
,0100
,1162
-0
,6265
1
.7168
,0103
-0
,0050
,9673
,0106
,0952
-0
,5300
1
.4834
,0094
-0
,0045
-0
,0299
1
,0104
,0797
-0
,4573
1
.2725
,0087
-0
,0041
-0
,0275
,0104
1
,0703
-0
,4038
1
.0986
,0081
-0
,0037
-0
,0255
,0102
,0621
,6416
.9971
,0075
-0
.0034
-0
,0237
.0099
,0554
-0
,3245
1
.9062
,0071
-0
,0032
-0
,0222
,0095
,0495
-0
,2944
.8232
,0066
-0
,0030
-0
,0209
.0093
,0439
-0
,2686
.7246
,0063
-0
,0028
-0
,0197
,0087
,0424
-0
,2532
.7002
,0059
-0
,0026
-0
,0187
,0087
,0372
-0
,2307
.6309
,0057
-0
.0024
-0
,0177
.0081
,0358
-0
,2168
.6064
Columns 8 through 12
-4
.0762
4
.7656
-3
.5039
1
,4385
-0
,2183
-3
.1582
3
.7754
-2
.7520
1
,1123
-0
,1649
-2
.6250
3
.1055
-2
.3301
.9219
-0
,1390
-2
.2676
2
.6602
-1
.9922
,7905
-0
,1163
-1
.9863
2
.4023
-1
.7139
,7104
-0
,0992
-1
.7969
2
.1094
-1
.5430
,6289
-0
,0897
-1
.6133
1
.9258
-1
.4043
,5581
-0
,0779
-0
.4658
1
.7598
-1
.2734
,5146
-0
,0715
-1
.3047
2
.5762
-1
.1445
,4629
-0
,0651
-1
.2793
1
.5098
-0
.1055
,4424
-0
,0619
-1
.1387
1
.3438
-0
.9873
1
,3955
-0
,0529
-1
.1025
1
.2998
-0
.9395
,3809
,9474
diary
Off
The MATLAB rcond function (which gave the number RCOND above) needs
some explanation. A useful reference on this is Hill [3, pp. 229-230]. It is based on
a condition number estimator in the old FORTRAN codes known as "LINPACK".
It is based on 1-norms. rcond(A) will give the reciprocal of the 1-norm condition
number of A. If A is well-conditioned, then rcond(A) will be close to unity (i.e.,
close to one), and will be very tiny if A is ill-conditioned. The rule of thumb
involved in interpreting an rcond output is "if rcond(A) « d x \Q~ k , where d is
a digit from 1 to 9, then the elements of xcomp can usually be expected to have
k fewer significant digits of accuracy than the elements of A" [3]. Here xcomp
TLFeBOOK
SVD AND LEAST SQUARES 191
is simply the computed solution to Ax — y; that is, in the notation of the present
set of notes, x =xcomp. MATLAB does arithmetic with about 16 decimal digits
[3, p. 228], so in the preceding example of a Hilbert matrix inversion problem for
N — 12, since RCOND is about 10~ 17 , we have lost about 17 digits in computing
R~ l . Of course, this loss is catastrophic for our problem.
APPENDIX 4.B SVD AND LEAST SQUARES
From Theorem 4.4, A — lfEV T , so this expands into the summation
p-i
A = Y,°iUivJ- (4.A.1)
i=a
But if r — rank (A), then (4.A.1) reduces to
r-l
A = ^2a iUi vJ. (4.A.2)
i'=0
In the following theorem p 2 LS — \\f l \\\ — \\Ax — f\\^ [see (4.127)], and x is the
least-squares optimal solution to Ax — f .
Theorem 4.B.1: Let A be represented as in (4.A.2) with A € R mx " and m > n.
If / € R m then
*-§(^r)"- <4A3)
m — 1
Proof For all x e R", using the invariance of the 2-norm to orthogonal trans-
formations, and the fact that VV T — I n (n x n identity matrix)
\\Ax - f\\\ = \\U T AV(V T x) - U T f\\ 2 2 = ||Ea - U T f\\ 2 2 , (4.A.5)
where a — V T x, so as a — [ao ■ ■ -Q!„_i] r we have a ; - = vj x. Equation (4. A. 5)
expands as
\Ax - f\\\ = a 1 Y* 1 Ea - 2a 1 Y, 1 U 1 f + f 1 UU 1 f, (4.A.6)
TLFeBOOK
192 LINEAR SYSTEMS OF EQUATIONS
which further expands as
r—\ r—\ m—\
\ax - f\\\ = J2 a ? a ? - 22><W7 + £t«f /] 2
i=0 i'=0 i=0
r—\ m — \
= i> 2 «, 2 - ^wif + [ M < r /] 2 ] + i>r /i 2
i— i=r
r—\ m — \
= J2l°i<*i - "If? + I>f /l 2 - (4-A.7)
(=0 i=r
To minimize this we must have CT,a,- — uj f — 0, and so
«i = -wf/ (4.A.8)
for ( € Z r . As a = V r x, we have x — Va, so if we set a r — ■ ■ ■ — a n -\ — 0, then
from Eq. (4. A. 8), we obtain
r— 1 r— 1 y /.
which is (4. A. 3). For this choice of jc from (4.A.7)
m— 1
ii^-/H2 = I>J7] 2 = pL>
which is (4.A.4).
Define A + = VE + £/ r (again A e R mx " with m > «), where
E+ = diagCcr" 1 , . . . , a~\ , 0, . . . , 0) € R" xm . (4.A.9)
We observe that
r—\ j f
A+/ = VY + U T f = V— 1>; = x. (4.A.10)
i=0 '
We call A + the pseudoinverse of A. We have established that if rank (A) = n,
then A T Ax — A T f, soi = (A^A) - ^ 7 /, which implies that in this case A + =
(A T A)- l A T . If A € R" x " and A -1 exists, then A+ = A -1 .
If A -1 exists (i.e., m — n and rank (A) = n), then we recall that K2(A) =
II^I|2||A _1 ||2- If A € R mx " and m > n, then we extend this definition to
K 2 (A) = ||A|| 2 ||A + || 2 . (4.A.11)
TLFeBOOK
REFERENCES 193
We have established that ||A|| 2 = o"o so since A + = VE + U T we have ||A + ||2 =
l/cr r _i . Consequently
K2 (A) = — (4.A.12)
a r -\
which provides a somewhat better justification of (4.119), because if rank (A) =
n then (4.A.12) is k 2 (A) = ao/<r n -i = or max (A)/(r min (A) [which is (4.119)].
From (4.A.11), K 2 (A r A) = ||A r A|| 2 ||(A r A)+|| 2 . With A = (/Sy r , and A r =
V£ r {/ r we have
a t a = vz T i:v T ,
and
(A r A)+ = y(E r E)+y r .
Thus, ||A r A|| 2 = cr 2 , and \\(A T A)+\\ 2 = ff^ (rank (A) = «). Thus
iTl^ "0 _ r / 4 s n 2
k 2 (A j A) = -^- = [k 2 (A)Y
The condition number definition k p (A) in (4.60) was fully justified because of
(4.58). An analogous justification exists for (4. A. 11), but is much more difficult to
derive, and this is why we do not consider it in this book.
REFERENCES
1. J. R. Rice, The Approximation of Functions, Vol. I: Linear Theory, Addison- Wesley,
Reading, MA, 1964.
2. M.-D. Choi, "Tricks or Treats with the Hilbert Matrix," Am. Math. Monthly 90(5),
301-312 (May 1983).
3. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.),
Random House, New York, 1988.
4. G. E. Forsythe and C. B. Moler, Computer Solution of Linear Algebraic Systems,
Prentice-Hall, Englewood Cliffs, NJ, 1967.
5. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ.
Press, Baltimore, MD, 1989.
6. R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge Univ. Press, Cambridge,
MA, 1985.
7. N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA,
1996.
8. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math-
ematics series, Vol. 37), Springer- Verlag, New York, 2000.
9. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 1966.
10. O. Axelsson, Iterative Solution Methods, Cambridge Univ. Press, New York, 1994.
11. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York, 1978.
TLFeBOOK
194 LINEAR SYSTEMS OF EQUATIONS
PROBLEMS
4.1. Function f(x) e L 2 [0, 1] is to be approximated according to
a\
f(x) « a x H ■ —
X + c
using least squares, where c e R is some fixed parameter. This involves
solving the linear system
I 1 - clog, (l + 1)
c(c+l)
«0
5]
Jo */(*) rfx
' /M
JO x+c aX
= R
where a is the vector from R that minimizes the energy V(a) [Eq. (4.8)].
(a) Suppose that f(x) — x + -i-. Find a for c = 1. For this special case
it is possible to know the answer in advance without solving the linear
system above. However, this problem requires you to solve the system.
[Hint: It helps to recall that
a b
c d
1
d
—c
-b
ad — be
a
(b) Derive R [which is a special case of (4.9)].
4.2. Suppose that
A
=
1 2 "
_ -1 -5
Find IIAHoo, ||A||i, and ||A|| 2
4.3. Suppose that
A =
1
2 6
€ R 2x2 ,e > 0,
and that A ' exists. Find Koo(A) if e is small.
4.4. Suppose that
A =
1 1
6
eR
2x2
and assume e > (so that A ' always exists). Find K2(A) — ||A||2||A '||2.
What happens to condition number /C2(A) if e -> 0?
4.5. Let A(e), B(e) € R nxn . For example, A(e) = [fl,/(e)] so element fl !; (e) of
A(e) depends on the parameter e € R.
TLFeBOOK
PROBLEMS 195
(a) Prove that
d dB(e) dA(e)
— [A(e)B(e)] = A(e)— -^ + — -^fl(e),
de de de
where dA(e)/dt — [datj(e)/de], and dB(e)/dt = [dbij(e)/de].
(b) Prove that
dA(e)~
— A- i (e) = -A- 1 (e)
de
de
A~\e).
[Hint: Consider £[A(e) A" 1 (e)] = £l = 0, and use (a).]
4.6. This problem is an alternative derivation of k(A). Suppose that e e R, A, F e
R" x " , and je (e) , y , f e R" . Consider the perturbed linear system of equations
(A + eF)x(e) = y + ef, (4.P.1)
where Ax = y, so x(0) = x is the correct solution. Clearly, eF models the
errors in A, while ef models the errors in y. From (4.P.1), we obtain
x(e) = [A + eF]- l (y + ef). (4.P.2)
The Taylor series expansion for x(e) about e = is
dx(0) -,
X ( 6 )=x(0) + — y —L e + 0(e 2 ), (4.P.3)
de
where 0(e 2 ) denotes terms in the expansion containing e k for k > 2.
Use (4.P.3), results from the previous problem, and basic matrix-vector norm
properties (Section 4.4) to derive the bound in
||jc(e) — jcII , f||/|| IIFII
-^—^ < ellAHIIA -1 !! ' ' '
=K(A)
\\y\\ \\A\\
[Comment: The relative error in A is p a — e||F||/||A||, and the relative error
in y is p y — e\\f\\/\\y\\, so more concisely \\x(e) — x\\/\\x\\ < K(A)(p A +
Py)-1
4.7. This problem follows the example relating to Eq. (4.95). An analog signal
/(f) is modeled according to
p-i
f(t) = Y^cijt j +r)(t),
.7=0
where a ; - € R for all j e Z p , and n(t) is some random noise term. We only
possess samples of the signal; that is, we only have the finite-length sequence
TLFeBOOK
196
LINEAR SYSTEMS OF EQUATIONS
(/„) defined by f n — f(nT s ) for n e Zyv, where T s > is the sampling
period of the data acquisition system that gave us the samples. Our estimate
of /„ is therefore
f n = J2 a J T s n '-
;'=o
With a — [«o oi
flp-2 a p -i] T e R p , find
JV-l
V(fl)= J^lfn-fnf
«=0
in the form of Eq. (4.102). This implies that you must specify p, g, A, and P.
4.8. Sampled data /„ (n = 0, 1, . . . , N — 1) is modeled according to
f n — a + bn + C sm(0n + (f>).
Recall sin(A + B) = sin A cos B + cos A sin B. Also recall that
AT— l
Vm = £[/n - /«] 2 = / r /~ / r A(0)[A r (0)A(0)]- 1 A r (0)/,
n=0
where / = [/o/i • • • /n-iV € R'*'. Give a detailed expression for A(0).
4.9. Prove that the product of two lower triangular matrices is a lower triangular
matrix.
4.10. Prove that the product of two upper triangular matrices is an upper triangular
matrix.
4.11. Find the LU decomposition of the matrix
2
-1
1
2
-1
-1
2
using Gauss transformations as recommended in Section 4.5. Use A — LU to
rewrite the factorization ofAasA = LZ)L r , where L is unit lower triangular,
and D is a diagonal matrix. Is A > 0? Why?
4.12. Use Gaussian elimination to LU factorize the matrix
A =
l
L 2
4 -1
-1 4
-1
l -\
2
-1
4
Is A > 0? Why"
TLFeBOOK
PROBLEMS
197
4.13. (a) Consider A e R" x ", and A — A T . Suppose that the leading principal
submatrices of A all have positive determinants. Prove that A > 0.
(b) Is
5-3
-3 5 1 > ()■'
15
Why?
4.14. The vector space R" is an inner product space with inner product (x, y) —
x T y for all x, y e R". Suppose that A e R" x ", A = A T , and also that A > 0.
Prove that (x, y) — x T Ay is also an inner product on the vector space R".
4.15. In the quadratic form
V(x) = ff
2x T A T f + x T A T Ax,
we assume x e R", and A T A > 0. Prove that
V(x) = f'f-f 1 A[A 1 A]- 1 A 1 f,
where x is the vector that minimizes V(x).
4.16. Suppose that A e R NxM with M < N, and rank (A) = M. Suppose also that
(x, y) = x T y. Consider P = A[A T A] -1 A T , Pj_ — I - P (I is the N x N
identity matrix). Prove that for all x e R N we have (Px, P±x) — 0. (Com-
ment: Matrices P and P± are examples of orthogonal projection operators.)
4.17. (a) Write a MATLAB function for forward substitution (solving Lz — y).
Write a MATLAB function for backward substitution (solving Ux — z).
Test your functions out on the following matrices and vectors:
v
[1
1
o -
- 1
2
3 4
1 1
| 1
, u =
3
5 5
7 1
3 3
-f
1 _
.
f
1 -1 -
if
.* = [!
2
7
3
~2] T -
(b) Write a MATLAB function to implement the LU decomposition algo-
rithm based on Gauss transformations considered in Section 4.5. Test
your function out on the following A matrices:
A =
12 3 4
-112 1
2 13
11
A =
2-1
-1 2 -1
0-1 2
TLFeBOOK
198 LINEAR SYSTEMS OF EQUATIONS
100 n
100 V
100 n
20 n
50 Q
1 A
MS)
8Q
200 V
10Q
1000 a
100 n
Figure 4.P.1 The DC electric circuit of Problem 4.18.
4.18. Consider the DC electric circuit in Fig. 4.P. 1 .
Write the node equations for the node voltages V\, V2, ■ ■ ■ , V(, as shown.
These may be loaded into a vector
v = [V^VsV^sVef
such that the node equations have the form
Gv = y. (4.P.4)
Use the Gaussian elimination, forward substitution, and backward substitu-
tion MATLAB functions from the previous problem to solve the linear system
(4.P.4).
4.19. Let /„ € R" x " be the order n identity matrix, and define
a
b
•
•
b
a
b ■
•
b
a
•
•
a
b
•
■ b
a
€R"
TLFeBOOK
PROBLEMS
199
Matrix T n is tridiagonal. It is also an example of a symmetric matrix that is
Toeplitz (defined in a later problem). The characteristic polynomial of T n is
p n (X) — det(XI n — T n ). Show that
p n (X) = (k-a)p n -i{X) -b Pn-lCX)
(4.P.5)
for n = 3, 4, 5, Find p\(X), and /^M [initial conditions for the polyno-
mial recursion in (4.P.5)].
4.20. Consider the following If: T e C nxn is Toeplitz if it has the form T —
[ti-j]ij=o,i,..., n -i- Thus, for example, if n — 3 we have
T =
to t-l t-2
h to f_i
h h to
Observe that in a Toeplitz matrix all of the elements on any given
diagonal are equal to each other. A symmetric Toeplitz matrix has the
form T — \t\i-j\\ since T — T T implies that t-i — ti for all i. Let x n —
[x n ,o *n,\ • • • Xn,n-2 x n ,n-i] T € C . Let /„ € C" x " be the n x n exchange
matrix (also called the contra- identity matrix) which is defined as the
matrix yielding x n = J n x n = [*„,„- 1*„,„-2 • • ■ x n ,ix n fi] T , We see that /„
simply reverses the order of the elements of x n . An immediate con-
sequence of this is that J^x n — x n (i.e., J% — /„). What is Jj? (Write
this matrix out completely.) Suppose that T n is a symmetric Toeplitz
matrix.
(a) Show that (noting that i„ — J n r„)
T n + 1 —
In ^n
=
to
. ?n
r T
T„
(nesting property). What is T„?
(b) Show that
^n^n^n :
= T n
(persymmetry property).
(Comment: Toeplitz matrices have an important role to play in digital signal
processing. For example, they appear in problems in spectral analysis, and
in voice compression algorithms.)
4.21. This problem is about a computationally efficient method to solve the linear
system
R n a n = o n eo,
(4.P.6)
TLFeBOOK
200 LINEAR SYSTEMS OF EQUATIONS
where R n e R" x " is symmetric and Toeplitz (recall Problem 4.20). All
of the leading principle submatrices (recall the definition in Theo-
rem 4.1) of R n are nonsingular. Also, eo = [1 • • • 0] T e R n , a n —
[1 a n \ ■■■ a n , n -2 «n,n-i] r , and a% e R is unknown as well. Define
e n — J n eo- Clearly, a n fi — 1 (all n).
(a) Prove that
R n a n = a^eo = o„e n ,
(4.P.7)
where J„a„ = a„.
(b) Observe that R n \_a n a n ] — cr^[eo e n ]. Augmented matrix [a n a n ] is
n x 2, and so [eo e n ] is n x 2 as well. Prove that
Rn
a„
a„
In ole n
(4.P.8)
What is r\rP. (That is, find a simple expression for it.)
(c) We wish to obtain
R n +\[a n +\ a n+ i] — a n+l [eo e„+i]
(4.P.9)
from a manipulation of (4. P. 8). To this end, find a formula for parameter
K„ € R in
^H
a n
a„
1 K„
K n 1
0„ eo f] n
Vn ff„ 2 e„
1 tf»
K n 1
such that (4. P. 9) is obtained. This implies that we obtain the vector
recursions
a n +\
A"„
and
Find the initial condition a\ e R lxl . What is er, 2 ?
(d) Prove that
ff„ +1 =ff„(l - /«:„).
(e) Summarize the algorithm obtained in the previous steps in the form of
pseudocode. The resulting algorithm is often called the Levinson-Durbin
algorithm.
TLFeBOOK
PROBLEMS
201
(f) Count the number of arithmetic operations needed to implement the
Levinson-Durbin algorithm in (e). Compare this number to the num-
ber of arithmetic operations needed by the general LU decomposition
algorithm presented in Section 4.5.
(g) Write a MATLAB function to implement the Levinson-Durbin algorithm.
Test your algorithm out on the matrix
2 1
1 2 1
1 2
4.22.
(Hint: You will need the properties in Problem 4.20. The parameters K n
are called reflection coefficients, and are connected to certain problems in
electrical transmission line theory.)
This problem is about proving that solving (4. P. 6) yields the LDL T decom-
position of R~ x . (Of course, R n is real-valued, symmetric, and Toeplitz.)
Observe that (via the nesting property for Toeplitz matrices)
Rn
1
■
0"
[On 1
X
X
X
0/1,1
1
•
<£l •
X
X
0/1,2
fl/z-1,1
•
=
X
X
@n,n— 2
a n -l,n-3 ■
1
•• °l
X
Qn,n—1
a n -l,n-2 ■
• 02,1
1 _
,
_
■■
°l\
= L„
= U„
(4.P.10)
where x denotes "don't care" entries; thus, the particular value of such an
entry is of no interest to us. Use (4.P.10) to prove that
L n K n L n — L) n ,
(4.P.11)
where D n — diag (cr„ , o^-v • • • ' a 2 > °t ) (diagonal matrix; see the comments
following Theorem 4.4). Use (4. P. 11) to prove that
r:
l D J
(4.P.12)
What is D n l [Hint: Using (4.P.10), and R n — R^ we note that L T n R n L n can
be expressed in two distinct but equivalent ways. This observation is used
to establish (4.P.11).]
4.23. A matrix T n is said to be strongly nonsingular (strongly regular) if all of its
leading principle submatrices are nonsingular (recall the definition in Theo-
rem 4.1). Suppose that x n — [x n q x n j • • • x„„_2 x nn _i\ T e C", and define
TLFeBOOK
202 LINEAR SYSTEMS OF EQUATIONS
Z„ e C" x " according to
Thus, Z„ shifts the elements of any column vector down by one position.
The top position is filled in with a zero, while the last element x„„_i is lost.
Assume that T n is real-valued, symmetric and Toeplitz. Consider
t — 7 t y T — y
(a) Find X n .
(b) If T n is strongly nonsingular then what is rank (X n )l (Be careful. There
may be separate cases to consider.)
(c) Use &k (Kronecker delta) to specify Z„; that is, Z n — [zij], so what is
Zij in terms of the Kronecker delta? (Hint: For example, the identity
matrix can be described as / = [5,_ ; ].)
4.24. Suppose that R e R NxN is strongly nonsingular (see Problem 4.23 for the
definition of this term), and that R — R T . Thus, there exists the factorization
R = L N D N L T N , (4.P.13)
where Ln is a unit lower triangular matrix, and Dn is a diagonal matrix.
Let
L N = [/o h ■ ■ ■ In-2 In-i],
so /,- is column i of L#. Let
D N — diag(d , d\, . . . , d N -\)-
Thus, via (4.P.13), we have
N-\
R=Y J d Ml- (4.P-14)
k=Q
Consider the following algorithm:
for n := to N - 1 do begin
d n ■= eJ,R n e n \
In := d~ 1 R n e n ;
ftn+1 := Rn — dnlnln !
end;
TLFeBOOK
PROBLEMS
203
As usual, we have the unit vector
e, = [0 • • • 1
Of GR N .
The algorithm above is the Jacobi procedure (algorithm) for computing the
Chole sky factorization (recall Theorem 4.6) of R.
Is R > a necessary condition for the algorithm to work? Explain. Test the
Jacobi procedure out on
R =
2 1
1 2 1
1 2
Is R > 0? Justify your answer. How many arithmetic operations are needed to
implement the Jacobi procedure? How does this compare with the Gaussian
elimination method for general LU factorization considered in Section 4.5?
4.25. Suppose that A ' exists, and that /
course, / is the identity matrix.
V T A l U is also nonsingular. Of
(a) Prove the Sherman-Morrison-Woodbury formula
[A
uv T r l
A~ l U[I + v t a~ 1 u]~ 1 v t a-
(b) Prove that if U = u e C" and V = v e C", then
T , , A~ 1 uv T A~ 1
(Comment: These identities can be used to develop adaptive filtering algo-
rithms, for example.)
4.26. Suppose that
a b
c
Find conditions on a, b, and c to ensure that A > 0.
4.27. Suppose that A € R" x ", and that A is not necessarily symmetric. We still
say that A > iff x T Ax > for all x # (x e R"). Show that A > iff
B — j(A + A T ) > 0. Matrix B is often called the symmetric part of A. (Note:
In this book, unless stated to the contrary, a pd matrix is always assumed to
be symmetric.)
TLFeBOOK
204 LINEAR SYSTEMS OF EQUATIONS
4.28. Prove that for A e R" x "
n-1
|A||oo = max V|a/,/l
0<i<»— 1 '
j=0
4.29. Derive Eq. (4.128) (the law of cosines).
4.30. Consider A„ e R" x " such that
A n —
A~ l =
i
— 1
1
-1
-1
— 1 — 1
-1 -1
1
-1 -1
1 -1
1
1
1 2
2"-
-3 2 »-2
1 1
2"-
-4 2 „-3
1
2"-
-5 2 »-4
1
1
1
Show that Koo(A n ) — nl" l . Clearly, det(A„) = 1. Consider
D„ = diag(10 _1 , 10" 1 , . . . , 10 _1 ) € R".
What is K p (D n )7 Clearly, det(D„) = 10~". What do these two cases say
about the relationship between det(A), and at (A) (A € R" x ", and is nonsin-
gular) in general?
4.31. Recall Problem 1.7 (in Chapter 1). Suppose a — [aoa\ ■ ■ -a n ] T , b — [bob\
■ ■ ■ b m ] T , and c — [c ci • • • c n+m ] T , where
ci — ) y cikbi-k-
k=Q
Find matrix A such that c — Ab, and find matrix B such that c — Ba. What
are the sizes of matrices A, and Bl {Hint: A and B will be rectangular
Toeplitz matrices. This problem demonstrates the close association between
Toeplitz matrices and the convolution operation, and so partially explains the
central importance of Toeplitz matrices in digital signal processing.)
TLFeBOOK
PROBLEMS
205
4.32. Matrix P e R" x " is a permutation matrix if it possesses exactly one one per
row and column, and zeros everywhere else. Such a matrix simply reorders
the elements in a vector. For example
10
xo
x\
10
XI
X2
10
X2
x
1
XT,
XT,
Show that P ' = P T (i.e., P is an orthogonal matrix).
4.33. Find c and s in
*0
x\
"0
l l
where s 2 + c 2 — 1.
4.34. Consider
A =
2 10
12 10
12 1
12
Use Householder transformation matrices to find the QR factorization of
matrix A.
4.35. Consider
A =
5 2 10
2 5 2 1
12 5 2
12 5
Using Householder transformation matrices, find orthogonal matrices Qq,
and Q\ such that
QiQoA =
hoo hoi hoi hoi
hio h\\ h l2 hi 3
h 2 \ h 2 2 h 2 T,
/?32 /J33
= H.
[Comment: Matrix H is upper Hessenberg. This problem is an illustration of
a process that is important in finding matrix eigenvalues (Chapter 11).]
4.36. Write a MATLAB function to implement the Householder QR factorization
algorithm as given by the pseudocode in Section 4.6 [between Eqs. (4.146),
and (4.147)]. The function must output the separate factors Hi that make up
Q, in addition to the factors Q and R. Test your function out on the matrix
A in Problem 4.34.
TLFeBOOK
206 LINEAR SYSTEMS OF EQUATIONS
4.37. Prove Eq. (4.117), and establish that
rank (A) = rank (A T ).
4.38. If A € C" x ", then the trace of A is given by
«-i
Tr(A) = ^ a k,k
(i.e., is the sum of the main diagonal elements of matrix A). Prove that
||A|||=Tr(AA ff )
[recall (4.36a)].
4.39. Suppose that A, B e R" x ", and Q e R" x " is orthogonal; then, if A =
QBQ T , prove that
Tr(A) = Tr(fl).
4.40. Recall Theorem 4.4. Prove that for A € R" x "
\\m\ 2 f = y,°1
k=0
(Hint: Use the results from Problems 4.38 and 4.39.)
4.41. Consider Theorem 4.9. Is a < 1 a necessary condition for convergence?
Explain.
4.42. Suppose that A € R" x " is pd and prove that the JOR method (Section 4.7)
converges- if
< &) <
p{D- x A)
4.43. Repeat the analysis made in Example 4.8, but instead use the matrix
A =
4 2 10
14 10
14 1
2 4
This will involve writing and running a suitable MATLAB function. Find the
optimum choice for a> to an accuracy of ±0.02.
TLFeBOOK
5
Orthogonal Polynomials
5.1 INTRODUCTION
Orthogonal polynomials arise in highly diverse settings. They can be solutions to
special classes of differential equations that arise in mathematical physics problems.
They are vital in the design of analog and digital filters. They arise in numerical
integration methods, and they have a considerable role to play in solving least-
squares and uniform approximation problems.
Therefore, in this chapter we begin by considering some of the properties shared
by all orthogonal polynomials. We then consider the special cases of Chebyshev,
Hermite, and Legendre polynomials. Additionally, we consider the application
of orthogonal polynomials to least-squares and uniform approximation problems.
However, we emphasize the case of least-squares approximation, which was first
considered in some depth in Chapter 4. The approach to least-squares problems
taken here alleviates some of the concerns about ill-conditioning that were noted
in Chapter 4.
5.2 GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS
We are interested here in the inner product space L 2 (D), where D is the domain
of definition of the functions in the space. Typically, D — [a, b], D — R, or D —
[0, oo). We shall, as in Chapter 4, assume that all members of L 2 (D) are real- valued
to simplify matters. So far, our inner product has been
(f,g) = f f(x)g(x)dx, (5.1)
J D
but now we consider the weighted inner product
(f,g)= f w(x)f(x)g(x)dx, (5.2)
J D
where w(x) > (x e D) is the weighting function. This includes (5.1) as a special
case for which w(x) — 1 for all x e D.
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
207
TLFeBOOK
208 ORTHOGONAL POLYNOMIALS
Our goal is to consider polynomials
<t>n(x) = Y^<Pn,kX k ,xe D (5.3)
k=Q
of degree n such that
(<Pn,4>m) — K-m (5.4)
for all n,m e Z + . The inner product is that of (5.2). Our polynomials {<p n (x)\n e
Z + }. are orthogonal polynomials on D with respect to weighting function w(x).
Changing D and/or w(x) will generate very different polynomials, and we will
consider important special cases later. However, all orthogonal polynomials possess
certain features in common with each other regardless of the choice of D or w(x).
We shall consider a few of these in this section.
If p(x) is a polynomial of degree n, then we may write dsg(p(x)) — n.
Theorem 5.1: Any three consecutive orthogonal polynomials are related by the
three-term recurrence formula {relation)
<t>n+i(x) = (A n x + B n )(j) n (x) + C n (j)n-\{x), (5.5)
where
. _ 4>n+l,n+\ „ _ 4>n+l,n+\ ( <Pn+l,n <Pn,n-l
A n — ; ^ "n — ;
<Pn,n <Pn,n Wm + 1,m + 1 <Pn,n
„ 0B+l,n+10«-l,n-l ,, „
C„ = - 2 . (5.6)
Vn,n
Proof Our proof is a somewhat expanded version of that in Isaacson and Keller
[1, pp. 204-205].
Observe that for A n — (j> n +i,n+i/<Pn,n< we have
h + 1 n
q„(x) = <t> n +l(x) - A n X(j) n (x) = 'Y^(l>n+\,kX k ~ A„ ^ <t>n,kX k+l
k=0 k=0
n+\ n+\
= ^2<Pn+l,kX k ~ A n 'Y^ l 4>n,}-\X i
fc=0 ;=1
n n
= ^24>n+l,kX k - A n 'Y^ l 4>n,j-\X i + [0 B+ i >B+ i - A n (/) ntn ]x n + l
k=0 ;=1
n n
= ^2<Pn+l,kX k ~ A n 'Y^ l 4>n,}-\X } ,
k=0 j=\
TLFeBOOK
GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 209
which is a polynomial of degree at most n, i.e., deg(q n (x)) < n. Thus, for suit-
able Uk
q n (x) = y^akfaW,
k=0
so because (fa, <j>j) — &k-j, we have a ; - = (q„, (j>j). In addition
n-2
0n+iW - A n xfa(x) = a„fa(x) + a n -\cj) n -i(x) + ^a k (j> k (x). (5.7)
Now degC*0y(;c)) < « — 1 for j < n — 2. Thus, there are /8a such that
n-i
x0/(x) = ^Mk(x)
k=0
so (</>,-, x</) ; > = 0ifr>n — 1, or
(fa,X<j>j} = (5.8)
for j = 0, 1, ...,« — 2, From (5.7) via (5.8), we obtain
{(/>„+! - A„X<p„,fa) — (<j>n+l,fa) - A n {(p n ,X<t>k) —
for k — 0, 1, . . . , n — 2. This is the inner product of cj>k(x) with the left-hand side
of (5.7). For the right-hand side of (5.7)
/ "~ 2 \
= ( a„<p n + a n -\4>n-l + 2_, a j4>j' <Pk )
n-2
= a n (<p n , fa) + a„-l {fa-l,fa) + /, &j (</>; , fa >
7=0
n-2
= ^2 a .j(<i>j><i>k)
again for £ = 0, . . . , n — 2. We can only have Yl'}Zo a j (4>j> <Pk) = if ay =0 for
j = 0, 1, . . . , n — 2. Thus, (5.7) reduces to
fa+i(x) - A n xfa{x) = a n fa(x) + a n -i<t>n-i(x)
or
<Pn+i(x) = (A„x + a„)0„(x)H-a„-i0„-i(x), (5.9)
TLFeBOOK
210 ORTHOGONAL POLYNOMIALS
which has the form of (5.5). We now need to verify that a n — B n , a n -\ — C„ as
in (5.6). From (5.9)
<p n (x) = A n -ix4> n -i(x) + a n -i4>„-i(x) + a n -2<t>n-2(x)
or
1
x<f> n -i(x) = <p n (x) +
A n -\
1
-(a n -i(j>„-i(x) + a n - 2 (/>n-2(x))
(5.10)
=P„-l(x)
for which deg(/?„_i(x)) < n — 1. Thus, from (5.9)
((pn+l, <l>n-\) — A n {x<j) n ,(t) n -i) +a n (</>„, (j)„-i) +a„-\ (</>„_i, </>„-i)
so that
and via (5.10)
which becomes
or
a«-l = -A n (x<p n , 4>n-\) — -A n {(p n ,X<p n -\),
a n -\
ln-1
"0« + Pn-l ) ,
«»-l - ~ ; " (<t>n,<t>n) —
Cm Of ,7 _ 1
A n — \ A n — \
<l>n + l,n + l 4>n-l,n-l
which is the expression for C n in (5.6). Expanding (5.5)
«+i
n+\
n-\
22&n+i,kx k — A n 22<p n , k -ix k + B n 22(j) n ^x k + C„ )J) n -\ :k x k ,
k=Q k=\ k=0 k=0
where, on comparing the terms for k — n, we see that
4 > n+l,n — A n (p n n -\ + D n <p nn
(since 4> n -\,n =0). Therefore
1 1
B n — — [<Pn + l,n ~ A n <p nn -\] — — —
tyn,n fin,)
<Pn + l,n + l
'«+!,« 7 <Pn,n-\
Pn+l,n+l
"w + l,w H'n.n — 1
,4>n+\,n+\ <Pn,n .
which is the expression for B n in (5.6).
TLFeBOOK
GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 211
Theorem 5.2: Orthogonal polynomials satisfy the Christoffel-Darboux formula
{relation)
n ,
(x - y)Y](pj(x)<pj(y) = - — — — [(p n+1 (x)(/) n (y) - cj) n+l (y)(/) n (x)]. (5.11)
~^ ' <Pn+\,n+\
Proof The proof is an expanded version of that from Isaacson and Keller [1,
p. 205].
From (5.5), we obtain
<M)O0m+iOO = (A n x + B n )(j) n (x)^ n (y) + C n cj) n -\{x)(i> n (y), (5.12)
so reversing the roles of x and y gives
4 l n{x)4> n+ i(y) = (A n y + B n )cp n {y)(j> n (x) + C„</>„_i(;y )</>„(*)■ (5.13)
Subtracting (5.13) from (5.12) yields
^n(y)4>n + l(x) - (/) n (x)^ n + i(y) = A n (x - y)(pn(x)<Pn(y)
+ c n (<p n -i(x)(p„(y) - (Pn(x)<P n -i(y))- (5.14)
We note that
(5.15)
C„ __ 0«-i,«-i __ 1
A n <Pn,n A n —\
so (5.14) may be rewritten as
(x - y)(/)n(x)(j)n(y) = A" 1 (</>„+] (x)(j>„(y) ~ <t>n+l{y)<t>n{x))
- A~\(<p n (x)<l> n -i(y) - (p n -i(x)<p n (y)).
Now consider
n n
(x - y)J2 ( / ) j(x)cPj(y) = J2 {AfWj+iixWjW - <Pj+i(y)<t>j(x)]
7=0 j=0
- Ajl 1 [4>j(.x)4>j-i(y) - 4>j-i(x)<t>j(y)]] . (5.16)
Since Aj — 0/+i j+i/<j>; j, we have AZi — <P-i,-i/(/>o,0 = because ^>-i,-i — 0.
Taking this into account, the summation on the right-hand side of (5.16) reduces
due to the cancellation of terms, 1 and so (5.16) finally becomes
(X ~ y)^2(pj(x)(/>j(y) = - — - [<Pn+l(x)(t>n(y) ~ <t>n+l(y)4>n(x)],
r-
which is (5.11).
,_ Pn + \,n + \
The cancellation process seen here is similar to the one that occurred in the derivation of the Dirichlet
kernel identity (3.24). This mathematical technique seems to be a recurring theme in analysis.
TLFeBOOK
212 ORTHOGONAL POLYNOMIALS
The following corollary to Theorem 5.2 is from Hildebrand [2, p. 342]. However,
there appears to be no proof in Ref. 2, so one is provided here.
Corollary 5.1 With cp k (x) — dcf>k(x)/dx we have
y>, 2 W = - <? "'" [0ffi (*)&,(*) - ti l) (x)<t> n+l (x)l (5.17)
Proof Since
<Pn+i(x)(j) n (y) - <p n+ i(y)(p n (x) = [0 n+ i(x) - 0„+i(y )]0„(y)
- [0n(*) - 0n(>')]0n+l(}')
via (5.11)
n
J^<l>j(x)<t>j(y)
<Pn+l(x) ~4>n+l(y) , , , <Pn(x) - (j) n (y)
0n(;y) 0«+i(y)
'«+!,«+! L x - y x - y
(5.18)
Letting y — > x in (5.18) immediately yields (5.17). This is so from the definition of
the derivative, and the fact that all polynomials may be differentiated any number
of times.
The three-term recurrence relation is a vital practical method for orthogonal
polynomial generation. However, an alternative approach is the following.
If deg(g r _i (x)) = r — 1 (q r -\ (x) is an arbitrary degree r — 1 polynomial), then
(cj) r ,q r -l)— I w(x)(j) r (x)q r -i(x) dx — 0. (5.19)
J D
Define
d r G r (x) /„■,
w{x)<t> r {x)=—^- = G ( ;\x) (5.20)
so that (5.19) becomes
/ G { p(x)q r -i(x)dx = 0. (5.21)
(V)
If we repeatedly integrate by parts (using g^-iM = f° r a ^ x )' we obtain
f Gj r) (x)«r-lWrf* = Gj r ~ 1) (x)?r-lW|D- f G^~ l) (x)q^ (x) dx
J D J D
TLFeBOOK
GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS
= G< r - 1) G0«.-1(J0|D - G ( r 2 \x)q'; l \{x)\ D
) (x)q r _ l (x)dx
G < f- l \x)q r . x (x)\ D - G { ;- 2 \x)qf\{x)\ D
213
/.
G (r-2)^_(2)
G ( r 3) (x)qf-\{x)\ D + f G { r 3) (x)qf\{x)dx
J D
and so on. Finally
<r-2)M)
[G^ ^,-1 - g; '? r _j + g; ^ r _j
,( r _3) (2)
+ (-i) r - i G f .« r ( r 1 1) ]D = o,
or alternatively
Since from (5.20)
Yy\f x G^\^
.k=\
= o.
4>,-(x) =
1 d r G r (;c)
w(x) dx r
(5.22a)
(5.22b)
(5.23)
which is a polynomial of degree r, it must be the case that G,-{x) satisfies the
differential equation
d r+1 f 1 d r G r (x)
dx r+1
w{x) dx r
(5.24)
for x € D. Recalling D = [a, b] and allowing a — > — oo, & -> oo (so we may have
(k)
D = [a, oo), or D = R), Eq. (5.22a) must be satisfied for any values of q r _ x (a) and
.(*)
C ; i(*)(* = o,i,
1). This implies that we have the boundary conditions
Gf'(a) = 0,Gf ) (l)) = (5.25)
for k = 0, 1, . . . , r — 1. These restrict the solution to (5.24). It can be shown [6]
that (5.24) has a nontrivial 2 solution for all r e Z + assuming that w(x) > for
all x e D, and that the moments f D x k w(x)dx exist for all k e Z + . Proving this
is difficult, and so will not be considered. The expression in (5.23) may be called
a Rodrigues formula for <p r (x) (although it should be said that this terminology
usually arises only in the context of Legendre polynomials).
The Christoffel-Darboux formulas arise in various settings. For example, they
are relevant to the problem of designing Savitzky-Golay smoothing digital filters
[3], and they arise in numerical integration methods [2, 4]. Another way in which
the Christoffel-Darboux formula of (5.11) can make an appearance is as follows.
The trivial solution for a differential equation is the identically zero function.
TLFeBOOK
214 ORTHOGONAL POLYNOMIALS
We may wish to approximate f(x) e L 2 (D) as
n
fix) « ^djcpjix),
so the residual is
r„O0 = /(*) - ^aj<t>j(x).
(5.26)
(5.27)
Adopting the now familiar least-squares approach, we select a; to minimize func-
tional V(a) = \\r n \\ 2 = (r n ,r n )
fix) - ^2,a.j(j>jix)
Via) = / w(x)r n (x)dx — J w(x)
J d J D
where a = [ao a\ ■ ■ ■ a n ] T € R" +1 . Of course, this expands to become
V(,
dx, (5.28)
(a) — J w(x)f (x)dx — 2 2_, a ] I w(x)f(x)(j)j(x)dx
JD ;_ n J D
n n .
+ J2J2 a J ak /
;'=o
w(x)4>j(x)<pk(x) dx,
;=0 k=Q
so if g - [go gi ■■■ g n ] T with gj = | D w(x)/(x)0 7 (x) rfx, and /? = [r,j] €
R («+i)x(«+i) wim r> _ = j^ w ( x )<t>i(x)<t>jix)dx, and p = f D w(x)f 2 (x)dx, then
y(a) = p — 2a g + a Ra.
(5.29)
But r,- ; = (<pi, <pj) — Si-j [via (5.4)], so we have R — I (identity matrix). Imme-
diately, one of the advantages of working with orthogonal polynomials is that R
is perfectly conditioned (contrast this with the Hilbert matrix of Chapter 4). This
in itself is a powerful incentive to consider working with orthogonal polynomial
expansions. Another nice consequence is that the optimal solution a satisfies
a = g,
that is
aj = / w{x)f{x)(j)j(x)dx = (/, <j)j),
JD
where j e Z„+i. If a — a in (5.27) then we have the optimal residual
f n ix) = fix) - ^2aj(j)j(x).
;=0
(5.30)
(5.31)
(5.32)
TLFeBOOK
GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 215
We may substitute (5.31) into (5.32) to obtain
n r-
r n (x) = f(x)-l^ J w(y)f{y)<t>j{y)dy
<Pj(x)
L
fix)- / f(y)w(y)
J2<Pj(x)<Pj(y)
dy.
(5.33)
For convenience we define the kernel function
n
K n (x,y) = ^2^)j{x)(pj{y),
so that (5.33) becomes
rn(x) = f(x)- w(y)f(y)K n (x,y)dy.
(5.34)
(5.35)
Clearly, K n (x, y) in (5.34) has the alternative formula given by (5.11). Now con-
sider
f f(x)w(y)K n (x,y)dy = f(x) f W^Tcpj
Jd Jd j=0 '
(x)(pj(y)dy
n
= f(x)Y](pj(x) w(y)cpj(y)dy
.7=0 ' Jd
n
= f(x)Y i <t>j(x){l,4>j)
.7=0
El d>o(x)
.7=0
</>0,0
d>o(x)
= fix)^ 1 = f(x)
00,0
because 4>q(x) — (/>o,o for x e D, and (0o, 4>j) — $j- Thus, (5.35) becomes
f n ix) = / /(x)wO0^h(*,)0 dy - I f(y)w(y)K n (x,y)dy
J D J D
L
wiy)K n (x,y)[fix)-fiy)]dy,
(5.36)
Recall the Dirichlet kernel of Chapter 3, which we now see is really just a special instance of the
general idea considered here.
TLFeBOOK
216 ORTHOGONAL POLYNOMIALS
The optimal residual (optimal error) f n (x) presumably gets smaller in some sense
as n — > oo. Clearly, insight into how it behaves in the limit as n -> oo can be
provided by (5.36). The behavior certainly depends on f(x), w(x), and the kernel
K n (x,y). Intuitively, the summation expression for K n (x,y) in (5.34) is likely
to be less convenient to work with than the alternative expression we obtain
from (5.11).
Some basic results on polynomial approximations are as follows.
Theorem 5.3: Weierstrass' Approximation Theorem Let f(x) be continu-
ous on the closed interval [a, b]. For any e > there exists an integer N — N(e),
and a polynomial pn(x) € P [a, b] (deg(/?w(x)) < N) such that
\f(x)-p N (x)\ < e
for all x e [a, b].
Various proofs of this result exist in the literature (e.g., see Rice [5, p. 121] and
Isaacson and Keller [1, pp. 183-186]), but we omit them here. We see that the
convergence of pn(x) to f(x) is uniform as N — > oo (recall Definition 3.4 in
Chapter 3). Theorem 5.3 states that any function continuous on an interval may be
approximated with a polynomial to arbitrary accuracy. Of course, a large degree
polynomial may be needed to achieve a particular level of accuracy depending upon
f(x). We also remark that Weierstrass' theorem is an existence theorem. It claims
the existence of a polynomial that uniformly approximates a continuous function
on [a, b], but it does not tell us how to find the polynomial. Some information
about the convergence behavior of a least-squares approximation is provided by
the following theorem.
Theorem 5.4: Let D — [a, b], w(x) — 1 for all x e D. Let f(x) be continuous
on D, and let
n
qn(x) = ^fl,(/);(x)
(least-squares polynomial approximation to f(x) on D), so cij — (f, <pj) =
f a f(x)<t>j(x)dx. Then
lim V(a)= lim / [f(x) — q n (x)] 2 dx — 0,
and we have Parseval's equality
rb °°
/ f 2 (x)dx^J2 a r < 5 - 37 )
J
=0
TLFeBOOK
GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 217
Proof We use proof by contradiction. Assume that linin^oo V(a) — S > 0.
Pick any e > such that e 2 — 2 (b-a) ^- ^ Theorem 5.3 (Weierstrass' theorem)
there is a polynomial p m (x) such that \f(x) — p m (x)\ < € for x e D. Thus
/'
J a
[f (x) - p m {x)f dx < e 2 [b-a]= -8.
Now via (5.29) and (5.30)
that is
V(a) = p — a a,
/ "
f 2 (x)dx -^2,a 2 ,
;'=o
and V(a) > for all a e R" +1 . So we have Bessel's inequality
rb "
a) = / f 2 (x)dx-J2tf ^ °-
6
V(a) = / / z (x)Jx-^«t ^ °. ( 5 - 38 )
and V(a) must be a nonincreasing function of n. Hence the least- squares approxi-
mation of degree m, say, q m (x), satisfies
-S> f [f(x)-q m (x)?dx>8
*■ J a
which is a contradiction unless 5 = 0. Since 5 = 0, we must have (5.37).
We observe that Parseval's equality of (5.37) relates the energy of f(x) to
the energy of the coefficients a,-. The convergence behavior described by Theo-
rem 5.4 is sometimes referred to as convergence in the mean [1, pp. 197-198].
This theorem does not say what happens if f(x) is not continuous on [a, b]. The
result (5.36) is independent regardless of whether f(x) is continuous. It is a poten-
tially more powerful result for this reason. It turns out that the convergence of
the least-squares polynomial sequence (q n (x)) to f(x) is pointwise in general but,
depending on </>/(x) and f(x), the convergence can sometimes be uniform. For uni-
form convergence, f(x) must be sufficiently smooth. The pointwise convergence
of the orthogonal polynomial series when f(x) has a discontinuity implies that the
Gibbs phenomenon can be expected. (Recall this phenomenon in the context of the
Fourier series expansion as seen in Section 3.4.)
We have seen polynomial approximation to functions in Chapter 3. There we saw
that the Taylor formula is a polynomial approximation to a function with a number
of derivatives equal to the degree of the Taylor polynomial. This approximation
technique is obviously limited to functions that are sufficiently differentiable. But
our present polynomial approximation methodology has no such limitation.
TLFeBOOK
218 ORTHOGONAL POLYNOMIALS
We remark that Theorem 5.4 suggests (but does not prove) that if f(x) €
L 2 [a, b], then
oo
fM = J2{f,<Pj)<t>j(x), (5.39)
.7=0
where
f2>
(f,4>
j)= f f(x)(Pj(x)dx. (5.40)
J a
This has the basic form of the Fourier series expansion that was first seen in
Chapter 1. For this reason (5.39) is sometimes called a generalized Fourier series
expansion, although in the most general form of this idea the orthogonal functions
are not necessarily always polynomials. The idea of a generalized Fourier series
expansion can be extended to domains such as D — [0, oo), or D — R, and to any
weighting functions that lead to solutions to (5.24).
The next three sections consider the Chebyshev, Hermite, and Legendre poly-
nomials as examples of how to apply the core theory of this section.
5.3 CHEBYSHEV POLYNOMIALS
Suppose that D = [a, b] and recall (5.28),
rb
V{a)= / w{x)rl{x)dx, (5.41)
J a
which is the (weighted) energy of the approximation error (residual) in (5.27):
n
r n {x) = f{x)-Y j a j <t> j (x). (5.42)
7=0
The weighting function w(x) is often selected to give more or less weight to errors
in different places on the interval [a,b]. This is intended to achieve a degree of
control over error behavior. If w(x) — c > for x e [a, b], then equal importance
is given to errors across the interval. This choice (with c — 1) gives rise to the
Legendre polynomials, and will be considered later. If we wish to give more weight
to errors at the ends of the interval, then a popular instance of this is for D = [— 1, 1]
with weighting function
w(x) = 1 . (5.43)
VI — x l
This choice leads to the famous Chebyshev polynomials of the first kind. The reader
will most likely see these applied to problems in analog and digital filter design in
subsequent courses. For now, we concentrate on their basic theory.
TLFeBOOK
CHEBYSHEV POLYNOMIALS 219
The following lemma and ideas expressed in it are pivotal in understanding the
Chebyshev polynomials and how they are constructed.
Lemma 5.1: If n e Z + , then
n
cos nO = Y^ Pn,k cos* 9 (5.44)
k=0
for suitable ft n ^ g R.
Proof First of all recall the trigonometric identities
cos(m + 1)9 = cos m.9 cos 9 — sin m.9 sin 9
cos(m — 1)9 = cos m0 cos + sin m0 sin
(which follow from the more basic identity cos(a + b) = cos a cos£> — sin a sinfc).
From the sum of these identities
cos(m + 1)9 = 2 cos m9 cos — cos(m — 1)9. (5.45)
For n = 1 in (5.44), we have /Jio = 0, /3i,i = 1, and for n = 0, we have /3o,o =
1. These will form initial conditions in a recursion that we now derive using
mathematical induction.
Assume that (5.44) is valid both for n = m and for n = m — 1, and so
m
cos m9 = 2_,Pm,k cos
k=0
m—\
cos(m — \)9 = yj Pm-\,k cos
jfc=0
and so via (5.45)
m m — 1
COS(m + 1)0 = 2C0S# \]Pm,k COS — >_^ Pm-l,k COS
fc=0 &=0
m m— 1
= 2 ^ £„,,* COS* +1 - J2 Pm-l,k COS*
k=0 k=0
m + \ m—\
— 2 2_j Pm,r-\ COS r — 2_^ fi m -\ : r COS'"
r=l r=0
TLFeBOOK
220 ORTHOGONAL POLYNOMIALS
m+1
= y^[2j6 mife _i - m -i,fc]cos fe 0(0 m ,* — for k < 0, and k > m)
k=Q
m+1
jt=0
which is to say that
-1,* = 20 m ,fc-l — 0m-l,*> (5.46)
This is the desired three-term recursion for the coefficients in (5.44). As a conse-
quence of this result, Eq. (5.44) is valid for n — m + 1 and thus is valid for all
n > by mathematical induction.
This lemma states that cos n0 may be expressed as a polynomial of degree n in
cos0. Equation (5.46) along with the initial conditions
A>.0 = 1. 01,0 = 0, U = 1 (5.47)
tells us how to find the polynomial coefficients. For example, from (5.46)
02,* = 201,fc_l - /Jo,*:
for £ = 0, 1, 2. [In general, we evaluate (5.46) for k = 0, 1, . . . , m + 1.] Therefore
02.0 = 20i,-i — 00,0 = — 1,
02.1 = 20i,o — 00,1 = 0,
02.2 = 20i,i — 00,2 = 2
which implies that cos 26 — — 1 + 2 cos 2 0. This is certainly true as it is a well-
known trigonometric identity.
Lemma 5.1 possesses a converse. Any polynomial in cos0 of degree n can be
expressed as a linear combination of members from the set {cos k0 \k = 0, 1, . . . , n).
We now have enough information to derive the Chebyshev polynomials. Recall-
ing (5.19) we need a polynomial (p r (x) of degree r such that
/
1 1
c/) r (x)q r -i(x)dx = 0, (5.48)
vT^?
where q,--\{x) is an arbitrary polynomial such that deg(g r -i(x)) < r — 1. Let us
change variables according to
jt = cos(9. (5.49)
TLFeBOOK
CHEBYSHEV POLYNOMIALS 221
Then dx — — smO dO, and x e [— 1, 1] maps to 6 e [0, it]. Thus
f l 1 [° I
/ d> r (x)q r -i(x) dx = - I -
i-i VI- x l Jjt VI— cos 2
= / (J) r (cos0)q r - l (cos6)d6 =
Jo
(cos#)gy_i(cos#) sin0d0
Jo
Because of Lemma 5.1 (and the above mentioned converse to it)
, (cosd)coskOdO =
for k = 0, 1, . . . , r — 1. Consider </v(cos#) = C,- cosr0, then
r
[cos(r + &)6> + cos(r - fc)0] J0
(5.50)
(5.51)
Jo
C r I cos r6 cos k6 d6
/o 2
1 f*
1
1 1
sin(r + k)9 H sin(r - k)B
r + k r — k
=
Jo
for k = 0, I, . . . , r — 1. Thus, we may indeed choose
t> r (x) — C r cos[r cos x].
(5.52)
Constant C r is selected to normalize the polynomial according to user requirements.
Perhaps the most common choice is simply to set C r — 1 for all r e Z + . In this
case we set T r (x) — 4> r (x):
T r {x) — cos[r cos x].
These are the Chebyshev polynomials of the first kind.
By construction, if r ^ k, then
(5.53)
(T r , T k )
L VT=x* Tr
(x)Tk(x) dx — 0.
Consider r — k
= f —
-T (x) dx
cos [r cos x]dx,
and apply (5.49). Thus
WTrW
/ cos 2 rd d$ = I
n, r =
j7T, r >
(5.54)
(5.55)
TLFeBOOK
222 ORTHOGONAL POLYNOMIALS
We have claimed <j> r (x) in (5.52), and hence T r (x) in (5.53), are polynomials. This is
not immediately obvious, given that the expressions are in terms of trigonometric
functions. We will now confirm that T r (x) is indeed a polynomial in x for all
r e Z + . Our approach follows the proof of Lemma 5.1.
First, we observe that
Tq(x) — cos(0 • cos - x) — 1, T\(x) — cos(l • cos - x) — x. (5.56)
Clearly, these are polynomials in x. Once again (see the proof of Lemma 5.1)
T n+ \{x) = cos(n + 1)9(9 — cos - x)
— cosn9 cos0 — (cos(« — 1)9 — cosn9 cos9)
— 2 cos n9 cos 9 — cos(« — 1)9
= 2xT n (x) - T n -i(x),
that is
T„+i(x) — 2xT n (x) - T n -i(x) (5.57)
for n e N. The initial conditions are expressed in (5.56). We see that this is a
special case of (5.5) (Theorem 5.1). It is immediately clear from (5.57) and (5.56)
that T n (x) is a polynomial in x for all n e Z + .
Some plots of Chebyshev polynomials of the first kind appear in Fig. 5.1. We
remark that on the interval x € [— 1, 1] the "ripples" of the polynomials are of
the same height. This fact makes this class of polynomials quite useful in certain
uniform approximation problems. It is one of the main properties that makes this
class of polynomials useful in filter design. Some indication of how the Chebyshev
polynomials relate to uniform approximation problems appears in Section 5.7.
Example 5.1 This example is about how to program the recursion specified
in (5.57). Define
n
.7=0
so that T„j is the coefficient of x J . This is the same kind of notation as in (5.3)
for 4> n (x). In this setting it is quite important to note that T n j — for j < 0, and
for j > n. From (5.57), we obtain
n+l n n—\
/ , ln+\,jX = t~X > ^ 'n,jX / j ln — \jX' ,
j=0 .7=0 .7=0
and so
n+l n n—\
7=0 7=0 .7=0
TLFeBOOK
CHEBYSHEV POLYNOMIALS 223
3°
— n=2
— n=3
.-. n =4
— n=5
-
! ; \\<
\ . : -
• : : 1
\ : :
//:::::
/ :
-1.5
-0.5
x
0.5
1.5
Figure 5.1 Chebyshev polynomials of the first kind of degrees 2, 3, 4, 5.
Changing the variable in the second summation according to k — j + 1 (so j
k — 1) yields
n + \ n+\ n — \
k=0 k=\ k=0
and modifying the limits on the summations on the right-hand side yields
n + 1
n + 1
n + 1
7 , T n +i,kX =2/, T n: k-ix — 2_^ T n -\,kX ■
fc=0
jt=0
)t=0
We emphasize that this is permitted because we recall that T n -\ — T n -\,n —
T n -\^ n+ \ — for all n > 1. Comparing like powers of x gives us the recursion
T n +\,k — 1Tn,k-\ — T n -\,k
for k = 0, 1, . . . , n + 1 with n — 1, 2, 3, ... . Since Tq(x) — 1, we have Too — 1,
and since 7\(x) = *, we have 7^0 = 0, 7ij = 1, which are the initial conditions
for the recursion. Therefore, a pseudocode program to compute T n (x) is
7b,o:=1;
7"1,0:=0;7"1,1 :=1;
for n := 1 to A/ - 1 do begin
for /f := to n + 1 do begin
TLFeBOOK
224
ORTHOGONAL POLYNOMIALS
T n+1,k '■= 2 T n,k-1 ~ T n--\,k'<
end;
end;
This computes T 2 (x),
Tff(x) (so we need N > 2).
We remark that the recursion in Example 5.1 may be implemented using integer
arithmetic, and so there will be no rounding or quantization errors involved in
computing T n (x). However, there is the risk of computing machine overflow.
Example 5.2 This example is about changing representations. Specifically,
how might we express
n
Pn(x)=J^Pn,jX j eP"[-l,l]
.7=0
in terms of the Chebyshev polynomials of the first kind? In other words, we wish
to determine the series coefficients in
n
Pn(x) = ^ajTjix).
j=o
We will consider this problem only for n — 4.
Therefore, we begin by noting that (via (5.56) and (5.57)) Tq(x) — 1, T\(x) — x,
T 2 (x) — 2x 2 — 1, Tj,{x) — 4r 3 — 3x, and T^(x) — 8x 4 — 8x 2 + 1. Consequently
/ ctjTjix) — ao + u\x + 2«2X
7=0
a 2
+ 4a3JC — 3a^x + 8«4X — 8014X + oa\
— (ao — <*2 + u/\) + (<X\ — 3a^,)x + (2a 2 — 8a4)x
implying that (on comparing like powers of x)
P4,o — a -a 2 +a 4 , p^\ = a\ - 3«3,
P4,2 = 2a2 — 8a4, 7743 = 4a3, 7744 = 8a4.
This may be more conveniently expressed in matrix form:
- 4a 3 x 3 + 8a 4 x 4
1
-1
1
1
2
-3
-8
4
8
a
/>4,0
a\
P4,l
a 2
=
P4,2
«3
P4,3
Q!4
PAA
TLFeBOOK
HERMITE POLYNOMIALS 225
The upper triangular system Ua — p is certainly easy to solve using the backward
substitution algorithm presented in Chapter 4 (see Section 4.5). We note that the
elements of matrix U are the coefficients [T n j] as might be obtained from the
algorithm in Example 5.1.
Can you guess, on the basis of Example 5.2, what matrix U will be for any n ?
5.4 HERMITE POLYNOMIALS
Now let us consider D — R with weighting function
w{x) = e- alx \ (5.58)
This is essentially the Gaussian pulse from Chapter 3. Recalling (5.23), we have
„2 V 2 d r G r (x)
t>r(x) = e a x —pf 1 , (5.59)
dx r
where G r (x) satisfies the differential equation
d r+:
dx r+l
a 2 x 2 d r G r (x)
dx r
(5.60)
[recall (5.24)]. From (5.25) G,-(x) and Gf\x) (for k = 1, 2, . . . , r - 1) must tend
to zero as x -> ±oo. We may consider the trial solution
G r (x) = Cre'" 2 * 2 . (5.61)
The kth derivative of this is of the form
_ 2 2
C r e a x x (polynomial of degree k),
so (5.61) satisfies both (5.60) and the required boundary conditions. Therefore
rW = C r e« V / 7 ( e - aV ). (5.62)
dx r
It is common practice to define the Hermite polynomials to be 4> r (x) for C r —
(—iy with either a 2 — 1 or a 2 — j. We shall select a 2 — 1, and so our Hermite
polynomials are
H r {x) = {-\) r e x2 ^- r {e-* 2 ). (5.63)
By construction, for k ^ r
f
e~ a x (P r (x)<t> k (x) dx = 0. (5.64)
For the case where k — r, the following result is helpful.
TLFeBOOK
226 ORTHOGONAL POLYNOMIALS
It must be the case that
-(*) = X!^ r
k x
k=0
so that
H0 r || 2 = / w(x)<fi(x)dx = / W(x)(j) r
J D J D
(x)
Y, &.***
.k=Q
dx,
but f D w(x)<p r (x)x l dx = for i = 0, 1, . . . , r — 1 [special case of (5.19)], and so
||0 r || 2 = 0,. r / w(x)4> r (x)x r dx = </>,-, r \ x r G^ (x) dx (5.65)
J D J D
[via (5.20)]. We may integrate (5.65) by parts r times, and apply (5.25) to obtain
(5.66)
\\<t>r\\ 2 = (-V r r\<t>r,r I G,-(x)dx.
J D
So, for our present problem with k — r, we obtain
/oo
e- av tf(x)dx = (-l) r
-oo
/oo
C r e _a * dx. (5.67)
-OG
Now (if y — ax with a > 0)
1"°° 2 2 f 00 2 2 If 00 ! 1
/ e~ a x dx = 2 e~ a x dx = - / e"- v <fy = -y^F
./-OO i0 « ^0 a
via (3.103) in Chapter 3. Consequently, (5.67) becomes
1101-11 2 = (-iyr\4>r,rC r
(5.68)
(5.69)
With C r — (— l) r and a = 1, we recall H r (x) — <j> r {x), so (5.69) becomes
\\H r \\ 2 = r\H r , r ^.
(5.70)
We need an expression for H rr .
We know that for suitable p n ^
d" _ x i _
dx"*
J^P"
kx
Lk=0
(5.71)
-p„(x)
TLFeBOOK
HERMITE POLYNOMIALS 227
Now consider
d n+l
dx n+l
dx
-2x
-k=Q
n
.k=0
n+\
k X
kX
^2, kPn ' kX
k-\
\-k=\
"2 ^2 Pn,k-\ + ^(k + l)Pn,k+l
k=0 k=0
1
= ^2 V-?-Pn,k-\ + (k+ l)Pn,k+l] X k e
x k e-* 2
-Pn+l,k
k=Q
so we have the recurrence relation
Pn+l,k — -2Pn,k-l + (k + l)p n ,k+l-
From (5.71) and (5.63) H n (x) = (-1)" p n (x). From (5.72)
Pn + l,n + l = -2pn,n + (« + 2)/?„, n+ 2 = -2p n ,n,
and so
fln+l.n+l — ( — 1) Pn+l,n+\ — — (— 1) ( — 2Pn,n) — 2//,,,„.
(5.72)
(5.73)
Because Hq(x) — 1 [via (5.63)], //o,o = 1, and immediately we have //„„ = 2"
[solution to the difference equation (5.73)]. Therefore
(5.74)
\\H,Y = 2 r rW7T
thus, f^ e- x2 H?(x) dx = 2 r r ly^T.
A three-term recurrence relation for H n (x) is needed. Define the generating
function
(5.75)
S(x, t) — exp[-f 2 + 2xt] — exp[x 2 - (t - x) 2 ].
Observe that
3" 3" 3"
— S(x, t) = exp[x 2 ]— exp[-(r - x) 2 ] = (-l)"exp[x 2 ] — exp[-(f - x) 2 ].
(5.76)
Because of the second equality in (5.76), we have
S (n \x, 0) = —-S(x, 0) = (-l)"exp[x 2 ] — exp[-x 2 ] = H n (x). (5.77)
dt" dx"
TLFeBOOK
228 ORTHOGONAL POLYNOMIALS
The Maclaurin expansion of S(x, t) about t — is [recall (3.75)]
n=0
so via (5.77), this becomes
* — ' w!
«=0
Now, from (5.75) and (5.78), we have
M = 2fe -' 2 +^ = f; ^ff„ W , (5.79a)
Unix),
dx '—' n\
n=0
and also
dS ^t"dH n (x) ^ t n+1 dH n+l (x) /c ^ nu ^
— = > = > . (5.79b)
dx ^ n\ dx ^ (n+ 1)! dx
n=0 n=—\
Comparing the like powers of t in (5.79) yields
1 dH n+l (x) 2
= — H n (x),
(n+1)! dx n\
which implies that
dx
for n e Z + . We may also consider
dH n+ \ (x)
= 2(n + l)H„(x) (5.80)
35 »2_i_t,» x— * (—2t + 2x) „
— = {-It + 2x)e- t +2,x = V ■ -t n H n (x) (5.81a)
rit ' * n
and
dt
n=0
r) S °° *^ — 1 °° *rt
^7 = E? TTT flr » (x) = E-T ff »+ 1 ^ ) - (5 - 81b)
at *-^ (n — 1)! *-^ /i!
From (5.81a)
V i ■ -t n H n {x) = -2 V — H n -i(x) + 2x V -#„(*). (5.82)
■^^ «! *— ' (ra — 1)! ^^ n\
n=0 n=0 n=0
TLFeBOOK
LEGENDRE POLYNOMIALS 229
150
100
50
-50
-100
-150
1 : / \
— n = 2
-- n = 3
.-. n = 4
— n = 5
\ 1 ■ V
V / : \ :
w \
:'|
/ 1
/: I
:/ \ : \ :
— r— • -L- \ -•
/ " "■**. '■ \
//I
Tr-rr5^*^\ - j
I V ;v •' ■ /
I: y \ /
1 ■ • \ /
\ / V v'
1/ ^^-^
-2.5 -2 -1.5 -1 -0.5 0.5 1 1.5
x
Figure 5.2 Hermite polynomials of degrees 2, 3, 4, 5.
2.5
Comparing like powers of t in (5.81b) and (5.82) yields
1 2 1
— H n+ i(x) = — -H n -i(x) + 2x — H„(x)
n\ (n — 1)! n\
or
H n +i(x) — 2xH n (x) - 2nH n -i(x).
(5.83)
This holds for all n € N. Equation (5.83) is another special case of (5.5).
The Hermite polynomials are relevant to quantum mechanics (quantum harmonic
oscillator problem), and they arise in signal processing as well (often in connection
with the uncertainty principle for signals). A few Hermite polynomials are plotted
in Fig. 5.2.
5.5 LEGENDRE POLYNOMIALS
We consider D = [— 1, 1] with uniform weighting function
w(x) = 1 for all x e D.
(5.84)
TLFeBOOK
230 ORTHOGONAL POLYNOMIALS
In this case (5.24) becomes
,/H-i r
dx r+l
d r G r (x)
~dx r
=
d 2r+1
-G r (x) = 0.
dx 2r + l
The boundary conditions of (5.25) become
G®(±1) =
for k e Z r . Consequently, the solution to (5.85) is given by
Thus, from (5.23)
G r (x) = C,-(x 2 - l) r .
t> r {x) = c r — (x 2 -iy.
The Legendre polynomials use C r = jf^t, ar, d are denoted by
1 d r
Prix) =
2 r r\dx'
-(x'-iy,
(5.85)
(5.86)
(5.87)
(5.88)
(5.89)
which is the Rodrigues formula for P r (x). By construction, for k ^ r, we must
have
/>
(x)P k (x)dx = 0.
(5.90)
From (5.66) and (5.87), we obtain
,2 (-I)''
\\p r \r
Recalling the binomial theorem
IY /■!
— P rr / (x 2 - l) r dx.
r ' J-i
(5.91)
(a + fc) r = J] ( I ) «^ r "
(5.92)
TLFeBOOK
LEGENDRE POLYNOMIALS 231
we see that
<1r r dx T L ~ l \ k
k=Q
dx r
1 \-\y- k x 2k
d r
dx r
,2r
r-\
E
k=0
I )(-iy- k x 2k
= 2r(2r-l)---(r + l)x r + —
dx r
>-l
£
.k=0
I )(-iy- k x 2k
implying that (recall (5.89))
Pr.r
2r(2r - 1) • • • (r + 1) (2r)\
2 r r\
2 r [r\] 2 '
(5.93)
But we need to evaluate the integral in (5.91) as well. With the change of variable
x — sin 9, we have
/l rrc/2
(x 2 - \y dx = {-xy \ cos 2r+1 0do.
-1 J-TZl
(5.94)
T/2
We may integrate by parts
/Tt/A fit,
cos 2 ^ 1 9 d0 = /
-7T/2 J-jr
7T/2
cos 6» cos 2r 6> J6»
/2
- cos 2r sin 6\ n/2 n + 2r I cos 2r_1 9 sin 2 (9 </6>
' T/2
/■7T/2
J -it II
(in f u dv — uv — f v du, we let w = cos 2r 9, and <iu = cos 9 dO), which becomes
(on using the identity sin 2 6 = 1— cos 2 9)
/7t/2 £71(2 fit/2
cos 2 ^ 1 9d9 =2r / cos 2r - 1 9 d9 - 2r \ cos 2r+1 9 d6
-7T/2 J-lc/2 J-7ll
T/2
for r > 1. Therefore
/Itjl <-JT ;
cos 2r+1 9d9 =2r
-71/2 J-TZ
(2r + 1) / cos 2r+1 rf(9
7T/2
/2
cos 2r " 1 0c/0.
(5.95)
Now, since [via (5.94)]
/ r _i=-(-l)
/■T/2
J-tt/2
cos 2r - 1 9 d9,
TLFeBOOK
232 ORTHOGONAL POLYNOMIALS
Eq. (5.95) becomes
(2r + l)(-l) r I r = -2r(-iy7 r _i,
or more simply
2r
2r + 1
Ir-l-
(5.96)
This holds for r > 1 with initial condition Iq — 2. The solution to the difference
equation is
2 2r+1 [r!] 2
/,. = (-l) r — !_, (5.97)
^ ' (2r + l)!
This can be confirmed by direct substitution of (5.97) into (5.96). Consequently, if
we combine (5.91), (5.93), and (5.97), then
, (-1) 2 (2r)! (-l) r 2 2r+1 [r!] 2
Il^|| 2 =
which simplifies to
2 r 2 r [r!] 2 (2r + l)!
? 2
ll^ll 2
2r + 1
(5.98)
A closed-form expression for P n (x) is possible using (5.92) in (5.89). Specifi-
cally, consider
P„(x)
i in i jb
1 -(x 2 -ir
2»M!fif:c"
2"re! djc"
E(
.k=0 v 7
(_l)* x 2»-2*
(5.99)
where M — n/2 (re even), or M = (n — l)/2 (re odd). We observe that
d"
-2k
__ x m-zk = (2n - 2k)(2n - 2k - 1) • • • (re - 2k + l)x n ~ ZK . (5.100)
Now
(2n - 2fc)! = (2re - 2fc)(2re - 2fc - 1) • • • (re - 2k + 1) (re - 2fc) • • • 2 • 1,
so (5.100) becomes
=(n-2Jt)!
il JC 2„-2* = (2n-2fc)! ^,_ 2 ,
Jjc" " (n-2k)\
(5.101)
Thus, (5.99) reduces to
,.«=^i;(;W^m-
2"re! ^\ i
fc=0 v
(n-2k)\
TLFeBOOK
LEGENDRE POLYNOMIALS 233
or alternatively
/>„(*) = yVD* (2 "~ 2fe)! x"~ 2k . (5.102)
Z ^ 2 n k\(n-k)\(n-2k)\
Consider /(*) = -±= so that /(*>(*) = (2*-D(2*-3)-3-i (1 _ x) -(2fc+i)/2 (for
fc> 1). Define (2fc - 1)H = (2* - 1)(2£ - 3) • • • 3 • 1, and define (-1)!! = 1. As
usual, define 0! = 1. We note that (2n)\ = 2"n\(2n — 1)!! which may be seen by
considering
(2«)! = (2n)(2n - l)(2n - 2)(2n - 3) • • • 3 • 2 • 1
= [(2n)(2n - 2) ■ ■ ■ 4 ■ 2][(2n - 1)(2« - 3) • • • 3 • 1] = 2"«!(2« - 1)!!.
(5.103)
Consequently, the Maclaurin expansion for f(x) is given by
vT^7 f-i k\ ^
c(k) (0) k _^ (2k)l k
Af— - L^ k] x 2^2 2k [kl] 2X
(5.104)
'-'MAM-
*=0 &=0
The ratio test [7, p. 709] confirms that this series converges if \x\ < 1. Using
(5.102) and (5.104), it is possible to show that
1 oo
S(x,t) = -== =Vfi,(x)f", (5.105)
Jl-2xt + t 2 ^
so this is the generating function for the Legendre polynomials P n (x). We observe
that
dS x — t x — t
— = — — = -5. (5.106)
3f [1 -2xf + ? 2 ] 3 / 2 l-2xf + f 2
Also
3 S °°
— = Y,nPnix)t n -\ (5.107)
n=0
Equating (5.107) and (5.106), we have
oo oo
x — t
«=0 «=0
which becomes
x
n=0 n=0 n=0 n=0
Y j P„(x)t n -J^P n (x)t n+1 = ^nPnix)!"- 1 -2xY,nP n (x)t n
Q
OO
J2nPn(x)t n+1 ,
H =
TLFeBOOK
234 ORTHOGONAL POLYNOMIALS
and if P_i(x) — 0, then this becomes
n=0
n=0
n=0
X J2 P n(x)t" ~ J2 P n-l(x)t" = Y, {H + VPn+l(x)t" - 2x^nP n (x)t n
=0
oo
n=0
oo
n=0
so on comparing like powers of t in this expression, we obtain
xP„(x) - P„-i(x) — (n + l)P n+ i(x) - 2xnP n (x) + (n - \)P n -\(x),
which finally yields
(n + l)P n+ i{x) = (2n+ l)xP„(x)-nP n -i(x)
or (for n > 1)
2m + 1 n
P n +l(x) = — -xP n (x) — -P n -i(x),
n + 1 n + 1
(5.108)
which is the three-term recurrence relation for the Legendre polynomials, and hence
is yet another special case of (5.5).
x
of
V . . ;
— n = 2
— n = 3
.-. n = 4
— n = 5 _
\ :
\ :
J
i
/
/
_ :../ v .,«»rss.; <: v .v^. ...>,....>
:// ;
//:
/ /:
-1.5
-0.5
0.5
1.5
Figure 5.3 Legendre polynomials of degrees 2, 3, 4, 5.
TLFeBOOK
ORTHOGONAL POLYNOMIAL LEAST-SQUARES APPROXIMATION 235
The Legendre polynomials arise in potential problems in electromagnetics. Mod-
eling the scattering of electromagnetic radiation by particles involves working with
these polynomials. Legendre polynomials appear in quantum mechanics as part
of the solution to Schrodinger's equation for the hydrogen atom. Some Legendre
polynomials appear in Fig. 5.3.
5.6 AN EXAMPLE OF ORTHOGONAL POLYNOMIAL
LEAST-SQUARES APPROXIMATION
Chebyshev polynomials of the first kind, and Legendre polynomials are both orthog-
onal on [— 1, 1], but their weighting functions are different. We shall illustrate the
approximation behavior of these polynomials through an example wherein we wish
to approximate
0, -1 <x <
/(*) = ( ,; <-<! (5-109)
by both of these polynomial types. We will work with fifth-degree least-squares
polynomial approximations in both cases.
We consider Legendre polynomial approximation first. We must therefore nor-
malize the polynomials so that our basis functions have unit norm. Thus, our
approximation will be
5
f(x) = J^a k <t> k (x), (5.110)
k=0
where
and
1
4>k(x) — Pk(x),
llfltll
f 1 1
= / fix)
J -I lift II
ak = {f,<Pk)=l TT7rrJi x> > P kix)dx,
for which we see that
' Pk(x)
hx) = J2
fc=0
J fix)P k ix)dx
\\pk\
(5.111)
We have [e.g., via (5.102)]
P ix) = 1, Pi(x) = x, P 2 (x) = l -YSx 2 - 1],
P 3 (x) = -[5x 3 - 3x], P 4 (x) = -[35jc 4 - 30x 2 + 3],
2 8
P 5 (x) = -[63x 5 - 70x 3 + 15x].
TLFeBOOK
236 ORTHOGONAL POLYNOMIALS
The squared norms [via (5.98)] are
||Poll 2 = 2, ||Pil| 2 = -,
IIAII 2 = ^, ll^ll 2 ^-^--
Iftll^j.
i^ir = -,
By direct calculation, a# — L P/ C (x)dx becomes
a
a 2
a 3
a4
0-5
/ I ■ dx — l,a\ — J
Jq Jo
i /•*
- / [5x 3 - 3x
2 Jo
s/o' 135 '
8 Jo
1 1
x dx
2'
\dx
1 dx
il
Jo
1
Mo = 0,
-ii
Jo
30x 2 + 3] dx
35
30
0,
[63x 5 -70x 3 + 15x]dx
63
r 1 fii
l
70
n <i
1 15
ri o"
1
-X
-
-X
H
-X
1.6
8
L4
o «
L2
16
The substitution of these (and the squared norms ||/\ll 2 ) into (5.111) yields the
least-squares Legendre polynomial approximation
fix) = ^PoOt) + /l(l) " 7^3« + ^rP 5 (x).
2 4 16 32
(5.112)
We observe that /(0) = 5, /(l) = 52 = 1.15625, and /(-l) = -^ = -0.15625.
Now we consider the Chebyshev polynomial approximation. In this case
where
and
/(•*) = ^2,h<t>kix),
(5.113)
<t>kix) =
1
;T k ix),
b k = {f,4>k)
1 7*||
1 1 fix)T k ix)
J-iVT=x~ 2
\T k \
dx,
TLFeBOOK
ORTHOGONAL POLYNOMIAL LEAST-SQUARES APPROXIMATION 237
from which we see that
/(*) = £
k=0
J-iVT-
Vi - .V
:/(x)TJt(x) dx
117*11
r **(*).
=ft
We have [e.g., via (5.57)] the polynomials
T (x) = l, Ti(x)=x, T 2 (x) = 2x 2 - 1,
r 3 (x) = 4x 3 -3x, r 4 (x) = 8a 4 -8x 2 + 1,
75(a) = 16x 5 -20x 3 +5x.
The squared norms [via (5.55)] are given by
(5.114)
IITbll -ff,||r k || =- (*>!)•
By direct calculation /^ = J , T^jx) dx becomes (using x = cos#, and
y'l—x 2
T k (cos0) = cos(k9) [recall (5.53)] 0* = Jq /2 cos(k6)d6, and hence
fo
= / 1 • d0 = -,
Jo 2'
h
= / cos(9d6» = [sin^Q 7 = 1,
Jo
h
i-JT/2
= / cos(2(9) J6» =
- sm(26)
tt/2
= 0,
h
,71/2
= / cos(36)d6 =
Jo
- sin(36>)
tt/2
1
~~3'
h
/.JT/2
= / cos(4(9) dO =
Jo
"1
- sin (46»)
.4
tt/2
= 0,
h
rn/2
= / cos(50)J6» =
Jo
"1
- sin(50)
tt/2
1
~ 5"
Substituting these (and the squared norms ||7i|| 2 ) into (5.114) yields the least-
squares Chebyshev polynomial approximation
fix) = -
|r (x) +2T l (x) - -r 3 (x) + -7s(x)
(5.115)
We observe that /(0) = ±, /(l) = 1.051737, and /(-l) = -0.051737.
TLFeBOOK
238 ORTHOGONAL POLYNOMIALS
1.2
0.8
o 0.6
E
< 0.4
0.2
-0.2
! ! ! !
ss
/
"^>v. I /
— fW
-- Legendre approximation
" ■ - ■ Chebyshev approximation
: /'
//
li
! ^s^'
/':
i
i
../
/
1
/
i
/
II
i.
: : : /(
: : : •/
>l
1
i i '•A 7
' I v n : : s'
I \ — -':
; ; ;
-0.8 -0.6 -0.4 -0.2
0.2 0.4 0.6 0.8
Figure 5.4 Plots of f(x) from (5.109), the Legendre approximation (5.112), and the
Chebyshev approximation (5.115). The least-squares approximations are fifth-degree poly-
nomials in both cases.
Plots of f(x) from (5.109) and the approximations in (5.112) and (5.115) appear
in Fig. 5.4. Both polynomial approximations are fifth-degree polynomials, and yet
they look fairly different. This is so because the weighting functions are different.
The Chebyshev approximation is better than the Legendre approximation near x —
±1 because greater weight is given to errors near the ends of the interval [—1, 1].
5.7 UNIFORM APPROXIMATION
The subject of uniform approximation is distinctly more complicated than that of
least-squares approximation. So, we will not devote too much space to it in this
book. We will concentrate on a relatively simple illustration of the main idea.
Recall the space C[a, b] from Chapter 1. We recall that it is a normed space
with the norm
||*|| = sup \x(t)\ (5.116)
t€[a,b]
for any x(t) e C[a, b]. In fact, further recalling Chapter 3, this space happens to
be a complete metric space for which the metric induced by (5.116) is
d(x, y) = sup \x{t) — y{t)\,
t€[a,b]
(5.117)
TLFeBOOK
UNIFORM APPROXIMATION 239
where x(t), y(t) e C[a, b]. In uniform approximation we (for example) may wish
to approximate f(t) e C[a, b] with x(t) e P"[o, b] C C[a,b] such that \\f — x\\
is minimized. The norm (5.116) is sometimes called the Chebyshev norm, and so
our problem is sometimes called the Chebyshev approximation problem. The error
e(t) — f(t) — x(t) has the norm
|M|= sup |/(f) -*(0I (5-118)
t€[a,b]
and is the maximum deviation between f(t) and x(t) on [a, b]. We wish to find x(t)
to minimize this. Consequently, uniform approximation is sometimes also called
minimax approximation as we wish to minimize the maximum deviation between
f(t) and its approximation x(t). We remark that because f(t) is continuous (by
definition) it will have a well-defined maximum on [a, b] (although this maximum
need not be at a unique location). We are therefore at liberty to replace "sup" by
"max" in (5.116), (5.117) and (5.118) if we wish.
Suppose that yj (t) e C [a , b] for j = 0, 1, ...,« — 1 and the set of functions
[yj(t)\j e Z„} are an independent set. This set generates an w-dimensional sub-
space 4 of C[a,b] that may be denoted by Y — \YH)l=o a iyj( t )\ ol i e R|. From
Kreyszig [8, p. 337] consider
Definition 5.1: Haar Condition Subspace Y C C[a, b] satisfies the Haar
condition if every y e Y (y ^ 0) has at most n — \ zeros in [a, b], where n —
dim(F) (dimension of the subspace Y).
We may select y/(0 = ? ; for j e Z„ so any y(t) e Y has the form y(t) —
S'!=o a ; fJ anc ^ mus ' s a polynomial of degree at most n — 1. A degree n — 1
polynomial has n — \ zeros, and so such a subspace Y satisfies the Haar condition.
Definition 5.2: Alternating Set Let x e C[a, b], and y e Y, where Y is any
subspace of C[a, b]. A set of points to, . . . , t\ in [a, b] such that to < t\ < ■ ■ ■ < tk
is called an alternating set for x — y if x(tA — y(tj) has alternately the values
+ ||x — y\\, and — \\x — y\\ at consecutive points tt.
Thus, suppose x(tj) - y(tj) = +\\x - y\\, then x(t j+i ) - y(t j+i ) = -\\x - y\\,
but if instead x(t j ) — y(tj) — —\\x — y\\, thenx(f /+ i) — y(tj + \) = +\\x — y\\. The
norm is, of course, that in (5.116).
Lemma 5.2: Best Approximation Let Y be any subspace of C[a, b] that
satisfies the Haar condition. Given f e C[a, b], let y e Y be such that for / — y
there exists an alternating set of n + 1 points, where n — dim(Y), then y is the
best uniform approximation of / out of Y.
The proof is omitted, but may be found in Kreyszig [8, pp. 345-346].
A subspace of a vector space X is a nonempty subset Y of X such t
scalars a, b from the field of the vector space we have ay\ + byx e Y.
A subspace of a vector space X is a nonempty subset Y of X such that for any y\ , yi e Y , and all
TLFeBOOK
240 ORTHOGONAL POLYNOMIALS
Consider the particular case of C[— 1, 1] with f(t) e C[— 1, 1] such that for a
given « € N
f(t) = t". (5.119)
Now Y — Z]/=o a / f/ l Q, 7 e R , so yj(t) — t j for _/' € Z„. We wish to select a,-
such that the error e — f — y
n-\
e(t) = f(t)-J^ajtJ (5.120)
7=0
is minimized with respect to the Chebyshev norm (5.116); that is, select a,- to
minimize ||e||. Clearly, dim(7) = n. According to Lemma 5.2 ||e|| is minimized
if e(t) in (5.120) has an alternating set of n + 1 points.
Recall Lemma 5.1 (Section 5.3), which stated [see (5.44)] that
n
cos n6 = J^Pn.k cos k 0. (5.121)
k=Q
From (5.46) Pn+i,n+i = 2/J B> „, and /3 ,o = 1. Consequently, /3„,„ = 2"" 1 (for « >
1). Thus, (5.121) can be rewritten as
n-\
cos«0 = 2" _1 cos" 9 + ^Pn,j cos J ' (5.122)
.7=0
(n > 1). Suppose that t — cos0, so 6 e [0, n] maps to f e [— 1, 1] and from (5.122)
B-l
cos[n cos -1 f] = 2"- 1 f" + ^/S nJ ^. (5.123)
j=o
We observe that cos nQ has an alternating set of n + 1 points 0q , 0i , . . . , 0„ on [0, it ]
for which cos n9u = ±1 (clearly || costi0|| = 1). For example, if n — 1, then
6» = 0, $i = it.
If n = 2, then
(9 = 0, 01 = 7T/2, 2 = 7T,
and if n = 3, then
7T 27T
00 = 0, 01 = -, 9 2 = — , 03 = JT.
In general, 0£ = -tt for fc = 0, 1, ... , n. Thus, if f& = cos 0,1 then fj = cos (-?r).
We may rewrite (5.123) as
1 1 ""'
^ cos[« cos" 1 f] = t" + —J2Pnjt J - (5.124)
7=0
TLFeBOOK
PROBLEMS 241
This is identical to e{t) in (5.120) if we set fi n j — —2" ay. In other words, if we
choose
1 _,
e(t) = — — j- cos[w cos t], (5.125)
then ||e|| is minimized since we know that e has an alternating set t — tk —
cos (f tt), k = 0, 1, . . . , n, andft e [-1, 1]. We recall from Section 5.3 that T k (t) —
cos[k cos -1 1] is the /cth-degree Chebyshev polynomial of the first kind [see (5.53)].
So, e{t) = T n (t)/2 n ~ 1 . Knowing this, we may readily determine the optimal coef-
ficients at in (5.120) if we so desire.
Thus, the Chebyshev polynomials of the first kind determine the best degree
n — 1 polynomial uniform approximation to /(f) = f" on the interval t € [—1, 1].
REFERENCES
1. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 1966.
2. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York,
1974.
3. J. S. Lim and A. V. Oppenheim, eds., Advanced Topics in Signal Processing. Prentice-
Hall, Englewood Cliffs, NJ, 1988.
4. P. J. Davis and P. Rabinowitz, Numerical Integration, Blaisdell, Waltham, MA, 1967.
5. J. R. Rice, The Approximation of Functions, Vol. I: Linear Theory, Addison- Wesley,
Reading, MA, 1964.
6. G. Szego, Orthogonal Polynomials, 3rd ed., American Mathematical Society, 1967.
7. L. Bers, Calculus: Preliminary Edition, Vol. 2, Holt, Rinehart, Winston, New York, 1967.
8. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York,
1978.
PROBLEMS
5.1. Suppose that {4>k(x)\k e Z^} are orthogonal polynomials on the interval D ■■
[a, b] C R with respect to some weighting function w(x) > 0. Show that
N-l
^ ak<j>k(x) =
k=0
holds for all x e [a, b] iff a^ — for all k e Zn [i.e., prove that {<pk{x)\k €
Z;v} is an independent set].
5.2. Verify (5.11) by direct calculation for 4>k(x) — Tk(x)/\\Tk\\ with n — 2; that
is, find 4>q{x), 4>\{x), <p2(x), and foix), and verify that the left- and right-hand
sides of (5.11) are equal to each other.
TLFeBOOK
242 ORTHOGONAL POLYNOMIALS
5.3. Suppose that [<p n (x)\n e Z + } are orthogonal polynomials on the interval
[a, b] with respect to some weighting function w(x) > 0. Prove the fol-
lowing theorem: The roots xj (j = 1, 2, . . . , n) of <j> n (x) — (n € N) are all
real-valued, simple, and a < xj < b for all j.
5.4. Recall that 7}(x) is the degree j Chebyshev polynomial of the first kind.
Recall from Lemma 5.1 that cos(n6) = Y^!k=ofin,k cos^ 6. For x e [— 1, 1]
with k e Z + , we can find coefficients aj such that
k
x — 2_\ajTj(x).
Prove that
w^'-l
a; = K^TPir cos k+r 6d6.
Is this the best way to find the coefficients a ; ? If not, specify an alternative
approach.
5.5. This problem is about the converse to Lemma 5.1. Suppose that we have
p(cos0) — J]^ = o a 3,*: cosk ®- We wish to find coefficients b^ j such that
3
p(cos0) = ^hj cos(j6).
7=0
Use Lemma 5.1 to show that a — Ub, where
a = [03,0 03,1 03,2 03,3]^, b = [/j 3i0 63,1 ^3,2 ^3,3] r
and U is an upper triangular matrix containing the coefficients p m % for
m — 0, 1, 2, 3, and fc = 0, 1 . . . , m. Of course, we may use back-substitution
to solve Ub = a for b if this were desired.
5.6. Suppose that for n e Z + we are given p(cos9) — Yll =o a n,kcos k 0. Show
how to find the coefficients b n j such that
p(cos6) = ^/3„jcos(j0),
;'=o
that is, generalize the previous problem from n — 3 to any n.
5.7. Recall Section 5.6, where
., , JO, -l<x<0
/W = 1, 0<*<1
TLFeBOOK
PROBLEMS 243
was approximated by
" „ 1
for n = 5. Find a general expression for b k for all k e Z + . Use the resulting
expansion
to prove that
5.8. Suppose that
then find b k in
- = E^— (-1)"-
4 ^ 2« - 1
n = \
,0, -1 < X <
/(X) H _ 0<X<1
oo 1
/(jc) = y^Sk — T k (x).
5.9. Do the following:
(a) Solve the polynomial equation 7^(x) = for all k > 0, and so find all
the zeros of the polynomials 7i(x).
(b) Show that T n {x) satisfies the differential equation
(1 - x 2 )T( 2) (x) - xT^\x) + n 2 T n (x) = 0.
Recall that t£\x) = d r T n {x)/dx r .
5.10. The three-term recurrence relation for the Chebyshev polynomials of the sec-
ond kind is
U r+ l(x) = 2xU r (x)-U r -l(x), (5.P.1)
where r > 1, and where the initial conditions are
U (x) = l,Ui(x) = 2x. (5.P.2)
We remark that (5.P.1) is identical to (5.57). Specifically, the recursion for
Chebyshev polynomials of both kinds is the same, except that the initial
TLFeBOOK
244 ORTHOGONAL POLYNOMIALS
conditions in (5. P. 2) are not the same for both. [From (5.56) Tq(x) — 1, but
T\(x) — x.] Since deg(U r ) — r, we have (for r > 0)
U r {x) = J2 u r,jx j -
.7=0
(5.P.3)
[Recall the notation for cf> r (x) in (5.3).] From the polynomial recursion in
(5.P.1) we may obtain
Ur+l,j — 2U,-,j-l — Ur-lJ.
(5.P.4a)
This expression holds for r > 1, and for j = 0, 1, . . . , r, r + 1. From (5. P. 2)
the initial conditions for (5 .P. 4a) are
I/o,o = 1, U h0 = 0, Ui,i = 2.
(5.P.4b)
Write a MATLAB function that uses (5.P.4) to generate the Chebyshev poly-
nomials of the second kind for r = 0, 1, . . . , N. Test your program out for
N — 8. Program output should be in the form of a table that is written to a
file. The tabular format should be something like
degree
1
2
3
4
5
6
7
8
1
1
2
2
-1
4
3
-4
8
etc.
5.11. The Chebyshev polynomials of the second kind use
D = [-1, 1], w(x) = VI -x 2 .
(5.P.5)
Denote these polynomials as the set {<p n (x)\n e Z + }. (This notation applies
if the polynomials are normalized to possess unity-valued norms.) Of course,
in our function space, we use the inner product
(f,g)=j Jl-x 2 f(x)g(x)dx.
(5.P.6)
Derive the polynomials {<j) n (x)\n e Z + }. (Hint: The process is much the
same as the derivation of the Chebyshev polynomials of the first kind pre-
sented in Section 5.3.) Therefore, begin by considering q r -\(x) which is
any polynomial of degree not more than r — 1. Thus, for suitable c& € R, we
TLFeBOOK
PROBLEMS 245
must have q r -\{x) = Yfj^CjfyW'Vnd ((j> r ,q r -i) = Ej-=o c ./( < / > r, 4>j) =
because (</> r ,<pj) — for j — 0, 1, . . . , r — 1. In expanded form
(4> r ,q r -i)= I v 1 — x 2 4> r (x)q r -\(x) dx — 0. (5. P. 7)
Use the change of variable x — cos 9 (so Jx = — sin 9 dO) to reduce (5 .P. 7) to
f
Jo
sin 2 6»cos(A:6')(/) r (cos0)d6' = (5.P.8)
for k = 0, 1, 2, . . . , r — 1, where Lemma 5.1 and its converse have also been
employed. Next consider the candidate
sin(r + 1)6*
r (cose) = C r . - , (5.P.9)
sm9
and verify that this satisfies (5. P. 8). [Hence (5. P. 9) satisfies (5.P.7).] Show
that (5. P. 9) becomes
sin[(r + 1) cos" 1 x]
<t> r (x) = C r —
Prove that for C r = 1 (all r € Z+)
&r+l(x) = 2x<j) r (x) - ^ r -l(x).
In this case we normally use the notation C/ r (x) = 4> r (x). Verify that Uq(x) —
1, and that Ui(x) = 2x. Prove that \\U n \\ 2 = f for n € Z+. Finally, of
course, </> n (x) = U n (x)/\\U n \\.
5.12. Write a MATLAB function to produce plots of Uk(x) for k — 2, 3, 4, 5 similar
to Fig. 5.1.
5.13. For <pk(x) — Uk(x)/\\Uk\\, we have
[2
4>k(x) = J —U k (x).
V n
Since (<j>k,<l>j) = &k-j, f° r an Y /(^) € L 2 [— 1, 1], we have the series
expansion
f(x) = J^a/i(/)t(x),
*:=0
where
«* = (/, <Pk) = 7777-7 / \/l -x 2 f(x)U k (x)dx.
\\Uk\\ J-i
TLFeBOOK
246 ORTHOGONAL POLYNOMIALS
Suppose that we work with the following function:
0, -1 <x <
/(A) " ' 1, 0<x<l
(a) Find a nice general formula for the elements of the sequence (ai)
(k € Z+).
(b) Use MATLAB to plot the approximation
hi*) = ^2,ak(j)k(x)
k=Q
on the same graph as that of f(x) (i.e., create a plot similar to that of
Fig. 5.4). Suppose that x[fk(x) — 7]t(;c)/||7]t||; then another approxima-
tion to f(x) is given by
f\(x) = ^2,b k f k {x).
fc=0
Plot fi(x) on the same graph as feix) and f(x).
(c) Compare the accuracy of the approximations f\{x) and f2(x) to fix)
near the endpoints x = ±1. Which approximation is better near these
endpoints ? Explain why if you can.
5.14. Prove the following:
(a) T„(x), and T n -\{x) have no zeros in common.
(b) Between any two neighboring zeros of T n (x), there is precisely one zero
of T n _\(x). This is called the interleaving of zeros property.
{Comment: The interleaving of zeros property is possessed by all orthog-
onal polynomials. When this property is combined with ideas from later
chapters, it can be used to provide algorithms to find the zeros of orthogonal
polynomials in general.)
5.15.
(a)
Show that we can write
T„(x) = cos[n
cos
'*] =
cosh[ncosh x].
[Hint: Note that cosjc = j(e
Jx _|_
e-J x ),
cosh* =
\{e x +e~
~ x ), so
that
cosx = cosh(jx).]
(b) Prove that T 2n (x) = T n (2x 2
-1).
[Hint:
cos(2x) :
- 2 cos 2 x
-1.]
5.16. Use Eq. (5.63) in the following problems:
(a) Show that
— H r (x) = 2rH r - 1 (x).
ax
TLFeBOOK
PROBLEMS 247
(b) Show that
d
dx
'id
e~ x — H r (x)
dx
= -2re~ x2 H r (x).
(c) From the preceding confirm that H r (x) satisfies the Hermite differential
equation
H< 2) (x) - 2xH { r l \x) + 2rH r (x) = 0.
5.17. Find H k (x) for k — 0, 1, 2, 3, 4, 5 (i.e., find the first six Hermite polynomials)
using (5.83).
5.18. Using (5.63) prove that
N 2 n ~ 2 J
H k (x)=n\Y(-l) j x"- 2j ,
~ JKn- 2j)\
where N — n/2 (n is even), N = (n — l)/2 (« is odd).
5.19. Suppose that Pk(x) is the Legendre polynomial of degree k. Recall Eq.
(5.90). Find constants a and fi such that for Qk(x) = Pk(<xx + fi) we have
(for k ± r)
L
b
Q r (x)Q k (x)dx = 0.
[Comment: This linear transformation of variable allows us to least-squares
approximate f(x) using Legendre polynomial series on any interval [a, b].]
5.20. Recall Eq. (5.105):
1 oo
-== = = TP n ( X )t".
VT^2x7TP ^ Q
Verify the terms n = 0, 1, 2, 3.
[Hint: Recall Eq. (3.82) (From Chapter 3).]
5.21. The distance between two points A and B in R 3 is r, while the distance from
A to the origin O is r\, and the distance from B to the origin O is ri. The
angle between the vector OA, and vector OB is 9.
(a) Show that
1 1
\ r i + r 2 ~ 2 nr 2 cos0
[Hint: Recall the law of cosines (Section 4.6).]
TLFeBOOK
248 ORTHOGONAL POLYNOMIALS
(b) Show that
1 1 ^ n \ n
r r 2~^ \ r 2/
where, of course, P n (x) is the Legendre polynomial of degree n.
{Comment: This result is important in electromagnetic potential theory.)
5.22. Recall Section 5.7. Write a MATLAB function to plot on the same graph
both f(t) = t" and e(t) [in (5.125)] for each of n = 2, 3, 4, 5. You must
generate four separate plots, one for each instance of n.
5.23. The set of points C — {e' e \G e R, j — V~ 1} is the unit circle of the complex
plane. If R(e^ e ) > for all OeR, then we may define an inner product on
a suitable space of functions that are defined on C
1 C n
(F,G} = — / R(e je )F*(e je )G(e je )de, (5.P.10)
2jt J_„
where, in general, F(e-* e ), G(e^ 9 ) e C. Since e^ is 2n -periodic in 0, the
integration limits in (5. P. 10) are — jt to n , but another standard choice is
from to lit. The function R(e' e ) is a weighting function for the inner
product in (5. P. 10). For R(e je ), there will be a real-valued sequence (t>)
such that
R(e je )= J2 r * e ~
jkt)
and we also have r_^ = r^ [i.e., (r^) is a symmetric sequence]. For F(e J )
and G(eJ s ), we have real-valued sequences (/*), and (gt) such that
oo oo
&= — oo fc=— oo
(a) Show that (e^'" , e"^) = r„_ m .
(b) Show that ie'-i" 6 F(ei e ), e-'" e G(<?> )) = (F(e' e ), G(e /0 )>.
(c) Show that {F{e-' e ), G(e je )) = (1, F(e~J e )G(eJ e )).
[Comment: The unit circle C is of central importance in the theory of stability
of linear time-invariant (LTI) discrete-time systems, and so appears in the
subjects of digital control and digital signal processing.]
5.24. Recall Problem 4.21 from and the previous problem (5.23). Given a n (z) —
E!t=o a n,kZ~ k (and z = e je ), show that
(a n (z), Z - k ) = ^S k
for k = 0, l,...,/i -1.
TLFeBOOK
PROBLEMS 249
[Comment: This result ultimately leads to an alternative derivation of the
Levinson-Durbin algorithm, and suggests that this algorithm actually gen-
erates a sequence of orthogonal polynomials on the unit circle C. These
polynomials are in the indeterminate z~ l (instead of x).]
5.25. It is possible to construct orthogonal polynomials on discrete sets. This prob-
lem is about a particular example of this. Suppose that
b r [n] = J2'
where n e [-L, U] C Z, U > L > and <j> r ^- ^ 0, so that deg(^ r ) = r, and
let us also assume that (f) r j e R for all r, and j. We say that {4> r [n\\n e Z + }
is an orthogonal set if (4>k[n], 4> m [n]) = \\4>k\\ 2 &k-m, where the inner product
is defined by
U
(f[n],g[n])= J2 wM/MsM. (5.P.11)
and w[n] > for all n e [-L, U] is a weighting sequence for our inner
product space. (Of course, ||/|| 2 = (/[«], f[n]).) Suppose that L — U = M;
then, for w[n] — 1 (all n e [-M, M]), it can be shown (with much effort)
that the Gram polynomials are given by the three-term recurrence relation
2(2k +1) k 2M + k + 1
Pk+\ \n\ — npk\n\ Pk-\ M,
/+ (k+l)(2M-k)' k+l 2M-k lk[lh
(5. P. 12)
where po[n] — 1, and pi[n] = n/M.
(a) Use (5.P.12) to find p k [n] for k = 2, 3, 4, where M = 2.
(b) Use (5.P.12) to find p k [n] for fc = 2, 3, 4, where M = 3.
(c) Use (5.P.11) to find \\p k \\ 2 for *; = 2, 3, 4, where M = 2.
(d) Use (5.P.11) to find || W || 2 for fc = 2, 3,4, where M = 3.
(Comment: The uniform weighting function w[m] = 1 makes the Gram poly-
nomials the discrete version of the Legendre polynomials. The Gram poly-
nomials were actually invented by Chebyshev.)
5.26. Integration by parts is clearly quite important in analysis (e.g., recall Section
3.6). To derive the Gram polynomials (previous problem) makes use of sum-
mation by parts.
Suppose that v[n], u[n], and f[n] are defined on Z. Define the forward
difference operator A according to
A/M = f[n + 1] - f[n]
TLFeBOOK
250 ORTHOGONAL POLYNOMIALS
[for any sequence (/[«])]. Prove the expression for summation by parts,
which is
v.
u
V' u[n]Av[n] — u[n]v[n] ^J 1 — V^ v[n + l]Au[n]
n=—L
n=—L
[Hint: Show that
«[n]A«[n] = Au[n]i)[n] — v[n + 1]Am[«],
and then consider using the identity
u u
J2 A/[n] = J2 (/[" + 1] " /W) = f [n]
n——L n=—L
Of course, f[n]\ B A = f[B] - f[A].]
u+i
-L
TLFeBOOK
6
Interpolation
6.1 INTRODUCTION
Suppose that we have the data {(tk, x(tk))\k e Z„+i}, perhaps obtained experimen-
tally. An example of this appeared in Section 4.6 . In this case we assumed that
t k — kT s for which x(tk) = x(t)\ t =kT s are the samples of some analog signal. In
this example these time samples were of simulated (and highly oversimplified)
physiological data for human patients (e.g., blood pressure, heart rate, body core
temperature). Our problem involved assuming a model for the data; thus, assum-
ing x(t) is explained by a particular mathematical function with certain unknown
parameters to be estimated on the basis of the model and the data. In other words,
we estimate x(t) with x(t, a), where a is the vector of unknown parameters (the
model parameters to be estimated), and we chose a to minimize the error
e(t k ) — x(t k ) -x(t k ,a), keZ n+ i (6.1)
according to some criterion. We have emphasized choosing a to minimize
n
V(a) = J2 e2 (tk)- (6-2)
k=Q
This was the least-squares approach. However, the idea of choosing a to mini-
mize max£ € z B+1 \e(t k )\ is an alternative suggested by Section 5.7. Other choices
are possible. However, no matter what choice we make, in all cases x(t k ,a) is
not necessarily exactly equal to x(t k ) except perhaps by chance. The problem of
finding x(t, a) to minimize e(t) in this manner is often called curve fitting. It is to
be distinguished from interpolation, which may be defined as follows.
Usually we assume to < t\ < • • • < t n -\ < t n with to — a, and t n — b so that
t k € [a, b] C R. To interpolate the data {{tk, x(tk))\k e Z n+ i}, we seek a function
p(t) such that t e [a, b], and
p(tk) = x{t k ) (6.3)
for all k e Z„+i. We might know something about the properties of x(t) for t ^
tk on interval [a,b], and so we might select p(t) to possess similar properties.
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
251
TLFeBOOK
252 INTERPOLATION
However, we emphasize that the interpolating function pit) exactly matches xit)
at the given sample points t\,k e Z„+i.
Curve fitting is used when the data are uncertain because of the corrupting effects
of measurement errors, random noise, or interference. Interpolation is appropriate
when the data are accurately or exactly known.
Interpolation is quite important in digital signal processing. For example, ban-
dlimited signals may need to be interpolated in order to change sampling rates.
Interpolation is vital in numerical integration methods, as will be seen later. In
this application the integrand is typically known or can be readily found at some
finite set of points with significant accuracy. Interpolation at these points leads to
a function (usually a polynomial) that can be easily integrated, and so provides a
useful approximation to the given integral.
This chapter discusses interpolation with polynomials only. In principle it is
possible to interpolate using other functions (rational functions, trigonometric func-
tions, etc.) But these other approaches are usually more involved, and so will not
be considered in this book.
6.2 LAGRANGE INTERPOLATION
This chapter presents polynomial interpolation in three different forms. The first
form, which might be called the direct form, involves obtaining the interpolat-
ing polynomial by the direct solution of a particular linear system of equations
(Vandermonde system). The second form is an alternative called Lagrange inter-
polation and is considered in this section along with the direct form. The third
form is called Newton interpolation, and is considered in the next section. All
three approaches give the same polynomial but expressed in different mathemati-
cal forms each possessing particular advantages and disadvantages. No one form
is useful in all applications, and this is why we must consider them all.
As in Section 6.1, we consider the data set {itt, xitk))\k e Z n +i}, and we will
let Xk = xitk). We wish, as already noted, that our interpolating function be a
polynomial of degree n:
n
For example, if n — 1 ilinear interpolation), then
1
.7=0
1
Xk+l = YlP^J t i+v
TLFeBOOK
LAGRANGE INTERPOLATION
253
Pl,0 + P\,\tk = Xk,
Pl,0 + Pl,lft+1 — Xk+l,
or in matrix form this becomes
1 tk
1 ft+i
Pi,o
Phi
Xk
Xk+l
This has a unique solution, provided tk # ft+i because in this instance the matrix
has determinant det I ' ) = tk+i — tk- Polynomial p\(t) — p\ n + p\ \t
VI 1 ft+i \) '
linearly interpolates the points (tk, Xk), and (tk+i,Xk+i)- As another example, if
n — 2 (quadratic interpolation), then we have
Xk
Xk+l
Xk+l
7=0
2
7=0
2
7=0
7
7
&+2>
or in matrix form
ft
ft+i
1 ft+2 t
a -i r
k+1
2
£+2
P2,0
Xk
P2,l
=
Xk+l
P2,2
Xk+l
(6.5)
The matrix in (6.5) has the determinant (tk — ft+i)(ft+i — ft+2)(ft+2 — tk), which
will not be zero if ft, ft+i and £^+2 are all distinct. The polynomial P2(t) —
P2,o + Pi,\t + p 2 ,2 {2 interpolates the points (t k ,x k ), (t k +i,Xk+i), and (t k+2 , x k+ 2).
In general, for arbitrary n we have the linear system
ft
ft+i
1 ft+n-l
1 ft+n
M-l
ft+1
l k+n
l k+l
k+n-l k+n-l
t n—l t n
l k+n
Pnfi
Xk
Pn,l
Xk+l
) n,n — 1
X k+n-l
Pn,n
Xk+n
(6.6)
TLFeBOOK
254
INTERPOLATION
Matrix A is called a Vandermonde matrix, and the linear system (6.6) is a Vander-
monde linear system of equations. The solution to (6.6) (if it exists) gives the direct
form of the interpolating polynomial stated in (6.4). For convenience we will let
k — 0. If we let t — to, then
" 1
t
■■ t"' 1
t
1
h
f n — \
' ' h
t
A = A(t) =
1
tn-l ■
t n—\
' l n-\
t"
n
1
tyl
t n — l
t
(6.7)
Let D(t) = det(A(f)), and we see that D(t) is a polynomial in the indetermi-
nate t of degree n. Therefore, D{t) = is an equation with exactly n roots (via
the fundamental theorem of algebra). But D(tk) — for k — 1, . . . , n since the
rows of A(t) are dependent for any t = t^. So t\, £2, • • • , t n are the only possible
roots of D(t) — 0. Therefore, if to, t\, . . . , t n are all distinct, then we must have
det(A(fo)) ^ 0. Hence A in (6.6) will always possess an inverse if tk+i 7^ tk+j for
For small values of n (e.g., n — 1, or n = 2), the direct solution of (6.6) can be a
useful method of polynomial interpolation. However, it is known that Vandermonde
matrices can be very ill-conditioned even for relatively small n (e.g., n > 10 or so).
This is particularly likely to happen for the common case of equispaced data, i.e.,
the case where tk — to + hk for k — 0, 1, . . . , n with h > 0, as mentioned in Hill [1,
p. 233]. Thus, we much prefer to avoid interpolating by the direct solution of (6.6)
when n ^> 2. Also, direct solution of (6.6) is computationally inefficient, unless one
contemplates the use of fast algorithms for Vandermonde system solution in Golub
and Van Loan [2]. These are significantly faster than Gaussian elimination as they
possess asymptotic time complexities of only 0(n 2 ) versus the 0(n 3 ) complexity
of Gaussian elimination approaches.
We remark that so far we have proved the existence of a polynomial of degree n
that interpolates the data {(tk, Xk)\k e Z n+ i}, provided tk # tj (j ^ k). The poly-
nomial also happens to be unique, a fact readily apparent from the uniqueness of
the solution to (6.6) assuming the existence condition is met. So, if we can find
(by any method) any polynomial of degree < n that interpolates the given data,
then this is the only possible interpolating polynomial for the data.
Since we disdain the idea of solving (6.6) directly we seek alternative methods
to obtain p n (t). In this regard better approach to polynomial interpolation is an
often Lagrange interpolation, which works as follows.
Again assume that we wish to interpolate the data set {(tk, Xk)\k € X n+ {\. Sup-
pose that we possess polynomials (called Lagrange polynomials) Lj (t) with the
property
' 0, j^k
1, j=k = 5 ^-
Lj(t k )
(6.8)
TLFeBOOK
LAGRANGE INTERPOLATION
Then the interpolating polynomial for the data set is
255
Pn(t) = * MO+*iMOH \-x n L n (t) = ^XjLjit).
.7=0
We observe that
(6.9)
Pn(tk) = ^XjLjjtk) = 'Y^Xj&j-k = X<:
.7=0 .7=0
(6.10)
for A: € Z n+ i, so /?«(/;) in (6.9) does indeed interpolate the data set, and via unique-
ness, if we were to write p n (t) in the direct form p n (t) — Yl"j=o Pn,jt J , then the
polynomial coefficients p n j would satisfy (6.6). We may see that for j e Z„+i the
Lagrange polynomials are given by
Mo=n^
(6.11)
Equation (6.9) is called the Lagrange form of the interpolating polynomial.
Example 6.1 Consider data set {(to, xq), (t\, x\), (t 2 , x 2 )} so n — 2. Therefore
(t-h)(t-t 2 )
-h)(t -
-t )(t-
-h)
■tii
(h-
(t-
-t )(h ■
-t )(t-
-fe)
■h)
2(0 ~ (ft -
-foXfc-
-h)
It is not difficult to see that pi{t) — xoLo(t) + x\L\(t) + X2L2(t). For example,
p 2 (to) = xoLo(to) + xiLi(to) +x 2 L 2 (to) = xq.
Suppose that the data set has the specific values {(0, 1), (1, 2), (2, 3)}, and so
(6.6) becomes
1
P2.Q
1
1
1 1
P2,l
—
2
1
2 4
P2,2
3
This has solution p 2 (t) — t + 1 (i.e., p 22 — 0). We see that the interpolating poly-
nomial p n (t) need not have degree exactly equal to n, but can be of lower degree.
We also see that
M0
(r-l)(f-2)
(0-l)(0-2)
1
1
-(t - l)(t-2) = -(t z -3t + 2),
TLFeBOOK
256 INTERPOLATION
L l{t) = (t - m - 2) -
(1 - 0)(1 - 2)
- -tit -
- 2) = -it 2 -
-2f)
L 2 (0= ( '-° )( '- 1) =
(2-0)(2-l)
1
= -tit -
2 y
1 9
- 1) = -{t 1 -
' 2
-t).
Observe that
x L (t) +x\Li(t) + x 2 L 2 {t)
= 1 • -it 1 - 3t + 2) + 2 ■ i-\)it 2 - 2t) + 3 • -it 2 - t)
^ 2 - 2f+ H + H f+4f "B + G' 2 ) =f+1 '
which is p 2 (t). As expected, the Lagrange form of the interpolating polynomial,
and the solution to the Vandermonde system are the same polynomial.
We remark that once p n (t) is found, we often wish to evaluate p n (t) for t ^ t\
(k e Z„+i). Suppose that p n (t) is known in direct form; then this should be done
using Horner's rule:
Pnit) = Pnfi + t[p„,l + t[p„,2 + t[p„,3 H h t[Pn,n-l + tp n ,n\] ' ' •]]]• (6-12)
For example, if n — 3, then
Vi{t) = P3,o + f[/?3,i + f[P3,2 + fP3,3]]- (6.13)
To evaluate this requires only 6 flops (3 floating-point multiplications, and 3 floating-
point additions). Evaluating piit) — /?3,o + P3,\t + pj,^t 2 + P3,3f 3 directly needs 9
flops (6 floating-point multiplications and 3 floating-point additions). Thus, Horner's
rule is more efficient from a computational standpoint. Using Horner's rule may be
described as evaluating the polynomial from the inside out.
While Lagrange interpolation represents a simple way to solve (6.6) the form
of the solution is (6.9), and so requires more effort to evaluate at t ^ t\ than the
direct form, the latter of which is easily evaluated using Horner's rule as noted in
the previous paragraph. Another objection to Lagrange interpolation is that if we
were to add elements to our data set, then the calculations for the current data set
would have to be discarded, and this would force us to begin again. It is possible to
overcome this inefficiency using Newton's divided-difference method which leads
to the Newton form of the interpolating polynomial. This may be seen in Hildebrand
[3, Chapter 2]. We will consider this methodology in the next section.
When we evaluate p n it) for t ^ tk, but with t e [a, b] = [to, t n ], then this is
interpolation. If we wish to evaluate p n it) for t < a or t > b, then this is referred
to as extrapolation. To do this is highly risky. Indeed, even if we constrain t to
satisfy t e [a, b], the results can be poor. We illustrate with a famous example of
TLFeBOOK
NEWTON INTERPOLATION 257
Runge's which is described in Forsythe et al. [4, pp. 69-70]. It is very possible
that for /(f) € C[a, b], we have
lim sup |/(f) - p„(t)\ = oo.
Runge's specific example is for f € [—5, 5] with
where he showed that for any f satisfying 3.64 < \t\ < 5, then
lim sup|/(f)- p n (t)\ = oo.
n— >oo
This divergence with respect to the Chebyshev norm [recall Section 5.7 (of
Chapter 5) for the definition of this term] is called Runge's phenomenon. A
fairly detailed account of Runge's example appears in Isaacson and Keller [5,
pp. 275-279].
We close this section with mention of the approximation error /(f) — p n (t).
Suppose that /(f) € C n+1 [a, b], and it is understood that f^ n+v> {t) is continuous
as well; that is, /(f) has continuous derivatives f^\t) fork— 1, 2, ...,« + 1. It
can be shown that for suitable £ = §(f)
1 "
e„(t) = /(f) - Pn(t) = - ■ — / ( " +1) (g) ]"[(' - ft), ( 6 - 14 )
(=0
where f € [a, &]. If we know /^ n+1 ^(f), then clearly (6.14) may yield useful bounds
on |e„(f)|. Of course, if we know nothing about /(f), then (6.14) is useless.
Equation (6.14) is useless if /(f) is not sufficiently differentiable. A derivation
of (6.14) is given by Hildebrand [3, pp. 81-83], but we omit it here. However, it
will follow from results to be considered in the next section.
6.3 NEWTON INTERPOLATION
Define
x(h)-x(to) XI -XQ
x[to, h]= = . (6.15)
h — to h — to
This is called the first divided difference of x(t) relative to t\ and to- We see that
x[to, fi] = x[t\, to]- We may linearly interpolate x(t) for t e [to, t{\ according to
t — to
x(t) « x(t ) H [x(h) - x(t )] = x(to) + (f - t )x[t , h]. (6.16)
h — to
TLFeBOOK
258
INTERPOLATION
It is convenient to define po(t) — x(to) and p\(t) — x(to) + (t — to)x[to, t\\. This
is notation consistent with Section 6.2. In fact, p\ (t) agrees with the solution to
(6.17)
as we expect.
Unless x(t) is truly linear the secant slope x[to, t\] will depend on the abscissas
to and t\. If x(t) is a second-degree polynomial, then x\t\, t] will itself be a linear
function of t for a given t\. Consequently, the ratio
" 1 h '
Pl.O
=
x Q
x[tQ, h, f2]
x[t\, t 2 ] - x[t , tl]
h — to
(6.18)
will be independent of to, t\, and t 2 - [This ratio is the second divided difference of
x(t) with respect to to, t\, and ?2-] To see that this claim is true consider x(t) —
«o + ci\t + a 2 t 2 , so then
x[t\, t] = a\ + a 2 (t + t\) .
(6.19)
Therefore
x[h, t 2 ] - x[to, h] — x[h, t 2 ] - x[t\, to] — a 2 (t 2 - to)
so that x[tQ, t\, t 2 ] — a 2 . We also note from (6.18) that
1
x[to
h
ti\ =
ti
-to
x[t\
to
ti\ =
1
ti
-h
X 2 - X\ X\- XQ
h — h t\ — to
xi - XQ xq- xi
h — to to — ti
(6.20a)
(6.20b)
for which we may rewrite these in symmetric form
xo xi
x[to, h, t 2 ] =
xi
(to - ti)(t Q - t 2 ) (ti - toKh - t 2 ) (h - to)(t 2 - h)
= x[ti,t ,t 2 ], (6.21)
that is, x[to, t\, t 2 ] = x[ti, to, t 2 ]. We see from (6.16) that
x(t) - x(t Q )
t-to
x[to,h],
that is
x[t Q ,t] t*tx[to,h],
(6.22)
TLFeBOOK
NEWTON INTERPOLATION 259
and from this we consider the difference
x[t , t] - x[t , h] — x[to, t] -x[h, t ] — (t - h)x[to, h, t] (6.23)
via (6.18), and the symmetry property x[to, t\,t]= x[t\, to,t]. Since x(t) is assumed
to be quadratic, we may replace the approximation of (6.16) with the identity
x(t) =x(t ) + (t-t )x[t ,t]
= x(to) + (t-to)x[to,ti] +(t - t Q )(t - h)x[t , h, t] (6.24)
=P\(t)
via (6.23). The first equality of (6.24) may be verified by direct calculation using
x(t) — ciq + a\t + ci2t 2 , and (6.19). We see that if pi(t) approximates x(t), then
the error involved is [from (6.24)]
e(t) = x(t) - pi(t) = {t- t )(t - ti)x[t , h, t]. (6.25)
These results generalize to x(t) a polynomial of higher degree.
We may recursively define the divided differences of orders 0, I, . . . , k — l,k
according to
x[t ] = x(t ) =x ,
x(ti)-x(t ) x\ -x
x[t ,ti] =
x[to, h, ti\
t\ —to t\— to
x[t\, t 2 ] - x[tp,ti]
h — to
x[h, ...,tt] - x[t , ...,t k -i]
x[to,..-,t k ]= . (6.26)
tk — to
We have established the symmetry x[to, h] = x\t\, to] (case k — 1), and also
x[to, t\, ?2] = x[h, to, ?2] (case k — 2). For k — 2, symmetry of this kind can also
be deduced from the symmetric form (6.21). It seems reasonable [from (6.21)] that
in general
x[f , ...,tk\
xo , x\ t x k
(to - ti) ■ ■ ■ (t - t k ) (h - t ) ■ ■ ■ (ti - t k ) (t k - t ) ■ ■ ■ (t k - t k -i)
k
J"— t xj. (6.27)
TLFeBOOK
260 INTERPOLATION
It is convenient to define the coefficient of xj as
a) = — r—i (6.28)
>¥./'
for y = 0, 1, . . . , k. Thus, x[?o, . . . , fjtl = S/=o a / x ;'- We ma y P rove (6-27) for-
mally by mathematical induction. We outline the detailed approach as follows.
Suppose that it is true for k — r; that is, assume that
± ^_
hn r i= o(tj-ti)
x[t ,...,t r ] = l^-—— —xj, (6.29)
• n 1 1 i=o \tj H)
and consider [from definition (6.26)]
1
X[t , . ..,t r +\] = {X[t\, ..., t r +\] -X[t , ...,t r ]}. (6.30)
tr+l — tQ
For (6.30)
X[t\, ...,t r + \] =
(h - t 2 ) ■ ■ ■ (h - t r +l)
X2 , , X r+ \
(t 2 ~ h) ■ ■ ■ (t 2 ~ t r+ l) (tr+l ~ h) ■ ■ ■ (tr+l ~ tr)
(6.31a)
and
xq
x[t , ...,t r ]
(t ~ h) ■ ■ ■ (t ~ t r )
XI
(h -t )--- (tl - t r ) (t r -to)--- (t r ~ t r -l)
(6.31b)
If we substitute (6.31) into (6.30), we see that, for example, for the terms involving
only x\
1 I x\
tr+l ~ to I (h - t 2 )(h ~ f 3 ) ' ' ' (tl ~ U)(h - tr+l)
XI
(tl - t Q )(h -t 2 )--- (tl - t r -l)(h - t r ) .
XI 1 f 1 1
tr+l — to (tl — t 2 ) ■ ■ ■ (tl — t r ) [ tl — t r +i tl — to
Xl r + i
= a^ xi.
(tl ~ t )(tl -t 2 )--- (tl - t r )(h - tr+l)
The same holds for all remaining terms in x / for j = 0, 1 , . . . , r + 1 . Hence (6.27)
is valid by induction for all k > 1. Because of (6.27), the ordering of the arguments
TLFeBOOK
NEWTON INTERPOLATION 261
in x[to, . . . , t k ] is irrelevant. Consequently, x[to, ■ ■ ■ ,t k ] can be expressed as the
difference between two divided differences of order k — 1, having any k — 1 of
their k arguments in common, divided by the difference between those arguments
that are not in common. For example
X[h,t2,t3] — X[t0,tl,t2] x[to, t2, h] -X[h,t2,t3]
x[t , h, t 2 , tj,] = = .
?3 — to tQ — t\
What happens if two arguments of a divided difference become equal? The
situation is reminiscent of Corollary 5.1 (in Chapter 5). For example, suppose t\ —
t + e ; then
x(h) - x(t) x(t + e)-x(t)
x[t, fi] =
implying that
so that
t\ — t
x(t + e)-x(t) dx(t)
x[t, t] = lim =
6^0 e dt
dx(t)
x[t,t]= — — . (6.32)
dt
This assumes that x(t) is differentiable. By similar reasoning
d
—x[to, ...,t k ,t] = x[t Q , ...,t k ,t, t] (6.33)
dt
(assuming that to, ■ ■ ■ , tk are constants). Suppose u\, . . . , u n are differentiable func-
tions of t, then it turns out that
d y—^ duj
— x[to, ...,tk,u\,...,u n ]= } x[t Q , ...,tk,u\, ...,u„,Uj]——. (6.34)
dt "-^ dt
./ = !
Therefore, if u\ = U2 = • • • = u n = t, then, from (6.34)
d
—x[to, ■ . . , tk, t, . . . , t] = nx[to, ■ ■ ■ ,h,t, . . . ,t], (6.35)
n n+1
Using (6.33), and (6.35) it may be shown that
d r
—x[t ,...,t k ,t] = r\ x[t ,...,t k ,t,...,t]. (6.36)
r+\
Of course, this assumes that x(t) is sufficiently differentiable.
TLFeBOOK
262 INTERPOLATION
Equation (6.24) is just a special case of something more general. We note that
if x(t) is not a quadratic, then (6.24) is only an approximation for t $ {to, t\, t 2 ],
and so
x(t) « x(t ) + (t- to)x[t , h] + (t - t )(t - h)x[t Q , fi, t 2 ] = P2(0. (6-37)
It is easy to verify by direct evaluation that p 2 (t{) — x{t{) for i e {0, 1,2}. Equation
(6.37) is the second-degree interpolation formula, while (6.16) is the first-degree
interpolation formula. We may generalize (6.24) [and hence (6.37)] to higher
degrees by using (6.26); that is
x(t) = x[t ] + (t- t Q )x[tQ, t],
x[to, t] = x[to, t\] + (t — h)x[to, t\, t],
x[to, h,t] = x[tQ,t\,t2] + (t - t2)x[tQ, t\, t2, t],
x[t , ...,t n -i,t] = x[t , ...,t n ] + (t - t„)x[t Q , ...,t n ,t], (6.38)
where the last equation follows from
x[t\, ...,t n ,t] -x[to, ...,t n ]
x[t , ...,t n ,t] =
t-t
x[t , ...,f„-l,f] -x[t , ...,t n ]
t-tn
(6.39)
(the second equality follows by exchanging to and t n ). If the second relation of
(6.38) is substituted into the first, we obtain
x(t) = x[t ] + (t- to)x[t , h] + (t- t )(t - ti)x[t , h,t], (6.40)
which is just (6.24) again. If we substitute the third relation of (6.38) into (6.40),
we obtain
x(t) = x[t ] + (t- to)x[to, h] + (t - t )(t - ti)x[to, h,t 2 ]
+ (t - t )(t - h){t - t 2 )x[t , t\, t 2 , t], (6.41)
This leads to the third-degree interpolation formula
p 3 (t) = x[t ] + (f - t Q )x[t , fi] + (t - t )(t - h)x[t , h, t 2 ]
+ (t - t )(t - h)(t - t 2 )x[t , ti, t 2 , t 3 ].
TLFeBOOK
NEWTON INTERPOLATION 263
Continuing in this fashion, we obtain
x(t) = x[t ] + (t- foMfo, h] + (t- t )(t - h)x[to, h, h]
+ ••• + (?- t )(t - h) ■ ■ ■ (t - t„-i)x[to, h...,t n ] + e(t), (6.42a)
where
e(t) = (t- t Q )(t - h) ■ ■ ■ (t - t n )x[t , t u ...,t n ,t], (6.42b)
and we define
Pn(t) = X[t ] + (t - t Q )x[t , fi]
+ (t- t )(t - h)x[t Q , h,t 2 ] + --- + (t-to)---(t- t„-i)x[to, • ■ • , t n ],
(6.42c)
which is the nth-degree interpolating formula, and is clearly a polynomial of degree
n. So e(t) is the error involved in interpolating x(t) using polynomial p n (t)- It is
the case that p n (tk) — x(tk) for k = 0, 1, . . . , n. Equation (6.42a) is the Newton
interpolating formula with divided differences. If x(t) is a polynomial of degree n
(or less), then e(t) = (all t). This is more formally justified later.
Example 6.2 Consider e(t) for n = 2. This requires [via (6.27)]
x x\
X[tt), h, t 2 , t]
(to - h)(fo - t 2 )(to - t) (h - fo)Oi - t 2 )(h - t)
x 2 x(t)
(h - t )(t 2 - h){t 2 -t) (t- t )(t - h)(t - t 2 )
Thus
e(t) = (t - t )(t - ti)(t - t 2 )x[t , t u t 2 , t]
(t - h)(t - t 2 )x _ (t - t )(t - t 2 ) Xl _ (t - t )(t - t 1 )x 2
(to - h)(t - t 2 ) (h - to)(h - t 2 ) (t 2 -to)(t 2 -h)
+x(t).
=-PiV)
(6.43)
The form of p 2 (t) in (6.43) is that of p 2 (t) in Example 6.1:
p 2 (t) = xoL (t) + x\L\(t) + x 2 L 2 (t).
If x(t) — «o + a\t + a 2 t 2 , then clearly e(t) — for all t. In fact, p 2 t — ak-
Example 6.3 Suppose that we wish to interpolate x(t) — e* given that t\ — kh
with h = 0.1, and k = 0, 1, 2, 3. We are told that
x Q = 1.000000, xi = 1.105171, x 2 = 1.221403, x 3 = 1.349859.
TLFeBOOK
=
*Oo) =
1.000000
x[t ,ti] =
: 1.05171
= 0.1
x(h) =
1.105171
x[h,t 2 ] =
: 1.16232
= 0.2
x(t 2 ) =
1.221403
x[t 2 ,t 3 ] =
1.28456
= 0.3
x(t 3 ) =
1.349859
264 INTERPOLATION
We will consider n — 2 (i.e., quadratic interpolation). The task is aided if we
construct the divided- difference table:
x[t ,ti,t 2 ] = 0.55305
x[t l ,t 2 ,t 3 ] = 0.61120
For t € [0, 0.2] we consider [from (6.37)]
x(t) % x (t ) + (t- t )x[t , ti] + (t- t )(t - h)x[t , ti,t 2 ],
which for the data we are given becomes
x(t) % 1.000000+ 1.051710? + 0.55305f(f- 0.1) = p$(t). (6.44a)
Iff = 0.11, then ^(0.11)= 1.116296, while it turns out that x (0.11) = 1.116278,
so e(0.11) = jc(O.II) - p^(0.11) = -0.000018. For t € [0.1, 0.3] we might con-
sider
x(t) % X (h) + (t- h)x[h, t 2 ] + (t- ti)(t - t 2 )x[h,t 2 , t 3 ],
which for the given data becomes
x(t) % 1.105171+ 1.16232(f-0.1) + 0.61120(f-0.1)(f-0.2) = p\(t).
(6.44b)
We observe that to calculate p\(t) does not require discarding all the results needed
to determine pf (£)• We do not need to begin again as with Lagrangian interpolation
since, for example
x[h,t 2 ] -x[t ,ti] x[t 2 , ?3] -x[h,t 2 ]
x[to,h,t 2 \= , x[h,t 2 , ?3J = ,
h — to ?3 — t\
and both of these divided differences require x\t\, t 2 ]. If we wanted to use cubic
interpolation then the table is very easily augmented to include x[tQ, ti, t 2 ,t 3 ].
Thus, updating the interpolating polynomial due to the addition of more data to the
table, or of increasing n, may proceed more efficiently than if we were to employ
Lagrange interpolation.
We also observe that both p^it) and p\(t) may be used to interpolate x(t) for
t e [0.1,0.2]. Which polynomial should be chosen? Ideally, we would select the
one for which e(t) is the smallest. Practically, this means seeking bounds for \e(f)\
and choosing the interpolating polynomial with the best error bound.
TLFeBOOK
NEWTON INTERPOLATION 265
We have shown that if x(t) is approximated by p n (t), then the error has the form
e(t) — jt(t)x[t Q , . . .,t n ,t] (6.45a)
[recall (6.42b)], where
7t(t) = (t- t )(t - h) ■ ■ ■ (t - t n ), (6.45b)
which is an degree n + 1 polynomial. The form of the error in (6.45a) can be useful
in analyzing the accuracy of numerical integration and numerical differentiation
procedures, but another form of the error can be found.
We note that e(t) — x(t) - p n (t) [recall (6.42)], so both x(t) - p n (t) and n(t)
vanish at to, t\, . . . , t n . Consider the linear combination
X(t) =x(t)~ Pn (t)- KTT(t). (6.46)
We wish to select k so that X(t) = 0, where 1 ^ tk for k e Z n+ \. Such a k
exists because n(t) vanishes only at to, h, . . . , t n . Let a — minffo, . . . , t n , 7], b —
max{?o, • ■ • , t n , J], and define the interval / = [a, b]. By construction, X(t) van-
ishes at least n + 2 times on /. Rolle's theorem from calculus states that X^it) —
d k X(t)/dt k vanishes at least n + 2 — k times inside /. Specifically, X^ I+1 ^(f) van-
ishes at least once inside /. Let this point be called f . Therefore, from (6.46), we
obtain
*(»+!>(!) - p^ +x \f) - KJt {n+l \f) = 0. (6.47)
But p n (t) is a polynomial of degree n, so p^ n+l \t) = for all t e I. From (6.45b)
jt (n+l) (t) = (n+ 1)!, so finally (6.47) reduces to
' x (n+1) (f). (6.48)
(n+l)\
From X(i) — in (6.46), and using (6.48), we find that
e(t) = x(t) - Pn (t) = - x (n+V) (f)7T(t) (6.49)
(n + 1)!
for some | e /. If we were to let 1 — tk for any k e Z n+ i, then both sides of (6.49)
vanish even in this previously excluded case. This allows us to write
e ^ = / * , * ( " +1) (g(0M0 (6.50)
(n + 1)!
for some ^ = f (t) el (I — [a, b] with a — minffo, ■ • ■ , t n , t], and b —
max{?o, ■ ■ ■ ,t n , t}). If x < - n+1 \t) is continuous for t e I, then x^" +1 '(f) is bounded
on /, so there is an M n+ \ > such that
*(»+!>($) < M n+l , (6.51)
TLFeBOOK
266 INTERPOLATION
and hence
k(OI < , M " + L \*(t)\ (6-52)
(« + l)!
for all t e I. It is to be emphasized that for this to hold x ( - n+i Ht) must exist,
and we normally require it to be continuous, too. Equations (6.50) and (6.45a) are
equivalent, and thus
7t(t)x[t , ..., t n , t] = — — — x ( " +1) (^)7r(r),
(n + 1)!
or in other words
x[t , . . . , t n , t] = * — * ( " +1) (l) (6.53)
(« + 1)!
for some f € /, whenever x'" +1 ^(f) exists in /. In particular, if jt (?) is a polynomial
of degree n or less, then (6.53) yields ;c[fo, ...?«, t] = 0, hence e(f) = for all t
(a fact mentioned earlier).
Example 6.4 Recall Example 6.3. We considered x(t) = e' with fg = 0, fi =
0.1, ?2 = 0.2, and we found that
pf(0= 1.000000+ 1.051710? + 0.553050f(f- 0.1)
for which pf (0.11) = 1.116296, and x(0.11) = 1.116278. The exact error is
e(0.11) = x(0.11) - p$(0.U) = -0.000018.
We will compare this with the bound we obtain from (6.52). Since n = 2, x^(f) =
e', and / = [0,0.2], so
M 3 = e 02 = 1.221403
(which was given data), and n(t) = f(f -0.1)0 - 0.2), so |tt(0.11)| =9.9 x 10" 5 .
Consequently, from (6.52), we have
1.221403 .
|e(0.11)| < j 9.9 x 10" 5 = 0.000020.
The actual error certainly agrees with this bound.
We end this section by observing that (6.14) immediately follows from (6.50).
6.4 HERMITE INTERPOLATION
In the previous sections polynomial interpolation methods matched the polynomial
only to the value of the function f(x) at various points x — Xk € [a, b] C R. In
TLFeBOOK
HERMITE INTERPOLATION 267
this section we consider Hermite interpolation where the interpolating polynomial
also matches the first derivatives f^\x) at x — x k . This interpolation technique
is important in the development of higher order numerical integration methods as
will be seen in Chapter 9.
The following theorem is the main result, and is essentially Theorem 3.9 from
Burden and Faires [6].
Theorem 6.1: Hermite Interpolation Suppose that f(x) e C l [a, b], and
that xq,x\, . . . ,x n e [a, b] are distinct, then the unique polynomial of degree (at
most) In + 1 denoted by p2 n +i(x), and such that
Pln + 1 [Xj ) = / (Xj ) , p% + , (Xj ) = f m (Xj ) (6.54)
(J e Z„+i) is given by
n n
P2„+i(x) = ^2h k (x)f(x k ) + J^h k (x)f m (x k ), (6.55)
where
and
t=0 k=0
h k (x) = [1 - 2L ( l\x k ){x - x k )][L k (x)f, (6.56)
h k (x) = (x - x k )[L k (x)] 2 (6.57)
such that [recall (6.11)]
n
L k (x) = f] ±^±L. (6.58)
. ^ X k %i
1=0
ijtk
Proof To show (6.54) for xq, x\, . . . , x„, we require that h k (x), and h k (x) in
(6.56) and (6.57) satisfy the conditions
h k ( X j) = Sj- k , h ( k 1 \x j ) = (6.59a)
and
h k ( Xj ) = 0, h[ l \xj) = t>i- k . (6.59b)
Assuming that these conditions hold, we may confirm (6.54) as follows. Via (6.55)
n n
P2n + l(Xj) = ^h k (xj)f(x k ) + ^2h k (xj)f (1) (x k ),
k=0 k =0
TLFeBOOK
268 INTERPOLATION
and via (6.59), this becomes
n n
k=Q k=0
This confirms the first case in (6.54). Similarly, via (6.59)
n n
*=0 /t=0
becomes
P&+i(*j) = E°- /(**) + £*;-*/ (1) (**) = f m (xj),
k=0 k=0
which confirms the second case in (6.54).
Now we will confirm that hk(x), and h k (x) as defined in (6.56), and (6.57)
satisfy the requirements given in (6.59). The conditions in (6.59b) imply that hk(x)
must have a double root at x = Xj for j ^ k, and a single root at x — Xk ■ A
polynomial of degree at most 2n + 1 that satisfies these requirements, and such
that h^\x k ) — 1 is
f , , , , (x -x ) 2 ---(x -Xk-i) 2 - (x - x k+i ) 2 ■ ■ ■ (x - x n ) 2
h k (x) = (x- x k )-
(x k - x ) z ■■■(x k - x k -\) z ■ (x k - x k+ \Y ■■■{xk- x„) 2
= (x - x k )L 2 k {x).
Certainly h k (x k ) — 0. Moreover, h k (x) — L^(x) + 2(x — x k )L k (x)L k (x) so
h^\xk) = L 2 (x k ) = 1 [via (6.8)]. These verify (6.59b).
Now we consider (6.59a). These imply xj for j ^ k is a double root of h k (x),
and we may consider (for suitable a and £> to be found below)
hk(x) = ™ — r~(x - x ) 2 ■ ■ ■ (x - x k -i) 2 (x - x k+ if
1 L=oW ~~ x k )
i^k
■ ■ ■ (x — x n ) (ax + b)
which has degree at most In + 1. More concisely, this polynomial is
h k (x) = L k (x)(ax + b).
From (6.59a) we require
1 = h k (x k ) — L 2 k {x k )(ax k + b) — ax k + b, (6.60)
TLFeBOOK
SPLINE INTERPOLATION 269
Also, h, (x) = aL k (x) + 2L k (x)L k (x)(ax + b), and we also need [via (6.59a)]
h k (xk) — ah\{x k ) + 2L k (x k )L k (x k )(ax k + b) = 0,
but again since L k (x k ) — 1, this expression reduces to
a + 2L ( k l \x k ) = 0,
where we have used (6.60). Hence
a = -2L ( k l) (x k ), b = 1 + 2L ( k 1 \x k )x k .
Therefore, we finally have
h k (x) = [1 - 2L ( k l \x k )(x - x k )]L 2 k (x)
which is (6.56). Since L k {xj) = for j ^ k it is clear that h k {xj) — for j ^ k.
It is also easy to see that h, (xj) = for all j ^ k too. Thus, (6.59a) is confirmed
for h k (x) as defined in (6.56).
An error bound for Hermite interpolation is provided by the expression
1 "
/(*) = P2n+lM + .. ; — 11^ " ^) 2 / (2 " +2) «) (6-61)
(2m + 2)! * = *
for some f € (a, b), where f(x) e C 2n+2 [a, b]. We shall not derive (6.61) except
to note that the approach is similar to the derivation of (6.14). Equation (6.14) was
really derived in Section 6.3.
In its present form Hermite interpolation requires working with Lagrange poly-
nomials, and their derivatives. As noted by Burden and Faires [6], this is rather
tedious (i.e., not computationally efficient). A procedure involving Newton inter-
polation (recall Section 6.3) may be employed to reduce the labor that would
otherwise be involved in Hermite interpolation. We do not consider this approach,
but instead refer the reader to Burden and Faires [6] for the details. We use Hermite
interpolation in Chapter 9 to develop numerical integration methods, and efficient
Hermite interpolation is not needed for this purpose.
6.5 SPLINE INTERPOLATION
Spline (spliced line) interpolation is a particular kind of piecewise polynomial
interpolation. We may wish, for example, to approximate f(x) for x € [a, b] C
R when given the sample points {(x k , f(x k ))\k e Z n+ \] by fitting straight-line
segments in between (x k , f(x k )), and {x k +\, f(x k+ \)) for k = 0, 1, . . . , n — 1. An
TLFeBOOK
270 INTERPOLATION
^
Figure 6.1 The cubic polynomial f{x) = (x — \)(x — 2)(x — 3), and its piecewise linear
interpolant (dashed line) at the nodes x\ = xq + hk for which xq = 0, and h = A, where
k = 0, 1 n with n — 8.
example of this appears in Fig. 6.1. This has a number of disadvantages. Although
f(x) may be differentiable at x = Xk, the piecewise linear approximation will not
be (in general). Also, the graph of the interpolant has visually displeasing "kinks"
in it. If interpolation is for a computer graphics application, or to define the physical
surface of an automobile body or airplane, then such kinks are seldom acceptable.
Splines are a means to deal with this problem. It is also worth noting that, more
recently, splines have found a role in the design of wavelet functions [7, 8], which
were briefly mentioned in Chapter 1.
The following definition is taken from Epperson [9], and our exposition of spline
functions in this section follows that in [9] fairly closely. As always f^(x) —
d l f(x)/dx l (i.e., this is the notation for the ith derivative of fix)).
Definition 6.1: Spline Suppose that we are given {{Xk, f(xk))\k e Z n+ \}.
The piecewise polynomial function p m (x) is called a spline if
(51) p m (xk) — f(xk) for all k e Z„+i (interpolation).
(52) lirn
Pm \x) = lim x
Pm (x) for all i e Zjv+i (smoothness).
TLFeBOOK
SPLINE INTERPOLATION 271
(S3) p m (x) is a polynomial of degree no larger than m on every subinterval
[xk, Xk+\] for k e Z„ (interval of definition).
We say that m is the degree of approximation and N is the degree of smoothness
of the spline p m (x).
There is a relationship between m and N. As there are n subintervals [xjc, X/t+i],
and each of these is the domain of definition of a degree m polynomial, we see
that there are Df — n(m + 1) degrees of freedom. Each polynomial is specified by
m + 1 coefficients, and there are n of these polynomials; hence Df is the number
of parameters to solve for in total. From Definition 6.1 there are n + 1 interpo-
lation conditions [axiom (SI)]. And there are n — 1 junction points x\, . . .x n -\
(sometimes also called knots), with N + 1 continuity conditions being imposed on
each of them [axiom (S2)]. As a result, there are D c — («+ 1) + (n — l)(N + 1)
constraints. Consider
D f - D c = n(m + 1) - [(« + 1) + (« - l)(N + 1)] =n(m-N-l) + N.
(6.62)
It is a common practice to enforce the condition m — N — 1 = 0; that is, we let
m = N+l. (6.63)
This relates the degree of approximation to the degree of smoothness in a simple
manner. Below we will focus our attention exclusively on the special case of the
cubic splines for which m — 3. From (6.63) we must therefore have N = 2. With
condition (6.63), then, from (6.62) we have
D f -D C = N. (6.64)
As a result, it is necessary to impose TV further constraints on the design problem.
How this is done is considered in detail below. Since we will look only at m — 3
with N — 2, we must impose two additional constraints. This will be done by
imposing one constraint at each endpoint of the interval [a, b]. There is more than
one way to do this as will be seen later.
From Definition 6. 1 it superficially appears that we need to compute n different
polynomials. However, it is possible to recast our problem in terms of B-splines.
A B-spline acts as a prototype in the formation of a basis set of splines.
Aside from the assumption that m — 3, N — 2, let us further assume that
a — xq < x\ < • • • < x n -\ < x n — b (6.65)
with Xk+i — Xk — h for k = 0, 1, . . . , n — 1. This is the uniform grid assumption.
We will also need to account for boundary conditions, and this requires us to
introduce the additional grid points
x_3 — a — 3h, x-2 — a — 2h, x_\ — a — h (6.66)
TLFeBOOK
272 INTERPOLATION
and
x n+ 3 = b + 3h, x n+ 2 = b + 2h, x n +\ = b + h.
Our prototype cubic B-spline will be the function
(6.67)
S(x)
0,
(x + 2)\
1 + 3(x + 1) + 3(x + l) 2 - 3(x + l) 3 ,
1 + 3(1 - x) + 3(1 - xf - 3(1 - x) 3 ,
(2-x) 3 ,
0,
x < -2
-2 < x < -1
-1 <x <
0<x < 1
1 < x < 2
x>2
(6.68)
This function has nodes at x e {—2, —1,0, 1 , 2}. A plot of it appears in Fig. 6.2, and
we see that it has a bell shape similar to the Gaussian pulse we saw in Chapter 3.
We may verify that S(x) satisfies Definition 6.1 as follows. Plainly, it is piecewise
cubic (i.e., m — 3), so axiom (S3) holds. The first and second derivatives are,
Figure 6.2 A plot of the cubic B-spline defined in Eq. (6.68).
TLFeBOOK
respectively
SPLINE INTERPOLATION 273
s (1 \ X )
0,
x < -2
3(x + 2) 2 ,
-2 < x < -1
3 + 6(x + l)-9(x+l) 2 ,
-1 < x <
-3-6(1 — x) + 9(1 -x) 2 ,
<x < 1
-3(2- x) 2 ,
1 < x < 2
0,
x > 2
(6.69)
and
S (2) (x) =
0, x < -2
6(x + 2), -2<x<-l
6-18(x + l), -l<x<0
6- 18(1 -x), <x < 1
6(2 -x),
0.
1 < x < 2
x > 2
(6.70)
We note that from (6.68)-(6.70) that
5(0) = 4,
S(±l)=l,
5(±2) = 0,
(6.71a)
S (1) (0) = 0,
S (1) (±1) = T 3,
S (1) (±2) = 0,
(6.71b)
S (2) (0) = -12,
S (2) (±l) = 6,
S (2) (±2) = 0.
(6.71c)
So it is apparent that for i = 0, 1, 2 we have
?(«')/
?(<),
lim S u; (x) = lim S^(x)
for all Xk € {—2, —1,0, 1,2}. Thus, the smoothness axiom (S2) is met for N — 2.
Now we need to consider how we may employ S(x) to approximate any f(x)
for x e [a, b] C R when working with the grid specified in (6.65)-(6.67). To this
end we define
(X — X; \
— JT~) (6 ' 72)
for i = —1,0, l,...,/i,n + l. Since 5 ( u, (x) = ^5 (1) (^), 5, u, (x) =
l 5 (2) (£^£l) from (6 71 ) ; we have (e g ^ x . ±1 = x . ± jg
S,-(jc i -) = S(0) = 4, S ( te±i) = S(±l) = l, S I -(x I -± 2 ) = S(±2) = 0,
(6.73a)
TLFeBOOK
274 INTERPOLATION
S?\ Xi ) = 0, S ; (1) (* (±1 ) = tJ, 5, (1) (x !±2 ) = 0, (6.73b)
S i 2) ^ = ~^' SfWi) = ^. 5, (2) (x i±2 ) = 0. (6.73c)
We construct a cubic B-spline interpolant for any f(x) by defining spline pi(x)
to be a linear combination of Si (x) for i — —1,0, 1, . . . ,n,n + 1, i.e., for suitable
coefficients a, we have
«+i
p 3 (x)= J2 a iS,(x). (6.74)
The series coefficients a, are determined in order to satisfy axiom (SI) of Definition
6.1; thus, we have
n + \
f(xk) = ^2 aiSi{x k ) (6.75)
for & e Z„+i.
If we apply (6.73a) to (6.75), we observe that
n+\
f(xo) = ^ ajSjixo) = a-iS-i(x Q ) + a Q SQ(xo) + aiSi(xo)
«+i
/(*i) = £^ a,-S,-(*i) = a 5 , o(xi)+ai5 , i(x 1 ) + fl 2 52(xi)
i=-l
n+1
f(x n ) — 2_^ a iSi( x n) — Cn-\S n -i(x n ) + a n S n (x n ) + a n+ iS n +\(x n )
(6.76)
For example, for /(xq) in (6.76), we note that Sk(xo) — for k > 2 since (x k —
xq + A;/;)
S k (x ) = 5 (^) = S ( *> -»*+*"> ) = S(-*) = 0.
More generally
f(x k ) — a k -iS k -i(xk) + a k S k (x k ) + a k+ \S k+ \(x k ) (6.77)
for which k e X n+ \. Again via (6.73a) we see that
c , , c f (x + kh) - (xq + (k - l)h) \
S k -i(x k ) = S = 5(1) = 1,
TLFeBOOK
SPLINE INTERPOLATION
275
Sk(xk) = s ■ {X0 + kh)-( X0 + khr = m = 4>
5/t+l C^yt) = S
h
(xo + kh)- (x + (k + \)h)
S(-l) = l.
Thus, (6.77) becomes
«£_! + 4a* + Ojfc+1 = /(**)
(6.78)
again for fc = 0, 1, . . . , n. In matrix form we have
1
4
1
•
•
1
4
1 •
•
•
• 4
1
_
•
• 1
4
1
a_i
«o
a„
a n + \
=A
/(*o)
f(xi)
f(Xn-l)
f(x„)
(6.79)
We note that A e r("+ 1 ) x ("+ 3 ) ; so there are n + 1 equations in n + 3 unknowns
(a e R" +3 ). The tridiagonal linear system in (6.79) cannot be solved in its present
form as we need two additional constraints. This is to be expected from our earlier
discussion surrounding (6.62)-(6.64). Recall that we have chosen m — 3, so N — 2,
and so, via (6.64), we have D f — D c — N — 2, implying the need for two more
constraints. There are two common approaches to obtaining these constraints:
,(2),
1. We enforce p^ (xq)
2. We enforce pf\xo) = / (1)
or clamped spline).
pf\x n )
(natural spline).
,(l)
(xo), and p\ (x n ) = f ( '(x n ) (complete spline
Of these two choices, the natural spline is a bit easier to work with but can lead
to larger approximation errors near the interval endpoints xq, and x n compared to
working with the complete spline. The complete spline can avoid the apparent need
to know the derivatives f^ l \xo), and f^\x n ) by using numerical approximations
to the derivative (see Section 9.6).
We will first consider the case of the natural spline. From (6.74), we obtain
P3(xq) = a-iS-i(x ) + a So(x ) + aiS\(x ),
so that
/?3 2) (*o) = a-
iS^(*o)-
a sf?\x )
aiS[ 2) (x ).
(6.80)
TLFeBOOK
276 INTERPOLATION
Since S, (2) (x ) = ±S {2) ( 5fi ^ L ), we have
l2) _ J_ (2) / x -(x -h) \ _ 1 (2) _ 6_
Sri(xo) = —S w ( - — ^ ) = —5^(1) = — , (6.81a)
^"--X 2 ^) -?*"«»-£
S^'(xo) = —S w ( -^-^ I = —S { '(0) = - — , (6.81b)
,(2)^_, _ 1 eg) { xo ~ (*o + h)\ _ 1 (2)
where (6.71c) was used. Thus, (6.80) reduces to
h~ [6a_i — 12«o + 6«i] =
a_i = 2ao — «i. (6.82)
Similarly
Pi\ x n) == Qn—l^n—l\X n ) ~r a n^n\ x n) T fl n + Wn+l (^-n)>
SO that
P3 2) (^«) = a«-i5'® 1 (x n ) +a„^ 2) (^«) +« n +i^+ 1 (x„). (6.83)
Since S, (2) (x„) = ^5 (2) (^^), we have
c (2) , . _ 1 „( 2 ) f (xQ + nh)-(xQ + (n- l)h) \ _ 1 (2) _ 6
(6.84a)
„i 1 m / (xn + m/j) — (xo + n/0\ 1 ,.,■, 12
&<*,) = 7^5 (2) (^ L_li dj = _ 5 (2) (0 ) = __ (6.84b)
c (2) , . _ 1 c (2) ( (XQ + w/i ) ~ (*0 + (W+1W \ 1 „( 2 ), n _ 6
(6.84c)
where (6.71c) was again employed. Thus, (6.83) reduces to
h~ [6a„_i — 12a„ + 6a n+ \] —
o„+i = 2a„ - a„_i. (6.85)
Now since [from (6.79)]
a_i + 4a + «i = f(xo)
TLFeBOOK
SPLINE INTERPOLATION 277
using (6.82) we have
and similarly
so via (6.85) we have
6a = /Oo),
a„_i + 4a„ + a„+\ = /(*„),
(6.86a)
6a„ = /(*„).
Using (6.86) we may rewrite (6.79) as the linear system
(6.86b)
4
1 •
•
1
4 1 •
•
•
• 1
4
1
_
•
•
1
4 _
a n -2
a n -\
f(xi) - g/(*o)
f(X 2 )
f(Xn-2)
_ f(x„-i) - hf(x n )
(6.87)
=/
Here we have A e r(" _1 ) x (" -1 ^ so now the tridiagonal system Aa — f has a
unique solution, assuming that A -1 exists. The existence of A -1 will be justified
below.
Now we will consider the case of the complete spline. In the case of
,(!)/
?(!)/
?(!)/
:■(!)/
P^Oo) = a-i^^xo) + ooSq^Cso) + aiS^-'^o) = / (1) (*o)>
since 5 ( (1) (xo) = ^5 (1) ( a ^ £L ), we have, using (6.71b)
(6.
S^(xo) = f 5« ( ^"-^-^ = I 5 (i)(l) =
r(l)/
n
so (6.88) becomes
1 -(1) / Xqj-JCo
h~ V h
1
= -S (1) (0) = 0,
h
xo - (xq + h)
,(i)
is«(-D = -.
h h
pl l '(x ) = SA-^-a-i +oi] = / UJ (x ).
(6.89)
Similarly
P \ l) (x„) = a„_i^!_\(x„) + flnS^Oc) + a^i^^fe) = f m (x n ) (6.90)
TLFeBOOK
278 INTERPOLATION
reduces to
p 3 (x„) = 3h~ 1 [-a„-i +a„ +i ] = / (1) (x„).
From (6.89) and (6.91), we obtain
0-1 = 01 - ±fr/ (1) (*o), Cl n +\ = On-l + ^/ (1) (^n)-
If we substitute (6.92) into (6.79), we obtain
(6.91)
(6.92)
4 2
1 4 1
1 4 1
2 4
«0
O]
a«-i
/(x )+^/ (1) (*o)
/(*n-l)
(6.93)
=/
Now we have A € r(«+ 1 ) x ("+ 1 ) ) anc j we see that Aa — f of (6.93) will have a
unique solution provided that A -1 exists.
Matrix A is tridiagonal, and so is quite sparse (i.e., it has many zero-valued
entries) because of the "locality" of the function S(x). This locality makes it pos-
sible to evaluate pj,(x) in (6.74) efficiently. If we know that x e [xk, Xk+i], then
P3(x) = a k -\S k -\{x) + a k S k (x) + a k+ \S k +\(x) + ak+iSk+lix).
(6.94)
We write suppg(x) = [a, b] to represent the fact that g(x) — for all x < a, and
x > b, while g(x) might not be zero- valued for x € [a,b]. From (6.68) we may
therefore say that suppS(x) = [—2,2], and so from (6.72), suppS^x) = [x,- —
2h, xi + 2h], so we see that Si(x) is not necessarily zero-valued for x € [x,- —
2h, xt + 2h]. From this, (6.94), and the fact that x,- = xq + ih, we obtain
,k+2
supp p 3 (x) = U^_jsupp Si(x) = [x + (* - 3)h, x Q + (k+ l)h] U
• • • U [x + kh, x Q + (k + 4)h]
(6.95)
(if A,- are sets, thenU^_ A,- = A n U • • • U A m ), which "covers" the interval
[xk, Xk+i] — [xo + kh, xq + (k + l)h\. Because the sampling grid (6.65)-(6.67),
is uniform it is easy to establish that (via x > xo + kh)
k =
xo
h
(6.96)
where [x] — the largest integer that is < x € R. Since suppS(x) is an interval of
finite length, we say that S(x) is compactly supported. This locality of support is
also a part of what makes splines useful in wavelet constructions.
TLFeBOOK
SPLINE INTERPOLATION
279
We now consider the solution of Aa — f in either of (6.87) or (6.93). Obviously,
Gaussian elimination is a possible method, but this is not efficient since Gaussian
elimination is a general procedure that does not take advantage of any matrix
structure. The sparse tridiagonal structure of A can be exploited for a more efficient
solution. We will consider a general method of tridiagonal linear system solution
here, but it is one that is based on modifying the general Gaussian elimination
method. We begin by defining
«00
floi
•
«10
on
an ■
«21
A22 '
•
■ a n -
-!,«-!
&n—\,n
■
®n,n — \
a n,n
X
r h
Xl
h
x 2
=
h
%n—l
fn-\
%n
In
(6.97)
Clearly, A e r("+ 1 ) x ("+ 1 ). There is some terminology associated with tridiagonal
matrices. The main diagonal consists of the elements a, ,, while the diagonal above
this consists of the elements atj+i, and is often called the superdiagonal. Similarly,
the diagonal below the main diagonal consists of the elements a,+i,;, and is often
called the subdiagonal.
Our approach to solving Ax — f in (6.97) will be to apply Gaussian elimination
to the augmented linear system [A\f] in order to reduce A to upper triangular form,
and then backward substitution will be used to solve for x. This is much the same
procedure as considered in Section 4.5, except that the tridiagonal structure of A
makes matters easier. To see this, consider the special case of n — 3 as an example,
that is (A° = A,f°= f)
[A°|/°] =
" a
"oo
MO
*01
2°
Ml
'21
"12
(I
22
'32
'23
-'33
I /o°"
I f?
I f 2 °
I / 3 °J
(6.98)
We may apply elementary row operations to eliminate a® in (6.98). Thus, (6.98)
becomes
[A 1 !/ 1 ]^
-'oo
a oi
o _ 4> fl o
"11 o "oi
"oo
a
o
M2
'22
'32
'23
'33
Jo
o
f _ f?o f
J\ „o Jo
"oo
ft
/3°
TLFeBOOK
280
INTERPOLATION
a 00
"oi
"ii
"l2
"21
a 22
fl
q\
a
23
33
I /o 1
I S\
(6.99)
Now we apply elementary row operations to eliminate a 21 . Thus, (6.99) becomes
[A 2 \f 2 l
-'oo
-'oi
♦ 22
'12
a\x 1
„l a 12
'32
'23
'33
?l
f!
i a 2\ f\
'1
fl a 2\ f
J 2 „1 •/!
./■,
r a 2
"oo
„2
'12
'22
v 2
'32
'23
2
'33
I /,
/, 2
I /2 2
I .A 2
(6.100)
Finally, we eliminate o| 2 , in which case (6.100) becomes
[A 3 |/ 3 ]
"oo
a\
'01
'12
'22
a 2
"23
"33
%2 „2
,2 "23
22
fl
ff
f:
I /?
2
2
%2 f 2
a 2 J 2
22
'00
-*oi
„3
'12
'22
'23
'33
I /,
./?
I fl
I /3 3
(6.101)
We have A 3 = U, an upper triangular matrix. Thus, Ux — / 3 can be solved by
backward substitution. The reader should write a pseudocode program to implement
this approach for any n. Ideally, the code ought to be written such that only the
vector /, and the main, super-, and subdiagonals of A are stored.
TLFeBOOK
SPLINE INTERPOLATION 281
We observe that the algorithm we have just constructed will work only if a\ • ^
for all i = 0, 1, . . . , n. Thus, our approach may not be stable. We expect this
potential problem because our algorithm does not employ any pivoting. However,
our application here is to use the algorithm to solve either (6.87) or (6.93). For A in
either of these cases, we never get a\ ■ = 0. We may justify this claim as follows.
In Definition 6.2 it is understood that a,- ; = for i, j g Z n+ \.
Definition 6.2: Diagonal Dominance The tridiagonal matrix A —
\fli, ;'](', ;'=0 n € R("+ 1 ) x (' i + 1 ) is diagonally dominant if
fl;,i > k,i-il + Kml > o
for i =0, 1, . . . , n.
It is clear that A in (6.87), or in (6.93) is diagonally dominant. For the algorithm
we have developed to solve Ax — f, in general we can say
a k
a k+1 -a k k+hk a k (6 1021
"k+l,k+l — a k+\,k+\ k "k,k+l> 10.1UAJ
with a^\ k = 0, and
a k,k+ 1 = a k,k+l (6.103)
for k = 0, 1, . . . , n — 1. Condition (6.103) states that the algorithm does not modify
the superdiagonal elements of A.
Theorem 6.2: If tridiagonal matrix A is diagonally dominant, then the algo-
rithm for solving Ax — f will not yield a\ ■ = for any i = 0, 1, . . . , n.
Proof We give only an outline of the main idea. The complete proof needs
mathematical induction.
From Definition 6.2
«o,o> l«o,-il + l«o,il = l a o,il >0,
so from (6.102)
1 _ o fl o,i
a l,l — a l,l ~o~ a i,o>
a o,o
with aj = 0. Thus, A 1 can be obtained from A = A. Now again from Defini-
tion 6.2
a% > |a? | + |o? 2 | > 0,
TLFeBOOK
282
INTERPOLATION
so that, because < |oq j/flg | < 1, we can say that
a\ , > a? i
•*o,i
•*o,o
l«i.ol
> l«? l + l«i 2 l
*o,i
^0,0
l«i,ol
= 1
'0,1
^0,0
l«l,0l + l«l,2l
> \a\ 2 \ = |aj >2 | = |o} >0 | + \a\ 2 \ >0,
and A 1 is diagonally dominant. The diagonal dominance of A implies the diagonal
dominance of A 1 . In general, we see that if A k is diagonally dominant, then A k+1
will be diagonally dominant as well, and so A k+1 can be formed from A k for all
k = 0, 1, . . . , n - 1. Thus, a\ ■ ^ for all i = 0, 1, . . . , w.
If A is diagonally dominant, it will be well-conditioned, too (not obvious), and
so our algorithm to solve Ax = / is actually quite stable in this case.
Before considering an example of spline interpolation, we specify a result con-
cerning the accuracy of approximation with cubic splines. This is a modified version
of Theorem 3.13 in Burden and Faires [6], or of Theorem 4.7 in Epperson [9].
Theorem 6.3: If f(x) e C 4 [a, b] with m^ x(z[aM |/ (4) (x)| < M, and Pi {x)
is the unique complete spline interpolant for f(x), then
max I/O) - p 3 (x)\ < —Mb*.
x€[a,b] J64
Proof The proof is omitted, but Epperson [9] suggests referring to the article
by Hall [10].
We conclude this section with an illustration of the quality of approximation of the
cubic splines.
Example 6.5 Figure 6.3 shows the natural and complete cubic spline inter-
polants to the function f(x) — exp(— x 2 ) for x e [—1, 1]. We have chosen the
nodes x^ — — 1 + jk, for k = 0, 1,2, 3, 4 (i.e., n = 4, and h — j). Clearly,
/W(x) = — 2x exp(— x 2 ). Thus, at the nodes
/(±1) = 0.36787944, f{±\) = 0.60653066, /(0) = 1.00000000,
and
f w (±l) = ^0.73575888.
TLFeBOOK
SPLINE INTERPOLATION 283
0.8
^ 0.6
0.4
0.2
1 1 ^-Q ■— 1 1
/"■ \\
y n.
i i i i i
-1.5
-0.5
(b)
x
0.5
1.5
Figure 6.3 Natural (a) and complete (b) spline interpolants (dashed lines) for the function
f(x) = e~ x (solid lines) on the interval [— 1, 1]. The circles correspond to node locations.
Here n = 4, with h = i, and xq = — 1. The nodes are at x k = xq + hk for k = 0, . . . , n.
Of course, the spline series for our example has the form
5
Pi(x) = J^ a k S k (x),
-l
so we need to determine the coefficients a k from both (6.87), and (6.93).
Considering the natural spline interpolant first, from (6.87), we have
4 1
1 4 1
1 4
a\
" /(*l) " g/(*o)
«2
=
f(X2)
«3
- f(X3) ~ lf(X4)
Additionally
and
11 11
ao = -f(xo) = 7/( -1 )> fl 4 = -f(x 4 ) = -/(l),
6 6 6 6
a_l = f(x ) - 4a Q - a\, a 5 — f(x 4 ) - 4a 4 - a 3 .
TLFeBOOK
284 INTERPOLATION
The natural spline series coefficients are therefore
k
at
-1
-0.01094139
0.06131324
1
0.13356787
2
0.18321607
3
0.13356787
4
0.06131324
5
-0.01094139
Now, on considering the case of the complete spline, from (6.93) we have
4
2
1
4
1
1
4
1
1
4
1
_
2
4
Additionall
y
«0
a\
as
(34
\hfV\x ),
/(jc ) + £a/ (1) (*o)
f(xi)
f(X2)
f(X3)
f(x A )-\hf l \x A )
a_i — a\ — ^nf^'{xo), as — «3 + ^hf ( '(x4).
The complete spline series coefficients are therefore
k
ak
-1
0.01276495
0.05493076
1
0.13539143
2
0.18230428
3
0.13539143
4
0.05493076
5
0.01276495
This example demonstrates what was suggested earlier, and that is that the com-
plete spline interpolant tends to be more accurate than the natural spline interpolant.
However, accuracy of the complete spline interpolant is contingent on accurate
estimation, or knowledge of the derivatives /^'(xo) and /^'(x„).
REFERENCES
1. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.),
Random House, New York, 1988.
2. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ.
Press, Baltimore, MD, 1989.
TLFeBOOK
PROBLEMS 285
3. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York,
1974.
4. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977.
5. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York,
1966.
6. R. L. Burden and J. D. Faires, Numerical Analysis, 4th ed., PWS-KENT Publi., Boston,
MA, 1989.
7. C. K. Chui, An Introduction to Wavelets, Academic Press, Boston, MA, 1992.
8. M. Unser and T. Blu, "Wavelet Theory Demystified," IEEE Trans. Signal Process. 51,
470-483 (Feb. 2003).
9. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York,
2002.
10. C. A. Hall, "On Error Bounds for Spline Interpolation," J. Approx. Theory. 1, 209-218
(1968).
PROBLEMS
6.1. Find the Lagrange interpolant pi(t) — Y2j—(,XjLj(t) for data set
{(-1, i), (0, 1), (1, -1)}. Find p 2J in P2 (t) = Y. )=o P2,jt j ■
6.2. Find the Lagrange interpolant /53(f) = ^2j = o x jLj(t) f° r data set
{(0,1), (± 2), (1,|), (2,
<2 ,*.„ V x, 2 „ v ^, -1)}. Find pjj in p 3 (t) = J2%o P3jt j .
6.3. We want to interpolate x(t) at t = t\, h, and x^(t) = dx(t)/dt at t — to, £3
using p3(t) = J2j=o P3,jt J .Let Xj = x(tj) and
ear system of equations satisfied by (pij).
(i)/
,(i)
= x w (tj). Find the lin-
6.4. Find a general expression for L , . (t) = dLj(t)/dt.
6.5. In Section 6.2 it was mentioned that fast algorithms exist to solve Vander-
monde linear systems of equations. The Vandermonde system Ap — x is
given in expanded form as
to
1 t„-i
1 t„
n-\
2
n-\
n-\
n-\
n-\
PQ
XQ
Pi
=
XI
Pn-l
Xn — 1
Pn
Xn
(6.P.1)
and p n (t) — Yl"j=o Pj* J interpolates the points {(tj, Xj)\k — 0, 1, 2, ... , n}.
A fast algorithm to solve (6.P.1) is
TLFeBOOK
286
INTERPOLATION
for k := to n - 1 do begin
for / := n downto to k + 1 do begin
Xi:=(Xj-x,_i)/(ti-ti_ k _iy,
end;
end;
for k :— n — 1 downto do begin
for / := k to n - 1 do begin
X; := X; - t k X i+ i ;
end;
end;
The algorithm overwrites vector i = [io ii • • • x n ] T with the vector p —
[po P\ ■■■ PnV ■
(a) Count the number of arithmetic operations needed by the fast algorithm.
What is the asymptotic time complexity of it, and how does this compare
with Gaussian elimination as a method to solve (6.P.1)?
(b) Test the fast algorithm out on the system
1 1 1
1
Po
10
1 2 4
8
Pi
26
1 3 9
27
P2
58
1 4 16
64
_ Pi
112
(c) The "top" k loop in the fast algorithm produces the Newton form (Section
6.3) of the representation for p n (t). For the system in (b), confirm that
k-\
Pn(t) = ^X k Y[(t ~ti),
k=0 1=0
where {x^k e Z n+ i} are the outputs from the top k loop. Since in (b)
n — 3, we must have for this particular special case
p 3 (t) =x +xi(t- t ) + x 2 (t - t )(t - h) + x 3 (t - t )(t - t x )(t - t 2 ).
(Comment: It has been noted by Bjorck and Pereyra that the fast algorithm
often yields accurate results even when A is ill-conditioned.)
6.6. Prove that for
A n =
to
t 2
l
t\
t 2
.n — \
' ' l \
tn-l
t 2
l n-\
.n — \
" l n-\
tn
t 2
t n — 1
TLFeBOOK
PROBLEMS
287
we have
det(A„) = Yl ('i ~ '.0-
0<i<j<n
(Hint: Use mathematical induction.)
6.7. Write a MATLAB function to interpolate the data {(tj,Xj)\j e Z„ + i} with
polynomial p n (t) via Lagrange interpolation. The function must accept f as
input, and return p n (t).
6.8. Runge's phenomenon was mentioned in Section 6.2 with respect to interpo-
lating f(t) — 1/(1 + f 2 ) on t e [—5, 5], Use polynomial p n (t) to interpolate
/(f) at the points t k = t o + kh, k — 0, 1, . . . , n, where to — —5, and t „ — 5
(so h — (t n - t )/n). Do this for n — 5, 8, 10. Use the MATLAB function
from the previous problem. Plot fit) and p n (t) on the same graph for all of
n — 5, 8, 10. Comment on the accuracy of interpolation as n increases.
6.9. Suppose that we wish to interpolate f(t) = sinf for t e [0, jt/2] using poly-
nomial p n (t) — Yl"j=o Pn,jt J ■ The approximation error is <?„(?) = f(t) —
p n (t). Since \f (n) (t)\ < 1 for all t e R via (6.14), it follows that
\e„(t)\ <
1
(n+l)\
f\\t - tt\ = Ht).
(6.P.2)
,=o
Let the grid (sample) points be to = 0, t n = tt/2, and t^ — to + kh for £ =
0, 1, . . . , n so that h — (t n - t Q )/n = ^. Using MATLAB
(a) For « = 2, on the same graph plot /(f) — p n {t), and ±£>(f).
(b) For n — 4, on the same graph plot /(f) — p„(t), and ±£>(f).
Cases (a) and (b) can be separate plots (e.g., using MATLAB subplot). Does
the bound in (6.P.2) hold in both cases?
6.10. Consider f(t) — coshf = j[e' + e '] which is to be interpolated by P2(t) —
S/=o P2jt' on f € [-1, 1] at the points to = -1, fo = 0, t\ = 1. Use (6.14)
to find an upper bound on the size of the approximation error e n (t) for
f € [—1, 1], where the size of the approximation error is given by the norm
1
e„
loo
= max |e„(f)|,
a<t<b
where here a — —
l,b =
1.
6.11. For
' 1
1
1 1
V =
to
t"
L 'o
t\
t"
f n — 1 f n
l n-\ l n _
g ^(n + l)x(n+l)
TLFeBOOK
288
INTERPOLATION
it is claimed by Gautschi that
IV
-in , rrl±M
I loo < max —
0<k<n ^ \t k -t t \
ijtk
(6.P.3)
Find an upper bound on k^CV) that uses (6. P. 3). How useful is an upper
bound on the condition number? Would a lower bound on condition number
be more useful? Explain.
6.12. For each function listed below, use divided difference tables (recall Example
6.3) to construct the degree n Newton interpolating polynomial for the spec-
ified points.
(a) f(t) = 4t , fo = 0, t\ — 1, H — 3. Use n = 2.
(b) f(t) — coshf, to — — 1, t\ — -j-, ii — 2>
(c) f(t) — Int, to = 1, t\ = 2, t2 = 3. Use n = 2.
_1 i 2 = i f 3 = 1. Use n = 3.
(d) /(0 = 1/(1 4
6.13. Prove Eq. (6.33).
f), to = 0,t\ = A, ?2 = 1, ?3 = 2. Use « = 3.
6.14. Consider Theorem 6.1. For n = 2, find /^(x) and ^^(x) as direct-form (i.e.,
a form such as (6.4)) polynomials in x. Of course, do this for all k — 0, 1, 2.
Find Hermite interpolation polynomial p2n+\(x) for n — 2 that interpolates
/(x) = .y/x at the points xo = 1, x\ — 3/2, X2 = 2.
6.15. Find the Hermite interpolating polynomial p$(x) to interpolate f(x) — e x at
the points xo = 0, x\ — 0.1, X2 = 0.2. Use MATLAB to compare the accu-
racy of approximation of ps(x) to that of p^ix) given in Example 6.3 (or
Example 6.4).
6.16. The following matrix is important in solving spline interpolation problems
[recall (6.87)]:
4
1
•
•
1
4
1 •
•
1
4 •
•
•
• 4
1
•
• 1
4
€R"
Suppose that D n — det(A„).
(a) Find D\, D2, Dj, and D4 by direct (hand) calculation using the basic
rules for computing determinants.
(b) Show that
D
n+2
4D„
D„=0
TLFeBOOK
PROBLEMS 289
(i.e., the determinants we seek may be generated by a second-order dif-
ference equation).
(c) For suitable constants a, ft e R, it can be shown that
D n =a(2 + v / 3)"+ y 8(2- V3f (6.P.4)
in e N). Find a and [J.
[Hint: Set up two linear equations in the two unknowns a and fJ using
(6. P. 4) for n = 1, 2 and the results from (a).]
(d) Prove that D„ > for all n e N.
(e) Is A„ > for all n e N? Justify your answer.
[//mf: Recall part (a) in Problem 4.13 (of Chapter 4).]
6.17. Repeat Example 6.5, except use
fM = rh>
and x € [—5, 5] with nodes Xk — — 5 + \k for k — 0, 1, 2, 3, 4 (i.e., « = 4,
and /; = |). How do the results compare to the results in Problem 6.8 (assum-
ing that you have done Problem 6.8)? Of course, you should use MATLAB
to "do the dirty work." You may use built-in MATLAB linear system solvers.
6.18. Repeat Example 6.5, except use
f(x) = Vl+x.
Use MATLAB to aid in the task.
6.19. Write a MATLAB function to solve tridiagonal linear systems of equations
based on the theory for doing so given in Section 6.5. Test your algorithm
out on the linear systems given in Example 6.5.
TLFeBOOK
7
Nonlinear Systems of Equations
7.1 INTRODUCTION
In this chapter we consider the problem of finding x to satisfy the equation
/0O = (7.1)
for arbitrary /, but such that f(x) e R. Usually we will restrict our discussion
to the case where x e R, but x e C can also be of significant interest in practice
(especially if / is a polynomial). However, any x satisfying (7.1) is called a root
of the equation. Such an x is also called a zero of the function /. More generally,
we are also interested in solutions to systems of equations:
f (xo,xi,...,x„-i) = 0,
fi(xo,xi, ...,x n -\) = 0,
fn-l(xo,Xl, ...,X n -l) — 0. (7.2)
Again, fi(xo, x\, . . . , x n -\) e R, and we will assume Xk e R for all ;', k e Z„. Var-
ious solution methods will be considered such the bisection method, fixed-point
method, and the Newton-Raphson method. All of these methods are iterative in
that they generate a sequence of points (or more generally vectors) that converge
(we hope) to the desired solution. Consequently, ideas from Chapter 3 are rele-
vant here. The number of iterations needed to achieve a solution with a given
accuracy is considered. Iterative procedures can break down (i.e., fail) in various
ways. The sequence generated by a procedure may diverge, oscillate, or display
chaotic behavior (which in the past was sometimes described as "wandering behav-
ior"). Examples of breakdown phenomena will therefore also be considered. Some
attention will be given to chaotic phenomena, as these are of growing engineering
interest. Applications of chaos are now being considered for such areas as cryp-
tography and spread-spectrum digital communications. These proposed applications
are still controversial, but at the very least some knowledge of chaos gives a deeper
insight into the behavior of nonlinear dynamic systems.
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
290
TLFeBOOK
INTRODUCTION 291
The equations considered in this chapter are nonlinear, and so are in contrast
with those of Chapter 4. Chapter 4 considered linear systems of equations only. The
reader is probably aware of the fact that a linear system either has a unique solution,
no solution, or an infinite number of solutions. Which case applies depends on the
size and rank of the matrix in the linear system. Chapter 4 emphasized the handling
of square and invertible matrices for which the solution exists, and is unique. In
Chapter 4 we saw that well-defined procedures exist (e.g., Gaussian elimination)
that give the solution in a finite number of steps.
The solution of nonlinear equations is significantly more complicated than the
solution of linear systems. Existence and uniqueness problems typically have no
easy answers. For example, e 2x + 1 = has no real- valued solutions, but if we allow
x e C, then x — jknj for which k is an odd integer (since e 2x + 1 = e-' nk + 1 =
cos(£jt) + j sin(A:7T) + 1 = cos(kjt) = —1 + 1 = as k is odd). On the other hand
e~' - sin(f) =
also has an infinite number of solutions, but for which t e R (see Fig. 7.1). How-
ever, the solutions are not specifiable with a nice formula. In Fig. 7.1 the solutions
correspond to the point where the two curves intersect each other.
Polynomial equations are of special interest. 1 For example, x 2 + 1 = has only
complex solutions x — ±j . Multiple roots are also possible. For example
x 3 - 3x 2 + 3x - 1 = (x - l) 3 =
has a real root of multiplicity 3 at x = 1. We remark that the methods of this
chapter are general and so (in principle) can be applied to find the solution of
any nonlinear equation, polynomial or otherwise. But the special importance of
polynomial equations has caused the development (over the centuries) of algorithms
dedicated to polynomial equation solution. Thus, special algorithms exist to solve
n
Pn(.x) = J^p n , k X k = 0, (7.3)
k=0
Why are polynomial equations of special interest? There are numerous answers to this. But the reader
already knows one reason why from basic electric circuits. For example, an unforced RLC (resistance x
inductance x capacitance) circuit has a response due to energy initially stored in the energy storage
elements (the inductors and capacitors). If x(t) is the voltage drop across an element or the current
through an element, then
d n x(t) d n ~ l x(t) dx(t)
The coefficients a^ depend on the circuit elements. The solution to the differential equation depends on
the roots of the characteristic equation:
a„X" +a n _iX"~ l H \-a\X + aQ = 0.
TLFeBOOK
292 NONLINEAR SYSTEMS OF EQUATIONS
2.5
2
1.5
1
0.5
-0.5
.;;;;;;;
— e~>
-- sin(f)
.._
V i
^ v \ s .-'' "T*^ N
V .
\ :
\ :
V
l^^^l
*
s
s ■ s
/• : : s :
_...'...: N ...:
: •
•
• :
! : : : s.
x
Figure 7.1 Plot of the individual terms in the equation /(f) = e r — sin(f) = 0. The infi-
nite number of solutions possible corresponds to the point where the plotted curves intersect.
and these algorithms do not apply to general nonlinear equations. Sometimes
these are based on the general methods we consider in this chapter. Other times
completely different methods are employed (e.g., replacing the problem of find-
ing polynomial zeros by the equivalent problem of finding matrix eigenvalues as
suggested in Jenkins and Traub [22]). However, we do not consider these special
polynomial equation solvers here. We only mention that some interesting references
on this matter are Wilkinson [1] and Cohen [2], These describe the concept of ill-
conditioned polynomials, and how to apply deflation procedures to produce more
accurate estimates of roots. The difficulties posed by multiple roots are considered
in Hull and Mathon [3]. Modern math-oriented software tools (e.g., MATLAB)
often take advantage of theories such as described in Refs. 1-3, (as well as in
other sources). In MATLAB polynomial zeros may be found using the roots and
mroots functions. Function mroots is a modern root finder that reliably determines
multiple roots.
7.2 BISECTION METHOD
The bisection method is a simple intuitive approach to solving
fix) = 0. (7.4)
It is assumed that fix) e R, and that x e R. This method is based on the following
theorem.
Theorem 7.1: Intermediate Value Theorem If f\[a, b] — > R is continuous
on the closed, bounded interval [a, b], and yo e R is such that f(a) < yo < fib),
then there is an xo e [a,b] such that fixo) — yo- In other words, a continuous
function on a closed and bounded interval takes on all values between /(a) and
f{b) at least once.
TLFeBOOK
BISECTION METHOD 293
The bisection method works as follows. Suppose that we have an initial interval
[ao, bo] such that
f(a )f(b ) < 0, (7.5)
which means that /(ao) and f(bo) have opposite signs (i.e., one is positive while
the other is negative). By Theorem 7.1 there must be a p e (ao, bo) so that f(p) =
0. We say that [ao, bo] brackets the wot p. Suppose that
Po = 3(00 + ^0)- (7.6)
This is the midpoint of the interval [ao, bo]- Consider the following cases:
1. If f(po) — 0, then p — po and we have found a root. We may stop at this
point.
2. If / '(ao) / (po) < 0, then it must be the case that p e [ao, po], so we define
the new interval [a\, b\] — [ao, po]- This new interval brackets the root.
3. If f (po) f (bo) < then it must be the case that p e [po, bo] so we define
the new interval [a\, b\] — [po, bo]- This new interval brackets the root.
The process is repeated by considering the midpoint of the new interval, which is
pi = \(ai+bi), (7.7)
and considering the three cases again. In principle, the process terminates when
case 1 is encountered. In practice, case 1 is unlikely in part because of the effects
of rounding errors, and so we need a more practical criterion to stop the process.
This will be considered a little later on. For now, pseudocode describing the basic
algorithm may be stated as follows:
input [ao, bo] which brackets the root p;
PO := (a + b )/2;
k:=0;
while stopping criterion is not met do begin
if fia^fipk) < then begin
a k+-\ -= a k-
b k+1 -=Pk-
end;
else begin
%+1 -=Pk-
b k+1 -= b k-
end;
p k+ 1 :=(a k+ i +b k+ i)/2;
k :=k+-\;
end;
When the algorithm terminates, the last value of Pk+\ computed is an estimate
of p.
TLFeBOOK
294 NONLINEAR SYSTEMS OF EQUATIONS
We see that the bisection algorithm constructs sequence (p n ) — (po, pi, P2, ■ ■ ■)
such that
lim p n = p, (7.8)
n— >oo
where p„ is the midpoint of [a n , b n ], and f(p) = 0. Formal proof that this process
works (i.e., yields a unique p such that f(p) — 0) is due to the following theorem.
Theorem 7.2: Cantor's Intersection Theorem Suppose that ([a^, b^]) is a
sequence of closed and bounded intervals such that
[ao, bo] D[ai,bi]D ■■■ D [a„, b„]D ■■■ ,
with lim n ^oo(/) n — a n ) — 0. There is a unique point p e [a„,b n ] for all n e Z + :
P)[a n >&n] = {/?}•
n=0
The bisection method produces (p n ) such that a n < p n < b n , and p € [a n , b n ] for
l 2<
all n e Z + . Consequently, since p n — \{a n + b n )
\Pn-p\<\b„-a„\<—^- (7.9)
for n e Z + , so that limj^oo p n — p. Recall that / is assumed continuous on [a, b],
so lim„^oo f(pn) — f(p)- So now observe that
1 1
\Pn ~a„\< —\b - a\, \b„ - p n \ < —\b - a\ (7.10)
so, via \x — y\ — \(x — z) — (y — z)\ < \x — z\ + \y — z\ (triangle inequality), we
have
1 1 1
\p-a„\ <\p- p n \ + \p n -a n \ < —(b-a)+ —(b-a) = — j(b-a),
(7.11a)
and similarly
\p-b n \<\p- p n \+ \p n -b n \<—^(b-a). (7.11b)
1
2"
Thus
lim a n — lim b n — p. (7.12)
At each step the root p is bracketed. This implies that there is a subsequence [of (p n )]
denoted (x n ) converging to p so that f(x n ) > for all n e Z + . Similarly, there is a
subsequence (y n ) converging to p such that f(y n ) < for all n e Z + . Thus
f(p) = lim f(x„) > 0, f(p) = lim f(y n ) < 0,
TLFeBOOK
BISECTION METHOD
295
which implies that f(p) — 0. We must conclude that the bisection method produces
sequence (/?„) converging to p such that f(p) — 0. Hence, the bisection method
always works.
Example 7.1 We want to find 5 1 / 3 (cube root of five). This is equivalent to
solving the equation f(x) — x 3 — 5 = 0. We note that /(l) = —4 and /(2) = 3,
so we may use [ao, bo] = [a,b] = [1,2] to initially bracket the root. We remark
that the "exact" value is 5 1 / 3 = 1.709976 (seven significant figures). Consider the
following iterations of the bisection method:
[oo,6 ] = [l,2],
[a\,b\] = [po, bo],
[a 2 ,b 2 ] = [oi, p\],
[03,63] = [P2,b 2 ],
[04,64] = [P3,h],
[a 5 ,6 5 ] = [04, pa\
po = 1.500000,
pi = 1.750000,
p 2 = 1.625000,
p 3 = 1.687500,
p 4 = 1.718750,
/?5 = 1.703125,
f(p ) = -1.625000
f(pi) = 0.359375
/Q, 2 ) = -0.708984
f(pj) = -0.194580
/0? 4 ) = 0.077362
/(p 5 ) = -0.059856
We see that |p 5 - p\ = |1.703125 - 1 .7099761 = 0.006851. From (7.9)
6-o 2-1 1
| P5 -/>I< — = — = ? = 0.0312500.
The exact error certainly agrees with this bound.
When should we stop iterating? In other words, what stopping criterion should
be chosen? Some possibilities are (for e > 0)
(7.13a)
(7.13b)
(7.13c)
(7.13d)
We would stop iterating when the inequalities are satisfied. Usually (7.13d) is
recommended. Condition (7.13a) is not so good as termination depends on the size
of the nth interval, while it is the accuracy of the estimate of p that is of most
interest. Condition (7.13b) requires knowing p in advance, which is not reasonable
since it is p that we are trying to determine. Condition (7.13c) is based on f(p n ),
and again we are more interested in how well p n approximates p. Thus, we are left
with (7.13d). This condition leads to termination when p n is relatively not much
different from p n -\.
2\Pn ^n)
< e,
\Pn ~ P\ <€,
f(Pn) < e,
Pn - Pn-\
Pn
< €
(Pn # 0)
TLFeBOOK
296 NONLINEAR SYSTEMS OF EQUATIONS
How may we characterize the computational efficiency of an iterative algorithm?
In Chapter 4 the algorithms terminated in a finite number of steps (with the excep-
tion of the iterative procedures suggested in Section 4.7 which do not terminate
unless a stopping condition is imposed), so flop counting was a reasonable mea-
sure. Where iterative procedures such as the bisection method are concerned, we
prefer
Definition 7.1: Rate of Convergence Suppose that (x n ) converges to 0:
liirin^oo x„ — 0. Suppose that (p n ) converges to p, i.e., linin^oo p„ — p. If there
is a K € R, but K > 0, and N e Z+ such that
\Pn ~ P\ < K\x n \
for all n > N , then we say that (/?„) converges to p with rate of convergence 0(x n ).
This is an alternative use of the "big O" notation that was first seen in Chapter 4.
Recalling (7.9), we obtain
\Pn-p\<(b-a)^, (7.14)
so that K — b — a, x n = ^r, and N — 0. Thus, the bisection method generates
sequence (/?„) that converges to p (with f(p) — 0) at the rate 0(1/2").
From (7.14), if we want \p n — p\ < e, then we may choose n so that
\Pn-p\<-^r<e,
implying 2" > (b — a)/e, or we may choose n so
' b — a
log 2
(7.15)
where \x~\ — smallest integer greater than or equal to x. This can be used as an
alternative means to terminate the bisection algorithm. But the conservative nature
of (7.15) suggests that the algorithm that employs it may compute more iterations
than are really necessary for the desired accuracy.
7.3 FIXED-POINT METHOD
Here we consider the Banach fixed-point theorem [4] as the theoretical basis for a
nonlinear equation solver. Suppose that X is a set, and T\X — > X is a mapping of
X into itself. A fixed point of T is an x e X such that
Tx=x. (7.16)
TLFeBOOK
FIXED-POINT METHOD 297
For example, suppose X = [0, 1] C R, and
Tx=jx(l-x). (7.17)
We certainly have Tx e [0, 1] for any x e X. The solution to
X — 2*0 — X)
(i.e., to Tx — x) is x — 0. So T has fixed point x — 0. (We reject the solution
x = -1 since -1 £ [0, 1].)
Definition 7.2: Contraction Let X — (X, d) be a metric space. Mapping
T\X — > X is called a contraction (or a contraction mapping, or a contractive
mapping) on X if there is an a € R such that < a < 1 , and for all x , y e X
d(Tx,Ty) <ad(x,y). (7.18)
Applying T to "points" x and y brings them closer together if T is a contractive
mapping. If a — 1, the mapping is sometimes called nonexpansive.
Theorem 7.3: Banach Fixed-Point Theorem Consider the metric space X —
(X, d), where X g 0. Suppose that X is complete, and T\X — >■ X is a contraction
on X. Then r has a unique fixed point.
Proof We must construct (x„), and show that it is Cauchy so that (x„) con-
verges in X (recall Section 3.2). Then we prove that x is the only fixed point of
T. Suppose that xo € X is any point from X. Consider the sequence produced by
the repeated application of T to xq:
xo, x\ — Txq, X2 — Tx\ — T xo, ■ . ■ , x n = Tx n -\ — T n xo, .... (7.19)
From (7.19) and (7.18), we obtain
d(x k+ i,x k ) = d(Txk, Txk-i)
< ad(xk,Xk-i)
— ad(Txk-\, Tx k _ 2 )
< a 2 d(x k -i,x k - 2 )
<a k d(xi,x ). (7.20)
TLFeBOOK
298 NONLINEAR SYSTEMS OF EQUATIONS
Using the triangle inequality with n > k, from (7.20)
d(x k ,x n ) < d(x k , x k +i) + d(x k+ i, x k+2 ) -\ \-d{x n -i,x n )
< (a k + a k+1 ■■■+ a n - 1 )d(x Q , xi)
k
\-a
n—k
i-«
-d(xo, x\)
(where the last equality follows by application of the formula for geometric series,
seen very frequently in previous chapters). Since < a < 1, we have 1 — a n ~ k < 1.
Thus
a k
d(x k ,x n ) < d(xo,x\) (7.21)
1 — a
(n > k). Since 0<o<l, 0<1— a < 1, too. In addition, d(xo, x\) is fixed (since
we have chosen xq). We may make the right-hand side of (7.21) arbitrarily small
by making k sufficiently big (keeping n > k). Consequently, (x n ) is Cauchy. X is
assumed to be complete, so x n -> x e X.
From the triangle inequality and (7.18)
d(Tx, x) < d(x, x k ) + d(x k , Tx)
< d(x, x k ) + ad(x k -\, x) (recall that Tx — x)
< € (any e > 0)
if A: — > oo, since x n — > x. Consequently, d(x, Tx) — implies that Tx — x (recall
(Ml) in Definition 1.1). Immediately, x is a fixed point of T.
Point x is unique. Let us assume that there is another fixed point x, i.e., Tx — x.
But from (7.18)
d(x, x) — d(Tx, Tx) < ad(x, x),
implying d(x, x) — because < a < 1. Thus x — x [(Ml) From Definition 1.1
again].
Theorem 7.3 is also called the contraction theorem, a special instance of which
was seen in Section 4.7. The theorem applies to complete metric spaces, and so is
applicable to Banach and Hilbert spaces. (Recall that inner products induce norms,
and norms induce metrics. Hilbert and Banach spaces are complete. So they must
be complete metric spaces as well.)
Note that we define T°x — x for any x e X. Thus T° is the identity mapping
(identity operator).
Corollary 7.1 Under the conditions of Theorem 7.3 sequence (x n ) from x n —
T"xq [i.e., the sequence in (7.19)] for any xq e X converges to a unique x e X
such that Tx — x. We have the following error estimates (i.e., bounds):
TLFeBOOK
FIXED-POINT METHOD 299
1. Prior estimate:
2. Posterior estimate:
a k
d(xk,x) < d(xo,xi) (7.22a)
\ — a
a
d(xk,x) < d(xk,Xk-i). (7.22b)
1 — a
Proof The prior estimate is an immediate consequence of Theorem 7.3 since
it lies within the proof of this theorem [let n — > oo in (7.21)].
Now consider (7.22b). In (7.22a) let k = 1, yo = x o< an d yi = x i- Thus,
d(xi, x) — d(yi, x), d(xo, x\) — d(yo, y\), and so
a
d(yi,x)<- d(y ,yi). (7.23)
1 — a
Let yo — Xk-i, so y\ — Tyo — Txk-i — x\ and (7.23) becomes
a
d(x k ,x) < d(xk,Xk-i),
1 — a
which is (7.22b).
The error bounds in Corollary 7.1 are useful in application of the contraction the-
orem to computational problems. For example, (7.22a) can estimate the number of
iteration steps needed to achieve a given accuracy in the estimate of the solution to
a nonlinear equation. The result would be analogous to Eq. (7.15) in the previous
section.
A difficulty with the theory so far is that T\X -> X is not always contractive
over the entire metric space X, but only on a subset, say, Y C X. However, a basic
result from functional analysis is that any closed subset of X is complete. Thus, for
any closed Y C X, there is a fixed point x e Y C X, and x„ — > x ix n — T h xq for
suitable xq e Y). For this idea to work, we must choose xo € Y so that x n e Y for
all n € Z+.
What does "closed subset" (formally) mean? A neighborhood of x e X (metric
space X) is the set N € (x) — {y\d(x, y) < e, e > 0} C X. Parameter e is the radius
of the neighborhood. We say that x is a limit point of Y C X if every neighborhood
of x contains y ^ x such that ye Y. Y is closed if every limit point of Y belongs
to Y. These definitions are taken from Rudin [5, p. 32].
Example 7.2 Suppose that X — R, Y = (0, 1) C X. We recall that X is a
complete metric space if d(x, y) = \x — y\ (x, y e X). Is Y closed? Limit points
of Y are x — and x = 1. [Any x e (0, 1) is also a limit point of Y.] But e" Y,
and 1 f" Y. Therefore, Y is not closed. On the other hand, Y — [0, 1] is a closed
subset of X since the limit points are now in Y.
TLFeBOOK
300 NONLINEAR SYSTEMS OF EQUATIONS
Fixed-point theorems (and related ideas) such as we have considered so far have
a large application range. They can be used to prove the existence and uniqueness
of solutions to both integral and differential equations [4, pp. 314-326]. They can
also provide (sometimes) computational algorithms for their solution. Fixed-point
results can have applications in signal reconstruction and image processing [6],
digital filter design [7], the interpolation of bandlimited sequences [8], and the
solution to so-called convex feasibility problems in general [9]. However, we will
consider the application of fixed-point theorems only to the problem of solving
nonlinear equations.
If we wish to find p e R so that
fip) = 0, (7.24)
then we may define g(x) (g(x)\R. — > R) with fixed point p such that
g(x) = x-f(x), (7.25)
and we then see g(p) = p — f(p) = p =$■ fip) — 0. Conversely, if there is a
function g(x) such that
g(p) = P, (7.26)
then
/(*) = x - g(x) (7.27)
will have a zero at x — p. So, if we wish to solve f(x) — 0, one approach would
be to find a suitable g(x) as in (7.27) (or (7.25)), and find fixed points for it.
Theorem 7.3 informs us about the existence and uniqueness of fixed points for
mappings on complete metric spaces (and ultimately on closed subsets of such
spaces). Furthermore, the theorem leads us to a well-defined computational algo-
rithm to find fixed points. At the outset, space X — R with metric d(x, y) — \x — y\
is complete, and for us g and / are mappings on X. So if g and / are related
according to (7.27), then any fixed point of g will be a zero of /, and can be
found by iterating as spelled out in Theorem 7.3. However, the discussion follow-
ing Corollary 7.1 warned us that g may not be contractive onX = R but only on
some subset Y C X. In fact, g is only rarely contractive on all of R. We therefore
usually need to find Y — [a, b] C X — R such that g is contractive on Y, and then
we compute x n — g n xo e Y for n e Z + . Then x n — >• x, and f(x) — 0. We again
emphasize that x n e Y is necessary for all n. If g is contractive on [a, b] C X,
then, from Definition 7.2, this means that for any x, y e [a,b]
\g(x)-g(y)\<a\x-y\ (7.28)
for some real- valued a such that < a < 1.
Example 7.3 Suppose that we want roots of fix) — kx 2 + (1 — k)x —
(assume X > 0). Of course, this is a quadratic equation and so may be easily
TLFeBOOK
FIXED-POINT METHOD
301
solved by the usual formula for the roots of such an equation (recall Section 4.2).
However, this example is an excellent illustration of the behavior of fixed-point
schemes. It is quite easy to verify that
f(x)=x-Xx(l-x). (7.29)
We observe that g(x) is a quadratic in x, and also
g m (x) = (2x-l)k,
so that g(x) has a maximum at x — j for which g(j) — A/4. Therefore, if we
allow only < X < 4, then g(x) e [0, 1] for all x e [0, 1]. Certainly [0, 1] C R. A
sketch of g(x) and of y — x appears in Fig. 7.2 for various X. The intersection of
these two curves locates the fixed points of g on [0, 1],
We will suppose Y — [0, 1]. Although g(x) e [0, 1] for all x € [0, 1] under the
stated conditions, g is not necessarily always contractive on the closed interval
[0, 1]. Suppose X — j; then, for all x e [0, 1], g(x) e [0, 4] (see the dotted line
in Fig. 7.2, and we can calculate this from knowledge of g). The mapping is
contractive for this case on all of [0, 1]. [This is justified below with respect to
Eq. (7.30).] Also, g(x) = x only for x — 0. If we select any xq e [0, 1], then, for
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
I
1 ^-*- -T-~~^,^ 1 1
1 ft
— y=x
A = 0.5
■ - A = 2.0
-- A = 2.8
— A = 4.0 "
s.. i y^
/ : '
/ '
/ /
/ t
/ :.../.
/ t
1 t
/ '
■/ /■■; y-
1 / : .'•
1 /
/ : \
/ X
* / i 'v i X
\ \
\ \
\ \
\ \
\
x N
/ : : s
<S : ' s
/ /
/ /
■/■■yr-y
t : /
./.'. VT
s \
x N
X
\ \
\ \
tit/
x N \
X\\-
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
Figure 7.2 Plot of g(x) = Xx{\ — x) for various X, and a plot of y = x. The places where
y = x and g(x) intersect define the fixed points of g(x).
TLFeBOOK
302 NONLINEAR SYSTEMS OF EQUATIONS
x n — g n xo, we can expect *„ — > 0. For example, suppose *o — 0.7500; then the
first few iterates are
xq = 0.7500
*i = 0.5xo(l - *o) = 0.0938
x 2 = 0.5*i(l-*i) = 0.0425
* 3 = 0.5*2(1 - *2) = 0.0203
* 4 = 0.5*3(1 -*3) = 0.0100
* 5 = 0.5*4(1 - * 4 ) = 0.0049
The process is converging to the unique fixed point at * = 0.
Suppose now that X — 2; then g(x) — 2*(1 — *), and now g(x) has two fixed
points on [0, 1], which are * = 0, and * = 5. For all * € [0, 1] we have g(x) e
[0, j], but g is not contractive on [0, 1], For example, suppose * = 0.8, y = 0.9,
theng(.8) = 0.3200, and g(.9) = 0.1800. Thus, |* - y\ = 0.1, but |g(*) - g(y)\ =
0.14 > 0.1. On the other hand, suppose that *o = 0.7500; then the first few iter-
ates are
*o = 0.7500
*i = 2* (1 - * ) = 0.3750
* 2 = 2*i(l -*i) = 0.4688
* 3 = 2* 2 (1 - xi) = 0.4980
* 4 = 2*3(1 -xi) = 0.5000
x 5 = 2* 4 (1 - * 4 ) = 0.5000
This process converges to one of the fixed points of g even though g is not
contractive on [0, 1].
From (7.28)
a >
g(x) - g(y)
(* # y),
x-y
from which, if we substitute g(x) — Xx(l — *), and g(y) — Xy(l — y), then
a = supA.|l — * — y\. (7.30)
If A. = |. then a — j. If X — 2, we cannot have a, so that < a < 1. If
g(x) — 2*(1 — *) (i.e., X — 2 again), but now instead Y — [0.4, 0.6], then, for all
TLFeBOOK
FIXED-POINT METHOD
303
x € Y = [0.4, 0.6], we have g(x) e [0.48, 0.50] C Y, implying g\Y -+ Y. From
(7.30) we have for this situation a — 0.4. The mapping g is contractive on Y.
Suppose x = 0.450000; then
x = 0.450000
xi = 2*o(l - *o) = 0.495000
x 2 = 2xi(l - xi) = 0.499950
x 3 = 2x 2 (l - x 2 ) = 0.500000
x 4 = 2x 3 (l - x 3 ) = 0.500000
The reader ought to compare the results of this example to the error bounds from
Corollary 7.1 as an exercise.
More generally, and again from (7.28), we have (for x, y e Y — [a, b] C R)
g(x) - g(y)
a = sup
x-y
(7.31)
Now recall the mean-value theorem (i.e., Theorem 3.3). If g(x) is continuous on
[a, b], and g^Hx) is continuous on (a, b), then there is a £ € (a, b) such that
g (i) (?) _ g(P)-g(fl)
b — a
Consequently, instead of (7.31) we may use
a — sup \g^'(x)
x£(a,b)
(7.32)
Example 7.4 We may use (7.32) to rework some of the results in Example 7.3.
Since g (1) (x) = A(2x - 1), if Y = [0, 1], and A. = ±, then
a = sup
*€(0,1)
x
2
If k = 2, then a = 2. If now Y = [0.4, 0.6] with k = 2, then
a = sup |4x - 2| = 0.4.
x€(0.4,0.6)
Now suppose that k — 2.8, and consider Y — [0.61, 0.67], which contains a fixed-
point of g (see Fig. 7.2, which contains a curve for this case). We have
a = sup |5.6x - 2.8| = 0.9520,
xe(0.6i,0.67)
TLFeBOOK
304 NONLINEAR SYSTEMS OF EQUATIONS
and if x € Y, then g(x) e [0.619080, 0.666120] C Y so that g\Y ->• Y, and so g is
contractive on Y. Thus, we consider the iterates
x = 0.650000
xi = 2.8x (l - xo) = 0.637000
x 2 = 2.8xi (1 - xi ) = 0.647447
x 3 = 2.8x 2 (l - x 2 ) = 0.639126
x 4 = 2.8x 3 (1 - x 3 ) = 0.645803
x 5 = 2.8x 4 (l - x 4 ) = 0.640476
The true fixed point (to 6 significant figures) is x = 0.642857 (i.e., x n — > x). We
may check these numbers against the bounds of Corollary 7.1. Therefore, we con-
sider the distances
d(x 5 ,x) = 0.002381, d(x , xi) = 0.013000, d(x 5 , x 4 ) = 0.005327,
and from (7.22a)
a k
d(x 5 ,x) < d(x ,xi) = 0.211780,
1 — a
and from (7.22b)
<i(x 5 ,x) < — — d(x 5 ,x 4 ) = 0.105652.
1 — a
These error bounds are very loose, but they are nevertheless consistent with the
true error d(x 5 , x) = 0.002381.
We have worked with g(x) — Xx(l — x) in the previous two examples
(Example 7.3 and Example 7.4). But this is not the only possible choice. It may
be better to make other choices.
Example 7.5 Again, assume f(x) — Xx 2 + (1 — X)x — as in the previous
two examples. Observe that
Xx — (X — l)x
implies that
A- A /^— ^x 1/2 = g(x). (7.33)
TLFeBOOK
NEWTON-RAPHSON METHOD 305
If k = 4, then f(x) = for x = and for * = f . For g(x) in (7.29), g (1) (*) =
8x — 4, and g^(|) = 2. We cannot find a closed interval Y containing x — | on
which g is contractive with g\Y —*■ Y (the slope of the curve g(x) is too steep in
the vicinity of the fixed point). But if we choose (7.33) instead, then
2 * v~> 4 ,1/2'
and if F = [0.7, 0.8], then for x e Y, we have g(x) € [0.7246, 0.7746] C Y, and
V3 1
a = sup — — 0.5175.
jte(0.7,0.8) 4 ^/X
So, g in (7.33) is contractive on Y. We observe the iterates
x = 0.7800
V3
*1 =
= — X
= 0.7649
*2 =
v3 1/2
2 l
= 0.7574
*3 =
v3 1/2
~-— X 2
= 0.7537
VJ 1/2
X
2
x 4 = — x 3 7 = 0.7518
a/3
x 5 = ^-*j /2 = 0.7509,
which converge to x — I (i.e., x„ — >• x).
7.4 NEWTON-RAPHSON METHOD
This method is yet another iterative approach to finding roots of nonlinear equations.
In fact, it is a version of the fixed-point method that was considered in the previous
section. However, it is of sufficient importance to warrant separate consideration
within its own section.
7.4.1 The Method
One way to derive the Newton-Raphson method is by a geometric approach. The
method attempts to solve
f(x) = (7.34)
TLFeBOOK
306
NONLINEAR SYSTEMS OF EQUATIONS
Liney-f(p„) = f< 1 >(p n )(x-p n )
[tangent to point (p„, f(p n )]
Figure 7.3 Geometric interpretation of the Newton-Raphson method.
(x € R, and f(x) e R) by approximating the root p (i.e., f(p) = 0) by a succession
of x intercepts of tangent lines to the curve y = f(x) at x the current approximation
to p. Specifically, if p n is the current estimate of p, then the tangent line to the
point (p n , f(p n )) on the curve y = f(x) is
y = f(Pn) + f m (Pn)(x-Pn)
(7.35)
(see Fig. 7.3). The next approximation to p is x
from (7.35)
Pn+i such that y — 0, that is,
= f(p n ) + f m (Pn)(Pn
+ 1
Pn),
so for n e Z +
/?«+! = Pn
f(Pn)
f m (PnY
(7.36)
To start this process off, we need an initial guess at p, that is, we need to select pq.
Clearly this approach requires f^Hpn) # for all n; otherwise the process will
terminate. Continuation will not be possible, except perhaps by choosing a new
starting point pq. This is one of the ways in which the method can break down.
It might be called premature termination. (Breakdown phenomena are discussed
further later in the text.)
Another derivation of (7.36) is as follows. Recall the theory of Taylor series
from Section 3.5. Suppose that f(x), /^'(x), and f^ 2 \x) are all continuous on
[a, b]. Suppose that p is a root of f(x) — 0, and that p n approximates p. From
Taylor's theorem [i.e., (3.71)]
f(x) = f(Pn) + f W (Pn)(x
Pn) + \f {2) mx
Pnf
(7.37)
TLFeBOOK
NEWTON-RAPHSON METHOD 307
where | = f (x) e (p n , x). Since fip) = from (7.37), we have
= fip) = /(/>„) + f m (PnKp ~ Pn) + jf (2) (i;Kp - Pnf- (7.38)
If I p — Pn I is small, we may neglect that last term of (7.38), and hence
0^f(Pn) + f W (Pn)(p-Pn),
implying
P^ Pn- } ■ (7.39)
f iV) (Pn)
We treat the right-hand side of (7.39) as p n +\, the next approximation to p. Thus,
we again arrive at (7.36). The assumption that (p — p n ) 2 is negligible is important.
If p„ is not close enough to p, then the method may not converge. In particular,
the choice of starting point po is important.
As already mentioned, the Newton-Raphson method is a special instance of the
fixed-point iteration method, where
fix)
g (x) = x- J n , . (7.40)
So
Pn + l = giPn) (7.41)
for n e Z + . Stopping criteria are (for e > 0)
\Pn ~ Pn-\\ < e, (7.42a)
\fiPn)\ < e, (7.42b)
Pn ~ Pn-\
Pn * 0. (7.42c)
Pn
As with the bisection method, we prefer (7.42c).
Theorem 7.4: Convergence Theorem for the Newton-Raphson Method
Let / be continuous on [a, b]. Let f^ l \x), and f^ 2 \x) exist and be continuous for
all x e (a, b). If p e [a, b] with f(p) — 0, and f^(p) ^ 0, then there is a S >
such that (7.36) generates sequence (/?„) with p„ — > p for any po e [p — S, p + S],
Proof We have
Pn+l = giPn), n eZ +
with g(;c) = x - 4$- . We need Y = [p - S, p + 8] C R with g|T -^ Y, and g
is contractive on Y . (Then we may immediately apply the convergence results from
the previous section.)
TLFeBOOK
308 NONLINEAR SYSTEMS OF EQUATIONS
Since f^'(p) # 0, and since f^(x) is continuous at p, there will be a S\ >
such that f^\x) ^ for all x e [p — Si, p + S\] C [a, b], so g is defined and
continuous for x G [p — Si, p + Si], Also
(1) [f W (x)] 2 -f(x)f (2 Hx) f(x)f<- 2 \x)
8 W [/ (1) (X)] 2 U {1) (X)] 2
for x € [p — Si, p + Si]. In addition, f^(x) is continuous on (a, £>), so g^(x) is
continuous on [p — Si, p + Si]. We assume f(p) = 0, so
8 (p> if m ( P )i 2
We have g^\x) continuous at x — p, implying that lim^^p g^\x) — g^ l \p) — 0,
so there is a 5 > with < S < S{ such that
\g m (x)\<a< 1 (7.43)
for x e [p — S, p + S] — Y for some a such that 0<« < l.If x e 7, then by the
mean- value theorem, there is a £ € (x, p) such that
\g(x) -p\ = \g(x) - g(p)\ = |g (1) (£)||* - p\
< a\x — p\ < \x — p\ < S,
so that we have g(x) e Y for all x e Y. That is, g\Y — > 7. Because of (7.43)
a = sup|g (1) (x)|,
and < a < 1, so that g is contractive on Y. Immediately, sequence (p n ) from
Pn+i = gOn) for all po € F converges to p (i.e., /?„ -> /?).
Essentially, the Newton-Raphson method is guaranteed to converge to a root if
po is close enough to it and / is sufficiently smooth. Theorem 7.4 is weak because
it does not specify how to select po.
The Newton-Raphson method, if it converges, tends to do so quite quickly.
However, the method needs f^Hpn) as well as f(p n ). It might be the case that
f^Hpn) requires much effort to evaluate. The secant method is a variation on the
Newton-Raphson method that replaces f^(Pn) with an approximation. Specifi-
cally, since
,(1), . ,. f(x)-f(Pn)
f { '(Pn)= lim ,
x^p„ X- p n
TLFeBOOK
if p n -i s» p n , then
NEWTON-RAPHSON METHOD 309
f m (Pn) « /(p«-i) /W _ (7 44)
Pn-1 - Pn
This is the slope of the chord that connects the points (p n -\, f{p n -\)), and
(Pn, f(Pn)) on the graph of fix). We may substitute (7.44) into (7.36) obtaining
(for n e N)
(Pn~ Pn-l)f(Pn) ,„...
P«+i = /?« - — — — -. (7.45)
f(Pn) ~ f(Pn-l)
We need po and p\ to initialize the iteration process in (7.45). Usually these are
chosen to bracket the root, but this does not guarantee convergence. If the method
does converge, it tends to do so more slowly than the Newton-Raphson method, but
this is the penalty to be paid for avoiding the computation of derivatives f^\x).
We remark that the method of false position is a modification of the secant method
that is guaranteed to converge because successive approximations to p (i.e., p n -\
and p n ) are chosen to always bracket the root. But it is possible that convergence
may be slow. We do not cover this method in this book. We merely mention that
it appears elsewhere in the literature [10-12].
7.4.2 Rate of Convergence Analysis
What can be said about the speed of convergence of fixed-point schemes in gen-
eral, and of the Newton-Raphson method in particular? Consider the following
definition.
Definition 7.3: Suppose that (p n ) is such that p n -> p with p n ^ p (n e Z + ).
If there are X > 0, and S > such that
.. \Pn+l ~ P\ ,
hm j- — X,
n ^°° \p n - pr
then we say that (p n ) converges to p of order S with asymptotic error constant X.
Additionally
• If S = 1, we say that (/?„) is linearly convergent.
• If 5 > 1, we have superlinear convergence.
• If S — 2, we have quadratic convergence.
From this definition, if n is big enough, then
\ Pn+l -p\^X\p n -p\ s . (7.46)
Thus, we would like S to be large and X to be small for fast convergence. Since
S is in the exponent, this parameter is more important than X for determining the
rate of convergence.
TLFeBOOK
310 NONLINEAR SYSTEMS OF EQUATIONS
Consider the fixed-point iteration
Pn+\ = g(Pn),
where g satisfies the requirements of a contraction mapping (so the Banach fixed-
point theorem applies). We can therefore say that
.. \Pn + \~P\ .. \g(Pn)-g(p)\ ._.„
hm = hm , (7.47)
n^OO \p n - p\ n^oo \p n - p\
so from the mean-value theorem we have % n between p n and p for which
g(Pn)-g(p)=g m (t;n)(.Pn-p),
which, if used in (7.47), implies that
Hm lP " +1 ~ g! = Iim|,«(fe,)|. (7.48)
n^oo \p n — p\ n^oo
Because f„ is between p n and p and p n — >• /?, we must have £„ -> p as well. Also,
we will assume g^Hx) is continuous at p, so (7.48) now becomes
lim 'f +1 "f =| g ^(p)|, (7.49)
which is a constant. Applying Definition 7.3, we conclude that the fixed-point
method is typically only linearly convergent because 8=1, and the asymptotic
error constant is A = \g ( - 1 \p)\, provided g^ l \p) ^ 0. However, if g^ip) = 0, we
expect faster convergence. It turns out that this is often the case for the Newton-
Raphson method, as will now be demonstrated.
The iterative scheme for the Newton-Raphson method is (7.36), which is a
particular case of fixed-point iteration where now
fix)
g (x) =x- J n \ , (7.50)
f (l Hx)
and for which
(i)/..x f(x)f (2) (x)
(x) = J —^- 1— (7.51)
[/<1)(jc)]2
so that g^(p) = because f(p) = 0. Thus, superlinear convergence is anticipated
for this particular fixed-point scheme. Suppose that we have the Taylor expansion
g(x) = g(p) + g w (p)(x -p) + £gW(f )(* - p)
I (2),
TLFeBOOK
NEWTON-RAPHSON METHOD 311
for which f is between p and x. Since g(p) — p and g^Hp) — 0, this becomes
g{x) = p+\ g W{i;)(x-p) 2 .
For x — p n , this in turn becomes
Pn+l = g0>») = p + ^ (2) fe)(p - Pnf (7.52)
for which % n lies between p and p n . Equation (7.52) can be rearranged as
p n+l - p= \g (2) {$n){p ~ Pn) 2 ,
and so
lim 'f"* 1 "' 1 - V>(rtl. (7-53)
n^oo \p n - p\± 2
since f„ — »• /?, and we are also assuming that g^Hx) is continuous at p. Imme-
diately, 8 = 2, and the Newton-Raphson method is quadratically convergent. The
asymptotic error constant is plainly equal to j\g ( - 2 \p)\, provided g^ 2 Hp) ^ 0. If
g( 2 \p) = 0, then an even higher order of convergence may be expected. It is
emphasized that these convergence results depend on g(x) being smooth enough.
Given this, since the convergence is at least quadratic, the number of accurate dec-
imal digits in the approximation to a root approximately doubles at every iteration.
7.4.3 Breakdown Phenomena
The Newton-Raphson method can fail to converge (i.e., break down) in vari-
ous ways. Premature termination was mentioned earlier. An example appears in
Fig. 7.4. This figure also illustrates divergence, which means that [if f(p) = 0]
lim \p„ - p\ = oo.
In other words, the sequence of iterates (p n ) generated by the Newton-Raphson
method moves progressively farther and farther away from the desired root p.
Another failure mechanism called oscillation may be demonstrated as follows.
Suppose that
f{x) = x 3 - 2x + 2, f m (x) = 3x 2 - 2. (7.54)
Newton's method for this specific case is
Pi -lPn + 2 2pl - 2
Pn + r = Pn- ^ _ - = j-r^. (7.55)
If we were to select po — as a starting point, then
Po = 0, p\ = 1, P2 - 0, p 3 = 1, £>4 = 0, P5 = l,....
TLFeBOOK
312 NONLINEAR SYSTEMS OF EQUATIONS
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
■"'"••-. ..j
— y = xe "
-- tangent at x = 1 :
■ - ■ tangent at x= 2
tangent at x= 4
'■ i.^^T"^. .__..'"■■-
"~ " 1 ■-•-.. ..
; ; ; ;
;
Figure 7.4 Illustration of breakdown phenomena in the Newton-Raphson method; here,
f(x) = xe~ x . The figure shows premature termination if p n = x = 1 since /^ '(1) = 0.
Also shown is divergence for the case where p$ > 1. (See how the x intercepts of the
tangent lines for x > 1 become bigger as x increases.)
More succinctly, p n — ^(1 — (—1)"). The sequence of iterates is oscillating; it
does not diverge, and it does not converge, either. The sequence is periodic (it may
be said to be period 2), and in this case quite simple, but far more complicated
oscillations are possible with longer periods.
Premature termination, divergence, and oscillation are not the only possible
breakdown mechanisms. Another possibility is that the sequence (j>„) can be
chaotic. Loosely speaking, chaos is a nonperiodic oscillation with a complicated
structure. Chaotic oscillations look a lot like random noise. This will be considered
in more detail in Section 7.6.
7.5 SYSTEMS OF NONLINEAR EQUATIONS
We now extend the fixed-point and Newton-Raphson methods to solving nonlinear
systems of equations. We emphasize two equations in two unknowns (i.e., two-
dimensional problems). But much of what is said here applies to higher dimensions.
7.5.1 Fixed-Point Method
As remarked, it is easier to begin by first considering the two-dimensional problem.
More specifically, we wish to solve
/o(*o,*i)=0, /i(xn,xi) = 0,
which we will assume may be rewritten in the form
xq - fo(xo,xi) = 0, xi - fi(xo, xi) = 0,
(7.56)
(7.57)
and we see that solving these is equivalent to finding where the curves in (7.57)
intersect in R 2 (i.e., [xq x\\ t e R 2 ). A general picture is in Fig. 7.5. As in the
TLFeBOOK
SYSTEMS OF NONLINEAR EQUATIONS 313
Figure 7.5 Typical curves in R corresponding to the system of equations in (7.57).
case of one-dimensional problems, there will usually be more than one way to
rewrite (7.56) in the form of (7.57).
Example 7.6 Suppose
fo(x ,x\) = x -x
f\(xo,xi) =x\ -xl
1*2 =
/oOo, x\) — xl + \x\,
x i =0=> fi(xo, xi) = x
We see that
U2
ixt - x = o =► o*o - 4r + jxf = j
1\2 i U2
(an ellipse in R ), and
xi =0
(xi + i) 2 = -1
(a hyperbola in R 2 ). These are plotted in Fig. 7.6. The solutions to the system
are the points where the ellipse and hyperbola intersect each other. Clearly, the
solution is not unique.
It is frequently the case that nonlinear systems of equations will have more than
one solution just as this was very possible and very common in the case of a single
equation.
Vector notation leads to compact descriptions. Specifically, we define x —
[x xi] T € R 2 , f (x) = fo(x ,xi), fi(x) = fi(x ,xi), and
F(x) =
foixo, X\)
f\{xo,xi)
fo(x)
fl(x)
(7.58)
TLFeBOOK
314
NONLINEAR SYSTEMS OF EQUATIONS
Figure 7.6 The curves of Example 7.6 (an ellipse and a hyperbola). The solutions to the
equations in Example 7.6 are the points where these curves intersect.
Then the nonlinear system in (7.57) becomes
x — F (x) ,
and the fixed point ~p e R 2 of F satisfies
P = F(p).
We recall that R 2 is a normed space if
(7.59)
(7.60)
mi 2
= lZ X ! 2 = X >
(7.61)
(x e R ). We may consider a sequence of vectors (x n ) (i.e., x n e R for n e Z + ),
and it converges to x iff
lim | \x„ —x\\ — 0.
n— >oo
We recall from Chapter 3 that R 2 with the norm in (7.61) is a complete space, so
every Cauchy sequence (x n ) in it will converge.
As with the scalar case considered in Section 7.3, we consider the sequence of
iterates (~p n ) such that
Tn + \ = F (Pn)> n e Z+ -
The previous statements (and the following theorem) apply if R 2 is replaced by R m
(m > 2); that is, the space can be of higher dimension to accommodate m equations
in m unknowns. Naturally, for R m the norm in (7.61) must change according to
| |x| | 2 = Y^k=o x k- The following theorem is really a special instance of the Banach
fixed-point theorem seen earlier.
TLFeBOOK
SYSTEMS OF NONLINEAR EQUATIONS 315
Theorem 7.5: Suppose that 1Z is a closed subset of R 2 , F\7Z — > TZ, and F is
contractive on TZ; then
x — F (x)
has a unique solution /J € R. The sequence (p n ), where
-p n+l = F(p n ), -p en, neZ+ (7.62)
is such that
and
lim \\p n -p\\=0
\\p n -p\\<-, IIPi-^oll. ( 7 - 63 )
1 — a
where ||F (pj) — F(75 2 )|| 5 a ll7?i — 7*2 1 1 f° r an y ~Pi'~P2 e ^> anc ^ < « < 1.
Proof As noted, this theorem is really a special instance of the Banach fixed-
point theorem (Theorem 7.3), so we only outline the proof.
It was mentioned in Section 7.3 that any closed subset of a complete metric
space is also complete. R 2 with norm (7.61) is a complete metric space, and since
TZ C R 2 is closed, 1Z must be complete. According to Theorem 7.3, F has a unique
fixed point ~p e TZ [i.e., F(p) — ~p], and sequence (~p n ) from ~p~ n+ \ — F(p n ) (with
7?o € ^> ar, d n € Z + ) converges to 75. The error bound
a"
\\Pn~ P\\ < -, \\Pi-Pq\\
1 — a
is an immediate consequence of Corollary 7.1.
A typical choice for 7\L would be the bounded and closed rectangular region
TZ — {[xqx\\ \oq < xq < bo, a\ < x\ < b\). (7.64)
The next theorem applies the Schwarz inequality of Chapter 1 to the estimation
of a. It must be admitted that applying the following theorem is often quite difficult
in practice, and in the end it is often better to simply implement the iterative
method in (7.62) and experiment with it rather than go through the laborious task
of computing a according to the theorem's dictates. However, exceptions cannot
be ruled out, and so knowledge of the theorem might be helpful.
Theorem 7.6: Suppose that TZ C R 2 is as defined in (7.64); then, if
'(to)\(!*)\(*L)\(yL)
\dx J \dxiJ \dxoJ \dxij
a — max
2"|l/2
(7.65)
TLFeBOOK
316 NONLINEAR SYSTEMS OF EQUATIONS
we have
||F(*i)-F(*2)||<«||3fi-3f2||
for all x~i, x 2 € H.
Proof We use a two-dimensional version of the Taylor expansion theorem,
which we will not attempt to justify in this book.
Given xi — [x^o x\\] T , x 2 — [x 2 fi x 2 ,i] , there is a point f e 1Z on the line
segment that joins x\ to X2 such that
F(xi) = F(J 2 ) + ,F (1) (f)(xi-X2)
F(x 2 )
3/bte) 3/b(f)
dxQ 3xi
3/l (£) 3/i (f)
|F(xi)-F(x- 2 )H 2 =
- 3*o
3/o(§)
dx
1 -"
*1,0 - *2,0
XI, 1 -X2,l
3x
' 3/i(S)
. 3*o
3/o(g)
3x
Oi,0 - x 2 ,o) +
3/o(S)
3xi
(Xl,l -X 2 ,l)
(Xl,0 - *2,o) +
3/o«)
3/i (f)
3xi
(Xll -X 2 ,l)
3xi
'W + (
3/i (?)
3xo / \ 3xi
via the Schwarz inequality (Theorem 1.1). Consequently
11*1 ~x 2 \
|*i -x 2 |
|F(xi)-F(x 2 )|| 2 <
/ 3/ (g)
V 3*0
3/i(£)
3x
3/o(g)
3xi
3/i (g)
3xi
|xi -x 2 || 2
< < max
3/oOD
3xo
3/i(*) x2
3x
= a 2 ||x 1 -x 2 ||
3/oCO
3x 1
3/i(*) x2
3x 1
|xi -x 2 |
TLFeBOOK
SYSTEMS OF NONLINEAR EQUATIONS 317
As with the one-dimensional Taylor theorem from Chapter 3, our application of it
here assumes that F is sufficiently smooth.
To apply Theorem 7.5, we try to select F and 1Z so that < a < 1. As already
noted, this is often done experimentally. The iterative process (7.62) can give us
an algorithm only if we have a stopping criterion. A good choice is (for suitable
e > 0) to stop iterating when
\\P„ -Pn-l\\ - / n <n ct*
— — < e, Pn ^ 0. (7.66)
\\Pn\\
It is possible to use norms in (7.66) other than the one defined by (7.61). (A
Chebyshev norm might be a good alternative.) Since our methodology is often to
make guesses about F and 1Z, it is very possible that a given guess will be wrong.
In other words, convergence may never occur. Thus, a loop that implements the
algorithm should also be programmed to terminate when the number of iterations
exceeds some reasonable threshold. In this event, the program must also be written
to print a message saying that convergence has not occurred because the upper limit
on the allowed number of iterations was exceeded. This is an important example
of exception handling in numerical computing.
Example 7.7 If we consider /o(xo> xi) and f\(xQ, x\) as in Example 7.6, then
the sequence of iterates for the starting vector 7J = [1 l] r is
Pi fi = 1.2500, p u = 0.0000
P2,o = 1.5625, p2,\ — 1.5625
P3 = 3.0518, p 3> i = 0.0000
p A fi = 9.3132, P4,i = 9.3132
This vector sequence is not converging to the root near [0.9 0.5] r (see Fig. 7.6).
Many experiments with different starting values do not lead to a solution. So we
need to change F,
We may rewrite xq -
- X|-j — 4-^1 — " 9-S
2 1 2 — J0\ x Q> x l)
X Q + 4*1
i we may rewrite x\
— Xq + x\ — as
x 2
X\ = , - = /i(x , *l)-
1+X!
TLFeBOOK
318 NONLINEAR SYSTEMS OF EQUATIONS
These redefine F [recall (7.58) for the general form of F], For this choice of F
and again with 75 = [1 l] r , the sequence of vectors is
pi,o = 0.8000, pi,i = 0.5000
p 2 ,o = 0.9110, p 2 ,i = 0.4627
P3j0 = 0.9480, p 3 ,i = 0.5818
p 4 ,o = 0.9140, p 4 ,i = 0.5682
pi4,o = 0.9189, pi4,i = 0.5461
In this case the vector sequence now converges to the root with four-decimal-place
accuracy by the 14th iteration.
7.5.2 Newton-Raphson Method
Consider yet again two equations in two unknowns
/o(*o.*i) = °>
/i(*0,*i) = 0, (7.67)
each of which defines a curve in the plane (again x — [xo x\\ T e R 2 ). Solutions
to (7.67) are points of intersection of the two curves in R 2 . We will denote a point
of intersection by ~p — [po P\\ T e R 2 , which is a root of the system (7.67).
Suppose thatxo = [xo,o xo,i] r is an initial approximation to the root p. Assume
that fo and f\ are smooth enough to possess a two-dimensional Taylor series expan-
sion,
3/o
f (x , X\) = /o(*0,0. x 0,l) + - — 0*0,0, *o,i)(*o - *o,o)
3xo
, 9 /o, w . , ! f 3 2 /o(xo,o,x j) ,2
+ 7 — (*0,07*0,l)(*l -X0,l)+ — { —^ (xo -X ,o)
3xi 2! [ dx£
9 2 /o(xo,o,x ,i)
+ 2 — (x - x ,o)(xi - x ,i)
dxoaxi
3 2 /o(xo,o,x ,i) / 2 f
dxi
-(xi -x ,i) > + •
3/i
/i(xo,xi) = /i(xo,o, xo,i) + - — (xo,o,xo,i)(x -xo,o)
3xo
, 3/i 1 [3 2 /i(xo,o,x j) 2
+ 7 — (*0,0>*0,i)(*i -xo,i)+ — { — 5 (xo-xo^o)
3xi 2! ^ 9xq
TLFeBOOK
SYSTEMS OF NONLINEAR EQUATIONS 319
,3 2 /l(*0,0>*0,l) , .. .
-(x -xo,o)(xi -*o,i)
3xo3xi
(x\ -x ,i) \ +
9 2 /i(x ,o,*o,i) / >
9x 2 j
which have the somewhat more compact form using vector notation as
_ _ 9/o (To) 9/o (To)
/o(T) = /o(*o) H ; (*o - *o,o) ^ ^ ( Xl ~ x o,i)
9xo 9xi
4 1 \ d 2 f (x ) , , 2 , „ 3 2 /oC*o) , ., .
+ — \ 5 — (*o - *o,o) + 2— — ^ — C*o - *o,o)(*i - *o,i)
2! ^ 9xq dxooxi
3 2 /o(*o)
9x 2
(*i-x ,i) 2 [ + --- (7.68a)
_ _ 9/i (To) 9/i (xo)
/lW = /i(*o) H ; (xo - x ,o) H - (*i - *o,i)
9xo 9xi
1 [ 9 2 /i(xq) / , 2 , o 3 2 /i(T ) /
+ ^ 1 — :T~? — w> ~ x °.°) + 2 ^ — 5 — (^0 ~ *o,o)(*i - *o,i)
2! [ 9x^ 9x 9x!
9 2 /i(x ) / 2
9x 2
(*i-x ,i) z +•••• (7.68b)
If xq is close to p, then from (7.68), we have
_ _ 9/o(^o) 9/o(xo)
= fo(p) « /o(*o) H r Oo - *o,o) + — t 0>i - *o,i)
9xo 9xi
2 «,/-=■„■> a2 £,/■=■„
1 3Vo0to), . ,2 , 9 2 /o(To)
(Po-^0,0) +2— — - — (po-^o,o)(Pi -*o,i)
2! 9xi ' 9xq9x
v
, 8 M X 0h .2 I , , nt -Q X
H —5 — (pi-^0,1) M (7.69a)
9x 2 J
_ _ 3/i (To) 9/i (To)
0= f\(p) « fi(xo)-\ (po-^o,o)H (p\ -*o,i)
9xo 9xi
, 1 f 9 2 /i(To) . . 2 , „ 3 2 /i(*o) ,
G>o-*o,o) + 2— — - — (po - xo,o)(pi - xo,\)
2! 9x 2 ' 9x 9x
v
9 /l(*0), .2 I , ,"7*010,
— —5 — (Pi-^0,1) H (7.69b)
9x 2 j
TLFeBOOK
320 NONLINEAR SYSTEMS OF EQUATIONS
If we neglect the higher-order terms (second derivatives and higher-order deriva-
tives), then we obtain
9/oC*o) 9/o(*o) _
— (po - x ,o) H (pi - x 0i i) «s _/ ( Xo ), (7.70a)
dxo ax\
9/iC*o) 9/iC*o) _
— (pQ-x 0fi )-\ (pi -x ,i) « -/i(x ). (7.70b)
dxo ox\
As a shorthand notation, define
dfi (In)
3x 7 -
so (7.70) becomes
(po - *o,o)/b,o(*o) + (Pi - *o,i)./b,i(*o) « -/o(*o)> (7.72a)
(f>0 - *o,o)/i,o(*o) + (Pi - *o,i)/i,i(*o) % -/i (*(>)• (7.72b)
Multiply (7.72a) by /i,i(x"o), and multiply (7.72b) by /o.iCco). Subtracting the
second equation from the first results in
(po - xo,o)[/o,o(*o)/i,i(*o) - /i,o(*o)/o,i(*o)]
^ -/o(*o)/i,i(*o) + /i(*o)/o,i (*())■ (7.73a)
Now multiply (7.72a) by /i.oCxoX an d multiply (7.72b) by /o.oCxo)- Subtracting
the second equation from the first results in
(p\ -*o,i)[/o,i(*o)/i,o(*o) - /o,o(*o)/u(*o)]
^ -/o(*o)/i,o(*o) + /i(*o)/o,o(*o)- (7.73b)
From (7.73b), we obtain
, ~ fo(xo) f\,\(xo) + f\(xo) fo,\(xo) ,hha^
Po ^ xoo~\ = — — =—, (7.74a)
/o,o(*o) /uw - /o,i(*o)/i,o(*o)
, -/i(*o)/o,o(*o) + /o(*o)/i,o(*o) „„,,,
Pi % *o i H = — — =—■ (7.74b)
foM x o)fi,i(xo) - /o,i(*o)/i,o(*o)
We may assume that the right-hand side of (7.74) is the next approximation to ~p
-/o/i,i + /i/o,i
■*i,0 % *o,o +
*1,1 «*0,1 +
/o,o/i,i - /o,i/i,o
-/i/o,o + /o/i,o
/o,o/i,i — /o,i/i,o
(7.75a)
(7.75b)
TLFeBOOK
SYSTEMS OF NONLINEAR EQUATIONS 321
(~x\ — [xi, 0*1,1] ), where the functions and derivatives are to be evaluated at xo.
We may continue this process to generate (x„) for n e Z + (so in general x n —
[x n ,o *«,i] r ) according to
-/o(*»)/i,i(*») + /i(*n)/o,i(*»)
fo,o( J n)fi,i(x n ) - fo,l(x„)flfi(x n )'
~fl(Xn)f0,0(Xn) + /o(*«)/l,o(*»)
fo,o(x n )fl,l(x n ) - fo,l(Xn)fl,o(Xn)
As in the previous subsection, we define
Xn+1,0 — X„fi +
Xn+1,1 — X n ,\
(7.76a)
(7.76b)
Also
F(x n ) =
F m (x n ) =
fo(x n ,Q, *n,l)
fl(Xn,0, X n ,\)
f0,0(Xn) fo,l(Xn)
/l,o(*n) f\,\(x n )
fo(x„)
fl(x~n)
= Jf(x„),
(7.77)
(7.78)
which is the Jacobian matrix Jf evaluated at x = x n - We see that
1
[j F (x n )r l =
f0,0(Xn)fl,l(x n ) ~ f0,l(Xn)fl,0(Xn)
fl,l(x n ) -fo,l(Xn)
f\fi(x n ) /b,o(*n)
(7.79)
so in vector notation (7.76) becomes
x n +\ =x„- [J F (x„)] _1 F(x„) (7.80)
for n e Z + . If ~x n e R m (i.e., if we consider m equations in m unknowns), then
Jf(x„) =
fo,o( x n) fo,l(.X„)
flfiiXn) f\,\(x„)
fo,m-l(x n )
J\,m—{ \Xfi)
Jm—l,0yXn) Jm—\,\\Xn) Jm—l,m—l\Xn)
(7.81a)
and
F(x n )
MXn)
fl(x n )
Jm — l \Xn)
Of course, x n = [x nfi x n ,\ ••• x„, m -i] r e R m .
(7.81b)
TLFeBOOK
322 NONLINEAR SYSTEMS OF EQUATIONS
Equation (7.80) reduces to (7.46) when we have only one equation in one
unknown. We see that the method will fail if Jf(x n ) is singular at x n . As in
the one-dimensional problem of Section 7.4, the success of the method depends
on picking a good starting point x~o. If convergence occurs, then it is quadratic
as in the one-dimensional (i.e., scalar) case. It is sometimes possible to force the
method to converge even if the starting point is poorly selected, but this will not
be considered here. The computational complexity of the method is quite high. If
x n e R m , then from (7.80) and (7.81), we require m 2 + m function evaluations,
and we need to invert an m x m Jacobian matrix at every iteration. We know from
Chapter 4 that matrix inversion needs 0(m 3 ) operations. Ill conditioning of the
Jacobian is very much a potential problem as well.
Example 7.8 Refer to the examples in Section 7.5.1. In Example 7.6 there we
wanted to solve
Mxq,xi) = x -Xq - \ x \ =0, /i(*o. x\) =x\ -x 2 + x 2 — 0.
Consequently
JQ\ x n) = X n fi x n Q 4 x n p /l (X n ) = X n i X n q + X n j,
and the derivatives are
fo,o(x n ) — 1 - 2x„,o, fi,o(xn) — -2x„fi
1
"2
Via (7.76), the desired equations are
fo,\(x n ) = -\x n ,l, h,\{x n ) = 1 +2x n ,\.
X n +\fi — X n fi
-{x nfi - x 2 Q - \xl A )(\ + 2x„,i) + (x„,i - x 2 Q + xl x ){-\x n ,\)
(1 - 2x nfi )(\ + 2x„j) - x n!Q x nt i
~(x n ,\ -xl fi + xl x )(\ -2x„, ) + {x„fi-xl Q - ^x^ 1 )(-2x„ i0 )
(1 - 2x„ i0 )(l + 2x„ ; i) - x„, ^n,i
If we execute the iterative procedure in (7.82), we obtain
x ,o = 0.8000, x ,i = 0.5000
xi, = 0.9391, x u = 0.5562
x 2 ,o = 0.9193, x 2 ,i = 0.5463
x 3j0 = 0.9189, x 3>1 = 0.5461
(7.82a)
(7.82b)
TLFeBOOK
CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 323
We see that the answer is correct to four decimal places in only three iterations.
This is much faster than the fixed-point method seen in Example 7.7.
7.6 CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION
Iterative processes such as x n +\ = g(x n ) (which includes the Newton-Raphson
method) can converge to a fixed point [i.e., to x such that g(x) — x], or they
can fail to do so in various ways. This was considered in Section 7.4.3. We are
interested here in the case where (x n ) is a chaotic sequence, in which case g is
often said to be a chaotic map. Formal definitions exist for chaotic maps [13,
p. 50]. However, these are rather technical. They are also difficult to apply except
in relatively simple cases. We shall therefore treat chaos in an intuitive/empirical
(i.e., experimental) manner for simplicity.
In Section 7.3 we considered examples based on the logistic map
g{x) = Xx{l-x) (7.83)
(recall Examples 7.3-7.5). Suppose that X = 4. Figure 7.7 shows two output
sequences from this map for two slightly different initial conditions. Plot (a) shows
for xq — Xq— 0.745; while plot (b), xq — Xq = 0.755. We see that |*g - x$\ —
0.01, yet after only a few iterations, the two sequences are very different from each
other. This is one of the distinguishing features of chaos: sensitive dependence
of the resulting sequence to minor changes in the initial conditions. For X — 4,
we have g\[0, 1] -> [0, 1], so divergence is impossible. Chaotic sequences do not
diverge. They remain bounded; that is, there is a M e R such that < M < oo
with \x n \ < M for all n e Z + . But the sequence does not converge, and it is not
periodic, either. In fact, the plots in Fig. 7.7 show that the elements of sequence
(x n ) seem to wander around rather aimlessly (i.e., apparently randomly). This wan-
dering behavior has been observed in the past [14, p. 167], but was not generally
recognized as being a chaotic phenomenon until more recently.
It has been known for a very long time that effective cryptographic systems
should exploit randomness [15]. Since chaotic sequences have apparently random
qualities, it is not surprising that they have been proposed as random-number gen-
erators (or as pseudo-random-number generators) for applications in cryptography
[16, 17]. However, it is presently a matter of legitimate controversy regarding just
how secure a cryptosystem based on chaos can be. One difficulty is as follows.
Nominally, a chaotic map g takes on values from the set of real numbers. But if
such a map is implemented on a digital computer, then, since all computers are
finite-state machines, any chaotic sequence will not be truly chaotic as it will even-
tually repeat. Short period sequences are cryptographically weak (i.e., not secure).
There is presently no effective procedure (beyond exhaustive searching) for deter-
mining when this difficulty will arise in a chaos-based system. This is not the only
problem (see p. 1507 of Ref. 17 for others).
Two specific chaos-based cryptosy stems are presented by De Angeli et al. [18]
and Papadimitriou et al. [19]. (There have been many others proposed in recent
TLFeBOOK
324
NONLINEAR SYSTEMS OF EQUATIONS
Figure 7.7 Output sequence from the logistic map g(x) = 4x(l— x) with initial conditions
xq = 0.745 (a) and xq = 0.755 (b). Observe that although these initial conditions are close
together, the resulting sequences become very different from each other after only a few
iterations.
years.) Key size is the number of different possible encryption keys available in the
system. Naturally, this number should be large enough to prevent a codebreaker
(eavesdropper, cryptanalyst) from guessing the correct key. However, it is also well
known that a large key size is most definitely not a guarantee that the cryptosystem
will be secure. Papadimitriou et al. [19] demonstrate that their system has a large
key size, but their security analysis [19] did not go beyond this. Their system [19]
seems difficult to analyze, however. In what follows we shall present some analysis
of the system in De Angeli et al. [18], and see that if implemented on a digital
computer, it is not really very secure. We begin with a description of their system
[18]. Their method in [18] is based on the Henon map (see Fig. 7.8), which is a
mapping defined on R 2 according to
XQ,n + l = 1 -«*0,n + x l,n
*l,n+l = /**0,n
(7.84)
for n e Z + (so [xo.h^i.h] 7 e R 2 )> an ^ which is known to be chaotic in some neigh-
borhood of
a = 1.4, /3 = 0.3 (7.85)
TLFeBOOK
CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION
325
Figure 7.8 Typical state sequences from the Henon map for a = 1.45 and fi = 0.25, with
initial conditions xqq = —0.5 and x\ o = 0.2.
so that constants a and /3 form the encryption key for the system. No choice in the
allowed neighborhood is a valid key. An immediate problem is that there seems
to be no detailed description of which points are allowed. A particular choice of
key should therefore be tested to see if the resulting sequence is chaotic. In what
follows the output sequence from the map is defined to be
y n = XQ,n ■
(7.86)
The vector x n — [xq h x\ n ] T is often called a state vector. The elements are state
variables.
The encryption algorithm works by mixing the chaotic sequence of (7.86) with
the message sequence, which we shall denote by (s n ). The mixing (described below)
yields the cyphertext sequence, denoted (c„). A problem is that the receiver must
somehow recover the original message (s n ) from the cyphertext (c„) using knowl-
edge of the encryption algorithm, and the key {a, fi}.
Consider the following mapping:
%«+l = 1 -ay n +xi,„,
xi,n+i = Py n -
(7.87)
Note that this mapping is of the same form as (7.84) except that xo,„ is replaced
by y„, but from (7.86) these are the same (nominally). The mapping in (7.84)
TLFeBOOK
326
NONLINEAR SYSTEMS OF EQUATIONS
represents a physical system (or piece of software) at the transmitter, while (7.87)
is part of the receiver (hardware or software). Now define the error sequences
v x i,n — x i,n
(i € {0, 1})
for which it is possible to show [using (7.84), (7.86), and (7.87)] that
8x0,n + l
Sxo, n
8xi „
(7.88)
(7.89)
= A
=
"01"
"01"
Sxi,„
=
" "
We observe that A 2 — 0. Matrix A is an example of a nilpotent matrix for this
reason. This immediately implies that the error sequences go to zero in at most
two iterations (i.e., two steps):
<5*l,n+2
This fact tells us that if (y n ) is generated at the transmitter, and sent over the
communications channel to the receiver, then the receiver may perfectly recover 2
the state sequence of the transmitter in not more than two steps. This is called dead-
beat synchronization. The system in (7.87) is a specific example of a nonlinear
observer for a nonlinear dynamic system. There is a well-developed theory of
observers for linear dynamic systems [20]. The notion of an observer is a control
systems concept, so we infer that control theory is central to the problem of applying
chaotic systems to cryptographic applications.
All of this suggests the following algorithm for encrypting a message. Assume
that we wish to encrypt a length N message sequence (s n ); that is, we only have
the elements sq,Si, . . . , sn-2, $N-i-
1. The transmitter generates and transmits yo, y\ according to (7.84) and (7.86).
The transmitter also generates yi, V3, . . . , y^, yN+i, but these are not trans-
mitted.
2. The transmitter sends the cyphertext sequence
yn+2
(7.90)
for n = 0, 1 , . . . , N — 1 . Of course, this assumes we never have s„ — 0.
Equation (7.90) is not the only possible means of mixing the message and the
chaotic sequence together. The decryption algorithm at the receiver is as follows:
This assumes a perfect communications channel (which is not realistic), and that rounding errors in
the computation are irrelevant (which will be true if the receiver implements arithmetic in the same
manner as the transmitter).
TLFeBOOK
CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 327
1. The receiver regenerates the transmitter's state sequence using (7.87), its
knowledge of the key, and the sequence elements yo,y\. Specifically, for
n — 0, 1 compute
*0,n+l = 1 -uyl + h,n,
*i,«+i = Py n (7.91a)
while for n — 2,3, . . . , N — I, N compute
X0,n + \ = 1 ~0Cxl n +X\, n ,
X\,n + \ = /*%«■ (7.91b)
Recover y n for n — 2,3, . . . , N, N + I according to
?«=%«■ (7.92)
2. Recover the original message via
V„_1_T
(7.93)
yn+2
s n —
Cn
where n = 0, 1 , . .
., N -
-2, N - 1.
The initial states at the transmitter and receiver are arbitrary; that is, any xo,o, *i,0
and x\fi may be selected in (7.84) and (7.91a). The elements yo, y\, and the cypher-
text are sent over the channel. If these are lost because of a corrupt channel, then the
receiver should request retransmission of this information. It is extremely impor-
tant that the transmitter resend the same synchronizing elements yo, yi and not a
different pair. The reason will become clear below.
We remark that since we are assuming that the algorithms are implemented
on a digital computer, so each sequence element is a binary word of some form.
The synchronizing elements and cyphertext are thus a bitstream. Methods exist to
encode such data for transmission over imperfect channels such that the probability
of successful transmission can be made quite high. These are called error control
coding schemes.
In general, the receiver will fail to recover the message if (1) the way in which
arithmetic is performed by the receiver is not the same as at the transmitter, (2) the
channel corrupts the transmission, or (3) the receiver does not know the key {a, ft}.
Item 1 is important since the failure to properly duplicate arithmetic operations at
the receiver will cause machine rounding errors to accumulate and prevent data
recovery. This is really a case of improper synchronization. The plots of Figs. 7.9-
7.11 illustrate some of these effects.
It is noteworthy that even though the receiver may not be a perfect match to the
transmitter, some of the samples (at the beginning of the message) are recovered.
TLFeBOOK
328
NONLINEAR SYSTEMS OF EQUATIONS
E
<
(a)
50 60
Time step
100
(b)
40 50 60
Time step
100
Figure 7.9 Transmitted (a) and reconstructed (b) message sequences. The sinusoidal mes-
sage sequence (s n ) is perfectly reconstructed at the receiver; because the channel is perfect,
the receiver knows the key (here a = 1.4, /J ^ 0.3), and arithmetic is performed in identical
fashion at both the transmitter and receiver.
This is actually an indication that the system is not really very secure. We now
consider security of the method in greater detail.
What is the key size? This question seems hard to answer accurately, but a
simple analysis is as follows. Suppose that the Henon map is chaotic in the
rectangular region of the afi plane defined by a € [1.4 — Aa, 1.4+ Aa], and
P e [0.3 - A0, 0.3 + Aft], with Aa, Ap > 0. We do not specifically know the
interval limits, and it is an ad hoc assumption that the chaotic neighborhood is
rectangular (this is a false assumption). Suppose that p is the smallest M-bit binary
fraction such that p > 0.3 — p, and that p is the largest M-bit binary fraction such
that p < 0.3 + A/J. In this case the number of M-bit fractions from 0.3 — A/3 to
0.3 + AP is about 2 M - P) + 1. Thus
2 M (P - p + 1) « 2 M [(0.3 + AP) - (0.3 - A/3)] + 1 = 2 M+1 AP + 1 = K p
and by similar reasoning for a
K a = 2 M+l Aa+\,
(7.94a)
(7.94b)
TLFeBOOK
CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION
329
E
<
(a)
E
<
(b)
40 50 60
Time step
100
40 50 60
Time step
100
Figure 7.10 Transmitted (a) and reconstructed (b) message sequences. Here, the conditions
of Fig. 7.9 hold except that the receiver uses a mismatched key a — 1.400001, fi = 0.300001.
Thus, the message is eventually lost. (Note that the first few message samples seem to be
recovered accurately.).
which implies that the key size is (very approximately)
K — K a Kp.
(7.95)
Even if Ace and A/3 are small, and the structure of the chaotic neighborhood is
not rectangular, we can make M big enough (in principle) to generate a big key
space. Apparently, Ruelle [21, p. 19] has shown that the Henon map is periodic
(i.e., not chaotic) for a — 1.3, ^ = 0.3, so the size of the chaotic region is not very
big. It is irregular and "full of holes" (the "holes" are key parameters that don't
give chaotic outputs). In any case it seems that a large key size is possible. But as
cautioned earlier, this is no guarantee of security.
What does the transmitter send over the channel? This is the same as asking
what the eavesdropper knows. From the encryption algorithm the eavesdropper
knows the synchronizing elements yo,y\, and the cyphertext. The eavesdropper
also knows the algorithm, but not the key. Is this enough to find the key? It is now
obvious to ask if knowledge of yo, y\ gives the key away. This question may be
easily (?) resolved as follows.
From (7.84) and (7.86) for n e N, we have
y n +\ = i -uy n + Pyn-i-
(7.96)
TLFeBOOK
330
NONLINEAR SYSTEMS OF EQUATIONS
(a)
(b)
40 50 60
Time step
100
40 50 60
Time step
100
Figure 7.11 Transmitted (a) and reconstructed (b) message sequences. Here, the conditions
of Fig. 7.9 hold except the receiver and the transmitter do not perform arithmetic in identical
fashion. In this case the order of two operations was reversed at the receiver. Thus, the
message is eventually lost. (Note again that the first few message samples seem to be
recovered accurately.).
The encryption algorithm generates yo, y\, , , , , yti+i, which, from (7.96), must
lead to the key satisfying the linear system of equations
n
y 2 N -i
yl
-yo
-y\
-y>N-2
-yN-i
a
(i
-yi
-yi
- yN
yN+\
(7.97)
Compactly, we have Ya — y. This is an overdetermined system. The eavesdropper
has only yo, y\, so from (7.97) the eavesdropper needs to solve
y\
-yo
-yi
a
yi
>'3
(7.98)
TLFeBOOK
CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION
331
But the eavesdropper does not know y>2, yi as these were not transmitted. These
elements become mixed with the message, and so are not available to the eaves-
dropper. We immediately conclude that deadbeat synchronization is secure. 3 There
is an alternative synchronization scheme [18] that may be proved to be insecure
by this method of analysis.
Does the cyphertext give key information away? This seems not to have a
complete answer either. However, we may demonstrate that the system of D'Angeli
et al. [18] (using deadbeat synchronization) is vulnerable to a known plaintext 4
attack. Such a vulnerability is ordinarily sufficient to preclude using a method in
a high-security application. The analysis assumes partial prior knowledge of the
message (s n ). Let us specifically assume that
$n
E
k=0
a^n
(7.99)
but we do not know a^ or p. That is, the structure of our message is a polyno-
mial sequence, but we do not know more than this. Combining (7.90) with (7.96)
gives us
1
-2Sn-2-
(7.100)
It is not difficult to confirm that
J «-i
2/>
= E
k=0
so that (7.100) becomes
E aia i
i+j=k
(n - If
(7.101)
E
k=0
cikn k c n
2p
k=Q
yfc\_
P
^ Pak(n
k=0
T) k c n
-2 = 1.
(7.102)
Remembering that the eavesdropper has cyphertext (c„) (obtained by eavesdrop-
ping), the eavesdropper can use (7.102) to set up a linear system of equations in
the key and message parameters <%. The code is therefore easily broken in this
case.
If the message has a more complex structure, (7.102) will generally be replaced
by some hard-to-solve nonlinear problem wherein the methods of previous sections
(e.g., Newton-Raphson) can be used to break the code. We conclude that, in spite
of technical problems from the eavesdropper's point of view (ill conditioning,
incomplete cyphertext sequences, etc.), the scheme in Ref. 18 is not secure.
This conclusion assumes there are no other distinct ways to work with the encryption algorithm
equations in such a manner as to give an equation that an eavesdropper can solve for the key knowing
only yo. >'i, and the cyphertext.
The message is also called plaintext.
TLFeBOOK
332 NONLINEAR SYSTEMS OF EQUATIONS
REFERENCES
1. J. H. Wilkinson, "The Perfidious Polynomial," in Studies in Mathematics, G. H. Golub,
ed., Vol. 24, Mathematical Association of America, 1984.
2. A. M. Cohen, "Is the Polynomial so Perfidious?" Numerische Mathematik 68, 225-238
(1994).
3. T. E. Hull and R. Mathon, "The Mathematical Basis and Prototype Implementation of a
New Polynomial Root Finder with Quadratic Convergence," ACM Trans. Math. Software
22, 261-280 (Sept. 1996).
4. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York,
1978.
5. W. Rudin, Principles of Mathematical Analysis, 3rd ed., McGraw-Hill, New York, 1976.
6. D. C. Youla and H. Webb, "Image Restoration by the Method of Convex Projections:
Part I— Theory," IEEE Trans. Med. Imag. MM, 81-94 (Oct. 1982).
7. A. E. Cetin, O. N. Gerek and Y. Yardimci, "Equiripple FIR Filter Design by the FFT
Algorithm," IEEE Signal Process. Mag. 14, 60-64 (March 1997).
8. K. Grochenig, "A Discrete Theory of Irregular Sampling," Linear Algebra Appl. 193,
129-150 (1993).
9. H. H. Bauschke and I. M. Borwein, "On Projection Algorithms for Solving Convex
Feasibility Problems," SIAM Rev. 38, 367-426 (Sept. 1996).
10. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979.
11. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 1966.
12. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York,
1974.
13. R. L. Devaney, An Introduction to Chaotic Dynamical Systems, 2nd ed., Addison-
Wesley, Redwood City, CA, 1989.
14. G. E. Forsythe, M. A. Malcolm and C. B. Moler, Computer Methods for Mathematical
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977.
15. G Brassard, Modern Cryptology: A Tutorial, Lecture Notes in Computer Science
(series), Vol. 325, G. Goos and J. Hartmanis, eds., Springer- Verlag, New York, 1988.
16. L. Kocarev, "Chaos-Based Cryptography: A Brief Overview," IEEE Circuits Syst. Mag.
1(3), 6-21 (2001).
17. F. Dachselt and W. Schwarz, "Chaos and Cryptography," IEEE Trans. Circuits Syst.
(Part I: Fundamental Theory and Applications) 48, 1498-1509 (Dec. 2001).
18. A. De Angeli, R. Genesio and A. Tesi, "Dead-Beat Chaos Synchronization in Discrete-
Time Systems," IEEE Trans. Circuits Syst. (Part I: Fundamental Theory and Applica-
tions) 42, 54-56 (Jan. 1995).
19. S. Papadimitriou, A. Bezerianos and T. Bountis, "Secure Communications with Chaotic
Systems of Difference Equations," IEEE Trans. Comput. 46, 27-38 (Jan. 1997).
20. M. S. Santina, A. R. Stubberud and G. H. Hostetter, Digital Control System Design,
2nd ed., Saunders College Publ., Fort Worth, TX, 1994.
21. D. Ruelle, Chaotic Evolution and Strange Attractors: The Statistical Analysis of Time
Series for Deterministic Nonlinear Systems, Cambridge Univ. Press, New York, 1989.
22. M. Jenkins and J. Traub, "A Three-Stage Variable Shift Algorithm for Polynomial Zeros
and Its Relation to Generalized Rayleigh Iteration," Numer. Math. 14, 252-263 (1970).
TLFeBOOK
PROBLEMS 333
PROBLEMS
7.1. For the functions and starting intervals below solve fix) — using the
bisection method. Use stopping criterion (7.13d) with e — 0.005. Do the
calculations with a pocket calculator.
(a) f(x) = log,* + 2x + 1, [a , b Q ] = [0.2, 0.3].
(b) f(x) = x 3 - cosx, [a , b ] = [0.8, 1.0].
(c) /(*)=x-e-*/ 5 ,[fl ,6o] = [!,l].
(d) fix)=x 6 -x-l,[ao,b ] = [l,l].
(e) /(*) = SS£ + exp(-x), [ fl0 , &o] = [3, 4].
(f) /(*) = SSi - , + 1, [« , fc ] = [1, 2].
7.2. Consider /(x) = sin(x)/x. This function has a minimum value for some x e
[jr, 2jt]. Use the bisection method to find this x. Use the stopping criterion
in (7.13d) with e — 0.005. Use a pocket calculator to do the computations.
7.3. This problem introduces the variation on the bisection method called regula
falsi (or the method of false position). Suppose that [oq, bo] brackets the root
p [i.e., fip) = 0]. Thus, /(flo)/(^o) < 0. The first estimate of the root p,
denoted by po, is where the line joining the points (ao, fi^o)) and (£>o, fiba))
crosses the x axis.
(a) Show that
h) - «o , , ,
Po = ao- -—— — — fia ).
fibo) - fia )
(b) Using stopping criterion (7.13d), write pseudocode for the method of
false position.
7.4. In certain signal detection problems (e.g., radar or sonar) the probability of
false alarm (FA) (i.e., of saying that a certain signal is present in the data
when it actually is not) is given by
r°° i
Pp A = / -if-'c" 1 ' 2 ^, (7.P.1)
J, Vip/2)2P/2
where n is called the detection threshold. If p is an even number, it can be
shown that (7.P.1) reduces to the finite series
, ( " /2) " 1 1 /mt
**-•-*' £ f I) •
k=Q
The detection threshold r\ is a very important design parameter in signal
detectors. Often it is desired to specify an acceptable value for Pfa (where
TLFeBOOK
334 NONLINEAR SYSTEMS OF EQUATIONS
< Pfa < 1), and then it is necessary to solve nonlinear equation (7.P.2)
for r\. Let p — 6. Use the bisection method to find r\ for
(a) P m = 0.001
(b) Pfa = 0.01
(c) P F a = 0.1
7.5. We wish to solve
f(x) = x 4 - 5 -x 3 + 5 -x - 1 =
using a fixed-point method. This requires finding gix) such that
g(x) -x — f{x).
Find four different functions gix).
7.6. Can the fixed-point method be used to find the solution to
fix) = x 6 -x- 1 =0
for the root located in the interval [1, |]? Explain.
7.7. Consider the nonlinear equation
fix) =x-e
-x/5
(which has a solution on interval [|, 1]). Use (7.32) to estimate a. Recalling
that x n = g"xo (in the fixed-point method), if xq — 1, then use (7.22a) to
estimate n so that dix n , x) < 0.005 [x is the root of fix) — 0]. Use a pocket
calculator to compute x\, . . . , x n .
7.8. Consider the nonlinear equation
fix) =x-l- \e~ x =
(which has a solution on interval [1, 1.2]). Use (7.32) to estimate a. Recalling
that x n = g n XQ (in the fixed-point method), if xo = 1 then use (7.22a) to
estimate n so that dix n , x) < 0.001 [x is the root of fix) — 0]. Use a pocket
calculator to compute x\, . . . , x n .
7.9. Problem 5.14 (in Chapter 5) mentioned the fact that orthogonal polynomi-
als possess the "interleaving of zeros" property. Use this property and the
bisection method to derive an algorithm to find the zeros of all Legendre poly-
nomials P„ix) for n = 1, 2, . . . , N. Express the algorithm in pseudocode. Be
fairly detailed about this.
TLFeBOOK
PROBLEMS 335
7.10. We wish to find all of the roots of
f{x) = x 3 - 3x 2 + Ax - 2 = 0.
There is one real-valued root, and two complex-valued roots. It is easy to
confirm that f(l) — 0, but use
x 3 + 3x 2 +x + 2
8(X) = 2^T1
to estimate the real root p using fixed-point iteration [i.e, p n +\ — g(p n )]-
Using a pocket calculator, compute only p\, p%, pj,, and p\, and use the
starting point po — 2. Also, use the Newton-Raphson method to estimate
the real root. Again choose po — 2, and compute only p\, pi, pj, and p\.
Once the real root is found, finding the complex-valued roots is easy. Find
the complex-valued roots by making use of the formula for the roots of a
quadratic equation.
7.11. Consider Eq. (7.36). Via Theorem 3.3, there is an a n between root p (f(p) =
0) and the iterate p n such that
f(p) ~ f(Pn) = f (1) (cc„)(p - Pn ).
(a) Show that
' fiPn) f m ( P n-l)\
Pn = (Pn ~ Pn-l)
f(pn-i) / (1) (««)
[Hint: Use the identity 1 =
f(Pn-l) f W (Pn-l) I
f (Pn-l) fW (Pn-l) '*
(b) Argue that if convergence is occurring, we have
lim \A„\ = 1.
n— >oo
Hence lim„
P-Pn
Pn-pn-1 )
(c) An alternative stopping criterion for the Newton-Raphson method is to
stop iterating when
\f(Pn)\ + \Pn ~ Pn-\\ < <?
for some suitably small e > 0. Is this criterion preferable to (7.42)?
Explain.
TLFeBOOK
336 NONLINEAR SYSTEMS OF EQUATIONS
7.12. For the functions listed below, and for the stated starting value po, use
the Newton-Raphson method to solve f(p) — 0. Use the stopping crite-
rion (7.42a) with e — 0.001. Perform all calculations using only a pocket
calculator.
(a) f(x) — x + tanx, po — 2.
(b) f(x) = x 6 -x-l,p = l.5.
(c) f(x) — x 3 — cosx, po — 1.
(d) f(x) = x - e~ x l\ p = 1.
7.13. Use the Newton-Raphson method to find the real-valued root of the polyno-
mial equation
1 1,1,
/ (x) = 1 + -x + -x 2 + —x 3 = 0.
J 2 6 24
Choose starting point po — —2. Iterate 4 times. [Comment: Polynomial /(x)
arises in the stability analysis of a numerical method for solving ordinary
differential equations. This will be seen in Chapter 10, Eq. (10.83).]
7.14. Write a MATLAB function to solve Problem 7.4 using the Newton-Raphson
method. Use the stopping criterion (7.42a).
7.15. Prove the following theorem (Newton-Raphson error formula). Let f(x) e
C 2 [a, b], and f(p) = for some p e [a, b]. For p n e [a, b] with
f(Pn)
Pn + l -Pn f(l)(pn y
there is a % n between p and p n such that
1 2 / (2) (f«)
P ~ Pn+1 = -~(P ~ Pn) -77777 ".
2 f W (Pn)
[Hint: Consider the Taylor series expansion of /(x) about the point x = p n
fix) = f(p„) + (X- p„)f m ( Pn ) + \(X - Pn ) 2 f (2) ^n),
and then set x = p.]
7.16. This problem is about two different methods to compute */x. To begin, recall
that if x is a binary floating-point number (Chapter 2), then it has the form
x = xo.xi • • • x t x 2 e ,
where, since x > 0, we have xo = 0, and because of normalization x\ = 1.
Generally, x^ e {0, 1}, and e is the exponent. If e = 2k (i.e., the exponent
TLFeBOOK
PROBLEMS 337
is even), we do not adjust x. But if e = 2k + 1 (i.e., the exponent is odd),
we shift the mantissa to the right by one bit position so that e becomes
e — 2k J r 2. Thus, x now has the form
x — a x 2 e ,
where a € [\, 1) in general, and e is an even number. Immediately, yfx —
*fa x 2 e l 2 . From this description we see that any square root algorithm need
work with arguments only on the interval [^,1] without loss of generality
(w.l.o.g.).
(a) Finding the square root of x — a is the same problem as solving f(x) —
x 2 — a — 0. Show that
the iterative algorithm
x 2 — a — 0. Show that the Newton-Raphson method for doing so yields
Pn+i = -( Pn + —\, (7.P.3)
where p n — > */a. [Comment: It can be shown via an argument based
on the theorem in Problem 7.15 that to ensure convergence, we should
set po — I (2a + 1). A simpler choice for the starting value is p$ =
i(i + 1) = | (since we know a € [4, 1]).
3'
2M i •-> — 8 V"" v " "* ^""" " = L 4 >
(b) Mikami et al. (1992) suggest an alternative algorithm to find */x. They
recommend that for a e [j, 1] the square root of a be obtained by the
algorithm
p n+l =/3(a- p 2 ) + p n . (7.P.4)
Define error sequence (e n ) as e n — «Ja — p n . Show that
e„+i = Pel + (1 - 2fi^/a)e„.
[Comment: Mikami et al. recommend that po — 0.666667a + 0.354167.]
(c) What condition on fi gives quadratic convergence for the algorithm in
(7.P.4)?
(d) Some microprocessors are intended for applications in high-speed digital
signal processing. As such, they tend to be fixed-point machines with a
high-speed hardware multiplier, but no divider unit. Floating-point arith-
metic tends to be avoided in this application context. In view of this,
what advantage might (7.P.4) have over (7. P. 3) as a means to compute
square roots?
7.17. Review Problem 7.10. Use x — xq + jx\ (xo, x\ e R) to rewrite the equation
f(x) = jt 3 — 3x 2 + 4x — 2 — in the form
fo(xo, *i) = 0, fi(xo,xi) = 0.
TLFeBOOK
338 NONLINEAR SYSTEMS OF EQUATIONS
Use the Newton-Raphson method (as implemented in MATLAB) to solve
this nonlinear system of equations for the complex roots of f(x) — 0. Use the
starting points Jo = [ xo,o *o,i ] T — [ 2 2 ] T and 1q — [ 2 —2 ] T .
Output six iterations in both cases.
7.18. Consider the nonlinear system of equations
f( X ,y) = X 2 + y 2 -l=0,
g(x,y)=\x 2 + 4y 2 -l=0.
(a) Sketch / and g on the (x, y) plane.
(b) Solve for the points of intersection of the two curves in (a) by hand
calculation.
(c) Write a MATLAB function that uses the Newton-Raphson method to
solve for the points of intersection. Use the starting vectors [xoyo] r =
[±1 ± l] r . Output six iterations in all four cases.
7.19. If y n+l = 1 - byl and x„ = Ux - ±) y„ + \ for n e Z+, then, if b =
jk 2 — jX, show that
%n + \ == XXn\i X n ).
7.20. De Angeli, et al, [18] suggest an alternative synchronization scheme (i.e.,
alternative to deadbeat synchronization). This works as follows. Suppose
that at the transmitter
XQ,n + l = 1 -a*o,n + x l,«-
*l,n+l = Pxo,„,
y n = \-ax 2 n .
The expression for y„ here replaces that in (7.86). At the receiver
XQ, n +l — xi,„ + y„
X\,n+l = &XQ,n-
(a) Error sequences are defined as
0Xi,n == X[ n X{ n
for i e {0, 1}. Find conditions on a and ft, giving linin^oo ||5x„|| =
(Sx„ = [Sxo.n <5^i, n ] r ).
(b) Prove that using y n = 1 — axh to synchronize the receiver and trans-
mitter is not a secure synchronization method [i.e., an eavesdropper may
collect enough elements from (y n ) to solve for the key {a, fi}].
TLFeBOOK
PROBLEMS 339
7.21. The chaotic encryption scheme of De Angeli et al. [18], which employs
deadbeat synchronization, was shown to be vulnerable to a known plaintext
attack, assuming a polynomial message sequence (recall Section 7.6). Show
that it is vulnerable to a known plaintext attack when the message sequence
is given by
s n — a sin(o)M + </>) .
Assuming that the eavesdropper already knows w and </>, show that a, a and
jS can be obtained by solving a third-order linear system of equations. How
many cyphertext elements c n are needed to solve the system?
7.22. Write, compile, and run a C program to implement the De Angeli et al.
chaotic encryption/decryption scheme in Section 7.6. (C is suggested here,
as I am not certain that MATLAB can do the job so easily.) Implement both
the encryption and decryption algorithms in the same program. Keep the
program structure simple and direct (i.e., avoid complicated data structures,
and difficult pointer operations). The program input is plaintext from a file,
and the output is decrypted cyphertext (also known as "recovered plaintext")
that is written to another file. The user is to input the encryption key {a, /3},
and the decryption key {a\, fi{\ at the terminal. Test your program out on
some plaintext file of your own making. It should include keyboard charac-
ters other than letters of the alphabet and numbers. Of course, your program
must convert character data into a floating-point format. The floating-point
numbers are input to the encryption algorithm. Algorithm output is also a
sequence of floating-point numbers. These are decrypted using the decryp-
tion algorithm, and the resulting floating-point numbers are converted back
into characters. There is a complication involved in decryption. Recall that
nominally the plaintext is recovered according to
_ y«+2
s n —
Cn
for n = 0, 1, . .., N — 1. However, this is a floating-point operation that
incurs rounding error. It is therefore necessary to implement
yn+2 -_
s„ = h offset
for which offset is a small positive value (e.g., 0.0001). The rationale is
as follows. Suppose that nominally (i.e., in the absence of roundoff) s n —
yn+i/cn = 113.000000. The number 113 is the ASCII (American Standard
Code for Information Interchange) code for some text character. Rounding in
division may instead yield the value 112.999999. The operation of converting
this number to an integer type will give 112 instead of 1 13. Clearly, the offset
cures this problem. A rounding error that gives, for instance, 113.001000 is
harmless. We mention that if plaintext (s„) is from sampled voice or video,
TLFeBOOK
and
340 NONLINEAR SYSTEMS OF EQUATIONS
then these rounding issues are normally irrelevant. Try using the following
key sets:
a= 1.399, £ = 0.305,
ai = 1.399, Pi =0.305,
a= 1.399,0 = 0.305,
ai = 1.389, Pi =0.304.
Of course, perfect recovery is expected for the first set, but not the second set.
[Comment: It is to be emphasized that in practice keys must be chosen that
cause the Henon map to be chaotic to good approximation on the computer.
Keys must not lead to an unstable system, a system that converges to a
fixed point, or to one with a short-period oscillation. Key choices leading to
instability will cause floating-point overflow (leading to a crash), while the
other undesirable choices likely generate security hazards. Ideally (and more
practically) the cyphertext array (floating-point numbers) should be written to
a file by an encryption program. A separate decryption program would then
read in the cyphertext from the file and decrypt it, producing the recovered
plaintext, which would be written to another file (i.e., three separate files:
original file of input plaintext, cyphertext file, and recovered plaintext file).]
TLFeBOOK
8
Unconstrained Optimization
8.1 INTRODUCTION
In engineering design it is frequently the case that an optimal design is sought
with respect to some performance criterion or criteria. Problems of this class are
generally referred to as optimal design problems, or mathematical programming
problems. Frequently nonlinearity is involved, and so the term nonlinear program-
ming also appears. There are many subcategories of problem types within this broad
category, and so space limitations will restrict us to an introductory presentation
mainly of so-called unconstrained problems. Even so, only a very few ideas from
within this category will be considered. However, some of the methods treated in
earlier chapters were actually examples of optimal design methods within this cate-
gory. This would include least-squares ideas, for example. In fact, an understanding
of least-squares ideas from the previous chapters helps a lot in understanding the
present one. Although the emphasis here is on unconstrained optimization problems,
some consideration of constrained problems appears in Section 8.5.
8.2 PROBLEM STATEMENT AND PRELIMINARIES
In this chapter we consider the problem
min/(;c) (8.1)
jcsR"
for which f(x) e R. This notation means that we wish to find a vector x that min-
imizes the function f(x). We shall follow previous notational practices, so here
x — [xq x\ ... x„_2 x n -\\ T . In what follows we shall usually assume that f(x)
possesses all first- and second-order partial derivatives (with respect to the elements
of x), and that these are continuous functions, too. In the present context the func-
tion f(x) is often called the objective function. For example, in least-squares prob-
lems (recall Chapter 4) we sought to minimize V(a) — \\f(x) — J2k=o a k<Pk(x)\\ 2
with respect to a € R N . Thus, V(a) is an objective function. The least-squares
problem had sufficient structure so that a simple solution was arrived at; specifi-
cally, to find the optimal vector a (denoted a), all we had to do was solve a linear
system of equations.
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
341
TLFeBOOK
342
UNCONSTRAINED OPTIMIZATION
Now we are interested in solving more general problems. For example, we might
wish to find x — [xqx\] t to minimize Rosenbrock' s function
f(x) = l00(x 1 - X 2) 2 + (l-x Q ) 2 .
(8.2)
This is a famous standard test function (taken here from Fletcher [1]) often used
by those who design optimization algorithms to test out their theories. A contour
plot of this function appears in Fig. 8.1. It turns out that this function has a unique
minimum at x = x = [1 l] r . As before, we have used a "hat" symbol to indicate
the optimal solution.
Some ideas from vector calculus are essential to understanding nonlinear opti-
mization problems. Of particular importance is the gradient operator.
3 3 3
3xo 9*i 9x„_
For example, if we apply this operator to Rosenbrock' s function, then
(8.3)
V/(x)
3*0
df(x)
dxi
-400x (xi - x%) - 2(1 - x )
200(xi - x$
(8.4)
Figure 8.1 Contour plot of Rosenbrock's function.
TLFeBOOK
PROBLEM STATEMENT AND PRELIMINARIES
343
We observe that V/(Jc) = [0 0] T (recall that x — [1 l] T ). In other words, the
gradient of the objective function at x is zero. Intuitively we expect that if x is
to minimize the objective function, then we should always have V/(jc) = 0. This
turns out to be a necessary but not sufficient condition for a minimum. To see this,
consider g(x) — —x 2 (x e R) for which
Vg(x) = -2x.
Clearly, Vg(0) = yet x — maximizes g{x) instead of minimizing it. Consider
h(x) — x 3 , so that
Vh(x) = 3x 2
for which Vft(0) = 0, but x — neither minimizes nor maximizes h(x). In this
case x — corresponds to a one-dimensional version of a saddle point.
Thus, finding x such that V/(x) = generally gives us minima, maxima, or
saddle points. These are collectively referred to as stationary points. We need (if
possible) a condition that tells us whether a given stationary point corresponds to
a minimum. Before considering this matter, we note that another problem is that
a given objective function will often have several local minima. This is illustrated
in Fig. 8.2. The function is a quartic polynomial for which there are two values of
40
35
30
25
20
15
10
-1.5
-0.5
0.5
x
1.5
2.5
Figure 8.2 A function f(x) = x — 3.1x + 2x + 1 with two local minima. The local
minimizers are near x = —0.5 and x = 2.25. The minimum near x = 2.25 is the unique
global minimum of fix).
TLFeBOOK
344
UNCONSTRAINED OPTIMIZATION
x such that V fix) — 0. We might denote these by x a and Xb- Suppose that x a is the
local minimizer near x — —0.5 and Xb is the local minimizer near x — 2.25. We see
that f(x a + S) > f(x a ) for all sufficiently small (but nonzero) S e R. Similarly,
f(xb + S) > f(xb) for all sufficiently small (but nonzero) S e R. But we see that
f(xb) < f(x a ), so Xb is the unique global minimizer for fix). An algorithm for
minimizing f(x) should seek Xb\ that is, in general, we seek the global minimizer
(assuming that it is unique). Except in special situations, an optimization algorithm
is usually not guaranteed to do better than find a local minimizer, and is not
guaranteed to find a global minimizer. Note that it is entirely possible for the
global minimizer not to be unique. A simple example is f(x) — sin(x) for which
the stationary points are x — (2k + 1)tt/2 (A: € Z). The minimizers that are a subset
of these points all give —1 for the value of sin(x).
To determine whether a stationary point is a local minimizer, it is useful to have
the Hessian matrix
V 2 f(x)
d 2 f(x)
dxt dxj
eR"
(8.5)
!,;=0,l,...,n-l
This matrix should not be confused with the Jacobian matrix (seen in Section 7.5.2).
The Jacobian and Hessian matrices are not the same. The veracity of this may be
seen by comparing their definitions. For the special case of n = 2, we have the
2x2 Hessian matrix
V 2 /(*) =
d\f(x) d 2 f(x)
3xq 3^o3xi
S 2 f(x) d 2 f(x)
3xi 3xQ dx 2
(8.6)
If f(x) in (8.6) is Rosenbrock's function, then (8.6) becomes
V 2 /(*) =
1200x2 - 400xi + 2 -400xo
-400xo 200
(8.7)
The Hessian matrix for f(x) helps in the following manner. Suppose that x is
a local minimizer for objective function f(x). We may Taylor-series-expand f(x)
around x according to
f{x + h) = f{x)+h'Vftx)
\h T [V 2 f{x)]h
(8.8)
where h eR". (This will not be formally justified.) If f(x) is sufficiently smooth
and \\h\\ is sufficiently small, then the terms in (8.8) that are not shown may
be entirely ignored (i.e., we neglect higher-order terms). In other words, in the
neighborhood of x
fix + h)*> f(x) + h 1 V/(x) + W [V z fix)]h
(8.9)
TLFeBOOK
LINE SEARCHES 345
is assumed. But this is the now familiar quadratic form. For convenience we will
(as did Fletcher [1]) define
G(x) = V 2 f(x), (8.10)
so (8.9) may be rewritten as
f(x + h) % f(x) + h T Vf(x) +\h T G(x)h. (8.11)
Sometimes we will write G instead of G(x) if there is no danger of confusion.
In Chapter 4 we proved that (8.11) has a unique minimum iff G > 0. In words,
f(x) looks like a positive definite quadratic form in the neighborhood of a local
minimizer. Therefore, the Hessian of f(x) at a local minimizer will be positive
definite and thus represents a way of testing a stationary point to see if it is a
minimizer. (Recall the second-derivative test for the minimum of a single-variable
function from elementary calculus, which is really just a special case of this more
general test.) More formally, we therefore have the following theorem.
Theorem 8.1: A sufficient condition for a local minimizer x is that V/(i) = 0,
and G(x) > 0.
This is a simplified statement of Fletcher's Theorem 2.1.1 [1, p. 14]. The proof
is really just a more rigorous version of the Taylor series argument just given, and
will therefore be omitted. For convenience we will also define the gradient vector
gW = V/(x). (8.12)
Sometimes we will write g for g(x) if there is no danger of confusion.
If we recall Rosenbrock's function again, we may now test the claim made earlier
that x = [1 l] T is a minimizer for f(x) in (8.2). For Rosenbrock's function the
Hessian is in (8.7), and thus we have
G(x) =
802 -400
-400 200
(8.13)
The eigenvalues of this matrix are A. — 0.3994, 1002. These are both bigger than
zero so G(x) > 0. We have already remarked that g(x) — 0, so immediately from
Theorem 8.1 we conclude that x — [1 l] T is a local minimizer for Rosenbrock's
function.
8.3 LINE SEARCHES
In general, for objective function f(x) we wish to allow x e R"; that is, we seek the
minimum of f(x) by performing a search in an n-dimensional vector space. How-
ever, the one-dimensional problem is an important special case. In this section we
begin by considering n — 1 , so x is a scalar. The problem of finding the minimum
TLFeBOOK
346 UNCONSTRAINED OPTIMIZATION
of f(x) [where f(x) e R for all x e R] is sometimes called the univariate search.
Various approaches exist for the solution of this problem, but we will consider only
the golden ratio search method (sometimes also called the golden section search).
We will then consider the backtracking line search [3] for the case where x e R"
for any n > 1.
Suppose that we know f(x) has a minimum over the interval [xf , xi]. Define
I 3 — xi — xf , which is the length of this interval. The index j represents the jth
iteration of the search algorithm, so it represents the current estimate of the interval
that contains the minimum. Select two points xi and x J h (xi < xf) such that they
are symmetrically placed in the interval [xf , xi]. Specifically
xi-xj =xl-x{. (8.14)
A new interval [xf ,x ] u ] is created according to the following procedure, such
that for all j
7^r = t>i. (8.i5)
If f(x£) > f(xl) then the minimum lies in [xi, xi], and so 7 /+1 = xi — xi. Our
new points are given according to
x l — x a > x u — x m> x a = x b (8.16a)
and
xl +1 =xi + -I i+1 . (8.16b)
T
If f(xi) < f(xl), then the minimum lies in [xf , xf], and so / /+1 = x } b — xf . Our
new points are given according to
7+1 / 7+1 j ;+l _ J
and
xi, x^=x J a (8.17a)
x J a +1 =xi-^I' +1 . (8.17b)
Since 1° — x® — x® (j — indicates the initial case), we must have X® = x® — -1°
and x® — xf + -1°. Figure 8.3 illustrates the search procedure. This is for the
particular case (8.17).
Because of (8.15), the rate of convergence to the minimum can be estimated.
But we need to know t; see the following theorem.
Theorem 8.2: The search interval lengths of the golden ratio search algorithm
are related according to
IJ = IJ+ 1 +iJ+ 2 (8.18a)
TLFeBOOK
LINE SEARCHES
347
Figure 8.3 Illustrating the golden ratio search procedure.
for which the golden ratio x is given by
r = i(l + V5)« 1.62.
(8.18b)
Proof There are four cases to consider in establishing (8.18a). First, suppose
that f(xi) > f(x J b ), so IJ +l = xi - x J a with
J+ 1
.7+1
.7 + 1
1
-/
T
/+]
.7+1
If it happens that f(x J a + ) > f{x[ ), then /'+ 2 = ^
/+i>
^"+1
.7+1
Ki (via xi — x/ — x 3 u — xi). This implies that 7 ;+1 + / ;+2 = x^ — xi — V .
On the other hand, if it happens that f(x 3 a ) < f(xi ), then 7 /+2 = x
,7+1
x /+1 - x /+1
x ;+1 - x 1
x, (via x,
xi — xi). Again we have I J+i + I J+2
7 + 1
.7+1 _ 7+1
A / — A w
-x/ =/•'.
.7+1
Now suppose that f(x ] a ) < f(xi). Therefore, I J+l — x J h — xi with
J+ 1
.7+1
If it happens that f(x J a )>f(xi + ), then V+ 2 = x 3 u
-I j+l ,
x
,7+1
.7+1
,7+1
.7 + 1
,7+1
x i+l - x J ' -
-7+1-,
so that IJ +l + V+ 2 = x,;
that /(x a ;+ ) < /(x^ + ), so therefore 7'+ 2 = x fc ;+ - x
x; ; = /'. Finally, suppose
,7+1 _ _; „7 _ v i „7
TLFeBOOK
348 UNCONSTRAINED OPTIMIZATION
so yet again we conclude that /■ /+1 + L' +2 — xi — xj — P . Thus, (8.18a)
now verified.
Since
— - I — = r and V = I j+1 + I j+2 ,
jj+l 77+2
we have V +x = ±V and L' +1 = ±/ /+1 = XlJ, so
V =IJ +1 + IJ +2 = -IJ + ±-I j ,
rl
immediately implying that
which yields (8.18b).
T 2 = T + l,
The golden ratio has a long and interesting history in art as well as science
and engineering, and this is considered in Schroeder [2]. For example, a famous
painting by Seurat contains figures that are proportioned according to this ratio
(Fig. 5.2 in Schroeder [2] includes a sketch).
The golden ratio search algorithm as presented so far assumes a user-provided
starting interval. The golden ratio search algorithm has the same drawback as the
bisection method for root finding (Chapter 7) in that the optimum solution must be
bracketed before the algorithm can be successfully run (in general). On the other
hand, an advantage of the golden ratio search method is that it does not need f(x)
to be differentiable. But the method can be slow to converge, in which case an
improved minimizer that also does not need derivatives can be found in Brent [4].
When setting up an optimization problem it is often advisable to look for ways
to reduce the dimensionality of the problem, if at all possible. We illustrate this
idea with an example that is similar in some ways to the least-squares problem
considered in the beginning of Section 4.6. Suppose that signal fit) e R it e R)
is modeled as
f(t) = asm(—t + 4>)+ri(t), (8.19)
where ??(f) is the term that accounts for noise, interference, or measurement errors
in the data fit). In other words, our data are modeled as a sinusoid plus noise.
The amplitude a, phase 4>, and period T are unknown parameters. We are assumed
only to possess samples of the signal; thus, we have only the sequence elements
fn = finT s ) = a sin fynT, + J + rjinT s ) (8.20)
for n — 0, I, . . . , N — 1. The sampling period parameter T s is assumed to be
known. As before, we may define an error sequence
(In \
e n — /„ — a sin I — nT s + <f> I , (8.21)
TLFeBOOK
LINE SEARCHES
349
where again n — 0, I, . . . , N — 1. The objective function for our problem (using a
least-squares criterion) would therefore be
N-l
V{a,T,<t>)= J2
n=0
. lit
f n -asm) — nT s
-i2
(8.22)
This function depends on the unknown model parameters a, T, and <p in a very non-
linear manner. We might formally define our parameter vector to be x — [a T 4>] T e
R 3 . This would lead us to conclude that we must search (by some means) a three-
dimensional space to find x to minimize (8.22). But in this special problem it is
possible to reduce the dimensionality of the search space from three dimensions
down to only one dimension. Reducing the problem in this way makes it solvable
using the golden section search method that we have just considered.
Recall the trigonometric identity
sin(A + B) — sin A cos B + cos A sin B.
Using this, we may write
(2tt
a sin — nl
, lit \ (lit
— a cosd) sin — nT s + a sin d> cos — nT s
(8.23)
(8.24)
Define x — [xq x{\ and
2tt \ (2tt
sin ( — nT s I cos I — nT s
-lT
(8.25)
We may rewrite e n in (8.21) using these vectors:
e n — J n v n % ■
(8.26)
Note that the approach used here is the same as that used to obtain e„ in Eqn. (4.99).
Therefore, the reader will probably find it useful to review this material now.
Continuing in this fashion, the error vector e — [eoei ■ ■ ■ epj-\] T , data vector / =
[/b/i • • • /n-i] T , and matrix of basis vectors
all may be used to write
L "N-l J
eR
JVx2
e — f — Ax,
(8.27)
TLFeBOOK
350 UNCONSTRAINED OPTIMIZATION
for which our objective function may be rewritten as
V(x) = e T e = f T f - 2x T A T f + x T A T Ax, (8.28)
which implicitly assumes that we already know the period T. If T were known
in advance, we could use the method of Chapter 4 to minimize (8.28); that is,
the optimum choice (least-squares sense) for x (denoted by x) satisfies the linear
system
A T Ax = A T f. (8.29)
Because of (8.24) (with x — [xq x\] T ), we obtain
xo = flcos(j!>, x\=asm<p. (8.30)
Since we have x from the solution to (8.29), we may use (8.30) to solve for a
and <p.
However, our original problem specified that we do not know a, <p, or T in
advance. So, how do we exploit these results to determine T as well as a and 4>1
The approach is to change how we think about V (x) in (8.28). Instead of thinking
of V(x) as a function of x, consider it to be a function of T, but with x — x as
given by (8.29). From (8.25) we see that v n depends on T, so A is also a function
of T, Thus, x is a function of T, too, because of (8.29). In fact, we may emphasize
this by rewriting (8.29) as
[A(T)] T A(T)x(T) = [A(T)ff. (8.31)
In other words, A = A(T), andx = x (T). The objective function V(x) then becomes
Vi(T) = V(x(T)) = f f - 2[x(T)] T A T (T)f + [x(T)f A T (T)A(T)x(T).
(8.32)
However, we may substitute (8.31) into (8.32) and simplify with the result that
Vi(T) = f f - f A(T)[A T (T)A(T)r l A T (T)f. (8.33)
We have reduced the search space from three dimensions down to one, so we
may apply the golden section search algorithm (or some other univariate search
procedure) to objective function V\(T). This would result in determining the opti-
mum period T [which minimizes V\(T)]. We then compute A(T), and solve
for x using (8.29) as before. Knowledge of x allows us to determine a and </>
via (8.30).
Figure 8.4 illustrates a typical noisy sinusoid and the corresponding objective
function V\(T), In this case the parameters chosen were T = 24 h, T s — 5 min,
a = 1, </> = — jt/10 radians, and N — 500. The noise component in the data was
created using MATLAB's Gaussian random-number generator. The noise variance
is a 2 — 0.5, and the mean value is zero. We observe that the minimum value of
V\{T) certainly corresponds to a value of T that is at or close to 24 h.
TLFeBOOK
LINE SEARCHES
351
(a)
500
450
400
350
300
250
16
(b)
20 25
Time (hours)
18
20
22 24 26
Time (hours)
28
30
32
Figure 8.4 An example of a noisy sinusoid (a) and the corresponding objective function
Vi(T)(b).
Appendix 8. A contains a sample MATLAB program implementation of the
golden section search algorithm applied to the noisy sinusoid problem depicted
in Fig. 8.4. In golden.m Topt is T , and eta rj is the random noise sequence r](nT s ).
We will now consider the backtracking line search algorithm. The exposition
will be similar to that in Boyd and Vandenberghe [5].
Now we assume that f(x) is our objective function with x e R". In a general
line search we seek the minimizing vector sequence (x^'), k e Z + , and x w € R"
(i.e., x^ = [xq x\ • • • x^^] 7 ) constructed according to the iterative process
x ik+i) =x ik) + t m s (k)
(8.34)
where t^' e R + , and t w > except when x^ is optimal [i.e., minimizes f(x)].
The vector s^ is called the search direction, and scalar t ^ is called the step size.
Because line searches (8.34) are descent methods, we have
,(*+!)
f(x^>) < f(x^>)
(/<K
(8.35)
Ak)
except when x ( ' is optimal. The "points" x
(*+i)
Ak)
lie along a line in the direction
in n -dimensional space R", and since the minimum of f(x) must lie in the
TLFeBOOK
352 UNCONSTRAINED OPTIMIZATION
direction that satisfies (8.35), we must ensure that s^ satisfies [recall (8.12)]
g(x (k) ) T s (k) = [Vf(x (k) )] T s (k) < 0. (8.36)
Geometrically, the negative- gradient vector — g{x^) (which "points down") makes
an acute angle (i.e., one of magnitude <90°) with the vector s", [Recall (4.130).]
If s^ satisfies (8.36) for f(x^) (i.e., f(x) at x — x^), it is called a descent direc-
tion for f(x) at x^\ A general descent algorithm has the following pseudocode
description:
Specify starting point x' 0) e R";
k:=0;
while stopping criterion is not met do begin
Find s^'; { determine descent direction }
Find fW; { line search step }
Compute x<*+ 1 > := x™ + t^s^ ■
{ update step }
k:=k+-\;
end ;
Newton's method with the backtracking line search (Section 8.4) is a specific
example of a descent method. There are others, but these will not be considered in
this book.
Now we need to say more about how to choose the step size t^ on the assump-
tion that s w is known. How to determine the direction s'*' is the subject of
Section 8.4.
So far /|R" — > R. Subsequent considerations are simplified if we assume that
f(x) satisfies the following definition.
Definition 8.1: Convex Function Function /|R" -> R is called a convex
function if for all $ e [0, 1], and for any x, y e R"
f(0x + (1 - 6)y) < 9f(x) + (1 - $)f(y). (8.37)
We emphasize that the domain of definition of f(x) is assumed to be R". It is
possible to modify the definition to accommodate f(x) where the domain of f(x)
is a proper subset of R". The geometric meaning of (8.37) is shown in Fig. 8.5
for the case where x e R (i.e., one-dimensional case). We see that when f(x)
is convex, the chord, which is the line segment joining (x, f(x)) to (y, f(y)),
always lies above the graph of fix). Now further assume that f(x) is at least
twice continuously differentiable in all elements of the vector x. We observe that
for any x,y e R", if f(x) is convex, then
f(y) > fix) + [Vf(x)] T (y - x). (8.38)
From the considerations of Section 8.2 it is easy to believe that f(x) is convex iff
V 2 f(x) > (8.39)
TLFeBOOK
NEWTON'S METHOD
353
(y, W)
fix)
k
(x, f(x))
Figure 8.5 Graph of a convex function fix) (x e R) and the chord that connects the points
(*,/(*)) and (y, /OO).
for all x e R"; that is, fix) is convex iff its Hessian matrix is at least positive
semidefinite (recall Definition 4.1). The function f(x) is said to be strongly convex
iff V 2 /(x) > for all x e R".
The backtracking line search attempts to approximately minimize f(x) along
the line {x + ts\t > 0} for some given s (search direction at x). The pseudocode
for this algorithm is as follows:
Specify the search direction s;
t :=1;
while (f(x + fs) > f(x) + St[Wf(x)] T s)
t := at;
end ;
In this algorithm < 8 < j, and < a < 1. Commonly, 8 e [0.01,0.30], and
a e [0.1,0.5]. These parameter ranges will not be justified here. As suggested
earlier, how to choose s will be the subject of the next section. The method is
called "backtracking" as it begins with t = 1, and then reduces t by factor a until
f(x + ts) < f{x) +8tV T f(x)s. [We have V T f(x) = [V/(x)] r .] Recall that s
is a descent direction so that (8.36) holds, specifically, V r f(x)s < 0, and so if t
is small enough [recall (8.9)], then
f(x + ts) % fix) + tV T fix)s < fix) + 8tV T fix)s,
which shows that the search must terminate eventually. We mention that the back-
tracking line search will terminate even if fix) is only "locally convex" — convex
on some proper subset S of R" . This will happen provided x e S in the algorithm.
8.4 NEWTON'S METHOD
Section 8.3 suggests attempting to reduce an n-dimensional search space to a
one-dimensional search space. Of course, this approach seldom works, which is
TLFeBOOK
354 UNCONSTRAINED OPTIMIZATION
why there is an elaborate body of methods on searching for minima in higher-
dimensional spaces. However, these methods are too involved to consider in detail
in this book, and so we will only partially elaborate on an idea from Section 8.2.
The quadratic model from Section 8.2 suggests an approach often called New-
ton's method. Suppose that x™ is the current estimate of the sought-after optimum
x. Following (8.11), we have the Taylor approximation
f(x (k) + S) % V(S) = f{x {k) ) + S T g(x (k) ) + ±8 T G(x (k) )8 (8.40)
for which S e R" since x™ € R". Since x^ is not necessarily the minimum x,
usually g(x^) ^ 0. Vector S is selected to minimize V(<5), and since this is a
quadratic form, if G(x^) > 0, then
G(x (k) )8 = -g(x (k) ). (8.41)
The next estimate of x is given by
X (k+D = x (k) + s _ (8 42)
Pseudocode for Newton's method (in its most basic form) is
Input starting point x* 0) ;
k:=0;
While stopping criterion is not met do begin
G(xW)« := -g(x<Q);
x (/c+1). = x W +a .
/t:=/c+1;
end;
The algorithm will terminate (if all goes well) with the last vector x { • l ' as a good
approximation to x. However, the Hessian G*- ' — G{x^') may not always be
positive definite, in which case this method can be expected to fail. Modification
of the method is required to guarantee that at least it will converge to a local
minimum. Said modifications often involve changing the method to work with line
searches (i.e., Section 8.3 ideas). We will now say more about this.
As suggested in Ref. 5, we may combine the backtracking line search with the
basic form of Newton's algorithm described above. A pseudocode description of
the result is
Input starting point x* 0) , and a tolerance e > 0;
k := 0;
S (0) := -[G(x (0) )n 1 g(x(°)); { search direction atx< 0) }
X 2 := -gW;
while k 2 > e do begin
Use backtracking line search to find (^ forx^ and s^;
X (fr+1) . = x (k) + t (k) s (k).
S (*+D := -[G(x<' < + 1 >)r 1 g(x(' < + 1 >);
A 2 :=-g 7 "(x<' ( + 1 ))s(' f + 1 );
k :=/c + 1;
end;
TLFeBOOK
NEWTON'S METHOD 355
The algorithm assumes that G(x^) > for all k. In this case [G(x^^)] _1 > as
well. If we define (for all x e R")
\\x\\ 2 G(y) ^x T [G(y)r 1 x, (8.43)
then ||x||(j(y) satisfies the norm axioms (recall Definition 1.3), and is in fact an
example of a weighted norm. But why do we consider X 2 as a stopping criterion in
Newton's algorithm? If we recall the term jS T G(x^)S in (8.40), since S satisfies
(8.41), at step k we must have
\& T G(x^)& = i/(x«)[G(x«)]- 1 g(x«) = l 2\\g{x (k) )\\ 2 G(x(k)) = j^ 2 - (8.44)
Estimate x w of x is likely to be good if (8.44) in particular is small [as opposed
to merely considering squared unweighted norm g(x^) T g(x^)]. It is known that
Newton's algorithm can converge rapidly (quadratic convergence). An analysis
showing this appears in Boyd and Vandenberghe [5] but is omitted here.
As it stands, Newton's method is computationally expensive since we must solve
the linear system in (8.41) at every step k. This would normally involve applying
the Cholesky factorization algorithm that was first mentioned (but not considered
in detail) in Section 4.6. We remark in passing that the Cholesky algorithm will
factorize G^ according to
G (k) = LDL T , (8.45)
where L is unit lower triangular and D is a diagonal matrix. We also mention that
G^ 1 > iff the elements of D are all positive, so the Cholesky algorithm provides
a built-in positive definiteness test. The decomposition in (8.45) is a variation on
the LU decomposition of Chapter 4. Recall that we proved in Section 4.5 that
positive definite matrices always possess such a factorization (see Theorems 4.1
and 4.2). So Eq. (8.45) is consistent with this result. The necessity to solve a linear
system of equations at every step makes us wonder if sensitivity to ill-conditioned
matrices is a problem in Newton's method. It turns out that the method is often
surprisingly resistant to ill conditioning (at least as reported in Ref. 5).
Example 8.1 A typical run of Newton's algorithm with the backtracking line
search as applied to the problem of minimizing Rosenbrock's function [Eq. (8.2)]
yields the following output:
k
,(*)
X 2
x o
(k)
x\
1.0000
800.00499
2.0000
2.0000
1
0.1250
1.98757
1.9975
3.9900
2
1.0000
0.41963
1.8730
3.4925
3
1.0000
0.49663
1.6602
2.7110
4
0.5000
0.38333
1.5945
2.5382
5
1.0000
0.21071
1.4349
2.0313
TLFeBOOK
356
UNCONSTRAINED OPTIMIZATION
k
,(*)
X 2
x ik)
x
r (k)
x l
6
0.5000
0.14763
1.3683
1.8678
7
1.0000
0.07134
1.2707
1.6031
8
1.0000
0.03978
1.1898
1.4092
9
1.0000
0.01899
1.1076
1.2201
10
1.0000
0.00627
1.0619
1.1255
11
1.0000
0.00121
1.0183
1.0350
12
1.0000
0.00006
1.0050
1.0099
The search parameters selected in this example are a — 0.5, 8 — 0.3, and e =
.00001, and the final estimate is x (l3) — [1.0002 1.0003] 7 . For these same param-
eters if instead x^ — [—1 l] T , then 18 iterations are needed, yielding x^ —
[0.9999 0.9998] r . Figure 8.6 shows the sequence of points x w for the case
x^ = [— 1 l] r . The dashed line shows the path from starting point [—1 l] T to
the minimum at [1 l] r , and we see that the algorithm follows the "valley" to the
optimum solution quite well.
Figure 8.6 The sequence of points (x^ ') generated by Newton's method with the back-
tracking line search as applied to Rosenbrock's function using the parameters a — 0.5,
S = 0.3, and e = 0.00001 with *(°) = [-1 1] T (see Example 8.1). The path followed is
shown by the dashed line.
TLFeBOOK
EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS 357
8.5 EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS
In this section we modify the original optimization problem in (8.1) according to
min fix)
*eR" , (8.46)
subject to fi (jt) = for all i = 0, 1, . . . , m — 1
where fix) is the objective function as before, and ft(x) = for i — 0, I, ... ,
m — 1 are the equality constraints. The functions /,- |R" -> R are equality constraint
functions. The set F — {x\f{x) = 0, i = 0, . . . , m — 1} is called the feasible set.
We are interested in
f = f(x) = mm f(x). (8.47)
xeF
There may be more than one x — x e R" satisfying (8.47); that is, the set
X = {x\x € F, f{x) = /} (8.48)
may have more than one element in it. We assume that our problem yields X with
at least one element in it (i.e., X ^ 0).
Equation (8.47) is really a more compact statement of (8.46), and in words states
that any minimizer x of fix) must also satisfy the equality constraints. We recall
examples of this type of problem from Chapter 4 [e.g., the problem of deriving a
computable expression for K2(A) and in the proof of Theorem 4.5]. More examples
will be seen later. Generally, in engineering, constrained optimization problems are
more common than unconstrained problems. However, it is important to understand
that algorithms for unconstrained problems form the core of algorithms for solving
constrained problems.
We now wish to make some general statements about how to solve (8.46), and
in so doing we introduce the concept of Lagrange multipliers. The arguments to
follow are somewhat heuristic, and they follow those of Section 9.1 in Fletcher [1].
Suppose that x e R" is at least a local minimizer for objective function fix) e
R. Analogously to (8.9), we have
fix + 8)* fix) + gjix)& + ±8 T [V 2 fix)]8, (8.49)
where 8 e R" is some incremental step away from x, and gtix) — V fix) e R"
ii = 0, 1, . . . , m — 1) is the gradient of the ith constraint function at x. A feasible
incremental step 8 must yield x + 8 e F, and so must satisfy
fiix + 8) = Mx) = (8.50)
for all i. From (8.49) this implies the condition that 8 must lie along feasible
direction s e R" (at x — x) such that
gfix)s = (8.51)
TLFeBOOK
358 UNCONSTRAINED OPTIMIZATION
again for all ;'. [We shall suppose that the vectors gt(x) — V/(jc) are linearly
independent for all /.] Recalling (8.36), if s were also a descent direction at x, then
g T (x)s < (8.52)
would hold (g(x) — V f{x) e R"). In this situation S would reduce f(x), as S is
along direction s. But this is impossible since we have assumed that x is a local
minimizer for f(x). [For any s at x, we expect to have g T (x)s = 0.] Consequently,
no direction s can satisfy (8.51) and (8.52) simultaneously. This statement remains
true if g(x) is a linear combination of g, (x), that is, if, for suitable I, € R we have
m — 1
g(x)=J^X i gi(x). 1 (8.53)
i=0
Thus, a necessary condition for x to be a local minimizer (or, more generally, a
stationary point) of f(x) is that [rewriting (8.53)]
m—\
g(x)-^2ii Si (x) = (8.54)
i=0
for suitable X,- € R which are called the Lagrange multipliers. We see that (8.54)
can be expressed as [with V as in (8.3)]
m— 1
f(x)-J^kifi(x)
= 0. (8.55)
In other words, we replace the original problem (8.46) with the mathematically
equivalent problem of minimizing
m— 1
L(x,X) = f(x)-J^Xifi(x), (8.56)
Equation (8.53) may be more formally justified as follows. Note that the same argument will also
extend to make (8.53) a necessary condition for x to be a local maximizer, or saddle point for f(x).
Thus, (8.53) is really a necessary condition for x to be a stationary point of f(x). We employ proof by
contradiction. Suppose that
g(x) = Gk + h,
where G = [go(x) ■ ■ ■ g m -\(x)] S R nxm , A, — [Xq ■ ■ ■ X m _ 1 ] T e R™ and h / 0. Further assume that
h 6 R" is the component of g(x) that is orthogonal to all gi(x). Thus, G h = 0. In this instance
s = -h will satisfy both (8.51) and (8.52) [i.e., g 1 ' (x)s = -[i T G T + h T ]h = -h T h < 0]. Satisfaction
of (8.52) implies that a step & in the direction s will reduce fix) [i.e., f(x + 8) < /(*)]. But this cannot
be the case since x is a local minimizer of /(x). Consequently, h ^ is impossible, which establishes
(8.53).
TLFeBOOK
EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS 359
called the Lagrangian function (or Lagrangian), where, of course, x € R" and
X e R m . Since L|R" x R m — > R, in order to satisfy (8.55), we must determine
x , X such that
VL{x,X) = 0,
where now instead [of (8.3)] we have V =
V,-
such that
(8.57)
3
-,T
3x dx n -i_
V/
3
dT
dX„
(8.58)
Now we see that a necessary condition for a stationary point of f(x) subject to
our constraints is that x and X form a stationary point of the Lagrangian function.
Of course, to resolve whether stationary point x is a minimizer requires additional
information (e.g., the Hessian). Observe that V^L(x, X) — [-fo(x) — fi(x) ■ ■ ■ —
f m -\(x)] T , so V\L(x, X) — implies that ft(x) — for all i. This is why we take
derivatives of the Lagrangian with respect to all elements of A; it is equivalent to
imposing the equality constraints as in the original problem (8.46).
We now consider a few examples of the application of the method of Lagrange
multipliers.
Example 8.2 This example is from Fletcher [1, pp. 196-198]. Suppose that
f(x) = xo + x\, fo(x) = x% - xi = 0,
so x — [xq x\\ T e R 2 , and the Lagrangian is
L(x, X) — xq + x\ — X(x Q — x\).
Clearly, to obtain stationary points, we must solve
dL
dxo
= 1 - 2Xx =
dL
3xi
= 1+A. = 0,
dL
~8X
= Xq — X\ — 0.
Immediately, X — — 1, so that 1 — 2Xxq — yields xq = — i, and xi. — x\ =0
yields x\
Thus
-lT
1 1
~2 4
Is x really a minimizer, or is it a maximizer, or saddle point?
TLFeBOOK
360 UNCONSTRAINED OPTIMIZATION
An alternative means to solve our problem is to recognize that since Xq — x\ =0,
we can actually minimize the new objective function
/'(x ) = X + X\\ Xi=x 2 — X + x\
with respect to xo instead. Clearly, this is a positive definite quadratic with a well-
defined and unique minimum at xo = — \. Again we conclude x = [— | \Y , and
it must specifically be a minimizer.
In Example 8.2, / = f(x) — xq + x\ — — j > — oo only because of the con-
straint, /o(x) = x^ — x\ — 0. Without such a constraint, we would have / = — oo.
We have described the method of Lagrange multipliers as being applied largely to
minimization problems. But we have noted that this method applies to maximization
problems as well because (8.57) is the necessary condition for a stationary point,
and not just a minimizer. The next example is a simple maximization problem from
geometry.
Example 8.3 We wish to maximize
F(x, y) = 4xy
subject to the constraint
C(x,y) = \x 2 + y 2 -\=0.
This problem may be interpreted as the problem of maximizing the area of a
rectangle of area Axy such that the corners of the rectangle are on an ellipse centered
at the origin of the two-dimensional plane. The ellipse is the curve C(x, y) — 0.
(Drawing a sketch is a useful exercise.)
Following the Lagrange multiplier procedure, we construct the Lagrangian
function
G(x, y, X) = Axy + \{\x 2 + y 2 - 1)
Taking the derivatives of G and setting them to zero yields
(8.59a)
(8.59b)
(8.59c)
dG
9x
1
= 4v + -Ax =
2
dG
~dy~
= 4x + 2Xy =
dG
= -x 2 + y 2 - 1 =
TLFeBOOK
EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS 361
From (8.59a,b) we have
v x
A. = -8-, and k = -2-,
x y
which means that — 2x/y = — %y/x, or x 2 — Ay 2 . Thus, we may replace x 2 by Ay 2
in (8.59c), giving
2y 2 - 1 =
for which y 2 — j, and so x 2 — 2. From these equations we easily obtain the loca-
tions of the corners of the rectangle on the ellipse. The area of the rectangle is also
seen to be four units.
Example 8.4 Recall from Chapter 4 that we sought a method to determine
(compute) the matrix 2-norm
||A|| 2 = max ||Ax|| 2 .
Chapter 4 considered only the special case x e R 2 (i.e., n — 2). Now we consider
the general case for which n > 2.
I 2
our problem is to maximize x T Rx subject to the equality constraint ||x||2 = 1 (or
equivalently x T x — 1). The Lagrangian is
Since ||Ax|| 2 = x T A T Ax = x T Rx with R = R T and R > (if A is full rank),
L(x, X) = x T Rx - X(x T x - 1)
since f(x) — x T Rx and fo(x) — x T x — 1. Now
n—\ n — \
fix) = x T Rx = ^^XjXjrij
i=0 y=0
n — 1 n—\n—\
= £
1=0
so that (using r,y = r ; ,)
3/
- — = 2r kk x k
ox k
= 2r kk x k
: ? r " + 2^2^
X[ X j f"ij
i=0 ;=0
«¥j
n-l
n-l
X! ■ r ./^/ +
£ xir ik
7=0
i =0
/#*
1 j=k
n-l
n-l
Z! r v^ +
J2 r kJ x J
;=0
.7 =
/#*
jl^k
TLFeBOOK
362 UNCONSTRAINED OPTIMIZATION
This reduces to
„j. n—\ n—\
— = 2r kk x k +2 ^ r kj Xj — 2 ^ r kj Xj
,/=0 j=o
for all k = 0, 1, . . . , n — 1. Consequently, V x f(x) — 2Rx. Similarly, V x fo(x) —
2x. Also, V^L(x, 1) = — x T x + 1 = — fo(x), and so VL(x,A.) = yields the
equations
2Rx - 2Xx = 0,
jc r x -1=0.
The first equation states that the maximizing solution (if it exists) must satisfy the
eigenproblem
Rx — Xx
for which X is an eigenvalue and x is the corresponding eigenvector. Consequently,
x T Rx — Xx T x, so
x T Rx \\Ax\\
2
2 _ II „„,,2
A=^^ = ^-^ = ||Ax||
11*11
2
must be chosen to be the biggest eigenvalue of R. Since R > 0, such an eigenvalue
will exist. As before (Chapter 4), we conclude that
IIAH^A.,,-!
for which X n -\ is the largest of the eigenvalues A-o, . . . , X n -\ of R — A T A {X n -\ >
K-2 > ■ ■ ■ > X > 0).
APPENDIX 8.A MATLAB CODE FOR GOLDEN SECTION SEARCH
%
% SineFitl.m
%
% This routine computes the objective function V_1 (T) for user input
% T and data vector f as required by the golden section search test
% procedure golden. m. Note that Ts (sampling period) must be consistent
% with Ts in golden. m.
%
function V1 = SineFitl (f ,T) ;
N = length(f); % Number of samples collected
Ts = 5*60; % 5 minute sampling period
TLFeBOOK
MATLAB CODE FOR GOLDEN SECTION SEARCH 363
% Compute the objective function V_1 (T)
n = [0:N-1] ;
T = T*60*60;
A= [ sin(2*pi*Ts*n/T) . ' cos(2*pi*Ts*n/T) . ' ];
B = inv(A. '*A) ;
V1 = f . '*f - f . ' *A*B*A. '*f ;
% golden. m
% This routine tests the golden section search procedure of Chapter 8
% on the noisy sinusoid problem depicted in Fig. 8.4.
% This routine creates a test signal and uses SineFitl.m to compute
% the corresponding objective function V_1 (T) (given by Equation (8.33)
% in Chapter 8) .
function Topt = golden
% Compute the test signal f
N = 500; % Number of samples collected
Ts = 5*60; % 5 minute sampling period
T = 24*60*60; % 24 hr period for the sinusoid
phi = -pi/10; % phase angle of sinusoid
a = 1.0; % sinusoid amplitude
var = .5000; % desired noise variance
std = sqrt(var) ;
eta = std*randn(1 ,N) ;
for n = 1 :N
f(n) = a*sin(((2*pi*(n-1)*Ts)/T) + phi);
end;
f = f + eta;
f = f.';
% Specify a starting interval and initial parameters
% (units of hours)
xl = 16;
xu = 32;
tau = (sqrt(5)+1 ) /2; % The golden ratio
tol = .05; % Accuracy of the location of the minimum
% Apply the golden section search procedure
I = xu - xl; % length of the starting interval
xa = xu - I/tau;
xb = xl + I/tau;
TLFeBOOK
364 UNCONSTRAINED OPTIMIZATION
while I > tol
if SineFitl (f ,xa) >= SineFitl (f ,xb)
I = xu - xa;
xl = xa;
temp = xa;
xa = xb;
xb = temp + I/tau;
else
I = xb - xl;
xu = xb;
temp = xb;
xb = xa;
xa = temp - I/tau;
end
end
Topt = (xl + xu)/2; % Estimate of optimum choice for T
REFERENCES
1. R. Fletcher, Practical Methods of Optimization, 2nd ed., Wiley, New York, 1987
(reprinted July 1993).
2. M. R. Schroeder, Number Theory in Science and Communication (with Applications
in Cryptography, Physics, Digital Information, Computing, and Self-Similarity), 2nd
(expanded) ed., Springer- Verlag, New York, 1986.
3. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math-
ematics series, Vol. 37). Springer- Verlag, New York, 2000.
4. R. P. Brent, Algorithms for Minimization without Derivatives, Dover Publications, Mine-
ola, NY, 2002.
5. S. Boyd and L. Vandenberghe, Convex Optimization, preprint, Dec. 2001.
PROBLEMS
8.1. Suppose A e R" x ", and that A is not symmetric in general. Prove that
V{x T Ax) = (A + A T )x,
where V is the gradient operator [Eq. (8.3)].
8.2. This problem is about ideas from vector calculus useful in nonlinear opti-
mization methods.
(a) If s, x, x' e R", then a line in R" is defined by
x — x(a) — x + as, a € R.
Vector s may be interpreted as determining the direction of the line in the
«-dimensional space R". The notation x(a) implies x(a) — [xo(a)x\(a)
■ ■ ■ x n _\ (a)] T . Prove that the slope df/da of f(x(a)) e R along the line
TLFeBOOK
PROBLEMS 365
at any x(a) is given by
df T
— =s T Vf,
da
where V is the gradient operator [see Eq. (8.3)]. (Hint: Use the chain
rule for derivatives.)
(b) Suppose that u(x), v(x) e R" (again x e R"). Prove that
V{u T v) = (V/)v + (Vv T )u.
8.3. Use the golden ratio search method to find the global minimizer of the
polynomial objective function in Fig. 8.2. Do the computations using a pocket
o
l /
8.4. Review Problem 7.4 (in Chapter 7).
calculator, with starting interval [xf, x®] = [2, 2.5]. Iterate 5 times.
Use a MATLAB implementation of the golden ratio search method to find
detection threshold r\ for Pfa = 0.1,0.01, and 0.001. The objective func-
tion is
f(ri)
(p/2)-l , ,
k=Q
Make reasonable choices about starting intervals.
8.5. Suppose x — [xq x\\ t e R 2 , and consider
f( x ) = Xq + xoxi + (1 + x\) 2 .
Find general expressions for the gradient and the Hessian of f(x). Is G(0) >
0? What does this signify? Use the Newton-Raphson method (Chapter 7)
to confirm that x — [0.6959 — 1.3479] r is a stationary point for f(x).
Select Jo = [0.7000 - 1.3] r as the starting point. Is G(x) > 0? What does
this signify? In this problem do all necessary computations using a pocket
calculator.
8.6. If x e R" is a local minimizer for f(x), then we know that G(x) > 0, that
is, Gix) is positive definite (pd). On the other hand, if x is a local maximizer
of f(x), then — G(x) > 0. In this case we say that G(x) is negative definite
(nd). Show that
f{x) = (XI - X\f + Xq 5
has only one stationary point, and that it is neither a minimizer nor a maxi-
mizer of f(x).
8.7. Take note of the criterion for a maximizer in the previous problem. For both
a — 6 and a — 8, find all stationary points of the function
f(x) = 2xq — 3x — axoxi(xo — x\ — 1).
Determine which are local minima, local maxima, or neither.
TLFeBOOK
366 UNCONSTRAINED OPTIMIZATION
8.8. Write a MATLAB function to find the minimum of Rosenbrock's function
using the Newton algorithm (basic form that does not employ a line search).
Separate functions must be written to implement computation of both the
gradient and the inverse of the Hessian. The function for computing the
inverse of the Hessian must return an integer that indicates whether the Hes-
sian is positive definite. The Newton algorithm must terminate if a Hessian
is encountered that is not positive definite. The I/O is to be at the terminal
only. The user must input the starting vector at the terminal, and the program
must report the estimated minimum, or it must print an error message if the
Hessian is not positive definite. Test your program out on the starting vectors
x (0) = [_3 _3f an d x ®) = [0 10] r
8.9. Write and test your own MATLAB routine (or routines) to verify Example 8.1.
8.10. Find the points on the ellipse
2 2
x L y L
a 2 b 2
that are closest to, and farthest from, the origin (x, y) — (0, 0). Use the
method of Lagrange multipliers.
8.11. The theory of Lagrange multipliers in Section 8.5 is a bit oversimplified.
Consider the following theorem. Suppose that x e R" gives an extremum
(i.e., minimum or maximum) of f(x) e R among all x satisfying g(x) — 0.
If /, g e C l [D] for a domain DcR containing x, then either
g(x) = and Vg(x) = 0, (8.P.1)
or there is a X e R such that
g(x) = and V/(x) - XVg(x) = 0. (8.P.2)
From this theorem, candidate points x for extrema of f(x) satisfying g(x) —
therefore are
(a) Points where / and g fail to have continuous partial derivatives.
(b) Points satisfying (8.P.1).
(c) Points satisfying (8. P. 2).
In view of the theorem above and its consequences, find the minimum dis-
tance from x = [xq x\ X2\ T — [0 — l] T to the surface
g( X ) — Xq + x\ — x\ — 0.
TLFeBOOK
PROBLEMS 367
8.12. A second-order finite-impulse response (FIR) digital filter has the frequency
response H{e> 01 ) — ^k=Q n k e ~'' Mk ^ where h.% e R are the filter parameters.
Since H(e J0} ) is 2tt — periodic we usually consider only w e [—it, it]. The
DC response of the filter is H(l) — H(e>°) — J2k=o n k- Define the energy
of the filter in the band [— a> p , a> P ] to be
1 C m p ,
E=—\ \H(en\ 2 da>.
2jt J-co„
Find the filter parameters h — [ho hi h-^ e R 3 such that for u> p — n/2
energy E is minimized subject to the constraint that H(\) — 1 (i.e., the gain
of the filter is unity at DC). Plot \H(e joJ )\ for co e [-n,n]. [Hint: E will
have the form E — h T Rh e R, where R e R 3x3 is a symmetric Toeplitz
matrix (recall Problem 4.20). Note also that \H(e^)\ 2 = //(e' a, )//*(e-'' w ).]
8.13. This problem introduces incremental condition estimation (ICE) and is based
on the paper C. H. Bischof, "Incremental Condition Estimation," SIAM J.
Matrix Anal. Appl. 11, 312-322 (April 1990). ICE can be used to estimate
the condition number of a lower triangular matrix as it is generated one
row at a time. Many algorithms for linear system solution produce triangular
matrix factorizations one row or one column at a time. Thus, ICE may be
built into such algorithms to warn the user of possible inaccuracies in the
solution due to ill conditioning. Let A n be an n x n matrix with singular
values a\(A n ) > • • ■ > <r n (A n ) > 0. A condition number for A n is
/, s °"i( A «)
Gn(An)
Consider the order n lower triangular linear system
L n x n — d n . (8. P. 3)
The minimum singular value, a n {L n ), of L n satisfies
n w l|fif " 112
On\E n ) < ~ —
I 1*7! Il2
(| \x„ I Ij = JZ( x n !')• Thus, an estimate (upper bound) of this singular value is
1141b
o n (L n ) =
I 1*7! lb
We would like to make this upper bound as small as possible. So, Bischof
suggests finding x„ to satisfy (8.P.3) such that ||x„||2 is maximized subject
to the constraint that ||<i n |b = 1. Given x n -\ such that L„_ix„_i = d n -\
TLFeBOOK
368
UNCONSTRAINED OPTIMIZATION
with |K_i|| 2 = 1 [which gives us cr„_i(L„_i)
c n such that ||x„||2 is maximized where
1/II^B-ilb], find s„ and
L n -\
T
V
s n d n -\
d n ,
SnXn — 1
(c n - s„a n )/y n
Find a n ,c n ,s n . Assume a n ^0. (Comment: The indexing of the singular
values used here is different from that in Chapter 4. The present notation is
more convenient in the present context.)
8.14. Assume that A e R mx " with m > n, but rank(A) < n is possible. As usual,
\x\\i — x x. Solve the following problem:
min 1 1 Ax
Hi
8\\x\\l
where S > 0. This is often called the Tychonov regularization problem. It
is a simple ploy to alleviate problems with ill-conditioning in least-squares
applications. Since A is not necessarily of full rank, we have A T A > 0, but
not necessarily A T A > 0. What is rank(A r A + 5/)? Of course, / is the order
n identity matrix.
TLFeBOOK
9
Numerical Integration
and Differentiation
9.1 INTRODUCTION
We are interested in how to compute the integral
r
J a
f(x)dx (9.1)
for which f(x) e R (and, of course, x e R). Depending on f(x), and perhaps also
on [a, b], the reader knows that "nice" closed-form expressions for / rarely exist.
This forces us to consider numerical methods to approximate /. We have seen from
Chapter 3 that one approach is to find a suitable series expansion for the integral
in (9.1). For example, recall that we wished to compute the error function
2 f x 2
erf(x) = —= \ e~' dt, (9.2)
V^ Jo
which has no antiderivative (i.e., "nice" formula). Recall that the error function is
crucial in solving various problems in applied probability that involve the Gaussian
probability density function [i.e., the function in (3.101) of Chapter 3]. The Taylor
series expansion of Eq. (3.108) was suggested as a means to approximately evaluate
erf(x), and is known to be practically effective if x is not too big. If x is large, then
the asymptotic expansion of Example 3.10 was suggested. The series expansion
methodology may seem to solve our problem, but there are integrals for which it
is not easy to find series expansions of any kind.
A recursive approach may be attempted as an alternative. An example of this was
seen in Chapter 5, where finding the norm of a Legendre polynomial required solv-
ing a recursion involving variables that were certain integrals [recall Eq. (5.96)].
As another example of this approach, consider the following case from Forsythe
et al. [1]. Suppose that we wish to compute
In = f X n i
JO
"' dx (9.3)
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
369
TLFeBOOK
370 NUMERICAL INTEGRATION AND DIFFERENTIATION
for any n e N (natural numbers). Recalling integration by parts, we see that
f x n e x - l dx=x n e x - x \l- f
Jo Jo
■ l
/o Jo
so
nx" l e x ' dx = 1 — n£„_i,
E n = l-nE n -i (9.4)
for n = 2, 3, 4, It is easy to confirm that
= 1
xe* dx = — .
e
This is the initial condition for recursion (9.4). We observe that E n > for all
n. But if, for example, MATLAB is used to compute £"19, we obtain computed
solution £19 = —5.1930, which is clearly wrong. Why has this happened?
Because of the need to quantize, E\ is actually stored in the computer as
Ex = Ei + e
for which e is some quantization error. Assuming that the operations in (9.4) do
not lead to further errors (i.e., assuming no rounding errors), we may arrive at a
formula for £„:
E 2 =l-2E 1 = E 2 + (-2)6,
£3 = 1 - 3£ 2 = £3 + (~3)(-2)e,
£4 = 1 - 4£ 3 = £4 + (-4) (-3) (-2)6,
and so on. In general
£„ = £„ + (-l)"- 1 l-2-3---(/i- 1)m6, (9.5)
£„-£„ = (- \) n - l n\€. (9.6)
We see that even a very tiny quantization error 6 will grow very rapidly during
the course of the computation even without any additional rounding errors at all !
Thus, (9.4) is a highly unstable numerical procedure, and so must be rejected as
a method to compute (9.3). However, it is possible to arrive at a stable procedure
by modifying (9.4). Now observe that
£„ = / x "e x ~ 1 dx< / x n dx= .
Jo Jo n+l
(9.7)
TLFeBOOK
TRAPEZOIDAL RULE 371
implying that E n -> as n increases. Instead of (9.4), consider the recursion
[obtained by rearranging (9.4)]
£„_i = -(l-£„). (9.8)
n
If we wish to compute E m , we may assume E„ = for some n significantly bigger
than m and apply (9.8). From (9.7) the error involved in approximating E n by zero
is not bigger than l/(w + 1). Thus, an algorithm for E m is
E k -i = -(1 - E k )
k
for k = n,n — I, ... ,m + 2,m + I, where E n — 0. At each stage of this algorithm
the initial error is reduced by factor \jk rather than being magnified as it was in
the procedure of (9.4).
Other than potential numerical stability problems, which may or may not be
easy to solve, it is apparent that not all integration problems may be cast into a
recursive form. It is also possible that f(x) is not known for all x e R. We might
know f(x) only on a finite subset of R, or perhaps a countably infinite subset of
R. This situation might arise in the context of obtaining f(x) experimentally. Such
a scenario would rule out the previous suggestions.
Thus, there is much room to consider alternative methodologies. In this chapter
we consider what may be collectively called quadrature methods. For the most part,
these are based on applying some of the interpolation ideas considered in Chapter 6.
But Gaussian quadrature (see Section 9.4) also employs orthogonal polynomials
(material from Chapter 5).
This Chapter is dedicated mainly to the subject of numerical integration by
quadratures. But the final section considers numerical approximations to deriva-
tives (i.e., numerical differentiation). Numerical differentiation is relevant to the
numerical solution of differential equations (to be considered in later chapters),
and we have mentioned that it is relevant to spline interpolation (Section 6.5). In
fact, it can also find a role in refined methods of numerical integration (Section 9.5).
9.2 TRAPEZOIDAL RULE
A simple approach to numerical integration is the following.
In this book we implicitly assume all functions are Riemann integrable. From
elementary calculus such integrals are obtained by the limiting process
pb i n
I b — a -r-^
= / f(x)dx= lim Tfixl)
-b
(9.9)
k=\
for which xq — a,x n — b, and for which the value / is independent of the point xZ e
[xjfc-i, Xfc]. We remark that not all functions f(x) satisfy this requirement and so,
as was mentioned in Chapter 3, not all functions are Riemann integrable. However,
TLFeBOOK
372
NUMERICAL INTEGRATION AND DIFFERENTIATION
we will ignore this potential problem. We may approximate / according to
b — a -r-^
/« — y,f(4)- ( 9 - 10 )
k=\
Such an approximation is called the rectangular rule (or rectangle rule) for numer-
ical integration, and there are different variants depending on the choice for x\.
Three possible choices are shown in Fig. 9.1. We mention that all variants involve
assuming f(x) is piecewise constant on [xk-i, Xk], and so amount to the constant
interpolation of f(x) (i.e., fitting a polynomial which is a constant to f(x) on
some interval). Define
h
From Fig. 9.1, the right-point rule uses
while the left-point rule uses
■kh.
x\ = a + (k — \)h,
(9.11a)
(9.11b)
Figure 9.1 Illustration of the different forms of the rectangular rule: (a) right-point rule;
(b) left-point rule; (c) midpoint rule. In all cases xq = a and x n = b.
TLFeBOOK
TRAPEZOIDAL RULE
373
and the midpoint rule uses
x^ = a + (k — l)h,
(9.11c)
where in all cases k — 1, 2, . . . , n — \,n. The midpoint rule is often preferred
among the three as it is usually more accurate. However, the rectangular rule is often
(sometimes unfairly) regarded as too crude, and so the following (or something still
"better") is chosen.
It is often better to approximate f(x) with trapezoids as shown in Fig. 9.2. This
results in the trapezoidal rule for numerical integration. From Fig. 9.2 we see that
this rule is based on the linear interpolation of the function f(x) on [x k -\,x k ].
The approximation to f* k f(x) dx is given by the area of the trapezoid: this is
f
J XI
f(x) dx % - [f(x k ) + f(x k -i)] (x k - x k -i) = - [f(x k -\) + f(x k )] h
(9.12)
[We have h — x k — x k -\ — (b — a)/n.] It is intuitively plausible that this method
should be more accurate than the rectangular rule, and yet not require much, if
any, additional computational effort to implement it. Applying (9.12) for k = 1 to
k — n, we have
fb h «
/ f(x)dx*T(n) = -J^[f(x k -i) + f(x k )], (9.13)
Ja Z k=\
and the summation expands out as
Tin) = ^ [f(x ) + 2f(xi) + 2f(x 2 ) + ■■■ + 2f(x n -i) + f(x n )] , (9.14)
where n e N (set of natural numbers).
Figure 9.2 Illustration of the trapezoidal rule.
TLFeBOOK
374 NUMERICAL INTEGRATION AND DIFFERENTIATION
We may investigate the error behavior of the trapezoidal rule as follows. The
process of analysis begins by assuming that n = 1; that is, we linearly interpolate
fix) on [a, b] using
p(x) = f(a)— b + fQ,)^-, (9.15)
a — b b — a
which is from the Lagrange interpolation formula [recall (6.9) for n — 1]. It must
be the case that for suitable error function e(x), we have
f(x) = p(x) + e(x) (9.16)
with x e [a, b]. Consequently
rb rb pb
I f(x)dx— I p(x)dx+ I e(x) dx
Ja Ja Ja
= b —^U(.a) + f{b)] + f eix) dx = Til) + E T(1) , (9.17)
where
E T(l) = / eix)dx, (9.18)
Ja
which is the error involved in using the trapezoidal rule. Of course, we would like
a suitable bound on this error. To obtain such a bound, we will assume that f^(x)
and f^ix) both exist and are continuous on [a, b]. Let x be fixed at some value
such that a < x < b, and define
it — a)(t — b)
git) = fit) - pit) - [fix) - pix)U ^ TT < 9 - 19 )
(x — a)ix — b)
for t e [a, b]. It is not difficult to confirm that
g(a) = g(b) = g(*) = 0,
so g(f) vanishes at three different places on the interval [a, b]. Rolle's theorem 1
says that there are points fi e (a, x), %2 € (x, £>) such that
g (1) (ti) = g (1) fe) = o.
Thus, g^(f) vanishes at two different places on ia,b), so yet again by Rolle's
theorem there is a £ e (£i , £2) suc h that
g (2) (f) = 0.
This was proved by Bers [2], but the proof is actually rather lengthy, and so we omit it.
TLFeBOOK
TRAPEZOIDAL RULE 375
We note that f = £(x), that is, point f depends on x. Therefore
g (2) (?) = / (2) (?) _ <2) (|) _ [/(x) _ (jt)] 2 = Q (9 2Q)
(x — a)(x — D)
The polynomial p(x) is of the first degree so // 2 ^(£) = 0. We may use this in
(9.20) and rearrange the result so that
fix) = p(x) + I/( 2 >(f (*))(* " «)(* " « (9.21)
for any x e (a, b). This expression also happens to be valid at x — a and at x — b,
so
e(x) = f{x) - pix) = ±/( 2 >(t (*))(* - a)(* - &) (9.22)
for x e [a,b]. But we need to evaluate £t(1) m (9.18). The second mean-value
theorem for integrals states that if fix) is continuous and g(x) is integrable (Rie-
mann) on [a, b], and further that gix) does not change sign on [a, b], then there
is a point p e (a, b) such that
rb pb
/ fix)gix)dx = fip) /
fix)gix)dx = fip) gix)dx. (9.23)
The proof is omitted. We observe that ix — a)(x — b) does not change sign on
x e [a, b], so via this theorem we have
~2Ja ■
£>,,,--/ / p '«W)(x-a)(j -i)(fa = -j^/^W-fl) 3 , (9.24)
where /? e (a, ft). We emphasize that this is the error for n — 1 in Tin). Naturally,
we want an error expression for n > 1, too. When n > 1, we may refer to the
integration rule as a compound or composite rule.
The error committed in numerically integrating over the kth subinterval [xt- i,Xk]
must be [via (9.24)]
E k = -4/ (2) (&X** " ^-i) 3 = -T^/ (2) &) = -7^ — V 2) &), (9-25)
12 12 12 «
where ^ € [jcji-i, x^] and k = 1, 2, . . . , n — 1, «. Therefore, the total error com-
mitted is
£r(„) = / /(*) dx - Tin) = ^ £ t , (9.26)
which becomes [via (9.25)]
h b — a -c-^ n ,
E T (n) = ~ — J2 f (&)' ( 9 ' 2? )
12 n
k=\
TLFeBOOK
376 NUMERICAL INTEGRATION AND DIFFERENTIATION
The average - Yl'l=i /^(£fc) must lie between the largest and smallest values of
f^ 2 \x) on [a, b], so recalling that f^(x) is continuous on [a, b], the intermediate
value theorem (Theorem 7.1) yields that there is an f € (a, b) such that
n
k=l
Therefore
£r ( «) = -^-a)/ (2) (£), (9.28)
where £ € (a, b). If the maximum value of f^(x) on [a, b] is known, then this
may be used in (9.28) to provide an upper bound on the error. We remark that
Erin) is often called truncation error.
We see that
T(n) = x T y (9.29)
for which
x = h[\\ ■ ■ ■ \\] T € R" +1 , y = [/(* )/(*i) • • • /(*„-i)/(*„)] r e R" +1 -
(9.30)
We know that rounding errors will be committed in the computation of (9.29). The
total rounding error might be denoted by Er. We recall from Chapter 2 [Eq. (2.40)]
that a bound on these errors is
\Er\ = \x T y - fl[x T y]\ < 1.01 (n + l)u\x\ T \y\, (9.31)
where u is the unit roundoff, or else the machine epsilon. The cumulative effect of
rounding errors can be expected to grow as n increases. If we suppose that
M= max \f (2) (x)\, (9.32)
x£[a,b]
then, from (9.28), we obtain
1 1
12^2
|£r ( „)l<— — (b-afM. (9.33)
Thus, (9.33) is an upper bound on the truncation error for the composite trapezoidal
rule. We see that the bound gets smaller as n increases. Thus, as we expect,
truncation error is reduced as the number of trapezoids used increases. Combining
(9.33) with (9.31) results in a bound on total error E:
1 1
12^2
\E\ < —^(b - aYM + 1.01 (n + l)w|x|' M- (9.34)
TLFeBOOK
TRAPEZOIDAL RULE
377
Usually n ^> l,son+ 1 ^n. Thus, substituting this and (9.30) into (9.31) results in
-l
\x T y - fl[x T y]\ < 1.01(b - a)u
and so (9.34) becomes
1 1 ,
\E\ < ,(b - a) 3 M + 1.010 - a)u
12 n l
l/(*o)l + !/(*«)!
£i/(**)i
*=i
(9.35)
|/(x )| + 1/(*„)|
n-\
£l/(**)l
k=\
(9.36)
In general, the first term in the bound of (9.36) becomes smaller as n increases,
while the second term becomes larger. Thus, there is a tradeoff involved in choosing
the number of trapezoids to approximate a given integral, and the best choice ought
to minimize the total error.
Example 9.1 We may apply the bound of (9.36) to the following problem. We
wish to compute
Jo
dx.
Thus, [a, b] — [0, 1], and so b — a = 1. Of course, it is very easy to confirm that
7=1 — e _1 . But this is what makes it a good example to test our theory out. We
also see that f (2 \x) = e~ x , and so in (9.32) M = 1, Also
for£ = 0, 1,
f(x k ) = e- k ' n
\,n. It is therefore easy to see that
n-\
J2 f( x k)
-l/n
k=\
1
-l/n
We might assume that the trapezoidal rule for this problem is implemented in the
C programming language using single-precision floating-point arithmetic, in which
case a typical value for u would be
u = 1.1921 x 10" 7 .
Therefore, from (9.36) the total error is bounded according to
|£|<
l
12n 2
1.2040 x 10"
1
-l/n
2e
1
-l/n
Figure 9.3 plots this bound versus n, and also shows the magnitude of the computed
(i.e., the true or actual) total error in the trapezoidal rule approximation, which
is \T(n) — I\. We see that the true error is always less than the bound, as we
would expect. However, the bound is rather pessimistic. Also, the bound predicts
TLFeBOOK
378
NUMERICAL INTEGRATION AND DIFFERENTIATION
10 4 10 5 10 6
Number of trapezoids used (n)
10 7
10 u
10 1
10 2 10 3 10 4
Number of trapezoids used (n)
10 5
10 b
Figure 9.3 Comparison of total error (computed) to bound on total error, illustrating the
tradeoff between rounding error and truncation error in numerical integration by the trape-
zoidal rule. The bound employed here is that of Eq. (9.36).
that the proper choice for n is much less than what the computed result predicts.
Specifically, the bound suggests that we choose n & 100, while the computed result
suggests that we choose n « 100, 000.
What is important is that the computed result and the bound both confirm that
there is a tradeoff between minimizing the truncation error and minimizing the
rounding error. To minimize rounding error, we prefer a small n, but to minimize
the truncation error, we prefer a large n. The best solution minimizes the total error
from both sources.
In practice, attempting a detailed analysis to determine the true optimum choice
for n is usually not worth the effort. What is important is to understand the funda-
mental tradeoffs involved in the choice of n, and from this understanding select a
reasonable value for n.
9.3 SIMPSON'S RULE
The trapezoidal rule employed linear interpolation to approximate f(x) between
sample points x^ on the x axis. We might consider quadratic interpolation in
TLFeBOOK
SIMPSON'S RULE
379
the hope of improving accuracy still further. Here "accuracy" is a reference to
truncation error.
Therefore, we wish to fit a quadratic curve to the points (xk-i, f(xk-\T),
(xk, f(xk)) an d (xk+i, f{xk+\))- We may define the quadratic to be
Pk(x) = a(x - Xk) +b(x-Xk) + c.
(9.37)
Contrary to past practice, the subscript k now does not denote degree, but rather
denotes the "centerpoint" of the interval [Xk-i, Xk+i] on which we are fitting the
quadratic. The situation is illustrated in Fig. 9.4 . For convenience, define v* =
f(xk)- Therefore, from (9.37) we may set up three equations in the unknowns
a, b, c:
a(xk-\ - x k ) 2 +b(xk-i -x k ) + c — y k -\,
a(xk - x k f + b(x k - Xk ) + C = y k ,
a(x k +i -x k ) 2 +b(xk+i -Xk) + c = yt+i-
This is a linear system of equations, and we will assume that h — x^ — Xk-i —
Xk+\ — Xk, so therefore
b =
yk+i -2yk + yk-i
yt+i — yk-i
2/;
c = yk-
This leads to the approximation
L
f(x)dx^ / Pk(x) dx = -[y k -i + 4y k + yk+i].
Jxt-\ - 1
(9.38a)
(9.38b)
(9.38c)
(9.39)
Parabolic arc p k (x)
x k-1 x k Km
Figure 9.4 Simpson's rule for numerical integration.
TLFeBOOK
380
NUMERICAL INTEGRATION AND DIFFERENTIATION
Of course, some algebra has been omitted to arrive at the equality in (9.39). As
in Section 9.2, we wish to integrate fix) on [a, b]. So, as before, a — xq, and
b — x„. If n is an even number, then the number of subdivisions of [a, b] is an
even number, and hence we have the approximation
J a
f{x) dx
I pi(x)dx+l pi(x)dx-\ h / p,,-
JXq J X2 J X n _2
i (x) dx
= 3 [yo + 4yi + 2y 2 + 4y 3 +2y 4 + ■ ■ ■ + 2y„_ 2 + 4y„_i + y n ], (9.40)
The last equality follows from applying (9.39). We define the Simpson rule approx-
imation to / as
h
Sin) = -[y + 4yi + 2y 2 + 4y 3 + 2y 4 + ■ ■ ■ + 2y„- 2 + 4y„-i + y„] (9.41)
for which n is even and n > 2.
A truncation error analysis of Simpson's rule is more involved than that of
the analysis of the trapezoidal rule seen in the previous section. Therefore, we
only outline the major steps and results. We begin by using only two subintervals
to approximate I — J f(x) dx, specifically, n — 2. Define c — (a + b)/2. Denote
the interpolating quadratic by p(x). For a suitable error function e(x), we must
have
fix) = p(x) + eix).
(9.42)
Immediately we see that
rb pb pb
1=1 fix)dx= I pix)dx+ I eix) dx
Ja Ja Ja
— — f(a) + 4/
/ eix) dx — 5(2) + E S (2)-
Ja
fib)
(9.43)
So, the truncation error in Simpson's rule is thus
t-b
''SO.)
-I
J a
eix)dx.
(9.44)
It is clear that Simpson's rule is exact for fix) a quadratic function. Less clear is
the fact that Simpson's rule is exact if fix) is a cubic polynomial. To demonstrate
the truth of this claim, we need an error result from Chapter 6. We assume that
f( k \x) exists and is continuous for all k = 0, 1, 2, 3, 4 for all x € [a, b]. From
TLFeBOOK
SIMPSON'S RULE
381
Eq. (6.14) the error involved in interpolating f(x) with a quadratic polynomial is
given by
e(x) = ^/ (3) (f (*))(* " «)(* " b)(x - c) (9.45)
for some % = %(x) e [a, b]. Hence
Esm = / «U)d* = ^ / / (3) (£(*))(* - a)(x - b)(x - c)dx. (9.46)
Unfortunately, polynomial (x — a)(x — b)(x — c) changes sign on the interval
[a, b], and so we are not able to apply the second mean- value theorem for integrals
as we did in Section 9.2. This is a major reason why the analysis of Simpson's
rule is harder than the analysis of the trapezoidal rule. However, at this point we
may still consider (9.46) for the case where f(x) is a cubic polynomial. In this
case we must have f^\x) — K (some constant). Consequently, from (9.46)
iS (2)
K f h
a)(x — b){x — c) dx,
but if z — x — c, then, since c — I (a + b), we must have
-S(2)
(b-a) r
u
y.!_u b _ a)
(b-a) I
(b-a)
2
dz.
dz
(9.47)
The integrand is an odd function of z, and the integration limits are symmetric
about the point z — 0. Immediately we conclude that E$(2) — in this particular
case. Thus, we conclude that Simpson's rule gives the exact result when f(x) is a
cubic polynomial.
Hermite interpolation (considered in a general way in Section 6.4) is polynomial
interpolation where not only does the interpolating polynomial match f(x) at the
sample points x^ but the first derivative of f(x) is matched as well. It is useful
to interpolate f(x) with a cubic polynomial that we will denote by r(x) at the
points {a, /(«)), (b, f(b)), and (c, /(c)), and also such that r (1) (c) = f {l) (c). A
cubic polynomial is specified by four coefficients, so these constraints uniquely
determine r(x). In fact
for which
r{x) — p(x) + a(x — a)(x — b)(x — c)
4[p (1) (c)-/ (1) (c)]
a — = .
(b - a) 2
(9.48a)
(9.48b)
TLFeBOOK
382 NUMERICAL INTEGRATION AND DIFFERENTIATION
Analogously to (9.19), we may define
— a)(t — c) 2 (t - b)
git) = f(t) - r(t) - [f(x) - r(x)]y £ ^7 i- «<'<*•
(x — a)ix — c) z (x — b)
(9.49)
It happens that g^ k \t) for k = 0, 1, 2, 3, 4 all exist and are continuous at all x e
[a, b]. Additionally, gia) — gib) = gic) — g^ic) — gix) — 0. The vanishing of
git) at four distinct points on [a, b], and g^'ic) — guarantees that g^ 4 '(£) =
for some £ € [a, b] by the repeated application of Rolle's theorem. Consequently,
using (9.49), we obtain
g (4) (f ) = / ( %) - r (4) (f ) - [/GO - r(*)]- 4! = 0. (9.50)
(x — a)(x — c) z (x — »)
Since r(x) is cubic, r^(§) = 0, and so (9.50) can be used to say that
fix) = rix) + 1/W(|(x))(x - a)(x - c) 2 (x - b) (9.51)
for x € ia,b). This is valid at the endpoints of [a, b], so finally
eix) = /(x) - r(x) = i/ (4) (?(x))(x - a)(x - c) 2 (x - b) (9.52)
for x € [a, b], and f(x) € [a,b]. Immediately, we see that
E Si 2) = ^f f {A) i^ix))ix-a)ix-c) 2 ix-b)dx. (9.53)
The polynomial in the integrand of (9.53) does not change sign on [a, b]. Thus, the
second mean- value theorem for integrals is applicable. Hence, for some £ € ia,b),
we have
f (4) (£) f b
E S (2) = -^^ / ix-a)ix- c) 2 (x -*)</*, (9.54)
4- J a
which reduces to
£5(2) = -^/ (4) (5) (9.55)
again for some £ € (a, £>), where h — ib — a)/2.
We need an expression for Es( n ), that is, an error expression for the compos-
ite Simpson rule. We will assume again that h = ib — a)/n, where n is an even
number. Consequently
nb «/ 2
Es(n) = / /GO d* - Sin) = ^2 / E k
n/2
,9.56)
TLFeBOOK
SIMPSON'S RULE 383
where Ek is the error committed in the approximation for the fcth subinterval
[x 2 {k-\),X2k\- Thus, for & € [x 2 (k-i), xik\, with k=l,2,..., n/2, we have
Ek = "Lf (4) ^ = -^^^/ (4) &)- (9-57)
90 90 n
Therefore
h 4 h — '^ 2
E S( „) = V / (4) feO- (9.58)
w 180 n/2 ^r
Applying the intermediate-value theorem to the average 4j^j f^iHk) con-
firms that there is a £ € (a, &) such that
n/2 * — '
jt=i
so therefore the truncation error expression for the composite Simpson rule becomes
h 4
E s(n) = - — (b-a)f^(l;) (9.59)
for some § € (a,b). For convenience, we repeat the truncation error expression for
the composite trapezoidal rule:
£r ( ») = -^-«)/ (2) (?). (9.60)
It is not really obvious which rule, trapezoidal or Simpson's, is better in general.
For a particular interval [a,b] and n, the two expressions depend on different
derivatives of f(x). It is possible that Simpson's rule may not be an improvement
on the trapezoidal rule in particular cases for this reason. More specifically, a
function that is not too smooth can be expected to have "big" higher derivatives.
Simpson's rule has a truncation error dependent on the fourth derivative, while the
trapezoidal rule has an error that depends only on the second derivative. Thus, a
nonsmooth function might be better approximated by the trapezoidal rule than by
Simpson's. In fact, Davis and Rabinowitz [3, p. 26] state that
The more "refined" a rule of approximate integration is, the more certain we must
be that it has been applied to a function which is sufficiently smooth. There may be
little or no advantage in using a 'better' rule for a function that is not smooth.
We repeat a famous example from Ref. 3 (originally due to Salzer and Levine).
TLFeBOOK
384 NUMERICAL INTEGRATION AND DIFFERENTIATION
Example 9.2 The following series defines a function due to Weierstrass that
happens to be continuous but is, surprisingly, not differentiable anywhere 2
00 ,
W(x) = J2~ cos(7 n 7rx).
(9.61a)
If we assume that we may integrate this expression term by term, then
I(y) = f y W(x) dx = -J^-^- sm(Tjry). (9.61b)
n = \
Of course, I = j W(x)dx — lib) — 1(a). The series (9.61b) gives the "exact"
value for I(y) and so may be compared to estimates produced by the trapezoidal
and Simpson rules. Assuming that n — 100, the following table of values is obtained
[MATLAB implementation of (9.61) and the numerical integration rules]:
Interval
Exact
Trapezoidal
Error
Simpson's
Error
[a,b]
Value /
Tin)
/ - T(n)
S(n)
/ - S(n)
[0,1]
0.01899291
0.01898760
0.00000531
0.01901426
-0.00002135
[.1..2]
-0.04145650
-0.04143815
-0.00001834
-0.04146554
0.00000904
[.2, .3]
0.03084617
0.03084429
0.00000188
0.03086261
-0.00001645
[.3, .4]
0.00337701
0.00342534
-0.00004833
0.00341899
-0.00004198
[.4, .5]
-0.03298025
-0.03300674
0.00002649
-0.03303611
0.00005586
We see that the errors involved in both the trapezoidal and Simpson rules do
not differ greatly from each other. So Simpson's rule has no advantage here.
It is commonplace for integration problems to involve integrals that possess
oscillatory integrands. For example, the Fourier transform of x(t) is defined to be
X(w)
/CO fOO POO
x(t)e~ jcot dt = x(t) cos(o)t) dt - j / x(t) sin(&>f) dt.
-co J —co J —CO
(9.62)
You will likely see much of this integral in other books and associated courses
(e.g., signals and systems). Also, determination of the Fourier series coefficients
required computing [recall Eq. (1.45)]
1 f l7T
2tt Jo
(9.63)
Proof that W(x) is continuous but not differentiable is quite difficult. There are many different
Weierstrass functions possessing this property of continuity without differentiability. Another example
complete with a proof appears on pp. 38-41 of Korner [4].
TLFeBOOK
GAUSSIAN QUADRATURE 385
An integrand is said to be rapidly oscillatory if there are numerous (i.e., of the
order of > 10) local maxima and minima over the range of integration (i.e., here
assumed to be the finite interval [a, b]). Some care is often required to compute
these properly with the aid of the integration rules we have considered so far.
However, we will consider only a simple idea called integration between the zeros.
Davis and Rabinowitz have given a more detailed consideration of how to handle
oscillatory integrands [3, pp. 53-68].
Relevant to the computation of (9.63) is, for example, the integral
L
2?r
f(x)sm(nx)dx. (9.64)
It may be that f(x) oscillates very little or not at all on [0, 2tt]. We may therefore
replace (9.64) with
2n_1 f(k+l)jr/n
1=^2 f(x)sm(nx)dx. (9.65)
k=o ^ k7Z l n
The endpoints of the "subintegrals"
J kn In
•(k+l)Tt/n
Ik — I f(x)sm(nx)dx (9.66)
T/fl
in (9.65) are the zeros of sin(«x) on [0, 2jt]. Thus, we are truly proposing integra-
tion between the zeros. At this point it is easiest to approximate (9.66) with either
the trapezoidal or Simpson's rules for all k. Since the endpoints of the integrands
in (9.66) are zero-valued, we can expect some savings in computation as a result
because some of the terms in the rules (9.14) and (9.41) will be zero-valued.
9.4 GAUSSIAN QUADRATURE
Gaussian quadrature is a numerical integration method that uses a higher order of
interpolation than do either the trapezoidal or Simpson rules. A detailed derivation
of the method is rather involved as it relies on Hermite interpolation, orthogonal
polynomial theory, and some aspects of the theory rely on issues relating to linear
system solution. Thus, only an outline presentation is given here. However, a
complete description may be found in Hildebrand [5, pp. 382-400]. Additional
information is presented in Davis and Rabinowitz [3].
It helps to recall ideas from Chapter 6 here. If we know f(x) for x = xu
where j — 0, 1, . . . , n — 1, n, then the Lagrange interpolating polynomial is (with
P(xj) = /(*/))
n
p(x) = J^f(xj)Lj(x), (9.67)
.7=0
TLFeBOOK
386 NUMERICAL INTEGRATION AND DIFFERENTIATION
where
m*)= n
X Xj
Recalling (6.45b), we may similarly define
n
i=0
Consequently
7r (1) (*) = £
n (x - xk)
k=0
so that
Jt m (xi)= Y[(xi-Xk),
k=Q
which allows us to rewrite (9.68) as
7t(x)
Lj(x) — —-r-
; jt( v Hxj)(x - xj)
for j = 0, 1, . . . , n. From (6.14)
e(x) = fix) - p(x)
1
(«+l)!
/ ( " +1) (|)7r(x)
(9.68)
(9.69)
(9.70)
(9.71)
(9.72)
(9.73)
for some £ € [a, b], and £ = f (x).
We now summarize Hermite interpolation (recall Section 6.4 for more detail).
Suppose that we have knowledge of both f(x) and f^(x) at x = x ; - (again j —
0, 1, ... , n). We may interpolate f(x) using a polynomial of degree In + 1 since
we must match the polynomial to both /(x) and /^(x) at x = x,\ Thus, we need
the polynomial
pto = ^M*)/(*i0 + £M*)/ (1 W
(9.74)
fc=0
fc=0
where /ijfc(x) and hk(x) are both polynomials of degree In + 1 that we must deter-
mine according to the constraints of our interpolation problem.
If
hi(xj) = Si-j,hi( X j) = 0, (9.75a)
TLFeBOOK
GAUSSIAN QUADRATURE 387
then p(xj) — f(xj), and if we have
h\ 1 \xj) = 0,hl l \xj) = S i -j, (9.75b)
then p^\xj) — f^'ixj) for all jf = 0, 1, . . . , n. Using (9.75), it is possible to
arrive at the conclusion that
and
hi(x) = [1 - 2Lf\ Xi )(,x - Xi)][Li(x)f
h{(x) = (x - Xi)[Li(x)] ■
(9.76a)
(9.76b)
Equation (9.74) along with (9.76) is Hermite's interpolating formula. [Both parts
of Eq. (9.76) are derived on pp. 383-384 of Ref. 5, as well as in Theorem 6.1 of
Chapter 6 (below).] It is further possible to prove that for p(x) in (9.74) we have
the error function
1
(2« + 2)!
e{x) = f(x) - p(x) =
where f € [a, b] and f = f (x).
From (9.77) and (9.74), we obtain
fix) = J2 hk(x)f(x k ) + J^h k (x)f m (xk)
f {2n+1) ^)[n{x)f
(9.77)
1
f i2n+2} (Hx))[7t(x)] 2
k=0
k=0
(2n + 2)V
(9.78)
Suppose that w(x) > for x e [a, b]. Function w(x) is intended to be a weighting
function such as seen in Chapter 5. Consequently, from (9.78)
" r rb
j w(x)f(x)dx=') I w(x)h k (x)dx
Ja k=Q
= ^ / w{x)h k (x)dx f(x k )
,._ Ua J
^2 / w{x)h k (x)dx f (1) (x k )
_L^ /" fV»+z>(£( x )) w (x)[„(x)?dx, (9.79)
« + 2)! J fl
*=o
(2k
where a < f (x) < b, if a < x k < b. This can be rewritten as
rb " n
I w(x)f(x) dx = Yl H k f(x k ) + J2 H k f m (x k ) + E,
k=0
k=0
(9.80)
TLFeBOOK
388 NUMERICAL INTEGRATION AND DIFFERENTIATION
where
f b
H k — I w(x)h k (x)dx
J a
and
= J w(x
J a
)[l-2L^\x k Kx-Xk)][L k (x)] 2 dx (9.81a)
H k
-
J a
w(x)h k (x) dx
r
= I w(x
J a
){x - x k )[L k (x)] dx.
(9.81b)
If we neglect the term E in (9.80), then the resulting approximation to f w(x)
f(x)dx is called the Hermite quadrature formula. Since we are assuming that
w(x) > 0, the second mean-value theorem for integrals allows us to claim that
E =
1
(2« + 2)!
i
f (2n+2) (H) I w{x)[n(x)Y dx
(9.82)
for some f € [a, b].
Now, recalling (9.72), we see that (9.81b) can be rewritten as
w(x)(x — x k )
b
H k = [
J a
1 f
= m - , / W(X)7T
X W (x k ) Ja
7t(x)
_7T {l >(x k )(x -X k ).
dx
(A')
7T(x)
_n (1) (x k )(x -x k )
1 f b
— — 7TT- ~ w(x)7T(x)L k (x)dx.
Tt W (x k ) J a
dx
(9.83)
We recall from Chapter 5 that an inner product on L [a, b] is (/, g are real-valued)
-b
(/.«)
/
J a
w(x)f(x)g(x)dx.
(9.84)
Thus, H k = for k = 0, 1, . . . , n if n(x) is orthogonal to L k (x) over [a, b] with
respect to the weighting function w(x). Since deg(L^(x)) = n (all k), this will be
the case if it(x) is orthogonal to all polynomials of degree < n over [a, b] with
respect to the weighting function w(x). Note that deg(7r(x)) = n + 1 [recall (9.69)].
In fact, the polynomial jt(x) of degree n + 1 is orthogonal to all polynomials
of degree < n over [a,b] with respect to w(x), the Hermite quadrature formula
reduces to the simpler form
nb »
/ w(x)f(x)dx = ^ / H k f(x k ) + E,
Ja fc=o
where
E =
1
(2« + 2)!
f b
f i2n+2 \H) / W(x)[7t(x)] 2 dx,
Ja
(9.85)
(9.86)
TLFeBOOK
GAUSSIAN QUADRATURE 389
and where xq, x\, . . . , x„ are the zeros of n(x) (such that a < x k < b). A formula of
this type is called a Gaussian quadrature formula. The weights H k are sometimes
called Christoffel numbers. We see that this numerical integration methodology
requires us to possess samples of f(x) at the zeros of it(x). Variations on this
theory can be used to remove this restriction [6], but we do not consider this
matter in this book. However, if f(x) is known at all x e [a, b], then this is not a
serious restriction.
In any case, to apply the approximation
/ w(x)f(x)dx^J^H k f(x k ), (9.87)
Ja k=0
it is clear that we need a method to determine the Christoffel numbers H k . It is
possible to do this using the Christoffel-Darboux formula [Eq. (5.11); see also
Theorem 5.2]. From this it can be shown that
Hk = fn^n+2 ^ (9g8)
<t>n+l,n+l^ n+ i(Xk)<t>n+l(Xk)
where polynomial (j> r {x) is obtained, for instance, from (5.5) and where
<t>n+i(x) = </>„ + i,„+i7r(x). (9.89)
Thus, we identify the zeros of it(x) with the zeros of orthogonal polynomial
<p n+ i(x) [recalling that {</>,-, <pj) — &i-j with respect to inner product (9.84)].
Since there are an infinite number of choices for orthogonal polynomials <j>k(x)
and there is a theory for creating them (Chapter 5), it is possible to choose n(x) in
an infinite number of ways. We have implicitly assumed that [a, b] is a finite length
interval, but this assumption is actually entirely unnecessary. Infinite or semiinfinite
intervals of integration are permitted. Thus, for example, tc(x) may be associated
with the Hermite polynomials of Section 5.4, as well as with the Chebyshev or
Legendre polynomials.
Let us consider as an example the case of Chebyshev polynomials of the first
kind (first seen in Section 5.3). In this case
1
w(x) —
VT^x 1
with [a, b] — [—1, 1], and we will obtain the Chebyshev -Gauss quadrature rule.
The Chebyshev polynomials of the first kind are T k {x) — cos[£cos _1 x], so, via
(5.55) for k > 0, we have
<Pk(x) = J-T k (x). (9.90)
V n
TLFeBOOK
390 NUMERICAL INTEGRATION AND DIFFERENTIATION
From the recursion for Chebyshev polynomials of the first kind [Eq. (5.57)], we
have (k > 0)
(9.91)
n,k
,*-!
(T k (x) = j: k j=0 T k jxj).Thus
We have that T k (x) = for
<Pk,k — \ — 2
V it
k-\
(9.92)
X — Xj — cos
2i + 1
2k
(9.93)
(i = 0, 1, . . . , k - 1). Additionally
77 (;c) = k sin[k cos x]
vT^^2
(9.94)
so, if, for convenience, we define a,- = =^-7t, then xt — cos a,-, and therefore
T, {l \ Xi ) = k
sin (to,)
sin a,
2i + l
sm a,-
sin «,■
(9.95)
Also
Z*+i(*i) = cos
, 2j + 1
(* + 1) I ~^— ?T
2/<
= cos
2/ + 1 2/ + 1
2*
2i + 1
7T
2
2i + 1
2i + 1
2£
2i + 1
2j + 1
= — sin
sin a, = (—1) sin a,-.
i+i
(9.96)
Therefore, (9.88) becomes
FT
l± 2 n+l
H k =
TT
/2 \ / \2
'-2» [ /_(n+l)
7T / \ V 7T
sina^
'2 \ «+ 1
_(_!)*+ 1 s i nai( ,
TT
(9.97)
Thus, the weights (Christoffel numbers) are all the same in this particular case. So,
(9.87) is now
\ -4==dx^ V f cos it) =C(n). (9.98)
TLFeBOOK
GAUSSIAN QUADRATURE
391
The error expression E in (9.86) can also be reduced accordingly, but we will omit
this here (again, see Hildebrand [5]).
A simple example of the application of (9.98) is as follows.
Example 9.3 Suppose that f(x) = Vl — x 2 , in which case
/
For this case (9.98) becomes
■ l
: dx
— J dx —
2.
/I
dx « C(n) — V^ sin
2k + 1
2n + 2'
For various n, we have the following table of values:
n
C(n)
1
2.2214
2
2.0944
5
2.0230
10
2.0068
20
2.0019
100
2.0001
Finally, we remark that the error expression in (9.82) suggests that the method
of this section is worth applying only if f(x) is sufficiently smooth. This is con-
sistent with comments made in Section 9.3 regarding how to choose between the
trapezoidal and Simpson's rules. In the next example f(x) — e~ x , which is a very
smooth function.
Example 9.4 Here we will consider
1 * 1
e~ x dx = e - - = 2.350402387
1 e
and compare the approximation to / obtained by applying Simpson's rule and
Legendre- Gauss quadrature. We will assume n — 2 in both cases.
Let us first consider application of Simpson's rule. Since xq — a — —\,x\ =
0,xi=b=l(h = (b- a) In = (1 - (-l))/2 = 1), we have via (9.41)
5(2) = -[e +l + 4e° + e' 1 ] = 2.362053757
TLFeBOOK
392 NUMERICAL INTEGRATION AND DIFFERENTIATION
for which the error is
E S (2) = / - S(2) = -0.011651.
Now let us consider the Legendre- Gauss quadrature for our problem. We recall
that for Legendre polynomials the weight function is w(x) — 1 for all x e [—1, 1]
(Section 5.5). From Section 5.6 we have
i /7 1 ,
fo(x) = P 3 (x) = J — [5x 3 - 3x],
\\Pi\\ V 22
1 /9 1 A ,
— P 4 (x) = ,/ — [35jc 4 - 30x 2 + 3].
3 4 || V 28
d>4(x) -
\\P 4 \\
Consequently, $33 = jJj, and the zeros of cpsix) are at* = 0, =W§, so now our
sample points (grid points, mesh points) are
[J [3
xo = -J -, xi = 0, x 2 — +J -.
Hence, since 0^ (x) = ./^[lS* 2 — 3], we have
i>?\x ) = 3^, ^ X) (xi) = -^, 4 1 \x 2 ) = 3^.
Also, 04,4 = g ■%/§. an d
3/9 3/9 3/9
4>4( x o) — \ —, 4>4( x l) — —\ —, 4>4(. x 2) — \ —•
w 10V 2 8V2 10V 2
Therefore, from (9.88) the Christoffel numbers are
5 8 5
Ho =-,Hi = -, H 2 = -.
From (9.87) the resulting quadrature is
£
with
e~ x dx « //o<?"*° + ^l^"* 1 + #2e~* 2 = L(2)
L(2) = 2.350336929,
TLFeBOOK
ROMBERG INTEGRATION 393
and the corresponding error is
E L( 2) = I - L(2) = 6.5458 x 10" 5 .
Clearly, |£"l(2)| <SC \Es(2)\- Thus, the Legendre-Gauss quadrature is much more
accurate than Simpson's rule. Considering how small n is here, the accuracy of the
Legendre-Gauss quadrature is remarkably high.
9.5 ROMBERG INTEGRATION
Romberg integration is a recursive procedure that seeks to improve on the trape-
zoidal and Simpson rules. But before we consider this numerical integration
methodology, we will look at some more basic ideas.
Suppose that I = f f(x) dx and I(n) is a quadrature that approximates /. For
us, I(n) will be either the trapezoidal rule T(n) from (9.14), or else it will be
Simpson's rule S(n) from (9.41). It could also be the corrected trapezoidal rule
Tc(n), which is considered below [see either (9.104), or (9.107)].
It is possible to improve on the "basic" trapezoidal rule from Section 9.2. Begin
by recalling (9.27)
1 "
E T(n) = --h 3 J2f (2) ^k), (9.99)
k=\
where h — (b — a) In, xq — a, x n — b, and ^ € [xk—i, Xfc]. Of course, Xk — xt-i —
h (uniform sampling grid). Assuming (as usual) that f w (x) is Riemann integrable
for all k > 0, then
lim y>/ (2) (&) = / (1) O>)-/ (1) (0)= / f (2) (x)dx. (9.100)
Thus, we have the approximation
n
J2 h f (2) ^k) * f W (b) - f m (a). (9.101)
k=l
Consequently,
1 " 1
E T (n) = -—h 2 J2 h f (2) ^k) » -— h 2 [f m (b) - f m (a)] (9.102)
k=\
I _ T (n) w -±h 2 [f m (b) - f W (a)l (9.103)
TLFeBOOK
394 NUMERICAL INTEGRATION AND DIFFERENTIATION
This immediately suggests that we can improve on the trapezoidal rule by replacing
T(n) with the new approximation
T c (n) = T(n) - ±^h 2 [f m (b) - f m (a)], (9.104)
where Tc(n) denotes the corrected trapezoidal rule approximation to /. Clearly,
once we have T(n), rather little extra effort is needed to obtain Tc(n).
In fact, we do not necessarily need to know f^(x) exactly anywhere, much
less at the points x — a or x — b. In the next section we will argue that either
f W (x) = i-3/Oc) + 4f(x + h)- f{x + 2/z)] + V/ (3) (?) (9.105a)
2h 3
or that
f W (x) = ^-[3 fix) - 4f(x -h) + f{x - 2/0] + \h 2 f (i \H). (9.105b)
2/z 3
In (9.105a) f e [x, x + 2h], while in (9.105b) £ € [x - 2h, x]. Consequently, with
xq — a for ^o € [a, a + 2h], we have [via (9.105a)]
fW(xd) = f m (a) = -L[-3/(x ) +4/(xj) - f(x 2 )] + V/ (3) (?o), (9.106a)
2/z 3
and with x„ = b for £„ € [b — 2h, b], we have [via (9.105b)]
f W (Xn) = / (1) W = ^r[3/fc) - 4/(x„-l) + f(x n - 2 )] + \h 2 f^^ n ).
2h 3
(9.106b)
Thus, (9.104) becomes {approximate corrected trapezoidal rule)
T c (n) = T(n) - ^[3/(x„) - 4/(x„_i) + /(*„_ 2 ) + 3/(x ) - 4/(x x ) + /(x 2 )]
/i 4
-Tr[/ (3) fe)-/ (3) (fo)]. (9.107)
36
Of course, in evaluating (9.107) we would exclude the terms involving f^H^o),
and/< 3 >(£„).
As noted in Epperson [7] for the trapezoidal and Simpson rules
1
I-I(n)cx— , (9.108)
nP
where p = 2 for /(n) = T(n) and p = 4 for /(n) = ,!>(«). In other words, for a
given rule and a suitable constant C we must have I — I(n) ^ Cn~ p . Now observe
TLFeBOOK
ROMBERG INTEGRATION
395
that we may define the ratio
l'4n
I(n) - I (In)
(I - Cn~P) - (I - C(2n)~P)
I(2n)
(2n)~
- I (An)
P -n~P
(I
(4n)~
(2nY
C(2n)~P)
2-P - 1
A~p - 2-P
(I - C(4n)~P)
= 2 p .
Immediately we conclude that
log 2 r 4n =
lQ g W r 4n
log 10 2 '
(9.109)
(9.110)
This is useful as a check on program implementation of our quadratures. If (9.110)
is not approximately satisfied when we apply the trapezoidal or Simpson rules,
then (1) the integrand f(x) is not smooth enough for our theories to apply, (2)
there is a "bug" in the program, or (3) the error may not be decreasing quickly
with n because it is already tiny to begin with, as might happen when integrating
an oscillatory function using Simpson's rule.
The following examples illustrate the previous principles.
Example 9.5 In this example we consider approximating
-i
jr/2
sin x dx = 1
using T(n) [via (9.14)], T c (n) [via (9.104)], and S(n) [via (9.41)]. Parameter p in
(9.1 10) is computed for each of these cases, and is displayed in the following table
(where "NaN" means "not a number"):
T(n)
p for T(n)
T C (n)
p for T c (n)
S(n)
p for S(n)
2
0.94805945
NaN
0.99946364
NaN
1.00227988
NaN
4
0.98711580
NaN
0.99996685
NaN
1.00013458
NaN
8
0.99678517
2.0141
0.99999793
4.0169
1.00000830
4.0864
16
0.99919668
2.0035
0.99999987
4.0042
1.00000052
4.0210
32
0.99979919
2.0009
0.99999999
4.0010
1.00000003
4.0052
64
0.99994980
2.0002
1.00000000
4.0003
1.00000000
4.0013
128
0.99998745
2.0001
1.00000000
4.0001
1.00000000
4.0003
256
0.99999686
2.0000
1.00000000
4.0000
1.00000000
4.0001
We see that since f(x) — sinx is a rather smooth function we obtain the values
for p that we expect to see.
Example 9.6 This example is in contrast with the previous one. Here we
approximate
Jo
/o 4
TLFeBOOK
396
NUMERICAL INTEGRATION AND DIFFERENTIATION
using T(n) [via (9.14)], T c (n) [via (9.107)], and S(n) [via (9.41)]. Again, p
[via (9.110)] is computed for each of these cases, and the results are tabulated
as follows:
T(n)
p for T{n)
T C (n)
p for T c (n)
S(n)
p for S(n)
2 0.64685026
NaN
0.69580035
NaN
0.69580035
NaN
4 0.70805534
NaN
0.72437494
NaN
0.72845703
NaN
8 0.73309996
1.2892
0.73980487
0.8890
0.74144817
1.3298
16 0.74322952
1.3059
0.74595297
1.3275
0.74660604
1.3327
32 0.74729720
1.3163
0.74839388
1.3327
0.74865310
1.3332
64 0.74892341
1.3227
0.74936261
1.3333
0.74946548
1.3333
128 0.74957176
1.3267
0.74974705
1.3333
0.74978788
1.3333
256 0.74982980
1.3292
0.74989962
1.3333
0.74991582
1.3333
We observe that
/(*) =
x 1 / 3 , but that f m
(x) = ttX"
-2/3 >/ (2) W =
_2 x -5/3
etc. Thus, the derivatives of f(x) are unbounded at x = 0, and so f(x) is not
smooth on the interval of integration. This explains why we obtain p « 1.3333 in
all cases.
As a further step toward Romberg integration, consider the following. Since
/ - I(2n) « C(2n)~ p = C2~ p n~ p « 2~ P (I - /(«)), we obtain the approximate
equality
_ I(2n)-2- p I(n)
1 -2-p '
or
2 p I (2n) - I (n)
I % K —: K —L = R(2n).
2P - i
(9.111)
We call R(2n) Richardson's extrapolated value (or Richardson's extrapolation),
which is an improvement on I(2n). The estimated error in the extrapolation is
given by
I(n) - I(2n)
E R(2n) = I{2n)-R{2n)= 2 „ _\ ■ (9.H2)
Of course, p in (9.111) and (9.112) must be the proper choice for the quadrature
I(n). We may "confirm" that (9.111) works for T(n) (as an example) by considering
(9.28)
h -(b-a)fV>(S) (9.113)
<T(n)
12
(for some f € [a, b]). Clearly
(hl2Y m 1
TLFeBOOK
or in other words (£Y(„) — I — T{n))
and hence for n > 1
ROMBERG INTEGRATION 397
7-r(2»)wi[/-r(n)],
47(2w) - 7(«)
/ « — — — = /?t(2w),
(9.114)
which is (9.111) for case p — 2. Equation (9.114) is called the Romberg integration
formula for the trapezoidal rule. Of course, a similar expression may be obtained
for Simpson's rule [with p = 4 in (9.111)]; that is, for n even
_ 165(2») - 5(«) ^
/ s» = Rs(2n).
15 A
In fact, it can also be shown that Rj(2n) — S(2n):
4T(2n) - T(n)
5(2/i) =
3
(9.115)
(9.116)
(Perhaps this is most easily seen in the special case where n — 1.) In other words,
the Romberg procedure applied to the trapezoidal rule yields the Simpson rule.
Now, Romberg integration is really the repeated application of the Richardson
extrapolation idea to the composite trapezoidal rule. A simple way to visualize the
process is with the Romberg table {Romberg array):
7(1)
7(2)
S(2)
7X4)
S(4)
Rs(4)
7(8)
5(8)
Rs(8)
9
7(16)
5(16)
R S (16)
9
9
In its present form the table consists of only three columns, but the recursive
process may be continued to produce a complete "triangular array."
The complete Romberg integration procedure is often fully justified and devel-
oped with respect to the following theorem.
Theorem 9.1: Euler-Maclaurin Formula Let / € C + [a, b] for some
k > 0, and let us approximate I — f f(x)dx by the composite trapezoidal rule
TLFeBOOK
398 NUMERICAL INTEGRATION AND DIFFERENTIATION
of (9.14). Letting h n — (b — a)/n for n > 1, we have
T ^ = 1 + E ^f[f^~ l Hb) - / (2 '-%)]
^ (20!
fi 2*+ 2 r2k+2 rh s f (2k+2)
(2k + 2)\
where r\ e [a, b], and for j > 1
2
h^(b-a)f (JX ^'(n), (9.117)
% = (-1)
/-i
E
(2jtm) 2 '
.n=l
(2;)! (9.118)
are the Bernoulli numbers.
Proof This is Property 9.3 in Quarteroni et al. [8]. A proof appears in Ralston
[9]. Alternative descriptions of the Bernoulli numbers appear in Gradshteyn and
Ryzhik [10]. Although not apparent from (9.118), the Bernoulli numbers are all
rational numbers.
We will present the complete Romberg integration process in a more straight-
forward manner. Begin by considering the following theorem.
Theorem 9.2: Recursive Trapezoidal Rule Suppose that h = (b — o)/(2n);
then, for n > 1 (xq — a , x n — b)
1 "
T(2n) = -T(n) + h^ /(*0 + (2* - V)h). (9.119)
k=\
The first column in the Romberg table is given by
TV) = \ T (2^) 4- *g« £ / (xo 4- (2fe - 1) 2 ( ;"- X0) ) (9.120)
for all n > 1.
Proof Omitted, but clearly (9.120) immediately follows from (9.119).
We let R„ denote the row n, column k entry of the Romberg table, where k —
0, 1, . . . , N and n — k, . . . , N [i.e., we construct an (N + I) x (N + 1) lower trian-
gular array, as suggested earlier]. Table entries are "blank" for n = 0, 1, . . . , k — 1
in column k. The first column of the table is certainly
R (P) = T( 2") (9.121)
TLFeBOOK
ROMBERG INTEGRATION
399
forn = 0, 1,
N. From Theorem 9.2 we must have
Ri Q) = ±R?\
r,(0) _
2»-i
On Z—r ^ I
*=1 v
(2fc- l)(fc -a)
2"
(9.122a)
(9.122b)
for m = 1, 2, . . . , Af. Equations (9.122) are the algorithm for constructing the first
column of the Romberg table. The second column is R„ , and these numbers are
given by [via (9.114)]
R m =
4R
«))
R
(0)
n-\
1
(9.123)
,(2)
for n = 1, 2, . . . , N. Similarly, the third column is R„ , and these numbers are
given by [via (9.115)]
R
(2)
4 2 Ri l)
R
(i)
(9.124)
for n = 2,3, .
according to
4 2 -l
N. The pattern suggested by (9.123) and (9.124) generalizes
R (k) =
4*tff- 1)
7?
M-l
4* - 1
(9.125)
for n = fc, . . . , N, with k = 1, 2, . . . , JV, Assuming that /(*) is sufficiently smooth,
we can estimate the error using the Richardson extrapolation method in this manner:
E^> =
(k)
(k) _ «-l
R
(k)
4 k
(9.126)
(/<)
This can be used to stop the recursive process of table construction when E„
(k)
is small enough. Recall that every entry R n in the table is an estimate of I =
f f(x)dx. In some sense R N is the "final estimate," and will be the best one if
f(x) is smooth enough. Finally, the general appearance of the Romberg table is
r,(0)
^0
<
<
<>
4 1}
jf>
p(0)
r,(l)
K N-1
r,(2)
K N-\
r.(A'-l)
K N-1
n(0)
K N
r,(D
K N
r,(2)
R N
K N
TLFeBOOK
400
NUMERICAL INTEGRATION AND DIFFERENTIATION
The Romberg integration procedure is efficient. Function evaluation is confined
to the construction of the first column. The remaining columns are filled in with
just a fixed (and small) number of arithmetic operations per entry as determined
by (9.125).
Example 9.7 Begin by considering the Romberg approximation to
/
Jo
e x dx = 1.718281828.
The Romberg table for this is (N — 3)
1.85914091
1.75393109
1.72722190
1.72051859
1.71886115
1.71831884
1.71828415
1.71828269
1.71828184
1.71828183
(3)
Table entry R\ — 1.71828183 is certainly the most accurate estimate of /. Now
contrast this example with the next one.
The zeroth-order modified Bessel function of the first kind Ioiy) € R(y G R)
is important in applied probability. For example, it appears in the problem of
computing bit error probabilities in amplitude shift keying (ASK) digital data com-
munications [11]. There is a series expansion expression for Iq(j), but there is also
an integral form, which is
h(y)
2tt Jo
2;t
,)>cosa:
dx.
(9.127)
For y = 1, we have (according to MATLAB's besseli function; Io(y) =besseli(0,y))
/o(l) = 1.2660658778.
The Romberg table of estimates for this integral is (N — 4)
2.71828183
1.54308063
1.15134690
1.27154032
1.18102688
1.18300554
1.26606608
1.26424133
1.26978896
1.27116647
1.26606588
1.26606581
1.26618744
1.26613028
1.26611053
Plainly, table entry R
(4)
1.26611053 is not as accurate as R
(0)
1.26606588.
Apparently, the integrand of (9.127) is simply not smooth enough to benefit from
the Romberg approach.
TLFeBOOK
NUMERICAL DIFFERENTIATION 401
9.6 NUMERICAL DIFFERENTIATION
A simple theory of numerical approximation to the derivative can be obtained via
Taylor series expansions (Chapter 3). Recall that [via (3.71)]
f(x + h) = J2 TT' (t) ( *> + j! —7rJ in+V} ^
k=Q
k\
(n+1)!'
(9.128)
for suitable f € [x, x + h]. As usual, 0! = 1, and fix) — f^(x). Since from ele-
mentary calculus
f a) (x) = lim
fix + h)- f(x) (1) f( x + h )-f(x)
f w (x)
from (9.128), we have
/ (1; Cv) -
/is fix + h) — fix) 1 n ~.
( 1 )/- T .^ _ Li I >_ ■> v ' hf"'(i;).
2!
This was obtained simply be rearranging
fix + h) = fix) + hf {l \x) + Ift 2 / (2) (?)-
We may write
/ (1) (x)
fix + h)- fix)
2!
hf (2 \H)
-ff\x)
-et(x)
(9.129)
(9.130)
(9.131)
(9.132)
?(i),
Approximation f\ (x) is called the forward difference approximation to f ( '(x),
and the error efix) is seen to be approximately proportional to h. Now consider
(fi € [x,x + h])
fix + h) = fix) + hf {l \x) + ^h 2 f (2 \x) + Ift 3 / (3) (Si), (9.133)
and clearly (£2 e [x — h, x])
fix -h) = fix) - hf (l \x) + ^h 2 f (2 \x) - Ift 3 / (3) fe). (9.134)
(9.134)
f m ix) -
fix) - fix -
h
-H) +
^/ (2) (S)
=/T«
=e h (x)
(9.135)
TLFeBOOK
402 NUMERICAL INTEGRATION AND DIFFERENTIATION
?(1)/
Here the approximation /, (x) is called the backward difference approximation to
f^ix), and has error eb(x) that is also approximately proportional to ft. However,
an improvement is possible, and this is obtained by subtracting (9.134) from (9.133)
r(D
1
f(x + ft) - f(x - ft) = 2hf w ix) + -W(§i) + / w (| 2 )]
3rfO)i
c(3).
3!
or on rearranging this, we have
r(D
f w ix) =
fix + h)-fix-
2ft
-« +
^/ (3) (fl) + / (3) fe) _
6 2
**
./."'w
=«*(*)
(9.136)
Recalling the derivation of (9.28), there is a £ € [x — ft, x + ft] such that
/ (3) (?)=2 [/(3)(?l) + /(3)fe)]
(9.137)
(fi € [jc, jc + ft], §2 € [x — ft, ft]). Hence, (9.136) can be rewritten as (for some
£ € [x — ft, x + ft])
/ (1) (x)
/(* + ft) - /(* -
-A)
^ 2 m
6
2ft
=/ c (1) M
=e c W
(9.138)
Clearly, the error e c ix), which is the error of the central difference approximation
fc \x) to f^ix), is proportional to ft 2 . Thus, if fix) is smooth enough, the
central difference approximation is more accurate than the forward or backward
difference approximations.
The errors e/ix), eb(x), and e c ix) are truncation errors in the approximations.
Of course, when implementing any of these approximations on a computer there
will be rounding errors, too. Each approximation can be expressed in the form
?(!)/
1
r>(x)= -f'y,
ft
where for the forward difference approximation
/ = [/(*) f(x + h)f,y = [-l if,
for the backward difference approximation
/ = [/(* -ft) fix)] T ,y = [-l if,
(9.139)
(9.140a)
(9.140b)
TLFeBOOK
NUMERICAL DIFFERENTIATION 403
and for the central difference approximation
/ = [/(* -A) f{x+h)f,y=\[-\ if. (9.140c)
Thus, the approximations are all Euclidean inner products of samples of f(x) with
a vector of constants y, followed by division by h. This is structurally much the
same kind of computation as numerical integration. In fact, an upper bound on the
size of the error due to rounding is given by the following theorem.
Theorem 9.3: Since fl[f m (x)] = // [ //[ { Ty] l, we have (/, y e R m )
\fl[f m (x)] - f m (x)\ < j[\Mm + \]\\f\\ 2 \\y\\2- (9.141)
h
Proof Our analysis here is rather similar to Example 2.4 in Chapter 2. Thus,
we exploit yet again the results from Chapter 2 on rounding errors in dot product
computation.
Via (2.41)
for which
In addition
//[/ r y] = / r y(i + 6i)
|ei| < l.Olmu-
\f T y\
//[/(D (x)] = ^-Z(l + ei )(l + e),
h
where |e| < u. Thus, since
~m f T y f T y
fl[f a \x)] = LJL + LJL {€l + e + eie))
h h
we have
\flU {l \x)} - f m (x)\ < \L2i(\ €l \ + | e | + | ei || e |) = \LA
h h
\f\ T M , , ,„,.... i\f\ T \y\
x ( l.Olmu- — = \-u + l.Olmu
\f T y\ ' \f T y\
and since u 2 <$C u, we may neglect the last term, yielding
\fl[f W (x)] - f m (x)\ < X - (l.01m M |/| r |y| + u\fy\)
TLFeBOOK
404 NUMERICAL INTEGRATION AND DIFFERENTIATION
But |/fM=<|/|,|yl><ll/ll2l|y||2, and \fy\ = |</, y)\ < ||/|| 2 ||y||2 via
Theorem 1.1, so finally we have
\fl[f m ix)] - / (1) (*)l < jH.Olm + l]||/||2||y||2,
h
which is the theorem statement.
The bound in (9.141) suggests that, since we treat u, m, ||/||2, and | |y| I2 are
fixed, as h becomes smaller, the rounding errors will grow in size. On the other
hand, as h becomes smaller, the truncation errors diminish in size. Thus, much as
with numerical integration (recall Fig. 9.3), there is a tradeoff between rounding
and truncation errors leading in the present case to the existence of some optimal
value for the choice of h. In most practical circumstances the truncation errors will
dominate, however.
We recall that interpolation theory from Chapter 6 was useful in developing
theories on numerical integration in earlier sections of the present chapter. We
therefore reasonably expect that interpolation ideas from Chapter 6 ought to be
helpful in developing approximations to the derivative.
Recall that p n (x) — Yl'l=o Pn,kX k interpolates fix) forx € [a, b], i.e., p n ix k ) =
f(xk) with xq — a, x n — b, and x k e [a, b] for all k, but such that Xk ^ x; for
k # j. If f(x) e C n+l [a, b], then, from (6.14)
1 "
fix) = p„ix) + ——T-J (n+l) &) \\(x - Xi) (9.142)
(„ + l)! fj
(| = %ix) € [a, b]). For convenience we also defined nix) — Y\'!=o^ x ~ x i)- Thus,
from (9.142)
f m ix) = pi 1 \x) + -
ff (i) W /(»+i)(|) + wW A / (»+i) ( |)
(«+!)![ ' dx'
(9.143)
Since % = % ix) is a function of x that is seldom known, it is not at all clear
what df( n+ ^i%)/dx is in general, and to evaluate this also assumes that d^ix)/dx
exists. 3 We may sidestep this problem by evaluating (9.143) only for x = Xk, and
since nixk) — for all k, Eq. (9.143) reduces to
f (i) (x k ) = pi'Hxk) + - l —-, n m ixk)f (n+l) i^Xk)). (9.144)
in + 1)!
For simplicity we will now assume that Xk — xq + hk, h — ib — a)/n (i.e.,
uniform sampling grid). We will also suppose that n — 2, in which case (9.144)
becomes
f m ix k ) = p^ixk) + l -Jt {l \x k )f & i^ix k )). (9.145)
This turns out to be true, although it is not easy to prove. Fortunately, we do not need this result.
TLFeBOOK
NUMERICAL DIFFERENTIATION 405
Since tt(x) — (x — xo)(x — x\){x — X2), we have jr^Cr) = (x — x\)(x — x 2 ) +
(x — xq)(x — xi) + (x — xq)(x — X2), and therefore
(9.146)
7r m (x ) =
C*0
-Xl)(XQ
-X2) =
(-A)(-2A)
= 2/j 2 ,
7T (1) (*l) =
(x\
-xo)(x\
-X2)
= h(-h)
= -^ 2 ,
7T (1) (X 2 ) =
(X2
-x )(x 2
-xi)
= (2A)/i
= 2/z 2 .
From
(6.9), we obtain
P2(x) =
f(xo)Lo<
(x) + f(x
i)Li(x) + /(x 2 )L 2 (x),
where
, from (6.11)
Lj(
»=n
(=0 "■' *'
^2W - (jt 2 -jc )(jc 2 -^l) ' ^2 W - (jt 2 -*0)(*2-*l)-
Hence
(1)/-. \ _ J r(l)/-. \ _ 1 ,(1),
( l \x ) = ,lP(xi) = + — ,4 1) (x 2 )= T — .
2 v u; 2/i ' 2 2/j 2 <■ ^ ^ 2 /j
(9.147)
(9.148)
and hence the approximation to f^(xk) is (k e {0, 1, 2})
P ( n\x k ) = f(x )L$\x k ) + /(jcOLJ^xt) + /(x 2 )L^ 1) (x^). (9.149)
From (9.148)
j „/„\ _ (*-*l)(*-*2) r (1)/- t n _ (*-*l)+(*~*2)
^0W - (* -*l)(*0-*2)' W - Cc -*i)(*0-*2)'
L t (x)= ('-*>)(*-*2) L< 1} (x)= (x-xo)+(x-X2) (9150)
1V ' (*l-*o)(*l-*2)' 1 V 7 (JCI— JC0)(^1— ^2)' V '
_ (x-x )(x-x l ) j (1)/ \ _ (x-*o)+Q:-:ti)
Li, l; (x ) = — -, U^(xi) = — -, L' i '(x 2 ) = + — ,
" 2A ° 2h ° 2h
< l i\x6) = +7, L^Cxi) = 0, L ( 1 1) (x 2 ) = -p (9.151)
3
We therefore have the following expressions for f^(xk)'.
f (i) (x ) = i-[-3/(* ) + 4/(xi) - /(x 2 )] + V/( 3 >(f (x )), (9.152a)
2« 3
/ (1) (*i) = ^-[-/(Jto) + f(xi)] - -U 2 / (3) (£(*i)), (9.152b)
2« 6
/ (1) fe) = 4;[/(*o) - 4/(xi) + 3/(x 2 )] + V/ (3) (?fe)). (9.152c)
2« 3
TLFeBOOK
406 NUMERICAL INTEGRATION AND DIFFERENTIATION
We recognize that the case (9.152b) contains the central difference approximation
[recall (9.138)], since we may let xq — x — h, xi — x + h (and x\ — x). If we let
xq — x, x\ — x + h and X2 — x + 2h, then (9.152a) yields
/ (1) 0O * ^-[-3/W + 4 f(x + ft) - f{x + 2ft)], (9.153)
2n
and if we let X2 — x, x\ — x — ft, and xq — x — 2ft, then (9.152c) yields
f W (x) « i-[/(* - 2ft) - 4/(* - A) + 3/(*)]. (9.154)
2ft
Note that (9.153) and (9.154) were employed in obtaining Tcin) in (9.107).
REFERENCES
1. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977.
2. L. Bers, Calculus: Preliminary Edition, Vol. 2, Holt, Rinehart, Winston, New York,
1967.
3. P. J. Davis and P. Rabinowitz, Numerical Integration, Blaisdell, Waltham, MA, 1967.
4. T. W. Korner, Fourier Analysis, Cambridge Univ. Press, New York, 1988.
5. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York,
1974.
6. W. Sweldens and R. Piessens, "Quadrature Formulae and Asymptotic Error Expansions
for Wavelet Approximations of Smooth Functions," SIAM J. Numer. Anal. 31, 1240-
1264 (Aug. 1994).
7. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York,
2002.
8. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math-
ematics series, Vol. 37), Springer- Verlag, New York, 2000.
9. A. Ralston, A First Course in Numerical Analysis, McGraw-Hill, New York, 1965.
10. I. S. Gradshteyn and I. M. Ryzhik, in Table of Integrals, Series and Products, 5th ed.,
A. Jeffrey, ed., Academic Press, San Diego, CA, 1994.
11. R. E. Ziemer and W. H. Tranter, Principles of Communications: Systems, Modulation,
and Noise, Houghton Mifflin, Boston, MA, 1976.
PROBLEMS
9.1. This problem is based on an assignment problem due to I. Leonard. Consider
./o
x + a
■ dx,
TLFeBOOK
PROBLEMS 407
where a > 1 and n e Z + . It is easy to see that for < x < 1, we have
jc" +1 < x", and < I n +\ < I n for all n e Z + . For < x < 1, we have
implying that
or
jc" x n x n
< < — .
1 + a x + a a
f l x n f l x n
I dx < I„ < / — dx
h 1 +a Jo a
1 1
< In <
(«+ l)(l+a) (n + l)o'
so immediately lim n ^oo /„ = 0. Also, we have the difference equation
f l x n ~ l [x + a-a] 1
/„ = / dx = - - al n -\ (9.P.1)
Jo x + a n
for n e N, where 7 = /J ^ dx = [log e (x + a)], 1 , = log e (^)-
(a) Assume that 7o = ^o + € is the computed value of Iq. Assume that no
other errors arise in computing /„ for n > 1 using (9.P.1). Then
1
/„ = aI n -\.
n
Define the error e n — I n — I n , and find a difference equation for e„.
(b) Solve for e n , and show that for large enough a we have linin^oo
\e n \ = oo.
(c) Find a stable algorithm to compute /„ for n e Z + .
9.2. (a) Find an upper bound on the magnitude of the rounding error involved in
applying the trapezoidal rule to
Jo
x dx.
(b) Find an upper bound on the magnitude of the truncation error in applying
the trapezoidal rule to the integral in (a) above.
9.3. Consider the integral
1(e) — I */xdx,
TLFeBOOK
408 NUMERICAL INTEGRATION AND DIFFERENTIATION
where a > e > 0. Write a MATLAB routine to fill in the following table for
a — 1, and n — 100:
e
|7(e)-T(«)| B T(n)
|7(e)- S(n)\ B s(n )
0.1000
0.0100
0.0010
0.0001
In this table T{n) is from (9.14), S(n) is from (9.41), and
\Et(h)\ < Bt(ti), \Es(n)\ < Bs(n),
where the upper bounds fir(n) and B${ n ) are obtained using (9.33) and (9.59),
respectively.
9.4. Consider the integral
h i
dx
4'
(a) Use the trapezoidal rule to estimate I, assuming that h = j,
(b) Use Simpson's rule to estimate I, assuming that h — j.
(c) Use the corrected trapezoidal rule to estimate I, assuming that h — j.
Perform all computations using only a pocket calculator.
9.5. Consider the integral
•!/ 2 dx
7-1/2 T
In 3.
(a) Use the trapezoidal rule to estimate I, assuming that h — j.
(b) Use Simpson's rule to estimate I, assuming that h=\.
(c) Use the corrected trapezoidal rule to estimate I, assuming that h — j.
Perform all computations using only a pocket calculator.
9.6. Consider the integral
" n l 2 sin(3x)
/
Jq
Tt
■ dx — — .
smi 2
(a) Use the trapezoidal rule to estimate I, assuming that h — ir/12.
(b) Use Simpson's Rule to estimate I, assuming that h — tt/12.
(c) Use the corrected trapezoidal rule to estimate I, assuming that h — n/Yl.
Perform all computations using only a pocket calculator.
TLFeBOOK
PROBLEMS 409
9.7. The length of the curve y — fix) for a < x < b is given by
-b
L= f Jl + [fVHx)] 2 dx.
J a
Suppose f{x) — cos*. Compute L for a — —jt/2 and b — jt/2. Use the
trapezoidal rule, selecting n so that |£Y(n)l < 0.001. Hence, using (9.33),
select n such that
1 1 .
^(b-afM < 0.001.
12 n L
Do the computations using a suitable MATLAB routine.
9.8. Recall Example 6.5. Estimate the integral
I — I e~ x dx
by
(a) Integrating the natural spline interpolant
(b) Integrating the complete spline interpolant
9.9. Find the constants a and /J in x — at + fi, and find / in terms of g such
that
f b f l fix)
/ git)dt= I - dx.
Ja J -i VI - x 1
[Comment: This transformation will permit you to apply the Chebyshev-
Gauss quadrature rule from Eq. (9.98) to general integrands.]
rb
9.10. Consider the midpoint rule (Section 9.2). If we approximate I — J a fix) dx
by one rectangle, then the rule is
' a + b
RH) = ib-a)f
2
so the truncation error involved in this approximation is
-i
£«(!)= / fix)dx-Ril).
J a
Use Taylor expansion error analysis to find an approximation to Er(\). [Hint:
With x = (a + b)/2 and h — b — a, we obtain
fix) = fiJ) + (x- J)f W iJ) +^_(x- x) 2 f (2) (x) + ■■■■
Consider ./"(fl), f(b), and f fix)dx using this series expansion.]
TLFeBOOK
410
NUMERICAL INTEGRATION AND DIFFERENTIATION
9.11. Use both the trapezoidal rule [Eq. (9.14)] and the Chebyshev-Gauss quadra-
ture rule [Eq. (9.98)] to approximate I — ^ Q ^^ dx for n — 6. Assuming
that / = 1.1790, which rule gives an answer closer to this value for / ? Use
a pocket calculator to do the computations.
9.12. Consider the integral
Jo
cos x dx — — .
2
Write a MATLAB routine to approximate / using the trapezoidal rule, and
Richardson's extrapolation. The program must allow you to fill in the fol-
lowing table:
n
T(n) \I-T(n)\
R(n) \I-R(n)\
2
4
8
16
32
64
128
256
512
1024
Of course, the extrapolated values are obtained from (9.111), where I(n) =
T(n). Does extrapolation improve on the accuracy? Comment on this.
9.13. Write a MATLAB routine that allows you to make a table similar to that in
Example 9.6, but for the integral
Jo
x l ' A dx.
9.14. The complete elliptic integral of the first kind is
^' 2 d6
K(k)
P7T/Z
Jo vT
,0 < k < 1,
k 2 sin 2
and the complete elliptic integral of the second kind is
f*/2
E(k)= VI -k 2 sm 2 6d6,0 < k < 1.
Jo
(a) Find a series expansion for K(k). [Hint: Recall (3.80).]
TLFeBOOK
PROBLEMS 411
(b) Construct a Romberg table for N — 4 for the integral K(j). Use the
series expansion from (a) to find the "exact" value of K(j) and compare.
(c) Find a series expansion for E(k). [Hint: Recall (3.80).]
(d) Construct a Romberg table for N — 4 for the integral E(j). Use the series
expansion from (c) to find the "exact" value of E(j) and compare.
Use MATLAB to do all of the calculations. [Comment: Elliptic integrals are
important in electromagnetic potential problems (e.g., finding the magnetic
vector potential of a circular current loop), and are important in analog and
digital filter design (e.g., elliptic filters).]
9.15. This problem is about an alternative approach to the derivation of Gauss-type
quadrature rules. The problem statement is long, but the solution is short
because so much information has been provided. Suppose that w(x) > for
x e [a, b], so w(x) is some weighting function. We wish to find weights Wk
and sample points Xk for k = 0, 1, . . . , n — 1 such that
/ w(x)f(x)dx = J^w k f(x k ) (9.P.2)
Ja , „
for all f(x) — x 1 , where j — 0, 1, . . . , In — 2, In — 1. This task is greatly
aided by defining the moments
-l
b
w(x)x } dx,
where m ; is called the 7th moment of w(x). In everything that follows it is
important to realize that because of (9.P.2)
pb "-1
ij = I x J w(x)dx = >•.
Ja fc=0
w k x J k .
The method proposed here implicitly assumes that it is easy to find the
moments m;. Once the weights and sample points are found, expression
(9.P.2) forms a quadrature rule according to
rb »-l
/ w(x)f(x) dx » y^ Wkfjxk) = G{n - 1),
Ja k=0
where now f(x) is essentially arbitrary. Define the vectors
w — [wqw\ ■ ■ ■ w n -2W n -\] e R",
m = [m m\ ■ ■ ■ m 2 n-2m 2 n-\] T € R 2 ".
TLFeBOOK
412 NUMERICAL INTEGRATION AND DIFFERENTIATION
Find matrix A e R nx2n such that A T w — m. Matrix A turns out to be a
rectangular Vandermonde matrix. If we knew the sample points Xk, then it
would be possible to use A T w — m to solve for the weights in w. (This is
so even though the linear system is overdetermined.) Define the polynomial
n— 1 n
Pn(x) — \\(X - Xj) = ^ p„,jX j
;=o .7=0
for which we see that p nn — 1. We observe that the zeros of p n (x) happen
to be the sample points Xk that we are looking for. The following suggestion
makes it possible to find the sample points Xk by first finding the polynomial
p n (x). Using (in principle) Chapter 7 ideas, we can then find the roots of
p n (x) — 0, and so find the sample points. Consider the expression
n-\
mj+ r pn,j = y^WfcXft r Pn,j, (9.P.3)
k=0
where r — 0, 1, ...,« — 2, n — 1. Consider the sum of (9.P.3) over j —
0, 1, . . . , n. Use this sum to show that
n-i
Y^mj+rPnj = -m„ +r (9.P.4)
.7=0
for r = 0, 1, . . . , n — 1. This can be expressed in matrix form as Mp — q,
where
M=[m i+i ] u=0 ,i,...,„-i€R" x "
which is called a Hankel matrix, and where
P = [Pn,0Pn,l ■ ■ ■ Pn,n-l] T € R"
and
-q = [m n m n+ \ ■ ■ ■ m 2 „-2»J2«-l] T '.
Formal proof that M~ l always exists is possible, but is omitted from con-
sideration. [Comment: The reader is warned that the approach to Gaussian
quadrature suggested here is not really practical due to the ill-conditioning
of the matrices involved (unless n is small).]
9.16. In the previous problem let a = — 1, b = 1, and w(x) — Vl — x 2 . Develop a
numerically reliable algorithm to compute the moments m ; - = f w(x)xi dx.
TLFeBOOK
PROBLEMS 413
9.17. Derive Eq. (9.88).
9.18. Repeat Example 9.4, except that the integral is now
/
e x dx.
1
9.19. This problem is a preview of certain aspects of probability theory. It is an
example of an application for numerical integration. An experiment produces
a measurable quantity denoted x e R. The experiment is random in that the
value of x is different from one experiment to the next, but the probability
of x lying within a particular range of values is known to be
P = P[ a < x < b] - / f x (x) dx,
where
1 / x 2
fx(x) = exp - —
V2tto- 2 \ 2<H
which is an instance of the Gaussian function mentioned in Chapter 3. This
fact may be interpreted as follows. Suppose that we perform the experiment R
times, where R is a "large" number. Then on average we expect a < x < b
a total of PR times. The function fx(x) is called a Gaussian probability
density function (pdf) with a mean of zero, and variance a 2 . Recall Fig.
3.6, which shows the effect of changing a 2 . In Monte Carlo simulations of
digital communications systems, or for that matter any other system where
randomness is an important factor, it is necessary to write programs that gen-
erate simulated random variables such as x. The MATLAB randn function
will generate zero mean Gaussian random variables with variance a 2 — 1.
For example, x — randn(l, AO will load N Gaussian random variables into
the row vector x. Write a MATLAB routine to generate N = 1000 simu-
lated Gaussian random variables (also called Gaussian variates) using randn.
Count the number of times x satisfies — 1 < x < 1. Let this count be denoted
C. Your routine must also use the trapezoidal rule to estimate the probability
P — P[—l < x < 1] using erf(x) [defined in Eq. (3.107)]. The magnitude of
the error in computing P must be <0.0001. This will involve using the trun-
cation error bound (9.33) in Chapter 9 to estimate the number of trapezoids
n that you need to do this job. You are to neglect rounding error effects here.
Compute C — PN. Your program must print C and C to a file. Of course,
we expect C « C.
9.20. Develop a MATLAB routine to fill in the following table, which uses the
central difference approximation to the first derivative of a function [i.e.,
TLFeBOOK
414 NUMERICAL INTEGRATION AND DIFFERENTIATION
fc (x)] to estimate f^\x), where here
fix) = \og e x.
h = 10" 4 h = 10" 5 h = 10" 6 h = 1(T 7
10" 7 A = 10" 8
f (1) (x)
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Explain the results you get.
9.21. Suppose that f(x) = e _j; \ and recall (9.132).
(a) Sketch f {2) (x).
(b) Let h — 1/10, and compute f\ (1). Find an upper bound on |ey(l)|.
Compute |/ (1) (1) - /) X) (1)| = |e/(l)|, and compare to the bound.
(c) Let h = jq , and compute /S (1 / V2) . Find an upper bound on | e /■ ( 1 /-s/2) | .
Compute |/W(l/-v/2) - /^'(l/v^)! = |e/(l/V2)|, and compare to the
bound.
9.22. Show that another approximation to f^\x) is given by
1
f w (x) w — -[8/(x + A) - 8/(x - h) - f(x + 2h) + f(x - 2ft)].
12ft
Give an expression for the error involved in using this approximation.
TLFeBOOK
AU Numerical Solution of Ordinary
Differential Equations
10.1 INTRODUCTION
In this chapter we consider numerical methods for the solution of ordinary differ-
ential equations (ODEs). We recall that in such differential equations the function
that we wish to solve for is in one independent variable. By contrast partial differ-
ential equations (PDEs) involve solving for functions in two or more independent
variables. The numerical solution of PDEs is a subject for a later chapter.
With respect to the level of importance of the subject the reader knows that all
dynamic systems with physical variables that change continuously over time (or
space, or both) are described in terms of differential equations, and so form the
basis for a substantial portion of engineering systems analysis, and design across
all branches of engineering. The reader is also well aware of the fact that it is quite
easy to arrive at differential equations that completely defy attempts at an analytical
solution. This remains so in spite of the existence of quite advanced methods for
analytical solution (e.g., symmetry methods that use esoteric ideas from Lie group
theory [1]), and so the need for this chapter is not hard to justify.
Where ODEs are concerned, differential equations arise within two broad cate-
gories of problems:
1. Initial-value problems (IVPs)
2. Boundary-value problems (BVPs)
In this chapter we shall restrict consideration to initial value problems. However,
this is quite sufficient to accommodate much of electric/electronic circuit modeling,
modeling the orbital dynamics of satellites around planetary bodies, and many other
problems besides.
A simple example of an ODE for which no general analytical theory of solution
is known is the Duffing equation
d x(t) dx(t) ■,
m ^— + k — — + axit) + Sx 6 (t) = F cos(cot), (10.1)
dt z dt
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
415
TLFeBOOK
416 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
where t > 0. Since we are concerned with initial-value problems we would need to
know, at least implicitly, x(0), and x J- f which are the initial conditions. If it were
the case that 5 = then the solution of (10.1) is straightforward because it is a
particular case of a second-order linear ODE with constant coefficients. Perhaps the
best method for solving (10.1) in this case would be the Laplace transform method.
However, the case where 5^0 immediately precludes a straightforward analytical
solution of this kind. It is worth noting that the Duffing equation models a forced
nonlinear mechanical spring, where the restoring force of the spring is accounted
for by the terms ax(t) + <Sx 3 (f). Function x(t), which we wish to solve for, is the
displacement at time t of some point on the spring (e.g., the point mass m at the
free end) with respect to a suitable reference frame. Term k-jp- is the opposing
friction, while F cos(atf) is the periodic forcing function that drives the system. An
example of a recent application for (10.1) is in the modeling of micromechanical
filters/resonators [2]. 1
At the outset we consider only first-order problems, specifically, how to solve
(numerically)
^- = f(x,t), x = x(0) (10.2)
at
for t > 0. [From now on we shall often write x instead of x(t), and dx/dt instead of
dx(t)/dt for brevity.] However, the example of (10.1) is a second-order problem.
But it is possible to replace it with a system of equivalent first-order problems.
There are many ways to do this in principle. One way is to define
y = £. do.3)
dt
The functions x(t) and y(t) are examples of state variables. Since we are interpret-
ing (10.1) as the model for a mechanical system wherein x(t) is displacement, it
therefore follows that we may interpret y(t) as velocity. From the definition (10.3)
we may use (10.1) to write
dy a k 8 -, F
— = x y x -\ cos(oif) (10.4a)
dt m m m m
and
dx
— = y. (10.4b)
dt
The clock circuit in many present-day digital systems is built around a quartz crystal. Such crystals
do not integrate onto chips. Micromechanical resonators are intended to replace the crystal since such
resonators can be integrated onto chips. This is in furtherance of the goal of more compact electronic
systems. This applications example is a good illustration of the rapidly growing trend to integrate
nonelectrical/nonelectronic systems onto chips. The implication of this is that it is now very necessary
for the average electrical and/or computer engineer to become very knowledgeable about most other
branches of engineering, and to possess a much broader and deeper knowledge of science (physics,
chemistry, biology, etc.) and mathematics.
TLFeBOOK
INTRODUCTION
417
Equations (10.4) have the forms
dx
— = f(x,y,t)
dt
dy
— =g(x,y,t)
dt
(10.5a)
(10.5b)
for the appropriate choices of / and g. The initial conditions for our example
are x(0) and y(0) (initial position and initial velocity, respectively). These rep-
resent a coupled system of first-order ODEs. Methods applicable to the solution
of (10.2) are extendable to the larger problem of solving systems of first-order
ODEs, and so in this way higher-order ODEs may be solved. Thus, we shall also
consider the numerical solution of initial-value problems in systems of first-order
ODEs.
The next two examples illustrate how to arrive at coupled systems of first-order
ODEs for electrical and electronic circuits.
Example 10.1 Consider the linear electric circuit shown in Fig. 10.1. The input
to the circuit is the voltage source v s (t), while we may regard the output as the
voltage drop across capacitor C, denoted vc(t). The differential equation relating
the input voltage v s (t) and the output voltage vc (t) is thus
LiC
d 3 v C (t)
dt 3
R\C
d 2 v c (t)
dt 2
dvcif)
dt
R\
—vc(t)
i>2
dv s {t)
dt
(10.6)
This third-order ODE may be obtained by mesh analysis of the circuit. The reader
ought to attempt this derivation as an exercise. One way to replace (10.6) with
a coupled system of first-order ODEs is to define the state variables Xk(t) (k e
{0, 1, 2}) according to
xo(t) = v c (t),
xi(t) =
dv C (t)
dt
Substituting (10.7) into (10.6) yields
xi{t) =
d 2 v c (t)
dt 2 '
(10.7)
LiC— f^ + RiCx 2 (t) + 7 + 1 J|(0+ 7^*0(0 = -^r 1 -
dt \L2 J L2 dt
Ri
<-i
L 2
+ \
'l,W
c
Figure 10.1 The linear electric circuit for Example 10.1.
TLFeBOOK
418 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
If we recognize that
xi (t) =
dx(j(t)
dt
x 2 (t) =
dx\(t)
dt
then the complete system of first order ODEs is
dx (t)
dt
dxi(t)
—, — =x 2 (t),
dt
(10.8)
dx2(t) 1 dv s (t)
dt L\C dt
R\
- —x 2 (t) -
LiC \L 2 J
xi(t)-
R
- T j r xo(t)-
LIL2C
In many ways this is not the best description for the circuit dynamics.
Instead, we may find the matrix A e R 3x3 and column vector b e R 3 such that
dvc(t)
dt
dJL { (t)
dt
dJL 2 (t)
dt
= A
v c (t)
bv s {t).
(10.9)
This defines a new set of state equations in terms of the new state variables Vc(t),
JLj(?)> an d 'l 2 (0- The matrix A and vector b contain constants that depend only
on the circuit parameters R\,L\, L2, and C.
Equation (10.9) is often a better representation than (10.8) because
1. There is no derivative of the forcing function v s (t) in (10.9) as there is in
(10.8).
2. There is a general (linear) theory of solution to (10.9) that is in practice easy
to apply, and it is based on state-space methods.
3. Inductor currents [i.e., iz,[(f), *'l, (?)] and capacitor voltages [i.e., vc(t)] can
be readily measured in a laboratory setting while derivatives of these are
not as easily measured. Thus, it is relatively easy to compare theoretical and
numerical solutions to (10.9) with laboratory experimental results.
Since
kit) = c
dv c (t)
dt
VL,(t) = Li
dJL, (t)
dt
v L Jt) = L
dJL 2 (t)
dt
on applying Kirchoff's Voltage law (KVL) and Kirchoff's Current law (KCL), we
arrive at the relevant state equations as follows.
TLFeBOOK
INTRODUCTION
419
First
v s (t) = RiiLi(t) + Li
We see that i>c(t) — Vi 2 (t), and so
vc(t) = i
dJLi(t)
dt
ih 2 {t)
i .
dt
L 2
dJL 2 (t)
dt
(10.10)
giving
dJL 2 (t)
dt
1
= — v c (t),
^2
(10.11)
which is one of the required state equations. Since
MO = *'c(0 + MO,
and so
we also have
dvr(t)
M0 = C— r^ + MO,
dt
1
dvc(t) 1
~dT = c iL > (t) - c iL ^ ty
(10.12)
This is another of the required state equations. Substituting (10.11) into (10.10)
gives the final state equation
di Ll (t)
R
l .
MO-
l
-v c (t)
1
-V s (t).
dt L\ "' ' L\ ' L\
The state equations may be collected together in matrix form as required:
dvc(t)
(10.13)
dt
diL, (0
dt
diL 2 (t)
dt
l
1 "I
c
c
1
R\
~L\
U
1
_ T 2
vc(0
MO
MO
o
i
i^(0-
=/i
Example 10.2 Now let us consider a more complicated third-order nonlinear
electronic circuit called the Colpitts oscillator [9]. The electronic circuit, and its
electric circuit equivalent (model) appears in Fig. 10.2. This circuit is a popular
analog signal generator with a long history (it used to be built using vacuum
tubes). The device Q is a three-terminal device called an NPN-type bipolar junction
transistor (BJT). The detailed theory of operation of BJTs is beyond the scope of
TLFeBOOK
420 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
v cc( f )C) v cc(t)i
(a) V EE (b) V EE -
Figure 10.2 The BJT Colpitts oscillator (a) and its electric circuit equivalent (b).
this book, but may be found in basic electronics texts [10]. For present purposes
it is enough to know that Q may be represented with a nonlinear resistor R (the
resistor enclosed in the box in Fig. 10.2b), and a current-controlled current source
(CCCS), where
ic(t) = Mb(1) (10.14)
and
*'fl00 =
0, v BE {t)<V TH
(vbe(i) - V T h)
Ron
vbeQ) > Vth
(10.15)
The current is (?) is called the base current of Q, and flows into the base terminal
of the transistor as shown. From (10.15) we observe that if the base-emitter voltage
vbe(i) is below a threshold voltage Vth, then there is no base current into Q (i.e.,
the device is cut off). The relationship between vbe( 1 ) — Vth, an d *b(0 obeys
Ohm's law only when VBE(t) is above threshold, in which case Q is active. In
either case (10.14) says the collector current ic(t) is directly proportional to /#(?),
and the constant of proportionality f>p is called the forward current gain of Q.
Voltage vcE(t) is the collector-emitter voltage of Q. Typically, Vth ^ 0.75 V, j3p
is about 100 (order of magnitude), and Ron (on resistance of Q) is seldom more
than hundreds of ohms in size.
TLFeBOOK
FIRST-ORDER ODEs 421
From (10.15) is it) is a nonlinear function of vgEit) that we may compactly write
as isit) = fitivBEit))- There are power supply voltages vccit), and Vee- Voltage
Vee < is a constant, with a typical value Vee — —5 V. Here we treat vccit)
as time- varying, but it is usually the case that (approximately) vccit) = Vccuit),
where
«(*) = { J; [H ■ ao.16)
Function u(t) is the unit step function. To say that vcc it) — Vcc u it) is to say that
the circuit is turned on at time t — 0. Typically, Vcc — +5 V.
The reader may verify (again as a circuit analysis review exercise) that state
equations for the Colpitts oscillator are:
c ^dvcEit)_ = ,^ f) _ p pfR{vBE{t)) ^ (10.17a)
at
dVRFit) VtjF it) + Vff
Cl-^ 1 = -- -^i — - fRiVBEit)) ~ iLit), (10.17b)
dt R E e
L dnM = vcc _ vcE{t) + _ RtiL(t) (10.17c)
dt
Thus, the state variables are vbe^), vcEit), and iiit)- As previously, this circuit
description is not unique, but it is convenient.
Since numerical methods only provide approximate solutions to ODEs, we are
naturally concerned about the accuracy of these approximations. There are also
issues about the stability of proposed methods, and so this matter as well will be
considered in this chapter.
10.2 FIRST-ORDER ODEs
Strictly speaking, before applying a numerical method to the solution of an ODE,
we must be certain that a solution exists. We are also interested in whether the
solution is unique. It is worth stating that in many cases, since ODEs are often
derived from problems in the physical world, existence and uniqueness are often
"obvious" for physical reasons. Notwithstanding this, a mathematical statement
about existence and uniqueness is worthwhile.
The following definition is needed by the succeeding theorem regarding the
existence and uniqueness of solutions to first order ODE initial value problems.
Definition 10.1: The Lipschitz Condition The function fix, t) & R satisfies
a Lipschitz condition in x for S C R 2 iff there is an a > such that
\f(x,t)-f(y,t)\<a\x-y\
when (x, t), (y,t) e S. The constant a is called a Lipschitz constant for f(x,t).
TLFeBOOK
422 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
It is apparent that if f(x, t) satisfies a Lipschitz condition, then it is smooth in
some sense. The following theorem is about the existence and uniqueness of the
solution to
^ = /(*,*) (10.18)
at
for < t < tf, with the initial condition xo — x(0). Time t — is the initial time, or
starting time. We call constant tf the final time. Essentially, we are only interested
in the solution over a finite time interval. This constraint on the theory is not
unreasonable since a computer can run for only a finite amount of time anyway.
We also remark that interpreting the independent variable t as time is common
practice, but not mandatory in general.
Theorem 10.1: Picard's Theorem Suppose that S — {{x, f) € R 2 |0 < t <
tf, — oo < x < oo}, and that f(x, t) is continuous on S. If / satisfies a Lipschitz
condition on set S in the variable x, then the initial- value problem (10.18) has a
unique solution x — x(t) for all < t < tf.
Proof Omitted. We simply mention that it is based on the Banach fixed-point
theorem (recall Theorem 7.3).
We also mention that a proof of a somewhat different version of this theorem
appears in Kreyszig [3, pp. 315-317]. It involves working with a contractive map-
ping on a certain closed subspace of C(J), where / = [to — fi, to + /J] C R and
C(J) is the metric space of continuous functions on /, where the metric is that
of (1.8) in Chapter 1. It was remarked in Chapter 3 [see Eq. (3.8)] that this space
is complete. Thus, any closed subspace of it is complete as well (a fact that was
mentioned in Chapter 7 following Corollary 7.1).
We may now consider specific numerical techniques. Define x„ — x(t n ) for
n e Z + . Usually we assume that to — 0, and that
t n +i=t n +h, (10.19)
where h > 0, and we call h the step size. From (10.18)
4 l) = ^r = ^P-\t= tn = /(*„, t n ). (10.20)
at at
We may expand solution x(t) in a Taylor series about t — t„. Therefore, since
x(t n+ \) = x(t n +h) = x n+ \, and with xj, ) = x (k) (t n )
x n+1 =x n + hx^ + ^h 2 xj?> + ^h 3 4 3) + ■■■■ (10-21)
(k)
If we drop terms in x„ f or k > 1, then (10.21) and (10.20) imply
*•«+! — %n ~r ^^n — %n \ *V \^n-> *n)'
TLFeBOOK
FIRST-ORDER ODEs 423
Since xq — x(to) — x(0) we may find (x„) via
x„+i — x„ + hf(x n , t n ). (10.22)
This is often called the Euler method (or Euler's method)? A more accurate descrip-
tion would be to call it the explicit form of Euler's method in order to distinguish
it from the implicit form to be considered a little later on. The distinction matters
in practice because implicit methods tend to be stable, whereas explicit methods
are often prone to instability.
A few general words about stability and accuracy are now appropriate. In what
follows we will assume (unless otherwise noted) that the solution to a differen-
tial equation remains bounded; that is, \x{t)\ < M < oo for all t > 0. However,
approximations to this solution [e.g., (x n ) from (10.22)] will not necessarily remain
bounded in the limit as n — > oo; that is, our numerical methods might not always be
stable. Of course, in a situation like this the numerical solution will deviate greatly
from the correct solution, and this is simply unacceptable. It therefore follows that
we must find methods to test the stability of a proposed numerical solution. Some
informal definitions relating to stability are
Stable method: The numerical solution does not grow without bound (i.e., "blow
up") with any choice of parameters such as step size.
Unstable method: The numerical solution blows up with any choices of param-
eters (such as step size).
Conditionally stable method: For certain choices of parameters the numerical
solution remains bounded.
We mention that even if the Euler method is stable, its accuracy is low because
only the first two terms in the Taylor series are retained. More specifically, we
say that it is a first-order method because only the first power of h is retained in
the Taylor approximation that gave rise to it. The omission of higher-order terms
causes truncation errors. Since h 2 (and higher power) terms are omitted we also
say that the truncation error per step (sometimes called the order of accuracy) is
of order h 2 . This is often written as 0(h 2 ). (Here we follow the terminology in
Kreyszig [4, pp. 793-794].) In summary, we prefer methods that are both stable,
and accurate. It is important to emphasize that accuracy and stability are distinct
concepts, and so must never be confused.
Strictly speaking, in truncating the series in (10.21) we should write x nJr \ = x n + hf(x n , t n ) so that
Euler's method is
x n+ i =x„ +hf(x„,t„)
with xq = xq. This is to emphasize that the method only generates approximations to x„ =x(t n ).
However, this kind of notation is seldom applied. It is assumed that the reader knows that the numerical
method only approximates x(t n ) even though the notation does not necessarily explicitly distinguish
the exact value from the approximate.
TLFeBOOK
424 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
We can say more about the accuracy of the Euler method:
Theorem 10.2: For dx(t)/dt — f(x(t), t) let f(x(t), t) be Lipschitz continu-
ous with constant a (Definition 10.1), and assume that x(t) e C 2 [?o, tf] (tf > to).
If x n s» x(t n ) — x n , where (t n — to + nh, and t n < tf)
Xn+\ = X n + hf(x n , t n ), (XO « x(t ))
then
1 e a(t„-t ) _ i
\x(t„) -x n \< e a(t »-' o) \x(to) - xq\ + -hM , (10.23)
2 a
where M = mzx, €[tQaf ] |x (2) (f)|.
Proof Euler' s method is
Xn+\ — X n -\- rlj \X n , t n )
and from Taylor's theorem
x(t n+ i) = x(t n ) + hx (l) (t n ) + \h 2 x^^ n )
for some £„ € [t n , t n+ {\. Thus
x(t n +\) ~ X n +\ = X(t„) - X n + h[x (l) (t n ) - f(x n , t n )] + \h 2 X {2) (£„)
= X(t n ) -X„ + h[f(x(t„), t n ) - f(x„, t n )] + \h 2 X {2 \^ n )
so that
\x(t„+i) - x n +l\ < \x{t n ) -x„\ +ah\x(t„) -x„\ + \h 2 \x {2) (%„)\.
For convenience we will let e n — \x(t n ) — x„\, X — 1 + ah, and r„ — jh 2 x^ 2 \^ n ),
so that
e«+l < ^e n +r n .
It is easy to see that 3
e\ < Xe Q + >"o
£2 < Xe\ +r\ — X 2 eo + Xro + r\
e3 < Xe2 + V2 — X 3 eo + X 2 ro + Xr\ + r2
n-\
e n < X n e Q + ^2x'r n -i-j
More formally, we may use mathematical induction.
TLFeBOOK
FIRST-ORDER ODEs 425
If M — max t€ y ^ f ] \x^(t)\ then r n -\-j < jh 2 M, and hence
n-\
e n < X n e + jh 2 Mj2 kJ '
and since J2"j=o ^'' = T^T ' anc * f°r x > — 1 we nave (1 + x) n < e " x ', thus
«zi . k n _ l £ nah _ j g afe-r ) _ {
> !■' = < = .
*-^ X — 1 ah ah
j=o
Consequently,
2 e a(t n -to) _ 1
e„ < e afe -' o) e + -AM
2 a
which immediately yields the theorem statement.
We remark that eo = \x(to) — xq\ — only if xo — x(to) exactly. Where quan-
tization errors (recall Chapter 2) are concerned, this will seldom be the case. The
second term in the bound of (10.23) may be large even if h is tiny. In other words,
Euler's method is not necessarily very accurate. Certainly, from (10.23) we can
say that e n oc h.
As a brief digression, we also note that Theorem 10.2 needed the bound
(l+x) n < e nx (x > -1). (10.24)
We may easily establish (10.24) as follows. From the Maclaurin expansion
(Chapter 3)
e x = 1 + x + \x 2 e$
(for some % e [0, x]) so that
< 1 + x < l+x + ±x V = e x ,
and because 1 + x > (i.e., x > — 1)
0< (l+x)" <e nx ,
thus establishing (10.24).
TLFeBOOK
426 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
The stability of any method may be analyzed in the following manner. First
recall the Taylor series expansion of fix, t) about the point (xq, to)
tt *\ ti t \< n t \ d f( x °' fo) , < df(x ,t )
fix, t) = fixo, to) + it - to) \-ix- xo)-
1
2!
it - to)
dt
2 d 2 fix , to)
dx
dt 2
+ 20 - to)ix - xo)
d Z fixo, tp)
dtdx
+(x — xq)
2 d 2 fjx , to)
dx 2
(10.25)
If we retain only the linear terms of (10.25) and substitute these into (10.18), then
we obtain
(i),,, dx t s,t. . .dfix^to) dfixp, t )
x 'it) = — = fixo, to) + it - t ) \-ix- x Q )
dt dt ox
dfixo, to) , dfixo, to)
■ x-\ 1-
dx
dt
,, N dfixo, to) dfixo, t )
f ixo, t ) - t xq-
dt
=/.,
= A2
so this has the general form (with k, k\, and X2 as constants)
dx
(10.26)
dx
— = Xx + k\t + A.2.
dt
(10.27)
This linearized approximation to the original problem in (10.18) allows us to inves-
tigate the behavior of the solution in close proximity to (xo, to). Equation (10.27)
is often simplified still further by considering what is called the model problem
dx
dt
(10.28)
Thus, here we assume that k\t + k 2 in (10.27) can also be neglected. However,
we do remark that (10.27) has the form
dx
dt
+ Pit)x = Qit)x n ,
(10.29)
where n = 0, Pit) — —X, and Qit) — k\t + X 2 - Thus, (10.27) is an instance of
Bernoulli's differential equation [5, p. 62] for which a general method of solution
exists. But for the purpose of stability analysis it turns out to be enough (usually,
but not always) to consider only (10.28). Equation (10.28) is certainly simple in
that its solution is
x(t) = xiO)e Xt . (10.30)
TLFeBOOK
FIRST-ORDER ODEs 427
If Euler's method is applied to (10.28), then
x n+ i — x„ + hXx n = (1 + hX)x n . (10.31)
Clearly, for n e Z +
Xn = (1 + hX)"x , (10.32)
and we may avoid lirrin^oo \x n \ — oo if
|l + Wi| < 1.
The model problem (10.28) with the solution (10.30) is stable 4 only if X < 0.
Hence Euler's method is conditionally stable for
2
X < and h < — , (10.33)
- \X\
and is unstable if
\l + hX\ > 1. (10.34)
We see that depending on X and h, the explicit Euler method might be unstable.
Now we consider the alternative implicit form of Euler's method. This method
is also called the backward Euler method. Instead of (10.22) we use
Xn + l = X„ + hf(x n +\, t n +l). (10.35)
It can be seen that a drawback of this method is the necessity to solve (10.35) for
x n+ \. This is generally a nonlinear problem requiring the techniques of Chapter 7.
However, a strength of the implicit method is enhanced stability. This may be
easily seen as follows. Apply (10.35) to the model problem (10.28), yielding
x n + \ — x n + Xhx n -\-l
or
1
x„+i = —x n . (10.36)
1 — Xn
Clearly
1
1 -Xh
x . (10.37)
Since we must assume as before that X < 0, the backward Euler method is stable
for all h > 0. In this sense we may say that the backward Euler method is uncon-
ditionally stable. Thus, the implicit Euler method (10.35) is certainly more stable
For stability we usually insist that X < as opposed to allowing X = 0. This is to accommodate a
concept called bounded-input, bounded-output (BIBO) stability. However, we do not consider the details
of this matter here.
TLFeBOOK
428 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
than the previous explicit form (10.22). However, the implicit and explicit forms
have the same accuracy as both are first-order methods.
Example 10.3 We wish to apply the implicit and explicit forms of the Euler
method to
dx
— +x =
dt
for x(0) = 1. Of course, since this is a simple linear problem, we immediately
know that
x(t) = x(0)e~' = e~'
for all t > 0. Since we have f(x, t) — —x, we obtain
df(x,t) = 1
dx
implying that X — — 1. Thus, the explicit Euler method (10.22) gives
x n+ i — (1 — h)x n (10.38a)
for which < h < 2 via (10.33). Similarly, (10.35) gives for the implicit method
1
x n +i = rx n (10.38b)
1 + h
for which h > 0. In both (10.38a) and (10.38b) we have x Q = 1.
Some typical simulation results for (10.38) appear in Fig. 10.3. Note the insta-
bility of the explicit method for the case where h > 2.
It is to be noted that a small step size h is desirable to achieve good accuracy.
Yet a larger h is desirable to minimize the amount of computation involved in
simulating the differential equation over the desired time interval.
Example 10.4 Now consider the ODE
dx _ t 2 3
h 2tx — te x
dt
[5, pp. 62-63]. The exact solution to this differential equation is
,2,
3
x(t) = T~ 2t , (10-39)
e ' + ce a
TLFeBOOK
FIRST-ORDER ODEs
429
x(t)
— 0—
Explicit Euler (/? =
= 0.1)
— + —
Implicit Euler (/? =
= 0.1)
Q.
E
<
(b)
x{t)
-o- Explicit Euler (h = 2.1)
_+_ Implicit Euler (/? = 2.1)
9
1
1
1
\ / \ i /
i i i
\
1 /
\ /
6
10 12
Time (f)
14
16
20
Figure 10.3 Illustration of the implicit and explicit forms of the Euler method for the
differential equation in Example 10.3. In plot (a), h is small enough that the explicit method
is stable. Here the implicit and explicit methods display similar accuracies. In plot (b), h
is too big for the explicit method to be stable. Instability is indicated by the oscillatory
behavior of the method and the growing amplitude of the oscillations with time. However,
the implicit method remains stable, but because h is quite large, the accuracy is not very
good.
for t > 0, where c = -^ — 1, and we assume that c > 0. Thus
f(x,t) = t(
-t l Ji
Itx
so
df(x, t) .2 9
= 3te ' x 2 -It.
dx
Consequently
_ df(x ,t ) _ 8/(*o,0) _ n
A — : — : — u.
dx
dx
Via (10.33) we conclude that h > is possible for both forms of the Euler method.
Since stability is therefore not a problem here, we choose to simulate the differential
TLFeBOOK
430
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
0.8
"a 0.6
E 0.4
<
0.2
x(t)
—o— Explicit Euler (/? = 0.02)
i i i
i i i
~=s**8893
(a)
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Time (f)
0.8
■° 0.6
E 0.4
<
0.2
a -i
_ x(t)
—o— Explicit Euler (A) = 0.20)
i i i
i
i
^^ ====: 5
(b)
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.
Time (f)
Figure 10.4 Illustration of the explicit Euler method for the differential equation in
Example 10.4. Clearly, although stability is not a problem here, the accuracy of the method
is better for smaller h.
equation using the explicit Euler method as this is much simpler to implement.
Thus, from (10.22), we obtain
ti ..3
Xn-\-l — %n T" <l\tfi€ n ' X n Ll n X n \.
(10.40)
We shall assume xo = 1 [initial condition x(0)]. Of course, t n — hn for n —
0, 1,2,....
Figure 10.4 illustrates the exact solution from (10.39), and the simulated solution
via (10.40) for h — 0.02 and h — 0.20. As expected, the result for h — 0.02 is more
accurate.
The next example involves an ODE whose solution does not remain bounded
over time. Nevertheless, our methods are applicable since we terminate the simu-
lation after a finite time.
Example 10.5 Consider the ODE
dx
dt
— t
2x
TLFeBOOK
FIRST-ORDER ODEs
431
for t > to > [5, pp. 60-61]. The exact solution is given by
,3
x(t)
f
c
572-
(10.41)
The initial condition is x(to) = xo with to > 0, and so
,3
v ?
5?,
implying that
c = 5fg.ro - fj.
2.t
Since f(x,t) = t—=f-,we have
A.=
9/C*o, fo)
3 A'
2
to'
so via (10.33) for the explicit Euler method
< h < to.
However, this result is misleading here because x(t) is not bounded with time. In
other words, it does not really apply here. From (10.22)
X n +\ — X„ + h
Zx n
tn _
(10.42a)
where
t n — to + nh
for n e Z + . If we consider the implicit method, then via (10.35)
2 2x n -|-l
Xn + 1
h
'm + 1
tn + 1 .
*n+l —
ht 2 ,,
n + \
2h
tn + l
(10.42b)
where t n+ \ — to + (n + \)h for n e Z + .
Figure 10.5 illustrates the exact solution x(t) from (10.41) along with the simu-
lated solutions from (10.42). This is for x Q — 1 and t — 0.05 with h — 0.025 (a),
and h — 5 (b). It can be seen that the implicit method is more accurate for t close
to fo- Of course, this could be very significant since startup transients are often of
TLFeBOOK
432
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
E
<
2
1.5
1
0.5
(a)
20
15
10
— x(f)
-o- Explicit Euler (h = 0.025)
-t- Implicit Euler (h = 0.025)
"%": ..^^X^r..^
\ \X. : j fil ^ M ^^ ^ r
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (f)
x10 5
— x(t)
-o Explicit Euler (h = 5)
-t- Implicit Euler (h = 5)
5
o4 | ^^ i| ^J j J" i' (I ' ( D(Dt i ) ' r " 1v
50
(b)
100
Time (f)
150
200
250
Figure 10.5 Illustration of the explicit and implicit forms of the Euler method for the ODE
in Example 10.5. In plot (a), note that the implicit form tracks the true solution x(t) better
near fy = 0.05. In plot (b), note that both forms display a similar accuracy even though h
is huge, provided t ^> t$.
interest in simulations of dynamic systems. It is noteworthy that both implicit and
explicit forms simulate the true solution with similar accuracy for t much bigger
than to even when h is large.
This example is of a "stiff system" (see Section 10.6).
Example 10.5 illustrates that stability and accuracy issues with respect to the
numerical solution of ODE initial-value problems can be more subtle than our
previous analysis would suggest. The reader is therefore duly cautioned about
these matters.
Recalling (3.71) from Chapter 3 (or recalling Theorem 10.2), the Taylor formula
for x(t) about t = t n is [recall x n = x(t n ) for all n]
l n + l
X n +hf(x n ,t n )+-h 2 X (2) (i;)
X — v- J £•
(10.43)
=J£n+l
for some £ € [?„, t n+ \]. Thus, the truncation error per step in the Euler method is
defined to be
e„+l = x n+l - x n+l = \h 2 x { ^{^). (10.44)
TLFeBOOK
FIRST-ORDER ODEs 433
We may therefore state, as suggested earlier, that the truncation error per step is of
order 0(h 2 ) because of this. The usefulness of (10.44) is somewhat limited in that
it depends on the solution x(t) [or rather on the second derivative x^ 2 '(f)], which
is, of course, something we seldom know in practice.
How may we obtain more accurate methods? More specifically, this means
finding methods for which the truncation error per step is of order 0(h m ) with
m > 2.
One way to obtain improved accuracy is to try to improve the Euler method.
More than one possibility for improvement exists. However, a popular approach is
Heun 's method. It is based on the following observation. A drawback of the Euler
method in (10.22) is that f(x n , t„) is the derivative x^(t) at the beginning of the
interval [t n , f n +i]> ar) d yet x^(t') varies over [t n , t n+ {\. The implicit form of the
Euler method works with f(x n +\, t n +\), namely, the derivative at t — t n+ \, and so
has a similar defect. Therefore, intuitively, we may believe that we can improve
the algorithm by replacing f(x n , t n ) with the average derivative
\ [f(Xn, t n ) + f(x„ + hf(x n ,t„), t n + K)] . (10.45)
This is approximately the average of x^'(f) at the endpoints of interval [t n , t n +{\.
The approximation is due to the fact that
/On+l, * B +i) ^ /(•*« + hf(x n , t n ), t n + h). (10.46)
We see in (10.46) that we have employed (10.22) to approximate x n +\ according
to x„ + i — x n + hf(x n , t n ) (explicit Euler method). Of course, t n+ \ — t„ + h does
not involve any approximation. Thus, Heun 's method is defined by
x n +i — x n + - [f(x n , t n ) + f(x„ + hf(x n , f„), t n + h)] . (10.47)
This is intended to replace (10.22) and (10.35).
However, (10.47) is an explicit method, and so we may wonder about its sta-
bility. If we apply the model problem to (10.47), we obtain
x n+1 = |"l+Aii+ I/j 2 l 2 ljt„ (10.48)
for which
[l+l/! + i/j 2 A 2 l\ - (10.49)
For stability we must select h such that we avoid linin^oo \x n \ — oo. For conve-
nience, define
a = \+Xh + \h 2 X 2 , (10.50)
so this requirement implies that we must have \a\ < 1. A plot of (10.50) in terms
of hX appears in Fig. 10.6. This makes it easy to see that we must have
-2<hX<0. (10.51)
TLFeBOOK
434 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
1.5
t> 0.5
-0.5
-2.5
-1.5
-1
hx
-0.5
0.5
Figure 10.6 A plot of a in terms of hX [see Eq. (10.50)].
Since X < is assumed [again because we assume that x(t) is bounded] (10.51),
implies that Heun's method is conditionally stable for the same conditions as in
(10.33). Thus, the stability characteristics of the method are identical to those of
the explicit Euler method, which is perhaps not such a surprize.
What about the accuracy of Heun's method? Here we may see that there is an
improvement. Again via (3.71) from Chapter 3
X n +\
K n + hx (l) (t n ) + ±h 2 X (2 Ht n ) + ±/*V 3 >(£)
(10.52)
for some f e \t n , t n +i]. We may approximate x^ 2 \t„) using a. forward difference
operation
x (l Ht n+l )-x^(t n )
x (2 \t n )
h
(10.53)
so that (10.52) becomes [using x^(t n+ i) — f(x n+ i,t n+ i), and x"'{t n )
f{Xn,t n )]
x n + l
1 2
hf(x„,t„)+ -h
X^{t n+ i)-X^{tn)
(1)/
+ -/z 3 x (3) (f), (10.54)
6
or upon simplifying this, we have
Xn+l = X n + - [f(x n , t n ) + f(x n +\, t n +l)] + -h 3 X°\%).
(10.55)
Replacing f(x n +\, f n +i) in (10.55) with the approximation (10.46), and dropping
the error term, we see that what remains is identical to (10.47), namely, Heun's
method. Various approximations were made to arrive at this conclusion, but they
are certainly reasonable, and so we claim that the truncation error per step for
Heun's method is
e n+l = i/*V 3 >(f), (10.56)
TLFeBOOK
FIRST-ORDER ODEs
435
where again f € [t n , f„+i], and so this error is of the order 0(h 3 ). In other words,
although Heun's method is based on modifying the explicit Euler method, the
modification has lead to a method with improved accuracy.
Example 10.6 Here we repeat Example 10.5 by applying Heun's method under
the same conditions as for Fig. 10.5a. Thus, the differential equation is again
dx
dt
2x
t
and again we choose xo = 1.0, ?o = 0.05, with h — 0.025. The simulation result
appears in Fig. 10.7.
It is very clear that Heun's method is distinctly more accurate than the Euler
method, especially near t — frj-
Heun's method may be viewed in a different light by considering the following.
We may formally integrate (10.18) to arrive at x(t) according to
x{t) - x(t n )
f
f(x, x)dx.
(10.57)
1.6
1.4
1.2
CD
E
<
I I I
— m
-o- Implicit Euler (h = 0.025)
- -+- Heun (h = 0.025)
Ti
V
j^S^SW"
I I I
0.6
0.4
0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (t)
Figure 10.7 Comparison of the implicit Euler method with Heun's method for
Example 10.6. We see that Heun's method is more accurate, especially for t near fg = 0.05.
TLFeBOOK
436 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
So if t = t n +\, then, from (10.57), we obtain
n+1
x(t„+i) = x(t„) + I f(x,x)dx
or
r'n+i
x„ + i=x„+/ f(x, x) dr. (10.58)
According to the trapezoidal rule for numerical integration (see Chapter 9,
Section 9.2), we have
/"»+' h
/ f(x, x)dx «s -[fix„,t„) + fix n+ i,t n+ i)]
(since h — t n +\ — t n ), and hence (10.58) becomes
h
X n +1 — X n + ~[fix„, t n ) + fiXn+l, t„+l)], (10.59)
which is just the first step in the derivation of Heun's method. We may certainly
call (10.59) the trapezoidal method. Clearly, it is an implicit method since we must
solve (10.59) for x n+ \. Equally clearly, Eq. (10.59) appears in (10.55), and so we
immediately conclude that the trapezoidal method is a second-order method with
a truncation error per step of order 0(/z 3 ). Thus, we may regard Heun's method
as the explicit form of the trapezoidal method. Or, equivalently, the trapezoidal
method can be regarded as the implicit form of Heun's method. We mention that
the trapezoidal method is unconditionally stable, but will not prove this here.
The following example illustrates some more subtle issues relating to the stability
of numerical solutions to ODE initial-value problems. It is an applications example
from population dynamics, but the issues it raises are more broadly applicable. The
example is taken from Beltrami [6].
Example 10.7 Suppose that x(t) is the total size of a population (people,
insects, bacteria, etc.). The members of the population exist in a habitat that can
realistically support not more than N individuals. This is the carrying capacity
for the system. The population may grow at some rate that diminishes to zero
as x(f) approaches N. But if the population size x(f) is much smaller than the
carrying capacity, the rate of growth might be considered proportional to the present
population size. Consequently, a model for population growth might be
dxit)
dt
1 xjt)
N
(10.60)
This is called the logistic equation. By separation of variables this equation has
solution
xit) = N n (10.61)
1 + ce "
TLFeBOOK
FIRST-ORDER ODEs 437
for t > 0. As usual, c depends on the initial condition (initial population size) x(0).
The exact solution in (10.61) is clearly "well behaved." Therefore, any numerical
solution to (10.60) must also be well behaved.
Suppose that we attempt to simulate (10.60) numerically using the explicit Euler
method. In this case we obtain [via (10.22)]
x n +\ = (1 + hr)x„ - -Tj-xl. (10.62)
Suppose that we transform variables according to x„ — ay n in which case (10.62)
can be rewritten as
y n+ \ — (1 + hr)y n
If we select
lira
1 Vn
N{\ + hrY "
(10.63)
then (10.63) becomes
N(l + hr)
hr
y n+l = ky n {\ - y n ), (10.64)
where A. = l+hr. We recognize this as the logistic map from Chapter 7 [see
Examples 7.3-7.5 and Eq. (7.83)]. From Section 7.6 in particular we recall that
this map can become chaotic for certain choices of X. In other words, chaotic
instability is another possible failure mode for a numerical method that purports to
solve ODEs.
The explicit form of the Euler method is actually an example of a first-order
Runge-Kutta method. Similarly, Heun's method is an example of a second-order
Runge-Kutta method. It is second-order essentially because the approximation
involved retains the term in h 2 in the Taylor series expansion. We mention that
methods of still higher order can be obtained simply by retaining more terms in the
Taylor series expansion of (10.21). This is seldom done because to do so requires
working with derivatives of increasing order, and this requires much computational
effort. But this effort can be completely avoided by developing Runge-Kutta meth-
ods of higher order. We now outline a general approach for doing this. It is based
on material from Rao [7].
All Runge-Kutta methods have a particular form that may be stated as
x n +i — x„ +ha{x n , t n , h), (10.65)
where a(x n ,t n ,h) is called the increment function. The increment function is
selected to represent the average slope on the interval t e [t n , t n+ \\. In particular,
the increment function has the form
m
a(x n ,t n ,h) = y^Cjkj, (10.66)
TLFeBOOK
438 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
where m is called the order of the Runge-Kutta method, Cj are constants, and
coefficients kj are obtained recursively according to
h = f(x„,t n )
h — f(x n +ai,\hk\, t„ + p 2 h)
h — f(x„ + a^ihki + a3 t2 hk 2 , t„ + pih)
(m-i \
x„ + ^2 a m ,jhkj, t„ + p m h . (10.67)
A more compact description of the Runge-Kutta methods is
m
x n +\ = x n + h y^ Cjkj, (10.68a)
./ = !
where
kj = / I x n + h ^ cijjki, t n + pjh | . (10.68b)
To specify a particular method requires selecting a variety of coefficients (c;, fl/j,
etc.). How is this to be done?
We illustrate with examples. Suppose that m — 1. In this case
x„+i = x„ + hc\k\ — x n + hc\f{x n , t„) (10.69)
which gives (10.22) when c\ = 1. Thus, we are justified in calling the explicit
Euler method a first-order Runge-Kutta method.
Suppose that m — 2. In this case
x n +l =x„ +hcif(x n ,t n ) + hc 2 f(x n + a 2 ,\hf (x n ,t n ),t n + p 2 h). (10.70)
We observe that if we choose
c 2 — c\ — 5, a 2 ,i = 1, P2 = 1, (10.71)
then (10.70) reduces to
x n +\ = x n + \h[f{x n , t n ) + f(x n + hf(x n , t n ), t n + h)],
which is Heun's method [compare this with (10.47)]. Thus, we are justified in call-
ing Heun's method a second-order Runge-Kutta method. However, the coefficient
TLFeBOOK
FIRST-ORDER ODEs 439
choices in (10.71) are not unique. Other choices will lead to other second-order
Runge-Kutta methods. We may arrive at a systematic approach for creating alter-
natives as follows.
For convenience as in (10.21), define x„ = x^\t n ). Since m — 2, we will
consider the Taylor expansion
x„+i =x n + hx n l) + ±h 2 xf } + 0(h 3 ) (10.72)
[recall (10.52)] for which the term 0(h 3 ) simply denotes the higher-order terms.
We recall that x^(t) — fix, t), so x„ — f(x n , t n ), and via the chain rule
n \ df df dx df df
x (2) (t) = —+——=—+— f(x, t), (10.73)
dt dx at dt dx
so (10.72) may be rewritten as
x n +i =x„+ hfix n , t n ) + -h h -h f(x„, t„) + OQl ).
2 dt 2 dx
(10.74)
Once again, the Runge-Kutta method for m — 2 is
x n+ \ =x„ +hc\fix n ,t n ) + hc 2 fix n +a 2 ,ihki,t„ + p 2 h). (10.75)
Recalling (10.26), the Taylor expansion of fix n + a 2 \hk\, t„ + p 2 h) is given by
dfix n ,t n )
fix n + a 2 ,iMi, t„ + p 2 h) = fix n , t n ) + a 2z ihfix n , t n )-
dx
dfiX„, t„) r>
p 2 h J \"' +Oih 2 ). (10.76)
at
Now we substitute (10.76) into (10.75) to obtain
, 2 dfix n ,t n )
X n + \ = X„ + (Cl + C 2 )hfix„, t n ) + p 2 C 2 h z
dt
idfix„,t n ) i
+ a 2A c 2 h 2 J \" fix n ,t n )+Oih 3 ). (10.77)
dx
We may now compare like terms of (10.77) with those in (10.74) to conclude that
the coefficients we seek satisfy the nonlinear system of equations
c\+c 2 — 1, p 2 c 2 — \, a 2t \c 2 — \. (10.78)
To generate second-order Runge-Kutta methods, we are at liberty to choose the
coefficients c\,c 2 , p 2 , and 02,1 in any way we wish so long as the choice sat-
isfies (10.78). Clearly, Heun's method is only one choice among many possible
choices. We observe from (10.78) that we have four unknowns, but possess three
TLFeBOOK
440 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
equations. Thus, we may select one parameter "arbitrarily" that will then determine
the remaining ones. For example, we may select c 2 , so then, from (10.78)
1 1
c\ = 1 - c 2 , p 2 = — , and a 2 ,i = — . (10.79)
2c 2 2c 2
Since one parameter is freely chosen, thus constraining all the rest, we say that
second-order Runge-Kutta methods possess one degree of freedom.
It should be clear that the previous procedure may be extended to systematically
generate Runge-Kutta methods of higher order (i.e., m > 2). However, this requires
a more complete description of the Taylor expansion for a function in two variables.
This is stated as follows.
Theorem 10.3: Taylor's Theorem Suppose that f(x,t), and all of its partial
derivatives of order n + 1 or less are defined and continuous on D — {(x, t)\a <
t <b,c < x < d}. Let (xq, to) e D; then, for all (x, t) e D, there is a point (rj, f ) €
D such that
r \ u t ^r-k, „ ^3 r /(^0,?0)
f(^t) = J2\^J2( r k )o-toY- k (x-xo)
A I ' I. n N /
k=Q
1 \— v / n + 1
{n+\)\*-~i\ k l y dt n+i-k dx k
k=0
for which (17, f ) is on the line segment that joins the points (xo, to), and (x, t).
Proof Omitted.
The reader can now easily imagine that any attempt to apply this approach for
m > 2 will be quite tedious. Thus, we shall not do this here. We will restrict our-
selves to stating a few facts. Applying the method to m — 3 (i.e., the generation
of third-order Runge-Kutta methods) leads to algorithm coefficients satisfying six
equations with eight unknowns. There will be 2 degrees of freedom as a conse-
quence.
Fourth-order Runge-Kutta methods (i.e., m — 4) also possess two degrees of
freedom, and also have a truncation error per step of 0(h 5 ). One such method
(attributed to Runge) in common use is
h
x n+ i = x„ + - [hi + 21c 2 + 2k 3 + k 4 ] , (10.80)
6
where
h = f(x n ,t n )
ki — f lx„ + \hk\ , t n + \h\
h — f \x„ + \hk 2 , t n + \h\
k 4 = f( Xn +hk 3 ,t n +h). (10.81)
Of course, an infinite number of other fourth-order methods are possible.
TLFeBOOK
FIRST-ORDER ODEs
441
We mention that Runge-Kutta methods are explicit methods, and so in principle
carry some risk of instability. However, it turns out that the higher the order of
the method, the lower the risk of stability problems. In particular, users of fourth-
order methods typically experience few stability problems in practice. In fact, it
can be shown that on applying (10.80) and (10.81) to the model problem (10.28),
we obtain
\+hX
1
h 2 X 2
1
h 3 X 3
1
24'
h*X 4
XQ.
A plot of
a = 1 + hX + -h 2 X 2 + -h 3 X 3 + —h 4 X A
2 6 24
(10.82)
(10.83)
in terms of hX appears in Fig. 10.8. To avoid linin^oo \x n \ = oo, we must have
| cr | < 1, and so it turns out that
-2.785 < hX <
(see Table 9.11 on p. 685 of Rao [7]). This is in agreement with Fig. 10.8. Thus
2.785
X < 0,
h <
\x\
(10.84)
This represents an improvement over (10.33).
Example 10.8 Once again we repeat Example 10.5 for which
dx
dt
— t
2x
t
with xq — 1.0 for fo = -05, but here instead our step size is now h — 0.05. Addition-
ally, our comparison is between Heun's method and the fourth-order Runge-Kutta
method defined by Eqs. (10.80) and (10.81).
b 1
Figure 10.8 A plot of a in terms of hX [see Eq. (10.83)].
TLFeBOOK
442
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
— x(t)
-o- 4th-order Runge-Kutta (h = 0.05)
Heun (h = 0.05)
Time (f)
Figure 10.9 Comparison of Heun's method (a second-order Runge-Kutta method) with
a fourth-order Runge-Kutta method [Eqs. (10.80) and (10.81)]. This is for the differential
equation in Example 10.8.
The simulated solutions based on these methods appear in Fig. 10.9. As expected,
the fourth-order method is much more accurate than Heun's method. The plot
in Fig. 10.9 was generated using MATLAB, and the code for this appears in
Appendix 10.A as an example. (Of course, previous plots were also produced by
MATLAB codes similar to that in Appendix 10. A.)
10.3 SYSTEMS OF FIRST-ORDER ODEs
The methods of Section 10.2 may be extended to handle systems of first-order
ODEs where the number of ODEs in the system is arbitrary (but finite). However,
we will consider only systems of two first-order ODEs here. Specifically, we wish
to solve (numerically)
dX = fix, y, t), (10.85a)
dt
dy_
dt
= g(x,y,t),
(10.85b)
where the initial condition is xq — x(to) and yo = y(to). This is sufficient, for
example, to simulate the Duffing equation mentioned in Section 10.1.
TLFeBOOK
SYSTEMS OF FIRST-ORDER ODEs 443
As in Section 10.2, we will begin with Euler methods. Therefore, following
(10.21) we may Taylor-expand x(t) and y(t) about the sampling time t — t n . As
before, for convenience we may define x„ — x^ k \t n ), y„ = y^ k \t n ). The relevant
expansions are given by
x n+1 = x„ + hx ( n l) + ±h 2 x ( n 2) + • • • , (10.86a)
y n+1 =y n + hy n l) + \h 2 y^ + ■■■. (10.86b)
The explicit Euler method follows by retaining the first two terms in each expansion
in (10.86). Thus, the Euler method in this case is
x n +\ — x„ + hf(x n , y„, t n ), (10.87a)
y n +\ — y n + hg(x n , y n , t n ), (10.87b)
where we have used the fact that x„ — f(x„,y„,t n ) and y n — g(x n , y„, t n ). As
we might expect, the implicit form of the Euler method is
x n +\ = x„ + hf(x n+ i, y n+ i, t n+ i), (10.88a)
y n +\ — y n +hg(x n+ \,y n +\,t n +\). (10.88b)
Of course, to employ (10.88) will generally involve solving a nonlinear system of
equations for x n+ \ and y n +i, necessitating the use of Chapter 7 techniques. As
before, we refer to parameter h as the step size.
The accuracy of the explicit and implicit Euler methods for systems is the same
as for individual equations; specifically, it is 0(h 2 ). However, stability analysis is
more involved. Matrix methods simply cannot be avoided. This is demonstrated as
follows.
The model problem for a single first-order ODE was Eq. (10.28). For a coupled
system of two first-order ODEs as in (10.85), the model problem is now
dx
— = ciqqx + ciQiy, (10.89a)
dt
dy
— =a\ox + a\iy. (10.89b)
dt
Here a, ;j are real-valued constants. We remark that this may be written in more
compact matrix form
dx
— = Ax, (10.90)
dt
where x — x(t) — [x(t)y(t)] and dx/dt = [dx(t)/dt dy(t)/dt] T and, of course
«oo «oi
fllO fl ll
(10.91)
TLFeBOOK
444 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
Powerful claims are possible using matrix methods. For example, it can be argued
that if x(t) € R N (so that A € R NxN ), then dx(t)/dt — Ax(t) has solution
At-
x(t) = e m x(0)
(10.92)
for t > 0; that is, (10.92) for N = 2 is the general solution to (10.90) [and hence
to (10.89)]. 5
The constants in A are related to f(x, y, t) and g(x, y, t) in the following
manner. Recall (10.25). If we retain only the linear terms in the Taylor expansions
of / and g around the point (xq, yo, to), then
,, ,. ,, , , . , df(xo,yo,to) df(xo,yo,t )
f(x, y, t) « f{x , y , t ) + (x - x ) h (y - yo)-
dx
+ (t- to)
dfjxp, yp, t )
dt
dg(xo, yo, to)
g(x, y, t) «s g (x , y , t ) + (x - xq) h (y - yo)
9v
(10.93a)
3^(^o- yo, to)
i)x
dy
+ (t- to)
As a consequence
dgjxp, yp, t )
dt
df(xo,yo,to) dfjxp, yp, tp)
dx dy
dg(x , yo, t ) dg(x Q , yo, to)
(10.93b)
dx
dy
(10.94)
At this point we may apply the explicit Euler method (10.87) to the model problem
(10.89), which results in
x n +i
Jn+l
1 + haoo haoi
ha\o 1 + han
Xn
y n
(10.95)
Yes, as surprising as it seems, although At is a matrix exp(Ar) makes sense as an operation. In fact,
for example, with x(t) = [xQ(t) ■ ■ ■ x n _\(t)] , the system of first-order ODEs
dx(t)
dt
Ax(t) + by(t)
(A 6 R" x ", b 6 R", and y(t) e R) has the general solution
x(t) = e At x(0~)+ e A(t ~ T) by(T)dT.
Jo-
The integral in this solution is an example of a convolution integral.
TLFeBOOK
SYSTEMS OF FIRST-ORDER ODEs 445
An alternative form for this is
x n +i = V + hA)x n . (10.96)
Here / is a 2 x 2 identity matrix and x n — [x n y n ] T . With xo = [xq yo] T , we
may immediately claim that
x„ = (I + hA) n x , (10.97)
where n e Z + . We observe that this includes (10.32) as a special case. Naturally,
we must select step size h to avoid instability; that is, we are forced to select h
to prevent lim„^oo ||x„|| = oo. In principle, the choice of norm is arbitrary, but
2-norms are often chosen. We recall that there is a nonsingular matrix T (matrix
of eigenvectors) such that
T~ l [I + hA]T = A, (10.98)
where A is the matrix of eigenvalues. We will assume that
A
*o
A.i
(10.99)
In other words, we assume / + hA is diagonalizable . This is not necessarily always
the case, but is an acceptable assumption for present purposes. Since from (10.98)
we have I + hA = TAT' 1 , (10.96) becomes
x n+ i = TAT~ l x n ,
T~ l x n+l = AT~ l x n . (10.100)
Let y n — T~ lr x n , so therefore (10.100) becomes
y n+l =Ay n . (10.101)
In any norm lim n ^oo ||7„|| ^ oo, provided |A.&| < 1 for all k = 0, 1, . . . , N — 1
<
(A € R ). Consequently, liirin^oo \\x n \\ ^ oo too (because ||3c„|| = ||7 , y„||
||r|| ||y„|| and ||r|| is finite). We conclude that h is an acceptable step size,
provided the eigenvalues of / + hA do not possess a magnitude greater than unity.
Note that in practice we normally insist that h result in |A.jt| < 1 for all k.
We may apply the previous stability analysis to the implicit Euler method.
Specifically, apply (10.88) to model problem (10.89), giving
X n +l — X n + h[aooX n+ i + Ooijn+l]
y n +l — y n + h[a w x n+ i + a n y n+l ],
TLFeBOOK
446
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
which in matrix form becomes
Xn+l
. yn+i
=
Xn
. yn
+ h
aoo fl oi
«io an
x n + \
_ 3'n+l
or more compactly as
Xn+l
= X n -\- rlAXn-\-\.
Consequently, for n e Z +
{[l-hATTx .
(10.102)
(10.103)
For convenience we can define B — [I — hA]~ l . Superficially, (10.103) seems to
have the same form as (10.97). We might be lead therefore to believe (falsely)
that the implicit method can be unstable, too. However, we may assume that there
exists a nonsingular matrix V such that (if A e R NxN )
V~ l AV = r,
(10.104)
where V — diag(yo, y\, . . , , yw-i), which is the diagonal matrix of the eigenvalues
of A. (Once again, it is not necessarily the case that A is always diagonalizable,
but the assumption is reasonable for our present purposes.) Immediately
[/ - hA]~ l =U- hVTV~ l r l
(V[V~ l V -hT]V~ l )
K-l
so that
[/ - /jA]" 1 = V[I - hTT X V~ x .
Consequently x n+ \ — [I — hA\~ x x n becomes
x n+l = VU-hTT l V- x x n .
(10.105)
(10.106)
Define y n — V x n , and so (10.106) becomes
y„+i = [I -hTY x y n -
(10.107)
Because / — hT is a diagonal matrix, a typical main diagonal element of [/ —
hT] -1 is ak — 1/(1 — hyk). It is a fact (which we will not prove here) that the
model problem in the general case is stable provided the eigenvalues of A all pos-
sess negative-valued real parts. 6 Thus, provided Re(y^) < for all k, we are assured
that \ok | < 1 for all k, and hence lim„^oo | \~y n \ \ —0. Thus, lim n ^oo | |x„ 1 1 =0, too,
and so we conclude that the implicit form of the Euler method is unconditionally
stable. Thus, if the model problem is stable, we may select any h > 0.
The eigenvalues of A may be complex-valued, and so it is the real parts of these that truly determine
system stability.
TLFeBOOK
SYSTEMS OF FIRST-ORDER ODEs 447
The following example is of a linear system for which a mathematically exact
solution can be found. [In fact, the solution is given by (10.92).]
Example 10.9 Consider the ODE system
— = -2x+±y, (10.108a)
at 4
— = -3x. (10.108b)
at
The initial condition is xq — x(0) = 1, yo — y(0) — — 1. From (10.94) we see that
A =
-2 i
Z 4
-3
The eigenvalues of A are yo = — j, and y\ — — |. These eigenvalues are both
negative, and so the solution to (10.108) happens to be stable. In fact, the exact
solution can be shown to be
x(t) = -|e" r/2 + je _3 ' /2 , (10.109a)
y(t) = -|e" r/2 + |e- 3(/2 (10.109b)
for f > 0. Note that the eigenvalues of A appear in the exponents of the exponentials
in (10.109). This is not a coincidence. The explicit Euler method has the iterations
x n +\ — x n + h \-2x n + \y n \ , (10.110a)
y n +\ = y n ~ 3hx„. (10.110b)
Simulation results are shown in Figs. 10.10 and 10.11 for h — 0.1 and h —
1.4, respectively. This involves comparing (10.110a,b) with the exact solution
(10.109a,b). Figure 10.10b shows a plot of the eigenvalues of / + hA for various
step sizes. We see from this that choosing h = 1 .4 must result in an unstable simu-
lation. This is confirmed by the result in Fig. 10.11. For comparison purposes, the
eigenvalues of / + hA and of [/ — hA] -1 are plotted in Fig. 10.12. This shows
that, at least in this particular case, the implicit method is more stable than the
explicit method.
Example 10.10 Recall the Duffing equation of Section 10.1. Also, recall the
fact that this ODE can be rewritten in the form of (10.85a,b), and this was done in
Eq. (10.4a,b).
Figures 10.13 and 10.14 show the result of simulating the Duffing equation
using the explicit Euler method for the model parameters
F = 0.5, «=1, m=\, a = \, 5 = 0.1, k = 0.05.
TLFeBOOK
448 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
1
0.5
1 -0.5
-1
-1.5
-2
-2.5
1 1 1 1 1 1 1 1 1 1 1 1 1
h 1 1 " r r i 1 1 1 1 r 1 1 1 1 ii' ' i '
JtfT.
<*#*
^#**
\ ^^&}^^
UN III III II III
\J^.
jp.
^
^
x(t)
— y(t)
-i— Euler method (h = 0.1 )
-+- Euler method (rt = 0.1)
(a)
2 3 4 5
Time (f)
10
Step size (h)
Figure 10.10 Simulation results for h = 0.1. Plot (b) shows the eigenvalues of / + hA.
Both plots were obtained by applying the explicit form of the Euler method to the ODE
system of Example 10.9. Clearly, the simulation is stable for h — 0.1.
Thus (10.1) is now
d x
dx
—r = 0.5 cos(f)- 0.05 [x + O.lx*].
dt 2 dt
We use initial condition x(0) = y(0) = 0. The driving function (applied force)
0.5 cos(f ) is being opposed by the restoring force of the spring (terms in square
brackets) and friction (first derivative term). Therefore, on physical grounds, we do
not expect the solution x(t) to grow without bound as t — > oo. Thus, the simulated
solution to this problem must be stable, too.
We mention that an analytical solution to the differential equation that we are
simulating is not presently known.
From (10.94) for our Duffing system example we have
A =
a 35
1
-— x —
k
mm m
Example 10.11 According to Hydon [1, p. 61], the second-order ODE
d 2 x
(dxV 1
V dt ) x
1\ dx
x I dt
(10.111)
TLFeBOOK
SYSTEMS OF FIRST-ORDER ODEs
449
7i
-8
x(0
- -y(t)
- — i— Euler method (h = 1 .4)
-t— Euler method (h = 1 .4)
:...*...
. 1. \ :
*
:
/ \
1 " ■*"
' A V
'A*
" / A
\
'/ V
/'
yY/ /
. A.\
r
/'
\\
\
\
\
/ :
\ v '
V '
\ 1
\
V
li
l
\ \
\ V
\*
v :
\
V
12 3 4 5 6 7
Time (f)
9 10
Figure 10.11 This is the result of applying the explicit form of the Euler method to the
ODE system of Example 10.9. Clearly, the simulation is not stable for h = 1.4. This is
predicted by the eigenvalue plot in Fig. 10.10b, which shows that one of the eigenvalues of
/ + hA has a magnitude exceeding unity for this choice of h.
1
0.5
CD
I
E -0.5
<
-1
-1.5
^^^^^-^~ ■
°~
— ^0
-- k,
j
(a)
0.5 1
Step size (h)
1.5
(b)
Step size (h)
Figure 10.12 Plots of the eigenvalues of / + hA (a), which determine the stability of the
explicit Euler method and the eigenvalues of [/ — hA]~ (b), which determine the stability
of the implicit Euler method. This applies for the ODE system of Example 10.9.
TLFeBOOK
450
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
CD
(a)
1 .0006
1 .0004
"§ 1.0002
0.9998
0.9996
(b)
+ *„(/! = . 020) :
O /„(/!=. 020)1:
- WaI 1 ■ IWr
; IP W
WW
™
10
20
30
40 50
Time (/)
60
70
80
90
— M
- - - W
I
0.01
0.02
0.03 0.04
Step size (/?)
0.05
0.06
0.07
Figure 10.13 (a) Explicit Euler method simulation of the Duffing equation; (b) magnitude
of the eigenvalues of I + hA. Both plots were obtained by applying the explicit form of the
Euler method to the ODE system of Example 10.10, which is the Duffing equation expressed
as a coupled system of first-order ODEs. The simulation is apparently stable for h = 0.02.
This is in agreement with the prediction based on the eigenvalues of I + hA [plot (b)],
which have a magnitude of less than unity for this choice of h.
has the exact solution
x(t)
c\
C\
- 1 tanh(Jc( - \{t + c 2 )),
ci -it + c 2 )-\
1 -cf tanh(./l -cf(f + c 2 )),
c\ > 1
c\ < 1
(10.112)
The ODE in (10.111) can be rewritten as the system of first-order ODEs
dx
—
— y>
dt
dy
t
dt
X
(10.113a)
(10.113b)
TLFeBOOK
SYSTEMS OF FIRST-ORDER ODEs
451
E
<
(a)
1 .0006
1 .0004
■o 1.0002
E
<
0.9998
0.9996
— M
■ - - M
i i i
0.01
0.02
(b)
0.03 0.04
Step size (h)
0.05
0.06
0.07
Figure 10.14 (a) Explicit Euler method simulation of the Duffing equation; (b) magnitude
of the eigenvalues of / + hA. Both plots were obtained by applying the explicit form of the
Euler method to the ODE system of Example 10.10, which is the Duffing equation expressed
as a coupled system of first-order ODEs. The simulation is not stable for h = 0.055. This is
in agreement with the prediction based on the eigenvalues of I + hA [plot (b)], which have
a magnitude exceeding unity for this choice of h.
The initial condition is xq — x(0), and y(0) = -^p-\ t= Q. From (10.94) we have
1
A =
X Q \ X Q /
2yo ^ ( 1
xq V XQ
(10.114)
Using (10.113a), we may obtain y(t) from (10.112). For example, let us consider
„2
simulating the case c\=\. Thus, in this case
y(t)
1
(f + C 2 ) 2 '
(10.115)
For the choice c\ — 1, we have
xq — c\ ,
C2
yo = -T-
(10.116a)
(10.116b)
TLFeBOOK
452
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
E
<
(a)
—
x(t)
Y(t)
-1-
Xn(f>-
= 0.020)
-o-
Vn(h-
= 0.020)
H I M Ill Ill Ill Ill ""
2.5
3.5
1.5
I 1
E
<
0.5
— |A„I
- - hi
; „ '
i ^ > ' i i i i
(b)
0.05 0.1 0.15 0.2 0.25
Step size (h)
0.3
0.35
0.4
Figure 10.15 Magnitude of the eigenvalues of / + hA is shown in plot (b). Both plots
show the simulation results of applying the explicit Euler method to the ODE system in
Example 10.11. The simulation is (as expected) stable for h = 0.02. Clearly, the simulated
result agrees well with the exact solution.
If we select c\ = 1, then
xq — 1 , yo — (1 - x ) .
(10.117)
Let us assume xq — — 1, and so yo — 4. The result of applying the explicit Euler
method to system (10.113) with these conditions is shown in Figs. 10.15 and 10.16.
The magnitudes of the eigenvalues of / + hA are displayed in plots (b) of both
figures. We see that for Fig. 10.15, h — 0.02 and a stable simulation is the result,
while for Fig. 10.16, we have h — 0.35, for which the simulation is unstable. This
certainly agrees with the stability predictions based on finding the eigenvalues of
matrix I + hA.
Examples 10.10 and 10.11 illustrate just how easy it is to arrive at differential
equations that are not so simple to simulate in a stable manner with low-order
explicit methods. It is possible to select an h that is "small" in some sense, yet
not small enough for stability. The cubic nonlinearity in the Duffing model makes
the implementation of the implicit form of Euler' s method in this problem quite
TLFeBOOK
SYSTEMS OF FIRST-ORDER ODEs
453
(a)
1.5
0.5
— iy
— N
^"*
^
- T^^
^^ -i
■** i
fj£
~— ~ — _
(b)
0.05 0.1 0.15 0.2 0.25
Step size (A?)
0.3
0.35
0.4
Figure 10.16 Plot (b) shows magnitude of eigenvalues of / + hA. Both plots show the sim-
ulation results of applying the explicit Euler method to the ODE system in Example 10.11.
The simulation is (as expected) not stable for h — 0.35. Instability is confirmed by the fact
that the simulated result deviates greatly from the exact solution when t is sufficiently large.
unattractive. So, a better approach to simulating the Duffing equation is with a
higher-order explicit method.
For example, Heun's method for (10.85) may be stated as
x„+i = x„ + jh[f(x n ,y n ,t„) + f(x n +ki,y„ + h,t„ + h)],
y n +\ — y n + 2 h \-8(Xn,yn,t n ) + g(Xn +kl,y„ + h,t„ + h)],
where
k\ = hf(x„, y„, t n ), h = hg(x„, y„, t„).
(10.118a)
(10.118b)
(10.119)
Also, for example, Chapter 36 of Branson [8] contains a summary of higher-order
methods that may be applied to (10.85).
Example 10.12 Recall the Duffing equation simulation in Example 10.10.
Figure 10.17 illustrates the simulation of the Duffing equation using both the
explicit Euler method and Heun's method for a small h (i.e., h — 0.005 in both
cases).
TLFeBOOK
454
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
E
<
(a)
60 70 80
Time (f)
50 60 70 80
(b)
Time (f)
Figure 10.17 Comparison of the explicit Euler (a) and Heun (b) method simulations of
the ODE in Example 10.12, which is the Duffing equation. Here the step size h is small
enough that the two methods give similar results.
At this point we note that there are other ways to display the results of numerical
solutions to ODEs that can lead to further insights into the behavior of the dynamic
system that is modeled by those ODEs. Figure 10.18 illustrates the phase portrait
of the Duffing system. This is obtained by plotting the points (x n ,y n ) on the
Cartesian plane, yielding an approximate plot of (x(t),y(t)). The resulting curve
is the trajectory, or orbit for the system. Periodicity of the system's response is
indicated by curves that encircle a point of equilibrium, which in this case would
be the center of the Cartesian plane [i.e., point (0, 0)]. The trajectory is tending to
an approximately ellipse-shaped closed curve indicative of approximately simple
harmonic motion.
The results in Fig. 10.18 are based on the parameters given in Example 10.10.
However, Fig. 10.19 shows what happens when the system parameters become
F = 0.3, (o=l, m=\, a = -\, 5=1, k = 0.22.
(10.120)
The phase portrait displays a more complicated periodicity than what appears in
Fig. 10.18. The figure is similar to Fig. 2.2.5 in Guckenheimer and Holmes [11].
TLFeBOOK
MULTISTEP METHODS FOR ODEs
455
120 140 160
(a)
Time (/)
(b)
Figure 10.18 (a) The result of applying Heun's method to obtain the numerical solution
of the Duffing system specified in Example 10.10; (b) the phase portrait for the system
obtained by plotting the points (x n , y n ) [from plot (a)] on the Cartesian plane, thus yielding
an approximate plot of (x(t), y(t)).
As in the cases of the explicit and implicit Euler methods, we may obtain a
theory of stability for Heun's method. As before, the approach is to apply the
model problem (10.90) to (10.118) and (10.119). As an exercise, the reader should
show that this yields
X n + l
hA
\h\
%n->
(10.121)
where / is the 2x2 identity matrix, A is obtained by using (10.94), and, of course,
I„ = [x n y n ] T . [The similarity between (10.121) and (10.48) is no coincidence.]
Criteria for the selection of step size h leading to a stable simulation can be
obtained by analysis of (10.121). But the details are not considered here.
10.4 MULTISTEP METHODS FOR ODEs
The numerical ODE solvers we have considered so far were either implicit methods
or explicit methods. But in all cases they were examples of so-called single-step
TLFeBOOK
456
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
Figure 10.19 (a) The result of applying Heun's method to obtain the numerical solu-
tion of the Duffing system specified in Example 10.12 [i.e., using the parameter values in
Eq. (10.120)]; (b) the phase portrait for the system obtained by plotting the points (x„, y n )
[from plot (a)] on the Cartesian plane, thus yielding an approximate plot of (x(t), y(f))-
methods; that is, x n +\ was ultimately only a function of x n . A disadvantage of
single-step methods is that to achieve good accuracy often requires the use of
higher-order methods (e.g., fourth- or fifth-order Runge-Kutta). But higher-order
methods need many function evaluations per step, and so are computationally
expensive.
Implicit single-step methods, although inherently stable, are not more accurate
than explicit methods, although they can track fast changes in the solution x(t)
better than can explicit methods (recall Example 10.5). However, implicit methods
may require nonlinear system solvers (i.e., Chapter 7 methods) as part of their
implementation. This is a complication that is also not necessarily very efficient
computationally. Furthermore, the methods in Chapter 7 possess their own stability
problems. Therefore, in this section we introduce multistep predictor-corrector
methods that overcome some of the deficiencies of the methods we have considered
so far.
In this section we return to consideration of a single first-order ODE IVP
„i dx(t)
c (D (f) = _AZ = f( x(f)t t)t Xo = x(0) .
dt
(10.122)
TLFeBOOK
MULTISTEP METHODS FOR ODEs 457
In reality, we have already seen a single-step predictor-corrector method in
Section 10.2. Suppose that we have the following method:
x n +\ — x n + hf(x n , t n ) (predictor step) (10.123a)
x„+\ — x n + \h[f{x n +\Jn+i) + f(x„, t„)] (corrector step). (10.123b)
If we substitute (10.123a) into (10.123b), we again arrive at Heun's method
[Eq. (10.47)], which overcame the necessity to solve for x n +i in the implicit method
of Eq. (10.59) (trapezoidal method). Generally, predictor-corrector methods replace
implicit methods in this manner, and we will see more examples further on in
this section. When a higher-order implicit method is "converted" to a predictor-
corrector method, the need to solve nonlinear equations is eliminated and accuracy
is preserved, but the stability characteristics of the implicit method will be lost, at
least to some extent. Of course, a suitable stability theory will still allow the user
to select reasonable values for the step size parameter h.
We may now consider a few simple examples of multistep methods. Perhaps
the simplest multistep methods derive from the numerical differentiation ideas from
Section 9.6 (of Chapter 9). Recall (9.138), for which
x m {t) = —[x(t + h)- x(t - h)] - -h 2 x (3 \i;) (10.124)
2h 6
for some f € [t — h, t + h]. The explicit Euler method (10.22) can be replaced
with the midpoint method derived from [using t — t n in (10.124)]
1
f(x(t n ), t n ) RS —[ x (t n+ i) - X(t n -l)],
2h
so this method is
X n +\ = X n -\ + 2hf(x„, t n ). (10.125)
This is an explicit method, but x n +\ depends on x„-\ as well as x n . We may call
it a two-step method. Similarly, via (9.153)
1
f(x(t n ), t n ) % —l-3x(t n ) + Ax(t n+ i) - x(t n+2 )],
2h
so we have the method
x n +2 = 4x„+i - 3x n - 2hf(x n , t n )
which can be rewritten as
x n+ \ = 4x n - 3x„_i - 2hf(x n -i, t„-\). (10.126)
TLFeBOOK
458 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
Finally, via (9.154), we obtain
1
f(x(t„), t„) « — -[x(t n -2) - Ax{t n -\) + 3x(f„)],
2«
yielding the method
x n+ i = \x n - \x n -\ + \hf(x n+ \, t n +i). (10.127)
Method (10.126) is a two-step method that is explicit, but (10.127) is a two-step
implicit method since we need to solve for x n+ \.
A problem with multistep methods is that the IVP (10.122) provides only one
initial condition x . But for n — 1 in any of (10.125), (10.126), or (10.127), we
need to know xi\ that is, for two-step methods we need two initial conditions, or
starting values. A simple way out of this dilemma is to use single-step methods to
provide any missing starting values. In fact, predictor-corrector methods derived
from single-step concepts (e.g., Runge-Kutta methods) are often used to provide
the starting values for multistep methods.
What about multistep method accuracy? Let us consider the midpoint method
again. From (10.124) for some % n e [f„_i, t n+ {\
X(fn+l) = x{t n -i) + 2hf(x(t n ), t n ) + \hW>{t- n ), (10.128)
so the method (10.125) has a truncation error per step that is 0(/z 3 ). The midpoint
method is therefore more accurate than the explicit Euler method [recall (10.43)
and (10.44)]. Yet we see that both methods need only one function evaluation
per step. Thus, the midpoint method is more efficient than the Euler method. We
recall that Heun's method (a Runge-Kutta method) has a truncation error per
step that is 0(h?), too [recall (10.56)], and so Heun's method may be used to
initialize (i.e., provide starting values for) the midpoint method (10.125). This
specific situation holds up in general. Thus, a multistep method can often achieve
comparable accuracy to single-step methods, and yet use fewer function calls,
leading to reduced computational effort.
What about stability considerations? Let us continue with our midpoint method
example. If we apply the model problem (10.28) to (10.125), we have the difference
equation
x n+ \ = x„_i + 2hXx n , (10.129)
which is a second-order difference equation. This has characteristic equation 7
z 2 - 2hXz - 1 = 0. (10.130)
We may rewrite (10.129) as
which has the z-transform
x n+2 -2hXx n+ \ -x n
(z 2 -2h\z-l)X(z) = 0.
TLFeBOOK
MULTISTEP METHODS FOR ODEs 459
For convenience, let p — hk, in which case the roots of (10.130) are easily seen
to be
zi = p + yjp 2 + l, Z2 = /0-VP 2 + '•• (10.131)
A general solution to (10.129) will have the form
Xn = c x z\ + c 2 z\ (10.132)
for n e Z + . Knowledge of xo and x\ allows us to solve for the constants c\ and c 2 in
(10.132), if this is desired. However, more importantly, we recall that we assume
k < 0, and we seek step size ft > so that lim n ^oo \x n \ ^ oo. But in this case
p = hk <0, and hence from (10.131), \z 2 \ > I for all h > 0. If c 2 #0 in (10.132)
(which is practically always the case), then we will have linin^oo \x„ \ — oo ! Thus,
the midpoint method is inherently unstable under all realistic conditions ! Term c\z\
in (10.132) is "harmless" since |zi| < 1 for suitable h. But the term c 2 z 2 , often
called a parasitic term, will eventually "blow up" with increasing n, thus fatally
corrupting the approximation to x(t).
Unfortunately, parasitic terms are inherent in multistep methods. However, there
are more advanced methods with stability theories designed to minimize the effects
of the parasitics. We now consider a few of these improved multistep methods.
10.4.1 Adams-Bashforth Methods
Here we look at the Adams-Bashforth (AB) family of multistep ODE IVP solvers.
Section 10.4.2 will look at the Adams-Moulton (AM) family. Our approach follows
Epperson [12, Section 6.6]. Both families are derived using Lagrange interpolation
[recall Section 6.2 from Chapter 6 (above)].
Recall (10.122) which we may integrate to obtain (t„ = to + nh)
X(fn+l) = X{t n ) + f" ¥1 f(x(t),t)dt. (10.133)
Now suppose that we had the samples x(t n -k) for k — 0, 1, . . . , m (i.e., m + 1 sam-
ples of the exact solution x{t)). Via Lagrange interpolation theory, we may interpo-
late F(t) — f(x(t), t) [the integrand of (10.133)] for t e [t„- m , £ n +i] 8 according to
in
Pm(t) = J^L k (t)f(x(t n - k ),t n - k ), (10.134)
k=Q
A solution to (10.129) exists only if z — 2hkz — 1 = 0. If the reader has not had a signals and systems
course (or equivalent) then this reasoning must be accepted "on faith." But it may help to observe that
the reasoning is similar to the theory of solution for linear ODEs in constant coefficients.
The upper limit on the interval [t n — m , t„] has been extended from t n to t n+ \ here. This is allowed
under interpolation theory, and actually poses no great problem in either method development or error
analysis. We are using the Lagrange interpolant to extrapolate from t = t n to t n j r \.
TLFeBOOK
460 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
where
m
£*(*) = [I t ~ t l~ l ■ (10.135)
i-0
ijtk
From (6.14) for some % t e [t n - m , t n+ \]
1 m
F(t) = Pm (t) + ———F (m+l) ^ t )V[(t - tn-i).
(m + 1)! jj^
However, F(t) — f(x(t), t) — x m (t) so (10.136) becomes
.. m
F{t) = p m (t) + . X {m+2) %) Y\(f - tn-i).
(m + 1)! ll
1=0
Thus, if we now substitute (10.137) into (10.133), we obtain
dn+l
f(x(t„-k), t n -k) I
A — -- i J
where
— X v "^>(^t)
nt
(=0
(10.136)
(10.137)
c(t n+l )=x(t n ) + 2_^f(x(t n - k ),t n - k ) L k (t)dt + R m (t„+i), (10.138)
*mfe+l) = / — X {m+2 \Ht) Y\(f - tn-i) dt. (10.139)
Jt„ (m + 1)! H;
Polynomial jt(t) does not change sign for t e [t n ,t n +i] (which is the interval of
integration). Thus, we can say that there is a % n e [t n , t n +i] such that
f'n+l
Jt„
R m (t n+ i) = - ] — x {m+2 \Hn) I n(i)dt. (10.140)
(m + 1)!
For convenience, define
1 f'"+>
1 ftn+l 1 [tn-l
(m + 1)! J, (m + 1)!./.
(t - t n )(t - t„-\) ■■■(t - t n -m) dt
(10.141)
and
f'n+l
k k = / L k (t)dt. (10.142)
Thus, (10.138) reduces to [with R m (t n+ i) = p m x (m+2) (f„)]
m
x(t n+l ) = *(*„) + £ X k f(x(t n - k ), t n - k ) + A„* (m+2) (? B ) (10.143)
>t=o
TLFeBOOK
MULTISTEP METHODS FOR ODEs 461
TABLE 10.1 Adams-Bashforth Method Parameters
m
A
X l
X 2
*3
^m(fn+l)
h
^ 2 * (2 >(£„)
1
Ik
2
1
— h
2
^V 3 >(?„)
2
23
—h
12
12
Ik
12
Vx (4) &)
O
3
55
— h
24
59
h
24
37
—h
24
9
h
24
251 5 r<n
720
for some §„ € [?„, f„+i]. The order m + 1 Adams-Bashforth method is therefore
defined to be
m
*n+l = *n + ^W(*n-*> '«-*)■ (10.144)
It is an explicit method involving m + 1 steps. Table 10.1 summarizes the method
parameters for various m, and is essentially Table 6.6 from Ref. 12.
10.4.2 Adams-Moulton Methods
The Adams-Moulton methods are a modification of the Adams-Bashforth meth-
ods. The Adams-Bashforth methods interpolate using the nodes t n , t„-i, . . . , t n - m .
On the other hand, the Adams-Moulton methods interpolate using the nodes
t n +i, t n , ■ ■ ■ , tn-m+i- Note that the number of nodes is the same in both methods.
Consequently, (10.138) becomes
x(t„+i) = x{t n ) + 2^ f(x(tn-k),tn-k) / L k (t)dt + R m (t n+ i), (10.145)
k=-\ Jt "
where now
m — 1
I tn
Lk (t) = n -^-^
, tn— k *n
(10.146)
and ff m (*„ + i) = p m x (m+2) (t; n ) with
1 /""+
Pm = 7 — TTT / ( f -?»+l)('-'«)- ••(^-fn-m+l) <*f. (10.147)
(m + '
TLFeBOOK
462 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
Thus, the order m + 1 Adams-Moulton method is defined to be
m— 1
x n +i = x„ + ^2 X k f(x n -k, t„- k ), (10.148)
where
X k = I ¥1 L k (t)dt (10.149)
/
Jt„
[same as (10.142) except k — -1, 0, 1, . . . , m - 1, and L k {t) is now (10.146)].
Method (10.148) is an implicit method since it is necessary to solve for x n+ \.
It also requires m + 1 steps. Table 10.2 summarizes the method parameters for
various m, and is essentially Table 6.7 from Ref. 12.
10.4.3 Comments on the Adams Families
For small values of m in Tables 10.1 and 10.2, we see that the Adams families
(AB family and AM family) correspond to methods seen earlier. To be specific:
1. For m = in Table 10.1, Xq — h, so (10.144) yields the explicit Euler method
(10.22).
2. For m — in Table 10.2, X-i — h, so (10.148) yields the implicit Euler
method (10.35).
3. For m = 1 in Table 10.2, A_i = Xq — jh, so (10.148) yields the trapezoidal
method (10.59).
Stability analysis for members of the Adams families is performed in the usual
manner. For example, when m = 1 in Table 10.1 (i.e., consider the second-order
AB method), Eq. (10.144) becomes
x n +\ — x„ + jh[3f(x n , t n ) - f(x n -\,t n -\)]. (10.150)
TABLE 10.2 Adams-Moulton Method Parameters
in
*-l
*0
*1
^2
Rm(t n +\)
h
-Vx (2 %„)
1
u
1
-h
- — h 3 x 0) tf„)
2
2
12
?
5
— h
A.
1
h
_1^(4) (?)
12
12
12
24
9
19
5
1
_ii fc 5 x (5) ft)
-!
— h
— h
h
— h
24
24
24
24
720
TLFeBOOK
MULTISTEP METHODS FOR ODEs 463
Application of the model problem (10.28) to (10.150) yields
x„+i = f 1 + -hX\x n - -h\x n _i
or
Xn+2 ~ ( 1 + -hk j X n +1 + -hXx n = 0.
This has characteristic equation (with p — hX)
-r - ! 1 + 2 P ) Z+ 2 P ^°'
(10.151)
This equation has roots
z\
Z2
l+-p)+^l+p + -p 2
1+-P)-^1+P + -P 2
(10.152)
We need to know what range of h > yields \z 1 1 , | zi \ < 1 . We consider only p <
since X < 0. Figure 10.20 plots |zi| and \z2\ versus p, and suggests that we may
select h such that
-1<AA,<0. (10.153)
Plots of stability regions for the other Adams families members may be seen in
Figs. 6.7 and 6.8 of Epperson [12]. Note that stability regions occupy the complex
plane as it is assumed in such a context that X e C. However, we have restricted
2.5
2
1.5
1
0.5
-s — l*il
- j-^-J | - - |.g g | r
i i i i i i i i T ** — -
-1.8 -1.6 -1.4 -1.2
-0.8 -0.6 -0.4 -0.2
Figure 10.20 Magnitudes of the roots in (10.152) as a function of p = hX.
TLFeBOOK
464 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
our attention to a first-order ODE IVP here, and so it is actually enough to assume
that X e R.
Finally, observe that AB and AM methods can be combined to yield predictor-
correctors. An mth-order AB method can act as a predictor for an mth-order
AM method that is the corrector. A Runge-Kutta method can initialize the
procedure.
10.5 VARIABLE-STEP-SIZE (ADAPTIVE) METHODS FOR ODEs
Accuracy in the numerical solution of ODEs requires either increasing the order
of the method applied to the problem or decreasing the step-size parameter h.
However, high-order methods (e.g., Runge-Kutta methods of order exceeding 5)
are not very attractive at least in part because of the computational effort involved.
To preserve accuracy while reducing computational requirements suggests that we
should adaptively vary step size h.
Recall Example 10.5, where we saw that low order methods were not accurate
near t — to- For a method of a given order, we would like in Example 10.5 to
have a small h for t near to, but a larger h for t away from to- This would reduce
the overall number of function evaluations needed to estimate x(t) for t e [to, f/].
The idea of adaptively varying h from step to step requires monitoring the error
in the solution somehow; that is, ideally, we need to infer e n — \x(t n ) — x n \ [x„
is the estimate of x(t) at t — t n from some method] at step n. If e n is small
enough, h may be increased in size at the next step, but if e n is too big, we
decrease h.
Of course, we do not know x(t n ), so we do not have direct access to the error e n .
However, one idea that is implemented in modern software tools (e.g., MATLAB
routines ode23 and ode45) is to compute x n for a given h using two methods,
each of a different order. The method of higher order is of greater accuracy, so
if x n does not differ much between the methods, we are lead to believe that h is
small enough, and so may be increased in the next step. On the other hand, if the
x n values given by the different methods significantly vary, we are then lead to
believe that h is too big, and so should be reduced.
In this section we give only a basic outline of the main ideas of this process.
Our emphasis is on the Runge-Kutta-Fehlberg (RKF) methods, of which MAT-
LAB routines ode23 and ode45 are particular implementations. Routine ode23
implements second- and third-order Runge-Kutta methods, while ode45 imple-
ments fourth- and fifth-order Runge-Kutta methods. Computational efficiency is
maintained by sharing intermediate results that are common to both second- and
third-order methods and common to both fourth- and fifth-order methods. More
specifically, Runge-Kutta methods of consecutive orders have constants such as kj
[recall (10.67)] in common with each other and so need not be computed twice.
We mention that ode45 implements a method based on Dormand and Prince [14],
and a more detailed account of this appears in Epperson [12]. An analysis of the
RKF methods also appears in Burden and Faires [17]. The details of all of this are
TLFeBOOK
VARIABLE-STEP-SIZE (ADAPTIVE) METHODS FOR ODEs 465
quite tedious, and so are not presented here. It is also worth noting that an account
of MATLAB ODE solvers is given by Shampine and Reichelt [13], who present
some improvements to the older MATLAB codes that make them better at solving
stiff systems (next section).
A pseudocode for something like ode45 is as follows, and is based on Algo-
rithm 6.5 in Epperson [12]:
Input tQ,XQ\ { initial condition and starting time }
Input tolerance e > 0;
Input the initial step size h > 0, and final time tf > frj;
n:=0;
while t n < tf do begin
X-| := RKF4(x n , t n ,h); { 4th order RKF estimate of x n+1 }
X 2 :=RKF5(x n ,t n ,h)\
E:=|X 1 -X 2 |;
if Xhe < E < hi then begin { h is OK J
x n+1 :=*2;
f„+1 :=t n +h;
n :=n+ 1;
else if E > he then { h is too big }
h := h/2; { reduce h and repeat }
else { h is too small }
h :=2h;
x n+\ -=X2\
f„+1 :=t n +h;
n :=n + 1;
end;
end;
Of course, variations on the "theme" expressed in this pseudocode are possible. As
noted in Epperson [12], a drawback of this algorithm is that it will tend to oscillate
between small and large step size values. We emphasize that the method is based on
considering the local error in going from time step t n to t n +\. However, this does
not in itself guarantee that the global error \x{t n ) — x n \ is small. It turns out that if
adequate smoothness prevails [e.g., if fix, t) is Lipschitz as per Definition 10.1],
then small local errors do imply small global errors (see Theorem 6.6 or 6.7 in
Ref. 12).
Example 10.13 This example illustrates a typical application of MATLAB
routine ode23 to the problem of simulating the Colpitts oscillator circuit of Example
10.2.
Figure 10.21 shows a typical plot of i>cE(t), and the phase portrait for parameter
values
V TH = 0.75 V (volts), V CC = 5 (V), V EE = -5 (V), R EE = 400 Q, (ohms),
R L = 35 (Q), L = 98.5 x 10" 6 H (henries), p F = 200, R n = 100 (f2),
d = C 2 = 54 x 10" 9 F (farads). (10.154)
TLFeBOOK
466
NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
(a)
(b)
2 3 4
V CE (t) (volts)
2.3 2.4 2.5 2.6 2.7 2.8
f (seconds)
2.9
3 3.1
x10 -3
Figure 10.21 Chaotic regime (a) phase portrait of Colpitts oscillator and (b) collector-
emitter voltage, showing typical results of applying MATLAB routine ode23 to the simula-
tion of the Colpitts oscillator circuit. Equation (10.154) specifies the circuit parameters for
the results shown here.
These circuit parameters were used in Kennedy [9], and the phase portrait in
Fig. 10.21 is essentially that in Fig. 5 of that article [9]. The MATLAB code that
generates Fig. 10.21 appears in Appendix 10. B.
For the parameters in (10.154) the circuit simulation phase portrait in Fig. 10.21
is that of a strange attractor [11, 16], and so is strongly indicative (although not
conclusive) of chaotic dynamics in the circuit.
We note that under "normal" circumstances the Colpitts oscillator is intended to
generate sinusoidal waveforms, and so the chaotic regime traditionally represents
a failure mode, or abnormal operating condition for the circuit. However, Kennedy
[9] suggests that the chaotic mode of operation may be useful in chaos-based data
communications (e.g., chaotic-carrier communications).
The following circuit parameters lead to approximately sinusoidal circuit out-
puts:
V m = 0.75 V,
R L = 200 ft,
V C c=5V, V EE
L = 100 x 10" 6 H,
-5V, R E e = 100ft,
/3 F = 80, Ron = H5ft,
Ci = 45 x 10" y F, C 2 = 58 x 10" y F.
(10.155)
TLFeBOOK
STIFF SYSTEMS
467
-3 -2
V CE (t) (volts)
2.94 2.96
t (seconds)
3.02
Figure 10.22 Sinusoidal operations (a) phase portrait of Colpitts operator and
(b) collector-emitter voltage, showing typical results of applying MATLAB routine ode23
to the simulation of the Colpitts oscillator circuit. Equation (10.155) specifies the circuit
parameters for the results shown here.
Figure 10.22 shows the phase portrait for the oscillator using these parameter val-
ues. We see that vce(0 is much more sinusoidal than in Fig. 10.21. The trajectory
in the phase portrait of Fig. 10.22 is tending to an elliptical closed curve indicative
of simple harmonic (i.e., sinusoidal) oscillation.
10.6 STIFF SYSTEMS
Consider the general system of coupled first-order ODEs
dxQ(t)
dt
dx\{t)
dt
./b(xo,xi, . . . ,x m _i, t),
/i(x ,xi, ...,x m -i,t),
(10.156)
dXm-l(t)
dt
— fm-l(XQ,Xl, ...,X m -l,t),
TLFeBOOK
468 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
which we wish to solve for t > given x(0), where x(t) = [x$(t) x\(t)
■■■ x m -i(t)] T . If we also define f(x(t),t) — [fo(x(t),t) fi(x(t),t) ■■■ f,
m — l
(x(t), t)] , then we may express (10.156) in compact vector form as
dx(t)
dt
= f(x(t), t).
(10.157)
We have so far described a general order m ODE IVP.
If we now wish to consider the stability of a numerical method applied to the
solution of (10.156) [or (10.157)], then we need to consider the model problem
dx(t)
dt
Ax(t),
(10.158)
where
3/o(*(0), 0)
3/i(*(0),0)
3xo
3/o(*(0),0)
dx\
3/i(s(0),0)
3xi
3/o (3c (0), 0)
3x m _i
3/i(x(0),0)
3x
m—\
3/ m _i(x(0),0) 3/ m _i(x"(0),0)
3/ m _!(x-(0),0)
(10.159)
3xo 3xi 3x m -i
[which generalizes A e R 2x2 in (10.94)]. The solution to (10.158) is given by
x(f) = e At x(0) (10.160)
[recall (10.92)]. 9 Ensuring the stability of the order m linear ODE system in
(10.158) requires all the eigenvalues A.& of A to have negative real parts (i.e.,
Re[A / ( : ] < for all k = 0, 1, . . . , m — 1, where X^ is the kth eigenvalue of A).
Of course, if m — 1, then with x(t) — xo(t), (10.158) reduces to
dx(t)
dt
Xx{t),
(10.161)
which is the model problem (10.28) again. Recall once again Example 10.5, for
which we found that
k =
3/C*o, t )
3x
2
to'
Note that if we know x(tQ) (any tQ s R), then we may slightly generalize our linear problem (10.158)
to determining x(t) for all ( > tQ, in which case
x(t) = e A{t -'^x(to)
replaces (10.160). However, little is lost by assuming /q = 0.
TLFeBOOK
MATLAB CODE FOR EXAMPLE 10.8 469
as (10.41) is the solution to dx/dt — t 2 — 2x/t for t > to > 0. If ?o is small, then
\X\ is large, and we saw that numerical methods, especially low-order explicit
ones, had difficulty in estimating x(t) accurately when t was near fn- If we recall
(10.33) (which stated that h < 2/\X\) as an example, we see that large negative
values for X force us to select small step sizes h to ensure stability of the explicit
Euler method. Since k is an eigenvalue of A — [X] in (10.161), we might expect
that this generalizes. In other words, a numerical method can be expected to have
accuracy problems if A in (10.159) has eigenvalues with large negative real parts.
In a situation like this x(t) in (10.156) has (it seems) a solution that changes so
rapidly for some time intervals (e.g., fast startup transients) that accurate numerical
solutions are hard to achieve. Such systems are called stiff systems.
So far our definition of a stiff system has not been at all rigorous. Indeed,
a rigorous definition is hard to come by. Higham and Trefethen [15] argue that
looking at the eigenvalues of A alone is not enough to decide on the stiffness of
(10.156) in a completely reliable manner. It is possible, for example, that A may
have favorable eigenvalues and yet (10.156) may still be stiff.
Stiff systems will not be discussed further here except to note that implicit
methods, or higher-order predictor-corrector methods, should be used for their
solution. The paper by Higham and Trefethen [15] is highly recommended reading
for those readers seriously interested in the problems posed by stiff systems.
10.7 FINAL REMARKS
In the numerical solution (i.e., simulation) of ordinary differential equations (ODEs),
two issues are of primary importance: accuracy and stability. The successful sim-
ulation of any system requires proper attention to both of these issues.
Computational efficiency is also an issue. Generally, we prefer to use the largest
possible step size consistent with required accuracy, and as such to avoid any
instability in the simulation.
APPENDIX 10.A MATLAB CODE FOR EXAMPLE 10.8
% f23.m
% This defines function f (x,t) in the differential equation for Example 10.
% (in Section 10.2) .
function y = f23(x,t)
y = t*t - (2*x/t);
% Runge.m
% This routine simulates the Heun's, and 4th order Runge-Kutta methods as
TLFeBOOK
470 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
% applied to the differential equation in Example 10.8 (Sect. 10.2), so this
% routine requires function f23.m. It therefore generates Fig. 10.9.
function Runge
to = .05; % initial time (starting time)
xO =1.0; % initial condition (x(tO))
% Exact solution x(t)
c = (5*t0*t0*x0) - (t0"5);
te = [tO: .02:1 .5] ;
for k = 1 :length(te)
xe(k) = (te(k)*te(k)*te(k))/5 + c/(5*te(k)*te(k) ) ;
end;
h = .05;
% Heun's method simulation
xh(1) = xO;
th(1) = tO;
for n = 1 :25
fn = th(n)*th(n) - (2*xh(n)/th(n) ) ; %f(x_n,t_n)
th(n+1) = th(n) + h;
xn1 = xh(n) + h*fn;
tn1 = th(n+1) ;
fn1 = tn1*tn1 - (2*xn1/tn1); % f (x_{n+1},t_{n+1}) (approx.)
xh(n+1) = xh(n) + (h/2)*(fn + fn1);
end;
% 4th order Runge-Kutta simulation
xr(1 ) = xO;
tr(1) = tO;
for n = 1 :25
t = tr(n);
x = xr(n) ;
k1 = f23(x,t);
k2 = f23(x + .5*h*k1,t + .5*h);
k3 = f23(x + .5*h*k2,t + .5*h);
k4 = f23(x + h*k3,t + h);
xr(n+1) = xr(n) + (h/6)*(k1 + 2*k2 + 2*k3 + k4) ;
tr(n+1) = tr(n) + h;
end;
plot(te,xe, ' - ' ,tr,xr, ' - -o' ,th,xh, ' --+' ) , grid
legend('x(t) ' , '4th Order Runge-Kutta (h = .05)','Heun (h = .05)',1);
xlabel(' Time (t) ')
ylabel(' Amplitude ')
APPENDIX 10.B MATLAB CODE FOR EXAMPLE 10.13
%
% fR.m
%
TLFeBOOK
MATLAB CODE FOR EXAMPLE 10.13 471
% This is Equation (10.15) of Chapter 10 pertaining to Example 10.2.
function i = fR(v)
VTH = 0.75; % Threshold voltage in volts
RON = 100; % On resistance of NPN BJT Q in Ohms
if v <= VTH
i = 0;
else
i = (v-VTH)/R0N;
end;
% vCC.m
% Supply voltage function v_CC(t) for Example 10.2 of Chapter 10.
% Here v_CC(t) = V_CC u(t) (i.e., oscillator switches on at t = 0) .
function v = vCC(t)
VCC = 5;
if t <
v = 0;
else
v = VCC;
end;
% Colpitts.m
% Computes the right-hand side of the state equations in Equation (10.17a,b,c)
% pertaining to Example 10.2 of Chapter 10.
function y = Colpitts(t,x)
C1 = 54e-9;
C2 = 54e-9;
REE = 400;
VEE = -5;
betaF = 200;
RL = 35;
L = 98.5e-6;
y(1) = ( x(3) - betaF*fR(x(2)) )/C1;
y(2) = ( -(x(2)+VEE)/REE - fR(x(2)) - x(3) )/C2;
y(3) = ( vCC(t) - x(1) + x(2) - RL*x(3))/L;
y = y- ';
% SimulateColpitts.m
% This routine uses vCC.m, fR.m and Colpitts.m to simulate the Colpitts
% oscillator circuit of Example 10.2 in Chapter 10. It produces
% Figure 10.21 in Chapter 10.
% The state vector x(:,:) is as follows:
% x(:,1) = v_CE(t)
TLFeBOOK
472 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
% x(:,2) = v_BE(t)
% x(:,3) = i_L(t)
function SimulateColpitts
[t,x] = ode23(@Colpitts, [ 0.003 ], [000]);
% [ 0.003 ] ---> Simulate to 3 milliseconds
% [ ] ---> Initial state vector
elf
L = length(t) ;
subplot(211), plot(x(:,1),x(:,2)), grid
xlabelC v_{CE} (t) (volts) ')
ylabelf v_{BE} (t) (volts) ')
title(' Phase Portrait of the Colpitts Oscillator (Chaotic Regime)
subplot(212), plot(t(L-1999:L),x(L-1999:L,1),'-'), grid
xlabel(' t (seconds) ')
ylabelf v_{CE} (t) (volts) ')
title (' Collector-Emitter Voltage (Chaotic Regime) ')
REFERENCES
1. P. E. Hydon, Symmetry Methods for Differential Equations: A Beginner's Guide, Cam-
bridge Univ. Press, Cambridge, UK, 2000.
2. C. T.-C. Nguyen and R. T. Howe, "An Integrated CMOS Micromechanical Resonator
High-Q Oscillator," IEEE J. Solid-State Circuits 34, 450-455 (April 1999).
3. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York,
1978.
4. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979.
5. L. M. Kells, Differential Equations: A Brief Course with Applications, McGraw-Hill,
New York, 1968.
6. E. Beltrami, Mathematics for Dynamic Modeling, Academic Press, Boston, MA, 1987.
7. S. S. Rao, Applied Numerical Methods for Engineers and Scientists, Prentice-Hall,
Upper Saddle River, NJ, 2002.
8. R. Bronson, Modern Introductory Differential Equations (Schaum's Outline Series),
McGraw-Hill, New York, 1973.
9. M. P. Kennedy, "Chaos in the Colpitts Oscillator," IEEE Trans. Circuits Syst. (Part I:
Fundamental Theory and Applications) 41, 771-774 (Nov. 1994).
10. A. S. Sedra and K. C. Smith, Microelectronic Circuits, 3rd ed., Saunders College Publ.,
Philadelphia, PA, 1989.
11. J. Guckenheimer and P. Holmes, Nonlinear Oscillations, Dynamical Systems, and Bifur-
cations of Vector Fields, Springer- Verlag, New York, 1983.
12. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York,
2002.
13. L. F. Shampine and M. W. Reichelt, "The MATLAB ODE Suite," SI AM J. Sci. Comput.
18, 1-22 (Jan. 1997).
TLFeBOOK
PROBLEMS
473
14. J. R. Dormand and P. J. Prince, "A Family of Embedded Runge-Kutta Formulae," J.
Comput. Appl. Math. 6, 19-26 (1980).
15. D. J. Higham and L. N. Trefethen, "Stiffness of ODEs," BIT 33, 285-303 (1993).
16. P. G. Drazin, Nonlinear Systems, Cambridge Univ. Press, Cambridge, UK, 1992.
17. R. L. Burden and J. D. Faires, Numerical Analysis, 4th ed., PWS-KENT Publ., Boston,
MA, 1989.
PROBLEMS
10.1. Consider the electric circuit depicted in Fig. 10.P.1. Find matrix A e R 2x2
and vector b e R 2 such that
dt
dJL 2 (t)
dt
MO
= A
lL 2 (t)
™ s (t),
where iL k (t) is the current through inductor L^ (k e {1,2, 3}).
(Comment: Although the number of energy storage elements in the circuit is
3, there are only two state variables needed to describe the circuit dynamics.)
10.2. The circuit in Fig. 10.P.2 is a simplified model for a parametric amplifier.
The amplifier contains a reverse-biased varactor diode that is modeled by
the parallel interconnection of linear time-invariant capacitor Co and linear
time-varying capacitor C(t). You may assume that C(t) = 2C\ cos(a> p t),
where C\ is constant, and co p is the pumping frequency . Note that
ic(t) = —\C(t)v(t)l
dt
The input to the amplifier is the ideal cosinusoidal current source i s {t) =
2I S cos(a>ot), and the load is the resistor R, so the output is the current /(f)
Ufl
uo
Figure 10.P.1 The linear electric circuit for Problem 10.1.
TLFeBOOK
474 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
l,(t)
WO
Load
~V~
Varactor diode model
Figure 10.P.2 A model for a parametric amplifier (Problem 10.2).
into R. Write the state equations for the circuit, assuming that the state vari-
ables are v(t), and i'i(f). Write these equations in matrix form. (Comment:
This problem is based on the example of a parametric amplifier as consid-
ered in C. A. Desoer and E. S. Kuh, Basic Circuit Theory, McGraw-Hill,
New York, 1969.)
10.3. Give a detailed derivation of Eqs. (10.17).
10.4. The general linear first-order ODE is
dx(t)
dt
= a(t)x(t) + b(t), x(to) = xq.
Use the trapezoidal method to find an expression for x n +\ in terms of x n ,
t„, and f„+i-
10.5. Prove that the trapezoidal method for ODEs is unconditionally stable.
10.6. Consider Theorem 10.2. Assume that jtrj = x(to). Use (10.23) to find an
upper bound on \x(t„) — x n \/M for the following ODE IVPs:
(a) — = 1 -2x, x(0) = 1.
dt
dx
(b) — = 2cosx, x(0) = 0.
dt
10.7. Consider the ODE IVP
dx(t)
dt
= \-2x, x(0) = l.
(a) Approximate the solution to this problem using the explicit Euler
method with h = 0.1 for n = 0, 1, . . . , 10. Do the computations with a
pocket calculator.
(b) Find the exact solution x(t).
TLFeBOOK
PROBLEMS 475
10.8. Consider the ODE IVP
dx(t)
dt
2cos:t, x(0) = 0.
(a) Approximate the solution to this problem using the explicit Euler
method with h — 0.1 for n = 0, 1, . . . , 10. Do the computations with a
pocket calculator.
(b) Find the exact solution x(t).
10.9. Write a MATLAB routine to simulate the ODE
dx x
dt x + t
for the initial condition jc(0) = 1. Use both the implicit and explicit Euler
methods. The program must accept as input the step size h, and the number
of iterations N that are desired. Parameters h and N are the same for both
methods. The program output is be written to a file in the form of a table
something such as (e.g., for h — 0.05, and N — 5) the following:
time step
ex|
plicit
x_n
implicit x
.0000
value
value
.0500
value
value
.1000
value
value
.1500
value
value
.2000
value
value
.2500
value
value
Test your program
out on
h =
0.0L
N = 100
and
ft = 0.10, N =10.
10.10. Consider
dr(t\
a(l -2pt z )e- pt . (10.P.1)
dx(t) „ nn2 ,-pt 2
dt
(a) Verify that for t > 0, with x(0) = 0, we have the solution
x(t) = ate- p, \ (10.P.2)
(b) For what range of step sizes h is the explicit Euler method a stable
means of solving (10.P.1)?
TLFeBOOK
476 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
(c) Write a MATLAB routine to simulate (1 O.P.I) for x(0) = using both
the explicit and implicit Euler methods. Assume that a — 10 and P — I.
Test your program out on
ft = 0.01, N=400 (10.P.3a)
and
h = 0.10, N = 40. (10.P.3b)
The program must produce plots of [x n \n = 0, I, . . . , N] (for both
explicit and implicit methods), and x(t) (from (10.P.2)) on the same
graph. This will lead to two separate plots, one for each of (10.P.3a)
and (10.P.3b).
10.11. Consider
dx(t)
— — =2tx + l, x(0) = 0.
dt
(a) Find \x n \n = 0, 1, . . . , 10} for h = 0.1 using the explicit Euler method.
Do the calculations with a pocket calculator.
(b) Verify that
— e I e
Jo
x{t) — e'~ I e s ds.
'o
(c) Does the stability condition (10.33) apply here? Explain.
10.12. Prove that Runge-Kutta methods of order one have no degrees of freedom
(i.e., we are forced to select c\ in only one possible way).
10.13. A fourth-order Runge-Kutta method is
x„ + i — x n + \h\k\ + 2&2 + 2&3 + £4]
for which
h = f(x„, t n ), k 2 = f \x„ + \hk\,t n + jhj ,
kj = / [x n + hhlc2, t n + Ih I , k/\ — f(x n + hie?,, t n + h).
When applied to the model problem, we get x n — cr n XQ for n e Z + . Derive
the expression for a .
10.14. A third-order Runge-Kutta method is
x„+i = x n + \\ki + Ak 2 + k 3 ]
TLFeBOOK
PROBLEMS 477
for which
k\ = hf{x n , t n ), k 2 = hf(x„ + \h, t n + \k\),
kj, — hf(x n + h, t n — k\ + Iki).
(a) When applied to the model problem, we get x n — a n xo for n e Z + .
Derive the expression for a .
(b) Find the allowable range of step sizes h that ensure stability of the
method.
10.15. Consider the fourth-order Runge-Kutta method in Eqs. (10.80) and (10.81).
Show that if f(x, t) = f(t), then the method reduces to Simpson's rule for
numerical integration over the interval [t n , t n +\\.
10.16. Recall Eq. (10.98). Suppose that A e R 2x2 has distinct eigenvalues yk such
that Re[y^] > for at least one of the eigenvalues. Show that / + hA will
have at least one eigenvalue \k such that 11^1 > 1.
10.17. Consider the coupled first-order ODEs
dx
= -yJx 2 + y 2 , (10.P.4a)
dt
dy
= xJx 2 + y 2 , (10.P.4b)
dt v
where (xo, yo) — (*(0), y(0)) are the initial conditions.
(a) Prove that for suitable constants ro and 9q, we have
x(t) = r cos(r t + O ), y (t) = r Q sin(r Q t + 9 Q ) . (10.P.5)
(b) Write a MATLAB routine to simulate the system represented by (10.P.4)
using the explicit Euler method [which will produce x n — [x n y n ] T
such that x n « x(t n ) and y„ « y(t n )]. Assume that h — 0.05 and the
initial condition is Jo = [1 0] r . Plot x„ and (x(t), y(t)) (via (10.P.5))
on the (x, y) plane.
(c) Write a MATLAB routine to simulate the system represented by (10.P.4)
using Heun's method [which will produce x n — [x n y n ] T such that
x n »s x(t n ) and y„ « y(t„)]. Assume that h — 0.05 and the initial con-
dition is Jo = [1 0] T . Plot x„, and (x(t), y(t)) [via (10.P.5)] on the
(x, y) plane.
Make reasonable choices about the number of time steps in the simulation.
10.18. In the previous problem the step size is h — 0.05.
(a) Determine whether the simulation using the explicit Euler method is
stable for this choice of step size. {Hint: Recall that one must consider
the eigenvalues of / + hA.)
TLFeBOOK
478 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS
(b) Determine whether the simulation using Heun's method is stable for
this choice of step size. [Hint: Consider the implications of (10.121).]
Use the MATLAB eig function to assist you in your calculations.
10.19. A curve in R 2 is specified parametrically according to
x(t) — Acos(cot), y(t) = aAcos(a>t — <j>), (10.P.6)
where a, A > 0, and t e R is the "parameter." When the points (x(t), y(t))
are plotted, the result is what electrical engineers often call a Lissajous figure
(or curve), which is really just an alternative name for a phase portrait.
(a) Find an implicit function expression for the curve, that is, find a descrip-
tion of the form
fix, y) = (10.P.7)
[i.e., via algebra and trigonometry eliminate t from (10.9.6) to obtain
(10.9.7)].
(b) On the (x, y) plane sketch the Lissajous figures for cases = 0, </> =
±| and <p = ±|.
(c) An interpretation of (10.9.6) is that x(t) may be the input voltage from
a source in an electric circuit, while y(t) may be the output voltage
drop across a load element in the circuit (in the steady-state condition
of course). Find a simple expression for sin<^> in terms of B such that
/(0, B) — [i.e., point(s) B on the y axis where the curve cuts the
y axis], and in terms of a and A. {Comment: On analog oscilloscopes
of olden days, it was possible to display a Lissajous figure, and so use
this figure to estimate the phase angle (f> on the lab bench.)
10.20. Verify the values for Xk in Table 10.1 for m — 2.
10.21. Verify the values for Xk in Table 10.2 for m — 2.
10.22. For the ODE IVP
dx(t)
dt
f(x,t), x(f )=x ,
write pseudocode for a numerical method that approximates the solution to
it using an AB method for m — 2 as a predictor, an AM method for m — 2
as a corrector, and the third-order Runge-Kutta (RK) method from Problem
10.14 to perform the initialization.
10.23. From Forsythe, Malcolm, and Moler (see Ref. 5 in Chapter 2)
— = 998x + 1998y (10.P.8a)
dt
-1- = -999* - 1999y (10.P.8b)
dt
TLFeBOOK
PROBLEMS 479
has the solution
x(t) = 4e~' - 3e~ 1000 ', (10.P.9a)
y(t) = -2e~ f + 3e- wm , (10.P.9b)
where x(0) = y(0) = 1. Recall A as given by (10.94).
(a) By direct substitution verify that (10.P.9) is the solution to (10. P. 8).
(b) If the eigenvalues of / + hA are Xo and X\, plot \Xq\ and |Ai| versus h
(using MATLAB).
(c) If the eigenvalues of / + hA + \h 2 A 2 [recall (10.121)] are X and Xi,
plot \X Q \ and |Ai| versus h (using MATLAB).
In parts (b) and (c), what can be said about the range of values for h leading
to a stable simulation of the system (10.P.8)?
10.24. Consider the coupled system of first-order ODEs
dx{t)
dt
= Ax(t) + y(t), (10.P.10)
where A e R nxn and x(t), y{t) e R" for all t e R + . Suppose that the eigen-
values of A are Xq, Aj, . . . , X n _\ such that Re[A^] < for all k e Z„.
Suppose that
a < Re[X k ] < x <
for all k. A stiffness quotient is defined to be
a
x
The system (10.P.10) is said to be stiff ii r ^> 1 (again, we assume Re[A^] <
for all k).
For the previous problem, does A correspond to a stiff system? (Comment:
As Higham and Trefethen [15] warned, the present stiffness definition is
not entirely reliable.)
TLFeBOOK
11
Numerical Methods
for Eigenproblems
11.1 INTRODUCTION
In previous chapters we have seen that eigenvalues and eigenvectors are important
(e.g., recall condition numbers from Chapter 4, and the stability analysis of numer-
ical methods for ODEs in Chapter 10). In this chapter we treat the eigenproblem
somewhat more formally than previously. We shall define and review the basic
problem in Section 1 1 .2, and in Section 11.3 we shall apply this understanding to
the problem of computing the matrix exponential [i.e., exp(Af), where A e R" x "
and t € R] since this is of central importance in many areas of electrical and com-
puter engineering (signal processing, stability of dynamic systems, control systems,
circuit simulation, etc.). In subsequent sections we will consider numerical methods
to determine the eigenvalues and eigenvectors of matrices.
11.2 REVIEW OF EIGENVALUES AND EIGENVECTORS
In this section we review some basic facts relating to the determination of eigen-
values and eigenvectors of matrices. Our emphasis, with a few exceptions, is on
matrices that are diagonalizable.
Definition 11.1: Eigenproblem Let A e C" x ". The eigenproblem for A is
to find solutions to the matrix equation
Ax = Xx, (11.1)
where X e C and x € C" such that x ^ 0. A solution (X, x) to (11.1) is called an
eigenpair, X is an eigenvalue, and x is its corresponding eigenvector.
Even if A € R nxn (the situation we will emphasize most) it is very possible to
have X € C and x e C" . We must also emphasize that x — is never permitted to
be an eigenvector for A.
We may rewrite (11.1) as (/ is the n x n identity matrix)
(A-XI)x = 0, (11.2a)
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
480
TLFeBOOK
REVIEW OF EIGENVALUES AND EIGENVECTORS
481
or equivalently as
(XI - A)x = 0.
(11.2b)
Equations (11.2) are homogeneous linear systems of n equations in n unknowns.
Since x — is never an eigenvector, an eigenvector x must be a nontrivial solution
to (11.2). An n x n homogeneous linear system has a nonzero (i.e., nontrivial)
solution iff the coefficient matrix is singular. Immediately, eigenvalue X satisfies
det(A - XI) = 0,
(11.3a)
or equivalently
det(XI - A) = 0.
(11.3b)
Of course, p(X) — det(A/ — A) is a polynomial of degree n. In principle, we
may find eigenpairs by finding p(X) (the characteristic polynomial of A), then
finding the zeros of p(X), and then substituting these into (11.2) to find x. In
practice this approach really works only for small analytical examples. Properly
conceived numerical methods are needed to determine eigenpairs reliably for larger
matrices A.
To set the stage for what follows, consider the following examples.
Example 11.1 From Hill [1], consider the example
Here p(X) — det(XI — A) — (X — 1) (X — 3) so A has a double eigenvalue (eigen-
value of multiplicity 2) at X — 1, and a simple eigenvalue at X = 3. The eigenvalues
may be individually denoted as Xo — l,X\ — 1, and A 2 — 3.
For X — 3, (3/ — A)x — becomes
2
XQ
2 2
Xl
=
X2
and from the application of elementary row operations this reduces to
1
XQ
1
Xl
=
X2
Immediately, xq — x\ — 0, and x-i is arbitrary (except, of course, it is not allowed
to be zero) so that the general form of the eigenvector corresponding to X — X2 is
,(2)
[0 x 2 ] ' e C J
TLFeBOOK
482
NUMERICAL METHODS FOR EIGENPROBLEMS
2
-2
1
"
1
x
" "
Xl
=
_ *2
On the other hand, now let us consider X — Xq — X\. In this case (/ — A)x —
becomes
which reduces to
Immediately, xo = X2 = 0, and x\ is arbitrary, so an eigenvector corresponding to
X — Xo — X\ is of the general form
xo
" "
XI
=
_ X 2
,(0)
= [ xi f e C 3 .
Even though X = 1 is a double eigenvalue, we are able to find only one eigenvector
for this case. In effect, one eigenvector seems to be "missing."
Example 11.2 Now consider (again from Hill [1])
-2
1
1
3
-1
1
A =
Here p(X) — det(XI — A) — (X — l) 2 (X — 2). The eigenvalues of A are thus Xo
l,X\ — 1, and X2 — 2.
For X — 2, (21 — A)x — becomes
2
1
2 -1
-1 1
1
1
1 "
1
xo
"
Xl
=
_ X 2
x
" "
XI
=
_ X 2
which reduces to
so the general form of the eigenvector for X — X2 is
v (2) = x (2) = [ -x x f e C 3
Now, if we consider X = Xq = A-i, (/ — A)x = becomes
1
2
-1
xo
1
-2
1
Xl
=
X2
TLFeBOOK
REVIEW OF EIGENVALUES AND EIGENVECTORS
483
which reduces to
1 2
-1
XQ
X\
=
X2
Since xq + 2x\ — X2 — 0, we may choose any two of xq, X\, or X2 as free para-
meters giving a general form of an eigenvector for X — Xq — X\ as
v®> = x
1 "
' "
1
+ X\
1
2
,-(())
eC 3 .
Thus, we have eigenpairs (Xq, x ( '), (X\,x ( '), (X2, x ( ')
AV\
,(2) A
We continue to emphasize that in all cases any free parameters are arbitrary,
except that they must never be selected to give a zero-valued eigenvector. In
Example 11.2 X = 1 is an eigenvalue of multiplicity 2, and i/ - 1 is a vector in
a two-dimensional vector subspace of C 3 .
X = 1 is also of multiplicity 2, and yet i/ ^
vector subspace of C 3 .
On the other hand, in Example 11.1
is only a vector in a one-dimensional
Definition 11.2: Defective Matrix For any A & C nxn , if the multiplicity
of any eigenvalue X e C is not equal to the dimension of the solution space
(eigenspace) of (XI — A)x = 0, then A is defective.
From this definition, A in Example 11.1 is a defective matrix, while A in
Example 11.2 is not defective (i.e., is nondefective). In a sense soon to be made
precise, defective matrices cannot be diagonalized. However, all matrices, diago-
nalizable or not, can be placed into Jordan canonical form, as follows.
Theorem 11.1: Jordan Decomposition If A e C" x ", then there exists anon-
singular matrix T e C nxn such that
T~ l AT = diag(./o, J\,
Jk-i),
where
Ji
A,
1
•
•
A,
1 •
•
•
• A;
1
•
•
X
eC n
and Ei=0 m i = "•
TLFeBOOK
484
NUMERICAL METHODS FOR EIGENPROBLEMS
Proof See Halmos [2] or Horn and Johnson [3].
Of course, A., in the theorem is an eigenvalue of A. The submatrices 7, are called
Jordan blocks. The number of blocks k and their dimensions m, are unique, but
their order is not unique. Note that if an eigenvalue has a multiplicity of unity (i.e.,
if it is simple), then the Jordan block in this case is the lxl matrix consisting
of that eigenvalue. From the theorem statement the characteristic polynomial of
A € C" x " is given by
k-\
p(X) = det(A7 - A) = Y\0- ~ *;)"
(11.4)
i=0
But if A € R" x ", and if A, € C for some i, then X* must also be an eigenvalue of
A, that is, a zero of p(X). This follows from the fundamental theorem of algebra,
which states that complex-valued roots of polynomials with real- valued coefficients
must always occur in complex -conjugate pairs.
Example 11.3 Consider (with ^ kit, k e Z)
A =
cost
sin 6
- sine
cosf
eR
2x2
(2x2 rotation operator from Appendix 3. A). We have the characteristic equation
X — cos 6
p(X) = det(XI - A) = det (
smf
X — cos (
= X z - 2 cos 6X + 1 =
for which the roots (eigenvalues of A) are therefore X — e ± ' e . Define Xq — e' e ,
X\ — e~' e . Clearly X\ — Xt (i.e., the two simple eigenvalues of A are a conju-
gate pair).
For X — Xq, (XqI — A)x — is
sint9
which reduces to
J 1
-1 J
-J
XQ
x\
The eigenvector for X — Xo is therefore of the form
, a e C.
x®> = a
TLFeBOOK
REVIEW OF EIGENVALUES AND EIGENVECTORS
485
Similarly, for k — X\, the homogeneous linear system (X\I — A)x = is
-j 1
-1 -j
x\
which reduces to
The eigenvector for A. = X \ is therefore of the form
" 1 j '
.00
Xq
=
" "
_ _
x m = b
beC.
Of course, free parameters a, and b are never allowed to be zero.
Computing the Jordan canonical form when m, > 1 is numerically rather dif-
ficult (as noted in Golub and Van Loan [4] and Horn and Johnson [3]), and so
is often avoided. However, there are important exceptions often involving state-
variable (state-space) systems analysis and design (e.g., see Fairman [5]). Within
the theory of Jordan forms it is possible to find supposedly "missing" eigenvectors,
resulting in a theory of generalized eigenvectors. We will not consider this here as
it is rather involved. Some of the references at the end of this chapter cover the
relevant theory [3].
We will now consider a series of theorems leading to a sufficient condition for
A to be nondefective. Our presentation largely follows Hill [1].
Theorem 11.2: If A € C" x ", then the eigenvectors corresponding to two dis-
tinct eigenvalues of A are linearly independent.
Proof We employ proof by contradiction.
Suppose that (a, x) and {fi, y) are two eigenpairs for A, and a ^ fi. Assume
y — ax for some a # (a e C). Thus
and also
Hence
implying that
fly = Ay — aAx = aax
/3y = a/3x.
apx — aax,
a{f5 — a)x — 0.
TLFeBOOK
486 NUMERICAL METHODS FOR EIGENPROBLEMS
But x ^ as it is an eigenvector of A, and also a ^ 0. Immediately, a — j3,
contradicting our assumption that these eigenvalues are distinct. Thus, y — ax is
impossible; that is, we have proved that y is independent of x.
Theorem 11.2 leads us to the next theorem.
Theorem 11.3: If A € C nx " has n distinct eigenvalues, then A has n linearly
independent eigenvectors.
Proof Uses mathematical induction (e.g., Stewart [6]).
We have already seen the following theorem.
Theorem 11.4: If A € R" x " and A = A T , then all eigenvalues of A are real-
valued.
Proof See Hill [1], or see the appropriate footnote in Chapter 4.
In addition to this theorem, we also have the following one.
Theorem 11.5: If A € R" x ", and if A = A T , then eigenvectors corresponding
to distinct eigenvalues of A are orthogonal.
Proof Suppose that (a, x) and (ft, y) are eigenpairs of A with a ^ /3. We wish
to show that x T y — y T x — (recall Definition 1.6). Now
so that
ax — Ax = A x
ay T x — y T A T x — (Ay) T x = fiy T x,
implying that
(a - fi)y T x = 0.
But a ^ p so that y T x — 0, that is, x ± y.
Theorem 11.5 states that eigenspaces corresponding to distinct eigenvalues of a
symmetric, real-valued matrix form mutually orthogonal vector subspaces of R".
Any vector from one eigenspace must therefore be orthogonal to any eigenvector
from another eigenspace. If we recall Definition 11.2, it is apparent that all sym-
metric, real-valued matrices are nondefective, // their eigenvalues are all distinct. '
In fact, even if A € C" x " and is not symmetric, then, as long as the eigenvalues
are distinct, A will be nondefective (Theorem 11.3).
It is possible to go even further and prove that any real-valued, symmetric matrix is nondefective,
even if it possesses multiple eigenvalues. Thus, any real-valued, symmetric matrix is diagonalizable.
TLFeBOOK
REVIEW OF EIGENVALUES AND EIGENVECTORS 487
Definition 11.3: Similarity Transformation If A, B € C nxn , and there is a
nonsingular matrix P e C nxn such that
B = P~ l AP,
we say that B is similar to A, and that P is a similarity transformation.
If A e C nxn , and A has n distinct eigenvalues forming n distinct eigenpairs
{{k k , x {k) )\k = 0, 1, . . . , n - 1}, then
allows us to write
Al/V"...^-"]
= p
Ax {k) = ***<«
r A
•■
> (P) JC (i)... Jt (i,-i) ]
ki ■■
=i>
•■
k n -
(11.5)
that is, AP — PA, where A = diag(Ao, k\, . . . , k n -\) e C" x " is the diagonal matrix
of eigenvalues of A. Thus
P~ l AP=A, (11.6)
and the matrix of eigenvectors P e C" x " of A defines the similarity transformation
that diagonalizes matrix A. More generally, we have the following theorem.
Theorem 11.6: If A, B e C" x ", and A and B are similar matrices, then A and
Z? have the same eigenvalues.
Proof Since A and B are similar, there exists a nonsingular matrix P e C" x "
such that B — P~ l AP. Therefore
det(kl - B) = det(AJ - P _1 AF)
= det(P" 1 (PP" 1 A- A)P)
= det(P _1 ) det(kl - A) det(P)
= det(A7 - A).
Thus, A and B possess the same characteristic polynomial, and so possess identical
eigenvalues.
In other words, similarity transformations preserve eigenvalues. 2 Note that The-
orem 11.6 holds regardless of whether A and B are defective. In developing (11.6)
This makes such transformations highly valuable in state-space control systems design, in addition to
a number of other application areas.
TLFeBOOK
488 NUMERICAL METHODS FOR EIGENPROBLEMS
we have seen that if A € C nxn has n distinct eigenvalues, it is diagonalizable. We
emphasize that this is only a sufficient condition. Example 11.2 confirms that a
matrix can have eigenvalues with multiplicity greater than one, and yet still be
diagonalizable.
11.3 THE MATRIX EXPONENTIAL
In Chapter 10 the problem of computing e At (A e R" x ", and t e R) was associated
with the stability analysis of numerical methods for systems of ODEs. It is also
noteworthy that to solve
dx(t)
-^-± = Ax{t)+by(t) (11.7)
at
[x(t) = [x (t)xi(t) ■ ■ -x n _i(t)] T e R", A e R" xn , and b e R" with y(t) e R for
all t] required us to compute e At [recall Example 10.1, which involved an example
of (1 1.7) from electric circuit analysis; see Eq. (10.9)]. Thus, we see that computing
the matrix exponential is an important problem in analysis. In this section we shall
gain more familiarity with the matrix exponential because of its significance.
Moler and Van Loan [7] caution that computing the matrix exponential is a
numerically difficult problem. Stable, reliable, accurate, and computationally effi-
cient algorithms are not so easy to come by. Their paper [7], as its title states,
considers 19 methods, and none of them are entirely satisfactory. Indeed, this
paper [7] appeared in 1978, and to this day the problem of successfully computing
e At for any A e C nxn has not been fully resolved. We shall say something about
why this is a difficult problem later.
Before considering this matter, we shall consider a general analytic (i.e., hand
calculation) method for obtaining e At for any A € R" x ", including when A is
defective. In principle, this would involve working with Jordan decompositions
and generalized eigenvectors, but we will avoid this by adopting the approach
suggested in Leonard [8].
The matrix exponential e Al may be defined in the expected manner as
°° 1
O(f) = e At = V— A k t k , (11.8)
*— ' kl
so, for example, the kth derivative of the matrix exponential is
0>«(f) = A k e At = e At A k (11.9)
for k e Z + (<J>(°)(f) = <t>(r)). To see how this works, consider the following special
case k = 1:
<D (1) (f) = — I V — A k t k \ = — \l + —At + —A 2 t 2 H h — A k t k + ■ ■
w dt\f^k\ j dt 1 1! 2! k\
TLFeBOOK
THE MATRIX EXPONENTIAL 489
1 2 ,
= — A+ —A 2 t + -
1! 2!
. .-1 A *f* 1 h
= a\i+ —At + --
1 1!
1 A *-i,*-i
(t-1)!
= Ae At = e A, A.
It is possible to formally verify that the series in (11.8) converges to a matrix
function of t e R by working with the Jordan decomposition of A. However,
we will avoid this level of detail. But we will consider the situation where A
is diagonalizable later on.
There is some additional background material needed to more fully appreciate
[8], and we will now consider this. The main result is the Cayley-Hamilton theorem
(Theorem 11.8, below).
Definition 11.4: Minors and Cofactors Let A e C" x ". The minor mij is the
determinant of the (n — 1) x (n — 1) submatrix of A derived from it by deleting
row i and column j . The cofactor en associated with m,y is en — (—l) l+ 'mij for
all i, j e Z„.
A formula for the inverse of A (assuming this exists) is given by Theorem 11.7.
Theorem 11.7: If A e C" x " is nonsingular, then
A_1 = XT77T ad K A )' ( 1L1 °)
det(A)
where adj(A) (adjoint matrix of A) is the transpose of the matrix of cofactors of
A. Thus, if C = [cjj] € C nxn is the matrix of cofactors of A, then adj(A) = C T .
Proof See Noble and Daniel [9].
Of course, the method suggested by Theorem 11.7 is useful only for the hand
calculation of low-order (small n) problems. Practical matrix inversion must use
ideas from Chapter 4. But Theorem 11.7 is a very useful result for theoretical
purposes, such as obtaining the following theorem.
Theorem 11.8: Cayley-Hamilton Theorem Any matrix A € C" xn satisfies
its own characteristic equation.
Proof The characteristic polynomial for A is p(X) — det(XI — A), and can be
written as
p(X) — X" + a x X n ~ x H h a n -\X + a„
for suitable constants a^ e C. The theorem claims that
A"+aiA" _1 H hfl„-iA + a„7 = 0, (H.H)
TLFeBOOK
490 NUMERICAL METHODS FOR EIGENPROBLEMS
where I is the order n identity matrix. To show (11.11), we consider adj(/x/ — A)
whose elements are polynomials in /x of a degree that is not greater than n — 1,
where ji is not an eigenvalue of A. Hence
adj(/x/ - A) = Mo/x"" 1 + M^"" 2 H h M n _ 2 /x + M„_i
for suitable constant matrices M& € C" x ". Via Theorem 11.7
(/x/ - A) adj(/x/ - A) = det(/x/ - A)/,
or in expanded form, this becomes
(jil - A)(M /x"- 1 + Mj/x"" 2 + • • • + M„_2M + M n _0
= (At"+ai/x"" 1 H ho„)/.
If we now equate like powers of /x on both sides of this equation, we obtain
M = /,
Mi — AMo = a\I,
M 2 — AM\ — a 2 I,
: (11.12)
M„_! -AM„_ 2 = a,_i/,
-AM„_] = a„7.
Premultiplying 3 the jth equation in (11.12) by A n ~ J (j — 0, 1, . . . , «), and then
adding all the equations that result from this, yields
A" M + A" _1 (Mi - AMo) + A"~ 2 (M 2 - AM\) H h A(M„_i - AM n - 2 )
- AM n _i = A"+a 1 A"" 1 +a 2 A n ~ 2 H ha^iA"" 1 +a„7.
But the left-hand side of this is seen to be zero because of cancellation of all the
terms, and (11.11) immediately results.
As an exercise the reader should verify that the matrices in Examples 1 1.1-1 1.3,
all satisfy their own characteristic equations.
We will now consider the approach in Leonard [8], who, however, assumes that
the reader is familiar with the theory of solution of nth-order homogeneous linear
ODEs in constant coefficients
x (n \t) + c n - lX {n ~ l \t) + ■■■ + dx m (t) + CQx(t) = 0, (11.13)
This means that we must multiply on the left.
TLFeBOOK
THE MATRIX EXPONENTIAL 491
where the initial conditions are known. In particular, the reader must know that if
X is a root of the characteristic equation
X n + c n -iX n - 1 + ■ ■ ■ + ciX + co = 0, (11.14)
then if X has multiplicity m, its contribution to the solution of the initial- value
problem (IVP) (11.13) is of the general form
(a + a x t H h a m _it m ~ l )e Xt . (11.15)
These matters are considered by Derrick and Grossman [10] and Reid [11]. We
shall be combining these facts with the results of Theorem 11.10 (below).
Leonard [8] presents two theorems that relate the solution of (11.13) to the
computation of <!>(?) = e At .
Theorem 11.9: Leonard I Let A e R nxn be a constant matrix with charac-
teristic polynomial
p(X) = det(XI - A) = X" + c n -iX n ~ l -\ h c x X + c .
0(?) = e At is the unique solution to the nth-order matrix differential equation
$ (n) (f) + c-i^" -1 ^*) + • • • + Cl <D (1) (f) + c <t>(0 = (11.16)
with initial conditions
d>(0) = /, <J> (1) (0) = A, . . . , d> ( "- 2) (0) = A n ~ 2 , O ( " _1) (0) = A"' 1 , (11.17)
Proof We will demonstrate uniqueness of the solution first of all.
Suppose that <t>i(t) and 02(f) are two solutions to (11.16) for the initial condi-
tions stated in (11.17). Let <J>(f) = <l>i(£) — <J>2(0 f° r present purposes, in which
case <J>(f) satisfies (11.16) with the initial conditions
d>(0) = <D (1) (0) = • • • = <D ( "" 2) (0) = $ ( " -1) (0) = 0.
Consequently, each entry of the matrix <£(f) satisfies a scalar IVP of the form
x {n \t) + c n -ix (n - l) (t) + ■■■ + cix (1) (0 + c Q x(t) = 0,
x(0) = x (1) (0) = • • • = x ( "" 2) (0) = x (n - l) (0) = 0,
where the solution is x(t) — for all t, so that <$>(t) — for all t e R + . Thus,
<t>i(t) = <t>2(£X an d so the solution must be unique (if it exists).
TLFeBOOK
492 NUMERICAL METHODS FOR EIGENPROBLEMS
Now we confirm that the solution is <t>(?) = e At (i.e., we confirm existence in
a constructive manner). Let A be a constant matrix of order n with characteristic
polynomial p(X) as in the theorem statement. If now <£(f) = e At , then we recall that
$(*)(,) = A k e At ,k = 1,2, ...,n (11.18)
[see (11.9)] so that
<D (n) (0 + c-i^"- 1 ^*) + • • • + ci<t> (1) (f) + co*(0
= [A" + Cn-iA"' 1 + ■ ■ ■ + a A + c I]e At
= p(A)e A ' =
via Theorem 11.8 (Cayley-Hamilton). From (11.18), we obtain
4>(°)(0) = /, O (1) (0) = A, . , . , O ( "- 2) (0) = A n ~ 2 , $ ( " _1) (0) = A"' 1 ,
and so <t>(r) = e At is the unique solution to the IVP in the theorem statement.
Theorem 11.10: Leonard II Let A e R" x " be a constant matrix with char-
acteristic polynomial
p{X) = X n + C,,-!!"" 1 + • • • + cik + co,
then
e A> = x (0/ + xi (0 A + x 2 0) A 2 H h jc„_i (t) A"" 1 ,
where x^(f), k e Z n are the solutions to the nth-order scalar ODEs
x {n \t) + c n -ix (n - l \t) + ■■■ + cix m (t) + c Q x(t) = 0,
satisfying the initial conditions
x { k j) (0) = Sj- k
forj,keZ n {xf\t) = x k {t)).
Proof Let constant matrix A have characteristic polynomial p(X) as in the
theorem statement. Define
O(f) = x (t)I + xi(t)A + x 2 (t)A 2 + ■■■+ x n -i(t)A n -\
where Xk(t), k e Z n are unique solutions to the «th-order scalar ODEs
x {n \t) + c n -ix {n - l \t) + ■■■ + c { x m (t) + c x(t) = 0,
TLFeBOOK
THE MATRIX EXPONENTIAL 493
satisfying the initial conditions stated in the theorem. Thus, for all t e R +
0> (n) (0 + c„-i$ ( " _1) (f) + • • • + C1 <D (1) (0 + c $(f)
n-\
= J2 H n) w + c»-i4" _1) (o + • • • + ci4 1} ( ? ) + co**(o] a*
)t=0
= • / + • A H h • A"" 1 = 0.
As well we see that
$(0) = xo(0)I+xi(0)A + --- + x n -i(.0)A n - 1 =/,
G>(1)(0) = 4 1) ( ) / + X l 1) ( () ) A +---+ JC B-l( ) A " _1 = A '
0(1-1) (0) = ^"-^(Q)/ + x ( "~ l) (0)A + • • • + ^"/'(O)^" 1 = A"" 1 .
Therefore
d>(f ) = x (?)/ + ^1 (0 A H h x„-i (t) A"" 1
satisfies the IVP
d> (n) (0 + c „_i* ( " _1) (0 + • • • + Cl O (1) (0 + c *(0 =
possessing the initial conditions
$W(0) = A*
(A; e Z„). The solution is unique, and so we must conclude that e At — YllZo x k(t)A k
for all t e R + , which is the central claim of the theorem.
An example of how to apply the result of Theorem 11.10 is as follows.
Example 11.4 Suppose that
A =
a y
p
e R 2x2 ,
which clearly has the eigenvalues X — a, p. Begin by assuming distinct eigenvalues
for A, specifically, that a ^ p.
The general solution to the second-order homogeneous ODE
x (2) (t) + cix (1) (t) + c x(t) =
with characteristic roots a, p (eigenvalues of A) is [recall (11.15)]
x(t) — aoe°" + aie^' .
We have x m (t) = a Q oie at + aifie pt .
TLFeBOOK
494 NUMERICAL METHODS FOR EIGENPROBLEMS
For the initial conditions x(0) = l,x^(0) = 0, we have the linear system
of equations
ao + a\ — 1,
aao + fia\ — 0,
which solve to yield
flo
P-u
, a\
P-a
Thus, the solution in this case is
x Q (t)
p-a
[pe at - a/'].
Now, if instead the initial conditions are x(0) — 0, x^(0) — 1, we have the linear
system of equations
ao + a\ — 0,
aao + fia\ — 1,
which solve to yield
1 1
«o = , ai = - —
P-a P-
Thus, the solution in this case is
x\ (t)= [-
P — a
o at -L. J»l
Via Leonard II we must have
e A
= x Q (t)I +
x\(t)A
=
1
" Pe at - aeP>
Pe°" - aeP' _
P -a
-
1
~ -ae at + ae?' -ye at + ye?' ~
-pe°" + pe pt
P-a
=
1
' (P - a)e at -ye at + ye?' '
(P- a)e pt
P-a
=
- g at
J
—a y '
e f>t
(11-19)
Now assume that a — p.
The general solution to
A2),
,(D
x w (t) + c x x w (t) + c x(t) =
TLFeBOOK
THE MATRIX EXPONENTIAL 495
with characteristic roots a, a (eigenvalues of A) is [again, recall (11.15)]
x(t) — («o + a\t)e at .
We have x^{t) — (a\ + aoa + a\at)e at .
For the initial conditions x(0) — l,;c^(0) = we have the linear system
of equations
a = 1,
a\ + aoa — 0,
which solves to yield a\ — —a, so that
x (t) = (1 -at)e°".
Now if instead the initial conditions are x(0) = 0, x^(0) = 1, we have the linear
system of equations
«o = 0,
a\ + aoa — 1,
which solves to yield a\ = 1 so that
xi (t) = te°".
If we again apply Leonard II, then we have
e At =xo(0/ + xi(f)A
e at yte°"
e at
(11.20)
A good exercise for the reader is to verify that x(t) — e At x(0) solves dx(t)/dt =
Ax(t) [of course, here x(t) — [xo(t)x\(t)] T is a state vector] in both of the cases
considered in Example 11.4. Do this by direct substitution.
Example 11.4 is considered in Moler and Van Loan [7] as it illustrates problems
in computing e At when the eigenvalues of A are nearly multiple. If we consider
(11.19) when /3 — a is small, and yet is not negligible, the "divided difference"
(11.21)
p-a
when computed directly, may result in a large relative error. In (11.19) the ratio
(11.21) is multiplied by y, so the final answer may be very inaccurate, indeed.
Matrix A in Example 11.4 is of low order (i.e., n — 2) and is triangular. This type
of problem is very difficult to detect and correct when A is larger and not triangular.
TLFeBOOK
496
NUMERICAL METHODS FOR EIGENPROBLEMS
20
15
slm
s
< 10
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
f
Figure 11.1 An illustration of the hump phenomenon in computing e .
Another difficulty noted in Moler and Van Loan [7] is sometimes called the hump
phenomenon. It is illustrated in Fig. 11.1 for Eq. (11.19) using the parameters
■1.01,
-1.00, y=50.
(11.22)
Figure 11.1 is a plot of matrix 2-norm [spectral norm; recall Equation (4.37) with
p — 2 in Chapter 4] ||e A '|l2 versus t. (It is a version of Fig. 1 in Ref. 7.) The
problem with this arises from the fact that one way or another some algorithms for
the computation of e At make use of the identity
e At = {e Atlm )m _
(11.23)
When s/m is under the hump while s lies beyond it (e.g., in Fig. 11.1 s — 4
with m — 8), we can have
\\e As \\2«\\e As/m \\^. (11.24)
Unfortunately, rounding errors in the mth power of e As l m are usually small only
relative to ||e Ax / m ||2, rather than \\e \\i. Thus, rounding errors may be a problem
in using (11.23) to compute e Al .
The Taylor series expansion in (11.8) is not a good method for computing e At .
The reader should recall the example of catastrophic convergence in the compu-
tation of e x (x e R) from Chapter 3 (Appendix 3.C). It is not difficult to imagine
that the problem of catastrophic convergence in (11.8) is likely to be much worse,
and much harder to contain. Indeed this is the case as shown by an example in
Moler and Van Loan [7].
It was suggested earlier in this section that the series in (11.8) can be shown to
converge by considering the diagonalization of A (assuming that A is nondefective).
Suppose that A e C" x ", and that A possesses eigenvalues that are all distinct, and
so we may apply (11.6). Since
A k = [PAP -1 ]*
= [PAP" 1 ][PAP" 1 ]---[PAP" 1 ][PAP _1 ] = PA k P~\
k factors
(11.25)
TLFeBOOK
THE MATRIX EXPONENTIAL
497
we have
-At
y-A k t k = y-PA k p- 1 t k = p
k=0
k=0
E-^
.k=Q
k t k
(oo 1 oo 1 OO . \
£=0 ' £=0 ' k=0 ' /
= Pdmg(e Ao ',e
Xi)t X\t
')P~
If we define
.At
diag(e A
J-\t
"),
(11.26)
then clearly we can say that
„Al
= Y-A k t k = Pe M p-
/t=0
A:!
(11.27)
We know from the theory of Maclaurin series that e x — JZ'kLo H 1 ' conver g es f° r
all x e R. Thus, e At converges for all t e R and is well defined, and hence the
series in (11.8) converges, and so e At is well defined. Of course, all of this suggests
that e may be numerically computed using (11.27). From Chapter 3 we infer that
accurate, reliable means to compute e x (x is a scalar) do exist. Also, reliable
methods exist to find the elements of A, and this will be considered later. But P
may be close to singular, that is, the condition number k(P) (recall Chapter 4) may
be large, and so accurate determination of P~ l , which is required by (11.27), may
be difficult. Additionally, the approach (11.27) lacks generality since it won't work
unless A is nondefective (i.e., can be diagonalized). Matrix factorization methods
to compute e At (including those in the defective case) are considered in greater
detail in Ref. 7, and this matter will not be mentioned further here.
In Chapter 4 a condition number k(A) was defined that informed us about the
sensitivity of the solution to Ax — b due to perturbations in A and b. It is possible
to develop a similar notion for the problem of computing e At . From Golub and
Van Loan [4], the matrix exponential condition number is
v(A, t) — max
I|£||2<1
f e A ^-^Ee As
Jo
ds
\\A\\ 2
\\e At \\ 2
(11.28)
(The theory behind this originally appeared in Van Loan [12].) In this expression
E e R" x " is a perturbation matrix. The condition number (11.28) measures the
sensitivity of mapping A
E such that
„At
for a given t e R. For a given t, there is a matrix
AA+E)t
„At
112
,At\
v(A,t)
\E\\2
\\A\\ 2 '
(11.29)
TLFeBOOK
498 NUMERICAL METHODS FOR EIGENPROBLEMS
We see from this that if v(A, t) is large, then a small change in A (modeled by the
perturbation matrix E) can cause a large change in e At . In general, it is not easy
to specify A leading to large values for v(A, t). However, it is known that
v(A,t)>t\\A\\ 2 (11.30)
for t e R + with equality iff A is normal. (Any A e C" x " is normal iff A H A —
AA H .) Thus, it appears that normal matrices are generally the least troublesome
with respect to computing e At . From the definition of a normal matrix we see that
real-valued, symmetric matrices are an important special case.
Of the less dubious means to compute e At , Golub and Van Loan's algorithm
11.3.1 [4] is suggested. It is based on Pade approximation, which is the use of
rational functions to approximate other functions. However, we will only refer the
reader to Ref. 4 (or Ref. 7) for the relevant details. A version of Algorithm 11.3.1
[4] is implemented in the MATLAB expm function, and MATLAB provides other
algorithm implementations for computing e At .
11.4 THE POWER METHODS
In this section we consider a simple approach to determine the eigenvalues and
eigenvectors of A € R" x ". The approach is iterative. The main result is as follows.
Theorem 11.11: Let A e R" x " be such that
(a) A has n linearly independent eigenvectors x^\ corresponding to the eigen-
Vahs{(X k ,xW)\k€Z n }.
(b) The eigenvalues satisfy A,,_i € R with
|A„_i| > \X n - 2 \ > \K-3\ >---|A,i| > | A. |
(!„_! is the dominant eigenvalue).
If yo € R" is a starting vector such that
71-1
;=o
with a n -\ ^ 0, then for yk+i — Ayt with k e Z +
lim ^Z£ = CX («-D (H.31)
. k ■ '
for some c ^ 0, and (recalling that (x, y) — x T y — y T x for any x, y e R")
(yn, A k yo)
lim UU ' ,_, =A.„_i. (11.32)
k^oo (y , A K 'yo)
TLFeBOOK
THE POWER METHODS 499
Proof We observe that v* = A k yo, and since A k x < -^ — k-x^, we have
n-2
A k y Q = Y J ^.i^ U) +an-^ k n . l x in - V) ,
implying that
A k y
— Cln—lX
(1-1)
"•n-1
n-2
J2 a J
.7=0
^•B-l
'' » *^.
(11.33)
Since |A./| < |A„_i| for _/ = 0, 1, . . . , n — 2, we have Unh-
and hence
A k yo
\K-i)
lim
®n—\X
(B-l)
B-l
Now A^o = jy i jZlajX k j x u \ so that
B-l
;=o
and if 77 y = (yo, x^), then
(yo, A^o> a n - X X k n _ x r] n -x + E" = o«/^1.7
(JO, A*-iyo> a n -iX k n -_\r>n-i + JTjZUj^nj
Vi-l
k 1
2 n -irin-l + 22j=o (l J r IJ [t—
Wj-1
fc-1
Again, since lim,t_
(&)'
(for j = 0, 1, . . . , n — 2), we have
(yo, A*;yo>
, llm 7 Ak-l \ = X n-i-
k^oo (y , A K l y$)
We note that
1 >
X n -2
>
^M-3
A.B-1
^n-1
Ao
Wi-1
so the rate of convergence of A k yo/k k i _ l to a n _ix^" ^ is, according to (11.33),
dominated by the term containing k n -ilk n -\. This is sometimes expressed
TLFeBOOK
500 NUMERICAL METHODS FOR EIGENPROBLEMS
by writing
A k y
k n-\
-IX
(n-1)
= ^ . (11.34)
The choice of norm in (11.34) is arbitrary. Of course, a small value for |A. n _2/A.„_i|
implies faster convergence.
The theory of Theorem 11.11 assumes that A is nondefective. If the algorithm
suggested by this theorem is applied to a defective A, it will attempt to converge.
In effect, for any defective matrix A there is a nondefective matrix close to it, and
so the limiting values in (11.31) and (11.32) will be for a "nearby" nondefective
matrix. However, convergence can be very slow, particularly if the dependent
eigenvectors of A correspond to \ n -\, and A.„_2. In this situation the results may
not be meaningful.
If yo is chosen such that a„_i =0, rounding errors in computing y k +\ — Ay^
will usually give a y k with a component in the direction of the eigenvector x ( - n ~ l \
Thus, convergence ultimately is the result. To ensure that this happens, it is best
to select yo with noninteger components.
From (11.31), if A: is big enough, then
y k = A k y Q * ck k n _ lX in - v \
and so y^ is an approximation to x'" -1 ^ (to within some scale factor). However,
if |A.„_i| > 1, we see that \\yt\\ — > oo with increasing k, while if |l„_i| < 1, we
see that \\yk\\ -> with increasing k. Either way, serious numerical problems will
certainly result (overflow in the former case and rounding errors or underflow in the
latter case). This difficulty may be eliminated by the proper scaling of the iterates,
and leads to what is sometimes called the scaled power algorithm:
Input N; { upper limit on the number of iterations }
Input yo; { starting vector }
y :=yo/l|yrjll2; { normalize yo to unit norm }
k := 0;
while k < N do begin
z k+ t :=Ay k ;
Yk+1 :=z /c+l/ll z /<+lll2;
^+D:=y[ +1 ^ +1 ;
k :=/f+1;
end;
In this algorithm yk+\ is the (k + l)th estimate of x < - n+l \ while A^ 1 ) is the
corresponding estimate of eigenvalue \ n -\. From the pseudocode above we may
easily see that
yk = — r^— (11-35)
l|A*y ||2
TLFeBOOK
THE POWER METHODS 501
for all k > 1. With y Q = J2'jZo a j xU) («/ e R such that ||y || 2 = 1)
n-2
n-2
A k y = Y j a j A k x u) + a n -\ A k x {n ~ l) = a^A^x
(»-i)
7=0
7=0
**<'>
= a„-ik
An-l)
n-2
■E
7=0
fl n-l \A. B — 1
,0')
•/(«
,(*)!
and as in Theorem 11.11 we see lim^oo ||m ( 'lb = 0, and (11.35) becomes
(11.36)
yk
a n -ik k n _ 1 [x« , -» + u<V]
\\an-i>£_i[x<- n -» + uW]\\ 2
l^k
x (n-\) + u (k)
|| X («-D + M W|| 2 '
where fi^ is the sign of a n -\\ k _ x (i.e., ji^ e {+1, —1}). Clearly, as k — > oo, vector
;y& in (11.36) becomes a better and better approximation to eigenvector x^" -1 '. To
confirm that A^ +1 ^ estimates X n -i, consider that from (11.36) we have (for k
sufficiently large)
X (k+1) = yI +l Ay k+1
( X (n-l)y Ax (n-1)
|X(»-D||2
X n -1
[recall that Ax^" 1 ' = A.„_ix ( " _1) ].
If |A. n _i| = |A,„_2| > |A.;| for j = 0, 1, . . . , n — 3, we have two dominant eigen-
values. In this situation, as noted in Quarteroni et al. [13], convergence may or may
not occur. If, for example, k n -\ = X n -2, then vector sequence (y%) converges to
a vector in the subspace of R" spanned by x^" -1 ', and x^ n ~ 2 \ In this case, since
A e R nx '\ we must have A.„_i, X n -i e R, and hence for k > 1, we must have
n-3
A k y = J2 a J xl J xU) +««-2^_ 2 x ( "- 2) +«„_ 1 ^_ 1 x("- 1 \
7=0
implying that
A*yo
"n-l
— ctfi — \jv ~\~ (-ifi — 2"
«-3 / i . \k
so that
lim
A k y
k^oo X
— Clfj — ^ Jt ~\~ Clyi — 2 5
n-1
TLFeBOOK
502
NUMERICAL METHODS FOR EIGENPROBLEMS
which is a vector in a two-dimensional subspace of R". On the other hand,
recall Example 11.3, where n — 2 so that Xq — e j6 \ X\ — e
A* From (11.35)
llwlb = 1, so, because A is a rotation operator, yt — A k yol\\A k yo\\2 will always
be a point on the unit circle {(x, y)\x 2 + y 2 — 1}. Convergence does not occur
since it is generally not the same point from one iteration to the next (e.g., consider
— it/2 radians).
Example 11.5 Consider an example based on application of the scaled power
algorithm to the matrix
"410
A - 14 1
1 4
This matrix turns out to have the eigenvalues
A = 2.58578644, k { = 4.00000000, A 2 = 5.41421356
(as may be determined in MATLAB using the eig function). If we define y^ —
[yic,o yic,\ yk,2\ T £ R 3 , then, from the algorithm, we obtain the iterates:
k
ykfi
yk,\
yk,2
*(*>
0.57735027
0.57735027
0.57735027
—
1
0.53916387
0.64699664
0.53916387
5.39534884
2
0.51916999
0.67891460
0.51916999
5.40988836
3
0.50925630
0.69376945
0.50925630
5.41322584
4
0.50444312
0.70076692
0.50444312
5.41398821
5
0.50212703
0.70408585
0.50212703
5.41416216
6
0.50101700
0.70566560
0.50101700
5.41420184
7
0.50048597
0.70641885
0.50048597
5.41421089
8
0.50023215
0.70677831
0.50023215
5.41421295
9
0.50011089
0.70694993
0.50011089
5.41421342
10
0.50005296
0.70703187
0.50005296
5.41421353
11
0.50002530
0.70707101
0.50002530
5.41421356
In only 1 1 iterations the power algorithm has obtained the dominant eigenvalue
to an accuracy of eight decimal places.
Continue to assume that A e R nxn is nondefective. Now define A M — A — jil
(where / is the n x n identity matrix, as usual), and \i e R is called the shift
(or shift parameter). 4 We will assume that /x always results in the existence of
A~ l . Because A is not defective, there will be a nonsingular matrix P such that
P~ l AP — A — diag(Ao, A.i, . . . , A„_i) (recall the basic facts from Section 11.2
that justify this). Consequently
A IA = PAP" 1 - fil ^ P _1 A M P = A-/j,I,
(11.37)
The reason for introducing the shift parameter /x will be made clear a bit later.
TLFeBOOK
THE POWER METHODS 503
and so A" 1 has the eigenvalues
Yk=- , keZ n . (11.38)
Ajfc - [A
P is a similarity transformation that diagonalizes A M , giving A — \il, so the eigen-
values of (A — /x/) _1 must be the eigenvalues of A" 1 as these are similar matrices.
A modification of the previous scaled power algorithm is the following shifted
inverse power algorithm:
Input N\ { upper limit on the number of iterations }
Input yg; { starting vector }
y :=yo/l|yoll2; 1 normalize yo to unit norm }
k:=Q;
while k < N do begin
A n z k+1 '-=yk-
y/c+1 :=Zfr+i/l|Z/r+ill2;
Y
(fr+1)
= y[ +1 ^y/c+i;
/c:=/c + 1;
end;
Assume that the eigenvalues of A satisfy
|A„_i| > |A„- 2 |>---> IA.il > |A |, (11.39)
and also that /z = 0, so then \kk\ = l/|Xfc|, and (11.39) yields
iKol > iKil > \Yi\ > •■■> lK«-2l > \Yn-i\- (11-40)
We observe that in the shifted inverse power algorithm A^Zk+i = y\ is equivalent
to Zk+\ — A~ l yk, and so A" 1 effectively replaces A in the statement Zk+i '■— Ayj. in
the scaled power algorithm. This implies that the shifted inverse power algorithm
produces vector sequence (y^) that converges to the eigenvector of A~ l — A -1
corresponding to the eigenvalue yo (= 1 Ao)- Since Ax^ — Xqx^ implies that
A _1 jt' ' = y-x^ — Yqx^K then for a sufficiently large k, the vector y\ will approx-
imate x {SS) . The argument to verify this follows the proof of Theorem 11.11.
Therefore, consider the starting vector
n-\
yo = ao x (0) + J2 a i xU) > a o # (H-41)
;=i
(such that Hyolb = 1)- We see that we must have
1 "~ l 1
A- k y Q = ao^x^ +Y j a j —x (j)
X o j= i x j
ra-1
= aoy ^ (0) +2^ajYjX U) > ( n - 42 )
7=1
TLFeBOOK
504 NUMERICAL METHODS FOR EIGENPROBLEMS
and thus
a —k n ~ 1
a k yo ( o)
Immediately, because of (11.40), we must have
t°j(z) xU)
Yo
lim
k—>oo
A k y
aox (0) .
From (11.42), we obtain
A k yo = uqYq
x®>
n-l
+E-
(-
\ x^
=„(«
(11.43)
In much the same way as we arrived at (11.35), for the shifted inverse power
algorithm, we must have for all k > 1
yk =
A n k yo
WAT/yoWi'
(11.44)
From (11.43) for \i
yk
0, this equation becomes
aoYo[x (0) +v (k) l
\\aoy£[xM+vWl\\2
,(0)
.,(*)
Vk
\\x
(0)
>(*)l
b
(11.45)
where v^ e { + 1 , — 1 } . From the pseudocode for the shifted inverse power algorithm,
if k is large enough, via (11.45), we obtain
Y ik+1) = yI +1 Ay k+l
(*< >) r Aj<°>
I|X(°)||2
= *0-
(11.46)
Thus, j/(* +1 ) is an approximation to A.o.
In summary, for a nondefective A e R" x " such that (11.39) holds, the shifted
inverse power algorithm will generate a sequence of increasingly better approxi-
mations to the eigenpair (Xq, x^), when we set fx — 0.
We note that the scaled power algorithm needs 0(n 2 ) flops (recall the definition
of flops from Chapter 4) at every iteration. This is due mainly to the matrix-vector
product step Zk+i — Ayt- To solve the linear system A^Zk+i — yk requires 0(« 3 )
flops in general. To save on computations in the implementation of the shifted
inverse power algorithm, it is often best to LU -decompose A^ (recall Section 4.5)
only once: A fl — LU. At each iteration LUzk+i — yk mav be solved using forward
TLFeBOOK
THE POWER METHODS
505
and backward substitution, which is efficient since this needs only 0(n 2 ) flops at
every iteration. Even so, because of the need to compute the LU decomposition of
A M , the shifted inverse power algorithm still needs 0(« 3 ) flops overall. It is thus
intrinsically more computation-intensive than is the scaled power algorithm.
However, it is the concept of a shift that makes the shifted inverse power algo-
rithm attractive, at least in some circumstances. But before we consider the reason
for introducing the shift parameter /x, the reader should view the following example.
Example 11.6 Let us reconsider matrix A from Example 11.5. If we apply
the shifted inverse power algorithm using /a = to this matrix, we obtain the
following iterates:
k
ykfi
yk,l
yk,2
y(k)
0.57735027
0.57735027
0.57735027
1
0.63960215
0.42640143
0.63960215
5.09090909
2
0.70014004
0.14002801
0.70014004
4.39215686
3
0.69011108
-0.21792981
0.69011108
3.39841689
4
0.62357865
-0.47148630
0.62357865
2.82396484
5
0.56650154
-0.59845803
0.56650154
2.64389042
6
0.53330904
-0.65662999
0.53330904
2.59925317
7
0.51623508
-0.68337595
0.51623508
2.58886945
8
0.50782504
-0.69586455
0.50782504
2.58649025
9
0.50375306
-0.70175901
0.50375306
2.58594700
10
0.50179601
-0.70455768
0.50179601
2.58582306
11
0.50085857
-0.70589049
0.50085857
2.58579479
12
0.50041023
-0.70652615
0.50041023
2.58578834
13
0.50019597
-0.70682953
0.50019597
2.58578687
14
0.50009360
-0.70697438
0.50009360
2.58578654
15
0.50004471
-0.70704355
0.50004471
2.58578646
16
0.50002135
-0.70707658
0.50002135
2.58578644
We see that the method converges to an estimate of Xo (smallest eigenvalue of
A) that is accurate to eight decimal places in only 16 iterations.
As an exercise, the reader should confirm that the vector y^ in the bottom row
of the table above is an estimate of the eigenvector for Xo. The reader should do
the same for Example 11.5.
Recall again that A l is assumed to exist (so that jj, is not an eigenvalue of A).
Observe that Ax^ — Xtx^, so that (A — (il)x^ — (Xj — ix)x^\ and therefore
A" 1 **/)
Yj-
C U)
Suppose that there is an m e Z„ such that
\X m - n\ < \Xj - fi\
(11.47)
for all j e Z„ , but that j ^ m; that is, X m is closest to /x of all the eigenvalues of
A. This really says that X m has a multiplicity of one (i.e., is simple). Now consider
TLFeBOOK
506
NUMERICAL METHODS FOR EIGENPROBLEMS
the starting vector
71-1
>'o
J2 a i x
(./)
Am)
(11.48)
;'=o
jjtm
with a m ^ 0, and \\y0W2 — 1- Clearly
n-1
V?o
= \
(3
jyfx (j) +a m yj,
implying that
- fl '« X + 2^ fl 7 (
Now via (11.47)
XL
Ym
=
Xj-11
< 1.
Therefore, via (11.49)
lim
A ~»
k yo
k (m)
M)
(11.49)
k^oo Ym
This implies that the vector sequence (yjt) in the shifted inverse power algorithm
converges to x (pt \ Put simply, by the proper selection of the shift parameter /x, we
can extract just about any eigenpair of A that we wish to (as long as X m is simple).
Thus, in this sense, the shifted inverse power algorithm is more general than the
scaled power algorithm. The following example illustrates another important point.
Example 11.7 Once again we apply the shifted inverse power algorithm to
matrix A from Example 11.5. However, now we select ji — 2. The resulting
sequence of iterates for this case is as follows:
k
yk,o
y*,i
yk,2
y(k)
0.57735027
0.57735027
0.57735027
—
1
0.70710678
0.00000000
0.70710678
4.00000000
2
0.57735027
-0.57735027
0.57735027
2.66666667
3
0.51449576
-0.68599434
0.51449576
2.58823529
4
0.50251891
-0.70352647
0.50251891
2.58585859
5
0.50043309
-0.70649377
0.50043309
2.58578856
6
0.50007433
-0.70700164
0.50007433
2.58578650
7
0.50001275
-0.70708874
0.50001275
2.58578644
TLFeBOOK
THE POWER METHODS
507
We see that convergence to the smallest eigenvalue of A has now occurred in
only seven iterations, which is faster than the case considered in Example 1 1.6 (for
which we used /i — 0).
We see that this example illustrates the fact that a properly chosen shift param-
eter can greatly accelerate the convergence of iterative eigenproblem solvers. This
notion of shifting to improve convergence rates is also important in practical imple-
mentations of QR iteration methods for solving eigenproblems (next section).
So far our methods extract only one eigenvalue from A at a time. One may
apply a method called deflation to extract all the eigenvalues of A under certain
conditions. Begin by noting the following elementary result.
Lemma 11.1: Suppose that B e R ( "" l)x( " _1) , and that B~ l exists, and that
r e R" _1 , then
1 r T I" 1 _ [ 1 -r T B~ x
OB OB' 1
(11.50)
Proof Exercise.
The deflation procedure is based on the following theorem.
Theorem 11.12: Deflation Suppose that A n e R" x ", that eigenvalue A.,- € R
for all i e Z„, and that all the eigenvalues of A n are distinct. The dominant eigenpair
of A n is (A.„_i, x^" -1 '), and we assume that ||x^ _1 '||2 = 1. Suppose that Q n e
R" x " is an orthogonal matrix such that Q„x (n_1) = [10 • • • 0] T — eo; then
QnA n Q T n
A-n-l a n -\
A„_i
(11.51)
Proof Q n exists because it can be a Householder transformation matrix (recall
Section 4.6). Any eigenvector x^ of A n can always be normalized so that
||*<*>|| 2 =1.
Following (11.5), we have
„[*C- 1 V' , - 2) ---;c (1 V 0) ]
=T„
K-i
•
^«-2 •
•
.[.("-'V- 2 '...^ 1 '/ ']
= T„
• ^!
•
Ao _
= D„
TLFeBOOK
508 NUMERICAL METHODS FOR EIGENPROBLEMS
that is, A n T n — T n D n . Thus, {Q T n — Q~ l via orthogonality)
Now
Q n A n Q T n = Q„T n D n T n - l Ql = {Q n T n )D n (Q n T n T x .
QtJn = [eo Q n X
(n-2)
and via Lemma 11.1, we have
QnX^QnX^]
B,
(11.52)
1 *J_,
n-l
1 Cl
B„-i
Thus, (11.52) becomes
G«A„G^ =
1 b 1
n-
o B„-
fi"- 1 !
X„-i o
D n -i
1 -fcl ,5 _1
B
n—\ n—\
1
n-l
^n-1
*„-i( D »-
B„
l — Ki-\h-\)B
n-l
■A-iV,
which has the form given in (11.51).
From Theorem 11.6, Q n A n Q^ and A„ are similar matrices, and so have the
same eigenvalues. Via (11.51), A n -\ has the same eigenvalues as A n , except
for k n -\. Clearly, the scaled power method could be used to find the eigenpair
(X n -\, x*" -1 )). The Householder procedure from Section 4.6 gives Q n . From The-
orem 11.12 we obtain A n -\, and the deflation procedure may be repeated to find
all the remaining eigenvalues of A — A n .
It is important to note that the deflation procedure may be improved with respect
to computational efficiency by employing instead the Rayleigh quotient iteration
method. This replaces the power methods we have considered so far. This approach
is suggested and considered in detail in Golub and Van Loan [4] and Epperson [14];
we omit the details here.
11.5 QR ITERATIONS
The power methods of Section 11.4 and variations thereof such as Rayleigh quotient
iterations are deficient in that they are not computationally efficient methods for
computing all possible eigenpairs. The power methods are really at their best when
we seek only a few eigenpairs (usually corresponding to either the smallest or
the largest eigenvalues). In Section 11.4 the power methods were applied only to
TLFeBOOK
QR ITERATIONS
509
computing real-valued eigenvalues, but it is noteworthy that power methods can
be adapted to finding complex-conjugate eigenvalue pairs [19].
The QR iterations algorithms are, according to Watkins [15], due originally to
Francis [16] and Kublanovskaya [17]. The methodology involved in QR iterations
is based, in turn, on earlier work of H. Rutishauser performed in the 1950s. The
detailed theory and rationale for the QR iterations are not by any means straight-
forward, and even the geometric arguments in Ref. 15 (based, in turn, on the work
of Parlett and Poole [18]) are not easy to follow. However, for matrices A e C" x "
that are dense (i.e., nonsparse; recall Section 4.7), that are not too large, and that
are nondefective, the QR iterations are the best approach presently known for find-
ing all possible eigenpairs of A. Indeed, the MATLAB eig function implements a
modern version of the QR iteration methodology. 5
Because of the highly involved nature of the QR iteration theory, we will only
present a few of the main ideas here. Other than the references cited so far, the
reader is referred to the literature [4,6,13,19] for more thorough discussions. Of
course, these are not the only references available on this subject.
Eigenvalue computations such as the QR iterations reduce large problems into
smaller problems. Golub and Van Loan [4] present two lemmas that are involved
in this reduction approach. Recall from Section 4.7 that s(A) denotes the set of all
the eigenvalues of matrix A (and is also called the spectrum of A).
Lemma 11.2: If A € C" x " is of the form
Aoo
Aoi
An
where A 00 eC pxp ,
s(A n ).
A i e C px i, An e C qxq (q + p = «), then s(A) = s(A 00 ) U
Proof Consider
Ax =
Aoo Aoi
An
X2
= X
X]
X2
(x\ e C p and xi e C q ). If x-i ^ 0, then A\\X2 — Xx2, and so we conclude that
X e s(An). On the other hand, if X2 — 0, then Aoo*i = Xx\, so we must have
X € s(Aoo). Thus, s(A) e s(Aoo) U s(A\\). Sets s(A) and ,s(Aoo) U s(An) have
the same cardinality (i.e., the same number of elements), and so s(A) — s(Aqq) U
s(A n ).
5 If A e C" x ", then [V, D] = eig(A) such that
A = VDV~ l ,
where D 6 C" x " is the diagonal matrix of eigenvalues of A and V e C nx " is the matrix whose columns
are the corresponding eigenvectors. The eigenvectors in V are "normalized" so that each eigenvector
has a 2-norm of unity.
TLFeBOOK
510
NUMERICAL METHODS FOR EIGENPROBLEMS
Essentially, Lemma 11.2 states that if A is block upper triangular, the eigen-
values lie within the diagonal blocks.
Lemma 11.3: If A € C" xn , B e C pxp , X e C nxp (with p <n) satisfy
AX = XB, rank(X) = p, (11.53)
then there is a unitary Q e C nxn (so Q~ l — Q H ) such that
Q"AQ=T =
Too 7b i
T n
(11.54)
where T m € C pxp , T m e C px(r, - p \ T n e c (n - p)x(n - p \ and s(T q) =
s(A)r\s(B).
Proof The QR decomposition idea from Section 4.6 generalizes to any X e
Qnxp w j m p < n anc i rank(X) = /?; that is, complex- valued Householder matrices
are available. Thus, there is a unitary matrix Q eC* M such that
X=Q
R
where R e C pxp . Substituting this into (11.53) yields
R
where
Too Tbi
' R '
Tio 7n
_
-
Q H AQ =
Too
(11.55)
From (11.55) T\qR — 0, implying that Tio = [yielding (11.54)], and also TqqR —
RB, implying that B — R~ 1 TqqR (R _1 exists because X is full-rank). Too and
B are similar matrices so s(B) — s(Tqo). From Lemma 11.2 we have s(A) —
s(Too) U s (Tn)- Thus, s(A) — s(B) U s(T\i). From basic properties regarding sets
(distributive laws)
s(T 00 ) n s(A) = s(T 00 ) n [s(fi) U s(T n )]
= [s(T O0 )ns(B)]\J[s(T O0 )ris(T 11 )]
= s(T OQ )n0,
implying that s(Tqq) — s(Tqo) D s(A) — s(A) D s(B). This statement [s(Toq) =
s(A) n s(B)] really says that the eigenvalues of B are a subset of those of A.
TLFeBOOK
QR ITERATIONS
511
Recall that a subspace of vector space C" is a subset of C" that is also a vector
space. Suppose that we have the vectors xq, . . . , x m -\ e C"; then we may define
the spanning set as
m— 1
span(^o, . . . , x m -i) — 2, ajXj\aj e C
.7=0
(11.56)
In particular, if S — span(jt), where x is an eigenvector of A, then
y e S =$■ Ay e S,
and so S is invariant for A, or invariant to the action of A. It is a subspace
(eigenspace) of C" that is invariant to A. Lemmas 11.2 and 11.3 can be used to
establish the following important decomposition theorem (Theorem 7.4.1 in Ref.
4). We emphasize that it is for real-valued A only.
Theorem 11.13: Real Schur Decomposition If A e R" x ", then there is an
orthogonal matrix Q e R" x " such that
Q T AQ
RoO ^01
R n
Ro,m-\
R\,m-l
R
m— \,m— 1
n,
(11.57)
where each Rjj is either 1 x 1 or 2 x 2. In the latter case /?,-_,- will have a complex-
conjugate pair of eigenvalues.
Proof The matrix A e R" x ", so det(17 — A) has real-valued coefficients, and
so complex eigenvalues of A always occur in conjugate pairs (recall Section 11.2).
Let k be the number of complex-conjugate eigenvalue pairs in s(A). We will
employ mathematical induction on k.
The theorem certainly holds for k — via Lemmas 11.2 and 11.3 since real-
valued matrices are only a special case. Now we assume that k > 1 (i.e., A possesses
at least one conjugate pair of eigenvalues). Suppose that an eigenvalue is X —
a + jfJ e s(A) with /5 # 0. There must be vectors x, y e R" (with y ^ 0) such that
A(x + jy) = (a + jP)(x + jy),
or equivalently
A[xy] = [xy]
-p 1 a
(11.58)
Since fJ ^ 0, vectors x and y span a two-dimensional subspace of R" that is
invariant to the action of A because of (11.58). From Lemma 11.3 there is an
TLFeBOOK
512 NUMERICAL METHODS FOR EIGENPROBLEMS
orthogonal matrix U\ e R" x " such that
UfAUi
Too 7b i
Zii
where T m € R 2x2 , 7bi e R 2x ^ n ~ 2 \ T u e R("- 2 ^(»- 2 \ and s(T 00 ) = {X, 1*}. By
induction there is another orthogonal matrix t/2 such that t/JlnL^ has the neces-
sary structure. Equation (11.57) then follows by letting
Q = Ui
h
u 2
where I2 is the 2x2 identity matrix. Of course, this process may be repeated as
often as needed.
A method that reliably gives us the blocks Rij for all i in (1 1.57) therefore gives
us all the eigenvalues of A since Rjj is only 1 x 1, or 2 x 2, making its eigenvalues
easy to find in any case. The elements in the first subdiagonal of Q T AQ of (11.57)
are not necessarily zero-valued (again because /?,- j might be 2 x 2), so we say that
Q 1 AQ is upper quasi-triangular.
Definition 11.5: Hessenberg Form Matrix A € C" x " is in Hessenberg form
if at j — for all i, j such that i — j > 1,
Technically, A in this definition is upper Hessenberg. Matrix A in Example 4.4
is Hessenberg. All upper triangular matrices are Hessenberg. The quasi-triangular
matrix Q T AQ in Theorem 11.13 is Hessenberg.
A pseudocode for the basic QR iterations algorithm is
Input N\ { Upper limit on the number of iterations }
Input A e R nx "; { Matrix we want to eigendecompose }
Hq := qIaQq; { Reduce A to Hessenberg form }
k:=V,
while k < N do begin
Hj<_i := Q/<R^; { QR-decomposition step }
H k :=R k Q k ;
/f:=/c + 1;
end;
In this algorithm we emphasize that A is assumed to be real-valued. Generalization
to the complex case is possible but omitted. The statement Hq — QqAQ q generally
involves applying orthogonal transformation <2o € R nxn to reduce A to Hessenberg
matrix Hq, although in principle this is not necessary. However, there are major
advantages (discussed below) to reducing A to Hessenberg form as a first step.
The basis for this initial reduction step is the following theorem, which proves that
such a step is always possible.
TLFeBOOK
QR ITERATIONS
513
Theorem 11.14: Hessenberg Reduction If A € R" x ", there is an orthogonal
matrix Q e R nx " such that Q T AQ — H is Hessenberg.
Proof From Section 4.6 in general there is an orthogonal matrix P such that
Px — ||x||2eo (e.g., P is a Householder matrix), where eo = [1 • • • 0] T e R"
if x e R".
Partition A according to
A<°> = A =
(0)
'on
«o
tf
where a^' € R and a , b e R"" 1 , A™ e r("-Dx(«-d_ Let P] be ormog onal such
that P\ao — Hflolb^o ( e o e R" _1 ). Define
Gi =
1 o
Pi
1 y
and clearly Q i — Q x (i.e., Q\ is also orthogonal). Thus
id)
d a<°>q[
(0)
'in,
b T P T
Il«0ll2«0 Pl^W
The first column of A^ satisfies the Hessenberg condition since it is
[aL, 1 1 «o 1 1 2 0---0 ] T . The process may be repeated again by partitioning A^
n-2 zeros
according to
4(D uT
id)
1 oo
'1
[0 oi] A™
where A$ € R 2x2 , and [0 ai],b\ e R("" 2 ) x2 , A^ e R(»-2)x(n-2)_ Let p 2 be
orthogonal such that P2«i = llailben ( e £ R" -2 )- Define
G2 = P 2
where 7 2 is the 2x2 identity matrix. Thus
A (2) = G2A (i) e r = G2eiA (0) G[e r
'1 ^2
1 00
°1 r 2
[0 Hflilbeo] P 2 A ( 1 1 1 ) P 2 r
TLFeBOOK
514 NUMERICAL METHODS FOR EIGENPROBLEMS
and the first two columns of A® satisfy the Hessenberg condition. Of course, we
may continue in this fashion, finally yielding
A ( "- 2) = Q n -2 • • • QiQiAQjQl ■ ■ ■ Q T n _ 2 ,
which is Hessenberg. We may define Q T = Q n -2- • • Q2Q1 an d H — A^" -2 ',
which is the claim made in the theorem statement.
Thus, Theorem 11.14 contains a prescription for finding Ho = QqAqQq as well
as a simple proof of existence of the decomposition. Hessenberg reduction is done
to facilitate reducing the amount of computation per iteration. Clearly, A and Hq
are similar matrices and so possess the same eigenvalues.
From the pseudocode for k — 1, 2, . . . , N, we obtain
ffk-i = QkRt,
H k = R k Q k ,
which yields
H n = Q t n -QIq!h Q 1 Q 2 -Q n , (11.59)
so therefore
H N = Q T N -Q T 2 Q T l QlAQ Q l Q 2 ---Q N . (11.60)
Matrices Hn and A are similar, and so have the same eigenvalues for any N . It is
important to note that if Q k is constructed properly then H k is Hessenberg for all
k. As explained in Golub and Van Loan [4, Section 7.4.2], the use of orthogonal
matrices Q k based on the 2x2 rotation operator (matrix A from Example 11.3)
is recommended. These orthogonal matrices are called Givens matrices, or Givens
rotations. The result is an algorithm that needs only 0(n 2 ) flops per iteration
instead of 0(n 3 ) flops. Overall computational complexity is still 0(n 3 ) flops, due
to the initial Hessenberg reduction step. It is to be noted that the rounding error
performance of the suggested algorithm is quite good [19].
We have already noted the desirability of the real Schur decomposition of A
into 11 according to Q T AQ = 1Z in (1 1.57). In fact, with proper attention to details
(many of which cannot be considered here), the QR iterations method is an excellent
means to find Q and 1Z as in any valid matrix norm
lim H N =11 (11.61)
and
Y\Qi = Q (11-62)
1=0
TLFeBOOK
QR ITERATIONS
515
(of course, Y\i=o Qi = Q0Q1 • • • Qn\ the ordering of factors in the product is
important since Qi is a matrix for all j). The formal proof of this is rather difficult,
and so it is omitted.
Suppose that we have
•
•
Q n
1
•
•
—a n -\
1
•
•
—a„-2
•
•
—(12
•
• 1
—a\
€R"
(11.63)
It can be shown that
p(X) = det(AJ - A) = X" + aiX n ~ l + a 2 X
\ n—2
a n -\X + a n . (11.64)
Matrix A is called a companion matrix. We see that it is easy to obtain (1 1.64) from
(11.63), or vice versa. We also see that A is Hessenberg. Because of (11.61), we
may conceivably input A of (1 1.63) into the basic QR iterations algorithm (omitting
the initial Hessenberg reduction step), and so determine the roots of p(X) = as
these are the eigenvalues of A. Since p(X) is essentially arbitrary, except that it
should not yield a defective A, we have an algorithm to solve the polynomial zero-
finding problem that was mentioned in Chapter 7. Unfortunately, it has been noted
[4,19] that this is not necessarily a stable method for finding polynomial zeros.
Example 11.8 Suppose that
p(X) = (X 2 -2X + 2)(X 2 - 4lX + 1)
= X 4 - (2 + V2)X 3 + (2V2 + 3)X 2 - 2(1 + V2)X + 2,
which has zeros for
'-'^T^TT
After 50 iterations of the basic QR iterations algorithm, we obtain
#50 =
0.5000
0.2071
0.0000
0.0000
-6.0355
1.5000
0.0000
0.0000
0.9239
-0.3827
0.7071
0.7071
5.3848
-2.2304
-0.7071
0.7071
^0,0
-^0,1
K1.1
where R, ;j e R , The reader may wish to confirm that Rqq has the eigenvalues
1 ± j, and that /?ij has the eigenvalues -4=(1 ± j).
TLFeBOOK
516 NUMERICAL METHODS FOR EIGENPROBLEMS
On the other hand, for
p(X) = (X + l)(X 2 -2X + 2)(X 2 - yflX + 1),
which is a slight modification of the previous example, the basic QR iterations
algorithm will fail to converge.
The following point is also important. In (11.60), define Q — QqQi ■ ■ ■ Qn,
so that H N = Q T AQ. Now suppose that A = A T . Clearly, Hj, = Q T A T Q =
Q T AQ — Hn. This implies that Hn will be tridiagonal (defined in Section 6.5)
for a real and symmetric A. Because of (11.61), we must now have
lim H N = diag(fl ,o,.Ku,
N— >oo
■ , Rm-\,m-\) — D,
where each /?, , is 1 x 1. Thus, D is the diagonal matrix of eigenvalues of A. Also,
we must have YYhLo Qi as me corresponding matrix of eigenvectors of A.
Example 11.9 Suppose that A is the matrix from Example 11.5:
A =
4 1
1 4 1
1 4
We see that A is in Hessenberg form already. After 36 iterations of the basic QR
iteration algorithm, we obtain
#36
5.4142 0.0000 0.0000
0.0000 4.0000 0.0000
0.0000 0.0000 2.5858
so the matrix is diagonal and reveals all eigenvalues of A. Additionally, we have
36
UQi =
0.5000
0.7071
0.5000
-0.7071
0.0000
0.7071
0.5000
-0.7071
0.5000
which is a good approximation to the eigenvectors of A.
We noted in Section 11.4 that shifting can be used to accelerate convergence in
power methods. Similarly, shifting can be employed in QR iterations to achieve
the same result. Indeed, all modern implementations of QR iterations incorporate
some form of shifting for this reason. The previous basic QR iteration algorithm
may be modified to incorporate the shift parameter ji e R. The overall structure of
the result is described by the following pseudocode:
TLFeBOOK
QR ITERATIONS
517
Input N; { Upper limit on the number of iterations }
Input A g R" xn ; { Matrix we want to eigendecompose ]
Ho := QqAQq; { Reduce A to Hessenberg form }
k:=V,
while k < A/do begin
Determine the shift parameter /x e R;
Hk--\ - fJ-l '■= QfcR/<; { QR-decomposition step }
H k := R k Q k + nl;
k:=k+\;
end;
The reader may readily confirm that we still have
On
QiQqAQoQi
Qn
just as we had in (11.60) for the basic QR iteration algorithm. Thus, we again find
that Hk is similar to A for all k. Perhaps the simplest means to generate /i is the
single-shift QR iterations algorithm:
Input N\ { Upper limit on the number of iterations }
Input A s R nxn ; { Matrix we want to eigendecompose }
Hq := QqAQq; { Reduce >4 to Hessenberg form }
k:=V,
while k < N do begin
Hk := H k _-\ (n - 1,n - 1); { \i k is the lower right corner element of H k _-\ }
H/(_i - fi k l := Q k R k \ { QR-decomposition step }
H k ■= RkQk + V;
k:=k+-\;
end;
We note that /x is not fixed in general from one iteration to the next. Basically, \i
varies from iteration to iteration in order to account for new information about s(A)
as the subdiagonal entries of Hk converge to zero. We will avoid the technicalities
involved in a full justification of this approach except to mention that it is flawed,
and that more sophisticated shifting methods are needed for an acceptable algorithm
(e.g., the double shift [4,19]). However, the following example shows that shifting
in this way does speed convergence.
Example 11.10 If we apply the single-shift QR iterations algorithm to A in
Example 11.9, we obtain the following matrix in only one iteration:
Hi =
This matrix certainly does not have the structure of Hx, in Example 11.9, but the
eigenvalues of the submatrix
4.0000
-1.4142
0.0000
1.4142
4.0000
0.0000
0.0000
0.0000
4.0000
4.0000
-1.4142
-1.4142
4.0000
are 5.4142, 2.5858.
TLFeBOOK
518 NUMERICAL METHODS FOR EIGENPROBLEMS
Finally, we mention that our pseudocodes assume a user-specified number of
iterations N. This is not convenient, and is inefficient in practice. Criteria to auto-
matically terminate the QR iterations without user intervention are available, but a
discussion of this matter is beyond our scope.
REFERENCES
1. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.),
Random House, New York, 1988.
2. P. Halmos, Finite Dimensional Vector Spaces, Van Nostrand, New York, 1958.
3. R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge Univ. Press, Cambridge,
UK, 1985.
4. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ.
Press, Baltimore, MD, 1989.
5. F. W. Fairman, "On Using Singular Value Decomposition to Obtain Irreducible Jordan
Realizations," in Linear Circuits, Systems and Signal Processing: Theory and Appli-
cation, C. I. Byrnes, C. F. Martin, and R. E. Saeks, eds., North-Holland, Amsterdam,
1988, pp. 35-40.
6. G. Stewart, Introduction to Matrix Computations, Academic Press, New York, 1973.
7. C. Moler and C. Van Loan, "Nineteen Dubious Ways to Compute the Exponential of a
Matrix," SI AM Rev. 20, 801-836 (Oct. 1978).
8. I. E. Leonard, "The Matrix Exponential," SIAM Rev. 38, 507-512 (Sept. 1996).
9. B. Noble and J. W. Daniel, Applied Linear Algebra, Prentice-Hall, Englewood Cliffs,
NJ, 1977.
10. W. R. Derrick and S. I. Grossman, Elementary Differential Equations with Applications,
2nd ed., Addison-Wesley, Reading, MA, 1981.
11. W. T. Reid, Ordinary Differential Equations, Wiley, New York, 1971.
12. C. Van Loan, "The Sensitivity of the Matrix Exponential," SIAM J. Numer. Anal. 14,
971-981 (Dec. 1977).
13. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math-
ematics series, Vol. 37). Springer- Verlag, New York, 2000.
14. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New
York, 2002.
15. D. S. Watkins, "Understanding the QR Algorithm," SIAM Rev. 24, 427-440 (Oct.
1982).
16. J. G. F. Francis, "The QR Transformation: A Unitary Analogue to the LR Transforma-
tions, Parts I and II," Comput. J. 4, 265-272, 332-345 (1961).
17. V. N. Kublanovskaya, "On Some Algorithms for the Solution of the Complete Eigen-
value Problem," USSR Comput. Math. Phys. 3, 637-657, (1961).
18. B. N. Parlett and W. G. Poole, Jr., "A Geometric Theory for the QR, LU, and Power
Iterations," SIAM J. Numer. Anal. 10, 389-412 (1973).
19. J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, UK,
1965.
TLFeBOOK
PROBLEMS
519
PROBLEMS
11.1. Aided with at most a pocket calculator, find all the eigenvalues and eigen-
vectors of the following matrices:
(a) A =
(b) B =
(c) C
4
1
1
4
"
-2 "
1
1
1
2
"
1
l -
4
1
4
1
1
(d) D
\-j
~j 2
11.2. A conic section in R 2 is described in general by
o!Xq + 2/3xoxi + Y x \ + & x o + e*i + p — 0.
(ll.P.l)
(a) Show that (ll.P.l) can be rewritten as a quadratic form:
x T Ax + g T x + p = 0, (11.P.2)
where x = [xq x\] t , and A — A T .
(b) For a conic section in standard form A is diagonal. Suppose that A is
diagonal. State the conditions on the diagonal elements that result in
(11.P.2) describing an ellipse, a parabola, and a hyperbola.
(c) Suppose that (11. P. 2) is not in standard form (i.e., A is not a diagonal
matrix). Explain how similarity transformations might be used to place
(11. P. 2) in standard form.
11.3. Consider the companion matrix
C =
•
•
c n
1
•
•
— Cn-1
1
•
•
—C n -2
•
•
-ci
•
• 1
-c\
eC"
TLFeBOOK
520 NUMERICAL METHODS FOR EIGENPROBLEMS
Prove that
p n (X) = det(A7 - C) = X" + ciX"~ l + c 2 X n ~ 2 + ■■■ + c n -\X + c„,
11.4. Suppose that the eigenvalues of A € C" x " are Xq, X\, . . . , X n -\. Find the
eigenvalues of A + al , where I is the order n identity matrix and a e C
is a constant.
11.5. Suppose that A e R nxn is orthogonal (i.e., A -1 = A T ); then, if X is an
eigenvalue of A, show that we must have \X\ — 1.
11.6. Find all the eigenvalues of
co&9 -sin0
cos</> — sin</>
sin0 cos«
sin</> cos(/>
(Hint: The problem is simplified by using permutation matrices.)
11.7. Consider the following definition: A, B e C nxn are simultaneously diago-
nalizable if there is a similarity matrix S e C" x " such that S~ l AS, and
5 1-1 BS are both diagonal matrices. Show that if A, B e C" x " are simulta-
neously diagonalizable, then they commute (i.e., Afi = BA).
11.8. Prove the following theorem. Let A, B e C nxn be diagonalizable. There-
fore, A and B commute iff they are simultaneously diagonalizable.
11.9. Matrix A € C" x " is a square root of B e C nxn if A 2 = B. Show that every
diagonalizable matrix in C" x " has a square root.
11.10. Prove the following (Bauer-Fike) theorem (which says something about
how perturbations of matrices affect their eigenvalues). If y is an eigenvalue
of A + E € R nxn , and T~ l AT = D = diag(A , A-i, . . . , A.„_i), then
min |A.-y|<K2(r)||E|| 2 .
ie.«(A)
Recall that s(A) denotes the set of eigenvalues of A (i.e., the spectrum of
A). [Hint: If y e s(A), the result is certainly true, so we need consider
only the situation where y <£ s(A). Confirm that if T~ 1 (A + E — yI)T
is singular, then so is I + (D — yf)~ l (T~ l ET). Note that if for some
B e R nx " the matrix I + B is singular, then (I + B)x — for some leR"
that is nonzero, so ||x||2 = ||Bx||2, and so ||B||2 > 1- Consider upper and
lower bounds on the norm ||(D — y I) -1 (T _1 ET)\\2-]
11.11. The Daubechies 4-tap scaling function 4>(t) satisfies the two-scale difference
equation
<t>(t) = />o<M2f) + pi</>(2f - 1) + P2<P(2t - 2) + p 3 0(2r - 3), (11.P.3)
TLFeBOOK
PROBLEMS 521
where supp</>(£) = [0,3] C R [i.e., 4>(t) is nonzero only on the interval
[0, 3]], and where
p =\(l + V3), pi = i(3 + V3),
p 2 =5(3-V3), P3 = 5(l-V3).
Note that the solution to (11. P. 3) is continuous (an important fact).
(a) Find the matrix M such that
M(/) = (/>, (11.P.4)
where = [^(l)0(2)] r e R 2 .
(b) Find the eigenvalues of M. Find the solution 4> to (11.P.4).
(c) Take (p from (b) and multiply it by constant a such that a J2k=i W =
1 (i.e., replace </> by the normalized form ouj)).
(d) Using the normalized vector from (c) (i.e., a<j>), find 4>(k/2) for all
keZ.
[Comment: Computation of the Daubechies 4-tap scaling function is the
first major step in computing the Daubechies 4-tap wavelet. The process
suggested in (d) may be continued to compute <f>(k/2 J ) for any k e Z, and
for any positive integer J. The algorithm suggested by this is often called
the interpolatory graphical display algorithm (IGDA).]
11.12. Prove the following theorem. Suppose A e R" x " and A — A T . Then A >
iff A — P T P for some nonsingular matrix P e R" x ".
11.13. Let A e R" x " be symmetric with eigenvalues
^0 < ^1 < ' ' ' < ^n-2 < ^n-1-
Show that for all x e R"
Xqx x < x Ax < X n -ix x.
[Hint: Use the fact that there is an orthogonal matrix P such that P T AP —
A (diagonal matrix of eigenvalues of A). Partition P in terms of its row
vectors.]
11.14. Section 11.3 presented a method of computing e At "by hand." Use this
method to
(a) Derive (10.109) (in Example 10.9).
(b) Derive (10.P.9) in Problem 10.23.
TLFeBOOK
522
NUMERICAL METHODS FOR EIGENPROBLEMS
(c) Find a closed-form expression for e , where
A =
k 1
A 1
A
11.15. This exercise confirms that eigenvalues and singular values are definitely
not the same thing. Consider the matrix
A =
2
1 1
Use the MATLAB eig and svd functions to find the eigenvalues and singular
values of A.
11.16. Show that e {A+B)t = e At e Bt for all t e R if AB = BA. Does e {A+B)t =
e e Bt always hold for all t e R when AB ^ BA ? Justify your answer.
11.17. This problem is an introduction to Floquet theory. Consider a linear system
with state vector x(t) e R" for all t e R such that
dx(t)
dt
A(t)x(t)
(11.P.5)
for some A(t) e R" x " (all t e R), and such that A(t + T) = A(t) for some
T > [so that A(t) is periodic with period T]. Let <t>(f) be the fundamental
matrix of the system such that
d®(t)
= A(f)0(f), cj>(0) = /.
(a) Let *(f) = 0(t
dt
T), and show that
dV(t)
dt
= A(0*(0-
(11.P.6)
(11.P.7)
(b) Show that <D(f + T) = $(f)C, where C — $>(T). [Hint: Equations
(11.P.6) and (11. P. 7) differ only in their initial conditions [i.e., ^(0) =
what ?].]
(c) Assume that C _1 exists for some C e R nx '\ and that there exists some
R e R" x " such that C = e TR . Define P(t) = ®(t)e~ tR , and show that
P(t + T) = P(t). [Thus, <D(f) = P(t)e tR , which is the general form of
the solution to (11.P.6).]
(Comment: Further details of the theory of solution of (11. P. 5) based on
working with (11.P.6) may be found in E. A. Coddington and N. Levinson,
TLFeBOOK
PROBLEMS 523
Theory of Ordinary Differential Equations, McGraw-Hill, New York, 1955.
The main thing for the student to notice is that the theory involves matrix
exponentials.)
11.18. Consider the matrix
A =
4 2
14 10
14 1
2 4
(a) Create a MATLAB routine that implements the scaled power algorithm,
and use your routine to find the largest eigenvalue of A.
(b) Create a MATLAB routine that implements the shifted inverse power
algorithm, and use your routine to find the smallest eigenvalue of A.
11.19. For A in the previous problem, find K2(A) using the MATLAB routines that
you created to solve the problem.
11.20. If A € R nxn , and A — A T , then, for some x e R" such that x # 0, we
define the Rayleigh quotient of A and x to be the ratio x T Ax/x T x (=
(x, Ax) J ' {x, x)). The Rayleigh quotient iterative algorithm is described as
/c:=0;
while k < N do begin
lx k :=z T k Az k lzlz k ;
(A - n k l)y k := z k ;
z/c+1 :=y/r/lly/cll2;
k:=k+-\;
end;
The user inputs zo, which is the initial guess about the eigenvector. Note
that the shift Hk is changed (i.e., updated) at every iteration. This has the
effect of accelerating convergence (i.e., of reducing the number of iterations
needed to achieve an accurate solution). However, A w = A — [x^l needs
to be factored anew with every iteration as a result. Prove the following
theorem. Let A e R nx " be symmetric and (k,x) be an eigenpair for A. If
y «s x, ix — y T Ay/y T y with ||x|| 2 = | |y | b = 1, then
\k-n\< ||A-A./|| 2 ||*-y||l.
[Comment: The norm \\A — XI\\2 is an A-dependent constant, while \\x —
y\\* = Hellj is the square of the size of the error between eigenvector x
and the estimate y of it. So the size of the error between X and the estimate
\i (i.e., \X — fi\) are proportional to IMI2 at the worst. This explains the
fast convergence of the method (i.e., only a relatively small N is usually
needed). Note that the proof uses (4.31).]
TLFeBOOK
524
NUMERICAL METHODS FOR EIGENPROBLEMS
11.21. Write a MATLAB routine that implements the basic QR iteration algorithm.
You may use the MATLAB function qr to perform QR factorizations. Test
your program out on the following matrices:
(a)
4 2
14 10
14 1
2 4
(b)
-\
1 ° ° 1 + 75
1 -§->/2
1 1 + V2
Use other built-in MATLAB functions (e.g., roots or eig) to verify your
answers. Iterate enough to obtain entries for Hn that are accurate to four
decimal places.
11.22. Repeat the previous problem using your own MATLAB implementation of
the single-shift QR iteration algorithm. Compare the number of iterations
needed to obtain four decimal places of accuracy with the result from the
previous problem.
11.23. Suppose that X,YeR nxn , and we define the matrices
A = X + jY,B
X
Y
-Y
X
Show that if A € s(A) is real-valued, then k e s(B). Find a relationship
between the corresponding eigenvectors.
TLFeBOOK
12
Numerical Solution of Partial
Differential Equations
12.1 INTRODUCTION
The subject of partial differential equations (PDEs) with respect to the matter of
their numerical solution is impossibly large to properly cover within a single chapter
(or, for that matter, even within a single textbook). Furthermore, the development of
numerical methods for PDEs is a highly active area of research, and so it continues
to be a challenge to decide what is truly "fundamental" material to cover at an
introductory level. In this chapter we shall place emphasis on wave propagation
problems modeled by hyperbolic PDEs (defined in Section 12.2). We will consider
especially the finite-difference time-domain (FDTD) method [8], as this appears to
be gaining importance in such application areas as modeling of the scattering of
electromagnetic waves from particles and objects and modeling of optoelectronic
systems. We will only illustrate the method with respect to planar electromagnetic
wave propagation problems at normal incidence. However, prior to this we shall
give an overview of PDEs, including how they are classified into elliptic, parabolic,
and hyperbolic types.
12.2 A BRIEF OVERVIEW OF PARTIAL DIFFERENTIAL EQUATIONS
In this section we define some notation and terminology that is used throughout
the chapter. We explain how second-order PDEs are classified. We also summarize
some problems that will not be covered within this book, simply citing references
where the interested reader can find out more.
We will consider only two-dimensional functions u(x, t) e R (or u(x, y) e R),
where the independent variable x is interpreted as a space variable, and independent
variable t is interpreted as time. The order of a PDE is the order of the highest
derivative. For our purposes, we will never consider PDEs of an order greater than
2. Common shorthand notation for partial derivatives includes
du du d u d u
, Uf ^ , Uxt — — , tlxx ~~ T •
dx dt dtdx dx z
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
525
TLFeBOOK
526 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
If the PDE has solution w(x, y), where x and y are spatial variables, then often
we are interested only in approximating the solution on a bounded region (i.e., a
bounded subset of R 2 ). However, we will consider mainly a PDE with solution
u(x,t), where, as already noted, t is time, and so we consider u(x,t) only for
x e [a, b] = [xo, Xf] and t e [to, tf]. Commonly, to = 0, xo = 0, and x/ — L, with
tf — T . We wish to approximate the PDE solution u(x, t) at grid points (mesh
points) much as we did in the problem of numerically solving ODEs as considered
in Chapter 10. Thus, we wish to approximate u(xk,t n ), where
xi — xo + hk, t n = to + xn, (12.1)
such that
1 1
h=— (xf-xo), T = —(t f -t ), (12.2)
M N
and so k = 0, 1, . . . , M, and n — 0, 1, . . . , N. This implies that we assume sam-
pling on a uniform two-dimensional grid defined on the xt plane [(x, t) plane].
Commonly, the numerical approximation to u(xk, t n ) is denoted by Uk ,«• The index
k is then a space index, and n is a time index.
There is a classification scheme for second-order linear PDEs. According to
Kreyszig [1], Myint-U and Debnath [2], and Courant and Hilbert [9], a PDE of
the form
Au xx + 2Bu xy + Cuyy — F (x , y, u, u x , u y ) (12.3)
is elliptic if AC — B 2 > 0, parabolic if AC — B 2 — 0, and is hyperbolic if AC —
B 2 < 0. 1 It is possible for A, B, and C to be functions of x and y, in which case
(12.3) may be of a type (elliptic, parabolic, hyperbolic) that varies with x and y. For
example, (12.3) might be hyperbolic in one region of R 2 but parabolic in another.
Of course, the terminology as to type remains the same when space variable y is
replaced by time variable t.
An example of an elliptic PDE is the Poisson equation from electrostatics [10]
V xx + Vyy = — p(x,y), (12.4)
€
where the solution V(x, y) is the electrical potential (e.g., in units of volts) at the
spatial location (x, y) in R 2 , constant e is the permittivity of the medium (e.g.,
in units of farads per meter), and p(x, y) is the charge density (e.g., in units of
coulombs per square meter) at the spatial point (x, y). We have assumed that
the permittivity is a constant, but it can vary spatially as well. Certainly, (12.4)
This classification scheme is related to the classification of conic sections on the Cartesian plane. The
general equation for such a conic on R is
Ax 1 + Bxy + Cy 2 + Dx + Ey + F = 0.
The conic is hyperbolic if B 2 - AAC > 0, parabolic if B 2 - 4 AC = 0, and is elliptic if B 2 - 4 AC < 0.
TLFeBOOK
A BRIEF OVERVIEW OF PARTIAL DIFFERENTIAL EQUATIONS 527
has the form of (12.3), where for u(x, y) — V(x, y) we have B — 0, A — C —
1, and F(x, y, u, u x , u y ) — —p(x, y)/e. Therefore, AC — B 2 — 1 > 0, confirming
that (12.4) is elliptic.
To develop an approximate method of solving (12.4), one may employ finite
differences. For example, from Taylor series theory
d 2 V(x k , Jn) _ V(x k +u Jn) ~ 2V(x k , y n ) + V(x k -i, y„) h 2 9 4 Vfe, y n )
dx 2 h 2 12 dx 4
(12.5a)
for some %\ e [xk-i, Xk+i], and
d 2 V(x k ,y„) _ Vjxk, y«+i) - 2V(x k , y n ) + V(xk, y«-i) _ r 2 d 4 V(x k , rj„)
dy 2 ~ x 2 12 3y 4
(12.5b)
for some r\ n e [y n -\, y n +\\, where x k — xq + hk, y n — yo + xn [recall (12.1), and
(12.2)]. The finite-difference approximation to (12.4) is thus
Vk+\,n - 2Vk,n + Vk-hn Vk,n+\ ~ 2V k ,n + Vk,n-\ _ PJXk, yn)
h 2 x 2 e
(12.6)
Here k = 0, 1, . . . , M, and n = 0, 1, . . . , N. Depending on p(x, y) and boundary
conditions, it is possible to rewrite (12.6) as a linear system of equations in the
unknown (approximate) potentials Vj. „. In practice, N and M may be large, and so
the linear system of equations will be of high order consisting of O (NM) unknowns
to solve for. Because of the structure of (12.6), the linear system is a sparse one,
too. It has therefore been pointed out [3,7,11] that iterative solution methods are
preferred, such as the Jacobi or Gauss-Seidel methods (recall Section 4.7). This
avoids the problems inherent in storing and manipulating large dense matrices.
Epperson [11] notes that in recent years conjugate gradient methods have begun to
displace Gauss-Seidel/Jacobi approaches to solving large and sparse linear systems
such as are generated from (12.6). In part this is due to the difficulties inherent
in obtaining the optimal value for the relaxation parameter a> (recall the definition
from Section 4.7).
An example of a parabolic PDE is sometimes called the heat equation, or diffu-
sion equation [2-4, 11] since it models one-dimensional diffusion processes such
as the flow of heat through a metal bar. The general form of the basic parabolic
PDE is
u, — a 2 u xx (12.7)
for x e [0, L], t e R + . Here u(x, t) could be the temperature of some material at
(x,t). It could also be the concentration of some chemical substance that is diffusing
out from some source. (Of course, other physical interpretations are possible.)
Typical boundary conditions are
u(0,t) = 0, u(L,t) = 0for t > 0, (12.8a)
TLFeBOOK
528 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
and the initial condition is
u(x,0) = f(x). (12.8b)
The initial condition might be an initial temperature distribution, or chemical con-
centration. If uix,t) is interpreted as temperature, then the boundary conditions
(12.8a) state that the ends of the one-dimensional medium are held at a constant
temperature of (e.g., degrees Celsius).
Equation (12.7) has the form of (12.3) with B — C — 0, and hence AC - B 2 =
0, which is the criterion for a parabolic PDE. Finite-difference schemes analogous
to the case for elliptic PDEs may be developed. Classically, perhaps the most
popular choice is the Crank-Nicolson method summarized by Burden and Faires
[3], but given a more detailed treatment by Epperson [11].
Another popular numerical solution technique for PDEs is the finite -element
(FEL) method. It applies to a broad class of PDEs, and there are many commercially
available software packages that implement this approach for various applications
such as structural vibration analysis, or electromagnetics. However, we will not
consider the FEL method as it deserves its own textbook. The interested reader can
see Strang and Fix [5] or Brenner and Scott [6] for details. A brief introduction
appears in Burden and Faires [3].
As stated earlier, the emphasis in this book will be on wave propagation prob-
lems as modeled by hyperbolic PDEs. We now turn our attention to this class
of PDEs.
12.3 APPLICATIONS OF HYPERBOLIC PDEs
In this section we summarize two problems that illustrate how hyperbolic PDEs
arise in practice. In later sections we will see that although both involve the mod-
eling of waves propagating in physical systems, the numerical methods for their
solution are different in the two cases, and yet they have in common the application
of finite-difference schemes.
12.3.1 The Vibrating String
Consider an elastic string with its ends fixed at the points x — and x — L (so
that the string is of length L unstretched). If the string is plucked at position
x = xp ixp e (0, L)) at time t — such as shown in Fig. 12.1, then it will vibrate
for t > 0. The PDE describing w(x, t), which is the displacement of the string at
position x and time t, is given by
u tt = c 2 u xx . (12.9)
The system of Fig. 12.1 is also characterized by the boundary conditions
u(0,t) = 0, u(L,t) = for all t e R + , (12.10)
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs 529
u(x,t)
1
i
P
u(x P ,0)
Xp
L
Figure 12.1 An elastic string plucked at time t = at point P, which is located at x = xp.
which specify that the string's ends are fixed, and we have the initial conditions
du(x, t)
u(x,0) = f(x),
dt
g(x),
(12.11)
r=0
which describes the initial displacement, and velocity of the string, respectively.
As explained, for example, in Kreyszig [1] or in Elmore and Heald [12], the PDE
(12.9) is derived from elementary Newtonian mechanics based on the following
assumptions:
1. The mass of the string per unit of length is a constant.
2. The string is perfectly elastic, offers no resistance to bending, and there is
no friction.
3. The tension in stretching the string before fixing its ends is large enough to
neglect the action of gravity.
4. The motion of the string is purely a vibration in the vertical plane (i.e., the
y direction), and the deflection and slope are small in absolute value.
We will omit the details of the derivation of (12.9), as this would carry us too far
off course. However, note that constant c 2 in (12.9) is
c 2 =
T
P '
(12.12)
where T is the tension in the string (e.g., units of newtons) and p is the density of
the string (e.g., units of kilograms per meter). A dimensional analysis of (12.12)
quickly reveals that c has the units of speed (e.g, meters per second). It specifies
the speed at which waves propagate on the string.
TLFeBOOK
530 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
It is easy to confirm that (12.9) is a hyperbolic PDE since, on comparing (12.9)
with (12.3), we have A = c 2 , B = 0, and C = -1. Thus, AC - B 2 = -c 2 < 0,
which meets the definition of a hyperbolic PDE.
At this point we summarize a standard method for obtaining series-based analyt-
ical solutions to PDEs. This is the method of separation of variables (also called the
product method). We shall also find that Fourier series expansions (recall Chapters 1
and 3) have an important role to play in the solution method. The solutions we
obtain yield test cases that we can use to gauge the accuracy of numerical methods
that we consider later on.
Assume that the solution 2 to (12.9) can be rewritten in the form
u(x,t) = X(x)T(t). (12.13)
Clearly
u xx = X xx T, u tt =XT tl , (12.14)
which may be substituted into (12.9), yielding
XT tt = c 2 TX xx , (12.15)
or equivalently
lit Ayr
-£- = -^L, (12.16)
c 2 T X
The expression on the left-hand side is a function of t only, while that on the
right-hand side is a function only of x. Thus, both sides must equal some constant,
say, k:
±tt Arv
-£- = ^- = k. (12.17)
c 2 T X
From (12.17) we obtain two second-order linear ODEs in constant coefficients
X xx -kX = (12.18)
and
T tt -c 2 icT = 0. (12.19)
We will now ascertain the general form of the solutions to (12.18) and (12.19),
based on the conditions (12.10) and (12.11). From (12.10) substituted into (12.13),
we obtain
u(0, t) = X(0)T(t) = 0, u(L, t) = X(L)T(t) = 0.
Theories about the existence and uniqueness of solutions to PDEs are often highly involved, and so
we completely ignore this matter here. The reader is advised to consult books dedicated to PDEs and
their solution for such information.
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs 531
If T(t) — (all i), then u(x, t) — for all x and t. This is the trivial solution, and
we reject it. Thus, we must have
X(0) = X(L) = 0. (12.20)
For k = 0, Eq. (12.18) is X xx — 0, which has the general solution X(x) — ax + b,
but from (12.20) we conclude that a — b — 0, and so X(x) — for all x. This is the
trivial solution and so is rejected. If k — /x 2 > 0, we have ODE X xx — y?X — 0,
which has a characteristic equation possessing roots at ±/z. Consequently, X(x) —
ae ,LX +be~ flx . If we apply (12.20) to this, we conclude that a = b = 0, once
again giving the trivial solution X(x) — for all x. Now finally suppose that
k — —fi 2 < 0, in which case (12.18) becomes
X XX +/3 2 X = 0, (12.21)
which has the characteristic equation .v 2 + 1 — 0. Thus, the general solution to
(12.21) is of the form
X(x) = a confix) + bsin(fix). (12.22)
Applying (12.20) yields
X(Q) = a = 0, X(L) = & sinOSL) = 0. (12.23)
Clearly, to avoid encountering the trivial solution, we must assume that b ^ 0.
Thus, we must have
sinOSL) = 0,
implying that we have
nit
P = — , n e Z. (12.24)
However, we avoid fi — (for n — 0) to prevent X(x) — for all x; and we
consider only n e {1, 2, 3, . . .} = N because sin(— x) — — sinx, and the minus sign
can be absorbed into the constant b. Thus, in general
X(x) = X„(x) = 6„sin(— x) (12.25)
for « € N, and where x € [0, L]. So now we have found that
in which case (12.19) takes on the form
T„+A 2 r = 0, where X n = —. (12.26)
TLFeBOOK
532 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
This has a general solution of the form
T n (t) = A n cos(X„t) + B n sin(A.„f) (12.27)
again for n e N. Consequently, u n {x, t) — X n (x)T n (t) is a solution to (12.9) for
all n e N, and
u n (x, t) = [A n cos(X n t) + B„ sm(X n t)] sin I — x 1 . (12.28)
The functions (12.28) are eigenfunctions with eigenvalues X n — rnic/L for the PDE
in (12.9). The set {X n \n e N} is the spectrum. Each u n (x,t) represents harmonic
motion of the string with frequency X„/(2jt) cycles per unit of time, and is also
called the nth normal mode for the string. The first mode for n = 1 is called the
fundamental mode, and the others (for n > 1) are called overtones, or harmonics.
It is clear that u n (x, t) in (12.28) satisfies PDE (12.9), and the boundary conditions
(12.10). However, u n (x,t) by itself will not satisfy (12.9), (12.10) and (12.11)
all simultaneously. In general, the complete solution is [using superposition as the
PDE (12.9) is linear]
u(x, t) = ^ u„(x, t) = ^[A„ cos(A„f) + B„ sm(X„t)] sin (^-*) , (12.29)
n=\ n=\
where the initial conditions (12.11) are employed to find the series coefficients A n
and B n for all n. We will now consider how this is done in general.
From (12.11) we obtain
m(x,0) = ^A^sin^— x\ = f(x) (12.30)
M = l
and
di
(x,
dt
"i
=0 =
2_][—A n X n sm(X n t) + B n X n cos(X n t )] sin I — x )
.« = !
= *(*)
r=0
oo
E/H7T \
B n X n sm{—x)=g(x). (12.31)
n=\
The orthogonality properties of sinusoids can be used to determine A n and B n
for all n. Note that (12.30) and (12.31) are particular instances of Fourier series
expansions. In particular, observe that for k, n e N
i, Sm (lT X ) Sm (^j^ = { L/2, n = k ■ (12J2)
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs 533
Plainly, set {sm(njr x / L)\n e N} is an orthogonal set. Thus, from (12.30)
2 f L /kit \ ^-^ /rnt \\ 2 [ L /lor
i ^(r) ^ A » sm (r)r I= i/o /(x)sin
L
x \ dx,
and via (12.32) this reduces to
2 f L (kit \
Ak = — I f{x) sin I — x I dx,
(12.33)
and similarly from (12.31), and (12.32)
If 1
^■kL Jo
B k - — T I g(x) sin ( —x ) dx.
(12.34)
Example 12.1 Suppose that g(x) — for all x. Thus, the initial velocity of the
string is zero. Let the initial position (deflection) of the plucked string be triangular
such that
2H L
x, < x < —
/U) { 2 L h l ^ . (12.35)
(L — x), — < x < L
L 2
In Fig. 12.1 this corresponds to xp — L/2 and u(xp, 0) = H . Since g(x) — for
all x via (12.34), we must have B^ — for all k. From (12.33) we have
AH
f L/2 ( k * \ (
J x sin I — x \ dx + I
Jo \ L / J L
L ( klT ,
(L — x) sin — x dx
L/2 \L '
Since
/
sin(flx) dx — cos(ax) + C
a
and
f ! !
/ x sin(ax) dx — x cos(ax) H r- sin(ax) + C
J a a 1
(C is a constant of integration), on simplification we have
SH (kit
(12.36)
TLFeBOOK
534
NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
0.1 -s
Time (/)
Position (x)
Figure 12.2 Fourier series solution to the vibrating string problem. A mesh plot of u(x, t)
as given by Eq. (12.37) for the parameters L
the first 100 terms of the series expansion.
as given by Eq. (12.37) for the parameters L = 1, c/L = A, and H = -At. The plot employed
Thus, substituting (12.36) into (12.29) yields the general solution for our example,
namely
u(x, t) —
%H ^ 1 (kn\ (kite \ (kit
but sin(&7r/2) = for even k, and so this expression reduces to
8H ^ (-1)"" 1 ({2n-\)jtc\ /(2«-1)tt
u(x, t) — — z- > ^ cos I 1 I sin I x
n z / -~ i (2« — \y
(12.37)
Figure 12.2 shows a typical plot of the function u(x, t) as given by (12.37) (for
the parameters stated in the figure caption).
The reader is encouraged to think about Fig. 12.2, and to ask if the picture is
a reasonable one on the basis of his/her intuitive understanding of how a plucked
string (say, that on a stringed musical instrument) behaves. Of course, this question
should be considered with respect to the modeling assumptions that lead to (12.9),
and that were listed earlier.
12.3.2 Plane Electromagnetic Waves
An electromagnetic wave (e.g., radio wave or light) in three-dimensional space R 3
within some material is described by the vector magnetic field intensity H(x, y, z, t)
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs 535
[e.g., in units of amperes per meter, (A/m)] and the vector electric field intensity
E(x, y, z, t) [e.g., in units of volts per meter (V/m)] such that
H(x, y, z, t) — H x (x, y, z, t)x + H y (x, y, z, t)y + H z (x, y, z, t)z,
E(x, y, z, t) — E x (x, y, z, t)x + E y (x, y, z, t)y + E z {x, y, z, t)z,
where x, y, and z are the unit vectors in the x, y, and z directions of R 3 , respec-
tively. The dynamic equations that H and E both satisfy are Maxwell's equations:
— 95
V x E = (Faraday's law), (12.38)
dt
— 3D
V x H (Ampere's law). (12.39)
dt
Here the material in which the wave propagates contains no charges or cur-
rent sources. The magnetic flux density B(x, y, z,t), and the electric flux density
D(x, y, z,t) are assumed to satisfy
D = €£, ~B = n~H. (12.40)
These relations assume that the material is linear, isotropic [i.e., the same in all
directions, and homogeneous i.e., the parameters € and /x do not vary with (x, y, z)].
Constant e is the material's permittivity [units of farads per meter (F/m)], and
constant /x is the material's permeability [units of henries per meter (H/m)]. The
permittivity and permeability of free space (i.e., a vacuum) are often denoted by
eo and /xo, respectively, and
6 = e r 6o, /"- = /"-i-Mo, (12.41)
where e r is the relative permittivity and \x r is the relative permeability of the
material. Note that
eo
8.854185 x 10" 12 F/m, p Q = 400tt x 10" 9 H/m. (12.42)
If the material is air, then e r « 1 and \i r « 1 to very good approximation, and
so air is not practically distinguished (usually) from free space. For commonly
occurring dielectric materials (i.e., insulators), we have p r & 1, also to excellent
approximation. On the other hand, for magnetic materials (e.g., iron, cobalt, nickel,
various alloys and mixtures), fi r will be very different from unity, and in fact
the relationship B — jiH must often be replaced by sometimes quite complicated
nonlinear relationships, often involving the phenomenon of hysteresis. But we will
completely avoid this situation here.
TLFeBOOK
536
NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
In general, for the vector field A — A x x + A y y + A z z, the curl is the determinant
V x A
X
y
z
a
3
3
dx
By
Bz
A x
Ay
A z
(12.43)
so, expanding this expression with A — E and A — H gives (respectively)
V x E = x
and
dE z
~~By~
V x H
— „ /3H :
By
BEy
~~dz
8Hy
~Jz
+ y
dE r dE 7
dz
BHx
dz
dx
dH z
dx
9Ey
dx
dH y
dx
BE,
By
BH X
By
(12.44)
(12.45)
We will consider only transverse electromagnetic (TEM) waves (i.e., plane
waves). If such a wave propagates in the x direction, then we may assume that
E x — E z — 0, and so H x — H y — 0. Note that the electric and magnetic field com-
ponents E y and H z are orthogonal to each other. They lie within the (y, z) plane,
which itself is orthogonal to the direction of travel of the plane wave. From (12.44)
and (12.45), Maxwell's equations (12.38) and (12.39) reduce to
BEy
BH 7
dH 7
BE V
— —I 1
Bx Bt
— —€
Bx Bt
(12.46)
where we have used (12.40). Combining the two equations in (12.46), we obtain
either
B 2 H 7 B 2 H 7
— ixe-
Bx 2 ^ Bt 2 '
which is the wave equation for the magnetic field, or
(12.47)
B l E v
B l E v
lie-
dx 2 ' dt 2 '
which is the wave equation for the electric field. If we define
1
//xe
(12.48)
(12.49)
then the general solution to (12.48) (for example) can be expressed in the form
E y (x, t) = E Jr (x - vt) + Ey,(X + vt), (12.50)
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs
537
where the first term is a wave propagating in the +x direction (i.e., to the right)
with speed v and the second term is a wave propagating in the — x direction (i.e.,
to the left) with speed v? Equation (12.50) is the classical D'Alembert solution to
the scalar wave equation (12.48). Clearly, similar reasoning applies to (12.47). Of
course, using (12.49) in (12.48), we can write
d z E
dt 2
y _ v 2 9 E y
dx 2
(12.51)
which has the same form as (12.9). In short, the mathematics describing the vibra-
tions of mechanical systems is much the same as that describing electromagnetic
systems, only the physical interpretations differ. Of course, (12.51) clearly implies
that (12.47) and (12.48) are hyperbolic PDEs.
Example 12.2 It is easy to confirm that (12.37) can be rewritten in the form
of (12.50). Via the identity
cos A sinB = ±[sin(A + B) - sin(A - B)]
(12.52)
we see that
cos
(2n — \)nc
(2m — \)tx
1
-- sin
2
(2m — 1)tt
L
(2m — 1)jt
L
(x + ct)
(x — ct)
Thus, (12.37) may immediately be rewritten as
AH ~ (-If 1 .
U(X, t) — =- > pr sin
n 2 *-*• (2m - l) 2
n=\
(2m - 1)71
L
(x — ct)
=u\(x—ct)
AH ^ (-1)" -1
It 2 2-> (2m - l) 2
n=\
sin
(2m - Y)it
(x + ct)
=u i (x+ct)
We note that when e r — /x r = 1, we have v — c, where
1
V^ofo
(12.53)
Readers are invited to draw a simple sketch and convince themselves that this interpretation is cor-
rect. This interpretation is vital in understanding the propagation of electromagnetic waves in layered
materials, such as thin optical films.
TLFeBOOK
538 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
This is the speed of light in a vacuum. Since for real materials p, r > \ and e,- > 1,
we have u < c, so an electromagnetic wave cannot travel at a speed exceeding that
of light in a vacuum.
Now we will assume sinusoidal solutions to the wave equations such as would
originate from sinusoidal sources. 4 Specifically, let us assume that
E y (x,t) — E sm(cot - fix), (12.54)
where p — co^/JIe — 2jt/X, and X is the wavelength (e.g., in units of meters). The
frequency of the source is a>, a fixed constant, and so the wavelength will vary
depending on the medium. If the free-space wavelength is denoted A.rj, then
2tt &)
— = -, (12.55)
where c is from (12.53). If the free-space wave then propagates into a denser
material, then
7.71 a>
— = - (12.56)
k v
for v given by (12.49). From (12.55) and (12.56), we obtain
k = -A. . (12.57)
c
Since v < c, we always have k < ko; that is, the wavelength will shorten. This
observation is useful in checking numerical methods that model the propagation
of sinusoidal waves across interfaces between different materials (e.g., layered
structures such as thin optical films).
From (12.54) we have
dEy
= —/3EQCOs(cLit — fix)
dx
so that from the first equation in (12.46) we have
-J
PEq fJE
cos(cot — fix) dt — sin(ft)? — fix). (12.58)
The characteristic impedance of the medium in which the sinusoidal electromag-
netic wave travels is defined to be
H z p V e V ^r
For example, the Colpitts oscillator from Chapter 10 could operate as a sinusoidal signal generator to
drive an antenna, thus producing a sinusoidal electromagnetic wave (radio wave) in space.
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs 539
where Zq — V/^o7^0 ls the characteristic impedance of free space. The units of
Z are in ohms (£2). We see that Z is analogous to the concept of impedance that
arises in electric circuit analysis.
The analogy between our present problem and phasor analysis in basic electric
circuit theory can be exploited. For suitable E(x) e R
E y = E y (x, t) = E(x)e ja " (12.60)
so that
d 2 E y d 2 E(x) ,„, 9 2 £ v ,
1^ = ^^ ' V = "^ W ' (12 ' 61)
Substituting (12.61) into (12.48) yields
e ja " = -/xew z £'(x)e
d2E <- X K icot _ _„ £ ,.,2 ,,,,.„>,/
JX 2
which reduces to the second-order linear ODE
d 2 E(x) ,
V^ + M«« 2 £(x) = 0. (12.62)
For convenience, we define the propagation constant
y = jco^Jie = jf3, (12.63)
so — y 2 — peco 2 , and (12.62) is now
d 2 E 7
— - y 2 E = (12.64)
(E — E(x)). This ODE has a general solution of the form
E(x) = E e- yx + E ie YX , (12.65)
where Eq and E\ are constants. Recalling (12.60), it follows that
E y (x, t) = £oe ; ^ _yx + Eie jmt+yx
= Eoe-JV x - M) + Eie i{fix+mt) . (12.66)
Of course, the first term is a wave propagating to the right, and the second term is
a wave propagating to the left.
So far we have assumed wave propagation in lossless materials since this is
the easiest case to consider at the outset. We shall now consider the effects of
lossy materials on propagation. This will be important in that it is a more realistic
TLFeBOOK
540 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
assumption in practice, and it is important in designing perfectly matched layers
(PMLs) in the finite-difference time-domain (FDTD) method, as will be seen later.
We may define electrical conductivity a [units of amperes per volt-meter
[A/(V-m)] or mhos/meter], and magnetic conductivity a* [units of volts per
ampere-meter [V/(A-m)] or ohms/meter]. In this case (12.38) and (12.39) take
on the more general forms.
O TJ
Vx£=-u a*~H, (12.67)
1 dt
Vxl = ( h o~E. (12.68)
dt
Note that a* is not the complex conjugate of a . In fact, or, a* e R with a >
for a lossy material (i.e., an imperfect insulator, or a conductor), and a* > 0. As
before we will assume E possesses only a y component, and H possesses only a z
component. Since we again have propagation only in the x direction, via (12.44),
(12.45) in (12.67), and (12.68), we have
dE y dH z
— - = -M — --a*H z , (12.69)
dx dt z
dH z dEy
-—S = € —± + crE y . (12.70)
dx dt
If E y — E y (x, t) possesses a term that propagates only to the right, then (using
phasors again)
E = E y (x, t)y = E e j(Ut e~ yx y (12.71)
for some suitable propagation constant y. For suitable characteristic impedance Z,
we must have
H = H z (x, t)z = -E e jat e- yx z. (12.72)
We may use (12.69) and (12.70) to determine y and Z. Substituting (12.71) and
(12.72) into (12.69) and (12.70) and solving for y and Z yields
z i = J_^ _ y 2 = (_ / - a)e + .)(jf <w/Lt +cr *). (12.73)
ywe + er
How to handle the complex square roots needed to obtain Z and y will be dealt
with below. Observe that the equations in (12.73) reduce to the previous cases
(12.59), and (12.63) when a — a* — 0. It is noteworthy that when we have the
condition
(12.74)
a a
then we have
Z 2 = i— = ^1 %- = -. (12.75)
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs
541
Condition (12.74) is what makes the creation of a PML possible, as will be con-
sidered later in this section, and will be demonstrated in Section 12.5.
Now we must investigate what happens to waves when they encounter a sudden
change in the material properties, specifically, an interface between layers. This
situation is depicted in Fig. 12.3. Assume that medium 1 has physical parameters
€\, (JLi, o\, and erj*, while medium 2 has physical parameters 62, M2> ff 2, an d o^-
The corresponding characteristic impedance and propagation constant for medium
1 is thus [via (12.73)]
)(o\i\ +erf
J \
jcoei +a t
while for medium 2 we have
2 _ j(Ofl 2 + 02
Y\ ={j(0€i+ai){jcotn +cr*),
y 2 = U^ 2 + (T2)U^^2 +^2*)'
(12.76)
ja>e 2 + 02
In Fig. 12.3 for some constants E and H we have for the incident field
(12.77)
E t - E t y = Ee
l^t-Yix
>'.
Hi = H t z = H
jojt-yix-.
zO
(Out of the page)
0"<
Incident
wave
H '
Reflected
wave
Medium 1
Interface (boundary) between
two different media
///////.
Medium 2
Figure 12.3 A plane wave normally incident on an interface (boundary) between two
different media. The magnetic field components are directed orthogonal to the page. The
interface is at x = and is the yz— plane.
TLFeBOOK
542 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
so that
dH t dE t
— jcoHi, = —y\Ei,
dt dx
and via (12.69)
implying that
-y\Ei = -jcofxiHi - a* Hi
E t _ jo)ji\+ a I
Zi. (12.78)
Hi ki
Similarly, for the transmitted field, we must have
E t jwu2 + erf
— = — — ^ = Z 2 . (12.79)
H t y 2
But, for the reflected field, again for suitable constants E', and H' we have
E r = E r y = E' <!***€**$, H r = H r z = H'e' mt e nx z
so that
3// r 3£ r
-— = jwH r , -— = Y\E r ,
at ax
and via (12.69)
implying
]/i£ r = -ja>ixiH r - a*H r
E r jco/xi + CTj"
//,- Kl
= -Zi. (12.80)
The electric and magnetic field components are tangential to the interface, and so
must be continuous across it. This implies that at x — 0, and for all t (in Fig. 12.3)
Hi + H r = H t , E t + E r = E t . (12.81)
If we substitute (12.78)-(12.80) into (12.81), then after a bit of algebra, we have
E, 2Z 2
(12.82)
and
It is easy to confirm that
E t Z 2 + Zy
E r Z2 — -Zi
E~~ Z2 + Z1
x = l + p. (12.84)
p. (12.83)
TLFeBOOK
APPLICATIONS OF HYPERBOLIC PDEs 543
We call t the transmission coefficient from medium 1 into medium 2, and p is the
reflection coefficient from medium 1 into medium 2. The coefficients x and p are
often called Fresnel coefficients, especially in the field of optics.
If a* — then from (12.73) the propagation constant for a sinusoidal wave in
a conductor is obtained from
2 2
y — —fJ,€a> + ja>fia,
(12.85)
where fiew 2 > and cojia > 0. We may express y 2 in polar form: y 2 — r\e^ x .
But y — re' 6 — [rie j6l ] l/2 . In general, if z — re-' 6 , then
z l/2 = r l/2 e J9/2^
z l/2 = r l/2 e ;(0/2+7r)^
(12.86)
From (12.85)
n — \y 2 \ — m\i\l a 2 + e 2 a> 2 , 0\ — h tan l (-co) .
2 \<j I
(12.87)
Consequently
Y
±W^G 2 + € 2 w 2 ] l l 2 e^ + 2^- i ^l°)\
(12.88)
A special case is the perfect insulator (perfect dielectric) for which a — 0. Since
tan _1 (oo) = 7r/2, Eq. (12.88) reduces to y = zkj^/JZea) — ±j/S. More generally
(a not necessarily zero) we have y — ±(a + jfi), where
a — [ajfiVcr 2 + € 2 co 2 ] ' cos
P = [co/j,y/a 2 + e 2 co 2 ] l/2 sin
Tt 1
tan
h - tan
4 2
(12.89a)
(12.89b)
Example 12.3 Assume that /x = /xo, e = €o, and that cr = 0.0001 mhos/meter.
For co = 2nf [/ is the sinusoid's frequency in Hertz (H z )] from (12.89), we obtain
the following table:
/(Hz)
1 x 10 b
.015236
0.025911
1 x 10 7
.018762
0.210423
1 x 10 8
.018836
2.095929
1 x 10 9
.018837
20.958455
Keeping the parameters the same, except that now a — 0.01 mhos/meter, we
have the following table:
TLFeBOOK
544 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
/ (Hz)
a
P
1 x 10 6
.198140
0.199245
1 x 10 7
.611091
0.646032
1 x 10 8
1.523602
2.591125
1 x 10 9
1.876150
21.042254
In general, from (12.89a) we have a > 0. If in Fig. 12.3 medium 1 is free space
while medium 2 is a conductor with < a < oo, then E t has the form [using
(12.82)] for x >
E t = rEe jmt e- y2X y = rEe ja " ' e~ aj - x e~ ihx y . (12.90)
Of course, we also have E, — Ee^ wt e~^ lX y for x < since y\ — jfii (i-e-, oti =
in free space), and E r — pEe Jcol e'^ lX y forx < [using (12.83)]. Since a 2 > 0, the
factor e~ a2X will go to zero as x -> oo. The amplitude of the wave must decay as
it progresses from free space (medium 1) into the conductive medium (medium 2).
The rate of decay certainly depends on the size of a 2 .
Now suppose that medium 1 is again free space, but that medium 2 has both
<r 2 > and a^ > such that condition (12.74) holds with /x 2 = Mo> an d e 2 — e o>
specifically
-±- = — (12.91)
which implies [via (12.75)] that Z 2 = Vmo/^o- Since medium 1 is free space Z\ —
VMo/eo, loo. The reflection coefficient from medium 1 into medium 2 is [via
(12.83)]
Z 2 — Z\ Zq — Zq
p — = = 0.
Z 2 + Z\ 2Zq
When wave Ej in medium 1 encounters the interface (at x = in Fig. 12.3), there
will be no reflected component, that is, we will have E r — 0. From (12.73) we
obtain
yl = (or 2 or 2 * - w 2 /x eo) + jw(a 2 IMo + or 2 *eo), (12.92)
and we select the medium 2 parameters so that for y 2 = a 2 + jfii we obtain a 2 > 0,
and a 2 is large enough so that the wave is rapidly attenuated in that e~ aiX is small
for relatively small x. In this case we may define medium 2 to be a perfectly
matched layer (PML). It is perfectly matched in the sense that its characteristic
impedance is the same as that of medium 1, thus eliminating reflections at the
interface. Because it is lossy, it absorbs radiation incident on it. The layer dissipates
energy without reflection. It thus simulates the walls of an anechoic chamber. In
other words, an anechoic chamber has walls that approximately realize condition
(12.74). The necessity to simulate the walls of an anechoic chamber will become
clearer in Section 12.5 when we look at the FDTD method.
TLFeBOOK
THE FINITE-DIFFERENCE (FD) METHOD 545
Finally, we remark on the similarities between the vibrating string problem
and the problem considered here. The analytical solution method employed in
Section 12.3.1 was separation of variables, and we have employed the same ap-
proach here since all of our electromagnetic field solutions are of the form u(x,t) —
X(x)T(t). The main difference is that in the vibrating string problem we have
boundary conditions defined by the ends of the string being tied down somewhere,
while in the electromagnetic wave propagation problem as we have considered it
here there are no boundaries, or rather, the boundaries are at x — ±00.
12.4 THE FINITE-DIFFERENCE (FD) METHOD
We now consider a classical approach to the numerical solution of (12.9) that we
call the finite -difference (FD) method. Note that the method to follow is by no means
the only approach. Indeed, the FDTD method to be considered in Section 12.5 is
an alternative, and there are still others.
Following (12.5)
d 2 u(x k , t n ) _ u(x k , t n+ i) - 2u(x k , t n ) + u(x k , t n -\) x 2 3 4 u(x k , rj n ) ..„,
dt 2 - T 2 12 dt 4 ( }
for some r\ n e [t n -\, t n +\\, and
d 2 u(x k , t n ) _ u(x k+ i, t n ) - 2u(x k , t n ) + u(x k -i, t n ) h 2 d 4 u(t; k , t n ) ,,„„.,
dx 2 h 2 12 dx 4
for some % k € [x k -\, x k+ \], where
x k — kh, t n — nx (12.95)
forfc = 0, l,...,M,and« € Z+. On substitution of (12.93) and (12.94) into (12.9),
we have
u(x k , t n+ i) - 2u(x k , t n ) + u(x k , t n _x) _ 2 u{x k+ i, t n ) - 2u(x k , t„) + u(x k -i, t n )
x 2 ° h 2
1
12
2 d^u(x k ,rj n ) 2 ,2 du ^k,t n )
t ; c h -,
3f 4 dx 4
(12.96)
=e kj ,
where e kt „ is the local truncation error. Since u k ,« ^ u(x k , t n ) from (12.96), we
obtain the difference equation
Uk,n + \ ~ 2uk,n + Uk,n-l ~ e 2 u k+ \, n + 1e 2 u k , n - e 2 u k -] :n = 0, (12.97)
where
e = T T c (12.98)
TLFeBOOK
546 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
is sometimes called the Courant parameter. It has a crucial role to play in deter-
mining the stability of the FD method (and of the FDTD method, too). If we solve
(12.97) for Uk, n +i we obtain
Uk,n+l = 2(1 - e 2 )ll k , n + e 2 (uk+l,n + Uk-l,n) ~ Uk,n-1, (12.99)
where k =1,2, , . . , M — 1, andn = 1, 2, 3, Equation (12.99) is the main recur-
sion in the FD algorithm. However, we need to account for the initial and boundary
conditions in order to initialize this recursion. Before we consider this matter note
that, in the language of Section 10.2, the FD algorithm has a truncation error of
0(x 2 + h 2 ) per step [via e k ,n m (12.96)].
Immediately on applying (12.10), since L — Mh, we have
m ,« = UM,n — for n e Z + . (12.100)
Since from (12.11) u(x, 0) = f(x), we also have
«*,<> = /(**) (12.101)
for k = 0, 1, . . . , M. From (12.101) we have ut,o, but we also need ut,i [consider
(12.99) for n — 1]. To obtain a suitable expression first observe that from (12.9)
d 2 u(x,t) T?) 2 u{x,t) d 2 u(x,0) ,3 2 m(x,0) , n -.
V— = c 2 ~^^ -f- = c 2 K -^-L = c 2 f i2 \x). (12.102)
dt 2 dx 2 dt 2 dx 2
Now, on applying the Taylor series expansion, we see that for some \i e [0, t\\ —
[0, t] (and /i may depend on x)
du(x, 0) 1 2 3 u(x,0) 1 3 3 m(x, /x)
u(x,t 1 ) = u(x,0) + T^^ + -T —^- + J x — gp— .
and on applying (12.11) and (12.102), this becomes
1 i o n\ 1 1 3 u(x, Li)
u(x, fi) = u(x, 0) + xg{x) + -r 2 c 2 f i2) (x) + -r 3 -^— . (12.103)
2 6 dt s
In particular, for x — x k , this yields
1 T O CTl 1 1 9 life, Uit)
«(**, fi) = u(x k , 0) + rgfe) + -t 2 c 2 f (2) (x k ) + -t 3 ' * . (12.104)
2 6 3f J
If f(x) e C 4 [0, L], then, for some & e [jc^-i, X£+i], we have
/(2)fo) = /fa +1 )-2/(* t ) + /(**-l) _ g /( 4) fe) . (12 . 105)
TLFeBOOK
THE FINITE-DIFFERENCE (FD) METHOD 547
Since u(x k ,0) = f{x k ) [recall (12.11) again], and if we substitute (12.105) into
(12.104), we obtain
,2 ,.2
1 c z t"
u(x k , h) = f(x k ) + xg{x k ) + -—j-[f(x k+ i) - 2f(x k ) + f(xk-i)]
0(r 3 + r 2 h 2 ).
(12.106)
Via (12.98) this yields the required approximation
«*,i = f(xk) + tg(x k ) + je 2 [f(x k+ i) - 2f(x k ) + f(x k -\)]
or
Uk,\ — (1 - e 2 )f(x k ) + je 2 [f(x k+i ) + f(xk-i)] + rg(x k ).
(12.107)
Taking account of the boundary conditions (12.100), Eq. (12.99) can be expressed
in matrix form as
u\, n +\
U2,n + l
UM-2,n+l
u M-\,n+\
2(1 -e z )
„2
M2,n
u M-2,n
UM-l,n
e 1
2(1 - e 2 ) e 2
"2,n-l
u M-2,n-l
UM-l,n-l
2(1 - e 2 )
2(1 -e 2 )
(12.108)
This matrix recursion is run for n — 1, 2, 3, . . ., and the initial conditions are pro-
vided by (12.101) and (12.107).
We recall from Chapter 10 that numerical methods for the solution of ODE IVPs
can be unstable. The same problem can arise in the numerical solution of PDEs.
In particular, as the FD method is effectively an explicit method, it can certainly
become unstable if h and r are inappropriately selected.
As noted by others [2, 13], we will have
lim u k n — u(hk, xri)
ft,T->0 '
provided that < e < 1. This is the famous Courant-Friedrichs-Lewy (CFL) con-
dition for the stability of the FD method, and is originally due to Courant et al.
[23]. The special case where e — 1 is interesting and easy to analyze. In this case
(12.99) reduces to
Uk,n+l = Uk+l,n ~ "A:,«-l + «/t-l,n- (12.109)
TLFeBOOK
548 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
We recall that u(x,t) has the form
u(x, t) — v(x — ct) + w{x + ct)
[see (12.50)]. Hence u(xk, t n ) — v(x k — ct n ) + w(xk + ct n ). Observe that, since
h — ex (as e — 1), we have
u(xt+i, t n ) - u(x k , t n -\) + u(xk-\,t n ) — v(xk + h - ct„) + w{xk +h + ct n )
— v(xk — ct n + ct) — w(xk + ct n — ct)
+ v(xk — h — ct n ) + w(xk — h + ct n )
= v(x.k — h — ct n ) + w(xk + h + ct„)
— v(Xk — Ct n ~ CT ) + w ( x k + ct n + cr )
= v(x k - ct n+ \) + w(x k + ct„ + i)
= u(x k ,t n+ i),
or
u(xk, t n +i) = u(x k +\, t n ) - u(Xk, t n -\) + u(x k -i, t n ). (12.110)
Equation (12.110) has a form that is identical to that of (12.109). In other words,
the algorithm (12.109) gives the exact solution to (12.9), but only at x — hk and
t — xn with h — ex (which is a rather restrictive situation).
A more general approach to error analysis that confirms the CFL condition is
sometimes called von Neumann stability analysis [2]. We outline the approach as
follows. We begin by defining the global truncation error
€k,n = u(Xk,t n ) ~U k ,n- (12.111)
Via (12.96)
u(x k ,t n+ i) - 2u(x k , t n ) + u(x k , t n -i) - e 2 [u(xk + \,t n ) -2u(x k ,t n ) + u(x k -\,t n )]
= r 2 e k ,n- (12-112)
If we subtract (12.97) from (12.112) and simplify the result using (12.111), we
obtain
£k,n+i = 2(1 - e 2 )e k ,n + e 2 [e k +\, n + tk-i,n] ~ e k ,n-i + r 2 e k , n (12.113)
for k = 0, 1, . . . , M (L — Mh), and n = 1,2,3, .... Equation (12.113) is a two-
dimensional difference equation for the global error sequence (€k, n )- The term
x 2 e k ,n is & forcing term, and if u(x,t) is smooth enough, the forcing term will
be bounded for all k and n. Basically, we can show that the CFL condition <
e < 1 prevents liirin^oo \€k,n\ — °° f° r a U k — 0, 1, . . . , M. Analogously to our
TLFeBOOK
THE FINITE-DIFFERENCE (FD) METHOD 549
stability analysis approach for ODE IVPs from Chapter 10, we may consider the
homogeneous problem
€k,n+\ = 2(1 - e 2 )e kjl + e 2 [e k+hn + e k -i, n ] - €k,n-u (12.114)
which is just (12.113) with the forcing term made identically zero for all k and
n. In Section 12.3.1 we learned that separation of variables was a useful means
to solve (12.9). We therefore believe that a discrete version of this approach is
helpful at solving (12.114). To this end we postulate a typical solution of (12.114)
of the form
e k n — exp[jakh + pnr] (12.115)
for suitable constants a e R and fi e C. We note that (12.115) has similarities to
(12.28) and is really a term in a discrete form of Fourier series expansion. We also
see that
\€k,n\ = \exp(finx)\ = \s n \.
Thus, if \s\ < 1, we will not have unbounded growth of the error sequence (e k , n )
as n increases. If we now substitute (12.115) into (12.114), we obtain (after sim-
plification) the characteristic equation
s 2 - [2(1 -e 2 ) + 2cos(ah)e 2 ]s + 1 = 0. (12.116)
=2b
Using the identity 2sin 2 x = 1 — cos(2x), we obtain
l-2e 2 sin 2 ( — I . (12.117)
It is easy to confirm that \b\ < 1 for all e such that < e < 1 because <
sin 2 (ah/2) < 1 for all all ah e R. We note that s 2 - lbs + 1 = for s — S\,S2,
where
s\
b + y/b 2 - 1, s 2 = b-y/b 2 - 1. (12.118)
If \b\ > 1, then |^| > 1 for some k e {1, 2}, which can happen if we permit e > 1.
Naturally we reject this choice as it yields unbounded growth in the size of e k ,n
as n — > oo. If \b\ < 1, then clearly \s k \ = 1 for all k. (To see this, consider the
product ^i^2 = Si*i = ki I 2 -) This prevents unbounded growth of e kn . Thus, we
have validated the CFL condition for the selection of Courant parameter e (i.e., we
must always choose e to satisfy < e < 1).
Example 12.4 Figure 12.4 illustrates the application of the recursion (12.108)
to the vibrating string problem of Example 12.1. The simulation parameters are
stated in the figure caption. The reader should compare the approximate solution
TLFeBOOK
550
NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
0.05
-0.05
Time (m)
02 Position (kh)
Figure 12.4 FD method approximate solution to the vibrating string problem. A mesh
plot of «£ „ as given by Eq. (12.108) for the parameters L = \,c/L
Additionally, h
simulation.
and H
io-
0.05 and r =0.1, which meets the CFL criterion for stability of the
of Fig. 12.4 to the exact solution of Fig. 12.2. The apparent loss of accuracy as the
number of time steps increases (i.e., with increasing nx) is due to the phenomenon
of numerical dispersion [22], a topic considered in the next section in the context
of the FDTD method. Of course, simulation accuracy improves as h, x — > for
fixed T = Nx, and L = Mh.
12.5 THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD
The FDTD method is often attributed to Yee [14]. It is a finite-difference scheme
just as the FD method of Section 12.4 is a finite-difference scheme. However, it is
of such a nature as to be particularly useful in solving hyperbolic PDEs where the
boundary conditions are at infinity (i.e., wave propagation problems of the kind
considered in Section 12.3.2).
The FDTD method considers approximations to H z (x, t) and E y (x, t) given by
applying the central difference and forward difference approximations to the first
derivatives in the PDEs (12.69) and (12.70). We will use the following notation
for the sampling of continuous functions such as f(x, t):
fk, n « f(kAx,nAt), f
k+
f{{k+\)Ax,{n
\)At)
(12.119)
(so Ax replaces h, and At replaces r here, where h and r were the grid spacings
used in previous sections). For convenience, let E — E y and H — H z (i.e., we drop
TLFeBOOK
THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD
551
the subscripts on the field components). We approximate the derivatives in (12.69)
and (12.70) specifically according to
dH ^
~dT "
1
(12.120a)
dE ^
1
s —[Ek,n + l — Ejc^n],
(12.120b)
dH ^
dx
1
(12.120c)
dE ^
dx
1
s —[Ek+\,n — Ek,n\-
(12.120d)
Define e k — e(kAx), ji k — p((k + \)Ax), a k — a(kAx), and a k * — a*((k + i)
Ax), which assumes the general situation where the material parameters vary with
x e [0, L] (computational region). Substituting these discretized material parame-
ters, and (12.120) into (12.69) and (12.70), we obtain the following algorithm:
H,
1 - -*- At
H,
k+i,,n-
1 At
: — [Ek+l,n — Ek, n ]
Hk Ax
Ek,n+1
At
&k,n
1 At
€k Ax
[H t
H,
(12.121a)
(12.121b)
This is sometimes called the leapfrog algorithm. The dependencies between the esti-
mated field components in (12.121) are illustrated in Fig. 12.5. If we assume that
H(-\Ax, t) = H((M + \)Ax, f) =
(12.122)
Time (f)
x=0
M-1
ooo
Computational region
x=L
Figure 12.5 An illustration of the dependencies between the approximate field components
given by (12.121a,b); the lines with arrows denote the "flow" of these dependencies [O
electric field component (E y ); □ magnetic field component (H z )].
TLFeBOOK
552 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
for all t e R + , then a more detailed pseudocode description of the FDTD algo-
rithm is
H i i := for k = 0, 1 M - 1 ;
E/( j0 := for /c = 0, 1 , . . . , M;
for n := to N — 1 do begin
E m>n := Eo sin(<wnAf); {0 < m < M]
for /( := to M - 1 do begin
W 1 1
end;
1-^Af
H /f+ . 2 -,n-±~w^ [E ' f + 1 .'''
for /( := to M do begin
& n+ 1 := Pi- ^Afl &,„-!-££ [H i i -H
end;
end;
2 > " "t 2 ^ — 2 ' ^ ~^~ 2
The statement E mn :— Eosm(amAt) simulates an antenna that broadcasts a
sinusoidal electromagnetic wave from the location x — mAx e (0, L). Of course,
the antenna must be located in free space.
The FDTD algorithm is an explicit difference scheme, and so it may have
stability problems. However, it can be shown that the algorithm is stable provided
we have
Af / Ax \
e = c < 1 or Af < , (12.123)
Ax \ c J
Thus, the CFL condition of Section 12.4 applies to the FDTD algorithm as well.
A justification of this claim appears in Taflove [8]. A MATLAB implementation
of the FDTD algorithm may be found in Appendix 12.A (see routine FDTD.m). In
this implementation we have introduced the parameters s x and s t (0 < s x , s t < 1)
such that
Ax
Ax = ^A , At = s t • (12.124)
c
Clearly, cAt/Ax — s t , and so the CFL condition is met. Also, spatial sampling is
determined by Ax = s x Xq, which is some fraction of a free-space wavelength Xq
[recall (12.55)]. Note that the algorithm simulates the field for all t e [0, T], where
T — NAt. If the wave is propagating only through free space, then the wave will
travel a distance
D = cT = Ns,s x X Q , (12.125)
that is, the distance traveled is Ns,s x free-space wavelengths. Since L = Ms x Xq
(i.e., the computational region spans Ms x free-space wavelengths), this allows us
to make a reasonable choice for A^.
A problem with the FDTD algorithm is that even if the computational region is
only free-space, a wave launched from location x = mAx € (0, L) will eventually
strike the boundaries at x = and/or x — L, and so will be reflected back toward
the source. These reflections will cause very large errors in the estimates of H, and
TLFeBOOK
THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD 553
E. But we know from Section 12.3.2 that we may design absorbing layers called
perfectly matched layers (PMLs) that suppress these reflections.
Suppose that the PML has physical parameters /i,e,a, and a*, then, from
(12.73), the PML will have a propagation constant given by
y — (era* — a> /xe) + ja>(cr /x + cr*e).
(12.126)
If we enforce the condition (12.74), namely
a '
a
(12.127)
then (12.126) becomes
y 2 = -((T 2 - OJ 2 € 2 ) + 2JCOGIM.
(12.128)
If we enforce a — co e < 0, then y — a + j/3, where
4
-Tan
2^2
co i e
2&)CT6
(12.129)
Equation (12.129) is obtained by the same arguments that yielded (12.89). As
a > 0, then a wave on entering the PML will be attenuated by a factor e~ ax ,
where x is the depth of penetration of the wave into the PML. A particularly
simple choice for a is to let a 2 = co 2 e 2 , in which case
Since co — l^-c and c — '
(12.130) as
/M0<?0
a^co^JU. (12.130)
with /i = /Xrfio and e — e r €o, we can rewrite
2n
a = siller — -
>-0
(12.131)
If we are matching the PML to free space, then ji r — e r — 1, and so a =
which case
„—OLX
In
in
e A o
(12.132)
If the PML is of thickness x = 2A , then, from (12.132) we have e~ ax = e~ An %
3.5 x 10~ 6 . A PML that is two free-space wavelengths thick will therefore absorb
very nearly all of the radiation incident on it at the wavelength Ao. Since we have
chosen a 2 — co 2 e 2 , it is easy to confirm that
In e r t 2it
a = , a = — l^rZo
Xq Zq Xq
(12.133)
TLFeBOOK
554 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
4 5 6 7
Distance (xA )
10
Figure 12.6 Typical output from FDTD.m (see Appendix 12.A) for the system described in
Example 12.5. The antenna broadcasts a sinusoid of wavelength Xq = 500 nm (nanometers)
from the location x = 3A.Q. The transmitted field strength is Eq = 1 V/m.
[via (12.127)], where we recall that Zq = VTW^o- Routine FDTD.m in Appen-
dix 12. A implements PMLs according to this approach.
Example 12.5 Figure 12.6 illustrates a typical output from FDTD.m (Appen-
dix 12.A). The system shown in the figure occupies a computational region of
length 10Ao (i.e., x e [0, lOArj])- Somewhat arbitrarily we have Xq — 500 nm (nano-
meters). The antenna (which, given the wavelength, could be a laser) is located
at index m — 150 (i.e., is at x — mAx — ms x Xo — 3Xq, since s x — .02). The free-
space region is for x e (2A,n, 6Ao). The lossless dielectric occupies x e [6Ao, 8Xo],
and has a relative permittivity of e r — 4. The entire computational region is non-
magnetic, and so we have /x = /j,q everywhere. Clearly, PML 1 is matched to free
space, while PML 2 is matched to the dielectric.
Since e r — 4, according to (12.82), the transmission coefficient from free space
into the dielectric is
T =
2/^r
V e rf0
2
2
/ M-0 , /MO
V f,-eo V e o
l + V^ 7
3
TLFeBOOK
THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD 555
Since Eq = 1 V/m, the amplitude of the electric field within the dielectric must be
2
xEq = I V/m. From Fig. 12.6 the reader can see that the electric field within the
dielectric does indeed have an amplitude of about | V/m to a good approximation.
From (12.57) the wavelength within the dielectric material is
1 1
X — ——Xq — -X.Q.
Again from Fig. 12.6 we see that the wavelength of the transmitted field is indeed
close to jXq within the dielectric.
We observe that the PMLs in Example 12.5 do not perfectly suppress reflections
at their boundaries. For example, the wave crest closest to the interface between
PML 2 and the dielectric, and that lies within the dielectric, is somewhat higher
than it should be. It is the discretization of a continuous space that has lead to these
residual reflections.
The theory of PMLs presented here does not easily extend from electromagnetic
wave propagation problems in one spatial dimension into propagation problems in
two or three spatial dimensions. It appears that the first truly successful extension of
PML theory to higher spatial dimensions is due to Berenger [15,16]. Wu and Fang
[17] claim to have improved the theory still further by improving the suppression
of the residual reflections noted above.
The problem of numerical dispersion was mentioned in Example 12.4 in the
application of the FD method to the simulation of a vibrating string. We conclude
this chapter with an account of the problem based mainly on the work of Trefethen
[22]. We will assume lossless propagation, so a — a* = in (12.121). We will also
assume that the computational region is free space, so p — po, and e — €q every-
where. If we now substitute E(x, t ) — Eq sin(ct>t — fix) and H(x,t) — Hq sin(&;f —
fix) into either of (12.121a) or (12.121b), apply the appropriate trigonometric iden-
tities, and then cancel out common factors, we obtain the identity
/w A A //SAx\
sin =esin . (12.134)
We may use (12.134) and (12.123) to obtain
ft) c 2 ,
v„ — — — sin
p efiAx
esm
J Ax
(12.135)
which is the phase speed of the wave of wavelength Xo (recall /3 = 2jt/Xo) in the
FDTD method. For the continuous wave E(x, t) [or, for that matter, H(x, t)], recall
from Section 12.3.2 that <o/fi — c, so without spatial or temporal discretization
effects, a sinusoid will propagate through free space at the speed c regardless of its
wavelength (or, equivalently, its frequency). However, (12.135) suggests that the
speed of an FDTD-simulated sinusoidal wave will vary with the wavelength. As
TLFeBOOK
556 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
explained in Ref. 22 (or see Ref. 12), the group speed
dw d
"dp 7 ~ lp
cos
iPVn)
(i£)
1 — e 2 sin 2
(*£)
(12.136)
is more relevant to assessing how propagation speed varies with the wavelength.
Again, in free space the continuous wave propagates at the group speed % =
jg(fic) — c. A plot of Vg/c versus An/Ax [with v g /c given by (12.136)] appears
in Fig. 2 of Represa et al. [19]. It shows that short-wavelength sinusoids travel at
slower speeds than do long-wavelength sinusoids when simulated using the FDTD
method.
We have seen that nonsinusoidal waveforms (e.g., the triangle wave of Exam-
ple 12.1) are a superposition of sinusoids of varying frequency. Thus, if we use
the FDTD method, the FD method, or indeed any numerical method to simulate
wave propagation, we will see the effects of numerical dispersion. In other words,
the various frequency components in the wave will travel at different speeds, and
so the original shape of the wave will become lost as the simulation progresses
in time (i.e., as «Af increases). Figure 12.7 illustrates this for the case of two
?
(1)
II)
fc
1
(1)
Q.
<0
()
>
m
-1
-2
(a)
o
uf
(b)
/
\
/
I
n = 500 steps
i
3 4 5
Distance (xx )
■I
1
1
?
I I
1
w
Errors due to
numerical dispersion ^^^
"-J
V
\p-
•V
n = 1 500 steps
i i
3 4 5
Distance (xX Q )
Figure 12.7 Numerical dispersion in the FDTD method as illustrated by the propagation
of two Gaussian pulses. The medium is free space.
TLFeBOOK
MATLAB CODE FOR EXAMPLE 12.5 557
Gaussian pulses traveling in opposite directions. The two pulses originally appeared
at x = 4A.o, and the medium is free space. For N = 1500 time steps, there is a very
noticeable error due to the "breakup" of the pulses as their constituent frequency
components separate out as a result of the numerical dispersion.
In closing, note that more examples of numerical dispersion may be found in
Luebbers et al. [18]. Shin and Nevels [21] explain how to work with Gaussian
test pulses to reduce numerical dispersion. We mention that Represa et al. [19]
use absorbing boundary conditions based on the theory in Mur [20], which is a
different method from the PML approach we have used in this book.
APPENDIX 12.A MATLAB CODE FOR EXAMPLE 12.5
% permittivity .m
% This routine specifies the permittivity profile of the
% computational region [0,L], and is needed by FDTD.m
%
function epsilon = permittivity (k,sx,lambdaO,M)
epsilonO = 8.854185*1e-12; % free-space permittivity
er1 =4; % relative permittivity of the dielectric
Dx = sx*lambdaO; % this is Delta x
x = k*Dx; % position at which we determine epsilon
L = M*Dx; % location of right end of computational
% region
if ((x >= 0) & (x < (L-4*lambda0)))
epsilon = epsilonO;
else
epsilon = er1*epsilon0;
end;
q,
'o
% permeability .m
%
% This routine specifies the permeability profile of the
% computational region [0,L], and is needed by FDTD.m
function mu = permeability(k,sx,lambdaO,M)
muO = 400*pi*1e-9; % free-space permeability
Dx = sx*lambdaO; % this is Delta x
x = k*Dx; % position at which we determine mu
L = M*Dx; % location of right end of computational
% region
mu = muO;
q,
% econductivity .m
q,
% This routine specifies the electrical conductivity profile
TLFeBOOK
558 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
% of the computational region [0,L], and is needed by FDTD.m
function sigma = econductivity(k,sx,lambdaO,M)
epsilonO = 8.854185*1e-12;
muO = 400*pi*1e-9;
er1 = 4;
epsilonl = er1*epsilon0;
ZO = sqrt(muO/epsilonO) ;
Dx = sx*lambdaO;
x = k*Dx;
L = M*Dx;
free-space permittivity
free-space permeability
dielectric relative permittivity
dielectric permittivity
free-space impedance
this is Delta x
position at which we determine sigma
location of right end of computational
region
starl = (2*pi/lambda0)*(1/Z0;
star2 = er1 *star1 ;
if ((x > 2*lambda0) & (x < L
sigma = 0;
elseif (x <= 2*lambda0)
sigma = starl ;
elseif (x >= (L - 2*lambda0);
sigma = star2;
end;
; % conductivity of PML 1 (at x = end)
% conductivity of PML 2 (at x = L end)
2*lambda0))
mconductivity .m
This routine specifies the magnetic conductivity profile of the
computational region [0,L], and is needed by FDTD.m
function sigmastar = mconductivity(k,sx,lambdaO,M)
epsilonO = 8.854185*1e-12;
muO = 400*pi*1e-9;
ZO = sqrt(muO/epsilonO) ;
Dx = sx*lambdaO;
x = (k+.5)*Dx;
L = M*Dx;
free-space permittivity
free-space permeability
free-space impedance
this is Delta x
position at which we determine sigmastar
location of right end of computational
region
star = (2*pi/lambda0)*
'ZO;
if ((x > 2*lambda0) &
(x <
L
■2'
*lambda0
sigmastar
= 0;
else
sigmastar
= star;
end;
%
FDTD.
This routine produces the plot in Fig. 12.6 which is associated with
Example 12.5. Thus, it illustrates the FDTD method.
The routine returns the total electric field component Ey to the
caller.
TLFeBOOK
MATLAB CODE FOR EXAMPLE 12.5
559
function Ey = FDTD
muO = 400*pi*1e-9;
epsilonO = 8.854185*1e-12;
c = 1 /sqrt (muO*epsilonO) ;
lambdaO = 500;
lambdaO = lambda0*1e-9;
betaO = (2*pi) /lambdaO;
sx = .02;
st = .10;
Dx = sx*lambda0;
Dt = (st/c)*Dx;
EO = 1 ;
m = 150;
omega = betaO*c;
M = 500;
N = 4000;
E = zeros(1 ,M+1 ) ;
H = zeros(1 ,M+2) ;
% free-space permeability
% free-space permittivity
% speed of light in free-space
% free-space wavelength of the source
% in nanometers
% free-space wavelength of the source
% in meters
% free-space beta (wavenumber)
% fraction of a free-space wavelength
% used to determine Delta x
% scale factor used to determine time-step
% size Delta t
% Delta x
% application of the CFL condition
% to determine time-step Delta t
% amplitude of the electric field (V/m)
% generated by the source
% source (antenna) location index
% source frequency (radians/second)
% number of spatial grid points is (M+1 )
% the number of time steps in the simulation
% initial electric field
% initial magnetic field
% Specify the material properties in the computational region
% (which is x in [0,M*Dx])
for k = 0:M
epsilon(k+1) = permittivity(k, sx, lambdaO, M) ;
ce(k+1) = 1 - Dt*econductivity(k, sx, lambdaO, M) /epsilon(k+1 ) ;
h(k+1) = 1/epsilon(k+1);
end;
h = h*Dt/Dx;
for k = 0: (M-1)
mu(k+1) = permeability(k, sx, lambdaO, M) ;
ch(k+1) = 1 - Dt*mconductivity(k, sx, lambdaO, M) /mu(k+1 ) ;
e(k+1) = 1/mu(k+1);
end;
e = e*Dt/Dx;
% Run the simulation for N time steps
for n = 1 :N
E(m+1) = E0*sin(omega*(n-1 )*Dt) ; % Antenna is at index m
H(2:M+1) = ch.*H(2:M+1) - e.*(E(2:M+1) - E(1:M));
E(1:M+1) = ce.*E(1:M+1) - h.*(H(2:M+2) - H(1:M+1));
E(m+1) = EO*sin(omega*n*Dt) ;
Ey(n,:) = E; % Save the total electric field at time step n
end;
TLFeBOOK
560 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
REFERENCES
1. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979.
2. T. Myint-U and L. Debnath, Partial Differential Equations for Scientists and Engineers,
3rd ed., North-Holland, New York, 1987.
3. R. L. Burden and J. D. Faires, Numerical Analysis, 4th ed., PWS-KENT Publ., Boston,
MA, 1989.
4. A. Quarteroni, R. Sacco and F. Saleri, Numerical Mathematics (Texts in Applied Math-
ematics series, Vol. 37), Springer- Verlag, New York, 2000.
5. G. Strang and G. Fix, An Analysis of the Finite Element Method, Prentice-Hall, Engle-
wood Cliffs, NJ, 1973.
6. S. Brenner and R. Scott, The Mathematical Theory of Finite Element Methods, Springer-
Verlag, New York, 1994.
7. C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, SIAM, Philadel-
phia, PA, 1995.
8. A. Taflove, Computational Electrodynamics: The Finite-Difference Time-Domain
Method, Artech House, Norwood, MA, 1995.
9. R. Courant and D. Hilbert, Methods of Mathematical Physics, Vol. II. Partial Differen-
tial Equations, Wiley, New York, 1962.
10. J. D. Kraus and K. R. Carver, Electromagnetics, 2nd ed., McGraw-Hill, New York,
1973.
11. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York,
2002.
12. W. C. Elmore and M. A. Heald, Physics of Waves, Dover Publ., New York, 1969.
13. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York,
1966.
14. K. S. Yee, "Numerical Solution of Initial Boundary Value Problems Involving
Maxwell's Equations in Isotropic Media," IEEE Trans. Antennas Propag. AP-14, 302-
307 (May 1966).
15. J. -P. Berenger, "A Perfectly Matched Layer for the Absorption of Electromagnetic
Waves," J. Comput. Phys. 114, 185-200 (1994).
16. J. -P. Berenger, "Perfectly Matched Layer for the FDTD Solution of Wave-Structure
Interaction Problems," IEEE Trans. Antennas Propag. 44, 110-117 (Jan. 1996).
17. Z. Wu and J. Fang, "High-Performance PML Algorithms," IEEE Microwave Guided
Wave Lett. 6, 335-337 (Sept. 1996).
18. R. J. Luebbers, K. S. Kunz and K. A. Chamberlin, "An Interactive Demonstration of
Electromagnetic Wave Propagation Using Time-Domain Finite Differences," IEEE
Trans. Educ. 33, 60-68 (Feb. 1990).
19. J. Represa, C. Pereira, M. Panizo and F. Tadeo, "A Simple Demonstration of Numerical
Dispersion under FDTD," IEEE Trans. Educ. 40, 98-102 (Feb. 1997).
20. G. Mur, "Absorbing Boundary Conditions for the Finite-Difference Approximation of
the Time-Domain Electromagnetic-Field Equations," IEEE Trans. Electromagn. Compat.
EMC-23, 377-382 (Nov. 1981).
21. C.-S. Shin and R. Nevels, "Optimizing the Gaussian Excitation Function in the Finite
Difference Time Domain Method," IEEE Trans. Educ. 45, 15-18 (Feb. 2002).
TLFeBOOK
PROBLEMS 561
22. L. N. Trefethen, "Group Velocity in Finite Difference Schemes," SIAMRev. 24, 113-136
(April 1982).
23. R. Courant, K. Friedrichs and H. Lewy, "tjber die Partiellen Differenzengleichungen
der Mathematischen Physik," Math. Ann. 100, 32-74 (1928).
PROBLEMS
12.1. Classify the following PDEs into elliptic, parabolic, and hyperbolic types
(or a combination of types).
(a)
?>u xx
+ 5U X y + Uyy
— x + y
(b)
u xx
~ "XV l" ^Hyy —
- u x + u
(c)
yuxx
+ Uyy —
(d)
y 2 u xx - 2xyu xy +
2
A, Lt -y v
(e)
4x 2 u
XX ~T Uyy U
;.2. Derive (12.5) (both equations).
12.3. In (12.6) let h — t, and let N — M — 4, and for convenience let f^ n —
p(xk, yn)/z- Define the vectors
v = [v u vi,2 Vi,3 v 2A v 2 , 2 v 2 ,3 v 3 ,i v X2 y 3 , 3 ],
/ = [/l,l /l,2 .A, 3 /2,1 h,2 /2,3 /3,1 /3,2 /3,3]-
Find the matrix A such that AV — h 2 f. Assume that
V ,„ = Vi.o = V kA = V 4 ,„ =
for all £, and n.
12.4. In (12.6) let h = r = \, and N — M — 4, and assume that p(xk, y n ) —
for all k, and n. Let xo = yo — 0. Suppose that
V(0, y) = V(x, 0) = 0, V(x, l) = x, V(l, y) = y.
Find the linear system of equations for Vk, n with 1 < k, n < 3, and put it
in matrix form.
12.5. Recall the previous problem.
(a) Write a MATLAB routine to implement the Gauss-Seidel method (recall
Section 4.7). Use your routine to solve the linear system of equations
in the previous problem.
(b) Find the exact solution to the PDE
v xx + y yy = o
for the boundary conditions stated in the previous problem.
TLFeBOOK
562 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
(c) Use the solution in (b) to find V(k/4, n/4) at 1 < k, n < 3, and compare
to the results obtained from part (a). They should be the same. Explain
why. [Hint: Consider the error terms in (12.5).]
12.6. The previous two problems suggest that the linear systems that arise in the
numerical solution of elliptic PDEs are sparse, and so it is worth considering
their solution using the iterative methods from Section 4.7. Recall that the
iterative methods of Section 4.7 have the general form
X (*+D = Bx (k) + f
[from (4.155)]. Show that
c<* +1 >-j<*>|| c
c (*) -*<*-!) || {
\B\
[Comment: Recalling (4.36c), it can be shown that p(A) < \\A\\ p [which is
really another way of expressing (4.158)]. For example, this result can be
used to estimate the spectral radius of Bj [Eq. (4.171)].]
12.7. Consider Eq. (12.9), with the initial and boundary conditions in (12.10) and
(12.11), respectively.
(a) Consider the change of variables
§ = x + ct , r\ — x — ct
and 0(£, n) replaces u{x,t) according to
0(1, ij) = u Q(§ + n ), 1(| - 1,)^ . (12.P.1)
Verify the derivative operator equivalences
9_3 913_9 9
dx 9^ dr/ c dt 9f dn
[Hint' ^L du_dx_ _i_ dudt_ 90 du_ dx_ , 'du_ dt_ -i
\_nini. a ^ — 3x g ^ -i- 3l 9 |, 9r/ — 3x dr) -I- 3t Sn -i
(b) Show that (12.9) is replaceable with
9 2 c
(12.P.2)
0. (12.P.3)
d^dn
(c) Show that the solution to (12.P.3) is of the form
and hence
u(x, t) — P(x + ct) + Q(x - ct),
where P and Q are arbitrary twice continuously differentiable functions.
TLFeBOOK
PROBLEMS
563
(d) Show that
P(x) + Q(x) = /(*),
>(i),
)(i)/
1
P w (x)-Q w (x) = -g(x).
c
(e) Use the facts from (d) to show that
1
"X+Ct
u(x,t) = -[f(x + ct) + f(x-ct)] + — / g(s)ds.
2 2c J x _ ct
12.8. In (12.11) suppose that
fix)
H ( L
— I x d
d \ 2
H ( L
— x \- d
d \ 2
0,
L L
— < x < \-d
2 ~ ~ 2
L L
d < x < —
2 ~ ~ 2
elsewhere
where < d < L/2. Assume that g(x) — for all x.
(a) Sketch f(x).
(b) Write a MATLAB routine to implement the FD algorithm (12.108) for
computing «^„. Write the routine in such a way that it is easy to change
the parameters c, H, h, r, N, M, and d. The routine must produce a mesh
plot similar to Fig. 12.4 and a plot similar to Fig. 12.7 on the same page
(i.e., make use of the subplot). The latter plot is to be of uu,n versus k. Try
out your routine using the parameters c = i, h = 0.05, r — 0.025, H =
0.1, d = L/10, M = 200, and N = 1100 (recalling that L = Mh). Do
you find numerical dispersion effects?
12.9. Repeat Example 12.1 for f(x) and g(x) in the previous problem.
12.10. Example 12.1 is about the "plucked string." Repeat Example 12.1 assuming
that f{x) — and
,'(*)
2V
2V
x),
L
0<x < -
~ ~ 2
L
— < x < L
2 ~ ~
This describes a "struck string."
12.11. The MATLAB routines in Appendix 12.A implement the FDTD method,
and generate information for the plot in Fig. 12.6. However, the reflected
field component for 21o < x < 6Xq is not computed or displayed.
Modify the code(s) in Appendix 12. A to compute the reflected field com-
ponent in the free- space region and to plot it. Verify that at the interface
TLFeBOOK
564 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS
between the free-space region and the dielectric that |p| = A (magnitude of
the reflection coefficient). Of course, you will need to read the amplitude
of the reflected component from your plot to do this.
12.12. Derive Eq. (12.134).
12.13. Modify the MATLAB code(s) in Appendix 12. A to generate a plot similar
. 21
to Fig. 12.7. [Hint: E^q — Eq exp
( k—m \
iw )
for k = 0, 1, ..., M is the
initial electric field. Set the initial magnetic field to zero for all k.]
12.14. Plot Vg/c versus Xo/Ax for v s given by (12.136). Choose e = 0.1, 0.5, and
0.8, and plot all curves on the same graph. Use these curves to explain why
the errors due to numerical dispersion (see Fig. 12.7) are worse on the side
of the pulse opposite to its direction of travel.
TLFeBOOK
13
An Introduction to MATLAB
13.1 INTRODUCTION
MATLAB is short for "matrix laboratory," and is an extremely powerful software
tool 1 for the development and testing of algorithms over a wide range of fields
including, but not limited to, control systems, signal processing, optimization, image
processing, wavelet methods, probability and statistics, and symbolic computing.
These various applications are generally divided up into toolboxes that typically
must be licensed separately from the core package.
Many books have already been written that cover MATLAB in varying degrees
of detail. Some, such as Nakamura [3], Quarteroni et al. [4], and Recktenwald
[5], emphasize MATLAB with respect to numerical analysis and methods, but
are otherwise fairly general. Other books emphasize MATLAB with respect to
particular areas such as matrix analysis and methods (e.g., see Golub and Van
Loan [1] or Hill [2]). Some books implicitly assume that the reader already knows
MATLAB [1,4]. Others assume little or no previous knowledge on the part of the
reader [2,3,5].
This chapter is certainly not a comprehensive treatment of the MATLAB tool,
and is nothing more than a quick introduction to it. Thus, the reader will have to
obtain other books on the subject, or consult the appropriate manuals for further
information. MATLAB 's online help facility is quite useful, too.
13.2 STARTUP
Once properly installed, MATLAB is often invoked (e.g., on a UNIX workstation
with a cmdtool window open) by typing matlab, and hitting return. A window
under which MATLAB runs will appear. The MATLAB prompt also appears:
MATLAB commands may then be entered and executed interactively.
If you wish to work with commands in M-files (discussed further below), then
having two cmdtool windows open to the same working directory is usually desir-
able. One window would be used to run MATLAB, and the other would be used
MATLAB is written in C, but is effectively a language on its own.
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
565
TLFeBOOK
566 AN INTRODUCTION TO MATLAB
to edit the M-files as needed (to either develop the algorithm in the file or to
debug it).
13.3 SOME BASIC OPERATORS, OPERATIONS, AND FUNCTIONS
The MATLAB command
>> diary filename
will create the file filename, and every command you type in and run, and every
result of this, will be stored in filename. This is useful for making a permanent
record of a MATLAB session which can help in documentation and sometimes in
debugging. When writing MATLAB M-files, always make sure to document your
programs. The examples in Section 13.6 illustrate this.
MATLAB tends to work in the more or less intuitive way where matrix and/or
vector operations are concerned. Of course, it is in the nature of this software tool
to assume the user is already familiar with matrix analysis and methods before
attempting to use it.
When a MATLAB command creates a vector as the output from some operation,
it may be in the form of a column or a row vector, depending on the command. A
typical MATLAB row vector is
»x =[111];
>>
The semicolon at the end of a line prevents the printout of the result of the command
at that line (this is useful in preventing display clutter). If you wish to turn it into
a column vector, then type:
>>x = x. '
x =
1
1
1
Making this conversion is sometimes necessary as the inputs to some routines need
the vectors in either row or column format, and some routines do not care. Routines
that do not care whether the input vector is row or column make the conversion to
a consistent form internally. For example, the following command sequence (which
can be stored in a file called an M-file) converts vector x into a row vector if it is
not one already:
» [N,M] = size(x) ;
» if N -= 1
» x = x. ' ;
» end;
>>
TLFeBOOK
SOME BASIC OPERATORS, OPERATIONS, AND FUNCTIONS 567
In this routine N is the number of rows and M is the number of columns in x. (The
size command also accepts matrices.) The related command length(x) will return
the length of vector x. This can be very useful in for loops (below).
The addition of vectors works in the obvious way:
» x = [ 1 1 1 ] ;
» y = [ 2 -1 2 ];
» x + y
ans =
In this routine, the answer might be saved in vector z by typing >> z — x + y\.
Clearly, to add vectors without error means that they must be of the same size.
MATLAB will generate an error message if matrix and vector objects are not
dimensioned properly when operations are performed on them. The mismatching
of array sizes is a very common error in MATLAB programming.
Matrices can be entered as follows:
» A = [ 1 1 ; 1 2 ]
A =
1 1
1 2
Again, addition or subtraction would occur in the expected manner. We can invert
a matrix as follows:
» inv(A)
ans =
2 -1
-1 1
Operation det(A) will give the determinant of A, and [L, U] = lu(A) will give the
LU factorization of A (if it exists). Of course, there are many other routines for
common matrix operations, and decompositions (QR decomposition, singular value
decomposition, eigendecompositions, etc.). Compute y — Ax + b:
» x = [ 1 -1 ].';
» b = [ 2 3 ] . ' ;
>>
y = A*x + b
TLFeBOOK
568 AN INTRODUCTION TO MATLAB
The colon operator can extract parts of matrices and vectors. For example, to
place the elements in rows j to k of column n of matrix B into vector x, use
> > x — B(j : k, n);. To extract the element from row k and column n of matrix A,
use >> x — A(k, «);. To raise something (including matrices) to a specific power,
use >> C — A A p, for which p is the desired power. (This computes C — A p '.)
{Note: MATLAB indexes vectors and matrices beginning with 1.)
Unless the user overrides the defaults, variables i and j denote the square root
of-1:
» sqrt(-1 )
ans =
+ 1 .OOOOi
Here, i andy are built-in constants. So, to enter a complex number, say, z — 3 — 2j,
type
» z = 3 - 2*i;
Observe
» x = [ 1 1+i ] ;
» x'
ans =
1 .0000
1 .0000 - 1 .0000i
So the transposition operator without the period gives the complex-conjugate
transpose (Hermitian transpose). Note that besides i and j, another useful built-in
constant is pi (= it).
Floating-point numbers are entered as, for instance, \.5e— 3 (which is 1.5 x
10 -3 ). MATLAB agrees with IEEE floating-point conventions, and so 0/0 will
result in NaN ("not a number") to more clearly indicate an undefined operation.
An operation like 1/0 will result in Inf as an output.
TLFeBOOK
SOME BASIC OPERATORS, OPERATIONS, AND FUNCTIONS 569
We may summarize a few important operators, functions, and other terms:
Relational Operators
<
less than
<=
less than or equal to
>
greater than
> =
greater than or equal to
==
equal to
~=
not equal to
Trigonometric Functions
sin sine
cos cosine
tan tangent
asin arcsine
acos arccosine
atan arctangent
atan2 four quadrant arctangent
sinh hyperbolic sine
cosh hyperbolic cosine
tanh hyperbolic tangent
asinh hyperbolic arcsine
acosh hyperbolic arccosine
atanh hyperbolic arctangent
Elementary Mathematical Functions
abs absolute value
angle phase angle (argument of a complex number)
sqrt square root
TLFeBOOK
570
AN INTRODUCTION TO MATLAB
Elementary Mathematical Functions
real
real part of a complex number
imag
imaginary part of a complex number
conj
complex conjugate
rem
remainder or modulus
exp
exponential to base e
log
natural logarithm
log 10
base- 10 logarithm
round
round to nearest integer
fix
round toward zero
floor
round toward — oo
ceil
round toward +oo
In setting up the time axis for plotting things (discussed below), a useful com-
mand is illustrated by
» y = [0: .2:1]
y =
0.2000 0.4000 0.6000 0.8000 1.0000
Thus, [x :y :z ] creates a row vector whose first element is x and whose last element
is z (depending on step size y), where the elements in between are of the form x
+ ky (where k is a positive integer).
It can also be useful to create vectors of zeros:
>> zeros(size( [1:4]))
Or, alternatively, a simpler way is
>> zeros(1 ,4)
ans =
TLFeBOOK
WORKING WITH POLYNOMIALS 571
Using "zeros(n,m)" will result in an n x m matrix of zeros. Similarly, a vector
(or matrix) containing only ones would be obtained using the MATLAB function
called "ones."
13.4 WORKING WITH POLYNOMIALS
We have seen on many occasions that polynomials are vital to numerical analysis
and methods. MATLAB has nice tools for dealing with these objects.
Suppose that we have polynomials P\(s) — s + 2 and P2(s) = 3.v + 4 and wish
to multiply them. In this case type the following command sequence:
» P1 = [ 1 2 ] ;
» P2 = [ 3 4 ] ;
» conv(P1 ,P2)
ans =
3 10 8
This is the correct answer since Pi(s)P2(s) — 3s 2 + 10s + 8. From this we see
polynomials are represented as vectors of the polynomial coefficients, where the
highest-degree coefficient is the first element in the vector. This rule is followed
pretty consistently. (Note: "conv" is the MATLAB convolution function, so if you
don't already know this, convolution is mathematically essentially the same as
polynomial multiplication.)
Suppose P(s) = s 2 + s — 2, and we want the roots. In this case you may type
» P = [ 1 1 -2 ] ;
» roots(P)
-2
1
MATLAB (version 5 and later) has "mroots," which is a root finder that does a
better job of computing multiple roots.
The MATLAB function "polyval" is used to evaluate polynomials. For example,
suppose P(s) — s 2 + 3s + 5, and we wanted to compute P(— 3). The command
sequence is
» P = [ 1 3 5 ] ;
» polyval(P, -3)
TLFeBOOK
572 AN INTRODUCTION TO MATLAB
ans =
5
>>
13.5 LOOPS
We may illustrate the simplest loop construct in MATLAB as follows:
» t = [ : . 1 : 1 ] ;
>> for k = 1 :length(t)
» x(k) = 5*sin( (pi/3) * t(k) ) + 2;
» end;
>>
This command sequence computes x(t) — 5 sin(jt) + 2 for t — O.Ik, where k —
0, 1, . . . , 10. The result is saved in the (row) vector x. However, an alternative
approach is to vectorize the calculation according to
» t = [0: .1:1];
» x = 5*sin( pi*t/3 ) + 2*ones(1 , length(t) ) ;
This yields the same result. Vectorizing calculations leads to faster code (in terms
of runtime).
A potentially useful method to add (append) elements to a vector is
» x = [];
» for k = 1:2:6
>> x = [ x k ] ;
» end;
» x
where x — [ ] defines x to be initially empty, while the for loop appends 1,3, and
5 to the vector one element at a time.
The format of numerical outputs can be controlled using MATLAB fprintf
(which has many similarities to the ANSI C fprintf function). For example
» for k = 0:9
fprintf ( '%12.8f \n' ,sqrt(k)) ;
end;
0.00000000
TLFeBOOK
PLOTTING AND M-FILES 573
1 .00000000
1 .41421356
1.73205081
2.00000000
2.23606798
2.44948974
2.64575131
2.82842712
3.00000000
The use of a file identifier can force the result to be printed to a specific file instead
of to the terminal (which is the result in this example). MATLAB also has save
and load commands that can save variables and arrays to memory, and read them
back, respectively.
Certainly, for loops may be nested in the expected manner. Of course, MATLAB
also supports a "while" statement. For information on conditional statements (i.e.,
"if" statements), use ">> help if."
13.6 PLOTTING AND M-FILES
Let's illustrate plotting and the use of M-files with an example. Note that M-files
are also called script files (use script as the keyword when using help for more
information on this feature).
As an exercise the reader may wish to create a file called "stepH.m" (open and
edit it in the manner you are accustomed to). In this file place the following lines:
% stepH.m
(X.
% This routine computes the unit-step response of the
% LTI system with system function H(s) given by
% H(s) =
% s'2 + 3s + K
(X.
% for user input parameter K. The result is plotted.
function stepH(K)
b = [ K ] ;
a = [ 1 3 K ];
elf % Clear any existing plots from the screen
step(b,a); % Compute the step response and plot it
grid % plot the grid
TLFeBOOK
574
AN INTRODUCTION TO MATLAB
This M-file becomes a MATLAB command, and for K — 0.1 may be executed
using
» stepH( .1) ;
>>
This will result in another window opening where the plot of the step response will
appear. To save this file for printing, use
>> print -dps filename.ps
which will save the plot as a postscript file called "filename.ps." Other printing for-
mats are available. As usual, the details are available from online help. Figure 13.1
is the plot produced by stepH.m for the specified value of K.
Another example of an M-file that computes the frequency response (both mag-
nitude and phase) of the linear time-invariant (LTI) system with Laplace transfer
function is
H(s)
1
Q.
E
<
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
I I I \- I
30 60 90 120
Time (seconds)
150
180
Figure 13.1 Step response: typical output from stepH.m.
TLFeBOOK
PLOTTING AND M-FILES 575
If you are not familiar with Laplace transforms, recall phasor analysis from basic
electric circuit theory. You may, for instance, interpret H(ja>) as the ratio of two
phasor voltages (frequency a>), such as
„,. . V2<Jo>)
Vi(jco)
The numerator phasor V^O&O is the output of the system, and the denominator
phasor Vi(ja>) is the input to the system (a sinusoidal voltage source).
% freqresp.m
%
% This routine plots the frequency response of the Laplace
% transfer function
% H(s) =
% s"2 + .5s + 1
%
% The magnitude response (in dB) and the phase response (in
% degrees) are returned in the vectors mag, and pha,
% respectively. The places on the frequency axis where the
% response is computed are returned in vector f .
function [mag, pha, f] = freqresp
b = [ 1];
a = [ 1 .5 1 ] ;
w = logspace( -2, 1 ,50) ; % Compute the frequency response for i0"(-2) to 10"1
% radians per second at 50 points in this range
h = f reqs(b,a,w) ; % h is the frequency response
mag = abs(h); % magnitude response
pha = angle(h); % phase response
f = w/(2*pi); % setup frequency axis in Hz
pha = pha*180/pi; % phase now in degrees
mag = 20*log10(mag) ; % magnitude response now in dB
elf
subplot(211 ) , semilogx(f ,mag, ' - ' ) , grid
xlabel(' Frequency (Hz) ')
ylabel(' Amplitude (dB) ')
title (' Magnitude Response ')
subplot(212) , semilogx(f ,pha, ' - ' ) , grid
xlabel(' Frequency (Hz) ')
ylabel(' Phase Angle (Degrees) ')
title(' Phase Response ')
Executing the command
» [mag,phase,f ] = freqresp;
»
TLFeBOOK
576 AN INTRODUCTION TO MATLAB
10
— n
CD
■n
-10
CI)
u
-20
n
F
-30
<
-40
10
(a)
1 1 1 Mlll| 1 1 1 MIM| 1 1 1 MIM| 1 1 1 Mill
10- 1
Frequency (Hz)
-50
-2 -100
C
CO
8 -150
CO
-200
I —
I — i
n
-
rr
1 "
I
___
1 1
!
n~
!
1 1
i
i
i i
i
iii
i i
10"
(b)
10-
10- 1
Frequency (Hz)
10°
10 1
Figure 13.2 The output from freqresp.m: (a) magnitude response; (b) phase response.
will result in the plot of Fig. 13.2, and will also give the vectors mag, phase,
and/. Vector mag contains the magnitude response (in decibels), and vector phase
contains the phase response (in degrees) at the sample values in the vector/, which
defines the frequency axis for the plots (in hertz). In other words, "freqresp.m" is a
Bode plotting routine. Note that the MATLAB command "bode" does Bode plots
as well.
Additional labeling may be applied to plots using the MATLAB command text
(or via a mouse using "gtext"; see online help). As well, the legend statement is
useful in producing labels for a plot with different curves on the same graph. For
example
function ShowLegend
ul = 2.5*pi;
11 = -pi/4;
N = 200;
dt = (ul - 11)/N;
for k = 0:N-1
t(k+1) = 11 + dt*k;
x(k+1) = exp(-t(k+1));
y(k+1) = sin(t(k+1));
end;
TLFeBOOK
REFERENCES 577
Figure 13.3 Illustration of the MATLAB legend statement.
subplot(21 1 ) , plot(t,x, ' - ' ,t,y, ' -- ' ) , grid
xlabelC t ')
legend)' e"{-t} ', ' sin(t) ')
When this code is run, it gives Fig. 13.3. Note that the label syntax is similar to
that in LaTeX [6].
Other sample MATLAB codes have appeared as appendixes in earlier chapters,
and the reader may wish to view these as additional examples.
REFERENCES
1. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed. Johns Hopkins Univ.
Press, Baltimore, MD, 1989.
2. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.),
Random House, New York, 1988.
3. S. Nakamura, Numerical Analysis and Graphic Visualization with MATLAB, 2nd ed.
Prentice-Hall, Upper Saddle River, NJ, 2002.
4. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math-
ematics series, Vol. 37). Springer- Verlag, New York, 2000.
5. G. Recktenwald, Numerical Methods with MATLAB: Implementation and Application,
Prentice-Hall, Upper Saddle River, NJ, 2000.
6. M. Goossens, F. Mittelbach, and A. Samarin, The LaTeX Companion, Addison- Wesley,
Reading, MA, 1994.
TLFeBOOK
TLFeBOOK
INDEX
Absolute convergence, 95
Additive splitting, 179
Adjoint matrix, 489
Alternating set, 239
Ampere's law, 535
Anechoic chamber, 544
Appolonius' identity, 35
Argmin, 39
ASCII, 339
Asymptotic expansion, 97
Asymptotic series, 97-103
Asymptotic time complexity,
155
Two's complement, 57
Binomial theorem, 85
Bipolar junction transistor (BJT)
Base current, 420
Base-emitter voltage, 420
Collector current, 420
Collector-emitter voltage, 420
Forward current gain, 420
On resistance, 420
Bisection method, 292-296
Block upper triangular matrix, 510
Boundary conditions, 527
Boundary value problem (BVP), 445
Backtracking line search, 346,353
Backward difference
approximation, 402
Backward Euler method
see Euler' s method of solution
(implicit form)
Backward substitution, 156
Banach fixed-point theorem, 297,422
see also Contraction theorem
Banach space, 68
Basic QR iterations algorithm, 512
Bauer-Fike theorem, 520
Bernoulli's differential equation, 426
Bernoulli's inequality, 123
Bernoulli numbers, 398
Bessel's inequality, 217
"Big O" notation, 155,296
Binary number codes
One's complement, 57
Sign-magnitude, 56
C, 38,339,565
C++, 38
Cancellation, 44
Cantor's intersection theorem, 294
Cartesian product, 2
Catastrophic cancellation, 62,117
Catastrophic convergence, 90
Cauchy sequence, 64
Cauchy-Schwarz inequality, 139
Cayley-Hamilton theorem, 489
Central difference approximation, 402
Chaotic sequence, 323
Characteristic equation, 458,549
Characteristic impedance, 538
Characteristic impedance of free
space, 539
Characteristic polynomial, 144,481
Chebyshev approximation, 239
Chebyshev norm, 239
An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc.
579
TLFeBOOK
580
INDEX
Chebyshev polynomials of the first kind,
218-225
Chebyshev polynomials
of the second kind, 243-245
Cholesky decomposition, 167,355
Christoffel-Darboux formula, 211,389
Christoffel numbers, 389
Closed subset, 299
Cofactor, 489
Colpitts oscillator, 419,465,538
Compactly supported, 278
Companion matrix, 515,519
Complementary error function, 98
Complementary sine integral, 100
Complete solution, 532
Complete space, 64
Complete spline, 275
Completing the square, 131
Complex numbers
Cartesian form, 28
Imaginary part, 31
Magnitude, 29
Polar form, 28
Real part, 31
Computational complexity, 154
Computational efficiency, 89
Computational region, 551
Condition number of a matrix,
135-147
Conic section, 519,526
Conjugate gradient methods, 527
Conjugate symmetry, 35
Continuous mapping, 11
Contraction (contractive)
mapping, 184,297
Contraction theorem, 183,298
Contra-identity matrix, 199
Convergence in the mean, 217
Convergence of a sequence, 63
Convex function, 352
Convolution, 204
Convolution integral, 104,444
Coordinate rotation digital computing
(CORDIC) method, 70,107-116
Countably infinite set, 22
Courant-Friedrichs-Lewy (CFL)
condition, 547
Courant parameter, 546
Crank-Nicolson method, 528
Cubic B-spline, 272
Current-controlled current source
(CCCS), 420
Curve fitting, 252
Cyphertext sequence, 325
D'Alembert solution, 537
Daubechies 4-tap scaling function, 520
Daubechies wavelets, 18
Deadbeat synchronization, 326
Defective matrix, 483
Deflation theorem, 507
Degree of freedom, 440
Descent algorithm, 352
Detection threshold, 333
Diagonal dominance, 182,281
Diagonalizable matrix, 445
Diffusion equation
see Heat equation
Dimension, 22
Dirichlet kernel, 74,103
Discrete basis, 70,108
Discrete convolution, 34
Discrete Fourier series, 25
Discrete Fourier transform
(DFT), 27,37,112
Divergent sequence, 64
Divided differences, 257-259
Divided-difference table, 264
Dominant eigenvalue, 498
Double sequence, 69
Duffing equation,
415,447-448,453-454
Eavesdropper, 331
Eigenf unction, 532
Eigenpair, 480
Eigenproblem, 480
Eigenspace, 483
TLFeBOOK
INDEX
581
Eigenvalue
Matrix, 143,480
Partial differential equation
(PDE), 532
Eigenvector, 480
Electric flux density, 535
Electrical conductivity, 540
Elementary logic
Logical contrary, 32
Logically equivalent, 32
Necessary condition, 31
Sufficient condition, 31
Elliptic integrals, 410-411
Elliptic partial differential
equation, 526
Equality constraint, 357
Error control coding, 327
Encryption key, 325
Energy
Of a signal (analog), 13
Of a sequence, 13
Error
Absolute, 45,60,140
Relative, 45,60,140,195
Euler-Maclaurin formula, 397
Error function, 94
Euclidean metric, 6
Euclidean space, 12,16
Euler's identity, 19,30
Exchange matrix
see Contra-identity matrix
Faraday's law, 535
Fast Fourier transform (FFT), 27
Feasible direction, 357
Feasible incremental step, 357
Feasible set, 357
Fejer's theorem, 122
Finite difference (FD) method, 545-550
Finite-difference time-domain (FDTD)
method, 525,540,550-557
Finite-element method (FEL), 528
Finite impulse response (FIR) filtering,
34,367
Finite-precision arithmetic effects, 1
Fixed-point dynamic range, 39
Fixed-point number
representations, 38-41
Fixed-point overflow, 40
Fixed-point method, 296-305,312-318
Fixed-point rounding, 41
Floating-point chopping, 47
Floating-point dynamic range, 42
Floating-point exponent, 42
Floating-point mantissa, 42
Floating-point normalization, 45
Floating-point number
representations, 42-47
Floating-point overflow, 43
Floating-point rounding, 43
Floating-point underflow, 43
Flop, 154
Floquet theory, 522
Forcing term, 548
FORTRAN, 38
Forward difference approximation, 401
Forward elimination, 156
Forward substitution
see Forward elimination
Fourier series, 18,73
Free space, 535
Fresnel coefficients, 543
Frobenius norm, 140
Full rank matrix, 161
Function space, 4
Fundamental mode, 532
Fundamental theorem
of algebra, 254,484
Gamma function, 90
Gaussian function
see Gaussian pulse
Gaussian probability
density function, 413
Gaussian pulse, 92,556
Gauss-Seidel method, 181,527
Gauss-Seidel successive overrelaxation
(SOR) method, 181
TLFeBOOK
582
INDEX
Gauss transformation matrix, 148
Gauss vector, 149
Generalized eigenvectors, 485
Generalized Fourier series
expansion, 218
Generalized mean-value theorem, 79
Generating functions
Hermite polynomials, 227
Legendre polynomials, 233
Geometric series, 34
Gibbs phenomenon, 77
Givens matrix, 514
Global truncation error, 548
Golden ratio, 347
Golden ratio search method, 346
Gradient operator, 342
Greatest lower bound
see Infimum
Group speed, 556
Haar condition, 239
Haar scaling function, 17,36
Haar wavelet, 17,36
Hankel matrix, 412
Harmonics, 162,532
Heat equation, 527
Henon map, 324
Hermite' s interpolation formula,
267-269,387
Hermite polynomials, 225-229
Hessenberg matrix, 205,512
Hessenberg reduction, 513
Hessian matrix, 344
Hilbert matrix, 133
Hilbert space, 68
Holder inequality, 139
Homogeneous material, 535
Householder transformation, 169-170
Hump phenomenon, 496
Hyperbolic partial differential
equation, 526
Hysteresis, 535
Ideal diode, 106
Idempotent matrix, 170
IEEE floating-point standard, 42
If and only if (iff), 32
Ill-conditioned matrix, 132
Increment function, 437
Incremental condition estimation
(ICE), 367
Inequality
Cauchy-Schwarz, 15
Holder, 10
Minkowski, 10
Schwarz, 24,36
Infimum, 5
Initial conditions, 528-529
Initial value problem (IVP)
Adams-Bashforth (AB) methods
of solution, 459-461
Adams-Moulton (AM) methods
of solution, 461-462
Chaotic instability, 437
Definition, 415-416
Euler's method of solution (explicit
form), 423,443
Euler method of solution (implicit
form), 427,443
Global error, 465
Heun's method of solution, 433,453
Midpoint method of solution,
457-458
Model problem, 426,443
Multistep predictor-corrector
methods of solution, 456-457
Parasitic term, 459
Runge-Kutta methods of solution,
437-441
Runge-Kutta-Fehlberg (RKF)
methods of solution, 464-466
Single-step methods of solution, 455
Stability, 421,423,427,433,436,
445-447,458-459,462-463
Stability regions, 463
Stiff systems, 432,467-469
Trapezoidal method of solution, 436
Truncation error per step, 432-434
Inner product, 14
Inner product space, 14
TLFeBOOK
INDEX
583
Interface between media, 541
Integral mean-value theorem, 96
Integration between zeros, 385
Integration by parts, 90
Intermediate value theorem, 292,383
Interpolation
Hermite, 266-269,381
Lagrange, 252-257
Newton, 257-266
Polynomial, 251
Spline, 269-284
Interpolatory graphical display
algorithm (IGDA), 521
Invariance (2-norm), 168
Invariant subspace, 511
Inverse discrete Fourier transformi
(IDFT), 27
Inverse power algorithm, 503
Isotropic material, 535
Jacobi algorithm, 203
Jacobi method, 181,527
Jacobi overrelaxation
(JOR) method, 181
Jacobian matrix, 321
Jordan blocks, 484
Jordan canonical form, 483
Kernel function, 215
Key size, 324,329
Kirchoffs laws, 418
Known plaintext attack, 331,339
Kronecker delta sequence, 22
L'Hopital's rule, 81
Lagrange multipliers, 142,166,357-358
Lagrange polynomials, 254
Lagrangian function, 359
Laplace transform, 416
Law of cosines, 169
Leading principle submatrix, 151
Leapfrog algorithm, 551
Least upper bound
see Supremum
Least-squares, 127-128,132,161
Least-squares approximation using
orthogonal polynomials, 235
Lebesgue integrable functions, 67
Legendre polynomials, 229-235
Levinson-Durbin algorithm, 200
Limit of a sequence, 63
Line searches, 345-353
Linearly independent, 21
Lipschitz condition, 421
Lipschitz constant, 421
Lissajous figure, 478
Local minima, 343
Local minimizer, 365
Local truncation error, 545
Logistic equation, 436
Logistic map, 323,437
Lossless material, 539
Lower triangular matrix, 148
LU decomposition, 148
Machine epsilon, 53
Maclaurin series, 85
Magnetic conductivity, 540
Magnetic flux density, 535
Manchester pulse, 17
Mathematical induction, 116
MATLAB, 34,54,90,102,117-118,126,
129,134-135,137,146,185,
186-191, 197-198,201,205-206,
244-246,248,287-289,292,336,
338-339,351,362-366,409-411,
413,442,465-467,469-472,
475-479,498,509,522-524,552,
557-559,561,563-564,565-577
Matrix exponential, 444,488-498
Matrix exponential condition
number, 497
Matrix norm, 140
Matrix norm equivalence, 141
Matrix operator, 176
Matrix spectrum, 509
TLFeBOOK
584
INDEX
Maxwell's equations, 535,540
Mean value theorem, 79,303
Message sequence, 325
Method of false position, 333
Method of separation of variables, 530
Metric, 6
Metric induced by the norm, 1 1
Metric space, 6
Micromechanical resonator, 416
Minor, 489
Moments, 140,411
Rectangular rule, 372
Recursive trapezoidal rule, 398
Richardson's extrapolation, 396
Right-point rule, 372
Romberg integration formula, 397
Romberg table, 397,399
Simpson's rule, 378-385
Simpson's rule truncation error,
380,383
Trapezoidal rule, 371-378
Trapezoidal rule truncation error for-
mula, 375-376
Natural spline, 275
Negative definite matrix, 365
Nesting property, 199
Newton's method, 353-356
Newton's method with backtracking
line search, 354-356
Newton-Raphson method breakdown
phenomena, 311-312
Newton-Raphson method, 305-312,
318-323
Newton-Raphson method rate of con-
vergence, 309-311
Nondefective matrix, 483
Nonexpansive mapping, 297
Nonlinear observer, 326
Non-return- to-zero (NRZ) pulse, 17
Norm, 10
Normal equations, 164
Normal matrix, 147,498
Normal mode, 532
Normed space, 10
Numerical dispersion, 550,555-556
Numerical integration
Chebyshev-Gauss quadrature, 389
Composite Simpson's rule, 380
Composite trapezoidal rule, 376
Corrected trapezoidal rule, 394
Gaussian quadrature, 389
Hermite quadrature, 388
Left-point rule, 372
Legendre-Gauss quadrature, 391
Midpoint rule, 373,409
Objective function, 341
Orbit
see Trajectory
Order of a partial differential equation,
525
Ordinary differential equation
(ODE), 415
Orthogonality, 17
Orthogonal matrix, 161
Orthogonal polynomial, 208
Orthonormal set, 23
Overdetermined least-squares
approximation, 161
Overrelaxation method, 182
Overshoot, 76-77
Overtones
see Harmonics
Parabolic partial differential
equation, 526
Parallelogram equality, 15
Parametric amplifier, 473
Parseval's equality, 216
Partial differential equation (PDE), 525
PASCAL, 38
Perfectly matched layer (PML),
540,544,553
Permeability, 535
Permittivity, 535
Permutation matrix, 205
TLFeBOOK
INDEX
585
Persymmetry property, 199
Phase portrait, 454
Phase speed, 555
Phasor analysis, 539
Picard's theorem, 422
Pivot, 151
Plane electromagnetic waves, 534-535
Plane waves, 536
Pointwise convergence, 71
Poisson equation, 526
Polynomial, 4
Positive definite matrix, 130
Positive semidefinite matrix, 130
Power method, 498-500
Power series, 94
Probability of false alarm, 333
Product method
see Method of separation of variables
Projection operator, 170,197
Propagation constant, 539
Pseudoin verse, 192
Pumping frequency, 473
QR decomposition, 161
QR factorization
see QR decomposition
QR iterations, 508-518
Quadratic form, 129,164
Quantization, 39
Quantization error, 40
Quartz crystal, 416
Radius of convergence, 95
Rate of convergence, 105,296
Ratio test, 233
Rayleigh quotient iteration, 508,523
Real Schur decomposition, 511
Reflection coefficient, 543
Reflection coefficients of a Toeplitz
matrix, 201
Regula falsi
see Method of false position
Relative permeability, 535
Relative permittivity, 535
Relative perturbations, 158
Residual reflections, 555
Residual vector, 144-145,168,180
Riemann integrable functions, 68-69
Ringing artifacts, 77
Rodrigues formula, 213,230
Rolle's theorem, 265
Rosenbrock's function, 342
Rotation operator, 112-113,120,484
Rounding errors, 38
Roundoff errors
see Rounding errors
Runge's phenomenon, 257
Saddle point, 343
Scaled power algorithm, 580
Schelin's theorem, 109
Second-order linear partial differential
equation, 526
Sequence, 3
Sequence of partial sums, 74,122
Sherman-Morrison-Woodbury
formula, 203
Shift parameter, 502,516
Similarity transformation, 487
Simultaneous diagonalizability, 520
Sine integral, 100
Single-shift QR iterations
algorithm, 517
Singular values, 147,165
Singular value decomposition
(SVD), 164
Singular vectors, 165
Stability, 157
State variables, 416
State vector, 325
Stationary point, 343
Stiffness quotient, 479
Stirling's formula, 91
Strange attractor, 466
Strongly nonsingular matrix
see Strongly regular matrix
Strongly regular matrix, 201
TLFeBOOK
586
INDEX
Subharmonics, 162
Submultiplicative property of matrix
norms, 141
Summation by parts, 249
Spectral norm, 147
Spectral radius, 177
Spectrum, 532
Speed of light, 538
Supremum, 5
Symmetric part of a matrix, 203
Unique global minimizer, 344
Unitary space, 12,16
Unit circle, 248,502
Unit roundoff, 53
Unit sphere, 141
Unit step function, 421
Unit vector, 17,170
UNIX, 565
Upper quasi-triangular matrix, 512
Upper triangular matrix, 148
Taylor series, 78-96
Taylor polynomial, 85
Taylor's theorem, 440
Three-term recurrence formula (relation)
Definition, 208
Chebyshev polynomials of the first
kind, 222
Chebyshev polynomials of the second
kind, 243
Gram polynomials, 249
Hermite polynomials, 229
Legendre polynomials, 234
Toeplitz matrix, 199
Trace of a matrix, 206
Trajectory, 454
Transmission coefficient, 543
Transverse electromagnetic (TEM)
waves, 586
Tridiagonal matrix, 199,275,516
Truncation error, 85
Truncation error per step, 546
Two-dimensional Taylor series,
318-319
Tychonov regularization, 368
Uncertainty principle, 36
Underrelaxation method, 182
Uniform approximation, 222,238
Uniform convergence, 71
Vandermonde matrix, 254,285
Varactor diode, 473
Vector electric field intensity, 534
Vector magnetic field intensity, 535
Vector, 9
Vector dot product, 16,48
Vector Hermitian transpose, 16
Vector norm, 138
Vector norm equivalence, 139
Vector space, 9
Vector subspace, 239
Vector transpose, 16
Vibrating string, 528-534
Von Neumann stability analysis, 548
Wavelength, 538
Wavelet series, 18
Wave equation
Electric field, 536
Magnetic field, 536
Weierstrass' approximation
theorem, 216
Weierstrass function, 384
Weighted inner product, 207
Weighting function, 207
Well-conditioned matrix, 137
Zeroth order modified Bessel function
of the first kind, 400
TLFeBOOK