Skip to main content

Full text of "An Introduction to Numerical Analysis for Electrical and Computer Engineers"

See other formats


AN INTRODUCTION TO 

NUMERICAL ANALYSIS FOR 

ELECTRICAL AND COMPUTER ENGINEERS 



CHRISTOPHER J, ZAROWSKI 




AN INTRODUCTION TO 
NUMERICAL ANALYSIS 
FOR ELECTRICAL AND 
COMPUTER ENGINEERS 



Christopher J. Zarowski 

University of Alberta, Canada 



® 



WILEY- 
INTERSCIENCE 



A JOHN WILEY & SONS, INC. PUBLICATION 



TLFeBOOK 



TLFeBOOK 



AN INTRODUCTION TO 
NUMERICAL ANALYSIS 
FOR ELECTRICAL AND 
COMPUTER ENGINEERS 



TLFeBOOK 



TLFeBOOK 



AN INTRODUCTION TO 
NUMERICAL ANALYSIS 
FOR ELECTRICAL AND 
COMPUTER ENGINEERS 



Christopher J. Zarowski 

University of Alberta, Canada 



® 



WILEY- 
INTERSCIENCE 



A JOHN WILEY & SONS, INC. PUBLICATION 



TLFeBOOK 



Copyright © 2004 by John Wiley & Sons, Inc. All rights reserved. 

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. 
Published simultaneously in Canada. 

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form 
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as 
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior 
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee 
to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, 
fax 978-646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission 
should be addressed to the Permissions Department, John Wiley & Sons, Inc., Ill River Street, 
Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008. 

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts 
in preparing this book, they make no representations or warranties with respect to the accuracy or 
completeness of the contents of this book and specifically disclaim any implied warranties of 
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales 
representatives or written sales materials. The advice and strategies contained herein may not be 
suitable for your situation. You should consult with a professional where appropriate. Neither the 
publisher nor author shall be liable for any loss of profit or any other commercial damages, including 
but not limited to special, incidental, consequential, or other damages. 

For general information on our other products and services, please contact our Customer Care 
Department within the United States at 877-762-2974, outside the United States at 317-572-3993 or 
fax 317-572-4002. 

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, 
however, may not be available in electronic format. 

Library of Congress Cataloging-in-Publication Data: 

Zarowski, Christopher J. 

An introduction to numerical analysis for electrical and computer engineers / Christopher 
J. Zarowski. 

p. cm. 
Includes bibliographical references and index. 
ISBN 0-471-46737-5 (com) 

1. Electric engineering — Mathematics. 2. Computer science — Mathematics. 3. Numerical 
analysis I. Title. 

TK153.Z37 2004 
621.3'01'518— dc22 

2003063761 

Printed in the United States of America. 
10 987654321 



TLFeBOOK 



In memory of my mother 

Lilian 

and of my father 

Walter 



TLFeBOOK 



TLFeBOOK 



CONTENTS 



Preface xiii 

1 Functional Analysis Ideas 1 

1 . 1 Introduction 1 

1.2 Some Sets 2 

1.3 Some Special Mappings: Metrics, Norms, and Inner Products 4 

1.3.1 Metrics and Metric Spaces 6 

1.3.2 Norms and Normed Spaces 8 

1.3.3 Inner Products and Inner Product Spaces 14 

1.4 The Discrete Fourier Series (DFS) 25 
Appendix l.A Complex Arithmetic 28 
Appendix l.B Elementary Logic 31 
References 32 
Problems 33 

2 Number Representations 38 

2.1 Introduction 38 

2.2 Fixed-Point Representations 38 

2.3 Floating-Point Representations 42 

2.4 Rounding Effects in Dot Product Computation 48 

2.5 Machine Epsilon 53 
Appendix 2.A Review of Binary Number Codes 54 
References 59 
Problems 59 

3 Sequences and Series 63 

3.1 Introduction 63 

3.2 Cauchy Sequences and Complete Spaces 63 

3.3 Pointwise Convergence and Uniform Convergence 70 

3.4 Fourier Series 73 

3.5 Taylor Series 78 



TLFeBOOK 



CONTENTS 



3.6 Asymptotic Series 

3.7 More on the Dirichlet Kernel 

3.8 Final Remarks 

Appendix 3. A Coordinate Rotation Digital Computing 
(CORDIC) 
3.A.1 Introduction 
3.A.2 The Concept of a Discrete Basis 
3.A.3 Rotating Vectors in the Plane 
3.A.4 Computing Arctangents 
3.A.5 Final Remarks 

Appendix 3.B Mathematical Induction 

Appendix 3.C Catastrophic Cancellation 

References 

Problems 



97 
103 
107 

107 
107 
108 
112 
114 
115 
116 
117 
119 
120 



4 Linear Systems of Equations 

4.1 Introduction 

4.2 Least-Squares Approximation and Linear Systems 

4.3 Least-Squares Approximation and Ill-Conditioned Linear 
Systems 

4.4 Condition Numbers 

4.5 LU Decomposition 

4.6 Least-Squares Problems and QR Decomposition 

4.7 Iterative Methods for Linear Systems 

4.8 Final Remarks 

Appendix 4. A Hilbert Matrix Inverses 
Appendix 4.B SVD and Least Squares 
References 
Problems 



127 

127 
127 

132 
135 
148 
161 
176 
186 
186 
191 
193 
194 



Orthogonal Polynomials 

5.1 Introduction 

5.2 General Properties of Orthogonal Polynomials 

5.3 Chebyshev Polynomials 

5.4 Hermite Polynomials 

5.5 Legendre Polynomials 

5.6 An Example of Orthogonal Polynomial Least-Squares 
Approximation 

5.7 Uniform Approximation 



207 

207 
207 
218 
225 
229 

235 
238 



TLFeBOOK 



CONTENTS 



References 
Problems 

6 Interpolation 

6.1 Introduction 

6.2 Lagrange Interpolation 

6.3 Newton Interpolation 

6.4 Hermite Interpolation 

6.5 Spline Interpolation 
References 

Problems 

7 Nonlinear Systems of Equations 



7.1 

7.2 
7.3 
7.4 



7.5 



7.6 



Introduction 
Bisection Method 
Fixed-Point Method 
Newton-Raphson Method 

7.4.1 The Method 

7.4.2 Rate of Convergence Analysis 

7.4.3 Breakdown Phenomena 
Systems of Nonlinear Equations 

7.5.1 Fixed-Point Method 

7.5.2 Newton-Raphson Method 

Chaotic Phenomena and a Cryptography Application 



References 
Problems 

8 Unconstrained Optimization 

8.1 Introduction 

8.2 Problem Statement and Preliminaries 

8.3 Line Searches 

8.4 Newton's Method 

8.5 Equality Constraints and Lagrange Multipliers 
Appendix 8.A MATLAB Code for Golden Section Search 
References 

Problems 

9 Numerical Integration and Differentiation 

9.1 Introduction 



241 
241 

251 

251 
252 
257 
266 
269 
284 
285 

290 

290 
292 
296 
305 
305 
309 

311 

312 
312 
318 
323 
332 
333 

341 

341 
341 
345 
353 
357 
362 
364 
364 

369 

369 



TLFeBOOK 



X CONTENTS 

9.2 Trapezoidal Rule 371 

9.3 Simpson's Rule 378 

9.4 Gaussian Quadrature 385 

9.5 Romberg Integration 393 

9.6 Numerical Differentiation 401 
References 406 
Problems 406 

10 Numerical Solution of Ordinary Differential Equations 415 

10.1 Introduction 415 

10.2 First-Order ODEs 421 

10.3 Systems of First-Order ODEs 442 

10.4 Multistep Methods for ODEs 455 

10.4.1 Adams-Bashforth Methods 459 

10.4.2 Adams-Moulton Methods 461 

10.4.3 Comments on the Adams Families 462 

10.5 Variable-Step-Size (Adaptive) Methods for ODEs 464 

10.6 Stiff Systems 467 

10.7 Final Remarks 469 
Appendix 10. A MATLAB Code for Example 10.8 469 
Appendix 10.B MATLAB Code for Example 10.13 470 
References 472 
Problems 473 

11 Numerical Methods for Eigenproblems 480 

11.1 Introduction 480 

11.2 Review of Eigenvalues and Eigenvectors 480 

11.3 The Matrix Exponential 488 

11.4 The Power Methods 498 

11.5 QR Iterations 508 
References 518 
Problems 519 

12 Numerical Solution of Partial Differential Equations 525 

12.1 Introduction 525 

12.2 A Brief Overview of Partial Differential Equations 525 

12.3 Applications of Hyperbolic PDEs 528 

12.3.1 The Vibrating String 528 

12.3.2 Plane Electromagnetic Waves 534 



TLFeBOOK 



CONTENTS xi 

12.4 The Finite-Difference (FD) Method 545 

12.5 The Finite-Difference Time-Domain (FDTD) Method 550 
Appendix 12.A MATLAB Code for Example 12.5 557 
References 560 
Problems 561 

13 An Introduction to MATLAB 565 

13.1 Introduction 565 

13.2 Startup 565 

13.3 Some Basic Operators, Operations, and Functions 566 

13.4 Working with Polynomials 571 

13.5 Loops 572 

13.6 Plotting and M-Files 573 
References 577 

Index 579 



TLFeBOOK 



TLFeBOOK 



PREFACE 



The subject of numerical analysis has a long history. In fact, it predates by cen- 
turies the existence of the modern computer. Of course, the advent of the modern 
computer in the middle of the twentieth century gave greatly added impetus to the 
subject, and so it now plays a central role in a large part of engineering analysis, 
simulation, and design. This is so true that no engineer can be deemed competent 
without some knowledge and understanding of the subject. Because of the back- 
ground of the author, this book tends to emphasize issues of particular interest to 
electrical and computer engineers, but the subject (and the present book) is certainly 
relevant to engineers from all other branches of engineering. 

Given the importance level of the subject, a great number of books have already 
been written about it, and are now being written. These books span a colossal 
range of approaches, levels of technical difficulty, degree of specialization, breadth 
versus depth, and so on. So, why should this book be added to the already huge, 
and growing list of available books? 

To begin, the present book is intended to be a part of the students' first exposure 
to numerical analysis. As such, it is intended for use mainly in the second year 
of a typical 4-year undergraduate engineering program. However, the book may 
find use in later years of such a program. Generally, the present book arises out of 
the author's objections to educational practice regarding numerical analysis. To be 
more specific 

1. Some books adopt a "grocery list" or "recipes" approach (i.e., "methods" at 
the expense of "analysis") wherein several methods are presented, but with 
little serious discussion of issues such as how they are obtained and their 
relative advantages and disadvantages. In this genre often little consideration 
is given to error analysis, convergence properties, or stability issues. When 
these issues are considered, it is sometimes in a manner that is too superficial 
for contemporary and future needs. 

2. Some books fail to build on what the student is supposed to have learned 
prior to taking a numerical analysis course. For example, it is common for 
engineering students to take a first-year course in matrix/linear algebra. Yet, 
a number of books miss the opportunity to build on this material in a manner 
that would provide a good bridge from first year to more sophisticated uses 
of matrix/linear algebra in later years (e.g., such as would be found in digital 
signal processing or state variable control systems courses). 



TLFeBOOK 



xiv PREFACE 

3. Some books miss the opportunity to introduce students to the now quite vital 
area of functional analysis ideas as applied to engineering problem solving. 
Modern numerical analysis relies heavily on concepts such as function spaces, 
orthogonality, norms, metrics, and inner products. Yet these concepts are 
often considered in a very ad hoc way, if indeed they are considered at all. 

4. Some books tie the subject matter of numerical analysis far too closely to 
particular software tools and/or programming languages. But the highly tran- 
sient nature of software tools and programming languages often blinds the 
user to the timeless nature of the underlying principles of analysis. Further- 
more, it is an erroneous belief that one can successfully employ numerical 
methods solely through the use of "canned" software without any knowledge 
or understanding of the technical details of the contents of the can. While 
this does not imply the need to understand a software tool or program down 
to the last line of code, it does rule out the "black box" methodology. 

5. Some books avoid detailed analysis and derivations in the misguided belief 
that this will make the subject more accessible to the student. But this denies 
the student the opportunity to learn an important mode of thinking that is a 
huge aid to practical problem solving. Furthermore, by cutting the student 
off from the language associated with analysis the student is prevented from 
learning those skills needed to read modern engineering literature, and to 
extract from this literature those things that are useful for solving the problem 
at hand. 

The prospective user of the present book will likely notice that it contains material 
that, in the past, was associated mainly with more advanced courses. However, the 
history of numerical computing since the early 1980s or so has made its inclusion 
in an introductory course unavoidable. There is nothing remarkable about this. For 
example, the material of typical undergraduate signals and systems courses was, 
not so long ago, considered to be suitable only for graduate-level courses. Indeed, 
most (if not all) of the contents of any undergraduate program consists of material 
that was once considered far too advanced for undergraduates, provided one goes 
back far enough in time. 

Therefore, with respect to the observations mentioned above, the following is a 
summary of some of the features of the present book: 

1 . An axiomatic approach to function spaces is adopted within the first chapter. 
So the book immediately exposes the student to function space ideas, espe- 
cially with respect to metrics, norms, inner products, and the concept of 
orthogonality in a general setting. All of this is illustrated by several examples, 
and the basic ideas from the first chapter are reinforced by routine use 
throughout the remaining chapters. 

2. The present book is not closely tied to any particular software tool or pro- 
gramming language, although a few MATLAB-oriented examples are pre- 
sented. These may be understood without any understanding of MATLAB 



TLFeBOOK 



PREFACE XV 

(derived from the term matrix laboratory) on the part of the student, how- 
ever. Additionally, a quick introduction to MATLAB is provided in Chapter 
13. These examples are simply intended to illustrate that modern software 
tools implement many of the theories presented in the book, and that the 
numerical characteristics of algorithms implemented with such tools are not 
materially different from algorithm implementations using older software 
technologies (e.g., catastrophic convergence, and ill conditioning, continue 
to be major implementation issues). Algorithms are often presented in a 
Pascal-like pseudocode that is sufficiently transparent and general to allow 
the user to implement the algorithm in the language of their choice. 

3. Detailed proofs and/or derivations are often provided for many key results. 
However, not all theorems or algorithms are proved or derived in detail 
on those occasions where to do so would consume too much space, or not 
provide much insight. Of course, the reader may dispute the present author's 
choices in this matter. But when a proof or derivation is omitted, a reference 
is often cited where the details may be found. 

4. Some modern applications examples are provided to illustrate the conse- 
quences of various mathematical ideas. For example, chaotic cryptography, 
the CORDIC (coordinate rotational digital computing) method, and least 
squares for system identification (in a biomedical application) are considered. 

5. The sense in which series and iterative processes converge is given fairly 
detailed treatment in this book as an understanding of these matters is now 
so crucial in making good choices about which algorithm to use in an appli- 
cation. Thus, for example, the difference between pointwise and uniform 
convergence is considered. Kernel functions are introduced because of their 
importance in error analysis for approximations based on orthogonal series. 
Convergence rate analysis is also presented in the context of root-finding 
algorithms. 

6. Matrix analysis is considered in sufficient depth and breadth to provide an 
adequate introduction to those aspects of the subject particularly relevant to 
modern areas in which it is applied. This would include (but not be limited 
to) numerical methods for electromagnetics, stability of dynamic systems, 
state variable control systems, digital signal processing, and digital commu- 
nications. 

7. The most important general properties of orthogonal polynomials are pre- 
sented. The special cases of Chebyshev, Legendre, and Hermite polynomials 
are considered in detail (i.e., detailed derivations of many basic properties 
are given). 

8. In treating the subject of the numerical solution of ordinary differential 
equations, a few books fail to give adequate examples based on nonlin- 
ear dynamic systems. But many examples in the present book are based on 
nonlinear problems (e.g., the Duffing equation). Furthermore, matrix methods 
are introduced in the stability analysis of both explicit and implicit methods 
for nth-order systems. This is illustrated with second-order examples. 



TLFeBOOK 



xvi PREFACE 

Analysis is often embedded in the main body of the text rather than being rele- 
gated to appendixes, or to formalized statements of proof immediately following a 
theorem statement. This is done to discourage attempts by the reader to "skip over 
the math." After all, skipping over the math defeats the purpose of the book. 

Notwithstanding the remarks above, the present book lacks the rigor of a math- 
ematically formal treatment of numerical analysis. For example, Lebesgue measure 
theory is entirely avoided (although it is mentioned in passing). With respect to 
functional analysis, previous authors (e.g., E. Kreyszig, Introductory Functional 
Analysis with Applications) have demonstrated that it is very possible to do this 
while maintaining adequate rigor for engineering purposes, and this approach is 
followed here. 

It is largely left to the judgment of the course instructor about what particular 
portions of the book to cover in a course. Certainly there is more material here 
than can be covered in a single term (or semester). However, it is recommended 
that the first four chapters be covered largely in their entirety (perhaps excepting 
Sections 1.4, 3.6, 3.7, and the part of Section 4.6 regarding SVD). The material of 
these chapters is simply too fundamental to be omitted, and is often drawn on in 
later chapters. 

Finally, some will say that topics such as function spaces, norms and inner 
products, and uniform versus pointwise convergence, are too abstract for engineers. 
Such individuals would do well to ask themselves in what way these ideas are 
more abstract than Boolean algebra, convolution integrals, and Fourier or Laplace 
transforms, all of which are standard fare in present-day electrical and computer 
engineering curricula. 



Engineering past 

A 



Engineering present 



Engineering future 



Christopher Zarowski 



TLFeBOOK 



1 



Functional Analysis Ideas 



1.1 INTRODUCTION 

Many engineering analysis and design problems are far too complex to be solved 
without the aid of computers. However, the use of computers in problem solving 
has made it increasingly necessary for users to be highly skilled in (practical) 
mathematical analysis. There are a number of reasons for this. A few are as follows. 

For one thing, computers represent data to finite precision. Irrational numbers 
such as it or V2 do not have an exact representation on a digital computer (with the 
possible exception of methods based on symbolic computing). Additionally, when 
arithmetic is performed, errors occur as a result of rounding (e.g., the truncation of 
the product of two n-bit numbers, which might be In bits long, back down to n 
bits). Numbers have a limited dynamic range; we might get overflow or underflow 
in a computation. These are examples of finite-precision arithmetic effects. Beyond 
this, computational methods frequently have sources of error independent of these. 
For example, an infinite series must be truncated if it is to be evaluated on a com- 
puter. The truncation error is something "additional" to errors from finite-precision 
arithmetic effects. In all cases, the sources (and sizes) of error in a computation 
must be known and understood in order to make sensible claims about the accuracy 
of a computer-generated solution to a problem. 

Many methods are "iterative." Accuracy of the result depends on how many 
iterations are performed. It is possible that a given method might be very slow, 
requiring many iterations before achieving acceptable accuracy. This could involve 
much computer runtime. The obvious solution of using a faster computer is usually 
unacceptable. A better approach is to use mathematical analysis to understand why 
a method is slow, and so to devise methods of speeding it up. Thus, an important 
feature of analysis applied to computational methods is that of assessing how 
much in the way of computing resources is needed by a given method. A given 
computational method will make demands on computer memory, operations count 
(the number of arithmetic operations, function evaluations, data transfers, etc.), 
number of bits in a computer word, and so on. 

A given problem almost always has many possible alternative solutions. Other 
than accuracy and computer resource issues, ease of implementation is also rel- 
evant. This is a human labor issue. Some methods may be easier to implement 
on a given set of computing resources than others. This would have an impact 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



TLFeBOOK 



2 FUNCTIONAL ANALYSIS IDEAS 

on software/hardware development time, and hence on system cost. Again, math- 
ematical analysis is useful in deciding on the relative ease of implementation of 
competing solution methods. 

The subject of numerical computing is truly vast. Methods are required to handle 
an immense range of problems, such as solution of differential equations (ordi- 
nary or partial), integration, solution of equations and systems of equations (linear 
or nonlinear), approximation of functions, and optimization. These problem types 
appear to be radically different from each other. In some sense the differences 
between them are true, but there are means to achieve some unity of approach in 
understanding them. 

The branch of mathematics that (perhaps) gives the greatest amount of unity 
is sometimes called functional analysis. We shall employ ideas from this subject 
throughout. However, our usage of these ideas is not truly rigorous; for example, 
we completely avoid topology, and measure theory. Therefore, we tend to follow 
simplified treatments of the subject such as Kreyszig [1], and then only those ideas 
that are immediately relevant to us. The reader is assumed to be very comfortable 
with elementary linear algebra, and calculus. The reader must also be comfortable 
with complex number arithmetic (see Appendix l.A now for a review if necessary). 
Some knowledge of electric circuit analysis is presumed since this will provide 
a source of applications examples later. (But application examples will also be 
drawn from other sources.) Some knowledge of ordinary differential equations is 
also assumed. 

It is worth noting that an understanding of functional analysis is a tremendous 
aid to understanding other subjects such as quantum physics, probability theory 
and random processes, digital communications system analysis and design, digital 
control systems analysis and design, digital signal processing, fuzzy systems, neural 
networks, computer hardware design, and optimal design of systems. Many of the 
ideas presented in this book are also intended to support these subjects. 



1.2 SOME SETS 

Variables in an engineering problem often take on values from sets of numbers. 
In the present setting, the sets of greatest interest to us are (1) the set of integers 
Z = {. . . -3, -2, -1, 0, 1, 2, 3 . . .}, (2) the set of real numbers R, and (3) the set of 
complex numbers C = [x + jy\j — V— 1, x, y e R}. The set of nonnegative inte- 
gers is Z + = {0, 1, 2, 3, . . . , } (so Z + C Z). Similarly, the set of nonnegative real 
numbers is R + = {x e R\x > 0}. Other kinds of sets of numbers will be introduced 
if and when they are needed. 

If A and B are two sets, their Cartesian product is denoted by A x B — 
{(a, b)\a e A, b e B}. The Cartesian product of n sets denoted Aq, A\, . . . , A n -\ 
is A x A\ x • • • x A„_! = {(a , a u ..., a„-i)\a k e A k }. 

Ideas from matrix/linear algebra are of great importance. We are therefore also 
interested in sets of vectors. Thus, R" shall denote the set of n-element vectors 
with real-valued components, and similarly, C" shall denote the set of n-element 



TLFeBOOK 



xo 


XI 


%n — 2 


_-*•«— 1 



SOME SETS 3 

vectors with complex-valued components. By default, we assume any vector x to 
be a column vector: 



(1.1) 



Naturally, row vectors are obtained by transposition. We will generally avoid using 
bars over or under symbols to denote vectors. Whether a quantity is a vector will 
be clear from the context of the discussion. However, bars will be used to denote 
vectors when this cannot be easily avoided. The indexing of vector elements Xk will 
often begin with as indicated in (1.1). Naturally, matrices are also important. Set 
R" xm denotes the set of matrices with n rows and m columns, and the elements are 
real-valued. The notation C" xm should now possess an obvious meaning. Matri- 
ces will be denoted by uppercase symbols, again without bars. If A is an n x m 
matrix, then 

A — [a p , q ] p =o,...,n-l, q=0,...,m-l- (1.2) 

Thus, the element in row p and column q of A is denoted a p , q . Indexing of rows 
and columns again will typically begin at 0. The subscripts on the right bracket "]" 
in (1.2) will often be omitted in the future. We may also write a pq instead of a pq 
where no danger of confusion arises. 

The elements of any vector may be regarded as the elements of a sequence of 
finite length. However, we are also very interested in sequences of infinite length. 
An infinite sequence may be denoted by x — (xk) — (xo, x\, xi, . . .), for which Xk 
could be either real-valued or complex-valued. It is possible for sequences to be 
doubly infinite, for instance, x — (x^) — (..., x-2, x—i, xo, x\, xi, ■ ■ •)• 

Relationships between variables are expressed as mathematical functions, that is, 
mappings between sets. The notation f\A — > B signifies that function / associates 
an element of set A with an element from set B. For example, /|R — > R represents 
a function defined on the real-number line, and this function is also real-valued; 
that is, it maps "points" in R to "points" in R. We are familiar with the idea 
of "plotting" such a function on the xy plane if y = f(x) (i.e., x, y e R). It is 
important to note that we may regard sequences as functions that are defined on 
either the set Z (the case of doubly infinite sequences), or the set Z + (the case 
of singly infinite sequences). To be more specific, if, for example, k e Z + , then 
this number maps to some number x* that is either real-valued or complex-valued. 
Since vectors are associated with sequences of finite length, they, too, may be 
regarded as functions, but defined on a finite subset of the integers. From (1.1) this 
subset might be denoted by Z„ = {0, 1,2, . . . , n — 2,n — 1}. 

Sets of functions are important. This is because in engineering we are often 
interested in mappings between sets of functions. For example, in electric circuits 
voltage and current waveforms (i.e., functions of time) are input to a circuit via volt- 
age and current sources. Voltage drops across circuit elements, or currents through 



TLFeBOOK 



4 FUNCTIONAL ANALYSIS IDEAS 

circuit elements are output functions of time. Thus, any circuit maps functions from 
an input set to functions from some output set. Digital signal processing systems 
do the same thing, except that here the functions are sequences. For example, a 
simple digital signal processing system might accept as input the sequence (x„), 
and produce as output the sequence (y„) according to 

x n + x n+\ ., ,. 

y n = (1.3) 

for which n e Z + . 

Some specific examples of sets of functions are as follows, and more will be 
seen later. The set of real- valued functions defined on the interval [a, b] C R that 
are n times continuously differentiable may be denoted by C n [a, b]. This means 
that all derivatives up to and including order n exist and are continuous. If n — 
we often just write C[a,b], which is the set of continuous functions on the interval 
[a, b]. We remark that the notation [a, b] implies inclusion of the endpoints of the 
interval. Thus, (a, b) implies that the endpoints a and b are not to be included [i.e., 
if x € (a, b), then a < x < b]. 

A polynomial in the indeterminate x of degree n is 

n 

Pn(.x) = ^,Pn,kX k - ( L4 ) 

k=0 

Unless otherwise stated, we will always assume p n \ e R. The indeterminate x 
is often considered to be either a real number or a complex number. But in 
some circumstances the indeterminate x is merely regarded as a "placeholder," 
which means that x is not supposed to take on a value. In a situation like this 
the polynomial coefficients may also be regarded as elements of a vector (e.g., 
p n = [p n o p n j • • • p n , n ] ). This happens in digital signal processing when we 
wish to convolve 1 sequences of finite length, because the multiplication of polyno- 
mials is mathematically equivalent to the operation of sequence convolution. We 
will denote the set of all polynomials of degree n as P" . If x is to be from the 
interval [a, b] C R, then the set of polynomials of degree n on [a, b] is denoted 
by P"[a, b]. If m < n we shall usually assume P m [a, b] C P"[a, b]. 



1.3 SOME SPECIAL MAPPINGS: METRICS, NORMS, 
AND INNER PRODUCTS 

Sets of objects (vectors, sequences, polynomials, functions, etc.) often have cer- 
tain special mappings defined on them that turn these sets into what are commonly 
called function spaces. Loosely speaking, functional analysis is about the properties 

These days it seems that the operation of convolution is first given serious study in introductory signals 
and systems courses. The operation of convolution is fundamental to all forms of signal processing, 
either analog or digital. 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 5 

of function spaces. Generally speaking, numerical computation problems are best 
handled by treating them in association with suitable mappings on well-chosen 
function spaces. For our purposes, the three most important special types of map- 
pings are (1) metrics, (2) norms, and (3) inner products. You are likely to be already 
familiar with special cases of these really very general ideas. 

The vector dot product is an example of an inner product on a vector space, while 
the Euclidean norm (i.e., the square root of the sum of the squares of the elements in 
a real-valued vector) is a norm on a vector space. The Euclidean distance between 
two vectors (given by the Euclidean norm of the difference between the two vectors) 
is a metric on a vector space. Again, loosely speaking, metrics give meaning to the 
concept of "distance" between points in a function space, norms give a meaning 
to the concept of the "size" of a vector, and inner products give meaning to the 
concept of "direction" in a vector space. 2 

In Section 1.1 we expressed interest in the sizes of errors, and so naturally the 
concept of a norm will be of interest. Later we shall see that inner products will 
prove to be useful in devising means of overcoming problems due to certain sources 
of error in a computation. In this section we shall consider various examples of 
function spaces, some of which we will work with later on in the analysis of 
certain computational problems. We shall see that there are many different kinds 
of metric, norm, and inner product. Each kind has its own particular advantages 
and disadvantages as will be discovered as we progress through the book. 

Sometimes a quantity cannot be computed exactly. In this case we may try to 
estimate bounds on the size of the quantity. For example, finding the exact error 
in the truncation of a series may be impossible, but putting a bound on the error 
might be relatively easy. In this respect the concepts of supremum and infimum 
can be important. These are defined as follows. 

Suppose we have E C R. We say that E is bounded above if E has an upper 
bound, that is, if there exists a B e R such that x < B for all x e E. If E ^ 
(empty set; set containing no elements) there is a supremum of E [also called a 
least upper bound (lub)], denoted 

sup E. 

For example, suppose E — [0, 1), then any B > 1 is an upper bound for E, but 
sup E = 1. More generally, sup E < B for every upper bound B of E. Thus, the 
supremum is a "tight" upper bound. Similarly, E may be bounded below. If E has 
a lower bound there is a b e R such that x > b for all x e E. If E ^ 0, then there 
exists an infimum [also called a greatest lower bound (gib)], denoted by 

ME. 

For example, suppose now E = (0, 1]; then any b < is a lower bound for E, 
but inf E — 0. More generally, inf E > b for every lower bound b of E. Thus, the 
infimum is a "tight" lower bound. 

The idea of "direction" is (often) considered with respect to the concept of an orthogonal basis in a 
vector space. To define "orthogonality" requires the concept of an inner product. We shall consider this 
in various ways later on. 



TLFeBOOK 



6 FUNCTIONAL ANALYSIS IDEAS 

1.3.1 Metrics and Metric Spaces 

In mathematics an axiomatic approach is often taken in the development of analysis 
methods. This means that we define a set of objects, a set of operations to be 
performed on the set of objects, and rules obeyed by the operations. This is typically 
how mathematical systems are constructed. The reader (hopefully) has already seen 
this approach in the application of Boolean algebra to the analysis and design of 
digital electronic systems (i.e., digital logic). We adopt the same approach here. 
We will begin with the following definition. 

Definition 1.1: Metric Space, Metric A metric space is a set X and a 
function d\X x X — > R + , which is called a metric or distance function on X. 
If x, y, z € X then d satisfies the following axioms: 

(Ml) d(x, y) = if and only if (iff) x = y. 

(M2) d(x, y) — d(y, x) (symmetry property). 

(M3) d(x, y) < d(x, z) + d(z, y) (triangle inequality). 

We emphasize that X by itself cannot be a metric space until we define d. Thus, 
the metric space is often denoted by the pair (X, d). The phrase "if and only 
if probably needs some explanation. In (Ml), if you were told that d(x, y) — 0, 
then you must immediately conclude that x — y. Conversely, if you were told that 
x — y, then you must immediately conclude that d(x, y) — 0. Instead of the words 
"if and only if it is also common to write 

d(x,y) = 0<S-i=)>. 

The phrase "if and only if is associated with elementary logic. This subject is 
reviewed in Appendix l.B. It is recommended that the reader study that appendix 
before continuing with later chapters. 

Some examples of metric spaces now follow. 

Example 1.1 Set X — R, with 

d(x,y) = \x-y\ (1.5) 

forms a metric space. The metric (1.5) is what is commonly meant by the "distance 
between two points on the real number line." The metric (1.5) is quite useful in 
discussing the sizes of errors due to rounding in digital computation. This is because 
there is a norm on R that gives rise to the metric in (1.5) (see Section 1.3.2). 



Example 1.2 The set of vectors R" with 



d(x, y) 



n-\ 



ixk — yk\ 
U=o 



^2 [** ~ y^ 2 



1/2 



(1.6a) 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 7 

forms a (Euclidean) metric space. However, another valid metric on R" is given by 



n-\ 



d\{x,y) = ^\xk-yk\- (1.6b) 

k=Q 

In other words, we can have the metric space (X, d), or (X, d\). These spaces are 
different because their metrics differ. 

Euclidean metrics, and their related norms and inner products, are useful in pos- 
ing and solving least-squares approximation problems. Least-squares approximation 
is a topic we shall consider in detail later. 

Example 1.3 Consider the set of (singly) infinite, complex- valued, and bounded 
sequences 

X — {x — (xq, xi, X2, ■ ■ -)\xk € C, \xk\ < c(x)(all k)}. (l-7a) 

Here c{x) > is a bound that may depend on x, but not on k. This set forms a 
metric space that may be denoted by l°°[0, oo] if we employ the metric 

d(x,y)= sup \x k -yk\- (1.7b) 

keZ+ 

The notation [0, oo] emphasizes that the sequences we are talking about are only 
singly infinite. We would use [— oo, oo] to specify that we are talking about doubly 
infinite sequences. 

Example 1.4 Define i = [a,fi]cR. The set C[a, b] will be a metric space if 

d(x,y) — sup \x(t) - y(t)\. (1.8) 

In Example 1.1 the metric (1.5) gives the "distance" between points on the real- 
number line. In Example 1.4 the "points" are real-valued, continuous functions of 
t € [a, b]. In functional analysis it is essential to get used to the idea that functions 
can be considered as points in a space. 

Example 1.5 The set X in (1.7a), where we now allow c(x) — >• oo (in other 
words, the sequence need not be bounded here), but with the metric 

El \xt — Vi-I 
_L1 — £^_ (i.9) 

/=o 2*+i 1 + |x t - yjtl 

is a metric space. (Sometimes this space is denoted s.) 



TLFeBOOK 



8 FUNCTIONAL ANALYSIS IDEAS 

Example 1.6 Let p be a real- valued constant such that p > 1. Consider the 
set of complex-valued sequences 



X = i X = (xo, JCl. *2, 

This set together with the metric 



.)|** € C, £ 1**1' < OO 

k=0 \ 



(1.10a) 



d(x,y) 



X>*-wi" 



.k=Q 



1/P 



(1.10b) 



forms a metric space that we denote by l p [0, oo] 



Example 1.7 Consider the set of complex-valued functions on [a, b] C R 



X 



-HI. 



\x(t)\ dt < oo 



for which 



d(x, y) 



f 



\x(t) - y(t)\ dt 



1/2 



(1.11a) 



(1.11b) 



is a metric. Pair (X, d) forms a metric space that is usually denoted by L [a, b]. 

The metric space of Example 1 .7 (along with certain variations) is very impor- 
tant in the theory of orthogonal polynomials, and in least-squares approximation 
problems. This is because it turns out to be an inner product space too (see 
Section 1.3.3). Orthogonal polynomials have a major role to play in the solution 
of least squares, and other types of approximation problem. 

All of the metrics defined in the examples above may be shown to satisfy the 
axioms of Definition 1.1. Of course, at least in some cases, much effort might be 
required to do this. In this book we largely avoid making this kind of effort. 



1.3.2 Norms and Normed Spaces 

So far our examples of function spaces have been metric spaces (Section 1.3.1). 
Such spaces are not necessarily associated with the concept of a vector space. 
However, normed spaces (i.e., spaces with norms defined on them) are always 
associated with vector spaces. So, before we can define a norm, we need to recall 
the general definition of a vector space. 

The following definition invokes the concept of afield of numbers. This concept 
arises in abstract algebra and number theory [e.g., 2, 3], a subject we wish to avoid 
considering here. 3 It is enough for the reader to know that R and C are fields under 

This avoidance is not to disparage abstract algebra. This subject is a necessary prerequisite to under- 
standing concepts such as fast algorithms for digital signal processing (i.e., fast Fourier transforms, and 
fast convolution algorithms; e.g., see Ref. 4), cryptography and data security, and error control codes 
for digital communications. 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 9 

the usual real and complex arithmetic operations. These are really the only fields 
that we shall work with. We remark, largely in passing, that rational numbers (set 
denoted Q) are also a field under the usual arithmetic operations. 

Definition 1.2: Vector Space A vector space (linear space) over a field K is 
a nonempty set X of elements x, y, z, ■ ■ . called vectors together with two algebraic 
operations. These operations are vector addition, and the multiplication of vectors 
by scalars that are elements of K. The following axioms must be satisfied: 

(VI) If x, y € X, then x + y e X (additive closure). 

(V2) If x, y, z € X, then (x + y) + z — x + (y + z) (associativity). 

(V3) There exists a vector in X denoted (zero vector) such that for all x e X, 

we have x + = + .r — x. 
(V4) For all x e X, there is a vector —x e X such that — x + x — x + 

(—x) — 0. We call — x the negative of a vector. 
(V5) For all x, y e X we have x + y — y + x (commutativity). 
(V6) If x e X and a e K, then the product of a and x is ax, and ax e X. 
(V7) If x, y € X, and a e K, then a(x + y) — ax + ay. 
(V8) If a, b e K, and x e X, then (a + b)x — ax + bx. 
(V9) If a,b e K, and x e X, then ab(x) — a(bx). 
(V10) If x e X, and I e K, then Ix — x multiplication of a vector by a unit 

scalar; all fields contain a unit scalar (i.e., a number called "one"). 



In this definition, as already noted, we generally work only with K — R, or K — C. 
We represent the zero vector by just as we also represent the scalar zero by 0. 
Rarely is there danger of confusion. 

The reader is already familiar with the special instances of this that relate to the 
sets R" and C". These sets are vector spaces under Definition 1.2, where vector 
addition is defined to be 



x + y — 



xo 




yo 




X\ 




y\ 






+ 




— 


_ X n -l 




y n -\ 





xo 
x\ 



VI 



x n -\ + y n -\ 

and multiplication by a field element is defined to be 



oxq 
ax i 



The zero vector is = [00 • • • 00] , and — x — [—xo — x\ ■ ■ ■ — x n -\ 
then the elements of x and y are real-valued, and a e R, but if X ■■ 



(1.12a) 



(1.12b) 



] r .If X = R" 
= C" then the 



TLFeBOOK 



10 



FUNCTIONAL ANALYSIS IDEAS 



elements of x and y are complex-valued, and a € C. The metric spaces in Exam- 
ple 1.2 are therefore also vector spaces under the operations defined in (1.12a,b). 
Some further examples of vector spaces now follow. 

Example 1.8 Metric space C[a, b] (Example 1.4) is a vector space under the 
operations 

(x + y)(t)=x(t) + y(t), (ax)(t)=ax(t), (1.13) 

where a e R. The zero vector is the function that is identically zero on the interval 
[a,b]. 

Example 1.9 Metric space l 2 [0, oo] (Example 1.6) is a vector space under the 
operations 

x + y — (x ,x\, ...) + (yo,yi, ■ . .) = (xo + yo,xi +yi, ...), 

ax = (axo,axi, . . .). (1-14) 

Here a e C. 

If x, y € / 2 [0, oo], then some effort is required to verify axiom (VI). This 
requires the Minkowski inequality, which is 



OO 


l/P 


J2\x k + yk\ p 


< 


k=0 





x>*l ; 



.k=0 



Up 



Ei»i j 



.k=Q 



!//> 



(1.15) 



Refer back to Example 1.6; here we employ p — 2, but (1.15) is valid for p > 1. 
Proof of (1.15) is somewhat involved, and so is omitted here. The interested reader 
can see Kreyszig [1, pp. 11-15]. 

We remark that the Minkowski inequality can be proved with the aid of the 
Holder inequality 



^\Xkyk\ 



k=Q 



X>*| ; 



.k=0 



Up 



2>*i 



-Jt=0 



1/9 



(1.16) 



for which here n > 1 and — + - = 1. 
' p q 

We are now ready to define a normed space. 



Definition 1.3: Normed Space, Norm A normed space X is a vector space 
with a norm defined on it. If x e X then the norm of x is denoted by 

\\x\\ (read this as "norm of x"). 

The norm must satisfy the following axioms: 

(Nl) ||;c|| > (i.e., the norm is nonnegative). 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 11 

(N2) ||x|| = 0^x = 0. 

(N3) \\ax\\ — | of | ||x||. Here a is a scalar in the field of X (i.e., a e K; see 

Definition 3.2). 
(N4) ||* + y\\ < ||x|| + ||y|| (triangle inequality). 

The normed space is vector space X together with a norm, and so may be properly 
denoted by the pair (X, \\ ■ ||). However, we may simply write X, and say "normed 
space X," so the norm that goes along with X is understood from the context of 
the discussion. 

It is important to note that all normed spaces are also metric spaces, where the 
metric is given by 

d(x,y) = \\x-y\\ (x,yeX). (1.17) 

The metric in (1.17) is called the metric induced by the norm. 

Various other properties of norms may be deduced. One of these is: 

Example 1.10 Prove | ||y|| - ||x|| | < \\y - x\\. 

Proof From (N3) and (N4) 

\\y\\ = \\y-x + x\\<\\y-x\\ + \\x\\,\\x\\ = \\x-y + y\\<\\y-x\\ + \\y\\. 

Combining these, we obtain 

llyll- 11*11 <ILv-*IULvll-ll*ll>-ILv-*ll- 

The claim follows immediately. 

We may regard the norm as a mapping from X to set R: || • |||Z — > R. This 
mapping can be shown to be continuous. However, this requires generalizing the 
concept of continuity that you may know from elementary calculus. Here we define 
continuity as follows. 

Definition 1.4: Continuous Mapping Suppose X — (X, d) and Y — (Y, d) 

are two metric spaces. The mapping T\X — > Y is said to be continuous at a point 
xq e X if for all e > there is a S > such that 

d(Tx, Txo) < e for all x satisfying d(x,xo) < 5. (1.18) 

T is said to be continuous if it is continuous at every point of X. 

Note that Tx is just another way of writing T(x). (R, | • |) is a normed space; that 
is, the set of real numbers with the usual arithmetic operations defined on it is a 



TLFeBOOK 



12 FUNCTIONAL ANALYSIS IDEAS 

vector space, and the absolute value of an element of R is the norm of that element. 
If we identify Y in Definition 1.4 with metric space (R, | • |), then (1.18) becomes 

d(Tx, Tx ) = d(\\x\\, 1 1* 1 1) = I 11*11 - Ikoll I < e, d(x,x ) = ||* -* || < 5. 

To make these claims, we are using (1.17). In other words, X and Y are normed 
spaces, and we employ the metrics induced by their respective norms. In addition, 
we identify T with || • ||. Using Example 1.10, we obtain 

I ||*|| -||*oll I < ll*-*oll <<$■ 

Thus, the requirements of Definition 1.4 are met, and so we conclude that norms 
are continuous mappings. 

We now list some other normed spaces. 

Example 1.11 The Euclidean space R" and the unitary space C" are both 
normed spaces, where the norm is defined to be 



■«-i 



I>| 2 



.k=0 



1/2 



(1.19) 



For R" the absolute value bars may be dropped. 4 It is easy to see that d(x, y) — 
||* — y\\ gives the same metric as in (1.6a) for space R". We further remark that 
for n = 1 we have 1 1* 1 1 = |* | . 



Example 1.12 The space l p [0, oo] is a normed space if we define the norm 
to be 

" oo -|1/P 

X><tl" d-20) 

L*=0 



for which d(x, y) — ||* — y\\ coincides with the metric in (1.10b). 

Example 1.13 The sequence space l°°[0, oo] from Example 1.3ofSection 1.3.1 
is a normed space, where the norm is defined to be 



|*|| = sup \x k \, 

k€Z+ 



(1.21) 



and this norm induces the metric of (1.7b). 



4 Suppose z = x + jy (j ■■ 
in general. 



-1, x, y S R) is some arbitrary complex number. Recall that z # \z\ 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 13 

Example 1.14 The space C[a, b] first seen in Example 1.4 is a normed space, 
where the norm is defined by 

||x|| = sup |x(f)|. (1.22) 

Naturally, this norm induces the metric of (1.8). 

Example 1.15 The space L 2 [a, b] of Example 1.7 is a normed space for the 

norm 

-iV2 



11*11 = 



/ 

J a 



\x(t)\ 2 dt 



(1.23) 



This norm induces the metric in (1.11b). 



The normed space of Example 1.15 is important in the following respect. 
Observe that 

||*|| 2 = / \x(t)\ 2 dt. (1.24) 

J a 

Suppose we now consider a resistor with resistance R. If the voltage drop across its 
terminals is v(t) and the current through it is i(t), we know that the instantaneous 
power dissipated in the device is p(t) — v(t)i(t). If we assume that the resistor is 
a linear device, then v(t) = Ri(t) via Ohm's law. Thus 

p(t) = v(t)i(t) = Ri 2 (t). (1.25) 

Consequently, the amount of energy delivered to the resistor over time interval 
t e [a, b] is given by 



J a 



i 2 {t)dt. (1.26) 



If the voltage/current waveforms in our circuit containing R belong to the space 
L 2 [a,b], then clearly E = R\\i\\ 2 . We may therefore regard the square of the L 2 
norm [given by (1.24)] of a signal to be the energy of the signal, provided the 
norm exists. This notion can be helpful in the optimal design of electric circuits 
(e.g., electric filters), and also of optimal electronic circuits. In analogous fashion, 
an element x of space / 2 [0, oo] satisfies 

oo 
IW| 2 = ^kd 2 <oo (1.27) 

[see (1.10a) and Example 1.12]. We may consider \\x\\ 2 to be the energy of the 
single-sided sequence x. This notion is useful in the optimal design of digital filters. 



TLFeBOOK 



14 FUNCTIONAL ANALYSIS IDEAS 

1.3.3 Inner Products and Inner Product Spaces 

The concept of an inner product is necessary before one can talk about orthogonal 
bases for vector spaces. Recall from elementary linear algebra that orthogonal 
bases were important in representing vectors. From a computational standpoint, 
as mentioned earlier, orthogonal bases can have a simplifying effect on certain 
types of approximation problem (e.g., least-squares approximations), and represent 
a means of controlling numerical errors due to so-called ill-conditioned problems. 
Following our axiomatic approach, consider the following definition. 

Definition 1.5: Inner Product Space, Inner Product An inner product space 
is a vector space X with an inner product defined on it. The inner product is a 
mapping (•, -)\X x X -> K that satisfies the following axioms: 

(11) (x + y,z) = {x,z) + {y,z). 

(12) (ax, y) — a(x, y). 

(13) (x,y) = {y,x)*. 

(14) (x, x) > 0, and (x, x) = <S> x = 0. 

Naturally, x, y, z € X, and a is a scalar from the field K of vector space X. The 
asterisk superscript on (y, x) in (13) denotes complex conjugation. 5 

If the field of X is not C, then the operation of complex conjugation in (13) is 
redundant. 

All inner product spaces are also normed spaces, and hence are also metric 
spaces. This is because the inner product induces a norm on X 

||x|| = [(;c,*}] 1/2 (1.28) 

for all x e X. Following (1.17), the induced metric is 

d(x, y) = \\x-y\\=[(x-y,x- y)] 1/2 . (1.29) 

Directly from the axioms of Definition 1.5, it is possible to deduce that (for 

x, y, z e X and a,b e K) 

(ax + by,z) =a(x,z) + b(y,z), (1.30a) 

{x,ay) = a*(x,y), (1.30b) 

and 

(x,ay + bz) = a*(x, y) + b*(x, z). (1.30c) 

The reader should prove these as an exercise. 

If z = X + yj is a complex number, then its conjugate is z* = x — yj. 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 



15 



We caution the reader that not all normed spaces are inner product spaces. We 
may construct an example with the aid of the following example. 

Example 1.16 Let x, y be from an inner product space. If || • || is the norm 
induced by the inner product, then \\x + y\\ 2 + \\x — y\\ 2 = 2(||x|| 2 + ||y|| 2 ). This 
is the parallelogram equality. 

Proof Via (1.30a,c) we have 



and 



|x + y|| z = (x + y,x + y) = (x, x + y) + (y, x + y) 
= (x,x) + (x,y) + (y,x) + (y,y), 

\x - y\\ 2 - (x -y,x -y) — (x,x - y) - (y,x - y) 
= (x, x) - (x, y) - (y, x) + (y, y). 



Adding these gives the stated result. 

It turns out that the space l p [0, oo] with p ^ 2 is not an inner product space. The 
parallelogram equality can be used to show this. Consider x = (1, 1, 0, 0, . . .), y = 
(1, — 1, 0, 0, . . .), which are certainly elements of l p [0, oo] [see (1.10a)]. We see that 



11*11 = IMI 



,i/p 



,||x + y|| = |l*-yll=2. 



The parallelogram equality is not satisfied, which implies that our norm does not 
come from an inner product. Thus, l p [0, oo] with p ^ 2 cannot be an inner product 
space. 

On the other hand, Z 2 [0, oo] is an inner product space, where the inner product 
is defined to be 



■ y) = Y, x ky*k- 



(1.31) 



i-0 



Does this infinite series converge? Yes, it does. To see this, we need the Cauchy- 
Schwarz inequality. 6 Recall the Holder inequality of (1.16). Let p — 2, so that 
q — 2. Then the Cauchy-Schwarz inequality is 



^2\xkyk\ 



k=Q 



X>*i 2 



.jt=0 



1/2 r 



X>*i 2 



.k=Q 



1/2 



(1.32) 



The inequality we consider here is related to the Schwarz inequality. We will consider the Schwarz 
inequality later on. This inequality is of immense practical value to electrical and computer engineers. 
It is used to derive the matched-filter receiver, which is employed in digital communications systems, 
to derive the uncertainty principle in quantum mechanics and in signal processing, and to derive the 
Cramer-Rao lower bound on the variance of parameter estimators, to name only three applications. 



TLFeBOOK 



16 FUNCTIONAL ANALYSIS IDEAS 

Now 



\(x,y)\ = 



J2 Xk yk 

k=0 



< 



£l*«*l- (1-33) 

fc=0 



The inequality in (1.33) follows from the triangle inequality for | • |. (Recall that the 
absolute value operation is a norm on R. It is also a norm on C; if z — x + jy e C, 
then |z| = yx 2 + y 2 .) The right-hand side of (1.32) is finite because x and y are 
in Z 2 [0, oo]. Thus, from (1.33), (x, y) is finite. Thus, the series (1.31) converges. 

It turns out that C[a, b] is not an inner product space, either. But we will not 
demonstrate the truth of this claim here. 

Some further examples of inner product spaces are as follows. 

Example 1.17 The Euclidean space R" is an inner product space, where the 
inner product is defined to be 

71-1 

(x,y) = J^x k y k . (1.34) 

k=Q 

The reader will recognize this as the vector dot product from elementary linear 
algebra; that is, x ■ y = (x, y). It is well worth noting that 

(x,y)=x T y. (1.35) 

Here the superscript T denotes transposition. So, x T is a row vector. The inner 
product in (1.34) certainly induces the norm in (1.19). 

Example 1.18 The unitary space C" is an inner product space for the inner 

product 

«-i 



■?> = !>«£• d-36) 



k=Q 

Again, the norm of (1.19) is induced by inner product (1.36). If H denotes the 
operation of complex conjugation and transposition (this is called Hermitian trans- 
position), then 

y H = [y* yt---y*„- l ] 

(row vector), and 

(x,y) = y H x. (1.37) 

Example 1.19 The space L 2 [a, b] from Example 1.7 is an inner product space 
if the inner product is defined to be 

(x,y) = / x(t)y*(t)dt. (1.38) 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 



17 



The norm induced by (1.38) is 



J a 



-1 1/2 



I* 0)1 dt 
This in turn induces the metric in (1.11b). 



(1-39) 



Now we consider the concept of orthogonality in a completely general manner. 

Definition 1.6: Orthogonality Let x, y be vectors from some inner product 
space X. These vectors are orthogonal iff 

<*,y} = 0. 



The orthogonality of x and y is symbolized by writing x 
A, B C X we write x _L A if x _L a for all a e A, and A _ 
and b e B. 



. y. Similarly, for subsets 
B if a _L b for all a € A, 



If we consider the inner product space R 2 , then it is easy to see that 
([10] r , [0 l] r ) = 0, so [01] r , and [10] r are orthogonal vectors. In fact, these 
vectors form an orthogonal basis for R 2 , a concept we will consider more gen- 
erally below. If we define the unit vectors eo — [1 0] T , and e\ — [0 l] 7 , then we 
recall that any x e R 2 can be expressed as x — xo<?o + *iei- (The extension of 
this reasoning to R" for « > 2 should be clear.) Another example of a pair of 
orthogonal vectors would be x — -7=[1 l] r , and y = -7=[1 — l] r • These too form 

an orthogonal basis for the space R 2 . 
Define the functions 



and 



<Kx) 



f{x) 



0, x < and x > 1 

1, 0<x < 1 



0, x < and x > 1 

1. 



0<X < i 



(1.40) 



(1.41) 



1, i <x < 1 



Function <p(x) is called the Haar scaling function, and function i/f(x) is called the 
Haar wavelet [5]. The function 4>(x) is also called an non-return-to-zero (NRZ) 
pulse, and function V0O is a l so called a Manchester pulse [6]. It is easy to con- 
firm that these pulses are elements of L 2 (R) = L 2 (— oo, oo), and that they are 
orthogonal, that is, (</>, xj/) — under the inner product defined in (1.38). This is 
so because 



{^»,^> 



j — • 



4>(x)t//*(x) dx 



Jo 



ij/(x) dx — 0. 



TLFeBOOK 



18 FUNCTIONAL ANALYSIS IDEAS 

Thus, we consider cp and xj/ to be elements in the inner product space L 2 (R), for 
which the inner product is 



-/. 



(x,y) = / x(t)y*(t)dt. 

J —00 

It turns out that the Haar wavelet is the simplest example of the more general class 
of Daubechies wavelets. The general theory of these wavelets first appeared in 
Daubechies [7]. Their development has revolutionized signal processing and many 
other areas. 7 The main reason for this is the fact that for any f(t) e L 2 (R) 

oo oo 

/( f )= J2 H (f,*n,k)Mt), (1-42) 

n = -oo k=-oo 

where if n ,k(t) — 2"^ 2 ^r(2 n t — k). This doubly infinite series is called a wavelet 
series expansion for /. The coefficients f n ^ — (f, ifn,k) have finite energy. In 
effect, if we treat either k or n as a constant, then the resulting doubly infinite 
sequence is in the space l 2 [— oo, oo]. In fact, it is also the case that 

oo oo 

J2 J2 \fn,k\ 2 <oo. (1.43) 

n=-oo k=-oo 

It is to be emphasized that the \jr used in (1.42) could be (1.41), or it could be 
chosen from the more general class in Ref. 7. We shall not prove these things in 
this book, as the technical arguments are quite hard. 

The wavelet series is presently not as familiar to the broader electrical and 
computer engineering community as is the Fourier series. A brief summary of the 
Fourier series is as follows. Again, rigorous proofs of many of the following claims 
will be avoided, though good introductory references to Fourier series are Tolstov 
[8] or Kreyszig [9]. If / e L 2 (0, In), then 

OO 

f(t)= J2 f" ei '"> 1 = ^' c 1 - 44 ) 

n =— oo 

where the Fourier (series) coefficients are given by 

/» = T- f* f(t)e-J nt dt. (1.45) 

We may define 

e„(t) = exp(jMf) (t e (0, 2n), n e Z) (1.46) 

For example, in digital communications the problem of designing good signaling pulses for data 
transmission is best treated with respect to wavelet theory. 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 19 

so that we see 

1 C 2n r • i* 

(/,e„> = — j f(t) [e' nt \ dt = f n . (1.47) 

The series (1.44) is the complex Fourier series expansion for /. Note that for 
n, k e Z 

txp[jn(t + 2irk)] = exp[jnt ] exp[2jrjnk] — expljnt]. (1-48) 

Here we have used Euler's identity 

e JX — cosx + j sinx (1-49) 

and cos(2jtIc) — 1, sin(27rfc) = 0. The function e' nt is therefore 2n -periodic; that 
is, its period is 2it. It therefore follows that the series on the right-hand side of 
(3.40) is a 2it -periodic function, too. The result (1.48) implies that, although / in 
(1.44) is initially defined only on (0, 2jt), we are at liberty to "periodically extend" 
/ over the entire real-number line; that is, we can treat / as one period of the 
periodic function 

fit) = J2 f« + 2jt V (1.50) 

keZ 

for which f(t) = f(t) for t e (0, 2n). Thus, series (1.44) is a way to represent 
periodic functions. Because / € L 2 (Q, 2it), it turns out that 

00 

J2 \fn\ 2 <°° (1-51) 

n=— oo 

so that (/„) € l 2 [— oo, oo]. 

Observe that in (1.47) we have "redefined" the inner product on L 2 (0, 2it) to be 

1 f 27T 
(x,y) = — x(t)y*(t)dt (1.52) 

2jt Jo 

which differs from (1.38) in that it has the factor J- in front. This variation also 
happens to be a valid inner product on the vector space defined by the set in (1.1 la). 
Actually, it is a simple example of a weighted inner product. 
Now consider, for n ^ m 

1 f 2n 1 r ■, i2n 

(e„, e m ) = — e jnt e- jmt dt = e J(»-m)t 

2jt Jq 2jrj(n — m) L Jo 

1nj(n-m) _ j 1—1 

0. (1.53) 



2jt j (n — m) 2nj(n — m) 

Similarly 

1 f 27T . . 1 f 27T 

(e„,e„) = — / e^'e-^'dt = — / dt = 1. (1.54) 

lit Jq lit Jo 

So, e n and e m (if n ^ m) are orthogonal with respect to the inner product in (1.52). 



TLFeBOOK 



20 



FUNCTIONAL ANALYSIS IDEAS 



From basic electric circuit analysis, periodic signals have finite power. Therefore, 
series (1.44) is a way to represent finite power signals. 8 We might therefore consider 
the space L 2 (0, 2jt) to be the "space of finite power signals." From considerations 
involving the wavelet series representation of (1.42), we may consider L 2 (R) to 
be the "space of finite energy signals." Recall also the discussion at the end of 
Section 1.3.2 (last paragraph). 

An example of a Fourier series expansion is the following. 



Example 1.20 Suppose that 
f(t) = 



1, < t < Tt 

-1, Tt < t < 2jt 



(1.55) 



A sketch of this function is one period of a lit -periodic square wave. The Fourier 
coefficients are given by (for n ^ 0) 



1 f 2n 

In — Z / 

2tc Jo 



f(t)e-i'"dt = 



1 

2tc 



rn r2jr 

/ e~ jnt dt - / e- jnt dt 

JO Jtt 



1 

2jt 



jn L Jo jn L Jjt 



1 1 _ e -jnn 

Tt 



jn 



2 
Ttn 



jnn/2 



1 

2tt 

gjnn/l _ e -jnn/2 



-jnjr 



e -jnn + 1 



2./ 



2 
Ttn 



J" 

-jnjz/2 



sin 



(?) 



where we have made use of 



sinx = — [e JX 



(1.56) 



(1.57) 



This is easily derived using the Euler identity in (1.49). For n = 0, it should be 
clear that /o = 0. 

The coefficients /„ in (1.56) involve expressions containing j. Since /(?) is 
real-valued, it therefore follows that we can rewrite the series expansion in such a 
manner as to avoid complex arithmetic. It is almost a standard practice to do this. 
We now demonstrate this process: 



E /.«"" - l - 



oo . -1 1 

J2 -e~^ 2 sin (-«) e'"' + £ V^/ 2 sin (-») e> n 

n=\ ~ n=-oo 

oo -. CO -. 

J2 -e-^l 1 sin (-«) e' nt + ^ -e inn ' 2 sin (-«) e~' r ' 



In fact, using phasor analysis and superposition, you can apply (1.44) to determine the steady-state 
output of a circuit for any periodic input (including, and especially, nonsinusoidal periodic functions). 
This makes the Fourier series very important in electrical/electronic circuit analysis. 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 21 



9 1 

£ J2 - sin (-«) [ e J nt e-'^ 2 + e^e**" 2 ] 

n = \ 

A °° 1 

-E- cos K f -|)] sin (l M ) 



2A1 . /.T 

(1=1 

=1 
Here we have used the fact that (see Appendix l.A) 



e jnt e -jnn/l + e -jnt e Jnn/2 = 2 Rg ^,7,^-^/2] = 2cQS L A _ *\ j _ 

This is so because if z = x + jy, then z + z* — 2x = 2 Re [z]. Since 

cos(a + /J) = cos a cos /J — sin a sin/3, 
we have 

[/ 7T\"| 7TH JIM 

n ( t 1 = cos(nf) cos h sin(nr) sin — . 

However, if n is an even number, then sin(7rn/2) = 0, and if n is an odd number, 
then cos(tt«/2) = 0. Therefore 

A °° 1 

-^-cos[„(r-|)]sin(|„) 

n = \ 

A °° 1 

-^E^ T T sin[(2n + 1)f]sm2 [ (2n + 1) |]' 

but sin 2 [(2n + l)y] = 1, so finally we have 

oo . oo . 

nt)= J2 f" eir "^-J2^—p[ sin ^ 2n + 1 ^- 

n=-oo ,i=0 

It is important to note that the wavelet series and Fourier series expansions have 
something in common, in spite of the fact that they look quite different and indeed 
are associated with quite different function spaces. The common feature is that both 
representations involve the use of orthogonal basis functions. We are now ready to 
consider this in a general manner. 

Begin by recalling from elementary linear algebra that a basis for a vector space 
such an X — R" or X — C" is a set of n vectors, say 

B = {e ,e 1 ,...,e n - l } (1.58) 

such that the elements e^ (basis vectors) are linearly independent. This means that 
no vector in the set can be expressed as a linear combination of any of the others. 



TLFeBOOK 



22 FUNCTIONAL ANALYSIS IDEAS 

In general, it is not necessary that (ek, e n ) — for n ^ k. In other words, indepen- 
dence does not require orthogonality. However, if set B is a basis (orthogonal or 
otherwise) then for any x e X (vector space) there exists a set of coefficients from 
the field of the vector space, say, b = {bo, b\ . . . , b n -\}, such that 

n-\ 

x = J^b k e k . (1.59) 

k=Q 

We say that spaces R" and C" are of dimension n. This is a direct reference to the 
number of basis vectors in B. This notion generalizes. 

Now let us consider a sequence space (e.g., / 2 [0, oo]). Suppose x — 
(xq, x\, X2, . . .) € Z 2 [0, oo]. Define the following unit vector sequences: 

e = (1,0,0,0, ...), e\ = (0, 1,0,0, . . .), e 2 = (0,0, 1, 0, . . .), etc. (1.60) 

Clearly 

oo 

X = y^XK*:- (1-61) 

k=Q 

It is equally clear that no vector e k can be expressed as a linear combination of 
any of the others. Thus, the countably infinite set 9 B = {eo, e\, e2, ■ ■ •} forms a 
basis for Z 2 [0, oo]. The sequence space is therefore of infinite dimension because 
B has a countable infinity of members. It is apparent as well that, under the inner 
product defined in (1.31), we have (e n , e m ) — S n - m . Sequence S — (<5„) is called 
the Kronecker delta sequence. It is defined by 

o, «#o ■ (L62) 

Therefore, the vectors in (1.60) are mutually orthogonal as well. So they happen to 
form an orthogonal basis for Z 2 [0, oo]. Of course, this is not the only possible basis. 
In general, given a countably infinite set of vectors {e k \k e Z + } [no longer neces- 
sarily those in (1.60)] that are linearly independent, and such that e k e / 2 [0, oo], 
for any x e / 2 [0, oo] there will exist coefficients a k e C such that 

00 

X = ^, a k e k- (1-63) 

k=Q 

In view of the above, consider the following linearly independent set of vectors 
from some inner product space X: 

B = {e k \e k eX, keZ}. (1.64) 

y A set A is countably infinite if its members can be put into one-to-one (1-1) correspondence with 
the members of the set Z + . This is also equivalent to being able to place the elements of A into 1-1 
correspondence with the elements of Z. 



TLFeBOOK 



SOME SPECIAL MAPPINGS: METRICS, NORMS, AND INNER PRODUCTS 23 

Assume that this is a basis for X. In this case for any x e X, there are coefficients 
a k such that 

x — ^2a k e k . (1.65) 

We define the set B to be orthogonal iff for all n , k e Z 

(e n ,e k ) = K-k- (1.66) 

Assume that the elements of B in (1.64) satisfy (1.66). It is then easy to 
see that 

(x,e n ) = (y2a k e k ,e n \ = ^(ate*. e„) (using (II)) 

\ * Ik 

= ^2a k (e k ,e n ) (using (12)) 

= ^<$fc_„aifc (using (1.66)) 

so finally we may say that 

(x,e„) = a n . (1-67) 

In other words, if the basis B is orthogonal, then 

x = *Y]{x, ek)e k . (1.68) 

keZ 

Previous examples (e.g., Fourier series expansion) are merely special cases of this 
general idea. We see that one of the main features of an orthogonal basis is the 
ease with which we can obtain the coefficients a k . Nonorthogonal bases are harder 
to work with in this respect. This is one of the reasons why orthogonal bases are 
so universally popular. 

A few comments on terminology are in order here. Some would say that the 
condition (1.66) on B in (1.64) means that B is an orthonormal set, and we would 
say that condition 

(e n ,e k ) = a n S n - k 

is the condition for B to be an orthogonal set, where a n is not necessarily unity 
(i.e., equal to one) for all n. However, in this book we often insist that orthogonal 
basis vectors be "normalized" so condition (1.66) holds. 

We conclude the present section by considering the following theorem. It was 
mentioned in a footnote that the following Schwarz inequality (or variations of it) 
is of very great value in electrical and computer engineering. 



TLFeBOOK 



24 FUNCTIONAL ANALYSIS IDEAS 



Theorem 1.1: Schwarz Inequality Let X be an inner product space, where 
x, y e X. Then 

\{x,y)\ < 11*11 ILvll- (1-69) 

Equality holds iff {x, y] is a linearly dependent set. 

Proof If y — then (x, 0) = 0, and (1.69) clearly holds in this special case. 
Let y ^ 0. For all scalars a in the field of X we must have [via inner product 
axioms and (1.30)] 

< \\x — ay\\ — (x — ay, x — ay) 

= (x, x) - a*(x, y) - a[(y, x) - a*(y, y)]. 

If we select a* — (y, x)/(y, y), then the quantity in the brackets [•] vanishes. Thus 
0< (x,x)-^\{x,y) = ||x|| 2 - l{X,y [ l 

~ (y,y) \\y\\ 2 

[using (x, y) — (y,x)*, i.e., axiom (13)]. Rearranging, this yields 

\(x,y)\ 2 <\\x\\ 2 \\y\\ 2 , 

and the result (1.69) follows (we must take positive square roots as ||x|| > 0, and 
|x|>0). 

Equality holds iff y — 0, or else ||x — ay|| 2 = 0, hence x — ay — [recall 
(N2)], so x = ay, demonstrating linear dependence of x and y. 

We may now see what Theorem 1 . 1 has to say when applied to the special case of 
a vector dot product. 

Example 1.21 Suppose that X is the inner product space of Example 1.17. 
Since 



^2x k y k 

k=0 



\(x,y)\ = 
Yl'l=o x k\ ' we nave from Theorem 1.1 that 



1 1/2 



M-l 




"«-i 


1/2 


"«-l _ 


^2x k y k 


< 


E** 2 




E? 2 


k=0 




-k=Q - 




-k=Q - 



1/2 



(1.70) 



If y k — axk (a e R) for all k e Z n , then 

-l 



E-*^ 



k=Q 



<\E*l 



k=Q 



TLFeBOOK 



THE DISCRETE FOURIER SERIES (DFS) 



25 



and [E£U 2 ] 1/2 =l«l[EZ=S<T, hence 



-n-\ 


1/2 


"n-1 


E** 

-4=0 - 




-k=0 - 



= i«iE- 



jt=0 



Thus, (1.70) does indeed hold with equality when y — ax. 



1.4 THE DISCRETE FOURIER SERIES (DFS) 

The subject of discrete Fourier series (DFS) and its relationship to the complex 
Fourier series expansion of Section 1.3.3 is often deferred to later courses (e.g., 
signals and systems), but will be briefly considered here as an additional example 
of an orthogonal series expansion. 

The complex Fourier series expansion of Section 1.3.3 was for 2jt -periodic 
functions defined on the real-number line. A similar series expansion exists for 
N-periodic sequences such as x — (x„); that is, for N e {2, 3, 4, . . .} C Z, consider 



*>n — / , x n+kN 



(1.71) 



where x — (x„) is such that x n — for n < 0, and forn>N as well. Thus, x is 
just one period of x. We observe that 



Xn+mN 



E x » 



+mN+kN 



E x » 



+ (m+k)N 



/ J X n+r fJ 



(r — m + k). This confirms that x is indeed A'-periodic (i.e., periodic with period 
N). We normally assume in a context such as this that x„ e C. We also regard x 
as a vector: x = [xq x\ ■ ■ ■ xn-\] t e C . An inner product may be defined on 
the space of Af -periodic sequences according to 



(x,y) - (x,y) - y x 



(1.72) 



(recall Example 1.18), where y e C N is one period of y. We assume, of course, 
that x and y are bounded sequences so that (1.72) is well defined. 
Now define e^ — [e^o e k,\ • • • e k,N-\\ T € C N according to 



ek,n = exp 



2n 
i — kn 

N 



(1.73) 



where n e Zpj. The periodization of et — (ek,n) is 



Ck,n — 2_^ e k,n+mN 
meZ 



(1.74) 



TLFeBOOK 



26 FUNCTIONAL ANALYSIS IDEAS 

yielding e k = (e k , n ). That (1.73) is periodic with period N with respect to index n 
is easily seen: 



ek,n+mN = exp 



2n 
/ — /c(« + mN) 

N 



exp 



2jt 
AT 



exp [j27r /cm] = ejt.n- 



It can be shown (by exercise) that [using definition (1.72)] 

N-\ 



(e k ,e r ) = (e*, e r ) : 

iV-l 

= J] exp 

M = 



l] eX P 

n=0 



2tt 
- / — rn 



exp 



2tt 
i — kn 

J N 



2jt 
/ — (k — r)n 
J N 



N, k-r = 
0, otherwise 



(1.75) 



Thus, if we consider (e k „), and (e rn ) with k ^ r we find that these sequences are 
orthogonal, and so form an orthogonal basis for the vector space C^. From (1.75) 
we may write 

(e k ,e r ) = NS k - r . (1.76) 

Thus, there must exist another vector X — [Xq X\ ■ ■ ■ Xn-i] t e C^ such that 



N-l 



l v-» 

x„ = — ) j X k exp 



k=Q 



2n 
/' — kn 

N 



(1.77) 



for n e Z«. In fact 



N-l 



\x, e r ) — / t x n£ rn 



«=0 



N-l (N-l 

^E E x * es p 

«=0 I fc=0 

7V-1 fiV-1 



2n 
/ — /cn 



exp 



2tz 
- / — rn 

J N 



= n ^ Xk \ ^ 6XP 

£=0 I n=0 

= -J^X k (NS k - r ) = X r 



2tt 
/ — (A: — r)n 
J N 



k=0 



(1.78) 



That is 



N-l 



Xk=^2*n exp 



n=0 



2jt 
- / — kn 

N 



(1.79) 



for k € Z» 



TLFeBOOK 



THE DISCRETE FOURIER SERIES (DFS) 27 

In (1.77) we see x n+m ^ — x„ for all m e Z. Thus, (x n ) in (1.77) is A^-periodic, 
and so we have x n = -h Z~lk=o ^k exp jjj-kn with X k given by (1.79). Equation 
(1.77) is the discrete Fourier series (DFS) expansion for an Af -periodic complex- 
valued sequence x such as in (1.71). The DFS coefficients are given by (1.79). 
However, it is common practice to consider only x n for n e Z#, which is equivalent 
to only considering the vector x e C N . In this case the vector X e C N given by 
(1.79) is now called the discrete Fourier transform (DFT) of the vector x, and the 
expression in (1.77) is the inverse DFT (IDFT) of the vector X. We observe that 
the DFT, and the IDFT can be concisely expressed in matrix form, where we define 
the DFT matrix 



F = 



cxp l ~i~N kn 



eC NxN , (1.80) 



k,neZ, 



N 



and we see from (1.77) that F ' = j/F* (IDFT matrix). Thus, X — Fx. We remark 
that the symmetry of F (i.e., F — F T ) means that either k or n in (1.80) may be 
interpreted as row or column indices. 

The DFT has a long history, and its invention is now attributed to Gauss 
[10]. The DFT is of central importance to numerical computing generally, but 
has particularly great significance in digital signal processing as it represents a 
numerical approximation to the Fourier transform, and it can also be used to 
efficiently implement digital filtering operations via so-called fast Fourier trans- 
form (FFT) algorithms. The construction of FFT algorithms to efficiently compute 
X — Fx (and x — F~ l X) is rather involved, and not within the scope of the 
present book. Simply note that the direct computation of the matrix- vector product 
X — Fx needs N 2 complex multiplications and N(N — 1) complex additions. For 
N — 2 P (p e {1,2,3, . . .}), which is called the radix-2 case, the algorithm of Coo- 
ley and Tukey [11] reduces the number of operations to something proportional to 
N log 2 N, which is a substantial savings compared to A" 2 operations with the direct 
approach when N is large enough. Essentially, the method in Ref. 1 1 implicitly fac- 
tors F according to F — F p F p -\ ■ ■ ■ F\, where the matrix factors F^ e C NxN are 
sparse (i.e., contain many zero-valued entries). Note that multiplication by zero is 
not implemented in either hardware or software and so does not represent a compu- 
tational cost in the practical implementation of the FFT algorithm. It is noteworthy 
that the algorithm of Ref. 1 1 also has a long history dating back to the work of 
Gauss, as noted by Heideman et al. [10]. It is also important to mention that fast 
algorithms exist for all possible N ^ 2 P [4]. The following example suggests one 
of the important applications of the DFT/DFS. 



Example 1.22 Suppose that x„ — Ae j0n with 9 — jj-m for m = 1,2,..., 
■j — 1 (N is assumed to be even here). From (1.79) using (1.75) 

X k = AN8 m _ k . (1.81) 



TLFeBOOK 



28 FUNCTIONAL ANALYSIS IDEAS 



Now suppose instead that x n — Ae ■' , so similarly 



N-\ 



X k - A XI eXP 



«=0 
A7-1 



A J^exp 

n=0 



2tt 
- / — n(m + &) 

7 at 



2tt 
7 — n(./V — m — k) 
J N 



AN8 N - 



(1.82) 



Thus, if now x„ = Acos(0«) = ±A[e^" + e - ^"], then from (1.81) and (1.82), we 
must have 



X£ = j AiV[<$ m _& + 8 N - m -t] 



(1.83) 



We observe that Xt — for all k ^ m, N — m, but that 



X m — \AN, and X w _ m = jAN . 



Thus, X^ is nonzero only for indices k — m and k — N — m corresponding to 
the frequency of (x„), which is 6 — jrtn. The DFT/DFS is therefore quite use- 
ful in detecting "sinusoids" (also sometimes called "tone detection"). This makes 
the DFT/DFS useful in such applications as narrowband radar and sonar signal 
detection. 

Can you explain the necessity (or, at least, the desirability) of the second equality 
inEq. (1.82)? 



APPENDIX 1.A COMPLEX ARITHMETIC 

Here we summarize the most important facts about arithmetic with complex num- 
bers zeC (set of complex numbers). You shall find this material very useful in 
electric circuits, as well as in the present book. 

Complex numbers may be represented in two ways: (1) Cartesian (rectangular) 
form or (2) polar form. First we consider the Cartesian form. 

In this case z € C has the form z — x + jy, where x, y e R (set of real num- 
bers), and j — V— 1- The complex conjugate of z is defined to be z* — x — jy (so 

j* = -;')• 

Suppose that z\ = x\ + jy\ and zi — X2 + jyi are two complex numbers. Addi- 
tion and subtraction are defined as 

z\ ± Z2 — (x\ ± x 2 ) + j(yi ± yz) 



TLFeBOOK 



COMPLEX ARITHMETIC 29 

[e.g., (1 + 2y) + (3 - 5j) = 4 - 3j, and (1 + 2j) - (3 - 5j) = -2 + 7j]. Using 
j 2 — — 1, the product of zi and Z2 is 



Z1Z2 = C*i + 7>i)fe + jyi) 
= x\x 2 + j 2 yiy 2 + jy\. 
= (*i*2 - yiw) + j{x\y 2 + x 2 y\) 



x\x 2 + j 2 y\y2 + jy\X2 + jx\y 2 



We note that 



ZZ* = (x + jy)(x - jy) = x 2 + y 2 = \ Z \ 2 , 



so |z| = *J x 2 + y 2 defines the magnitude of z. For example, (1 + 2j)(3 — 5j) = 
13 + j. The quotient of z\ and zi is defined to be 

ta_ _ zizl _ (x\ + jy\)(x 2 - jy 2 ) 
Zi Ziz\ x 2 + y 2 

_ {x\xi + yin) + j(x 2 y\ - xiy 2 ) 

x 2 + y 2 
_ x\x 2 + y { y 2 x 2 y x - x\y 2 

x 2 + y 2 x 2 + y\ 

where the last equality is z\/z 2 in Cartesian form. 

Now we may consider polar form representations. For z — x + jy, we may 
regard x and y as the x and y coordinates (respectively) of a point in the Cartesian 
plane (sometimes denoted R 2 ). 10 We may therefore express these coordinates in 
polar form; thus, for any x and y we can write 

x — rcos9, y = rsinO, 

where r > 0, and e [0, 2jr), or 6 e (—it, it]. We observe that 

x 2 + y 2 = r 2 (cos 2 9 + sin 2 6) = r 2 , 

so \z\ — r. 

Now recall the following Maclaurin series expansions (considered in greater 
depth in Chapter 3): 



V(-i)"- 1 — 

M=l 
00 



2«-l 



{In- 1)! 

^2n-2 



EX 
(n - 



(2/1-2)! 

H-l 



(B-l)! 

n = l 

This suggests that z may be equivalently represented by the column vector [xy] . The vector inter- 
pretation of complex numbers can be quite useful. 



TLFeBOOK 



30 FUNCTIONAL ANALYSIS IDEAS 

These series converge for — oo < x < oo. Observe the following: 



j x _ y^ U x ) _ y^ 

n=\ n=\ 



(jx) 



(2«-l)-l 



(jx) 



(2«-D 



[(2/i-l)-l]! [2/1-1]! 



where we have split the summation into terms involving even n and odd n. Thus, 
continuing 



00 r ;2n-2^2n-2 ;2n-l„2«-l 



*7* 



E 



J"' 'X 



;- *x' 



(2n-2)\ (In- 1)! 



Ei 

«=i 

00 



2n-2 



,2n-2 



„2n-l 



+ y- 



= Et- 1 )"" 



(2/1-2)! •'(2/1- 1)! 

v 2«-2 



n=l 



(2n - 2) 



+iE(-ir 



07 2 " -2 = /"-') 



„2n-l 



n = \ 



(2«- 1)! 



a 2 " -2 = a 2 )" -1 = (-i)"- 1 ) 



= cosx + y sinx. 

Thus, e'* = cosx + j sinx. This is justification for Euler's identity in (1.49). Addi- 
tionally, since e~' x — cosx — j sinx, we have 



,}* -1- *-;* 



= 2 cosx, e 



jx _ a -jx 



= 2j sin x . 



These immediately imply that 



,jx _ „-jx 



■?}X I p—JX 



V 2 

These identities allow for the conversion of expressions involving trig(onometric) 
functions into expressions involving exponentials, and vice versa. The necessity to 
do this is frequent. For this reason, they should be memorized, or else you should 
remember how to derive them "on the spot" when necessary. 
Now observe that 

re 3 — r cos 9 + jr sm6, 

so that if z — x + jy, then, because there exist r and 9 such that x — r cos 6 and 
y — r sin9, we may immediately write 



Z — re J 



TLFeBOOK 



ELEMENTARY LOGIC 31 

This is z in polar form. For example (assuming that 9 is in radians) 

1 + 7 = v^e^ 74 , -l + y = V2e 37ry/4 , 
1 - j = V^-- 7 '*/ 4 , -1-7 = V2e~ 3n J/ 4 . 

It can sometimes be useful to observe that 

j = e inl2 , -j = e - jn/2 , and - 1 = e ±J7T . 

If z\ = r\e^ 1 , and z 2 = r^e^ 1 , then 

Z lZ2 = ri^W-W - = ^^(ft-fc). 

Z2 r 2 

In other words, multiplication and division of complex numbers is very easy when 
they are expressed in polar form. 

Finally, some terminology. For z — x + jy, we call x the real part of z, and we 
call y the imaginary part of z. The notation is 

x — Re [z], y — Im [z]. 

That is, z = Re [z] + 7 Im [z]. 



APPENDIX 1.B ELEMENTARY LOGIC 

Here we summarize the basic language and ideas associated with elementary logic 
as some of what is found here appears in later sections and chapters of this book. 
The concepts found here appear often in mathematics and engineering literature. 

Consider two mathematical statements represented as P and Q. Each statement 
may be either true or false. Suppose that we know that if P is true, then Q is 
certainly true (allowing the possibility that Q is true even if P is false). Then we 
say that P implies Q, or Q is implied by P, or P is a sufficient condition for Q, 
or symbolically 

P =► Q or Q «= P. 

Suppose that if P is false, then Q is certainly false (allowing the possibility 
that Q may be false even if P is true). Then we say that P is implied by Q, or Q 
implies P, or P is a necessary condition for Q, or 

P «= Q or Q =► P. 

Now suppose that if P is true, then Q is certainly true, and if P is false, then 
Q is certainly false. In other words, P and Q are either both true or both false. 
Then we say that P implies and is implied by Q, or P is a necessary and sufficient 



TLFeBOOK 



32 FUNCTIONAL ANALYSIS IDEAS 

condition for Q, or P and Q are logically equivalent, or P if and only if Q, or 
symbolically 

P O Q- 

A common abbreviation for "if and only if is iff. 

The logical contrary of the statement P is called "not P." It is often denoted by 
either P or ~ P. This is the statement that is true if P is false, or false if P is true. 
For example, if P is the statement "x > 1," then ~ P is the statement "x < 1." If 
P is the statement "f(x) ^ for all x e R," then ~ P is the statement "there is 
at least one x e R for which f(x) — 0." We may write 

x 4 - 5x 2 + 4 = «= x — 1 or x — 2, 

but the converse is not true because x 4 — 5x 2 + 4 = is a quartic equation pos- 
sessing four possible solutions. We may write 

x — 3 =>■ x — 3x, 

but we cannot say x 2 — 3x =>■ x — 3 because x = is also possible. 
Finally, we observe that 

P =>■ Q is equivalent to ~ P <^ ~ Q, 
P ■$= Q is equivalent to ~ P =>■ ~ Q, 
P -o- Q is equivalent to ~ P -o- ~ Q; 

that is, taking logical contraries reverses the directions of implication arrows. 



REFERENCES 

1. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New 
York, 1978. 

2. A. P. Hillman and G. L. Alexanderson, A First Undergraduate Course in Abstract Alge- 
bra, 3rd ed., Wadsworth, Belmont, CA, 1983. 

3. R. B. J. T. Allenby, Rings, Fields and Groups: An Introduction to Abstract Algebra, 
Edward Arnold, London, UK, 1983. 

4. R. E. Blahut, Fast Algorithms for Digital Signal Processing, Addison- Wesley, Reading, 
MA, 1985. 

5. C. K. Chui, Wavelets: A Mathematical Tool for Signal Analysis. SIAM, Philadelphia, 
PA, 1997. 

6. R. E. Ziemer and W. H. Tranter, Principles of Communications: Systems, Modulation, 
and Noise, 3rd ed., Houghton Mifflin, Boston, MA, 1990. 

7. I. Daubechies, "Orthonormal Bases of Compactly Supported Wavelets," Commun. Pure 
Appl. Math. 41, 909-996 (1988). 



TLFeBOOK 



PROBLEMS 



33 



8. G. P. Tolstov, Fourier Series (transl. from Russian by R. A. Silverman), Dover Publi- 
cations, New York, 1962. 

9. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979. 

10. M. T. Heideman, D. H. Johnson and C. S. Burrus, "Gauss and the History of the Fast 
Fourier Transform," IEEE ASSP Mag. 1, 14-21 (Oct. 1984). 

11. J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation of Complex 
Fourier Series," Math. Comput., 19, 297-301 (April 1965). 



PROBLEMS 
1.1. (a) Find a, b e R in 



1+27 



a + bj. 



(b) Find r, 9 e R in 



J = r < 



je 



(Of course, choose r > 0, and 9 e (— it, it].) 
1.2. Solve for x e C in the quadratic equation 



x 2 — 2r coaOx + r 2 



0. 



Here r > 0, and 9 e (— jt, jv]. Express your solution in polar form. 

1.3. Let 0, and <p be arbitrary angles (so 0, 4> e R). Show that 

(cos0 + j sin0)(cos0 + j sin</>) = cos(9 + <p) + j sm(9 + <p). 

1.4. Prove the following theorem. Suppose z e C such that 

z — r cos 9 + jr sin6> 



for which r = \z\ > 0, and 9 € (—jt, it]. Let n € {1, 2, 3, . . 
positive integer). The n different nth roots of z are given by 



(i.e., n is a 



.!/« 



COS 



2jtk 



/sin 



9 + 2itk 



for k = 0, 1,2, ...,«- 1. 
1.5. State whether the following are true or false: 

(a) \x\ < 2 =► x < 2 

(b) |*| < 3 «= < x < 3 



TLFeBOOK 



34 FUNCTIONAL ANALYSIS IDEAS 

(c) x — y>0=^x>;y>0 

(d) xy — =$■ x — and y — 

(e) x = 10 «= x 2 = lOx 

Explain your answer in all cases. 
1.6. Consider the function 



,. | -x 2 + 2x + 1, < x < 1 

/W-i x 2_ 2x+ 3 1<x<2 



2' 



Find 



sup /(x), inf /(x). 
*e[0,2] *e[0,2] ' 



1.7. Suppose that we have the following polynomials in the indeterminate x: 

n m 

a{x) — 2^aicX , b(x) = } ^ bjX 1 . 
fc=0 ,/=o 



Prove that 



where 



n+m 



:(x) — a(x)b(x) = 2, c l x > 



1=0 



ci — ) y cikbi-k- 

k=Q 

[Comment: This is really asking us to prove that discrete convolution is 
mathematically equivalent to polynomial multiplication. It explains why the 
MATLAB routine for multiplying polynomials is called conv. Discrete con- 
volution is a fundamental operation in digital signal processing, and is an 
instance of something called finite impulse response (FIR) filtering. You will 
find it useful to note that a^ — for k < 0, and k > n, and that b 1 ■■ — for 
j < 0, and j > m. Knowing this allows you to manipulate the summation 
limits to achieve the desired result.] 

1.8. Recall Example 1.5. Suppose that x\ — 2 k+l , and that y^ = 1 for k e Z + . 
Find the sum of the series d(x, y). (Hint: Recall the theory of geometric 
series. For example, ^^ff' = " if a # 1.) 

1.9. Prove that if x ^ 1, then 



<jpl — 7 fCX 



k-\ 

k=\ 



TLFeBOOK 



PROBLEMS 35 

is given by 

_ l-(«+ l)x" +nx n+l 

(l - x) 1 

What is the formula for S„ when x — 1 ? (Hint: Begin by showing that 

S„ - xS,j = 1 + x + x 2 H h x n ~ l - nx n .) 

1.10. Recall Example 1.1. Prove that d(x, y) in (1.5) satisfies all the axioms for a 
metric. 

1.11. Recall Example 1.18. Prove that (x, y) in (1.36) satisfies all the axioms for 
an inner product. 

1.12. By direct calculation, show that if x, y, z are elements from an inner product 
space, then 

||z-x|| 2 +||z-y|| 2 = i||x-y|| 2 + 2|| Z -I(* + >0|| 2 

(Appolonius' identity). 

1.13. Suppose x, y e R 3 (three-dimensional Euclidean space) such that 

x = [l 1 if, y = [l -1 If. 

Find all vectors z € R 3 such that (x, z) — (y, z) — 0. 

1.14. The complex Fourier series expansion method as described is for / € 
L 2 (0,2ir). Find the complex Fourier series expansion for / € L 2 (0, T), 
where < T < oo (i.e., the interval on which / is defined is now of arbitrary 
length). 

1.15. Consider again the complex Fourier series expansion for / € L 2 (0, lit). 
Specifically, consider Eq. (1.44). If /(f) € R for all t e (0, 2tt), then show 
that /„ = f* n . [The sequence (/„) is conjugate symmetric] Use this to show 
that for suitable a n , b n e R (all n) we have 

00 00 

^ f n e jnt — «o + y^\a n cos(nf) + b„ sin(nf)]- 

n=-oo n=\ 

How are the coefficients a„ and b n related to /„ ? (Be very specific. There 
is a simple formula.) 

1.16. (a) Suppose that / € L 2 (0, 2jt), and that specifically 

r i, o < t < n 

\ h 7T <t <27T ' 

Find /„ in Eq. (1.44) using (1.45); that is, find the complex Fourier series 
expansion for f(t). Make sure that you appropriately simplify your series 
expansion. 



TLFeBOOK 



36 



FUNCTIONAL ANALYSIS IDEAS 



(b) Show how to use the result in Example 1.20 to find the complex Fourier 
series expansion for f(t) in (a). 

1.17. This problem is about finding the Fourier series expansion for the wave- 
form at the output of a full-wave rectifier circuit. This circuit is used in 
AC/DC (alternating/direct-current) converters. Knowledge of the Fourier 
series expansion gives information to aid in the design of such converters. 



(a) Find the complex Fourier series expansion of 

'2tt 



f(t) = 



sin 



€// 



T\ 



(b) Find the sequences (a n ), and (b n ) in 



v^ T (2nn \ (2nn 

f(t) - a + 2_, a " cos I -f~t I + b n sin I — — 

n=i *- ^ ' ^ 



for f(t) in (a). You need to consider how T is related to T\. 

1.18. Recall the definitions of the Haar scaling function and Haar wavelet in 
Eqs. (1.40) and (1.41), respectively. Define </>*,„(?) = 2 k/2 (p(2 k t - n), and 
f k ,„(t) = 2 k ' 2 ir(2 k t - n). Recall that (/(?), g(t)) = /_°^ f(t)g*(t)dt is the 
inner product for L 2 (R). 

(a) Sketch (j>k, n (t), and Va,n(0- 

(b) Evaluate the integrals 



(c) Prove that 



/oo poo 

4>l„(t)dt, and / f\ n {t)dt. 
-oo J — oo 



(0t,n(O. 0*,m(O) = 5 n- 



1.19. Prove the following version of the Schwarz inequality. For all x, y € X (inner 
product space) 

|Re[(x,y>]|< ||*|| ||y|| 

with equality iff y = fix, and /J € R is a constant. 

[Hint: The proof of this one is not quite like that of Theorem 1.1. Consider 
(ax + y, ax + y) > with a e R. The inner product is to be viewed as a 
quadratic in a.] 

1.20. The following result is associated with the proof of the uncertainty principle 
for analog signals. 

Prove that for f(t) e L 2 (R) such that \t\f(t) € L 2 (R) and / (1) (?) = 
df(t)/dt e L 2 (R), we have the inequality 

2 



Rc 



/ 

y — ( 



tf(t)[f m (t)Tdt 



/OO /•OO 

\tf{t)\ 2 dt / \f m (t)\ 2 dt . 
-oo J L*J — oo 



TLFeBOOK 



PROBLEMS 



37 



1.21. Suppose e k = [et,o e*,i • • • ^,^-2 ^.iv-i] 7 ^ € C^, where 

e*,n = exp 



27T 

/ — /en 

N 



and keZ N . If #, y € C w recall that (x, y) — Y^k=o x Wk- Prove tnat 
(gfc, e r > = N&k-r- Thus, B = {e,t|/c € Z^} is an orthogonal basis for C^. 
Set B is important in digital signal processing because it is used to define 
the discrete Fourier transform. 



TLFeBOOK 



2 



Number Representations 



2.1 INTRODUCTION 

In this chapter we consider how numbers are represented on a computer largely with 
respect to the errors that occur when basic arithmetical operations are performed 
on them. We are most interested here in so-called rounding errors (also called 
roundoff errors). Floating-point computation is emphasized. This is due to the 
fact that most numerical computation is performed with floating-point numbers, 
especially when numerical methods are implemented in high-level programming 
languages such as C, Pascal, FORTRAN, and C++. However, an understanding 
of floating-point requires some understanding of fixed-point schemes first, and so 
this case will be considered initially. In addition, fixed-point schemes are used to 
represent integer data (i.e., subsets of Z), and so the fixed-point representation is 
important in its own right. For example, the exponent in a floating-point number 
is an integer. 

The reader is assumed to be familiar with how integers are represented, and 
how they are manipulated with digital hardware from a typical introductory dig- 
ital electronics book or course. However, if this is not so, then some review of 
this topic appears in Appendix 2. A. The reader should study this material now if 
necessary. 

Our main (historical) reference text for the material of this chapter is Wilkin- 
son [1]. However, Golub and Van Loan [4, Section 2.4] is also a good refer- 
ence. Golub and Van Loan [4] base their conventions and results in turn on 
Forsythe et al. [5]. 



2.2 FIXED-POINT REPRESENTATIONS 

We now consider fixed-point fractions. We must do so because the mantissa in a 
floating-point number is a fixed-point fraction. 

We assume that fractions are t + 1 digits long. If the number is in binary, then 
we usually say "t + 1 bits" long instead. Suppose, then, that x is a (t + l)-bit 
fraction. We shall write it in the form 

OO2 = xo.x\X2 ■■■xt-ixt (xk e {0, 1}). (2.1) 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 

38 



TLFeBOOK 



FIXED-POINT REPRESENTATIONS 39 

The notation {x)2 means that x is in base-2 (binary) form. More generally, (x) r 
means that x is expressed as a base-r number (e.g., if r — 10 this would be the 
decimal representation). We use this notation to emphasize which base we are 
working with when necessary (e.g., to avoid ambiguity). We shall assume that 
(2.1) is a two's complement fraction. Thus, bit xq is the sign bit. If this bit is 1, 
we interpret the fraction to be negative; otherwise, it is nonnegative. For example, 
(1.1011)2 = (-0.3125)io- [To take the two's complement of (1.1011)2, first com- 
plement every bit, and then add (0.0001) 2 . This gives (0.0101) 2 = (0.3 125) i .] In 
general, for the case of a (t + l)-bit two's complement fraction, we obtain 

-1<x<1-2"'. (2.2) 

In fact 

(-l)io = (1.00^00)2, (l-2- f )io = (O.U J _ : JT)2. (2.3) 

t bits t bits 

We may regard (2.2) as specifying the dynamic range of the (t + l)-bit two's 
complement fraction representation scheme. Numbers beyond this range are not 
represented. Justification of (2.2) [and (2.3)] would follow the argument for the 
conversion of two's complement integers into decimal integers that is considered 
in Appendix 2.A. 

Consider the set {x e R| — 1 < x < 1 — 2 - '}. In other words, x is a real number 
within the limits imposed by (2.2), but it is not necessarily equal to a (t + l)-bit 
fraction. For example, x — \[2 — 1 is in the range (2.2), but it is an irrational 
number, and so does not possess an exact (t + l)-bit representation. We may choose 
to approximate such a number with t + 1 bits. Denote the (t + l)-bit approximation 
of x as Q[x]. For example, Q[x] might be the approximation to x obtained by 
selecting an element from set 

B = {b n = -1 + 2~'n\n = 0, 1, . . . , 2 t+1 - 1} C R (2.4) 

that is the closest to x, where distance is measured by the metric in Example 1.1. 
Note that each number in B is representable as a (t + l)-bit fraction. In fact, B is the 
entire set of (t + l)-bit two's complement fractions. Formally, our approximation 
is given by 

Q[x] — argmin \x — b„\. (2.5) 

B6{0,l,...,2 t + 1 -l} 

The notation "argmin" means "let Q[x] be the b n for the n in the set 
[0, 1, . . . , 2 t+l — 1} that minimizes \x — b n \." In other words, we choose the 
argument b n that minimizes the distance to x. Some reflection (and perhaps con- 
sidering some simple examples for small t) will lead the reader to conclude that 
the error in this approximation satisfies 

\x-Q[x]\ <2~ (t+1) . (2.6) 



TLFeBOOK 



40 NUMBER REPRESENTATIONS 

The error e — x — Q[x] is called quantization error. Equation (2.6) is an upper 
bound on the size (norm) of this error. In fact, in the notation of Chapter 1, if 
||x|| = |;c|, then ||e|| = \\x — Q[x]|| < 2~ ( - t+l \ We remark that our quantization 
method is not unique. There are many other methods, and these will generally lead 
to different bounds. 

When we represent the numbers in a computational problem on a computer, 
we see that errors due to quantization can arise even before we perform any oper- 
ations on the numbers at all. However, errors will also arise in the course of 
performing basic arithmetic operations on the numbers. We consider the sources 
of these now. 

If x, y are coded as in (2.1), then their sum might not be in the range specified 
by (2.2). This can happen only if x and y are either both positive or both negative. 
Such a condition is fixed-point overflow. (A test for overflow in two's complement 
integer addition appears in Appendix 2.A, and it is easy to modify it for the problem 
of overflow testing in the addition effractions.) Similarly, overflow can occur when 
a negative number is subtracted from a positive number, or if a positive number 
is subtracted from a negative number. A test for this case is possible, too, but we 
omit the details. Other than the problem of overflow, no errors can occur in the 
addition or subtraction of fractions. 

With respect to fractions, rounding error arises only when we perform multipli- 
cation and division. We now consider errors in these operations. 

We will deal with multiplication first. Suppose that x and y are represented 
according to (2.1). Suppose also that xq — yo = 0. It is easy to see that the product 
of x and y is given by 



p=*y=(i> 2 ~*)(i> 2 ~ B ) 

\jfc=0 / \n=Q I 



= (x + xi2 l +---+x,2 ')(y + yi2 l + ■ ■ ■ + y,2 ') 

= x y + (xoyi + *iyo)2 _1 H h x t y t 2~ 2 '. (2.7) 

This implies that the product is a (2t + l)-bit number. If we allow x and y to be 
either positive or negative, then the product will also be 2t + 1 bits long. Of course, 
one of these bits is the sign bit. If we had to multiply several numbers together, 
we see that the product wordsize would grow in some proportion to the number of 
factors in the product. The growth is clearly very rapid, and no practical computer 
could sustain this for very long. We are therefore forced in general to round off the 
product p back down to a number that is only t + 1 bits long. Obviously, this will 
introduce an error. 

How should the rounding be done? There is more than one possibility (just as 
there is more than one way to quantize). Wilkinson [1, p. 4] suggests the following. 
Since the product p has the form 

(ph - P0-P1P2--- Pt-iPtPt+i--- P2t (Pk e {0, 1}) (2.8) 



TLFeBOOK 



FIXED-POINT REPRESENTATIONS 41 

we may add 2~^ +1 ' to this product, and then simply discard the last t bits of the 
resulting sum (i.e., the bits indexed t + 1 to 2t). For example, suppose t = 4, and 
consider 

0.00111111 = p 
+0.00001000 = 2~ 5 
0.01000111 

Thus, the rounded product is (0.0100)2- The error involved in rounding in this 
manner is not higher in magnitude than j2~ { — 2~ < -' +l \ Define the result of the 
rounding operation to be fx[p] — fx[xy], so then 

\P- fx[p]\< \2~'. (2.9) 

[For the previous example, p — (0.00111111)2, and so fx[p] — (0.0100)2-] It is 
natural to measure the sizes of errors in the same way as we measured the size of 
quantization errors earlier. Thus, (2.9) is an upper bound on the size of the error 
due to rounding a product. As with quantization, other rounding methods would 
generally give other bounds. We remark that Wilkinson's suggestion amounts to 
"ordinary rounding." 

Finally, we consider fixed-point division. Again, suppose that x and y are rep- 
resented as in (2.1), and consider the quotient q — x/y. Obviously, we must avoid 
y — 0. Also, the quotient will not be in the permitted range given by (2.2) unless 
\y\ > \ x \- This implies that when fixed-point division is implemented either the 
dividend x or the divisor y need to be scaled to meet this restriction. Scaling is 
multiplication by a power of 2, and so should be implemented to reduce rounding 
error. We do not consider the specifics of how to achieve this. Another problem 
is that x/y may require an infinite number of bits to represent it. For example, 
suppose 

= (0.0010) 2 = (0.125)io = / 1\ = — 
q (0.0110)2 (0.375)10 Wio ' )2 ' 

The bar over 01 denotes the fact that this pattern repeats indefinitely. Fortunately, 
the same recipe for the rounding of products considered above may also be used 
to round quotients. If fx[q] again denotes the result of applying this procedure to 
q, then 

\q-fx[q]\<\2~'. (2.10) 

We see that the difficulties associated with division in fixed-point representations 
means that fixed-point arithmetic should, if possible, not be used to implement 
algorithms that require division. This forces us to either (1) employ floating-point 
representations or (2) develop algorithms that solve the problem without the need 
for division operations. 

Both strategies are employed in practice. Usually choice 1 is easier. 



TLFeBOOK 



42 NUMBER REPRESENTATIONS 

2.3 FLOATING-POINT REPRESENTATIONS 

In the previous section we have seen that fixed-point numbers are of very limited 
dynamic range. This poses a major problem in employing them in engineering 
computations since obviously we desire to work with numbers far beyond the 
range in (2.2). Floating-point representations provide the definitive solution to this 
problem. We remark (in passing) that the basic organization of a floating-point 
arithmetic unit [i.e., digital hardware for floating-point addition and subtraction 
appears in Ref. 2 (see pp. 295-306)]. There is a standard IEEE format for floating- 
point numbers. We do not consider this standard here, but it is summarized in 
Ref. 2 (see pp. 304-306). Some of the technical subtleties associated with the 
IEEE standard are considered by Higham [6]. 

Following Golub and Van Loan [4, p. 61], the set F (subset of R) of floating- 
point numbers consists of numbers of the form 

x — xq.x\X2 ■ ■ ■ x t -\x t x r e , (2.11) 

where xo is a sign bit (which means that we can replace xq by ±; this is done in 
Ref. 4), and r is the base of the representation [typically r — 2 (binary), or r — 10 
(decimal); we will emphasize r — 2]. Therefore, x\ € {0, 1, . . . , r — 2, r — 1} for 
1 < k < t. These are the digits (bits if r = 2) of the mantissa. We therefore see 
that the mantissa is a fraction. l It is important to note that x\ ^ 0, and this has 
implications with regard to how operations are performed and the resulting rounding 
errors. We call e the exponent. This is an integer quantity such that L < e < U. 
For example, we might represent e as an n-bit two's complement integer. We will 
assume this unless otherwise specified in what follows. This would imply that 
(e)2 — e n -\e n -2 ■ ■ ■ e\eo, and so 

-2"" 1 < e < 2" _1 - 1 (2.12) 

(see Appendix A for justification). For nonzero jeF, then 

m < \x\ < M, (2.13a) 

where 

m = r L -\ M = r u (l-r-<). (2.13b) 

Equation (2.13) gives the dynamic range for the floating-point representation. With 
r = 2we see that the total wordsize for the floating-point number is t + n + 1 bits. 
In the absence of rounding errors in a computation, our numbers may initially 
be from the set 

G = {x e R\m < \x\ < M} U {0}. (2.14) 

Including the sign bit the mantissa is (for r = 2) t + 1 bits long. Frequently in what follows we shall 
refer to it as being only t bits long. This is because we are ignoring the sign bit, which is always 
understood to be present. 



TLFeBOOK 



FLOATING-POINT REPRESENTATIONS 43 

This set is analogous to the set {x e R| — 1 < x < 1 — 2 - '} that we saw in the 
previous section in our study of fixed-point quantization effects. Again following 
Golub and Van Loan [4], we may define a mapping (operator) fl\G — > F. Here 
c — fl[x] (x e G) is obtained by choosing the closest c € F to x. As you might 
expect, distance is measured using 1 1 • 1 1 = | • | , as we did in the previous section. 
Golub and Van Loan call this rounded arithmetic [4], and it coincides with the 
rounding procedure described by Wilkinson [1, pp. 7-11]. 

Suppose that x and y are two floating-point numbers (i.e., elements of F) and 
that "op 7 ' denotes any of the four basic arithmetic operations (addition, subtrac- 
tion, multiplication, or division). Suppose \x op y\ ^ G. This implies that either 
\x op y\ > M (floating-point overflow), or < \x op y\ < m (floating-point under- 
flow) has occurred. Under normal circumstances an arithmetic fault such as over- 
flow will not happen unless an unstable procedure is being performed. The issue 
of "numerical stability" will be considered later. Overflows typically cause runtime 
error messages to appear. The underflow arithmetic fault occurs when a number 
arises that is not zero, but is too small to represent in the set F. This usually poses 
less of a problem than overflow. 2 However, as noted before, we are concerned 
mainly with rounding errors here. If \x op y\ e G, then we assume that the com- 
puter implementation of x op y will be given by fl[x op y]. In other words, the 
operator // models rounding effects in floating-point arithmetic operations. We 
remark that where floating-point arithmetic is concerned, rounding error arises in 
all four arithmetic operations. This contrasts with fixed-point arithmetic wherein 
rounding errors arise only in multiplication and division. 

It turns out that for the floating-point rounding procedure suggested above 

fl[xopy] = (xopy)(l + €), (2.15) 

where 

\e\ < \r l - f {= 2~ { if r = 2). (2.16) 

We shall justify this only for the case r — 2. Our arguments will follow those of 
Wilkinson [1, pp. 7-11]. 

Let us now consider the addition of the base-2 floating-point numbers 

x = xo.xi ■ ■ ■ x t x2 e * (2.17a) 



and 

y = yo-yi ■ ■ ■ y t x2 ey , (2.17b) 



and we assume that |x| > \y\. (If instead \y\ > \x\, then reverse the roles of x and 
y.) If e x — e y > t, then 

fl[x + y] = x. (2.18) 

Underflows are simply set to zero on some machines. 



TLFeBOOK 



44 NUMBER REPRESENTATIONS 

For example, if t = 4, and x = 0.1001 x 2 4 , and j = 0.1110 x 2 _1 , then to add 
these numbers, we must shift the bits in the mantissa of one of them so that both 
have the same exponent. If we choose y (usually shifting is performed on the 
smallest number), then y = 0.00000111 x 2 4 . Therefore, x + y = 0.10010111 x 
2 4 , but then fl[x + y] = 0.1001 x2 4 =i. 

Now if instead we have e x — e y < t, we divide y by 2 e *~ e y by shifting its man- 
tissa e x — e y positions to the right. The sum x + 2 e >'~ ex y is then calculated exactly, 
and requires < It bits for its representation. The sum is multiplied by a power of 2, 
using left or right shifts to ensure that the mantissa is properly normalized [recall 
that for x in (2.11) we must have x\ ^ 0]. Of course, the exponent must be modi- 
fied to account for the shift of the bits in the mantissa. The 2f-bit mantissa is then 
rounded off to t bits using //. Because we have \m x \ + 2 e >~ ex \m y \ < 1 + 1 = 2, 
the largest possible right shift is by one bit position. However, a left shift of up to t 
bit positions might be needed because of the cancellation of bits in the summation 
process. Let us consider a few examples. We will assume that t — A. 

Example 2.1 Let x = 0.1001 x 2 4 , and y = 0.1010 x 2 1 . Thus 

0.10010000 x 2 4 

+0.00010100 x 2 4 

0.10100100 x 2 4 

and the sum is rounded to 0.1010 x 2 4 (computed sum). 

Example 2.2 Let x = 0.1111 x 2 4 , and y = 0.1010 x 2 2 . Thus 

0.11110000 x 2 4 

+0.00101000 x 2 4 

1.00011000 x 2 4 

but 1.00011000 x2 4 = 0.100011000 x 2 5 , and this exact sum is rounded to 0.1001 
x 2 5 (computed sum). 

Example 2.3 Let x = 0.1111 x 2" 4 , and y = -.1110 x 2" 4 . Thus 

0.11110000 x 2 -4 

-0.11100000 x 2 -4 



0.00010000 x 2" 4 

but 0.00010000 x 2" 4 = 0.1000 x 2" 7 , and this exact sum is rounded to 0.1000 x 
2 -7 (computed sum). Here there is much cancellation of the bits leading in turn 
to a large shift of the mantissa of the exact sum to the left. Yet, the computed sum 
is exact. 



TLFeBOOK 



FLOATING-POINT REPRESENTATIONS 45 

We observe that the computed sum is obtained by computing the exact sum, 
normalizing it so that the mantissa sq.s\ ■ ■ ■ s t -is t s t +i • ■ ■ S2t satisfies s\ = 1 (i.e., 
s\ ^ 0), and then we round it to t places (i.e., we apply //). If the normalized 
exact sum is s — m s x 2 es {— x + y), then the rounding error e' is such that \e'\ < 
i2~'2 e '. Essentially, the error e' is due to rounding the mantissa (a fixed-point 
number) according to the method used in Section 2.2. Because of the form of m s , 
\l e ° <\s\< 2 e \ and so 

fl[x + y] = (x + y)(l + e) (2.19) 

which is just a special case of (2.15). This expression requires further explanation, 
however. Observe that 

\s-fl[s]\ _ \s-(s + e')\ _ \e'\ < ^2-'2 g ' 
Us- 1 |*| " |*| - |*| 



which is the relative error 2 due to rounding. Because we have j2 e * < \s\ < 2 e < , 

i2 e %sotl 

3 ~ fl[s] 



this error is biggest when |*| = i2 e % so therefore we conclude that 



< 2 _f . (2.20) 

1*1 

From (2.19) fl[s] = s + se, so that \s - fl[s]\ = \s\\e\, or |e| = \s - fl[s]\/\s\. 
Thus, |e| < 2~' , which is (2.16). In other words, \e'\ is the absolute error, and |e| 
is the relative error. 

Finally, if x — or y — then no rounding error occurs: e — 0. Subtraction 
results do not differ from addition. 

Now consider computing the product of x and y in (2.17). Since x — m x x2 fl , 
and y — m y x 2 e >' with x\ ^ 0, and y\ ^ we must have 

\\ < \m x m y \ < 1. (2.21) 

This implies that it may be necessary to normalize the mantissa of the product with 
a shift to the left, and an appropriate adjustment of the exponent as well. The 2f-bit 
mantissa of the product is rounded to give a r-bit mantissa. If x — 0, or y = (or 
both x and y are zero), then the product is zero. 

In general, if a is the exact value of some quantity and a is some approximation to a, the absolute 
error is \\a — a\\, while the relative error is 

Ik -all 



M 



(a # 0). 



The relative error is usually more meaningful in practice. This is because an error is really "big" or 
"small" only in relation to the size of the quantity being approximated. 



TLFeBOOK 



46 NUMBER REPRESENTATIONS 

We may consider a few examples. We will suppose t — 4. Begin with x — 
0.1010 x 2 2 , and y = 0.1111 x 2 1 , so then 

xy = 0.10010110 x 2 3 , 

and so fl[xy] — 0.1001 x 2 3 (computed product). If now x — 0.1000 x 2 4 , y — 
0.1000 x 2 _1 , then, before normalizing the mantissa, we have 

xy = 0.01000000 x 2 3 , 

and after normalization we have 

xy = 0.10000000 x 2 2 

so that fl[xy] = 0.1000 x 2 2 (computed product). Finally, suppose that x — 
0.1010 x 2°, and y = 0.1010 x 2°, so then the unnormalized product is 

xy = 0.01100100 x 2° 

for which the normalized product is 

xy = 0.11001000 x 2"\ 

so finally fl[xy] — 0.1101 x 2 _1 (computed product). 

The application of // to the normalized product will have exactly the same 
effect as it did in the case of addition (or of subtraction). This may be under- 
stood by recognizing that a 2f-bit mantissa will "look the same" to operator // 
regardless of how that mantissa was obtained. It therefore immediately follows 
that 

fl[xy] = (xy)(l + e), (2.22) 

which is another special case of (2.15), and |e| < 2~' , which is (2.16) again. 
Now consider the quotient x/y, for x and y ^ in (2.17), 

x m x x 2 e " m x „ „ 
q = -= „ = — x ?*~ e y =m q x 2 e * (2.23) 

y m y x 1 y m y 

(so m q = m x /m y , and e q — e x — e y ). The arithmetic unit in the machine has an 
accumulator that we assume contains m x and which is "double length" in that it 
is 2t bits long. Specifically, this accumulator initially stores xq.X\ ■ ■ ■ x t • • • 0. If 

( bits 

\m x \ > \m y \ the number in the accumulator is shifted one place to the right, and 
so e q is increased by one (i.e., incremented). The number in the accumulator is 
then divided by m y in such a manner as to give a correctly rounded f-bit result. 
This implies that the computed mantissa of the quotient, say, m q — qo-qi • • -qt, 



TLFeBOOK 



FLOATING-POINT REPRESENTATIONS 47 

satisfies the normalization condition q\ — 1, so that j < \m q \ < 1. Once again we 

must have 

x 
fl ~ 

.y 



-(l + O (2.24) 

y 



such that |e| < 2~' . Therefore, (2.15) and (2.16) are now justified for all instances 
of op. 

We complete this section with a few examples. Suppose x — 0.1010 x 2 2 , and 
v = 0.1100 x 2" 2 , then 



0.1010 x 2 2 0.10100000 x 2 1 






y 0.1100x2- 2 0.1100x2"' 

0.10100000 A A 

x 2 4 = 0.11010101 x2 4 



0.1100 



so that fl[q] = 0.1101 x 2 4 (computed quotient). Now suppose that x = 0.1110 x 
2 3 , and y = 0.1001 x 2" 2 , and so 



0.1110 x 2 3 0.01110000 x 2 4 






y 0.1001 x 2" 2 0.1001 x 2"- 

0.01110000 fi fi 

x 2 6 = 0.11000111 x2 6 



0.1001 



so that fl[q] = 0.1100 x 2 6 (computed quotient). 

Thus far we have emphasized ordinary rounding, but an alternative implemen- 
tation of // is to use chopping. If x — ± (X^i x k2 ) x 2 e , then, for chopping 
operator //, we have //[jc] = ± (Yl'k=i x k2~ k ) x 2 e (chopping x to ? + 1 bits 
including the sign bit). Thus, the absolute error is 



|e'| = \x - fl[x]\ = J2 x * 2 ~ k U e < 2 e J2 2 ~ k 

\k=t+l / k=t+l 

(as Xk — 1 for all A: > f), but since Yl/k=t+\ 2 ~ k = 2~'> we mus t have 

|e'| = \x- fl[x]\ < 2"'2 e , 

and so the relative error for chopping is 

I* -//[x]| ^2-V =2 _ f+1 
1*1 ^ 

(because we recall that |x| > j2 e ). We see that the error in chopping is somewhat 
bigger than the error in rounding, but chopping is somewhat easier to implement. 



TLFeBOOK 



48 NUMBER REPRESENTATIONS 

2.4 ROUNDING EFFECTS IN DOT PRODUCT COMPUTATION 

Suppose x, y e R". We recall from Chapter 1 (and from elementary linear algebra) 
that the vector dot product is given by 

71-1 

(x,y) — x T y — y T x — 'Y^ l x k y k . (2.25) 

k=Q 

This operation occurs in matrix-vector product computation (e.g., y — Ax, where 
A e R" x "), digital filter implementation (i.e., computing discrete-time convolu- 
tion), numerical integration, and other applications. In other words, it is so common 
that it is important to understand how rounding errors can affect the accuracy of a 
computed dot product. 

We may regard dot product computation as a recursive process. Thus 

n— 1 n— 2 

s n -i = ^2x k y k = ^x k y k +x„-\y n -i = s„- 2 + x n -\y n -\. 

k=0 k=0 

So 

sk = sk-i+xkyk (2.26) 

for k — 0, 1, ...,« — 1, and s-\ — 0. Each arithmetic operation in (2.26) is a 
separate floating-point operation and so introduces its own error into the over- 
all calculation. We would like to obtain a general expression for this error. To 
begin, we may model the computation process according to 

s = fl[xoyo] 

s\ = fl[h + fllxiyi]] 

h = fl[s\ + fl[x 2 y2]] 



S n -2 = fltfn-3 + flUn-2y n -2\\ 

Sn-\ = fl[Sn-2 + fl[x n -\y n -\\]- (2.27) 

From (2.15) we may write 

£o = OxoyoHl + 5o) 

Si = [S + (*iyi)(l + «i)](l + ei) 

S 2 = [Si + fey2)(l + 5 2 )](l+e 2 ) 



TLFeBOOK 



ROUNDING EFFECTS IN DOT PRODUCT COMPUTATION 49 

S,t-2 = [Sb-3 + fe-23'n-2)(l + 5„_ 2 )](1 + e„-2) 

S B -i = [*n-2 + (x„-iy„-\)(\ + 5„_i)](l + e„_i), (2.28) 

where |5 k | < 2" J (for k = 0, 1, . . . , n - 1), and |e k | < 2 - ' (for fc=l,2, ..., 
n — 1), via (2.16). It is possible to write 4 

n—\ n—\ 

s n -i = ^2x k y k (l + Yk) = Jn-i + y^XkykYk, (2.29) 



fc=0 *:=0 



where 



B-l 

1 + n = (1 + **) ll^ 1 + 6 y)( £ o = 0). (2.30) 

Note that the n notation means, for example 

n 
\\x k — X X\X 2 - --Xn-lXn, (2.31) 

k=0 

where n is the symbol to compute the product of all x k for k = 0, 1, . . . , n. The 
similarity to how we interpret E notation should therefore be clear. 

The absolute value operator is a norm on R, so from the axioms for a norm 
(recall Definition 1.3), we must have 

71-1 

\s n -\ -s„-i\ = \x T y - fl[x T y]\ < ^2\x k y k \\y k \. (2.32) 

k=0 

In particular, obtaining this involves the repeated use of the triangle inequality. 
Equation (2.32) thus represents an upper bound on the absolute error involved in 
computing a vector dot product. Of course, the notation fl[x T y] symbolizes the 
floating-point approximation to the exact quantity x T y. However, the bound in 
(2.32) is incomplete because we need to appropriately bound the numbers yk- 
To obtain the bound we wish involves using the following lemma. 

Lemma 2.1: We have 

l+x<e x , x>0 (2.33a) 

e* < 1 + 1.01*, 0<jc<.01. (2.33b) 

Equation (2.29) is most easily arrived at by considering examples for small n, for instance 

.? 3 = x y (l + «o)(l + e o)(l + ei)(l + e 2 )(l + £3) + *mU + «i)(l + ei)(l + e 2 )(l + £3) 
+ x 2 y 2 (l + « 2 )(1 + e 2 )(l + £3) + *3:V3(1 + <5 3 )(1 + e 3 ), 
and using such examples to "spot the pattern." 



TLFeBOOK 



50 NUMBER REPRESENTATIONS 

Proof Begin with consideration of (2.33a). Recall that for — oo < x < oo 



n=0 

Therefore 



e x = T-r- (2-34) 



? x = i+ x + Y2- 



2 



so that 



«=2 

but the terms in the summation are all nonnegative, so (2.33a) follows immediately. 
Now consider (2.33b), which is certainly valid for x — 0. The result will follow 
if we prove 

e x - 1 

< 1.01 (x#0). 

x 

From (2.34) 

> = 1+7 



x ^L(m+ 1)! ^ (m + 1)! 



so we may also equivalently prove instead that 



^ — ^ X 

£—/ (rn 4- 



, (m + 1)! 

m = l 



< o.oi 



for < x < 0.01. Observe that 



x m 111 I 

^2 — = -x + -x 2 + —x 3 H < -x + x 2 + x 3 + x 4 



(m + 1)! 2 6 24 ~ 2 

m — 1 



1 OO -. oo 

-*+5>* = -*+5; 

fc=2 fc=0 



X* - 1 - X 



1 1,1 1+x 

—x — 1 = — X- 



1— x 2 21— x 

It is not hard to verify that 

1 1+x 

-x < 0.01 

2 1 -x _ 

for < x < 0.01. Thus, (2.33b) follows. 



TLFeBOOK 



ROUNDING EFFECTS IN DOT PRODUCT COMPUTATION 51 

If n = 1, 2, 3, .... and if < nu < 0.01, then 

(1 + uf < (e")« [via (2.33a)] 

<1 + 1.01mm [via (2.33b)]. (2.35) 

Now if \Si\ < u for i = 0, 1, . . . , n — 1 then 



n( 1 + < 5 <-)<n (1 + l <5 ''l ) - (1 + M) " 



j'=0 i=0 

so via (2.35) 

«-i 

n( 1 + 5 ') ^ 1 + 1-OImm, (2.36) 

where we must emphasize that < nu < 0.01. Certainly there is a <5 such that 

n-\ 

l+S = Y\(l + Si), (2.37) 

i=0 

and so from (2.36), |<5| < 1.01mm. If we identify yfc with S in (2.33) for all k, then 

IXfel < l.Olnw (2.38) 

for which we consider u — 2~' [because in (2.30) both |e,-| and |5,-| < 2~']. Using 
(2.38) in (2.32), we obtain 

M-l 

\x T y - fl[x T y]\ < l.Olnu J2 l*W*l. (2-39) 

k=0 

but YX=o \xkyk\ = E*=o l^llwl, and this may be symbolized as \x\ T \y\ (so that 
\x\ — [|xo||xi| • • • |x„_i|] r ). Thus, we may rewrite (2.39) as 

\x T y - fl[x T y]\ < l.0lnu\x\ T \y\. (2.40) 



Observe that the relative error satisfies 








\x T y 


- fHx T y]\ 

\ X Ty\ 


< 


1.01mm 


l*riyl 

\x T y\ 



(2.41) 

The bound in (2.41) may be quite large if |x| r |;y| 3> | jc ^y | . This suggests the 
possibility of a large relative error. We remark that since u — 2~', nu < 0.01 will 
hold in all practical cases unless n is very large (a typical value for t is t — 56). 

The potentially large relative errors indicated by the analysis we have just made 
are a consequence of the details of how the dot product was calculated. As noted 
on p. 65 of Ref. 4, the use of a double-precision accumulator to compute the dot 



TLFeBOOK 



52 NUMBER REPRESENTATIONS 

product can reduce the error dramatically. Essentially, if x and y are floating- 
point vectors with f-bit mantissas, the "running sum" s k [of (2.26)] is built up in 
an accumulator with a 2f-bit mantissa. Multiplication of two f-bit numbers can 
be stored exactly in a double-precision variable. The large dynamic floating-point 
range limits the likelihood of overflow/underflow. Only when final sum s n -\ is 
written to a single-precision memory location will there be a rounding error. It 
therefore follows that when this alternative procedure is employed, we get 

fl[x T y] = x T y(l + S) (2.42) 

for which \S\ « 2~' (= u). Clearly, this is a big improvement. 
The material of this section shows 

1. The analysis required to obtain insightful bounds on errors can be quite 
arduous. 

2. Proper numerical technique can have a dramatic effect in reducing errors. 

3. Proper technique can be revealed by analysis. 

The following example illustrates how the bound on rounding error in dot prod- 
uct computation may be employed. 

Example 2.4 Assume the existence of a square root function such that 
fl\_\fx\ = V*(l + € ) an d l e l < u - We use the algorithm that corresponds to the 
bound of Eq. (2.40) to compute x T x (x e R"), and then use this to give an algo- 
rithm for ||x|| = V ' x 1 x. This can be expressed in the form of pseudocode: 

s_i := 0; 

for k := to n - 1 do begin 

$ k :=S k _i +x2; 

end; 
||x 1 1 := VSnZT; 

We will now obtain a bound on the relative error due to rounding in the computation 
of | \x 1 1 . We will use the fact that Vl +x < 1 + x (for x > 0). 



Now 



€\ — = =>■ fl[x x] — x x(l + ei), 



and via (2.41) 



\e\\ < l.Olnu 



M T \x\ 
\x T x\ 



\\x\\ 2 
1.01mm r- = I.OIhm 

\\x\\ 2 

-n-l |„. |2 _ V-n-1 2 _ i|^||2 „j IIUII2 



(M 1 |*| = Ejto M -T." k Zo4 = \M\\ and |||*|| 2 | = ||*|| 2 ). So in "short- 
hand" notation, fl[y/ fl[x T x]] = fl[\\x\\], and 



fl[\\x\\] = Vx T xJl + ei (l + e) = ||*||Vl + ei(l + e), 



TLFeBOOK 



MACHINE EPSILON 53 

and VI + e i < 1 + €1, so 

//[||*l|]< 11*11(1 +ei)(l + e). 
Now (1 + ei)(l + e) — 1 + €\ + € + e\e, implying that 

||jc||(1 + ei)(l + e) = 11*11 + ||*||(6i +€ + e x e) 
so therefore 

//[||*l|]< 11*11 + ll*ll(ei + e + eie), 
and thus 

//[||*l|]- 11*11 



11*11 



< \e\ + 6 + eie| < u + 1.01mm + l.Olnu 2 



= u[l + 1.01/1+ l.Olrew]. 
Of course, we have used the fact that |e| < u. 

2.5 MACHINE EPSILON 

In Section 2.3 upper bounds on the error involved in applying the operator // were 
derived. Specifically, we found that the relative error satisfies 

I* - fl[x]\ J 2~' (rounding) 

— I n-t+\ /„i „; \ ■ \^-^J) 



|*| " \ 2 t+1 (chopping) 

As suggested in Section 2.4, these bounds are often denoted by u; that is, u — 2~' 
for rounding, and u — 2~ t+i for chopping. The bound u is often called the unit 
roundoff [4, Section 2.4.2]. 

The details of how floating-point arithmetic is implemented on any given com- 
puting machine may not be known or readily determined by the user. Thus, u 
may not be known. However, an "experimental" approach is possible. One may 
run a simple program to "estimate" u, and the estimate is the machine epsilon, 
denoted e^- The machine epsilon is defined to be the difference between 1.0 and 
the next biggest floating-point number [6, Section 2.1]. Consequently, cm = 2~ t+1 . 
A pseudocode to compute €m is as follows: 

stop := 1; 

eps := 1.0; 

while stop == 1 do begin 

eps := eps/2.0; 

x := 1.0 + eps; 

ifx< 1.0 
begin 



TLFeBOOK 



54 NUMBER REPRESENTATIONS 

stop := 0; 
end; 
end; 
eps := 2.0 * eps; 

This code may be readily implemented as a MATLAB routine. MATLAB stores 
eps (= €m) as a built-in constant, and the reader may wish to test the code above 
to see if the result agrees with MATLAB eps (as a programming exercise). 

In this book we shall (unless otherwise stated) regard machine epsilon and unit 
roundoff as practically interchangeable. 



APPENDIX 2A REVIEW OF BINARY NUMBER CODES 

This appendix summarizes typical methods used to represent integers in binary. 
Extension of the results in this appendix to fractions is certainly possible. This 
material is normally to be found in introductory digital electronics books. The 
reader is here assumed to know Boolean algebra. This implies that the reader 
knows that + can represent either algebraic addition, or the logical or operation. 
Similarly, xy might mean the logical and of the Boolean variables x and y, or it 
might mean the arithmetic product of the real variables x and y. The context must 
be considered to ascertain which meaning applies. 

Below we speak of "complements." These are used to represent negative inte- 
gers, and also to facilitate arithmetic with integers. We remark that the results of 
this appendix are presented in a fairly general manner. Thus, the reader may wish, 
for instance, to see numerical examples of arithmetic using two's complement (2's 
comp.) codings. The reader can consult pp. 276-280 of Ref. 2 for such examples. 
Almost any other books on digital logic will also provide a source of numerical 
examples [3], 

We may typically interpret a bit pattern in one of four ways, assuming that the 
bit pattern is to represent a number (negative or nonnegative integer). An example 
of this is as follows, and it provides a summary of common representations (e.g., 
for n = 3 bits): 



Bit Pattern 


Unsi 


gned Integer 


2's Comp. 


l's Comp. 


Sign 


Magnitude 



























1 




1 


1 


1 




1 


1 







2 


2 


2 




2 


1 


1 




3 


3 


3 




3 


1 







4 


-4 


-3 




-0 


1 


1 




5 


-3 


-2 




-1 


1 1 







6 


-2 


-1 




-2 


1 1 


1 




7 


-1 


-0 




-3 



TLFeBOOK 



REVIEW OF BINARY NUMBER CODES 



55 



In the four coding schemes summarized in this table, the interpretation of the bit 
pattern is always the same when the most significant bit (MSB) is zero. A similar 
table for n — 4 appears in Hamacher et al. [2, see p. 271]. 

Note that, philosophically speaking, the table above implies that a bit pattern 
can have more than one meaning. It is up to the engineer to decide what meaning 
it should have. Of course, this will be a function of purpose. Presently, our purpose 
is that bit patterns should have meaning with respect to the problems of numerical 
computing; that is, bit patterns must represent numerical information. 

The relative merits of the three signed number coding schemes illustrated in the 
table above may be summarized as follows: 



Coding Scheme 



Advantages 



Disadvantages 



2's complement 
l's complement 

Sign magnitude 



Simple adder/subtracter 

circuit 
Only one code for 
Easy to obtain the l's comp. 

of a number 



Intuitively obvious code 



Circuit for finding the 2's comp. 

more complex than circuit for 

finding the l's comp. 
Circuit for addition and 

subtraction more complex 

than for the 2's comp. 

adder/subtracter 
Two codes for 
Has the most complex 

adder/subtracter circuit 
Two codes for 



The following is a summary of some formulas associated with arithmetic (i.e., 
addition and subtraction) with r's and (r — l)'s complements. In binary arithmetic 
r — 2, while in decimal arithmetic r — 10. We emphasize the case r — 2. 

Let A be an n -digit base-r number (integer) 



in-i' 



-2 ' ' 



AiA 



l^o 



where Ak e {0, 1, . . . , r — 2, r — 1}. Digit A n -\ is the most significant digit 
(MSD), while digit A Q is the least significant digit (LSD). Provided that A is 
not negative (i.e., is unsigned), we recognize that to convert A to a base-10 repre- 
sentation (i.e., ordinary decimal number) requires us to compute 



I>A 



k=0 



If A is allowed to be a negative integer, the usage of this summation needs modi- 
fication. This is considered below. 

The r's complement of A is defined to be 



r's complement of A — A* — 



A, A#0 
A = 



(2.A.1) 



TLFeBOOK 



56 NUMBER REPRESENTATIONS 

The (r — l)'s complement of A is defined to be 

(r - l)'s complement of A = A = (r n - 1) - A (2.A.2) 

It is important not to confuse the bar over the A in (2. A. 2) with the Boolean not 
operation, although for the special case of r — 2 the bar will denote complemen- 
tation of each bit of A; that is, for r — 2 



^b-i-Ab-2 • • • AjAq 



where the bar now denotes the logical not operation. More generally, if A is a 
base-r number 

A = (r-l)-A n -i (r-l)-A„_ 2 • • • (r - 1) - Aj (r - 1) - A 

Thus, to obtain A, each digit of A is subtracted from r — 1. As a consequence, 
comparing (2A.1) and (2 A. 2), we see that 

A* = A+1 (2A.3) 

where the plus denotes algebraic addition (which takes place in base r). 

Now we consider the three (previously noted) different methods for coding 
integers when r — 2: 

1. Sign-magnitude coding 

2. One's complement coding 

3. Two's complement coding 

In all three of these coding schemes the most significant bit (MSB) is the sign bit. 
Specifically , if A„_i = 0, the number is nonnegative, and if A„_i = 1, the number 
is negative. It can be shown that when the complement (either one's or two's) of 
a binary number is taken, this is equivalent to placing a minus sign in front of the 
number. As a consequence, when given a binary number A = A„_i A„_2 • • • A\ A$ 
coded according to one of these three schemes, we may convert that number to a 
base- 10 integer according to the following formulas: 

1. Sign-Magnitude Coding. The sign-magnitude binary number A = A n -\A n -2 
■ ■ ■ Ai Aq (A* € {0, 1}) has the base-10 equivalent 

n-2 



£Aj2'', A„_i=0 
i=0 

n-2 

-^A,-2', A„_! = l 



(2.A.4) 



TLFeBOOK 



REVIEW OF BINARY NUMBER CODES 

With this coding scheme there are two codings for zero: 
(0)i = (000 • • • 00) 2 = (100 • • • 00) 2 



57 



2. One's Complement Coding. In this coding we represent —A as A. The one's 
complement binary number A = A„_i A„_2 • • • A\ Aq (A# € {0, 1}) has the 
base- 10 equivalent 



n-2 



I>2< 


,An-l=0 


n-2 


, A„_i = 1 



(2.A.5) 



With this coding scheme there are also two codes for zero: 
(0)io = (000- -00)2 = (HI •••11)2 



3. Two's Complement Coding. In this coding we represent —A as A* (— A + 1). 
The two's complement binary number A = A„_i A„_2 • • • A\ Ao (At € {0, 1}) 
has the base- 10 equivalent 



-TT^An-i+^Aa 



(2.A.6) 



The proof is as follows. If A n -\ = 0, then A > and immediately the base- 10 
equivalent is A = Yl"i=o A 2' (via the procedure for converting a number in 
base-2 to one in base- 10), which is (2. A. 6) for A„_i = 0. Now, if A„_i = 1, 
then A < 0, and so if we take the two's complement of A we must get |A|: 

\A\ = A+l 

= (1 - A„_i)(l - A„_ 2 ) • • • (1 - Ai)(l - A ) + 00 •• • 01 



n-\ 



= ^(1 - A,)2' + 1 

i=0 

= 2"- 1 (l-A„_ 1 ) + ^(l-A i )2 i + l 

i=0 

ii—2 M—2 

= ^2'' + l-^A,-2 i '(A„_ 1 = l) 



n-2 



i=0 



i=0 



TLFeBOOK 



58 NUMBER REPRESENTATIONS 



1 _ 2"" 1 ^ ,. / " .. 1 - a n+l ' 



'-!>* "■Z>'-T=r 



1-2 

i=0 \ (=0 

n-2 



= 2 »-i _ J^ A,-2'' 



i=0 



and so A = -2"" 1 + £"r o 2 A,- 2'', which is (2.A.6) for A„_i = 1. 
In this coding scheme there is only one code for zero: 



(0)io = (000- • -00)2 



When n-bit integers are added together, there is the possibility that the sum may 
not fit in n bits. This is overflow. The condition is easy to detect by monitoring 
the signs of the operands and the sum. Suppose that x and y are n-bit two's 
complement coded integers, so that the sign bits of these operands are x n -\ and 
y n -\. Suppose that the sum is denoted by s, implying that the sign bit is s n -\. The 
Boolean function that tests for overflow of s — x + y (algebraic sum of x and y) is 

T — x n -\y n -\l n -\ +x n -{y n _ l s n -\. 

The first term will be logical 1 if the operands are negative while the sum is 
positive. The second term will be logical 1 if the operands are positive but the sum 
is negative. Either condition yields T = 1, thus indicating an overflow. A similar 
test may be obtained for subtraction, but we omit this here. 

The following is both the procedure and the justification of the procedure for 
adding two's complement coded integers. 

Theorem 2.A.1: Two's Complement Addition If A and B are n-bit two's 
complement coded numbers, then compute A + B (the sum of A and B) as though 
they were unsigned numbers, discarding any carryout. 

Proof Suppose that A > 0, B > 0; then A + B will generate no carryout from 
the bit position n — \ since A„_i = B n -\ — (i.e., the sign bits are zero-valued), 
and the result will be correct if A + B < 2 n ~ l . (If this inequality is not satisfied, 
then the sign bit will be one, indicating a negative answer, which is wrong. This 
amounts to an overflow.) 

Suppose that A > B > 0; then 

A + (-B) = A + B* = A + 2" - B = 2" + A - B, 

and if we discard the carryout, this is equivalent to subtracting 2" (because the 
carryout has a weight of 2"). Doing this yields A + (— B) — A — B. 



TLFeBOOK 



PROBLEMS 59 

Similarly 

(-A) + B = A* + B = 2 n - A + B = 2" + B - A, 

and discarding the carry out yields (—A) + B — B — A. 
Again, suppose that A > B > 0, then 

(-A) + (-B) = A* + B* =2 n - A + 2 n - B = 2" + [2 n -(A + B)] 
= 2" + (A + B)* 

so discarding the carryout gives (—A) + (— B) = (A + £)*, which is the desired 
two's complement representation of —(A + B), provided A + B < 2" _1 . (If this 
latter inequality is not satisfied, then we have an overflow.) 

The procedure for subtraction (and its justification) follows similarly. We omit 
these details. 

REFERENCES 

1. J. H. Wilkinson, Rounding Errors in Algebraic Processes, Prentice-Hall, Englewood 
Cliffs, NJ, 1963. 

2. V. C. Hamacher, Z Vranesic, and S. G. Zaky, Computer Organization, 3rd ed., 
McGraw-Hill, New York, 1990. 

3. J. F. Wakerly, Digital Design Principles and Practices, Prentice-Hall, Englewood Cliffs, 
NJ, 1990. 

4. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ. 
Press, Baltimore, MD, 1989. 

5. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical 
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977. 

6. N. J. Higham, Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, 
1996. 



PROBLEMS 

2.1. Let /x[x] denote the operation of reducing x (a fixed-point binary fraction) 
to t + 1 bits (including the sign bit) according to the Wilkinson rounding 
(ordinary rounding) procedure in Section 2.2. Suppose that a — (0.1000)2, 
b — (0.1001)2, and c = (0.0101)2, so t — 4 here. In arithmetic of unlimited 
precision, we always have aib + c) — ab + ac. Suppose that a practical com- 
puting machine applies the operator fx[-] after every arithmetic operation. 

(a) Find x — fx[fx[ab] + fx[ac]]. 

(b) Find v = fx[afx[b + c]]. 

Do you obtain x — yl 



TLFeBOOK 



60 NUMBER REPRESENTATIONS 

This problem shows that the order of operations in an algorithm implemented 
on a practical computer can affect the answer obtained. 

2.2. Recall from Section 2.2 that 

Find the absolute error in representing q as a (t + l)-bit binary number. Find 
the relative error. Assume both ordinary rounding and chopping (defined at 
the end of Section 2.3 with respect to floating-point arithmetic). 

2.3. Recall that we define a floating-point number in base r to have the form 

x = xq.X\X2 ■ ■ ■ x t -\x t x r e , 



=/ 

where xq € {+, — } (sign digit), Xk £ {0, 1, . . . , r — 1} for k = 1, 2, . . . , t, e 
is the exponent (a signed integer), and x\ ^ (so r~ l < \f\ < 1) if x ^ 0. 
Show that for x ^ 

m <\x\ < M, 

where for L < e < U, we have 



m 



- L ~\ M = r u (l-r-'). 



2.4. Suppose r — 10. We may consider the result of a decimal arithmetic operation 
in the floating-point representation to be 



= ±\Yx k io- k \ xlf/. 



(a) If fl[x] is the operator for chopping, then 

fl[x] — (±.xiX2 • • -x t -\x t ) x 10 e , 

thus, all digits x& for k > t are forced to zero. 

(b) If fl[x] is the operator for rounding then it is defined as follows. Add 
0.00 ■ ■ -01 to the mantissa if x t+ \ > 5, but if x t+ \ < 5, the mantissa is 

t+\ digits 

unchanged. Then all digits Xk for k > t are forced to zero. 

Show that the absolute error for chopping satisfies the upper bound 

\x- fl[x]\ < 10 _f 10 e , 
and that the absolute error for rounding satisfies the upper bound 

\x- fl[x]\ < ilCT r 10 e . 



TLFeBOOK 



PROBLEMS 61 



Show that the relative errors satisfy 

\x-fl[x]\ \ 10 1 "' (chopping) 



5IO 1 ' (rounding) 

2.5. Suppose that t — 4 and r — 2 (i.e., we are working with floating-point binary 
numbers). Suppose that we have the operands 

x = 0.1011 x 10" 3 , y = -0.1101 x 10 2 . 

Find x + y, x — y, and xy. Clearly show the steps involved. 

2.6. Suppose that A e R" x " ; x e R", and that fl[Ax] represents the result 
of computing the product Ax on a floating-point computer. Define \A\ — 
[|«ijlL-,y=0,i,...,n-i. and \x\ = [|*oll*il"- \x n -i\] T - We have 

fl[Ax] — Ax + e, 

where e e R" is the error vector. Of course, e models the rounding errors 
involved in the actual computation of product Ax on the computer. Justify 
the bound 

\e\ < l.01nu\A\\x\. 

2.7. Explain why a conditional test such as 

if x /y then begin 
f:=f/(x-y); 
end; 

is unreliable. 

(Hint: Think about dynamic range limitations in floating-point arithmetic.) 

2.8. Suppose that x — [xqX\ ■ ■ ■ x n -{\ T is a real-valued vector, ||x||o, 

maxo<A:<n-i |jcjt|, and that we wish to compute ||x||2 = S^=o x k\ 
Explain the advantages, and disadvantages of the following algorithm with 
respect to computational efficiency (number of arithmetic operations, and 
comparisons), and dynamic range limitations in floating-point arithmetic: 

m := llxlloo; 

s:=0; 

for k := to n - 1 do begin 

s :=s+ (x k /m) 2 ; 

end; 
||x|| 2 :=mVs; 

Comments regarding computational efficiency may be made with respect to 
the pseudocode algorithm in Example 2.4. 



1/2 



TLFeBOOK 



62 



NUMBER REPRESENTATIONS 



2.9. Recall that for x + bx + c = 0, the roots are 



x\ 



-b + \/b 2 - Ac 



X2 



-b - *Jb 2 - Ac 



If b = —0.3001, c = 0.00006, then the "exact" roots for this set of parame- 
ters are 

xi = 0.29989993, x 2 = 2.0006673 x 10" 4 . 

Let us compute the roots using four-digit (i.e., t = 4) decimal (i.e., r = 10) 
floating-point arithmetic, where, as a result of rounding quantization b, and 
c are replaced with their approximations 

b = -0.3001 =b, c = 0.0001 #c. 

Compute J2, which is the approximation to x 2 obtained using b and c in 
place of b and c. Show that the relative error is 



X2 — X2 



X2 



0.75 



(i.e., the relative error is about 75%). {Comment: This is an example of 
catastrophic cancellation.) 

2.10. Suppose a, b e R, and x — a — b. Floating-point approximations to a and b 
are a — fl[a] — a{\ + e a ) and b — fl[b] — b(\ + €b), respectively. Hence 
the floating-point approximation to x is x — a — b. Show that the relative 
error is of the form 



e = 



< a- 



\a\ + \b\ 



What is a? When is |e| large? 
2.11. For a ^ 0, the quadratic equation ax 2 + bx + c — has roots given by 

—b + Vb 2 - Aac -b - \Jb 2 - Aac 



M 



2a 



x 2 



2a 



For c ^ 0, quadratic equation ex + bx + a — has roots given by 



-b + sjb 2 - Aac 
1c 



-b — *Jb 2 — Aac 
Yc ' 



(a) Show that x\x' 2 — 1 and X2x[ — 1. 

(b) Using the result from Problem 2.10, explain accuracy problems that can 
arise in computing either x\ or X2 when b 2 ^> \Aac\. Can you use the 
result in part (a) to alleviate the problem? Explain. 



TLFeBOOK 



•5 Sequences and Series 



3.1 INTRODUCTION 

Sequences and series have a major role to play in computational methods. In this 
chapter we consider various types of sequences and series, especially with respect 
to their convergence behavior. A series might converge "mathematically," and yet 
it might not converge "numerically" (i.e., when implemented on a computer). Some 
of the causes of difficulties such as this will be considered here, along with possible 
remedies. 



3.2 CAUCHY SEQUENCES AND COMPLETE SPACES 

It was noted in the introduction to Chapter 1 that many computational processes 
are "iterative" (the Newton-Raphson method for finding the roots of an equation, 
iterative methods for linear system solution, etc.). The practical effect of this is to 
produce sequences of elements from function spaces. The sequence produced by the 
iterative computation is only useful if it converges. We must therefore investigate 
what this means. 

In Chapter 1 it was possible for sequences to be either singly or doubly infinite. 
Here we shall assume sequences are singly infinite unless specifically stated to the 
contrary. 

We begin with the following (standard) definition taken from Kreyszig 
[1, pp. 25-26]. Examples of applications of the definitions to follow will be con- 
sidered later. 

Definition 3.1: Convergence of a Sequence, Limit A sequence (x n ) in a 
metric space X — (X, d) is said to converge, or to be convergent iff there is an 
x e X such that 

lim d(x n , x) — 0. (3.1) 

The element x is called the limit of (x n ) (i.e., limit of the sequence), and we may 
state that 

lim x n — x. (3.2) 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



63 



TLFeBOOK 



64 SEQUENCES AND SERIES 

We say that (x n ) converges to x or has a limit x. If (x n ) is not convergent, then 
we say that it is a divergent sequence, or is simply divergent. 

A shorthand expression for (3.2) is to write x n — > x. We observe that sequence (x„) 
is defined to converge (or not) with respect to a particular metric here denoted d 
(recall the axioms for a metric space from Chapter 1). We remark that it is possible 
that, for some (x„) in some set X, the sequence might converge with respect to 
one metric on the set, but might not converge with respect to another choice of 
metric. It must be emphasized that the limit x must be an element of X in order 
for the sequence to be convergent. 

Suppose, for example, that X — (0, 1] C R, and consider the sequence x n — 
■Ar{(n e Z + ). Suppose also that d(x, y) = \x — y\. The sequence (x n ) does not 
converge in X because the sequence "wants to go to 0." But is not in X. So the 
sequence does not converge. (Of course, the sequence converges in X — R with 
respect to our present choice of metric.) 

It can be difficult in practice to ascertain whether a particular sequence con- 
verges according to Definition 3.1. This is because the limit x may not be known 
in advance. In fact, this is almost always the case in computing applications of 
sequences. Sometimes it is therefore easier to work with the following: 

Definition 3.2: Cauchy Sequence, Complete Space A sequence (x n ) in a 
metric space X — (X, d) is called a Cauchy sequence iff for all 6 > there is an 
N{e) € Z+ such that 

d(x m ,x n ) < e (3.3) 

for all m, n > N(e). The space X is a complete space iff every Cauchy sequence 
in X converges. 

We often write N instead of N(e), because N may depend on our choice of e. It 
is possible to prove that any convergent sequence is also Cauchy. 

We remark that, if in fact the limit is known (or at least strongly suspected), 
then applying Definition 3.1 may actually be easier than applying Definition 3.2. 

We see that under Definition 3.2 the elements of a Cauchy sequence get closer to 
each other as n and m increase. Establishing the "Cauchiness" of a sequence does 
not require knowing the limit of the sequence. This, at least in principle, simplifies 
matters. However, a big problem with this definition is that there are metric spaces 
X in which not all Cauchy sequences converge. In other words, there are incomplete 
metric spaces. For example, the space X — (0, 1] with d(x, y) = \x — y\ is not 
complete. Recall that we considered x n = l/(« + 1). This sequence is Cauchy, 1 
but the limit is 0, which is not in X. Thus, this sequence is a nonconvergent Cauchy 
sequence. Thus, the space (X, | • |) is not complete. 



We see that 



d\Xm> x n) '■ 



1 1 



m + 1 n + 1 



TLFeBOOK 



CAUCHY SEQUENCES AND COMPLETE SPACES 65 



A more subtle example of an incomplete metric space is the following. Recall 
space C[a, b] from Example 1.4. Assume that a — and b = 1, and now choose 
the metric to be 



d(x, y) 



-I 

Jo 



\x(t) - y(t)\dt 



(3.4) 



instead of Eq. (1.8). Space C[0, 1] with the metric (3.4) is not complete. This 
may be shown by considering the sequence of continuous functions illustrated 
in Fig. 3.1. The functions x m {t) in Fig. 3.1a form a Cauchy sequence. (Here we 
assume m > 1, and is an integer.) This is because d(x m ,x n ) is the area of the 
triangle in Fig. 3.1b, and for any e > 0, we have 

V ttl ' Yl ) ^^ 

whenever m,n > l/(2e). (Suppose n >m and consider that d(x m , x n ) — j(— — 
) < j- < €.) We may see that this Cauchy sequence does not converge. Observe 
that we have 



x m (t) — for t € [0, j], x m (t) —I for t e [a m , 1], 





t 

Figure 3.1 A Cauchy sequence of functions in C[0, 1]. 

For any e > we may find 7V(e) > such that for n > m > N(e) 

1 1 
< e. 

m n 

If h < m, the roles of n and m may be reversed. The conditions of Definition 3.2 are met and so the 
sequence is Cauchy. 



TLFeBOOK 



66 SEQUENCES AND SERIES 



where a m — j + ■£-. Therefore, for all x e C[0, 1], 



d(x m ,x) — / \x m (t) -x(t)\dt 



/ \x„ 
Jo 

/■1/2 ra m /•< 

/ \x(t)\dt+ \x m (t)- x(t)\dt + 

JO J 1/2 Ja m 



•1/2 fa,„ n\ 

,(f) - x(t)\dt + I \\ - x(t)\dt. 



The integrands are all nonnegative, and so each of the integrals on the right-hand 
side are nonnegative, too. Thus, to say that d(x m , x) -> implies that each integral 
approaches zero. Since x(t) is continuous, it must be the case that 

x(t) = for t e [0, \), x(t) = 1 for t e (± 1]. 



However, this is not possible for a continuous function. In other words, we have 
a contradiction. Hence, (x n ) does not converge (i.e., has no limit in X — C[0, 1]). 
Again, we have a Cauchy sequence that does not converge, and so C[0, 1] with 
the metric (3.4) is not complete. 

This example also shows that a sequence of continuous functions may very well 
possess a discontinuous limit. Actually, we have seen this phenomenon before. 
Recall the example of the Fourier series in Chapter 1 (see Example 1.20). In this 
case the series representation of the square wave was made up of terms that are all 
continuous functions. Yet the series converges to a discontinuous limit. We shall 
return to this issue again later. 

So now, some metric spaces are not complete. This means that even though a 
sequence is Cauchy, there is no guarantee of convergence. We are therefore faced 
with the following questions: 

1. What metric spaces are complete? 

2. Can they be "completed" if they are not? 

The answer to the second question is "Yes." Given an incomplete metric space, 
it is always possible to complete it. We have seen that a Cauchy sequence does 
not converge when the sequence tends toward a limit that does not belong to the 
space; thus, in a sense, the space "has holes in it." Completion is the process of 
filling in the holes. This amounts to adding the appropriate elements to the set that 
made up the incomplete space. However, in general, this is a technically difficult 
process to implement in many cases, and so we will never do this. This is a job 
normally left to mathematicians. 

We will therefore content ourselves with answering the first question. This will 
be done simply by listing complete metric spaces that are useful to engineers: 

1. Sets R and C with the metric d(x, y) = \x — y\ are complete metric spaces. 



TLFeBOOK 



CAUCHY SEQUENCES AND COMPLETE SPACES 



67 



2. Recall Example 1.3. The space /°°[0, oo] with the metric 

d(x,y) = sup \xk - yk\ 

keZ+ 



(3.5) 



is a complete metric space. (A proof of this claim appears in Ref. 1, p. 34.) 
3. The Euclidean space R" and the unitary space C" both with metric 



d(x,y) = 



-.1/2 



^2 \xk -ykY 



.k=0 



(3.6) 



are complete metric spaces. (Proof is on p. 33 of Ref. 1.) 
4. Recall Example 1.6. Fixing p, the space l p [0, oo] such that 1 < p < oo is a 
complete metric space. Here we recall that the metric is 



d(x, y) 



J2\xk -yk\' 



.k=Q 



UP 



(3.7) 



5. Recall Example 1.4. The set C[a, b] with the metric 
d(x,y)— sup \x(t) — y(t)\ 

t€[a,b] 

is a complete metric space. (Proof is on pp. 36-37 of Ref. 1.) 



(3.8) 



The last example is interesting because the special case C[0, 1] with metric (3.4) 
was previously shown to be incomplete. Keeping the same set but changing the 
metric from that in (3.4) to that in (3.8) changes the situation dramatically. 

In Chapter 1 we remarked on the importance of the metric space L 2 [a, b] (recall 
Example 1.7). The space is important as the "space of finite energy signals on the 
interval [a, b]." (A "finite power" interpretation was also possible.) An important 
special case of this was L 2 (R) = L 2 (— oo, oo). Are these metric spaces complete? 
Our notation implicitly assumes that the set (1.11a) (Chapter 1) contains the so- 
called Lebesgue integrable functions on [a,b]. In this case the space L 2 [a, b] is 
indeed complete with respect to the metric 



d(x, y) 



f 

.J a 



,1/2 



\x(t) - y(t)\ dt 



(3.9) 



Lebesgue integrable functions 2 have a complicated mathematical structure, and we 
have promised to avoid any measure theory in this book. It is enough for the reader 

One of the "simplest" introductions to these is Rudin [2]. However, these functions appear in the 
last chapter [2, Chapter 11]. Knowledge of much of the previous chapters is prerequisite to studying 
Chapter 11. Thus, the effort required to learn measure theory is substantial. 



TLFeBOOK 



68 SEQUENCES AND SERIES 

to assume that the functions in L 2 [a, b] are the familiar ones from elementary 
calculus. 3 

The complete metric spaces considered in the two previous paragraphs also 
happen to be normed spaces; recall Section 1.3.2. This is because the metrics are 
all induced by suitable norms on the spaces. It therefore follows that these spaces 
are complete normed spaces. Complete normed spaces are called Banach spaces. 

Some of the complete normed spaces are also inner product spaces. Again, this 
follows because in those cases an inner product is defined that induced the norm. 
Complete inner product spaces are called Hilbert spaces. To be more specific, the 
following spaces are Hilbert spaces: 

1 . The Euclidean space R" and the unitary space C" along with the inner product 

71-1 



y) = J2 x *y*k ( 3 - 10 ) 



k=0 



are both Hilbert spaces. 
2. The space L 2 [a, b] with the inner product 



( x ,y) = J x(t)y*(t)dt (3.11) 

J a 



is a Hilbert space. [This includes the special case L 2 (R).] 
3. The space Z 2 [0, oo] with the inner product 



k=0 



y) = J2 x ^y*k ( 3 - 12 ) 



is a Hilbert space. 



We emphasize that (3.10) induces the metric (3.6), (3.11) induces the metric (3.9), 
and (3.12) induces the metric (3.7) (but only for case p — 2; recall from Chapter 1 
that l p [0, oo] is not an inner product space when p ^ 2). The three Hilbert spaces 
listed above are particularly important because of the fact, in part, that elements in 
these spaces have (as we have already noted) either finite energy or finite power 
interpretations. Additionally, least-squares problems are best posed and solved 
within these spaces. This will be considered later. 

Define the set (of natural numbers) N = {1, 2, 3, . . .}. We have seen that sequences 
of continuous functions may have a discontinuous limit. An extreme example of this 
phenomenon is from p. 145 of Rudin [2]. 

These "familiar" functions are called Riemann integrable functions. These functions form a proper 
subset of the Lebesgue integrable functions. 



TLFeBOOK 



CAUCHY SEQUENCES AND COMPLETE SPACES 69 

Example 3.1 For n e N define 

x„(f) = lim [cos(n\jrt)] 2m . 

When n\t is an integer, then x n (t) — 1 (simply because cos(nk) = ±1 for k e Z). 
For all other values of t , we must have x n (t) — (simply because | cos t \ < 1 when 
t is not an integral multiple of jt). Define 

x(t) — lim x n (t). 

If t is irrational, then x„(f) = for all n. Suppose that t is rational; that is, suppose 
t — p/q for which p, q e Z. In this case «!f is an integer when n > q in which 
case x(t) — 1. Consequently, we may conclude that 

. . ,. ,• r / , s-,2m \ 0> t is irrational ,. ... 

x(t) — lim lim [cos(«!jrf)] = i , • • (3.13) 

n^Kuii^oo I 1, t is rational 

We have mentioned (in footnote 3, above) that Riemann integrable functions are a 
proper subset of the Lebesgue integrable functions. It turns out that x(t) in (3.13) 
is Lebesgue integrable, but not Riemann integrable. In other words, you cannot use 
elementary calculus to find the integral of x(t) in (3.13). Of course, x(t) is a very 
strange function. This is typical; that is, functions that are not Riemann integrable 
are usually rather strange, and so are not commonly encountered (by the engineer). 
It therefore follows that we do not need to worry much about the more general 
class of Lebesgue integrable functions. 

Limiting processes are potentially dangerous. This is illustrated by a very simple 
example. 

Example 3.2 Suppose n, m e N. Define 

m 

x m,n = ; • 



(This is a double sequence. In Chapter 1 we saw that these arise routinely in wavelet 
theory.) Treating n as a fixed constant, we obtain 



so 



lim x mn = 1 

m— >oo 



n^oo m^oo 



Now instead treat m as a fixed constant so that 

lim x m ,„ — 



TLFeBOOK 



70 SEQUENCES AND SERIES 

which in turn implies that 

Interchanging the order of the limits has given two completely different answers. 

Interchanging the order of limits clearly must be done with great care. 

The following example is simply another illustration of how to apply Defini- 
tion 3.2. 

Example 3.3 Define 



1 



(n e Z+). 



This is a sequence in the metric space (R, | • |). This space is complete, so we need 
not know the limit of the sequence to determine whether it converges (although 
we might guess that the limit is x — 1). We see that 



iX \ytfti j •™yi ) 



where the triangle inequality has been used. If we assume [without loss of generality 
(commonly abbreviated w.l.o.g.)] that n > m > N(e) then 



(-l) m 


(-1)" 


1 111 






< 1 < \- - 


m + 1 


n + 1 


m + 1 n + 1 m n 



1 




1 




2 






— 


+ 


— 


< 


— 


< 


€. 


in 




n 




in 







So, for a given e > 0, we select n > m > 2/e. The sequence is Cauchy, and so it 
must converge. 

We close this section with mention of Appendix 3. A. Think of the material in it 
as being a very big applications example. This appendix presents an introduction 
to coordinate rotation digital computing (CORDIC). This is an application of a 
particular class of Cauchy sequence (called a discrete basis) to the problem of 
performing certain elementary operations (e.g., vector rotation, computing sines 
and cosines). The method is used in application-specific integrated circuits (ASICs), 
gate arrays, and has been used in pocket calculators. Note that Appendix 3. A also 
illustrates a useful series expansion which is expressed in terms of the discrete basis. 

3.3 POINTWISE CONVERGENCE AND UNIFORM CONVERGENCE 

The previous section informed us that sequences can converge in different ways, 
assuming that they converge in any sense at all. We explore this issue further here. 



TLFeBOOK 



POINTWISE CONVERGENCE AND UNIFORM CONVERGENCE 71 

Definition 3.3: Pointwise Convergence Suppose that (x n (t)) (n e Z + ) is a 
sequence of functions for which t e S C R. We say that the sequence converges 
pointwise iff there is an x(t) (t e S) so that for all e > there is an N — N(e, t) 
such that 

\x n (t) - x(t)\ < € (3.14) 

for n > N. We call x the limit of (x n ) and write 

x(t) = lim x„(f) (f € S). (3.15) 

n— »-00 

We emphasize that under this definition A' may depend on both e and t. We may 
contrast Definition 3.3 with the following definition. 

Definition 3.4: Uniform Convergence Suppose that (x n (t)) (n e Z + ) is a 
sequence of functions for which t e S C R. We say that the sequence converges 
uniformly iff there is an x(t) (t e S) so that for all e > there is an Af = A^(e) 
such that 

\x n (t)-x(t)\<e (3.16) 

for n > N. We call x the limit of (x n ) and write 

x(t) = lim x„{t) (t e S). (3.17) 

n— >oo 

We emphasize that under this definition N never depends on t, although it may 
depend on e. It is apparent that a uniformly convergent sequence is also point- 
wise convergent. However, the converse is not true; that is, a pointwise convergent 
sequence is not necessarily uniformly convergent. This distinction is important in 
understanding the convergence behavior of series as well as of sequences. In par- 
ticular, it helps in understanding convergence phenomena in Fourier (and wavelet) 
series expansions. 

In contrast with the definitions of Section 3.2, under Definitions 3.3 and 3.4 the 
elements of (x n ) and the limit x need not reside in the same function space. In 
fact, we do not ask what function spaces they belong to at all. In other words, the 
definitions of this section represent a different approach to convergence analysis. 

As with the Definition 3.1, direct application of Definitions 3.3 and 3.4 can be 
quite difficult since the limit, assuming it exists, is not often known in advance 
(i.e., a priori) in practice. Therefore, we would hope for a convergence criterion 
similar to the idea of Cauchy convergence in Section 3.2 (Definition 3.2). In fact, 
we have the following theorem (from Rudin [2, pp. 147-148]). 

Theorem 3.1: The sequence of functions (x n ) defined on S C R converges 
uniformly on S iff for all e > there is an N such that 

\x m {t) - x n {t)\ < e (3.18) 

for all n,m > N. 



TLFeBOOK 



72 SEQUENCES AND SERIES 

This is certainly analogous to the Cauchy criterion seen earlier. (We omit the 
proof.) 



Example 3.4 Suppose that (x„) is defined according to 



x n (t) = 



1 



nt + 1 



t € (0, 1) and n e N. 



A sketch of x„(t) for various n appears in Fig. 3.2. We see that ("by inspection") 
x n -> 0. But consider for all e > 



\x„(t)-0\ 



1 



nt + 1 



< e 



which implies that we must have 



B >I(I-l)=tf 

so that N is a function of both t and e. Convergence is therefore pointwise, and is 
not uniform. 

Other criteria for uniform convergence may be established. For example, there is 
the following theorem (again from Rudin [2, p. 148]). 



Theorem 3.2: Suppose that 



Define 



lim x n (t) = x(t) (t e S). 



M n = sup\x„(t) - x(t)\. 
teS 



Then x n — > x uniformly on S iff M„ — > as n — > oo. 



1 
0.8 
0.6 
0.4 
0.2 



•^^l 


i 










i 


i 
















— n=1 
-- n = 2 
--- n = 20 


- 


--..../~~~~" - 




















\ 








.. 








* 














- 


i 


i 












M 


...... 





0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

f 

Figure 3.2 A plot of typical sequence elements for Example 3.4; here, t e [0.01, 0.99]. 



TLFeBOOK 



FOURIER SERIES 73 




8 10 



Figure 3.3 A plot of typical sequence elements for Example 3.5. 



The proof is really an immediate consequence of Definition 3.4, and so is omitted 
here. 



Example 3.5 Suppose that 



x„(t) 



t 



t e R and n e N. 



1 + nt 2 ' 
A sketch of x n (t) for various n appears in Fig. 3.3. We note that 

dx n (t) (1 + nt 2 ) ■ 1 - t ■ (2nt) 1 - nt 2 



dt 



[l+nt 1 ] 



2l2 



[1+Hf 2 ] 



2l2 



for t = ±-7=. We see that 

sin 



4n) Isfn 

We also see that x n — >• 0. So then 

1 
M n = sup|x„0)| = t— =■ 

Clearly, M„ — >■ as n — >■ oo. Therefore, via Theorem 3.2, we immediately conclude 
that x n — > x uniformly on the real number line. 



3.4 FOURIER SERIES 

The Fourier series expansion was introduced briefly in Chapter 1, where the behav- 
ior of this series with respect to its convergence properties was not mentioned. In 
this section we shall demonstrate the pointwise convergence of the Fourier series 



TLFeBOOK 



74 SEQUENCES AND SERIES 

by the analysis of a particular example. Much of what follows is from Walter [17]. 
However, of necessity, the present treatment is not so rigorous. 
Suppose that 

(3.19) 



,'(f) = — (7T - t), < t < 2lt. 
Tt 



The reader is strongly invited to show that this has Fourier series expansion 

2 °° 1 
git) = - J] - sm(kt). (3.20) 



% ' — ' k 
k=\ 



The procedure for doing this closely follows Example 1.20. In our analysis to 
follow, it will be easier to work with 



fit) = -zgit) =y 7 sin(^f). 

2 * — ' k 

k=i 

Define the sequence of partial sums 

n 1 

S n (t) — V* - &in(kt) (n e N). 

' ' h 

So we infer that 



(3.21) 



k=\ 



lim S n it) = fit), 



(3.22) 



(3.23) 



but we do not know in what sense the partial sums tend to f{t). Is convergence 
pointwise, or uniform? 

We shall need the special function 



D n it)=- 

Tt 



- + Y^ cos i kt ) 



k=\ 



1 sin(« + 7j)f 
2tt sin(if) 



(3.24) 



This function is called the Dirichlet kernel. The second equality in (3.24) is not 
obvious. We will prove it. Consider that 

-t\ inD n it))= 2 Sin l2/ +IZ sin (2 f ) COs(A:f) 

v 7 k=\ v 



t + - y] sin | - - A- 1 / 



sin ( fe If, 

(3.25) 



TLFeBOOK 



FOURIER SERIES 75 



where we have used the identity sin a cosb — j sin(a + b) + j sin(a — b). By 
expanding the sums and looking for cancellations 



k=\ v 7 k=\ v ' 

Applying (3.26) in (3.25), we obtain 



X \ ( X 

sin | n H — J t — sin I —t 



(3.26) 



sin ( -t 1 (jrD n (t)) = - sin I n + - 1 / 



so immediately 



D n it) 



1 sin(« + j)f 
2tt sin(if) 



and this establishes (3.24). Using the identity sin(a + b) = sin a cosZ? + cos a sinfe, 
we may also write 



D n (t) 



1 

2tt 



sin(nf)cos(jf) 
sin(lf) 



+ cos(«f) 



(3.27) 



For t > 0, using the form of the Dirichlet kernel in (3.27), we have 

n J D n (x)dx — I 
Jo Jo 



sin(nx) cos(ix) 1 

: — 1 — cos(nx) 

2sin(4x) 2 



dx 



= f'sm(nx ldx+ r' 
Jo x J 



1 cos(^x) 1 

2 sin(±x) x 



dx 



cos(nx) dx. 



(3.28) 



We are interested in what happens when t is a small positive value, but n is large. 
To begin with, it is not difficult to see that 



Less clearly 



lim 

n— >oo 



2 7o 

Jo 



1 1 

cos(«x) dx — lim — sin(nf) = 0. 

n^oo 2n 



lim / sin(nx) 



1 cos(^x) 1 

2 sin(±x) x 



dx — 



(3.29) 



(3.30) 



TLFeBOOK 



76 SEQUENCES AND SERIES 

(take this for granted). Through a simple change of variable 

sin(«:t) 



f' sin(«:t) f 

l(nt) = / — - — -dx = / 
Jo x Jo 



dx. 



In fact 



f°° sinx tc 

I dx — — . 

Jo x 2 



(3.31) 



(3.32) 



This is not obvious, either. The result may be found in integral tables [18, p. 483]. 
In other words, even for very small t, I(nt) does not go to zero as n increases. 
Consequently, using (3.29), (3.30), and (3.31) in (3.28), we have (for big n) 



n I D n (x 
Jo 



) dx ss lint). 



(3.33) 



The results in the previous paragraph help in the following manner. Begin by 
noting that 



" 1 n rt ft 

S n (t) — 2_\ — sia(kt) — 2_, I cos(£x) dx — I 



y^ cos(kx) 



.k=\ 



dx 



f 

Jo 



1 v^ 

- + 2_^cos(£x) 



k=\ 



dx 



1 f 1 

-t = n D n {x)dx--t (via (3.24)). 

2 Jo 2 



So from (3.33) 



5„(0« I(nt)- if. 



Define the sequence t n — -jr. Consequently 



1 



S n (t„) « I in)- —IT. 
In 

As n — > oo, t n -> 0, and S„(t n ) — > /(jt). We can say that for big n 

S„(0+) « 7(7T). 

Now, /(0+) = |-, so for big n 
5»(0+) 



/(0+) 



2 r 

* Jo 



■dx % 1.18. 



(3.34) 
(3.35) 

(3.36) 
(3.37) 

(3.38) 



Numerical integration is needed to establish this. This topic is the subject of a later 
chapter, however. 

We see that the sequence of partial sums S„ (0+) converges to a value bigger 
than /(0+) asn^ oo. The approximation S n (t) therefore tends to "overshoot" 



TLFeBOOK 



FOURIER SERIES 



77 



the true value of fit) for small t . This is called the Gibbs phenomenon, or Gibbs 
overshoot. We observe that t — is the place where f(t) has a discontinuity. This 
tendency of the Fourier series to overshoot near discontinuities is entirely typical. 
We note that f(n) — S n (jt) — for all n > 1. Thus, for any e > 

\f(it)-S n (jt)\<e 

for all n > 1. The previous analysis for t — 0+, and this one for t — n show that 
N (in the definitions of convergence) depends on t. Convergence of the Fourier 
series is therefore pointwise and not uniform. Generally, the Gibbs phenomenon is 
a symptom of pointwise convergence. 

We remark that the Gibbs phenomenon has an impact in the signal processing 
applications of series expansions. Techniques for signal compression and signal 
enhancement are often based on series expansions. The Gibbs phenomenon can 
degrade the quality of decompressed or reconstructed signals. The phenomenon 
is responsible for "ringing artifacts." This is one reason why the convergence 
properties of series expansions are important to engineers. 

Figure 3.4 shows a plot of /(f), S n (t) and the error 



E n (t) = S n (t)-f(t). 



(3.39) 



The reader may confirm (3.38) directly from the plot in Fig. 3.4a. 



-1 

-2 

C 

1.5 

„ 1 

ra 

§> 0.5 

w 

2 ° 
3, -0.5 

£ -1 
uf 



(a) 



I 








I 












■■■■f(t) 

— S n (f)forn = 30 


_ 






























i 


















1.5 

-2 

1 2 3 4 b b 7 

(b) f 

Figure 3.4 Plots of the Fourier series expansion for f(t) in (3.21), S n (t) [of (3.22)] for 
n = 30, and the error E n (t) = S n (t) — fit). 



TLFeBOOK 



78 



SEQUENCES AND SERIES 



CO 

c 
g> 

o 
-2- 




Figure 3.5 E n (t) of (3.39) for different values of n. 



We conclude this section by remarking that 



lim 



/>2tt 
JO 



\E n (t)\ 2 dt = 0, 



(3.40) 



that is, the energy of the error goes to zero in the limit as n goes to infinity. 
However, the amplitude of the error in the vicinity of a discontinuity remains 
unchanged in the limit. This is more clearly seen in Fig. 3.5, where the error is 
displayed for different values of n. Of course, this fact agrees with our analysis. 
Equation (3.40) is really a consequence of the fact that (recalling Chapter 1) S n (t) 
and fit) are both in the space L 2 (0, 2it). Rigorous proof of this is quite tough, 
and so we omit the proof entirely. 



3.5 TAYLOR SERIES 

Assume that fix) is real-valued and that x e R. One way to define the derivative 
of f{x) at x — xq is according to 



f w ix) 



df(x) , 



— — lx=X0 = lim ■ 
ax x^*o 



f(x) - fix Q ) 



Xq 



The notation is 



f in) ix) 



d"fix) 



dx n 



(so f (U) ix) = f(x)). From (3.41), we obtain 



./■(x)«/(xo) + / (1) (xo)(x-xo). 



(3.41) 



(3.42) 



(3.43) 



But how good is this approximation? Can we obtain a more accurate approximation 
to fix) if we know f^ n \xo) for n > 1? Again, what is the accuracy of the resulting 
approximation? We consider these issues in this section. 



TLFeBOOK 



TAYLOR SERIES 79 

Begin by recalling the following theorem. 

Theorem 3.3: Mean- Value Theorem If f(x) is continuous for x e [a, b] 

with a continuous derivative for x e (a, b), then there is a number f € (a, b) such 

that 

fib) - /(a) m 

: =/ (1) (£). (3.44) 

o — a 

Therefore, if a = xo and b — x, we must have 

/(x) = /(x ) + / (1) (f)(x-x ). (3.45) 

This expression is "exact," and so is in contrast with (3.43). Proof of Theo- 
rem 3.3 may be found in, for example, Bers [19, p. 636]. Theorem 3.3 generalizes 
to the following theorem. 

Theorem 3.4: Generalized Mean- Value Theorem Suppose that f(x) and 
g(x) are continuous functions on x e [a, b]. Assume /^(x), and g^ix) exist and 
are continuous, and g^\x) ^ for x e (a, b). There is a number f € (a, b) such 
that 

f{b) - /(a) / (1) (f) 



m-gia) £«(£) 



(3.46) 



Once again the proof is omitted, but may be found in Bers [19, p. 637]. 

The tangent to f(x) at x — xo is given by t(x) — f(xo) + / (1) (^o)U — xq) 
(f (xo) — /(xo) and f^(xo) = f^Hxo))- We wish to consider t(x) to be an approx- 
imation to f(x), so the error is 

f(x) -t(x) = e(x) 

or 

f(x) = /(x ) + f m (x )(x - x ) + e(x). (3.47) 

/ (1) (^o). 



Thus 



e(x) f{x) - /(x ) ^. (1) 





x — xo x — xo 


But 


fix) - /(xo) 
hm = / 




*->*o x — xq 


so immediately 


eix) 
lim — — - = 0. 



Z (1) (*o) 



(3.48) 

*^*0 X — Xo 

From (3.47) e(xo) = 0, and also from (3.47), we obtain 

f m (x) = f m ix Q ) + e m ix), (3.49) 



TLFeBOOK 



80 SEQUENCES AND SERIES 

so e ( - r> (xo) — 0. From (3.49), f^Hx) — e^ 2 '(x) (so we now assume that f(x) has 
a second derivative, and we will also assume that it is continuous). Theorem 3.4 
has the following corollary. 

Corollary 3.1 Suppose that f(a) — g(a) — 0, so for all b ^ a there is a £ € 
(a, b) such that 

fib) f m (H) 



8(b) g m 0;) 



(3.50) 



Now apply this corollary to f(x) — e(x), and g(x) — (x — xo) 2 , with a — xq, 
b — x. Thus, from (3.50) 

e(x) e m (t) 

(3.51) 



(x - xo) 1 2(t - x ) 

(t € (xq, x)). Apply the corollary once more to f(x) = e^'(x), and g(r) 
2(t - x ): 

e (V) (r) 1 m 1 n\ 



2(r - x ) 2 
(£ € (*<>, T)). Apply (3.51) in (3.52) 



= ^'(?) = ^/ w g) (3-52) 



«W ! .(2), 



= t/ w «) 



(x-x ) 2 2' 
or (for some £ € (xo, x)) 

e(*) = |/ (2) (?)(x-x ) 2 . (3.53) 

If \f (2) (t)\ < M 2 for t e (x , x), then 

\e(x)\ <^-j-M 2 |x-x | 2 (3.54) 

and 

/(x) = /(xo) + / (1) (x )(x - xq) + e(x), (3.55) 

for which (3.54) is an upper bound on the size of the error involved in approxi- 
mating f(x) using (3.43). 

Example 3.6 Suppose that f(x) — *J~x; then 



Suppose that xo = 1, and x = xo + Sx — 1 + Sx. Thus, via (3.55) 

Vx" = Vl+5x = 1 + / (1) (l)5x + e(x) = 1 + ±5x + e(x), 



TLFeBOOK 



TAYLOR SERIES 81 

so if, for example, \Sx\ < |, then \f (2) (x)\ < 2 (for x e (\, |)) so M 2 = 2, 
and so 

\e(x)\ < (Sx) 2 

via (3.54). This bound may be compared to the following table of values: 







1 + \Sx 


e(x) 




Sx 


VI + <5x 


(Sx) 2 


3 

4 


0.5000 


0.6250 


-0.1250 


0.5625 


1 

2 


0.7071 


0.7500 


-0.0429 


0.2500 





1.0000 


1.0000 


0.0000 


0.0000 


1 

2 


1.2247 


1.2500 


-0.0253 


0.2500 


3 
4 


1.3229 


1.3750 


-0.0521 


0.5625 



It is easy to see that indeed \e(x)\ < (Sx) 2 . 

We mention that Corollary 3.1 leads to l'Hopital's rule. It therefore allows us 
to determine 

.. /(*) 
hm 

x^a g( x ) 

when f(a) — g(a) — 0. We now digress briefly to consider this subject. The rule 
applies if f(x) and g(x) are continuous at x — a, if f(x) and g(x) have continu- 
ous derivatives at x = a, and if g^Hx) ^ near x — a, except perhaps at x = a. 

f m (x) 

l'Hopital's rule is as follows. If lim x ^ fl f(x) = lim^^ a g(x) — and lim x ^ a J {lj ^ 
exists, then 

f(x) f m (x) 

hm —— = hm . (3.56) 

x-*a g(x) x^a g( l >(x) 

The rationale is that from Corollary 3.1 for all x ^ a there is a £ € (a,b) such that 

fix) f^'(£) f*^(£) 

j±£ = )(1) .: . . So, if x is close to a then £ must also be close to a, and J (1) ; 
is close to its limit. l'Hopital's rule is also referred to as "the rule for evaluating 
the indeterminate form §." If it happens that f^\a) — g^(a) — 0, then one may 

attempt l'Hopital's rale yet again; that is, if lim x ^ a J (2) exists, then 



<■" g(2) (x) 



,. fix) .. f (2) (x) 
hm = hm —pr- . 

x^a g(x) x^a g( l >(x) 

Example 3.7 Consider 



im 

^0 


sinx 


-e x 

X 2 


+ 1 


= lim 

x^0 


d 

dx 


[sinx — 

d 
dx 


e x 

• 2 1 


+ 1] 










— lim 

x^0 


cosx — e x 
2x 


= 


lim - 

x^0 



sinx — e x 1 

2 = ~2 



TLFeBOOK 



82 SEQUENCES AND SERIES 

for which the rule has been applied twice. Now consider instead 



1 — 2x j r 

lim = lim " ' 



— [1 -2x] 



2 + 4* ^od_ [2 + Ax] 
dx 



lim — 

x^o 4 



1 



This is wrong ! l'Hopital's rule does not apply here because /(0) = 1, and g(0) = 2 
(i.e., we do not have /(0) = g(0) — as needed by the theory). 

The rule can be extended to cover other indeterminate forms (e.g., — ). For example, 
consider 



1 

log x Z 

lim x log. x — lim — - — = lim 
x^-0 x^O 1 x^o 1 



= lim (-X) = 0. 

x^Q 



An interesting case is that of finding 



lim 1 + - 

x^oa \ x 



This is an indeterminate of the form 1°°. Consider 



lim log I 1 + - 

x^oo \ x 



lim 

x— >oo 



log, I 1 + - 

I 

x 
1 / 1 



1 



lim 



lim 



1 + - 

X 



The logarithm and exponential functions are continuous functions, so it happens to 
be the case that 



I lim log e I 1 + - ) -= log e 



lim I 1 + - 

*^°o \ x 



TLFeBOOK 



TAYLOR SERIES 83 

that is, the limit and the logarithm can be interchanged. Thus 

= ■^LssoH)*] 



e l = 



so finally we have 



H)' 



lim 1 + - = e. (3.57) 



More generally, it can be shown that 



lim 

n— »oo 



(X\ n 
1 + -) = e*. (3.58) 



This result has various applications, including some in probability theory relating to 
Poisson and exponential random variables [20]. An alternative derivation of (3.57) 
appears on pp. 64-65 of Rudin [2], but involves the use of the Maclaurin series 
expansion for e. We revisit the Maclaurin series for e x later. 
We have demonstrated that for suitable £ € (xq, x) 

fix) = f(x ) + f m (x )(x - x ) + \f {2 \H )(x - xo) 2 

(recall (3.55)). Define 

p(x) = /(xo) + f m (x )(x - xo) + y (2) (x )(x - xo) 2 (3.59) 

so this is some approximation to f(x) near x — xo- Equation (3.43) is a linear 
approximation to f(x), and (3.59) is a quadratic approximation to f(x). Once 
again, we wish to consider the error 

f(x) - p(x) = e(x). (3.60) 

We note that 

p(x ) = f(x ), p m (x Q ) = f m (xo), p (2 \xo) = f {2 \xo). (3.61) 

In other words, the approximation to f(x) in (3.59) matches the function and its 
first two derivatives at x — xq. Because of (3.61), via (3.60) 

e(x ) = e m (xo) = e (2) (x ) = 0, (3.62) 

and so via (3.59) and (3.60) 

e (3) (x) = / (3) (x) (3.63) 



TLFeBOOK 



84 SEQUENCES AND SERIES 

(because // 3 '(x) — since p(x) is a quadratic in x). As in the derivation of (3.53), 
we may repeatedly apply Corollary 3.1: 



e(x) e^{h) 



(x-x ) 3 3(ti-x ) 2 
eW(ti) e (2) (?2) 



3(fi-x ) 2 3-2(t 2 -x ) 
e (2) (t 2 ) e (3) (f) 



3 • 2(t 2 -x Q ) 3-2 
which together yield 

e{x) _ /< 3 >(g) 
(x — xo) 3 3 • 2 

or 



for t\ e (xo, x) 
for ? 2 e (x , t\) 
for f € (x ,r 2 ), 



for £ € (xo, x) 



«(*) = ^-y / (3) (f X* - xo) 3 (3.64) 

for some f € (xo, x). Thus 

/(*) = /(xo) + / (1) (xo)(x - xo) + — / (2) (x )(x - xo) 2 + e(x). (3.65) 

Analogously to (3.54), if \f®\t)\ 5- ^3 f° r t e (xo,x), then we have the error 
bound 

k(x)|<^-j-M 3 |x-xo| 3 . (3.66) 

We have gone from a linear approximation to /(x) to a quadratic approximation 
to f(x). All of this suggests that we may generalize to a degree n polynomial 
approximation to /(x). Therefore, we define 

n 

Pn(x) = ^2p„,k(x - x ) k , (3.67) 

k=Q 



where 

r>„ 7, = 

k\ 

Then 



Pn,k = A/ (t) (*o). (3-68) 



/(x) = p„(x) + e„+i(x), (3.69) 

where the error term is 



e n+1 (x) = -_L_/(»+l>($)(jt -x )" +1 
(n + 1)! 



(3.70) 



TLFeBOOK 



TAYLOR SERIES 85 

for suitable £ € (xo,x). We call p n {x) the Taylor polynomial of degree n. This 
polynomial is the approximation to /(x), and the error e n +\(x) in (3.70) can be 
formally obtained by the repeated application of Corollary 3.1. These details are 
omitted. Expanding (3.69), we obtain 

fix) = f(xo) + f m (x )(x - xo) + ^/ (2) (x )(x - x ) 2 

+ ■■■ + -J (n \x*){x - x ) n + — { ^-J in+l \H){x - x )" +1 (3.71) 
n ! (n + 1 ) ! 

which is the familiar Taylor formula for f(x). We remark that 

f ik \xo) = p?\xo) (3.72) 

for k = 0, 1, . . . , n — 1, n. So we emphasize that the approximation p n (x) to f(x) 
is based on forcing p„(x) to match the first n derivatives of f(x), as well as 
enforcing p n (xo) — f(xo). If \f^ n+v> {t)\ < M n+ \ for all t e I [interval / contains 
(xo, x)], then 

\e n +i(x)\ < 7— "-ttt M n+l \x - x \ n+1 . (3.73) 

If all derivatives of /(x) exist and are continuous, then we have the Taylor series 
expansion of f(x), namely, the infinite series 

/w = E 7^ /( " )(xo)(x - xo)A: - (3 - 74) 

fc=0 
The Maclaurin series expansion is a special case of (3.74) for xq — 0: 

/(*) = Yi -f (k \0)x k . (3.75) 

If we retain only terms k — to k — n in the infinite series (3.74) and (3.75), we 
know that e n+ \{x) gives the error in the resulting approximation. This error may 
be called the truncation error (since it arises from truncation of the infinite series 
to a finite number of terms). Now we consider some examples. 
First recall the binomial theorem 

(a+x) n = ^r( n k \x k a"- k , (3.76) 

where 



n 



k J k\{n-k)\' 



(3.77) 



TLFeBOOK 



86 SEQUENCES AND SERIES 

In (3.76) we emphasize that n e Z + . But we can use Taylor's formula to obtain an 
expression for (a + x) a when a^O, and a is not necessarily an element of Z + . 
Let us consider the special case 

f(x) = (l+x) a 

for which (if k > 1) 

f (k) (x) = a(a - l)(a - 2) ■ ■ ■ (a - k + 1)(1 + x) a ~ k . (3.78) 

These derivatives are guaranteed to exist, provided x > — 1. We will assume this 
restriction always applies. So, in particular 

/ (fc) (0) = a(a - l)(a - 2) • • • (a - fc + 1) (3.79) 

giving the Maclaurin expansion 



" 1 
(1 + x f = 1 + ^ ry[«(« -!)■■■(«-*+ 1)]j 






[o(a- !)• ••(«-«)](! +f) a -"- 1 x n+1 (3.80) 



(n+1)! 
for some £ € (xo, x). We may extend the definition (3.77), that is, define 

o) = 1, (*) = ^ a(a ~ 1) --- (a ~ k + l)(k - l) (3 ' 81) 

so that (3.80) becomes 

tt +*)" = £(" )**+(„ + ! )a + §) a "" _1 *" +1 , (3.82) 

k=0 , 



--Pn(x) =««+!« 



for x > — 1 . 



Example 3.8 We wish to compute [1.03] 1 / 3 with n — 2 in (3.82), and to esti- 
mate the error involved in doing so. We have x — 0.03, a — i, and £ e (0, .03). 
Therefore from (3.82) [1 + X] 1 / 3 is approximated by the Taylor polynomial 

p 2 ( X ) = 1 + ( J J X + f 3 \ x 2 = l + ^ x _^ x 2 
SO 

[1.03] 1/3 « £> 2 (0.03) = 1.009900000 



TLFeBOOK 



TAYLOR SERIES 87 

but [1.03] 1/3 = 1.009901634, so e 3 (x) = 1.634 x 10" 6 . From (3.82) 

and so e 3 (0.03) = Jr^r(l + £)~ 8/3 = | x 10" 6 (1 + £)" 8/3 . Since < £ < 0.03, 
we have 

1.5403 x 10" 6 < e 3 (.03) < 1.6667 x 10" 6 . 

The actual error is certainly within this range. 

If f(x) = -A-r, and if xo — 0, then 



l+x 

n+l„n+l 
1 +X 



1 " 



***+ ( } X . (3.83) 



W=-^ («^1). (3.84) 

■^^ 1 — a 



=r(*) 

This may be seen by recalling that 

l-a" +1 

y* = 

ifc=0 

so ELo(-i)^ = 1 ~ ( ~ 1 1 ) +t'"" +1 - and thus 

A ^ ^ (-l) n+1 jc B+1 _ 1 - (-l) n+1 x n+1 (-l) n+1 x n+1 __ 1 
{-" X l+x l + x l+x l+x' 

k—0 

This confirms (3.83). We observe that the remainder term r(x) in (3.83) is not given 
by e„ + i (x) in (3.82). We have obtained an exact expression for the remainder using 
elementary methods. 

Now, from (3.83), we have 

1 -i_, + ,2_,3 + ... + ( _ ir -y-i , (-1)"'" 



l+t 1+t 

and we see immediately that 



f x dt , , , 

log e (l +x)= I = x - -x 1 + -x 3 

Jo l + t 



1 , 1 

— ; 
2 



-\ v n rx *n 



<—Y) n ~X n 

- — + (-1)"/ - — dt. (3.85) 



Jo l + t 



=r(x) 



TLFeBOOK 



88 SEQUENCES AND SERIES 



For x > 0, and < t < x we have -A^ < 1, implying that 



px jti px r"+l 

< / dt < t"dt = (x > 0). 

Jo 1 + 1 Jo n+l 



For —1 < x < with x < t < 0, we have 



1 1 

< 



l+t l+x 1-1*1 



so 



Jr 



/ -— dt <^^- / f" 

Jo 1 + f ~ 1-|*| Jo 
Consequently, we may conclude that 

1 

\r(x)\ < 



1 



1-1*1 



„ra+l 



« + 1 



IB+1 



(l-|x|)(« + l) 



n+l 



n + 1 



1*1 



n + l 



I (1 - |jc|)(n + 1) 



x > 

-1 <x <0 



(3.86) 



Equation (3.85) gives us a means to compute logarithms, and (3.86) gives us a 
bound on the error. 

Now consider (3.83) with x replaced by x 2 : 



1 + X k=Q 



{-\) n+i x 



n+\ v 2n+2 



l+X 2 



(3.87) 



Replacing n with n — 1, replacing x with f, and expanding, this becomes 
1 



1 - t 2 + t 4 



l+t 2 
where, on integrating, we obtain 



+ (_l)»-l,2»-2+ ( l)nt2n 



l + t 2 



tan x 



f x dt 1 
— I r- — x x 

Jo l + t 2 3 



3,5 7 

" ' -X X 



(_iyi-\ x 2n-\ 



2/i — l jq 



rx j.2n 

Jo TT7 



-dt . 



(3.S 



=r(*) 



Because -r-^j < 1 for all t e R, it follows that 



\r(x)\ < 



|2n+l 



2« + 1 



(3.89) 



TLFeBOOK 



TAYLOR SERIES 89 

We now have a method of computing n. Since j — tan _1 (l), we have 

Tt 111 (-1)"" 1 



and 



r(1) -^T- (3 ' 91) 

2m + 1 



Using (3.90) to compute it is not efficient with respect to the number of arith- 
metic operations needed (i.e., it is not computationally efficient). This is because 
to achieve an accuracy of about 1/n requires about n/2 terms in the series [which 
follows from (3.91)]. However, if x is small (i.e., close to zero), then series (3.88) 
converges relatively quickly. Observe that 

tan" 1 — — — =tan _1 x + tan _1 y. (3.92) 

1 — xy 

Suppose that x = j, and y = A, then x ^ t — 1, so 

% = tan" 1 ( - ) + tan" 1 ( - ) . (3.93) 



4 V2. 

It is actually faster to compute tan _1 (j), and tan _1 (|) using (3.88), and for these 
obtain jt using (3.93) than to compute tan _1 (l) directly. In fact, this approach (a 
type of "divide and conquer" method) can be taken further by noting that 

tan" 1 ( - | = tan" 1 | - 1 + tan" 1 ( - | , tan" 1 ( - | = tan" 1 ( - | + tan" 1 



implying that 



— = 2 tan" 1 ( - | +tan _1 ( - 
4 \5 \7 



= 2tan _1 I - I +tan _1 I - + 2 tan -1 I - I . (3.94) 



Now consider f(x) — e x . Since f^(x) — e x for all k € Z + we have for xq — 
the Maclaurin series expansion 



oo k 

e*^-. (3.95) 

k=0 

This is theoretically valid for — oo < x < oo. We have employed this series before 
in various ways. We now consider it as a computational tool for calculating e x . 
Appendix 3.C is based on a famous example in Forsythe et al. [21, pp. 14-16]. 
This example shows that series expansions must be implemented on computers 



TLFeBOOK 



90 SEQUENCES AND SERIES 

with rather great care. Specifically, Appendix 3.C shows what can happen when we 
compute e~ 20 by the direct implementation of the series (3.95). Using MATLAB 
as stated e~ 20 «s 4.1736 x 10~ 9 , which is based on keeping terms k — to 88 
(inclusive) of (3.95). Using additional terms will have no effect on the final answer 
as they are too small. However, the correct value is actually e~ 20 — 2.0612 x 10 -9 , 
as may be verified using the MATLAB exponential function, or using a typical 
pocket calculator. Our series approximation has resulted in an answer possessing 
no significant digits at all. What went wrong? Many of the terms in the series 
are orders of magnitude bigger than the final result and typically possess rounding 
errors about as big as the final answer. The phenomenon is called catastrophic 
cancellation (or catastrophic convergence). As Forsythe et al. [21] stated, "It is 
important to realize that this great cancellation is not the cause of error in the 
answer; it merely magnifies the error already present in the terms." Catastrophic 
cancellation can in principle be eliminated by carrying more significant digits in the 
computation. However, this is costly with respect to computing resources. In the 
present problem a cheap and very simple solution is to compute e 20 using (3.95), 
and then take the reciprocal, i.e., use e~ 2Q — 1/e 20 . 
An important special function is the gamma function: 



-f 

Jo 



r(z)= / x z ~ l e~ x dx. (3.96) 

>0 

Here, we assume z € R. This is an improper integral so we are left to wonder if 

-M 



/ 

1 JO 



lim / x z e x dx 



exists. It turns out that the integral (3.96) converges for z > 0, but diverges for 
z < 0. The proof is slightly tedious, and so we will omit it [22, pp. 273-274]. If 
z — n e N, then consider 



/•OO 

{n) = / x" 
Jo 



r(n)= / x n ~ l e~ x dx. (3.97) 



Now 



r(n+l)=/ x n e~ x dx= lim / x n e~ x dx 

JO M^ooJq 

= lim -x n e- x \^+n x n - l e~ x dx 

M^oo |_ Jo 

(via f u dv — uv — j v du, i.e., integration by parts). Therefore 

/•OO 

r(n + l) = re/ x n - l e~ x dx =nT{n). (3.98) 

Jo 



TLFeBOOK 



TAYLOR SERIES 91 



We see that T(l) = f™ e~ x dx = [-e~ x ]™ = 1. Thus, T(n + 1) = re!. Using the 
gamma function in combination with (3.95), we may obtain Stirling' s formula 

re! « ^/2nn n+1/2 e- n (3.99) 

which is a good approximation to re ! if re is big. The details of a rigorous derivation 
of this are tedious, so we give only an outline presentation. Begin by noting that 

/•OO rOO 

re!= / x n e~ x dx = \ e nXax - x dx. 
Jo Jo 

Let x — n + y, so 

/OO 
e nXain+y) - y dy. 
-n 

Now, since re ln(re + j) = n In [re (l + ^)] = re lnre + re In (l + ^), we have 

/oo 
e " ln "+" ]n (i+5)-yrfy 
-n 

/oo 
e „ ln( i + Z)_ y ^ 
-« 

Using (3.85), that is 

(i + Z) = Z_2l + Z_..., 

V re/ re 2re 2 3re 3 



In 
we have 



rein (l + -)->/ = - — 



2 3 

r , r 



2re 3n 2 
so 



re!=re"e-" / 

If now y = *Jnv, then dy = *Jndv, and so 

/oo 

So if re is big, then 



OO y 2 ^3 



OO 1,2 w 3 

re!=re" + Je"" / e~~ + ^~'" dv. 



■I. 



OO 2 



re! % re" + 2e " / e ? dv 



TLFeBOOK 



92 SEQUENCES AND SERIES 

If we accept that 



/ 

j — ( 



then immediately we have 



~ x2/2 dx = Jhi, 



/^ n-\- ~ —n 

'2jtn 2g " 



(3.100) 



and the formula is now established. Stirling's formula is very useful in statistical 
mechanics (e.g., deriving the Fermi-Dirac distribution of fermion particle energies, 
and this in turn is important in understanding the operation of solid-state electronic 
devices at a physical level). 

Another important special function is 



,'(*) 



1 



V2W 



exp 



(x — m) 
2^ 



2-\ 



-oo < X < oo 



(3.101) 



which is the Gaussian function (or Gaussian pulse). This function is of immense 
importance in probability theory [20], and is also involved in the uncertainty princi- 
ple in signal processing and quantum mechanics [23]. A sketch of g(x) for m — 0, 
with a 2 — 1, and a 2 = 0.1 appears in Fig. 3.6. For m — and a 2 — 1, the standard 
form pulse 

f(x) = -^=e- x2 ' 2 (3.102) 



'lit 



is sometimes defined [20]. In this case we observe that g(x) — — f [ x — ! ^-). We will 
show that 

(3.103) 



/ 

Jo 



e x dx — —Jit , 
2 



which can be used to obtain (3.100) by a simple change of variable. From [22] 
(p. 262) we have 



1.4 
1.2 
1 
0.8 
0.6 
0.4 
0.2 

















i 












— a 2 =\ 
-- a 2 = 0.1 


- 






1 
1 


\ 
\ 








1 
1 


\ 
\ 








1 
1 


\ 
\ 








1 
1 


\ 
\ 








-*' 1 


\ "-- 






I "~ 


/ 




" r ., 



Figure 3.6 Plots of two Gaussian pulses, where g(x) in (3.101) for m = 0, with a 
1,0.1. 



TLFeBOOK 



TAYLOR SERIES 93 



Theorem 3.5: Let lirtijc^oo x p f(x) — A. Then 

1. f^° f(x) dx converges if p > 1 and — oo < A < oo. 

2. / a °° /Or) fifx diverges if p < 1 and A ^ (A may be infinite). 

We see that lim^^oo* e x = (perhaps via l'Hopital's rule). So in The- 
orem 3.5 /(*) = e _ * , and p — 2, with A = 0, and so L e~ x dx converges. 
Define 



1m 



= / e~ x dx = / e- y 
Jo Jo 



<iv 



and let liiriM^oo I M — I. Then 



<iv 



= /" [ e-^+y 2 Uxdy 
Jr m J 

for which /?m is the square OASC in Fig. 3.7. This square has sides of length M. 
Since e - ^ +> ' ' > 0, we obtain 

f f e - (x2+y2) dx dy < I M < f f e - (x2+y2) dx dy, (3.104) 

JRi J JRij J 



M\/2 



M 



v v 



y 
l 

E 
- A 


i 


\ 6 


^ 




O 


( 


Z D 



Figure 3.7 Regions used to establish (3.103). 



TLFeBOOK 



94 SEQUENCES AND SERIES 

where Ry is the region in the first quadrant bounded by a circle of radius M. 
Similarly, Ry is the region in the first quadrant bounded by a circle of radius \p2M. 
Using polar coordinates, r 2 — x 2 + y 2 and dx dy — r dr d(f>, so (3.104) becomes 

nir/2 pM nir/2 rs/2M 

i I e~ r r dr d<p < 1 2 M < / e~ r rdrd(t>. (3.105) 

J0=O Jr=Q Jij>=Q Jr=a 

Since -j£e~ x2 = xe'* 2 we have f Q M re~ rl dr = -\[e~ x2 ]^ = ±[1 - e~ M \ 
Thus, (3.105) reduces to 

j[l-e- M2 l<I 2 M <^ll-e- 2M \ (3.106) 

If we now allow M -> oo in (3.106), then /^ -» j, implying that I 2 = j, or 
/ = jyfjz. This confirms (3.103). 

In probability theory it is quite important to be able to compute functions such 
as the error function 

2 f x 2 
ert{x)=—= / e~ f dt. (3.107) 

\/x Jo 

This has wide application in digital communications system analysis, for example. 
No closed-form 4 expression for (3.107) exists. We may therefore try to compute 
(3.107) using series expansions. In particular, we may try working with the Maclau- 
rin series expansion for e x : 



erf(x) 



^ — ^ A 

u=o 



2 r x 

jn Jo 

_ 2 y,(-l)« f 

~^h k\ Jo 



dt 



t 2k dt 



2 ~ (-l) k x 2k+1 

= -=> - — . (3.108) 

JWf^kW + 1) 

However, to arrive at this expression, we had to integrate an infinite series term 
by term. It is not obvious that we can do this. When is this justified? 
A power series is any series of the form 

CO 

f(x) = J^a k (x-x ) k . (3.109) 

k=0 

Clearly, Taylor and Maclaurin series are all examples of power series. We have the 
following theorem. 

A closed-form expression is simply a "nice" formula typically involving more familiar functions such 
as sines, cosines, tangents, polynomials, and exponential functions. 



TLFeBOOK 



TAYLOR SERIES 95 

Theorem 3.6: Given the power series (3.109), there is an R > (which may 
be R = +oo) such that the series is absolutely convergent for \x — xq\ < R, and is 
divergent for \x — xq\ > R. At x — xo — R and at x — xq — — R, the series might 
converge or diverge. 

Series (3.109) is absolutely convergent if the series 

oo 
h(x) = J^\a k (x-x ) k \ (3.110) 

converges. We remark that absolutely convergent series are convergent. This means 
that if (3.110) converges, then (3.109) also converges. (However, the converse is 
not necessarily true.) We also have the following theorem. 

Theorem 3.7: If 

00 

f (x) — 2_\ a k{x — xq) for \x — xq\ < R, 

where R > is the radius of convergence of the power series, then f(x) is 
continuous and differentiable in the interval of convergence x e (xo — R, xq + R), 
and 

oo 

fW(x) = Y^ka k (x-x ) k -\ (3.111a) 

k=\ 

/ f(t)dt = J27^h( x - x o) k+1 - (3.111b) 

This series [Eq. (3.111a,b)] also has a radius of convergence R. 

As a consequence of Theorem 3.7, Eq. (3.108) is valid for — oo < x < oo (i.e., the 
radius of convergence is R — +oo). This is because the Maclaurin expansion for 

e x had R — +oo. 

Example 3.9 Here we will find an expression for the error involved in trun- 
cating the series for erf(x) in (3.108). 

From (3.71) for some f € [0, x] (interval endpoints may be included because of 
continuity of the function being approximated) 

n fc 

k=0 ' 
=Pn(x) 



TLFeBOOK 



^n\* ) — . . . £ X 



96 SEQUENCES AND SERIES 

where 

e,.(x\ = 

(«+l)! 

Thus, where x — —t 2 , so for some £ such that — t 2 < § < 

<?"' 2 = p„(-t 2 ) + e n (-t 2 ), 

and hence 

2 f x 2 f x 

erf(x) = —= p n (-t 2 )dt + —= e n (-t 2 )dt, 

Jit Jo JTI Jo 



/0 \/n JO 

=9n (*) =e„ CO 

where the degree « polynomial 

2 r 
^„(x) = — — / 

V 77 Jo 
is the approximation, and we are interested in the error 

2 f x 
e„(x) = erf(x) - #„(*) = — — / e n {-t 2 )dt. 

Jn Jo 



A (-\) k t 2k 
^ k\ 

Lk=o 


dt - 


2 " { -lf x 2k+l 

' ^ho k ^ 2k + i ) 



Clearly 

e n (x) = ^=P^f X t 2n+2 eSdt, 

V^r (« + 1)! J 



where we recall that f depends on f in that — t 2 < % < 0. There is an integral 
mean-value theorem, which states that for /(?), g(0 S C[a, £>] (and g(f) does not 
change sign on the interval [a, b]) there is a f € [a, b] such that 



/" g(t)f(t)dt = f(0 f 

J a J a 



-b 

g(t)f(t)dt = f(0 I g {t)dt. 



Thus, there is a f € [— x , 0], giving 



2 ,(-l)" +1 x 2 "+ 3 
e n (x) = — =e 4 



V^r (n+l)!2n + 3' 

Naturally the error expression in Example 3.9 can be used to estimate how many 
terms one must keep in the series expansion (3.108) in order to compute erf(x) to 
a desired accuracy. 



TLFeBOOK 



ASYMPTOTIC SERIES 



97 



3.6 ASYMPTOTIC SERIES 

The Taylor series expansions of Section 3.5 might have a large radius of conver- 
gence, but practically speaking, if x is sufficiently far from xq, then many many 
terms may be needed in a computer implementation to converge to the correct 
solution with adequate accuracy. This is highly inefficient. Also, if many terms 
are to be retained, then rounding errors might accumulate and destroy the result. 
In other words, Taylor series approximations are really effective only for x suf- 
ficiently close to xq (i.e., "small x"). We therefore seek expansion methods that 
give good approximations for large values of the argument x. These are called the 
asymptotic expansions, or asymptotic series. This section is just a quick introduction 
based mainly on Section 19.15 in Kreyszig [24]. Another source of information on 
asymptotic expansions, although applied mainly to problems involving differential 
equations, appears in Lakin and Sanchez [25]. 

Asymptotic expansions may take on different forms. That is, there are different 
"varieties" of such expansions. (This is apparent in Ref. 25.) However, we will 
focus on the following definition. 

Definition 3.5: A series of the form 



k=0 



(3.112) 



for which q e R (real-valued constants), and x e R is called an asymptotic expan- 
sion, or asymptotic series, of a function f(x), which is defined for all sufficiently 
large x if, for every n e Z + 



/<*>- E 




as x -> oo, 



(3.113) 



and we shall then write 



/(*) ~ J2 



£1 



k=0 



It is to be emphasized that the series (3.112) need not converge for any x. The 
condition (3.113) suggests a possible method of finding sequence (ck)- Specifically 



[fix) 



f(x) - c -► or co = lim f(x), 



CO 



- \x -> or c\— lim [f(x) — co]x, 



[fix) 



CO 



Cl C2| 2 

2 ' X 



r] 



or C2 = lim 

X— >00 



[fix) -co Jx 2 , 



TLFeBOOK 



98 SEQUENCES AND SERIES 

or in general 



lim 

X — >oo 



n-\ 



/(*) - E 



C L 

■k 



X 
k=0 J 



(3.114) 



for n > 1. However, this recursive procedure is seldom practical for generating 
more than the first few series coefficients. Of course, in some cases this might 
be all that is needed. We remark that Definition 3.5 can be usefully extended 
according to 



f(x) ~ g(x) + h(x) 



E 

U=0 



fi 



for which 



fix) - g(x) 
h(x) 



k=0 



(3.115) 



(3.116) 



The single most generally useful method for getting (cjt) is probably to use "inte- 
gration by parts." This is illustrated with examples. 

Example 3.10 Recall erf(x) from (3.107). We would like to evaluate this func- 
tion for large x [whereas the series in (3.108) is better suited for small x; see the 
error expression in Example 3.9]. In this regard it is preferable to work with the 
complementary error function 



erfc(x) = 1 — erf(x) 



—— / e ' dt. 

s/7T Jx 



(3.117) 



We observe that erf(oo) = 1 [via (3.103)]. Now let r = t , so that dt 
With this change of variable 

1 f°° _i _ T 
erfc(x) = — — I x 2 e dr. 
Jtt J x 2 



U-i 



' 2 dx. 



(3.118) 



Now observe that via integration by parts, we have 






T-2e- T dr = -T- 1/2 e- T \°° 2 



2J X 2 



x 2e T dx 



= X -e~^ 

x 2 



i r 

2J X 2 



x-2e~ T dx = -x- 3/2 e- T \°° 2 



x 2 e T dx 
3 



/ 



x 2 e T dx 



f 

Jx 2 



X 2 e T dx, 



TLFeBOOK 



ASYMPTOTIC SERIES 99 

and so on. We observe that this process of successive integration by parts has 
generated integrals of the form 



F „(*)= / 

Jx 2 



-(2n+l)/2 -T 



e Jt 



(3.119) 



for n e Z + , and we see that erfc(x) = -j=Fo(x). So, if we apply integration by 
parts to F n (x), then 



I T -(2« + D/2 e -r rfr 

*2 



. T -(2»+l)/2 e -r|00 _ ^L±l p T -(2„+3)/2 e -T ^ 
* 2 J v 2 



-(2n + l)-x 2 _ 2n + 1 



/; 



-v^i\- dt 



so that we have the recursive expression 



F n (x) = 



2m+ 1 



-2n+l 



-F n+ i(x) 



(3.120) 



which holds for n e Z + . This may be rewritten as 



e x F n (x) 



2« + 1 



r 2n+l 



-e* F B+ i(*). 



(3.121) 



Repeated application of (3.121) yields 



e x2 F (x) = - - J/f,W, 
x 2 

,2 1 1 1-3,2 

e* F (*) = - - — + — e * F 2 (x), 
x 2 1 1 1-3 1 • 3 • 5 x 2 



and so finally 



e x F (x) = 



1 1 1-3 

x 2x 3 2 2 x 5 



+ (-D 



n _i 1 ■ 3 ■ ■ ■ (2« - 3) 



2« — 1 v2n — 1 



=5 2 „_i(x) 

„ 1 • 3 • • • (2n - 1) r 2 
+ (-D" ^ -«* F n (x). 



(3.122) 



TLFeBOOK 



100 SEQUENCES AND SERIES 

From this it appears that our asymptotic expansion is 

r 2 1 1 1-3 „ , 1 • 3 • • • (2m - 3) 

e F ° W ~x-2^ + 2^--- + ( - 1) 2-1^-1 +■- ^ 123 ) 

However, this requires confirmation. Define K n — (— 2)~"[1 • 3 • • • (2m — 1)]. From 
(3.122), we have 

[e x2 F (x) - S 2 „-i(x)]x 2n - 1 = K n e x \ 2n - x F n (x). (3.124) 

We wish to show that for any fixed n — 1, 2, 3, . . . the expression in (3.124) on 
the right of the equality goes to zero asi -> oo. In (3.119) we have 









1 


1 








< 










r (2« + l)/2 - 


y2n+l 


for all t 


> x 2 , 


which 


gives the bound 








/>OC 


e~ T 1 


/•CO 




F„(x) 


- / 


dx < 


/ e 




L* 


r (2n+l)/2 - x 2n+l 


U 


But this 


implies that 







^ r ^ = ^rr- (3- 125 ) 



2 

\K n \e x \ 2 "- l F n ( x ) < \K n \e x \ 2n - 1 *"" '*"' 



v-2n+l y2 

I K" I 

and L -j i — > for x — > oo. Thus, immediately, (3.123) is indeed the asymptotic 

2 

expansion for e* Fq(x). Hence 



1 -f 
erfc(x) ~ — —e 



1 1 1-3 „ , 1 • 3 • • • (2m - 3) 

1 L (-1)"" 1 + ■ ■ 

x 2x 3 2 2 x 5 + l ' 2"- 1 x 2 "- 1 + 

(3.126) 



We recall from Section 3.4 that the integral L ^SI dt was important in analyzing 
the Gibbs phenomenon in Fourier series expansions. We now consider asymptotic 
approximations to this integral. 

Example 3.11 The sine integral is 

f x sint 

~ Jo t 

and the complementary sine integral is 

f 00 sin? 

si(x) = / dt. (3.128) 

Jx t 



Si(x) = / dt, (3.127) 

o 



TLFeBOOK 



ASYMPTOTIC SERIES 101 



It turns out that si(0) = j, which is shown on p. 277 of Spiegel [22]. We wish to 
find an asymptotic series for Si(x). Since 

n r°° sint f x sint f°° sint 

-=/ dt = / df+/ Jf = Si(x) + si(x), (3.129) 

2 Jo t J t Jx t 

we will consider the expansion of si(x). If we integrate by parts in succession, then 

1 



/; 

r 

J X 

f 

J X 



i ' f°° ! 

t sint dt — — cosx — 1 • / — 

x J x V 

,00 , 

Jx 7* 



cost dt, 



-7 1 

£ cos t dt — -sinx + 2 



1 



f sinf df = — 7 cosx — 3 • 

x i 



1 



■ sin? df, 
cost dt, 

■ sin t dt, 

Jx ' 

and so on. If n € N, then we may define 

,00 ,00 

,y n (x) = / t~ n sin tdt, c„(x) — I t~ n cost dt. 

J X J X 



poo 

I t cos tdt — jsinx + 4- 

Jx x 4 



,oo l 
Jx ^ 



Therefore, for odd n 



and for even n 



1 

Sn 00 = — - COS X - n C n +\(x), 

x" 



c n {x) = -sinx +n s n+ \(x). 

x" 



(3.130) 



(3.131a) 



(3.131b) 



We observe that Si(x) = si(x). Repeated application of the recursions (3.131a,b) 
results in 

11 11. 
s\(x) — cos* H 7t- sinx — 1 • 2st,(x) 



x 
1- 1 



1 • 1 



1 -2 



1-2-3 



■ COS X 



sinx + 1-2-3 -4s 5 (x), 



or in general 

s\(x) = cosx 



11 1-2 
x x 3 

1-1 1-2-3 



x x* 

+ (-l)"(2»)Lv 2 „ +1 (x) 



(-D 



„+i(2«-2)! 



r 2/i — 1 



+ ••• + (-1) 



+1 (2n-l)! 

v 2n 



(3.132) 



TLFeBOOK 



102 SEQUENCES AND SERIES 

for n e N (with 0! = 1). From this, we obtain 

[si(x) - S 2n (x)]x 2n = (-l) n (2n)\x 2n s 2n+1 (x) 



(3.133) 



for which S2n(x) is appropriately defined as those terms in (3.132) involving sinx 
and cosx. It is unclear whether 



lim x 2n S2n+i(x) — 
x— >oo 



(3.134) 



for any n e N. But if we accept (3.134), then 
si(x) ~ S2n(x) = cosx 



1-1 1-2 „, ,(2/i-2)! 



sinx 



id_i4^ + ... +( _ ir+ .< 2 »-') ! 



r ln 



(3.135) 



Figure 3.8 shows plots of Si(x) and the asymptotic approximation j — S2n(x) for 
n = 1, 2, 3. We observe that the approximations are good for "large x," but poor 
for "small x," as we would expect. Moreover, for small x, the approximations are 



1.5 



^ 1 



E 
< 



0.5 



(a) 



^/ 


~*~:' ,%— ^^ 










1 
1 










Si(x) " 

n=1 

rt = 2 _ 

n = 3 


1 

I 


















i 


i 









10 



E -1 
< 



(b) 



1.6 



1.8 



2.2 2.4 

x 



2.6 



- - — 




_^ . — ' 






1 ' ' 




Si(x) - 

n= J [ 

. . . . n = 2 - 
n = 3 


s ' 
























/ 
/ 












/ 













2.8 



Figure 3.8 Plot of Si(x) using MATLAB routine sinint, and plots of 5- — 52„(x) for n ■■ 
1,2,3. The horizontal solid line is at height n /2. 



TLFeBOOK 



MORE ON THE DIRICHLET KERNEL 103 

better for smaller n. This is also reasonable. Our plots therefore constitute informal 
verification of the correctness of (3.135), in spite of potential doubts about (3.134). 

Another approach to finding (q.) is based on the fact that many special func- 
tions are solutions to particular differential equations. If the originating differential 
equation is known, it may be used to generate sequence (c*)- However, this section 
was intended to be relatively brief, and so this method is omitted. The interested 
reader may see Kreyszig [24]. 

3.7 MORE ON THE DIRICHLET KERNEL 

In Section 3.4 the Dirichlet kernel was introduced in order to analyze the manner 
in which Fourier series converge in the vicinity of a discontinuity of a 2n -periodic 
function. However, this was only done with respect to the special case of g(t) in 
(3.19). In this section we consider the Dirichlet kernel D n (t) in a more general 
manner. 

We begin by recalling the complex Fourier series expansion from Chapter 1 
(Section 1.3.3). If f(t) e L 2 (0, 2jt), then 

00 

fa) = Y, f » eint > (3 ' 136) 

n — — oo 

where /„ = (/, e n ), with e n (t) — e' nt , and 

(x,y) = ±- ^ x{t)y*(t)dt. (3.137) 

2n Jo 

An approximation to f(t) is the truncated Fourier series expansion 

L 

h(t)= J2 fne' nt e L 2 (0,2jt). (3.138) 



The approximation error is 

6i (0 = fit) - hit) e L 2 (0, 2it). (3.139) 

We seek a general expression for ei(t) that is hopefully more informative than 
(3.139). The overall goal is to generalize the error analysis approach seen in 
Section 3.4. Therefore, consider 



6l(0 = /(0- Y (f> e n)e„(t) 



TLFeBOOK 



104 SEQUENCES AND SERIES 

[via /„ = {/, e n ), and (3.138) into (3.139)]. Thus 

e £ (0 = /(0- M^/ f(x)e- jnx dx\e n (t) 



fit) , 



^/^/wl i>'" ( ' _J0 U* (3 - 140) 



Since 



z. 



^ e jn(t-x) = j + 2 ^ cos[ „( f _ x )] (3.141) 

n=—L n=\ 

(show this as an exercise), via (3.24), we obtain 

J^ sin[(L + \){t - *)] 

l+2> cos[«(f-x)] = j-? =27tD L (t-x). (3.142) 

^ sin[±(f-x)] 

Immediately, we see that 

ei(0 = /(0-/ f(x)D L (t-x)dx, (3.143) 

Jo 



where also (recall (3.139)) 



Jo 



fL(t)= I f(x)D L (t-x)dx. (3.144) 



Equation (3.144) is an alternative integral form of the approximation to /(f) origi- 
nally specified in (3.138). The integral in (3.144) is really an example of something 
called a convolution integral. The following example will demonstrate how we 
might apply (3.143). 

Example 3.12 Suppose that 

sinf, < t < n 



^ (f) "0, it < t < In " 

Note that f(t) is continuous for all t, but that f^'{t) = df(t)/dt is not continuous 
everywhere. For example, f^'{t) is not continuous at t — jr. Plots of fi(t) for 
various L (see Fig. 3.10) suggest that /l(?) converges most slowly to f(t) near 
t — it. Can we say something about the rate of convergence? 



TLFeBOOK 



MORE ON THE DIRICHLET KERNEL 105 



Therefore, consider 



M 



1 r 

2n J 



sinx 



sin[(L + \){7t - x)] 

sin[i(7r — x)] 



dx. 



(3.145) 



Now 



1 



-(JT — X) 



l ( — ) cos f —x 1 — cos ( — ) sin I -. 

\2/ \2 / V2/ \2 



COS —X 

.2 



and since sinx = 2sin(Ax) cos(lx) so (3.145) reduces to 



fL(7t) =Uo sin G x ) 


sin 


!H) ( 


T — X) 


dx 






i r*i 

= — / { cos 

In Jo I 


(L + l)x - ( L + - ) it 


— cos 


Lx — 


(l + \)* 


1 


sin 


(L + l)x - (l + it 


-\7l 

- 




2it(L+ 1) _ 




1 

2jtL 


sin 


Lx — I L H — 1 7r 

V 27 . 


- 7T 

- 




1 

2jr(L+ 1) . 


1 + sin I L H — 1 7T 

V 2; . 


1 

2jtL 


- 1 + sin 


[ L+l 2)\ 


1 2L + 1- (-1) L 





lit L(L + 1) 
We see that, as expected 



dx 



(3.146) 



lim A(tt) = 0. 

L—>-oo 

Also, (3.146) gives €/.(7r) = — /l(7t), and this is the exact value for the approxi- 
mation error at t — jt for all L. Furthermore 

1 

ki(7r)| oc - 

for large L, and so we have a measure of the rate of convergence of the error, at 
least at the point t = n. (Symbol "oc" means "proportional to.") 

We remark that f(t) in Example 3.12 may be regarded as the voltage drop 
across the resistor R in Fig. 3.9. The circuit in Fig. 3.9 is a simple half-wave 



TLFeBOOK 



106 SEQUENCES AND SERIES 



Ideal diode 




Figure 3.9 An electronic circuit interpretation for Example 3.12; here, v(t) = sin(f) for 
all t e R. 




Figure 3.10 A plot of f(t) and Fourier series approximations to f(t) [i.e., //,(?) for 
L = 3, 10]. 



rectifier circuit. The reader ought to verify as an exercise that 

L 

sin t 4- 

jt 2 



1 1 " l + (-l) n 

f L (t)= — + — sin t + > — — cos nt. 



n=2 



7T(l-n 2 ) 



(3.147) 



Plots of /i(f) for L — 3, 10 versus the plot of f(t) appear in Fig. 3.10. 



TLFeBOOK 



COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 107 

3.8 FINAL REMARKS 

We have seen that sequences and series might converge "mathematically" yet not 
"numerically." Essentially, we have seen three categories of difficulty: 

1. Pointwise convergence of series leading to irreducible errors in certain regions 
of the approximation, such as the Gibbs phenomenon in Fourier expansions, 
which arises in the vicinity of discontinuities in the function being approxi- 
mated 

2. The destructive effect of rounding errors as illustrated by the catastrophic 
convergence of series 

3. Slow convergence such as illustrated by the problem of computing it with 
the Maclaurin expansion for tan -1 x 

We have seen that some ingenuity may be needed to overcome obstacles such 
as these. For example, a divide-and-conquer approach helped in the problem of 
computing n . In the case of catastrophic convergence, the problem was solved by 
changing the computational algorithm. The problem of overcoming Gibbs overshoot 
is not considered here. However, it involves seeking uniformly convergent series 
approximations. 



APPENDIX 3.A CO ORDINATE R OTATATION DI GITAL 
COMPUTING (CORDIC) 

3.A.1 Introduction 

This appendix presents the basics of a method for computing "elementary func- 
tions," which includes the problem of rotating vectors in the plane, computing 
tan _1 x, sinS, and co&6. The method to be used is called "coordinate rotation 
digital computing" (CORDIC), and was invented by Jack Voider [3] in the late 
1950s. However, in spite of the age of the method, it is still important. The method 
is one of those great ideas that is able to survive despite technological changes. It 
is a good example of how a clever mathematical idea, if anything, becomes less 
obsolete with the passage of time. 

The CORDIC method was, and is, desirable because it reduces the problem of 
computing apparently complicated functions, such as trig functions, to a succession 
of simple operations. Specifically, these simple operations are shifting and adding. 
In the 1950s it was a major achievement just to build systems that could add two 
numbers together because all that was available for use was vacuum tubes, and to a 
lesser degree, discrete transistors. However, even with the enormous improvements 
in computing technology that have occurred since then, it is still important to reduce 
complicated operations to simple ones. Thus, the CORDIC method has survived 
very well. For example, in 1980 [5] a special-purpose CORDIC VLSI (very large 
scale integration) chip was presented. More recent references to the method will 



TLFeBOOK 



108 SEQUENCES AND SERIES 

be given later. Nowadays, the CORDIC method is more likely to be implemented 
with gate-array technology. 

Since the CORDIC method involves the operations of shifting and adding only, 
once the mathematics of the method is understood, it is easy to build CORDIC 
computing hardware using the logic design methods considered in typical elemen- 
tary digital logic courses or books. 5 Consideration of CORDIC computing also 
makes a connection between elementary mathematics courses (calculus and linear 
algebra) and computer hardware systems design, as well as the subject of numerical 
analysis. 

3.A.2 The Concept of a Discrete Basis 

The original paper of Voider [3] does not give a rigorous treatment of the mathe- 
matics of the CORDIC method. However, in this section (based closely on Schelin 
[6]), we will begin the process of deriving the CORDIC method in a mathemati- 
cally rigorous manner. The central idea is to represent operands (e.g, the 9 in sinf9) 
in terms of a discrete basis. When the discrete basis representation is combined 
with appropriate mathematical identities the CORDIC algorithm results. 

Let R denote the set of real numbers. Everything we do here revolves around 
the following theorem (From Schelin [6]). 

Theorem 3.A.1: Suppose that $ k e R f or k e {0, 1, 2, 3, . . .} satisfy 9 > B\ > 
6» 2 > • • • > 6„ > 0, and that 

n 

Ok < Y^ 9 J +0 " for ° - k - "' (3.A.la) 



and suppose 9 e R satisfies 



\6\<J2 9 i- ( 3 - A - lb ) 



If 6 (0) = 0, and 9 (k+l) = 9 {k) + S k 9 k for < k < n, where 

1, if9>9 (k) 



then 



«*H -i iie <9^ (3A ' lc) 



\9 - 9 (k) \ < J2 e .i + e n for < k < n, (3.A.ld) 

j=k 



and so in particular \9 — 0^" +1 '| < 9 n . 



The method is also easy to implement in an assembly language (or other low-level) programming 
environment. 



TLFeBOOK 



COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 109 

Proof We may use proof by mathematical induction 6 on the index k. For k — 

n n 

\8-6 (O) \ = \6\<J2 J <J2 e J +0 n 

.7=0 j=0 

via (3.A.lb). 

Assume that \9 - 8 (k) \ < T,"=k e i + 9 " is tme ' and consider \9 - 9 (k+l) \. Via 
(3.A.lc), S k and 9 — 0® have the same sign, and so 



|0-0 ( * +1) | 



)W-8 k $ k \ = \\ 



j(*)i 



ft|. 



Now, via the inductive hypothesis (i.e., \6 — 9^ >\ < Yl']=k ®i + ^n)> we have 



}(*)! 



Via (3. A. la), we obtain 



j=k 



j=k+l 



(3.A.2a) 



E^ 

;'=*+! 



<-0* 



so that 

n 

J2 °i + e " 
j=k+i 

Combining (3.A.2a) with (3. A. 2b) gives 



< \9 -8 (k) \ -9 k . 



(3.A.2b) 



\e-9 {k+l) \ = \\9-9 {k) \-8 k \< J2 e i+ e n 

i=k+\ 

so that (3 A. Id) holds for k replaced by k + 1, and so (3 A. Id) holds via induction. 

We will call this result Schelin's theorem. The set {9 k } is called a discrete basis 
if it satisfies the restrictions given in the theorem. 

In what follows we will interpret 9 as an angle. However, note that 9 in this 
theorem could be more general than this. Now suppose that we define 



9 k = tan" 1 2~ k 



(3.A.3) 



for k = 0, 1, . . . , n. Table 3.A.1 shows typical values for 9 k as given by (3.A.3). 
We see that good approximations to 9 can be obtained from relatively small n 
(because from Schelin's theorem \9 — 9 ( - n+1) \ < 9 n ). 

A brief introduction to proof by mathematical induction appears in Appendix 3.B. 



TLFeBOOK 



110 SEQUENCES AND SERIES 



TABLE 3.A.1 Values 
for Some Elements of 
the Discrete Basis given 
by Eq. (3.A.3) 



k 


Ok 


= tan _1 2-' : 







45° 


1 




26.565° 


2 




14.036° 


3 




7.1250° 


4 




3.5763° 



Clearly, these satisfy 0q > 0\ > ■ ■ ■ > 6 n > 0, which is one of the restrictions in 
Schelin's theorem. 

The mean-value theorem of elementary calculus says that there exists a £ e 
[xq, xq + A] such that 



df(x) 



dx 



*=? 



f(x + A) - f(x ) 

A 



so if f(x) — tan x, then df(x)/dx — 1/(1 + x ), and thus 



1 



1+x 2 



x£[2-< k + l \2- k ] 



Qk - e k+ i 

2-k _ 2-(*+i) 



fc+i 



1 



^. =2 -(*+l) 



- 2 _* _ 2 _ (t+ i) - 1+JC 2 

because the slope of tan -1 x is largest at x — 2~ < - k+l \ and this in turn implies 
2~k _ 9-(*+i) 2* +2 — 2' t+1 2* +1 



61* - ^+i < 



and 



1 + 2- 2 ^+D 1 + 2 2 ^ +1 ' 1 + 2 2 ^ +1 > 

2 k 
k > 



\+2 2k 



because tan l x > x -£- tan ! x = -jj^iC* > 0), for k = 0, 1, . . . , n (let x — 2 k ). 



k\l 



'For x > 



rf x 1 — a: 



1+x 2 dx 1+x 2 (1+x 2 ) 2 



f* 1 x 

/ T dt > 



-=- => tan x > r- 

: 2 1+x 2 



TLFeBOOK 



COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 111 

Now, as a consequence of these results 

Ok ~ n = (0 k - ftfc+i) + (0 k+l - k+2 ) + ■■■ + ($„-2 ~ n -!) + (9 n -X - n ) 
n-\ 



v^ 2j+l ( _ 2/+1 A 

2^ ! + 2 2(./+D \ i °i+ l ~ i + 220+D J 



< 



2-' 



^ l+2 2 i 



./'=*+ 1 



j=k+\ 

implying that 






0* < J^ Oj + e„ 

y=*+l 



for A: = 0, 1, ...,«, and thus (3.A.la) holds for {6>yt} in (3.A.3), and so (3.A.3) is 
a concrete example of a discrete basis. If you have a pocket calculator handy then 
it is easy to verify that 

3 

J2 j = 92.73° > 90°, 
;=o 

so we will work only with angles 6 that satisfy \6\ < 90°. Thus, for such 0, there 
exists a sequence {S^} such that S^ e {—1, +1}, where 

= £**& + e„+i = ( " +1) + e„+i, (3.A.4) 

where |e n +i| < 0„ — tan -1 2~". Equation (3.A.lc) gives us a way to find {Sk}- We 
can also write that any angle 6 satisfying \0\ < ^ (radians) can be represented 
exactly as 

oo 

= J2 S k°k 

k=0 

for appropriately chosen coordinates {8k}. 

In the next section we will begin to see how useful it is to be able to represent 
angles in terms of the discrete basis given by (3. A. 3). 



TLFeBOOK 



112 SEQUENCES AND SERIES 



3.A.3 Rotating Vectors in the Plane 



No one would disagree that a basic computational problem in electrical and com- 
puter engineering is to find 



e J z, 



€R, 



(3.A.5) 



where j — V— 1, z — x + jy, and z' = x' + jy' (x, y, x' ', y' e R). Certainly, this 
problem arises in the phasor analysis of circuits, in computer graphics (to rotate 
objects, for example), and it also arises in digital signal processing (DSP) a lot. 8 

Thus, z and z' are complex variables. Expressing them in terms of their real and 
imaginary parts, we may rewrite (3.A.5) as 

x + jy' — (cos 9 + j sin9)(x + jy) — [x cos 9 — y sin 9] + j[x sin 6* + y cos 6], 

and this can be further rewritten as the matrix-vector product 

(3.A.6) 



cost 
sinf 



- sin 9 
cos 6 



which we recognize as the formula for rotating the vector [xy] T in the plane to 

y'f. 

Recall the trigonometric identities 



[x'y'f 



sin 6 — 



tan 9 



VT 



COS 9 — 



tan z 



Vl + tan 2 #' 



(3.A.7) 



and so for 9 — 9^ in (3. A. 3) 



sin0£ = 



Vl+2- 2 *' 



cos 9k = 



VI +2~ 2k ' 



(3.A.: 



Thus, to rotate Xk + jyic by angle Sj i 9/ C to x^+i + jyk+i is accomplished via 



Xk+l 

yk+i 



^\ + 2~ 2k L S k 2 



S k 2~ k 

1 



Xk 

yk 



(3.A.9) 



In DSP it is often necessary to compute the discrete Fourier transform (DFT), which is an approximation 
to the Fourier transform (FT) and is defined by 



E( 2izkn \ 
expl-J-jy-**): 



where {jtj.} is the samples of some analog signal, i.e., x^ = x(kT), where k is an integer and T is a 
positive constant; x(t) for t s R is the analog signal. Note that additional information about the DFT 
appeared in Section 1.4. 



TLFeBOOK 



COORDINATE ROTATION DIGITAL COMPUTING (CORDIC) 



113 



where we've used (3. A. 6). From Schelin's theorem, < -"+ 1 ' 1 ss y, so if we wish to 
rotate x + jy = x + jy by y ( " +1) to x„+i + jy„+i (« x' + jy') then via (3.A.9) 



" +1 = I FT 




S\2~ l 



(3.A.10) 



Define 



*„=n 



k=0 



vT 



-2k 



(3.A.11) 



where Ojfc=0 "* = «n«n-l«n-2 • • • «1«0- 
Consider 



X n + 1 
9n+l 



1 

5 W 2" 



-<5„2 

1 



5,2" 



-Si2~ 

1 



1 
So 



-S 

1 



vo 



(3A.12) 
which is the same expression as (3. A. 10) except that we have dropped the multi- 
plication by K n . We observe that to implement (3 .A. 12) requires only the simple 
(to implement in digital hardware, or assembly language) operations of shifting 
and adding. Implementing (3 A. 12) rotates x + jy by approximately the desired 
amount, but gives a solution vector that is a factor of l/K n longer than it should 
be. Of course, this can be corrected by multiplying [x n +\ y n +\] T by K n if desired. 
Note that in some applications, this would not be necessary. 
Note that 



cosf 
sin0 



- sin0 
cosy 



K„ 



1 
«„2-" 

1 

5.2" 1 



-*„2-" 

1 



5i2" 

1 



l 



1 
So 



So 



(3.A.13) 



and so the matrix product in (3. A. 12) [or (3. A. 10)] represents an efficient approxi- 
mate factorization of the rotation operator in (3. A. 6). The approximation gets better 
and better as n increases, and in the limit as n -> oo becomes exact. 

The computational complexity of the CORDIC rotation algorithm may be des- 
cribed as follows. In Eq. (3 A. 10) there are exactly 2n shifts (i.e., multiplications 
by 2~ k ), and 2« + 2 additions, plus two scalings by factor K n . As well, only n + \ 
bits of storage are needed to save the sequence {8^}. 

We conclude this section with a numerical example to show how to obtain the 
sequence {S^} via (3.A.lc). 



TLFeBOOK 



114 SEQUENCES AND SERIES 

Example 3.A.1 Suppose that we want to rotate a vector by an angle of 9 = 
20°, and we decide that n — A gives sufficient accuracy for the application at hand. 
Via (3.A.lc) and Table 3.A.1, we have 

(1) =$ = 45° as <5 = +1 since 9 > (9 (0) = 0° 

0(2) = 0(1) +Sl 0j 

= 45° - 26.565° as Si = -1 since 9 < 9 (1) 
= 18.435° 

0(3) = 0(2) + g 2d2 

= 18.435° + 14.036° as <5 2 = +1 since 9 > 9 (2) 
= 32.471° 
0^ = 9& + s 303 

= 32.471° - 7.1250° as S 3 = -1 since < 9 (3) 
= 25.346° 
0< 5 ) = 0< 4 > + 5 4 4 

= 25.346° - 3.5763° as S 4 = -1 since < (9 (4) 
= 21.770° «6» 

Via (3.A.4) and \9 - 6» (n+1) | < 6„ 

\e 5 \ = 1.770° < 04 = 3.5763° 

so the error bound in Schelin's theorem is actually somewhat conservative, at least 
in this special case. 

3.A.4 Computing Arctangents 

The results in Section 3. A. 3 can be modified to obtain a CORDIC algorithm for 
computing an approximation to 6 — tan _1 (y/x). The idea is to find the sequence 
{Sk\k = 0, 1, . . . , n — 1, n] to rotate the vector [ x y ] T — [ xq yo ] T to the 
vector [x n y n ] T , where y n « 0. More specifically, we would select 8k so that 

Iw+il < \yk\- 

Let 9 denote the approximation to 6. The desired algorithm to compute 9 may 
be expressed as Pascal-like pseudocode: 

6 :=0; 

x :=x;y := y; 



TLFeBOOK 



COORDINATE ROTATION D/GITAL COMPUTING (CORDIC) 115 



for k := to n d 


d begin 


ify/<> 


Other 


begin 


**:= 


= -1; 




end 






else begin 




* k := 


= +1; 




end 


; 




§:=- 


S k d k + 


9; 


Xfr+1 : 


= x k - 


W k Yk\ 


y^+i : 


= S k 2~ 


k Xk+Yk\ 


end ; 







In this pseudocode 



Xk+l 




l 


-S k 2~ k ' 


Xk 


yk+\ 




. S k 2~ k 


1 


. yk 



and we see that for the manner in which sequence {8k} is constructed by the 
pseudocode, the inequality \y k +i\ < \yk\ is satisfied. We choose n to achieve the 
desired accuracy of our estimate 6 of 6, specifically, \6 — 6\ <9 n . 

3.A.5 Final Remarks 

As an exercise, the reader should modify the previous results to determine a 
CORDIC method for computing cos0, and sinf?. [Hint: Take a good look at 
(3.A.13).] 

The CORDIC philosophy can be extended to the computation of hyperbolic 
trigonometric functions, logarithms 9 and other functions [4, 7]. It can also perform 
multiplication and division (see Table on p. 324 of Schelin [6]). As shown by Hu 
and Naganathan [9], the rate of convergence of the CORDIC method can be accel- 
erated by a method similar to the Booth algorithm (see pp. 287-289 of Hamacher 
et al. [10]) for multiplication. However, this is at the expense of somewhat more 
complicated hardware structures. A roundoff error analysis of the CORDIC method 
has been performed by Hu [8]. We do not present these results in this book as they 
are quite involved. Hu claims to have fairly tight bounds on the errors, however. 
Fixed-point and floating-point schemes are both analyzed. A tutorial presentation 
of CORDIC-based VLSI architectures for digital signal processing applications 
appears in Hu [11]. Other papers on the CORDIC method are those by Timmer- 
mann et al. [12] and Lee and Lang [13] (which appeared in the IEEE Transactions 
on Computers, "Special Issue on Computer Arithmetic" of August 1992). An alter- 
native summary of the CORDIC method may be found in Hwang [14]. Many of 
the ideas in Hu's paper [11] are applicable in a gate-array technology environ- 
ment. Applications include the computation of discrete transforms (e.g., the DFT), 
digital filtering, adaptive filtering, Kalman filtering, the solution of special linear 

y A clever alternative to the CORDIC approach for log calculations appears in Lo and Chen [15], and 
a method of computing square roots without division appears in Mikami et al. [16]. 



TLFeBOOK 



116 SEQUENCES AND SERIES 

systems of equations (e.g., Toeplitz), deconvolution, and eigenvalue and singular 
value decompositions. 



APPENDIX 3.B MATHEMATICAL INDUCTION 

The basic idea of mathematical induction is as follows. Assume that we are given 
a sequence of statements 

So, S\, ... , S„, . . . 

and each Si is true or it is false. To prove that all of the statements are true (i.e., 
to prove that S n is true for all n) by induction: (1) prove that S n is true for n — 0, 
and then (2) assume that S n is true for any n — k and then show that S n is true for 
n = k+\. 

Example 3.B.1 Prove 

" . 2 »(»+l)(2»+l) 

S n = 2_^ 1 = 7 ' ">0. 

i=0 

Proof We will use induction, but note that there are other methods (e.g., via z 
transforms). For n — 0, we obtain 

c V^-2 n a »(»+l)(2» + l) , 

So = 2^ ' = ° and 7 1«=0 = 0. 

(=0 

Thus, S n is certainly true for n — 0. 

Assume now that 5„ is true for n — k so that 

J t = E' 2 =* ( * + T +1) - (3-B.l) 



,=o 



We have 



and so 



5 ft+ i = j^ i 2 = J^ i 2 + (k+ l) 2 = S k + (k + l) 2 

i=0 i=0 



S k + (k+ \y = + (A: + l) z = 

6 6 

_ n(n+ 1)(2« + 1) 

— ~ \n=k+l 

6 
where we have used (3.B.1). 



TLFeBOOK 



CATASTROPHIC CANCELLATION 117 

Therefore, S„ is true for n — k + 1 if S n is true for n — k. Therefore, S n is true 
for all n > by induction. 

APPENDIX 3.C CATASTROPHIC CANCELLATION 

The phenomenon of catastrophic cancellation is illustrated in the following out- 
put from a MATLAB implementation that ran on a Sun Microsystems Ultra 10 
workstation using MATLAB version 6.0.0.88, release 12, in an attempt to compute 
exp(— 20) using the Maclaurin series for exp(x) directly: 



term k x~k/ k ! 

1.000000000000000 

1 -20.000000000000000 

2 200.000000000000000 

3 -1333.333333333333258 

4 6666.666666666666970 

5 -26666.666666666667879 

6 88888.888888888890506 

7 -253968.253968253964558 

8 634920.634920634911396 

9 -1410934.744268077658489 

10 2821869.488536155316979 

11 -5130671 .797338464297354 

12 8551119.662230772897601 

13 -13155568.711124267429113 

14 18793669.587320379912853 

15 -25058226.116427175700665 

16 31322782.645533967763186 

17 -36850332.524157606065273 

18 40944813.915730677545071 

19 -43099804.121821768581867 

20 43099804.121821768581867 

21 -41047432.496973112225533 

22 37315847.724521011114120 

23 -32448563.238713916391134 

24 27040469.365594934672117 

25 -21632375.492475949227810 

26 16640288.840366113930941 

27 -12326139.881752677261829 

28 8804385.629823340103030 

29 -6071990.089533339254558 

30 4047993.393022226169705 

31 -2611608.640659500379115 

32 1632255.400412187911570 

33 - 989245 . 69721 9507652335 

34 581909.233658534009010 

35 -332519.562090590829030 

36 184733.090050328260986 

37 -99855.724351528784609 

38 52555.644395541465201 



TLFeBOOK 



118 SEQUENCES AND SERIES 

39 -26951.612510534087050 

40 13475.806255267041706 

41 -6573.564026959533294 

42 3130.268584266443668 

43 -1455.938876402997266 

44 661 .790398364998850 

45 -294.129065939999407 

46 127.882202582608457 

47 - 54 . 41 7958545790832 

48 22.674149394079514 

49 - 9 . 254754854726333 

50 3.701901941890533 

51 -1.451726251721777 

52 0.558356250662222 

53 -0.210700471948008 

54 0.078037211832596 

55 -0.028377167939126 

56 0.010134702835402 

57 - . 003556036082597 

58 0.001226219338827 

59 -0.000415667572484 

60 0.000138555857495 

61 -0.000045428149998 

62 0.000014654241935 

63 -0.000004652140297 

64 0.000001453793843 

65 -0.000000447321182 

66 0.000000135551873 

67 -0.000000040463246 

68 0.000000011900955 

69 -0.000000003449552 

70 0.000000000985586 

71 -0.000000000277630 

72 0.000000000077119 

73 -0.000000000021129 

74 0.000000000005710 

75 -0.000000000001523 

76 0.000000000000401 

77 -0.000000000000104 

78 0.000000000000027 

79 -0.000000000000007 

80 0.000000000000002 

81 -0.000000000000000 

82 0.000000000000000 

83 -0.000000000000000 

84 0.000000000000000 

85 -0.000000000000000 

86 0.000000000000000 

87 -0.000000000000000 

88 0.000000000000000 

exp(-20) from sum of the above terms = 0.000000004173637 
True value of exp(-20) = 0.000000002061154 



TLFeBOOK 



REFERENCES 119 

REFERENCES 

1. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York, 
1978. 

2. W. Rudin, Principles of Mathematical Analysis, 3rd ed., McGraw-Hill, New York, 1976. 

3. J. E. Voider, "The CORDIC Trigonometric Computing Technique," IRE Trans. Electron. 
Comput. EC-8, 330-334 (Sept. 1959). 

4. J. S. Walther, "A Unified Algorithm for Elementary Functions," AFIPS Conf. Proc, 
Vol. 38, 1971 Spring Joint Computer Conf., 379-385 May 18-20, 1971. 

5. G. L. Haviland and A. A. Tuszynski, "A CORDIC Arithmetic Processor Chip," IEEE 
Trans. Comput. C-29, 68-79 (Feb. 1980). 

6. C. W. Schelin, "Calculator Function Approximation," Am. Math. Monthly 90, 317-325 
(May 1983). 

7. J.-M. Muller, "Discrete Basis and Computation of Elementary Functions," IEEE Trans. 
Comput. C-34, 857-862 (Sept. 1985). 

8. Y. H. Hu, "The Quantization Effects of the CORDIC Algorithm," IEEE Trans. Signal 
Process. 40, 834-844 (April 1992). 

9. Y. H. Hu and S. Naganathan, "An Angle Recoding Method for CORDIC Algorithm 
Implementation," IEEE Trans. Comput. 42, 99-102 (Jan. 1993). 

10. V. C. Hamacher, Z. G. Vranesic, and S. G. Zaky, Computer Organization, 3rd ed., 
McGraw-Hill, New York, 1990. 

11. Y. H. Hu, "CORDIC -Based VLSI Architectures for Digital Signal Processing," IEEE 
Signal Process. Mag. 9, 16-35 (July 1992). 

12. D. Timmermann, H. Hahn, and B. J. Hosticka, "Low Latency Time CORDIC Algo- 
rithms," IEEE Trans. Comput. 41, 1010-1015 (Aug. 1992). 

13. J. Lee and T. Lang, "Constant-Factor Redundant CORDIC for Angle Calculation and 
Rotation," IEEE Trans. Comput. 41, 1016-1025 (Aug. 1992). 

14. K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, Wiley, New 
York, 1979. 

15. H.-Y. Lo and J.-L. Chen, "A Hardwired Generalized Algorithm for Generating the Log- 
arithm Base-k by Iteration," IEEE Trans. Comput. C-36, 1363-1367 (Nov. 1987). 

16. N. Mikami, M. Kobayashi, and Y. Yokoyama, "A New DSP-Oriented Algorithm for 
Calculation of the Square Root Using a Nonlinear Digital Filter," IEEE Trans. Signal 
Process. 40, 1663-1669 (July 1992). 

17. G. G. Walter, Wavelets and Other Orthogonal Systems with Applications, CRC Press, 
Boca Raton, FL, 1994. 

18. I. S. Gradshteyn and I. M. Ryzhik, in Table of Integrals, Series and Products, A. Jeffrey, 
ed., 5th ed., Academic Press, San Diego, CA, 1994. 

19. L. Bers, Calculus: Preliminary Edition, Vol. 2, Holt, Rinehart, Winston, New York, 1967. 

20. A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, 2nd ed., 
Addison-Wesley, Reading, MA, 1994. 

21. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical 
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977. 

22. M. R. Spiegel, Theory and Problems of Advanced Calculus (Schaum's Outline Series). 
Schaum (McGraw-Hill), New York, 1963. 



TLFeBOOK 



120 



SEQUENCES AND SERIES 



23. A. Papoulis, Signal Analysis, McGraw-Hill, New York, 1977. 

24. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979. 

25. W. D. Lakin and D. A. Sanchez, Topics in Ordinary Differential Equations, Dover Pub- 
lications, New York, 1970. 



PROBLEMS 

3.1. Prove the following theorem: Every convergent sequence in a metric space 
is a Cauchy sequence. 

3.2. Let /„(*) = x n for n e {1, 2, 3, . . .} = N, and /„(*) € C[0, 1] for all n e N. 

(a) What is f(x) = lim,,-,*, /„(*) (for x e [0, 1])? 

(b) Is f(x) e C[0, 1]? 

3.3. Sequence ix n ) is defined to be x n = (n + l)/(« + 2) for « € Z + . Clearly, if 
X = [0, 1) c R, then x n e X for all n e Z + . Assume the metric for metric 
space X is dix, y) — \x — y\ (x, y e X). 

(a) What is x — lirrin^oo x n l 

(b) Is X a complete space? 

(c) Prove that (x„) is Cauchy. 

3.4. Recall Section 3. A. 3 wherein the rotation operator was defined [Eq. (3. A. 6)]. 
(a) Find an expression for angle 9 such that for y ^ 



cos 9 
sin 9 



- sin 9 
cos 9 



x 

y 



X 





= G(9) 



where x' is some arbitrary nonzero constant. 

(b) Prove that G~ l i9) — G T id) [i.e., the inverse of G(9) is given by its 
transpose]. 

(c) Consider the matrix 

4 2 
A= 14 1 

_ 2 4 

Let 0„ xm denote an array (matrix) of zeros with n rows, and m columns. 
Find G(6»i), and Gi9 2 ) so that 



1 

2 xl 



0lx2 

Gi9 2 ) 



Gi$i) 2xl 

0lx2 1 



A = R, 



= Q< 



where R is an upper triangular matrix (defined in Section 4.5, if you do 
not recall what this is). 



TLFeBOOK 



PROBLEMS 121 

(d) Find Q~ T (inverse of Q T ). 

{Comment: The procedure illustrated by this problem is important in various 
applications such as solving least-squares approximations, and in finding the 
eigenvalues and eigenvectors of matrices.) 

3.5. Review Appendix 3. A. Suppose that we wish to rotate a vector [xy] T e R 2 
through an angle 9 — 25° ± 1°. Find n, and the required delta sequence (8k) 
to achieve this accuracy. 

3.6. Review Appendix 3. A. A certain CORDIC routine has the following pseu- 
docode description: 

Input x and z (|z| < 1); 
y :=0;z :=z; 
for k :— to n — 1 do begin 
if z k < then begin 

end 
else begin 

end; 
y k+ i :=y k + B k x2- k ; 

z k+1 :=z k -S k 2- k ; 

end; 

The algorithm' s output is y n . What is y„ ? 

3.7. Suppose that 

t 

x n (t) = 



n + t 



for n e Z + , and t e (0, 1) C R. Show that (x n (t)) is uniformly convergent 
on S = (0, 1). 

3.8. Suppose that u n > 0, and also that ^2T=o M « converges. Prove that ]~[^lo(l + 
u n ) converges. 

[Hint: Recall Lemma 2.1 (of Chapter 2).] 

3.9. Prove that limn^^x" = if \x\ < 1. 
3.10. Prove that for a, b, x e R 

1 1 1 



b\ ~ \ + \a-x\\ + \b 



(Comment: This inequality often appears in the context of convergence proofs 
for certain series expansions.) 



TLFeBOOK 



122 SEQUENCES AND SERIES 

3.11. Consider the function 



In N 



K N (x) = j—}^D n (x) 



n=0 

[recall (3.24)]. Since D n (x) is 2it -periodic, K^(x) is also 2jt -periodic. We 
may assume that x e [—n, n], 

(a) Prove that 

1 1 - cos(N + \)x 

K N (x) = — — — i '-. 

N + 1 1 — COS* 

[Hint: Consider ZLo sin ((« + 2>*) = Im [E*=o e^ n+ ^ x ] .] 

(b) Prove that 

K N (x) > 0. 



(c) Prove that 

1 /•* 

— / Kn(x) dx — 1. 

2tt 



J — 71 



[Comment: The partial sums of the complex Fourier series expansion of the 
2tt -periodic function f(x) (again x e [—it, jt]) are given by 



N 

'nx 



/n(x) = ^2 f^ 



n=-N 

where f n = ^- J* f(x)e~J nx dx. Define 

A' 



VnW = J^ /jW. 



1 

7 

It can be shown that 



1 /■* 



on(,x)=—\ f(x-t)K N (t)dt. 
2jt ./_„ 

It is also possible to prove that on(x) -> /(x) uniformly on [— jt, 7r] if /(*) 
is continuous. This is often called Fejer's theorem.] 

3.12. Repeat the analysis of Example 3.12 for /l(0). 

3.13. If f{t) e L 2 (0, In) and f(t) = £ neZ /„^" ( , show that 

*2tz 
lit Jq 



j r2jT °° 

—J \f(t)\ 2 dt= J2 i/»i 



TLFeBOOK 



PROBLEMS 123 

This relates the energy/power of f(t) to the energy/power of its Fourier 
series coefficients. 

[Comment: For example, if f(t) is one period of a 2jt -periodic signal, then 
the power interpretation applies. In particular, suppose that f(t) — i(t) a 
current waveform, and that this is the current into a resistor of resistance R; 
then the average power delivered to R is 

2jr 



1 f 
Pav = — Ri 2 (t)dt.] 

2n Jo 



3.14. Use the result of Example 1.20 to prove that 

(-1)" 7T 



£ 



3.15. Prove that 



In + 1 4 

n=0 



1 f sin[(L + i)r] 1 

dt — -. 



f 

Jo 



2tt Jo sinf^f] 

3.16. Use mathematical induction to prove that 

(1.1)" > 1 + j^jn 

for all n e N. 

3.17. Use mathematical induction to prove that 

(1 + hf >l+nh 

for all n e N, with h > — 1, (This is Bernoulli's inequality.) 

3.18. Use mathematical induction to prove that 4" + 2 is a multiple of 6 for all 

n € N. 

3.19. Conjecture a formula for (« € N) 

n 

s„ = y— — , 
ti k(k+l) 

and prove it using mathematical induction. 

3.20. Suppose that f(x) — tanx. Use (3.55) to approximate f(x) for all x e 
(—1, 1). Find an upper bound on |e(x)|, where e(x) is the error of the 
approximation. 



TLFeBOOK 



124 SEQUENCES AND SERIES 

3.21. Given f(x) = x 2 + 1, find all £ € (1, 2) such that 

/ (1) (?) 



r(1) /(2)-/(l) 



2- 1 

3.22. Use (3.65) to find an approximation to Vl + x for ie(|, |). Find an upper 
bound on the magnitude of e(x). 

3.23. Using a pocket calculator, compute (0.97) '/ 3 for n — 3 using (3.82). Find 
upper and lower bounds on the error e n +\(x). 

3.24. Using a pocket calculator, compute [1.05] 1 / 4 using (3.82). Choose n — 2 (i.e., 
quadratic approximation). Estimate the error involved in doing this using the 
error bound expression. Compare the bounds to the actual error. 

3.25. Show that 

= 2". 






3.26. Show that 








-I 


« - 


- 1 )- 


_(» 


j\ 


v J ~ 


■lj" 


"W 


for j = 1, 2, ...,« — 1, n. 








3.27. Show that for n > 2 












3.28. For 



= «2"" 
\ K I 



P„(k)- ( I )p k (l- p )"- k ,0<p< 1, 



where k = 0, 1, . . . , n, show that 

(n — k)p 

p n (k+ 1) = -/?„(£)■ 

(fc+ 1)(1 - p) 

[Comment: This recursive approach for finding p n (k) extends the range of 
n for which p n {k) may be computed before experiencing problems with 
numerical errors.] 

3.29. Identity (3.103) was confirmed using an argument associated with Fig. 3.7. A 
somewhat different approach is the following. Confirm (3.103) by working 
with 

1 f°° i I 2 1 f°° f°° i i 
-L / e- xl ' 2 dx =— / e-^ + y 2) ' 2 dxdy. 

I Z7T J —oo J ^- J— OQ t/— 00 



TLFeBOOK 



PROBLEMS 125 

(Hint: Use the Cartesian to polar coordinate conversion x — rcos6,y — 
r sin0.) 

3.30. Find the Maclaurin (infinite) series expansion for f(x) — sin -1 x. What is 
the radius of convergence? [Hint: Theorem 3.7 and Eq. (3.82) are use- 
ful.] 

3.31. From Eq. (3.85) for x > -1 

\og e (l+x) = J2^^- k + r(x), (3.P.1) 

k = \ 



where 

\r(x)\ 



-TTX n+ \ x>0 

n + l ' — 



-1 <x <0 



(3.P.2) 



(1-|jc|)(b+1)' 

[Eq. (3.86)]. For what range of values of x does the series 

(-l) k - l x k 



log e (l+x) = ^ 



k=\ 



converge? Explain using (3. P. 2). 
3.32. The following problems are easily worked with a pocket calculator, 
(a) It can be shown that 



(-lr 

J W = E (2k+lV x2k+1 + e2n+3(x) ' (3 ' P3) 



where 

(2n + 3)l 



e 2n+ 3(x)\ < - l ^A x\ 2n+ \ (3.P.4) 



Use (3. P. 4) to compute sin(x) to 3 decimal places of accuracy for x — 1.5 
radians. How large does n need to be? 
(b) Use the approximation 



(_i)"-V 

\og e (\+x)^Y.-^- — 



n = \ 



to compute log e (1.5) to three decimal places of accuracy. How large 
should N be to achieve this level of accuracy? 



TLFeBOOK 



126 SEQUENCES AND SERIES 

3.33. Assuming that 

2 f" sinx ^— v 

— / dx — 2 / 

Wo * ^ 

show that for suitable n 



jt lr (-i) r 



(2r + l)(2r + l)! 



2n 



2n + l 



(2n + 2)(2n + 2)\ 



(3.P.5) 



2 r* 

x Jo 



sinx 



■ dx > 1.17. 



What is the smallest n needed? Justify Eq. (3.P.5). 
3.34. Using integration by parts, find the asymptotic expansion of 



J X 



c(x) = I cost dt. 
3.35. Using integration by parts, find the asymptotic expansion of 



s(x) 



-f 

J x 



sin? dt. 



3.36. Use MATLAB to plot (on the same graph) function K^{x) in Problem 3.11 

for N = 2, 4, 15. 



TLFeBOOK 



4 Linear Systems of Equations 



4.1 INTRODUCTION 

The necessity to solve linear systems of equations is commonplace in numerical 
computing. This chapter considers a few examples of how such problems arise 
(more examples will be seen in subsequent chapters) and the numerical problems 
that are frequently associated with attempts to solve them. We are particularly inter- 
ested in the phenomenon of ill conditioning. We will largely concentrate on how 
the problem arises, what its effects are, and how to test for this problem. 1 In addi- 
tion to this, we will also consider methods of solving linear systems other than the 
Gaussian elimination method that you most likely learned in an elementary linear 
algebra course. 2 More specifically, we consider LU and QR matrix factorization 
methods, and iterative methods of linear system solution. The concept of a singular 
value decomposition (SVD) is also introduced. 

We will often employ the term "linear systems" instead of the longer phrase 
"linear systems of equations." However, the reader must be warned that the phrase 
"linear systems" can have a different meaning from our present usage. In signals 
and systems courses you will most likely see that a "linear system" is either a 
continuous-time (i.e., analog) or discrete-time (i.e., digital) dynamic system whose 
input/output (I/O) behavior satisfies superposition. However, such dynamic systems 
can be described in terms of linear systems of equations. 



4.2 LEAST-SQUARES APPROXIMATION AND LINEAR SYSTEMS 

Suppose that f(x), g(x) e L 2 [0, 1], and that these functions are real-valued. Recall- 
ing Chapter 1, their inner product is therefore given by 

(/,*}= f f(x)g(x)dx. (4.1) 

Jo 

Methods employed to avoid ill-conditioned linear systems of equations will be mainly considered in 
a later chapter. These chiefly involve working with orthogonal basis sets. 

Review the Gaussian elimination procedure now if necessary. It is mandatory that you recall the basic 
matrix and vector operations and properties [(AB) = B A , etc.] from elementary linear algebra, 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



127 



TLFeBOOK 



128 LINEAR SYSTEMS OF EQUATIONS 

Note that here we will assume that all members of L 2 [0, 1] are real-valued (for 
simplicity). Now assume that {<p n (x)\n e Z^} form a linearly independent set such 
that 4> n (x) e L 2 [0, 1] for all n. We wish to find the coefficients a n such that 

AT— 1 

f(x) « J2 a n^n(x) for x e [0, 1]. (4.2) 

n=0 

A popular approach to finding a n [1] is to choose them to minimize the functional 

N-l 

V(a) = \\f(x)-J2 a n<Pn(x)\\ 2 , (4.3) 

n=0 

where, for future convenience, we will treat a n as belonging to the vector a — 
[a ai ■■■a N _ 2 a N -i\ T e R N . Of course, in (4.3) ||/|| 2 = (/, /) via (4.1). We are 
at liberty to think of 

N-l 

e(x) = f(x) - J2 an<t>nix) (4.4) 

n=0 

as the error between fix) and its approximation ^ n a n $ n ix). So, our goal is to 
pick a to minimize ||e(x)|| 2 , which, we have previously seen, may be interpreted 
as the energy of the error e(x). This methodology of approximation is called least- 
squares approximation. The version of this that we are now considering is only 
one of a great many variations. We will see others later. 
We may rewrite (4.3) as follows: 

N-l 

|2 



V(a) = ||/(x)- J>„0„(x)|| 2 

M = 

(N-l N-l \ 

f -^2 fl «0>" f-^2 ak $ k ) 
n=0 k=0 I 

= </,/>-(/,2>*&) 

\ k=0 I 

JN-1 \ IN -I N-l \ 

~ ( ^ a n4>n, /) + ( 5Z fl "^"' ^2 ak< t >k 
\«=0 / \n=0 k=0 I 

N-l N-l 

= \\f\\ 2 -J2a k {f,<t> k )-Y i a n {(t> n ,f) 

k=0 n=0 

N-l N-l 

+ ^2^2 a k a n ((j> k ,<l> n ) 



n=0 k=0 



TLFeBOOK 



LEAST-SQUARES APPROXIMATION AND LINEAR SYSTEMS 129 

N-l 



= II/H 2 - 2 £>*(/,</>*} 

N-l N-l 
+ XI 5Z a k®n{<Pk,<Pn)- 



(4.5) 



n=0 k=0 



Naturally, we have made much use of the inner product properties of Chapter 1 . 
It is very useful to define 



P = 11/11, gk = {f,<Pk), r n ,k = {fa,fa), 
along with the vector 



g = [gogi ■■■gN-if e R^, 



and the matrix 



R = [r n ,k] € R 



NxN 



(4.6) 



(4.7a) 



(4.7b) 



We immediately observe that R — R T (i.e., R is symmetric). This is by virtue of 
the fact that r n ^ — (<p n , fa) = (fa, fa.) — r% n . Immediately we may rewrite (4.5) 
in order to obtain the quadratic form" 



V(a) = a T Ra — 2a T g + p. 



(4.8) 



The quadratic form occurs very widely in optimization and approximation prob- 
lems, and so warrants considerable study. An expanded view of R is 



R = 



/' 

Jo 



(pnix) dx 



/„ 



/' 

Jo 



</>o(x)(/>i(x) dx 
1 



(po{x)cj)\(x) dx 



Jo 



</>i (x) dx 



i: 
i: 



<t>o(x)<t>N-i(x)dx 



<Pl(x)<t>N-l(x)dx 



/ 4>o(x)4>N-l(x)dx / (pi(x)<p N _i(x)dx ■■■ \ (j>j,_,(x)dx 

■Jo Jo Jo 

(4.9) 

Note that the reader must get used to visualizing matrices and vectors in the general 

manner now being employed. The practical use of linear/matrix algebra demands 

this. Writing programs for anything involving matrix methods (which encompasses 

a great deal) is almost impossible without this ability. Moreover, modern software 

tools (e.g., MATLAB) assume that the user is skilled in this manner. 

The quadratic form is a generalization of the familiar quadratic ax + bx + c, for which x e C (if 
we are interested in the roots of a quadratic equation; otherwise we usually consider x s R) , and also 
a, b, c e R. 



TLFeBOOK 



130 LINEAR SYSTEMS OF EQUATIONS 

It is essential to us that i? _1 exist (i.e., we need R to be nonsingular). Fortu- 
nately, this is always the case because {(f> n \n e Z^} is an independent set. We shall 
prove this. If R is singular, then there is a set of coefficients a, such that 

JV-l 

£a/(&,^} = (4.10) 

for j e Zjsi . This is equivalent to saying that the columns of R are linearly depen- 
dent. Now consider the function 

N-l 



f(x) = ^2,cti<t>i{x) 



1=0 

so if (4.10) holds for all j e Z N , then 

IN-l \ N-l 



(/■ 0/> = H a i<$>u4>j ) = S «i (&. 0/> = 



\ i=0 / i'=0 

for all j € Zjv. Thus 



iV-l JV-l 



'N-l 



Y^OCi((j)i,(t)j) 



= 0, 



.7=0 ;=0 L ,'=0 

implying that 

(N-l N-l \ 

^2 <Xi<j>i, ^aj<t>j) = 0, 
i=0 ;=0 / 

or in other words, (/, /) = ||/|| 2 = 0, and so f(x) — for all x e [0, 1]. This 
contradicts the assumption that {<j) n \n e Zyv} is a linearly independent set. So i? _1 
must exist. 

From (4.3) it is clear that (via basic norm properties) V(a) > for all a e R N . 
If we now assume that f(x) — (all x e [0, 1]), we have p — 0, and g — too. 
Thus, a T Ra > for all a. 

Definition 4.1: Positive Semidefinite Matrix, Positive Definite Matrix Sup- 
pose that A — A T and that A e R" x ". Suppose that x e R". We say that A is 
positive semidefinite (psd) iff 

x 1 Ax > 
for all x. We say that A is positive definite (pd) iff 

x T Ax > 
for all x # 0. 



TLFeBOOK 



LEAST-SQUARES APPROXIMATION AND LINEAR SYSTEMS 131 

If A is psd, we often symbolize this by writing A > 0, and if A is pd, we often 
symbolize this by writing A > 0. If a matrix is pd then it is clearly psd, but the 
converse is not necessarily true. 

So far it is clear that R > 0. But in fact R > 0. This follows from the linear 
independence of the columns of R. If the columns of R were linearly dependent, 
then there would be an a ^ such that Ra — 0, but we have already shown that 
7? _1 exists, so it must be so that Ra — iff a — 0. Immediately we conclude that 
R is positive definite. 

Why is R > so important? Recall that we may solve ax 2 + bx + c — (x € C) 
by completing the square: 



ax + bx 



b 

2d 



12 b 2 

+ c-— . (4.11) 

4a 



Now if a > 0, then y(x) — ax 2 + bx + c has a unique minimum. Since y^'(x) — 
lax + b — Qior x — — j-, this choice of x forces the first term of (4.1 1) (right-hand 
side of the equality) to zero, and we see that the minimum value of y(x) is 

/ b \ _ b 2 _ Aac - b 2 

y \2a\)~ C ~A~a'~ ~4~a ' 

Thus, completing the square makes the location of the minimum (if it exists), and 
the value of the minimum of a quadratic very obvious. For the same purpose we 
may complete the square of (4.8): 

V(a) = [a - R- l g] T R[a - R~ l g] + p- g T R~ l g. (4.13) 

It is quite easy to confirm that (4.13) gives (4.8) (so these equations must be 
equivalent): 



-R~ l g] 


r R[a- 


R~ 1 g] + 


p-g T R~ 


l g 








= [a T - 


g T (R~ 


l ) T ][Ra- 


-gl + p- 


g T R~ 


l 8 






— a Ra 


-g T R 


Ra — a 


T g + g T R 


" 1 * + 


P~ 


8 T 


R 


= a Ra 


T 

-g a 


-a T g + 


p — a Ra 


-2a 7 


g + 


p. 





We have used (R ! ) r = (R T ) l = R \ and the fact that a T g — g T a. If vector 
x — a — R~ l g, then 

[a - R~ l g] T R[a - i? _1 g] = x T Rx. 

So, because R > 0, it follows that x T Rx > for all x ^ 0. The last two terms of 
(4.13) do not depend on a. So we can minimize V(a) only by minimizing the first 
term. R > implies that this minimum must be for x — 0, implying a — a, where 

a- R~ 1 g = 0, 

or 

Ra = g. (4.14) 



TLFeBOOK 



132 LINEAR SYSTEMS OF EQUATIONS 

Thus 

a — argminV(a). (4.15) 

We see that to minimize ||e(x)|| 2 we must solve a linear system of equations, 
namely, Eq. (4.14). We remark that for R > 0, the minimum of V(a) is at a unique 
location a e R N ; that is, the minimum is unique. 

In principle, solving least-squares approximation problems seems quite simple 
because we have systematic (and numerically reliable) methods to solve (4.14) 
(e.g., Gaussian elimination with partial pivoting). However, one apparent difficulty 
is the need to determine various integrals: 



= / f(x)(j>k(x)dx,r nyk = / (f>„( 
Jo Jo 



Usually, the independent set {4>k\k e Z^} is chosen to make finding r n k rela- 
tively straightforward. In fact, sometimes nice closed-form expressions exist. But 
numerical integration is generally needed to find gk- Practically, this could involve 
applying series expansions such as considered in Chapter 3, or perhaps using 
quadratures such as will be considered in a later chapter. Other than this, there 
is a more serious problem. This is the problem that R might be ill-conditioned. 



4.3 LEAST-SQUARES APPROXIMATION AND ILL-CONDITIONED 
LINEAR SYSTEMS 

A popular choice for an independent set {<j>k(x)} would be 

c/, k (x) = x k for x e [0, 1], k e Z N . (4.16) 

Certainly, these functions belong to the inner product space L 2 [0, 1]. Thus, for 
f(x) € L 2 [0, 1] an approximation to it is 

N-l 

f(x) = J2 «*** € P" -1 [0, 1], (4.17) 

fc=0 

and so we wish to fit a degree N — 1 polynomial to f(x). Consequently 

g k = [ x k f(x)dx (4.18a) 



which is sometimes called the kth moment^ of f(x) on [0, 1], and also 

•1 rl 

(j>„(x)<pk(x) dx = / 
'o Jo 

for n, k e Z#. 



r n ,k= / (f>n(x)<Pk(x)dx= x n+k dx= (4.18b) 

Jo Jo n + k + 1 



The concept of a moment is also central to probability theory. 



TLFeBOOK 



LEAST-SQUARES APPROXIMATION AND ILL-CONDITIONED LINEAR SYSTEMS 133 

For example, suppose that N — 3 (i.e., a quadratic fit); then 



-I T 



J f(x)dx I xf(x)dx J x f(x)dx 
Jo Jo Jo 



=gl 



= X2 



(4.19a) 



and 



R 



1 


1 

2 


1 

3 


1 

2 


1 
3 


1 

4 


1 

3 


1 

4 


1 

5 



roo roi ro2 
no r n r n 
no n\ m 



(4.19b) 



and a = [do «i 02] > so we wish to solve 



fl 



1 


1 
2 


1 
3 


1 


1 


1 


2 


3 


4 


1 


1 


1 


3 


4 


5 



/ f(x)dx 
Jo 

I xf(x)dx 
Jo 

If 

L Jo 



x f{x) dx 



(4.20) 



We remark that R does not depend on the "data" f(x), only the elements of g 
do. This is true in general, and it can be used to advantage. Specifically, if f(x) 
changes frequently (i.e., we must work with different data), but the independent 
set does not change, then we need to invert R only once. 

Matrix R in (4.19b) is a special case of the famous Hilbert matrix [2-4]. The 
general form of this matrix is (for any N € N) 

1 

N 

1 



R = 



1 








2 


3 


1 


1 


1 


2 


3 


4 


1 


1 


1 


3 


4 


5 



N+ 1 
1 

N + 2 



1 



1 



1 



1 



N N + 1 N + 2 



2N - 1 



€ R 



NxN 



(4.21) 



Thus, (4.20) is a special case of a Hilbert linear system of equations. The matrix R 
in (4.21) seems "harmless," but it is actually a menace from a numerical computing 
standpoint. We now demonstrate this concept. 



TLFeBOOK 



134 



LINEAR SYSTEMS OF EQUATIONS 



Suppose that our data are something very simple. Say that 

f{x) = 1 for all x e [0, 1]. 

In this case gi = f x k dx — rry. Therefore, for any TV € N, we are compelled to 
solve 



1 


2 


1 


1 


2 


3 


1 


1 


3 


4 



1 



1 



1 

TV 
1 



TV + 1 
1 



N 



1 



L TV TV + 1 TV + 2 



2TV- 1 









r i "1 




fl 

a\ 

«2 





i 

2 
1 

3 




a-N-i 




1 

- TV - 



(4.22) 



A moment of thought reveals that solving (4.22) is trivial because g is the first 
column of R. Immediately, we see that 



a = [100- -OOY 



(4.23) 



(No other solution is possible since R~ exists, implying that Ra — g always 
possesses a unique solution.) 

MATLAB implements Gaussian elimination (with partial pivoting) using the 
operator "\" to solve linear systems. For example, if we want x in Ax — y for 
which A~ l exists, then x — A\y. MATLAB also computes R using function "hilb"; 
that is, R — hilb (TV) will result in R being set to a TV x TV Hilbert matrix. Using the 
MATLAB "\" operator to solve for a in (4.22) gives the expected answer (4.23) for 
TV < 50 (at least). The computer-generated answers are correct to several decimal 
places. (Note that it is somewhat unusual to want to fit polynomials to data that 
are of such large degree.) So far, so good. 

Now consider the results in Appendix 4. A. The MATLAB function "inv" may 
be used to compute the inverse of matrices. The appendix shows R _1 (computed 
via inv) for TV = 10, 11, 12, and the MATLAB computed product RR~ l for these 
cases. Of course, RR~ l — I (identity matrix) is expected in all cases. For the 
number of decimal places shown, we observe that RR~ l ^ /. Not only that, but 
the error E — RR~ X — I rapidly becomes large with an increase in TV. For TV = 12, 
the error is substantial. In fact, the MATLAB function inv has built-in features to 
warn of trouble, and it does so for case TV = 12. Since RR~ l is not being computed 
correctly, something has clearly gone wrong, and this has happened for rather small 
values of TV. This is in striking contrast with the previous problem, where we wanted 
to compute a in (4.22). In this case, apparently, nothing went wrong. 

We may consider changing our data to f(x) — x N ~ l . In this case gk — 1/(TV + 
k) for k e Z#. The vector g in this case will be the last column of R. Thus, 



TLFeBOOK 



CONDITION NUMBERS 135 

mathematically, a — [00 • • • 001] r . If we use MATLAB "\" to compute a for this 
problem we obtain the computed solutions: 

a = [0.0000 0.0000 0.0000 0.0000 . . . 

0.0002 -0.0006 0.0013 -0.00170.0014 -0.0007 l.OOOlf (N = 11), 

a = [0.0000 0.0000 0.0000 -0.0002 . . . 0.0015 -0.0067 

0.0187 -0.0342 0.0403 -0.0297 0.0124 0.9978] r (N = 12). 

The errors in the computed solutions a here are much greater than those experienced 
in computing a in (4.23). 

It turns out that the Hilbert matrix R is a classical example of an ill-conditioned 
matrix (with respect to the problem of solving linear systems of equations). The 
linear system in which it resides [i.e., Eq. (4.14)] is therefore an ill-conditioned 
linear system. In such systems the final answer (which is a here) can be exquisitely 
sensitive to very small perturbations (i.e., disturbances) in the inputs. The inputs 
in this case are the elements of R and g. From Chapter 2 we remember that R 
and g will not have an exact representation on the computer because of quantiza- 
tion errors. Additionally, as the computation proceeds rounding errors will cause 
further disturbances. The result is that in the end the final computed solution can 
deviate enormously from the correct mathematical solution. On the other hand, we 
have also shown that it is possible for the computed solution to be very close to 
the mathematically correct solution even in the presence of ill conditioning. Our 
problem then is to be able to detect when ill conditioning arises, and hence might 
pose a problem. 



4.4 CONDITION NUMBERS 

In the previous section there appears to be a problem involved in accurately com- 
puting the inverse of R (Hilbert matrix). This was attributed to the so-called ill 
conditioning of R. We begin here with some simpler lower-order examples that 
illustrate how the solution to a linear system Ax — y can depend sensitively on A 
and y. This will lead us to develop a theory of condition numbers that warn us that 
the solution x might be inaccurately computed due to this sensitivity. 

We will consider Ax = y on the assumption that A e R" x ", and x, y e R". 
Initially we will assume n — 2, so 



fl00 floi 
flio an 









(4.24) 



In practice, we may be uncertain about the accuracy of the entries of A and y. 
Perhaps these entities originate from experimental data. So the entries may be 
subject to experimental errors. Additionally, as previously mentioned, the elements 



TLFeBOOK 



136 



LINEAR SYSTEMS OF EQUATIONS 



of A and y cannot normally be exactly represented on a computer because of the 
need to quantize their entries. Thus, we must consider the perturbed system 



flOO G01 

flio an 



<Saoo <5floi 
&ci\q San 



=/>A 






>'o 



Syo 

Sy\ 

=Sy 



(4.25) 



The perturbations are SA and Sy. We will assume that these are "small." As you 
might expect, the practical definition of "small" will force us to define and work 
with suitable norms. This is dealt with below. We further assume that the computing 
machine we use to solve (4.25) is a "magical machine" that computes without 
rounding errors. Thus, any errors in the computed solution, denoted x here, can be 
due only to the perturbations SA and Sy. It is our hope that x « x. Unfortunately, 
this will not always be so, even for n — 2 with small perturbations. 

Because n is small, we may obtain closed-form expressions for A -1 , x, [A + 
S A] -1 , and x. More specifically 



floofln — floiflio 



an 

-flio 



-fl l 

floo 



(4.26) 



and 



[A + SA]~ [ = 



1 



(floo + <5fl o)(flii + San) - (floi + S«oi)(«io + Saw) 



an + <5fln 
-(flio + 5flio) 



-(floi +5fl i) 
(aoo + <5floo) 



(4.27) 



The reader can confirm these by multiplying A as given in (4.26) by A. The 
2x2 identity matrix should be obtained. Using these formulas, we may consider 
the following example. 

Example 4.1 Suppose that 



-.01 
.01 



Nominally, the correct solution is 



1 
100 



Let us consider different perturbation cases: 



TLFeBOOK 



CONDITION NUMBERS 



137 



1. Suppose that 



In this case 



2. Suppose that 



for which 



SA = 




.005 



Sy = [0 or 



jc = [1.1429 -85.7143]' 



<5A = 



.03 



SA 



sy = to or 



.01 

.02 



This matrix is mathematically singular, so it does not possess an inverse. If 
MATLAB tries to compute x using (4.27), then we obtain 

x = 1.0 x 10 17 x [-0.0865 -8.6469] r 

and MATLAB issues a warning that the answer may not be correct. Obvi- 
ously, this is truly a nonsense answer. 
3. Suppose that 



SA 





-.02 



.Sy = [0.10 -0.05]'. 



In this case 



ic = [-1.1500 -325.0000] 7 ". 



It is clear that small perturbations of A and y can lead to large errors in the 
computed value for x. These errors are not a result of accumulated rounding errors 
in the computational algorithm for solving the problem. For computations on a 
"nonmagical" (i.e., "real") computer, this should be at least intuitively plausible 
since our formulas for x are very simple in the sense of creating little opportunity 
for rounding error to grow (there are very few arithmetical operations involved). 
Thus, the errors x — x must be due entirely (or nearly so) to uncertainties in the 
original inputs. We conclude that the real problem is that the linear system we 
are solving is too sensitive to perturbations in the inputs. This naturally raises the 
question of how we may detect such sensitivity. 

In view of this, we shall say that a matrix A is ill-conditioned if the solution x 
(in Ax = y) is very sensitive to perturbations on A and y. Otherwise, the matrix 
is said to be well-conditioned. 



TLFeBOOK 



138 



LINEAR SYSTEMS OF EQUATIONS 



We will need to introduce appropriate norms in order to objectively measure 
the sizes of objects in our problem. However, before doing this we make a few 
observations that give additional insight into the nature of the problem. In Example 
4.1 we note that the first column of A is big (in some sense), while the second 
column is small. The smallness of the second column makes A close to being 
singular. A similar observation may be made about Hilbert matrices. For a general 
TV x TV Hilbert matrix, the last two columns are given by 



1 



TV- 1 
1 

TV 



1 



1 

TV 
1 



TV + 1 



1 



L 2TV - 2 2N - 1 J 



For very large TV, it is apparent that these two columns are almost linearly depen- 
dent; that is, one may be taken as close to being equal to the other. A simple 
numerical example is that j4, « -j^y. Thus, at least at the outset, it seems that 
ill-conditioned matrices are close to being singular, and that this is the root cause 
of the sensitivity problem. 

We now need to extend our treatment of the concept of norms from what we 
have seen in earlier chapters. Our main source is Golub and Van Loan [5], but 
similar information is to be found in Ref. 3 or 4 (or the references cited therein). 
A fairly rigorous treatment of matrix and vector norms can be found in Horn and 
Johnson [6]. 

Suppose again that x e R". The p-norm of x is defined to be 



11*11,= 



5>*l' 



.k=0 



Up 



(4.28) 



where p > 1. The most important special cases are, respectively, the 1-norm, 2- 
norm, and oo-norm: 





71-1 


l*lll 


= £l**l. 




k=Q 


l*l| 2 


= U T x]"\ 


*||oo 


= max \x k \. 

0<k<n-\ 



(4.29a) 

(4.29b) 
(4.29c) 



The operation "max" means to select the biggest \xk\- A unit vector with respect 
to norm II • II is a vector x such that llxll = 1. Note that if x is a unit vector with 



TLFeBOOK 



CONDITION NUMBERS 139 

respect to one norm, then it is not necessarily a unit vector with respect to another 
choice of norm. For example, suppose that x = [-^j] r ; then 

V3+ 1 V3 

I |x|| 2 = i,IM|i = ^— ,IMIoo = — • 

The vector x is a unit vector under the 2-norm, but is not a unit vector under the 
1-norm or the oo-norm. 

Norms have various properties that we will list without proof. Assume that 
x,y e R". For example, the Holder inequality [recall (1.16) (in Chapter 1) for 
comparison] is 

\x T y\ < \\x\\ p \\y\\ q (4.30) 

for which \- + -7 = 1. A special case is the Cauchy-Schwarz inequality 

\x T y\<\\x\h\\y\\2, (4.31) 

which is a special instance of Theorem 1.1 (in Chapter 1). An important feature of 
norms is that they are equivalent. This means that if || • ||„ and || • ||« are norms 
on R", then there are c\, C2 > such that 

ci\\x\\ a < 11*11/1 <C2||jc||„ (4.32) 

for all x € R" . Some special instances of this are 

ll*||2<ll*||i<-A||*||2, ( 433a ) 

Hx||oo<IWl2<-AlWloo, (4.33b) 

IMIco <Nli <n|Mlcc (4.33c) 

Equivalence is significant with respect to our problem in the following manner. 
When we define condition numbers below, we shall see that the specific value of 
the condition number depends in part on the choice of norm. However, equivalence 
says that if a matrix is ill-conditioned with respect to one type of norm, then it 
must be ill-conditioned with respect to any other type of norm. This can simplify 
analysis in practice because it allows us to compute the condition number using 
whatever norms are the easiest to work with. Equivalence can be useful in another 
respect. If we have a sequence of vectors in the space R" , then, if the sequence is 
Cauchy with respect to some chosen norm, it must be Cauchy with respect to any 
other choice of norm. This can simplify convergence analysis, again because we 
may pick the norm that is easiest to work with. 

In Chapter 2 we considered absolute and relative error in the execution of 
floating-point operations. In this setting, operations were on scalars, and scalar 
solutions were generated. Now we must redefine absolute and relative error for 
vector quantities using the norms defined in the previous paragraph. Since x € R" 



TLFeBOOK 



140 LINEAR SYSTEMS OF EQUATIONS 

is the computed (i.e., approximate) solution to x e R" it is reasonable to define the 
absolute error to be 

€ a = \\x-x\\, (4.34a) 



and the relative error is 



||x-x|| 
er = li - n - n - Li . (4.34b) 

11*11 



Of course, x ^ is assumed here. The choice of norm is in principle arbitrary. 
However, if we use the oo-norm, then the concept of relative error with respect to 
it can be made equivalent to a statement about the correct number of significant 
digits in x: 

Hi ~ X|100 « l(T d . (4.35) 

\\x\\oo 

In other words, the largest element of the computed solution x is correct to approx- 
imately d decimal digits. For example, suppose that x = [1.256 — 2.554] r , and 
x = [1.251 - 2.887] r ; then x-x = [-0.005 - 0.333] r , and so 

||x-x||oo = 0.333, 11x11^ = 2.554, 

so therefore e r — 0.1304 « 10 _1 . Thus, x has a largest element that is accurate to 
about one decimal digit, but the smallest element is observed to be correct to about 
three significant digits. 

Matrices can have norms defined on them. We have remarked that ill condi- 
tioning seems to arise when a matrix is close to singular. Suitable matrix norms 
can allow us to measure how close a matrix is to being singular, and thus gives 
insight into its condition. Suppose that A, B e R mx " (so A and B are not neces- 
sarily square matrices). || • |||R mx " -> R is a matrix norm, provided the following 
axioms hold: 

(MN1) ||A|| > for all A, and ||A|| = iff A = 0. 
(MN2) ||A + fl||<||A|| + ||B||. 

(MN3) 1 1 a A|| = | a | ||A||. Constanta is from the same field as the elements of 
the matrix A. 

In the present context we usually consider a e R. Extensions to complex-valued 
matrices and vectors are possible. The axioms above are essentially the same as for 
the norm in all other cases (see Definition 1.3 for comparison). The most common 
matrix norms are the Frobenius norm 



\\A\\ F 



m—\ n—\ 



EIj«ul 2 (4.36a) 



\ k=0 1=0 



TLFeBOOK 



CONDITION NUMBERS 141 

and the p-norms 

\\AX\\ P 

||A|| p = sup^^. (4.36b) 

x=iO \\X\\p 

We see that in (4.36b) the matrix p-norm is dependent on the vector /?-norm. Via 
(4.36b), we have 

\\Ax\\ p < \\A\\ p \\x\\ p . (4.36c) 

We may regard A as an operator applied to x that yields output Ax. Equation 
(4.36c) gives an upper bound on the size of the output, as we know the size of A 
and the size of x as given by their respective p-norms. Also, since A e R mx " it 
must be the case that x e R" , but y € R™ . We observe that 



\A\\ P = sup 
xjtO 



11*11. 



= max \\Ax\\ p . (4.37) 

p \\x\\ p =\ 



This is an alternative means to compute the matrix p-norm: Evaluate ||Ajt||n at all 
points on the unit sphere, which is the set of vectors {x|||x||p = 1}, and then pick 
the largest value of ||A;e|L. Note that the term "sphere" is an extension of what 
we normally mean by a sphere. For the 2-norm in n dimensions, the unit sphere is 
clearly 

\\x\\2=[xl+xl + --- + x 2 n _ l \ l l 2 = \. (4.38) 

This represents our intuitive (i.e., Euclidean) notion of a sphere. But, say, for the 
1-norm the unit sphere is 

Iklli = |*oH-l*il + -" + l*».-il = i. (4-39) 

Equations (4.38) and (4.39) specify very different looking surfaces in n-dimensional 
space. A suggested exercise is to sketch these spheres for n — 2. 

As with vector norms, matrix norms have various properties. One property pos- 
sessed by the matrix p-norms is called the submultiplicative property: 

\\AB\\ P < \\A\\ P \\B\\ P AeR mxn , B e R" x? . (4.40) 

(The reader is warned that not all matrix norms possess this property; a coun- 
terexample appears on p. 57 of Golub and Van Loan [5]). A miscellany of other 
properties (including equivalences) is 

l|A|| 2 < ||A|| F < yfr\\A\\ 2 , (4.41a) 

max \a,j\ < \\A\\2 < V OTMmax \ a ij I, (4.41b) 



TLFeBOOK 



142 LINEAR SYSTEMS OF EQUATIONS 



m — \ 



A||i=maxy]|a u |, (4.41c) 



/eZ„ 

(=0 



n-\ 

X 

ieZ, 

1 



A||oo =maxY"|a iJ |, (4.41d) 

Sal '—' J 

.7=0 



,-JAHoo < ||A|j 2 < V^IIAMoo, (4.41e) 

l|A||i <||A|| 2 <V^||A||i. (4.41f) 

m 



The equivalences [e.g., (4.41a) and (4.41b)] have the same significance for 
matrices as the analogous equivalences for vectors seen in (4.32) and (4.33). 

From (4.41c,d) we see that computing matrix 1 -norms and oo-norms is easy. 
However, computing matrix 2-norms is not easy. Consider (4.37) with p — 2: 

||A|| 2 = max ||Ax|| 2 . (4.42) 

11*112 = 1 

Let R — A T A e R" x " (no, R is not a Hilbert matrix here; we have "recycled" the 
symbol for another use), so then 

\\Ax\\\ = x T A T Ax = x T Rx. (4.43) 

Now consider n — 2. Thus 

(4.44) 



x Rx — [xqxi] 



roo r 0i 

no n\ 



Xq 



where rgi = no because R — R T . The vectors and matrix in (4.44) multiply out 
to become 

x T Rx — roo*o + 2r Q \XQXi +r n x\. (4.45) 

Since \\A\\\ — max|| x || 2 — i ||Ajt|||, we may find \\A\\\ by maximizing (4.45) subject 
to the equality constraint 1 1 jc 1 1^ — 1) i- e -> x 1 x = xi + x\ = \. This problem may 
be solved by using Lagrange multipliers (considered somewhat more formally in 
Section 8.5). Thus, we must maximize 

V(x) =x T Rx-X[x T x- 1], (4.46) 

where k is the Lagrange multiplier. Since 

V(x) = r m xl + 2r 0l x x 1 + r n x\ - X[xq + x\ - 1], 



TLFeBOOK 



CONDITION NUMBERS 



143 



we have 



dV(x) 

dx 

dv(x) 

dx\ 



= 2rooxo + 2roix\ — 2Axo = 0, 
= 2roixo + 2r\\x\ — 2Xx\ — 0, 



and these equations may be rewritten in matrix form as 



HX) r 0i 

no n\ 



XQ 

x\ 



xo 
x\ 



(4.47) 



In other words, Rx — Xx, Thus, the optimum choice of x is an eigenvector of 
R — A T A. But which eigenvector is it? 

First note that A -1 exists (by assumption), so x T Rx — x T A T Ax — 
(Ax) T (Ax) > for all x ^ 0. Therefore, R > 0. Additionally, R = R T , so all 
of the eigenvalues of R are real numbers. 5 Furthermore, because R > 0, all of its 
eigenvalues are positive. This follows if we consider Rx — Xx, and assume that 
X < 0. In this case x T Rx — Xx T x — X\\x\\2 < for any x ^ 0. [If X — 0, then 
Rx — • x — implies that x = (as 7? _1 exists), so x T Rx — 0.] But this con- 
tradicts the assumption that R > 0, and so all of the eigenvalues of R must be 
positive. Now, since ||Ax||2 = x T A T Ax — x T Rx — x T (Xx) — A||x|||, and since 
Hxllj = 1, it must be the case that IIAxllj is biggest for the eigenvector of R cor- 
responding to the biggest eigenvalue of R. If the eigenvalues of R are denoted X\ 
and Xq with A.i > Xq > 0, then finally we must have 



||A||| = Ai. 



(4.48) 



This argument can be generalized for all n > 2. If R > 0, we assume that all 
of its eigenvalues are distinct (this is not always true). If we denote them by 
Xq, Xi, . . . , X n -i, then we may arrange them in decreasing order: 



A.,, 



> X„-2 > ■ ■ ■ > X\ > Xq > 0. 



(4.49) 



Therefore, for A € R" 



|A||2=A„_i. 



(4.50) 



The problem of computing the eigenvalues and eigenvectors of a matrix has its own 
special numerical difficulties. At this point we warn the reader that these problems 
must never be treated lightly. 



If A is a real-valued symmetric square matrix, then we may prove this claim as follows. Suppose 
that for eigenvector x of A, the eigenvalue is A, that is, Ax = Xx. Now ((Ax)*) = ((kx)*) , and so 
(x*) T A T = k*(x*) T . Therefore, (x*) T A T x = X*(x*) T x. But (x*) T A T x = (x*) T Ax = X(x*) T x, so 



finally X*(x*) T a 



; X(x*) x, so we must have X = X*. This can be true only if X e R. 



TLFeBOOK 



144 



LINEAR SYSTEMS OF EQUATIONS 



Example 4.2 Let det(A) denote the determinant of A. Suppose that R — A T A, 

where 

1 0.5 

t 0.5 1 
We will find ||A|| 2 . Consider 



R = 



1 0.5 
0.5 1 



x 
x\ 



XQ 

X] 



We must solve det(A7 — R) — for A. [Recall that det(A7 — R) is the characteristic 
polynomial of R.] Thus 



det (XI - R) - det 

and (X - l) 2 - \ = X 2 - 2X 



A-l -0.5 
-0.5 X - 1 

\ = 0, for 



= (X - I) 2 



1 



= 0, 



A = 



(-2) ±J (-2) 2 -4-1 



2± 1 



So, A.1 



X Q 



2- 1 
Thus, 1 1 A|| 2 = A.i 



1 3 
2' 2' 



|A|| 2 



and so finally 



(We do not need the eigenvectors of R to compute the 2-norm of A.) 

We see that the essence of computing the 2-norm of matrix A is to find the zeros 
of the characteristic polynomial of A T A. The problem of finding polynomial zeros 
is the subject of a later chapter. Again, this problem has its own special numerical 
difficulties that must never be treated lightly. 

We now derive the condition number. Begin by assuming that A € R" x ", and 
that A -1 exists. The error between computed solution x to Ax — y and x is 

e = x-x. (4.51) 

Ax — y, but Ax ^ y in general. So we may define the residual 

r — y — Ax. 

We see that 

Ae — Ax — Ax — y — Ax — r. 

Thus, e — A~ l r. We observe that if e — 0, then r — 0, but if r is small, then e 
is not necessarily small because A -1 might be big, making A~V big. In other 



(4.52) 



(4.53) 



TLFeBOOK 



CONDITION NUMBERS 145 

words, a small residual r does not guarantee that x is close to x. Sometimes r is 
computed as a cursory check to see if x is "reasonable." The main advantage of 
r is that it may always be computed, whereas x is not known in advance and so 
e may never be computed exactly. Below it will be shown that considering r in 
combination with a condition number is a more reliable method of assessing how 
close x is to x. 

Now, since e — A~V, we can say that \\e\\ p = ||A _1 r|| /:1 < ||A _1 ||p||r||p [via 
(4.36c)]. Similarly, since r = Ae, we have \\r\\ p = \\Ae\\ p < \\A\\ p \\e\\ p . Thus 

lln ' 1 " <IMI„ ^IIA-'lUML. (4.54) 



\\A\\ P 
Similarly, x — A~ l y, so immediately 



N ' P < \\x\\ P <\\A- l \\ p \\y\\ p . (4.55) 



l|A||_ 
If ||x|L ^ 0, and \\y\\ p ^ 0, then taking reciprocals in (4.55) yields 

\\A 



< — — . (4.56) 



\\A- l \\ p \\y\\ p - \\x\\ p - \\y\\ p 
We may multiply corresponding terms in (4.56) and (4.54) to obtain 

i ikiip ik|L 114 _, ikn 



IIA-'llpllAHp \\y\\ p \\x\\ p \\y\\p 

We recall from (4.34b) that e r = ll *r* llp = ^K so 

11-^11/? I I -* I I p 

!,!„,„ ,'S', P Kr< IIA-'II.IIAH^. (4.58) 



We call 



\\A- l \\ p \\A\\ p \\y\\ p ~ ••"•• •••'\\y\\ 

\\r\\ P lly-Ax||, 



(4.59) 



\\y\\ P \\y\\p 

the relative residual. We define 

K p (A) = \\A\\p\\A- 1 \\ p (4.60) 

to be the condition number of A. It is immediately apparent that k p {A) > 1 for 
any A and valid p. We see that e r is between 1/k p (A) and K p (A) times the 
relative residual. In particular, if k p (A) >> 1 (i.e., if the condition number is very 
large), even if the relative residual is tiny, then e r might be large. On the other 
hand, if k p (A) is close to unity, then e r will be small if the relative residual is 
small. In conclusion, if K p (A) is large, it is a warning (not a certainty) that small 



TLFeBOOK 



146 



LINEAR SYSTEMS OF EQUATIONS 



perturbations in A and y may cause x to differ greatly from x. Equivalently, ifK p (A) 
is large, then a small r does not imply that x is close to x. 

A rule of thumb in interpreting condition numbers is as follows [3, p. 229], and 
is more or less true regardless of p in (4.60). If K p {A) « d x 10*, where d is a 
decimal digit from one to nine, we can expect to lose (at worst) about k digits of 
accuracy. The reason that p does not matter too much is because we recall that 
matrix norms are equivalent. Therefore, for this rule of thumb to be useful, the 
working precision of the computing machine/software package must be known. 
For example, MATLAB computes to about 16 decimal digits of precision. Thus, 
k > 16 would give us concern that x is not close to x. 



Example 4.3 Suppose that 

A = 



1 1-e 

1 1 



€ R 2x2 , |e| << 1. 



We will determine an estimate of ati(A). Clearly 



6 



1 -1 + 6 
1 1 



*oo boi 
ho b\\ 



We have 

l 



l 



^ |a/,o| = laool + l«iol = 2, y^|a,-,i| = |«oi I + l«n| = |1 -e| + 1, 
i'=0 i=0 

so via (4.41c), \\A\\ X = max{2, |1 - e| + 1} « 2. Similarly 
i -. i 



T \bi,o\ = \boo\ + \ho\ = r7>f] \Ki\ = 1% I + l^ii I = A 



/=o 



kl 



1=0 



kl 



-1 + e 



so again via (4.41c) || A ||i=max|A, 



1-i 



h\ « A. Thus 



*i(A) 



v\ 



We observe that if e = 0, then A 
is a reasonable result because 



does not exist, so our approximation to k\ (A) 



lim k\(A) = oo. 



We may wish to compute kj{A) = ||A||2||A _ ||2. We will suppose that A e 
R" x " and that A -1 exists. But we recall that computing matrix 2-norms involves 
finding eigenvalues. More specifically, \\A\\i is the largest eigenvalue of R — A 7 A 



TLFeBOOK 



CONDITION NUMBERS 147 

[recall (4.50)]. Suppose, as in (4.49), that Xq is the smallest eigenvalue of R for 



which the corresponding eigenvector is denoted by v, that is, Rv — Xqv. Then we 

4' ' 



observe that R l v — j-V. In other words, 1/Aq is an eigenvalue of R '.By similar 



reasoning, l/Xk for k e Z„ must all be eigenvalues of 7? _1 . Thus, 1/Ao will be the 
biggest eigenvalue of R~ l . For present simplicity assume that A is a normal matrix. 
This means that AA T — A T A — R. The reader is cautioned that not all matrices A 
are normal. However, in this case we have R~ l — A~ l A~ T — A~ T A~ l . [Recall 
that (A _1 ) r = (A r ) _1 = A~ T .] We have that IIA -1 !^ is the largest eigenvalue of 
A~ T A~ l , but R _1 — A~ T A -1 since A is assumed normal. The largest eigenvalue 
of R~ l has been established to be 1/A.o, so it must be the case that for a normal 
matrix A (real-valued and invertible) 



*2(A) = t /-jp, (4.61) 

that is, A is ill-conditioned if the ratio of the biggest to smallest eigenvalue of 
A T A is large. In other words, a large eigenvalue spread is associated with matrix 
ill conditioning. It turns out that this conclusion holds even if A is not normal; that 
is, (4.61) is valid even if A is not normal. But we will not prove this. (The interested 
reader can see pp. 312 and 340 of Horn and Johnson [6] for more information.) 

An obvious difficulty with condition numbers is that their exact calculation often 
seems to require knowledge of A -1 . Clearly this is problematic since computing 
A -1 accurately may not be easy or possible (because of ill conditioning). We seem 
to have a "chicken and egg" problem. This problem is often dealt with by using 
condition number estimators. This in turn generally involves placing bounds on 
condition numbers. But the subject of condition number estimation is not within 
the scope of this book. The interested reader might consult Higham [7] for further 
information on this subject if desired. There is some information on this matter in 
the treatise by Golub and Van Loan [5, pp. 128-130], which includes a pseudocode 
algorithm for oo-norm condition number estimation of an upper triangular nonsin- 
gular matrix. We remark that ||A||2 is sometimes called the spectral norm of A, 
and is actually best computed using entities called singular values [5, p. 72]. This 
is because computing singular values avoids the necessity of computing A -1 , and 
can be done in a numerically reliable manner. Singular values will be discussed in 
more detail later. 

We conclude this section with a remark about the Hilbert matrix R of Section 4.3. 
As discussed by Hill [3, p. 232], we have 

K 2 (R) oc e aN 

for some a > 0. (Recall that symbol oc means "proportional to.") Proving this is 
tough, and we will not attempt it. Thus, the condition number of R grows very 
rapidly with N and explains why the attempt to invert R in Appendix 4. A failed 
for so small a value of N. 



TLFeBOOK 



148 LINEAR SYSTEMS OF EQUATIONS 

4.5 LU DECOMPOSITION 

In this section we will assume A € R" x ", and that A -1 exists. Many algorithms 
to solve Ax — y work by factoring the matrix A in various ways. In this section 
we consider a Gaussian elimination approach to writing A as 



A = LU, 



(4.62) 



where L is a nonsingular lower triangular matrix, and U is a nonsingular upper 
triangular matrix. This is the LU decomposition (factorization) of A. Naturally, 
L, U e R nx ", and L — [lij], U — [m,- j]. Since these matrices are lower and upper 
triangular, respectively, it must be the case that 

ll : — for j > i and «,- ,• = for j < i. (4.63) 

For example, the following are (respectively) lower and upper triangular matrices: 



1 
1 1 
1 1 1 



U 



1 2 3 
4 5 
6 



These matrices are clearly nonsingular since their determinants are 1 and 24, respec- 
tively. In fact, L is nonsingular iff /,-_,- ^ for all i , and U is nonsingular iff u ;> ; ^ 
for all j. We note that with A factored as in (4.62), the solution of Ax — y becomes 
quite easy, but the details of this will be considered later. We now concentrate on 
finding the factors L, U. 

We begin by defining a Gauss transformation matrix Gk such that 



G k x 



1 










"n-1 



• 


• " 




x 




• 

1 • 


• 

• 




Xk-1 
Xk 


= 


• 


• 1 _ 




Xfi—l 





xo 



Xk-l 





(4.64) 



for 



x} = 



Xj 
Xk-l 



,i = k, 



(4.65) 



The superscript k on x k . does not denote raising Xj to a power; it is simply part of 
the name of the symbol. This naming convention is needed to account for the fact 
that there is a different set of r values for every Gk ■ For this to work requires that 
Xk-i # 0. Equation (4.65) followed from considering the matrix-vector product 



TLFeBOOK 



LU DECOMPOSITION 



149 



in (4.64): 



-r k Xk-i + x k — 
-r k+1 Xk-i + xt+i — 



l «-l 



Xk-l 



0. 



We observe that Gt is "designed" to annihilate the last n — k elements of vector 
x. We also see that G k is lower triangular, and if it exists, always possesses an 
inverse because the main diagonal elements are all equal to unity. A lower triangular 
matrix where all of the main diagonal elements are equal to unity is called unit 
lower triangular. Similar terminology applies to upper triangular matrices. Define 
the kth Gauss vector 



l «-lJ 



k zeros 



The £th unit vector is 



el =[0---01 0---0 ]. 



k zeros n—k—l zeros 



(4.66) 



(4.67) 



If / is an n x n identity matrix, then 



Gt k T 

k = I -x e k _ l 



(4.68) 



forfc= 1,2, 



1. For example, if n — 4, we have 



G, = 



Gi = 



1 



~rl 


1 





-4 


1 





~< 





1 


1 








1 











1 








-*] 


1 



G 2 = 



1 











1 








-4 i 








-x\ 


1 



(4.69) 



The Gauss transformation matrices may be applied to A, yielding an upper trian- 
gular matrix. This is illustrated by the following example. 

Example 4.4 Suppose that 



12 3 4 

-112 1 

2 13 

11 



(=A°). 



TLFeBOOK 



150 



LINEAR SYSTEMS OF EQUATIONS 



We introduce matrices A , where A k — GkA k ~ l fork— 1, 2, . . . , n — 1, and finally 
U — A" -1 . Once again, A k is not the kth power of A, but rather denotes the kth 
matrix in a sequence of matrices. Now consider 



GiA v = 



10 
110 
10 
1 



12 3 4 

-112 1 

2 13 

11 



12 3 4 

3 5 5 

2 13 

11 



= A 1 



for which the x\ entries in the first column of G\ depend on the first column of 
A (i.e., of A) according to (4.65). Similarly 



G 2 A l 



for which the rf entries in the second column of G2 depend on the second column 
of A 1 , and also 



1 


ooo - 


"12 3 4" 




" 1 2 


3 


4 





1 


3 5 5 




3 


5 


5 






-§ 1 
1 _ 


2 13 
11 





_ 


7 
3 

1 


1 
3 

1 



GiA 1 



1 








1 


2 





1 








3 








1 














1 1 









4 




" 1 


2 


3 


4 


5 







3 


5 


5 


1 

3 










7 

3 


1 

3 


1 




. 








6 

7 



u 



for which the t. 3 entries in the third column of G3 depend on the third column of 
A 2 . We see that U is indeed upper triangular, and it is also nonsingular. We also 
see that 

U = G3G2G1 A. 

Since the product of lower triangular matrices is a lower triangular matrix, it is the 
case that L\ — G3G2G1 is lower triangular. Thus 

A = L~ l U. 

Since the inverse (if it exists) of a lower triangular matrix is also a lower triangular 
matrix, we can define L = L~^ , and so A = LU. Thus 

L = Gi G2 Go . 

From this example it appears that we need to do much work in order to find GT . 
However, this is not the case. It turns out that 



T k P T 

T e k-\- 



(4.70) 



TLFeBOOK 



LU DECOMPOSITION 



151 



This is easy to confirm. From (4.68) and (4.70) 



G k G: 



U 
I - 



r k e T 



][/ 



r k eU] 



T k P T 

x e k-\ 



r kT 



k-lS'k-1- 



T k p T 
X e k-\ 



T k P T T k P T 
T e k-\ X e k-\ 



But from (4.66) and (4.67) e T k _ x x k = 0, so finally G k G~ l = /. 

To obtain r k from (4.65), we see that we must divide by x k -i- In our matrix 
factorization application of the Gauss transformation, we have seen (in Example 
4.4) that Xk-\ will be an element of A k . These elements are called pivots. It 
is apparent that the factorization procedure cannot work if a pivot is zero. The 
occurrence of zero-valued pivots is a common situation. A simple example of a 
matrix that cannot be factored with our algorithm is 



In this case 



G\A = 



and from (4.65) 



A = 



1 

l _ £i _ 
1 XQ 



1 

1 



Oio 

floo 



00. 



(4.71) 



(4.72) 



(4.73) 



This result implies that not all matrices possess an LU factorization. Let det(A) 
denote the determinant of A. We may state a general condition for the existence of 
the LU factorization: 



Theorem 4.1: Since A = [fl,,/];,/=o,...,n-i € R" x " we define the kth leading 
principle submatrix of A to be A k — [ajj];, ;=o,... ,k-\ € R* x * for k — 1,2, ... , n 
(so that A — A n , and A\ = [«oo] = aoo)- There exists a unit lower triangular matrix 
L and an upper triangular matrix U such that A — LU, provided that det(Ak) ^ 
for all k — 1, 2 . . . , n. Furthermore, with U — [uij] e R" x " we have det(Afc) = 

The proof is given in Golub and Van Loan [5]. It will not be considered here. For 
A in (4.71), we see that A\ — [0] = 0, so det(Ai) = 0. Thus, even though A -1 
exists, it does not possess an LU decomposition. It is also easy to verify that for 



4 



1 

8 1 
1 1 



although A exists, again A does not possess an LU decomposition. In this case 



we have det(A2) = det 



4 



= 0. Theorem 4. 1 leads to a test of positive 



definiteness according to the following theorem. 



TLFeBOOK 



152 LINEAR SYSTEMS OF EQUATIONS 

Theorem 4.2: Suppose R e R" x " with R = R T . Suppose that R = LDL T , 
where L is unit lower triangular, and D is a diagonal matrix (L — 
[li,j]i,j=o,...,n-i, D = [d/,/L-,/=o,...,n-i)- If rf/,i > for all i e Z„, then 7? > 0. 

Proof L is unit lower triangular, so for any y € R" there will be a unique 
x e R" such that 

y = L T x (y T =x T L) 

because L _1 exists. Thus, assuming D > 

71-1 

x r i?:t = x T LDL T x = y T Dy = ^y}d i4 > 

for all y # 0, since <i, j( - > for all i e Z„. In fact, ^f~o y?rfj,, = iff y; = for 
all i € Z„. Consequently, x T Rx > for all x ^ 0, and so immediately R > 0. 

We relate Z) in Theorem 4.2 to [/ in Theorem 4.1 according to t/ = DL T . If the 
LDL T decomposition of a matrix i? exists, then matrix D immediately tells us 
whether R is pd just by viewing the signs of the diagonal elements. 

We may define (as in Example 4.4) A k — [of ,], where k = 0, 1, . . . , n — 1 and 
A = A. Consequently 

r* = -5- = ^£^ (4.74) 

**"» «t-U-i 

for i = fe, k + 1, . . . , n — 1. This follows because Gt contains t. , and as observed 
in the example above, r* depends on the column indexed k — 1 in A* -1 . Thus, a 
pseudocode program for finding U can therefore be stated as follows: 

A°:=A; 

for /f := 1 to n - 1 do begin 
for / := k to n — 1 do begin 

r* := afr_i/a£l] t_n ; {This loop computes r''} 
end; 
/A A := G k A k ~^ ; { G^ contains zf via (4.64) } 
end; 
U:=A n - 1 ; 

We see that the pivots are aZZ\ t_j for A: = 1, 2, , . . , n — 1. Now 

U — G„-iG n -2 ■ ■ ■ G2G1 A. 

so 

A = G- l G- x ---G- l _ 2 G- l _ l U. (4.75) 



TLFeBOOK 



LU DECOMPOSITION 153 

Consequently, from (4.70), we obtain 

H-l 

L = (/ + x x el)(I + x 1 e\) •••(/ + r"" 1 ^) = / + £ r fc ej_ x . (4.76) 

To confirm the last equality of (4.76), consider defining L m = GJ~ •••G~ 1 for 
m = 1, . . . , n — 1. Assume that L m — I + Y1T=1 xke k-v wruc h is true for m — 1 
because L\ — G~[ = I + x x e^ . Consider L m+ \ = L m G~ +l , so 

L m+1 = (/ + fy e [_ 1 )(/ + r"'+ 1 ^) 



k=l 



=i+T,^ T k -i+* m+i ^+i: 



k „ T _1_ T m + l „ T _1_ X^T k „ T -r m + l „ T 



r e,_, + t e m + > r e,_,r e 



jt-l l e m- 



A:=l *:=! 



But ^_ 1 T m+1 = for k = 1, . . . , m from (4.66) and (4.67). Thus 

m m-\-\ 

L m+l =I + Y j r k e T k _ x + r m+l e T m =l+Y j r k eI-v 

k=\ k=\ 

Therefore, (4.76) is valid by mathematical induction. (A simpler example of a 
proof by induction appears in Appendix 3.B.) Because of (4.76), the previous 
pseudocode implicitly computes L as well as U. Thus, if no zero-valued pivots 
are encountered, the algorithm will terminate, having provided us with both L 
and U. [As an exercise, the reader should use (4.76) to find L in Example 4.4 
simply by looking at the appropriate entries of the matrices Gt\ that is, do not 
use L = Gj~ Gj G7 . Having found L by this means, confirm that LU — A.] We 
remark that (4.76) shows that L is unit lower triangular. 

It is worth mentioning that certain classes of matrix are guaranteed to possess 
an LU decomposition. Suppose that A e R" x " with A — A T and A > 0. Let v — 
[vo ■ ■ ■ Vk-\ • • • ] T ; then, if v ^ 0, we have v T Av > 0, but if A^ is the kth 

_ T n—k zeros 

leading principle submatrix of A, then 

v Av — u Akii > 

which holds for all k — 1,2, ... ,n. Consequently, A& > for all k, and so A^ 
exists for all k. Since A^ exists for all k, it follows that det(A^) ^ for all k. 
The conditions of Theorem 4.1 are met, and so A possesses an LU decomposi- 
tion. That is, all real-valued, symmetric positive definite matrices possess an LU 
decomposition. 



TLFeBOOK 



154 



LINEAR SYSTEMS OF EQUATIONS 



We recall that the class of positive definite matrices is an important one since 
they have a direct association with least-squares approximation problems. This was 
demonstrated in Section 4.2. 

How many floating-point operations (flops) are needed by the algorithm for 
finding the LU decomposition of a matrix? Answering this question gives us an 
indication of the computational complexity of the algorithm. Neglecting multipli- 
cation by zero or by one, to compute A k — GkA k ~ l requires (« — k)(n — k + 1) 
multiplications, and the same number of additions. This follows from considering 
the product GkA k ~ l with the factors partitioned into submatrices according to 



G k 



h 
T k 




h-k 



,/t-l 



,k-l 
1 00 





i oi 



(4.77) 



where It is a k x k identity matrix, 7* is (« — k) x k and is zero-valued except for 



k-\ 



its last column, which contains — r [see (4.64)]. Similarly, A 00 is (k — 1) x 



(k - 1), A* is (k - 1) x (« 



.k-i 



1), and A,, ' is (n - k + 1) x (n 



1). 



From the pseudocode, we see that we need ~YTk=\ ( n ~ ^) division operations. 
Operation A k — GtA k ~ x is executed for k = 1 to n — 1, so the total number of 
operations is: 

M-l 



2_\( n — k)(n — k + \) multiplications 

k=\ 

n-\ 

2__S n — k)(n — k + \) additions 

k=\ 

n-\ 

J> - k) 



divisions 



We now recognize that 



iV 



k=\ 



N(N+1) ^ 2 _ N(N + l)(2N + 1) 



I> 2 
jt=i 



(4.78) 



where the second summation identity was proven in Appendix 3.B. The first sum- 
mation identity may be proved in a similar manner. Therefore 



n-l 



H-l 



k=\ 



J2( n - k )( n ~ k + 1) = J^[« 2 + n - (2n + 1)£ + k 2 ] 

k=\ 

n— 
= {n- 1)(« 2 + n) - (2n + l)J2 k + J2 k2 



n—\ n — \ 



k=l k=l 



TLFeBOOK 



LU DECOMPOSITION 155 

1 , 1 

= -n 3 «, (4.79a) 

3 3 

«— 1 n— 1 . -. 

Y^(n-k) = n(n- l)-^k = -n 2 - -n. (4.79b) 

k=l k=\ 

So-called asymptotic complexity measures are denned using 

Definition 4.2: Big O We say that fin) = 0(g(n)) if there is a < c < oo, 
and an N € N (N < oo) such that 

fin) < cgin) 

for all n > N. 

Our algorithm needs a total of fin) — 2(|« 3 — |n) + jn 2 — jn — |w 3 + jn 2 — 
In flops. We may say that 0(n 3 ) operations (flops) are needed (so here gin) = 
n 3 ). We may read 0(« 3 ) as "order n-cubed," so order n-cubed operations are 
needed. If one operation takes one unit of time on a computing machine we say 
the asymptotic time complexity of the algorithm is 0(w 3 ). Parameter n (matrix 
order) is the size of the problem. We might also say that the time complexity 
of the algorithm is cubic in the size of the problem since the number of oper- 
ations fin) is a cubic polynomial in n. But we caution the reader about flop 
counting: 

"Flop counting is a necessarily crude approach to the measuring of program efficiency 
since it ignores subscripting, memory traffic, and the countless other overheads asso- 
ciated with program execution. We must not infer too much from a comparison of 
flops counts. . . . Flop counting is just a 'quick and dirty' accounting method that 
captures only one of several dimensions of the efficiency issue." 

— Golub and Van Loan [5, p. 20] 

Asymptotic complexity measures allow us to talk about algorithmic resource 
demands without getting bogged down in detailed expressions for computing time, 
memory requirements, and other variables. However, the comment by Golub and 
Van Loan above may clearly be extended to asymptotic measures. 

Suppose that A is LC/-factorable, and that we know L and U . Suppose that we 
wish to solve Ax — y. Thus 

LUx = y, (4.80) 

and define Ux — z, so we begin by considering 

Lz = y. (4.81) 



TLFeBOOK 



156 LINEAR SYSTEMS OF EQUATIONS 

In expanded form this becomes 



4o 











ho 


lu 








ho 


hi 


'22 






'«-i,o 4-1,1 '«-i,2 ••• 4-l,«-l 





zo 




yo 




Z\ 




yi 




Zl 


= 


yi 




Z n -\ 




y n -\ 



(4.82) 



Since L ' exists, solving (4.81) is easy using forward elimination {forward substi- 
tution). Specifically, from (4.82) 



zo 



yo 
4,o 



z\ = - — [y\ -zo4,o] 
'1,1 



Z2 = " Ij2 - Zoh,0 - Zl/2,l] 

'2,2 



Thus, in general 



Zn-l = 



'« — l,n- 



Z* = 



lk„ 



n-2 

y n -\ — / ,Zkln-l,k 

k=Q 



k-l 

yk ~ ^ Zjlk.i 
i=0 



(4.83) 



for fc = 1, 2, ...,« — 1 with zo = yo/4,0- Since we now know z, we may solve 
[/x = z by backward substitution. To see this, express the problem in expanded 
form: 



«0,0 "0,1 
Ml 1 








M 0,n-2 M0,h-1 

"l,n-2 «l,n-l 



tin— 2,w— 2 W«— 2, n — 1 
M„_i„_i 





x 




zo 




XI 




Zl 




*n-2 




Zn-2 




_ ■*»-! 




Zn-l 



From (4.84), we obtain 



Xn—l 



Xn—2 



Zn-l 
Mn—\,n—\ 

1 

Un—2,n—2 



-YZn-2 — X„-lU n -2,n-l] 



(4.84) 



TLFeBOOK 



LU DECOMPOSITION 



157 



X n -3 



Mn—3,n—3 



~iZn— 3 x n — \Un—?>,n — { Xn—2M-n—3,n—2] 



Xq = 



"0,0 



zo 



n-\ 

/ ,XkUQ,k 
k=\ 



In general 



** 



1 
Uk,k 



Zk 



^2 XiUk,i 



i=k+\ 



(4.85) 



for k — n — 2, . . . ,0 with x n -\ — z n -\/u n -i,n-\- The forward-substitution and 
backward-substitution algorithms that we have just derived have an asymptotic 
time complexity of 0(n 2 ). The reader should confirm this as an exercise. This 
result suggests that most of the computational effort needed to solve for x in 
Ax = y lies in the LU decomposition stage. 

So far we have said nothing about the performance of our linear system solution 
method with respect to finite precision arithmetic effects (i.e., rounding error). 
Before considering this matter, we make a few remarks regarding the stability of 
our method. We have noted that the LU decomposition algorithm will fail if a zero- 
valued pivot is encountered. This can happen even if Ax — y has a solution and A 
is well-conditioned. In other words, our algorithm is actually unstable since we can 
input numerically well-posed problems that cause it to fail. This does not necessarily 
mean that our algorithm should be totally rejected. For example, we have shown 
that positive definite matrices will never result in a zero-valued pivot. Furthermore, 
if A > 0, and it is well-conditioned, then it can be shown that an accurate answer 
will be provided by the algorithm despite its faults. Nonetheless, the problem of 
failure due to encountering a zero-valued pivot needs to be addressed. Also, what 
happens if a pivot is not exactly zero, but is close to zero? We might expect that 
this can result in a computed solution x that differs greatly from the mathematically 
exact solution x, especially where rounding error is involved, even if A is well- 
conditioned. 

Recall (2.15) from Chapter 2, 



fl[x op y] = (x op y)(l + e) 



(4.86) 



for which |e| < 2 '. If we store A in a floating-point machine, then, because of 
the necessity to quantize we are really storing the elements 



[fl[A]]ij=fl[a i j] = a i j(l + e iJ ) 



(4.87) 



TLFeBOOK 



158 LINEAR SYSTEMS OF EQUATIONS 

with |e,-j| < 2~'. Suppose now that A, 5 e R mx "; we then define 6 

|A| = [|a u |]eR mxn , (4.88) 

and by B < A, we mean bij < aij for all i and j. So we may express (4.87) more 
compactly as 

\fl[A]-A\<u\A\, (4.89) 

where u — 2~' , since |e,- ;| < 2~' . 

Forsythe and Moler [4, pp. 104-105] show that the computed solution z to 
Lz — y [recall (4.81)] as obtained by forward substitution is actually the exact 
solution to a perturbed lower triangular system 

(L + 8L)z = y, (4.90) 

where SL is a lower triangular perturbation matrix, and where 

|<5L| < 1.01nw|L|. (4.91) 

A very similar bound exists for the problem of solving Ux — z by backward- 
substitution. We will not derive these bounds, but will simply mention that the 
derivation involves working with a bound similar to (2.39) in Chapter 2. From 
(4.91) we have relative perturbations -rMr < l.Olnw. It is apparent that since u 
is typically quite tiny, unless n (matrix order) is quite huge, these relative pertur- 
bations will not be significant. In other words, forward substitution and backward 
substitution are very stable procedures that are quite resistant to the effects of 
rounding errors. Thus, any difficulties with our linear system solution procedure in 
terms of a rounding error likely involve only the LU factorization stage. 

The rounding error analysis for our Gaussian elimination algorithm is even more 
involved than the effort required to obtain (4.91), so again we will content ourselves 
with citing the main result without proof. We cite Theorem 3.3.1 in Golub and Van 
Loan [5] as follows. 

Theorem 4.3: Assume that A is an n x n matrix of floating-point numbers. 
If no zero-valued pivots are encountered during the execution of the Gaussian 
elimination algorithm for which A is the input, then the computed triangular factors 
(here denoted L and U) satisfy 

LU = A + 8A (4.92a) 

such that 

\SA\ < 3(« - 1)m(|A| + \L\\U\) + 0(u 2 ). (4.92b) 

There is some danger in confusing this with the determinant. That is, some people use \A\ to denote 
the determinant of A. We will avoid this here by sticking with det(A) as the notation for determinant 
of A. 



TLFeBOOK 



LU DECOMPOSITION 



159 



In this theorem the term 0(u 2 ) denotes a part of the error term dependent on u 2 . 
This is quite small as u 2 — 2~ 2t (rounding assumed), and so may be practically 
disregarded. The term arises in the work of Golub and Van Loan [5] because those 
authors prefer to work with slightly looser bounding results than are to be found 
in the volume by Forsythe and Moler [4]. The bound in (4.92b) gives us cause for 
concern. The perturbation matrix S A may not be small. This is because |L||L/| can 
be quite large. An example of this would be 



A = 



1 4 1 

2 8.001 1 
-1 1 



for which 



1 

2 1 
-1000 1 



U 



1 4 1 

0.001 -1 
-999 



This has happened because 



A 1 = 



1 


4 


1 





0.001 


-1 





-1 


1 



which has a\ j = 0.001. This is a small pivot and is ultimately responsible for 
giving us "big" triangular factors. Clearly, the smaller the pivot the bigger the 
potential problem. Golub and Van Loan's [5] Theorem 3.3.2 (which we will not 
repeat here) goes on to demonstrate that the errors in the computed triangular 
factors can adversely affect the solution to Ax = L Ux — y as obtained by forward 
substitution and backward substitution. Thus, if we use the computed solutions L 
and U in LUx = y, then the computed solution x may not be close to x. 

How may our Gaussian elimination LU factorization algorithm be modified to 
make it more stable? The standard solution is to employ partial pivoting. We do not 
consider the method in detail here, but illustrate it with a simple example (Example 
4.5). Essentially, before applying a Gauss transformation Gt, the rows of matrix 
A k ~ l are permuted (i.e., exchanged) in such a manner as to make the pivot as 
large as possible while simultaneously ensuring that A k is as close to being upper 
triangular as possible. Permutation operations have a matrix description, and such 
matrices may be denoted by P^. We remark that P7 — Pi . 



Example 4.5 Suppose that 



A = 



1 


4 


1 


2 


8 


1 





-1 


1 



= A h 



TLFeBOOK 



160 



LINEAR SYSTEMS OF EQUATIONS 



Thus 



G1P1A 



o 



10 0" 




"010" 




" 1 


4 


1 " 




\ 1 




1 




2 


8 


1 




1 _ 




_ 1 _ 




_ 


-1 


1 




10 0" 




"2 8 1" 




" 2 


8 


1 


\ 1 




1 4 1 


= 








l 

2 


1 _ 




-1 1 






_ 


-1 


1 



and 





"10 




"10 0" 




"2 8 1 


G2P2A 1 = 


1 
_ 1 _ 




1 
_ 1 _ 




1 

_ -1 1 




"10 0" 




"2 8 1" 




= 


1 




-1 1 


= A 2 




1 




.0 1 







for which U — A 2 . We see that P 2 interchanges rows 2 and 3 rather than 1 and 
2 because to do otherwise would ruin the upper triangular structure we seek. It is 
apparent that 

G 2 P 2 GiPiA = U, 



so that 



for which 



A=P- 1 G~ 1 P- 1 G^U, 



=L 



r 1 



1 



1 
1 



This matrix is manifestly not lower triangular. Thus, our use of partial pivoting to 
achieve algorithmic stability has been purchased at the expense of some loss of 
structure (although Theorem 3.4.1 in Ref. 5 shows how to recover much of what 
is lost. 7 ) Also, permutations involve moving data around in the computer, and this 
is a potentially significant cost. But these prices are usually worth paying. 

In general, the Gaussian elimination with partial pivoting algorithm generates 



and it turns out that 



G n _ 1 P n _ 1 -G 2 P 2 G 1 P 1 A = U, 



P„_l ■ ■ ■ P 2 PiA = LU 



for which L is unit lower triangular, and U is upper triangular. The expression for L in terms of the 
factors Gk is messy, and so we omit it. The interested reader can see pp. 112-113 of Ref. 5 for details. 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 



161 



It is worth mentioning that the need to trade off algorithm speed in favor of sta- 
bility is common in numerical computing; that is, fast algorithms often have stability 
problems. Much of numerical computing is about creating the fastest possible stable 
algorithms. This is a notoriously challenging engineering problem. 

A much more detailed account of Gaussian elimination with partial pivoting 
appears in Golub and Van Loan [5, pp. 108-1 16]. This matter will not be discussed 
further in this book. 



4.6 LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 

In this section we consider the QR decomposition of A € R mx " for which m > n, 
and A is of full rank [i.e., rank (A) = «]. Full rank in this sense means that the 
columns of A are linearly independent. The QR decomposition of A is 

A = QR, (4.93) 

where Q e R mxm is an orthogonal matrix [i.e., Q T Q — QQ T — I (identity 
matrix)], and R e R mx " is upper triangular in the following sense: 



R = 



0.0 




ro,i • 


ro,n-l 

r\, n -\ 






• 
• 


fn — \,n — l 

















n 
o 



(4.94) 



Here TZ e R" x " is a square upper triangular matrix and is nonsingular because A 
is full rank. The bottom block of zeros in R of (4.94) is (m — n) x n. 

It should be immediately apparent that the existence of a QR decomposition 
for A makes it quite easy to solve for x in Ax — y, if A -1 exists (which implies 
that in this special case A is square). Thus, Ax — QRx — y, and so Rx — Q T y. 
The upper triangular linear system Rx — Q T y may be readily solved by backward 
substitution (recall the previous section). 

The case where m > n is important because it arises in overdetermined least- 
squares approximation problems. We illustrate with the following example based on 
a real-world problem. 8 Figure 4. 1 is a plot of some simulated body core temperature 

This example is from the problem of estimating the circadian rhythm parameters of human patients 
who have sustained head injuries. The estimates are obtained by the suitable processing of various 
physiological data sets (e.g., body core temperature, heart rate, blood pressure). The nature of the injury 
has made the patients' rhythms deviate from the nominal 24-h cycle. Correct estimation of rhythm 
parameters can lead to improved clinical treatment because of improved timing in the administering of 



TLFeBOOK 



162 



LINEAR SYSTEMS OF EQUATIONS 



O 



Q. 
E 



D. 



37.2 






Illustration of least-squares fitting 












37 1 




Noisy data with trend 
■ Linear trend component 
- Model 


£ - --\ 


j 


37 
36.9 

-5R « 


/ _\ A - 




/ 







10 



20 



30 



40 50 

Time (hours) 



60 



70 



80 



90 



Figure 4.1 Simulated human patient temperature data to illustrate overdetermined least- 
squares model parameter estimation. Here we have N = 1000 samples f n (the dots), for 
T s = 300 (seconds), T = 24 (hours), a = 2 x 10" 7o C/s, b = 37°C, and c = 0.1°C. The 
solution to (4.103) is a = 2.0582 x 10" 7o C/s, b = 36.9999°C, c = 0.1012°C. 



measurements 

has three components 



from a human patient (this is the noisy data with trend). The data 



1. A sinusoidal component 

2. Random noise. 

3. A linear trend. 

Our problem is to estimate the parameters of the sinusoid (i.e., the amplitude, 
period, and phase), which represents the patient's circadian rhythm. In other words, 
the noise and trend are undesirable and so are to be, in effect, removed from the 
desired sinusoidal signal component. Here we will content ourselves with estimating 
only the amplitude of the sinusoid. The problem of estimating the remaining param- 
eters is tougher. Methods to estimate the remaining parameters will be considered 
later. (This is a nonlinear optimization problem.) 

We assume the model for the data in Fig. 4.1 is the analog signal 



2n 
f(t) = at + b + c sin I — t 



lit). 



(4.95) 



Here the first two terms model the trend (assumed to be a straight line), the third 
term is the desired sinusoidal signal component, and r](t) is a random noise com- 
ponent. We only possess samples of the signal /„ = f(nT s ) (i.e., t — nT s ), for 
n — 0, I, . . . , N — 1 , where T s is the sampling period of the data collection system. 

medication. We emphasize that the model in (4.95) is grossly oversimplified. Indeed, a better model is 
to replace term at + b with subharmonic, and harmonic terms of sin I ^j-t\. A harmonic term is one 

of frequency -S-n, while a subharmonic has frequency -S--- Cosine terms should also be included in 
the improved model. 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 



163 



We assume that we know T which is the period of the patient's circadian rhythm. 
Our model also implicitly assumes knowledge of the phase of the sinusoid, too. 
These are very artificial assumptions since in practice these are the most important 
parameters we are trying to estimate, and they are never known in advance. How- 
ever, our present circumstances demand simplification. Our estimate of /„ may be 
defined by 

f n =aT s n + b + csm[ — nT s . (4.96) 



This is a sampled version of the analog model, except the noise term has been 
deleted. 

We may estimate the unknown model parameters a, b, c by employing the same 
basic strategy we used in Section 4.2, specifically, a least-squares approach. Thus, 
defining x — [a b c] T (vector of unknown parameters), we strive to minimize 



JV— 1 



AT— 1 



v(*) = j>„ 2 = £[/„- ;„] 2 



(4.97) 



n=0 



n=0 



with respect to x. Using matrix/vector notation was very helpful in Section 4.2, 
and it remains so here. Define 



T s ii 1 sin | — T s n 



(4.98) 



Thus 



fn ~ V n X. 



(4.99) 



We may define the error vector e — [eo e\ • • 
[/o /l • • • In-iV > and the matrix of basis vectors 



eN-\] T , data vector f 



A = 



T 
..T 



L "N-l J 



eR 



JVx3 



(4.100) 



Consequently, via (4.99) 



e — f — Ax. 



(4.101) 



Obviously, we would like to have e = 0, which implies the desire to solve Ax — f. 
If we have N — 3 and A -1 exists, then we may uniquely solve for x given any 
/. However, in practice, N >> 3, so our linear system is overdetermined. Thus, 
no unique solution is possible. We have no option but to select x to minimize e in 



TLFeBOOK 



164 LINEAR SYSTEMS OF EQUATIONS 

some sense. Once again, previous experience from Section 4.2 says least-squares 
is a viable choice. Thus, since IHI2 = e T e — *}2 n =o e n> we consider 

V(x) = e T e= f T f -2x T A T f +x T A T Ax (4.102) 



[which is a more compact version of (4.97)]. This is yet another quadratic form 
[recall (4.8)]. We see that P e R 3x3 , and g e R 3 . In our problem A is full rank so 
from the results in Section 4.2 we see that P > 0. Naturally, from the discussions 
of Sections 4.3 and 4.4, the conditioning of P is a concern. Here it turns out that 
because P is of low order (largely because we are interested only in estimating 
three parameters) it typically has a low condition number. However, as the order 
of P rises, the conditioning of P usually rapidly worsens; that is, ill conditioning 
tends to be a severe problem when the number of parameters to be estimated rises. 
From Section 4.2 we know that the optimum choice for x, denoted x, is obtained 
by solving the linear system 

Px = g. (4.103) 

The model curve of Fig. 4.1 (solid line) is the curve obtained using x in (4.96). 
Thus, since x — [a b c] T , we plot /„ for a,b,c in place of a,b,c in (4.96). 
Equation (4.103) can be written as 

A T Ax = A T f. (4.104) 

This is just the overdetermined linear system Ax — f multiplied on the left (i.e., 
premultiplied) by A T . The system (4.104) is often referred to in the literature as 
the normal equations. 

How is the previous applications example relevant to the problem of QR fac- 
torizing A as in Eq. (4.93)? To answer this, we need to consider the condition 
numbers of A, and of P — A T A, and to see how orthogonal matrices Q facili- 
tate the solution of overdetermined least-squares problems. We will then move on 
to the problem of how to practically compute the QR factorization of a full-rank 
matrix. We will consider the issue of conditioning first since this is a justification 
for considering QR factorization methods as opposed to the linear system solution 
methods of the previous section. 

Singular values were mentioned in Section 4.4 as being relevant to the problem 
of computing spectral norms, and so of computing K2(A). Now we need to consider 
the consequences of 

Theorem 4.4: Singular Value Decomposition (SVD) Suppose A € R mxn ; 
then there exist orthogonal matrices 

U = [«0K1 ' ' ' «m-l] € R mxm , V = [W0W1 • • • V n -l] € R BXB 

such that 

E = U T AV = diag (ct , o u . . . , o p _{) e R mx ", p = min{m, «}, (4.105) 

where oq > o\ > . . . > er„-i > 0. 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 



165 



An outline proof appears in Ref. 5 (p. 71) and is omitted here. The nota- 
tion diag(tro, • ■ ■ , o>-i) means a diagonal matrix with main diagonal elements 
ao, . . . , o p -\. For example, if m — 3, n — 2, then p — 2, and 



U T AV 



CTQ 
(7! 





but if m = 2, n = 3, then again p — 2, but now 



U T AV 



o-o 
ori 



The numbers er, are called singular values. Vector ut is the ith left singular vector, 
and vi is the ith right singular vector. The following notation is helpful: 

cr r (A) = the ith singular value of A (i e Z p ). 
cr max (A) = the biggest singular value of A. 
&mm(A) = the smallest singular value of A. 

We observe that because AV — WE, and A T U — VE T we have, respectively 

Avj — OiUi, A Uj — OiV{ (4.106) 

for i e Z p . Singular values give matrix 2-norms; as noted in the following theorem. 

Theorem 4.5: 

l|A|| 2 = O-o = 0-max(A). 

Proof Recall the result (4.37). From (4.105) A = UEV T so 



and 



\Ax\\l =x T A T Ax, 



P -\ 



A T A = VE^Ey 7 = J^aj L v i vf € R" x ". 
For any x e R" there exist di such that 

n-l 
X = 'Y^diVi 



(4.107) 



(4.108) 



;=o 



(because V is orthogonal so its column vectors form an orthogonal basis for R"). 
Thus 



-i»-i 



-ill = xTx = J2J2 did J v ^ v J = 12 d ? 



(4.109) 



i=o ;=o 



i=0 



TLFeBOOK 



166 LINEAR SYSTEMS OF EQUATIONS 

(via vJvj — S{-j). Now 



x 



P -\ P -\ 

T A T Ax = Y]<T?(x T Vi)(v?x) = J2 a ^ x ' ^ >2 ' (4.110) 

i=0 (=0 



but 

(x, vi) = {Y^djVj, V{ J = ^djjvj, vj) = <i,-. (4.111) 

\ .7 ' .7 

Using (4.111) in (4.110), we obtain 

M-l 

;=o 

,2 — a f„„ ; ^ „ r \ir_ ™ ^; m ;,a II A -*■! |2 



for which it is understood that <r = for i > p — 1. We maximize 1 1 Ax\ \^ subject 



to constraint ||x||2 = 1, which means employing Lagrange multipliers; that is, we 



maximize 

n— 1 /« — 1 



L(d) = J2 a i Ld ?- x IZ^ 2-1 ' (4.113) 



(=0 \(=0 

where d = [do d\ ■ ■ ■ d n -\] T . Thus 

8L(d) 2 









ddj 


"7 z - /v< 


*■} — 


or 






ajdj 


= A.dj . 




From (4.114) 


into 


(4.112) 




n-\ 










\\Ax\\ 2 2 = 


^E d ? 


= a 



(4.114) 



(4.115) 



i'=0 



for which we have used the fact that \\x\\\ = 1 in (4.109). From (4.114) X is the 
eigenvalue of a diagonal matrix containing of. Consequently, k is maximized for 

X — Oq. Therefore, ||A||2 = &o- 

Suppose that 

OQ > • • • > O r -\ > O r — ■ ■ ■ — Op-i — 0, (4.116) 

then 

rank(A) = r. (4.117) 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 167 

Thus, the SVD of A can tell us the rank of A. In our overdetermined least-squares 
problem we have m > n and A is assumed to be of full-rank. This implies that 
r — n. Also, p — n. Thus, all singular values of a full-rank matrix are bigger than 
zero. Now suppose that A -1 exists. From (4.105) A -1 = VT,~ l U T . Immediately, 
l|A- 1 || 2 = l/crmin(A). Hence 

^2(A) = ||A|| 2 ||A- 1 || 2 = gmax( ^ ) . (4.118) 

Cmin(A) 

Thus, a large singular value spread is associated with matrix ill conditioning. [Recall 
(4.61) and the related discussion.] As remarked on p. 223 of Ref. 5, Eq. (4.118) 
can be extended to cover full-rank rectangular matrices with m > n: 

A€R mx ",rank(A) = «^/C2(A)= °" max(A) . (4.119) 

CTminCA) 

This also holds for the transpose of A because A T — VT, T U T , so A T has the 
same singular values as A. Thus, *r 2 (A r ) = K2(A), Golub and Van Loan [5, p. 225] 
claim (without formal proof) that /c 2 (A r A) = [/c 2 (A)] 2 . In other words, if the linear 
system Ax — f is ill-conditioned, then A T Ax — A T f is even more ill-conditioned. 
The condition number of the latter system is the square of that of the former system. 
More information on the conditioning of rectangular matrices is to be found in 
Appendix 4.B. This includes justification that Ar 2 (A r A) = [k 2 (A)] 2 . 

A popular approach toward solving the normal equations A T Ax — A T f is based 
on Cholesky decomposition 

Theorem 4.6: Cholesky Decomposition If R e R" xn is symmetric and pos- 
itive definite, then there exists a unique lower triangular matrix L e R" x " with 
positive diagonal entries such that R — LL T . This is the Cholesky decomposition 
(factorization) of R. 

Algorithms to find this decomposition appear in Chapter 4 of Ref. 5. We do not 
consider them except to note that if they are used, then the computed solution to 
A T Ax = A T f, which we denote by x, may satisfy 

ll *~* lh ~u[K 2 (A)f, (4.120) 



ll*||2 

where u is as in (4.89). Thus, this method of linear system solution is potentially 
highly susceptible to errors due to ill-conditioned problems. On the other hand, 
Cholesky approaches are computationally efficient in that they require about n 3 /3 
flops (Floating-point operations). Clearly, Gaussian elimination may be employed 
to solve the normal equations as well, but we recall that Gaussian elimination 
needed about 2n 3 /3 flops. Gaussian elimination is less efficient because it does 
not account for symmetry in matrix R. Note that these counts do not take into 
consideration the number of flops needed to determine A T A and A T f, and do 



TLFeBOOK 



168 



LINEAR SYSTEMS OF EQUATIONS 



not account for the number of flops needed by the forward/backward substitution 
steps. However, the comparison between Cholesky decomposition and Gaussian 
elimination is reasonably fair because these other steps are essentially the same for 
both approaches. 



Recall that \\e\\\ — \\Ax — f\\\. Thus, for orthogonal matrix Q 
\\Q T e\\l = [Q T e] T Q T e = e T QQ T e = e T e = ||e|||. 



(4.121) 



Thus, the 2-norm is invariant to orthogonal transformations. This is one of the 
more important properties of 2-norms. Now consider 



\\e\\ 2 2 = \\Q T Ax-Q T f\\ 2 2 . 



(4.122) 



Suppose that 



e r / = 



./■" 



(4.123) 



for which /" € R", and /' € R m_ ". Thus, from (4.94) and Q T A = R, we obtain 



Q T Ax - Q T f = 



Kx- f u 

-f 



implying that 
Immediately, we see that 



kill = lift* - /"111 + ll/'lll. 



Kx = f u . 



(4.124) 

(4.125) 
(4.126) 



The least-squares optimal solution x is therefore found by backward substitution. 
Equally clearly, we see that 



min||e||i = ||/'||i 



2 

Pls- 



(4.127) 



This is the minimum error energy. Quantity p^ s is also called the minimum sum 
of squares, and e is called the residual [5]. It is easy to verify that /C2(G) = 1 
(Q is orthogonal). In other words, orthogonal matrices are perfectly conditioned. 
This means that the operation Q T A will not result in a matrix that is not as well 
conditioned as A. This in turn suggests that solving our least-squares problem 
using QR decomposition might be numerically more reliable than working with 
the normal equations. As explained on p. 230 of Ref. 5, this is not necessarily 
always true, but it is nevertheless a good reason to contemplate QR approaches to 
solving least-squares problems. 9 

y If the residual is big and the problem is ill-conditioned, then neither QR nor normal equation methods 
may give an accurate answer. However, QR approaches may be more accurate for small residuals in 
ill-conditioned problems than normal equation approaches. 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 169 

How may we compute Ql There are three major approaches: 

1. Gram-Schmidt algorithms 

2. Givens rotation algorithms 

3. Householder transformation algorithms 

We will consider only Householder transformations. 

We begin by a review of how vectors are projected onto vectors. Recall the 
law of cosines from trigonometry in reference to Fig. 4.2a. Assume that x, y e R". 
Suppose that \\x — y\\2 = a, \\x\\i — b, and that \\yW2 — c. Therefore, where 9 is 
the angle between x and y (0 < 9 < n radians) 



b 2 + c 2 - 2bccos9, 



(4.128) 



or in terms of the vectors x and y, Eq. (4.128) becomes 

\\x-y\\ 2 2 = \\x\\l + \\y\\ 2 2 -2\\x\\ 2 \\y\\2Cos$. 
In terms of inner products, this becomes 

(x-y,x-y) = (x, x) + (y, y) - 2[(x, x)]^ 2 [(y, y)} 1 ' 2 cos9, 
which reduces to 

(x,y) = l(x,x)] 1/2 [(y,y)] l/2 c 0& 9, 



(4.129) 



(x,y) 



\x\\2\\y\\2COSt 



(4.130) 



(a) 





^y 



Figure 4.2 Illustration of the law of cosines (a) and the projection of vector x onto vector 

y (b). 



TLFeBOOK 



170 



LINEAR SYSTEMS OF EQUATIONS 



Now consider Fig. 4.2b. Vector P y x is the projection ofx onto y, where P y denotes 
the projection operator that projects x onto y. It is immediately apparent that 



\PyX\\2 = ||x|| 2 COS( 



(4.131) 



This is the Euclidean length of P y x. The unit vector in the direction of y is y/\ \y\ | 2 . 

Therefore 

llxll? COS0 

-y. (4.132) 



PyX 



\\y\h 



But from (4.130) this becomes 



(x,y) 

P y X = Try. 

\\y\\\ 



(4.133) 



Since (x, y) — x y — y x, we see that 



T 

y x 
P y x = ^—ry = 



\\y\\\ 



yy 



(4.134) 



In (4.134) yy € R" x ", so the operator P y has the matrix representation 

1 



Py = 



ml yy 



In Fig. 4.2b we see that z — x — P y x, and that 



Z = (I ~ Py)X 



1 



\\y\\\ 



yy 



X, 



which is the component of x that is orthogonal to y. We observe that 



1 



IMI 



T T 

jyy yy 



y\\y\\ 2 2 y T 
\\y\\i 



^r 3 



Py 



(4.135) 



(4.136) 



(4.137) 



If A 2 — A, we say that matrix A is idempotent. Thus, projection operators are 
idempotent. Also, Pf = P y so projection operators are also symmetric. 

In Fig. 4.3, x, y, z € R", and y T z — 0. Define the Householder transformation 
matrix 



H = I 



, yy 

'\\y\\i 



(4.138) 



We see that H — I — 2P y [via (4.135)]. Hence Hx is as shown in Fig. 4.3; that 
is, the Householder transformation finds the reflection of vector x with respect to 
vector z, and z -L y (Definition 1.6 of Chapter 1). Recall the unit vector e, € R" 

ei = [0---oio---0] r , 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 171 




Figure 4.3 Geometric interpretation of the Householder transformation operator H . Note 
that z T y = 0. 



so eo = [10 • • • 0] r . Suppose that we want Hx — aeo for some a e R with a ^ 0; 
that is, we wish to design H to annihilate all elements of x except for the top 
element. Let y — x + aeo; then 



T T T T 

y x — (x + ae Q )x — x x + cexo 
(as x — [xq x\ ■ ■ ■ x n -\\ T ), and 

||y|| 2 = (x + ae Q )(x + aeo) — x x + 2axo + a 
Therefore 

T T 

Hx — x — 2- — —x = x — 2 -y 



\\y\\ 2 2 



\\y\\V 



so from (4.139), this becomes 



Hx = x — 2 



x — 2 



(x x + axQ)(x + aeo) 

iiyiil 

(x x + ajto)-^ + &(y x)eo 



1 -2- 



x x + ajto 



x — 2a 



T 
y x 



x T x + 2ax + a 2 ] \\y\\i 

To force the first term to zero, we require 

x x + 2axQ + a — 2(x x + oixq) — 0, 
which implies that a 2 — x T x, or in other words, we need 

« = ±l|x|| 2 . 



r^o- 



(4.139a) 



(4.139b) 



(4.140) 



(4.141) 



TLFeBOOK 



172 LINEAR SYSTEMS OF EQUATIONS 

Consequently, we select y — x ± ||x||2<?o- I n this case 



y T x x T x + axQ 
Hx - —2a T-e = —2a — T e 



I l^lll 



x T x + 2a.ro + « 2 



a + axo 

= ~ 2q! t 7 , -, g o = -ae = TlMbeo. 

LoL L + 2axo 



(4.142) 



so Hx = aeo for a — —a if y — x + aeo with a = ±||x||2- 

Example 4.6 Suppose x — [4 3 0] r , so ||jc||2 = 5. Choose a = 5. Thus 

y = [9 3 Of, and 



H = 1-2 



yy _ j_ 

'y T y ~ 45 



-36 -27 

-27 36 

45 



We see that 




















1 
Hx = — 

45 


' -36 


-27 


" 




" 4 " 


1 

= 45 


" -225 " 




" -5 " 


-27 



36 





45 




3 







= 






so Hx — —aeo 


. 



















The Householder transformation is designed to annihilate elements of vectors. But 
in contrast with the Gauss transformations of Section 4.5, Householder matrices 
are orthogonal. To see this observe that 



H T H = 



1-2 



yy 

yTy 



1-2 



yy 
y T yj 



T T T 

y T y iy T yf 

T T 

= 7-4^+4^ = 7. 

y y y y 



Thus, no matter how we select y, K2(H) = 1. Householder matrices are therefore 
perfectly conditioned. 

To obtain R in (4.94), we define 



H k 



4-i 








€R" 



(4.143) 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 



173 



where k — 1,2, ... ,n, h-i is an order k — 1 identity matrix, Hk is an order m — 
k + 1 Householder transformation matrix. We design Hk to annihilate elements k 
to m — 1 of column A: — 1 in A k ~ l , where A — A, and 



A k = H k A k ~\ 



(4.144) 



so A" = R (in (4.94)). Much as in Section 4.5, we have A k = [af ,.] € R mx ", and 
we assume m > n. 

Example 4.7 Suppose 



'oo 
i° 

MO 
'20 
-'30 



-*01 
,,0 



*21 



-*31 



*02 

1° 
-M2 

,.0 
'22 

'32 



eR 



4x3 



and so therefore 



H 1 A V 



A 2 = H 2 A X = 



X X X X 

X X X X 

X X X X 

X X X X 



10 

x x x 

x x x 

x x x 



" a 
"oo 

a 

"10 

a 

"20 

"30 
fl 1 

"oo 








"01 

a 

"21 

fl° 
"31 

'01 

Ml 

3 21 

'31 



"02 

a 

"12 
Q 22 

Q 32 

7 1 

'02 

2 1 

M2 

3 22 
'32 



r fl 1 

"00 







a 1 
"oo 







"01 

"11 

a 21 

"31 

a 1 

"oi 

8 n 





a 1 "" 

"02 

"12 

a 22 

"32 

7 1 

'02 

2 2 

Z 2 
-'22 

2 2 
^32 J 



and 



A 3 = H 3 A 2 



10 

10 

x x 

x x 



fl 00 







"01 

°n 







"02 

a 2 

"l2 
3 22 





r a 1 
"oo 





= R. 



"01 

a ii 




2 1 "" 
"02 

2 2 
l 2 

"22 

2 2 
-'32 



TLFeBOOK 



174 LINEAR SYSTEMS OF EQUATIONS 

The x signs denote the Householder matrix elements that are not specified. This 
example is intended only to show the general pattern of elements in the matrices. 

Define 

** = [4-l.k-l 4U-1 • • • 4-U-lf € Rm "" +1 ' ( 4 - 145 ) 

so if x k = [x k x\ ■■■ x k m _ k f then x k = a\~l_ xk _ v and so 

y k (y k ) T 

H k = l m _ k+l -2 y -^- k , (4.146) 

where y k — x k ± | |jc*| |2^q, and e\ = [1 • • • 0] r € R m_ * +1 . A pseudocode anal- 
ogous to that for Gaussian elimination (recall Section 4.5) is as follows: 

A°:=A; 

for k := 1 to n do begin 

for / := to m — k do begin 

x k := av~, _., fc1 ; {This loop makes x^} 
end; 
y k :=x' t + sign(x*)||x' ( ||2eg; 
A k := H k A k ~ 1 ; { H k contains H k via (4.146) } 
end; 
R:=A n ; 

From (4.143) H. H k — I m because H. H k — I m - k +\, and of course /Jj/jt-i = 
4_i; that is, H^ is orthogonal for all k. Since 

R = A"=H n H n - 1 ---H 2 H l A, (4.147) 

we have 

A = Hi Hi ■■■ Hi R. (4. 148) 

=Q 

Thus, the pseudocode above implicitly computes Q because it creates the orthog- 
onal factors Hk- 

In the pseudocode we see that 

y*=;c* + sign(*J)||**||2e§. (4.149) 

Recall from (4.141) that a = ±||x||2, so we must choose the sign of a. It is best that 
a — sign(*o)||x||2, where sign(xo) = +1 for xq > 0, and sign(xo) = — 1 if xq < 0. 
This turns out to ensure that H remains as close as possible to perfect orthogonality 
in the face of rounding errors. Because ||x||2 might be very large or very small, 
there is a risk of overflow or underflow in the computation of ||x||2. Thus, it 
is often better to compute y from jc/IWIoo- This works because scaling x does 



TLFeBOOK 



LEAST-SQUARES PROBLEMS AND QR DECOMPOSITION 



175 



not mathematically alter H (which may be confirmed as an exercise). Typically, 
m >> n (e.g., in the example of Fig. 4.1 we had m — N — 1000, while n — 3), 
so, since Hk € R mxm , we rarely can accumulate and store the elements of Hk for 
all k as too much memory is needed for such a task. Instead, it is much better to 
observe that (for example) if H e R mxm , A e R mxn then, as y e R m , we have 



HA 



T ~\ 

, yy 
y yl 



yTy 



y(A T y) T . 



(4.150) 



From (4.150) Ay e R", which has y'th element 

m — 1 



[A T y]j = Y] a k,jyk 



(4.151) 



k=Q 



for j =0, 1, 



1. If p = 2/y 1 y, then, from (4.150) and (4.151), we have 

~m — 1 



for i =0, 1,..., 

implements this is as follows: 

fl := 2/y T y; 

for; = to n - 1 do begin 

s := E^L"o a km 

S := ps; 

for / := to m - 1 do begin 



[HA]ij = a u - ^y; 



1, and j =0, 1, 



J2 a k,j 



yk 



.k=Q 



(4.152) 



1. A pseudocode program that 



end; 



end; 



This program is written to overwrite matrix A with matrix HA. This reduces 
computer system memory requirements. Recall (4.123), where we see that Q T f 
must be computed so that /" can be found. Knowledge of /" is essential to 
compute x via (4.126). As in the problem of computing HkA k ~ l , we do not wish 
to accumulate and save the factors Hk in 



Q 1 f = H„H„-i---Hif. 



(4.153) 



Instead, Q f would be computed using an algorithm similar to that suggested by 
(4.152). 

All the suggestions in the previous paragraph are needed in a practical imple- 
mentation of the Householder transformation matrix method for QR factorization. 
As noted in Ref. 5, the rounding error performance of the practical Householder 
QR factorization algorithm is quite good. It is stated [5] as well that the number of 
flops needed by the Householder method for finding x is greater than that needed by 



TLFeBOOK 



176 LINEAR SYSTEMS OF EQUATIONS 

Cholesky factorization. Somewhat simplistically, the Cholesky method is computa- 
tionally more efficient than the Householder method, but the Householder method 
is less susceptible to ill conditioning and to rounding errors than is the Cholesky 
method. More or less, there is therefore a tradeoff between speed and accuracy 
involved in selecting between these competing methods for solving the overde- 
termined least-squares problem. The Householder approach is also claimed [5] to 
require more memory than the Cholesky approach. 



4.7 ITERATIVE METHODS FOR LINEAR SYSTEMS 

Matrix A € R" x " is said to be sparse if most of its n 2 elements are zero-valued. 
Such matrices can arise in various applications, such as in the numerical solution 
of partial differential equations (PDEs). Sections 4.5 and 4.6 have presented such 
direct methods as the LU and QR decompositions (factorizations) of A in order to 
solve Ax — b (assuming that A is nonsingular). However, these procedures do not 
in themselves take advantage of any structure that may be possessed by A such 
as sparsity. Thus, they are not necessarily computationally efficient procedures. 
Therefore, in the present section, we consider iterative methods to determine x € R" 
in Ax = b. In this section, whenever we consider Ax — b, we will always assume 
that A -1 exists. Iterative methods work by creating a Cauchy sequence of vectors 
(x w ) that converges to x. 10 Iterative methods may be particularly advantageous 
when A is not only sparse, but is also large (i.e., large n). This is because direct 
methods often require the considerable movement of data around the computing 
machine memory system, and this can slow the computation down substantially. But 
a properly conceived and implemented iterative method can alleviate this problem. 

Our presentation of iterative methods here is based largely on the work of 
Quarteroni et al. [8, Chapter 4]. We use much of the same notation as that in 
Ref. 8. But it is a condensed presentation as this section is intended only to convey 
the main ideas about iterative linear system solvers. 

In Section 4.4 matrix and vector norms were considered in order to characterize 
the sizes of errors in the numerical estimate of x in Ax = b due to perturbations 
of A, and b. We will need to consider such norms here. As noted above, our goal 
here is to derive a methodology to generate vector sequence (x*^) 11 such that 

lim x (k) =x, (4.154) 

k—>oo 

where i = [io ii • • • x n -\\ T e R" satisfies Ax — b and x^ = [xq x\ ■ ■ ■ 
x^_j] r € R". The basic idea is to find an operator T such that x^ k+v> — Tx^ k \— 
T(x^)), for fc = 0,1,2,.... Because (x^) is designed to be Cauchy (recall 

As such, we will be revisiting ideas first seen in Section 3.2. 



Note that the "(k)" in x^' does not denote the raising of i to a power or the taking of the kxh 
derivative, but rather is part of the n 
the k\h power of A, but A^ is not. 



derivative, but rather is part of the name of the vector. Similar notation applies to matrices. So, A is 



TLFeBOOK 



ITERATIVE METHODS FOR LINEAR SYSTEMS 177 

Section 3.2) for any e > 0, there will be an m e Z + such that ||x^ m ^ — x\\ < e 
[recall that d(x^ \ x) — \\x w — x\\]. The operator T is defined according to 

x (k+1) = Bx (k) + f, (4.155) 

where x^ e R" is the starting value (initial guess about the solution x), B e R" x " 
is called the iteration matrix, and / e R" is derived from A and b in Ax — b. Since 
we want (4.154) to hold, from (4.155) we seek B and / such that x — Bx + f, or 
A~ l b — BA~ l b + f (using Ax — b, implying x = A~ l b), so 

f = (I-B)A~ l b. (4.156) 

The error vector at step k is defined to be 

e (*) =,(*)_ X| (4.157) 

and naturally we want lim^oo e^ — 0. Convergence would be in some suitably 
selected norm. 

As matters now stand, there is no guarantee that (4.154) will hold. We achieve 
convergence only by the proper selection of B, and for matrices A possessing 
suitable properties (considered below). Before we can consider these matters we 
require certain basic results involving matrix norms. 

Definition 4.3: Spectral Radius Let s(A) denote the set of eigenvalues of 
matrix A e R" x ". The spectral radius of A is 

p(A) = max |A,|, 

X€s(A) 

An important property possessed by p(A) is as follows. 

Property 4.1 If A € R" x " with e > 0, then there is a norm denoted || • || e 
(i.e., a norm perhaps dependent on e) satisfying the consistency condition (4.36c), 
and such that 

\\A\\ € < P (A) + e. 

Proof See Isaacson and Keller [9]. 

This is just a formal way of saying that there is always a matrix norm that is 
arbitrarily close to the spectral radius of A 

p(A) = inf||A|| (4.158) 

with the infimum (defined in Section 1.3) taken over all possible norms that satisfy 
(4.36c). We say that the sequence of matrices (A^) [with A^ € R" xn ] converges 
to A € R" x " iff 

lim \\A (k) -A|| =0. (4.159) 



TLFeBOOK 



178 LINEAR SYSTEMS OF EQUATIONS 

The norm in (4.159) is arbitrary because of norm equivalence (recall discussion on 
this idea in Section 4.4). 

Theorem 4.7: Let A e R" x "; then 

lim A k = •»• p(A) < 1. (4.160) 

As well, the matrix geometric series YI^LqA converges iff p(A) < 1. In this 

instance 

oo 

J2A k = (I - A)~\ (4.161) 

k=Q 

So, if p(A) < 1, then matrix I — A is invertible, and also 

< ||(7 - A)" 1 !! < - \— , (4.162) 



1 + I|A||-" V "-1-IIAH 

where || • || here is an induced matrix norm (i.e., (4.36b) holds) such that ||A|| < 1. 

Proof We begin by showing (4.160) holds. Let p(A) < 1 so there must be an 
e > such that p(A) < 1 — e, and from Property 4.1 there is a consistent matrix 
norm 1 1 • 1 1 such that 

\\A\\ < p(A) + € < 1. 

Because [recall (4.40)] of ||A^|| < \\A\\ k < 1, and the definition of convergence, 
as k -* oo, we have A k -» € R" x ". Conversely, assume that lim^^oo A* = 0, 
and let X be any eigenvalue of A. For eigenvector x (^ 0) of A associated with 
eigenvalue X, we have A k x — X k x, and so \m\k^oo^ k — 0. Thus, |A| < 1, and 
hence p(A) < 1. Now consider (4.161). If X is an eigenvalue of A, then 1 — X is 
an eigenvalue of / — A. We observe that 

(/ - A)(I + A + A 2 + • • • + A" -1 + A") = / - A" +1 . (4.163) 

Since p(A) < \, I — A has an inverse, and letting n — > oo in (4.163) yields 



(/-A)^A* = / 



fc=0 

so that (4.161) holds. 

Now, because matrix norm || • || satisfies (4.36b), we must have ||/|| = 1. Thus 

1 = ||/|| < 11/ -A|| ||(7- A)" 1 !! < (l + ||A||)||(7- A)" 1 !!, 



TLFeBOOK 



ITERATIVE METHODS FOR LINEAR SYSTEMS 179 

which gives the first inequality in (4.162). Since / = (7 — A) + A, we have 



so that 



(I - A)' 1 = / + A(I - A)~ l 



|| ( /_ A) -i||< l + ||A|| ||(7- A)" 1 !]. 



Condition ||A|| < 1 implies that this yields the second inequality in (4.162). 

We mention that in Theorem 4.7 an induced matrix norm exists to give ||A|| < 
1 because of Property 4.1 (recall that (A®) is convergent, giving p(A) < 1). 
Theorem 4.7 now leads us to the following theorem. 

Theorem 4.8: Suppose that / e R" satisfies (4.156); then (x w ) converges to 
x satisfying Ax — b for any x^ iff p(B) < 1. 

Proof From (4.155)-(4.157), we have 

e (k+D = x (k+i) _ x = Bx (k) + f _ x = Bx (k) + (I _ B)A -i b _ x 

= Be {k) + Bx + (I - B)A~ l b - x 
= Be (k) + Bx + x- Bx -x 
= Be {k \ 



Immediately, we see that 



for k e Z + . From Theorem 4.7 



e » = BV 0) (4.164) 



lim B k e (0) = 

for all e (0) € R" iff p{B) < 1. 

On the other hand, suppose p(B) > 1; then there is at least one eigenvalue X of 
B such that |A.| > 1. Let e^ be the eigenvector associated with X, so Be^ — \e®\ 
implying that e^ — X k e < -°\ But this implies that e^' -/> as k — >■ oo since |A.| > 1. 

This theorem gives a general condition on B so that iterative procedure (4.155) 
converges. Theorem 4.9 (below) will say more. However, our problem now is to 
find B. From (4.158), and Theorem 4.7 a sufficient condition for convergence is 
that HZ? 1 1 < 1, for any matrix norm. 

A general approach to constructing iterative methods is to use the additive 
splitting of the matrix A according to 

A = P-N, (4.165) 



TLFeBOOK 



180 



LINEAR SYSTEMS OF EQUATIONS 



where P, N e R nxn are suitable matrices, and P~ l exists. Matrix P is sometimes 
called a preconditioning matrix, or preconditioner (for reasons we will not consider 
here, but that are explained in Ref. 8). To be specific, we rewrite (4.155) as 



Ak+l) 



= P~ l Nx 



(k) 



P~ l b, 



that is, for k e Z H 



Px 



(*+i) 



Nx 



(k) 



(4.166) 



so that / = P~ l b, and B — P^N. Alternatively 



Ak+l) 




(4.167) 



where r^ is the residual vector at step k. From (4.167) we see that to obtain 
x (k+i) re q U j res us [ solve a linear system of equations involving P. Clearly, for 
this approach to be worth the trouble, P must be nonsingular, and be easy to invert 
as well in order to save on computations. 

We will now make the additional assumption that the main diagonal elements of 
A are nonzero (i.e., a,-,; ^ for all i e Z„). All the iterative methods we consider in 
this section will assume this. In this case we may express Ax = b in the equivalent 
form 



1 
an 






(4.168) 



for i — 0, 1 , . . . , n — 1 . 

The expression (4.168) immediately leads to, for any initial guess x®\ 
Jacobi method, which is defined by the iterations 



the 



,(*+!) 



n-\ 



J2 ai iA 



{10 



.7=0 



(4.169) 



for i =0, 1, 
splitting 



n — 1. It is easy to show that this algorithm implements the 

P = D,N = D- A = L + U, (4.170) 



where D — diag(ao,o, «i,i, ■ ■ ■ , «n-i,«-i) (i-e-, diagonal matrix that is the main 
diagonal elements of A), L is the lower triangular matrix such that Z, ; = — a,y 
if ( > j, and lu = if i < j, and [/ is the upper triangular matrix such that 
Uij — —au if j > i, and m (/ - = if j < i. Here the iteration matrix B is given by 



B = Bj = P~ [ N = D"'(L + [/) 



D~ l A. 



(4.171) 



TLFeBOOK 



ITERATIVE METHODS FOR LINEAR SYSTEMS 181 

The Jacobi method generalizes according to 



(*+i) _ _^_ 

i 

an 



n-\ 



bi - ^ aijx ( j 



(k) 



j=0 



+ (1 - w)x\ 



(/<) 



(4.172) 



where i — 0, 1, . . . , n — 1, and a> is the relaxation parameter. Relaxation parame- 
ters are introduced into iterative procedures in order to control convergence rates. 
The algorithm (4.172) is called the Jacobi overrelaxation (JOR) method. In this 
algorithm the iteration matrix B takes on the form 



B = Bj{co) = wBj + (1 - co)I, 
and (4.172) can be expressed in the form (4.167) according to 

coD-'r^. 



^ +1 > = *<*> 



(4.173) 



(4.174) 



The JOR method satisfies (4.156) provided that co ^ 0. The method is easily seen 
to reduce to the Jacobi method when w = 1. 

An alternative to the Jacobi method is the Gauss-Seidel method. This is defined 
as 

1 



xf +1) = 



i—\ n—\ 

bi - ^ ajjx) + - ^2 a 'J x j 
,/=o y=i'+i 



(k) 



where i = 0, 1, . . . , n — 1. In matrix form (4.175) can be expressed as 



Dx^^b + Lx^ + Ux^, 



(4.175) 



(4.176) 



where D, L, and U are the same matrices as those associated with the Jacobi 
method. In the Gauss-Seidel method we implement the splitting 



D-L, N = U 



(4.177) 



with the iteration matrix 



B = B GS = {D-L)- l U. 



(4.178) 



As there is an overrelaxation method for the Jacobi approach, the same idea applies 
for the Gauss-Seidel case. The Gauss-Seidel successive overrelaxation (SOR) 
method is defined to be 



.(*+!) 



w 
an 



Y^a, 



. r (*+i) 
ij x j 



j=i+i 



.(*) 



+ (1 - co)x\ 



(k) 



(4.179) 



TLFeBOOK 



182 LINEAR SYSTEMS OF EQUATIONS 

again for i = 0, 1, ...,« — 1. In matrix form this procedure can be expressed as 

Dx (k+1) = co[b + Lx (k+l) + Ux (k) ] + (1 - co)Dx (k) 

or 

[/ - coD~ l L]x {k+l) = &>D _1 fr + [(1 - w)I + wD~ l U]x (k) , (4.180) 

for which the iteration matrix is now 

B = B GS (a}) = [I -o)D- l L]- l [(l - co)I + wD~ l U]. (4.181) 

We see from (4.180) (on multiplying both sides by D) that 

[D - aL]x (k+l) =wb+ [(1 - co)D + oU]x (k) , 
so from the fact that A = D — (L + U) [recall (4.170)], this may be rearranged as 



x (k+\) = x (k) 



1 

-D-L 

a> 



r (k) (4.182) 



(r (k) = b- Ax (k) ), which is the form (4.167). Condition (4.156) holds if w ^ 0. 
The case a> — 1 corresponds to the Gauss-Seidel method in (4.175). If a> e (0, 1), 
the technique is often called an underrelaxation method, while for a> e (1, oo) it 
is an overrelaxation method. 

We will now summarize, largely without proof, results concerning the conver- 
gence of (x w ) to x for sequences generated by the previous iterative algorithms. 
We observe that every iteration in any of the proposed methods needs (in the worst 
case, assuming that A is not sparse) 0(n 2 ) arithmetic operations. The total number 
of iterations is m, and is needed to achieve desired accuracy \\x^ mS) — x\\ < e, and 
so in turn the total number of arithmetic operations needed is 0(mn 2 ). Gaussian 
elimination needs 0(« 3 ) operations to solve Ax — b, so the iterative methods are 
worthwhile computationally only if m is sufficiently small. If m is about the same 
size as n, then little advantage can be expected from iterative methods. On the 
other hand, if A is sparse, perhaps possessing only 0(n) nonzero elements, then 
the iterative methods require only 0(mn) operations to achieve ||x^ m ^ — x\\ < e. 
We need to give conditions on A so that x^' — >• x, and also to say something about 
the number of iterations needed to achieve convergence to desired accuracy. 

Let us begin with the following definition. 

Definition 4.4: Diagonal Dominance Matrix A € R nx " is diagonally domi- 
nant if 

n-\ 



J] lay | (4.183) 



.7=0 



for i — 0, 1 , . . . , n — 1 . 



TLFeBOOK 



ITERATIVE METHODS FOR LINEAR SYSTEMS 183 

We mention here that Definition 4.4 is a bit different from Definition 6.2 (Chapter 6), 
where diagonal dominance concepts appear in the context of spline interpolation 
problems. It can be shown that if A in Ax = b is diagonally dominant according to 
Definition 4.4, then the Jacobi and Gauss-Seidel methods both converge. Proof for 
the Jacobi method appears in Theorem 4.2 of Ref. 8, while the Gauss-Seidel case is 
proved by Axelsson [10]. 

If A — A T , and A > both the Jacobi and Gauss-Seidel methods will converge. 
A proof for the Gauss-Seidel case appears in Golub and Van Loan [5, Theo- 
rem 10.1.2]. The Jacobi case is considered in Ref. 8. Convergence results exist for 
the overrelaxation methods JOR and SOR. For example, if A — A T with A > 0, 
the SOR method is convergent iff < w < 2 [8]. Naturally, we wish to select a> so 
that convergence occurs as rapidly as possible (i.e., m in ||jc^ — x|| < e is min- 
imal). However, the problem of selecting the optimal value for a> is well beyond 
the scope of this book. 

We recall that our iterative procedures have the general form in (4.155), where 
it is intended that x — Bx + f. We may regard y — Tx — Bx + f as a mapping 
r|R" — > R". On linear vector space R" we may define the metric 

d(x, y) — max \xj - yj\ (4.184) 

(recall the properties of metrics from Chapter 1). Space (R", d) is a complete metric 
space [11, p. 308]. From Kreyszig [11] we have the following theorem. 

Theorem 4.9: If the linear system x = Bx + f is such that 

M-l 

£iM<! 

.7=0 

for i = 0, 1 , . . . , n — 1 then solution x is unique. The solution can be obtained as 
the limit of the vector sequence (x™) for k — 0, 1,2,... (x^ is arbitrary), where 

x (k+i) = Bx (k) + ff 
and where for a — max,-^ J]"Z \bij\, we have the error bounds 

in 

d(x (m \x) < — ^C-'^W) < -?—d(x (0 \x m ). (4.185) 

1 — a 1 — a 

Proof We will give only an outline proof. This theorem is really just a special 
instance of the contraction theorem, which appears and is proved in Chapter 7 (see 
Theorem 7.3 and Corollary 7.1). 

The essence of the proof is to consider the fact that 



d(Tx, Ty) — max 
ieZ„ 



n-l 



TLFeBOOK 



184 



LINEAR SYSTEMS OF EQUATIONS 



n-1 

< max \xj — y ,■ I max V^ \ba I 

;eZ„ !6Z„ ^— ' 

.,=0 

n-1 

= d(x,y)max^|%|, 

,£Z " .7=0 



so d(Tx, Ty) < ad(x, y), if we define 



n-1 



[recall (4.41d)]. 



In this theorem we see that if a < 1, then jp" — > jc. In this case d(Tx, Ty) < 
cf(x, 3O for all ;t, y e R". Such a mapping r is called a contraction mapping (or 
contractive mapping). We see that contraction mappings have the effect of moving 
points in a space closer together. The error bounds stated in (4.185) give us an 
idea about the number of iterations m needed to achieve ||x^ m ^ — x\\ < e (e > 
0). We emphasize that condition a < 1 is sufficient for convergence, so (x^) 
may converge to x even if this condition is violated. It is also noteworthy that 
convergence will be fast if a is small, that is, if ||fi||oo i s small. The result in 
Theorem 4.8 certainly suggests convergence ought to be fast if p(B) is small. 

Example 4.8 We shall consider the application of SOR to the problem of 
solving Ax — b, where 



4 10 
14 10 
14 1 
14 



b = 



1 
2 
3 

4 



We shall assume that jt (0) = [0000] T . Note that SOR is not the best way to solve 
this problem. A better approach is to be found in Section 6.5 (Chapter 6). This 
example is for illustration only. However, it is easy to confirm that 

x = [0.1627 0.3493 0.4402 0.8900] r . 



Recall that the SOR iterations are specified by (4.179). However, we have not 
discussed how to terminate the iterative process. A popular choice is to recall that 



,(*) 



b — Ax^' [see (4.167)], and to stop the iterations when for k 



Am)\ 



-(0)1 



< T 



(4.186) 



TLFeBOOK 



ITERATIVE METHODS FOR LINEAR SYSTEMS 185 



Number of iterations needed by SOR 



150 



100 




Figure 4.4 Plot of the number of iterations m needed by SOR as a function of co for the 



parameters of Example 4.8 in order to satisfy the stopping condition ||r' m - ) | loo/Ik 



r(0| 



< T. 



for some r > (a small value). For our present purposes 1 1 • 1 1 shall be the norm 
in (4.29c), which is compatible with the needs of Theorem 4.9. We shall choose 
t = 0.001. 

We observe that A is diagonally dominant, so convergence is certainly expected 
for co — 1. In fact, A > so convergence of the SOR method can be expected for 
all co e (0, 2). Figure 4.4 plots the m that achieves (4.186) versus co, and we see 
that there is an optimal choice for co that is somewhat larger than co — 1 . In this 
case though the optimal choice does not lead to much of an improvement over 
choice o)=l. 

For our problem [recalling (4.170)], we have 





' 4 C 


" 

















" 




D = 


4 
C 




4 


,L = 




-1 





-1 










. 




_ ° c 


4 









- 


-1 


_ 






" - 


-1 


" 












U = 







-1 






-1 

_ 












(4.178) 


" 0.0000 


-0.2500 




0.0000 




0.0000 


Bgs — 


0.0000 
0.0000 


0.0625 
-0.0156 




-0.2500 

0.0625 




0.0000 
-0.2500 






0.0000 


0. 


00 


39 




-0.0156 




0.06 


25 



We therefore find that ||BgsIIoo = 0.3281. It is possible to show (preferably using 
MATLAB or some other software tool that is good with eigenproblems) that 



TLFeBOOK 



186 LINEAR SYSTEMS OF EQUATIONS 

P(Bgs) — 0.1636. Given (4.185) in Theorem 4.9 we therefore expect fast con- 
vergence for our problem since a is fairly small. In fact 

ll* (w) -*||oo< ; "^""ji IKP-^r^Hoo (4.187) 

1 - WdgsWoo 

[using x^ — 0, x^ — (D — L)~ l b]. For the stopping criterion of (4.186) we 
obtained (recalling that w = 1, and r=0.001)m = 5 with 

x (5) = [0.1630 0.3490 0.4403 0.8899] r 

so that 

||jc c5) — jcMoo = 3.6455 x 10 -4 . 

The right-hand side of (4.187) evaluates to 

i/,,t;s!i '"' ; <\(D -L)- l b\\oo = 4.7523 x 10" 3 . 



l-\\B G s\ 
Thus, (4.187) certainly holds true. 

4.8 FINAL REMARKS 

We have seen that inaccurate solutions to linear systems of equations can arise 
when the linear system is ill-conditioned. Condition numbers warn us if this is 
a potential problem. However, even if a problem is well-conditioned, an inac- 
curate solution may arise if the algorithm applied to solve it is unstable. In the 
case of problems arising out of algorithm instability, we naturally replace the 
unstable algorithm with a stable one (e.g., Gaussian elimination may need to 
be replaced by Gaussian elimination with partial pivoting). In the case of an 
ill-conditioned problem, we may try to improve the accuracy of the solution by 
either 

1 . Using an algorithm that does not worsen the conditioning of the underlying 
problem (e.g., choosing QR factorization in preference to Cholesky factor- 
ization) 

2. Reformulating the problem so that it is better conditioned 

We have not considered the second alternative in this chapter. This will be done 
in Chapter 5. 

APPENDIX 4.A HILBERT MATRIX INVERSES 

Consider the following MATLAB code: 

R = hilb(10); 
inv(R) 



TLFeBOOK 



HILBERT MATRIX INVERSES 



187 



ans 



1 .Oe+12 * 



Columns 1 through 7 






.0000 


-0 


,0000 


0, 


,0000 


-0, 


,0000 





,0000 


-0 


.0000 





.0000 





.0000 





.0000 


-0 


,0000 





,0000 


-0 


,0002 





.0005 


-0 


.0008 





.0000 


-0 


,0000 


0. 


,0001 


-0, 


,0010 





,0043 


-0 


.0112 





.0178 





.0000 





.0000 


-0 


,0010 





,0082 


-0 


,0379 





.1010 


-0 


.1616 





.0000 


-0 


,0002 


0, 


,0043 


-0, 


,0379 





,1767 


-0 


.4772 





.7712 





.0000 





.0005 


-0 


,0112 





,1010 


-0 


,4772 


1 


.3014 


-2 


.1208 





.0000 


-0 


,0008 


0. 


,0178 


-0, 


,1616 





,7712 


-2 


.1208 


3 


.4803 





.0000 





.0008 


-0 


,0166 





,1529 


-0 


,7358 


2 


.0376 


-3 


.3636 





.0000 


-0 


,0004 


0, 


,0085 


-0, 


,0788 





,3820 


-1 


.0643 


1 


.7659 





.0000 





.0001 


-0 


,0018 





,0171 


-0 


,0832 





.2330 


-0 


.3883 



Columns 8 through 10 



-0.0000 





.0000 


-0 


,0000 


0.0008 


-0 


.0004 


0. 


,0001 


-0.0166 





.0085 


-0 


,0018 


0.1529 


-0 


.0788 


0, 


,0171 


-0.7358 





.3820 


-0 


,0832 


2.0376 


-1 


.0643 


0, 


,2330 


-3.3636 


1 


.7659 


-0 


,3883 


3.2675 


-1 


.7231 


0, 


,3804 


-1 .7231 





.9122 


-0 


,2021 


0.3804 


-0 


.2021 


0, 


,0449 


R*inv(R) 











Columns 1 through 7 



1 


,0000 





,0000 





.0000 


-0 


.0000 





,0001 





.0001 


-0 


.0001 





,0000 


1 


,0000 





.0000 


-0 


.0000 





,0001 





.0001 


-0 


.0002 





,0000 





,0000 


1 


.0000 


-0 


.0000 





,0001 





.0000 


-0 


.0001 





,0000 





,0000 





.0000 


1 


.0000 





,0000 





.0000 


-0 


.0000 





,0000 





,0000 





.0000 


-0 


.0000 


1 


,0000 


-0 


.0000 


-0 


.0000 





,0000 





,0000 





.0000 


-0 


.0000 





,0000 


1 


.0000 


-0 


.0000 





,0000 





,0000 





.0000 


-0 


.0000 





,0000 





.0000 





.9999 





,0000 





,0000 





.0000 


-0 


.0000 





,0000 





.0001 


-0 


.0000 





,0000 





,0000 





.0000 


-0 


.0000 





,0000 





.0000 


-0 


.0001 





,0000 





,0000 





.0000 


-0 


.0000 





,0000 





.0000 


-0 


.0000 



Columns 8 through 10 



-0.0000 
0.0001 
0.0000 

-0.0000 



-0.0001 
-0.0001 
-0.0001 
-0.0000 



0.0000 
0.0000 
0.0000 
0.0000 



TLFeBOOK 



188 



LINEAR SYSTEMS OF EQUATIONS 



0.0000 


-0 


,0000 





.0000 


-0.0000 


-0 


,0000 





.0000 


0.0000 


-0 


,0000 





.0000 


1 .0000 


-0 


.0000 





,0000 


0.0000 


1 


.0000 





,0000 


-0.0000 


-0 


.0000 


1 


,0000 


R = hilb(11) 


) 








inv(R) 











1 .Oe+14 * 



Columns 1 through 7 






,0000 


-0 


,0000 





.0000 


-0 


,0000 





.0000 


-0 


,0000 





,0000 





.0000 





.0000 


-0 


.0000 





.0000 


-0 


,0000 





,0000 


-0 


.0000 





,0000 


-0 


,0000 





.0000 


-0 


,0000 





,0002 


-0 


,0006 





,0012 





,0000 





.0000 


-0 


,0000 





.0003 


-0 


,0019 





,0064 


-0 


.0137 





,0000 


-0 


,0000 





,0002 


-0 


,0019 





,0110 


-0 


,0381 





,0817 





.0000 





,0000 


-0 


.0006 





.0064 


-0 


,0381 





,1329 


-0 


,2877 





,0000 


-0 


,0000 





.0012 


-0 


,0137 





,0817 


-0 


,2877 





,6270 





,0000 





,0001 


-0 


.0016 





.0183 


-0 


,1101 





,3902 


-0 


,8555 





,0000 


-0 


.0000 





.0013 


-0 


,0149 





,0905 


-0 


,3227 





.7111 





.0000 





.0000 


-0 


.0006 





.0068 


-0 


,0415 





,1487 


-0 


,3292 





,0000 


-0 


.0000 





.0001 


-0 


,0013 





,0081 


-0 


,0293 





,0651 



Columns 8 through 11 



-0 


.0000 





.0000 


-0, 


,0000 





.0000 





,0001 


-0 


,0000 


0, 


,0000 


-0 


,0000 


-0 


.0016 





,0013 


-0, 


,0006 





.0001 





,0183 


-0 


,0149 


0, 


,0068 


-0 


,0013 


-0 


.1101 





.0905 


-0, 


.0415 





.0081 





,3902 


-0 


.3227 


0, 


,1487 


-0 


,0293 


-0 


,8555 





,7111 


-0, 


,3292 





.0651 


1 


,1733 


-0 


,9796 


0, 


,4553 


-0 


,0903 


-0 


,9796 





.8212 


-0, 


,3830 





.0762 





,4553 


-0 


,3830 


0, 


,1792 


-0 


,0357 


-0 


,0903 





,0762 


-0. 


,0357 





.0071 


R*inv< 


(R) 















ans 



Columns 1 through 7 






,9997 


-0 


,0009 





,0022 





,0028 


-0 


,0164 





,0558 


-0 


,1229 





.0002 





,9992 





.0020 





.0023 


-0 


,0132 





.0454 


-0, 


,1029 





.0002 


-0 


,0007 


1 


.0018 





.0019 


-0 


,0112 





,0385 


-0. 


,0844 





.0002 


-0 


,0006 





.0016 


1 


.0017 


-0 


,0097 





,0331 


-0, 


,0736 





,0002 


-0 


,0006 





,0015 





.0015 





,9915 





,0285 


-0 


,0638 





.0002 


-0 


,0005 





.0014 





,0013 


-0 


,0076 


1 


.0258 


-0, 


,0581 



TLFeBOOK 



HILBERT MATRIX INVERSES 



189 



-0 


.0002 


-0 


,0005 





,0013 





,0012 


-0 


,0070 


-0 


.0001 


-0 


,0005 





,0012 





,0011 


-0 


,0063 


-0 


.0001 


-0 


,0004 





,0011 





,0010 


-0 


,0059 


-0 


.0001 


-0 


,0004 





,0010 





,0009 


-0 


,0053 


-0 


.0001 


-0 


,0004 





,0010 





,0009 


-0 


,0052 



0.0234 0.9468 

0.0216 -0.0474 

0.0201 -0.0448 

0.0187 -0.0406 

0.0179 -0.0395 



Columns 8 through 11 






,1665 


-0, 


,1405 





.0652 


-0. 


,0091 





,1351 


-0, 


,1165 





.0530 


-0, 


,0071 





,1125 


-0, 


.0973 





.0452 


-0. 


,0058 





,0964 


-0, 


.0844 





.0385 


-0. 


,0047 





,0858 


-0, 


,0739 





.0341 


-0, 


.0041 





,0745 


-0, 


,0661 





.0300 


-0. 


,0037 





,0696 


-0, 


.0592 





.0280 


-0, 


,0033 


1 


,0635 


-0, 


,0547 





.0251 


-0. 


,0029 





,0581 


0, 


,9495 





.0235 


-0 


,0028 





,0536 


-0, 


,0458 


1 


.0213 


-0, 


,0024 





,0527 


-0, 


.0445 





.0207 





,9976 



R = hilb(12); 

inv(R) 

Warning: Matrix is close to singular or badly scaled. 

Results may be inaccurate. RC0ND = 2.632091e-17. 



1 .Oe+15 * 



Columns 1 through 7 






,0000 


-0 


,0000 


0. 


.0000 


-0, 


,0000 


0, 


.0000 


-0 


.0000 





.0000 





.0000 





.0000 


-0 


,0000 





,0000 


-0, 


.0000 





.0000 


-0 


.0000 





,0000 


-0 


,0000 





,0000 


-0. 


,0000 


0, 


,0001 


-0 


.0002 





.0006 





.0000 





.0000 


-0 


,0000 





,0001 


-0, 


.0008 





.0032 


-0 


.0086 





,0000 


-0 


,0000 


0. 


,0001 


-0. 


,0008 


0, 


,0054 


-0 


.0229 





.0624 





.0000 





.0000 


-0 


,0002 





,0032 


-0, 


.0229 





.0990 


-0 


.2720 





,0000 


-0 


,0000 


0. 


,0006 


-0. 


,0086 


0, 


.0624 


-0 


.2720 





.7528 





.0000 





.0000 


-0 


,0011 





,0151 


-0, 


,1107 





.4863 


-1 


.3545 





,0000 


-0 


,0000 


0. 


,0013 


-0. 


,0173 


0, 


.1276 


-0 


.5640 


1 


.5794 





,0000 





,0000 


-0 


,0009 





,0124 


-0, 


.0920 





.4090 


-1 


.1511 





,0000 


-0 


,0000 


0. 


,0004 


-0, 


,0050 


0, 


.0377 


-0 


.1686 





.4765 





.0000 





.0000 


-0 


,0001 





,0009 


-0, 


.0067 





.0301 


-0 


.0855 



Columns 8 through 12 






.0000 





.0000 


-0 


,0000 





,0000 


-0 


,0000 





,0000 


-0 


,0000 


0, 


,0000 


-0, 


,0000 





,0000 





.0011 





.0013 


-0 


,0009 





,0004 


-0 


,0001 





,0151 


-0 


,0173 


0. 


,0124 


-0. 


,0050 





,0009 





.1107 





.1275 


-0 


,0920 





,0377 


-0 


,0067 





,4863 


-0 


,5639 


0, 


,4090 


-0, 


,1686 





,0301 


1 


.3544 


1 


.5793 


-1 


.1510 





,4765 


-0 


,0855 


2 


,4505 


-2 


,8712 


2. 


,1015 


-0. 


,8732 





,1572 



TLFeBOOK 



190 



LINEAR SYSTEMS OF EQUATIONS 



-2.8713 3.3786 -2.4821 

2.1016 -2.4822 1.8297 

-0.8732 1.0348 -0.7651 

0.1572 -0.1869 0.1386 



1.0348 -0.1869 

-0.7651 0.1385 

0.3208 -0.0582 

-0.0582 0.0106 



R*inv(R) 

Warning: Matrix is close to singular or badly scaled. 

Results may be inaccurate. RC0ND = 2.632091 e-17. 



Columns 1 through 7 



1 


,0126 


-0 


,0066 


-0 


,0401 





,0075 





,1532 


-0 


,8140 


2 


.2383 





,0113 





.9943 


-0 


,0361 





,0100 





,1162 


-0 


,6265 


1 


.7168 





,0103 


-0 


,0050 





,9673 





,0106 





,0952 


-0 


,5300 


1 


.4834 





,0094 


-0 


,0045 


-0 


,0299 


1 


,0104 





,0797 


-0 


,4573 


1 


.2725 





,0087 


-0 


,0041 


-0 


,0275 





,0104 


1 


,0703 


-0 


,4038 


1 


.0986 





,0081 


-0 


,0037 


-0 


,0255 





,0102 





,0621 





,6416 





.9971 





,0075 


-0 


.0034 


-0 


,0237 





.0099 





,0554 


-0 


,3245 


1 


.9062 





,0071 


-0 


,0032 


-0 


,0222 





,0095 





,0495 


-0 


,2944 





.8232 





,0066 


-0 


,0030 


-0 


,0209 





.0093 





,0439 


-0 


,2686 





.7246 





,0063 


-0 


,0028 


-0 


,0197 





,0087 





,0424 


-0 


,2532 





.7002 





,0059 


-0 


,0026 


-0 


,0187 





,0087 





,0372 


-0 


,2307 





.6309 





,0057 


-0 


.0024 


-0 


,0177 





.0081 





,0358 


-0 


,2168 





.6064 



Columns 8 through 12 



-4 


.0762 


4 


.7656 


-3 


.5039 


1 


,4385 


-0 


,2183 


-3 


.1582 


3 


.7754 


-2 


.7520 


1 


,1123 


-0 


,1649 


-2 


.6250 


3 


.1055 


-2 


.3301 





.9219 


-0 


,1390 


-2 


.2676 


2 


.6602 


-1 


.9922 





,7905 


-0 


,1163 


-1 


.9863 


2 


.4023 


-1 


.7139 





,7104 


-0 


,0992 


-1 


.7969 


2 


.1094 


-1 


.5430 





,6289 


-0 


,0897 


-1 


.6133 


1 


.9258 


-1 


.4043 





,5581 


-0 


,0779 


-0 


.4658 


1 


.7598 


-1 


.2734 





,5146 


-0 


,0715 


-1 


.3047 


2 


.5762 


-1 


.1445 





,4629 


-0 


,0651 


-1 


.2793 


1 


.5098 


-0 


.1055 





,4424 


-0 


,0619 


-1 


.1387 


1 


.3438 


-0 


.9873 


1 


,3955 


-0 


,0529 


-1 


.1025 


1 


.2998 


-0 


.9395 





,3809 





,9474 


diary 


Off 



















The MATLAB rcond function (which gave the number RCOND above) needs 
some explanation. A useful reference on this is Hill [3, pp. 229-230]. It is based on 
a condition number estimator in the old FORTRAN codes known as "LINPACK". 
It is based on 1-norms. rcond(A) will give the reciprocal of the 1-norm condition 
number of A. If A is well-conditioned, then rcond(A) will be close to unity (i.e., 
close to one), and will be very tiny if A is ill-conditioned. The rule of thumb 
involved in interpreting an rcond output is "if rcond(A) « d x \Q~ k , where d is 
a digit from 1 to 9, then the elements of xcomp can usually be expected to have 
k fewer significant digits of accuracy than the elements of A" [3]. Here xcomp 



TLFeBOOK 



SVD AND LEAST SQUARES 191 

is simply the computed solution to Ax — y; that is, in the notation of the present 
set of notes, x =xcomp. MATLAB does arithmetic with about 16 decimal digits 
[3, p. 228], so in the preceding example of a Hilbert matrix inversion problem for 
N — 12, since RCOND is about 10~ 17 , we have lost about 17 digits in computing 
R~ l . Of course, this loss is catastrophic for our problem. 



APPENDIX 4.B SVD AND LEAST SQUARES 

From Theorem 4.4, A — lfEV T , so this expands into the summation 

p-i 
A = Y,°iUivJ- (4.A.1) 

i=a 

But if r — rank (A), then (4.A.1) reduces to 

r-l 

A = ^2a iUi vJ. (4.A.2) 

i'=0 

In the following theorem p 2 LS — \\f l \\\ — \\Ax — f\\^ [see (4.127)], and x is the 
least-squares optimal solution to Ax — f . 

Theorem 4.B.1: Let A be represented as in (4.A.2) with A € R mx " and m > n. 
If / € R m then 

*-§(^r)"- <4A3) 

m — 1 



Proof For all x e R", using the invariance of the 2-norm to orthogonal trans- 
formations, and the fact that VV T — I n (n x n identity matrix) 

\\Ax - f\\\ = \\U T AV(V T x) - U T f\\ 2 2 = ||Ea - U T f\\ 2 2 , (4.A.5) 

where a — V T x, so as a — [ao ■ ■ -Q!„_i] r we have a ; - = vj x. Equation (4. A. 5) 
expands as 



\Ax - f\\\ = a 1 Y* 1 Ea - 2a 1 Y, 1 U 1 f + f 1 UU 1 f, (4.A.6) 



TLFeBOOK 



192 LINEAR SYSTEMS OF EQUATIONS 

which further expands as 

r—\ r—\ m—\ 



\ax - f\\\ = J2 a ? a ? - 22><W7 + £t«f /] 2 

i=0 i'=0 i=0 

r—\ m — \ 

= i> 2 «, 2 - ^wif + [ M < r /] 2 ] + i>r /i 2 

i— i=r 

r—\ m — \ 

= J2l°i<*i - "If? + I>f /l 2 - (4-A.7) 



(=0 i=r 

To minimize this we must have CT,a,- — uj f — 0, and so 

«i = -wf/ (4.A.8) 

for ( € Z r . As a = V r x, we have x — Va, so if we set a r — ■ ■ ■ — a n -\ — 0, then 
from Eq. (4. A. 8), we obtain 

r— 1 r— 1 y /. 

which is (4. A. 3). For this choice of jc from (4.A.7) 

m— 1 

ii^-/H2 = I>J7] 2 = pL> 

which is (4.A.4). 

Define A + = VE + £/ r (again A e R mx " with m > «), where 

E+ = diagCcr" 1 , . . . , a~\ , 0, . . . , 0) € R" xm . (4.A.9) 

We observe that 

r—\ j f 

A+/ = VY + U T f = V— 1>; = x. (4.A.10) 

i=0 ' 

We call A + the pseudoinverse of A. We have established that if rank (A) = n, 
then A T Ax — A T f, soi = (A^A) - ^ 7 /, which implies that in this case A + = 
(A T A)- l A T . If A € R" x " and A -1 exists, then A+ = A -1 . 

If A -1 exists (i.e., m — n and rank (A) = n), then we recall that K2(A) = 
II^I|2||A _1 ||2- If A € R mx " and m > n, then we extend this definition to 

K 2 (A) = ||A|| 2 ||A + || 2 . (4.A.11) 



TLFeBOOK 



REFERENCES 193 

We have established that ||A|| 2 = o"o so since A + = VE + U T we have ||A + ||2 = 
l/cr r _i . Consequently 

K2 (A) = — (4.A.12) 

a r -\ 

which provides a somewhat better justification of (4.119), because if rank (A) = 
n then (4.A.12) is k 2 (A) = ao/<r n -i = or max (A)/(r min (A) [which is (4.119)]. 
From (4.A.11), K 2 (A r A) = ||A r A|| 2 ||(A r A)+|| 2 . With A = (/Sy r , and A r = 
V£ r {/ r we have 

a t a = vz T i:v T , 

and 

(A r A)+ = y(E r E)+y r . 

Thus, ||A r A|| 2 = cr 2 , and \\(A T A)+\\ 2 = ff^ (rank (A) = «). Thus 



iTl^ "0 _ r / 4 s n 2 



k 2 (A j A) = -^- = [k 2 (A)Y 



The condition number definition k p (A) in (4.60) was fully justified because of 
(4.58). An analogous justification exists for (4. A. 11), but is much more difficult to 
derive, and this is why we do not consider it in this book. 



REFERENCES 

1. J. R. Rice, The Approximation of Functions, Vol. I: Linear Theory, Addison- Wesley, 
Reading, MA, 1964. 

2. M.-D. Choi, "Tricks or Treats with the Hilbert Matrix," Am. Math. Monthly 90(5), 
301-312 (May 1983). 

3. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.), 
Random House, New York, 1988. 

4. G. E. Forsythe and C. B. Moler, Computer Solution of Linear Algebraic Systems, 
Prentice-Hall, Englewood Cliffs, NJ, 1967. 

5. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ. 
Press, Baltimore, MD, 1989. 

6. R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge Univ. Press, Cambridge, 
MA, 1985. 

7. N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, PA, 
1996. 

8. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math- 
ematics series, Vol. 37), Springer- Verlag, New York, 2000. 

9. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 1966. 

10. O. Axelsson, Iterative Solution Methods, Cambridge Univ. Press, New York, 1994. 

11. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York, 1978. 



TLFeBOOK 



194 LINEAR SYSTEMS OF EQUATIONS 

PROBLEMS 

4.1. Function f(x) e L 2 [0, 1] is to be approximated according to 

a\ 

f(x) « a x H ■ — 

X + c 

using least squares, where c e R is some fixed parameter. This involves 
solving the linear system 



I 1 - clog, (l + 1) 



c(c+l) 



«0 

5] 



Jo */(*) rfx 

' /M 



JO x+c aX 



= R 



where a is the vector from R that minimizes the energy V(a) [Eq. (4.8)]. 

(a) Suppose that f(x) — x + -i-. Find a for c = 1. For this special case 
it is possible to know the answer in advance without solving the linear 
system above. However, this problem requires you to solve the system. 
[Hint: It helps to recall that 





a b 
c d 




1 


d 
—c 


-b 




ad — be 


a 


(b) Derive R [which is a special case of (4.9)]. 




4.2. Suppose that 


A 


= 


1 2 " 
_ -1 -5 







Find IIAHoo, ||A||i, and ||A|| 2 
4.3. Suppose that 

A = 



1 

2 6 



€ R 2x2 ,e > 0, 



and that A ' exists. Find Koo(A) if e is small. 
4.4. Suppose that 



A = 



1 1 

6 



eR 



2x2 



and assume e > (so that A ' always exists). Find K2(A) — ||A||2||A '||2. 
What happens to condition number /C2(A) if e -> 0? 

4.5. Let A(e), B(e) € R nxn . For example, A(e) = [fl,/(e)] so element fl !; (e) of 
A(e) depends on the parameter e € R. 



TLFeBOOK 



PROBLEMS 195 

(a) Prove that 

d dB(e) dA(e) 

— [A(e)B(e)] = A(e)— -^ + — -^fl(e), 
de de de 

where dA(e)/dt — [datj(e)/de], and dB(e)/dt = [dbij(e)/de]. 

(b) Prove that 

dA(e)~ 



— A- i (e) = -A- 1 (e) 
de 



de 



A~\e). 



[Hint: Consider £[A(e) A" 1 (e)] = £l = 0, and use (a).] 

4.6. This problem is an alternative derivation of k(A). Suppose that e e R, A, F e 
R" x " , and je (e) , y , f e R" . Consider the perturbed linear system of equations 

(A + eF)x(e) = y + ef, (4.P.1) 

where Ax = y, so x(0) = x is the correct solution. Clearly, eF models the 
errors in A, while ef models the errors in y. From (4.P.1), we obtain 

x(e) = [A + eF]- l (y + ef). (4.P.2) 

The Taylor series expansion for x(e) about e = is 

dx(0) -, 

X ( 6 )=x(0) + — y —L e + 0(e 2 ), (4.P.3) 

de 

where 0(e 2 ) denotes terms in the expansion containing e k for k > 2. 

Use (4.P.3), results from the previous problem, and basic matrix-vector norm 
properties (Section 4.4) to derive the bound in 

||jc(e) — jcII , f||/|| IIFII 

-^—^ < ellAHIIA -1 !! ' ' ' 



=K(A) 



\\y\\ \\A\\ 



[Comment: The relative error in A is p a — e||F||/||A||, and the relative error 
in y is p y — e\\f\\/\\y\\, so more concisely \\x(e) — x\\/\\x\\ < K(A)(p A + 

Py)-1 

4.7. This problem follows the example relating to Eq. (4.95). An analog signal 
/(f) is modeled according to 

p-i 
f(t) = Y^cijt j +r)(t), 

.7=0 

where a ; - € R for all j e Z p , and n(t) is some random noise term. We only 
possess samples of the signal; that is, we only have the finite-length sequence 



TLFeBOOK 



196 



LINEAR SYSTEMS OF EQUATIONS 



(/„) defined by f n — f(nT s ) for n e Zyv, where T s > is the sampling 
period of the data acquisition system that gave us the samples. Our estimate 
of /„ is therefore 

f n = J2 a J T s n '- 

;'=o 



With a — [«o oi 



flp-2 a p -i] T e R p , find 

JV-l 
V(fl)= J^lfn-fnf 



«=0 

in the form of Eq. (4.102). This implies that you must specify p, g, A, and P. 

4.8. Sampled data /„ (n = 0, 1, . . . , N — 1) is modeled according to 

f n — a + bn + C sm(0n + (f>). 

Recall sin(A + B) = sin A cos B + cos A sin B. Also recall that 

AT— l 
Vm = £[/n - /«] 2 = / r /~ / r A(0)[A r (0)A(0)]- 1 A r (0)/, 

n=0 

where / = [/o/i • • • /n-iV € R'*'. Give a detailed expression for A(0). 

4.9. Prove that the product of two lower triangular matrices is a lower triangular 
matrix. 

4.10. Prove that the product of two upper triangular matrices is an upper triangular 
matrix. 

4.11. Find the LU decomposition of the matrix 



2 


-1 





1 


2 


-1 





-1 


2 



using Gauss transformations as recommended in Section 4.5. Use A — LU to 
rewrite the factorization ofAasA = LZ)L r , where L is unit lower triangular, 
and D is a diagonal matrix. Is A > 0? Why? 

4.12. Use Gaussian elimination to LU factorize the matrix 



A = 



l 

L 2 



4 -1 

-1 4 

-1 



l -\ 

2 
-1 

4 



Is A > 0? Why" 



TLFeBOOK 



PROBLEMS 



197 



4.13. (a) Consider A e R" x ", and A — A T . Suppose that the leading principal 

submatrices of A all have positive determinants. Prove that A > 0. 

(b) Is 

5-3 

-3 5 1 > ()■' 
15 

Why? 

4.14. The vector space R" is an inner product space with inner product (x, y) — 
x T y for all x, y e R". Suppose that A e R" x ", A = A T , and also that A > 0. 
Prove that (x, y) — x T Ay is also an inner product on the vector space R". 

4.15. In the quadratic form 



V(x) = ff 



2x T A T f + x T A T Ax, 



we assume x e R", and A T A > 0. Prove that 



V(x) = f'f-f 1 A[A 1 A]- 1 A 1 f, 
where x is the vector that minimizes V(x). 



4.16. Suppose that A e R NxM with M < N, and rank (A) = M. Suppose also that 
(x, y) = x T y. Consider P = A[A T A] -1 A T , Pj_ — I - P (I is the N x N 
identity matrix). Prove that for all x e R N we have (Px, P±x) — 0. (Com- 
ment: Matrices P and P± are examples of orthogonal projection operators.) 

4.17. (a) Write a MATLAB function for forward substitution (solving Lz — y). 

Write a MATLAB function for backward substitution (solving Ux — z). 
Test your functions out on the following matrices and vectors: 



v 



[1 



1 


o - 




- 1 


2 


3 4 


1 1 

| 1 






, u = 






3 



5 5 

7 1 

3 3 


-f 


1 _ 




. 





f 


1 -1 - 


if 


.* = [! 


2 


7 

3 


~2] T - 



(b) Write a MATLAB function to implement the LU decomposition algo- 
rithm based on Gauss transformations considered in Section 4.5. Test 
your function out on the following A matrices: 



A = 



12 3 4 

-112 1 

2 13 

11 



A = 



2-1 

-1 2 -1 

0-1 2 



TLFeBOOK 



198 LINEAR SYSTEMS OF EQUATIONS 



100 n 




100 V 



100 n 



20 n 



50 Q 



1 A 

MS) 



8Q 




200 V 



10Q 



1000 a 



100 n 



Figure 4.P.1 The DC electric circuit of Problem 4.18. 

4.18. Consider the DC electric circuit in Fig. 4.P. 1 . 

Write the node equations for the node voltages V\, V2, ■ ■ ■ , V(, as shown. 
These may be loaded into a vector 

v = [V^VsV^sVef 

such that the node equations have the form 

Gv = y. (4.P.4) 

Use the Gaussian elimination, forward substitution, and backward substitu- 
tion MATLAB functions from the previous problem to solve the linear system 
(4.P.4). 

4.19. Let /„ € R" x " be the order n identity matrix, and define 



a 


b 


• 


• 





b 


a 


b ■ 


• 








b 


a 


• 











• 


a 


b 








• 


■ b 


a 



€R" 



TLFeBOOK 



PROBLEMS 



199 



Matrix T n is tridiagonal. It is also an example of a symmetric matrix that is 
Toeplitz (defined in a later problem). The characteristic polynomial of T n is 
p n (X) — det(XI n — T n ). Show that 



p n (X) = (k-a)p n -i{X) -b Pn-lCX) 



(4.P.5) 



for n = 3, 4, 5, Find p\(X), and /^M [initial conditions for the polyno- 
mial recursion in (4.P.5)]. 

4.20. Consider the following If: T e C nxn is Toeplitz if it has the form T — 

[ti-j]ij=o,i,..., n -i- Thus, for example, if n — 3 we have 



T = 



to t-l t-2 

h to f_i 
h h to 



Observe that in a Toeplitz matrix all of the elements on any given 
diagonal are equal to each other. A symmetric Toeplitz matrix has the 
form T — \t\i-j\\ since T — T T implies that t-i — ti for all i. Let x n — 
[x n ,o *n,\ • • • Xn,n-2 x n ,n-i] T € C . Let /„ € C" x " be the n x n exchange 
matrix (also called the contra- identity matrix) which is defined as the 
matrix yielding x n = J n x n = [*„,„- 1*„,„-2 • • ■ x n ,ix n fi] T , We see that /„ 
simply reverses the order of the elements of x n . An immediate con- 
sequence of this is that J^x n — x n (i.e., J% — /„). What is Jj? (Write 
this matrix out completely.) Suppose that T n is a symmetric Toeplitz 
matrix. 

(a) Show that (noting that i„ — J n r„) 



T n + 1 — 


In ^n 


= 


to 

. ?n 


r T 
T„ 


(nesting property). What is T„? 




(b) Show that 






^n^n^n : 


= T n 







(persymmetry property). 

(Comment: Toeplitz matrices have an important role to play in digital signal 
processing. For example, they appear in problems in spectral analysis, and 
in voice compression algorithms.) 

4.21. This problem is about a computationally efficient method to solve the linear 
system 



R n a n = o n eo, 



(4.P.6) 



TLFeBOOK 



200 LINEAR SYSTEMS OF EQUATIONS 



where R n e R" x " is symmetric and Toeplitz (recall Problem 4.20). All 
of the leading principle submatrices (recall the definition in Theo- 
rem 4.1) of R n are nonsingular. Also, eo = [1 • • • 0] T e R n , a n — 
[1 a n \ ■■■ a n , n -2 «n,n-i] r , and a% e R is unknown as well. Define 
e n — J n eo- Clearly, a n fi — 1 (all n). 



(a) Prove that 



R n a n = a^eo = o„e n , 



(4.P.7) 



where J„a„ = a„. 
(b) Observe that R n \_a n a n ] — cr^[eo e n ]. Augmented matrix [a n a n ] is 
n x 2, and so [eo e n ] is n x 2 as well. Prove that 



Rn 



a„ 
a„ 



In ole n 



(4.P.8) 



What is r\rP. (That is, find a simple expression for it.) 
(c) We wish to obtain 



R n +\[a n +\ a n+ i] — a n+l [eo e„+i] 



(4.P.9) 



from a manipulation of (4. P. 8). To this end, find a formula for parameter 
K„ € R in 



^H 



a n 
a„ 



1 K„ 

K n 1 



0„ eo f] n 
Vn ff„ 2 e„ 



1 tf» 

K n 1 



such that (4. P. 9) is obtained. This implies that we obtain the vector 
recursions 



a n +\ 







A"„ 







and 











Find the initial condition a\ e R lxl . What is er, 2 ? 



(d) Prove that 



ff„ +1 =ff„(l - /«:„). 



(e) Summarize the algorithm obtained in the previous steps in the form of 
pseudocode. The resulting algorithm is often called the Levinson-Durbin 
algorithm. 



TLFeBOOK 



PROBLEMS 



201 



(f) Count the number of arithmetic operations needed to implement the 
Levinson-Durbin algorithm in (e). Compare this number to the num- 
ber of arithmetic operations needed by the general LU decomposition 
algorithm presented in Section 4.5. 

(g) Write a MATLAB function to implement the Levinson-Durbin algorithm. 
Test your algorithm out on the matrix 



2 1 
1 2 1 
1 2 



4.22. 



(Hint: You will need the properties in Problem 4.20. The parameters K n 
are called reflection coefficients, and are connected to certain problems in 
electrical transmission line theory.) 

This problem is about proving that solving (4. P. 6) yields the LDL T decom- 
position of R~ x . (Of course, R n is real-valued, symmetric, and Toeplitz.) 
Observe that (via the nesting property for Toeplitz matrices) 



Rn 



1 





■ 


0" 




[On 1 


X 


X 


X 


0/1,1 


1 


• 










<£l • 


X 


X 


0/1,2 


fl/z-1,1 


• 





= 








X 


X 


@n,n— 2 


a n -l,n-3 ■ 


1 













•• °l 


X 


Qn,n—1 


a n -l,n-2 ■ 


• 02,1 


1 _ 


, 


_ 





■■ 


°l\ 



= L„ 



= U„ 



(4.P.10) 



where x denotes "don't care" entries; thus, the particular value of such an 
entry is of no interest to us. Use (4.P.10) to prove that 



L n K n L n — L) n , 



(4.P.11) 



where D n — diag (cr„ , o^-v • • • ' a 2 > °t ) (diagonal matrix; see the comments 
following Theorem 4.4). Use (4. P. 11) to prove that 



r: 



l D J 



(4.P.12) 



What is D n l [Hint: Using (4.P.10), and R n — R^ we note that L T n R n L n can 
be expressed in two distinct but equivalent ways. This observation is used 
to establish (4.P.11).] 

4.23. A matrix T n is said to be strongly nonsingular (strongly regular) if all of its 
leading principle submatrices are nonsingular (recall the definition in Theo- 
rem 4.1). Suppose that x n — [x n q x n j • • • x„„_2 x nn _i\ T e C", and define 



TLFeBOOK 



202 LINEAR SYSTEMS OF EQUATIONS 

Z„ e C" x " according to 

Thus, Z„ shifts the elements of any column vector down by one position. 
The top position is filled in with a zero, while the last element x„„_i is lost. 
Assume that T n is real-valued, symmetric and Toeplitz. Consider 

t — 7 t y T — y 

(a) Find X n . 

(b) If T n is strongly nonsingular then what is rank (X n )l (Be careful. There 
may be separate cases to consider.) 

(c) Use &k (Kronecker delta) to specify Z„; that is, Z n — [zij], so what is 
Zij in terms of the Kronecker delta? (Hint: For example, the identity 
matrix can be described as / = [5,_ ; ].) 

4.24. Suppose that R e R NxN is strongly nonsingular (see Problem 4.23 for the 
definition of this term), and that R — R T . Thus, there exists the factorization 

R = L N D N L T N , (4.P.13) 

where Ln is a unit lower triangular matrix, and Dn is a diagonal matrix. 
Let 

L N = [/o h ■ ■ ■ In-2 In-i], 

so /,- is column i of L#. Let 

D N — diag(d , d\, . . . , d N -\)- 

Thus, via (4.P.13), we have 

N-\ 



R=Y J d Ml- (4.P-14) 



k=Q 

Consider the following algorithm: 

for n := to N - 1 do begin 
d n ■= eJ,R n e n \ 
In := d~ 1 R n e n ; 
ftn+1 := Rn — dnlnln ! 

end; 



TLFeBOOK 



PROBLEMS 



203 



As usual, we have the unit vector 

e, = [0 • • • 1 



Of GR N . 



The algorithm above is the Jacobi procedure (algorithm) for computing the 
Chole sky factorization (recall Theorem 4.6) of R. 

Is R > a necessary condition for the algorithm to work? Explain. Test the 
Jacobi procedure out on 



R = 



2 1 
1 2 1 
1 2 



Is R > 0? Justify your answer. How many arithmetic operations are needed to 
implement the Jacobi procedure? How does this compare with the Gaussian 
elimination method for general LU factorization considered in Section 4.5? 



4.25. Suppose that A ' exists, and that / 
course, / is the identity matrix. 



V T A l U is also nonsingular. Of 
(a) Prove the Sherman-Morrison-Woodbury formula 



[A 



uv T r l 



A~ l U[I + v t a~ 1 u]~ 1 v t a- 



(b) Prove that if U = u e C" and V = v e C", then 

T , , A~ 1 uv T A~ 1 

(Comment: These identities can be used to develop adaptive filtering algo- 
rithms, for example.) 



4.26. Suppose that 



a b 
c 



Find conditions on a, b, and c to ensure that A > 0. 

4.27. Suppose that A € R" x ", and that A is not necessarily symmetric. We still 
say that A > iff x T Ax > for all x # (x e R"). Show that A > iff 
B — j(A + A T ) > 0. Matrix B is often called the symmetric part of A. (Note: 
In this book, unless stated to the contrary, a pd matrix is always assumed to 
be symmetric.) 



TLFeBOOK 



204 LINEAR SYSTEMS OF EQUATIONS 



4.28. Prove that for A e R" x " 



n-1 



|A||oo = max V|a/,/l 



0<i<»— 1 ' 



j=0 



4.29. Derive Eq. (4.128) (the law of cosines). 

4.30. Consider A„ e R" x " such that 



A n — 



A~ l = 



i 



— 1 

1 


-1 

-1 




— 1 — 1 

-1 -1 








1 




-1 -1 













1 -1 













1 


1 


1 2 




2"- 


-3 2 »-2 





1 1 




2"- 


-4 2 „-3 





1 




2"- 


-5 2 »-4 










1 


1 













1 



Show that Koo(A n ) — nl" l . Clearly, det(A„) = 1. Consider 

D„ = diag(10 _1 , 10" 1 , . . . , 10 _1 ) € R". 

What is K p (D n )7 Clearly, det(D„) = 10~". What do these two cases say 
about the relationship between det(A), and at (A) (A € R" x ", and is nonsin- 
gular) in general? 

4.31. Recall Problem 1.7 (in Chapter 1). Suppose a — [aoa\ ■ ■ -a n ] T , b — [bob\ 
■ ■ ■ b m ] T , and c — [c ci • • • c n+m ] T , where 



ci — ) y cikbi-k- 

k=Q 

Find matrix A such that c — Ab, and find matrix B such that c — Ba. What 
are the sizes of matrices A, and Bl {Hint: A and B will be rectangular 
Toeplitz matrices. This problem demonstrates the close association between 
Toeplitz matrices and the convolution operation, and so partially explains the 
central importance of Toeplitz matrices in digital signal processing.) 



TLFeBOOK 



PROBLEMS 



205 



4.32. Matrix P e R" x " is a permutation matrix if it possesses exactly one one per 
row and column, and zeros everywhere else. Such a matrix simply reorders 
the elements in a vector. For example 



10 




xo 




x\ 


10 




XI 




X2 


10 




X2 




x 


1 




XT, 




XT, 



Show that P ' = P T (i.e., P is an orthogonal matrix). 
4.33. Find c and s in 



*0 
x\ 



"0 



l l 



where s 2 + c 2 — 1. 
4.34. Consider 



A = 



2 10 
12 10 
12 1 
12 



Use Householder transformation matrices to find the QR factorization of 
matrix A. 



4.35. Consider 



A = 



5 2 10 

2 5 2 1 
12 5 2 
12 5 



Using Householder transformation matrices, find orthogonal matrices Qq, 
and Q\ such that 



QiQoA = 



hoo hoi hoi hoi 

hio h\\ h l2 hi 3 

h 2 \ h 2 2 h 2 T, 

/?32 /J33 



= H. 



[Comment: Matrix H is upper Hessenberg. This problem is an illustration of 
a process that is important in finding matrix eigenvalues (Chapter 11).] 

4.36. Write a MATLAB function to implement the Householder QR factorization 
algorithm as given by the pseudocode in Section 4.6 [between Eqs. (4.146), 
and (4.147)]. The function must output the separate factors Hi that make up 
Q, in addition to the factors Q and R. Test your function out on the matrix 
A in Problem 4.34. 



TLFeBOOK 



206 LINEAR SYSTEMS OF EQUATIONS 

4.37. Prove Eq. (4.117), and establish that 

rank (A) = rank (A T ). 

4.38. If A € C" x ", then the trace of A is given by 

«-i 
Tr(A) = ^ a k,k 

(i.e., is the sum of the main diagonal elements of matrix A). Prove that 

||A|||=Tr(AA ff ) 

[recall (4.36a)]. 

4.39. Suppose that A, B e R" x ", and Q e R" x " is orthogonal; then, if A = 
QBQ T , prove that 

Tr(A) = Tr(fl). 

4.40. Recall Theorem 4.4. Prove that for A € R" x " 

\\m\ 2 f = y,°1 

k=0 

(Hint: Use the results from Problems 4.38 and 4.39.) 

4.41. Consider Theorem 4.9. Is a < 1 a necessary condition for convergence? 
Explain. 

4.42. Suppose that A € R" x " is pd and prove that the JOR method (Section 4.7) 
converges- if 



< &) < 



p{D- x A) 



4.43. Repeat the analysis made in Example 4.8, but instead use the matrix 



A = 



4 2 10 
14 10 
14 1 
2 4 



This will involve writing and running a suitable MATLAB function. Find the 
optimum choice for a> to an accuracy of ±0.02. 



TLFeBOOK 



5 



Orthogonal Polynomials 



5.1 INTRODUCTION 

Orthogonal polynomials arise in highly diverse settings. They can be solutions to 
special classes of differential equations that arise in mathematical physics problems. 
They are vital in the design of analog and digital filters. They arise in numerical 
integration methods, and they have a considerable role to play in solving least- 
squares and uniform approximation problems. 

Therefore, in this chapter we begin by considering some of the properties shared 
by all orthogonal polynomials. We then consider the special cases of Chebyshev, 
Hermite, and Legendre polynomials. Additionally, we consider the application 
of orthogonal polynomials to least-squares and uniform approximation problems. 
However, we emphasize the case of least-squares approximation, which was first 
considered in some depth in Chapter 4. The approach to least-squares problems 
taken here alleviates some of the concerns about ill-conditioning that were noted 
in Chapter 4. 



5.2 GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 

We are interested here in the inner product space L 2 (D), where D is the domain 
of definition of the functions in the space. Typically, D — [a, b], D — R, or D — 
[0, oo). We shall, as in Chapter 4, assume that all members of L 2 (D) are real- valued 
to simplify matters. So far, our inner product has been 

(f,g) = f f(x)g(x)dx, (5.1) 

J D 

but now we consider the weighted inner product 

(f,g)= f w(x)f(x)g(x)dx, (5.2) 

J D 

where w(x) > (x e D) is the weighting function. This includes (5.1) as a special 
case for which w(x) — 1 for all x e D. 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



207 



TLFeBOOK 



208 ORTHOGONAL POLYNOMIALS 

Our goal is to consider polynomials 



<t>n(x) = Y^<Pn,kX k ,xe D (5.3) 

k=Q 

of degree n such that 

(<Pn,4>m) — K-m (5.4) 

for all n,m e Z + . The inner product is that of (5.2). Our polynomials {<p n (x)\n e 
Z + }. are orthogonal polynomials on D with respect to weighting function w(x). 
Changing D and/or w(x) will generate very different polynomials, and we will 
consider important special cases later. However, all orthogonal polynomials possess 
certain features in common with each other regardless of the choice of D or w(x). 
We shall consider a few of these in this section. 

If p(x) is a polynomial of degree n, then we may write dsg(p(x)) — n. 

Theorem 5.1: Any three consecutive orthogonal polynomials are related by the 
three-term recurrence formula {relation) 

<t>n+i(x) = (A n x + B n )(j) n (x) + C n (j)n-\{x), (5.5) 

where 

. _ 4>n+l,n+\ „ _ 4>n+l,n+\ ( <Pn+l,n <Pn,n-l 

A n — ; ^ "n — ; 



<Pn,n <Pn,n Wm + 1,m + 1 <Pn,n 

„ 0B+l,n+10«-l,n-l ,, „ 

C„ = - 2 . (5.6) 

Vn,n 

Proof Our proof is a somewhat expanded version of that in Isaacson and Keller 
[1, pp. 204-205]. 

Observe that for A n — (j> n +i,n+i/<Pn,n< we have 

h + 1 n 

q„(x) = <t> n +l(x) - A n X(j) n (x) = 'Y^(l>n+\,kX k ~ A„ ^ <t>n,kX k+l 

k=0 k=0 

n+\ n+\ 

= ^2<Pn+l,kX k ~ A n 'Y^ l 4>n,}-\X i 
fc=0 ;=1 

n n 

= ^24>n+l,kX k - A n 'Y^ l 4>n,j-\X i + [0 B+ i >B+ i - A n (/) ntn ]x n + l 
k=0 ;=1 

n n 

= ^2<Pn+l,kX k ~ A n 'Y^ l 4>n,}-\X } , 
k=0 j=\ 



TLFeBOOK 



GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 209 

which is a polynomial of degree at most n, i.e., deg(q n (x)) < n. Thus, for suit- 
able Uk 



q n (x) = y^akfaW, 



k=0 

so because (fa, <j>j) — &k-j, we have a ; - = (q„, (j>j). In addition 

n-2 

0n+iW - A n xfa(x) = a„fa(x) + a n -\cj) n -i(x) + ^a k (j> k (x). (5.7) 

Now degC*0y(;c)) < « — 1 for j < n — 2. Thus, there are /8a such that 

n-i 
x0/(x) = ^Mk(x) 

k=0 

so (</>,-, x</) ; > = 0ifr>n — 1, or 

(fa,X<j>j} = (5.8) 

for j = 0, 1, ...,« — 2, From (5.7) via (5.8), we obtain 

{(/>„+! - A„X<p„,fa) — (<j>n+l,fa) - A n {(p n ,X<t>k) — 

for k — 0, 1, . . . , n — 2. This is the inner product of cj>k(x) with the left-hand side 
of (5.7). For the right-hand side of (5.7) 

/ "~ 2 \ 

= ( a„<p n + a n -\4>n-l + 2_, a j4>j' <Pk ) 

n-2 
= a n (<p n , fa) + a„-l {fa-l,fa) + /, &j (</>; , fa > 

7=0 

n-2 

= ^2 a .j(<i>j><i>k) 

again for £ = 0, . . . , n — 2. We can only have Yl'}Zo a j (4>j> <Pk) = if ay =0 for 
j = 0, 1, . . . , n — 2. Thus, (5.7) reduces to 

fa+i(x) - A n xfa{x) = a n fa(x) + a n -i<t>n-i(x) 

or 

<Pn+i(x) = (A„x + a„)0„(x)H-a„-i0„-i(x), (5.9) 



TLFeBOOK 



210 ORTHOGONAL POLYNOMIALS 

which has the form of (5.5). We now need to verify that a n — B n , a n -\ — C„ as 
in (5.6). From (5.9) 

<p n (x) = A n -ix4> n -i(x) + a n -i4>„-i(x) + a n -2<t>n-2(x) 



or 



1 

x<f> n -i(x) = <p n (x) + 

A n -\ 



1 



-(a n -i(j>„-i(x) + a n - 2 (/>n-2(x)) 



(5.10) 



=P„-l(x) 

for which deg(/?„_i(x)) < n — 1. Thus, from (5.9) 

((pn+l, <l>n-\) — A n {x<j) n ,(t) n -i) +a n (</>„, (j)„-i) +a„-\ (</>„_i, </>„-i) 
so that 



and via (5.10) 



which becomes 



or 



a«-l = -A n (x<p n , 4>n-\) — -A n {(p n ,X<p n -\), 



a n -\ 



ln-1 



"0« + Pn-l ) , 



«»-l - ~ ; " (<t>n,<t>n) — 



Cm Of ,7 _ 1 



A n — \ A n — \ 

<l>n + l,n + l 4>n-l,n-l 



which is the expression for C n in (5.6). Expanding (5.5) 



«+i 



n+\ 



n-\ 



22&n+i,kx k — A n 22<p n , k -ix k + B n 22(j) n ^x k + C„ )J) n -\ :k x k , 

k=Q k=\ k=0 k=0 

where, on comparing the terms for k — n, we see that 

4 > n+l,n — A n (p n n -\ + D n <p nn 

(since 4> n -\,n =0). Therefore 



1 1 

B n — — [<Pn + l,n ~ A n <p nn -\] — — — 

tyn,n fin,) 



<Pn + l,n + l 
'«+!,« 7 <Pn,n-\ 



Pn+l,n+l 



"w + l,w H'n.n — 1 



,4>n+\,n+\ <Pn,n . 



which is the expression for B n in (5.6). 



TLFeBOOK 



GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 211 

Theorem 5.2: Orthogonal polynomials satisfy the Christoffel-Darboux formula 

{relation) 

n , 

(x - y)Y](pj(x)<pj(y) = - — — — [(p n+1 (x)(/) n (y) - cj) n+l (y)(/) n (x)]. (5.11) 

~^ ' <Pn+\,n+\ 

Proof The proof is an expanded version of that from Isaacson and Keller [1, 
p. 205]. 

From (5.5), we obtain 

<M)O0m+iOO = (A n x + B n )(j) n (x)^ n (y) + C n cj) n -\{x)(i> n (y), (5.12) 

so reversing the roles of x and y gives 

4 l n{x)4> n+ i(y) = (A n y + B n )cp n {y)(j> n (x) + C„</>„_i(;y )</>„(*)■ (5.13) 

Subtracting (5.13) from (5.12) yields 

^n(y)4>n + l(x) - (/) n (x)^ n + i(y) = A n (x - y)(pn(x)<Pn(y) 

+ c n (<p n -i(x)(p„(y) - (Pn(x)<P n -i(y))- (5.14) 

We note that 



(5.15) 



C„ __ 0«-i,«-i __ 1 

A n <Pn,n A n —\ 

so (5.14) may be rewritten as 

(x - y)(/)n(x)(j)n(y) = A" 1 (</>„+] (x)(j>„(y) ~ <t>n+l{y)<t>n{x)) 

- A~\(<p n (x)<l> n -i(y) - (p n -i(x)<p n (y)). 
Now consider 

n n 

(x - y)J2 ( / ) j(x)cPj(y) = J2 {AfWj+iixWjW - <Pj+i(y)<t>j(x)] 

7=0 j=0 

- Ajl 1 [4>j(.x)4>j-i(y) - 4>j-i(x)<t>j(y)]] . (5.16) 

Since Aj — 0/+i j+i/<j>; j, we have AZi — <P-i,-i/(/>o,0 = because ^>-i,-i — 0. 
Taking this into account, the summation on the right-hand side of (5.16) reduces 
due to the cancellation of terms, 1 and so (5.16) finally becomes 



(X ~ y)^2(pj(x)(/>j(y) = - — - [<Pn+l(x)(t>n(y) ~ <t>n+l(y)4>n(x)], 

r- 
which is (5.11). 



,_ Pn + \,n + \ 



The cancellation process seen here is similar to the one that occurred in the derivation of the Dirichlet 
kernel identity (3.24). This mathematical technique seems to be a recurring theme in analysis. 



TLFeBOOK 



212 ORTHOGONAL POLYNOMIALS 

The following corollary to Theorem 5.2 is from Hildebrand [2, p. 342]. However, 
there appears to be no proof in Ref. 2, so one is provided here. 

Corollary 5.1 With cp k (x) — dcf>k(x)/dx we have 

y>, 2 W = - <? "'" [0ffi (*)&,(*) - ti l) (x)<t> n+l (x)l (5.17) 

Proof Since 

<Pn+i(x)(j) n (y) - <p n+ i(y)(p n (x) = [0 n+ i(x) - 0„+i(y )]0„(y) 

- [0n(*) - 0n(>')]0n+l(}') 



via (5.11) 

n 

J^<l>j(x)<t>j(y) 



<Pn+l(x) ~4>n+l(y) , , , <Pn(x) - (j) n (y) 

0n(;y) 0«+i(y) 

'«+!,«+! L x - y x - y 



(5.18) 



Letting y — > x in (5.18) immediately yields (5.17). This is so from the definition of 
the derivative, and the fact that all polynomials may be differentiated any number 
of times. 

The three-term recurrence relation is a vital practical method for orthogonal 
polynomial generation. However, an alternative approach is the following. 

If deg(g r _i (x)) = r — 1 (q r -\ (x) is an arbitrary degree r — 1 polynomial), then 

(cj) r ,q r -l)— I w(x)(j) r (x)q r -i(x) dx — 0. (5.19) 

J D 

Define 

d r G r (x) /„■, 

w{x)<t> r {x)=—^- = G ( ;\x) (5.20) 

so that (5.19) becomes 

/ G { p(x)q r -i(x)dx = 0. (5.21) 

(V) 

If we repeatedly integrate by parts (using g^-iM = f° r a ^ x )' we obtain 

f Gj r) (x)«r-lWrf* = Gj r ~ 1) (x)?r-lW|D- f G^~ l) (x)q^ (x) dx 
J D J D 



TLFeBOOK 



GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 

= G< r - 1) G0«.-1(J0|D - G ( r 2 \x)q'; l \{x)\ D 
) (x)q r _ l (x)dx 

G < f- l \x)q r . x (x)\ D - G { ;- 2 \x)qf\{x)\ D 



213 



/. 



G (r-2)^_(2) 



G ( r 3) (x)qf-\{x)\ D + f G { r 3) (x)qf\{x)dx 
J D 



and so on. Finally 



<r-2)M) 



[G^ ^,-1 - g; '? r _j + g; ^ r _j 



,( r _3) (2) 



+ (-i) r - i G f .« r ( r 1 1) ]D = o, 



or alternatively 



Since from (5.20) 



Yy\f x G^\^ 



.k=\ 



= o. 



4>,-(x) = 



1 d r G r (;c) 

w(x) dx r 



(5.22a) 
(5.22b) 

(5.23) 



which is a polynomial of degree r, it must be the case that G,-{x) satisfies the 
differential equation 

d r+1 f 1 d r G r (x) 



dx r+1 



w{x) dx r 







(5.24) 



for x € D. Recalling D = [a, b] and allowing a — > — oo, & -> oo (so we may have 

(k) 

D = [a, oo), or D = R), Eq. (5.22a) must be satisfied for any values of q r _ x (a) and 



.(*) 



C ; i(*)(* = o,i, 



1). This implies that we have the boundary conditions 
Gf'(a) = 0,Gf ) (l)) = (5.25) 



for k = 0, 1, . . . , r — 1. These restrict the solution to (5.24). It can be shown [6] 
that (5.24) has a nontrivial 2 solution for all r e Z + assuming that w(x) > for 
all x e D, and that the moments f D x k w(x)dx exist for all k e Z + . Proving this 
is difficult, and so will not be considered. The expression in (5.23) may be called 
a Rodrigues formula for <p r (x) (although it should be said that this terminology 
usually arises only in the context of Legendre polynomials). 

The Christoffel-Darboux formulas arise in various settings. For example, they 
are relevant to the problem of designing Savitzky-Golay smoothing digital filters 
[3], and they arise in numerical integration methods [2, 4]. Another way in which 
the Christoffel-Darboux formula of (5.11) can make an appearance is as follows. 

The trivial solution for a differential equation is the identically zero function. 



TLFeBOOK 



214 ORTHOGONAL POLYNOMIALS 

We may wish to approximate f(x) e L 2 (D) as 

n 

fix) « ^djcpjix), 



so the residual is 



r„O0 = /(*) - ^aj<t>j(x). 



(5.26) 



(5.27) 



Adopting the now familiar least-squares approach, we select a; to minimize func- 
tional V(a) = \\r n \\ 2 = (r n ,r n ) 



fix) - ^2,a.j(j>jix) 



Via) = / w(x)r n (x)dx — J w(x) 
J d J D 

where a = [ao a\ ■ ■ ■ a n ] T € R" +1 . Of course, this expands to become 

V(, 



dx, (5.28) 



(a) — J w(x)f (x)dx — 2 2_, a ] I w(x)f(x)(j)j(x)dx 

JD ;_ n J D 

n n . 

+ J2J2 a J ak / 



;'=o 

w(x)4>j(x)<pk(x) dx, 



;=0 k=Q 



so if g - [go gi ■■■ g n ] T with gj = | D w(x)/(x)0 7 (x) rfx, and /? = [r,j] € 
R («+i)x(«+i) wim r> _ = j^ w ( x )<t>i(x)<t>jix)dx, and p = f D w(x)f 2 (x)dx, then 



y(a) = p — 2a g + a Ra. 



(5.29) 



But r,- ; = (<pi, <pj) — Si-j [via (5.4)], so we have R — I (identity matrix). Imme- 
diately, one of the advantages of working with orthogonal polynomials is that R 
is perfectly conditioned (contrast this with the Hilbert matrix of Chapter 4). This 
in itself is a powerful incentive to consider working with orthogonal polynomial 
expansions. Another nice consequence is that the optimal solution a satisfies 



a = g, 



that is 



aj = / w{x)f{x)(j)j(x)dx = (/, <j)j), 

JD 

where j e Z„+i. If a — a in (5.27) then we have the optimal residual 



f n ix) = fix) - ^2aj(j)j(x). 
;=0 



(5.30) 
(5.31) 

(5.32) 



TLFeBOOK 



GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 215 

We may substitute (5.31) into (5.32) to obtain 



n r- 



r n (x) = f(x)-l^ J w(y)f{y)<t>j{y)dy 



<Pj(x) 



L 



fix)- / f(y)w(y) 



J2<Pj(x)<Pj(y) 



dy. 



(5.33) 



For convenience we define the kernel function 

n 

K n (x,y) = ^2^)j{x)(pj{y), 

so that (5.33) becomes 

rn(x) = f(x)- w(y)f(y)K n (x,y)dy. 



(5.34) 



(5.35) 



Clearly, K n (x, y) in (5.34) has the alternative formula given by (5.11). Now con- 
sider 



f f(x)w(y)K n (x,y)dy = f(x) f W^Tcpj 
Jd Jd j=0 ' 



(x)(pj(y)dy 



n 

= f(x)Y](pj(x) w(y)cpj(y)dy 
.7=0 ' Jd 

n 

= f(x)Y i <t>j(x){l,4>j) 

.7=0 

El d>o(x) 



.7=0 



</>0,0 



d>o(x) 
= fix)^ 1 = f(x) 

00,0 

because 4>q(x) — (/>o,o for x e D, and (0o, 4>j) — $j- Thus, (5.35) becomes 
f n ix) = / /(x)wO0^h(*,)0 dy - I f(y)w(y)K n (x,y)dy 

J D J D 



L 



wiy)K n (x,y)[fix)-fiy)]dy, 



(5.36) 



Recall the Dirichlet kernel of Chapter 3, which we now see is really just a special instance of the 
general idea considered here. 



TLFeBOOK 



216 ORTHOGONAL POLYNOMIALS 

The optimal residual (optimal error) f n (x) presumably gets smaller in some sense 
as n — > oo. Clearly, insight into how it behaves in the limit as n -> oo can be 
provided by (5.36). The behavior certainly depends on f(x), w(x), and the kernel 
K n (x,y). Intuitively, the summation expression for K n (x,y) in (5.34) is likely 
to be less convenient to work with than the alternative expression we obtain 
from (5.11). 

Some basic results on polynomial approximations are as follows. 

Theorem 5.3: Weierstrass' Approximation Theorem Let f(x) be continu- 
ous on the closed interval [a, b]. For any e > there exists an integer N — N(e), 
and a polynomial pn(x) € P [a, b] (deg(/?w(x)) < N) such that 

\f(x)-p N (x)\ < e 

for all x e [a, b]. 

Various proofs of this result exist in the literature (e.g., see Rice [5, p. 121] and 
Isaacson and Keller [1, pp. 183-186]), but we omit them here. We see that the 
convergence of pn(x) to f(x) is uniform as N — > oo (recall Definition 3.4 in 
Chapter 3). Theorem 5.3 states that any function continuous on an interval may be 
approximated with a polynomial to arbitrary accuracy. Of course, a large degree 
polynomial may be needed to achieve a particular level of accuracy depending upon 
f(x). We also remark that Weierstrass' theorem is an existence theorem. It claims 
the existence of a polynomial that uniformly approximates a continuous function 
on [a, b], but it does not tell us how to find the polynomial. Some information 
about the convergence behavior of a least-squares approximation is provided by 
the following theorem. 

Theorem 5.4: Let D — [a, b], w(x) — 1 for all x e D. Let f(x) be continuous 
on D, and let 

n 

qn(x) = ^fl,(/);(x) 

(least-squares polynomial approximation to f(x) on D), so cij — (f, <pj) = 
f a f(x)<t>j(x)dx. Then 

lim V(a)= lim / [f(x) — q n (x)] 2 dx — 0, 

and we have Parseval's equality 

rb °° 

/ f 2 (x)dx^J2 a r < 5 - 37 ) 



J 



=0 



TLFeBOOK 



GENERAL PROPERTIES OF ORTHOGONAL POLYNOMIALS 217 

Proof We use proof by contradiction. Assume that linin^oo V(a) — S > 0. 
Pick any e > such that e 2 — 2 (b-a) ^- ^ Theorem 5.3 (Weierstrass' theorem) 
there is a polynomial p m (x) such that \f(x) — p m (x)\ < € for x e D. Thus 



/' 

J a 



[f (x) - p m {x)f dx < e 2 [b-a]= -8. 



Now via (5.29) and (5.30) 
that is 



V(a) = p — a a, 



/ " 

f 2 (x)dx -^2,a 2 , 

;'=o 
and V(a) > for all a e R" +1 . So we have Bessel's inequality 

rb " 

a) = / f 2 (x)dx-J2tf ^ °- 



6 
V(a) = / / z (x)Jx-^«t ^ °. ( 5 - 38 ) 



and V(a) must be a nonincreasing function of n. Hence the least- squares approxi- 
mation of degree m, say, q m (x), satisfies 



-S> f [f(x)-q m (x)?dx>8 

*■ J a 



which is a contradiction unless 5 = 0. Since 5 = 0, we must have (5.37). 

We observe that Parseval's equality of (5.37) relates the energy of f(x) to 
the energy of the coefficients a,-. The convergence behavior described by Theo- 
rem 5.4 is sometimes referred to as convergence in the mean [1, pp. 197-198]. 
This theorem does not say what happens if f(x) is not continuous on [a, b]. The 
result (5.36) is independent regardless of whether f(x) is continuous. It is a poten- 
tially more powerful result for this reason. It turns out that the convergence of 
the least-squares polynomial sequence (q n (x)) to f(x) is pointwise in general but, 
depending on </>/(x) and f(x), the convergence can sometimes be uniform. For uni- 
form convergence, f(x) must be sufficiently smooth. The pointwise convergence 
of the orthogonal polynomial series when f(x) has a discontinuity implies that the 
Gibbs phenomenon can be expected. (Recall this phenomenon in the context of the 
Fourier series expansion as seen in Section 3.4.) 

We have seen polynomial approximation to functions in Chapter 3. There we saw 
that the Taylor formula is a polynomial approximation to a function with a number 
of derivatives equal to the degree of the Taylor polynomial. This approximation 
technique is obviously limited to functions that are sufficiently differentiable. But 
our present polynomial approximation methodology has no such limitation. 



TLFeBOOK 



218 ORTHOGONAL POLYNOMIALS 

We remark that Theorem 5.4 suggests (but does not prove) that if f(x) € 

L 2 [a, b], then 

oo 

fM = J2{f,<Pj)<t>j(x), (5.39) 

.7=0 

where 

f2> 



(f,4> 



j)= f f(x)(Pj(x)dx. (5.40) 

J a 

This has the basic form of the Fourier series expansion that was first seen in 
Chapter 1. For this reason (5.39) is sometimes called a generalized Fourier series 
expansion, although in the most general form of this idea the orthogonal functions 
are not necessarily always polynomials. The idea of a generalized Fourier series 
expansion can be extended to domains such as D — [0, oo), or D — R, and to any 
weighting functions that lead to solutions to (5.24). 

The next three sections consider the Chebyshev, Hermite, and Legendre poly- 
nomials as examples of how to apply the core theory of this section. 



5.3 CHEBYSHEV POLYNOMIALS 

Suppose that D = [a, b] and recall (5.28), 

rb 

V{a)= / w{x)rl{x)dx, (5.41) 

J a 

which is the (weighted) energy of the approximation error (residual) in (5.27): 

n 

r n {x) = f{x)-Y j a j <t> j (x). (5.42) 

7=0 

The weighting function w(x) is often selected to give more or less weight to errors 
in different places on the interval [a,b]. This is intended to achieve a degree of 
control over error behavior. If w(x) — c > for x e [a, b], then equal importance 
is given to errors across the interval. This choice (with c — 1) gives rise to the 
Legendre polynomials, and will be considered later. If we wish to give more weight 
to errors at the ends of the interval, then a popular instance of this is for D = [— 1, 1] 
with weighting function 

w(x) = 1 . (5.43) 

VI — x l 

This choice leads to the famous Chebyshev polynomials of the first kind. The reader 
will most likely see these applied to problems in analog and digital filter design in 
subsequent courses. For now, we concentrate on their basic theory. 



TLFeBOOK 



CHEBYSHEV POLYNOMIALS 219 

The following lemma and ideas expressed in it are pivotal in understanding the 
Chebyshev polynomials and how they are constructed. 

Lemma 5.1: If n e Z + , then 

n 
cos nO = Y^ Pn,k cos* 9 (5.44) 

k=0 

for suitable ft n ^ g R. 

Proof First of all recall the trigonometric identities 

cos(m + 1)9 = cos m.9 cos 9 — sin m.9 sin 9 
cos(m — 1)9 = cos m0 cos + sin m0 sin 

(which follow from the more basic identity cos(a + b) = cos a cos£> — sin a sinfc). 
From the sum of these identities 

cos(m + 1)9 = 2 cos m9 cos — cos(m — 1)9. (5.45) 

For n = 1 in (5.44), we have /Jio = 0, /3i,i = 1, and for n = 0, we have /3o,o = 
1. These will form initial conditions in a recursion that we now derive using 
mathematical induction. 

Assume that (5.44) is valid both for n = m and for n = m — 1, and so 

m 

cos m9 = 2_,Pm,k cos 

k=0 
m—\ 



cos(m — \)9 = yj Pm-\,k cos 



jfc=0 

and so via (5.45) 

m m — 1 



COS(m + 1)0 = 2C0S# \]Pm,k COS — >_^ Pm-l,k COS 
fc=0 &=0 

m m— 1 

= 2 ^ £„,,* COS* +1 - J2 Pm-l,k COS* 
k=0 k=0 

m + \ m—\ 

— 2 2_j Pm,r-\ COS r — 2_^ fi m -\ : r COS'" 



r=l r=0 



TLFeBOOK 



220 ORTHOGONAL POLYNOMIALS 



m+1 



= y^[2j6 mife _i - m -i,fc]cos fe 0(0 m ,* — for k < 0, and k > m) 

k=Q 
m+1 



jt=0 

which is to say that 



-1,* = 20 m ,fc-l — 0m-l,*> (5.46) 



This is the desired three-term recursion for the coefficients in (5.44). As a conse- 
quence of this result, Eq. (5.44) is valid for n — m + 1 and thus is valid for all 
n > by mathematical induction. 

This lemma states that cos n0 may be expressed as a polynomial of degree n in 
cos0. Equation (5.46) along with the initial conditions 

A>.0 = 1. 01,0 = 0, U = 1 (5.47) 

tells us how to find the polynomial coefficients. For example, from (5.46) 

02,* = 201,fc_l - /Jo,*: 

for £ = 0, 1, 2. [In general, we evaluate (5.46) for k = 0, 1, . . . , m + 1.] Therefore 

02.0 = 20i,-i — 00,0 = — 1, 

02.1 = 20i,o — 00,1 = 0, 

02.2 = 20i,i — 00,2 = 2 

which implies that cos 26 — — 1 + 2 cos 2 0. This is certainly true as it is a well- 
known trigonometric identity. 

Lemma 5.1 possesses a converse. Any polynomial in cos0 of degree n can be 
expressed as a linear combination of members from the set {cos k0 \k = 0, 1, . . . , n). 

We now have enough information to derive the Chebyshev polynomials. Recall- 
ing (5.19) we need a polynomial (p r (x) of degree r such that 



/ 



1 1 

c/) r (x)q r -i(x)dx = 0, (5.48) 



vT^? 



where q,--\{x) is an arbitrary polynomial such that deg(g r -i(x)) < r — 1. Let us 
change variables according to 

jt = cos(9. (5.49) 



TLFeBOOK 



CHEBYSHEV POLYNOMIALS 221 



Then dx — — smO dO, and x e [— 1, 1] maps to 6 e [0, it]. Thus 



f l 1 [° I 

/ d> r (x)q r -i(x) dx = - I - 

i-i VI- x l Jjt VI— cos 2 

= / (J) r (cos0)q r - l (cos6)d6 = 
Jo 



(cos#)gy_i(cos#) sin0d0 



Jo 



Because of Lemma 5.1 (and the above mentioned converse to it) 

, (cosd)coskOdO = 
for k = 0, 1, . . . , r — 1. Consider </v(cos#) = C,- cosr0, then 

r 

[cos(r + &)6> + cos(r - fc)0] J0 



(5.50) 



(5.51) 



Jo 



C r I cos r6 cos k6 d6 

/o 2 



1 f* 



1 



1 1 

sin(r + k)9 H sin(r - k)B 

r + k r — k 



= 



Jo 



for k = 0, I, . . . , r — 1. Thus, we may indeed choose 



t> r (x) — C r cos[r cos x]. 



(5.52) 



Constant C r is selected to normalize the polynomial according to user requirements. 
Perhaps the most common choice is simply to set C r — 1 for all r e Z + . In this 
case we set T r (x) — 4> r (x): 



T r {x) — cos[r cos x]. 

These are the Chebyshev polynomials of the first kind. 
By construction, if r ^ k, then 



(5.53) 



(T r , T k ) 



L VT=x* Tr 



(x)Tk(x) dx — 0. 



Consider r — k 



= f — 



-T (x) dx 



cos [r cos x]dx, 



and apply (5.49). Thus 



WTrW 



/ cos 2 rd d$ = I 



n, r = 
j7T, r > 



(5.54) 



(5.55) 



TLFeBOOK 



222 ORTHOGONAL POLYNOMIALS 

We have claimed <j> r (x) in (5.52), and hence T r (x) in (5.53), are polynomials. This is 
not immediately obvious, given that the expressions are in terms of trigonometric 
functions. We will now confirm that T r (x) is indeed a polynomial in x for all 
r e Z + . Our approach follows the proof of Lemma 5.1. 
First, we observe that 

Tq(x) — cos(0 • cos - x) — 1, T\(x) — cos(l • cos - x) — x. (5.56) 

Clearly, these are polynomials in x. Once again (see the proof of Lemma 5.1) 

T n+ \{x) = cos(n + 1)9(9 — cos - x) 

— cosn9 cos0 — (cos(« — 1)9 — cosn9 cos9) 

— 2 cos n9 cos 9 — cos(« — 1)9 
= 2xT n (x) - T n -i(x), 

that is 

T„+i(x) — 2xT n (x) - T n -i(x) (5.57) 

for n e N. The initial conditions are expressed in (5.56). We see that this is a 
special case of (5.5) (Theorem 5.1). It is immediately clear from (5.57) and (5.56) 
that T n (x) is a polynomial in x for all n e Z + . 

Some plots of Chebyshev polynomials of the first kind appear in Fig. 5.1. We 
remark that on the interval x € [— 1, 1] the "ripples" of the polynomials are of 
the same height. This fact makes this class of polynomials quite useful in certain 
uniform approximation problems. It is one of the main properties that makes this 
class of polynomials useful in filter design. Some indication of how the Chebyshev 
polynomials relate to uniform approximation problems appears in Section 5.7. 

Example 5.1 This example is about how to program the recursion specified 
in (5.57). Define 

n 
.7=0 

so that T„j is the coefficient of x J . This is the same kind of notation as in (5.3) 
for 4> n (x). In this setting it is quite important to note that T n j — for j < 0, and 
for j > n. From (5.57), we obtain 

n+l n n—\ 

/ , ln+\,jX = t~X > ^ 'n,jX / j ln — \jX' , 

j=0 .7=0 .7=0 



and so 



n+l n n—\ 

7=0 7=0 .7=0 



TLFeBOOK 



CHEBYSHEV POLYNOMIALS 223 



3° 









— n=2 

— n=3 
.-. n =4 

— n=5 


- 










! ; \\< 

\ . : - 

• : : 1 




\ : : 














//::::: 
/ : 

















-1.5 



-0.5 




x 



0.5 



1.5 



Figure 5.1 Chebyshev polynomials of the first kind of degrees 2, 3, 4, 5. 

Changing the variable in the second summation according to k — j + 1 (so j 
k — 1) yields 

n + \ n+\ n — \ 

k=0 k=\ k=0 

and modifying the limits on the summations on the right-hand side yields 



n + 1 



n + 1 



n + 1 



7 , T n +i,kX =2/, T n: k-ix — 2_^ T n -\,kX ■ 



fc=0 



jt=0 



)t=0 



We emphasize that this is permitted because we recall that T n -\ — T n -\,n — 
T n -\^ n+ \ — for all n > 1. Comparing like powers of x gives us the recursion 

T n +\,k — 1Tn,k-\ — T n -\,k 

for k = 0, 1, . . . , n + 1 with n — 1, 2, 3, ... . Since Tq(x) — 1, we have Too — 1, 
and since 7\(x) = *, we have 7^0 = 0, 7ij = 1, which are the initial conditions 
for the recursion. Therefore, a pseudocode program to compute T n (x) is 

7b,o:=1; 

7"1,0:=0;7"1,1 :=1; 
for n := 1 to A/ - 1 do begin 
for /f := to n + 1 do begin 



TLFeBOOK 



224 



ORTHOGONAL POLYNOMIALS 



T n+1,k '■= 2 T n,k-1 ~ T n--\,k'< 
end; 
end; 



This computes T 2 (x), 



Tff(x) (so we need N > 2). 



We remark that the recursion in Example 5.1 may be implemented using integer 
arithmetic, and so there will be no rounding or quantization errors involved in 
computing T n (x). However, there is the risk of computing machine overflow. 

Example 5.2 This example is about changing representations. Specifically, 
how might we express 

n 
Pn(x)=J^Pn,jX j eP"[-l,l] 

.7=0 

in terms of the Chebyshev polynomials of the first kind? In other words, we wish 
to determine the series coefficients in 

n 

Pn(x) = ^ajTjix). 

j=o 

We will consider this problem only for n — 4. 

Therefore, we begin by noting that (via (5.56) and (5.57)) Tq(x) — 1, T\(x) — x, 
T 2 (x) — 2x 2 — 1, Tj,{x) — 4r 3 — 3x, and T^(x) — 8x 4 — 8x 2 + 1. Consequently 



/ ctjTjix) — ao + u\x + 2«2X 

7=0 



a 2 



+ 4a3JC — 3a^x + 8«4X — 8014X + oa\ 
— (ao — <*2 + u/\) + (<X\ — 3a^,)x + (2a 2 — 8a4)x 
implying that (on comparing like powers of x) 

P4,o — a -a 2 +a 4 , p^\ = a\ - 3«3, 
P4,2 = 2a2 — 8a4, 7743 = 4a3, 7744 = 8a4. 

This may be more conveniently expressed in matrix form: 



- 4a 3 x 3 + 8a 4 x 4 



1 





-1 





1 






1 






2 


-3 




-8 











4 

















8 



a 




/>4,0 


a\ 




P4,l 


a 2 


= 


P4,2 


«3 




P4,3 


Q!4 




PAA 



TLFeBOOK 



HERMITE POLYNOMIALS 225 

The upper triangular system Ua — p is certainly easy to solve using the backward 
substitution algorithm presented in Chapter 4 (see Section 4.5). We note that the 
elements of matrix U are the coefficients [T n j] as might be obtained from the 
algorithm in Example 5.1. 

Can you guess, on the basis of Example 5.2, what matrix U will be for any n ? 



5.4 HERMITE POLYNOMIALS 

Now let us consider D — R with weighting function 

w{x) = e- alx \ (5.58) 

This is essentially the Gaussian pulse from Chapter 3. Recalling (5.23), we have 



„2 V 2 d r G r (x) 
t>r(x) = e a x —pf 1 , (5.59) 

dx r 



where G r (x) satisfies the differential equation 

d r+: 



dx r+l 



a 2 x 2 d r G r (x) 
dx r 



(5.60) 



[recall (5.24)]. From (5.25) G,-(x) and Gf\x) (for k = 1, 2, . . . , r - 1) must tend 
to zero as x -> ±oo. We may consider the trial solution 

G r (x) = Cre'" 2 * 2 . (5.61) 

The kth derivative of this is of the form 

_ 2 2 

C r e a x x (polynomial of degree k), 
so (5.61) satisfies both (5.60) and the required boundary conditions. Therefore 

rW = C r e« V / 7 ( e - aV ). (5.62) 

dx r 

It is common practice to define the Hermite polynomials to be 4> r (x) for C r — 
(—iy with either a 2 — 1 or a 2 — j. We shall select a 2 — 1, and so our Hermite 
polynomials are 

H r {x) = {-\) r e x2 ^- r {e-* 2 ). (5.63) 

By construction, for k ^ r 



f 



e~ a x (P r (x)<t> k (x) dx = 0. (5.64) 

For the case where k — r, the following result is helpful. 



TLFeBOOK 



226 ORTHOGONAL POLYNOMIALS 

It must be the case that 



-(*) = X!^ r 



k x 



k=0 



so that 



H0 r || 2 = / w(x)<fi(x)dx = / W(x)(j) r 

J D J D 



(x) 



Y, &.*** 



.k=Q 



dx, 



but f D w(x)<p r (x)x l dx = for i = 0, 1, . . . , r — 1 [special case of (5.19)], and so 
||0 r || 2 = 0,. r / w(x)4> r (x)x r dx = </>,-, r \ x r G^ (x) dx (5.65) 

J D J D 

[via (5.20)]. We may integrate (5.65) by parts r times, and apply (5.25) to obtain 

(5.66) 



\\<t>r\\ 2 = (-V r r\<t>r,r I G,-(x)dx. 
J D 

So, for our present problem with k — r, we obtain 



/oo 
e- av tf(x)dx = (-l) r 
-oo 



/oo 
C r e _a * dx. (5.67) 

-OG 



Now (if y — ax with a > 0) 

1"°° 2 2 f 00 2 2 If 00 ! 1 

/ e~ a x dx = 2 e~ a x dx = - / e"- v <fy = -y^F 

./-OO i0 « ^0 a 

via (3.103) in Chapter 3. Consequently, (5.67) becomes 



1101-11 2 = (-iyr\4>r,rC r 



(5.68) 



(5.69) 



With C r — (— l) r and a = 1, we recall H r (x) — <j> r {x), so (5.69) becomes 



\\H r \\ 2 = r\H r , r ^. 



(5.70) 



We need an expression for H rr . 
We know that for suitable p n ^ 



d" _ x i _ 
dx"* 



J^P" 



kx 



Lk=0 



(5.71) 



-p„(x) 



TLFeBOOK 



HERMITE POLYNOMIALS 227 



Now consider 



d n+l 



dx n+l 



dx 



-2x 



-k=Q 
n 

.k=0 
n+\ 



k X 



kX 



^2, kPn ' kX 



k-\ 



\-k=\ 



"2 ^2 Pn,k-\ + ^(k + l)Pn,k+l 

k=0 k=0 

1 
= ^2 V-?-Pn,k-\ + (k+ l)Pn,k+l] X k e 



x k e-* 2 



-Pn+l,k 



k=Q 

so we have the recurrence relation 

Pn+l,k — -2Pn,k-l + (k + l)p n ,k+l- 

From (5.71) and (5.63) H n (x) = (-1)" p n (x). From (5.72) 

Pn + l,n + l = -2pn,n + (« + 2)/?„, n+ 2 = -2p n ,n, 

and so 

fln+l.n+l — ( — 1) Pn+l,n+\ — — (— 1) ( — 2Pn,n) — 2//,,,„. 



(5.72) 



(5.73) 



Because Hq(x) — 1 [via (5.63)], //o,o = 1, and immediately we have //„„ = 2" 
[solution to the difference equation (5.73)]. Therefore 



(5.74) 



\\H,Y = 2 r rW7T 



thus, f^ e- x2 H?(x) dx = 2 r r ly^T. 

A three-term recurrence relation for H n (x) is needed. Define the generating 
function 

(5.75) 



S(x, t) — exp[-f 2 + 2xt] — exp[x 2 - (t - x) 2 ]. 



Observe that 

3" 3" 3" 

— S(x, t) = exp[x 2 ]— exp[-(r - x) 2 ] = (-l)"exp[x 2 ] — exp[-(f - x) 2 ]. 

(5.76) 
Because of the second equality in (5.76), we have 

S (n \x, 0) = —-S(x, 0) = (-l)"exp[x 2 ] — exp[-x 2 ] = H n (x). (5.77) 

dt" dx" 



TLFeBOOK 



228 ORTHOGONAL POLYNOMIALS 

The Maclaurin expansion of S(x, t) about t — is [recall (3.75)] 



n=0 



so via (5.77), this becomes 

* — ' w! 

«=0 

Now, from (5.75) and (5.78), we have 

M = 2fe -' 2 +^ = f; ^ff„ W , (5.79a) 



Unix), 



dx '—' n\ 

n=0 



and also 



dS ^t"dH n (x) ^ t n+1 dH n+l (x) /c ^ nu ^ 

— = > = > . (5.79b) 

dx ^ n\ dx ^ (n+ 1)! dx 

n=0 n=—\ 

Comparing the like powers of t in (5.79) yields 

1 dH n+l (x) 2 

= — H n (x), 



(n+1)! dx n\ 

which implies that 



dx 

for n e Z + . We may also consider 



dH n+ \ (x) 

= 2(n + l)H„(x) (5.80) 



35 »2_i_t,» x— * (—2t + 2x) „ 

— = {-It + 2x)e- t +2,x = V ■ -t n H n (x) (5.81a) 

rit ' * n 



and 



dt 

n=0 



r) S °° *^ — 1 °° *rt 

^7 = E? TTT flr » (x) = E-T ff »+ 1 ^ ) - (5 - 81b) 

at *-^ (n — 1)! *-^ /i! 

From (5.81a) 

V i ■ -t n H n {x) = -2 V — H n -i(x) + 2x V -#„(*). (5.82) 

■^^ «! *— ' (ra — 1)! ^^ n\ 

n=0 n=0 n=0 



TLFeBOOK 



LEGENDRE POLYNOMIALS 229 



150 



100 



50 



-50 



-100 



-150 













1 : / \ 






— n = 2 
-- n = 3 

.-. n = 4 

— n = 5 




\ 1 ■ V 

V / : \ : 

w \ 




:'| 

/ 1 
/: I 




:/ \ : \ : 
— r— • -L- \ -• 


/ " "■**. '■ \ 


//I 

Tr-rr5^*^\ - j 




I V ;v •' ■ / 

I: y \ / 

1 ■ • \ / 
\ / V v' 

1/ ^^-^ 





























-2.5 -2 -1.5 -1 -0.5 0.5 1 1.5 

x 

Figure 5.2 Hermite polynomials of degrees 2, 3, 4, 5. 



2.5 



Comparing like powers of t in (5.81b) and (5.82) yields 

1 2 1 

— H n+ i(x) = — -H n -i(x) + 2x — H„(x) 

n\ (n — 1)! n\ 



or 



H n +i(x) — 2xH n (x) - 2nH n -i(x). 



(5.83) 



This holds for all n € N. Equation (5.83) is another special case of (5.5). 

The Hermite polynomials are relevant to quantum mechanics (quantum harmonic 
oscillator problem), and they arise in signal processing as well (often in connection 
with the uncertainty principle for signals). A few Hermite polynomials are plotted 
in Fig. 5.2. 



5.5 LEGENDRE POLYNOMIALS 

We consider D = [— 1, 1] with uniform weighting function 

w(x) = 1 for all x e D. 



(5.84) 



TLFeBOOK 



230 ORTHOGONAL POLYNOMIALS 

In this case (5.24) becomes 

,/H-i r 



dx r+l 



d r G r (x) 
~dx r 



= 



d 2r+1 



-G r (x) = 0. 



dx 2r + l 
The boundary conditions of (5.25) become 

G®(±1) = 

for k e Z r . Consequently, the solution to (5.85) is given by 



Thus, from (5.23) 



G r (x) = C,-(x 2 - l) r . 



t> r {x) = c r — (x 2 -iy. 



The Legendre polynomials use C r = jf^t, ar, d are denoted by 

1 d r 



Prix) = 



2 r r\dx' 



-(x'-iy, 



(5.85) 



(5.86) 



(5.87) 



(5.88) 



(5.89) 



which is the Rodrigues formula for P r (x). By construction, for k ^ r, we must 
have 



/> 



(x)P k (x)dx = 0. 



(5.90) 



From (5.66) and (5.87), we obtain 

,2 (-I)'' 



\\p r \r 



Recalling the binomial theorem 



IY /■! 

— P rr / (x 2 - l) r dx. 
r ' J-i 



(5.91) 



(a + fc) r = J] ( I ) «^ r " 



(5.92) 



TLFeBOOK 



LEGENDRE POLYNOMIALS 231 



we see that 



<1r r dx T L ~ l \ k 

k=Q 



dx r 



1 \-\y- k x 2k 



d r 
dx r 



,2r 



r-\ 

E 

k=0 



I )(-iy- k x 2k 



= 2r(2r-l)---(r + l)x r + — 

dx r 



>-l 

£ 

.k=0 



I )(-iy- k x 2k 



implying that (recall (5.89)) 



Pr.r 



2r(2r - 1) • • • (r + 1) (2r)\ 



2 r r\ 



2 r [r\] 2 ' 



(5.93) 



But we need to evaluate the integral in (5.91) as well. With the change of variable 
x — sin 9, we have 



/l rrc/2 

(x 2 - \y dx = {-xy \ cos 2r+1 0do. 
-1 J-TZl 



(5.94) 



T/2 



We may integrate by parts 



/Tt/A fit, 

cos 2 ^ 1 9 d0 = / 
-7T/2 J-jr 



7T/2 



cos 6» cos 2r 6> J6» 



/2 



- cos 2r sin 6\ n/2 n + 2r I cos 2r_1 9 sin 2 (9 </6> 

' T/2 



/■7T/2 

J -it II 



(in f u dv — uv — f v du, we let w = cos 2r 9, and <iu = cos 9 dO), which becomes 
(on using the identity sin 2 6 = 1— cos 2 9) 



/7t/2 £71(2 fit/2 

cos 2 ^ 1 9d9 =2r / cos 2r - 1 9 d9 - 2r \ cos 2r+1 9 d6 
-7T/2 J-lc/2 J-7ll 



T/2 



for r > 1. Therefore 



/Itjl <-JT ; 

cos 2r+1 9d9 =2r 
-71/2 J-TZ 



(2r + 1) / cos 2r+1 rf(9 



7T/2 

/2 



cos 2r " 1 0c/0. 



(5.95) 



Now, since [via (5.94)] 



/ r _i=-(-l) 



/■T/2 

J-tt/2 



cos 2r - 1 9 d9, 



TLFeBOOK 



232 ORTHOGONAL POLYNOMIALS 

Eq. (5.95) becomes 

(2r + l)(-l) r I r = -2r(-iy7 r _i, 
or more simply 



2r 



2r + 1 



Ir-l- 



(5.96) 



This holds for r > 1 with initial condition Iq — 2. The solution to the difference 

equation is 

2 2r+1 [r!] 2 
/,. = (-l) r — !_, (5.97) 

^ ' (2r + l)! 

This can be confirmed by direct substitution of (5.97) into (5.96). Consequently, if 
we combine (5.91), (5.93), and (5.97), then 

, (-1) 2 (2r)! (-l) r 2 2r+1 [r!] 2 
Il^|| 2 = 



which simplifies to 



2 r 2 r [r!] 2 (2r + l)! 

? 2 

ll^ll 2 



2r + 1 



(5.98) 



A closed-form expression for P n (x) is possible using (5.92) in (5.89). Specifi- 
cally, consider 



P„(x) 



i in i jb 

1 -(x 2 -ir 



2»M!fif:c" 



2"re! djc" 



E( 

.k=0 v 7 



(_l)* x 2»-2* 



(5.99) 



where M — n/2 (re even), or M = (n — l)/2 (re odd). We observe that 

d" 



-2k 



__ x m-zk = (2n - 2k)(2n - 2k - 1) • • • (re - 2k + l)x n ~ ZK . (5.100) 



Now 



(2n - 2fc)! = (2re - 2fc)(2re - 2fc - 1) • • • (re - 2k + 1) (re - 2fc) • • • 2 • 1, 

so (5.100) becomes 



=(n-2Jt)! 



il JC 2„-2* = (2n-2fc)! ^,_ 2 , 
Jjc" " (n-2k)\ 



(5.101) 



Thus, (5.99) reduces to 



,.«=^i;(;W^m- 



2"re! ^\ i 
fc=0 v 



(n-2k)\ 



TLFeBOOK 



LEGENDRE POLYNOMIALS 233 

or alternatively 

/>„(*) = yVD* (2 "~ 2fe)! x"~ 2k . (5.102) 

Z ^ 2 n k\(n-k)\(n-2k)\ 

Consider /(*) = -±= so that /(*>(*) = (2*-D(2*-3)-3-i (1 _ x) -(2fc+i)/2 (for 
fc> 1). Define (2fc - 1)H = (2* - 1)(2£ - 3) • • • 3 • 1, and define (-1)!! = 1. As 
usual, define 0! = 1. We note that (2n)\ = 2"n\(2n — 1)!! which may be seen by 
considering 

(2«)! = (2n)(2n - l)(2n - 2)(2n - 3) • • • 3 • 2 • 1 

= [(2n)(2n - 2) ■ ■ ■ 4 ■ 2][(2n - 1)(2« - 3) • • • 3 • 1] = 2"«!(2« - 1)!!. 

(5.103) 
Consequently, the Maclaurin expansion for f(x) is given by 



vT^7 f-i k\ ^ 



c(k) (0) k _^ (2k)l k 

Af— - L^ k] x 2^2 2k [kl] 2X 



(5.104) 

'-'MAM- 
*=0 &=0 



The ratio test [7, p. 709] confirms that this series converges if \x\ < 1. Using 
(5.102) and (5.104), it is possible to show that 

1 oo 

S(x,t) = -== =Vfi,(x)f", (5.105) 

Jl-2xt + t 2 ^ 

so this is the generating function for the Legendre polynomials P n (x). We observe 

that 

dS x — t x — t 

— = — — = -5. (5.106) 

3f [1 -2xf + ? 2 ] 3 / 2 l-2xf + f 2 

Also 

3 S °° 

— = Y,nPnix)t n -\ (5.107) 

n=0 



Equating (5.107) and (5.106), we have 

oo oo 



x — t 

«=0 «=0 

which becomes 



x 

n=0 n=0 n=0 n=0 



Y j P„(x)t n -J^P n (x)t n+1 = ^nPnix)!"- 1 -2xY,nP n (x)t n 

Q 

OO 

J2nPn(x)t n+1 , 



H = 



TLFeBOOK 



234 ORTHOGONAL POLYNOMIALS 

and if P_i(x) — 0, then this becomes 



n=0 



n=0 



n=0 



X J2 P n(x)t" ~ J2 P n-l(x)t" = Y, {H + VPn+l(x)t" - 2x^nP n (x)t n 

=0 

oo 



n=0 

oo 



n=0 

so on comparing like powers of t in this expression, we obtain 

xP„(x) - P„-i(x) — (n + l)P n+ i(x) - 2xnP n (x) + (n - \)P n -\(x), 
which finally yields 

(n + l)P n+ i{x) = (2n+ l)xP„(x)-nP n -i(x) 
or (for n > 1) 



2m + 1 n 

P n +l(x) = — -xP n (x) — -P n -i(x), 

n + 1 n + 1 



(5.108) 



which is the three-term recurrence relation for the Legendre polynomials, and hence 
is yet another special case of (5.5). 



x 

of 











V . . ; 






— n = 2 

— n = 3 
.-. n = 4 

— n = 5 _ 


\ : 
\ : 






J 


i 

/ 
/ 


_ :../ v .,«»rss.; <: v .v^. ...>,....> 






:// ; 






//: 
/ /: 



















-1.5 



-0.5 



0.5 



1.5 



Figure 5.3 Legendre polynomials of degrees 2, 3, 4, 5. 



TLFeBOOK 



ORTHOGONAL POLYNOMIAL LEAST-SQUARES APPROXIMATION 235 

The Legendre polynomials arise in potential problems in electromagnetics. Mod- 
eling the scattering of electromagnetic radiation by particles involves working with 
these polynomials. Legendre polynomials appear in quantum mechanics as part 
of the solution to Schrodinger's equation for the hydrogen atom. Some Legendre 
polynomials appear in Fig. 5.3. 

5.6 AN EXAMPLE OF ORTHOGONAL POLYNOMIAL 
LEAST-SQUARES APPROXIMATION 

Chebyshev polynomials of the first kind, and Legendre polynomials are both orthog- 
onal on [— 1, 1], but their weighting functions are different. We shall illustrate the 
approximation behavior of these polynomials through an example wherein we wish 
to approximate 

0, -1 <x < 



/(*) = ( ,; <-<! (5-109) 

by both of these polynomial types. We will work with fifth-degree least-squares 
polynomial approximations in both cases. 

We consider Legendre polynomial approximation first. We must therefore nor- 
malize the polynomials so that our basis functions have unit norm. Thus, our 
approximation will be 

5 

f(x) = J^a k <t> k (x), (5.110) 

k=0 



where 



and 



1 

4>k(x) — Pk(x), 

llfltll 



f 1 1 

= / fix) 

J -I lift II 



ak = {f,<Pk)=l TT7rrJi x> > P kix)dx, 
for which we see that 

' Pk(x) 



hx) = J2 



fc=0 



J fix)P k ix)dx 



\\pk\ 



(5.111) 



We have [e.g., via (5.102)] 



P ix) = 1, Pi(x) = x, P 2 (x) = l -YSx 2 - 1], 

P 3 (x) = -[5x 3 - 3x], P 4 (x) = -[35jc 4 - 30x 2 + 3], 
2 8 

P 5 (x) = -[63x 5 - 70x 3 + 15x]. 



TLFeBOOK 



236 ORTHOGONAL POLYNOMIALS 

The squared norms [via (5.98)] are 



||Poll 2 = 2, ||Pil| 2 = -, 

IIAII 2 = ^, ll^ll 2 ^-^-- 



Iftll^j. 



i^ir = -, 



By direct calculation, a# — L P/ C (x)dx becomes 



a 



a 2 



a 3 



a4 



0-5 



/ I ■ dx — l,a\ — J 

Jq Jo 

i /•* 

- / [5x 3 - 3x 
2 Jo 

s/o' 135 ' 

8 Jo 



1 1 

x dx 



2' 



\dx 



1 dx 



il 



Jo 



1 



Mo = 0, 



-ii 



Jo 



30x 2 + 3] dx 



35 



30 



0, 



[63x 5 -70x 3 + 15x]dx 



63 



r 1 fii 


l 


70 


n <i 


1 15 


ri o" 


1 


-X 




- 


-X 


H 


-X 




1.6 





8 


L4 


o « 


L2 






16 



The substitution of these (and the squared norms ||/\ll 2 ) into (5.111) yields the 
least-squares Legendre polynomial approximation 



fix) = ^PoOt) + /l(l) " 7^3« + ^rP 5 (x). 

2 4 16 32 



(5.112) 



We observe that /(0) = 5, /(l) = 52 = 1.15625, and /(-l) = -^ = -0.15625. 
Now we consider the Chebyshev polynomial approximation. In this case 



where 



and 



/(•*) = ^2,h<t>kix), 



(5.113) 



<t>kix) = 



1 



;T k ix), 



b k = {f,4>k) 



1 7*|| 
1 1 fix)T k ix) 



J-iVT=x~ 2 



\T k \ 



dx, 



TLFeBOOK 



ORTHOGONAL POLYNOMIAL LEAST-SQUARES APPROXIMATION 237 



from which we see that 



/(*) = £ 



k=0 



J-iVT- 



Vi - .V 



:/(x)TJt(x) dx 



117*11 



r **(*). 



=ft 



We have [e.g., via (5.57)] the polynomials 

T (x) = l, Ti(x)=x, T 2 (x) = 2x 2 - 1, 

r 3 (x) = 4x 3 -3x, r 4 (x) = 8a 4 -8x 2 + 1, 

75(a) = 16x 5 -20x 3 +5x. 

The squared norms [via (5.55)] are given by 



(5.114) 



IITbll -ff,||r k || =- (*>!)• 

By direct calculation /^ = J , T^jx) dx becomes (using x = cos#, and 

y'l—x 2 

T k (cos0) = cos(k9) [recall (5.53)] 0* = Jq /2 cos(k6)d6, and hence 



fo 


= / 1 • d0 = -, 

Jo 2' 








h 


= / cos(9d6» = [sin^Q 7 = 1, 
Jo 




h 


i-JT/2 

= / cos(2(9) J6» = 


- sm(26) 


tt/2 



= 0, 


h 


,71/2 

= / cos(36)d6 = 
Jo 


- sin(36>) 


tt/2 



1 

~~3' 


h 


/.JT/2 

= / cos(4(9) dO = 
Jo 


"1 

- sin (46») 
.4 


tt/2 



= 0, 


h 


rn/2 
= / cos(50)J6» = 
Jo 


"1 
- sin(50) 


tt/2 



1 

~ 5" 



Substituting these (and the squared norms ||7i|| 2 ) into (5.114) yields the least- 
squares Chebyshev polynomial approximation 



fix) = - 



|r (x) +2T l (x) - -r 3 (x) + -7s(x) 



(5.115) 



We observe that /(0) = ±, /(l) = 1.051737, and /(-l) = -0.051737. 



TLFeBOOK 



238 ORTHOGONAL POLYNOMIALS 



1.2 



0.8 



o 0.6 



E 

< 0.4 



0.2 



-0.2 



! ! ! ! 


ss 

/ 


"^>v. I / 


— fW 

-- Legendre approximation 
" ■ - ■ Chebyshev approximation 






: /' 
// 
li 


! ^s^' 








/': 

i 
i 

../ 






/ 
1 


/ 






i 

/ 
II 

i. 






: : : /( 
: : : •/ 


>l 

1 






i i '•A 7 
' I v n : : s' 

I \ — -': 

; ; ; 









-0.8 -0.6 -0.4 -0.2 



0.2 0.4 0.6 0.8 



Figure 5.4 Plots of f(x) from (5.109), the Legendre approximation (5.112), and the 
Chebyshev approximation (5.115). The least-squares approximations are fifth-degree poly- 
nomials in both cases. 



Plots of f(x) from (5.109) and the approximations in (5.112) and (5.115) appear 
in Fig. 5.4. Both polynomial approximations are fifth-degree polynomials, and yet 
they look fairly different. This is so because the weighting functions are different. 
The Chebyshev approximation is better than the Legendre approximation near x — 
±1 because greater weight is given to errors near the ends of the interval [—1, 1]. 

5.7 UNIFORM APPROXIMATION 



The subject of uniform approximation is distinctly more complicated than that of 
least-squares approximation. So, we will not devote too much space to it in this 
book. We will concentrate on a relatively simple illustration of the main idea. 

Recall the space C[a, b] from Chapter 1. We recall that it is a normed space 
with the norm 

||*|| = sup \x(t)\ (5.116) 

t€[a,b] 

for any x(t) e C[a, b]. In fact, further recalling Chapter 3, this space happens to 
be a complete metric space for which the metric induced by (5.116) is 



d(x, y) = sup \x{t) — y{t)\, 

t€[a,b] 



(5.117) 



TLFeBOOK 



UNIFORM APPROXIMATION 239 

where x(t), y(t) e C[a, b]. In uniform approximation we (for example) may wish 
to approximate f(t) e C[a, b] with x(t) e P"[o, b] C C[a,b] such that \\f — x\\ 
is minimized. The norm (5.116) is sometimes called the Chebyshev norm, and so 
our problem is sometimes called the Chebyshev approximation problem. The error 
e(t) — f(t) — x(t) has the norm 

|M|= sup |/(f) -*(0I (5-118) 

t€[a,b] 

and is the maximum deviation between f(t) and x(t) on [a, b]. We wish to find x(t) 
to minimize this. Consequently, uniform approximation is sometimes also called 
minimax approximation as we wish to minimize the maximum deviation between 
f(t) and its approximation x(t). We remark that because f(t) is continuous (by 
definition) it will have a well-defined maximum on [a, b] (although this maximum 
need not be at a unique location). We are therefore at liberty to replace "sup" by 
"max" in (5.116), (5.117) and (5.118) if we wish. 

Suppose that yj (t) e C [a , b] for j = 0, 1, ...,« — 1 and the set of functions 
[yj(t)\j e Z„} are an independent set. This set generates an w-dimensional sub- 
space 4 of C[a,b] that may be denoted by Y — \YH)l=o a iyj( t )\ ol i e R|. From 
Kreyszig [8, p. 337] consider 

Definition 5.1: Haar Condition Subspace Y C C[a, b] satisfies the Haar 
condition if every y e Y (y ^ 0) has at most n — \ zeros in [a, b], where n — 
dim(F) (dimension of the subspace Y). 

We may select y/(0 = ? ; for j e Z„ so any y(t) e Y has the form y(t) — 
S'!=o a ; fJ anc ^ mus ' s a polynomial of degree at most n — 1. A degree n — 1 
polynomial has n — \ zeros, and so such a subspace Y satisfies the Haar condition. 

Definition 5.2: Alternating Set Let x e C[a, b], and y e Y, where Y is any 
subspace of C[a, b]. A set of points to, . . . , t\ in [a, b] such that to < t\ < ■ ■ ■ < tk 
is called an alternating set for x — y if x(tA — y(tj) has alternately the values 
+ ||x — y\\, and — \\x — y\\ at consecutive points tt. 

Thus, suppose x(tj) - y(tj) = +\\x - y\\, then x(t j+i ) - y(t j+i ) = -\\x - y\\, 
but if instead x(t j ) — y(tj) — —\\x — y\\, thenx(f /+ i) — y(tj + \) = +\\x — y\\. The 
norm is, of course, that in (5.116). 

Lemma 5.2: Best Approximation Let Y be any subspace of C[a, b] that 
satisfies the Haar condition. Given f e C[a, b], let y e Y be such that for / — y 
there exists an alternating set of n + 1 points, where n — dim(Y), then y is the 
best uniform approximation of / out of Y. 



The proof is omitted, but may be found in Kreyszig [8, pp. 345-346]. 

A subspace of a vector space X is a nonempty subset Y of X such t 
scalars a, b from the field of the vector space we have ay\ + byx e Y. 



A subspace of a vector space X is a nonempty subset Y of X such that for any y\ , yi e Y , and all 



TLFeBOOK 



240 ORTHOGONAL POLYNOMIALS 

Consider the particular case of C[— 1, 1] with f(t) e C[— 1, 1] such that for a 
given « € N 

f(t) = t". (5.119) 

Now Y — Z]/=o a / f/ l Q, 7 e R , so yj(t) — t j for _/' € Z„. We wish to select a,- 
such that the error e — f — y 

n-\ 

e(t) = f(t)-J^ajtJ (5.120) 

7=0 

is minimized with respect to the Chebyshev norm (5.116); that is, select a,- to 
minimize ||e||. Clearly, dim(7) = n. According to Lemma 5.2 ||e|| is minimized 
if e(t) in (5.120) has an alternating set of n + 1 points. 

Recall Lemma 5.1 (Section 5.3), which stated [see (5.44)] that 

n 

cos n6 = J^Pn.k cos k 0. (5.121) 

k=Q 

From (5.46) Pn+i,n+i = 2/J B> „, and /3 ,o = 1. Consequently, /3„,„ = 2"" 1 (for « > 
1). Thus, (5.121) can be rewritten as 

n-\ 

cos«0 = 2" _1 cos" 9 + ^Pn,j cos J ' (5.122) 

.7=0 

(n > 1). Suppose that t — cos0, so 6 e [0, n] maps to f e [— 1, 1] and from (5.122) 

B-l 

cos[n cos -1 f] = 2"- 1 f" + ^/S nJ ^. (5.123) 

j=o 

We observe that cos nQ has an alternating set of n + 1 points 0q , 0i , . . . , 0„ on [0, it ] 
for which cos n9u = ±1 (clearly || costi0|| = 1). For example, if n — 1, then 

6» = 0, $i = it. 

If n = 2, then 

(9 = 0, 01 = 7T/2, 2 = 7T, 

and if n = 3, then 

7T 27T 

00 = 0, 01 = -, 9 2 = — , 03 = JT. 

In general, 0£ = -tt for fc = 0, 1, ... , n. Thus, if f& = cos 0,1 then fj = cos (-?r). 
We may rewrite (5.123) as 

1 1 ""' 

^ cos[« cos" 1 f] = t" + —J2Pnjt J - (5.124) 

7=0 



TLFeBOOK 



PROBLEMS 241 



This is identical to e{t) in (5.120) if we set fi n j — —2" ay. In other words, if we 
choose 

1 _, 

e(t) = — — j- cos[w cos t], (5.125) 

then ||e|| is minimized since we know that e has an alternating set t — tk — 
cos (f tt), k = 0, 1, . . . , n, andft e [-1, 1]. We recall from Section 5.3 that T k (t) — 
cos[k cos -1 1] is the /cth-degree Chebyshev polynomial of the first kind [see (5.53)]. 
So, e{t) = T n (t)/2 n ~ 1 . Knowing this, we may readily determine the optimal coef- 
ficients at in (5.120) if we so desire. 

Thus, the Chebyshev polynomials of the first kind determine the best degree 
n — 1 polynomial uniform approximation to /(f) = f" on the interval t € [—1, 1]. 



REFERENCES 

1. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 1966. 

2. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York, 
1974. 

3. J. S. Lim and A. V. Oppenheim, eds., Advanced Topics in Signal Processing. Prentice- 
Hall, Englewood Cliffs, NJ, 1988. 

4. P. J. Davis and P. Rabinowitz, Numerical Integration, Blaisdell, Waltham, MA, 1967. 

5. J. R. Rice, The Approximation of Functions, Vol. I: Linear Theory, Addison- Wesley, 
Reading, MA, 1964. 

6. G. Szego, Orthogonal Polynomials, 3rd ed., American Mathematical Society, 1967. 

7. L. Bers, Calculus: Preliminary Edition, Vol. 2, Holt, Rinehart, Winston, New York, 1967. 

8. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York, 
1978. 



PROBLEMS 

5.1. Suppose that {4>k(x)\k e Z^} are orthogonal polynomials on the interval D ■■ 
[a, b] C R with respect to some weighting function w(x) > 0. Show that 



N-l 



^ ak<j>k(x) = 



k=0 

holds for all x e [a, b] iff a^ — for all k e Zn [i.e., prove that {<pk{x)\k € 
Z;v} is an independent set]. 

5.2. Verify (5.11) by direct calculation for 4>k(x) — Tk(x)/\\Tk\\ with n — 2; that 
is, find 4>q{x), 4>\{x), <p2(x), and foix), and verify that the left- and right-hand 
sides of (5.11) are equal to each other. 



TLFeBOOK 



242 ORTHOGONAL POLYNOMIALS 

5.3. Suppose that [<p n (x)\n e Z + } are orthogonal polynomials on the interval 
[a, b] with respect to some weighting function w(x) > 0. Prove the fol- 
lowing theorem: The roots xj (j = 1, 2, . . . , n) of <j> n (x) — (n € N) are all 
real-valued, simple, and a < xj < b for all j. 

5.4. Recall that 7}(x) is the degree j Chebyshev polynomial of the first kind. 
Recall from Lemma 5.1 that cos(n6) = Y^!k=ofin,k cos^ 6. For x e [— 1, 1] 
with k e Z + , we can find coefficients aj such that 

k 
x — 2_\ajTj(x). 



Prove that 



w^'-l 



a; = K^TPir cos k+r 6d6. 



Is this the best way to find the coefficients a ; ? If not, specify an alternative 
approach. 

5.5. This problem is about the converse to Lemma 5.1. Suppose that we have 
p(cos0) — J]^ = o a 3,*: cosk ®- We wish to find coefficients b^ j such that 

3 

p(cos0) = ^hj cos(j6). 

7=0 

Use Lemma 5.1 to show that a — Ub, where 

a = [03,0 03,1 03,2 03,3]^, b = [/j 3i0 63,1 ^3,2 ^3,3] r 

and U is an upper triangular matrix containing the coefficients p m % for 
m — 0, 1, 2, 3, and fc = 0, 1 . . . , m. Of course, we may use back-substitution 
to solve Ub = a for b if this were desired. 

5.6. Suppose that for n e Z + we are given p(cos9) — Yll =o a n,kcos k 0. Show 
how to find the coefficients b n j such that 



p(cos6) = ^/3„jcos(j0), 
;'=o 

that is, generalize the previous problem from n — 3 to any n. 
5.7. Recall Section 5.6, where 

., , JO, -l<x<0 
/W = 1, 0<*<1 



TLFeBOOK 



PROBLEMS 243 
was approximated by 

" „ 1 

for n = 5. Find a general expression for b k for all k e Z + . Use the resulting 
expansion 



to prove that 



5.8. Suppose that 



then find b k in 



- = E^— (-1)"- 

4 ^ 2« - 1 

n = \ 



,0, -1 < X < 
/(X) H _ 0<X<1 



oo 1 

/(jc) = y^Sk — T k (x). 



5.9. Do the following: 



(a) Solve the polynomial equation 7^(x) = for all k > 0, and so find all 
the zeros of the polynomials 7i(x). 

(b) Show that T n {x) satisfies the differential equation 

(1 - x 2 )T( 2) (x) - xT^\x) + n 2 T n (x) = 0. 

Recall that t£\x) = d r T n {x)/dx r . 

5.10. The three-term recurrence relation for the Chebyshev polynomials of the sec- 
ond kind is 

U r+ l(x) = 2xU r (x)-U r -l(x), (5.P.1) 

where r > 1, and where the initial conditions are 

U (x) = l,Ui(x) = 2x. (5.P.2) 

We remark that (5.P.1) is identical to (5.57). Specifically, the recursion for 
Chebyshev polynomials of both kinds is the same, except that the initial 



TLFeBOOK 



244 ORTHOGONAL POLYNOMIALS 

conditions in (5. P. 2) are not the same for both. [From (5.56) Tq(x) — 1, but 
T\(x) — x.] Since deg(U r ) — r, we have (for r > 0) 



U r {x) = J2 u r,jx j - 

.7=0 



(5.P.3) 



[Recall the notation for cf> r (x) in (5.3).] From the polynomial recursion in 
(5.P.1) we may obtain 



Ur+l,j — 2U,-,j-l — Ur-lJ. 



(5.P.4a) 



This expression holds for r > 1, and for j = 0, 1, . . . , r, r + 1. From (5. P. 2) 
the initial conditions for (5 .P. 4a) are 



I/o,o = 1, U h0 = 0, Ui,i = 2. 



(5.P.4b) 



Write a MATLAB function that uses (5.P.4) to generate the Chebyshev poly- 
nomials of the second kind for r = 0, 1, . . . , N. Test your program out for 
N — 8. Program output should be in the form of a table that is written to a 
file. The tabular format should be something like 



degree 





1 


2 


3 


4 


5 


6 


7 


8 





1 


























1 





2 























2 


-1 





4 




















3 





-4 





8 


















etc. 



5.11. The Chebyshev polynomials of the second kind use 

D = [-1, 1], w(x) = VI -x 2 . 



(5.P.5) 



Denote these polynomials as the set {<p n (x)\n e Z + }. (This notation applies 
if the polynomials are normalized to possess unity-valued norms.) Of course, 
in our function space, we use the inner product 



(f,g)=j Jl-x 2 f(x)g(x)dx. 



(5.P.6) 



Derive the polynomials {<j) n (x)\n e Z + }. (Hint: The process is much the 
same as the derivation of the Chebyshev polynomials of the first kind pre- 
sented in Section 5.3.) Therefore, begin by considering q r -\(x) which is 
any polynomial of degree not more than r — 1. Thus, for suitable c& € R, we 



TLFeBOOK 



PROBLEMS 245 

must have q r -\{x) = Yfj^CjfyW'Vnd ((j> r ,q r -i) = Ej-=o c ./( < / > r, 4>j) = 
because (</> r ,<pj) — for j — 0, 1, . . . , r — 1. In expanded form 



(4> r ,q r -i)= I v 1 — x 2 4> r (x)q r -\(x) dx — 0. (5. P. 7) 

Use the change of variable x — cos 9 (so Jx = — sin 9 dO) to reduce (5 .P. 7) to 



f 

Jo 



sin 2 6»cos(A:6')(/) r (cos0)d6' = (5.P.8) 



for k = 0, 1, 2, . . . , r — 1, where Lemma 5.1 and its converse have also been 
employed. Next consider the candidate 

sin(r + 1)6* 
r (cose) = C r . - , (5.P.9) 

sm9 

and verify that this satisfies (5. P. 8). [Hence (5. P. 9) satisfies (5.P.7).] Show 
that (5. P. 9) becomes 

sin[(r + 1) cos" 1 x] 

<t> r (x) = C r — 

Prove that for C r = 1 (all r € Z+) 

&r+l(x) = 2x<j) r (x) - ^ r -l(x). 

In this case we normally use the notation C/ r (x) = 4> r (x). Verify that Uq(x) — 
1, and that Ui(x) = 2x. Prove that \\U n \\ 2 = f for n € Z+. Finally, of 
course, </> n (x) = U n (x)/\\U n \\. 

5.12. Write a MATLAB function to produce plots of Uk(x) for k — 2, 3, 4, 5 similar 
to Fig. 5.1. 

5.13. For <pk(x) — Uk(x)/\\Uk\\, we have 

[2 
4>k(x) = J —U k (x). 
V n 

Since (<j>k,<l>j) = &k-j, f° r an Y /(^) € L 2 [— 1, 1], we have the series 
expansion 



f(x) = J^a/i(/)t(x), 



*:=0 

where 



«* = (/, <Pk) = 7777-7 / \/l -x 2 f(x)U k (x)dx. 

\\Uk\\ J-i 



TLFeBOOK 



246 ORTHOGONAL POLYNOMIALS 

Suppose that we work with the following function: 

0, -1 <x < 



/(A) " ' 1, 0<x<l 

(a) Find a nice general formula for the elements of the sequence (ai) 

(k € Z+). 

(b) Use MATLAB to plot the approximation 



hi*) = ^2,ak(j)k(x) 



k=Q 



on the same graph as that of f(x) (i.e., create a plot similar to that of 
Fig. 5.4). Suppose that x[fk(x) — 7]t(;c)/||7]t||; then another approxima- 
tion to f(x) is given by 



f\(x) = ^2,b k f k {x). 



fc=0 

Plot fi(x) on the same graph as feix) and f(x). 
(c) Compare the accuracy of the approximations f\{x) and f2(x) to fix) 
near the endpoints x = ±1. Which approximation is better near these 
endpoints ? Explain why if you can. 

5.14. Prove the following: 

(a) T„(x), and T n -\{x) have no zeros in common. 

(b) Between any two neighboring zeros of T n (x), there is precisely one zero 
of T n _\(x). This is called the interleaving of zeros property. 

{Comment: The interleaving of zeros property is possessed by all orthog- 
onal polynomials. When this property is combined with ideas from later 
chapters, it can be used to provide algorithms to find the zeros of orthogonal 
polynomials in general.) 



5.15. 


(a) 


Show that we can write 


















T„(x) = cos[n 


cos 


'*] = 


cosh[ncosh x]. 










[Hint: Note that cosjc = j(e 


Jx _|_ 


e-J x ), 


cosh* = 


\{e x +e~ 


~ x ), so 


that 






cosx = cosh(jx).] 
















(b) Prove that T 2n (x) = T n (2x 2 


-1). 


[Hint: 


cos(2x) : 


- 2 cos 2 x 


-1.] 





5.16. Use Eq. (5.63) in the following problems: 
(a) Show that 



— H r (x) = 2rH r - 1 (x). 

ax 



TLFeBOOK 



PROBLEMS 247 
(b) Show that 



d 

dx 



'id 

e~ x — H r (x) 

dx 



= -2re~ x2 H r (x). 



(c) From the preceding confirm that H r (x) satisfies the Hermite differential 
equation 

H< 2) (x) - 2xH { r l \x) + 2rH r (x) = 0. 

5.17. Find H k (x) for k — 0, 1, 2, 3, 4, 5 (i.e., find the first six Hermite polynomials) 
using (5.83). 

5.18. Using (5.63) prove that 

N 2 n ~ 2 J 

H k (x)=n\Y(-l) j x"- 2j , 

~ JKn- 2j)\ 

where N — n/2 (n is even), N = (n — l)/2 (« is odd). 

5.19. Suppose that Pk(x) is the Legendre polynomial of degree k. Recall Eq. 
(5.90). Find constants a and fi such that for Qk(x) = Pk(<xx + fi) we have 
(for k ± r) 



L 



b 

Q r (x)Q k (x)dx = 0. 



[Comment: This linear transformation of variable allows us to least-squares 
approximate f(x) using Legendre polynomial series on any interval [a, b].] 

5.20. Recall Eq. (5.105): 

1 oo 

-== = = TP n ( X )t". 

VT^2x7TP ^ Q 

Verify the terms n = 0, 1, 2, 3. 

[Hint: Recall Eq. (3.82) (From Chapter 3).] 

5.21. The distance between two points A and B in R 3 is r, while the distance from 
A to the origin O is r\, and the distance from B to the origin O is ri. The 
angle between the vector OA, and vector OB is 9. 

(a) Show that 

1 1 



\ r i + r 2 ~ 2 nr 2 cos0 
[Hint: Recall the law of cosines (Section 4.6).] 



TLFeBOOK 



248 ORTHOGONAL POLYNOMIALS 

(b) Show that 



1 1 ^ n \ n 

r r 2~^ \ r 2/ 



where, of course, P n (x) is the Legendre polynomial of degree n. 
{Comment: This result is important in electromagnetic potential theory.) 

5.22. Recall Section 5.7. Write a MATLAB function to plot on the same graph 
both f(t) = t" and e(t) [in (5.125)] for each of n = 2, 3, 4, 5. You must 
generate four separate plots, one for each instance of n. 

5.23. The set of points C — {e' e \G e R, j — V~ 1} is the unit circle of the complex 
plane. If R(e^ e ) > for all OeR, then we may define an inner product on 
a suitable space of functions that are defined on C 

1 C n 
(F,G} = — / R(e je )F*(e je )G(e je )de, (5.P.10) 

2jt J_„ 

where, in general, F(e-* e ), G(e^ 9 ) e C. Since e^ is 2n -periodic in 0, the 
integration limits in (5. P. 10) are — jt to n , but another standard choice is 
from to lit. The function R(e' e ) is a weighting function for the inner 
product in (5. P. 10). For R(e je ), there will be a real-valued sequence (t>) 
such that 



R(e je )= J2 r * e ~ 



jkt) 



and we also have r_^ = r^ [i.e., (r^) is a symmetric sequence]. For F(e J ) 
and G(eJ s ), we have real-valued sequences (/*), and (gt) such that 

oo oo 

&= — oo fc=— oo 

(a) Show that (e^'" , e"^) = r„_ m . 

(b) Show that ie'-i" 6 F(ei e ), e-'" e G(<?> )) = (F(e' e ), G(e /0 )>. 

(c) Show that {F{e-' e ), G(e je )) = (1, F(e~J e )G(eJ e )). 

[Comment: The unit circle C is of central importance in the theory of stability 
of linear time-invariant (LTI) discrete-time systems, and so appears in the 
subjects of digital control and digital signal processing.] 

5.24. Recall Problem 4.21 from and the previous problem (5.23). Given a n (z) — 
E!t=o a n,kZ~ k (and z = e je ), show that 



(a n (z), Z - k ) = ^S k 



for k = 0, l,...,/i -1. 



TLFeBOOK 



PROBLEMS 249 

[Comment: This result ultimately leads to an alternative derivation of the 
Levinson-Durbin algorithm, and suggests that this algorithm actually gen- 
erates a sequence of orthogonal polynomials on the unit circle C. These 
polynomials are in the indeterminate z~ l (instead of x).] 

5.25. It is possible to construct orthogonal polynomials on discrete sets. This prob- 
lem is about a particular example of this. Suppose that 



b r [n] = J2' 



where n e [-L, U] C Z, U > L > and <j> r ^- ^ 0, so that deg(^ r ) = r, and 
let us also assume that (f) r j e R for all r, and j. We say that {4> r [n\\n e Z + } 
is an orthogonal set if (4>k[n], 4> m [n]) = \\4>k\\ 2 &k-m, where the inner product 
is defined by 

U 
(f[n],g[n])= J2 wM/MsM. (5.P.11) 



and w[n] > for all n e [-L, U] is a weighting sequence for our inner 
product space. (Of course, ||/|| 2 = (/[«], f[n]).) Suppose that L — U = M; 
then, for w[n] — 1 (all n e [-M, M]), it can be shown (with much effort) 
that the Gram polynomials are given by the three-term recurrence relation 

2(2k +1) k 2M + k + 1 

Pk+\ \n\ — npk\n\ Pk-\ M, 

/+ (k+l)(2M-k)' k+l 2M-k lk[lh 

(5. P. 12) 
where po[n] — 1, and pi[n] = n/M. 

(a) Use (5.P.12) to find p k [n] for k = 2, 3, 4, where M = 2. 

(b) Use (5.P.12) to find p k [n] for fc = 2, 3, 4, where M = 3. 

(c) Use (5.P.11) to find \\p k \\ 2 for *; = 2, 3, 4, where M = 2. 

(d) Use (5.P.11) to find || W || 2 for fc = 2, 3,4, where M = 3. 

(Comment: The uniform weighting function w[m] = 1 makes the Gram poly- 
nomials the discrete version of the Legendre polynomials. The Gram poly- 
nomials were actually invented by Chebyshev.) 

5.26. Integration by parts is clearly quite important in analysis (e.g., recall Section 
3.6). To derive the Gram polynomials (previous problem) makes use of sum- 
mation by parts. 

Suppose that v[n], u[n], and f[n] are defined on Z. Define the forward 
difference operator A according to 

A/M = f[n + 1] - f[n] 



TLFeBOOK 



250 ORTHOGONAL POLYNOMIALS 

[for any sequence (/[«])]. Prove the expression for summation by parts, 
which is 



v. 



u 



V' u[n]Av[n] — u[n]v[n] ^J 1 — V^ v[n + l]Au[n] 



n=—L 



n=—L 



[Hint: Show that 

«[n]A«[n] = Au[n]i)[n] — v[n + 1]Am[«], 
and then consider using the identity 

u u 

J2 A/[n] = J2 (/[" + 1] " /W) = f [n] 

n——L n=—L 

Of course, f[n]\ B A = f[B] - f[A].] 



u+i 

-L 



TLFeBOOK 



6 



Interpolation 



6.1 INTRODUCTION 

Suppose that we have the data {(tk, x(tk))\k e Z„+i}, perhaps obtained experimen- 
tally. An example of this appeared in Section 4.6 . In this case we assumed that 
t k — kT s for which x(tk) = x(t)\ t =kT s are the samples of some analog signal. In 
this example these time samples were of simulated (and highly oversimplified) 
physiological data for human patients (e.g., blood pressure, heart rate, body core 
temperature). Our problem involved assuming a model for the data; thus, assum- 
ing x(t) is explained by a particular mathematical function with certain unknown 
parameters to be estimated on the basis of the model and the data. In other words, 
we estimate x(t) with x(t, a), where a is the vector of unknown parameters (the 
model parameters to be estimated), and we chose a to minimize the error 

e(t k ) — x(t k ) -x(t k ,a), keZ n+ i (6.1) 

according to some criterion. We have emphasized choosing a to minimize 

n 

V(a) = J2 e2 (tk)- (6-2) 

k=Q 

This was the least-squares approach. However, the idea of choosing a to mini- 
mize max£ € z B+1 \e(t k )\ is an alternative suggested by Section 5.7. Other choices 
are possible. However, no matter what choice we make, in all cases x(t k ,a) is 
not necessarily exactly equal to x(t k ) except perhaps by chance. The problem of 
finding x(t, a) to minimize e(t) in this manner is often called curve fitting. It is to 
be distinguished from interpolation, which may be defined as follows. 

Usually we assume to < t\ < • • • < t n -\ < t n with to — a, and t n — b so that 
t k € [a, b] C R. To interpolate the data {{tk, x(tk))\k e Z n+ i}, we seek a function 
p(t) such that t e [a, b], and 

p(tk) = x{t k ) (6.3) 

for all k e Z„+i. We might know something about the properties of x(t) for t ^ 
tk on interval [a,b], and so we might select p(t) to possess similar properties. 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



251 



TLFeBOOK 



252 INTERPOLATION 

However, we emphasize that the interpolating function pit) exactly matches xit) 
at the given sample points t\,k e Z„+i. 

Curve fitting is used when the data are uncertain because of the corrupting effects 
of measurement errors, random noise, or interference. Interpolation is appropriate 
when the data are accurately or exactly known. 

Interpolation is quite important in digital signal processing. For example, ban- 
dlimited signals may need to be interpolated in order to change sampling rates. 
Interpolation is vital in numerical integration methods, as will be seen later. In 
this application the integrand is typically known or can be readily found at some 
finite set of points with significant accuracy. Interpolation at these points leads to 
a function (usually a polynomial) that can be easily integrated, and so provides a 
useful approximation to the given integral. 

This chapter discusses interpolation with polynomials only. In principle it is 
possible to interpolate using other functions (rational functions, trigonometric func- 
tions, etc.) But these other approaches are usually more involved, and so will not 
be considered in this book. 



6.2 LAGRANGE INTERPOLATION 

This chapter presents polynomial interpolation in three different forms. The first 
form, which might be called the direct form, involves obtaining the interpolat- 
ing polynomial by the direct solution of a particular linear system of equations 
(Vandermonde system). The second form is an alternative called Lagrange inter- 
polation and is considered in this section along with the direct form. The third 
form is called Newton interpolation, and is considered in the next section. All 
three approaches give the same polynomial but expressed in different mathemati- 
cal forms each possessing particular advantages and disadvantages. No one form 
is useful in all applications, and this is why we must consider them all. 

As in Section 6.1, we consider the data set {itt, xitk))\k e Z n +i}, and we will 
let Xk = xitk). We wish, as already noted, that our interpolating function be a 
polynomial of degree n: 

n 



For example, if n — 1 ilinear interpolation), then 

1 

.7=0 
1 



Xk+l = YlP^J t i+v 



TLFeBOOK 



LAGRANGE INTERPOLATION 



253 



Pl,0 + P\,\tk = Xk, 
Pl,0 + Pl,lft+1 — Xk+l, 



or in matrix form this becomes 



1 tk 

1 ft+i 



Pi,o 

Phi 



Xk 

Xk+l 



This has a unique solution, provided tk # ft+i because in this instance the matrix 

has determinant det I ' ) = tk+i — tk- Polynomial p\(t) — p\ n + p\ \t 

VI 1 ft+i \) ' 

linearly interpolates the points (tk, Xk), and (tk+i,Xk+i)- As another example, if 

n — 2 (quadratic interpolation), then we have 



Xk 



Xk+l 



Xk+l 



7=0 

2 

7=0 

2 

7=0 



7 



7 

&+2> 



or in matrix form 



ft 

ft+i 



1 ft+2 t 



a -i r 



k+1 

2 



£+2 





P2,0 




Xk 




P2,l 


= 


Xk+l 




P2,2 




Xk+l 



(6.5) 



The matrix in (6.5) has the determinant (tk — ft+i)(ft+i — ft+2)(ft+2 — tk), which 
will not be zero if ft, ft+i and £^+2 are all distinct. The polynomial P2(t) — 
P2,o + Pi,\t + p 2 ,2 {2 interpolates the points (t k ,x k ), (t k +i,Xk+i), and (t k+2 , x k+ 2). 
In general, for arbitrary n we have the linear system 



ft 
ft+i 



1 ft+n-l 
1 ft+n 



M-l 



ft+1 



l k+n 



l k+l 



k+n-l k+n-l 
t n—l t n 



l k+n 



Pnfi 




Xk 


Pn,l 




Xk+l 


) n,n — 1 




X k+n-l 


Pn,n 




Xk+n 



(6.6) 



TLFeBOOK 



254 



INTERPOLATION 



Matrix A is called a Vandermonde matrix, and the linear system (6.6) is a Vander- 
monde linear system of equations. The solution to (6.6) (if it exists) gives the direct 
form of the interpolating polynomial stated in (6.4). For convenience we will let 
k — 0. If we let t — to, then 





" 1 


t 


■■ t"' 1 


t 




1 


h 


f n — \ 

' ' h 


t 


A = A(t) = 












1 


tn-l ■ 


t n—\ 

' l n-\ 


t" 

n 




1 


tyl 


t n — l 


t 



(6.7) 



Let D(t) = det(A(f)), and we see that D(t) is a polynomial in the indetermi- 
nate t of degree n. Therefore, D{t) = is an equation with exactly n roots (via 
the fundamental theorem of algebra). But D(tk) — for k — 1, . . . , n since the 
rows of A(t) are dependent for any t = t^. So t\, £2, • • • , t n are the only possible 
roots of D(t) — 0. Therefore, if to, t\, . . . , t n are all distinct, then we must have 
det(A(fo)) ^ 0. Hence A in (6.6) will always possess an inverse if tk+i 7^ tk+j for 

For small values of n (e.g., n — 1, or n = 2), the direct solution of (6.6) can be a 
useful method of polynomial interpolation. However, it is known that Vandermonde 
matrices can be very ill-conditioned even for relatively small n (e.g., n > 10 or so). 
This is particularly likely to happen for the common case of equispaced data, i.e., 
the case where tk — to + hk for k — 0, 1, . . . , n with h > 0, as mentioned in Hill [1, 
p. 233]. Thus, we much prefer to avoid interpolating by the direct solution of (6.6) 
when n ^> 2. Also, direct solution of (6.6) is computationally inefficient, unless one 
contemplates the use of fast algorithms for Vandermonde system solution in Golub 
and Van Loan [2]. These are significantly faster than Gaussian elimination as they 
possess asymptotic time complexities of only 0(n 2 ) versus the 0(n 3 ) complexity 
of Gaussian elimination approaches. 

We remark that so far we have proved the existence of a polynomial of degree n 
that interpolates the data {(tk, Xk)\k e Z n+ i}, provided tk # tj (j ^ k). The poly- 
nomial also happens to be unique, a fact readily apparent from the uniqueness of 
the solution to (6.6) assuming the existence condition is met. So, if we can find 
(by any method) any polynomial of degree < n that interpolates the given data, 
then this is the only possible interpolating polynomial for the data. 

Since we disdain the idea of solving (6.6) directly we seek alternative methods 
to obtain p n (t). In this regard better approach to polynomial interpolation is an 
often Lagrange interpolation, which works as follows. 

Again assume that we wish to interpolate the data set {(tk, Xk)\k € X n+ {\. Sup- 
pose that we possess polynomials (called Lagrange polynomials) Lj (t) with the 
property 

' 0, j^k 
1, j=k = 5 ^- 



Lj(t k ) 



(6.8) 



TLFeBOOK 



LAGRANGE INTERPOLATION 

Then the interpolating polynomial for the data set is 



255 



Pn(t) = * MO+*iMOH \-x n L n (t) = ^XjLjit). 

.7=0 



We observe that 



(6.9) 



Pn(tk) = ^XjLjjtk) = 'Y^Xj&j-k = X<: 
.7=0 .7=0 



(6.10) 



for A: € Z n+ i, so /?«(/;) in (6.9) does indeed interpolate the data set, and via unique- 
ness, if we were to write p n (t) in the direct form p n (t) — Yl"j=o Pn,jt J , then the 
polynomial coefficients p n j would satisfy (6.6). We may see that for j e Z„+i the 
Lagrange polynomials are given by 



Mo=n^ 



(6.11) 






Equation (6.9) is called the Lagrange form of the interpolating polynomial. 

Example 6.1 Consider data set {(to, xq), (t\, x\), (t 2 , x 2 )} so n — 2. Therefore 

(t-h)(t-t 2 ) 





-h)(t - 

-t )(t- 


-h) 
■tii 


(h- 
(t- 


-t )(h ■ 
-t )(t- 


-fe) 
■h) 


2(0 ~ (ft - 


-foXfc- 


-h) 



It is not difficult to see that pi{t) — xoLo(t) + x\L\(t) + X2L2(t). For example, 
p 2 (to) = xoLo(to) + xiLi(to) +x 2 L 2 (to) = xq. 

Suppose that the data set has the specific values {(0, 1), (1, 2), (2, 3)}, and so 
(6.6) becomes 



1 







P2.Q 




1 


1 


1 1 




P2,l 


— 


2 


1 


2 4 




P2,2 




3 



This has solution p 2 (t) — t + 1 (i.e., p 22 — 0). We see that the interpolating poly- 
nomial p n (t) need not have degree exactly equal to n, but can be of lower degree. 
We also see that 



M0 



(r-l)(f-2) 
(0-l)(0-2) 



1 



1 



-(t - l)(t-2) = -(t z -3t + 2), 



TLFeBOOK 



256 INTERPOLATION 








L l{t) = (t - m - 2) - 
(1 - 0)(1 - 2) 


- -tit - 


- 2) = -it 2 - 


-2f) 


L 2 (0= ( '-° )( '- 1) = 

(2-0)(2-l) 


1 
= -tit - 

2 y 


1 9 
- 1) = -{t 1 - 

' 2 


-t). 


Observe that 









x L (t) +x\Li(t) + x 2 L 2 {t) 

= 1 • -it 1 - 3t + 2) + 2 ■ i-\)it 2 - 2t) + 3 • -it 2 - t) 

^ 2 - 2f+ H + H f+4f "B + G' 2 ) =f+1 ' 

which is p 2 (t). As expected, the Lagrange form of the interpolating polynomial, 
and the solution to the Vandermonde system are the same polynomial. 

We remark that once p n (t) is found, we often wish to evaluate p n (t) for t ^ t\ 
(k e Z„+i). Suppose that p n (t) is known in direct form; then this should be done 
using Horner's rule: 

Pnit) = Pnfi + t[p„,l + t[p„,2 + t[p„,3 H h t[Pn,n-l + tp n ,n\] ' ' •]]]• (6-12) 

For example, if n — 3, then 

Vi{t) = P3,o + f[/?3,i + f[P3,2 + fP3,3]]- (6.13) 

To evaluate this requires only 6 flops (3 floating-point multiplications, and 3 floating- 
point additions). Evaluating piit) — /?3,o + P3,\t + pj,^t 2 + P3,3f 3 directly needs 9 
flops (6 floating-point multiplications and 3 floating-point additions). Thus, Horner's 
rule is more efficient from a computational standpoint. Using Horner's rule may be 
described as evaluating the polynomial from the inside out. 

While Lagrange interpolation represents a simple way to solve (6.6) the form 
of the solution is (6.9), and so requires more effort to evaluate at t ^ t\ than the 
direct form, the latter of which is easily evaluated using Horner's rule as noted in 
the previous paragraph. Another objection to Lagrange interpolation is that if we 
were to add elements to our data set, then the calculations for the current data set 
would have to be discarded, and this would force us to begin again. It is possible to 
overcome this inefficiency using Newton's divided-difference method which leads 
to the Newton form of the interpolating polynomial. This may be seen in Hildebrand 
[3, Chapter 2]. We will consider this methodology in the next section. 

When we evaluate p n it) for t ^ tk, but with t e [a, b] = [to, t n ], then this is 
interpolation. If we wish to evaluate p n it) for t < a or t > b, then this is referred 
to as extrapolation. To do this is highly risky. Indeed, even if we constrain t to 
satisfy t e [a, b], the results can be poor. We illustrate with a famous example of 



TLFeBOOK 



NEWTON INTERPOLATION 257 

Runge's which is described in Forsythe et al. [4, pp. 69-70]. It is very possible 
that for /(f) € C[a, b], we have 

lim sup |/(f) - p„(t)\ = oo. 
Runge's specific example is for f € [—5, 5] with 

where he showed that for any f satisfying 3.64 < \t\ < 5, then 

lim sup|/(f)- p n (t)\ = oo. 

n— >oo 

This divergence with respect to the Chebyshev norm [recall Section 5.7 (of 
Chapter 5) for the definition of this term] is called Runge's phenomenon. A 
fairly detailed account of Runge's example appears in Isaacson and Keller [5, 
pp. 275-279]. 

We close this section with mention of the approximation error /(f) — p n (t). 
Suppose that /(f) € C n+1 [a, b], and it is understood that f^ n+v> {t) is continuous 
as well; that is, /(f) has continuous derivatives f^\t) fork— 1, 2, ...,« + 1. It 
can be shown that for suitable £ = §(f) 

1 " 

e„(t) = /(f) - Pn(t) = - ■ — / ( " +1) (g) ]"[(' - ft), ( 6 - 14 ) 

(=0 

where f € [a, &]. If we know /^ n+1 ^(f), then clearly (6.14) may yield useful bounds 
on |e„(f)|. Of course, if we know nothing about /(f), then (6.14) is useless. 
Equation (6.14) is useless if /(f) is not sufficiently differentiable. A derivation 
of (6.14) is given by Hildebrand [3, pp. 81-83], but we omit it here. However, it 
will follow from results to be considered in the next section. 



6.3 NEWTON INTERPOLATION 

Define 

x(h)-x(to) XI -XQ 

x[to, h]= = . (6.15) 

h — to h — to 

This is called the first divided difference of x(t) relative to t\ and to- We see that 
x[to, fi] = x[t\, to]- We may linearly interpolate x(t) for t e [to, t{\ according to 

t — to 

x(t) « x(t ) H [x(h) - x(t )] = x(to) + (f - t )x[t , h]. (6.16) 

h — to 



TLFeBOOK 



258 



INTERPOLATION 



It is convenient to define po(t) — x(to) and p\(t) — x(to) + (t — to)x[to, t\\. This 
is notation consistent with Section 6.2. In fact, p\ (t) agrees with the solution to 



(6.17) 



as we expect. 

Unless x(t) is truly linear the secant slope x[to, t\] will depend on the abscissas 
to and t\. If x(t) is a second-degree polynomial, then x\t\, t] will itself be a linear 
function of t for a given t\. Consequently, the ratio 



" 1 h ' 


Pl.O 


= 


x Q 



x[tQ, h, f2] 



x[t\, t 2 ] - x[t , tl] 
h — to 



(6.18) 



will be independent of to, t\, and t 2 - [This ratio is the second divided difference of 
x(t) with respect to to, t\, and ?2-] To see that this claim is true consider x(t) — 
«o + ci\t + a 2 t 2 , so then 



x[t\, t] = a\ + a 2 (t + t\) . 



(6.19) 



Therefore 



x[h, t 2 ] - x[to, h] — x[h, t 2 ] - x[t\, to] — a 2 (t 2 - to) 
so that x[tQ, t\, t 2 ] — a 2 . We also note from (6.18) that 



1 



x[to 


h 


ti\ = 


ti 


-to 


x[t\ 


to 


ti\ = 




1 


ti 


-h 



X 2 - X\ X\- XQ 



h — h t\ — to 
xi - XQ xq- xi 



h — to to — ti 



(6.20a) 
(6.20b) 



for which we may rewrite these in symmetric form 

xo xi 



x[to, h, t 2 ] = 



xi 



(to - ti)(t Q - t 2 ) (ti - toKh - t 2 ) (h - to)(t 2 - h) 
= x[ti,t ,t 2 ], (6.21) 

that is, x[to, t\, t 2 ] = x[ti, to, t 2 ]. We see from (6.16) that 

x(t) - x(t Q ) 



t-to 



x[to,h], 



that is 



x[t Q ,t] t*tx[to,h], 



(6.22) 



TLFeBOOK 



NEWTON INTERPOLATION 259 

and from this we consider the difference 

x[t , t] - x[t , h] — x[to, t] -x[h, t ] — (t - h)x[to, h, t] (6.23) 

via (6.18), and the symmetry property x[to, t\,t]= x[t\, to,t]. Since x(t) is assumed 
to be quadratic, we may replace the approximation of (6.16) with the identity 

x(t) =x(t ) + (t-t )x[t ,t] 

= x(to) + (t-to)x[to,ti] +(t - t Q )(t - h)x[t , h, t] (6.24) 

=P\(t) 

via (6.23). The first equality of (6.24) may be verified by direct calculation using 
x(t) — ciq + a\t + ci2t 2 , and (6.19). We see that if pi(t) approximates x(t), then 
the error involved is [from (6.24)] 

e(t) = x(t) - pi(t) = {t- t )(t - ti)x[t , h, t]. (6.25) 

These results generalize to x(t) a polynomial of higher degree. 

We may recursively define the divided differences of orders 0, I, . . . , k — l,k 
according to 

x[t ] = x(t ) =x , 

x(ti)-x(t ) x\ -x 
x[t ,ti] = 



x[to, h, ti\ 



t\ —to t\— to 

x[t\, t 2 ] - x[tp,ti] 

h — to 



x[h, ...,tt] - x[t , ...,t k -i] 
x[to,..-,t k ]= . (6.26) 

tk — to 

We have established the symmetry x[to, h] = x\t\, to] (case k — 1), and also 
x[to, t\, ?2] = x[h, to, ?2] (case k — 2). For k — 2, symmetry of this kind can also 
be deduced from the symmetric form (6.21). It seems reasonable [from (6.21)] that 
in general 

x[f , ...,tk\ 

xo , x\ t x k 



(to - ti) ■ ■ ■ (t - t k ) (h - t ) ■ ■ ■ (ti - t k ) (t k - t ) ■ ■ ■ (t k - t k -i) 

k 
J"— t xj. (6.27) 



TLFeBOOK 



260 INTERPOLATION 

It is convenient to define the coefficient of xj as 

a) = — r—i (6.28) 

>¥./' 

for y = 0, 1, . . . , k. Thus, x[?o, . . . , fjtl = S/=o a / x ;'- We ma y P rove (6-27) for- 
mally by mathematical induction. We outline the detailed approach as follows. 
Suppose that it is true for k — r; that is, assume that 



± ^_ 

hn r i= o(tj-ti) 



x[t ,...,t r ] = l^-—— —xj, (6.29) 

• n 1 1 i=o \tj H) 

and consider [from definition (6.26)] 

1 

X[t , . ..,t r +\] = {X[t\, ..., t r +\] -X[t , ...,t r ]}. (6.30) 

tr+l — tQ 

For (6.30) 



X[t\, ...,t r + \] = 



(h - t 2 ) ■ ■ ■ (h - t r +l) 

X2 , , X r+ \ 



(t 2 ~ h) ■ ■ ■ (t 2 ~ t r+ l) (tr+l ~ h) ■ ■ ■ (tr+l ~ tr) 

(6.31a) 



and 

xq 



x[t , ...,t r ] 



(t ~ h) ■ ■ ■ (t ~ t r ) 

XI 



(h -t )--- (tl - t r ) (t r -to)--- (t r ~ t r -l) 



(6.31b) 



If we substitute (6.31) into (6.30), we see that, for example, for the terms involving 
only x\ 

1 I x\ 



tr+l ~ to I (h - t 2 )(h ~ f 3 ) ' ' ' (tl ~ U)(h - tr+l) 
XI 



(tl - t Q )(h -t 2 )--- (tl - t r -l)(h - t r ) . 

XI 1 f 1 1 



tr+l — to (tl — t 2 ) ■ ■ ■ (tl — t r ) [ tl — t r +i tl — to 
Xl r + i 

= a^ xi. 



(tl ~ t )(tl -t 2 )--- (tl - t r )(h - tr+l) 

The same holds for all remaining terms in x / for j = 0, 1 , . . . , r + 1 . Hence (6.27) 
is valid by induction for all k > 1. Because of (6.27), the ordering of the arguments 



TLFeBOOK 



NEWTON INTERPOLATION 261 

in x[to, . . . , t k ] is irrelevant. Consequently, x[to, ■ ■ ■ ,t k ] can be expressed as the 
difference between two divided differences of order k — 1, having any k — 1 of 
their k arguments in common, divided by the difference between those arguments 
that are not in common. For example 

X[h,t2,t3] — X[t0,tl,t2] x[to, t2, h] -X[h,t2,t3] 

x[t , h, t 2 , tj,] = = . 

?3 — to tQ — t\ 

What happens if two arguments of a divided difference become equal? The 
situation is reminiscent of Corollary 5.1 (in Chapter 5). For example, suppose t\ — 
t + e ; then 

x(h) - x(t) x(t + e)-x(t) 
x[t, fi] = 



implying that 



so that 



t\ — t 



x(t + e)-x(t) dx(t) 

x[t, t] = lim = 

6^0 e dt 



dx(t) 

x[t,t]= — — . (6.32) 

dt 

This assumes that x(t) is differentiable. By similar reasoning 

d 

—x[to, ...,t k ,t] = x[t Q , ...,t k ,t, t] (6.33) 

dt 

(assuming that to, ■ ■ ■ , tk are constants). Suppose u\, . . . , u n are differentiable func- 
tions of t, then it turns out that 

d y—^ duj 

— x[to, ...,tk,u\,...,u n ]= } x[t Q , ...,tk,u\, ...,u„,Uj]——. (6.34) 
dt "-^ dt 

./ = ! 

Therefore, if u\ = U2 = • • • = u n = t, then, from (6.34) 

d 

—x[to, ■ . . , tk, t, . . . , t] = nx[to, ■ ■ ■ ,h,t, . . . ,t], (6.35) 

n n+1 

Using (6.33), and (6.35) it may be shown that 

d r 

—x[t ,...,t k ,t] = r\ x[t ,...,t k ,t,...,t]. (6.36) 

r+\ 

Of course, this assumes that x(t) is sufficiently differentiable. 



TLFeBOOK 



262 INTERPOLATION 

Equation (6.24) is just a special case of something more general. We note that 
if x(t) is not a quadratic, then (6.24) is only an approximation for t $ {to, t\, t 2 ], 
and so 

x(t) « x(t ) + (t- to)x[t , h] + (t - t )(t - h)x[t Q , fi, t 2 ] = P2(0. (6-37) 

It is easy to verify by direct evaluation that p 2 (t{) — x{t{) for i e {0, 1,2}. Equation 
(6.37) is the second-degree interpolation formula, while (6.16) is the first-degree 
interpolation formula. We may generalize (6.24) [and hence (6.37)] to higher 
degrees by using (6.26); that is 

x(t) = x[t ] + (t- t Q )x[tQ, t], 
x[to, t] = x[to, t\] + (t — h)x[to, t\, t], 

x[to, h,t] = x[tQ,t\,t2] + (t - t2)x[tQ, t\, t2, t], 



x[t , ...,t n -i,t] = x[t , ...,t n ] + (t - t„)x[t Q , ...,t n ,t], (6.38) 

where the last equation follows from 

x[t\, ...,t n ,t] -x[to, ...,t n ] 



x[t , ...,t n ,t] = 



t-t 

x[t , ...,f„-l,f] -x[t , ...,t n ] 
t-tn 



(6.39) 



(the second equality follows by exchanging to and t n ). If the second relation of 
(6.38) is substituted into the first, we obtain 

x(t) = x[t ] + (t- to)x[t , h] + (t- t )(t - ti)x[t , h,t], (6.40) 

which is just (6.24) again. If we substitute the third relation of (6.38) into (6.40), 
we obtain 

x(t) = x[t ] + (t- to)x[to, h] + (t - t )(t - ti)x[to, h,t 2 ] 

+ (t - t )(t - h){t - t 2 )x[t , t\, t 2 , t], (6.41) 

This leads to the third-degree interpolation formula 

p 3 (t) = x[t ] + (f - t Q )x[t , fi] + (t - t )(t - h)x[t , h, t 2 ] 
+ (t - t )(t - h)(t - t 2 )x[t , ti, t 2 , t 3 ]. 



TLFeBOOK 



NEWTON INTERPOLATION 263 

Continuing in this fashion, we obtain 

x(t) = x[t ] + (t- foMfo, h] + (t- t )(t - h)x[to, h, h] 

+ ••• + (?- t )(t - h) ■ ■ ■ (t - t„-i)x[to, h...,t n ] + e(t), (6.42a) 

where 

e(t) = (t- t Q )(t - h) ■ ■ ■ (t - t n )x[t , t u ...,t n ,t], (6.42b) 

and we define 

Pn(t) = X[t ] + (t - t Q )x[t , fi] 

+ (t- t )(t - h)x[t Q , h,t 2 ] + --- + (t-to)---(t- t„-i)x[to, • ■ • , t n ], 

(6.42c) 
which is the nth-degree interpolating formula, and is clearly a polynomial of degree 
n. So e(t) is the error involved in interpolating x(t) using polynomial p n (t)- It is 
the case that p n (tk) — x(tk) for k = 0, 1, . . . , n. Equation (6.42a) is the Newton 
interpolating formula with divided differences. If x(t) is a polynomial of degree n 
(or less), then e(t) = (all t). This is more formally justified later. 

Example 6.2 Consider e(t) for n = 2. This requires [via (6.27)] 

x x\ 



X[tt), h, t 2 , t] 



(to - h)(fo - t 2 )(to - t) (h - fo)Oi - t 2 )(h - t) 
x 2 x(t) 



(h - t )(t 2 - h){t 2 -t) (t- t )(t - h)(t - t 2 ) 
Thus 



e(t) = (t - t )(t - ti)(t - t 2 )x[t , t u t 2 , t] 

(t - h)(t - t 2 )x _ (t - t )(t - t 2 ) Xl _ (t - t )(t - t 1 )x 2 
(to - h)(t - t 2 ) (h - to)(h - t 2 ) (t 2 -to)(t 2 -h) 



+x(t). 



=-PiV) 

(6.43) 
The form of p 2 (t) in (6.43) is that of p 2 (t) in Example 6.1: 

p 2 (t) = xoL (t) + x\L\(t) + x 2 L 2 (t). 

If x(t) — «o + a\t + a 2 t 2 , then clearly e(t) — for all t. In fact, p 2 t — ak- 

Example 6.3 Suppose that we wish to interpolate x(t) — e* given that t\ — kh 
with h = 0.1, and k = 0, 1, 2, 3. We are told that 

x Q = 1.000000, xi = 1.105171, x 2 = 1.221403, x 3 = 1.349859. 



TLFeBOOK 



= 


*Oo) = 


1.000000 


x[t ,ti] = 


: 1.05171 


= 0.1 


x(h) = 


1.105171 


x[h,t 2 ] = 


: 1.16232 


= 0.2 


x(t 2 ) = 


1.221403 


x[t 2 ,t 3 ] = 


1.28456 


= 0.3 


x(t 3 ) = 


1.349859 







264 INTERPOLATION 

We will consider n — 2 (i.e., quadratic interpolation). The task is aided if we 
construct the divided- difference table: 



x[t ,ti,t 2 ] = 0.55305 
x[t l ,t 2 ,t 3 ] = 0.61120 



For t € [0, 0.2] we consider [from (6.37)] 

x(t) % x (t ) + (t- t )x[t , ti] + (t- t )(t - h)x[t , ti,t 2 ], 

which for the data we are given becomes 

x(t) % 1.000000+ 1.051710? + 0.55305f(f- 0.1) = p$(t). (6.44a) 

Iff = 0.11, then ^(0.11)= 1.116296, while it turns out that x (0.11) = 1.116278, 
so e(0.11) = jc(O.II) - p^(0.11) = -0.000018. For t € [0.1, 0.3] we might con- 
sider 

x(t) % X (h) + (t- h)x[h, t 2 ] + (t- ti)(t - t 2 )x[h,t 2 , t 3 ], 

which for the given data becomes 

x(t) % 1.105171+ 1.16232(f-0.1) + 0.61120(f-0.1)(f-0.2) = p\(t). 

(6.44b) 
We observe that to calculate p\(t) does not require discarding all the results needed 
to determine pf (£)• We do not need to begin again as with Lagrangian interpolation 
since, for example 

x[h,t 2 ] -x[t ,ti] x[t 2 , ?3] -x[h,t 2 ] 
x[to,h,t 2 \= , x[h,t 2 , ?3J = , 

h — to ?3 — t\ 

and both of these divided differences require x\t\, t 2 ]. If we wanted to use cubic 
interpolation then the table is very easily augmented to include x[tQ, ti, t 2 ,t 3 ]. 
Thus, updating the interpolating polynomial due to the addition of more data to the 
table, or of increasing n, may proceed more efficiently than if we were to employ 
Lagrange interpolation. 

We also observe that both p^it) and p\(t) may be used to interpolate x(t) for 
t e [0.1,0.2]. Which polynomial should be chosen? Ideally, we would select the 
one for which e(t) is the smallest. Practically, this means seeking bounds for \e(f)\ 
and choosing the interpolating polynomial with the best error bound. 



TLFeBOOK 



NEWTON INTERPOLATION 265 

We have shown that if x(t) is approximated by p n (t), then the error has the form 

e(t) — jt(t)x[t Q , . . .,t n ,t] (6.45a) 

[recall (6.42b)], where 

7t(t) = (t- t )(t - h) ■ ■ ■ (t - t n ), (6.45b) 

which is an degree n + 1 polynomial. The form of the error in (6.45a) can be useful 
in analyzing the accuracy of numerical integration and numerical differentiation 
procedures, but another form of the error can be found. 

We note that e(t) — x(t) - p n (t) [recall (6.42)], so both x(t) - p n (t) and n(t) 
vanish at to, t\, . . . , t n . Consider the linear combination 

X(t) =x(t)~ Pn (t)- KTT(t). (6.46) 

We wish to select k so that X(t) = 0, where 1 ^ tk for k e Z n+ \. Such a k 
exists because n(t) vanishes only at to, h, . . . , t n . Let a — minffo, . . . , t n , 7], b — 
max{?o, • ■ • , t n , J], and define the interval / = [a, b]. By construction, X(t) van- 
ishes at least n + 2 times on /. Rolle's theorem from calculus states that X^it) — 
d k X(t)/dt k vanishes at least n + 2 — k times inside /. Specifically, X^ I+1 ^(f) van- 
ishes at least once inside /. Let this point be called f . Therefore, from (6.46), we 
obtain 

*(»+!>(!) - p^ +x \f) - KJt {n+l \f) = 0. (6.47) 

But p n (t) is a polynomial of degree n, so p^ n+l \t) = for all t e I. From (6.45b) 
jt (n+l) (t) = (n+ 1)!, so finally (6.47) reduces to 

' x (n+1) (f). (6.48) 



(n+l)\ 
From X(i) — in (6.46), and using (6.48), we find that 

e(t) = x(t) - Pn (t) = - x (n+V) (f)7T(t) (6.49) 

(n + 1)! 

for some | e /. If we were to let 1 — tk for any k e Z n+ i, then both sides of (6.49) 
vanish even in this previously excluded case. This allows us to write 

e ^ = / * , * ( " +1) (g(0M0 (6.50) 

(n + 1)! 

for some ^ = f (t) el (I — [a, b] with a — minffo, ■ • ■ , t n , t], and b — 
max{?o, ■ ■ ■ ,t n , t}). If x < - n+1 \t) is continuous for t e I, then x^" +1 '(f) is bounded 
on /, so there is an M n+ \ > such that 

*(»+!>($) < M n+l , (6.51) 



TLFeBOOK 



266 INTERPOLATION 
and hence 



k(OI < , M " + L \*(t)\ (6-52) 

(« + l)! 



for all t e I. It is to be emphasized that for this to hold x ( - n+i Ht) must exist, 
and we normally require it to be continuous, too. Equations (6.50) and (6.45a) are 
equivalent, and thus 

7t(t)x[t , ..., t n , t] = — — — x ( " +1) (^)7r(r), 
(n + 1)! 

or in other words 

x[t , . . . , t n , t] = * — * ( " +1) (l) (6.53) 

(« + 1)! 

for some f € /, whenever x'" +1 ^(f) exists in /. In particular, if jt (?) is a polynomial 
of degree n or less, then (6.53) yields ;c[fo, ...?«, t] = 0, hence e(f) = for all t 
(a fact mentioned earlier). 

Example 6.4 Recall Example 6.3. We considered x(t) = e' with fg = 0, fi = 
0.1, ?2 = 0.2, and we found that 

pf(0= 1.000000+ 1.051710? + 0.553050f(f- 0.1) 

for which pf (0.11) = 1.116296, and x(0.11) = 1.116278. The exact error is 

e(0.11) = x(0.11) - p$(0.U) = -0.000018. 

We will compare this with the bound we obtain from (6.52). Since n = 2, x^(f) = 
e', and / = [0,0.2], so 

M 3 = e 02 = 1.221403 

(which was given data), and n(t) = f(f -0.1)0 - 0.2), so |tt(0.11)| =9.9 x 10" 5 . 
Consequently, from (6.52), we have 

1.221403 . 

|e(0.11)| < j 9.9 x 10" 5 = 0.000020. 

The actual error certainly agrees with this bound. 

We end this section by observing that (6.14) immediately follows from (6.50). 

6.4 HERMITE INTERPOLATION 

In the previous sections polynomial interpolation methods matched the polynomial 
only to the value of the function f(x) at various points x — Xk € [a, b] C R. In 



TLFeBOOK 



HERMITE INTERPOLATION 267 

this section we consider Hermite interpolation where the interpolating polynomial 
also matches the first derivatives f^\x) at x — x k . This interpolation technique 
is important in the development of higher order numerical integration methods as 
will be seen in Chapter 9. 

The following theorem is the main result, and is essentially Theorem 3.9 from 
Burden and Faires [6]. 

Theorem 6.1: Hermite Interpolation Suppose that f(x) e C l [a, b], and 
that xq,x\, . . . ,x n e [a, b] are distinct, then the unique polynomial of degree (at 
most) In + 1 denoted by p2 n +i(x), and such that 

Pln + 1 [Xj ) = / (Xj ) , p% + , (Xj ) = f m (Xj ) (6.54) 

(J e Z„+i) is given by 

n n 

P2„+i(x) = ^2h k (x)f(x k ) + J^h k (x)f m (x k ), (6.55) 



where 



and 



t=0 k=0 



h k (x) = [1 - 2L ( l\x k ){x - x k )][L k (x)f, (6.56) 



h k (x) = (x - x k )[L k (x)] 2 (6.57) 



such that [recall (6.11)] 

n 

L k (x) = f] ±^±L. (6.58) 

. ^ X k %i 

1=0 

ijtk 

Proof To show (6.54) for xq, x\, . . . , x„, we require that h k (x), and h k (x) in 
(6.56) and (6.57) satisfy the conditions 

h k ( X j) = Sj- k , h ( k 1 \x j ) = (6.59a) 



and 



h k ( Xj ) = 0, h[ l \xj) = t>i- k . (6.59b) 



Assuming that these conditions hold, we may confirm (6.54) as follows. Via (6.55) 

n n 

P2n + l(Xj) = ^h k (xj)f(x k ) + ^2h k (xj)f (1) (x k ), 



k=0 k =0 



TLFeBOOK 



268 INTERPOLATION 
and via (6.59), this becomes 

n n 

k=Q k=0 

This confirms the first case in (6.54). Similarly, via (6.59) 

n n 

*=0 /t=0 

becomes 

P&+i(*j) = E°- /(**) + £*;-*/ (1) (**) = f m (xj), 

k=0 k=0 

which confirms the second case in (6.54). 

Now we will confirm that hk(x), and h k (x) as defined in (6.56), and (6.57) 
satisfy the requirements given in (6.59). The conditions in (6.59b) imply that hk(x) 
must have a double root at x = Xj for j ^ k, and a single root at x — Xk ■ A 
polynomial of degree at most 2n + 1 that satisfies these requirements, and such 
that h^\x k ) — 1 is 

f , , , , (x -x ) 2 ---(x -Xk-i) 2 - (x - x k+i ) 2 ■ ■ ■ (x - x n ) 2 

h k (x) = (x- x k )- 



(x k - x ) z ■■■(x k - x k -\) z ■ (x k - x k+ \Y ■■■{xk- x„) 2 



= (x - x k )L 2 k {x). 

Certainly h k (x k ) — 0. Moreover, h k (x) — L^(x) + 2(x — x k )L k (x)L k (x) so 
h^\xk) = L 2 (x k ) = 1 [via (6.8)]. These verify (6.59b). 

Now we consider (6.59a). These imply xj for j ^ k is a double root of h k (x), 
and we may consider (for suitable a and £> to be found below) 

hk(x) = ™ — r~(x - x ) 2 ■ ■ ■ (x - x k -i) 2 (x - x k+ if 

1 L=oW ~~ x k ) 

i^k 

■ ■ ■ (x — x n ) (ax + b) 
which has degree at most In + 1. More concisely, this polynomial is 



h k (x) = L k (x)(ax + b). 



From (6.59a) we require 



1 = h k (x k ) — L 2 k {x k )(ax k + b) — ax k + b, (6.60) 



TLFeBOOK 



SPLINE INTERPOLATION 269 

Also, h, (x) = aL k (x) + 2L k (x)L k (x)(ax + b), and we also need [via (6.59a)] 

h k (xk) — ah\{x k ) + 2L k (x k )L k (x k )(ax k + b) = 0, 
but again since L k (x k ) — 1, this expression reduces to 

a + 2L ( k l \x k ) = 0, 
where we have used (6.60). Hence 

a = -2L ( k l) (x k ), b = 1 + 2L ( k 1 \x k )x k . 



Therefore, we finally have 



h k (x) = [1 - 2L ( k l \x k )(x - x k )]L 2 k (x) 



which is (6.56). Since L k {xj) = for j ^ k it is clear that h k {xj) — for j ^ k. 
It is also easy to see that h, (xj) = for all j ^ k too. Thus, (6.59a) is confirmed 
for h k (x) as defined in (6.56). 

An error bound for Hermite interpolation is provided by the expression 

1 " 

/(*) = P2n+lM + .. ; — 11^ " ^) 2 / (2 " +2) «) (6-61) 

(2m + 2)! * = * 

for some f € (a, b), where f(x) e C 2n+2 [a, b]. We shall not derive (6.61) except 
to note that the approach is similar to the derivation of (6.14). Equation (6.14) was 
really derived in Section 6.3. 

In its present form Hermite interpolation requires working with Lagrange poly- 
nomials, and their derivatives. As noted by Burden and Faires [6], this is rather 
tedious (i.e., not computationally efficient). A procedure involving Newton inter- 
polation (recall Section 6.3) may be employed to reduce the labor that would 
otherwise be involved in Hermite interpolation. We do not consider this approach, 
but instead refer the reader to Burden and Faires [6] for the details. We use Hermite 
interpolation in Chapter 9 to develop numerical integration methods, and efficient 
Hermite interpolation is not needed for this purpose. 

6.5 SPLINE INTERPOLATION 

Spline (spliced line) interpolation is a particular kind of piecewise polynomial 
interpolation. We may wish, for example, to approximate f(x) for x € [a, b] C 
R when given the sample points {(x k , f(x k ))\k e Z n+ \] by fitting straight-line 
segments in between (x k , f(x k )), and {x k +\, f(x k+ \)) for k = 0, 1, . . . , n — 1. An 



TLFeBOOK 



270 INTERPOLATION 



^ 




Figure 6.1 The cubic polynomial f{x) = (x — \)(x — 2)(x — 3), and its piecewise linear 
interpolant (dashed line) at the nodes x\ = xq + hk for which xq = 0, and h = A, where 
k = 0, 1 n with n — 8. 



example of this appears in Fig. 6.1. This has a number of disadvantages. Although 
f(x) may be differentiable at x = Xk, the piecewise linear approximation will not 
be (in general). Also, the graph of the interpolant has visually displeasing "kinks" 
in it. If interpolation is for a computer graphics application, or to define the physical 
surface of an automobile body or airplane, then such kinks are seldom acceptable. 
Splines are a means to deal with this problem. It is also worth noting that, more 
recently, splines have found a role in the design of wavelet functions [7, 8], which 
were briefly mentioned in Chapter 1. 

The following definition is taken from Epperson [9], and our exposition of spline 
functions in this section follows that in [9] fairly closely. As always f^(x) — 
d l f(x)/dx l (i.e., this is the notation for the ith derivative of fix)). 

Definition 6.1: Spline Suppose that we are given {{Xk, f(xk))\k e Z n+ \}. 
The piecewise polynomial function p m (x) is called a spline if 



(51) p m (xk) — f(xk) for all k e Z„+i (interpolation). 

(52) lirn 



Pm \x) = lim x 



Pm (x) for all i e Zjv+i (smoothness). 



TLFeBOOK 



SPLINE INTERPOLATION 271 

(S3) p m (x) is a polynomial of degree no larger than m on every subinterval 
[xk, Xk+\] for k e Z„ (interval of definition). 

We say that m is the degree of approximation and N is the degree of smoothness 
of the spline p m (x). 

There is a relationship between m and N. As there are n subintervals [xjc, X/t+i], 
and each of these is the domain of definition of a degree m polynomial, we see 
that there are Df — n(m + 1) degrees of freedom. Each polynomial is specified by 
m + 1 coefficients, and there are n of these polynomials; hence Df is the number 
of parameters to solve for in total. From Definition 6.1 there are n + 1 interpo- 
lation conditions [axiom (SI)]. And there are n — 1 junction points x\, . . .x n -\ 
(sometimes also called knots), with N + 1 continuity conditions being imposed on 
each of them [axiom (S2)]. As a result, there are D c — («+ 1) + (n — l)(N + 1) 
constraints. Consider 

D f - D c = n(m + 1) - [(« + 1) + (« - l)(N + 1)] =n(m-N-l) + N. 

(6.62) 
It is a common practice to enforce the condition m — N — 1 = 0; that is, we let 

m = N+l. (6.63) 

This relates the degree of approximation to the degree of smoothness in a simple 
manner. Below we will focus our attention exclusively on the special case of the 
cubic splines for which m — 3. From (6.63) we must therefore have N = 2. With 
condition (6.63), then, from (6.62) we have 

D f -D C = N. (6.64) 

As a result, it is necessary to impose TV further constraints on the design problem. 
How this is done is considered in detail below. Since we will look only at m — 3 
with N — 2, we must impose two additional constraints. This will be done by 
imposing one constraint at each endpoint of the interval [a, b]. There is more than 
one way to do this as will be seen later. 

From Definition 6. 1 it superficially appears that we need to compute n different 
polynomials. However, it is possible to recast our problem in terms of B-splines. 
A B-spline acts as a prototype in the formation of a basis set of splines. 

Aside from the assumption that m — 3, N — 2, let us further assume that 

a — xq < x\ < • • • < x n -\ < x n — b (6.65) 

with Xk+i — Xk — h for k = 0, 1, . . . , n — 1. This is the uniform grid assumption. 
We will also need to account for boundary conditions, and this requires us to 
introduce the additional grid points 

x_3 — a — 3h, x-2 — a — 2h, x_\ — a — h (6.66) 



TLFeBOOK 



272 INTERPOLATION 

and 

x n+ 3 = b + 3h, x n+ 2 = b + 2h, x n +\ = b + h. 

Our prototype cubic B-spline will be the function 



(6.67) 



S(x) 



0, 

(x + 2)\ 

1 + 3(x + 1) + 3(x + l) 2 - 3(x + l) 3 , 

1 + 3(1 - x) + 3(1 - xf - 3(1 - x) 3 , 

(2-x) 3 , 

0, 



x < -2 
-2 < x < -1 
-1 <x < 
0<x < 1 

1 < x < 2 
x>2 



(6.68) 



This function has nodes at x e {—2, —1,0, 1 , 2}. A plot of it appears in Fig. 6.2, and 
we see that it has a bell shape similar to the Gaussian pulse we saw in Chapter 3. 
We may verify that S(x) satisfies Definition 6.1 as follows. Plainly, it is piecewise 
cubic (i.e., m — 3), so axiom (S3) holds. The first and second derivatives are, 




Figure 6.2 A plot of the cubic B-spline defined in Eq. (6.68). 



TLFeBOOK 



respectively 



SPLINE INTERPOLATION 273 



s (1 \ X ) 



0, 


x < -2 


3(x + 2) 2 , 


-2 < x < -1 


3 + 6(x + l)-9(x+l) 2 , 


-1 < x < 


-3-6(1 — x) + 9(1 -x) 2 , 


<x < 1 


-3(2- x) 2 , 


1 < x < 2 


0, 


x > 2 



(6.69) 



and 



S (2) (x) = 



0, x < -2 

6(x + 2), -2<x<-l 

6-18(x + l), -l<x<0 

6- 18(1 -x), <x < 1 



6(2 -x), 
0. 



1 < x < 2 
x > 2 



(6.70) 



We note that from (6.68)-(6.70) that 



5(0) = 4, 


S(±l)=l, 


5(±2) = 0, 


(6.71a) 


S (1) (0) = 0, 


S (1) (±1) = T 3, 


S (1) (±2) = 0, 


(6.71b) 


S (2) (0) = -12, 


S (2) (±l) = 6, 


S (2) (±2) = 0. 


(6.71c) 



So it is apparent that for i = 0, 1, 2 we have 



?(«')/ 



?(<), 



lim S u; (x) = lim S^(x) 



for all Xk € {—2, —1,0, 1,2}. Thus, the smoothness axiom (S2) is met for N — 2. 
Now we need to consider how we may employ S(x) to approximate any f(x) 
for x e [a, b] C R when working with the grid specified in (6.65)-(6.67). To this 
end we define 

(X — X; \ 
— JT~) (6 ' 72) 



for i = —1,0, l,...,/i,n + l. Since 5 ( u, (x) = ^5 (1) (^), 5, u, (x) = 
l 5 (2) (£^£l) from (6 71 ) ; we have (e g ^ x . ±1 = x . ± jg 



S,-(jc i -) = S(0) = 4, S ( te±i) = S(±l) = l, S I -(x I -± 2 ) = S(±2) = 0, 

(6.73a) 



TLFeBOOK 



274 INTERPOLATION 

S?\ Xi ) = 0, S ; (1) (* (±1 ) = tJ, 5, (1) (x !±2 ) = 0, (6.73b) 

S i 2) ^ = ~^' SfWi) = ^. 5, (2) (x i±2 ) = 0. (6.73c) 

We construct a cubic B-spline interpolant for any f(x) by defining spline pi(x) 
to be a linear combination of Si (x) for i — —1,0, 1, . . . ,n,n + 1, i.e., for suitable 

coefficients a, we have 

«+i 

p 3 (x)= J2 a iS,(x). (6.74) 

The series coefficients a, are determined in order to satisfy axiom (SI) of Definition 
6.1; thus, we have 

n + \ 

f(xk) = ^2 aiSi{x k ) (6.75) 

for & e Z„+i. 

If we apply (6.73a) to (6.75), we observe that 

n+\ 



f(xo) = ^ ajSjixo) = a-iS-i(x Q ) + a Q SQ(xo) + aiSi(xo) 

«+i 
/(*i) = £^ a,-S,-(*i) = a 5 , o(xi)+ai5 , i(x 1 ) + fl 2 52(xi) 



i=-l 



n+1 
f(x n ) — 2_^ a iSi( x n) — Cn-\S n -i(x n ) + a n S n (x n ) + a n+ iS n +\(x n ) 

(6.76) 
For example, for /(xq) in (6.76), we note that Sk(xo) — for k > 2 since (x k — 
xq + A;/;) 

S k (x ) = 5 (^) = S ( *> -»*+*"> ) = S(-*) = 0. 

More generally 

f(x k ) — a k -iS k -i(xk) + a k S k (x k ) + a k+ \S k+ \(x k ) (6.77) 

for which k e X n+ \. Again via (6.73a) we see that 

c , , c f (x + kh) - (xq + (k - l)h) \ 

S k -i(x k ) = S = 5(1) = 1, 



TLFeBOOK 



SPLINE INTERPOLATION 



275 



Sk(xk) = s ■ {X0 + kh)-( X0 + khr = m = 4> 



5/t+l C^yt) = S 



h 



(xo + kh)- (x + (k + \)h) 



S(-l) = l. 



Thus, (6.77) becomes 



«£_! + 4a* + Ojfc+1 = /(**) 



(6.78) 



again for fc = 0, 1, . . . , n. In matrix form we have 



1 


4 


1 


• 


• 











1 


4 


1 • 


• 

















• 


• 4 


1 





_ 








• 


• 1 


4 


1 



a_i 

«o 

a„ 

a n + \ 



=A 





/(*o) 




f(xi) 




f(Xn-l) 




f(x„) 



(6.79) 



We note that A e r("+ 1 ) x ("+ 3 ) ; so there are n + 1 equations in n + 3 unknowns 
(a e R" +3 ). The tridiagonal linear system in (6.79) cannot be solved in its present 
form as we need two additional constraints. This is to be expected from our earlier 
discussion surrounding (6.62)-(6.64). Recall that we have chosen m — 3, so N — 2, 
and so, via (6.64), we have D f — D c — N — 2, implying the need for two more 
constraints. There are two common approaches to obtaining these constraints: 



,(2), 



1. We enforce p^ (xq) 

2. We enforce pf\xo) = / (1) 
or clamped spline). 



pf\x n ) 



(natural spline). 



,(l) 



(xo), and p\ (x n ) = f ( '(x n ) (complete spline 



Of these two choices, the natural spline is a bit easier to work with but can lead 
to larger approximation errors near the interval endpoints xq, and x n compared to 
working with the complete spline. The complete spline can avoid the apparent need 
to know the derivatives f^ l \xo), and f^\x n ) by using numerical approximations 
to the derivative (see Section 9.6). 

We will first consider the case of the natural spline. From (6.74), we obtain 

P3(xq) = a-iS-i(x ) + a So(x ) + aiS\(x ), 



so that 



/?3 2) (*o) = a- 



iS^(*o)- 



a sf?\x ) 



aiS[ 2) (x ). 



(6.80) 



TLFeBOOK 



276 INTERPOLATION 



Since S, (2) (x ) = ±S {2) ( 5fi ^ L ), we have 



l2) _ J_ (2) / x -(x -h) \ _ 1 (2) _ 6_ 



Sri(xo) = —S w ( - — ^ ) = —5^(1) = — , (6.81a) 



^"--X 2 ^) -?*"«»-£ 



S^'(xo) = —S w ( -^-^ I = —S { '(0) = - — , (6.81b) 



,(2)^_, _ 1 eg) { xo ~ (*o + h)\ _ 1 (2) 



where (6.71c) was used. Thus, (6.80) reduces to 

h~ [6a_i — 12«o + 6«i] = 



a_i = 2ao — «i. (6.82) 

Similarly 

Pi\ x n) == Qn—l^n—l\X n ) ~r a n^n\ x n) T fl n + Wn+l (^-n)> 

SO that 

P3 2) (^«) = a«-i5'® 1 (x n ) +a„^ 2) (^«) +« n +i^+ 1 (x„). (6.83) 

Since S, (2) (x„) = ^5 (2) (^^), we have 

c (2) , . _ 1 „( 2 ) f (xQ + nh)-(xQ + (n- l)h) \ _ 1 (2) _ 6 

(6.84a) 



„i 1 m / (xn + m/j) — (xo + n/0\ 1 ,.,■, 12 

&<*,) = 7^5 (2) (^ L_li dj = _ 5 (2) (0 ) = __ (6.84b) 

c (2) , . _ 1 c (2) ( (XQ + w/i ) ~ (*0 + (W+1W \ 1 „( 2 ), n _ 6 



(6.84c) 



where (6.71c) was again employed. Thus, (6.83) reduces to 
h~ [6a„_i — 12a„ + 6a n+ \] — 



o„+i = 2a„ - a„_i. (6.85) 

Now since [from (6.79)] 

a_i + 4a + «i = f(xo) 



TLFeBOOK 



SPLINE INTERPOLATION 277 



using (6.82) we have 

and similarly 

so via (6.85) we have 



6a = /Oo), 
a„_i + 4a„ + a„+\ = /(*„), 



(6.86a) 



6a„ = /(*„). 
Using (6.86) we may rewrite (6.79) as the linear system 



(6.86b) 



4 


1 • 


• 








1 


4 1 • 


• 











• 


• 1 


4 


1 


_ 


• 


• 


1 


4 _ 



a n -2 
a n -\ 



f(xi) - g/(*o) 

f(X 2 ) 

f(Xn-2) 
_ f(x„-i) - hf(x n ) 



(6.87) 



=/ 

Here we have A e r(" _1 ) x (" -1 ^ so now the tridiagonal system Aa — f has a 
unique solution, assuming that A -1 exists. The existence of A -1 will be justified 
below. 

Now we will consider the case of the complete spline. In the case of 



,(!)/ 



?(!)/ 



?(!)/ 



:■(!)/ 



P^Oo) = a-i^^xo) + ooSq^Cso) + aiS^-'^o) = / (1) (*o)> 
since 5 ( (1) (xo) = ^5 (1) ( a ^ £L ), we have, using (6.71b) 



(6. 



S^(xo) = f 5« ( ^"-^-^ = I 5 (i)(l) = 



r(l)/ 



n 

so (6.88) becomes 



1 -(1) / Xqj-JCo 

h~ V h 
1 



= -S (1) (0) = 0, 
h 



xo - (xq + h) 



,(i) 



is«(-D = -. 
h h 



pl l '(x ) = SA-^-a-i +oi] = / UJ (x ). 



(6.89) 



Similarly 



P \ l) (x„) = a„_i^!_\(x„) + flnS^Oc) + a^i^^fe) = f m (x n ) (6.90) 



TLFeBOOK 



278 INTERPOLATION 

reduces to 

p 3 (x„) = 3h~ 1 [-a„-i +a„ +i ] = / (1) (x„). 

From (6.89) and (6.91), we obtain 

0-1 = 01 - ±fr/ (1) (*o), Cl n +\ = On-l + ^/ (1) (^n)- 

If we substitute (6.92) into (6.79), we obtain 



(6.91) 



(6.92) 



4 2 

1 4 1 











1 4 1 

2 4 



«0 
O] 



a«-i 



/(x )+^/ (1) (*o) 



/(*n-l) 



(6.93) 



=/ 

Now we have A € r(«+ 1 ) x ("+ 1 ) ) anc j we see that Aa — f of (6.93) will have a 
unique solution provided that A -1 exists. 

Matrix A is tridiagonal, and so is quite sparse (i.e., it has many zero-valued 
entries) because of the "locality" of the function S(x). This locality makes it pos- 
sible to evaluate pj,(x) in (6.74) efficiently. If we know that x e [xk, Xk+i], then 



P3(x) = a k -\S k -\{x) + a k S k (x) + a k+ \S k +\(x) + ak+iSk+lix). 



(6.94) 



We write suppg(x) = [a, b] to represent the fact that g(x) — for all x < a, and 
x > b, while g(x) might not be zero- valued for x € [a,b]. From (6.68) we may 
therefore say that suppS(x) = [—2,2], and so from (6.72), suppS^x) = [x,- — 
2h, xi + 2h], so we see that Si(x) is not necessarily zero-valued for x € [x,- — 
2h, xt + 2h]. From this, (6.94), and the fact that x,- = xq + ih, we obtain 



,k+2 



supp p 3 (x) = U^_jsupp Si(x) = [x + (* - 3)h, x Q + (k+ l)h] U 
• • • U [x + kh, x Q + (k + 4)h] 



(6.95) 



(if A,- are sets, thenU^_ A,- = A n U • • • U A m ), which "covers" the interval 
[xk, Xk+i] — [xo + kh, xq + (k + l)h\. Because the sampling grid (6.65)-(6.67), 
is uniform it is easy to establish that (via x > xo + kh) 



k = 



xo 



h 



(6.96) 



where [x] — the largest integer that is < x € R. Since suppS(x) is an interval of 
finite length, we say that S(x) is compactly supported. This locality of support is 
also a part of what makes splines useful in wavelet constructions. 



TLFeBOOK 



SPLINE INTERPOLATION 



279 



We now consider the solution of Aa — f in either of (6.87) or (6.93). Obviously, 
Gaussian elimination is a possible method, but this is not efficient since Gaussian 
elimination is a general procedure that does not take advantage of any matrix 
structure. The sparse tridiagonal structure of A can be exploited for a more efficient 
solution. We will consider a general method of tridiagonal linear system solution 
here, but it is one that is based on modifying the general Gaussian elimination 
method. We begin by defining 



«00 


floi 


• 










«10 


on 


an ■ 













«21 


A22 ' 
















• 


■ a n - 


-!,«-! 


&n—\,n 








■ 


®n,n — \ 


a n,n 



X 




r h 


Xl 




h 


x 2 


= 


h 


%n—l 




fn-\ 


%n 




In 



(6.97) 



Clearly, A e r("+ 1 ) x ("+ 1 ). There is some terminology associated with tridiagonal 
matrices. The main diagonal consists of the elements a, ,, while the diagonal above 
this consists of the elements atj+i, and is often called the superdiagonal. Similarly, 
the diagonal below the main diagonal consists of the elements a,+i,;, and is often 
called the subdiagonal. 

Our approach to solving Ax — f in (6.97) will be to apply Gaussian elimination 
to the augmented linear system [A\f] in order to reduce A to upper triangular form, 
and then backward substitution will be used to solve for x. This is much the same 
procedure as considered in Section 4.5, except that the tridiagonal structure of A 
makes matters easier. To see this, consider the special case of n — 3 as an example, 
that is (A° = A,f°= f) 



[A°|/°] = 



" a 
"oo 



MO 



*01 

2° 

Ml 

'21 





"12 




(I 



22 



'32 



'23 



-'33 



I /o°" 

I f? 

I f 2 ° 

I / 3 °J 



(6.98) 



We may apply elementary row operations to eliminate a® in (6.98). Thus, (6.98) 
becomes 



[A 1 !/ 1 ]^ 



-'oo 








a oi 
o _ 4> fl o 

"11 o "oi 

"oo 



a 



o 







M2 



'22 



'32 



'23 



'33 



Jo 



o 

f _ f?o f 
J\ „o Jo 
"oo 



ft 

/3° 



TLFeBOOK 



280 



INTERPOLATION 



a 00 


"oi 











"ii 


"l2 








"21 


a 22 


fl 








q\ 


a 



23 



33 



I /o 1 

I S\ 



(6.99) 



Now we apply elementary row operations to eliminate a 21 . Thus, (6.99) becomes 



[A 2 \f 2 l 



-'oo 









-'oi 



♦ 22 



'12 

a\x 1 
„l a 12 



'32 



'23 



'33 



?l 



f! 

i a 2\ f\ 



'1 

fl a 2\ f 

J 2 „1 •/! 



./■, 



r a 2 
"oo 







„2 



'12 

'22 

v 2 

'32 



'23 

2 
'33 



I /, 



/, 2 



I /2 2 

I .A 2 



(6.100) 



Finally, we eliminate o| 2 , in which case (6.100) becomes 



[A 3 |/ 3 ] 



"oo 

a\ 







'01 



'12 



'22 






a 2 

"23 



"33 



%2 „2 
,2 "23 

22 



fl 

ff 



f: 



I /? 



2 
2 

%2 f 2 
a 2 J 2 

22 



'00 









-*oi 

„3 



'12 



'22 







'23 



'33 



I /, 



./? 



I fl 
I /3 3 



(6.101) 



We have A 3 = U, an upper triangular matrix. Thus, Ux — / 3 can be solved by 
backward substitution. The reader should write a pseudocode program to implement 
this approach for any n. Ideally, the code ought to be written such that only the 
vector /, and the main, super-, and subdiagonals of A are stored. 



TLFeBOOK 



SPLINE INTERPOLATION 281 

We observe that the algorithm we have just constructed will work only if a\ • ^ 
for all i = 0, 1, . . . , n. Thus, our approach may not be stable. We expect this 
potential problem because our algorithm does not employ any pivoting. However, 
our application here is to use the algorithm to solve either (6.87) or (6.93). For A in 
either of these cases, we never get a\ ■ = 0. We may justify this claim as follows. 
In Definition 6.2 it is understood that a,- ; = for i, j g Z n+ \. 

Definition 6.2: Diagonal Dominance The tridiagonal matrix A — 

\fli, ;'](', ;'=0 n € R("+ 1 ) x (' i + 1 ) is diagonally dominant if 

fl;,i > k,i-il + Kml > o 

for i =0, 1, . . . , n. 

It is clear that A in (6.87), or in (6.93) is diagonally dominant. For the algorithm 
we have developed to solve Ax — f, in general we can say 

a k 
a k+1 -a k k+hk a k (6 1021 

"k+l,k+l — a k+\,k+\ k "k,k+l> 10.1UAJ 



with a^\ k = 0, and 



a k,k+ 1 = a k,k+l (6.103) 



for k = 0, 1, . . . , n — 1. Condition (6.103) states that the algorithm does not modify 
the superdiagonal elements of A. 

Theorem 6.2: If tridiagonal matrix A is diagonally dominant, then the algo- 
rithm for solving Ax — f will not yield a\ ■ = for any i = 0, 1, . . . , n. 

Proof We give only an outline of the main idea. The complete proof needs 
mathematical induction. 
From Definition 6.2 

«o,o> l«o,-il + l«o,il = l a o,il >0, 

so from (6.102) 

1 _ o fl o,i 
a l,l — a l,l ~o~ a i,o> 
a o,o 

with aj = 0. Thus, A 1 can be obtained from A = A. Now again from Defini- 
tion 6.2 

a% > |a? | + |o? 2 | > 0, 



TLFeBOOK 



282 



INTERPOLATION 



so that, because < |oq j/flg | < 1, we can say that 



a\ , > a? i 



•*o,i 



•*o,o 



l«i.ol 



> l«? l + l«i 2 l 



*o,i 



^0,0 



l«i,ol 



= 1 



'0,1 



^0,0 



l«l,0l + l«l,2l 



> \a\ 2 \ = |aj >2 | = |o} >0 | + \a\ 2 \ >0, 

and A 1 is diagonally dominant. The diagonal dominance of A implies the diagonal 
dominance of A 1 . In general, we see that if A k is diagonally dominant, then A k+1 
will be diagonally dominant as well, and so A k+1 can be formed from A k for all 
k = 0, 1, . . . , n - 1. Thus, a\ ■ ^ for all i = 0, 1, . . . , w. 

If A is diagonally dominant, it will be well-conditioned, too (not obvious), and 
so our algorithm to solve Ax = / is actually quite stable in this case. 

Before considering an example of spline interpolation, we specify a result con- 
cerning the accuracy of approximation with cubic splines. This is a modified version 
of Theorem 3.13 in Burden and Faires [6], or of Theorem 4.7 in Epperson [9]. 

Theorem 6.3: If f(x) e C 4 [a, b] with m^ x(z[aM |/ (4) (x)| < M, and Pi {x) 
is the unique complete spline interpolant for f(x), then 



max I/O) - p 3 (x)\ < —Mb*. 

x€[a,b] J64 

Proof The proof is omitted, but Epperson [9] suggests referring to the article 
by Hall [10]. 

We conclude this section with an illustration of the quality of approximation of the 
cubic splines. 

Example 6.5 Figure 6.3 shows the natural and complete cubic spline inter- 
polants to the function f(x) — exp(— x 2 ) for x e [—1, 1]. We have chosen the 
nodes x^ — — 1 + jk, for k = 0, 1,2, 3, 4 (i.e., n = 4, and h — j). Clearly, 
/W(x) = — 2x exp(— x 2 ). Thus, at the nodes 

/(±1) = 0.36787944, f{±\) = 0.60653066, /(0) = 1.00000000, 



and 



f w (±l) = ^0.73575888. 



TLFeBOOK 



SPLINE INTERPOLATION 283 




0.8 

^ 0.6 

0.4 



0.2 



1 1 ^-Q ■— 1 1 




/"■ \\ 




y n. 




i i i i i 



-1.5 



-0.5 



(b) 




x 



0.5 



1.5 



Figure 6.3 Natural (a) and complete (b) spline interpolants (dashed lines) for the function 
f(x) = e~ x (solid lines) on the interval [— 1, 1]. The circles correspond to node locations. 
Here n = 4, with h = i, and xq = — 1. The nodes are at x k = xq + hk for k = 0, . . . , n. 

Of course, the spline series for our example has the form 

5 



Pi(x) = J^ a k S k (x), 



-l 



so we need to determine the coefficients a k from both (6.87), and (6.93). 
Considering the natural spline interpolant first, from (6.87), we have 



4 1 
1 4 1 
1 4 



a\ 




" /(*l) " g/(*o) 


«2 


= 


f(X2) 


«3 




- f(X3) ~ lf(X4) 



Additionally 



and 



11 11 

ao = -f(xo) = 7/( -1 )> fl 4 = -f(x 4 ) = -/(l), 
6 6 6 6 



a_l = f(x ) - 4a Q - a\, a 5 — f(x 4 ) - 4a 4 - a 3 . 



TLFeBOOK 



284 INTERPOLATION 

The natural spline series coefficients are therefore 



k 


at 


-1 


-0.01094139 





0.06131324 


1 


0.13356787 


2 


0.18321607 


3 


0.13356787 


4 


0.06131324 


5 


-0.01094139 



Now, on considering the case of the complete spline, from (6.93) we have 





4 


2 













1 


4 


1 













1 


4 


1 













1 


4 


1 




_ 








2 


4 


Additionall 


y 











«0 

a\ 
as 

(34 



\hfV\x ), 



/(jc ) + £a/ (1) (*o) 
f(xi) 

f(X2) 

f(X3) 

f(x A )-\hf l \x A ) 



a_i — a\ — ^nf^'{xo), as — «3 + ^hf ( '(x4). 
The complete spline series coefficients are therefore 



k 


ak 


-1 


0.01276495 





0.05493076 


1 


0.13539143 


2 


0.18230428 


3 


0.13539143 


4 


0.05493076 


5 


0.01276495 



This example demonstrates what was suggested earlier, and that is that the com- 
plete spline interpolant tends to be more accurate than the natural spline interpolant. 
However, accuracy of the complete spline interpolant is contingent on accurate 
estimation, or knowledge of the derivatives /^'(xo) and /^'(x„). 

REFERENCES 

1. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.), 
Random House, New York, 1988. 

2. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ. 
Press, Baltimore, MD, 1989. 



TLFeBOOK 



PROBLEMS 285 

3. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York, 
1974. 

4. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical 
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977. 

5. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 
1966. 

6. R. L. Burden and J. D. Faires, Numerical Analysis, 4th ed., PWS-KENT Publi., Boston, 
MA, 1989. 

7. C. K. Chui, An Introduction to Wavelets, Academic Press, Boston, MA, 1992. 

8. M. Unser and T. Blu, "Wavelet Theory Demystified," IEEE Trans. Signal Process. 51, 
470-483 (Feb. 2003). 

9. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York, 
2002. 

10. C. A. Hall, "On Error Bounds for Spline Interpolation," J. Approx. Theory. 1, 209-218 
(1968). 



PROBLEMS 



6.1. Find the Lagrange interpolant pi(t) — Y2j—(,XjLj(t) for data set 
{(-1, i), (0, 1), (1, -1)}. Find p 2J in P2 (t) = Y. )=o P2,jt j ■ 

6.2. Find the Lagrange interpolant /53(f) = ^2j = o x jLj(t) f° r data set 



{(0,1), (± 2), (1,|), (2, 



<2 ,*.„ V x, 2 „ v ^, -1)}. Find pjj in p 3 (t) = J2%o P3jt j . 
6.3. We want to interpolate x(t) at t = t\, h, and x^(t) = dx(t)/dt at t — to, £3 



using p3(t) = J2j=o P3,jt J .Let Xj = x(tj) and 
ear system of equations satisfied by (pij). 

(i)/ 



,(i) 



= x w (tj). Find the lin- 



6.4. Find a general expression for L , . (t) = dLj(t)/dt. 

6.5. In Section 6.2 it was mentioned that fast algorithms exist to solve Vander- 
monde linear systems of equations. The Vandermonde system Ap — x is 
given in expanded form as 



to 



1 t„-i 
1 t„ 



n-\ 

2 




n-\ 



n-\ 
n-\ 



n-\ 













PQ 




XQ 




Pi 


= 


XI 




Pn-l 




Xn — 1 




Pn 




Xn 



(6.P.1) 



and p n (t) — Yl"j=o Pj* J interpolates the points {(tj, Xj)\k — 0, 1, 2, ... , n}. 
A fast algorithm to solve (6.P.1) is 



TLFeBOOK 



286 



INTERPOLATION 



for k := to n - 1 do begin 

for / := n downto to k + 1 do begin 
Xi:=(Xj-x,_i)/(ti-ti_ k _iy, 
end; 
end; 
for k :— n — 1 downto do begin 
for / := k to n - 1 do begin 

X; := X; - t k X i+ i ; 

end; 
end; 

The algorithm overwrites vector i = [io ii • • • x n ] T with the vector p — 
[po P\ ■■■ PnV ■ 

(a) Count the number of arithmetic operations needed by the fast algorithm. 
What is the asymptotic time complexity of it, and how does this compare 
with Gaussian elimination as a method to solve (6.P.1)? 

(b) Test the fast algorithm out on the system 



1 1 1 


1 


Po 




10 


1 2 4 


8 


Pi 




26 


1 3 9 


27 


P2 




58 


1 4 16 


64 


_ Pi 




112 



(c) The "top" k loop in the fast algorithm produces the Newton form (Section 
6.3) of the representation for p n (t). For the system in (b), confirm that 



k-\ 



Pn(t) = ^X k Y[(t ~ti), 



k=0 1=0 

where {x^k e Z n+ i} are the outputs from the top k loop. Since in (b) 
n — 3, we must have for this particular special case 

p 3 (t) =x +xi(t- t ) + x 2 (t - t )(t - h) + x 3 (t - t )(t - t x )(t - t 2 ). 

(Comment: It has been noted by Bjorck and Pereyra that the fast algorithm 
often yields accurate results even when A is ill-conditioned.) 

6.6. Prove that for 



A n = 



to 


t 2 
l 




t\ 


t 2 


.n — \ 
' ' l \ 


tn-l 


t 2 

l n-\ 


.n — \ 
" l n-\ 


tn 


t 2 


t n — 1 



TLFeBOOK 



PROBLEMS 



287 



we have 



det(A„) = Yl ('i ~ '.0- 



0<i<j<n 



(Hint: Use mathematical induction.) 

6.7. Write a MATLAB function to interpolate the data {(tj,Xj)\j e Z„ + i} with 
polynomial p n (t) via Lagrange interpolation. The function must accept f as 
input, and return p n (t). 

6.8. Runge's phenomenon was mentioned in Section 6.2 with respect to interpo- 
lating f(t) — 1/(1 + f 2 ) on t e [—5, 5], Use polynomial p n (t) to interpolate 
/(f) at the points t k = t o + kh, k — 0, 1, . . . , n, where to — —5, and t „ — 5 
(so h — (t n - t )/n). Do this for n — 5, 8, 10. Use the MATLAB function 
from the previous problem. Plot fit) and p n (t) on the same graph for all of 
n — 5, 8, 10. Comment on the accuracy of interpolation as n increases. 

6.9. Suppose that we wish to interpolate f(t) = sinf for t e [0, jt/2] using poly- 
nomial p n (t) — Yl"j=o Pn,jt J ■ The approximation error is <?„(?) = f(t) — 
p n (t). Since \f (n) (t)\ < 1 for all t e R via (6.14), it follows that 



\e„(t)\ < 



1 



(n+l)\ 



f\\t - tt\ = Ht). 



(6.P.2) 



,=o 



Let the grid (sample) points be to = 0, t n = tt/2, and t^ — to + kh for £ = 
0, 1, . . . , n so that h — (t n - t Q )/n = ^. Using MATLAB 

(a) For « = 2, on the same graph plot /(f) — p n {t), and ±£>(f). 

(b) For n — 4, on the same graph plot /(f) — p„(t), and ±£>(f). 

Cases (a) and (b) can be separate plots (e.g., using MATLAB subplot). Does 
the bound in (6.P.2) hold in both cases? 



6.10. Consider f(t) — coshf = j[e' + e '] which is to be interpolated by P2(t) — 
S/=o P2jt' on f € [-1, 1] at the points to = -1, fo = 0, t\ = 1. Use (6.14) 
to find an upper bound on the size of the approximation error e n (t) for 
f € [—1, 1], where the size of the approximation error is given by the norm 







1 


e„ 


loo 


= max |e„(f)|, 

a<t<b 


where here a — — 


l,b = 


1. 








6.11. For 














' 1 


1 






1 1 




V = 


to 

t" 
L 'o 


t\ 
t" 






f n — 1 f n 
l n-\ l n _ 


g ^(n + l)x(n+l) 



TLFeBOOK 



288 



INTERPOLATION 



it is claimed by Gautschi that 



IV 



-in , rrl±M 

I loo < max — 

0<k<n ^ \t k -t t \ 
ijtk 



(6.P.3) 



Find an upper bound on k^CV) that uses (6. P. 3). How useful is an upper 
bound on the condition number? Would a lower bound on condition number 
be more useful? Explain. 

6.12. For each function listed below, use divided difference tables (recall Example 
6.3) to construct the degree n Newton interpolating polynomial for the spec- 
ified points. 

(a) f(t) = 4t , fo = 0, t\ — 1, H — 3. Use n = 2. 

(b) f(t) — coshf, to — — 1, t\ — -j-, ii — 2> 

(c) f(t) — Int, to = 1, t\ = 2, t2 = 3. Use n = 2. 



_1 i 2 = i f 3 = 1. Use n = 3. 



(d) /(0 = 1/(1 4 
6.13. Prove Eq. (6.33). 



f), to = 0,t\ = A, ?2 = 1, ?3 = 2. Use « = 3. 



6.14. Consider Theorem 6.1. For n = 2, find /^(x) and ^^(x) as direct-form (i.e., 
a form such as (6.4)) polynomials in x. Of course, do this for all k — 0, 1, 2. 
Find Hermite interpolation polynomial p2n+\(x) for n — 2 that interpolates 
/(x) = .y/x at the points xo = 1, x\ — 3/2, X2 = 2. 

6.15. Find the Hermite interpolating polynomial p$(x) to interpolate f(x) — e x at 
the points xo = 0, x\ — 0.1, X2 = 0.2. Use MATLAB to compare the accu- 
racy of approximation of ps(x) to that of p^ix) given in Example 6.3 (or 
Example 6.4). 

6.16. The following matrix is important in solving spline interpolation problems 
[recall (6.87)]: 



4 


1 


• 


• 





1 


4 


1 • 


• 








1 


4 • 


• 











• 


• 4 


1 








• 


• 1 


4 



€R" 



Suppose that D n — det(A„). 

(a) Find D\, D2, Dj, and D4 by direct (hand) calculation using the basic 
rules for computing determinants. 

(b) Show that 



D 



n+2 



4D„ 



D„=0 



TLFeBOOK 



PROBLEMS 289 

(i.e., the determinants we seek may be generated by a second-order dif- 
ference equation). 

(c) For suitable constants a, ft e R, it can be shown that 

D n =a(2 + v / 3)"+ y 8(2- V3f (6.P.4) 

in e N). Find a and [J. 

[Hint: Set up two linear equations in the two unknowns a and fJ using 

(6. P. 4) for n = 1, 2 and the results from (a).] 

(d) Prove that D„ > for all n e N. 

(e) Is A„ > for all n e N? Justify your answer. 

[//mf: Recall part (a) in Problem 4.13 (of Chapter 4).] 

6.17. Repeat Example 6.5, except use 

fM = rh> 

and x € [—5, 5] with nodes Xk — — 5 + \k for k — 0, 1, 2, 3, 4 (i.e., « = 4, 
and /; = |). How do the results compare to the results in Problem 6.8 (assum- 
ing that you have done Problem 6.8)? Of course, you should use MATLAB 
to "do the dirty work." You may use built-in MATLAB linear system solvers. 

6.18. Repeat Example 6.5, except use 



f(x) = Vl+x. 

Use MATLAB to aid in the task. 

6.19. Write a MATLAB function to solve tridiagonal linear systems of equations 
based on the theory for doing so given in Section 6.5. Test your algorithm 
out on the linear systems given in Example 6.5. 



TLFeBOOK 



7 



Nonlinear Systems of Equations 



7.1 INTRODUCTION 

In this chapter we consider the problem of finding x to satisfy the equation 

/0O = (7.1) 

for arbitrary /, but such that f(x) e R. Usually we will restrict our discussion 
to the case where x e R, but x e C can also be of significant interest in practice 
(especially if / is a polynomial). However, any x satisfying (7.1) is called a root 
of the equation. Such an x is also called a zero of the function /. More generally, 
we are also interested in solutions to systems of equations: 

f (xo,xi,...,x„-i) = 0, 
fi(xo,xi, ...,x n -\) = 0, 

fn-l(xo,Xl, ...,X n -l) — 0. (7.2) 

Again, fi(xo, x\, . . . , x n -\) e R, and we will assume Xk e R for all ;', k e Z„. Var- 
ious solution methods will be considered such the bisection method, fixed-point 
method, and the Newton-Raphson method. All of these methods are iterative in 
that they generate a sequence of points (or more generally vectors) that converge 
(we hope) to the desired solution. Consequently, ideas from Chapter 3 are rele- 
vant here. The number of iterations needed to achieve a solution with a given 
accuracy is considered. Iterative procedures can break down (i.e., fail) in various 
ways. The sequence generated by a procedure may diverge, oscillate, or display 
chaotic behavior (which in the past was sometimes described as "wandering behav- 
ior"). Examples of breakdown phenomena will therefore also be considered. Some 
attention will be given to chaotic phenomena, as these are of growing engineering 
interest. Applications of chaos are now being considered for such areas as cryp- 
tography and spread-spectrum digital communications. These proposed applications 
are still controversial, but at the very least some knowledge of chaos gives a deeper 
insight into the behavior of nonlinear dynamic systems. 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 

290 



TLFeBOOK 



INTRODUCTION 291 

The equations considered in this chapter are nonlinear, and so are in contrast 
with those of Chapter 4. Chapter 4 considered linear systems of equations only. The 
reader is probably aware of the fact that a linear system either has a unique solution, 
no solution, or an infinite number of solutions. Which case applies depends on the 
size and rank of the matrix in the linear system. Chapter 4 emphasized the handling 
of square and invertible matrices for which the solution exists, and is unique. In 
Chapter 4 we saw that well-defined procedures exist (e.g., Gaussian elimination) 
that give the solution in a finite number of steps. 

The solution of nonlinear equations is significantly more complicated than the 
solution of linear systems. Existence and uniqueness problems typically have no 
easy answers. For example, e 2x + 1 = has no real- valued solutions, but if we allow 
x e C, then x — jknj for which k is an odd integer (since e 2x + 1 = e-' nk + 1 = 
cos(£jt) + j sin(A:7T) + 1 = cos(kjt) = —1 + 1 = as k is odd). On the other hand 

e~' - sin(f) = 

also has an infinite number of solutions, but for which t e R (see Fig. 7.1). How- 
ever, the solutions are not specifiable with a nice formula. In Fig. 7.1 the solutions 
correspond to the point where the two curves intersect each other. 

Polynomial equations are of special interest. 1 For example, x 2 + 1 = has only 
complex solutions x — ±j . Multiple roots are also possible. For example 

x 3 - 3x 2 + 3x - 1 = (x - l) 3 = 

has a real root of multiplicity 3 at x = 1. We remark that the methods of this 
chapter are general and so (in principle) can be applied to find the solution of 
any nonlinear equation, polynomial or otherwise. But the special importance of 
polynomial equations has caused the development (over the centuries) of algorithms 
dedicated to polynomial equation solution. Thus, special algorithms exist to solve 

n 

Pn(.x) = J^p n , k X k = 0, (7.3) 

k=0 

Why are polynomial equations of special interest? There are numerous answers to this. But the reader 
already knows one reason why from basic electric circuits. For example, an unforced RLC (resistance x 
inductance x capacitance) circuit has a response due to energy initially stored in the energy storage 
elements (the inductors and capacitors). If x(t) is the voltage drop across an element or the current 
through an element, then 

d n x(t) d n ~ l x(t) dx(t) 

The coefficients a^ depend on the circuit elements. The solution to the differential equation depends on 
the roots of the characteristic equation: 

a„X" +a n _iX"~ l H \-a\X + aQ = 0. 



TLFeBOOK 



292 NONLINEAR SYSTEMS OF EQUATIONS 



2.5 
2 

1.5 
1 

0.5 



-0.5 



.;;;;;;; 


— e~> 
-- sin(f) 


.._ 


V i 


^ v \ s .-'' "T*^ N 


V . 

\ : 
\ : 

V 


l^^^l 


* 
s 


s ■ s 
/• : : s : 

_...'...: N ...: 


: • 
• 

• : 


! : : : s. 


x 



Figure 7.1 Plot of the individual terms in the equation /(f) = e r — sin(f) = 0. The infi- 
nite number of solutions possible corresponds to the point where the plotted curves intersect. 



and these algorithms do not apply to general nonlinear equations. Sometimes 
these are based on the general methods we consider in this chapter. Other times 
completely different methods are employed (e.g., replacing the problem of find- 
ing polynomial zeros by the equivalent problem of finding matrix eigenvalues as 
suggested in Jenkins and Traub [22]). However, we do not consider these special 
polynomial equation solvers here. We only mention that some interesting references 
on this matter are Wilkinson [1] and Cohen [2], These describe the concept of ill- 
conditioned polynomials, and how to apply deflation procedures to produce more 
accurate estimates of roots. The difficulties posed by multiple roots are considered 
in Hull and Mathon [3]. Modern math-oriented software tools (e.g., MATLAB) 
often take advantage of theories such as described in Refs. 1-3, (as well as in 
other sources). In MATLAB polynomial zeros may be found using the roots and 
mroots functions. Function mroots is a modern root finder that reliably determines 
multiple roots. 



7.2 BISECTION METHOD 

The bisection method is a simple intuitive approach to solving 

fix) = 0. (7.4) 

It is assumed that fix) e R, and that x e R. This method is based on the following 
theorem. 

Theorem 7.1: Intermediate Value Theorem If f\[a, b] — > R is continuous 
on the closed, bounded interval [a, b], and yo e R is such that f(a) < yo < fib), 
then there is an xo e [a,b] such that fixo) — yo- In other words, a continuous 
function on a closed and bounded interval takes on all values between /(a) and 
f{b) at least once. 



TLFeBOOK 



BISECTION METHOD 293 

The bisection method works as follows. Suppose that we have an initial interval 
[ao, bo] such that 

f(a )f(b ) < 0, (7.5) 

which means that /(ao) and f(bo) have opposite signs (i.e., one is positive while 
the other is negative). By Theorem 7.1 there must be a p e (ao, bo) so that f(p) = 
0. We say that [ao, bo] brackets the wot p. Suppose that 

Po = 3(00 + ^0)- (7.6) 

This is the midpoint of the interval [ao, bo]- Consider the following cases: 

1. If f(po) — 0, then p — po and we have found a root. We may stop at this 
point. 

2. If / '(ao) / (po) < 0, then it must be the case that p e [ao, po], so we define 
the new interval [a\, b\] — [ao, po]- This new interval brackets the root. 

3. If f (po) f (bo) < then it must be the case that p e [po, bo] so we define 
the new interval [a\, b\] — [po, bo]- This new interval brackets the root. 

The process is repeated by considering the midpoint of the new interval, which is 

pi = \(ai+bi), (7.7) 

and considering the three cases again. In principle, the process terminates when 
case 1 is encountered. In practice, case 1 is unlikely in part because of the effects 
of rounding errors, and so we need a more practical criterion to stop the process. 
This will be considered a little later on. For now, pseudocode describing the basic 
algorithm may be stated as follows: 

input [ao, bo] which brackets the root p; 
PO := (a + b )/2; 
k:=0; 

while stopping criterion is not met do begin 
if fia^fipk) < then begin 
a k+-\ -= a k- 
b k+1 -=Pk- 
end; 
else begin 
%+1 -=Pk- 
b k+1 -= b k- 
end; 
p k+ 1 :=(a k+ i +b k+ i)/2; 
k :=k+-\; 
end; 

When the algorithm terminates, the last value of Pk+\ computed is an estimate 
of p. 



TLFeBOOK 



294 NONLINEAR SYSTEMS OF EQUATIONS 

We see that the bisection algorithm constructs sequence (p n ) — (po, pi, P2, ■ ■ ■) 

such that 

lim p n = p, (7.8) 

n— >oo 

where p„ is the midpoint of [a n , b n ], and f(p) = 0. Formal proof that this process 
works (i.e., yields a unique p such that f(p) — 0) is due to the following theorem. 

Theorem 7.2: Cantor's Intersection Theorem Suppose that ([a^, b^]) is a 
sequence of closed and bounded intervals such that 

[ao, bo] D[ai,bi]D ■■■ D [a„, b„]D ■■■ , 
with lim n ^oo(/) n — a n ) — 0. There is a unique point p e [a„,b n ] for all n e Z + : 



P)[a n >&n] = {/?}• 



n=0 



The bisection method produces (p n ) such that a n < p n < b n , and p € [a n , b n ] for 

l 2< 



all n e Z + . Consequently, since p n — \{a n + b n ) 



\Pn-p\<\b„-a„\<—^- (7.9) 

for n e Z + , so that limj^oo p n — p. Recall that / is assumed continuous on [a, b], 
so lim„^oo f(pn) — f(p)- So now observe that 

1 1 

\Pn ~a„\< —\b - a\, \b„ - p n \ < —\b - a\ (7.10) 

so, via \x — y\ — \(x — z) — (y — z)\ < \x — z\ + \y — z\ (triangle inequality), we 
have 

1 1 1 

\p-a„\ <\p- p n \ + \p n -a n \ < —(b-a)+ —(b-a) = — j(b-a), 

(7.11a) 
and similarly 



\p-b n \<\p- p n \+ \p n -b n \<—^(b-a). (7.11b) 



1 

2" 



Thus 

lim a n — lim b n — p. (7.12) 

At each step the root p is bracketed. This implies that there is a subsequence [of (p n )] 
denoted (x n ) converging to p so that f(x n ) > for all n e Z + . Similarly, there is a 
subsequence (y n ) converging to p such that f(y n ) < for all n e Z + . Thus 

f(p) = lim f(x„) > 0, f(p) = lim f(y n ) < 0, 



TLFeBOOK 



BISECTION METHOD 



295 



which implies that f(p) — 0. We must conclude that the bisection method produces 
sequence (/?„) converging to p such that f(p) — 0. Hence, the bisection method 
always works. 

Example 7.1 We want to find 5 1 / 3 (cube root of five). This is equivalent to 
solving the equation f(x) — x 3 — 5 = 0. We note that /(l) = —4 and /(2) = 3, 
so we may use [ao, bo] = [a,b] = [1,2] to initially bracket the root. We remark 
that the "exact" value is 5 1 / 3 = 1.709976 (seven significant figures). Consider the 
following iterations of the bisection method: 



[oo,6 ] = [l,2], 
[a\,b\] = [po, bo], 
[a 2 ,b 2 ] = [oi, p\], 
[03,63] = [P2,b 2 ], 
[04,64] = [P3,h], 
[a 5 ,6 5 ] = [04, pa\ 



po = 1.500000, 
pi = 1.750000, 
p 2 = 1.625000, 
p 3 = 1.687500, 
p 4 = 1.718750, 
/?5 = 1.703125, 



f(p ) = -1.625000 
f(pi) = 0.359375 
/Q, 2 ) = -0.708984 
f(pj) = -0.194580 
/0? 4 ) = 0.077362 
/(p 5 ) = -0.059856 



We see that |p 5 - p\ = |1.703125 - 1 .7099761 = 0.006851. From (7.9) 

6-o 2-1 1 

| P5 -/>I< — = — = ? = 0.0312500. 

The exact error certainly agrees with this bound. 

When should we stop iterating? In other words, what stopping criterion should 
be chosen? Some possibilities are (for e > 0) 



(7.13a) 

(7.13b) 
(7.13c) 

(7.13d) 



We would stop iterating when the inequalities are satisfied. Usually (7.13d) is 
recommended. Condition (7.13a) is not so good as termination depends on the size 
of the nth interval, while it is the accuracy of the estimate of p that is of most 
interest. Condition (7.13b) requires knowing p in advance, which is not reasonable 
since it is p that we are trying to determine. Condition (7.13c) is based on f(p n ), 
and again we are more interested in how well p n approximates p. Thus, we are left 
with (7.13d). This condition leads to termination when p n is relatively not much 
different from p n -\. 



2\Pn ^n) 


< e, 




\Pn ~ P\ <€, 




f(Pn) < e, 




Pn - Pn-\ 
Pn 


< € 


(Pn # 0) 



TLFeBOOK 



296 NONLINEAR SYSTEMS OF EQUATIONS 

How may we characterize the computational efficiency of an iterative algorithm? 
In Chapter 4 the algorithms terminated in a finite number of steps (with the excep- 
tion of the iterative procedures suggested in Section 4.7 which do not terminate 
unless a stopping condition is imposed), so flop counting was a reasonable mea- 
sure. Where iterative procedures such as the bisection method are concerned, we 
prefer 

Definition 7.1: Rate of Convergence Suppose that (x n ) converges to 0: 
liirin^oo x„ — 0. Suppose that (p n ) converges to p, i.e., linin^oo p„ — p. If there 
is a K € R, but K > 0, and N e Z+ such that 

\Pn ~ P\ < K\x n \ 

for all n > N , then we say that (/?„) converges to p with rate of convergence 0(x n ). 

This is an alternative use of the "big O" notation that was first seen in Chapter 4. 
Recalling (7.9), we obtain 

\Pn-p\<(b-a)^, (7.14) 

so that K — b — a, x n = ^r, and N — 0. Thus, the bisection method generates 
sequence (/?„) that converges to p (with f(p) — 0) at the rate 0(1/2"). 
From (7.14), if we want \p n — p\ < e, then we may choose n so that 



\Pn-p\<-^r<e, 



implying 2" > (b — a)/e, or we may choose n so 

' b — a 



log 2 



(7.15) 



where \x~\ — smallest integer greater than or equal to x. This can be used as an 
alternative means to terminate the bisection algorithm. But the conservative nature 
of (7.15) suggests that the algorithm that employs it may compute more iterations 
than are really necessary for the desired accuracy. 



7.3 FIXED-POINT METHOD 

Here we consider the Banach fixed-point theorem [4] as the theoretical basis for a 
nonlinear equation solver. Suppose that X is a set, and T\X — > X is a mapping of 
X into itself. A fixed point of T is an x e X such that 



Tx=x. (7.16) 



TLFeBOOK 



FIXED-POINT METHOD 297 

For example, suppose X = [0, 1] C R, and 

Tx=jx(l-x). (7.17) 

We certainly have Tx e [0, 1] for any x e X. The solution to 

X — 2*0 — X) 

(i.e., to Tx — x) is x — 0. So T has fixed point x — 0. (We reject the solution 
x = -1 since -1 £ [0, 1].) 

Definition 7.2: Contraction Let X — (X, d) be a metric space. Mapping 
T\X — > X is called a contraction (or a contraction mapping, or a contractive 
mapping) on X if there is an a € R such that < a < 1 , and for all x , y e X 

d(Tx,Ty) <ad(x,y). (7.18) 

Applying T to "points" x and y brings them closer together if T is a contractive 
mapping. If a — 1, the mapping is sometimes called nonexpansive. 

Theorem 7.3: Banach Fixed-Point Theorem Consider the metric space X — 
(X, d), where X g 0. Suppose that X is complete, and T\X — >■ X is a contraction 
on X. Then r has a unique fixed point. 

Proof We must construct (x„), and show that it is Cauchy so that (x„) con- 
verges in X (recall Section 3.2). Then we prove that x is the only fixed point of 
T. Suppose that xo € X is any point from X. Consider the sequence produced by 
the repeated application of T to xq: 

xo, x\ — Txq, X2 — Tx\ — T xo, ■ . ■ , x n = Tx n -\ — T n xo, .... (7.19) 

From (7.19) and (7.18), we obtain 

d(x k+ i,x k ) = d(Txk, Txk-i) 

< ad(xk,Xk-i) 

— ad(Txk-\, Tx k _ 2 ) 

< a 2 d(x k -i,x k - 2 ) 



<a k d(xi,x ). (7.20) 



TLFeBOOK 



298 NONLINEAR SYSTEMS OF EQUATIONS 

Using the triangle inequality with n > k, from (7.20) 

d(x k ,x n ) < d(x k , x k +i) + d(x k+ i, x k+2 ) -\ \-d{x n -i,x n ) 

< (a k + a k+1 ■■■+ a n - 1 )d(x Q , xi) 



k 



\-a 



n—k 



i-« 



-d(xo, x\) 



(where the last equality follows by application of the formula for geometric series, 
seen very frequently in previous chapters). Since < a < 1, we have 1 — a n ~ k < 1. 

Thus 

a k 

d(x k ,x n ) < d(xo,x\) (7.21) 

1 — a 

(n > k). Since 0<o<l, 0<1— a < 1, too. In addition, d(xo, x\) is fixed (since 
we have chosen xq). We may make the right-hand side of (7.21) arbitrarily small 
by making k sufficiently big (keeping n > k). Consequently, (x n ) is Cauchy. X is 
assumed to be complete, so x n -> x e X. 
From the triangle inequality and (7.18) 

d(Tx, x) < d(x, x k ) + d(x k , Tx) 

< d(x, x k ) + ad(x k -\, x) (recall that Tx — x) 

< € (any e > 0) 

if A: — > oo, since x n — > x. Consequently, d(x, Tx) — implies that Tx — x (recall 
(Ml) in Definition 1.1). Immediately, x is a fixed point of T. 

Point x is unique. Let us assume that there is another fixed point x, i.e., Tx — x. 
But from (7.18) 

d(x, x) — d(Tx, Tx) < ad(x, x), 

implying d(x, x) — because < a < 1. Thus x — x [(Ml) From Definition 1.1 
again]. 

Theorem 7.3 is also called the contraction theorem, a special instance of which 
was seen in Section 4.7. The theorem applies to complete metric spaces, and so is 
applicable to Banach and Hilbert spaces. (Recall that inner products induce norms, 
and norms induce metrics. Hilbert and Banach spaces are complete. So they must 
be complete metric spaces as well.) 

Note that we define T°x — x for any x e X. Thus T° is the identity mapping 
(identity operator). 

Corollary 7.1 Under the conditions of Theorem 7.3 sequence (x n ) from x n — 
T"xq [i.e., the sequence in (7.19)] for any xq e X converges to a unique x e X 
such that Tx — x. We have the following error estimates (i.e., bounds): 



TLFeBOOK 



FIXED-POINT METHOD 299 



1. Prior estimate: 



2. Posterior estimate: 



a k 



d(xk,x) < d(xo,xi) (7.22a) 

\ — a 



a 

d(xk,x) < d(xk,Xk-i). (7.22b) 

1 — a 

Proof The prior estimate is an immediate consequence of Theorem 7.3 since 
it lies within the proof of this theorem [let n — > oo in (7.21)]. 

Now consider (7.22b). In (7.22a) let k = 1, yo = x o< an d yi = x i- Thus, 
d(xi, x) — d(yi, x), d(xo, x\) — d(yo, y\), and so 

a 

d(yi,x)<- d(y ,yi). (7.23) 

1 — a 

Let yo — Xk-i, so y\ — Tyo — Txk-i — x\ and (7.23) becomes 

a 

d(x k ,x) < d(xk,Xk-i), 

1 — a 

which is (7.22b). 

The error bounds in Corollary 7.1 are useful in application of the contraction the- 
orem to computational problems. For example, (7.22a) can estimate the number of 
iteration steps needed to achieve a given accuracy in the estimate of the solution to 
a nonlinear equation. The result would be analogous to Eq. (7.15) in the previous 
section. 

A difficulty with the theory so far is that T\X -> X is not always contractive 
over the entire metric space X, but only on a subset, say, Y C X. However, a basic 
result from functional analysis is that any closed subset of X is complete. Thus, for 
any closed Y C X, there is a fixed point x e Y C X, and x„ — > x ix n — T h xq for 
suitable xq e Y). For this idea to work, we must choose xo € Y so that x n e Y for 
all n € Z+. 

What does "closed subset" (formally) mean? A neighborhood of x e X (metric 
space X) is the set N € (x) — {y\d(x, y) < e, e > 0} C X. Parameter e is the radius 
of the neighborhood. We say that x is a limit point of Y C X if every neighborhood 
of x contains y ^ x such that ye Y. Y is closed if every limit point of Y belongs 
to Y. These definitions are taken from Rudin [5, p. 32]. 

Example 7.2 Suppose that X — R, Y = (0, 1) C X. We recall that X is a 
complete metric space if d(x, y) = \x — y\ (x, y e X). Is Y closed? Limit points 
of Y are x — and x = 1. [Any x e (0, 1) is also a limit point of Y.] But e" Y, 
and 1 f" Y. Therefore, Y is not closed. On the other hand, Y — [0, 1] is a closed 
subset of X since the limit points are now in Y. 



TLFeBOOK 



300 NONLINEAR SYSTEMS OF EQUATIONS 

Fixed-point theorems (and related ideas) such as we have considered so far have 
a large application range. They can be used to prove the existence and uniqueness 
of solutions to both integral and differential equations [4, pp. 314-326]. They can 
also provide (sometimes) computational algorithms for their solution. Fixed-point 
results can have applications in signal reconstruction and image processing [6], 
digital filter design [7], the interpolation of bandlimited sequences [8], and the 
solution to so-called convex feasibility problems in general [9]. However, we will 
consider the application of fixed-point theorems only to the problem of solving 
nonlinear equations. 

If we wish to find p e R so that 

fip) = 0, (7.24) 

then we may define g(x) (g(x)\R. — > R) with fixed point p such that 

g(x) = x-f(x), (7.25) 

and we then see g(p) = p — f(p) = p =$■ fip) — 0. Conversely, if there is a 
function g(x) such that 

g(p) = P, (7.26) 

then 

/(*) = x - g(x) (7.27) 

will have a zero at x — p. So, if we wish to solve f(x) — 0, one approach would 
be to find a suitable g(x) as in (7.27) (or (7.25)), and find fixed points for it. 

Theorem 7.3 informs us about the existence and uniqueness of fixed points for 
mappings on complete metric spaces (and ultimately on closed subsets of such 
spaces). Furthermore, the theorem leads us to a well-defined computational algo- 
rithm to find fixed points. At the outset, space X — R with metric d(x, y) — \x — y\ 
is complete, and for us g and / are mappings on X. So if g and / are related 
according to (7.27), then any fixed point of g will be a zero of /, and can be 
found by iterating as spelled out in Theorem 7.3. However, the discussion follow- 
ing Corollary 7.1 warned us that g may not be contractive onX = R but only on 
some subset Y C X. In fact, g is only rarely contractive on all of R. We therefore 
usually need to find Y — [a, b] C X — R such that g is contractive on Y, and then 
we compute x n — g n xo e Y for n e Z + . Then x n — >• x, and f(x) — 0. We again 
emphasize that x n e Y is necessary for all n. If g is contractive on [a, b] C X, 
then, from Definition 7.2, this means that for any x, y e [a,b] 

\g(x)-g(y)\<a\x-y\ (7.28) 

for some real- valued a such that < a < 1. 

Example 7.3 Suppose that we want roots of fix) — kx 2 + (1 — k)x — 
(assume X > 0). Of course, this is a quadratic equation and so may be easily 



TLFeBOOK 



FIXED-POINT METHOD 



301 



solved by the usual formula for the roots of such an equation (recall Section 4.2). 
However, this example is an excellent illustration of the behavior of fixed-point 
schemes. It is quite easy to verify that 

f(x)=x-Xx(l-x). (7.29) 

We observe that g(x) is a quadratic in x, and also 

g m (x) = (2x-l)k, 

so that g(x) has a maximum at x — j for which g(j) — A/4. Therefore, if we 
allow only < X < 4, then g(x) e [0, 1] for all x e [0, 1]. Certainly [0, 1] C R. A 
sketch of g(x) and of y — x appears in Fig. 7.2 for various X. The intersection of 
these two curves locates the fixed points of g on [0, 1], 

We will suppose Y — [0, 1]. Although g(x) e [0, 1] for all x € [0, 1] under the 
stated conditions, g is not necessarily always contractive on the closed interval 
[0, 1]. Suppose X — j; then, for all x e [0, 1], g(x) e [0, 4] (see the dotted line 
in Fig. 7.2, and we can calculate this from knowledge of g). The mapping is 
contractive for this case on all of [0, 1]. [This is justified below with respect to 
Eq. (7.30).] Also, g(x) = x only for x — 0. If we select any xq e [0, 1], then, for 



1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 







I 


1 ^-*- -T-~~^,^ 1 1 




1 ft 










— y=x 

A = 0.5 

■ - A = 2.0 
-- A = 2.8 

— A = 4.0 " 




























s.. i y^ 








/ : ' 
/ ' 
/ / 
/ t 

/ :.../. 

/ t 
1 t 

/ ' 

■/ /■■; y- 

1 / : .'• 

1 / 


/ : \ 
/ X 

* / i 'v i X 


\ \ 
\ \ 
\ \ 
\ \ 
\ 

x N 




/ : : s 
<S : ' s 


/ / 
/ / 


■/■■yr-y 

t : / 

./.'. VT 




s \ 
x N 

X 


\ \ 
\ \ 


tit/ 








x N \ 
X\\- 







0.1 



0.2 



0.3 



0.4 



0.5 
x 



0.6 



0.7 



0.8 



0.9 



Figure 7.2 Plot of g(x) = Xx{\ — x) for various X, and a plot of y = x. The places where 
y = x and g(x) intersect define the fixed points of g(x). 



TLFeBOOK 



302 NONLINEAR SYSTEMS OF EQUATIONS 

x n — g n xo, we can expect *„ — > 0. For example, suppose *o — 0.7500; then the 
first few iterates are 

xq = 0.7500 

*i = 0.5xo(l - *o) = 0.0938 
x 2 = 0.5*i(l-*i) = 0.0425 
* 3 = 0.5*2(1 - *2) = 0.0203 
* 4 = 0.5*3(1 -*3) = 0.0100 
* 5 = 0.5*4(1 - * 4 ) = 0.0049 



The process is converging to the unique fixed point at * = 0. 

Suppose now that X — 2; then g(x) — 2*(1 — *), and now g(x) has two fixed 
points on [0, 1], which are * = 0, and * = 5. For all * € [0, 1] we have g(x) e 
[0, j], but g is not contractive on [0, 1], For example, suppose * = 0.8, y = 0.9, 
theng(.8) = 0.3200, and g(.9) = 0.1800. Thus, |* - y\ = 0.1, but |g(*) - g(y)\ = 
0.14 > 0.1. On the other hand, suppose that *o = 0.7500; then the first few iter- 
ates are 

*o = 0.7500 

*i = 2* (1 - * ) = 0.3750 
* 2 = 2*i(l -*i) = 0.4688 
* 3 = 2* 2 (1 - xi) = 0.4980 
* 4 = 2*3(1 -xi) = 0.5000 
x 5 = 2* 4 (1 - * 4 ) = 0.5000 



This process converges to one of the fixed points of g even though g is not 
contractive on [0, 1]. 
From (7.28) 



a > 



g(x) - g(y) 



(* # y), 



x-y 
from which, if we substitute g(x) — Xx(l — *), and g(y) — Xy(l — y), then 

a = supA.|l — * — y\. (7.30) 

If A. = |. then a — j. If X — 2, we cannot have a, so that < a < 1. If 
g(x) — 2*(1 — *) (i.e., X — 2 again), but now instead Y — [0.4, 0.6], then, for all 



TLFeBOOK 



FIXED-POINT METHOD 



303 



x € Y = [0.4, 0.6], we have g(x) e [0.48, 0.50] C Y, implying g\Y -+ Y. From 
(7.30) we have for this situation a — 0.4. The mapping g is contractive on Y. 
Suppose x = 0.450000; then 

x = 0.450000 

xi = 2*o(l - *o) = 0.495000 
x 2 = 2xi(l - xi) = 0.499950 
x 3 = 2x 2 (l - x 2 ) = 0.500000 
x 4 = 2x 3 (l - x 3 ) = 0.500000 



The reader ought to compare the results of this example to the error bounds from 
Corollary 7.1 as an exercise. 

More generally, and again from (7.28), we have (for x, y e Y — [a, b] C R) 

g(x) - g(y) 



a = sup 



x-y 



(7.31) 



Now recall the mean-value theorem (i.e., Theorem 3.3). If g(x) is continuous on 
[a, b], and g^Hx) is continuous on (a, b), then there is a £ € (a, b) such that 

g (i) (?) _ g(P)-g(fl) 



b — a 



Consequently, instead of (7.31) we may use 

a — sup \g^'(x) 

x£(a,b) 



(7.32) 



Example 7.4 We may use (7.32) to rework some of the results in Example 7.3. 
Since g (1) (x) = A(2x - 1), if Y = [0, 1], and A. = ±, then 



a = sup 

*€(0,1) 



x 

2 



If k = 2, then a = 2. If now Y = [0.4, 0.6] with k = 2, then 

a = sup |4x - 2| = 0.4. 

x€(0.4,0.6) 

Now suppose that k — 2.8, and consider Y — [0.61, 0.67], which contains a fixed- 
point of g (see Fig. 7.2, which contains a curve for this case). We have 

a = sup |5.6x - 2.8| = 0.9520, 
xe(0.6i,0.67) 



TLFeBOOK 



304 NONLINEAR SYSTEMS OF EQUATIONS 

and if x € Y, then g(x) e [0.619080, 0.666120] C Y so that g\Y ->• Y, and so g is 
contractive on Y. Thus, we consider the iterates 

x = 0.650000 

xi = 2.8x (l - xo) = 0.637000 
x 2 = 2.8xi (1 - xi ) = 0.647447 
x 3 = 2.8x 2 (l - x 2 ) = 0.639126 
x 4 = 2.8x 3 (1 - x 3 ) = 0.645803 
x 5 = 2.8x 4 (l - x 4 ) = 0.640476 



The true fixed point (to 6 significant figures) is x = 0.642857 (i.e., x n — > x). We 
may check these numbers against the bounds of Corollary 7.1. Therefore, we con- 
sider the distances 

d(x 5 ,x) = 0.002381, d(x , xi) = 0.013000, d(x 5 , x 4 ) = 0.005327, 

and from (7.22a) 

a k 

d(x 5 ,x) < d(x ,xi) = 0.211780, 

1 — a 



and from (7.22b) 

<i(x 5 ,x) < — — d(x 5 ,x 4 ) = 0.105652. 
1 — a 

These error bounds are very loose, but they are nevertheless consistent with the 
true error d(x 5 , x) = 0.002381. 

We have worked with g(x) — Xx(l — x) in the previous two examples 
(Example 7.3 and Example 7.4). But this is not the only possible choice. It may 
be better to make other choices. 

Example 7.5 Again, assume f(x) — Xx 2 + (1 — X)x — as in the previous 
two examples. Observe that 

Xx — (X — l)x 



implies that 



A- A /^— ^x 1/2 = g(x). (7.33) 



TLFeBOOK 



NEWTON-RAPHSON METHOD 305 

If k = 4, then f(x) = for x = and for * = f . For g(x) in (7.29), g (1) (*) = 
8x — 4, and g^(|) = 2. We cannot find a closed interval Y containing x — | on 
which g is contractive with g\Y —*■ Y (the slope of the curve g(x) is too steep in 
the vicinity of the fixed point). But if we choose (7.33) instead, then 



2 * v~> 4 ,1/2' 

and if F = [0.7, 0.8], then for x e Y, we have g(x) € [0.7246, 0.7746] C Y, and 

V3 1 
a = sup — — 0.5175. 

jte(0.7,0.8) 4 ^/X 

So, g in (7.33) is contractive on Y. We observe the iterates 

x = 0.7800 

V3 



*1 = 


= — X 


= 0.7649 


*2 = 


v3 1/2 
2 l 


= 0.7574 


*3 = 


v3 1/2 
~-— X 2 


= 0.7537 



VJ 1/2 

X 

2 



x 4 = — x 3 7 = 0.7518 



a/3 
x 5 = ^-*j /2 = 0.7509, 



which converge to x — I (i.e., x„ — >• x). 



7.4 NEWTON-RAPHSON METHOD 



This method is yet another iterative approach to finding roots of nonlinear equations. 
In fact, it is a version of the fixed-point method that was considered in the previous 
section. However, it is of sufficient importance to warrant separate consideration 
within its own section. 

7.4.1 The Method 

One way to derive the Newton-Raphson method is by a geometric approach. The 
method attempts to solve 

f(x) = (7.34) 



TLFeBOOK 



306 



NONLINEAR SYSTEMS OF EQUATIONS 



Liney-f(p„) = f< 1 >(p n )(x-p n ) 
[tangent to point (p„, f(p n )] 




Figure 7.3 Geometric interpretation of the Newton-Raphson method. 



(x € R, and f(x) e R) by approximating the root p (i.e., f(p) = 0) by a succession 
of x intercepts of tangent lines to the curve y = f(x) at x the current approximation 
to p. Specifically, if p n is the current estimate of p, then the tangent line to the 
point (p n , f(p n )) on the curve y = f(x) is 



y = f(Pn) + f m (Pn)(x-Pn) 



(7.35) 



(see Fig. 7.3). The next approximation to p is x 
from (7.35) 



Pn+i such that y — 0, that is, 



= f(p n ) + f m (Pn)(Pn 



+ 1 



Pn), 



so for n e Z + 



/?«+! = Pn 



f(Pn) 
f m (PnY 



(7.36) 



To start this process off, we need an initial guess at p, that is, we need to select pq. 
Clearly this approach requires f^Hpn) # for all n; otherwise the process will 
terminate. Continuation will not be possible, except perhaps by choosing a new 
starting point pq. This is one of the ways in which the method can break down. 
It might be called premature termination. (Breakdown phenomena are discussed 
further later in the text.) 

Another derivation of (7.36) is as follows. Recall the theory of Taylor series 
from Section 3.5. Suppose that f(x), /^'(x), and f^ 2 \x) are all continuous on 
[a, b]. Suppose that p is a root of f(x) — 0, and that p n approximates p. From 
Taylor's theorem [i.e., (3.71)] 



f(x) = f(Pn) + f W (Pn)(x 



Pn) + \f {2) mx 



Pnf 



(7.37) 



TLFeBOOK 



NEWTON-RAPHSON METHOD 307 

where | = f (x) e (p n , x). Since fip) = from (7.37), we have 

= fip) = /(/>„) + f m (PnKp ~ Pn) + jf (2) (i;Kp - Pnf- (7.38) 

If I p — Pn I is small, we may neglect that last term of (7.38), and hence 

0^f(Pn) + f W (Pn)(p-Pn), 

implying 

P^ Pn- } ■ (7.39) 

f iV) (Pn) 

We treat the right-hand side of (7.39) as p n +\, the next approximation to p. Thus, 
we again arrive at (7.36). The assumption that (p — p n ) 2 is negligible is important. 
If p„ is not close enough to p, then the method may not converge. In particular, 
the choice of starting point po is important. 

As already mentioned, the Newton-Raphson method is a special instance of the 
fixed-point iteration method, where 

fix) 
g (x) = x- J n , . (7.40) 

So 

Pn + l = giPn) (7.41) 

for n e Z + . Stopping criteria are (for e > 0) 

\Pn ~ Pn-\\ < e, (7.42a) 

\fiPn)\ < e, (7.42b) 

Pn ~ Pn-\ 



Pn * 0. (7.42c) 

Pn 

As with the bisection method, we prefer (7.42c). 

Theorem 7.4: Convergence Theorem for the Newton-Raphson Method 

Let / be continuous on [a, b]. Let f^ l \x), and f^ 2 \x) exist and be continuous for 
all x e (a, b). If p e [a, b] with f(p) — 0, and f^(p) ^ 0, then there is a S > 
such that (7.36) generates sequence (/?„) with p„ — > p for any po e [p — S, p + S], 

Proof We have 

Pn+l = giPn), n eZ + 

with g(;c) = x - 4$- . We need Y = [p - S, p + 8] C R with g|T -^ Y, and g 
is contractive on Y . (Then we may immediately apply the convergence results from 
the previous section.) 



TLFeBOOK 



308 NONLINEAR SYSTEMS OF EQUATIONS 

Since f^'(p) # 0, and since f^(x) is continuous at p, there will be a S\ > 
such that f^\x) ^ for all x e [p — Si, p + S\] C [a, b], so g is defined and 
continuous for x G [p — Si, p + Si], Also 

(1) [f W (x)] 2 -f(x)f (2 Hx) f(x)f<- 2 \x) 

8 W [/ (1) (X)] 2 U {1) (X)] 2 

for x € [p — Si, p + Si]. In addition, f^(x) is continuous on (a, £>), so g^(x) is 
continuous on [p — Si, p + Si]. We assume f(p) = 0, so 

8 (p> if m ( P )i 2 

We have g^\x) continuous at x — p, implying that lim^^p g^\x) — g^ l \p) — 0, 
so there is a 5 > with < S < S{ such that 

\g m (x)\<a< 1 (7.43) 

for x e [p — S, p + S] — Y for some a such that 0<« < l.If x e 7, then by the 
mean- value theorem, there is a £ € (x, p) such that 

\g(x) -p\ = \g(x) - g(p)\ = |g (1) (£)||* - p\ 
< a\x — p\ < \x — p\ < S, 

so that we have g(x) e Y for all x e Y. That is, g\Y — > 7. Because of (7.43) 

a = sup|g (1) (x)|, 

and < a < 1, so that g is contractive on Y. Immediately, sequence (p n ) from 
Pn+i = gOn) for all po € F converges to p (i.e., /?„ -> /?). 

Essentially, the Newton-Raphson method is guaranteed to converge to a root if 
po is close enough to it and / is sufficiently smooth. Theorem 7.4 is weak because 
it does not specify how to select po. 

The Newton-Raphson method, if it converges, tends to do so quite quickly. 
However, the method needs f^Hpn) as well as f(p n ). It might be the case that 
f^Hpn) requires much effort to evaluate. The secant method is a variation on the 
Newton-Raphson method that replaces f^(Pn) with an approximation. Specifi- 
cally, since 

,(1), . ,. f(x)-f(Pn) 
f { '(Pn)= lim , 

x^p„ X- p n 



TLFeBOOK 



if p n -i s» p n , then 



NEWTON-RAPHSON METHOD 309 



f m (Pn) « /(p«-i) /W _ (7 44) 

Pn-1 - Pn 



This is the slope of the chord that connects the points (p n -\, f{p n -\)), and 
(Pn, f(Pn)) on the graph of fix). We may substitute (7.44) into (7.36) obtaining 
(for n e N) 

(Pn~ Pn-l)f(Pn) ,„... 

P«+i = /?« - — — — -. (7.45) 

f(Pn) ~ f(Pn-l) 

We need po and p\ to initialize the iteration process in (7.45). Usually these are 
chosen to bracket the root, but this does not guarantee convergence. If the method 
does converge, it tends to do so more slowly than the Newton-Raphson method, but 
this is the penalty to be paid for avoiding the computation of derivatives f^\x). 
We remark that the method of false position is a modification of the secant method 
that is guaranteed to converge because successive approximations to p (i.e., p n -\ 
and p n ) are chosen to always bracket the root. But it is possible that convergence 
may be slow. We do not cover this method in this book. We merely mention that 
it appears elsewhere in the literature [10-12]. 



7.4.2 Rate of Convergence Analysis 

What can be said about the speed of convergence of fixed-point schemes in gen- 
eral, and of the Newton-Raphson method in particular? Consider the following 
definition. 

Definition 7.3: Suppose that (p n ) is such that p n -> p with p n ^ p (n e Z + ). 
If there are X > 0, and S > such that 

.. \Pn+l ~ P\ , 
hm j- — X, 

n ^°° \p n - pr 

then we say that (p n ) converges to p of order S with asymptotic error constant X. 
Additionally 

• If S = 1, we say that (/?„) is linearly convergent. 

• If 5 > 1, we have superlinear convergence. 

• If S — 2, we have quadratic convergence. 

From this definition, if n is big enough, then 

\ Pn+l -p\^X\p n -p\ s . (7.46) 

Thus, we would like S to be large and X to be small for fast convergence. Since 
S is in the exponent, this parameter is more important than X for determining the 
rate of convergence. 



TLFeBOOK 



310 NONLINEAR SYSTEMS OF EQUATIONS 

Consider the fixed-point iteration 

Pn+\ = g(Pn), 

where g satisfies the requirements of a contraction mapping (so the Banach fixed- 
point theorem applies). We can therefore say that 

.. \Pn + \~P\ .. \g(Pn)-g(p)\ ._.„ 

hm = hm , (7.47) 

n^OO \p n - p\ n^oo \p n - p\ 

so from the mean-value theorem we have % n between p n and p for which 

g(Pn)-g(p)=g m (t;n)(.Pn-p), 

which, if used in (7.47), implies that 

Hm lP " +1 ~ g! = Iim|,«(fe,)|. (7.48) 

n^oo \p n — p\ n^oo 

Because f„ is between p n and p and p n — >• /?, we must have £„ -> p as well. Also, 
we will assume g^Hx) is continuous at p, so (7.48) now becomes 

lim 'f +1 "f =| g ^(p)|, (7.49) 

which is a constant. Applying Definition 7.3, we conclude that the fixed-point 
method is typically only linearly convergent because 8=1, and the asymptotic 
error constant is A = \g ( - 1 \p)\, provided g^ l \p) ^ 0. However, if g^ip) = 0, we 
expect faster convergence. It turns out that this is often the case for the Newton- 
Raphson method, as will now be demonstrated. 

The iterative scheme for the Newton-Raphson method is (7.36), which is a 
particular case of fixed-point iteration where now 

fix) 

g (x) =x- J n \ , (7.50) 

f (l Hx) 



and for which 



(i)/..x f(x)f (2) (x) 



(x) = J —^- 1— (7.51) 

[/<1)(jc)]2 



so that g^(p) = because f(p) = 0. Thus, superlinear convergence is anticipated 
for this particular fixed-point scheme. Suppose that we have the Taylor expansion 



g(x) = g(p) + g w (p)(x -p) + £gW(f )(* - p) 



I (2), 



TLFeBOOK 



NEWTON-RAPHSON METHOD 311 

for which f is between p and x. Since g(p) — p and g^Hp) — 0, this becomes 

g{x) = p+\ g W{i;)(x-p) 2 . 

For x — p n , this in turn becomes 

Pn+l = g0>») = p + ^ (2) fe)(p - Pnf (7.52) 

for which % n lies between p and p n . Equation (7.52) can be rearranged as 
p n+l - p= \g (2) {$n){p ~ Pn) 2 , 

and so 

lim 'f"* 1 "' 1 - V>(rtl. (7-53) 

n^oo \p n - p\± 2 

since f„ — »• /?, and we are also assuming that g^Hx) is continuous at p. Imme- 
diately, 8 = 2, and the Newton-Raphson method is quadratically convergent. The 
asymptotic error constant is plainly equal to j\g ( - 2 \p)\, provided g^ 2 Hp) ^ 0. If 
g( 2 \p) = 0, then an even higher order of convergence may be expected. It is 
emphasized that these convergence results depend on g(x) being smooth enough. 
Given this, since the convergence is at least quadratic, the number of accurate dec- 
imal digits in the approximation to a root approximately doubles at every iteration. 

7.4.3 Breakdown Phenomena 

The Newton-Raphson method can fail to converge (i.e., break down) in vari- 
ous ways. Premature termination was mentioned earlier. An example appears in 
Fig. 7.4. This figure also illustrates divergence, which means that [if f(p) = 0] 

lim \p„ - p\ = oo. 

In other words, the sequence of iterates (p n ) generated by the Newton-Raphson 
method moves progressively farther and farther away from the desired root p. 

Another failure mechanism called oscillation may be demonstrated as follows. 
Suppose that 

f{x) = x 3 - 2x + 2, f m (x) = 3x 2 - 2. (7.54) 

Newton's method for this specific case is 

Pi -lPn + 2 2pl - 2 
Pn + r = Pn- ^ _ - = j-r^. (7.55) 

If we were to select po — as a starting point, then 

Po = 0, p\ = 1, P2 - 0, p 3 = 1, £>4 = 0, P5 = l,.... 



TLFeBOOK 



312 NONLINEAR SYSTEMS OF EQUATIONS 



0.6 

0.4 

0.2 



-0.2 

-0.4 

-0.6 

-0.8 



■"'"••-. ..j 


— y = xe " 
-- tangent at x = 1 : 
■ - ■ tangent at x= 2 
tangent at x= 4 


'■ i.^^T"^. .__..'"■■- 






"~ " 1 ■-•-.. .. 






; ; ; ; 


; 



Figure 7.4 Illustration of breakdown phenomena in the Newton-Raphson method; here, 
f(x) = xe~ x . The figure shows premature termination if p n = x = 1 since /^ '(1) = 0. 
Also shown is divergence for the case where p$ > 1. (See how the x intercepts of the 
tangent lines for x > 1 become bigger as x increases.) 



More succinctly, p n — ^(1 — (—1)"). The sequence of iterates is oscillating; it 
does not diverge, and it does not converge, either. The sequence is periodic (it may 
be said to be period 2), and in this case quite simple, but far more complicated 
oscillations are possible with longer periods. 

Premature termination, divergence, and oscillation are not the only possible 
breakdown mechanisms. Another possibility is that the sequence (j>„) can be 
chaotic. Loosely speaking, chaos is a nonperiodic oscillation with a complicated 
structure. Chaotic oscillations look a lot like random noise. This will be considered 
in more detail in Section 7.6. 



7.5 SYSTEMS OF NONLINEAR EQUATIONS 

We now extend the fixed-point and Newton-Raphson methods to solving nonlinear 
systems of equations. We emphasize two equations in two unknowns (i.e., two- 
dimensional problems). But much of what is said here applies to higher dimensions. 

7.5.1 Fixed-Point Method 

As remarked, it is easier to begin by first considering the two-dimensional problem. 
More specifically, we wish to solve 



/o(*o,*i)=0, /i(xn,xi) = 0, 
which we will assume may be rewritten in the form 

xq - fo(xo,xi) = 0, xi - fi(xo, xi) = 0, 



(7.56) 



(7.57) 



and we see that solving these is equivalent to finding where the curves in (7.57) 
intersect in R 2 (i.e., [xq x\\ t e R 2 ). A general picture is in Fig. 7.5. As in the 



TLFeBOOK 



SYSTEMS OF NONLINEAR EQUATIONS 313 




Figure 7.5 Typical curves in R corresponding to the system of equations in (7.57). 



case of one-dimensional problems, there will usually be more than one way to 
rewrite (7.56) in the form of (7.57). 



Example 7.6 Suppose 



fo(x ,x\) = x -x 
f\(xo,xi) =x\ -xl 



1*2 = 



/oOo, x\) — xl + \x\, 



x i =0=> fi(xo, xi) = x 



We see that 



U2 



ixt - x = o =► o*o - 4r + jxf = j 



1\2 i U2 



(an ellipse in R ), and 



xi =0 



(xi + i) 2 = -1 



(a hyperbola in R 2 ). These are plotted in Fig. 7.6. The solutions to the system 
are the points where the ellipse and hyperbola intersect each other. Clearly, the 
solution is not unique. 

It is frequently the case that nonlinear systems of equations will have more than 
one solution just as this was very possible and very common in the case of a single 
equation. 

Vector notation leads to compact descriptions. Specifically, we define x — 
[x xi] T € R 2 , f (x) = fo(x ,xi), fi(x) = fi(x ,xi), and 



F(x) = 



foixo, X\) 
f\{xo,xi) 



fo(x) 
fl(x) 



(7.58) 



TLFeBOOK 



314 



NONLINEAR SYSTEMS OF EQUATIONS 




Figure 7.6 The curves of Example 7.6 (an ellipse and a hyperbola). The solutions to the 
equations in Example 7.6 are the points where these curves intersect. 



Then the nonlinear system in (7.57) becomes 

x — F (x) , 

and the fixed point ~p e R 2 of F satisfies 

P = F(p). 

We recall that R 2 is a normed space if 



(7.59) 



(7.60) 



mi 2 



= lZ X ! 2 = X > 



(7.61) 



(x e R ). We may consider a sequence of vectors (x n ) (i.e., x n e R for n e Z + ), 
and it converges to x iff 

lim | \x„ —x\\ — 0. 
n— >oo 

We recall from Chapter 3 that R 2 with the norm in (7.61) is a complete space, so 
every Cauchy sequence (x n ) in it will converge. 

As with the scalar case considered in Section 7.3, we consider the sequence of 
iterates (~p n ) such that 

Tn + \ = F (Pn)> n e Z+ - 

The previous statements (and the following theorem) apply if R 2 is replaced by R m 
(m > 2); that is, the space can be of higher dimension to accommodate m equations 
in m unknowns. Naturally, for R m the norm in (7.61) must change according to 
| |x| | 2 = Y^k=o x k- The following theorem is really a special instance of the Banach 
fixed-point theorem seen earlier. 



TLFeBOOK 



SYSTEMS OF NONLINEAR EQUATIONS 315 

Theorem 7.5: Suppose that 1Z is a closed subset of R 2 , F\7Z — > TZ, and F is 
contractive on TZ; then 

x — F (x) 

has a unique solution /J € R. The sequence (p n ), where 

-p n+l = F(p n ), -p en, neZ+ (7.62) 



is such that 



and 



lim \\p n -p\\=0 



\\p n -p\\<-, IIPi-^oll. ( 7 - 63 ) 

1 — a 

where ||F (pj) — F(75 2 )|| 5 a ll7?i — 7*2 1 1 f° r an y ~Pi'~P2 e ^> anc ^ < « < 1. 

Proof As noted, this theorem is really a special instance of the Banach fixed- 
point theorem (Theorem 7.3), so we only outline the proof. 

It was mentioned in Section 7.3 that any closed subset of a complete metric 
space is also complete. R 2 with norm (7.61) is a complete metric space, and since 
TZ C R 2 is closed, 1Z must be complete. According to Theorem 7.3, F has a unique 
fixed point ~p e TZ [i.e., F(p) — ~p], and sequence (~p n ) from ~p~ n+ \ — F(p n ) (with 
7?o € ^> ar, d n € Z + ) converges to 75. The error bound 

a" 

\\Pn~ P\\ < -, \\Pi-Pq\\ 

1 — a 
is an immediate consequence of Corollary 7.1. 

A typical choice for 7\L would be the bounded and closed rectangular region 

TZ — {[xqx\\ \oq < xq < bo, a\ < x\ < b\). (7.64) 

The next theorem applies the Schwarz inequality of Chapter 1 to the estimation 
of a. It must be admitted that applying the following theorem is often quite difficult 
in practice, and in the end it is often better to simply implement the iterative 
method in (7.62) and experiment with it rather than go through the laborious task 
of computing a according to the theorem's dictates. However, exceptions cannot 
be ruled out, and so knowledge of the theorem might be helpful. 



Theorem 7.6: Suppose that TZ C R 2 is as defined in (7.64); then, if 

'(to)\(!*)\(*L)\(yL) 

\dx J \dxiJ \dxoJ \dxij 



a — max 



2"|l/2 

(7.65) 



TLFeBOOK 



316 NONLINEAR SYSTEMS OF EQUATIONS 

we have 

||F(*i)-F(*2)||<«||3fi-3f2|| 
for all x~i, x 2 € H. 

Proof We use a two-dimensional version of the Taylor expansion theorem, 
which we will not attempt to justify in this book. 

Given xi — [x^o x\\] T , x 2 — [x 2 fi x 2 ,i] , there is a point f e 1Z on the line 
segment that joins x\ to X2 such that 

F(xi) = F(J 2 ) + ,F (1) (f)(xi-X2) 



F(x 2 ) 



3/bte) 3/b(f) 

dxQ 3xi 

3/l (£) 3/i (f) 



|F(xi)-F(x- 2 )H 2 = 



- 3*o 
3/o(§) 



dx 



1 -" 



*1,0 - *2,0 
XI, 1 -X2,l 



3x 

' 3/i(S) 

. 3*o 

3/o(g) 

3x 



Oi,0 - x 2 ,o) + 



3/o(S) 

3xi 



(Xl,l -X 2 ,l) 



(Xl,0 - *2,o) + 



3/o«) 



3/i (f) 
3xi 



(Xll -X 2 ,l) 



3xi 



'W + ( 



3/i (?) 



3xo / \ 3xi 
via the Schwarz inequality (Theorem 1.1). Consequently 



11*1 ~x 2 \ 



|*i -x 2 | 



|F(xi)-F(x 2 )|| 2 < 



/ 3/ (g) 

V 3*0 



3/i(£) 

3x 



3/o(g) 

3xi 

3/i (g) 

3xi 



|xi -x 2 || 2 



< < max 



3/oOD 

3xo 



3/i(*) x2 



3x 

= a 2 ||x 1 -x 2 || 



3/oCO 

3x 1 



3/i(*) x2 
3x 1 



|xi -x 2 | 



TLFeBOOK 



SYSTEMS OF NONLINEAR EQUATIONS 317 

As with the one-dimensional Taylor theorem from Chapter 3, our application of it 
here assumes that F is sufficiently smooth. 

To apply Theorem 7.5, we try to select F and 1Z so that < a < 1. As already 
noted, this is often done experimentally. The iterative process (7.62) can give us 
an algorithm only if we have a stopping criterion. A good choice is (for suitable 
e > 0) to stop iterating when 

\\P„ -Pn-l\\ - / n <n ct* 
— — < e, Pn ^ 0. (7.66) 

\\Pn\\ 

It is possible to use norms in (7.66) other than the one defined by (7.61). (A 
Chebyshev norm might be a good alternative.) Since our methodology is often to 
make guesses about F and 1Z, it is very possible that a given guess will be wrong. 
In other words, convergence may never occur. Thus, a loop that implements the 
algorithm should also be programmed to terminate when the number of iterations 
exceeds some reasonable threshold. In this event, the program must also be written 
to print a message saying that convergence has not occurred because the upper limit 
on the allowed number of iterations was exceeded. This is an important example 
of exception handling in numerical computing. 

Example 7.7 If we consider /o(xo> xi) and f\(xQ, x\) as in Example 7.6, then 
the sequence of iterates for the starting vector 7J = [1 l] r is 

Pi fi = 1.2500, p u = 0.0000 

P2,o = 1.5625, p2,\ — 1.5625 

P3 = 3.0518, p 3> i = 0.0000 

p A fi = 9.3132, P4,i = 9.3132 



This vector sequence is not converging to the root near [0.9 0.5] r (see Fig. 7.6). 
Many experiments with different starting values do not lead to a solution. So we 
need to change F, 



We may rewrite xq - 


- X|-j — 4-^1 — " 9-S 








2 1 2 — J0\ x Q> x l) 

X Q + 4*1 


i we may rewrite x\ 


— Xq + x\ — as 




x 2 
X\ = , - = /i(x , *l)- 



1+X! 



TLFeBOOK 



318 NONLINEAR SYSTEMS OF EQUATIONS 

These redefine F [recall (7.58) for the general form of F], For this choice of F 
and again with 75 = [1 l] r , the sequence of vectors is 

pi,o = 0.8000, pi,i = 0.5000 

p 2 ,o = 0.9110, p 2 ,i = 0.4627 

P3j0 = 0.9480, p 3 ,i = 0.5818 

p 4 ,o = 0.9140, p 4 ,i = 0.5682 



pi4,o = 0.9189, pi4,i = 0.5461 

In this case the vector sequence now converges to the root with four-decimal-place 
accuracy by the 14th iteration. 



7.5.2 Newton-Raphson Method 

Consider yet again two equations in two unknowns 

/o(*o.*i) = °> 

/i(*0,*i) = 0, (7.67) 

each of which defines a curve in the plane (again x — [xo x\\ T e R 2 ). Solutions 
to (7.67) are points of intersection of the two curves in R 2 . We will denote a point 
of intersection by ~p — [po P\\ T e R 2 , which is a root of the system (7.67). 

Suppose thatxo = [xo,o xo,i] r is an initial approximation to the root p. Assume 
that fo and f\ are smooth enough to possess a two-dimensional Taylor series expan- 
sion, 

3/o 
f (x , X\) = /o(*0,0. x 0,l) + - — 0*0,0, *o,i)(*o - *o,o) 

3xo 

, 9 /o, w . , ! f 3 2 /o(xo,o,x j) ,2 

+ 7 — (*0,07*0,l)(*l -X0,l)+ — { —^ (xo -X ,o) 

3xi 2! [ dx£ 

9 2 /o(xo,o,x ,i) 

+ 2 — (x - x ,o)(xi - x ,i) 

dxoaxi 

3 2 /o(xo,o,x ,i) / 2 f 



dxi 



-(xi -x ,i) > + • 



3/i 
/i(xo,xi) = /i(xo,o, xo,i) + - — (xo,o,xo,i)(x -xo,o) 

3xo 

, 3/i 1 [3 2 /i(xo,o,x j) 2 

+ 7 — (*0,0>*0,i)(*i -xo,i)+ — { — 5 (xo-xo^o) 

3xi 2! ^ 9xq 



TLFeBOOK 



SYSTEMS OF NONLINEAR EQUATIONS 319 

,3 2 /l(*0,0>*0,l) , .. . 

-(x -xo,o)(xi -*o,i) 



3xo3xi 

(x\ -x ,i) \ + 



9 2 /i(x ,o,*o,i) / > 



9x 2 j 

which have the somewhat more compact form using vector notation as 

_ _ 9/o (To) 9/o (To) 

/o(T) = /o(*o) H ; (*o - *o,o) ^ ^ ( Xl ~ x o,i) 

9xo 9xi 

4 1 \ d 2 f (x ) , , 2 , „ 3 2 /oC*o) , ., . 

+ — \ 5 — (*o - *o,o) + 2— — ^ — C*o - *o,o)(*i - *o,i) 

2! ^ 9xq dxooxi 



3 2 /o(*o) 

9x 2 



(*i-x ,i) 2 [ + --- (7.68a) 



_ _ 9/i (To) 9/i (xo) 

/lW = /i(*o) H ; (xo - x ,o) H - (*i - *o,i) 

9xo 9xi 

1 [ 9 2 /i(xq) / , 2 , o 3 2 /i(T ) / 

+ ^ 1 — :T~? — w> ~ x °.°) + 2 ^ — 5 — (^0 ~ *o,o)(*i - *o,i) 
2! [ 9x^ 9x 9x! 

9 2 /i(x ) / 2 



9x 2 



(*i-x ,i) z +•••• (7.68b) 



If xq is close to p, then from (7.68), we have 



_ _ 9/o(^o) 9/o(xo) 

= fo(p) « /o(*o) H r Oo - *o,o) + — t 0>i - *o,i) 

9xo 9xi 

2 «,/-=■„■> a2 £,/■=■„ 



1 3Vo0to), . ,2 , 9 2 /o(To) 



(Po-^0,0) +2— — - — (po-^o,o)(Pi -*o,i) 



2! 9xi ' 9xq9x 



v 



, 8 M X 0h .2 I , , nt -Q X 

H —5 — (pi-^0,1) M (7.69a) 

9x 2 J 

_ _ 3/i (To) 9/i (To) 

0= f\(p) « fi(xo)-\ (po-^o,o)H (p\ -*o,i) 

9xo 9xi 

, 1 f 9 2 /i(To) . . 2 , „ 3 2 /i(*o) , 

G>o-*o,o) + 2— — - — (po - xo,o)(pi - xo,\) 



2! 9x 2 ' 9x 9x 



v 



9 /l(*0), .2 I , ,"7*010, 

— —5 — (Pi-^0,1) H (7.69b) 

9x 2 j 



TLFeBOOK 



320 NONLINEAR SYSTEMS OF EQUATIONS 

If we neglect the higher-order terms (second derivatives and higher-order deriva- 
tives), then we obtain 

9/oC*o) 9/o(*o) _ 

— (po - x ,o) H (pi - x 0i i) «s _/ ( Xo ), (7.70a) 

dxo ax\ 

9/iC*o) 9/iC*o) _ 

— (pQ-x 0fi )-\ (pi -x ,i) « -/i(x ). (7.70b) 

dxo ox\ 

As a shorthand notation, define 

dfi (In) 
3x 7 - 

so (7.70) becomes 

(po - *o,o)/b,o(*o) + (Pi - *o,i)./b,i(*o) « -/o(*o)> (7.72a) 

(f>0 - *o,o)/i,o(*o) + (Pi - *o,i)/i,i(*o) % -/i (*(>)• (7.72b) 

Multiply (7.72a) by /i,i(x"o), and multiply (7.72b) by /o.iCco). Subtracting the 
second equation from the first results in 

(po - xo,o)[/o,o(*o)/i,i(*o) - /i,o(*o)/o,i(*o)] 

^ -/o(*o)/i,i(*o) + /i(*o)/o,i (*())■ (7.73a) 

Now multiply (7.72a) by /i.oCxoX an d multiply (7.72b) by /o.oCxo)- Subtracting 
the second equation from the first results in 

(p\ -*o,i)[/o,i(*o)/i,o(*o) - /o,o(*o)/u(*o)] 

^ -/o(*o)/i,o(*o) + /i(*o)/o,o(*o)- (7.73b) 

From (7.73b), we obtain 

, ~ fo(xo) f\,\(xo) + f\(xo) fo,\(xo) ,hha^ 

Po ^ xoo~\ = — — =—, (7.74a) 

/o,o(*o) /uw - /o,i(*o)/i,o(*o) 

, -/i(*o)/o,o(*o) + /o(*o)/i,o(*o) „„,,, 

Pi % *o i H = — — =—■ (7.74b) 

foM x o)fi,i(xo) - /o,i(*o)/i,o(*o) 

We may assume that the right-hand side of (7.74) is the next approximation to ~p 

-/o/i,i + /i/o,i 



■*i,0 % *o,o + 

*1,1 «*0,1 + 



/o,o/i,i - /o,i/i,o 
-/i/o,o + /o/i,o 



/o,o/i,i — /o,i/i,o 



(7.75a) 
(7.75b) 



TLFeBOOK 



SYSTEMS OF NONLINEAR EQUATIONS 321 



(~x\ — [xi, 0*1,1] ), where the functions and derivatives are to be evaluated at xo. 
We may continue this process to generate (x„) for n e Z + (so in general x n — 
[x n ,o *«,i] r ) according to 



-/o(*»)/i,i(*») + /i(*n)/o,i(*») 
fo,o( J n)fi,i(x n ) - fo,l(x„)flfi(x n )' 

~fl(Xn)f0,0(Xn) + /o(*«)/l,o(*») 
fo,o(x n )fl,l(x n ) - fo,l(Xn)fl,o(Xn) 

As in the previous subsection, we define 



Xn+1,0 — X„fi + 



Xn+1,1 — X n ,\ 



(7.76a) 
(7.76b) 



Also 



F(x n ) = 



F m (x n ) = 



fo(x n ,Q, *n,l) 
fl(Xn,0, X n ,\) 



f0,0(Xn) fo,l(Xn) 
/l,o(*n) f\,\(x n ) 



fo(x„) 

fl(x~n) 



= Jf(x„), 



(7.77) 



(7.78) 



which is the Jacobian matrix Jf evaluated at x = x n - We see that 

1 



[j F (x n )r l = 



f0,0(Xn)fl,l(x n ) ~ f0,l(Xn)fl,0(Xn) 



fl,l(x n ) -fo,l(Xn) 
f\fi(x n ) /b,o(*n) 



(7.79) 

so in vector notation (7.76) becomes 

x n +\ =x„- [J F (x„)] _1 F(x„) (7.80) 

for n e Z + . If ~x n e R m (i.e., if we consider m equations in m unknowns), then 



Jf(x„) = 



fo,o( x n) fo,l(.X„) 

flfiiXn) f\,\(x„) 



fo,m-l(x n ) 
J\,m—{ \Xfi) 



Jm—l,0yXn) Jm—\,\\Xn) Jm—l,m—l\Xn) 



(7.81a) 



and 



F(x n ) 



MXn) 
fl(x n ) 



Jm — l \Xn) 

Of course, x n = [x nfi x n ,\ ••• x„, m -i] r e R m . 



(7.81b) 



TLFeBOOK 



322 NONLINEAR SYSTEMS OF EQUATIONS 

Equation (7.80) reduces to (7.46) when we have only one equation in one 
unknown. We see that the method will fail if Jf(x n ) is singular at x n . As in 
the one-dimensional problem of Section 7.4, the success of the method depends 
on picking a good starting point x~o. If convergence occurs, then it is quadratic 
as in the one-dimensional (i.e., scalar) case. It is sometimes possible to force the 
method to converge even if the starting point is poorly selected, but this will not 
be considered here. The computational complexity of the method is quite high. If 
x n e R m , then from (7.80) and (7.81), we require m 2 + m function evaluations, 
and we need to invert an m x m Jacobian matrix at every iteration. We know from 
Chapter 4 that matrix inversion needs 0(m 3 ) operations. Ill conditioning of the 
Jacobian is very much a potential problem as well. 

Example 7.8 Refer to the examples in Section 7.5.1. In Example 7.6 there we 
wanted to solve 

Mxq,xi) = x -Xq - \ x \ =0, /i(*o. x\) =x\ -x 2 + x 2 — 0. 
Consequently 

JQ\ x n) = X n fi x n Q 4 x n p /l (X n ) = X n i X n q + X n j, 

and the derivatives are 

fo,o(x n ) — 1 - 2x„,o, fi,o(xn) — -2x„fi 
1 

"2 

Via (7.76), the desired equations are 



fo,\(x n ) = -\x n ,l, h,\{x n ) = 1 +2x n ,\. 



X n +\fi — X n fi 

-{x nfi - x 2 Q - \xl A )(\ + 2x„,i) + (x„,i - x 2 Q + xl x ){-\x n ,\) 
(1 - 2x nfi )(\ + 2x„j) - x n!Q x nt i 

~(x n ,\ -xl fi + xl x )(\ -2x„, ) + {x„fi-xl Q - ^x^ 1 )(-2x„ i0 ) 
(1 - 2x„ i0 )(l + 2x„ ; i) - x„, ^n,i 

If we execute the iterative procedure in (7.82), we obtain 

x ,o = 0.8000, x ,i = 0.5000 

xi, = 0.9391, x u = 0.5562 

x 2 ,o = 0.9193, x 2 ,i = 0.5463 

x 3j0 = 0.9189, x 3>1 = 0.5461 



(7.82a) 



(7.82b) 



TLFeBOOK 



CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 323 

We see that the answer is correct to four decimal places in only three iterations. 
This is much faster than the fixed-point method seen in Example 7.7. 

7.6 CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 

Iterative processes such as x n +\ = g(x n ) (which includes the Newton-Raphson 
method) can converge to a fixed point [i.e., to x such that g(x) — x], or they 
can fail to do so in various ways. This was considered in Section 7.4.3. We are 
interested here in the case where (x n ) is a chaotic sequence, in which case g is 
often said to be a chaotic map. Formal definitions exist for chaotic maps [13, 
p. 50]. However, these are rather technical. They are also difficult to apply except 
in relatively simple cases. We shall therefore treat chaos in an intuitive/empirical 
(i.e., experimental) manner for simplicity. 

In Section 7.3 we considered examples based on the logistic map 

g{x) = Xx{l-x) (7.83) 

(recall Examples 7.3-7.5). Suppose that X = 4. Figure 7.7 shows two output 
sequences from this map for two slightly different initial conditions. Plot (a) shows 
for xq — Xq— 0.745; while plot (b), xq — Xq = 0.755. We see that |*g - x$\ — 
0.01, yet after only a few iterations, the two sequences are very different from each 
other. This is one of the distinguishing features of chaos: sensitive dependence 
of the resulting sequence to minor changes in the initial conditions. For X — 4, 
we have g\[0, 1] -> [0, 1], so divergence is impossible. Chaotic sequences do not 
diverge. They remain bounded; that is, there is a M e R such that < M < oo 
with \x n \ < M for all n e Z + . But the sequence does not converge, and it is not 
periodic, either. In fact, the plots in Fig. 7.7 show that the elements of sequence 
(x n ) seem to wander around rather aimlessly (i.e., apparently randomly). This wan- 
dering behavior has been observed in the past [14, p. 167], but was not generally 
recognized as being a chaotic phenomenon until more recently. 

It has been known for a very long time that effective cryptographic systems 
should exploit randomness [15]. Since chaotic sequences have apparently random 
qualities, it is not surprising that they have been proposed as random-number gen- 
erators (or as pseudo-random-number generators) for applications in cryptography 
[16, 17]. However, it is presently a matter of legitimate controversy regarding just 
how secure a cryptosystem based on chaos can be. One difficulty is as follows. 
Nominally, a chaotic map g takes on values from the set of real numbers. But if 
such a map is implemented on a digital computer, then, since all computers are 
finite-state machines, any chaotic sequence will not be truly chaotic as it will even- 
tually repeat. Short period sequences are cryptographically weak (i.e., not secure). 
There is presently no effective procedure (beyond exhaustive searching) for deter- 
mining when this difficulty will arise in a chaos-based system. This is not the only 
problem (see p. 1507 of Ref. 17 for others). 

Two specific chaos-based cryptosy stems are presented by De Angeli et al. [18] 
and Papadimitriou et al. [19]. (There have been many others proposed in recent 



TLFeBOOK 



324 



NONLINEAR SYSTEMS OF EQUATIONS 




Figure 7.7 Output sequence from the logistic map g(x) = 4x(l— x) with initial conditions 
xq = 0.745 (a) and xq = 0.755 (b). Observe that although these initial conditions are close 
together, the resulting sequences become very different from each other after only a few 
iterations. 



years.) Key size is the number of different possible encryption keys available in the 
system. Naturally, this number should be large enough to prevent a codebreaker 
(eavesdropper, cryptanalyst) from guessing the correct key. However, it is also well 
known that a large key size is most definitely not a guarantee that the cryptosystem 
will be secure. Papadimitriou et al. [19] demonstrate that their system has a large 
key size, but their security analysis [19] did not go beyond this. Their system [19] 
seems difficult to analyze, however. In what follows we shall present some analysis 
of the system in De Angeli et al. [18], and see that if implemented on a digital 
computer, it is not really very secure. We begin with a description of their system 
[18]. Their method in [18] is based on the Henon map (see Fig. 7.8), which is a 
mapping defined on R 2 according to 



XQ,n + l = 1 -«*0,n + x l,n 
*l,n+l = /**0,n 



(7.84) 



for n e Z + (so [xo.h^i.h] 7 e R 2 )> an ^ which is known to be chaotic in some neigh- 
borhood of 

a = 1.4, /3 = 0.3 (7.85) 



TLFeBOOK 



CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 



325 




Figure 7.8 Typical state sequences from the Henon map for a = 1.45 and fi = 0.25, with 
initial conditions xqq = —0.5 and x\ o = 0.2. 

so that constants a and /3 form the encryption key for the system. No choice in the 
allowed neighborhood is a valid key. An immediate problem is that there seems 
to be no detailed description of which points are allowed. A particular choice of 
key should therefore be tested to see if the resulting sequence is chaotic. In what 
follows the output sequence from the map is defined to be 



y n = XQ,n ■ 



(7.86) 



The vector x n — [xq h x\ n ] T is often called a state vector. The elements are state 
variables. 

The encryption algorithm works by mixing the chaotic sequence of (7.86) with 
the message sequence, which we shall denote by (s n ). The mixing (described below) 
yields the cyphertext sequence, denoted (c„). A problem is that the receiver must 
somehow recover the original message (s n ) from the cyphertext (c„) using knowl- 
edge of the encryption algorithm, and the key {a, fi}. 

Consider the following mapping: 



%«+l = 1 -ay n +xi,„, 

xi,n+i = Py n - 



(7.87) 



Note that this mapping is of the same form as (7.84) except that xo,„ is replaced 
by y„, but from (7.86) these are the same (nominally). The mapping in (7.84) 



TLFeBOOK 



326 



NONLINEAR SYSTEMS OF EQUATIONS 



represents a physical system (or piece of software) at the transmitter, while (7.87) 
is part of the receiver (hardware or software). Now define the error sequences 



v x i,n — x i,n 



(i € {0, 1}) 



for which it is possible to show [using (7.84), (7.86), and (7.87)] that 



8x0,n + l 



Sxo, n 

8xi „ 



(7.88) 



(7.89) 



= A 



= 


"01" 




"01" 




Sxi,„ 


= 


" " 




We observe that A 2 — 0. Matrix A is an example of a nilpotent matrix for this 
reason. This immediately implies that the error sequences go to zero in at most 
two iterations (i.e., two steps): 



<5*l,n+2 



This fact tells us that if (y n ) is generated at the transmitter, and sent over the 
communications channel to the receiver, then the receiver may perfectly recover 2 
the state sequence of the transmitter in not more than two steps. This is called dead- 
beat synchronization. The system in (7.87) is a specific example of a nonlinear 
observer for a nonlinear dynamic system. There is a well-developed theory of 
observers for linear dynamic systems [20]. The notion of an observer is a control 
systems concept, so we infer that control theory is central to the problem of applying 
chaotic systems to cryptographic applications. 

All of this suggests the following algorithm for encrypting a message. Assume 
that we wish to encrypt a length N message sequence (s n ); that is, we only have 
the elements sq,Si, . . . , sn-2, $N-i- 



1. The transmitter generates and transmits yo, y\ according to (7.84) and (7.86). 
The transmitter also generates yi, V3, . . . , y^, yN+i, but these are not trans- 
mitted. 

2. The transmitter sends the cyphertext sequence 



yn+2 



(7.90) 



for n = 0, 1 , . . . , N — 1 . Of course, this assumes we never have s„ — 0. 

Equation (7.90) is not the only possible means of mixing the message and the 
chaotic sequence together. The decryption algorithm at the receiver is as follows: 

This assumes a perfect communications channel (which is not realistic), and that rounding errors in 
the computation are irrelevant (which will be true if the receiver implements arithmetic in the same 
manner as the transmitter). 



TLFeBOOK 



CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 327 

1. The receiver regenerates the transmitter's state sequence using (7.87), its 
knowledge of the key, and the sequence elements yo,y\. Specifically, for 
n — 0, 1 compute 

*0,n+l = 1 -uyl + h,n, 

*i,«+i = Py n (7.91a) 

while for n — 2,3, . . . , N — I, N compute 

X0,n + \ = 1 ~0Cxl n +X\, n , 

X\,n + \ = /*%«■ (7.91b) 

Recover y n for n — 2,3, . . . , N, N + I according to 

?«=%«■ (7.92) 

2. Recover the original message via 

V„_1_T 

(7.93) 







yn+2 
s n — 

Cn 


where n = 0, 1 , . . 


., N - 


-2, N - 1. 



The initial states at the transmitter and receiver are arbitrary; that is, any xo,o, *i,0 
and x\fi may be selected in (7.84) and (7.91a). The elements yo, y\, and the cypher- 
text are sent over the channel. If these are lost because of a corrupt channel, then the 
receiver should request retransmission of this information. It is extremely impor- 
tant that the transmitter resend the same synchronizing elements yo, yi and not a 
different pair. The reason will become clear below. 

We remark that since we are assuming that the algorithms are implemented 
on a digital computer, so each sequence element is a binary word of some form. 
The synchronizing elements and cyphertext are thus a bitstream. Methods exist to 
encode such data for transmission over imperfect channels such that the probability 
of successful transmission can be made quite high. These are called error control 
coding schemes. 

In general, the receiver will fail to recover the message if (1) the way in which 
arithmetic is performed by the receiver is not the same as at the transmitter, (2) the 
channel corrupts the transmission, or (3) the receiver does not know the key {a, ft}. 
Item 1 is important since the failure to properly duplicate arithmetic operations at 
the receiver will cause machine rounding errors to accumulate and prevent data 
recovery. This is really a case of improper synchronization. The plots of Figs. 7.9- 
7.11 illustrate some of these effects. 

It is noteworthy that even though the receiver may not be a perfect match to the 
transmitter, some of the samples (at the beginning of the message) are recovered. 



TLFeBOOK 



328 



NONLINEAR SYSTEMS OF EQUATIONS 



E 
< 



(a) 




50 60 

Time step 



100 




(b) 



40 50 60 

Time step 



100 



Figure 7.9 Transmitted (a) and reconstructed (b) message sequences. The sinusoidal mes- 
sage sequence (s n ) is perfectly reconstructed at the receiver; because the channel is perfect, 
the receiver knows the key (here a = 1.4, /J ^ 0.3), and arithmetic is performed in identical 
fashion at both the transmitter and receiver. 



This is actually an indication that the system is not really very secure. We now 
consider security of the method in greater detail. 

What is the key size? This question seems hard to answer accurately, but a 
simple analysis is as follows. Suppose that the Henon map is chaotic in the 
rectangular region of the afi plane defined by a € [1.4 — Aa, 1.4+ Aa], and 
P e [0.3 - A0, 0.3 + Aft], with Aa, Ap > 0. We do not specifically know the 
interval limits, and it is an ad hoc assumption that the chaotic neighborhood is 
rectangular (this is a false assumption). Suppose that p is the smallest M-bit binary 
fraction such that p > 0.3 — p, and that p is the largest M-bit binary fraction such 
that p < 0.3 + A/J. In this case the number of M-bit fractions from 0.3 — A/3 to 
0.3 + AP is about 2 M - P) + 1. Thus 



2 M (P - p + 1) « 2 M [(0.3 + AP) - (0.3 - A/3)] + 1 = 2 M+1 AP + 1 = K p 



and by similar reasoning for a 



K a = 2 M+l Aa+\, 



(7.94a) 



(7.94b) 



TLFeBOOK 



CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 



329 



E 
< 



(a) 




E 
< 



(b) 



40 50 60 

Time step 



100 




40 50 60 

Time step 



100 



Figure 7.10 Transmitted (a) and reconstructed (b) message sequences. Here, the conditions 
of Fig. 7.9 hold except that the receiver uses a mismatched key a — 1.400001, fi = 0.300001. 
Thus, the message is eventually lost. (Note that the first few message samples seem to be 
recovered accurately.). 



which implies that the key size is (very approximately) 

K — K a Kp. 



(7.95) 



Even if Ace and A/3 are small, and the structure of the chaotic neighborhood is 
not rectangular, we can make M big enough (in principle) to generate a big key 
space. Apparently, Ruelle [21, p. 19] has shown that the Henon map is periodic 
(i.e., not chaotic) for a — 1.3, ^ = 0.3, so the size of the chaotic region is not very 
big. It is irregular and "full of holes" (the "holes" are key parameters that don't 
give chaotic outputs). In any case it seems that a large key size is possible. But as 
cautioned earlier, this is no guarantee of security. 

What does the transmitter send over the channel? This is the same as asking 
what the eavesdropper knows. From the encryption algorithm the eavesdropper 
knows the synchronizing elements yo,y\, and the cyphertext. The eavesdropper 
also knows the algorithm, but not the key. Is this enough to find the key? It is now 
obvious to ask if knowledge of yo, y\ gives the key away. This question may be 
easily (?) resolved as follows. 

From (7.84) and (7.86) for n e N, we have 



y n +\ = i -uy n + Pyn-i- 



(7.96) 



TLFeBOOK 



330 



NONLINEAR SYSTEMS OF EQUATIONS 




(a) 



(b) 



40 50 60 

Time step 



100 




40 50 60 
Time step 



100 



Figure 7.11 Transmitted (a) and reconstructed (b) message sequences. Here, the conditions 
of Fig. 7.9 hold except the receiver and the transmitter do not perform arithmetic in identical 
fashion. In this case the order of two operations was reversed at the receiver. Thus, the 
message is eventually lost. (Note again that the first few message samples seem to be 
recovered accurately.). 



The encryption algorithm generates yo, y\, , , , , yti+i, which, from (7.96), must 
lead to the key satisfying the linear system of equations 



n 



y 2 N -i 
yl 



-yo 
-y\ 

-y>N-2 
-yN-i 



a 
(i 



-yi 
-yi 

- yN 
yN+\ 



(7.97) 



Compactly, we have Ya — y. This is an overdetermined system. The eavesdropper 
has only yo, y\, so from (7.97) the eavesdropper needs to solve 



y\ 



-yo 
-yi 



a 




yi 

>'3 



(7.98) 



TLFeBOOK 



CHAOTIC PHENOMENA AND A CRYPTOGRAPHY APPLICATION 



331 



But the eavesdropper does not know y>2, yi as these were not transmitted. These 
elements become mixed with the message, and so are not available to the eaves- 
dropper. We immediately conclude that deadbeat synchronization is secure. 3 There 
is an alternative synchronization scheme [18] that may be proved to be insecure 
by this method of analysis. 

Does the cyphertext give key information away? This seems not to have a 
complete answer either. However, we may demonstrate that the system of D'Angeli 
et al. [18] (using deadbeat synchronization) is vulnerable to a known plaintext 4 
attack. Such a vulnerability is ordinarily sufficient to preclude using a method in 
a high-security application. The analysis assumes partial prior knowledge of the 
message (s n ). Let us specifically assume that 



$n 



E 

k=0 



a^n 



(7.99) 



but we do not know a^ or p. That is, the structure of our message is a polyno- 
mial sequence, but we do not know more than this. Combining (7.90) with (7.96) 
gives us 



1 



-2Sn-2- 



(7.100) 



It is not difficult to confirm that 



J «-i 



2/> 

= E 

k=0 



so that (7.100) becomes 



E aia i 



i+j=k 



(n - If 



(7.101) 



E 

k=0 



cikn k c n 



2p 
k=Q 



yfc\_ 



P 

^ Pak(n 

k=0 



T) k c n 



-2 = 1. 



(7.102) 



Remembering that the eavesdropper has cyphertext (c„) (obtained by eavesdrop- 
ping), the eavesdropper can use (7.102) to set up a linear system of equations in 
the key and message parameters <%. The code is therefore easily broken in this 
case. 

If the message has a more complex structure, (7.102) will generally be replaced 
by some hard-to-solve nonlinear problem wherein the methods of previous sections 
(e.g., Newton-Raphson) can be used to break the code. We conclude that, in spite 
of technical problems from the eavesdropper's point of view (ill conditioning, 
incomplete cyphertext sequences, etc.), the scheme in Ref. 18 is not secure. 

This conclusion assumes there are no other distinct ways to work with the encryption algorithm 
equations in such a manner as to give an equation that an eavesdropper can solve for the key knowing 
only yo. >'i, and the cyphertext. 

The message is also called plaintext. 



TLFeBOOK 



332 NONLINEAR SYSTEMS OF EQUATIONS 

REFERENCES 

1. J. H. Wilkinson, "The Perfidious Polynomial," in Studies in Mathematics, G. H. Golub, 
ed., Vol. 24, Mathematical Association of America, 1984. 

2. A. M. Cohen, "Is the Polynomial so Perfidious?" Numerische Mathematik 68, 225-238 
(1994). 

3. T. E. Hull and R. Mathon, "The Mathematical Basis and Prototype Implementation of a 
New Polynomial Root Finder with Quadratic Convergence," ACM Trans. Math. Software 
22, 261-280 (Sept. 1996). 

4. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York, 
1978. 

5. W. Rudin, Principles of Mathematical Analysis, 3rd ed., McGraw-Hill, New York, 1976. 

6. D. C. Youla and H. Webb, "Image Restoration by the Method of Convex Projections: 
Part I— Theory," IEEE Trans. Med. Imag. MM, 81-94 (Oct. 1982). 

7. A. E. Cetin, O. N. Gerek and Y. Yardimci, "Equiripple FIR Filter Design by the FFT 
Algorithm," IEEE Signal Process. Mag. 14, 60-64 (March 1997). 

8. K. Grochenig, "A Discrete Theory of Irregular Sampling," Linear Algebra Appl. 193, 
129-150 (1993). 

9. H. H. Bauschke and I. M. Borwein, "On Projection Algorithms for Solving Convex 
Feasibility Problems," SIAM Rev. 38, 367-426 (Sept. 1996). 

10. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979. 

11. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 1966. 

12. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York, 
1974. 

13. R. L. Devaney, An Introduction to Chaotic Dynamical Systems, 2nd ed., Addison- 
Wesley, Redwood City, CA, 1989. 

14. G. E. Forsythe, M. A. Malcolm and C. B. Moler, Computer Methods for Mathematical 
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977. 

15. G Brassard, Modern Cryptology: A Tutorial, Lecture Notes in Computer Science 
(series), Vol. 325, G. Goos and J. Hartmanis, eds., Springer- Verlag, New York, 1988. 

16. L. Kocarev, "Chaos-Based Cryptography: A Brief Overview," IEEE Circuits Syst. Mag. 
1(3), 6-21 (2001). 

17. F. Dachselt and W. Schwarz, "Chaos and Cryptography," IEEE Trans. Circuits Syst. 
(Part I: Fundamental Theory and Applications) 48, 1498-1509 (Dec. 2001). 

18. A. De Angeli, R. Genesio and A. Tesi, "Dead-Beat Chaos Synchronization in Discrete- 
Time Systems," IEEE Trans. Circuits Syst. (Part I: Fundamental Theory and Applica- 
tions) 42, 54-56 (Jan. 1995). 

19. S. Papadimitriou, A. Bezerianos and T. Bountis, "Secure Communications with Chaotic 
Systems of Difference Equations," IEEE Trans. Comput. 46, 27-38 (Jan. 1997). 

20. M. S. Santina, A. R. Stubberud and G. H. Hostetter, Digital Control System Design, 
2nd ed., Saunders College Publ., Fort Worth, TX, 1994. 

21. D. Ruelle, Chaotic Evolution and Strange Attractors: The Statistical Analysis of Time 
Series for Deterministic Nonlinear Systems, Cambridge Univ. Press, New York, 1989. 

22. M. Jenkins and J. Traub, "A Three-Stage Variable Shift Algorithm for Polynomial Zeros 
and Its Relation to Generalized Rayleigh Iteration," Numer. Math. 14, 252-263 (1970). 



TLFeBOOK 



PROBLEMS 333 

PROBLEMS 

7.1. For the functions and starting intervals below solve fix) — using the 
bisection method. Use stopping criterion (7.13d) with e — 0.005. Do the 
calculations with a pocket calculator. 

(a) f(x) = log,* + 2x + 1, [a , b Q ] = [0.2, 0.3]. 

(b) f(x) = x 3 - cosx, [a , b ] = [0.8, 1.0]. 

(c) /(*)=x-e-*/ 5 ,[fl ,6o] = [!,l]. 

(d) fix)=x 6 -x-l,[ao,b ] = [l,l]. 

(e) /(*) = SS£ + exp(-x), [ fl0 , &o] = [3, 4]. 

(f) /(*) = SSi - , + 1, [« , fc ] = [1, 2]. 

7.2. Consider /(x) = sin(x)/x. This function has a minimum value for some x e 
[jr, 2jt]. Use the bisection method to find this x. Use the stopping criterion 
in (7.13d) with e — 0.005. Use a pocket calculator to do the computations. 

7.3. This problem introduces the variation on the bisection method called regula 
falsi (or the method of false position). Suppose that [oq, bo] brackets the root 
p [i.e., fip) = 0]. Thus, /(flo)/(^o) < 0. The first estimate of the root p, 
denoted by po, is where the line joining the points (ao, fi^o)) and (£>o, fiba)) 
crosses the x axis. 

(a) Show that 

h) - «o , , , 

Po = ao- -—— — — fia ). 

fibo) - fia ) 

(b) Using stopping criterion (7.13d), write pseudocode for the method of 
false position. 

7.4. In certain signal detection problems (e.g., radar or sonar) the probability of 
false alarm (FA) (i.e., of saying that a certain signal is present in the data 
when it actually is not) is given by 

r°° i 

Pp A = / -if-'c" 1 ' 2 ^, (7.P.1) 

J, Vip/2)2P/2 

where n is called the detection threshold. If p is an even number, it can be 
shown that (7.P.1) reduces to the finite series 

, ( " /2) " 1 1 /mt 

**-•-*' £ f I) • 

k=Q 

The detection threshold r\ is a very important design parameter in signal 
detectors. Often it is desired to specify an acceptable value for Pfa (where 



TLFeBOOK 



334 NONLINEAR SYSTEMS OF EQUATIONS 

< Pfa < 1), and then it is necessary to solve nonlinear equation (7.P.2) 
for r\. Let p — 6. Use the bisection method to find r\ for 

(a) P m = 0.001 

(b) Pfa = 0.01 

(c) P F a = 0.1 

7.5. We wish to solve 

f(x) = x 4 - 5 -x 3 + 5 -x - 1 = 
using a fixed-point method. This requires finding gix) such that 

g(x) -x — f{x). 

Find four different functions gix). 

7.6. Can the fixed-point method be used to find the solution to 

fix) = x 6 -x- 1 =0 

for the root located in the interval [1, |]? Explain. 

7.7. Consider the nonlinear equation 



fix) =x-e 



-x/5 



(which has a solution on interval [|, 1]). Use (7.32) to estimate a. Recalling 
that x n = g"xo (in the fixed-point method), if xq — 1, then use (7.22a) to 
estimate n so that dix n , x) < 0.005 [x is the root of fix) — 0]. Use a pocket 
calculator to compute x\, . . . , x n . 

7.8. Consider the nonlinear equation 

fix) =x-l- \e~ x = 

(which has a solution on interval [1, 1.2]). Use (7.32) to estimate a. Recalling 
that x n = g n XQ (in the fixed-point method), if xo = 1 then use (7.22a) to 
estimate n so that dix n , x) < 0.001 [x is the root of fix) — 0]. Use a pocket 
calculator to compute x\, . . . , x n . 

7.9. Problem 5.14 (in Chapter 5) mentioned the fact that orthogonal polynomi- 
als possess the "interleaving of zeros" property. Use this property and the 
bisection method to derive an algorithm to find the zeros of all Legendre poly- 
nomials P„ix) for n = 1, 2, . . . , N. Express the algorithm in pseudocode. Be 
fairly detailed about this. 



TLFeBOOK 



PROBLEMS 335 

7.10. We wish to find all of the roots of 

f{x) = x 3 - 3x 2 + Ax - 2 = 0. 

There is one real-valued root, and two complex-valued roots. It is easy to 
confirm that f(l) — 0, but use 

x 3 + 3x 2 +x + 2 
8(X) = 2^T1 

to estimate the real root p using fixed-point iteration [i.e, p n +\ — g(p n )]- 
Using a pocket calculator, compute only p\, p%, pj,, and p\, and use the 
starting point po — 2. Also, use the Newton-Raphson method to estimate 
the real root. Again choose po — 2, and compute only p\, pi, pj, and p\. 
Once the real root is found, finding the complex-valued roots is easy. Find 
the complex-valued roots by making use of the formula for the roots of a 
quadratic equation. 

7.11. Consider Eq. (7.36). Via Theorem 3.3, there is an a n between root p (f(p) = 
0) and the iterate p n such that 

f(p) ~ f(Pn) = f (1) (cc„)(p - Pn ). 

(a) Show that 

' fiPn) f m ( P n-l)\ 



Pn = (Pn ~ Pn-l) 



f(pn-i) / (1) (««) 



[Hint: Use the identity 1 = 



f(Pn-l) f W (Pn-l) I 
f (Pn-l) fW (Pn-l) '* 

(b) Argue that if convergence is occurring, we have 



lim \A„\ = 1. 

n— >oo 



Hence lim„ 



P-Pn 



Pn-pn-1 ) 

(c) An alternative stopping criterion for the Newton-Raphson method is to 
stop iterating when 

\f(Pn)\ + \Pn ~ Pn-\\ < <? 

for some suitably small e > 0. Is this criterion preferable to (7.42)? 
Explain. 



TLFeBOOK 



336 NONLINEAR SYSTEMS OF EQUATIONS 

7.12. For the functions listed below, and for the stated starting value po, use 
the Newton-Raphson method to solve f(p) — 0. Use the stopping crite- 
rion (7.42a) with e — 0.001. Perform all calculations using only a pocket 
calculator. 

(a) f(x) — x + tanx, po — 2. 

(b) f(x) = x 6 -x-l,p = l.5. 

(c) f(x) — x 3 — cosx, po — 1. 

(d) f(x) = x - e~ x l\ p = 1. 

7.13. Use the Newton-Raphson method to find the real-valued root of the polyno- 
mial equation 

1 1,1, 

/ (x) = 1 + -x + -x 2 + —x 3 = 0. 
J 2 6 24 

Choose starting point po — —2. Iterate 4 times. [Comment: Polynomial /(x) 
arises in the stability analysis of a numerical method for solving ordinary 
differential equations. This will be seen in Chapter 10, Eq. (10.83).] 

7.14. Write a MATLAB function to solve Problem 7.4 using the Newton-Raphson 
method. Use the stopping criterion (7.42a). 

7.15. Prove the following theorem (Newton-Raphson error formula). Let f(x) e 
C 2 [a, b], and f(p) = for some p e [a, b]. For p n e [a, b] with 

f(Pn) 
Pn + l -Pn f(l)(pn y 

there is a % n between p and p n such that 

1 2 / (2) (f«) 
P ~ Pn+1 = -~(P ~ Pn) -77777 ". 

2 f W (Pn) 

[Hint: Consider the Taylor series expansion of /(x) about the point x = p n 

fix) = f(p„) + (X- p„)f m ( Pn ) + \(X - Pn ) 2 f (2) ^n), 

and then set x = p.] 

7.16. This problem is about two different methods to compute */x. To begin, recall 
that if x is a binary floating-point number (Chapter 2), then it has the form 

x = xo.xi • • • x t x 2 e , 

where, since x > 0, we have xo = 0, and because of normalization x\ = 1. 
Generally, x^ e {0, 1}, and e is the exponent. If e = 2k (i.e., the exponent 



TLFeBOOK 



PROBLEMS 337 

is even), we do not adjust x. But if e = 2k + 1 (i.e., the exponent is odd), 
we shift the mantissa to the right by one bit position so that e becomes 
e — 2k J r 2. Thus, x now has the form 

x — a x 2 e , 

where a € [\, 1) in general, and e is an even number. Immediately, yfx — 
*fa x 2 e l 2 . From this description we see that any square root algorithm need 
work with arguments only on the interval [^,1] without loss of generality 
(w.l.o.g.). 



(a) Finding the square root of x — a is the same problem as solving f(x) — 
x 2 — a — 0. Show that 
the iterative algorithm 



x 2 — a — 0. Show that the Newton-Raphson method for doing so yields 



Pn+i = -( Pn + —\, (7.P.3) 

where p n — > */a. [Comment: It can be shown via an argument based 
on the theorem in Problem 7.15 that to ensure convergence, we should 
set po — I (2a + 1). A simpler choice for the starting value is p$ = 



i(i + 1) = | (since we know a € [4, 1]). 



3' 
2M i •-> — 8 V"" v " "* ^""" " = L 4 > 

(b) Mikami et al. (1992) suggest an alternative algorithm to find */x. They 
recommend that for a e [j, 1] the square root of a be obtained by the 
algorithm 

p n+l =/3(a- p 2 ) + p n . (7.P.4) 

Define error sequence (e n ) as e n — «Ja — p n . Show that 
e„+i = Pel + (1 - 2fi^/a)e„. 

[Comment: Mikami et al. recommend that po — 0.666667a + 0.354167.] 

(c) What condition on fi gives quadratic convergence for the algorithm in 
(7.P.4)? 

(d) Some microprocessors are intended for applications in high-speed digital 
signal processing. As such, they tend to be fixed-point machines with a 
high-speed hardware multiplier, but no divider unit. Floating-point arith- 
metic tends to be avoided in this application context. In view of this, 
what advantage might (7.P.4) have over (7. P. 3) as a means to compute 
square roots? 

7.17. Review Problem 7.10. Use x — xq + jx\ (xo, x\ e R) to rewrite the equation 
f(x) = jt 3 — 3x 2 + 4x — 2 — in the form 

fo(xo, *i) = 0, fi(xo,xi) = 0. 



TLFeBOOK 



338 NONLINEAR SYSTEMS OF EQUATIONS 

Use the Newton-Raphson method (as implemented in MATLAB) to solve 
this nonlinear system of equations for the complex roots of f(x) — 0. Use the 
starting points Jo = [ xo,o *o,i ] T — [ 2 2 ] T and 1q — [ 2 —2 ] T . 
Output six iterations in both cases. 

7.18. Consider the nonlinear system of equations 

f( X ,y) = X 2 + y 2 -l=0, 
g(x,y)=\x 2 + 4y 2 -l=0. 

(a) Sketch / and g on the (x, y) plane. 

(b) Solve for the points of intersection of the two curves in (a) by hand 
calculation. 

(c) Write a MATLAB function that uses the Newton-Raphson method to 
solve for the points of intersection. Use the starting vectors [xoyo] r = 
[±1 ± l] r . Output six iterations in all four cases. 

7.19. If y n+l = 1 - byl and x„ = Ux - ±) y„ + \ for n e Z+, then, if b = 
jk 2 — jX, show that 

%n + \ == XXn\i X n ). 

7.20. De Angeli, et al, [18] suggest an alternative synchronization scheme (i.e., 
alternative to deadbeat synchronization). This works as follows. Suppose 
that at the transmitter 

XQ,n + l = 1 -a*o,n + x l,«- 
*l,n+l = Pxo,„, 
y n = \-ax 2 n . 

The expression for y„ here replaces that in (7.86). At the receiver 

XQ, n +l — xi,„ + y„ 
X\,n+l = &XQ,n- 

(a) Error sequences are defined as 

0Xi,n == X[ n X{ n 

for i e {0, 1}. Find conditions on a and ft, giving linin^oo ||5x„|| = 
(Sx„ = [Sxo.n <5^i, n ] r ). 

(b) Prove that using y n = 1 — axh to synchronize the receiver and trans- 
mitter is not a secure synchronization method [i.e., an eavesdropper may 
collect enough elements from (y n ) to solve for the key {a, fi}]. 



TLFeBOOK 



PROBLEMS 339 

7.21. The chaotic encryption scheme of De Angeli et al. [18], which employs 
deadbeat synchronization, was shown to be vulnerable to a known plaintext 
attack, assuming a polynomial message sequence (recall Section 7.6). Show 
that it is vulnerable to a known plaintext attack when the message sequence 
is given by 

s n — a sin(o)M + </>) . 

Assuming that the eavesdropper already knows w and </>, show that a, a and 
jS can be obtained by solving a third-order linear system of equations. How 
many cyphertext elements c n are needed to solve the system? 

7.22. Write, compile, and run a C program to implement the De Angeli et al. 
chaotic encryption/decryption scheme in Section 7.6. (C is suggested here, 
as I am not certain that MATLAB can do the job so easily.) Implement both 
the encryption and decryption algorithms in the same program. Keep the 
program structure simple and direct (i.e., avoid complicated data structures, 
and difficult pointer operations). The program input is plaintext from a file, 
and the output is decrypted cyphertext (also known as "recovered plaintext") 
that is written to another file. The user is to input the encryption key {a, /3}, 
and the decryption key {a\, fi{\ at the terminal. Test your program out on 
some plaintext file of your own making. It should include keyboard charac- 
ters other than letters of the alphabet and numbers. Of course, your program 
must convert character data into a floating-point format. The floating-point 
numbers are input to the encryption algorithm. Algorithm output is also a 
sequence of floating-point numbers. These are decrypted using the decryp- 
tion algorithm, and the resulting floating-point numbers are converted back 
into characters. There is a complication involved in decryption. Recall that 
nominally the plaintext is recovered according to 

_ y«+2 

s n — 

Cn 

for n = 0, 1, . .., N — 1. However, this is a floating-point operation that 
incurs rounding error. It is therefore necessary to implement 

yn+2 -_ 

s„ = h offset 



for which offset is a small positive value (e.g., 0.0001). The rationale is 
as follows. Suppose that nominally (i.e., in the absence of roundoff) s n — 
yn+i/cn = 113.000000. The number 113 is the ASCII (American Standard 
Code for Information Interchange) code for some text character. Rounding in 
division may instead yield the value 112.999999. The operation of converting 
this number to an integer type will give 112 instead of 1 13. Clearly, the offset 
cures this problem. A rounding error that gives, for instance, 113.001000 is 
harmless. We mention that if plaintext (s„) is from sampled voice or video, 



TLFeBOOK 



and 



340 NONLINEAR SYSTEMS OF EQUATIONS 

then these rounding issues are normally irrelevant. Try using the following 
key sets: 

a= 1.399, £ = 0.305, 

ai = 1.399, Pi =0.305, 

a= 1.399,0 = 0.305, 

ai = 1.389, Pi =0.304. 

Of course, perfect recovery is expected for the first set, but not the second set. 
[Comment: It is to be emphasized that in practice keys must be chosen that 
cause the Henon map to be chaotic to good approximation on the computer. 
Keys must not lead to an unstable system, a system that converges to a 
fixed point, or to one with a short-period oscillation. Key choices leading to 
instability will cause floating-point overflow (leading to a crash), while the 
other undesirable choices likely generate security hazards. Ideally (and more 
practically) the cyphertext array (floating-point numbers) should be written to 
a file by an encryption program. A separate decryption program would then 
read in the cyphertext from the file and decrypt it, producing the recovered 
plaintext, which would be written to another file (i.e., three separate files: 
original file of input plaintext, cyphertext file, and recovered plaintext file).] 



TLFeBOOK 



8 



Unconstrained Optimization 



8.1 INTRODUCTION 

In engineering design it is frequently the case that an optimal design is sought 
with respect to some performance criterion or criteria. Problems of this class are 
generally referred to as optimal design problems, or mathematical programming 
problems. Frequently nonlinearity is involved, and so the term nonlinear program- 
ming also appears. There are many subcategories of problem types within this broad 
category, and so space limitations will restrict us to an introductory presentation 
mainly of so-called unconstrained problems. Even so, only a very few ideas from 
within this category will be considered. However, some of the methods treated in 
earlier chapters were actually examples of optimal design methods within this cate- 
gory. This would include least-squares ideas, for example. In fact, an understanding 
of least-squares ideas from the previous chapters helps a lot in understanding the 
present one. Although the emphasis here is on unconstrained optimization problems, 
some consideration of constrained problems appears in Section 8.5. 



8.2 PROBLEM STATEMENT AND PRELIMINARIES 

In this chapter we consider the problem 

min/(;c) (8.1) 

jcsR" 

for which f(x) e R. This notation means that we wish to find a vector x that min- 
imizes the function f(x). We shall follow previous notational practices, so here 
x — [xq x\ ... x„_2 x n -\\ T . In what follows we shall usually assume that f(x) 
possesses all first- and second-order partial derivatives (with respect to the elements 
of x), and that these are continuous functions, too. In the present context the func- 
tion f(x) is often called the objective function. For example, in least-squares prob- 
lems (recall Chapter 4) we sought to minimize V(a) — \\f(x) — J2k=o a k<Pk(x)\\ 2 
with respect to a € R N . Thus, V(a) is an objective function. The least-squares 
problem had sufficient structure so that a simple solution was arrived at; specifi- 
cally, to find the optimal vector a (denoted a), all we had to do was solve a linear 
system of equations. 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



341 



TLFeBOOK 



342 



UNCONSTRAINED OPTIMIZATION 



Now we are interested in solving more general problems. For example, we might 
wish to find x — [xqx\] t to minimize Rosenbrock' s function 



f(x) = l00(x 1 - X 2) 2 + (l-x Q ) 2 . 



(8.2) 



This is a famous standard test function (taken here from Fletcher [1]) often used 
by those who design optimization algorithms to test out their theories. A contour 
plot of this function appears in Fig. 8.1. It turns out that this function has a unique 
minimum at x = x = [1 l] r . As before, we have used a "hat" symbol to indicate 
the optimal solution. 

Some ideas from vector calculus are essential to understanding nonlinear opti- 
mization problems. Of particular importance is the gradient operator. 



3 3 3 

3xo 9*i 9x„_ 

For example, if we apply this operator to Rosenbrock' s function, then 



(8.3) 



V/(x) 



3*0 

df(x) 

dxi 



-400x (xi - x%) - 2(1 - x ) 



200(xi - x$ 



(8.4) 




Figure 8.1 Contour plot of Rosenbrock's function. 



TLFeBOOK 



PROBLEM STATEMENT AND PRELIMINARIES 



343 



We observe that V/(Jc) = [0 0] T (recall that x — [1 l] T ). In other words, the 
gradient of the objective function at x is zero. Intuitively we expect that if x is 
to minimize the objective function, then we should always have V/(jc) = 0. This 
turns out to be a necessary but not sufficient condition for a minimum. To see this, 
consider g(x) — —x 2 (x e R) for which 

Vg(x) = -2x. 

Clearly, Vg(0) = yet x — maximizes g{x) instead of minimizing it. Consider 
h(x) — x 3 , so that 

Vh(x) = 3x 2 

for which Vft(0) = 0, but x — neither minimizes nor maximizes h(x). In this 
case x — corresponds to a one-dimensional version of a saddle point. 

Thus, finding x such that V/(x) = generally gives us minima, maxima, or 
saddle points. These are collectively referred to as stationary points. We need (if 
possible) a condition that tells us whether a given stationary point corresponds to 
a minimum. Before considering this matter, we note that another problem is that 
a given objective function will often have several local minima. This is illustrated 
in Fig. 8.2. The function is a quartic polynomial for which there are two values of 



40 



35 



30 



25 



20 



15 



10 






-1.5 



-0.5 



0.5 
x 



1.5 



2.5 



Figure 8.2 A function f(x) = x — 3.1x + 2x + 1 with two local minima. The local 
minimizers are near x = —0.5 and x = 2.25. The minimum near x = 2.25 is the unique 
global minimum of fix). 



TLFeBOOK 



344 



UNCONSTRAINED OPTIMIZATION 



x such that V fix) — 0. We might denote these by x a and Xb- Suppose that x a is the 
local minimizer near x — —0.5 and Xb is the local minimizer near x — 2.25. We see 
that f(x a + S) > f(x a ) for all sufficiently small (but nonzero) S e R. Similarly, 
f(xb + S) > f(xb) for all sufficiently small (but nonzero) S e R. But we see that 
f(xb) < f(x a ), so Xb is the unique global minimizer for fix). An algorithm for 
minimizing f(x) should seek Xb\ that is, in general, we seek the global minimizer 
(assuming that it is unique). Except in special situations, an optimization algorithm 
is usually not guaranteed to do better than find a local minimizer, and is not 
guaranteed to find a global minimizer. Note that it is entirely possible for the 
global minimizer not to be unique. A simple example is f(x) — sin(x) for which 
the stationary points are x — (2k + 1)tt/2 (A: € Z). The minimizers that are a subset 
of these points all give —1 for the value of sin(x). 

To determine whether a stationary point is a local minimizer, it is useful to have 
the Hessian matrix 



V 2 f(x) 



d 2 f(x) 
dxt dxj 



eR" 



(8.5) 



!,;=0,l,...,n-l 



This matrix should not be confused with the Jacobian matrix (seen in Section 7.5.2). 
The Jacobian and Hessian matrices are not the same. The veracity of this may be 
seen by comparing their definitions. For the special case of n = 2, we have the 
2x2 Hessian matrix 



V 2 /(*) = 



d\f(x) d 2 f(x) 

3xq 3^o3xi 

S 2 f(x) d 2 f(x) 

3xi 3xQ dx 2 



(8.6) 



If f(x) in (8.6) is Rosenbrock's function, then (8.6) becomes 



V 2 /(*) = 



1200x2 - 400xi + 2 -400xo 
-400xo 200 



(8.7) 



The Hessian matrix for f(x) helps in the following manner. Suppose that x is 
a local minimizer for objective function f(x). We may Taylor-series-expand f(x) 
around x according to 



f{x + h) = f{x)+h'Vftx) 



\h T [V 2 f{x)]h 



(8.8) 



where h eR". (This will not be formally justified.) If f(x) is sufficiently smooth 
and \\h\\ is sufficiently small, then the terms in (8.8) that are not shown may 
be entirely ignored (i.e., we neglect higher-order terms). In other words, in the 
neighborhood of x 



fix + h)*> f(x) + h 1 V/(x) + W [V z fix)]h 



(8.9) 



TLFeBOOK 



LINE SEARCHES 345 

is assumed. But this is the now familiar quadratic form. For convenience we will 
(as did Fletcher [1]) define 

G(x) = V 2 f(x), (8.10) 

so (8.9) may be rewritten as 

f(x + h) % f(x) + h T Vf(x) +\h T G(x)h. (8.11) 

Sometimes we will write G instead of G(x) if there is no danger of confusion. 
In Chapter 4 we proved that (8.11) has a unique minimum iff G > 0. In words, 
f(x) looks like a positive definite quadratic form in the neighborhood of a local 
minimizer. Therefore, the Hessian of f(x) at a local minimizer will be positive 
definite and thus represents a way of testing a stationary point to see if it is a 
minimizer. (Recall the second-derivative test for the minimum of a single-variable 
function from elementary calculus, which is really just a special case of this more 
general test.) More formally, we therefore have the following theorem. 

Theorem 8.1: A sufficient condition for a local minimizer x is that V/(i) = 0, 
and G(x) > 0. 

This is a simplified statement of Fletcher's Theorem 2.1.1 [1, p. 14]. The proof 
is really just a more rigorous version of the Taylor series argument just given, and 
will therefore be omitted. For convenience we will also define the gradient vector 

gW = V/(x). (8.12) 

Sometimes we will write g for g(x) if there is no danger of confusion. 

If we recall Rosenbrock's function again, we may now test the claim made earlier 
that x = [1 l] T is a minimizer for f(x) in (8.2). For Rosenbrock's function the 
Hessian is in (8.7), and thus we have 



G(x) = 



802 -400 
-400 200 



(8.13) 



The eigenvalues of this matrix are A. — 0.3994, 1002. These are both bigger than 
zero so G(x) > 0. We have already remarked that g(x) — 0, so immediately from 
Theorem 8.1 we conclude that x — [1 l] T is a local minimizer for Rosenbrock's 
function. 



8.3 LINE SEARCHES 

In general, for objective function f(x) we wish to allow x e R"; that is, we seek the 
minimum of f(x) by performing a search in an n-dimensional vector space. How- 
ever, the one-dimensional problem is an important special case. In this section we 
begin by considering n — 1 , so x is a scalar. The problem of finding the minimum 



TLFeBOOK 



346 UNCONSTRAINED OPTIMIZATION 

of f(x) [where f(x) e R for all x e R] is sometimes called the univariate search. 
Various approaches exist for the solution of this problem, but we will consider only 
the golden ratio search method (sometimes also called the golden section search). 
We will then consider the backtracking line search [3] for the case where x e R" 
for any n > 1. 

Suppose that we know f(x) has a minimum over the interval [xf , xi]. Define 
I 3 — xi — xf , which is the length of this interval. The index j represents the jth 
iteration of the search algorithm, so it represents the current estimate of the interval 
that contains the minimum. Select two points xi and x J h (xi < xf) such that they 
are symmetrically placed in the interval [xf , xi]. Specifically 

xi-xj =xl-x{. (8.14) 

A new interval [xf ,x ] u ] is created according to the following procedure, such 
that for all j 

7^r = t>i. (8.i5) 

If f(x£) > f(xl) then the minimum lies in [xi, xi], and so 7 /+1 = xi — xi. Our 
new points are given according to 

x l — x a > x u — x m> x a = x b (8.16a) 



and 



xl +1 =xi + -I i+1 . (8.16b) 

T 



If f(xi) < f(xl), then the minimum lies in [xf , xf], and so / /+1 = x } b — xf . Our 
new points are given according to 



7+1 / 7+1 j ;+l _ J 



and 



xi, x^=x J a (8.17a) 



x J a +1 =xi-^I' +1 . (8.17b) 

Since 1° — x® — x® (j — indicates the initial case), we must have X® = x® — -1° 
and x® — xf + -1°. Figure 8.3 illustrates the search procedure. This is for the 
particular case (8.17). 

Because of (8.15), the rate of convergence to the minimum can be estimated. 
But we need to know t; see the following theorem. 

Theorem 8.2: The search interval lengths of the golden ratio search algorithm 
are related according to 

IJ = IJ+ 1 +iJ+ 2 (8.18a) 



TLFeBOOK 



LINE SEARCHES 



347 




Figure 8.3 Illustrating the golden ratio search procedure. 

for which the golden ratio x is given by 

r = i(l + V5)« 1.62. 



(8.18b) 



Proof There are four cases to consider in establishing (8.18a). First, suppose 
that f(xi) > f(x J b ), so IJ +l = xi - x J a with 



J+ 1 



.7+1 



.7 + 1 



1 
-/ 

T 



/+] 



.7+1 



If it happens that f(x J a + ) > f{x[ ), then /'+ 2 = ^ 



/+i> 



^"+1 



.7+1 



Ki (via xi — x/ — x 3 u — xi). This implies that 7 ;+1 + / ;+2 = x^ — xi — V . 



On the other hand, if it happens that f(x 3 a ) < f(xi ), then 7 /+2 = x 



,7+1 



x /+1 - x /+1 



x ;+1 - x 1 



x, (via x, 
xi — xi). Again we have I J+i + I J+2 



7 + 1 



.7+1 _ 7+1 

A / — A w 

-x/ =/•'. 



.7+1 



Now suppose that f(x ] a ) < f(xi). Therefore, I J+l — x J h — xi with 



J+ 1 



.7+1 



If it happens that f(x J a )>f(xi + ), then V+ 2 = x 3 u 



-I j+l , 

x 



,7+1 



.7+1 



,7+1 



.7 + 1 



,7+1 



x i+l - x J ' - 
-7+1-, 



so that IJ +l + V+ 2 = x,; 



that /(x a ;+ ) < /(x^ + ), so therefore 7'+ 2 = x fc ;+ - x 



x; ; = /'. Finally, suppose 

,7+1 _ _; „7 _ v i „7 



TLFeBOOK 



348 UNCONSTRAINED OPTIMIZATION 

so yet again we conclude that /■ /+1 + L' +2 — xi — xj — P . Thus, (8.18a) 
now verified. 



Since 



— - I — = r and V = I j+1 + I j+2 , 



jj+l 77+2 



we have V +x = ±V and L' +1 = ±/ /+1 = XlJ, so 

V =IJ +1 + IJ +2 = -IJ + ±-I j , 



rl 



immediately implying that 



which yields (8.18b). 



T 2 = T + l, 



The golden ratio has a long and interesting history in art as well as science 
and engineering, and this is considered in Schroeder [2]. For example, a famous 
painting by Seurat contains figures that are proportioned according to this ratio 
(Fig. 5.2 in Schroeder [2] includes a sketch). 

The golden ratio search algorithm as presented so far assumes a user-provided 
starting interval. The golden ratio search algorithm has the same drawback as the 
bisection method for root finding (Chapter 7) in that the optimum solution must be 
bracketed before the algorithm can be successfully run (in general). On the other 
hand, an advantage of the golden ratio search method is that it does not need f(x) 
to be differentiable. But the method can be slow to converge, in which case an 
improved minimizer that also does not need derivatives can be found in Brent [4]. 

When setting up an optimization problem it is often advisable to look for ways 
to reduce the dimensionality of the problem, if at all possible. We illustrate this 
idea with an example that is similar in some ways to the least-squares problem 
considered in the beginning of Section 4.6. Suppose that signal fit) e R it e R) 
is modeled as 

f(t) = asm(—t + 4>)+ri(t), (8.19) 



where ??(f) is the term that accounts for noise, interference, or measurement errors 
in the data fit). In other words, our data are modeled as a sinusoid plus noise. 
The amplitude a, phase 4>, and period T are unknown parameters. We are assumed 
only to possess samples of the signal; thus, we have only the sequence elements 

fn = finT s ) = a sin fynT, + J + rjinT s ) (8.20) 

for n — 0, I, . . . , N — 1. The sampling period parameter T s is assumed to be 
known. As before, we may define an error sequence 

(In \ 

e n — /„ — a sin I — nT s + <f> I , (8.21) 



TLFeBOOK 



LINE SEARCHES 



349 



where again n — 0, I, . . . , N — 1. The objective function for our problem (using a 
least-squares criterion) would therefore be 



N-l 

V{a,T,<t>)= J2 

n=0 



. lit 

f n -asm) — nT s 



-i2 



(8.22) 



This function depends on the unknown model parameters a, T, and <p in a very non- 
linear manner. We might formally define our parameter vector to be x — [a T 4>] T e 
R 3 . This would lead us to conclude that we must search (by some means) a three- 
dimensional space to find x to minimize (8.22). But in this special problem it is 
possible to reduce the dimensionality of the search space from three dimensions 
down to only one dimension. Reducing the problem in this way makes it solvable 
using the golden section search method that we have just considered. 
Recall the trigonometric identity 



sin(A + B) — sin A cos B + cos A sin B. 
Using this, we may write 



(2tt 
a sin — nl 



, lit \ (lit 

— a cosd) sin — nT s + a sin d> cos — nT s 



(8.23) 



(8.24) 



Define x — [xq x{\ and 



2tt \ (2tt 
sin ( — nT s I cos I — nT s 



-lT 



(8.25) 



We may rewrite e n in (8.21) using these vectors: 



e n — J n v n % ■ 



(8.26) 



Note that the approach used here is the same as that used to obtain e„ in Eqn. (4.99). 
Therefore, the reader will probably find it useful to review this material now. 
Continuing in this fashion, the error vector e — [eoei ■ ■ ■ epj-\] T , data vector / = 
[/b/i • • • /n-i] T , and matrix of basis vectors 



all may be used to write 



L "N-l J 



eR 



JVx2 



e — f — Ax, 



(8.27) 



TLFeBOOK 



350 UNCONSTRAINED OPTIMIZATION 

for which our objective function may be rewritten as 

V(x) = e T e = f T f - 2x T A T f + x T A T Ax, (8.28) 

which implicitly assumes that we already know the period T. If T were known 
in advance, we could use the method of Chapter 4 to minimize (8.28); that is, 
the optimum choice (least-squares sense) for x (denoted by x) satisfies the linear 
system 

A T Ax = A T f. (8.29) 

Because of (8.24) (with x — [xq x\] T ), we obtain 

xo = flcos(j!>, x\=asm<p. (8.30) 

Since we have x from the solution to (8.29), we may use (8.30) to solve for a 
and <p. 

However, our original problem specified that we do not know a, <p, or T in 
advance. So, how do we exploit these results to determine T as well as a and 4>1 
The approach is to change how we think about V (x) in (8.28). Instead of thinking 
of V(x) as a function of x, consider it to be a function of T, but with x — x as 
given by (8.29). From (8.25) we see that v n depends on T, so A is also a function 
of T, Thus, x is a function of T, too, because of (8.29). In fact, we may emphasize 
this by rewriting (8.29) as 

[A(T)] T A(T)x(T) = [A(T)ff. (8.31) 

In other words, A = A(T), andx = x (T). The objective function V(x) then becomes 

Vi(T) = V(x(T)) = f f - 2[x(T)] T A T (T)f + [x(T)f A T (T)A(T)x(T). 

(8.32) 
However, we may substitute (8.31) into (8.32) and simplify with the result that 

Vi(T) = f f - f A(T)[A T (T)A(T)r l A T (T)f. (8.33) 

We have reduced the search space from three dimensions down to one, so we 
may apply the golden section search algorithm (or some other univariate search 
procedure) to objective function V\(T). This would result in determining the opti- 
mum period T [which minimizes V\(T)]. We then compute A(T), and solve 
for x using (8.29) as before. Knowledge of x allows us to determine a and </> 
via (8.30). 

Figure 8.4 illustrates a typical noisy sinusoid and the corresponding objective 
function V\(T), In this case the parameters chosen were T = 24 h, T s — 5 min, 
a = 1, </> = — jt/10 radians, and N — 500. The noise component in the data was 
created using MATLAB's Gaussian random-number generator. The noise variance 
is a 2 — 0.5, and the mean value is zero. We observe that the minimum value of 
V\{T) certainly corresponds to a value of T that is at or close to 24 h. 



TLFeBOOK 



LINE SEARCHES 



351 



(a) 




500 
450 
400 
350 
300 



250 



16 



(b) 



20 25 

Time (hours) 






18 



20 



22 24 26 

Time (hours) 



28 



30 



32 



Figure 8.4 An example of a noisy sinusoid (a) and the corresponding objective function 
Vi(T)(b). 



Appendix 8. A contains a sample MATLAB program implementation of the 
golden section search algorithm applied to the noisy sinusoid problem depicted 
in Fig. 8.4. In golden.m Topt is T , and eta rj is the random noise sequence r](nT s ). 

We will now consider the backtracking line search algorithm. The exposition 
will be similar to that in Boyd and Vandenberghe [5]. 

Now we assume that f(x) is our objective function with x e R". In a general 
line search we seek the minimizing vector sequence (x^'), k e Z + , and x w € R" 
(i.e., x^ = [xq x\ • • • x^^] 7 ) constructed according to the iterative process 



x ik+i) =x ik) + t m s (k) 



(8.34) 



where t^' e R + , and t w > except when x^ is optimal [i.e., minimizes f(x)]. 
The vector s^ is called the search direction, and scalar t ^ is called the step size. 
Because line searches (8.34) are descent methods, we have 



,(*+!) 



f(x^>) < f(x^>) 



(/<K 



(8.35) 



Ak) 



except when x ( ' is optimal. The "points" x 



(*+i) 



Ak) 



lie along a line in the direction 



in n -dimensional space R", and since the minimum of f(x) must lie in the 



TLFeBOOK 



352 UNCONSTRAINED OPTIMIZATION 

direction that satisfies (8.35), we must ensure that s^ satisfies [recall (8.12)] 

g(x (k) ) T s (k) = [Vf(x (k) )] T s (k) < 0. (8.36) 

Geometrically, the negative- gradient vector — g{x^) (which "points down") makes 
an acute angle (i.e., one of magnitude <90°) with the vector s", [Recall (4.130).] 
If s^ satisfies (8.36) for f(x^) (i.e., f(x) at x — x^), it is called a descent direc- 
tion for f(x) at x^\ A general descent algorithm has the following pseudocode 
description: 

Specify starting point x' 0) e R"; 

k:=0; 

while stopping criterion is not met do begin 

Find s^'; { determine descent direction } 

Find fW; { line search step } 

Compute x<*+ 1 > := x™ + t^s^ ■ 
{ update step } 

k:=k+-\; 

end ; 

Newton's method with the backtracking line search (Section 8.4) is a specific 
example of a descent method. There are others, but these will not be considered in 
this book. 

Now we need to say more about how to choose the step size t^ on the assump- 
tion that s w is known. How to determine the direction s'*' is the subject of 
Section 8.4. 

So far /|R" — > R. Subsequent considerations are simplified if we assume that 
f(x) satisfies the following definition. 

Definition 8.1: Convex Function Function /|R" -> R is called a convex 
function if for all $ e [0, 1], and for any x, y e R" 

f(0x + (1 - 6)y) < 9f(x) + (1 - $)f(y). (8.37) 

We emphasize that the domain of definition of f(x) is assumed to be R". It is 
possible to modify the definition to accommodate f(x) where the domain of f(x) 
is a proper subset of R". The geometric meaning of (8.37) is shown in Fig. 8.5 
for the case where x e R (i.e., one-dimensional case). We see that when f(x) 
is convex, the chord, which is the line segment joining (x, f(x)) to (y, f(y)), 
always lies above the graph of fix). Now further assume that f(x) is at least 
twice continuously differentiable in all elements of the vector x. We observe that 
for any x,y e R", if f(x) is convex, then 

f(y) > fix) + [Vf(x)] T (y - x). (8.38) 

From the considerations of Section 8.2 it is easy to believe that f(x) is convex iff 

V 2 f(x) > (8.39) 



TLFeBOOK 



NEWTON'S METHOD 



353 



(y, W) 



fix) 

k 




(x, f(x)) 



Figure 8.5 Graph of a convex function fix) (x e R) and the chord that connects the points 
(*,/(*)) and (y, /OO). 

for all x e R"; that is, fix) is convex iff its Hessian matrix is at least positive 
semidefinite (recall Definition 4.1). The function f(x) is said to be strongly convex 
iff V 2 /(x) > for all x e R". 

The backtracking line search attempts to approximately minimize f(x) along 
the line {x + ts\t > 0} for some given s (search direction at x). The pseudocode 
for this algorithm is as follows: 

Specify the search direction s; 

t :=1; 

while (f(x + fs) > f(x) + St[Wf(x)] T s) 

t := at; 

end ; 

In this algorithm < 8 < j, and < a < 1. Commonly, 8 e [0.01,0.30], and 
a e [0.1,0.5]. These parameter ranges will not be justified here. As suggested 
earlier, how to choose s will be the subject of the next section. The method is 
called "backtracking" as it begins with t = 1, and then reduces t by factor a until 
f(x + ts) < f{x) +8tV T f(x)s. [We have V T f(x) = [V/(x)] r .] Recall that s 
is a descent direction so that (8.36) holds, specifically, V r f(x)s < 0, and so if t 
is small enough [recall (8.9)], then 

f(x + ts) % fix) + tV T fix)s < fix) + 8tV T fix)s, 

which shows that the search must terminate eventually. We mention that the back- 
tracking line search will terminate even if fix) is only "locally convex" — convex 
on some proper subset S of R" . This will happen provided x e S in the algorithm. 



8.4 NEWTON'S METHOD 

Section 8.3 suggests attempting to reduce an n-dimensional search space to a 
one-dimensional search space. Of course, this approach seldom works, which is 



TLFeBOOK 



354 UNCONSTRAINED OPTIMIZATION 

why there is an elaborate body of methods on searching for minima in higher- 
dimensional spaces. However, these methods are too involved to consider in detail 
in this book, and so we will only partially elaborate on an idea from Section 8.2. 
The quadratic model from Section 8.2 suggests an approach often called New- 
ton's method. Suppose that x™ is the current estimate of the sought-after optimum 
x. Following (8.11), we have the Taylor approximation 

f(x (k) + S) % V(S) = f{x {k) ) + S T g(x (k) ) + ±8 T G(x (k) )8 (8.40) 

for which S e R" since x™ € R". Since x^ is not necessarily the minimum x, 
usually g(x^) ^ 0. Vector S is selected to minimize V(<5), and since this is a 
quadratic form, if G(x^) > 0, then 

G(x (k) )8 = -g(x (k) ). (8.41) 

The next estimate of x is given by 

X (k+D = x (k) + s _ (8 42) 

Pseudocode for Newton's method (in its most basic form) is 

Input starting point x* 0) ; 

k:=0; 

While stopping criterion is not met do begin 

G(xW)« := -g(x<Q); 

x (/c+1). = x W +a . 

/t:=/c+1; 

end; 

The algorithm will terminate (if all goes well) with the last vector x { • l ' as a good 
approximation to x. However, the Hessian G*- ' — G{x^') may not always be 
positive definite, in which case this method can be expected to fail. Modification 
of the method is required to guarantee that at least it will converge to a local 
minimum. Said modifications often involve changing the method to work with line 
searches (i.e., Section 8.3 ideas). We will now say more about this. 

As suggested in Ref. 5, we may combine the backtracking line search with the 
basic form of Newton's algorithm described above. A pseudocode description of 
the result is 

Input starting point x* 0) , and a tolerance e > 0; 

k := 0; 

S (0) := -[G(x (0) )n 1 g(x(°)); { search direction atx< 0) } 

X 2 := -gW; 

while k 2 > e do begin 

Use backtracking line search to find (^ forx^ and s^; 
X (fr+1) . = x (k) + t (k) s (k). 

S (*+D := -[G(x<' < + 1 >)r 1 g(x(' < + 1 >); 

A 2 :=-g 7 "(x<' ( + 1 ))s(' f + 1 ); 

k :=/c + 1; 

end; 



TLFeBOOK 



NEWTON'S METHOD 355 

The algorithm assumes that G(x^) > for all k. In this case [G(x^^)] _1 > as 
well. If we define (for all x e R") 

\\x\\ 2 G(y) ^x T [G(y)r 1 x, (8.43) 

then ||x||(j(y) satisfies the norm axioms (recall Definition 1.3), and is in fact an 
example of a weighted norm. But why do we consider X 2 as a stopping criterion in 
Newton's algorithm? If we recall the term jS T G(x^)S in (8.40), since S satisfies 
(8.41), at step k we must have 

\& T G(x^)& = i/(x«)[G(x«)]- 1 g(x«) = l 2\\g{x (k) )\\ 2 G(x(k)) = j^ 2 - (8.44) 

Estimate x w of x is likely to be good if (8.44) in particular is small [as opposed 
to merely considering squared unweighted norm g(x^) T g(x^)]. It is known that 
Newton's algorithm can converge rapidly (quadratic convergence). An analysis 
showing this appears in Boyd and Vandenberghe [5] but is omitted here. 

As it stands, Newton's method is computationally expensive since we must solve 
the linear system in (8.41) at every step k. This would normally involve applying 
the Cholesky factorization algorithm that was first mentioned (but not considered 
in detail) in Section 4.6. We remark in passing that the Cholesky algorithm will 
factorize G^ according to 

G (k) = LDL T , (8.45) 

where L is unit lower triangular and D is a diagonal matrix. We also mention that 
G^ 1 > iff the elements of D are all positive, so the Cholesky algorithm provides 
a built-in positive definiteness test. The decomposition in (8.45) is a variation on 
the LU decomposition of Chapter 4. Recall that we proved in Section 4.5 that 
positive definite matrices always possess such a factorization (see Theorems 4.1 
and 4.2). So Eq. (8.45) is consistent with this result. The necessity to solve a linear 
system of equations at every step makes us wonder if sensitivity to ill-conditioned 
matrices is a problem in Newton's method. It turns out that the method is often 
surprisingly resistant to ill conditioning (at least as reported in Ref. 5). 

Example 8.1 A typical run of Newton's algorithm with the backtracking line 
search as applied to the problem of minimizing Rosenbrock's function [Eq. (8.2)] 
yields the following output: 



k 


,(*) 


X 2 


x o 


(k) 
x\ 





1.0000 


800.00499 


2.0000 


2.0000 


1 


0.1250 


1.98757 


1.9975 


3.9900 


2 


1.0000 


0.41963 


1.8730 


3.4925 


3 


1.0000 


0.49663 


1.6602 


2.7110 


4 


0.5000 


0.38333 


1.5945 


2.5382 


5 


1.0000 


0.21071 


1.4349 


2.0313 



TLFeBOOK 



356 



UNCONSTRAINED OPTIMIZATION 



k 


,(*) 


X 2 


x ik) 
x 


r (k) 

x l 


6 


0.5000 


0.14763 


1.3683 


1.8678 


7 


1.0000 


0.07134 


1.2707 


1.6031 


8 


1.0000 


0.03978 


1.1898 


1.4092 


9 


1.0000 


0.01899 


1.1076 


1.2201 


10 


1.0000 


0.00627 


1.0619 


1.1255 


11 


1.0000 


0.00121 


1.0183 


1.0350 


12 


1.0000 


0.00006 


1.0050 


1.0099 



The search parameters selected in this example are a — 0.5, 8 — 0.3, and e = 
.00001, and the final estimate is x (l3) — [1.0002 1.0003] 7 . For these same param- 
eters if instead x^ — [—1 l] T , then 18 iterations are needed, yielding x^ — 
[0.9999 0.9998] r . Figure 8.6 shows the sequence of points x w for the case 
x^ = [— 1 l] r . The dashed line shows the path from starting point [—1 l] T to 
the minimum at [1 l] r , and we see that the algorithm follows the "valley" to the 
optimum solution quite well. 




Figure 8.6 The sequence of points (x^ ') generated by Newton's method with the back- 
tracking line search as applied to Rosenbrock's function using the parameters a — 0.5, 
S = 0.3, and e = 0.00001 with *(°) = [-1 1] T (see Example 8.1). The path followed is 
shown by the dashed line. 



TLFeBOOK 



EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS 357 

8.5 EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS 

In this section we modify the original optimization problem in (8.1) according to 

min fix) 

*eR" , (8.46) 

subject to fi (jt) = for all i = 0, 1, . . . , m — 1 

where fix) is the objective function as before, and ft(x) = for i — 0, I, ... , 

m — 1 are the equality constraints. The functions /,- |R" -> R are equality constraint 

functions. The set F — {x\f{x) = 0, i = 0, . . . , m — 1} is called the feasible set. 

We are interested in 

f = f(x) = mm f(x). (8.47) 

xeF 

There may be more than one x — x e R" satisfying (8.47); that is, the set 

X = {x\x € F, f{x) = /} (8.48) 

may have more than one element in it. We assume that our problem yields X with 
at least one element in it (i.e., X ^ 0). 

Equation (8.47) is really a more compact statement of (8.46), and in words states 
that any minimizer x of fix) must also satisfy the equality constraints. We recall 
examples of this type of problem from Chapter 4 [e.g., the problem of deriving a 
computable expression for K2(A) and in the proof of Theorem 4.5]. More examples 
will be seen later. Generally, in engineering, constrained optimization problems are 
more common than unconstrained problems. However, it is important to understand 
that algorithms for unconstrained problems form the core of algorithms for solving 
constrained problems. 

We now wish to make some general statements about how to solve (8.46), and 
in so doing we introduce the concept of Lagrange multipliers. The arguments to 
follow are somewhat heuristic, and they follow those of Section 9.1 in Fletcher [1]. 

Suppose that x e R" is at least a local minimizer for objective function fix) e 
R. Analogously to (8.9), we have 

fix + 8)* fix) + gjix)& + ±8 T [V 2 fix)]8, (8.49) 

where 8 e R" is some incremental step away from x, and gtix) — V fix) e R" 
ii = 0, 1, . . . , m — 1) is the gradient of the ith constraint function at x. A feasible 
incremental step 8 must yield x + 8 e F, and so must satisfy 

fiix + 8) = Mx) = (8.50) 

for all i. From (8.49) this implies the condition that 8 must lie along feasible 
direction s e R" (at x — x) such that 

gfix)s = (8.51) 



TLFeBOOK 



358 UNCONSTRAINED OPTIMIZATION 

again for all ;'. [We shall suppose that the vectors gt(x) — V/(jc) are linearly 
independent for all /.] Recalling (8.36), if s were also a descent direction at x, then 

g T (x)s < (8.52) 

would hold (g(x) — V f{x) e R"). In this situation S would reduce f(x), as S is 
along direction s. But this is impossible since we have assumed that x is a local 
minimizer for f(x). [For any s at x, we expect to have g T (x)s = 0.] Consequently, 
no direction s can satisfy (8.51) and (8.52) simultaneously. This statement remains 
true if g(x) is a linear combination of g, (x), that is, if, for suitable I, € R we have 

m — 1 

g(x)=J^X i gi(x). 1 (8.53) 

i=0 

Thus, a necessary condition for x to be a local minimizer (or, more generally, a 
stationary point) of f(x) is that [rewriting (8.53)] 

m—\ 

g(x)-^2ii Si (x) = (8.54) 

i=0 

for suitable X,- € R which are called the Lagrange multipliers. We see that (8.54) 
can be expressed as [with V as in (8.3)] 



m— 1 



f(x)-J^kifi(x) 



= 0. (8.55) 



In other words, we replace the original problem (8.46) with the mathematically 
equivalent problem of minimizing 

m— 1 

L(x,X) = f(x)-J^Xifi(x), (8.56) 



Equation (8.53) may be more formally justified as follows. Note that the same argument will also 
extend to make (8.53) a necessary condition for x to be a local maximizer, or saddle point for f(x). 
Thus, (8.53) is really a necessary condition for x to be a stationary point of f(x). We employ proof by 
contradiction. Suppose that 

g(x) = Gk + h, 

where G = [go(x) ■ ■ ■ g m -\(x)] S R nxm , A, — [Xq ■ ■ ■ X m _ 1 ] T e R™ and h / 0. Further assume that 
h 6 R" is the component of g(x) that is orthogonal to all gi(x). Thus, G h = 0. In this instance 
s = -h will satisfy both (8.51) and (8.52) [i.e., g 1 ' (x)s = -[i T G T + h T ]h = -h T h < 0]. Satisfaction 
of (8.52) implies that a step & in the direction s will reduce fix) [i.e., f(x + 8) < /(*)]. But this cannot 
be the case since x is a local minimizer of /(x). Consequently, h ^ is impossible, which establishes 
(8.53). 



TLFeBOOK 



EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS 359 

called the Lagrangian function (or Lagrangian), where, of course, x € R" and 
X e R m . Since L|R" x R m — > R, in order to satisfy (8.55), we must determine 
x , X such that 

VL{x,X) = 0, 



where now instead [of (8.3)] we have V = 



V,- 



such that 



(8.57) 



3 



-,T 



3x dx n -i_ 



V/ 



3 

dT 



dX„ 



(8.58) 



Now we see that a necessary condition for a stationary point of f(x) subject to 
our constraints is that x and X form a stationary point of the Lagrangian function. 
Of course, to resolve whether stationary point x is a minimizer requires additional 
information (e.g., the Hessian). Observe that V^L(x, X) — [-fo(x) — fi(x) ■ ■ ■ — 
f m -\(x)] T , so V\L(x, X) — implies that ft(x) — for all i. This is why we take 
derivatives of the Lagrangian with respect to all elements of A; it is equivalent to 
imposing the equality constraints as in the original problem (8.46). 

We now consider a few examples of the application of the method of Lagrange 
multipliers. 

Example 8.2 This example is from Fletcher [1, pp. 196-198]. Suppose that 

f(x) = xo + x\, fo(x) = x% - xi = 0, 

so x — [xq x\\ T e R 2 , and the Lagrangian is 

L(x, X) — xq + x\ — X(x Q — x\). 

Clearly, to obtain stationary points, we must solve 

dL 



dxo 


= 1 - 2Xx = 


dL 

3xi 


= 1+A. = 0, 


dL 
~8X 


= Xq — X\ — 0. 



Immediately, X — — 1, so that 1 — 2Xxq — yields xq = — i, and xi. — x\ =0 



yields x\ 



Thus 



-lT 



1 1 

~2 4 
Is x really a minimizer, or is it a maximizer, or saddle point? 



TLFeBOOK 



360 UNCONSTRAINED OPTIMIZATION 

An alternative means to solve our problem is to recognize that since Xq — x\ =0, 
we can actually minimize the new objective function 

/'(x ) = X + X\\ Xi=x 2 — X + x\ 

with respect to xo instead. Clearly, this is a positive definite quadratic with a well- 
defined and unique minimum at xo = — \. Again we conclude x = [— | \Y , and 
it must specifically be a minimizer. 

In Example 8.2, / = f(x) — xq + x\ — — j > — oo only because of the con- 
straint, /o(x) = x^ — x\ — 0. Without such a constraint, we would have / = — oo. 

We have described the method of Lagrange multipliers as being applied largely to 
minimization problems. But we have noted that this method applies to maximization 
problems as well because (8.57) is the necessary condition for a stationary point, 
and not just a minimizer. The next example is a simple maximization problem from 
geometry. 

Example 8.3 We wish to maximize 

F(x, y) = 4xy 

subject to the constraint 

C(x,y) = \x 2 + y 2 -\=0. 

This problem may be interpreted as the problem of maximizing the area of a 
rectangle of area Axy such that the corners of the rectangle are on an ellipse centered 
at the origin of the two-dimensional plane. The ellipse is the curve C(x, y) — 0. 
(Drawing a sketch is a useful exercise.) 

Following the Lagrange multiplier procedure, we construct the Lagrangian 
function 

G(x, y, X) = Axy + \{\x 2 + y 2 - 1) 

Taking the derivatives of G and setting them to zero yields 



(8.59a) 
(8.59b) 
(8.59c) 



dG 

9x 


1 

= 4v + -Ax = 

2 


dG 

~dy~ 


= 4x + 2Xy = 


dG 


= -x 2 + y 2 - 1 = 



TLFeBOOK 



EQUALITY CONSTRAINTS AND LAGRANGE MULTIPLIERS 361 

From (8.59a,b) we have 

v x 

A. = -8-, and k = -2-, 

x y 

which means that — 2x/y = — %y/x, or x 2 — Ay 2 . Thus, we may replace x 2 by Ay 2 
in (8.59c), giving 

2y 2 - 1 = 

for which y 2 — j, and so x 2 — 2. From these equations we easily obtain the loca- 
tions of the corners of the rectangle on the ellipse. The area of the rectangle is also 
seen to be four units. 

Example 8.4 Recall from Chapter 4 that we sought a method to determine 
(compute) the matrix 2-norm 

||A|| 2 = max ||Ax|| 2 . 

Chapter 4 considered only the special case x e R 2 (i.e., n — 2). Now we consider 
the general case for which n > 2. 

I 2 

our problem is to maximize x T Rx subject to the equality constraint ||x||2 = 1 (or 
equivalently x T x — 1). The Lagrangian is 



Since ||Ax|| 2 = x T A T Ax = x T Rx with R = R T and R > (if A is full rank), 



L(x, X) = x T Rx - X(x T x - 1) 
since f(x) — x T Rx and fo(x) — x T x — 1. Now 

n—\ n — \ 

fix) = x T Rx = ^^XjXjrij 
i=0 y=0 

n — 1 n—\n—\ 

= £ 
1=0 



so that (using r,y = r ; ,) 



3/ 

- — = 2r kk x k 
ox k 



= 2r kk x k 



: ? r " + 2^2^ 


X[ X j f"ij 


i=0 ;=0 




«¥j 




n-l 


n-l 


X! ■ r ./^/ + 


£ xir ik 


7=0 


i =0 


/#* 


1 j=k 


n-l 


n-l 


Z! r v^ + 


J2 r kJ x J 


;=0 


.7 = 


/#* 


jl^k 



TLFeBOOK 



362 UNCONSTRAINED OPTIMIZATION 

This reduces to 

„j. n—\ n—\ 

— = 2r kk x k +2 ^ r kj Xj — 2 ^ r kj Xj 
,/=0 j=o 

for all k = 0, 1, . . . , n — 1. Consequently, V x f(x) — 2Rx. Similarly, V x fo(x) — 
2x. Also, V^L(x, 1) = — x T x + 1 = — fo(x), and so VL(x,A.) = yields the 
equations 

2Rx - 2Xx = 0, 
jc r x -1=0. 

The first equation states that the maximizing solution (if it exists) must satisfy the 
eigenproblem 

Rx — Xx 

for which X is an eigenvalue and x is the corresponding eigenvector. Consequently, 
x T Rx — Xx T x, so 



x T Rx \\Ax\\ 



2 

2 _ II „„,,2 



A=^^ = ^-^ = ||Ax|| 



11*11 



2 



must be chosen to be the biggest eigenvalue of R. Since R > 0, such an eigenvalue 
will exist. As before (Chapter 4), we conclude that 

IIAH^A.,,-! 

for which X n -\ is the largest of the eigenvalues A-o, . . . , X n -\ of R — A T A {X n -\ > 
K-2 > ■ ■ ■ > X > 0). 



APPENDIX 8.A MATLAB CODE FOR GOLDEN SECTION SEARCH 

% 

% SineFitl.m 

% 

% This routine computes the objective function V_1 (T) for user input 

% T and data vector f as required by the golden section search test 

% procedure golden. m. Note that Ts (sampling period) must be consistent 

% with Ts in golden. m. 

% 

function V1 = SineFitl (f ,T) ; 

N = length(f); % Number of samples collected 

Ts = 5*60; % 5 minute sampling period 



TLFeBOOK 



MATLAB CODE FOR GOLDEN SECTION SEARCH 363 



% Compute the objective function V_1 (T) 

n = [0:N-1] ; 

T = T*60*60; 

A= [ sin(2*pi*Ts*n/T) . ' cos(2*pi*Ts*n/T) . ' ]; 

B = inv(A. '*A) ; 

V1 = f . '*f - f . ' *A*B*A. '*f ; 



% golden. m 

% This routine tests the golden section search procedure of Chapter 8 
% on the noisy sinusoid problem depicted in Fig. 8.4. 

% This routine creates a test signal and uses SineFitl.m to compute 

% the corresponding objective function V_1 (T) (given by Equation (8.33) 

% in Chapter 8) . 



function Topt = golden 

% Compute the test signal f 

N = 500; % Number of samples collected 

Ts = 5*60; % 5 minute sampling period 

T = 24*60*60; % 24 hr period for the sinusoid 
phi = -pi/10; % phase angle of sinusoid 
a = 1.0; % sinusoid amplitude 

var = .5000; % desired noise variance 

std = sqrt(var) ; 

eta = std*randn(1 ,N) ; 

for n = 1 :N 

f(n) = a*sin(((2*pi*(n-1)*Ts)/T) + phi); 

end; 
f = f + eta; 
f = f.'; 

% Specify a starting interval and initial parameters 
% (units of hours) 

xl = 16; 
xu = 32; 

tau = (sqrt(5)+1 ) /2; % The golden ratio 

tol = .05; % Accuracy of the location of the minimum 

% Apply the golden section search procedure 

I = xu - xl; % length of the starting interval 
xa = xu - I/tau; 
xb = xl + I/tau; 



TLFeBOOK 



364 UNCONSTRAINED OPTIMIZATION 



while I > tol 

if SineFitl (f ,xa) >= SineFitl (f ,xb) 

I = xu - xa; 

xl = xa; 

temp = xa; 

xa = xb; 

xb = temp + I/tau; 
else 

I = xb - xl; 

xu = xb; 

temp = xb; 

xb = xa; 

xa = temp - I/tau; 
end 
end 

Topt = (xl + xu)/2; % Estimate of optimum choice for T 



REFERENCES 

1. R. Fletcher, Practical Methods of Optimization, 2nd ed., Wiley, New York, 1987 
(reprinted July 1993). 

2. M. R. Schroeder, Number Theory in Science and Communication (with Applications 
in Cryptography, Physics, Digital Information, Computing, and Self-Similarity), 2nd 
(expanded) ed., Springer- Verlag, New York, 1986. 

3. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math- 
ematics series, Vol. 37). Springer- Verlag, New York, 2000. 

4. R. P. Brent, Algorithms for Minimization without Derivatives, Dover Publications, Mine- 
ola, NY, 2002. 

5. S. Boyd and L. Vandenberghe, Convex Optimization, preprint, Dec. 2001. 

PROBLEMS 

8.1. Suppose A e R" x ", and that A is not symmetric in general. Prove that 

V{x T Ax) = (A + A T )x, 

where V is the gradient operator [Eq. (8.3)]. 

8.2. This problem is about ideas from vector calculus useful in nonlinear opti- 
mization methods. 

(a) If s, x, x' e R", then a line in R" is defined by 

x — x(a) — x + as, a € R. 

Vector s may be interpreted as determining the direction of the line in the 
«-dimensional space R". The notation x(a) implies x(a) — [xo(a)x\(a) 
■ ■ ■ x n _\ (a)] T . Prove that the slope df/da of f(x(a)) e R along the line 



TLFeBOOK 



PROBLEMS 365 

at any x(a) is given by 

df T 
— =s T Vf, 
da 

where V is the gradient operator [see Eq. (8.3)]. (Hint: Use the chain 

rule for derivatives.) 

(b) Suppose that u(x), v(x) e R" (again x e R"). Prove that 

V{u T v) = (V/)v + (Vv T )u. 

8.3. Use the golden ratio search method to find the global minimizer of the 

polynomial objective function in Fig. 8.2. Do the computations using a pocket 

o 

l / 

8.4. Review Problem 7.4 (in Chapter 7). 



calculator, with starting interval [xf, x®] = [2, 2.5]. Iterate 5 times. 



Use a MATLAB implementation of the golden ratio search method to find 
detection threshold r\ for Pfa = 0.1,0.01, and 0.001. The objective func- 
tion is 



f(ri) 



(p/2)-l , , 

k=Q 



Make reasonable choices about starting intervals. 

8.5. Suppose x — [xq x\\ t e R 2 , and consider 

f( x ) = Xq + xoxi + (1 + x\) 2 . 

Find general expressions for the gradient and the Hessian of f(x). Is G(0) > 
0? What does this signify? Use the Newton-Raphson method (Chapter 7) 
to confirm that x — [0.6959 — 1.3479] r is a stationary point for f(x). 
Select Jo = [0.7000 - 1.3] r as the starting point. Is G(x) > 0? What does 
this signify? In this problem do all necessary computations using a pocket 
calculator. 

8.6. If x e R" is a local minimizer for f(x), then we know that G(x) > 0, that 
is, Gix) is positive definite (pd). On the other hand, if x is a local maximizer 
of f(x), then — G(x) > 0. In this case we say that G(x) is negative definite 
(nd). Show that 

f{x) = (XI - X\f + Xq 5 

has only one stationary point, and that it is neither a minimizer nor a maxi- 
mizer of f(x). 

8.7. Take note of the criterion for a maximizer in the previous problem. For both 
a — 6 and a — 8, find all stationary points of the function 

f(x) = 2xq — 3x — axoxi(xo — x\ — 1). 

Determine which are local minima, local maxima, or neither. 



TLFeBOOK 



366 UNCONSTRAINED OPTIMIZATION 

8.8. Write a MATLAB function to find the minimum of Rosenbrock's function 
using the Newton algorithm (basic form that does not employ a line search). 
Separate functions must be written to implement computation of both the 
gradient and the inverse of the Hessian. The function for computing the 
inverse of the Hessian must return an integer that indicates whether the Hes- 
sian is positive definite. The Newton algorithm must terminate if a Hessian 
is encountered that is not positive definite. The I/O is to be at the terminal 
only. The user must input the starting vector at the terminal, and the program 
must report the estimated minimum, or it must print an error message if the 
Hessian is not positive definite. Test your program out on the starting vectors 

x (0) = [_3 _3f an d x ®) = [0 10] r 

8.9. Write and test your own MATLAB routine (or routines) to verify Example 8.1. 

8.10. Find the points on the ellipse 

2 2 

x L y L 
a 2 b 2 

that are closest to, and farthest from, the origin (x, y) — (0, 0). Use the 
method of Lagrange multipliers. 

8.11. The theory of Lagrange multipliers in Section 8.5 is a bit oversimplified. 
Consider the following theorem. Suppose that x e R" gives an extremum 
(i.e., minimum or maximum) of f(x) e R among all x satisfying g(x) — 0. 
If /, g e C l [D] for a domain DcR containing x, then either 

g(x) = and Vg(x) = 0, (8.P.1) 

or there is a X e R such that 

g(x) = and V/(x) - XVg(x) = 0. (8.P.2) 

From this theorem, candidate points x for extrema of f(x) satisfying g(x) — 
therefore are 

(a) Points where / and g fail to have continuous partial derivatives. 

(b) Points satisfying (8.P.1). 

(c) Points satisfying (8. P. 2). 

In view of the theorem above and its consequences, find the minimum dis- 
tance from x = [xq x\ X2\ T — [0 — l] T to the surface 

g( X ) — Xq + x\ — x\ — 0. 



TLFeBOOK 



PROBLEMS 367 

8.12. A second-order finite-impulse response (FIR) digital filter has the frequency 
response H{e> 01 ) — ^k=Q n k e ~'' Mk ^ where h.% e R are the filter parameters. 
Since H(e J0} ) is 2tt — periodic we usually consider only w e [—it, it]. The 
DC response of the filter is H(l) — H(e>°) — J2k=o n k- Define the energy 
of the filter in the band [— a> p , a> P ] to be 



1 C m p , 

E=—\ \H(en\ 2 da>. 
2jt J-co„ 



Find the filter parameters h — [ho hi h-^ e R 3 such that for u> p — n/2 
energy E is minimized subject to the constraint that H(\) — 1 (i.e., the gain 
of the filter is unity at DC). Plot \H(e joJ )\ for co e [-n,n]. [Hint: E will 
have the form E — h T Rh e R, where R e R 3x3 is a symmetric Toeplitz 
matrix (recall Problem 4.20). Note also that \H(e^)\ 2 = //(e' a, )//*(e-'' w ).] 

8.13. This problem introduces incremental condition estimation (ICE) and is based 
on the paper C. H. Bischof, "Incremental Condition Estimation," SIAM J. 
Matrix Anal. Appl. 11, 312-322 (April 1990). ICE can be used to estimate 
the condition number of a lower triangular matrix as it is generated one 
row at a time. Many algorithms for linear system solution produce triangular 
matrix factorizations one row or one column at a time. Thus, ICE may be 
built into such algorithms to warn the user of possible inaccuracies in the 
solution due to ill conditioning. Let A n be an n x n matrix with singular 
values a\(A n ) > • • ■ > <r n (A n ) > 0. A condition number for A n is 

/, s °"i( A «) 

Gn(An) 

Consider the order n lower triangular linear system 

L n x n — d n . (8. P. 3) 

The minimum singular value, a n {L n ), of L n satisfies 

n w l|fif " 112 

On\E n ) < ~ — 

I 1*7! Il2 

(| \x„ I Ij = JZ( x n !')• Thus, an estimate (upper bound) of this singular value is 

1141b 



o n (L n ) = 



I 1*7! lb 



We would like to make this upper bound as small as possible. So, Bischof 
suggests finding x„ to satisfy (8.P.3) such that ||x„||2 is maximized subject 
to the constraint that ||<i n |b = 1. Given x n -\ such that L„_ix„_i = d n -\ 



TLFeBOOK 



368 



UNCONSTRAINED OPTIMIZATION 



with |K_i|| 2 = 1 [which gives us cr„_i(L„_i) 
c n such that ||x„||2 is maximized where 



1/II^B-ilb], find s„ and 



L n -\ 

T 



V 







s n d n -\ 



d n , 



SnXn — 1 

(c n - s„a n )/y n 



Find a n ,c n ,s n . Assume a n ^0. (Comment: The indexing of the singular 
values used here is different from that in Chapter 4. The present notation is 
more convenient in the present context.) 

8.14. Assume that A e R mx " with m > n, but rank(A) < n is possible. As usual, 



\x\\i — x x. Solve the following problem: 



min 1 1 Ax 



Hi 



8\\x\\l 



where S > 0. This is often called the Tychonov regularization problem. It 
is a simple ploy to alleviate problems with ill-conditioning in least-squares 
applications. Since A is not necessarily of full rank, we have A T A > 0, but 
not necessarily A T A > 0. What is rank(A r A + 5/)? Of course, / is the order 
n identity matrix. 



TLFeBOOK 



9 



Numerical Integration 
and Differentiation 



9.1 INTRODUCTION 

We are interested in how to compute the integral 



r 

J a 



f(x)dx (9.1) 



for which f(x) e R (and, of course, x e R). Depending on f(x), and perhaps also 
on [a, b], the reader knows that "nice" closed-form expressions for / rarely exist. 
This forces us to consider numerical methods to approximate /. We have seen from 
Chapter 3 that one approach is to find a suitable series expansion for the integral 
in (9.1). For example, recall that we wished to compute the error function 

2 f x 2 
erf(x) = —= \ e~' dt, (9.2) 

V^ Jo 



which has no antiderivative (i.e., "nice" formula). Recall that the error function is 
crucial in solving various problems in applied probability that involve the Gaussian 
probability density function [i.e., the function in (3.101) of Chapter 3]. The Taylor 
series expansion of Eq. (3.108) was suggested as a means to approximately evaluate 
erf(x), and is known to be practically effective if x is not too big. If x is large, then 
the asymptotic expansion of Example 3.10 was suggested. The series expansion 
methodology may seem to solve our problem, but there are integrals for which it 
is not easy to find series expansions of any kind. 

A recursive approach may be attempted as an alternative. An example of this was 
seen in Chapter 5, where finding the norm of a Legendre polynomial required solv- 
ing a recursion involving variables that were certain integrals [recall Eq. (5.96)]. 
As another example of this approach, consider the following case from Forsythe 
et al. [1]. Suppose that we wish to compute 



In = f X n i 

JO 



"' dx (9.3) 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



369 



TLFeBOOK 



370 NUMERICAL INTEGRATION AND DIFFERENTIATION 

for any n e N (natural numbers). Recalling integration by parts, we see that 



f x n e x - l dx=x n e x - x \l- f 
Jo Jo 



■ l 

/o Jo 

so 



nx" l e x ' dx = 1 — n£„_i, 



E n = l-nE n -i (9.4) 

for n = 2, 3, 4, It is easy to confirm that 



= 1 



xe* dx = — . 
e 



This is the initial condition for recursion (9.4). We observe that E n > for all 
n. But if, for example, MATLAB is used to compute £"19, we obtain computed 
solution £19 = —5.1930, which is clearly wrong. Why has this happened? 
Because of the need to quantize, E\ is actually stored in the computer as 

Ex = Ei + e 

for which e is some quantization error. Assuming that the operations in (9.4) do 
not lead to further errors (i.e., assuming no rounding errors), we may arrive at a 
formula for £„: 

E 2 =l-2E 1 = E 2 + (-2)6, 

£3 = 1 - 3£ 2 = £3 + (~3)(-2)e, 

£4 = 1 - 4£ 3 = £4 + (-4) (-3) (-2)6, 

and so on. In general 

£„ = £„ + (-l)"- 1 l-2-3---(/i- 1)m6, (9.5) 



£„-£„ = (- \) n - l n\€. (9.6) 

We see that even a very tiny quantization error 6 will grow very rapidly during 
the course of the computation even without any additional rounding errors at all ! 
Thus, (9.4) is a highly unstable numerical procedure, and so must be rejected as 
a method to compute (9.3). However, it is possible to arrive at a stable procedure 
by modifying (9.4). Now observe that 



£„ = / x "e x ~ 1 dx< / x n dx= . 

Jo Jo n+l 



(9.7) 



TLFeBOOK 



TRAPEZOIDAL RULE 371 

implying that E n -> as n increases. Instead of (9.4), consider the recursion 
[obtained by rearranging (9.4)] 

£„_i = -(l-£„). (9.8) 

n 

If we wish to compute E m , we may assume E„ = for some n significantly bigger 
than m and apply (9.8). From (9.7) the error involved in approximating E n by zero 
is not bigger than l/(w + 1). Thus, an algorithm for E m is 

E k -i = -(1 - E k ) 
k 

for k = n,n — I, ... ,m + 2,m + I, where E n — 0. At each stage of this algorithm 
the initial error is reduced by factor \jk rather than being magnified as it was in 
the procedure of (9.4). 

Other than potential numerical stability problems, which may or may not be 
easy to solve, it is apparent that not all integration problems may be cast into a 
recursive form. It is also possible that f(x) is not known for all x e R. We might 
know f(x) only on a finite subset of R, or perhaps a countably infinite subset of 
R. This situation might arise in the context of obtaining f(x) experimentally. Such 
a scenario would rule out the previous suggestions. 

Thus, there is much room to consider alternative methodologies. In this chapter 
we consider what may be collectively called quadrature methods. For the most part, 
these are based on applying some of the interpolation ideas considered in Chapter 6. 
But Gaussian quadrature (see Section 9.4) also employs orthogonal polynomials 
(material from Chapter 5). 

This Chapter is dedicated mainly to the subject of numerical integration by 
quadratures. But the final section considers numerical approximations to deriva- 
tives (i.e., numerical differentiation). Numerical differentiation is relevant to the 
numerical solution of differential equations (to be considered in later chapters), 
and we have mentioned that it is relevant to spline interpolation (Section 6.5). In 
fact, it can also find a role in refined methods of numerical integration (Section 9.5). 



9.2 TRAPEZOIDAL RULE 

A simple approach to numerical integration is the following. 

In this book we implicitly assume all functions are Riemann integrable. From 
elementary calculus such integrals are obtained by the limiting process 



pb i n 

I b — a -r-^ 

= / f(x)dx= lim Tfixl) 



-b 

(9.9) 

k=\ 



for which xq — a,x n — b, and for which the value / is independent of the point xZ e 
[xjfc-i, Xfc]. We remark that not all functions f(x) satisfy this requirement and so, 
as was mentioned in Chapter 3, not all functions are Riemann integrable. However, 



TLFeBOOK 



372 



NUMERICAL INTEGRATION AND DIFFERENTIATION 



we will ignore this potential problem. We may approximate / according to 

b — a -r-^ 

/« — y,f(4)- ( 9 - 10 ) 

k=\ 

Such an approximation is called the rectangular rule (or rectangle rule) for numer- 
ical integration, and there are different variants depending on the choice for x\. 
Three possible choices are shown in Fig. 9.1. We mention that all variants involve 
assuming f(x) is piecewise constant on [xk-i, Xk], and so amount to the constant 
interpolation of f(x) (i.e., fitting a polynomial which is a constant to f(x) on 
some interval). Define 



h 



From Fig. 9.1, the right-point rule uses 



while the left-point rule uses 



■kh. 



x\ = a + (k — \)h, 



(9.11a) 



(9.11b) 




Figure 9.1 Illustration of the different forms of the rectangular rule: (a) right-point rule; 
(b) left-point rule; (c) midpoint rule. In all cases xq = a and x n = b. 



TLFeBOOK 



TRAPEZOIDAL RULE 



373 



and the midpoint rule uses 



x^ = a + (k — l)h, 



(9.11c) 



where in all cases k — 1, 2, . . . , n — \,n. The midpoint rule is often preferred 
among the three as it is usually more accurate. However, the rectangular rule is often 
(sometimes unfairly) regarded as too crude, and so the following (or something still 
"better") is chosen. 

It is often better to approximate f(x) with trapezoids as shown in Fig. 9.2. This 
results in the trapezoidal rule for numerical integration. From Fig. 9.2 we see that 
this rule is based on the linear interpolation of the function f(x) on [x k -\,x k ]. 
The approximation to f* k f(x) dx is given by the area of the trapezoid: this is 



f 

J XI 



f(x) dx % - [f(x k ) + f(x k -i)] (x k - x k -i) = - [f(x k -\) + f(x k )] h 



(9.12) 
[We have h — x k — x k -\ — (b — a)/n.] It is intuitively plausible that this method 
should be more accurate than the rectangular rule, and yet not require much, if 
any, additional computational effort to implement it. Applying (9.12) for k = 1 to 
k — n, we have 

fb h « 

/ f(x)dx*T(n) = -J^[f(x k -i) + f(x k )], (9.13) 

Ja Z k=\ 

and the summation expands out as 

Tin) = ^ [f(x ) + 2f(xi) + 2f(x 2 ) + ■■■ + 2f(x n -i) + f(x n )] , (9.14) 
where n e N (set of natural numbers). 




Figure 9.2 Illustration of the trapezoidal rule. 



TLFeBOOK 



374 NUMERICAL INTEGRATION AND DIFFERENTIATION 

We may investigate the error behavior of the trapezoidal rule as follows. The 
process of analysis begins by assuming that n = 1; that is, we linearly interpolate 
fix) on [a, b] using 

p(x) = f(a)— b + fQ,)^-, (9.15) 

a — b b — a 

which is from the Lagrange interpolation formula [recall (6.9) for n — 1]. It must 
be the case that for suitable error function e(x), we have 

f(x) = p(x) + e(x) (9.16) 

with x e [a, b]. Consequently 



rb rb pb 

I f(x)dx— I p(x)dx+ I e(x) dx 

Ja Ja Ja 



= b —^U(.a) + f{b)] + f eix) dx = Til) + E T(1) , (9.17) 

where 

E T(l) = / eix)dx, (9.18) 

Ja 

which is the error involved in using the trapezoidal rule. Of course, we would like 
a suitable bound on this error. To obtain such a bound, we will assume that f^(x) 
and f^ix) both exist and are continuous on [a, b]. Let x be fixed at some value 
such that a < x < b, and define 

it — a)(t — b) 

git) = fit) - pit) - [fix) - pix)U ^ TT < 9 - 19 ) 

(x — a)ix — b) 

for t e [a, b]. It is not difficult to confirm that 

g(a) = g(b) = g(*) = 0, 

so g(f) vanishes at three different places on the interval [a, b]. Rolle's theorem 1 
says that there are points fi e (a, x), %2 € (x, £>) such that 

g (1) (ti) = g (1) fe) = o. 

Thus, g^(f) vanishes at two different places on ia,b), so yet again by Rolle's 
theorem there is a £ e (£i , £2) suc h that 

g (2) (f) = 0. 

This was proved by Bers [2], but the proof is actually rather lengthy, and so we omit it. 



TLFeBOOK 



TRAPEZOIDAL RULE 375 

We note that f = £(x), that is, point f depends on x. Therefore 

g (2) (?) = / (2) (?) _ <2) (|) _ [/(x) _ (jt)] 2 = Q (9 2Q) 

(x — a)(x — D) 

The polynomial p(x) is of the first degree so // 2 ^(£) = 0. We may use this in 
(9.20) and rearrange the result so that 

fix) = p(x) + I/( 2 >(f (*))(* " «)(* " « (9.21) 

for any x e (a, b). This expression also happens to be valid at x — a and at x — b, 
so 

e(x) = f{x) - pix) = ±/( 2 >(t (*))(* - a)(* - &) (9.22) 

for x e [a,b]. But we need to evaluate £t(1) m (9.18). The second mean-value 
theorem for integrals states that if fix) is continuous and g(x) is integrable (Rie- 
mann) on [a, b], and further that gix) does not change sign on [a, b], then there 
is a point p e (a, b) such that 



rb pb 

/ fix)gix)dx = fip) / 



fix)gix)dx = fip) gix)dx. (9.23) 

The proof is omitted. We observe that ix — a)(x — b) does not change sign on 
x e [a, b], so via this theorem we have 



~2Ja ■ 



£>,,,--/ / p '«W)(x-a)(j -i)(fa = -j^/^W-fl) 3 , (9.24) 



where /? e (a, ft). We emphasize that this is the error for n — 1 in Tin). Naturally, 
we want an error expression for n > 1, too. When n > 1, we may refer to the 
integration rule as a compound or composite rule. 

The error committed in numerically integrating over the kth subinterval [xt- i,Xk] 
must be [via (9.24)] 

E k = -4/ (2) (&X** " ^-i) 3 = -T^/ (2) &) = -7^ — V 2) &), (9-25) 
12 12 12 « 

where ^ € [jcji-i, x^] and k = 1, 2, . . . , n — 1, «. Therefore, the total error com- 
mitted is 

£r(„) = / /(*) dx - Tin) = ^ £ t , (9.26) 



which becomes [via (9.25)] 



h b — a -c-^ n , 

E T (n) = ~ — J2 f (&)' ( 9 ' 2? ) 



12 n 

k=\ 



TLFeBOOK 



376 NUMERICAL INTEGRATION AND DIFFERENTIATION 

The average - Yl'l=i /^(£fc) must lie between the largest and smallest values of 
f^ 2 \x) on [a, b], so recalling that f^(x) is continuous on [a, b], the intermediate 
value theorem (Theorem 7.1) yields that there is an f € (a, b) such that 






n 

k=l 

Therefore 

£r ( «) = -^-a)/ (2) (£), (9.28) 

where £ € (a, b). If the maximum value of f^(x) on [a, b] is known, then this 
may be used in (9.28) to provide an upper bound on the error. We remark that 
Erin) is often called truncation error. 
We see that 

T(n) = x T y (9.29) 

for which 

x = h[\\ ■ ■ ■ \\] T € R" +1 , y = [/(* )/(*i) • • • /(*„-i)/(*„)] r e R" +1 - 

(9.30) 
We know that rounding errors will be committed in the computation of (9.29). The 
total rounding error might be denoted by Er. We recall from Chapter 2 [Eq. (2.40)] 
that a bound on these errors is 

\Er\ = \x T y - fl[x T y]\ < 1.01 (n + l)u\x\ T \y\, (9.31) 

where u is the unit roundoff, or else the machine epsilon. The cumulative effect of 
rounding errors can be expected to grow as n increases. If we suppose that 

M= max \f (2) (x)\, (9.32) 

x£[a,b] 



then, from (9.28), we obtain 



1 1 
12^2 



|£r ( „)l<— — (b-afM. (9.33) 



Thus, (9.33) is an upper bound on the truncation error for the composite trapezoidal 
rule. We see that the bound gets smaller as n increases. Thus, as we expect, 
truncation error is reduced as the number of trapezoids used increases. Combining 
(9.33) with (9.31) results in a bound on total error E: 



1 1 

12^2 



\E\ < —^(b - aYM + 1.01 (n + l)w|x|' M- (9.34) 



TLFeBOOK 



TRAPEZOIDAL RULE 



377 



Usually n ^> l,son+ 1 ^n. Thus, substituting this and (9.30) into (9.31) results in 

-l 



\x T y - fl[x T y]\ < 1.01(b - a)u 



and so (9.34) becomes 



1 1 , 

\E\ < ,(b - a) 3 M + 1.010 - a)u 

12 n l 



l/(*o)l + !/(*«)! 



£i/(**)i 



*=i 



(9.35) 



|/(x )| + 1/(*„)| 



n-\ 



£l/(**)l 



k=\ 



(9.36) 

In general, the first term in the bound of (9.36) becomes smaller as n increases, 

while the second term becomes larger. Thus, there is a tradeoff involved in choosing 

the number of trapezoids to approximate a given integral, and the best choice ought 

to minimize the total error. 

Example 9.1 We may apply the bound of (9.36) to the following problem. We 
wish to compute 



Jo 



dx. 



Thus, [a, b] — [0, 1], and so b — a = 1. Of course, it is very easy to confirm that 
7=1 — e _1 . But this is what makes it a good example to test our theory out. We 
also see that f (2 \x) = e~ x , and so in (9.32) M = 1, Also 



for£ = 0, 1, 



f(x k ) = e- k ' n 



\,n. It is therefore easy to see that 



n-\ 



J2 f( x k) 



-l/n 



k=\ 



1 



-l/n 



We might assume that the trapezoidal rule for this problem is implemented in the 
C programming language using single-precision floating-point arithmetic, in which 
case a typical value for u would be 

u = 1.1921 x 10" 7 . 

Therefore, from (9.36) the total error is bounded according to 



|£|< 



l 



12n 2 



1.2040 x 10" 



1 



-l/n 



2e 



1 



-l/n 



Figure 9.3 plots this bound versus n, and also shows the magnitude of the computed 
(i.e., the true or actual) total error in the trapezoidal rule approximation, which 
is \T(n) — I\. We see that the true error is always less than the bound, as we 
would expect. However, the bound is rather pessimistic. Also, the bound predicts 



TLFeBOOK 



378 



NUMERICAL INTEGRATION AND DIFFERENTIATION 




10 4 10 5 10 6 

Number of trapezoids used (n) 



10 7 




10 u 



10 1 



10 2 10 3 10 4 

Number of trapezoids used (n) 



10 5 



10 b 



Figure 9.3 Comparison of total error (computed) to bound on total error, illustrating the 
tradeoff between rounding error and truncation error in numerical integration by the trape- 
zoidal rule. The bound employed here is that of Eq. (9.36). 



that the proper choice for n is much less than what the computed result predicts. 
Specifically, the bound suggests that we choose n & 100, while the computed result 
suggests that we choose n « 100, 000. 

What is important is that the computed result and the bound both confirm that 
there is a tradeoff between minimizing the truncation error and minimizing the 
rounding error. To minimize rounding error, we prefer a small n, but to minimize 
the truncation error, we prefer a large n. The best solution minimizes the total error 
from both sources. 

In practice, attempting a detailed analysis to determine the true optimum choice 
for n is usually not worth the effort. What is important is to understand the funda- 
mental tradeoffs involved in the choice of n, and from this understanding select a 
reasonable value for n. 



9.3 SIMPSON'S RULE 

The trapezoidal rule employed linear interpolation to approximate f(x) between 
sample points x^ on the x axis. We might consider quadratic interpolation in 



TLFeBOOK 



SIMPSON'S RULE 



379 



the hope of improving accuracy still further. Here "accuracy" is a reference to 
truncation error. 

Therefore, we wish to fit a quadratic curve to the points (xk-i, f(xk-\T), 
(xk, f(xk)) an d (xk+i, f{xk+\))- We may define the quadratic to be 



Pk(x) = a(x - Xk) +b(x-Xk) + c. 



(9.37) 



Contrary to past practice, the subscript k now does not denote degree, but rather 
denotes the "centerpoint" of the interval [Xk-i, Xk+i] on which we are fitting the 
quadratic. The situation is illustrated in Fig. 9.4 . For convenience, define v* = 
f(xk)- Therefore, from (9.37) we may set up three equations in the unknowns 
a, b, c: 

a(xk-\ - x k ) 2 +b(xk-i -x k ) + c — y k -\, 

a(xk - x k f + b(x k - Xk ) + C = y k , 

a(x k +i -x k ) 2 +b(xk+i -Xk) + c = yt+i- 

This is a linear system of equations, and we will assume that h — x^ — Xk-i — 
Xk+\ — Xk, so therefore 



b = 



yk+i -2yk + yk-i 
yt+i — yk-i 



2/; 



c = yk- 
This leads to the approximation 



L 



f(x)dx^ / Pk(x) dx = -[y k -i + 4y k + yk+i]. 

Jxt-\ - 1 



(9.38a) 

(9.38b) 
(9.38c) 

(9.39) 



Parabolic arc p k (x) 




x k-1 x k Km 

Figure 9.4 Simpson's rule for numerical integration. 



TLFeBOOK 



380 



NUMERICAL INTEGRATION AND DIFFERENTIATION 



Of course, some algebra has been omitted to arrive at the equality in (9.39). As 
in Section 9.2, we wish to integrate fix) on [a, b]. So, as before, a — xq, and 
b — x„. If n is an even number, then the number of subdivisions of [a, b] is an 
even number, and hence we have the approximation 



J a 



f{x) dx 



I pi(x)dx+l pi(x)dx-\ h / p,,- 

JXq J X2 J X n _2 



i (x) dx 



= 3 [yo + 4yi + 2y 2 + 4y 3 +2y 4 + ■ ■ ■ + 2y„_ 2 + 4y„_i + y n ], (9.40) 

The last equality follows from applying (9.39). We define the Simpson rule approx- 
imation to / as 

h 
Sin) = -[y + 4yi + 2y 2 + 4y 3 + 2y 4 + ■ ■ ■ + 2y„- 2 + 4y„-i + y„] (9.41) 



for which n is even and n > 2. 

A truncation error analysis of Simpson's rule is more involved than that of 
the analysis of the trapezoidal rule seen in the previous section. Therefore, we 
only outline the major steps and results. We begin by using only two subintervals 
to approximate I — J f(x) dx, specifically, n — 2. Define c — (a + b)/2. Denote 
the interpolating quadratic by p(x). For a suitable error function e(x), we must 
have 



fix) = p(x) + eix). 



(9.42) 



Immediately we see that 



rb pb pb 

1=1 fix)dx= I pix)dx+ I eix) dx 

Ja Ja Ja 



— — f(a) + 4/ 



/ eix) dx — 5(2) + E S (2)- 

Ja 



fib) 



(9.43) 



So, the truncation error in Simpson's rule is thus 

t-b 



''SO.) 



-I 

J a 



eix)dx. 



(9.44) 



It is clear that Simpson's rule is exact for fix) a quadratic function. Less clear is 
the fact that Simpson's rule is exact if fix) is a cubic polynomial. To demonstrate 
the truth of this claim, we need an error result from Chapter 6. We assume that 
f( k \x) exists and is continuous for all k = 0, 1, 2, 3, 4 for all x € [a, b]. From 



TLFeBOOK 



SIMPSON'S RULE 



381 



Eq. (6.14) the error involved in interpolating f(x) with a quadratic polynomial is 
given by 

e(x) = ^/ (3) (f (*))(* " «)(* " b)(x - c) (9.45) 

for some % = %(x) e [a, b]. Hence 

Esm = / «U)d* = ^ / / (3) (£(*))(* - a)(x - b)(x - c)dx. (9.46) 

Unfortunately, polynomial (x — a)(x — b)(x — c) changes sign on the interval 
[a, b], and so we are not able to apply the second mean- value theorem for integrals 
as we did in Section 9.2. This is a major reason why the analysis of Simpson's 
rule is harder than the analysis of the trapezoidal rule. However, at this point we 
may still consider (9.46) for the case where f(x) is a cubic polynomial. In this 
case we must have f^\x) — K (some constant). Consequently, from (9.46) 



iS (2) 



K f h 



a)(x — b){x — c) dx, 



but if z — x — c, then, since c — I (a + b), we must have 



-S(2) 



(b-a) r 



u 

y.!_u b _ a) 



(b-a) I 

(b-a) 



2 
dz. 



dz 



(9.47) 



The integrand is an odd function of z, and the integration limits are symmetric 
about the point z — 0. Immediately we conclude that E$(2) — in this particular 
case. Thus, we conclude that Simpson's rule gives the exact result when f(x) is a 
cubic polynomial. 

Hermite interpolation (considered in a general way in Section 6.4) is polynomial 
interpolation where not only does the interpolating polynomial match f(x) at the 
sample points x^ but the first derivative of f(x) is matched as well. It is useful 
to interpolate f(x) with a cubic polynomial that we will denote by r(x) at the 
points {a, /(«)), (b, f(b)), and (c, /(c)), and also such that r (1) (c) = f {l) (c). A 
cubic polynomial is specified by four coefficients, so these constraints uniquely 
determine r(x). In fact 



for which 



r{x) — p(x) + a(x — a)(x — b)(x — c) 

4[p (1) (c)-/ (1) (c)] 

a — = . 

(b - a) 2 



(9.48a) 



(9.48b) 



TLFeBOOK 



382 NUMERICAL INTEGRATION AND DIFFERENTIATION 

Analogously to (9.19), we may define 

— a)(t — c) 2 (t - b) 

git) = f(t) - r(t) - [f(x) - r(x)]y £ ^7 i- «<'<*• 

(x — a)ix — c) z (x — b) 

(9.49) 

It happens that g^ k \t) for k = 0, 1, 2, 3, 4 all exist and are continuous at all x e 

[a, b]. Additionally, gia) — gib) = gic) — g^ic) — gix) — 0. The vanishing of 

git) at four distinct points on [a, b], and g^'ic) — guarantees that g^ 4 '(£) = 

for some £ € [a, b] by the repeated application of Rolle's theorem. Consequently, 

using (9.49), we obtain 

g (4) (f ) = / ( %) - r (4) (f ) - [/GO - r(*)]- 4! = 0. (9.50) 

(x — a)(x — c) z (x — ») 

Since r(x) is cubic, r^(§) = 0, and so (9.50) can be used to say that 

fix) = rix) + 1/W(|(x))(x - a)(x - c) 2 (x - b) (9.51) 

for x € ia,b). This is valid at the endpoints of [a, b], so finally 

eix) = /(x) - r(x) = i/ (4) (?(x))(x - a)(x - c) 2 (x - b) (9.52) 

for x € [a, b], and f(x) € [a,b]. Immediately, we see that 

E Si 2) = ^f f {A) i^ix))ix-a)ix-c) 2 ix-b)dx. (9.53) 



The polynomial in the integrand of (9.53) does not change sign on [a, b]. Thus, the 
second mean- value theorem for integrals is applicable. Hence, for some £ € ia,b), 
we have 

f (4) (£) f b 
E S (2) = -^^ / ix-a)ix- c) 2 (x -*)</*, (9.54) 

4- J a 

which reduces to 

£5(2) = -^/ (4) (5) (9.55) 

again for some £ € (a, £>), where h — ib — a)/2. 

We need an expression for Es( n ), that is, an error expression for the compos- 
ite Simpson rule. We will assume again that h = ib — a)/n, where n is an even 
number. Consequently 



nb «/ 2 

Es(n) = / /GO d* - Sin) = ^2 / E k 



n/2 

,9.56) 



TLFeBOOK 



SIMPSON'S RULE 383 

where Ek is the error committed in the approximation for the fcth subinterval 
[x 2 {k-\),X2k\- Thus, for & € [x 2 (k-i), xik\, with k=l,2,..., n/2, we have 

Ek = "Lf (4) ^ = -^^^/ (4) &)- (9-57) 

90 90 n 

Therefore 

h 4 h — '^ 2 

E S( „) = V / (4) feO- (9.58) 

w 180 n/2 ^r 

Applying the intermediate-value theorem to the average 4j^j f^iHk) con- 
firms that there is a £ € (a, &) such that 



n/2 * — ' 



jt=i 
so therefore the truncation error expression for the composite Simpson rule becomes 

h 4 
E s(n) = - — (b-a)f^(l;) (9.59) 

for some § € (a,b). For convenience, we repeat the truncation error expression for 
the composite trapezoidal rule: 

£r ( ») = -^-«)/ (2) (?). (9.60) 

It is not really obvious which rule, trapezoidal or Simpson's, is better in general. 
For a particular interval [a,b] and n, the two expressions depend on different 
derivatives of f(x). It is possible that Simpson's rule may not be an improvement 
on the trapezoidal rule in particular cases for this reason. More specifically, a 
function that is not too smooth can be expected to have "big" higher derivatives. 
Simpson's rule has a truncation error dependent on the fourth derivative, while the 
trapezoidal rule has an error that depends only on the second derivative. Thus, a 
nonsmooth function might be better approximated by the trapezoidal rule than by 
Simpson's. In fact, Davis and Rabinowitz [3, p. 26] state that 

The more "refined" a rule of approximate integration is, the more certain we must 
be that it has been applied to a function which is sufficiently smooth. There may be 
little or no advantage in using a 'better' rule for a function that is not smooth. 

We repeat a famous example from Ref. 3 (originally due to Salzer and Levine). 



TLFeBOOK 



384 NUMERICAL INTEGRATION AND DIFFERENTIATION 

Example 9.2 The following series defines a function due to Weierstrass that 
happens to be continuous but is, surprisingly, not differentiable anywhere 2 



00 , 

W(x) = J2~ cos(7 n 7rx). 



(9.61a) 



If we assume that we may integrate this expression term by term, then 

I(y) = f y W(x) dx = -J^-^- sm(Tjry). (9.61b) 

n = \ 

Of course, I = j W(x)dx — lib) — 1(a). The series (9.61b) gives the "exact" 
value for I(y) and so may be compared to estimates produced by the trapezoidal 
and Simpson rules. Assuming that n — 100, the following table of values is obtained 
[MATLAB implementation of (9.61) and the numerical integration rules]: 



Interval 


Exact 


Trapezoidal 


Error 


Simpson's 


Error 


[a,b] 


Value / 


Tin) 


/ - T(n) 


S(n) 


/ - S(n) 


[0,1] 


0.01899291 


0.01898760 


0.00000531 


0.01901426 


-0.00002135 


[.1..2] 


-0.04145650 


-0.04143815 


-0.00001834 


-0.04146554 


0.00000904 


[.2, .3] 


0.03084617 


0.03084429 


0.00000188 


0.03086261 


-0.00001645 


[.3, .4] 


0.00337701 


0.00342534 


-0.00004833 


0.00341899 


-0.00004198 


[.4, .5] 


-0.03298025 


-0.03300674 


0.00002649 


-0.03303611 


0.00005586 



We see that the errors involved in both the trapezoidal and Simpson rules do 
not differ greatly from each other. So Simpson's rule has no advantage here. 



It is commonplace for integration problems to involve integrals that possess 
oscillatory integrands. For example, the Fourier transform of x(t) is defined to be 



X(w) 



/CO fOO POO 

x(t)e~ jcot dt = x(t) cos(o)t) dt - j / x(t) sin(&>f) dt. 

-co J —co J —CO 



(9.62) 
You will likely see much of this integral in other books and associated courses 
(e.g., signals and systems). Also, determination of the Fourier series coefficients 
required computing [recall Eq. (1.45)] 



1 f l7T 
2tt Jo 



(9.63) 



Proof that W(x) is continuous but not differentiable is quite difficult. There are many different 
Weierstrass functions possessing this property of continuity without differentiability. Another example 
complete with a proof appears on pp. 38-41 of Korner [4]. 



TLFeBOOK 



GAUSSIAN QUADRATURE 385 

An integrand is said to be rapidly oscillatory if there are numerous (i.e., of the 
order of > 10) local maxima and minima over the range of integration (i.e., here 
assumed to be the finite interval [a, b]). Some care is often required to compute 
these properly with the aid of the integration rules we have considered so far. 
However, we will consider only a simple idea called integration between the zeros. 
Davis and Rabinowitz have given a more detailed consideration of how to handle 
oscillatory integrands [3, pp. 53-68]. 

Relevant to the computation of (9.63) is, for example, the integral 



L 



2?r 

f(x)sm(nx)dx. (9.64) 



It may be that f(x) oscillates very little or not at all on [0, 2tt]. We may therefore 
replace (9.64) with 

2n_1 f(k+l)jr/n 

1=^2 f(x)sm(nx)dx. (9.65) 

k=o ^ k7Z l n 

The endpoints of the "subintegrals" 



J kn In 



•(k+l)Tt/n 

Ik — I f(x)sm(nx)dx (9.66) 

T/fl 



in (9.65) are the zeros of sin(«x) on [0, 2jt]. Thus, we are truly proposing integra- 
tion between the zeros. At this point it is easiest to approximate (9.66) with either 
the trapezoidal or Simpson's rules for all k. Since the endpoints of the integrands 
in (9.66) are zero-valued, we can expect some savings in computation as a result 
because some of the terms in the rules (9.14) and (9.41) will be zero-valued. 



9.4 GAUSSIAN QUADRATURE 

Gaussian quadrature is a numerical integration method that uses a higher order of 
interpolation than do either the trapezoidal or Simpson rules. A detailed derivation 
of the method is rather involved as it relies on Hermite interpolation, orthogonal 
polynomial theory, and some aspects of the theory rely on issues relating to linear 
system solution. Thus, only an outline presentation is given here. However, a 
complete description may be found in Hildebrand [5, pp. 382-400]. Additional 
information is presented in Davis and Rabinowitz [3]. 

It helps to recall ideas from Chapter 6 here. If we know f(x) for x = xu 
where j — 0, 1, . . . , n — 1, n, then the Lagrange interpolating polynomial is (with 
P(xj) = /(*/)) 

n 

p(x) = J^f(xj)Lj(x), (9.67) 

.7=0 



TLFeBOOK 



386 NUMERICAL INTEGRATION AND DIFFERENTIATION 

where 



m*)= n 



X Xj 






Recalling (6.45b), we may similarly define 

n 



i=0 



Consequently 



7r (1) (*) = £ 



n (x - xk) 



k=0 



so that 



Jt m (xi)= Y[(xi-Xk), 



k=Q 



which allows us to rewrite (9.68) as 



7t(x) 

Lj(x) — —-r- 

; jt( v Hxj)(x - xj) 



for j = 0, 1, . . . , n. From (6.14) 

e(x) = fix) - p(x) 



1 



(«+l)! 



/ ( " +1) (|)7r(x) 



(9.68) 



(9.69) 



(9.70) 



(9.71) 



(9.72) 



(9.73) 



for some £ € [a, b], and £ = f (x). 

We now summarize Hermite interpolation (recall Section 6.4 for more detail). 
Suppose that we have knowledge of both f(x) and f^(x) at x = x ; - (again j — 
0, 1, ... , n). We may interpolate f(x) using a polynomial of degree In + 1 since 
we must match the polynomial to both /(x) and /^(x) at x = x,\ Thus, we need 
the polynomial 



pto = ^M*)/(*i0 + £M*)/ (1 W 



(9.74) 



fc=0 



fc=0 



where /ijfc(x) and hk(x) are both polynomials of degree In + 1 that we must deter- 
mine according to the constraints of our interpolation problem. 
If 

hi(xj) = Si-j,hi( X j) = 0, (9.75a) 



TLFeBOOK 



GAUSSIAN QUADRATURE 387 

then p(xj) — f(xj), and if we have 

h\ 1 \xj) = 0,hl l \xj) = S i -j, (9.75b) 

then p^\xj) — f^'ixj) for all jf = 0, 1, . . . , n. Using (9.75), it is possible to 
arrive at the conclusion that 



and 



hi(x) = [1 - 2Lf\ Xi )(,x - Xi)][Li(x)f 



h{(x) = (x - Xi)[Li(x)] ■ 



(9.76a) 



(9.76b) 



Equation (9.74) along with (9.76) is Hermite's interpolating formula. [Both parts 
of Eq. (9.76) are derived on pp. 383-384 of Ref. 5, as well as in Theorem 6.1 of 
Chapter 6 (below).] It is further possible to prove that for p(x) in (9.74) we have 
the error function 



1 



(2« + 2)! 



e{x) = f(x) - p(x) = 

where f € [a, b] and f = f (x). 
From (9.77) and (9.74), we obtain 



fix) = J2 hk(x)f(x k ) + J^h k (x)f m (xk) 



f {2n+1) ^)[n{x)f 



(9.77) 



1 



f i2n+2} (Hx))[7t(x)] 2 



k=0 



k=0 



(2n + 2)V 

(9.78) 
Suppose that w(x) > for x e [a, b]. Function w(x) is intended to be a weighting 
function such as seen in Chapter 5. Consequently, from (9.78) 



" r rb 



j w(x)f(x)dx=') I w(x)h k (x)dx 

Ja k=Q 



= ^ / w{x)h k (x)dx f(x k ) 

,._ Ua J 

^2 / w{x)h k (x)dx f (1) (x k ) 

_L^ /" fV»+z>(£( x )) w (x)[„(x)?dx, (9.79) 

« + 2)! J fl 



*=o 



(2k 



where a < f (x) < b, if a < x k < b. This can be rewritten as 

rb " n 

I w(x)f(x) dx = Yl H k f(x k ) + J2 H k f m (x k ) + E, 



k=0 



k=0 



(9.80) 



TLFeBOOK 



388 NUMERICAL INTEGRATION AND DIFFERENTIATION 

where 

f b 

H k — I w(x)h k (x)dx 

J a 

and 



= J w(x 

J a 



)[l-2L^\x k Kx-Xk)][L k (x)] 2 dx (9.81a) 



H k 



- 

J a 



w(x)h k (x) dx 



r 

= I w(x 

J a 



){x - x k )[L k (x)] dx. 



(9.81b) 



If we neglect the term E in (9.80), then the resulting approximation to f w(x) 
f(x)dx is called the Hermite quadrature formula. Since we are assuming that 
w(x) > 0, the second mean-value theorem for integrals allows us to claim that 



E = 



1 



(2« + 2)! 



i 



f (2n+2) (H) I w{x)[n(x)Y dx 



(9.82) 



for some f € [a, b]. 

Now, recalling (9.72), we see that (9.81b) can be rewritten as 



w(x)(x — x k ) 

b 



H k = [ 

J a 

1 f 

= m - , / W(X)7T 
X W (x k ) Ja 



7t(x) 



_7T {l >(x k )(x -X k ). 



dx 



(A') 



7T(x) 



_n (1) (x k )(x -x k ) 

1 f b 

— — 7TT- ~ w(x)7T(x)L k (x)dx. 

Tt W (x k ) J a 



dx 



(9.83) 



We recall from Chapter 5 that an inner product on L [a, b] is (/, g are real-valued) 

-b 



(/.«) 



/ 

J a 



w(x)f(x)g(x)dx. 



(9.84) 



Thus, H k = for k = 0, 1, . . . , n if n(x) is orthogonal to L k (x) over [a, b] with 
respect to the weighting function w(x). Since deg(L^(x)) = n (all k), this will be 
the case if it(x) is orthogonal to all polynomials of degree < n over [a, b] with 
respect to the weighting function w(x). Note that deg(7r(x)) = n + 1 [recall (9.69)]. 
In fact, the polynomial jt(x) of degree n + 1 is orthogonal to all polynomials 
of degree < n over [a,b] with respect to w(x), the Hermite quadrature formula 
reduces to the simpler form 



nb » 

/ w(x)f(x)dx = ^ / H k f(x k ) + E, 

Ja fc=o 



where 



E = 



1 



(2« + 2)! 



f b 

f i2n+2 \H) / W(x)[7t(x)] 2 dx, 
Ja 



(9.85) 



(9.86) 



TLFeBOOK 



GAUSSIAN QUADRATURE 389 

and where xq, x\, . . . , x„ are the zeros of n(x) (such that a < x k < b). A formula of 
this type is called a Gaussian quadrature formula. The weights H k are sometimes 
called Christoffel numbers. We see that this numerical integration methodology 
requires us to possess samples of f(x) at the zeros of it(x). Variations on this 
theory can be used to remove this restriction [6], but we do not consider this 
matter in this book. However, if f(x) is known at all x e [a, b], then this is not a 
serious restriction. 

In any case, to apply the approximation 

/ w(x)f(x)dx^J^H k f(x k ), (9.87) 

Ja k=0 

it is clear that we need a method to determine the Christoffel numbers H k . It is 
possible to do this using the Christoffel-Darboux formula [Eq. (5.11); see also 
Theorem 5.2]. From this it can be shown that 

Hk = fn^n+2 ^ (9g8) 

<t>n+l,n+l^ n+ i(Xk)<t>n+l(Xk) 

where polynomial (j> r {x) is obtained, for instance, from (5.5) and where 

<t>n+i(x) = </>„ + i,„+i7r(x). (9.89) 

Thus, we identify the zeros of it(x) with the zeros of orthogonal polynomial 
<p n+ i(x) [recalling that {</>,-, <pj) — &i-j with respect to inner product (9.84)]. 

Since there are an infinite number of choices for orthogonal polynomials <j>k(x) 
and there is a theory for creating them (Chapter 5), it is possible to choose n(x) in 
an infinite number of ways. We have implicitly assumed that [a, b] is a finite length 
interval, but this assumption is actually entirely unnecessary. Infinite or semiinfinite 
intervals of integration are permitted. Thus, for example, tc(x) may be associated 
with the Hermite polynomials of Section 5.4, as well as with the Chebyshev or 
Legendre polynomials. 

Let us consider as an example the case of Chebyshev polynomials of the first 
kind (first seen in Section 5.3). In this case 

1 

w(x) — 



VT^x 1 

with [a, b] — [—1, 1], and we will obtain the Chebyshev -Gauss quadrature rule. 
The Chebyshev polynomials of the first kind are T k {x) — cos[£cos _1 x], so, via 
(5.55) for k > 0, we have 

<Pk(x) = J-T k (x). (9.90) 

V n 



TLFeBOOK 



390 NUMERICAL INTEGRATION AND DIFFERENTIATION 

From the recursion for Chebyshev polynomials of the first kind [Eq. (5.57)], we 
have (k > 0) 

(9.91) 



n,k 



,*-! 



(T k (x) = j: k j=0 T k jxj).Thus 



We have that T k (x) = for 



<Pk,k — \ — 2 
V it 



k-\ 



(9.92) 



X — Xj — cos 



2i + 1 
2k 



(9.93) 



(i = 0, 1, . . . , k - 1). Additionally 



77 (;c) = k sin[k cos x] 



vT^^2 



(9.94) 



so, if, for convenience, we define a,- = =^-7t, then xt — cos a,-, and therefore 



T, {l \ Xi ) = k 



sin (to,) 
sin a, 



2i + l 



sm a,- 



sin «,■ 



(9.95) 



Also 

Z*+i(*i) = cos 



, 2j + 1 

(* + 1) I ~^— ?T 



2/< 



= cos 



2/ + 1 2/ + 1 



2* 



2i + 1 

7T 

2 

2i + 1 



2i + 1 
2£ 



2i + 1 



2j + 1 



= — sin 



sin a, = (—1) sin a,-. 



i+i 



(9.96) 



Therefore, (9.88) becomes 



FT 

l± 2 n+l 



H k = 



TT 



/2 \ / \2 

'-2» [ /_(n+l) 

7T / \ V 7T 



sina^ 



'2 \ «+ 1 

_(_!)*+ 1 s i nai( , 

TT 



(9.97) 



Thus, the weights (Christoffel numbers) are all the same in this particular case. So, 
(9.87) is now 

\ -4==dx^ V f cos it) =C(n). (9.98) 



TLFeBOOK 



GAUSSIAN QUADRATURE 



391 



The error expression E in (9.86) can also be reduced accordingly, but we will omit 
this here (again, see Hildebrand [5]). 

A simple example of the application of (9.98) is as follows. 



Example 9.3 Suppose that f(x) = Vl — x 2 , in which case 



/ 



For this case (9.98) becomes 
■ l 



: dx 



— J dx — 



2. 



/I 
dx « C(n) — V^ sin 



2k + 1 
2n + 2' 



For various n, we have the following table of values: 



n 


C(n) 


1 


2.2214 


2 


2.0944 


5 


2.0230 


10 


2.0068 


20 


2.0019 


100 


2.0001 



Finally, we remark that the error expression in (9.82) suggests that the method 
of this section is worth applying only if f(x) is sufficiently smooth. This is con- 
sistent with comments made in Section 9.3 regarding how to choose between the 
trapezoidal and Simpson's rules. In the next example f(x) — e~ x , which is a very 
smooth function. 



Example 9.4 Here we will consider 



1 * 1 

e~ x dx = e - - = 2.350402387 

1 e 



and compare the approximation to / obtained by applying Simpson's rule and 
Legendre- Gauss quadrature. We will assume n — 2 in both cases. 

Let us first consider application of Simpson's rule. Since xq — a — —\,x\ = 
0,xi=b=l(h = (b- a) In = (1 - (-l))/2 = 1), we have via (9.41) 

5(2) = -[e +l + 4e° + e' 1 ] = 2.362053757 



TLFeBOOK 



392 NUMERICAL INTEGRATION AND DIFFERENTIATION 

for which the error is 

E S (2) = / - S(2) = -0.011651. 

Now let us consider the Legendre- Gauss quadrature for our problem. We recall 
that for Legendre polynomials the weight function is w(x) — 1 for all x e [—1, 1] 
(Section 5.5). From Section 5.6 we have 

i /7 1 , 

fo(x) = P 3 (x) = J — [5x 3 - 3x], 

\\Pi\\ V 22 



1 /9 1 A , 

— P 4 (x) = ,/ — [35jc 4 - 30x 2 + 3]. 
3 4 || V 28 



d>4(x) - 

\\P 4 \\ 

Consequently, $33 = jJj, and the zeros of cpsix) are at* = 0, =W§, so now our 
sample points (grid points, mesh points) are 

[J [3 

xo = -J -, xi = 0, x 2 — +J -. 



Hence, since 0^ (x) = ./^[lS* 2 — 3], we have 



i>?\x ) = 3^, ^ X) (xi) = -^, 4 1 \x 2 ) = 3^. 



Also, 04,4 = g ■%/§. an d 



3/9 3/9 3/9 

4>4( x o) — \ —, 4>4( x l) — —\ —, 4>4(. x 2) — \ —• 

w 10V 2 8V2 10V 2 

Therefore, from (9.88) the Christoffel numbers are 

5 8 5 

Ho =-,Hi = -, H 2 = -. 



From (9.87) the resulting quadrature is 



£ 



with 



e~ x dx « //o<?"*° + ^l^"* 1 + #2e~* 2 = L(2) 



L(2) = 2.350336929, 



TLFeBOOK 



ROMBERG INTEGRATION 393 

and the corresponding error is 

E L( 2) = I - L(2) = 6.5458 x 10" 5 . 

Clearly, |£"l(2)| <SC \Es(2)\- Thus, the Legendre-Gauss quadrature is much more 
accurate than Simpson's rule. Considering how small n is here, the accuracy of the 
Legendre-Gauss quadrature is remarkably high. 

9.5 ROMBERG INTEGRATION 

Romberg integration is a recursive procedure that seeks to improve on the trape- 
zoidal and Simpson rules. But before we consider this numerical integration 
methodology, we will look at some more basic ideas. 

Suppose that I = f f(x) dx and I(n) is a quadrature that approximates /. For 
us, I(n) will be either the trapezoidal rule T(n) from (9.14), or else it will be 
Simpson's rule S(n) from (9.41). It could also be the corrected trapezoidal rule 
Tc(n), which is considered below [see either (9.104), or (9.107)]. 

It is possible to improve on the "basic" trapezoidal rule from Section 9.2. Begin 
by recalling (9.27) 

1 " 

E T(n) = --h 3 J2f (2) ^k), (9.99) 

k=\ 

where h — (b — a) In, xq — a, x n — b, and ^ € [xk—i, Xfc]. Of course, Xk — xt-i — 
h (uniform sampling grid). Assuming (as usual) that f w (x) is Riemann integrable 
for all k > 0, then 

lim y>/ (2) (&) = / (1) O>)-/ (1) (0)= / f (2) (x)dx. (9.100) 

Thus, we have the approximation 

n 

J2 h f (2) ^k) * f W (b) - f m (a). (9.101) 

k=l 

Consequently, 

1 " 1 

E T (n) = -—h 2 J2 h f (2) ^k) » -— h 2 [f m (b) - f m (a)] (9.102) 

k=\ 



I _ T (n) w -±h 2 [f m (b) - f W (a)l (9.103) 



TLFeBOOK 



394 NUMERICAL INTEGRATION AND DIFFERENTIATION 

This immediately suggests that we can improve on the trapezoidal rule by replacing 
T(n) with the new approximation 

T c (n) = T(n) - ±^h 2 [f m (b) - f m (a)], (9.104) 

where Tc(n) denotes the corrected trapezoidal rule approximation to /. Clearly, 
once we have T(n), rather little extra effort is needed to obtain Tc(n). 

In fact, we do not necessarily need to know f^(x) exactly anywhere, much 
less at the points x — a or x — b. In the next section we will argue that either 

f W (x) = i-3/Oc) + 4f(x + h)- f{x + 2/z)] + V/ (3) (?) (9.105a) 
2h 3 

or that 

f W (x) = ^-[3 fix) - 4f(x -h) + f{x - 2/0] + \h 2 f (i \H). (9.105b) 
2/z 3 

In (9.105a) f e [x, x + 2h], while in (9.105b) £ € [x - 2h, x]. Consequently, with 
xq — a for ^o € [a, a + 2h], we have [via (9.105a)] 

fW(xd) = f m (a) = -L[-3/(x ) +4/(xj) - f(x 2 )] + V/ (3) (?o), (9.106a) 
2/z 3 

and with x„ = b for £„ € [b — 2h, b], we have [via (9.105b)] 

f W (Xn) = / (1) W = ^r[3/fc) - 4/(x„-l) + f(x n - 2 )] + \h 2 f^^ n ). 

2h 3 

(9.106b) 

Thus, (9.104) becomes {approximate corrected trapezoidal rule) 

T c (n) = T(n) - ^[3/(x„) - 4/(x„_i) + /(*„_ 2 ) + 3/(x ) - 4/(x x ) + /(x 2 )] 

/i 4 
-Tr[/ (3) fe)-/ (3) (fo)]. (9.107) 

36 

Of course, in evaluating (9.107) we would exclude the terms involving f^H^o), 
and/< 3 >(£„). 

As noted in Epperson [7] for the trapezoidal and Simpson rules 

1 
I-I(n)cx— , (9.108) 

nP 

where p = 2 for /(n) = T(n) and p = 4 for /(n) = ,!>(«). In other words, for a 
given rule and a suitable constant C we must have I — I(n) ^ Cn~ p . Now observe 



TLFeBOOK 



ROMBERG INTEGRATION 



395 



that we may define the ratio 



l'4n 



I(n) - I (In) 



(I - Cn~P) - (I - C(2n)~P) 



I(2n) 
(2n)~ 



- I (An) 
P -n~P 



(I 



(4n)~ 



(2nY 



C(2n)~P) 
2-P - 1 
A~p - 2-P 



(I - C(4n)~P) 

= 2 p . 



Immediately we conclude that 



log 2 r 4n = 



lQ g W r 4n 

log 10 2 ' 



(9.109) 



(9.110) 



This is useful as a check on program implementation of our quadratures. If (9.110) 
is not approximately satisfied when we apply the trapezoidal or Simpson rules, 
then (1) the integrand f(x) is not smooth enough for our theories to apply, (2) 
there is a "bug" in the program, or (3) the error may not be decreasing quickly 
with n because it is already tiny to begin with, as might happen when integrating 
an oscillatory function using Simpson's rule. 

The following examples illustrate the previous principles. 

Example 9.5 In this example we consider approximating 



-i 



jr/2 



sin x dx = 1 



using T(n) [via (9.14)], T c (n) [via (9.104)], and S(n) [via (9.41)]. Parameter p in 
(9.1 10) is computed for each of these cases, and is displayed in the following table 
(where "NaN" means "not a number"): 



T(n) 



p for T(n) 



T C (n) 



p for T c (n) 



S(n) 



p for S(n) 



2 


0.94805945 


NaN 


0.99946364 


NaN 


1.00227988 


NaN 


4 


0.98711580 


NaN 


0.99996685 


NaN 


1.00013458 


NaN 


8 


0.99678517 


2.0141 


0.99999793 


4.0169 


1.00000830 


4.0864 


16 


0.99919668 


2.0035 


0.99999987 


4.0042 


1.00000052 


4.0210 


32 


0.99979919 


2.0009 


0.99999999 


4.0010 


1.00000003 


4.0052 


64 


0.99994980 


2.0002 


1.00000000 


4.0003 


1.00000000 


4.0013 


128 


0.99998745 


2.0001 


1.00000000 


4.0001 


1.00000000 


4.0003 


256 


0.99999686 


2.0000 


1.00000000 


4.0000 


1.00000000 


4.0001 



We see that since f(x) — sinx is a rather smooth function we obtain the values 
for p that we expect to see. 

Example 9.6 This example is in contrast with the previous one. Here we 
approximate 



Jo 



/o 4 



TLFeBOOK 



396 



NUMERICAL INTEGRATION AND DIFFERENTIATION 



using T(n) [via (9.14)], T c (n) [via (9.107)], and S(n) [via (9.41)]. Again, p 
[via (9.110)] is computed for each of these cases, and the results are tabulated 
as follows: 



T(n) 



p for T{n) 



T C (n) 



p for T c (n) 



S(n) 



p for S(n) 



2 0.64685026 


NaN 


0.69580035 


NaN 


0.69580035 


NaN 


4 0.70805534 


NaN 


0.72437494 


NaN 


0.72845703 


NaN 


8 0.73309996 


1.2892 


0.73980487 


0.8890 


0.74144817 


1.3298 


16 0.74322952 


1.3059 


0.74595297 


1.3275 


0.74660604 


1.3327 


32 0.74729720 


1.3163 


0.74839388 


1.3327 


0.74865310 


1.3332 


64 0.74892341 


1.3227 


0.74936261 


1.3333 


0.74946548 


1.3333 


128 0.74957176 


1.3267 


0.74974705 


1.3333 


0.74978788 


1.3333 


256 0.74982980 


1.3292 


0.74989962 


1.3333 


0.74991582 


1.3333 


We observe that 


/(*) = 


x 1 / 3 , but that f m 


(x) = ttX" 


-2/3 >/ (2) W = 


_2 x -5/3 



etc. Thus, the derivatives of f(x) are unbounded at x = 0, and so f(x) is not 
smooth on the interval of integration. This explains why we obtain p « 1.3333 in 
all cases. 

As a further step toward Romberg integration, consider the following. Since 
/ - I(2n) « C(2n)~ p = C2~ p n~ p « 2~ P (I - /(«)), we obtain the approximate 
equality 

_ I(2n)-2- p I(n) 
1 -2-p ' 



or 



2 p I (2n) - I (n) 
I % K —: K —L = R(2n). 

2P - i 



(9.111) 



We call R(2n) Richardson's extrapolated value (or Richardson's extrapolation), 
which is an improvement on I(2n). The estimated error in the extrapolation is 
given by 

I(n) - I(2n) 
E R(2n) = I{2n)-R{2n)= 2 „ _\ ■ (9.H2) 



Of course, p in (9.111) and (9.112) must be the proper choice for the quadrature 
I(n). We may "confirm" that (9.111) works for T(n) (as an example) by considering 
(9.28) 

h -(b-a)fV>(S) (9.113) 



<T(n) 



12 



(for some f € [a, b]). Clearly 



(hl2Y m 1 



TLFeBOOK 



or in other words (£Y(„) — I — T{n)) 



and hence for n > 1 



ROMBERG INTEGRATION 397 



7-r(2»)wi[/-r(n)], 



47(2w) - 7(«) 
/ « — — — = /?t(2w), 



(9.114) 



which is (9.111) for case p — 2. Equation (9.114) is called the Romberg integration 
formula for the trapezoidal rule. Of course, a similar expression may be obtained 
for Simpson's rule [with p = 4 in (9.111)]; that is, for n even 



_ 165(2») - 5(«) ^ 

/ s» = Rs(2n). 

15 A 



In fact, it can also be shown that Rj(2n) — S(2n): 

4T(2n) - T(n) 



5(2/i) = 



3 



(9.115) 



(9.116) 



(Perhaps this is most easily seen in the special case where n — 1.) In other words, 
the Romberg procedure applied to the trapezoidal rule yields the Simpson rule. 

Now, Romberg integration is really the repeated application of the Richardson 
extrapolation idea to the composite trapezoidal rule. A simple way to visualize the 
process is with the Romberg table {Romberg array): 



7(1) 










7(2) 


S(2) 








7X4) 


S(4) 


Rs(4) 






7(8) 


5(8) 


Rs(8) 


9 




7(16) 


5(16) 


R S (16) 


9 


9 



In its present form the table consists of only three columns, but the recursive 
process may be continued to produce a complete "triangular array." 

The complete Romberg integration procedure is often fully justified and devel- 
oped with respect to the following theorem. 



Theorem 9.1: Euler-Maclaurin Formula Let / € C + [a, b] for some 
k > 0, and let us approximate I — f f(x)dx by the composite trapezoidal rule 



TLFeBOOK 



398 NUMERICAL INTEGRATION AND DIFFERENTIATION 

of (9.14). Letting h n — (b — a)/n for n > 1, we have 

T ^ = 1 + E ^f[f^~ l Hb) - / (2 '-%)] 

^ (20! 

fi 2*+ 2 r2k+2 rh s f (2k+2) 



(2k + 2)\ 
where r\ e [a, b], and for j > 1 

2 



h^(b-a)f (JX ^'(n), (9.117) 



% = (-1) 



/-i 



E 



(2jtm) 2 ' 

.n=l 



(2;)! (9.118) 



are the Bernoulli numbers. 



Proof This is Property 9.3 in Quarteroni et al. [8]. A proof appears in Ralston 
[9]. Alternative descriptions of the Bernoulli numbers appear in Gradshteyn and 
Ryzhik [10]. Although not apparent from (9.118), the Bernoulli numbers are all 
rational numbers. 

We will present the complete Romberg integration process in a more straight- 
forward manner. Begin by considering the following theorem. 

Theorem 9.2: Recursive Trapezoidal Rule Suppose that h = (b — o)/(2n); 
then, for n > 1 (xq — a , x n — b) 

1 " 

T(2n) = -T(n) + h^ /(*0 + (2* - V)h). (9.119) 

k=\ 

The first column in the Romberg table is given by 

TV) = \ T (2^) 4- *g« £ / (xo 4- (2fe - 1) 2 ( ;"- X0) ) (9.120) 

for all n > 1. 

Proof Omitted, but clearly (9.120) immediately follows from (9.119). 

We let R„ denote the row n, column k entry of the Romberg table, where k — 
0, 1, . . . , N and n — k, . . . , N [i.e., we construct an (N + I) x (N + 1) lower trian- 
gular array, as suggested earlier]. Table entries are "blank" for n = 0, 1, . . . , k — 1 
in column k. The first column of the table is certainly 

R (P) = T( 2") (9.121) 



TLFeBOOK 



ROMBERG INTEGRATION 



399 



forn = 0, 1, 



N. From Theorem 9.2 we must have 



Ri Q) = ±R?\ 



r,(0) _ 



2»-i 
On Z—r ^ I 

*=1 v 



(2fc- l)(fc -a) 



2" 



(9.122a) 



(9.122b) 



for m = 1, 2, . . . , Af. Equations (9.122) are the algorithm for constructing the first 
column of the Romberg table. The second column is R„ , and these numbers are 
given by [via (9.114)] 



R m = 



4R 



«)) 



R 



(0) 
n-\ 



1 



(9.123) 



,(2) 



for n = 1, 2, . . . , N. Similarly, the third column is R„ , and these numbers are 
given by [via (9.115)] 



R 



(2) 



4 2 Ri l) 



R 



(i) 



(9.124) 



for n = 2,3, . 
according to 



4 2 -l 
N. The pattern suggested by (9.123) and (9.124) generalizes 



R (k) = 



4*tff- 1) 



7? 



M-l 



4* - 1 



(9.125) 



for n = fc, . . . , N, with k = 1, 2, . . . , JV, Assuming that /(*) is sufficiently smooth, 
we can estimate the error using the Richardson extrapolation method in this manner: 



E^> = 



(k) 



(k) _ «-l 



R 



(k) 



4 k 



(9.126) 



(/<) 



This can be used to stop the recursive process of table construction when E„ 

(k) 

is small enough. Recall that every entry R n in the table is an estimate of I = 
f f(x)dx. In some sense R N is the "final estimate," and will be the best one if 
f(x) is smooth enough. Finally, the general appearance of the Romberg table is 



r,(0) 

^0 










< 


< 








<> 


4 1} 


jf> 






p(0) 


r,(l) 

K N-1 


r,(2) 

K N-\ 


r.(A'-l) 

K N-1 




n(0) 
K N 


r,(D 

K N 


r,(2) 


R N 


K N 



TLFeBOOK 



400 



NUMERICAL INTEGRATION AND DIFFERENTIATION 



The Romberg integration procedure is efficient. Function evaluation is confined 
to the construction of the first column. The remaining columns are filled in with 
just a fixed (and small) number of arithmetic operations per entry as determined 
by (9.125). 

Example 9.7 Begin by considering the Romberg approximation to 



/ 

Jo 



e x dx = 1.718281828. 



The Romberg table for this is (N — 3) 



1.85914091 
1.75393109 
1.72722190 
1.72051859 



1.71886115 
1.71831884 
1.71828415 



1.71828269 
1.71828184 



1.71828183 



(3) 

Table entry R\ — 1.71828183 is certainly the most accurate estimate of /. Now 
contrast this example with the next one. 

The zeroth-order modified Bessel function of the first kind Ioiy) € R(y G R) 
is important in applied probability. For example, it appears in the problem of 
computing bit error probabilities in amplitude shift keying (ASK) digital data com- 
munications [11]. There is a series expansion expression for Iq(j), but there is also 
an integral form, which is 



h(y) 



2tt Jo 



2;t 



,)>cosa: 



dx. 



(9.127) 



For y = 1, we have (according to MATLAB's besseli function; Io(y) =besseli(0,y)) 

/o(l) = 1.2660658778. 
The Romberg table of estimates for this integral is (N — 4) 



2.71828183 










1.54308063 


1.15134690 








1.27154032 


1.18102688 


1.18300554 






1.26606608 


1.26424133 


1.26978896 


1.27116647 




1.26606588 


1.26606581 


1.26618744 


1.26613028 


1.26611053 



Plainly, table entry R 



(4) 



1.26611053 is not as accurate as R 



(0) 



1.26606588. 



Apparently, the integrand of (9.127) is simply not smooth enough to benefit from 
the Romberg approach. 



TLFeBOOK 



NUMERICAL DIFFERENTIATION 401 



9.6 NUMERICAL DIFFERENTIATION 



A simple theory of numerical approximation to the derivative can be obtained via 
Taylor series expansions (Chapter 3). Recall that [via (3.71)] 



f(x + h) = J2 TT' (t) ( *> + j! —7rJ in+V} ^ 



k=Q 



k\ 



(n+1)!' 



(9.128) 



for suitable f € [x, x + h]. As usual, 0! = 1, and fix) — f^(x). Since from ele- 
mentary calculus 



f a) (x) = lim 



fix + h)- f(x) (1) f( x + h )-f(x) 



f w (x) 



from (9.128), we have 

/ (1; Cv) - 



/is fix + h) — fix) 1 n ~. 
( 1 )/- T .^ _ Li I >_ ■> v ' hf"'(i;). 



2! 



This was obtained simply be rearranging 



fix + h) = fix) + hf {l \x) + Ift 2 / (2) (?)- 



We may write 



/ (1) (x) 



fix + h)- fix) 



2! 



hf (2 \H) 



-ff\x) 



-et(x) 



(9.129) 



(9.130) 



(9.131) 



(9.132) 



?(i), 



Approximation f\ (x) is called the forward difference approximation to f ( '(x), 
and the error efix) is seen to be approximately proportional to h. Now consider 
(fi € [x,x + h]) 

fix + h) = fix) + hf {l \x) + ^h 2 f (2 \x) + Ift 3 / (3) (Si), (9.133) 

and clearly (£2 e [x — h, x]) 

fix -h) = fix) - hf (l \x) + ^h 2 f (2 \x) - Ift 3 / (3) fe). (9.134) 



(9.134) 


f m ix) - 


fix) - fix - 

h 


-H) + 


^/ (2) (S) 




=/T« 




=e h (x) 



(9.135) 



TLFeBOOK 



402 NUMERICAL INTEGRATION AND DIFFERENTIATION 



?(1)/ 



Here the approximation /, (x) is called the backward difference approximation to 
f^ix), and has error eb(x) that is also approximately proportional to ft. However, 
an improvement is possible, and this is obtained by subtracting (9.134) from (9.133) 



r(D 



1 



f(x + ft) - f(x - ft) = 2hf w ix) + -W(§i) + / w (| 2 )] 



3rfO)i 



c(3). 



3! 



or on rearranging this, we have 



r(D 



f w ix) = 



fix + h)-fix- 
2ft 


-« + 


^/ (3) (fl) + / (3) fe) _ 

6 2 




** 




./."'w 


=«*(*) 



(9.136) 



Recalling the derivation of (9.28), there is a £ € [x — ft, x + ft] such that 



/ (3) (?)=2 [/(3)(?l) + /(3)fe)] 



(9.137) 



(fi € [jc, jc + ft], §2 € [x — ft, ft]). Hence, (9.136) can be rewritten as (for some 
£ € [x — ft, x + ft]) 



/ (1) (x) 



/(* + ft) - /(* - 


-A) 


^ 2 m 
6 


2ft 




=/ c (1) M 




=e c W 



(9.138) 



Clearly, the error e c ix), which is the error of the central difference approximation 
fc \x) to f^ix), is proportional to ft 2 . Thus, if fix) is smooth enough, the 
central difference approximation is more accurate than the forward or backward 
difference approximations. 

The errors e/ix), eb(x), and e c ix) are truncation errors in the approximations. 
Of course, when implementing any of these approximations on a computer there 
will be rounding errors, too. Each approximation can be expressed in the form 



?(!)/ 



1 



r>(x)= -f'y, 

ft 



where for the forward difference approximation 

/ = [/(*) f(x + h)f,y = [-l if, 

for the backward difference approximation 

/ = [/(* -ft) fix)] T ,y = [-l if, 



(9.139) 



(9.140a) 



(9.140b) 



TLFeBOOK 



NUMERICAL DIFFERENTIATION 403 

and for the central difference approximation 

/ = [/(* -A) f{x+h)f,y=\[-\ if. (9.140c) 

Thus, the approximations are all Euclidean inner products of samples of f(x) with 
a vector of constants y, followed by division by h. This is structurally much the 
same kind of computation as numerical integration. In fact, an upper bound on the 
size of the error due to rounding is given by the following theorem. 

Theorem 9.3: Since fl[f m (x)] = // [ //[ { Ty] l, we have (/, y e R m ) 

\fl[f m (x)] - f m (x)\ < j[\Mm + \]\\f\\ 2 \\y\\2- (9.141) 

h 

Proof Our analysis here is rather similar to Example 2.4 in Chapter 2. Thus, 
we exploit yet again the results from Chapter 2 on rounding errors in dot product 
computation. 



Via (2.41) 



for which 



In addition 



//[/ r y] = / r y(i + 6i) 



|ei| < l.Olmu- 



\f T y\ 



//[/(D (x)] = ^-Z(l + ei )(l + e), 

h 

where |e| < u. Thus, since 

~m f T y f T y 

fl[f a \x)] = LJL + LJL {€l + e + eie)) 

h h 

we have 

\flU {l \x)} - f m (x)\ < \L2i(\ €l \ + | e | + | ei || e |) = \LA 

h h 

\f\ T M , , ,„,.... i\f\ T \y\ 



x ( l.Olmu- — = \-u + l.Olmu 



\f T y\ ' \f T y\ 

and since u 2 <$C u, we may neglect the last term, yielding 

\fl[f W (x)] - f m (x)\ < X - (l.01m M |/| r |y| + u\fy\) 



TLFeBOOK 



404 NUMERICAL INTEGRATION AND DIFFERENTIATION 

But |/fM=<|/|,|yl><ll/ll2l|y||2, and \fy\ = |</, y)\ < ||/|| 2 ||y||2 via 
Theorem 1.1, so finally we have 

\fl[f m ix)] - / (1) (*)l < jH.Olm + l]||/||2||y||2, 

h 

which is the theorem statement. 

The bound in (9.141) suggests that, since we treat u, m, ||/||2, and | |y| I2 are 
fixed, as h becomes smaller, the rounding errors will grow in size. On the other 
hand, as h becomes smaller, the truncation errors diminish in size. Thus, much as 
with numerical integration (recall Fig. 9.3), there is a tradeoff between rounding 
and truncation errors leading in the present case to the existence of some optimal 
value for the choice of h. In most practical circumstances the truncation errors will 
dominate, however. 

We recall that interpolation theory from Chapter 6 was useful in developing 
theories on numerical integration in earlier sections of the present chapter. We 
therefore reasonably expect that interpolation ideas from Chapter 6 ought to be 
helpful in developing approximations to the derivative. 

Recall that p n (x) — Yl'l=o Pn,kX k interpolates fix) forx € [a, b], i.e., p n ix k ) = 
f(xk) with xq — a, x n — b, and x k e [a, b] for all k, but such that Xk ^ x; for 
k # j. If f(x) e C n+l [a, b], then, from (6.14) 

1 " 

fix) = p„ix) + ——T-J (n+l) &) \\(x - Xi) (9.142) 

(„ + l)! fj 

(| = %ix) € [a, b]). For convenience we also defined nix) — Y\'!=o^ x ~ x i)- Thus, 
from (9.142) 



f m ix) = pi 1 \x) + - 



ff (i) W /(»+i)(|) + wW A / (»+i) ( |) 



(«+!)![ ' dx' 



(9.143) 



Since % = % ix) is a function of x that is seldom known, it is not at all clear 
what df( n+ ^i%)/dx is in general, and to evaluate this also assumes that d^ix)/dx 
exists. 3 We may sidestep this problem by evaluating (9.143) only for x = Xk, and 
since nixk) — for all k, Eq. (9.143) reduces to 

f (i) (x k ) = pi'Hxk) + - l —-, n m ixk)f (n+l) i^Xk)). (9.144) 

in + 1)! 

For simplicity we will now assume that Xk — xq + hk, h — ib — a)/n (i.e., 
uniform sampling grid). We will also suppose that n — 2, in which case (9.144) 
becomes 

f m ix k ) = p^ixk) + l -Jt {l \x k )f & i^ix k )). (9.145) 

This turns out to be true, although it is not easy to prove. Fortunately, we do not need this result. 



TLFeBOOK 



NUMERICAL DIFFERENTIATION 405 

Since tt(x) — (x — xo)(x — x\){x — X2), we have jr^Cr) = (x — x\)(x — x 2 ) + 
(x — xq)(x — xi) + (x — xq)(x — X2), and therefore 



(9.146) 





7r m (x ) = 


C*0 


-Xl)(XQ 


-X2) = 


(-A)(-2A) 


= 2/j 2 , 




7T (1) (*l) = 


(x\ 


-xo)(x\ 


-X2) 


= h(-h) 


= -^ 2 , 




7T (1) (X 2 ) = 


(X2 


-x )(x 2 


-xi) 


= (2A)/i 


= 2/z 2 . 


From 


(6.9), we obtain 












P2(x) = 


f(xo)Lo< 


(x) + f(x 


i)Li(x) + /(x 2 )L 2 (x), 


where 


, from (6.11) 




Lj( 


»=n 







(=0 "■' *' 



^2W - (jt 2 -jc )(jc 2 -^l) ' ^2 W - (jt 2 -*0)(*2-*l)- 



Hence 



(1)/-. \ _ J r(l)/-. \ _ 1 ,(1), 



( l \x ) = ,lP(xi) = + — ,4 1) (x 2 )= T — . 

2 v u; 2/i ' 2 2/j 2 <■ ^ ^ 2 /j 



(9.147) 



(9.148) 



and hence the approximation to f^(xk) is (k e {0, 1, 2}) 

P ( n\x k ) = f(x )L$\x k ) + /(jcOLJ^xt) + /(x 2 )L^ 1) (x^). (9.149) 

From (9.148) 

j „/„\ _ (*-*l)(*-*2) r (1)/- t n _ (*-*l)+(*~*2) 

^0W - (* -*l)(*0-*2)' W - Cc -*i)(*0-*2)' 

L t (x)= ('-*>)(*-*2) L< 1} (x)= (x-xo)+(x-X2) (9150) 

1V ' (*l-*o)(*l-*2)' 1 V 7 (JCI— JC0)(^1— ^2)' V ' 

_ (x-x )(x-x l ) j (1)/ \ _ (x-*o)+Q:-:ti) 



Li, l; (x ) = — -, U^(xi) = — -, L' i '(x 2 ) = + — , 
" 2A ° 2h ° 2h 

< l i\x6) = +7, L^Cxi) = 0, L ( 1 1) (x 2 ) = -p (9.151) 

3 



We therefore have the following expressions for f^(xk)'. 

f (i) (x ) = i-[-3/(* ) + 4/(xi) - /(x 2 )] + V/( 3 >(f (x )), (9.152a) 

2« 3 

/ (1) (*i) = ^-[-/(Jto) + f(xi)] - -U 2 / (3) (£(*i)), (9.152b) 

2« 6 

/ (1) fe) = 4;[/(*o) - 4/(xi) + 3/(x 2 )] + V/ (3) (?fe)). (9.152c) 

2« 3 



TLFeBOOK 



406 NUMERICAL INTEGRATION AND DIFFERENTIATION 

We recognize that the case (9.152b) contains the central difference approximation 
[recall (9.138)], since we may let xq — x — h, xi — x + h (and x\ — x). If we let 
xq — x, x\ — x + h and X2 — x + 2h, then (9.152a) yields 

/ (1) 0O * ^-[-3/W + 4 f(x + ft) - f{x + 2ft)], (9.153) 

2n 

and if we let X2 — x, x\ — x — ft, and xq — x — 2ft, then (9.152c) yields 

f W (x) « i-[/(* - 2ft) - 4/(* - A) + 3/(*)]. (9.154) 

2ft 

Note that (9.153) and (9.154) were employed in obtaining Tcin) in (9.107). 



REFERENCES 

1. G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for Mathematical 
Computations, Prentice-Hall, Englewood Cliffs, NJ, 1977. 

2. L. Bers, Calculus: Preliminary Edition, Vol. 2, Holt, Rinehart, Winston, New York, 
1967. 

3. P. J. Davis and P. Rabinowitz, Numerical Integration, Blaisdell, Waltham, MA, 1967. 

4. T. W. Korner, Fourier Analysis, Cambridge Univ. Press, New York, 1988. 

5. F. B. Hildebrand, Introduction to Numerical Analysis, 2nd ed., McGraw-Hill, New York, 
1974. 

6. W. Sweldens and R. Piessens, "Quadrature Formulae and Asymptotic Error Expansions 
for Wavelet Approximations of Smooth Functions," SIAM J. Numer. Anal. 31, 1240- 
1264 (Aug. 1994). 

7. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York, 
2002. 

8. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math- 
ematics series, Vol. 37), Springer- Verlag, New York, 2000. 

9. A. Ralston, A First Course in Numerical Analysis, McGraw-Hill, New York, 1965. 

10. I. S. Gradshteyn and I. M. Ryzhik, in Table of Integrals, Series and Products, 5th ed., 
A. Jeffrey, ed., Academic Press, San Diego, CA, 1994. 

11. R. E. Ziemer and W. H. Tranter, Principles of Communications: Systems, Modulation, 
and Noise, Houghton Mifflin, Boston, MA, 1976. 



PROBLEMS 
9.1. This problem is based on an assignment problem due to I. Leonard. Consider 



./o 



x + a 



■ dx, 



TLFeBOOK 



PROBLEMS 407 



where a > 1 and n e Z + . It is easy to see that for < x < 1, we have 
jc" +1 < x", and < I n +\ < I n for all n e Z + . For < x < 1, we have 



implying that 



or 



jc" x n x n 
< < — . 

1 + a x + a a 



f l x n f l x n 

I dx < I„ < / — dx 

h 1 +a Jo a 



1 1 

< In < 



(«+ l)(l+a) (n + l)o' 

so immediately lim n ^oo /„ = 0. Also, we have the difference equation 

f l x n ~ l [x + a-a] 1 

/„ = / dx = - - al n -\ (9.P.1) 

Jo x + a n 

for n e N, where 7 = /J ^ dx = [log e (x + a)], 1 , = log e (^)- 

(a) Assume that 7o = ^o + € is the computed value of Iq. Assume that no 
other errors arise in computing /„ for n > 1 using (9.P.1). Then 

1 

/„ = aI n -\. 

n 

Define the error e n — I n — I n , and find a difference equation for e„. 

(b) Solve for e n , and show that for large enough a we have linin^oo 
\e n \ = oo. 

(c) Find a stable algorithm to compute /„ for n e Z + . 

9.2. (a) Find an upper bound on the magnitude of the rounding error involved in 
applying the trapezoidal rule to 



Jo 



x dx. 



(b) Find an upper bound on the magnitude of the truncation error in applying 
the trapezoidal rule to the integral in (a) above. 

9.3. Consider the integral 

1(e) — I */xdx, 



TLFeBOOK 



408 NUMERICAL INTEGRATION AND DIFFERENTIATION 

where a > e > 0. Write a MATLAB routine to fill in the following table for 
a — 1, and n — 100: 



e 


|7(e)-T(«)| B T(n) 


|7(e)- S(n)\ B s(n ) 


0.1000 
0.0100 
0.0010 
0.0001 







In this table T{n) is from (9.14), S(n) is from (9.41), and 

\Et(h)\ < Bt(ti), \Es(n)\ < Bs(n), 

where the upper bounds fir(n) and B${ n ) are obtained using (9.33) and (9.59), 
respectively. 



9.4. Consider the integral 



h i 



dx 



4' 



(a) Use the trapezoidal rule to estimate I, assuming that h = j, 

(b) Use Simpson's rule to estimate I, assuming that h — j. 

(c) Use the corrected trapezoidal rule to estimate I, assuming that h — j. 

Perform all computations using only a pocket calculator. 
9.5. Consider the integral 

•!/ 2 dx 



7-1/2 T 



In 3. 



(a) Use the trapezoidal rule to estimate I, assuming that h — j. 

(b) Use Simpson's rule to estimate I, assuming that h=\. 

(c) Use the corrected trapezoidal rule to estimate I, assuming that h — j. 

Perform all computations using only a pocket calculator. 
9.6. Consider the integral 

" n l 2 sin(3x) 



/ 

Jq 



Tt 

■ dx — — . 
smi 2 



(a) Use the trapezoidal rule to estimate I, assuming that h — ir/12. 

(b) Use Simpson's Rule to estimate I, assuming that h — tt/12. 

(c) Use the corrected trapezoidal rule to estimate I, assuming that h — n/Yl. 

Perform all computations using only a pocket calculator. 



TLFeBOOK 



PROBLEMS 409 



9.7. The length of the curve y — fix) for a < x < b is given by 

-b 



L= f Jl + [fVHx)] 2 dx. 

J a 



Suppose f{x) — cos*. Compute L for a — —jt/2 and b — jt/2. Use the 
trapezoidal rule, selecting n so that |£Y(n)l < 0.001. Hence, using (9.33), 
select n such that 

1 1 . 

^(b-afM < 0.001. 

12 n L 

Do the computations using a suitable MATLAB routine. 

9.8. Recall Example 6.5. Estimate the integral 

I — I e~ x dx 

by 

(a) Integrating the natural spline interpolant 

(b) Integrating the complete spline interpolant 

9.9. Find the constants a and /J in x — at + fi, and find / in terms of g such 
that 

f b f l fix) 

/ git)dt= I - dx. 

Ja J -i VI - x 1 

[Comment: This transformation will permit you to apply the Chebyshev- 
Gauss quadrature rule from Eq. (9.98) to general integrands.] 

rb 

9.10. Consider the midpoint rule (Section 9.2). If we approximate I — J a fix) dx 
by one rectangle, then the rule is 

' a + b 
RH) = ib-a)f 



2 
so the truncation error involved in this approximation is 



-i 



£«(!)= / fix)dx-Ril). 

J a 

Use Taylor expansion error analysis to find an approximation to Er(\). [Hint: 
With x = (a + b)/2 and h — b — a, we obtain 

fix) = fiJ) + (x- J)f W iJ) +^_(x- x) 2 f (2) (x) + ■■■■ 

Consider ./"(fl), f(b), and f fix)dx using this series expansion.] 



TLFeBOOK 



410 



NUMERICAL INTEGRATION AND DIFFERENTIATION 



9.11. Use both the trapezoidal rule [Eq. (9.14)] and the Chebyshev-Gauss quadra- 
ture rule [Eq. (9.98)] to approximate I — ^ Q ^^ dx for n — 6. Assuming 
that / = 1.1790, which rule gives an answer closer to this value for / ? Use 
a pocket calculator to do the computations. 



9.12. Consider the integral 



Jo 



cos x dx — — . 

2 



Write a MATLAB routine to approximate / using the trapezoidal rule, and 
Richardson's extrapolation. The program must allow you to fill in the fol- 
lowing table: 



n 


T(n) \I-T(n)\ 


R(n) \I-R(n)\ 


2 

4 

8 

16 

32 

64 
128 
256 
512 
1024 







Of course, the extrapolated values are obtained from (9.111), where I(n) = 
T(n). Does extrapolation improve on the accuracy? Comment on this. 

9.13. Write a MATLAB routine that allows you to make a table similar to that in 
Example 9.6, but for the integral 



Jo 



x l ' A dx. 



9.14. The complete elliptic integral of the first kind is 

^' 2 d6 



K(k) 



P7T/Z 

Jo vT 



,0 < k < 1, 



k 2 sin 2 
and the complete elliptic integral of the second kind is 

f*/2 

E(k)= VI -k 2 sm 2 6d6,0 < k < 1. 

Jo 

(a) Find a series expansion for K(k). [Hint: Recall (3.80).] 



TLFeBOOK 



PROBLEMS 411 

(b) Construct a Romberg table for N — 4 for the integral K(j). Use the 
series expansion from (a) to find the "exact" value of K(j) and compare. 

(c) Find a series expansion for E(k). [Hint: Recall (3.80).] 

(d) Construct a Romberg table for N — 4 for the integral E(j). Use the series 
expansion from (c) to find the "exact" value of E(j) and compare. 

Use MATLAB to do all of the calculations. [Comment: Elliptic integrals are 
important in electromagnetic potential problems (e.g., finding the magnetic 
vector potential of a circular current loop), and are important in analog and 
digital filter design (e.g., elliptic filters).] 

9.15. This problem is about an alternative approach to the derivation of Gauss-type 
quadrature rules. The problem statement is long, but the solution is short 
because so much information has been provided. Suppose that w(x) > for 
x e [a, b], so w(x) is some weighting function. We wish to find weights Wk 
and sample points Xk for k = 0, 1, . . . , n — 1 such that 



/ w(x)f(x)dx = J^w k f(x k ) (9.P.2) 

Ja , „ 



for all f(x) — x 1 , where j — 0, 1, . . . , In — 2, In — 1. This task is greatly 
aided by defining the moments 



-l 



b 

w(x)x } dx, 



where m ; is called the 7th moment of w(x). In everything that follows it is 
important to realize that because of (9.P.2) 



pb "-1 

ij = I x J w(x)dx = >•. 
Ja fc=0 



w k x J k . 



The method proposed here implicitly assumes that it is easy to find the 
moments m;. Once the weights and sample points are found, expression 
(9.P.2) forms a quadrature rule according to 

rb »-l 

/ w(x)f(x) dx » y^ Wkfjxk) = G{n - 1), 

Ja k=0 

where now f(x) is essentially arbitrary. Define the vectors 

w — [wqw\ ■ ■ ■ w n -2W n -\] e R", 
m = [m m\ ■ ■ ■ m 2 n-2m 2 n-\] T € R 2 ". 



TLFeBOOK 



412 NUMERICAL INTEGRATION AND DIFFERENTIATION 

Find matrix A e R nx2n such that A T w — m. Matrix A turns out to be a 
rectangular Vandermonde matrix. If we knew the sample points Xk, then it 
would be possible to use A T w — m to solve for the weights in w. (This is 
so even though the linear system is overdetermined.) Define the polynomial 

n— 1 n 

Pn(x) — \\(X - Xj) = ^ p„,jX j 

;=o .7=0 

for which we see that p nn — 1. We observe that the zeros of p n (x) happen 
to be the sample points Xk that we are looking for. The following suggestion 
makes it possible to find the sample points Xk by first finding the polynomial 
p n (x). Using (in principle) Chapter 7 ideas, we can then find the roots of 
p n (x) — 0, and so find the sample points. Consider the expression 

n-\ 

mj+ r pn,j = y^WfcXft r Pn,j, (9.P.3) 

k=0 

where r — 0, 1, ...,« — 2, n — 1. Consider the sum of (9.P.3) over j — 
0, 1, . . . , n. Use this sum to show that 

n-i 

Y^mj+rPnj = -m„ +r (9.P.4) 

.7=0 

for r = 0, 1, . . . , n — 1. This can be expressed in matrix form as Mp — q, 
where 

M=[m i+i ] u=0 ,i,...,„-i€R" x " 

which is called a Hankel matrix, and where 

P = [Pn,0Pn,l ■ ■ ■ Pn,n-l] T € R" 

and 

-q = [m n m n+ \ ■ ■ ■ m 2 „-2»J2«-l] T '. 

Formal proof that M~ l always exists is possible, but is omitted from con- 
sideration. [Comment: The reader is warned that the approach to Gaussian 
quadrature suggested here is not really practical due to the ill-conditioning 
of the matrices involved (unless n is small).] 

9.16. In the previous problem let a = — 1, b = 1, and w(x) — Vl — x 2 . Develop a 
numerically reliable algorithm to compute the moments m ; - = f w(x)xi dx. 



TLFeBOOK 



PROBLEMS 413 



9.17. Derive Eq. (9.88). 

9.18. Repeat Example 9.4, except that the integral is now 



/ 



e x dx. 

1 



9.19. This problem is a preview of certain aspects of probability theory. It is an 
example of an application for numerical integration. An experiment produces 
a measurable quantity denoted x e R. The experiment is random in that the 
value of x is different from one experiment to the next, but the probability 
of x lying within a particular range of values is known to be 

P = P[ a < x < b] - / f x (x) dx, 



where 



1 / x 2 

fx(x) = exp - — 

V2tto- 2 \ 2<H 

which is an instance of the Gaussian function mentioned in Chapter 3. This 
fact may be interpreted as follows. Suppose that we perform the experiment R 
times, where R is a "large" number. Then on average we expect a < x < b 
a total of PR times. The function fx(x) is called a Gaussian probability 
density function (pdf) with a mean of zero, and variance a 2 . Recall Fig. 
3.6, which shows the effect of changing a 2 . In Monte Carlo simulations of 
digital communications systems, or for that matter any other system where 
randomness is an important factor, it is necessary to write programs that gen- 
erate simulated random variables such as x. The MATLAB randn function 
will generate zero mean Gaussian random variables with variance a 2 — 1. 
For example, x — randn(l, AO will load N Gaussian random variables into 
the row vector x. Write a MATLAB routine to generate N = 1000 simu- 
lated Gaussian random variables (also called Gaussian variates) using randn. 
Count the number of times x satisfies — 1 < x < 1. Let this count be denoted 
C. Your routine must also use the trapezoidal rule to estimate the probability 
P — P[—l < x < 1] using erf(x) [defined in Eq. (3.107)]. The magnitude of 
the error in computing P must be <0.0001. This will involve using the trun- 
cation error bound (9.33) in Chapter 9 to estimate the number of trapezoids 
n that you need to do this job. You are to neglect rounding error effects here. 
Compute C — PN. Your program must print C and C to a file. Of course, 
we expect C « C. 

9.20. Develop a MATLAB routine to fill in the following table, which uses the 
central difference approximation to the first derivative of a function [i.e., 



TLFeBOOK 



414 NUMERICAL INTEGRATION AND DIFFERENTIATION 

fc (x)] to estimate f^\x), where here 

fix) = \og e x. 



h = 10" 4 h = 10" 5 h = 10" 6 h = 1(T 7 



10" 7 A = 10" 8 



f (1) (x) 



1.0 

2.0 
3.0 
4.0 
5.0 
6.0 
7.0 
8.0 
9.0 
10.0 



Explain the results you get. 

9.21. Suppose that f(x) = e _j; \ and recall (9.132). 

(a) Sketch f {2) (x). 

(b) Let h — 1/10, and compute f\ (1). Find an upper bound on |ey(l)|. 
Compute |/ (1) (1) - /) X) (1)| = |e/(l)|, and compare to the bound. 

(c) Let h = jq , and compute /S (1 / V2) . Find an upper bound on | e /■ ( 1 /-s/2) | . 

Compute |/W(l/-v/2) - /^'(l/v^)! = |e/(l/V2)|, and compare to the 
bound. 

9.22. Show that another approximation to f^\x) is given by 



1 



f w (x) w — -[8/(x + A) - 8/(x - h) - f(x + 2h) + f(x - 2ft)]. 
12ft 

Give an expression for the error involved in using this approximation. 



TLFeBOOK 



AU Numerical Solution of Ordinary 
Differential Equations 



10.1 INTRODUCTION 

In this chapter we consider numerical methods for the solution of ordinary differ- 
ential equations (ODEs). We recall that in such differential equations the function 
that we wish to solve for is in one independent variable. By contrast partial differ- 
ential equations (PDEs) involve solving for functions in two or more independent 
variables. The numerical solution of PDEs is a subject for a later chapter. 

With respect to the level of importance of the subject the reader knows that all 
dynamic systems with physical variables that change continuously over time (or 
space, or both) are described in terms of differential equations, and so form the 
basis for a substantial portion of engineering systems analysis, and design across 
all branches of engineering. The reader is also well aware of the fact that it is quite 
easy to arrive at differential equations that completely defy attempts at an analytical 
solution. This remains so in spite of the existence of quite advanced methods for 
analytical solution (e.g., symmetry methods that use esoteric ideas from Lie group 
theory [1]), and so the need for this chapter is not hard to justify. 

Where ODEs are concerned, differential equations arise within two broad cate- 
gories of problems: 

1. Initial-value problems (IVPs) 

2. Boundary-value problems (BVPs) 

In this chapter we shall restrict consideration to initial value problems. However, 
this is quite sufficient to accommodate much of electric/electronic circuit modeling, 
modeling the orbital dynamics of satellites around planetary bodies, and many other 
problems besides. 

A simple example of an ODE for which no general analytical theory of solution 
is known is the Duffing equation 



d x(t) dx(t) ■, 

m ^— + k — — + axit) + Sx 6 (t) = F cos(cot), (10.1) 

dt z dt 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



415 



TLFeBOOK 



416 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

where t > 0. Since we are concerned with initial-value problems we would need to 
know, at least implicitly, x(0), and x J- f which are the initial conditions. If it were 
the case that 5 = then the solution of (10.1) is straightforward because it is a 
particular case of a second-order linear ODE with constant coefficients. Perhaps the 
best method for solving (10.1) in this case would be the Laplace transform method. 
However, the case where 5^0 immediately precludes a straightforward analytical 
solution of this kind. It is worth noting that the Duffing equation models a forced 
nonlinear mechanical spring, where the restoring force of the spring is accounted 
for by the terms ax(t) + <Sx 3 (f). Function x(t), which we wish to solve for, is the 
displacement at time t of some point on the spring (e.g., the point mass m at the 
free end) with respect to a suitable reference frame. Term k-jp- is the opposing 
friction, while F cos(atf) is the periodic forcing function that drives the system. An 
example of a recent application for (10.1) is in the modeling of micromechanical 
filters/resonators [2]. 1 

At the outset we consider only first-order problems, specifically, how to solve 
(numerically) 

^- = f(x,t), x = x(0) (10.2) 

at 

for t > 0. [From now on we shall often write x instead of x(t), and dx/dt instead of 
dx(t)/dt for brevity.] However, the example of (10.1) is a second-order problem. 
But it is possible to replace it with a system of equivalent first-order problems. 
There are many ways to do this in principle. One way is to define 

y = £. do.3) 

dt 

The functions x(t) and y(t) are examples of state variables. Since we are interpret- 
ing (10.1) as the model for a mechanical system wherein x(t) is displacement, it 
therefore follows that we may interpret y(t) as velocity. From the definition (10.3) 
we may use (10.1) to write 

dy a k 8 -, F 

— = x y x -\ cos(oif) (10.4a) 

dt m m m m 

and 

dx 

— = y. (10.4b) 

dt 

The clock circuit in many present-day digital systems is built around a quartz crystal. Such crystals 
do not integrate onto chips. Micromechanical resonators are intended to replace the crystal since such 
resonators can be integrated onto chips. This is in furtherance of the goal of more compact electronic 
systems. This applications example is a good illustration of the rapidly growing trend to integrate 
nonelectrical/nonelectronic systems onto chips. The implication of this is that it is now very necessary 
for the average electrical and/or computer engineer to become very knowledgeable about most other 
branches of engineering, and to possess a much broader and deeper knowledge of science (physics, 
chemistry, biology, etc.) and mathematics. 



TLFeBOOK 



INTRODUCTION 



417 



Equations (10.4) have the forms 



dx 

— = f(x,y,t) 

dt 

dy 

— =g(x,y,t) 

dt 



(10.5a) 



(10.5b) 



for the appropriate choices of / and g. The initial conditions for our example 
are x(0) and y(0) (initial position and initial velocity, respectively). These rep- 
resent a coupled system of first-order ODEs. Methods applicable to the solution 
of (10.2) are extendable to the larger problem of solving systems of first-order 
ODEs, and so in this way higher-order ODEs may be solved. Thus, we shall also 
consider the numerical solution of initial-value problems in systems of first-order 
ODEs. 

The next two examples illustrate how to arrive at coupled systems of first-order 
ODEs for electrical and electronic circuits. 

Example 10.1 Consider the linear electric circuit shown in Fig. 10.1. The input 
to the circuit is the voltage source v s (t), while we may regard the output as the 
voltage drop across capacitor C, denoted vc(t). The differential equation relating 
the input voltage v s (t) and the output voltage vc (t) is thus 



LiC 



d 3 v C (t) 
dt 3 



R\C 



d 2 v c (t) 
dt 2 






dvcif) 
dt 



R\ 
—vc(t) 

i>2 



dv s {t) 
dt 



(10.6) 



This third-order ODE may be obtained by mesh analysis of the circuit. The reader 
ought to attempt this derivation as an exercise. One way to replace (10.6) with 
a coupled system of first-order ODEs is to define the state variables Xk(t) (k e 
{0, 1, 2}) according to 



xo(t) = v c (t), 



xi(t) = 



dv C (t) 
dt 



Substituting (10.7) into (10.6) yields 



xi{t) = 



d 2 v c (t) 
dt 2 ' 



(10.7) 



LiC— f^ + RiCx 2 (t) + 7 + 1 J|(0+ 7^*0(0 = -^r 1 - 

dt \L2 J L2 dt 





Ri 


<-i 




L 2 








+ \ 




'l,W 


c 















Figure 10.1 The linear electric circuit for Example 10.1. 



TLFeBOOK 



418 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

If we recognize that 



xi (t) = 



dx(j(t) 
dt 



x 2 (t) = 



dx\(t) 
dt 



then the complete system of first order ODEs is 

dx (t) 



dt 










dxi(t) 

—, — =x 2 (t), 
dt 








(10.8) 


dx2(t) 1 dv s (t) 
dt L\C dt 


R\ 
- —x 2 (t) - 


LiC \L 2 J 


xi(t)- 


R 

- T j r xo(t)- 
LIL2C 



In many ways this is not the best description for the circuit dynamics. 

Instead, we may find the matrix A e R 3x3 and column vector b e R 3 such that 



dvc(t) 

dt 
dJL { (t) 

dt 

dJL 2 (t) 

dt 



= A 



v c (t) 



bv s {t). 



(10.9) 



This defines a new set of state equations in terms of the new state variables Vc(t), 
JLj(?)> an d 'l 2 (0- The matrix A and vector b contain constants that depend only 
on the circuit parameters R\,L\, L2, and C. 

Equation (10.9) is often a better representation than (10.8) because 

1. There is no derivative of the forcing function v s (t) in (10.9) as there is in 
(10.8). 

2. There is a general (linear) theory of solution to (10.9) that is in practice easy 
to apply, and it is based on state-space methods. 

3. Inductor currents [i.e., iz,[(f), *'l, (?)] and capacitor voltages [i.e., vc(t)] can 
be readily measured in a laboratory setting while derivatives of these are 
not as easily measured. Thus, it is relatively easy to compare theoretical and 
numerical solutions to (10.9) with laboratory experimental results. 



Since 



kit) = c 



dv c (t) 
dt 



VL,(t) = Li 



dJL, (t) 
dt 



v L Jt) = L 



dJL 2 (t) 
dt 



on applying Kirchoff's Voltage law (KVL) and Kirchoff's Current law (KCL), we 
arrive at the relevant state equations as follows. 



TLFeBOOK 



INTRODUCTION 



419 



First 

v s (t) = RiiLi(t) + Li 

We see that i>c(t) — Vi 2 (t), and so 

vc(t) = i 



dJLi(t) 
dt 



ih 2 {t) 

i . 

dt 



L 2 



dJL 2 (t) 
dt 



(10.10) 



giving 



dJL 2 (t) 
dt 



1 
= — v c (t), 

^2 



(10.11) 



which is one of the required state equations. Since 

MO = *'c(0 + MO, 



and so 



we also have 



dvr(t) 
M0 = C— r^ + MO, 



dt 



1 



dvc(t) 1 

~dT = c iL > (t) - c iL ^ ty 



(10.12) 



This is another of the required state equations. Substituting (10.11) into (10.10) 
gives the final state equation 



di Ll (t) 



R 



l . 



MO- 



l 



-v c (t) 



1 



-V s (t). 



dt L\ "' ' L\ ' L\ 

The state equations may be collected together in matrix form as required: 

dvc(t) 



(10.13) 



dt 
diL, (0 

dt 
diL 2 (t) 

dt 






l 


1 "I 


c 


c 


1 


R\ 





~L\ 


U 


1 

_ T 2 









vc(0 
MO 
MO 



o 

i 





i^(0- 



=/i 



Example 10.2 Now let us consider a more complicated third-order nonlinear 
electronic circuit called the Colpitts oscillator [9]. The electronic circuit, and its 
electric circuit equivalent (model) appears in Fig. 10.2. This circuit is a popular 
analog signal generator with a long history (it used to be built using vacuum 
tubes). The device Q is a three-terminal device called an NPN-type bipolar junction 
transistor (BJT). The detailed theory of operation of BJTs is beyond the scope of 



TLFeBOOK 



420 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

v cc( f )C) v cc(t)i 





(a) V EE (b) V EE - 

Figure 10.2 The BJT Colpitts oscillator (a) and its electric circuit equivalent (b). 



this book, but may be found in basic electronics texts [10]. For present purposes 
it is enough to know that Q may be represented with a nonlinear resistor R (the 
resistor enclosed in the box in Fig. 10.2b), and a current-controlled current source 
(CCCS), where 

ic(t) = Mb(1) (10.14) 



and 



*'fl00 = 



0, v BE {t)<V TH 

(vbe(i) - V T h) 



Ron 



vbeQ) > Vth 



(10.15) 



The current is (?) is called the base current of Q, and flows into the base terminal 
of the transistor as shown. From (10.15) we observe that if the base-emitter voltage 
vbe(i) is below a threshold voltage Vth, then there is no base current into Q (i.e., 
the device is cut off). The relationship between vbe( 1 ) — Vth, an d *b(0 obeys 
Ohm's law only when VBE(t) is above threshold, in which case Q is active. In 
either case (10.14) says the collector current ic(t) is directly proportional to /#(?), 
and the constant of proportionality f>p is called the forward current gain of Q. 
Voltage vcE(t) is the collector-emitter voltage of Q. Typically, Vth ^ 0.75 V, j3p 
is about 100 (order of magnitude), and Ron (on resistance of Q) is seldom more 
than hundreds of ohms in size. 



TLFeBOOK 



FIRST-ORDER ODEs 421 

From (10.15) is it) is a nonlinear function of vgEit) that we may compactly write 
as isit) = fitivBEit))- There are power supply voltages vccit), and Vee- Voltage 
Vee < is a constant, with a typical value Vee — —5 V. Here we treat vccit) 
as time- varying, but it is usually the case that (approximately) vccit) = Vccuit), 
where 

«(*) = { J; [H ■ ao.16) 

Function u(t) is the unit step function. To say that vcc it) — Vcc u it) is to say that 
the circuit is turned on at time t — 0. Typically, Vcc — +5 V. 

The reader may verify (again as a circuit analysis review exercise) that state 
equations for the Colpitts oscillator are: 

c ^dvcEit)_ = ,^ f) _ p pfR{vBE{t)) ^ (10.17a) 

at 

dVRFit) VtjF it) + Vff 
Cl-^ 1 = -- -^i — - fRiVBEit)) ~ iLit), (10.17b) 

dt R E e 

L dnM = vcc _ vcE{t) + _ RtiL(t) (10.17c) 

dt 

Thus, the state variables are vbe^), vcEit), and iiit)- As previously, this circuit 
description is not unique, but it is convenient. 

Since numerical methods only provide approximate solutions to ODEs, we are 
naturally concerned about the accuracy of these approximations. There are also 
issues about the stability of proposed methods, and so this matter as well will be 
considered in this chapter. 



10.2 FIRST-ORDER ODEs 

Strictly speaking, before applying a numerical method to the solution of an ODE, 
we must be certain that a solution exists. We are also interested in whether the 
solution is unique. It is worth stating that in many cases, since ODEs are often 
derived from problems in the physical world, existence and uniqueness are often 
"obvious" for physical reasons. Notwithstanding this, a mathematical statement 
about existence and uniqueness is worthwhile. 

The following definition is needed by the succeeding theorem regarding the 
existence and uniqueness of solutions to first order ODE initial value problems. 

Definition 10.1: The Lipschitz Condition The function fix, t) & R satisfies 
a Lipschitz condition in x for S C R 2 iff there is an a > such that 

\f(x,t)-f(y,t)\<a\x-y\ 

when (x, t), (y,t) e S. The constant a is called a Lipschitz constant for f(x,t). 



TLFeBOOK 



422 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

It is apparent that if f(x, t) satisfies a Lipschitz condition, then it is smooth in 
some sense. The following theorem is about the existence and uniqueness of the 
solution to 

^ = /(*,*) (10.18) 

at 

for < t < tf, with the initial condition xo — x(0). Time t — is the initial time, or 
starting time. We call constant tf the final time. Essentially, we are only interested 
in the solution over a finite time interval. This constraint on the theory is not 
unreasonable since a computer can run for only a finite amount of time anyway. 
We also remark that interpreting the independent variable t as time is common 
practice, but not mandatory in general. 

Theorem 10.1: Picard's Theorem Suppose that S — {{x, f) € R 2 |0 < t < 
tf, — oo < x < oo}, and that f(x, t) is continuous on S. If / satisfies a Lipschitz 
condition on set S in the variable x, then the initial- value problem (10.18) has a 
unique solution x — x(t) for all < t < tf. 

Proof Omitted. We simply mention that it is based on the Banach fixed-point 
theorem (recall Theorem 7.3). 

We also mention that a proof of a somewhat different version of this theorem 
appears in Kreyszig [3, pp. 315-317]. It involves working with a contractive map- 
ping on a certain closed subspace of C(J), where / = [to — fi, to + /J] C R and 
C(J) is the metric space of continuous functions on /, where the metric is that 
of (1.8) in Chapter 1. It was remarked in Chapter 3 [see Eq. (3.8)] that this space 
is complete. Thus, any closed subspace of it is complete as well (a fact that was 
mentioned in Chapter 7 following Corollary 7.1). 

We may now consider specific numerical techniques. Define x„ — x(t n ) for 
n e Z + . Usually we assume that to — 0, and that 

t n +i=t n +h, (10.19) 

where h > 0, and we call h the step size. From (10.18) 

4 l) = ^r = ^P-\t= tn = /(*„, t n ). (10.20) 

at at 

We may expand solution x(t) in a Taylor series about t — t„. Therefore, since 
x(t n+ \) = x(t n +h) = x n+ \, and with xj, ) = x (k) (t n ) 

x n+1 =x n + hx^ + ^h 2 xj?> + ^h 3 4 3) + ■■■■ (10-21) 

(k) 

If we drop terms in x„ f or k > 1, then (10.21) and (10.20) imply 



*•«+! — %n ~r ^^n — %n \ *V \^n-> *n)' 



TLFeBOOK 



FIRST-ORDER ODEs 423 

Since xq — x(to) — x(0) we may find (x„) via 

x„+i — x„ + hf(x n , t n ). (10.22) 

This is often called the Euler method (or Euler's method)? A more accurate descrip- 
tion would be to call it the explicit form of Euler's method in order to distinguish 
it from the implicit form to be considered a little later on. The distinction matters 
in practice because implicit methods tend to be stable, whereas explicit methods 
are often prone to instability. 

A few general words about stability and accuracy are now appropriate. In what 
follows we will assume (unless otherwise noted) that the solution to a differen- 
tial equation remains bounded; that is, \x{t)\ < M < oo for all t > 0. However, 
approximations to this solution [e.g., (x n ) from (10.22)] will not necessarily remain 
bounded in the limit as n — > oo; that is, our numerical methods might not always be 
stable. Of course, in a situation like this the numerical solution will deviate greatly 
from the correct solution, and this is simply unacceptable. It therefore follows that 
we must find methods to test the stability of a proposed numerical solution. Some 
informal definitions relating to stability are 

Stable method: The numerical solution does not grow without bound (i.e., "blow 
up") with any choice of parameters such as step size. 

Unstable method: The numerical solution blows up with any choices of param- 
eters (such as step size). 

Conditionally stable method: For certain choices of parameters the numerical 
solution remains bounded. 

We mention that even if the Euler method is stable, its accuracy is low because 
only the first two terms in the Taylor series are retained. More specifically, we 
say that it is a first-order method because only the first power of h is retained in 
the Taylor approximation that gave rise to it. The omission of higher-order terms 
causes truncation errors. Since h 2 (and higher power) terms are omitted we also 
say that the truncation error per step (sometimes called the order of accuracy) is 
of order h 2 . This is often written as 0(h 2 ). (Here we follow the terminology in 
Kreyszig [4, pp. 793-794].) In summary, we prefer methods that are both stable, 
and accurate. It is important to emphasize that accuracy and stability are distinct 
concepts, and so must never be confused. 

Strictly speaking, in truncating the series in (10.21) we should write x nJr \ = x n + hf(x n , t n ) so that 
Euler's method is 

x n+ i =x„ +hf(x„,t„) 

with xq = xq. This is to emphasize that the method only generates approximations to x„ =x(t n ). 
However, this kind of notation is seldom applied. It is assumed that the reader knows that the numerical 
method only approximates x(t n ) even though the notation does not necessarily explicitly distinguish 
the exact value from the approximate. 



TLFeBOOK 



424 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

We can say more about the accuracy of the Euler method: 

Theorem 10.2: For dx(t)/dt — f(x(t), t) let f(x(t), t) be Lipschitz continu- 
ous with constant a (Definition 10.1), and assume that x(t) e C 2 [?o, tf] (tf > to). 
If x n s» x(t n ) — x n , where (t n — to + nh, and t n < tf) 

Xn+\ = X n + hf(x n , t n ), (XO « x(t )) 

then 

1 e a(t„-t ) _ i 

\x(t„) -x n \< e a(t »-' o) \x(to) - xq\ + -hM , (10.23) 

2 a 

where M = mzx, €[tQaf ] |x (2) (f)|. 
Proof Euler' s method is 

Xn+\ — X n -\- rlj \X n , t n ) 

and from Taylor's theorem 

x(t n+ i) = x(t n ) + hx (l) (t n ) + \h 2 x^^ n ) 
for some £„ € [t n , t n+ {\. Thus 

x(t n +\) ~ X n +\ = X(t„) - X n + h[x (l) (t n ) - f(x n , t n )] + \h 2 X {2) (£„) 

= X(t n ) -X„ + h[f(x(t„), t n ) - f(x„, t n )] + \h 2 X {2 \^ n ) 

so that 

\x(t„+i) - x n +l\ < \x{t n ) -x„\ +ah\x(t„) -x„\ + \h 2 \x {2) (%„)\. 

For convenience we will let e n — \x(t n ) — x„\, X — 1 + ah, and r„ — jh 2 x^ 2 \^ n ), 
so that 

e«+l < ^e n +r n . 

It is easy to see that 3 

e\ < Xe Q + >"o 

£2 < Xe\ +r\ — X 2 eo + Xro + r\ 

e3 < Xe2 + V2 — X 3 eo + X 2 ro + Xr\ + r2 



n-\ 



e n < X n e Q + ^2x'r n -i-j 



More formally, we may use mathematical induction. 



TLFeBOOK 



FIRST-ORDER ODEs 425 

If M — max t€ y ^ f ] \x^(t)\ then r n -\-j < jh 2 M, and hence 



n-\ 



e n < X n e + jh 2 Mj2 kJ ' 



and since J2"j=o ^'' = T^T ' anc * f°r x > — 1 we nave (1 + x) n < e " x ', thus 

«zi . k n _ l £ nah _ j g afe-r ) _ { 

> !■' = < = . 

*-^ X — 1 ah ah 

j=o 

Consequently, 

2 e a(t n -to) _ 1 

e„ < e afe -' o) e + -AM 

2 a 

which immediately yields the theorem statement. 

We remark that eo = \x(to) — xq\ — only if xo — x(to) exactly. Where quan- 
tization errors (recall Chapter 2) are concerned, this will seldom be the case. The 
second term in the bound of (10.23) may be large even if h is tiny. In other words, 
Euler's method is not necessarily very accurate. Certainly, from (10.23) we can 
say that e n oc h. 

As a brief digression, we also note that Theorem 10.2 needed the bound 

(l+x) n < e nx (x > -1). (10.24) 

We may easily establish (10.24) as follows. From the Maclaurin expansion 
(Chapter 3) 

e x = 1 + x + \x 2 e$ 
(for some % e [0, x]) so that 

< 1 + x < l+x + ±x V = e x , 
and because 1 + x > (i.e., x > — 1) 

0< (l+x)" <e nx , 
thus establishing (10.24). 



TLFeBOOK 



426 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

The stability of any method may be analyzed in the following manner. First 
recall the Taylor series expansion of fix, t) about the point (xq, to) 



tt *\ ti t \< n t \ d f( x °' fo) , < df(x ,t ) 

fix, t) = fixo, to) + it - to) \-ix- xo)- 



1 
2! 



it - to) 



dt 

2 d 2 fix , to) 



dx 



dt 2 



+ 20 - to)ix - xo) 



d Z fixo, tp) 
dtdx 



+(x — xq) 



2 d 2 fjx , to) 
dx 2 



(10.25) 



If we retain only the linear terms of (10.25) and substitute these into (10.18), then 
we obtain 



(i),,, dx t s,t. . .dfix^to) dfixp, t ) 

x 'it) = — = fixo, to) + it - t ) \-ix- x Q ) 

dt dt ox 



dfixo, to) , dfixo, to) 

■ x-\ 1- 



dx 



dt 



,, N dfixo, to) dfixo, t ) 

f ixo, t ) - t xq- 



dt 



=/., 



= A2 



so this has the general form (with k, k\, and X2 as constants) 



dx 



(10.26) 



dx 

— = Xx + k\t + A.2. 

dt 



(10.27) 



This linearized approximation to the original problem in (10.18) allows us to inves- 
tigate the behavior of the solution in close proximity to (xo, to). Equation (10.27) 
is often simplified still further by considering what is called the model problem 



dx 
dt 



(10.28) 



Thus, here we assume that k\t + k 2 in (10.27) can also be neglected. However, 
we do remark that (10.27) has the form 



dx 
dt 



+ Pit)x = Qit)x n , 



(10.29) 



where n = 0, Pit) — —X, and Qit) — k\t + X 2 - Thus, (10.27) is an instance of 
Bernoulli's differential equation [5, p. 62] for which a general method of solution 
exists. But for the purpose of stability analysis it turns out to be enough (usually, 
but not always) to consider only (10.28). Equation (10.28) is certainly simple in 
that its solution is 

x(t) = xiO)e Xt . (10.30) 



TLFeBOOK 



FIRST-ORDER ODEs 427 

If Euler's method is applied to (10.28), then 

x n+ i — x„ + hXx n = (1 + hX)x n . (10.31) 

Clearly, for n e Z + 

Xn = (1 + hX)"x , (10.32) 

and we may avoid lirrin^oo \x n \ — oo if 

|l + Wi| < 1. 

The model problem (10.28) with the solution (10.30) is stable 4 only if X < 0. 
Hence Euler's method is conditionally stable for 

2 
X < and h < — , (10.33) 

- \X\ 

and is unstable if 

\l + hX\ > 1. (10.34) 

We see that depending on X and h, the explicit Euler method might be unstable. 

Now we consider the alternative implicit form of Euler's method. This method 
is also called the backward Euler method. Instead of (10.22) we use 

Xn + l = X„ + hf(x n +\, t n +l). (10.35) 

It can be seen that a drawback of this method is the necessity to solve (10.35) for 
x n+ \. This is generally a nonlinear problem requiring the techniques of Chapter 7. 
However, a strength of the implicit method is enhanced stability. This may be 
easily seen as follows. Apply (10.35) to the model problem (10.28), yielding 

x n + \ — x n + Xhx n -\-l 

or 

1 

x„+i = —x n . (10.36) 

1 — Xn 



Clearly 

1 



1 -Xh 



x . (10.37) 



Since we must assume as before that X < 0, the backward Euler method is stable 
for all h > 0. In this sense we may say that the backward Euler method is uncon- 
ditionally stable. Thus, the implicit Euler method (10.35) is certainly more stable 

For stability we usually insist that X < as opposed to allowing X = 0. This is to accommodate a 
concept called bounded-input, bounded-output (BIBO) stability. However, we do not consider the details 
of this matter here. 



TLFeBOOK 



428 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

than the previous explicit form (10.22). However, the implicit and explicit forms 
have the same accuracy as both are first-order methods. 

Example 10.3 We wish to apply the implicit and explicit forms of the Euler 
method to 

dx 

— +x = 

dt 

for x(0) = 1. Of course, since this is a simple linear problem, we immediately 
know that 

x(t) = x(0)e~' = e~' 

for all t > 0. Since we have f(x, t) — —x, we obtain 

df(x,t) = 1 
dx 

implying that X — — 1. Thus, the explicit Euler method (10.22) gives 

x n+ i — (1 — h)x n (10.38a) 

for which < h < 2 via (10.33). Similarly, (10.35) gives for the implicit method 



1 

x n +i = rx n (10.38b) 

1 + h 

for which h > 0. In both (10.38a) and (10.38b) we have x Q = 1. 

Some typical simulation results for (10.38) appear in Fig. 10.3. Note the insta- 
bility of the explicit method for the case where h > 2. 

It is to be noted that a small step size h is desirable to achieve good accuracy. 
Yet a larger h is desirable to minimize the amount of computation involved in 
simulating the differential equation over the desired time interval. 

Example 10.4 Now consider the ODE 

dx _ t 2 3 

h 2tx — te x 

dt 

[5, pp. 62-63]. The exact solution to this differential equation is 



,2, 



3 



x(t) = T~ 2t , (10-39) 

e ' + ce a 



TLFeBOOK 



FIRST-ORDER ODEs 



429 






x(t) 




— 0— 


Explicit Euler (/? = 


= 0.1) 


— + — 


Implicit Euler (/? = 


= 0.1) 



Q. 

E 
< 



(b) 



x{t) 

-o- Explicit Euler (h = 2.1) 
_+_ Implicit Euler (/? = 2.1) 








9 








1 


1 
1 







\ / \ i / 

i i i 




\ 


1 / 

\ / 

6 







10 12 

Time (f) 



14 



16 



20 



Figure 10.3 Illustration of the implicit and explicit forms of the Euler method for the 
differential equation in Example 10.3. In plot (a), h is small enough that the explicit method 
is stable. Here the implicit and explicit methods display similar accuracies. In plot (b), h 
is too big for the explicit method to be stable. Instability is indicated by the oscillatory 
behavior of the method and the growing amplitude of the oscillations with time. However, 
the implicit method remains stable, but because h is quite large, the accuracy is not very 
good. 



for t > 0, where c = -^ — 1, and we assume that c > 0. Thus 



f(x,t) = t( 



-t l Ji 



Itx 



so 



df(x, t) .2 9 

= 3te ' x 2 -It. 
dx 



Consequently 



_ df(x ,t ) _ 8/(*o,0) _ n 

A — : — : — u. 



dx 



dx 



Via (10.33) we conclude that h > is possible for both forms of the Euler method. 
Since stability is therefore not a problem here, we choose to simulate the differential 



TLFeBOOK 



430 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



0.8 



"a 0.6 



E 0.4 
< 

0.2 

























































x(t) 

—o— Explicit Euler (/? = 0.02) 










i i i 


i i i 


~=s**8893 





(a) 



0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 

Time (f) 



0.8 



■° 0.6 



E 0.4 
< 

0.2 







a -i 
























































_ x(t) 

—o— Explicit Euler (A) = 0.20) 












i i i 


i 


i 


^^ ====: 5 





(b) 



0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1. 

Time (f) 



Figure 10.4 Illustration of the explicit Euler method for the differential equation in 
Example 10.4. Clearly, although stability is not a problem here, the accuracy of the method 
is better for smaller h. 



equation using the explicit Euler method as this is much simpler to implement. 
Thus, from (10.22), we obtain 



ti ..3 



Xn-\-l — %n T" <l\tfi€ n ' X n Ll n X n \. 



(10.40) 



We shall assume xo = 1 [initial condition x(0)]. Of course, t n — hn for n — 
0, 1,2,.... 

Figure 10.4 illustrates the exact solution from (10.39), and the simulated solution 
via (10.40) for h — 0.02 and h — 0.20. As expected, the result for h — 0.02 is more 
accurate. 

The next example involves an ODE whose solution does not remain bounded 
over time. Nevertheless, our methods are applicable since we terminate the simu- 
lation after a finite time. 



Example 10.5 Consider the ODE 

dx 
dt 



— t 



2x 



TLFeBOOK 



FIRST-ORDER ODEs 



431 



for t > to > [5, pp. 60-61]. The exact solution is given by 

,3 



x(t) 



f 



c 
572- 



(10.41) 



The initial condition is x(to) = xo with to > 0, and so 

,3 



v ? 



5?, 







implying that 



c = 5fg.ro - fj. 



2.t 



Since f(x,t) = t—=f-,we have 



A.= 



9/C*o, fo) 

3 A' 



2 
to' 



so via (10.33) for the explicit Euler method 

< h < to. 

However, this result is misleading here because x(t) is not bounded with time. In 
other words, it does not really apply here. From (10.22) 



X n +\ — X„ + h 



Zx n 
tn _ 



(10.42a) 



where 



t n — to + nh 
for n e Z + . If we consider the implicit method, then via (10.35) 

2 2x n -|-l 



Xn + 1 



h 



'm + 1 



tn + 1 . 



*n+l — 



ht 2 ,, 
n + \ 



2h 

tn + l 



(10.42b) 



where t n+ \ — to + (n + \)h for n e Z + . 

Figure 10.5 illustrates the exact solution x(t) from (10.41) along with the simu- 
lated solutions from (10.42). This is for x Q — 1 and t — 0.05 with h — 0.025 (a), 
and h — 5 (b). It can be seen that the implicit method is more accurate for t close 
to fo- Of course, this could be very significant since startup transients are often of 



TLFeBOOK 



432 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



E 
< 



2 

1.5 

1 

0.5 



(a) 





20 
15 
10 



— x(f) 

-o- Explicit Euler (h = 0.025) 

-t- Implicit Euler (h = 0.025) 


















"%": ..^^X^r..^ 

\ \X. : j fil ^ M ^^ ^ r 





0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

Time (f) 

x10 5 



— x(t) 

-o Explicit Euler (h = 5) 
-t- Implicit Euler (h = 5) 



5 

o4 | ^^ i| ^J j J" i' (I ' ( D(Dt i ) ' r " 1v 




50 



(b) 



100 
Time (f) 



150 



200 



250 



Figure 10.5 Illustration of the explicit and implicit forms of the Euler method for the ODE 
in Example 10.5. In plot (a), note that the implicit form tracks the true solution x(t) better 
near fy = 0.05. In plot (b), note that both forms display a similar accuracy even though h 
is huge, provided t ^> t$. 



interest in simulations of dynamic systems. It is noteworthy that both implicit and 
explicit forms simulate the true solution with similar accuracy for t much bigger 
than to even when h is large. 

This example is of a "stiff system" (see Section 10.6). 

Example 10.5 illustrates that stability and accuracy issues with respect to the 
numerical solution of ODE initial-value problems can be more subtle than our 
previous analysis would suggest. The reader is therefore duly cautioned about 
these matters. 

Recalling (3.71) from Chapter 3 (or recalling Theorem 10.2), the Taylor formula 
for x(t) about t = t n is [recall x n = x(t n ) for all n] 



l n + l 



X n +hf(x n ,t n )+-h 2 X (2) (i;) 

X — v- J £• 



(10.43) 



=J£n+l 



for some £ € [?„, t n+ \]. Thus, the truncation error per step in the Euler method is 
defined to be 

e„+l = x n+l - x n+l = \h 2 x { ^{^). (10.44) 



TLFeBOOK 



FIRST-ORDER ODEs 433 

We may therefore state, as suggested earlier, that the truncation error per step is of 
order 0(h 2 ) because of this. The usefulness of (10.44) is somewhat limited in that 
it depends on the solution x(t) [or rather on the second derivative x^ 2 '(f)], which 
is, of course, something we seldom know in practice. 

How may we obtain more accurate methods? More specifically, this means 
finding methods for which the truncation error per step is of order 0(h m ) with 
m > 2. 

One way to obtain improved accuracy is to try to improve the Euler method. 
More than one possibility for improvement exists. However, a popular approach is 
Heun 's method. It is based on the following observation. A drawback of the Euler 
method in (10.22) is that f(x n , t„) is the derivative x^(t) at the beginning of the 
interval [t n , f n +i]> ar) d yet x^(t') varies over [t n , t n+ {\. The implicit form of the 
Euler method works with f(x n +\, t n +\), namely, the derivative at t — t n+ \, and so 
has a similar defect. Therefore, intuitively, we may believe that we can improve 
the algorithm by replacing f(x n , t n ) with the average derivative 

\ [f(Xn, t n ) + f(x„ + hf(x n ,t„), t n + K)] . (10.45) 

This is approximately the average of x^'(f) at the endpoints of interval [t n , t n +{\. 
The approximation is due to the fact that 

/On+l, * B +i) ^ /(•*« + hf(x n , t n ), t n + h). (10.46) 

We see in (10.46) that we have employed (10.22) to approximate x n +\ according 
to x„ + i — x n + hf(x n , t n ) (explicit Euler method). Of course, t n+ \ — t„ + h does 
not involve any approximation. Thus, Heun 's method is defined by 

x n +i — x n + - [f(x n , t n ) + f(x„ + hf(x n , f„), t n + h)] . (10.47) 

This is intended to replace (10.22) and (10.35). 

However, (10.47) is an explicit method, and so we may wonder about its sta- 
bility. If we apply the model problem to (10.47), we obtain 

x n+1 = |"l+Aii+ I/j 2 l 2 ljt„ (10.48) 



for which 



[l+l/! + i/j 2 A 2 l\ - (10.49) 



For stability we must select h such that we avoid linin^oo \x n \ — oo. For conve- 
nience, define 

a = \+Xh + \h 2 X 2 , (10.50) 

so this requirement implies that we must have \a\ < 1. A plot of (10.50) in terms 
of hX appears in Fig. 10.6. This makes it easy to see that we must have 

-2<hX<0. (10.51) 



TLFeBOOK 



434 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



1.5 



t> 0.5 



-0.5 




-2.5 



-1.5 



-1 

hx 



-0.5 



0.5 



Figure 10.6 A plot of a in terms of hX [see Eq. (10.50)]. 



Since X < is assumed [again because we assume that x(t) is bounded] (10.51), 
implies that Heun's method is conditionally stable for the same conditions as in 
(10.33). Thus, the stability characteristics of the method are identical to those of 
the explicit Euler method, which is perhaps not such a surprize. 

What about the accuracy of Heun's method? Here we may see that there is an 
improvement. Again via (3.71) from Chapter 3 



X n +\ 



K n + hx (l) (t n ) + ±h 2 X (2 Ht n ) + ±/*V 3 >(£) 



(10.52) 



for some f e \t n , t n +i]. We may approximate x^ 2 \t„) using a. forward difference 
operation 

x (l Ht n+l )-x^(t n ) 



x (2 \t n ) 



h 



(10.53) 



so that (10.52) becomes [using x^(t n+ i) — f(x n+ i,t n+ i), and x"'{t n ) 

f{Xn,t n )] 



x n + l 



1 2 

hf(x„,t„)+ -h 



X^{t n+ i)-X^{tn) 



(1)/ 



+ -/z 3 x (3) (f), (10.54) 
6 



or upon simplifying this, we have 



Xn+l = X n + - [f(x n , t n ) + f(x n +\, t n +l)] + -h 3 X°\%). 



(10.55) 



Replacing f(x n +\, f n +i) in (10.55) with the approximation (10.46), and dropping 
the error term, we see that what remains is identical to (10.47), namely, Heun's 
method. Various approximations were made to arrive at this conclusion, but they 
are certainly reasonable, and so we claim that the truncation error per step for 
Heun's method is 

e n+l = i/*V 3 >(f), (10.56) 



TLFeBOOK 



FIRST-ORDER ODEs 



435 



where again f € [t n , f„+i], and so this error is of the order 0(h 3 ). In other words, 
although Heun's method is based on modifying the explicit Euler method, the 
modification has lead to a method with improved accuracy. 

Example 10.6 Here we repeat Example 10.5 by applying Heun's method under 
the same conditions as for Fig. 10.5a. Thus, the differential equation is again 



dx 

dt 



2x 

t 



and again we choose xo = 1.0, ?o = 0.05, with h — 0.025. The simulation result 
appears in Fig. 10.7. 

It is very clear that Heun's method is distinctly more accurate than the Euler 
method, especially near t — frj- 



Heun's method may be viewed in a different light by considering the following. 
We may formally integrate (10.18) to arrive at x(t) according to 



x{t) - x(t n ) 



f 



f(x, x)dx. 



(10.57) 



1.6 



1.4 



1.2 



CD 



E 
< 



I I I 






— m 

-o- Implicit Euler (h = 0.025) 
- -+- Heun (h = 0.025) 










































Ti 








V 
















j^S^SW" 




I I I 



0.6 



0.4 



0.2 



0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

Time (t) 

Figure 10.7 Comparison of the implicit Euler method with Heun's method for 
Example 10.6. We see that Heun's method is more accurate, especially for t near fg = 0.05. 



TLFeBOOK 



436 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

So if t = t n +\, then, from (10.57), we obtain 



n+1 

x(t„+i) = x(t„) + I f(x,x)dx 






or 



r'n+i 
x„ + i=x„+/ f(x, x) dr. (10.58) 

According to the trapezoidal rule for numerical integration (see Chapter 9, 
Section 9.2), we have 

/"»+' h 

/ f(x, x)dx «s -[fix„,t„) + fix n+ i,t n+ i)] 

(since h — t n +\ — t n ), and hence (10.58) becomes 

h 

X n +1 — X n + ~[fix„, t n ) + fiXn+l, t„+l)], (10.59) 

which is just the first step in the derivation of Heun's method. We may certainly 
call (10.59) the trapezoidal method. Clearly, it is an implicit method since we must 
solve (10.59) for x n+ \. Equally clearly, Eq. (10.59) appears in (10.55), and so we 
immediately conclude that the trapezoidal method is a second-order method with 
a truncation error per step of order 0(/z 3 ). Thus, we may regard Heun's method 
as the explicit form of the trapezoidal method. Or, equivalently, the trapezoidal 
method can be regarded as the implicit form of Heun's method. We mention that 
the trapezoidal method is unconditionally stable, but will not prove this here. 

The following example illustrates some more subtle issues relating to the stability 
of numerical solutions to ODE initial-value problems. It is an applications example 
from population dynamics, but the issues it raises are more broadly applicable. The 
example is taken from Beltrami [6]. 

Example 10.7 Suppose that x(t) is the total size of a population (people, 
insects, bacteria, etc.). The members of the population exist in a habitat that can 
realistically support not more than N individuals. This is the carrying capacity 
for the system. The population may grow at some rate that diminishes to zero 
as x(f) approaches N. But if the population size x(f) is much smaller than the 
carrying capacity, the rate of growth might be considered proportional to the present 
population size. Consequently, a model for population growth might be 



dxit) 
dt 



1 xjt) 

N 



(10.60) 



This is called the logistic equation. By separation of variables this equation has 
solution 

xit) = N n (10.61) 

1 + ce " 



TLFeBOOK 



FIRST-ORDER ODEs 437 

for t > 0. As usual, c depends on the initial condition (initial population size) x(0). 
The exact solution in (10.61) is clearly "well behaved." Therefore, any numerical 
solution to (10.60) must also be well behaved. 

Suppose that we attempt to simulate (10.60) numerically using the explicit Euler 
method. In this case we obtain [via (10.22)] 

x n +\ = (1 + hr)x„ - -Tj-xl. (10.62) 

Suppose that we transform variables according to x„ — ay n in which case (10.62) 
can be rewritten as 



y n+ \ — (1 + hr)y n 
If we select 



lira 

1 Vn 

N{\ + hrY " 



(10.63) 



then (10.63) becomes 



N(l + hr) 
hr 



y n+l = ky n {\ - y n ), (10.64) 



where A. = l+hr. We recognize this as the logistic map from Chapter 7 [see 
Examples 7.3-7.5 and Eq. (7.83)]. From Section 7.6 in particular we recall that 
this map can become chaotic for certain choices of X. In other words, chaotic 
instability is another possible failure mode for a numerical method that purports to 
solve ODEs. 

The explicit form of the Euler method is actually an example of a first-order 
Runge-Kutta method. Similarly, Heun's method is an example of a second-order 
Runge-Kutta method. It is second-order essentially because the approximation 
involved retains the term in h 2 in the Taylor series expansion. We mention that 
methods of still higher order can be obtained simply by retaining more terms in the 
Taylor series expansion of (10.21). This is seldom done because to do so requires 
working with derivatives of increasing order, and this requires much computational 
effort. But this effort can be completely avoided by developing Runge-Kutta meth- 
ods of higher order. We now outline a general approach for doing this. It is based 
on material from Rao [7]. 

All Runge-Kutta methods have a particular form that may be stated as 

x n +i — x„ +ha{x n , t n , h), (10.65) 

where a(x n ,t n ,h) is called the increment function. The increment function is 
selected to represent the average slope on the interval t e [t n , t n+ \\. In particular, 
the increment function has the form 

m 
a(x n ,t n ,h) = y^Cjkj, (10.66) 



TLFeBOOK 



438 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

where m is called the order of the Runge-Kutta method, Cj are constants, and 
coefficients kj are obtained recursively according to 

h = f(x„,t n ) 

h — f(x n +ai,\hk\, t„ + p 2 h) 

h — f(x„ + a^ihki + a3 t2 hk 2 , t„ + pih) 



(m-i \ 

x„ + ^2 a m ,jhkj, t„ + p m h . (10.67) 

A more compact description of the Runge-Kutta methods is 

m 

x n +\ = x n + h y^ Cjkj, (10.68a) 

./ = ! 

where 

kj = / I x n + h ^ cijjki, t n + pjh | . (10.68b) 

To specify a particular method requires selecting a variety of coefficients (c;, fl/j, 
etc.). How is this to be done? 

We illustrate with examples. Suppose that m — 1. In this case 

x„+i = x„ + hc\k\ — x n + hc\f{x n , t„) (10.69) 

which gives (10.22) when c\ = 1. Thus, we are justified in calling the explicit 
Euler method a first-order Runge-Kutta method. 
Suppose that m — 2. In this case 

x n +l =x„ +hcif(x n ,t n ) + hc 2 f(x n + a 2 ,\hf (x n ,t n ),t n + p 2 h). (10.70) 

We observe that if we choose 

c 2 — c\ — 5, a 2 ,i = 1, P2 = 1, (10.71) 

then (10.70) reduces to 

x n +\ = x n + \h[f{x n , t n ) + f(x n + hf(x n , t n ), t n + h)], 

which is Heun's method [compare this with (10.47)]. Thus, we are justified in call- 
ing Heun's method a second-order Runge-Kutta method. However, the coefficient 



TLFeBOOK 



FIRST-ORDER ODEs 439 

choices in (10.71) are not unique. Other choices will lead to other second-order 
Runge-Kutta methods. We may arrive at a systematic approach for creating alter- 
natives as follows. 

For convenience as in (10.21), define x„ = x^\t n ). Since m — 2, we will 
consider the Taylor expansion 

x„+i =x n + hx n l) + ±h 2 xf } + 0(h 3 ) (10.72) 

[recall (10.52)] for which the term 0(h 3 ) simply denotes the higher-order terms. 
We recall that x^(t) — fix, t), so x„ — f(x n , t n ), and via the chain rule 

n \ df df dx df df 

x (2) (t) = —+——=—+— f(x, t), (10.73) 

dt dx at dt dx 

so (10.72) may be rewritten as 

x n +i =x„+ hfix n , t n ) + -h h -h f(x„, t„) + OQl ). 

2 dt 2 dx 

(10.74) 

Once again, the Runge-Kutta method for m — 2 is 

x n+ \ =x„ +hc\fix n ,t n ) + hc 2 fix n +a 2 ,ihki,t„ + p 2 h). (10.75) 

Recalling (10.26), the Taylor expansion of fix n + a 2 \hk\, t„ + p 2 h) is given by 

dfix n ,t n ) 



fix n + a 2 ,iMi, t„ + p 2 h) = fix n , t n ) + a 2z ihfix n , t n )- 



dx 



dfiX„, t„) r> 

p 2 h J \"' +Oih 2 ). (10.76) 

at 



Now we substitute (10.76) into (10.75) to obtain 

, 2 dfix n ,t n ) 



X n + \ = X„ + (Cl + C 2 )hfix„, t n ) + p 2 C 2 h z 



dt 



idfix„,t n ) i 

+ a 2A c 2 h 2 J \" fix n ,t n )+Oih 3 ). (10.77) 

dx 

We may now compare like terms of (10.77) with those in (10.74) to conclude that 
the coefficients we seek satisfy the nonlinear system of equations 

c\+c 2 — 1, p 2 c 2 — \, a 2t \c 2 — \. (10.78) 

To generate second-order Runge-Kutta methods, we are at liberty to choose the 
coefficients c\,c 2 , p 2 , and 02,1 in any way we wish so long as the choice sat- 
isfies (10.78). Clearly, Heun's method is only one choice among many possible 
choices. We observe from (10.78) that we have four unknowns, but possess three 



TLFeBOOK 



440 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

equations. Thus, we may select one parameter "arbitrarily" that will then determine 
the remaining ones. For example, we may select c 2 , so then, from (10.78) 

1 1 

c\ = 1 - c 2 , p 2 = — , and a 2 ,i = — . (10.79) 

2c 2 2c 2 

Since one parameter is freely chosen, thus constraining all the rest, we say that 
second-order Runge-Kutta methods possess one degree of freedom. 

It should be clear that the previous procedure may be extended to systematically 
generate Runge-Kutta methods of higher order (i.e., m > 2). However, this requires 
a more complete description of the Taylor expansion for a function in two variables. 
This is stated as follows. 

Theorem 10.3: Taylor's Theorem Suppose that f(x,t), and all of its partial 
derivatives of order n + 1 or less are defined and continuous on D — {(x, t)\a < 
t <b,c < x < d}. Let (xq, to) e D; then, for all (x, t) e D, there is a point (rj, f ) € 
D such that 



r \ u t ^r-k, „ ^3 r /(^0,?0) 



f(^t) = J2\^J2( r k )o-toY- k (x-xo) 

A I ' I. n N / 



k=Q 

1 \— v / n + 1 







{n+\)\*-~i\ k l y dt n+i-k dx k 

k=0 

for which (17, f ) is on the line segment that joins the points (xo, to), and (x, t). 
Proof Omitted. 

The reader can now easily imagine that any attempt to apply this approach for 
m > 2 will be quite tedious. Thus, we shall not do this here. We will restrict our- 
selves to stating a few facts. Applying the method to m — 3 (i.e., the generation 
of third-order Runge-Kutta methods) leads to algorithm coefficients satisfying six 
equations with eight unknowns. There will be 2 degrees of freedom as a conse- 
quence. 

Fourth-order Runge-Kutta methods (i.e., m — 4) also possess two degrees of 
freedom, and also have a truncation error per step of 0(h 5 ). One such method 
(attributed to Runge) in common use is 

h 
x n+ i = x„ + - [hi + 21c 2 + 2k 3 + k 4 ] , (10.80) 

6 

where 

h = f(x n ,t n ) 

ki — f lx„ + \hk\ , t n + \h\ 

h — f \x„ + \hk 2 , t n + \h\ 

k 4 = f( Xn +hk 3 ,t n +h). (10.81) 

Of course, an infinite number of other fourth-order methods are possible. 



TLFeBOOK 



FIRST-ORDER ODEs 



441 



We mention that Runge-Kutta methods are explicit methods, and so in principle 
carry some risk of instability. However, it turns out that the higher the order of 
the method, the lower the risk of stability problems. In particular, users of fourth- 
order methods typically experience few stability problems in practice. In fact, it 
can be shown that on applying (10.80) and (10.81) to the model problem (10.28), 
we obtain 



\+hX 



1 



h 2 X 2 



1 



h 3 X 3 



1 

24' 



h*X 4 



XQ. 



A plot of 



a = 1 + hX + -h 2 X 2 + -h 3 X 3 + —h 4 X A 
2 6 24 



(10.82) 



(10.83) 



in terms of hX appears in Fig. 10.8. To avoid linin^oo \x n \ = oo, we must have 
| cr | < 1, and so it turns out that 

-2.785 < hX < 

(see Table 9.11 on p. 685 of Rao [7]). This is in agreement with Fig. 10.8. Thus 

2.785 



X < 0, 



h < 



\x\ 



(10.84) 



This represents an improvement over (10.33). 

Example 10.8 Once again we repeat Example 10.5 for which 



dx 
dt 



— t 



2x 
t 



with xq — 1.0 for fo = -05, but here instead our step size is now h — 0.05. Addition- 
ally, our comparison is between Heun's method and the fourth-order Runge-Kutta 
method defined by Eqs. (10.80) and (10.81). 



b 1 




Figure 10.8 A plot of a in terms of hX [see Eq. (10.83)]. 



TLFeBOOK 



442 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



— x(t) 

-o- 4th-order Runge-Kutta (h = 0.05) 
Heun (h = 0.05) 




Time (f) 

Figure 10.9 Comparison of Heun's method (a second-order Runge-Kutta method) with 
a fourth-order Runge-Kutta method [Eqs. (10.80) and (10.81)]. This is for the differential 
equation in Example 10.8. 



The simulated solutions based on these methods appear in Fig. 10.9. As expected, 
the fourth-order method is much more accurate than Heun's method. The plot 
in Fig. 10.9 was generated using MATLAB, and the code for this appears in 
Appendix 10.A as an example. (Of course, previous plots were also produced by 
MATLAB codes similar to that in Appendix 10. A.) 



10.3 SYSTEMS OF FIRST-ORDER ODEs 

The methods of Section 10.2 may be extended to handle systems of first-order 
ODEs where the number of ODEs in the system is arbitrary (but finite). However, 
we will consider only systems of two first-order ODEs here. Specifically, we wish 
to solve (numerically) 

dX = fix, y, t), (10.85a) 



dt 
dy_ 

dt 



= g(x,y,t), 



(10.85b) 



where the initial condition is xq — x(to) and yo = y(to). This is sufficient, for 
example, to simulate the Duffing equation mentioned in Section 10.1. 



TLFeBOOK 



SYSTEMS OF FIRST-ORDER ODEs 443 

As in Section 10.2, we will begin with Euler methods. Therefore, following 
(10.21) we may Taylor-expand x(t) and y(t) about the sampling time t — t n . As 
before, for convenience we may define x„ — x^ k \t n ), y„ = y^ k \t n ). The relevant 
expansions are given by 

x n+1 = x„ + hx ( n l) + ±h 2 x ( n 2) + • • • , (10.86a) 

y n+1 =y n + hy n l) + \h 2 y^ + ■■■. (10.86b) 

The explicit Euler method follows by retaining the first two terms in each expansion 
in (10.86). Thus, the Euler method in this case is 

x n +\ — x„ + hf(x n , y„, t n ), (10.87a) 

y n +\ — y n + hg(x n , y n , t n ), (10.87b) 

where we have used the fact that x„ — f(x„,y„,t n ) and y n — g(x n , y„, t n ). As 
we might expect, the implicit form of the Euler method is 

x n +\ = x„ + hf(x n+ i, y n+ i, t n+ i), (10.88a) 

y n +\ — y n +hg(x n+ \,y n +\,t n +\). (10.88b) 

Of course, to employ (10.88) will generally involve solving a nonlinear system of 
equations for x n+ \ and y n +i, necessitating the use of Chapter 7 techniques. As 
before, we refer to parameter h as the step size. 

The accuracy of the explicit and implicit Euler methods for systems is the same 
as for individual equations; specifically, it is 0(h 2 ). However, stability analysis is 
more involved. Matrix methods simply cannot be avoided. This is demonstrated as 
follows. 

The model problem for a single first-order ODE was Eq. (10.28). For a coupled 
system of two first-order ODEs as in (10.85), the model problem is now 

dx 

— = ciqqx + ciQiy, (10.89a) 
dt 

dy 

— =a\ox + a\iy. (10.89b) 
dt 

Here a, ;j are real-valued constants. We remark that this may be written in more 
compact matrix form 

dx 

— = Ax, (10.90) 

dt 

where x — x(t) — [x(t)y(t)] and dx/dt = [dx(t)/dt dy(t)/dt] T and, of course 



«oo «oi 
fllO fl ll 



(10.91) 



TLFeBOOK 



444 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

Powerful claims are possible using matrix methods. For example, it can be argued 
that if x(t) € R N (so that A € R NxN ), then dx(t)/dt — Ax(t) has solution 



At- 



x(t) = e m x(0) 



(10.92) 



for t > 0; that is, (10.92) for N = 2 is the general solution to (10.90) [and hence 
to (10.89)]. 5 

The constants in A are related to f(x, y, t) and g(x, y, t) in the following 
manner. Recall (10.25). If we retain only the linear terms in the Taylor expansions 
of / and g around the point (xq, yo, to), then 

,, ,. ,, , , . , df(xo,yo,to) df(xo,yo,t ) 

f(x, y, t) « f{x , y , t ) + (x - x ) h (y - yo)- 



dx 



+ (t- to) 



dfjxp, yp, t ) 
dt 



dg(xo, yo, to) 
g(x, y, t) «s g (x , y , t ) + (x - xq) h (y - yo) 



9v 

(10.93a) 
3^(^o- yo, to) 



i)x 



dy 



+ (t- to) 
As a consequence 



dgjxp, yp, t ) 
dt 



df(xo,yo,to) dfjxp, yp, tp) 

dx dy 

dg(x , yo, t ) dg(x Q , yo, to) 



(10.93b) 



dx 



dy 



(10.94) 



At this point we may apply the explicit Euler method (10.87) to the model problem 
(10.89), which results in 



x n +i 

Jn+l 



1 + haoo haoi 
ha\o 1 + han 



Xn 

y n 



(10.95) 



Yes, as surprising as it seems, although At is a matrix exp(Ar) makes sense as an operation. In fact, 
for example, with x(t) = [xQ(t) ■ ■ ■ x n _\(t)] , the system of first-order ODEs 

dx(t) 



dt 



Ax(t) + by(t) 



(A 6 R" x ", b 6 R", and y(t) e R) has the general solution 

x(t) = e At x(0~)+ e A(t ~ T) by(T)dT. 

Jo- 

The integral in this solution is an example of a convolution integral. 



TLFeBOOK 



SYSTEMS OF FIRST-ORDER ODEs 445 

An alternative form for this is 

x n +i = V + hA)x n . (10.96) 

Here / is a 2 x 2 identity matrix and x n — [x n y n ] T . With xo = [xq yo] T , we 
may immediately claim that 

x„ = (I + hA) n x , (10.97) 

where n e Z + . We observe that this includes (10.32) as a special case. Naturally, 
we must select step size h to avoid instability; that is, we are forced to select h 
to prevent lim„^oo ||x„|| = oo. In principle, the choice of norm is arbitrary, but 
2-norms are often chosen. We recall that there is a nonsingular matrix T (matrix 
of eigenvectors) such that 

T~ l [I + hA]T = A, (10.98) 

where A is the matrix of eigenvalues. We will assume that 



A 



*o 
A.i 



(10.99) 



In other words, we assume / + hA is diagonalizable . This is not necessarily always 
the case, but is an acceptable assumption for present purposes. Since from (10.98) 
we have I + hA = TAT' 1 , (10.96) becomes 

x n+ i = TAT~ l x n , 



T~ l x n+l = AT~ l x n . (10.100) 
Let y n — T~ lr x n , so therefore (10.100) becomes 

y n+l =Ay n . (10.101) 

In any norm lim n ^oo ||7„|| ^ oo, provided |A.&| < 1 for all k = 0, 1, . . . , N — 1 



< 



(A € R ). Consequently, liirin^oo \\x n \\ ^ oo too (because ||3c„|| = ||7 , y„|| 
||r|| ||y„|| and ||r|| is finite). We conclude that h is an acceptable step size, 
provided the eigenvalues of / + hA do not possess a magnitude greater than unity. 
Note that in practice we normally insist that h result in |A.jt| < 1 for all k. 

We may apply the previous stability analysis to the implicit Euler method. 
Specifically, apply (10.88) to model problem (10.89), giving 

X n +l — X n + h[aooX n+ i + Ooijn+l] 
y n +l — y n + h[a w x n+ i + a n y n+l ], 



TLFeBOOK 



446 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



which in matrix form becomes 





Xn+l 

. yn+i 


= 


Xn 

. yn 


+ h 


aoo fl oi 
«io an 




x n + \ 
_ 3'n+l 


or more compactly as 

Xn+l 


= X n -\- rlAXn-\-\. 


Consequently, for n e Z + 













{[l-hATTx . 



(10.102) 



(10.103) 



For convenience we can define B — [I — hA]~ l . Superficially, (10.103) seems to 
have the same form as (10.97). We might be lead therefore to believe (falsely) 
that the implicit method can be unstable, too. However, we may assume that there 
exists a nonsingular matrix V such that (if A e R NxN ) 



V~ l AV = r, 



(10.104) 



where V — diag(yo, y\, . . , , yw-i), which is the diagonal matrix of the eigenvalues 
of A. (Once again, it is not necessarily the case that A is always diagonalizable, 
but the assumption is reasonable for our present purposes.) Immediately 



[/ - hA]~ l =U- hVTV~ l r l 



(V[V~ l V -hT]V~ l ) 



K-l 



so that 

[/ - /jA]" 1 = V[I - hTT X V~ x . 

Consequently x n+ \ — [I — hA\~ x x n becomes 



x n+l = VU-hTT l V- x x n . 



(10.105) 



(10.106) 



Define y n — V x n , and so (10.106) becomes 

y„+i = [I -hTY x y n - 



(10.107) 



Because / — hT is a diagonal matrix, a typical main diagonal element of [/ — 
hT] -1 is ak — 1/(1 — hyk). It is a fact (which we will not prove here) that the 
model problem in the general case is stable provided the eigenvalues of A all pos- 
sess negative-valued real parts. 6 Thus, provided Re(y^) < for all k, we are assured 
that \ok | < 1 for all k, and hence lim„^oo | \~y n \ \ —0. Thus, lim n ^oo | |x„ 1 1 =0, too, 
and so we conclude that the implicit form of the Euler method is unconditionally 
stable. Thus, if the model problem is stable, we may select any h > 0. 

The eigenvalues of A may be complex-valued, and so it is the real parts of these that truly determine 
system stability. 



TLFeBOOK 



SYSTEMS OF FIRST-ORDER ODEs 447 

The following example is of a linear system for which a mathematically exact 
solution can be found. [In fact, the solution is given by (10.92).] 

Example 10.9 Consider the ODE system 

— = -2x+±y, (10.108a) 
at 4 

— = -3x. (10.108b) 
at 

The initial condition is xq — x(0) = 1, yo — y(0) — — 1. From (10.94) we see that 

A = 



-2 i 

Z 4 



-3 

The eigenvalues of A are yo = — j, and y\ — — |. These eigenvalues are both 
negative, and so the solution to (10.108) happens to be stable. In fact, the exact 
solution can be shown to be 

x(t) = -|e" r/2 + je _3 ' /2 , (10.109a) 

y(t) = -|e" r/2 + |e- 3(/2 (10.109b) 

for f > 0. Note that the eigenvalues of A appear in the exponents of the exponentials 
in (10.109). This is not a coincidence. The explicit Euler method has the iterations 

x n +\ — x n + h \-2x n + \y n \ , (10.110a) 

y n +\ = y n ~ 3hx„. (10.110b) 

Simulation results are shown in Figs. 10.10 and 10.11 for h — 0.1 and h — 
1.4, respectively. This involves comparing (10.110a,b) with the exact solution 
(10.109a,b). Figure 10.10b shows a plot of the eigenvalues of / + hA for various 
step sizes. We see from this that choosing h = 1 .4 must result in an unstable simu- 
lation. This is confirmed by the result in Fig. 10.11. For comparison purposes, the 
eigenvalues of / + hA and of [/ — hA] -1 are plotted in Fig. 10.12. This shows 
that, at least in this particular case, the implicit method is more stable than the 
explicit method. 



Example 10.10 Recall the Duffing equation of Section 10.1. Also, recall the 
fact that this ODE can be rewritten in the form of (10.85a,b), and this was done in 
Eq. (10.4a,b). 

Figures 10.13 and 10.14 show the result of simulating the Duffing equation 
using the explicit Euler method for the model parameters 

F = 0.5, «=1, m=\, a = \, 5 = 0.1, k = 0.05. 



TLFeBOOK 



448 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



1 

0.5 



1 -0.5 




-1 
-1.5 

-2 
-2.5 



1 1 1 1 1 1 1 1 1 1 1 1 1 



h 1 1 " r r i 1 1 1 1 r 1 1 1 1 ii' ' i ' 



JtfT. 



<*#* 



^#** 



\ ^^&}^^ 



UN III III II III 



\J^. 



jp. 



^ 



^ 



x(t) 

— y(t) 

-i— Euler method (h = 0.1 ) 

-+- Euler method (rt = 0.1) 



(a) 



2 3 4 5 

Time (f) 



10 




Step size (h) 



Figure 10.10 Simulation results for h = 0.1. Plot (b) shows the eigenvalues of / + hA. 
Both plots were obtained by applying the explicit form of the Euler method to the ODE 
system of Example 10.9. Clearly, the simulation is stable for h — 0.1. 



Thus (10.1) is now 



d x 



dx 



—r = 0.5 cos(f)- 0.05 [x + O.lx*]. 

dt 2 dt 

We use initial condition x(0) = y(0) = 0. The driving function (applied force) 
0.5 cos(f ) is being opposed by the restoring force of the spring (terms in square 
brackets) and friction (first derivative term). Therefore, on physical grounds, we do 
not expect the solution x(t) to grow without bound as t — > oo. Thus, the simulated 
solution to this problem must be stable, too. 

We mention that an analytical solution to the differential equation that we are 
simulating is not presently known. 

From (10.94) for our Duffing system example we have 



A = 





a 35 



1 



-— x — 



k 
mm m 

Example 10.11 According to Hydon [1, p. 61], the second-order ODE 



d 2 x 



(dxV 1 

V dt ) x 



1\ dx 

x I dt 



(10.111) 



TLFeBOOK 



SYSTEMS OF FIRST-ORDER ODEs 



449 



7i 



-8 













x(0 

- -y(t) 

- — i— Euler method (h = 1 .4) 
-t— Euler method (h = 1 .4) 






:...*... 






. 1. \ : 
















* 


: 


/ \ 


1 " ■*" 

' A V 
'A* 






" / A 


\ 


'/ V 








/' 


yY/ / 


. A.\ 


r 
/' 


\\ 


\ 
\ 
\ 


/ : 


\ v ' 

V ' 

\ 1 


\ 

V 


li 
l 


\ \ 
\ V 




\* 


v : 


\ 










V 



12 3 4 5 6 7 

Time (f) 



9 10 



Figure 10.11 This is the result of applying the explicit form of the Euler method to the 
ODE system of Example 10.9. Clearly, the simulation is not stable for h = 1.4. This is 
predicted by the eigenvalue plot in Fig. 10.10b, which shows that one of the eigenvalues of 
/ + hA has a magnitude exceeding unity for this choice of h. 

1 
0.5 

CD 
I 

E -0.5 
< 

-1 

-1.5 



^^^^^-^~ ■ 




°~ 












— ^0 

-- k, 










j 



(a) 



0.5 1 

Step size (h) 



1.5 



(b) 




Step size (h) 



Figure 10.12 Plots of the eigenvalues of / + hA (a), which determine the stability of the 
explicit Euler method and the eigenvalues of [/ — hA]~ (b), which determine the stability 
of the implicit Euler method. This applies for the ODE system of Example 10.9. 



TLFeBOOK 



450 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



CD 



(a) 

1 .0006 
1 .0004 

"§ 1.0002 



0.9998 



0.9996 



(b) 



+ *„(/! = . 020) : 








O /„(/!=. 020)1: 






















- WaI 1 ■ IWr 












; IP W 


WW 




™ 



10 



20 



30 



40 50 

Time (/) 



60 



70 



80 



90 



— M 

- - - W 




















































I 







0.01 



0.02 



0.03 0.04 

Step size (/?) 



0.05 



0.06 



0.07 



Figure 10.13 (a) Explicit Euler method simulation of the Duffing equation; (b) magnitude 
of the eigenvalues of I + hA. Both plots were obtained by applying the explicit form of the 
Euler method to the ODE system of Example 10.10, which is the Duffing equation expressed 
as a coupled system of first-order ODEs. The simulation is apparently stable for h = 0.02. 
This is in agreement with the prediction based on the eigenvalues of I + hA [plot (b)], 
which have a magnitude of less than unity for this choice of h. 



has the exact solution 



x(t) 



c\ 



C\ 



- 1 tanh(Jc( - \{t + c 2 )), 
ci -it + c 2 )-\ 



1 -cf tanh(./l -cf(f + c 2 )), 



c\ > 1 



c\ < 1 



(10.112) 



The ODE in (10.111) can be rewritten as the system of first-order ODEs 



dx 




— 


— y> 


dt 




dy 


t 


dt 


X 



(10.113a) 
(10.113b) 



TLFeBOOK 



SYSTEMS OF FIRST-ORDER ODEs 



451 



E 
< 



(a) 




1 .0006 
1 .0004 



■o 1.0002 



E 
< 



0.9998 



0.9996 









— M 

■ - - M 






























i i i 



0.01 



0.02 



(b) 



0.03 0.04 

Step size (h) 



0.05 



0.06 



0.07 



Figure 10.14 (a) Explicit Euler method simulation of the Duffing equation; (b) magnitude 
of the eigenvalues of / + hA. Both plots were obtained by applying the explicit form of the 
Euler method to the ODE system of Example 10.10, which is the Duffing equation expressed 
as a coupled system of first-order ODEs. The simulation is not stable for h = 0.055. This is 
in agreement with the prediction based on the eigenvalues of I + hA [plot (b)], which have 
a magnitude exceeding unity for this choice of h. 

The initial condition is xq — x(0), and y(0) = -^p-\ t= Q. From (10.94) we have 

1 



A = 



X Q \ X Q / 



2yo ^ ( 1 

xq V XQ 



(10.114) 



Using (10.113a), we may obtain y(t) from (10.112). For example, let us consider 

„2 



simulating the case c\=\. Thus, in this case 



y(t) 



1 



(f + C 2 ) 2 ' 



(10.115) 



For the choice c\ — 1, we have 



xq — c\ , 

C2 



yo = -T- 



(10.116a) 
(10.116b) 



TLFeBOOK 



452 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



E 
< 



(a) 




— 


x(t) 







Y(t) 




-1- 


Xn(f>- 


= 0.020) 


-o- 


Vn(h- 


= 0.020) 



H I M Ill Ill Ill Ill "" 



2.5 



3.5 



1.5 



I 1 

E 
< 

0.5 



— |A„I 

- - hi 


; „ ' 












i ^ > ' i i i i 



(b) 



0.05 0.1 0.15 0.2 0.25 

Step size (h) 



0.3 



0.35 



0.4 



Figure 10.15 Magnitude of the eigenvalues of / + hA is shown in plot (b). Both plots 
show the simulation results of applying the explicit Euler method to the ODE system in 
Example 10.11. The simulation is (as expected) stable for h = 0.02. Clearly, the simulated 
result agrees well with the exact solution. 



If we select c\ = 1, then 



xq — 1 , yo — (1 - x ) . 



(10.117) 



Let us assume xq — — 1, and so yo — 4. The result of applying the explicit Euler 
method to system (10.113) with these conditions is shown in Figs. 10.15 and 10.16. 
The magnitudes of the eigenvalues of / + hA are displayed in plots (b) of both 
figures. We see that for Fig. 10.15, h — 0.02 and a stable simulation is the result, 
while for Fig. 10.16, we have h — 0.35, for which the simulation is unstable. This 
certainly agrees with the stability predictions based on finding the eigenvalues of 
matrix I + hA. 

Examples 10.10 and 10.11 illustrate just how easy it is to arrive at differential 
equations that are not so simple to simulate in a stable manner with low-order 
explicit methods. It is possible to select an h that is "small" in some sense, yet 
not small enough for stability. The cubic nonlinearity in the Duffing model makes 
the implementation of the implicit form of Euler' s method in this problem quite 



TLFeBOOK 



SYSTEMS OF FIRST-ORDER ODEs 



453 



(a) 




1.5 



0.5 



— iy 

— N 












^"* 










^ 




- T^^ 


^^ -i 


■** i 


fj£ 




~— ~ — _ 



(b) 



0.05 0.1 0.15 0.2 0.25 

Step size (A?) 



0.3 



0.35 



0.4 



Figure 10.16 Plot (b) shows magnitude of eigenvalues of / + hA. Both plots show the sim- 
ulation results of applying the explicit Euler method to the ODE system in Example 10.11. 
The simulation is (as expected) not stable for h — 0.35. Instability is confirmed by the fact 
that the simulated result deviates greatly from the exact solution when t is sufficiently large. 



unattractive. So, a better approach to simulating the Duffing equation is with a 
higher-order explicit method. 

For example, Heun's method for (10.85) may be stated as 



x„+i = x„ + jh[f(x n ,y n ,t„) + f(x n +ki,y„ + h,t„ + h)], 



y n +\ — y n + 2 h \-8(Xn,yn,t n ) + g(Xn +kl,y„ + h,t„ + h)], 



where 



k\ = hf(x„, y„, t n ), h = hg(x„, y„, t„). 



(10.118a) 
(10.118b) 

(10.119) 



Also, for example, Chapter 36 of Branson [8] contains a summary of higher-order 
methods that may be applied to (10.85). 

Example 10.12 Recall the Duffing equation simulation in Example 10.10. 
Figure 10.17 illustrates the simulation of the Duffing equation using both the 
explicit Euler method and Heun's method for a small h (i.e., h — 0.005 in both 

cases). 



TLFeBOOK 



454 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



E 
< 



(a) 




60 70 80 



Time (f) 




50 60 70 80 



(b) 



Time (f) 



Figure 10.17 Comparison of the explicit Euler (a) and Heun (b) method simulations of 
the ODE in Example 10.12, which is the Duffing equation. Here the step size h is small 
enough that the two methods give similar results. 



At this point we note that there are other ways to display the results of numerical 
solutions to ODEs that can lead to further insights into the behavior of the dynamic 
system that is modeled by those ODEs. Figure 10.18 illustrates the phase portrait 
of the Duffing system. This is obtained by plotting the points (x n ,y n ) on the 
Cartesian plane, yielding an approximate plot of (x(t),y(t)). The resulting curve 
is the trajectory, or orbit for the system. Periodicity of the system's response is 
indicated by curves that encircle a point of equilibrium, which in this case would 
be the center of the Cartesian plane [i.e., point (0, 0)]. The trajectory is tending to 
an approximately ellipse-shaped closed curve indicative of approximately simple 
harmonic motion. 

The results in Fig. 10.18 are based on the parameters given in Example 10.10. 
However, Fig. 10.19 shows what happens when the system parameters become 



F = 0.3, (o=l, m=\, a = -\, 5=1, k = 0.22. 



(10.120) 



The phase portrait displays a more complicated periodicity than what appears in 
Fig. 10.18. The figure is similar to Fig. 2.2.5 in Guckenheimer and Holmes [11]. 



TLFeBOOK 



MULTISTEP METHODS FOR ODEs 



455 




120 140 160 



(a) 



Time (/) 




(b) 

Figure 10.18 (a) The result of applying Heun's method to obtain the numerical solution 
of the Duffing system specified in Example 10.10; (b) the phase portrait for the system 
obtained by plotting the points (x n , y n ) [from plot (a)] on the Cartesian plane, thus yielding 
an approximate plot of (x(t), y(t)). 



As in the cases of the explicit and implicit Euler methods, we may obtain a 
theory of stability for Heun's method. As before, the approach is to apply the 
model problem (10.90) to (10.118) and (10.119). As an exercise, the reader should 
show that this yields 



X n + l 



hA 



\h\ 



%n-> 



(10.121) 



where / is the 2x2 identity matrix, A is obtained by using (10.94), and, of course, 
I„ = [x n y n ] T . [The similarity between (10.121) and (10.48) is no coincidence.] 
Criteria for the selection of step size h leading to a stable simulation can be 
obtained by analysis of (10.121). But the details are not considered here. 



10.4 MULTISTEP METHODS FOR ODEs 

The numerical ODE solvers we have considered so far were either implicit methods 
or explicit methods. But in all cases they were examples of so-called single-step 



TLFeBOOK 



456 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 




Figure 10.19 (a) The result of applying Heun's method to obtain the numerical solu- 
tion of the Duffing system specified in Example 10.12 [i.e., using the parameter values in 
Eq. (10.120)]; (b) the phase portrait for the system obtained by plotting the points (x„, y n ) 
[from plot (a)] on the Cartesian plane, thus yielding an approximate plot of (x(t), y(f))- 



methods; that is, x n +\ was ultimately only a function of x n . A disadvantage of 
single-step methods is that to achieve good accuracy often requires the use of 
higher-order methods (e.g., fourth- or fifth-order Runge-Kutta). But higher-order 
methods need many function evaluations per step, and so are computationally 
expensive. 

Implicit single-step methods, although inherently stable, are not more accurate 
than explicit methods, although they can track fast changes in the solution x(t) 
better than can explicit methods (recall Example 10.5). However, implicit methods 
may require nonlinear system solvers (i.e., Chapter 7 methods) as part of their 
implementation. This is a complication that is also not necessarily very efficient 
computationally. Furthermore, the methods in Chapter 7 possess their own stability 
problems. Therefore, in this section we introduce multistep predictor-corrector 
methods that overcome some of the deficiencies of the methods we have considered 
so far. 

In this section we return to consideration of a single first-order ODE IVP 



„i dx(t) 

c (D (f) = _AZ = f( x(f)t t)t Xo = x(0) . 
dt 



(10.122) 



TLFeBOOK 



MULTISTEP METHODS FOR ODEs 457 

In reality, we have already seen a single-step predictor-corrector method in 
Section 10.2. Suppose that we have the following method: 

x n +\ — x n + hf(x n , t n ) (predictor step) (10.123a) 

x„+\ — x n + \h[f{x n +\Jn+i) + f(x„, t„)] (corrector step). (10.123b) 

If we substitute (10.123a) into (10.123b), we again arrive at Heun's method 
[Eq. (10.47)], which overcame the necessity to solve for x n +i in the implicit method 
of Eq. (10.59) (trapezoidal method). Generally, predictor-corrector methods replace 
implicit methods in this manner, and we will see more examples further on in 
this section. When a higher-order implicit method is "converted" to a predictor- 
corrector method, the need to solve nonlinear equations is eliminated and accuracy 
is preserved, but the stability characteristics of the implicit method will be lost, at 
least to some extent. Of course, a suitable stability theory will still allow the user 
to select reasonable values for the step size parameter h. 

We may now consider a few simple examples of multistep methods. Perhaps 
the simplest multistep methods derive from the numerical differentiation ideas from 
Section 9.6 (of Chapter 9). Recall (9.138), for which 

x m {t) = —[x(t + h)- x(t - h)] - -h 2 x (3 \i;) (10.124) 

2h 6 

for some f € [t — h, t + h]. The explicit Euler method (10.22) can be replaced 
with the midpoint method derived from [using t — t n in (10.124)] 

1 

f(x(t n ), t n ) RS —[ x (t n+ i) - X(t n -l)], 

2h 
so this method is 

X n +\ = X n -\ + 2hf(x„, t n ). (10.125) 

This is an explicit method, but x n +\ depends on x„-\ as well as x n . We may call 
it a two-step method. Similarly, via (9.153) 

1 
f(x(t n ), t n ) % —l-3x(t n ) + Ax(t n+ i) - x(t n+2 )], 
2h 

so we have the method 

x n +2 = 4x„+i - 3x n - 2hf(x n , t n ) 
which can be rewritten as 

x n+ \ = 4x n - 3x„_i - 2hf(x n -i, t„-\). (10.126) 



TLFeBOOK 



458 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

Finally, via (9.154), we obtain 

1 

f(x(t„), t„) « — -[x(t n -2) - Ax{t n -\) + 3x(f„)], 

2« 
yielding the method 

x n+ i = \x n - \x n -\ + \hf(x n+ \, t n +i). (10.127) 

Method (10.126) is a two-step method that is explicit, but (10.127) is a two-step 
implicit method since we need to solve for x n+ \. 

A problem with multistep methods is that the IVP (10.122) provides only one 
initial condition x . But for n — 1 in any of (10.125), (10.126), or (10.127), we 
need to know xi\ that is, for two-step methods we need two initial conditions, or 
starting values. A simple way out of this dilemma is to use single-step methods to 
provide any missing starting values. In fact, predictor-corrector methods derived 
from single-step concepts (e.g., Runge-Kutta methods) are often used to provide 
the starting values for multistep methods. 

What about multistep method accuracy? Let us consider the midpoint method 
again. From (10.124) for some % n e [f„_i, t n+ {\ 

X(fn+l) = x{t n -i) + 2hf(x(t n ), t n ) + \hW>{t- n ), (10.128) 

so the method (10.125) has a truncation error per step that is 0(/z 3 ). The midpoint 
method is therefore more accurate than the explicit Euler method [recall (10.43) 
and (10.44)]. Yet we see that both methods need only one function evaluation 
per step. Thus, the midpoint method is more efficient than the Euler method. We 
recall that Heun's method (a Runge-Kutta method) has a truncation error per 
step that is 0(h?), too [recall (10.56)], and so Heun's method may be used to 
initialize (i.e., provide starting values for) the midpoint method (10.125). This 
specific situation holds up in general. Thus, a multistep method can often achieve 
comparable accuracy to single-step methods, and yet use fewer function calls, 
leading to reduced computational effort. 

What about stability considerations? Let us continue with our midpoint method 
example. If we apply the model problem (10.28) to (10.125), we have the difference 
equation 

x n+ \ = x„_i + 2hXx n , (10.129) 

which is a second-order difference equation. This has characteristic equation 7 

z 2 - 2hXz - 1 = 0. (10.130) 

We may rewrite (10.129) as 



which has the z-transform 



x n+2 -2hXx n+ \ -x n 



(z 2 -2h\z-l)X(z) = 0. 



TLFeBOOK 



MULTISTEP METHODS FOR ODEs 459 

For convenience, let p — hk, in which case the roots of (10.130) are easily seen 
to be 



zi = p + yjp 2 + l, Z2 = /0-VP 2 + '•• (10.131) 

A general solution to (10.129) will have the form 

Xn = c x z\ + c 2 z\ (10.132) 

for n e Z + . Knowledge of xo and x\ allows us to solve for the constants c\ and c 2 in 
(10.132), if this is desired. However, more importantly, we recall that we assume 
k < 0, and we seek step size ft > so that lim n ^oo \x n \ ^ oo. But in this case 
p = hk <0, and hence from (10.131), \z 2 \ > I for all h > 0. If c 2 #0 in (10.132) 
(which is practically always the case), then we will have linin^oo \x„ \ — oo ! Thus, 
the midpoint method is inherently unstable under all realistic conditions ! Term c\z\ 
in (10.132) is "harmless" since |zi| < 1 for suitable h. But the term c 2 z 2 , often 
called a parasitic term, will eventually "blow up" with increasing n, thus fatally 
corrupting the approximation to x(t). 

Unfortunately, parasitic terms are inherent in multistep methods. However, there 
are more advanced methods with stability theories designed to minimize the effects 
of the parasitics. We now consider a few of these improved multistep methods. 

10.4.1 Adams-Bashforth Methods 

Here we look at the Adams-Bashforth (AB) family of multistep ODE IVP solvers. 
Section 10.4.2 will look at the Adams-Moulton (AM) family. Our approach follows 
Epperson [12, Section 6.6]. Both families are derived using Lagrange interpolation 
[recall Section 6.2 from Chapter 6 (above)]. 

Recall (10.122) which we may integrate to obtain (t„ = to + nh) 

X(fn+l) = X{t n ) + f" ¥1 f(x(t),t)dt. (10.133) 

Now suppose that we had the samples x(t n -k) for k — 0, 1, . . . , m (i.e., m + 1 sam- 
ples of the exact solution x{t)). Via Lagrange interpolation theory, we may interpo- 
late F(t) — f(x(t), t) [the integrand of (10.133)] for t e [t„- m , £ n +i] 8 according to 

in 

Pm(t) = J^L k (t)f(x(t n - k ),t n - k ), (10.134) 

k=Q 

A solution to (10.129) exists only if z — 2hkz — 1 = 0. If the reader has not had a signals and systems 
course (or equivalent) then this reasoning must be accepted "on faith." But it may help to observe that 
the reasoning is similar to the theory of solution for linear ODEs in constant coefficients. 

The upper limit on the interval [t n — m , t„] has been extended from t n to t n+ \ here. This is allowed 
under interpolation theory, and actually poses no great problem in either method development or error 
analysis. We are using the Lagrange interpolant to extrapolate from t = t n to t n j r \. 



TLFeBOOK 



460 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

where 

m 

£*(*) = [I t ~ t l~ l ■ (10.135) 



i-0 

ijtk 



From (6.14) for some % t e [t n - m , t n+ \] 

1 m 

F(t) = Pm (t) + ———F (m+l) ^ t )V[(t - tn-i). 

(m + 1)! jj^ 

However, F(t) — f(x(t), t) — x m (t) so (10.136) becomes 

.. m 

F{t) = p m (t) + . X {m+2) %) Y\(f - tn-i). 

(m + 1)! ll 

1=0 

Thus, if we now substitute (10.137) into (10.133), we obtain 

dn+l 

f(x(t„-k), t n -k) I 

A — -- i J 

where 

— X v "^>(^t) 

nt 

(=0 



(10.136) 



(10.137) 



c(t n+l )=x(t n ) + 2_^f(x(t n - k ),t n - k ) L k (t)dt + R m (t„+i), (10.138) 



*mfe+l) = / — X {m+2 \Ht) Y\(f - tn-i) dt. (10.139) 

Jt„ (m + 1)! H; 



Polynomial jt(t) does not change sign for t e [t n ,t n +i] (which is the interval of 
integration). Thus, we can say that there is a % n e [t n , t n +i] such that 



f'n+l 

Jt„ 



R m (t n+ i) = - ] — x {m+2 \Hn) I n(i)dt. (10.140) 

(m + 1)! 



For convenience, define 
1 f'"+> 



1 ftn+l 1 [tn-l 

(m + 1)! J, (m + 1)!./. 



(t - t n )(t - t„-\) ■■■(t - t n -m) dt 

(10.141) 
and 

f'n+l 

k k = / L k (t)dt. (10.142) 

Thus, (10.138) reduces to [with R m (t n+ i) = p m x (m+2) (f„)] 

m 

x(t n+l ) = *(*„) + £ X k f(x(t n - k ), t n - k ) + A„* (m+2) (? B ) (10.143) 

>t=o 



TLFeBOOK 



MULTISTEP METHODS FOR ODEs 461 

TABLE 10.1 Adams-Bashforth Method Parameters 



m 


A 


X l 


X 2 


*3 


^m(fn+l) 





h 








^ 2 * (2 >(£„) 


1 


Ik 

2 


1 
— h 

2 






^V 3 >(?„) 


2 


23 

—h 

12 


12 


Ik 

12 




Vx (4) &) 

O 


3 


55 
— h 
24 


59 

h 

24 


37 

—h 

24 


9 
h 

24 


251 5 r<n 
720 



for some §„ € [?„, f„+i]. The order m + 1 Adams-Bashforth method is therefore 
defined to be 

m 
*n+l = *n + ^W(*n-*> '«-*)■ (10.144) 

It is an explicit method involving m + 1 steps. Table 10.1 summarizes the method 
parameters for various m, and is essentially Table 6.6 from Ref. 12. 



10.4.2 Adams-Moulton Methods 

The Adams-Moulton methods are a modification of the Adams-Bashforth meth- 
ods. The Adams-Bashforth methods interpolate using the nodes t n , t„-i, . . . , t n - m . 
On the other hand, the Adams-Moulton methods interpolate using the nodes 
t n +i, t n , ■ ■ ■ , tn-m+i- Note that the number of nodes is the same in both methods. 
Consequently, (10.138) becomes 

x(t„+i) = x{t n ) + 2^ f(x(tn-k),tn-k) / L k (t)dt + R m (t n+ i), (10.145) 

k=-\ Jt " 



where now 

m — 1 



I tn 



Lk (t) = n -^-^ 

, tn— k *n 



(10.146) 






and ff m (*„ + i) = p m x (m+2) (t; n ) with 



1 /""+ 



Pm = 7 — TTT / ( f -?»+l)('-'«)- ••(^-fn-m+l) <*f. (10.147) 

(m + ' 



TLFeBOOK 



462 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

Thus, the order m + 1 Adams-Moulton method is defined to be 

m— 1 



x n +i = x„ + ^2 X k f(x n -k, t„- k ), (10.148) 



where 

X k = I ¥1 L k (t)dt (10.149) 



/ 

Jt„ 



[same as (10.142) except k — -1, 0, 1, . . . , m - 1, and L k {t) is now (10.146)]. 
Method (10.148) is an implicit method since it is necessary to solve for x n+ \. 
It also requires m + 1 steps. Table 10.2 summarizes the method parameters for 
various m, and is essentially Table 6.7 from Ref. 12. 

10.4.3 Comments on the Adams Families 

For small values of m in Tables 10.1 and 10.2, we see that the Adams families 
(AB family and AM family) correspond to methods seen earlier. To be specific: 

1. For m = in Table 10.1, Xq — h, so (10.144) yields the explicit Euler method 
(10.22). 

2. For m — in Table 10.2, X-i — h, so (10.148) yields the implicit Euler 
method (10.35). 

3. For m = 1 in Table 10.2, A_i = Xq — jh, so (10.148) yields the trapezoidal 
method (10.59). 

Stability analysis for members of the Adams families is performed in the usual 
manner. For example, when m = 1 in Table 10.1 (i.e., consider the second-order 
AB method), Eq. (10.144) becomes 

x n +\ — x„ + jh[3f(x n , t n ) - f(x n -\,t n -\)]. (10.150) 



TABLE 10.2 Adams-Moulton Method Parameters 



in 


*-l 


*0 


*1 


^2 


Rm(t n +\) 





h 








-Vx (2 %„) 


1 


u 


1 
-h 






- — h 3 x 0) tf„) 




2 


2 






12 


? 


5 
— h 


A. 


1 

h 




_1^(4) (?) 




12 


12 


12 




24 




9 


19 


5 


1 


_ii fc 5 x (5) ft) 


-! 


— h 


— h 


h 


— h 




24 


24 


24 


24 


720 



TLFeBOOK 



MULTISTEP METHODS FOR ODEs 463 

Application of the model problem (10.28) to (10.150) yields 
x„+i = f 1 + -hX\x n - -h\x n _i 



or 



Xn+2 ~ ( 1 + -hk j X n +1 + -hXx n = 0. 



This has characteristic equation (with p — hX) 



-r - ! 1 + 2 P ) Z+ 2 P ^°' 



(10.151) 



This equation has roots 



z\ 



Z2 



l+-p)+^l+p + -p 2 



1+-P)-^1+P + -P 2 



(10.152) 



We need to know what range of h > yields \z 1 1 , | zi \ < 1 . We consider only p < 
since X < 0. Figure 10.20 plots |zi| and \z2\ versus p, and suggests that we may 
select h such that 

-1<AA,<0. (10.153) 

Plots of stability regions for the other Adams families members may be seen in 
Figs. 6.7 and 6.8 of Epperson [12]. Note that stability regions occupy the complex 
plane as it is assumed in such a context that X e C. However, we have restricted 



2.5 
2 

1.5 

1 

0.5 




-s — l*il 

- j-^-J | - - |.g g | r 

i i i i i i i i T ** — - 



-1.8 -1.6 -1.4 -1.2 



-0.8 -0.6 -0.4 -0.2 



Figure 10.20 Magnitudes of the roots in (10.152) as a function of p = hX. 



TLFeBOOK 



464 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

our attention to a first-order ODE IVP here, and so it is actually enough to assume 
that X e R. 

Finally, observe that AB and AM methods can be combined to yield predictor- 
correctors. An mth-order AB method can act as a predictor for an mth-order 
AM method that is the corrector. A Runge-Kutta method can initialize the 
procedure. 



10.5 VARIABLE-STEP-SIZE (ADAPTIVE) METHODS FOR ODEs 

Accuracy in the numerical solution of ODEs requires either increasing the order 
of the method applied to the problem or decreasing the step-size parameter h. 
However, high-order methods (e.g., Runge-Kutta methods of order exceeding 5) 
are not very attractive at least in part because of the computational effort involved. 
To preserve accuracy while reducing computational requirements suggests that we 
should adaptively vary step size h. 

Recall Example 10.5, where we saw that low order methods were not accurate 
near t — to- For a method of a given order, we would like in Example 10.5 to 
have a small h for t near to, but a larger h for t away from to- This would reduce 
the overall number of function evaluations needed to estimate x(t) for t e [to, f/]. 
The idea of adaptively varying h from step to step requires monitoring the error 
in the solution somehow; that is, ideally, we need to infer e n — \x(t n ) — x n \ [x„ 
is the estimate of x(t) at t — t n from some method] at step n. If e n is small 
enough, h may be increased in size at the next step, but if e n is too big, we 
decrease h. 

Of course, we do not know x(t n ), so we do not have direct access to the error e n . 
However, one idea that is implemented in modern software tools (e.g., MATLAB 
routines ode23 and ode45) is to compute x n for a given h using two methods, 
each of a different order. The method of higher order is of greater accuracy, so 
if x n does not differ much between the methods, we are lead to believe that h is 
small enough, and so may be increased in the next step. On the other hand, if the 
x n values given by the different methods significantly vary, we are then lead to 
believe that h is too big, and so should be reduced. 

In this section we give only a basic outline of the main ideas of this process. 
Our emphasis is on the Runge-Kutta-Fehlberg (RKF) methods, of which MAT- 
LAB routines ode23 and ode45 are particular implementations. Routine ode23 
implements second- and third-order Runge-Kutta methods, while ode45 imple- 
ments fourth- and fifth-order Runge-Kutta methods. Computational efficiency is 
maintained by sharing intermediate results that are common to both second- and 
third-order methods and common to both fourth- and fifth-order methods. More 
specifically, Runge-Kutta methods of consecutive orders have constants such as kj 
[recall (10.67)] in common with each other and so need not be computed twice. 
We mention that ode45 implements a method based on Dormand and Prince [14], 
and a more detailed account of this appears in Epperson [12]. An analysis of the 
RKF methods also appears in Burden and Faires [17]. The details of all of this are 



TLFeBOOK 



VARIABLE-STEP-SIZE (ADAPTIVE) METHODS FOR ODEs 465 

quite tedious, and so are not presented here. It is also worth noting that an account 
of MATLAB ODE solvers is given by Shampine and Reichelt [13], who present 
some improvements to the older MATLAB codes that make them better at solving 
stiff systems (next section). 

A pseudocode for something like ode45 is as follows, and is based on Algo- 
rithm 6.5 in Epperson [12]: 

Input tQ,XQ\ { initial condition and starting time } 

Input tolerance e > 0; 

Input the initial step size h > 0, and final time tf > frj; 

n:=0; 

while t n < tf do begin 

X-| := RKF4(x n , t n ,h); { 4th order RKF estimate of x n+1 } 

X 2 :=RKF5(x n ,t n ,h)\ 

E:=|X 1 -X 2 |; 

if Xhe < E < hi then begin { h is OK J 

x n+1 :=*2; 

f„+1 :=t n +h; 

n :=n+ 1; 
else if E > he then { h is too big } 

h := h/2; { reduce h and repeat } 
else { h is too small } 

h :=2h; 

x n+\ -=X2\ 

f„+1 :=t n +h; 

n :=n + 1; 
end; 
end; 



Of course, variations on the "theme" expressed in this pseudocode are possible. As 
noted in Epperson [12], a drawback of this algorithm is that it will tend to oscillate 
between small and large step size values. We emphasize that the method is based on 
considering the local error in going from time step t n to t n +\. However, this does 
not in itself guarantee that the global error \x{t n ) — x n \ is small. It turns out that if 
adequate smoothness prevails [e.g., if fix, t) is Lipschitz as per Definition 10.1], 
then small local errors do imply small global errors (see Theorem 6.6 or 6.7 in 
Ref. 12). 

Example 10.13 This example illustrates a typical application of MATLAB 
routine ode23 to the problem of simulating the Colpitts oscillator circuit of Example 
10.2. 

Figure 10.21 shows a typical plot of i>cE(t), and the phase portrait for parameter 
values 

V TH = 0.75 V (volts), V CC = 5 (V), V EE = -5 (V), R EE = 400 Q, (ohms), 
R L = 35 (Q), L = 98.5 x 10" 6 H (henries), p F = 200, R n = 100 (f2), 
d = C 2 = 54 x 10" 9 F (farads). (10.154) 



TLFeBOOK 



466 



NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 




(a) 



(b) 



2 3 4 

V CE (t) (volts) 






2.3 2.4 2.5 2.6 2.7 2.8 

f (seconds) 



2.9 



3 3.1 

x10 -3 



Figure 10.21 Chaotic regime (a) phase portrait of Colpitts oscillator and (b) collector- 
emitter voltage, showing typical results of applying MATLAB routine ode23 to the simula- 
tion of the Colpitts oscillator circuit. Equation (10.154) specifies the circuit parameters for 
the results shown here. 



These circuit parameters were used in Kennedy [9], and the phase portrait in 
Fig. 10.21 is essentially that in Fig. 5 of that article [9]. The MATLAB code that 
generates Fig. 10.21 appears in Appendix 10. B. 

For the parameters in (10.154) the circuit simulation phase portrait in Fig. 10.21 
is that of a strange attractor [11, 16], and so is strongly indicative (although not 
conclusive) of chaotic dynamics in the circuit. 

We note that under "normal" circumstances the Colpitts oscillator is intended to 
generate sinusoidal waveforms, and so the chaotic regime traditionally represents 
a failure mode, or abnormal operating condition for the circuit. However, Kennedy 
[9] suggests that the chaotic mode of operation may be useful in chaos-based data 
communications (e.g., chaotic-carrier communications). 

The following circuit parameters lead to approximately sinusoidal circuit out- 
puts: 



V m = 0.75 V, 
R L = 200 ft, 



V C c=5V, V EE 
L = 100 x 10" 6 H, 



-5V, R E e = 100ft, 
/3 F = 80, Ron = H5ft, 



Ci = 45 x 10" y F, C 2 = 58 x 10" y F. 



(10.155) 



TLFeBOOK 



STIFF SYSTEMS 



467 




-3 -2 

V CE (t) (volts) 




2.94 2.96 

t (seconds) 



3.02 



Figure 10.22 Sinusoidal operations (a) phase portrait of Colpitts operator and 
(b) collector-emitter voltage, showing typical results of applying MATLAB routine ode23 
to the simulation of the Colpitts oscillator circuit. Equation (10.155) specifies the circuit 
parameters for the results shown here. 



Figure 10.22 shows the phase portrait for the oscillator using these parameter val- 
ues. We see that vce(0 is much more sinusoidal than in Fig. 10.21. The trajectory 
in the phase portrait of Fig. 10.22 is tending to an elliptical closed curve indicative 
of simple harmonic (i.e., sinusoidal) oscillation. 



10.6 STIFF SYSTEMS 

Consider the general system of coupled first-order ODEs 
dxQ(t) 



dt 

dx\{t) 

dt 



./b(xo,xi, . . . ,x m _i, t), 
/i(x ,xi, ...,x m -i,t), 



(10.156) 



dXm-l(t) 

dt 



— fm-l(XQ,Xl, ...,X m -l,t), 



TLFeBOOK 



468 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

which we wish to solve for t > given x(0), where x(t) = [x$(t) x\(t) 
■■■ x m -i(t)] T . If we also define f(x(t),t) — [fo(x(t),t) fi(x(t),t) ■■■ f, 



m — l 



(x(t), t)] , then we may express (10.156) in compact vector form as 



dx(t) 
dt 



= f(x(t), t). 



(10.157) 



We have so far described a general order m ODE IVP. 

If we now wish to consider the stability of a numerical method applied to the 
solution of (10.156) [or (10.157)], then we need to consider the model problem 



dx(t) 
dt 



Ax(t), 



(10.158) 



where 



3/o(*(0), 0) 

3/i(*(0),0) 

3xo 



3/o(*(0),0) 

dx\ 
3/i(s(0),0) 

3xi 



3/o (3c (0), 0) 

3x m _i 
3/i(x(0),0) 



3x 



m—\ 



3/ m _i(x(0),0) 3/ m _i(x"(0),0) 



3/ m _!(x-(0),0) 



(10.159) 



3xo 3xi 3x m -i 

[which generalizes A e R 2x2 in (10.94)]. The solution to (10.158) is given by 

x(f) = e At x(0) (10.160) 

[recall (10.92)]. 9 Ensuring the stability of the order m linear ODE system in 
(10.158) requires all the eigenvalues A.& of A to have negative real parts (i.e., 
Re[A / ( : ] < for all k = 0, 1, . . . , m — 1, where X^ is the kth eigenvalue of A). 
Of course, if m — 1, then with x(t) — xo(t), (10.158) reduces to 



dx(t) 
dt 



Xx{t), 



(10.161) 



which is the model problem (10.28) again. Recall once again Example 10.5, for 
which we found that 



k = 



3/C*o, t ) 

3x 



2 
to' 



Note that if we know x(tQ) (any tQ s R), then we may slightly generalize our linear problem (10.158) 
to determining x(t) for all ( > tQ, in which case 

x(t) = e A{t -'^x(to) 

replaces (10.160). However, little is lost by assuming /q = 0. 



TLFeBOOK 



MATLAB CODE FOR EXAMPLE 10.8 469 

as (10.41) is the solution to dx/dt — t 2 — 2x/t for t > to > 0. If ?o is small, then 
\X\ is large, and we saw that numerical methods, especially low-order explicit 
ones, had difficulty in estimating x(t) accurately when t was near fn- If we recall 
(10.33) (which stated that h < 2/\X\) as an example, we see that large negative 
values for X force us to select small step sizes h to ensure stability of the explicit 
Euler method. Since k is an eigenvalue of A — [X] in (10.161), we might expect 
that this generalizes. In other words, a numerical method can be expected to have 
accuracy problems if A in (10.159) has eigenvalues with large negative real parts. 
In a situation like this x(t) in (10.156) has (it seems) a solution that changes so 
rapidly for some time intervals (e.g., fast startup transients) that accurate numerical 
solutions are hard to achieve. Such systems are called stiff systems. 

So far our definition of a stiff system has not been at all rigorous. Indeed, 
a rigorous definition is hard to come by. Higham and Trefethen [15] argue that 
looking at the eigenvalues of A alone is not enough to decide on the stiffness of 
(10.156) in a completely reliable manner. It is possible, for example, that A may 
have favorable eigenvalues and yet (10.156) may still be stiff. 

Stiff systems will not be discussed further here except to note that implicit 
methods, or higher-order predictor-corrector methods, should be used for their 
solution. The paper by Higham and Trefethen [15] is highly recommended reading 
for those readers seriously interested in the problems posed by stiff systems. 

10.7 FINAL REMARKS 

In the numerical solution (i.e., simulation) of ordinary differential equations (ODEs), 
two issues are of primary importance: accuracy and stability. The successful sim- 
ulation of any system requires proper attention to both of these issues. 

Computational efficiency is also an issue. Generally, we prefer to use the largest 
possible step size consistent with required accuracy, and as such to avoid any 
instability in the simulation. 

APPENDIX 10.A MATLAB CODE FOR EXAMPLE 10.8 



% f23.m 

% This defines function f (x,t) in the differential equation for Example 10. 
% (in Section 10.2) . 



function y = f23(x,t) 
y = t*t - (2*x/t); 

% Runge.m 

% This routine simulates the Heun's, and 4th order Runge-Kutta methods as 



TLFeBOOK 



470 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

% applied to the differential equation in Example 10.8 (Sect. 10.2), so this 
% routine requires function f23.m. It therefore generates Fig. 10.9. 

function Runge 

to = .05; % initial time (starting time) 
xO =1.0; % initial condition (x(tO)) 

% Exact solution x(t) 

c = (5*t0*t0*x0) - (t0"5); 
te = [tO: .02:1 .5] ; 
for k = 1 :length(te) 

xe(k) = (te(k)*te(k)*te(k))/5 + c/(5*te(k)*te(k) ) ; 

end; 

h = .05; 

% Heun's method simulation 

xh(1) = xO; 
th(1) = tO; 
for n = 1 :25 

fn = th(n)*th(n) - (2*xh(n)/th(n) ) ; %f(x_n,t_n) 

th(n+1) = th(n) + h; 

xn1 = xh(n) + h*fn; 

tn1 = th(n+1) ; 

fn1 = tn1*tn1 - (2*xn1/tn1); % f (x_{n+1},t_{n+1}) (approx.) 

xh(n+1) = xh(n) + (h/2)*(fn + fn1); 

end; 

% 4th order Runge-Kutta simulation 

xr(1 ) = xO; 
tr(1) = tO; 
for n = 1 :25 

t = tr(n); 

x = xr(n) ; 

k1 = f23(x,t); 

k2 = f23(x + .5*h*k1,t + .5*h); 

k3 = f23(x + .5*h*k2,t + .5*h); 

k4 = f23(x + h*k3,t + h); 

xr(n+1) = xr(n) + (h/6)*(k1 + 2*k2 + 2*k3 + k4) ; 

tr(n+1) = tr(n) + h; 

end; 

plot(te,xe, ' - ' ,tr,xr, ' - -o' ,th,xh, ' --+' ) , grid 

legend('x(t) ' , '4th Order Runge-Kutta (h = .05)','Heun (h = .05)',1); 

xlabel(' Time (t) ') 

ylabel(' Amplitude ') 



APPENDIX 10.B MATLAB CODE FOR EXAMPLE 10.13 

% 

% fR.m 

% 



TLFeBOOK 



MATLAB CODE FOR EXAMPLE 10.13 471 

% This is Equation (10.15) of Chapter 10 pertaining to Example 10.2. 

function i = fR(v) 

VTH = 0.75; % Threshold voltage in volts 

RON = 100; % On resistance of NPN BJT Q in Ohms 

if v <= VTH 

i = 0; 
else 

i = (v-VTH)/R0N; 

end; 

% vCC.m 

% Supply voltage function v_CC(t) for Example 10.2 of Chapter 10. 
% Here v_CC(t) = V_CC u(t) (i.e., oscillator switches on at t = 0) . 



function v = vCC(t) 

VCC = 5; 
if t < 

v = 0; 
else 

v = VCC; 

end; 

% Colpitts.m 

% Computes the right-hand side of the state equations in Equation (10.17a,b,c) 
% pertaining to Example 10.2 of Chapter 10. 



function y = Colpitts(t,x) 

C1 = 54e-9; 
C2 = 54e-9; 
REE = 400; 
VEE = -5; 
betaF = 200; 
RL = 35; 
L = 98.5e-6; 

y(1) = ( x(3) - betaF*fR(x(2)) )/C1; 

y(2) = ( -(x(2)+VEE)/REE - fR(x(2)) - x(3) )/C2; 

y(3) = ( vCC(t) - x(1) + x(2) - RL*x(3))/L; 

y = y- '; 

% SimulateColpitts.m 

% This routine uses vCC.m, fR.m and Colpitts.m to simulate the Colpitts 
% oscillator circuit of Example 10.2 in Chapter 10. It produces 
% Figure 10.21 in Chapter 10. 

% The state vector x(:,:) is as follows: 
% x(:,1) = v_CE(t) 



TLFeBOOK 



472 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

% x(:,2) = v_BE(t) 

% x(:,3) = i_L(t) 



function SimulateColpitts 

[t,x] = ode23(@Colpitts, [ 0.003 ], [000]); 

% [ 0.003 ] ---> Simulate to 3 milliseconds 
% [ ] ---> Initial state vector 

elf 

L = length(t) ; 

subplot(211), plot(x(:,1),x(:,2)), grid 

xlabelC v_{CE} (t) (volts) ') 

ylabelf v_{BE} (t) (volts) ') 

title(' Phase Portrait of the Colpitts Oscillator (Chaotic Regime) 

subplot(212), plot(t(L-1999:L),x(L-1999:L,1),'-'), grid 

xlabel(' t (seconds) ') 

ylabelf v_{CE} (t) (volts) ') 

title (' Collector-Emitter Voltage (Chaotic Regime) ') 



REFERENCES 

1. P. E. Hydon, Symmetry Methods for Differential Equations: A Beginner's Guide, Cam- 
bridge Univ. Press, Cambridge, UK, 2000. 

2. C. T.-C. Nguyen and R. T. Howe, "An Integrated CMOS Micromechanical Resonator 
High-Q Oscillator," IEEE J. Solid-State Circuits 34, 450-455 (April 1999). 

3. E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, New York, 
1978. 

4. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979. 

5. L. M. Kells, Differential Equations: A Brief Course with Applications, McGraw-Hill, 
New York, 1968. 

6. E. Beltrami, Mathematics for Dynamic Modeling, Academic Press, Boston, MA, 1987. 

7. S. S. Rao, Applied Numerical Methods for Engineers and Scientists, Prentice-Hall, 
Upper Saddle River, NJ, 2002. 

8. R. Bronson, Modern Introductory Differential Equations (Schaum's Outline Series), 
McGraw-Hill, New York, 1973. 

9. M. P. Kennedy, "Chaos in the Colpitts Oscillator," IEEE Trans. Circuits Syst. (Part I: 
Fundamental Theory and Applications) 41, 771-774 (Nov. 1994). 

10. A. S. Sedra and K. C. Smith, Microelectronic Circuits, 3rd ed., Saunders College Publ., 
Philadelphia, PA, 1989. 

11. J. Guckenheimer and P. Holmes, Nonlinear Oscillations, Dynamical Systems, and Bifur- 
cations of Vector Fields, Springer- Verlag, New York, 1983. 

12. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York, 
2002. 

13. L. F. Shampine and M. W. Reichelt, "The MATLAB ODE Suite," SI AM J. Sci. Comput. 
18, 1-22 (Jan. 1997). 



TLFeBOOK 



PROBLEMS 



473 



14. J. R. Dormand and P. J. Prince, "A Family of Embedded Runge-Kutta Formulae," J. 
Comput. Appl. Math. 6, 19-26 (1980). 

15. D. J. Higham and L. N. Trefethen, "Stiffness of ODEs," BIT 33, 285-303 (1993). 

16. P. G. Drazin, Nonlinear Systems, Cambridge Univ. Press, Cambridge, UK, 1992. 

17. R. L. Burden and J. D. Faires, Numerical Analysis, 4th ed., PWS-KENT Publ., Boston, 
MA, 1989. 



PROBLEMS 

10.1. Consider the electric circuit depicted in Fig. 10.P.1. Find matrix A e R 2x2 
and vector b e R 2 such that 



dt 

dJL 2 (t) 
dt 









MO 


= A 






lL 2 (t) 







™ s (t), 



where iL k (t) is the current through inductor L^ (k e {1,2, 3}). 

(Comment: Although the number of energy storage elements in the circuit is 
3, there are only two state variables needed to describe the circuit dynamics.) 

10.2. The circuit in Fig. 10.P.2 is a simplified model for a parametric amplifier. 
The amplifier contains a reverse-biased varactor diode that is modeled by 
the parallel interconnection of linear time-invariant capacitor Co and linear 
time-varying capacitor C(t). You may assume that C(t) = 2C\ cos(a> p t), 
where C\ is constant, and co p is the pumping frequency . Note that 

ic(t) = —\C(t)v(t)l 

dt 

The input to the amplifier is the ideal cosinusoidal current source i s {t) = 
2I S cos(a>ot), and the load is the resistor R, so the output is the current /(f) 



Ufl 



uo 




Figure 10.P.1 The linear electric circuit for Problem 10.1. 



TLFeBOOK 



474 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 



l,(t) 





WO 




Load 



~V~ 



Varactor diode model 
Figure 10.P.2 A model for a parametric amplifier (Problem 10.2). 



into R. Write the state equations for the circuit, assuming that the state vari- 
ables are v(t), and i'i(f). Write these equations in matrix form. (Comment: 
This problem is based on the example of a parametric amplifier as consid- 
ered in C. A. Desoer and E. S. Kuh, Basic Circuit Theory, McGraw-Hill, 
New York, 1969.) 

10.3. Give a detailed derivation of Eqs. (10.17). 

10.4. The general linear first-order ODE is 



dx(t) 
dt 



= a(t)x(t) + b(t), x(to) = xq. 



Use the trapezoidal method to find an expression for x n +\ in terms of x n , 
t„, and f„+i- 

10.5. Prove that the trapezoidal method for ODEs is unconditionally stable. 

10.6. Consider Theorem 10.2. Assume that jtrj = x(to). Use (10.23) to find an 
upper bound on \x(t„) — x n \/M for the following ODE IVPs: 

(a) — = 1 -2x, x(0) = 1. 
dt 

dx 

(b) — = 2cosx, x(0) = 0. 
dt 

10.7. Consider the ODE IVP 



dx(t) 
dt 



= \-2x, x(0) = l. 



(a) Approximate the solution to this problem using the explicit Euler 
method with h = 0.1 for n = 0, 1, . . . , 10. Do the computations with a 
pocket calculator. 

(b) Find the exact solution x(t). 



TLFeBOOK 



PROBLEMS 475 

10.8. Consider the ODE IVP 

dx(t) 



dt 



2cos:t, x(0) = 0. 



(a) Approximate the solution to this problem using the explicit Euler 
method with h — 0.1 for n = 0, 1, . . . , 10. Do the computations with a 
pocket calculator. 

(b) Find the exact solution x(t). 

10.9. Write a MATLAB routine to simulate the ODE 

dx x 

dt x + t 

for the initial condition jc(0) = 1. Use both the implicit and explicit Euler 
methods. The program must accept as input the step size h, and the number 
of iterations N that are desired. Parameters h and N are the same for both 
methods. The program output is be written to a file in the form of a table 
something such as (e.g., for h — 0.05, and N — 5) the following: 



time step 


ex| 


plicit 


x_n 


implicit x 


.0000 




value 




value 


.0500 




value 




value 


.1000 




value 




value 


.1500 




value 




value 


.2000 




value 




value 


.2500 




value 




value 


Test your program 


out on 










h = 


0.0L 


N = 100 



and 

ft = 0.10, N =10. 



10.10. Consider 

dr(t\ 

a(l -2pt z )e- pt . (10.P.1) 



dx(t) „ nn2 ,-pt 2 



dt 

(a) Verify that for t > 0, with x(0) = 0, we have the solution 

x(t) = ate- p, \ (10.P.2) 

(b) For what range of step sizes h is the explicit Euler method a stable 
means of solving (10.P.1)? 



TLFeBOOK 



476 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

(c) Write a MATLAB routine to simulate (1 O.P.I) for x(0) = using both 
the explicit and implicit Euler methods. Assume that a — 10 and P — I. 
Test your program out on 

ft = 0.01, N=400 (10.P.3a) 

and 

h = 0.10, N = 40. (10.P.3b) 

The program must produce plots of [x n \n = 0, I, . . . , N] (for both 
explicit and implicit methods), and x(t) (from (10.P.2)) on the same 
graph. This will lead to two separate plots, one for each of (10.P.3a) 
and (10.P.3b). 

10.11. Consider 

dx(t) 

— — =2tx + l, x(0) = 0. 
dt 

(a) Find \x n \n = 0, 1, . . . , 10} for h = 0.1 using the explicit Euler method. 
Do the calculations with a pocket calculator. 

(b) Verify that 



— e I e 

Jo 



x{t) — e'~ I e s ds. 

'o 



(c) Does the stability condition (10.33) apply here? Explain. 

10.12. Prove that Runge-Kutta methods of order one have no degrees of freedom 
(i.e., we are forced to select c\ in only one possible way). 

10.13. A fourth-order Runge-Kutta method is 

x„ + i — x n + \h\k\ + 2&2 + 2&3 + £4] 
for which 

h = f(x„, t n ), k 2 = f \x„ + \hk\,t n + jhj , 

kj = / [x n + hhlc2, t n + Ih I , k/\ — f(x n + hie?,, t n + h). 

When applied to the model problem, we get x n — cr n XQ for n e Z + . Derive 
the expression for a . 

10.14. A third-order Runge-Kutta method is 

x„+i = x n + \\ki + Ak 2 + k 3 ] 



TLFeBOOK 



PROBLEMS 477 
for which 

k\ = hf{x n , t n ), k 2 = hf(x„ + \h, t n + \k\), 
kj, — hf(x n + h, t n — k\ + Iki). 

(a) When applied to the model problem, we get x n — a n xo for n e Z + . 
Derive the expression for a . 

(b) Find the allowable range of step sizes h that ensure stability of the 
method. 

10.15. Consider the fourth-order Runge-Kutta method in Eqs. (10.80) and (10.81). 
Show that if f(x, t) = f(t), then the method reduces to Simpson's rule for 
numerical integration over the interval [t n , t n +\\. 

10.16. Recall Eq. (10.98). Suppose that A e R 2x2 has distinct eigenvalues yk such 
that Re[y^] > for at least one of the eigenvalues. Show that / + hA will 
have at least one eigenvalue \k such that 11^1 > 1. 

10.17. Consider the coupled first-order ODEs 

dx 



= -yJx 2 + y 2 , (10.P.4a) 

dt 

dy 



= xJx 2 + y 2 , (10.P.4b) 

dt v 

where (xo, yo) — (*(0), y(0)) are the initial conditions. 

(a) Prove that for suitable constants ro and 9q, we have 

x(t) = r cos(r t + O ), y (t) = r Q sin(r Q t + 9 Q ) . (10.P.5) 

(b) Write a MATLAB routine to simulate the system represented by (10.P.4) 
using the explicit Euler method [which will produce x n — [x n y n ] T 
such that x n « x(t n ) and y„ « y(t n )]. Assume that h — 0.05 and the 
initial condition is Jo = [1 0] r . Plot x„ and (x(t), y(t)) (via (10.P.5)) 
on the (x, y) plane. 

(c) Write a MATLAB routine to simulate the system represented by (10.P.4) 
using Heun's method [which will produce x n — [x n y n ] T such that 
x n »s x(t n ) and y„ « y(t„)]. Assume that h — 0.05 and the initial con- 
dition is Jo = [1 0] T . Plot x„, and (x(t), y(t)) [via (10.P.5)] on the 
(x, y) plane. 

Make reasonable choices about the number of time steps in the simulation. 

10.18. In the previous problem the step size is h — 0.05. 

(a) Determine whether the simulation using the explicit Euler method is 
stable for this choice of step size. {Hint: Recall that one must consider 
the eigenvalues of / + hA.) 



TLFeBOOK 



478 NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 

(b) Determine whether the simulation using Heun's method is stable for 
this choice of step size. [Hint: Consider the implications of (10.121).] 

Use the MATLAB eig function to assist you in your calculations. 

10.19. A curve in R 2 is specified parametrically according to 

x(t) — Acos(cot), y(t) = aAcos(a>t — <j>), (10.P.6) 

where a, A > 0, and t e R is the "parameter." When the points (x(t), y(t)) 
are plotted, the result is what electrical engineers often call a Lissajous figure 
(or curve), which is really just an alternative name for a phase portrait. 

(a) Find an implicit function expression for the curve, that is, find a descrip- 
tion of the form 

fix, y) = (10.P.7) 

[i.e., via algebra and trigonometry eliminate t from (10.9.6) to obtain 
(10.9.7)]. 

(b) On the (x, y) plane sketch the Lissajous figures for cases = 0, </> = 
±| and <p = ±|. 

(c) An interpretation of (10.9.6) is that x(t) may be the input voltage from 
a source in an electric circuit, while y(t) may be the output voltage 
drop across a load element in the circuit (in the steady-state condition 
of course). Find a simple expression for sin<^> in terms of B such that 
/(0, B) — [i.e., point(s) B on the y axis where the curve cuts the 
y axis], and in terms of a and A. {Comment: On analog oscilloscopes 
of olden days, it was possible to display a Lissajous figure, and so use 
this figure to estimate the phase angle (f> on the lab bench.) 

10.20. Verify the values for Xk in Table 10.1 for m — 2. 

10.21. Verify the values for Xk in Table 10.2 for m — 2. 

10.22. For the ODE IVP 

dx(t) 



dt 



f(x,t), x(f )=x , 



write pseudocode for a numerical method that approximates the solution to 
it using an AB method for m — 2 as a predictor, an AM method for m — 2 
as a corrector, and the third-order Runge-Kutta (RK) method from Problem 
10.14 to perform the initialization. 

10.23. From Forsythe, Malcolm, and Moler (see Ref. 5 in Chapter 2) 

— = 998x + 1998y (10.P.8a) 

dt 

-1- = -999* - 1999y (10.P.8b) 

dt 



TLFeBOOK 



PROBLEMS 479 

has the solution 

x(t) = 4e~' - 3e~ 1000 ', (10.P.9a) 

y(t) = -2e~ f + 3e- wm , (10.P.9b) 

where x(0) = y(0) = 1. Recall A as given by (10.94). 

(a) By direct substitution verify that (10.P.9) is the solution to (10. P. 8). 

(b) If the eigenvalues of / + hA are Xo and X\, plot \Xq\ and |Ai| versus h 
(using MATLAB). 

(c) If the eigenvalues of / + hA + \h 2 A 2 [recall (10.121)] are X and Xi, 
plot \X Q \ and |Ai| versus h (using MATLAB). 

In parts (b) and (c), what can be said about the range of values for h leading 
to a stable simulation of the system (10.P.8)? 

10.24. Consider the coupled system of first-order ODEs 

dx{t) 



dt 



= Ax(t) + y(t), (10.P.10) 



where A e R nxn and x(t), y{t) e R" for all t e R + . Suppose that the eigen- 
values of A are Xq, Aj, . . . , X n _\ such that Re[A^] < for all k e Z„. 
Suppose that 

a < Re[X k ] < x < 

for all k. A stiffness quotient is defined to be 

a 

x 

The system (10.P.10) is said to be stiff ii r ^> 1 (again, we assume Re[A^] < 
for all k). 

For the previous problem, does A correspond to a stiff system? (Comment: 
As Higham and Trefethen [15] warned, the present stiffness definition is 
not entirely reliable.) 



TLFeBOOK 



11 



Numerical Methods 
for Eigenproblems 



11.1 INTRODUCTION 

In previous chapters we have seen that eigenvalues and eigenvectors are important 
(e.g., recall condition numbers from Chapter 4, and the stability analysis of numer- 
ical methods for ODEs in Chapter 10). In this chapter we treat the eigenproblem 
somewhat more formally than previously. We shall define and review the basic 
problem in Section 1 1 .2, and in Section 11.3 we shall apply this understanding to 
the problem of computing the matrix exponential [i.e., exp(Af), where A e R" x " 
and t € R] since this is of central importance in many areas of electrical and com- 
puter engineering (signal processing, stability of dynamic systems, control systems, 
circuit simulation, etc.). In subsequent sections we will consider numerical methods 
to determine the eigenvalues and eigenvectors of matrices. 



11.2 REVIEW OF EIGENVALUES AND EIGENVECTORS 

In this section we review some basic facts relating to the determination of eigen- 
values and eigenvectors of matrices. Our emphasis, with a few exceptions, is on 
matrices that are diagonalizable. 

Definition 11.1: Eigenproblem Let A e C" x ". The eigenproblem for A is 
to find solutions to the matrix equation 

Ax = Xx, (11.1) 

where X e C and x € C" such that x ^ 0. A solution (X, x) to (11.1) is called an 
eigenpair, X is an eigenvalue, and x is its corresponding eigenvector. 

Even if A € R nxn (the situation we will emphasize most) it is very possible to 
have X € C and x e C" . We must also emphasize that x — is never permitted to 
be an eigenvector for A. 

We may rewrite (11.1) as (/ is the n x n identity matrix) 

(A-XI)x = 0, (11.2a) 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 

480 



TLFeBOOK 



REVIEW OF EIGENVALUES AND EIGENVECTORS 



481 



or equivalently as 



(XI - A)x = 0. 



(11.2b) 



Equations (11.2) are homogeneous linear systems of n equations in n unknowns. 
Since x — is never an eigenvector, an eigenvector x must be a nontrivial solution 
to (11.2). An n x n homogeneous linear system has a nonzero (i.e., nontrivial) 
solution iff the coefficient matrix is singular. Immediately, eigenvalue X satisfies 



det(A - XI) = 0, 



(11.3a) 



or equivalently 



det(XI - A) = 0. 



(11.3b) 



Of course, p(X) — det(A/ — A) is a polynomial of degree n. In principle, we 
may find eigenpairs by finding p(X) (the characteristic polynomial of A), then 
finding the zeros of p(X), and then substituting these into (11.2) to find x. In 
practice this approach really works only for small analytical examples. Properly 
conceived numerical methods are needed to determine eigenpairs reliably for larger 
matrices A. 

To set the stage for what follows, consider the following examples. 

Example 11.1 From Hill [1], consider the example 



Here p(X) — det(XI — A) — (X — 1) (X — 3) so A has a double eigenvalue (eigen- 
value of multiplicity 2) at X — 1, and a simple eigenvalue at X = 3. The eigenvalues 
may be individually denoted as Xo — l,X\ — 1, and A 2 — 3. 
For X — 3, (3/ — A)x — becomes 



2 




XQ 







2 2 




Xl 


= 










X2 








and from the application of elementary row operations this reduces to 



1 




XQ 







1 




Xl 


= 










X2 








Immediately, xq — x\ — 0, and x-i is arbitrary (except, of course, it is not allowed 
to be zero) so that the general form of the eigenvector corresponding to X — X2 is 



,(2) 



[0 x 2 ] ' e C J 



TLFeBOOK 



482 



NUMERICAL METHODS FOR EIGENPROBLEMS 












2 














-2 


1 





" 








1 














x 




" " 




Xl 


= 







_ *2 








On the other hand, now let us consider X — Xq — X\. In this case (/ — A)x — 
becomes 



which reduces to 



Immediately, xo = X2 = 0, and x\ is arbitrary, so an eigenvector corresponding to 
X — Xo — X\ is of the general form 





xo 




" " 




XI 


= 







_ X 2 








,(0) 



= [ xi f e C 3 . 



Even though X = 1 is a double eigenvalue, we are able to find only one eigenvector 
for this case. In effect, one eigenvector seems to be "missing." 

Example 11.2 Now consider (again from Hill [1]) 






-2 


1 


1 


3 


-1 








1 



A = 



Here p(X) — det(XI — A) — (X — l) 2 (X — 2). The eigenvalues of A are thus Xo 
l,X\ — 1, and X2 — 2. 

For X — 2, (21 — A)x — becomes 



2 
1 



2 -1 

-1 1 

1 


1 





1 " 
1 






xo 




" 




Xl 


= 







_ X 2 








x 




" " 


XI 


= 





_ X 2 








which reduces to 



so the general form of the eigenvector for X — X2 is 

v (2) = x (2) = [ -x x f e C 3 
Now, if we consider X = Xq = A-i, (/ — A)x = becomes 



1 


2 


-1 




xo 







1 


-2 


1 




Xl 


= 
















X2 








TLFeBOOK 



REVIEW OF EIGENVALUES AND EIGENVECTORS 



483 



which reduces to 



1 2 


-1 




XQ 















X\ 


= 













X2 








Since xq + 2x\ — X2 — 0, we may choose any two of xq, X\, or X2 as free para- 
meters giving a general form of an eigenvector for X — Xq — X\ as 



v®> = x 



1 " 




' " 




1 


+ X\ 


1 

2 



,-(()) 



eC 3 . 



Thus, we have eigenpairs (Xq, x ( '), (X\,x ( '), (X2, x ( ') 



AV\ 



,(2) A 



We continue to emphasize that in all cases any free parameters are arbitrary, 
except that they must never be selected to give a zero-valued eigenvector. In 
Example 11.2 X = 1 is an eigenvalue of multiplicity 2, and i/ - 1 is a vector in 



a two-dimensional vector subspace of C 3 . 
X = 1 is also of multiplicity 2, and yet i/ ^ 
vector subspace of C 3 . 



On the other hand, in Example 11.1 
is only a vector in a one-dimensional 



Definition 11.2: Defective Matrix For any A & C nxn , if the multiplicity 
of any eigenvalue X e C is not equal to the dimension of the solution space 
(eigenspace) of (XI — A)x = 0, then A is defective. 

From this definition, A in Example 11.1 is a defective matrix, while A in 
Example 11.2 is not defective (i.e., is nondefective). In a sense soon to be made 
precise, defective matrices cannot be diagonalized. However, all matrices, diago- 
nalizable or not, can be placed into Jordan canonical form, as follows. 

Theorem 11.1: Jordan Decomposition If A e C" x ", then there exists anon- 
singular matrix T e C nxn such that 



T~ l AT = diag(./o, J\, 



Jk-i), 



where 



Ji 



A, 


1 


• 


• 








A, 


1 • 


• 











• 


• A; 


1 








• 


• 


X 



eC n 



and Ei=0 m i = "• 



TLFeBOOK 



484 



NUMERICAL METHODS FOR EIGENPROBLEMS 



Proof See Halmos [2] or Horn and Johnson [3]. 

Of course, A., in the theorem is an eigenvalue of A. The submatrices 7, are called 
Jordan blocks. The number of blocks k and their dimensions m, are unique, but 
their order is not unique. Note that if an eigenvalue has a multiplicity of unity (i.e., 
if it is simple), then the Jordan block in this case is the lxl matrix consisting 
of that eigenvalue. From the theorem statement the characteristic polynomial of 
A € C" x " is given by 



k-\ 



p(X) = det(A7 - A) = Y\0- ~ *;)" 



(11.4) 



i=0 



But if A € R" x ", and if A, € C for some i, then X* must also be an eigenvalue of 
A, that is, a zero of p(X). This follows from the fundamental theorem of algebra, 
which states that complex-valued roots of polynomials with real- valued coefficients 
must always occur in complex -conjugate pairs. 

Example 11.3 Consider (with ^ kit, k e Z) 



A = 



cost 
sin 6 



- sine 
cosf 



eR 



2x2 



(2x2 rotation operator from Appendix 3. A). We have the characteristic equation 

X — cos 6 



p(X) = det(XI - A) = det ( 



smf 
X — cos ( 



= X z - 2 cos 6X + 1 = 

for which the roots (eigenvalues of A) are therefore X — e ± ' e . Define Xq — e' e , 
X\ — e~' e . Clearly X\ — Xt (i.e., the two simple eigenvalues of A are a conju- 
gate pair). 

For X — Xq, (XqI — A)x — is 



sint9 



which reduces to 



J 1 
-1 J 



-J 







XQ 

x\ 








The eigenvector for X — Xo is therefore of the form 

, a e C. 



x®> = a 



TLFeBOOK 



REVIEW OF EIGENVALUES AND EIGENVECTORS 



485 



Similarly, for k — X\, the homogeneous linear system (X\I — A)x = is 



-j 1 
-1 -j 



x\ 



which reduces to 



The eigenvector for A. = X \ is therefore of the form 



" 1 j ' 

.00 




Xq 


= 


" " 
_ _ 



x m = b 



beC. 



Of course, free parameters a, and b are never allowed to be zero. 

Computing the Jordan canonical form when m, > 1 is numerically rather dif- 
ficult (as noted in Golub and Van Loan [4] and Horn and Johnson [3]), and so 
is often avoided. However, there are important exceptions often involving state- 
variable (state-space) systems analysis and design (e.g., see Fairman [5]). Within 
the theory of Jordan forms it is possible to find supposedly "missing" eigenvectors, 
resulting in a theory of generalized eigenvectors. We will not consider this here as 
it is rather involved. Some of the references at the end of this chapter cover the 
relevant theory [3]. 

We will now consider a series of theorems leading to a sufficient condition for 
A to be nondefective. Our presentation largely follows Hill [1]. 

Theorem 11.2: If A € C" x ", then the eigenvectors corresponding to two dis- 
tinct eigenvalues of A are linearly independent. 

Proof We employ proof by contradiction. 

Suppose that (a, x) and {fi, y) are two eigenpairs for A, and a ^ fi. Assume 
y — ax for some a # (a e C). Thus 



and also 



Hence 



implying that 



fly = Ay — aAx = aax 
/3y = a/3x. 



apx — aax, 



a{f5 — a)x — 0. 



TLFeBOOK 



486 NUMERICAL METHODS FOR EIGENPROBLEMS 

But x ^ as it is an eigenvector of A, and also a ^ 0. Immediately, a — j3, 
contradicting our assumption that these eigenvalues are distinct. Thus, y — ax is 
impossible; that is, we have proved that y is independent of x. 

Theorem 11.2 leads us to the next theorem. 

Theorem 11.3: If A € C nx " has n distinct eigenvalues, then A has n linearly 
independent eigenvectors. 

Proof Uses mathematical induction (e.g., Stewart [6]). 

We have already seen the following theorem. 

Theorem 11.4: If A € R" x " and A = A T , then all eigenvalues of A are real- 
valued. 

Proof See Hill [1], or see the appropriate footnote in Chapter 4. 

In addition to this theorem, we also have the following one. 

Theorem 11.5: If A € R" x ", and if A = A T , then eigenvectors corresponding 
to distinct eigenvalues of A are orthogonal. 

Proof Suppose that (a, x) and (ft, y) are eigenpairs of A with a ^ /3. We wish 
to show that x T y — y T x — (recall Definition 1.6). Now 



so that 



ax — Ax = A x 



ay T x — y T A T x — (Ay) T x = fiy T x, 



implying that 

(a - fi)y T x = 0. 

But a ^ p so that y T x — 0, that is, x ± y. 

Theorem 11.5 states that eigenspaces corresponding to distinct eigenvalues of a 
symmetric, real-valued matrix form mutually orthogonal vector subspaces of R". 
Any vector from one eigenspace must therefore be orthogonal to any eigenvector 
from another eigenspace. If we recall Definition 11.2, it is apparent that all sym- 
metric, real-valued matrices are nondefective, // their eigenvalues are all distinct. ' 
In fact, even if A € C" x " and is not symmetric, then, as long as the eigenvalues 
are distinct, A will be nondefective (Theorem 11.3). 

It is possible to go even further and prove that any real-valued, symmetric matrix is nondefective, 
even if it possesses multiple eigenvalues. Thus, any real-valued, symmetric matrix is diagonalizable. 



TLFeBOOK 



REVIEW OF EIGENVALUES AND EIGENVECTORS 487 

Definition 11.3: Similarity Transformation If A, B € C nxn , and there is a 
nonsingular matrix P e C nxn such that 

B = P~ l AP, 

we say that B is similar to A, and that P is a similarity transformation. 

If A e C nxn , and A has n distinct eigenvalues forming n distinct eigenpairs 
{{k k , x {k) )\k = 0, 1, . . . , n - 1}, then 



allows us to write 



Al/V"...^-"] 



= p 



Ax {k) = ***<« 








r A 


•■ 


> (P) JC (i)... Jt (i,-i) ] 





ki ■■ 


=i> 





•■ 






k n - 



(11.5) 
that is, AP — PA, where A = diag(Ao, k\, . . . , k n -\) e C" x " is the diagonal matrix 
of eigenvalues of A. Thus 

P~ l AP=A, (11.6) 

and the matrix of eigenvectors P e C" x " of A defines the similarity transformation 
that diagonalizes matrix A. More generally, we have the following theorem. 

Theorem 11.6: If A, B e C" x ", and A and B are similar matrices, then A and 
Z? have the same eigenvalues. 

Proof Since A and B are similar, there exists a nonsingular matrix P e C" x " 
such that B — P~ l AP. Therefore 

det(kl - B) = det(AJ - P _1 AF) 

= det(P" 1 (PP" 1 A- A)P) 

= det(P _1 ) det(kl - A) det(P) 

= det(A7 - A). 

Thus, A and B possess the same characteristic polynomial, and so possess identical 
eigenvalues. 

In other words, similarity transformations preserve eigenvalues. 2 Note that The- 
orem 11.6 holds regardless of whether A and B are defective. In developing (11.6) 

This makes such transformations highly valuable in state-space control systems design, in addition to 
a number of other application areas. 



TLFeBOOK 



488 NUMERICAL METHODS FOR EIGENPROBLEMS 

we have seen that if A € C nxn has n distinct eigenvalues, it is diagonalizable. We 
emphasize that this is only a sufficient condition. Example 11.2 confirms that a 
matrix can have eigenvalues with multiplicity greater than one, and yet still be 
diagonalizable. 



11.3 THE MATRIX EXPONENTIAL 

In Chapter 10 the problem of computing e At (A e R" x ", and t e R) was associated 

with the stability analysis of numerical methods for systems of ODEs. It is also 

noteworthy that to solve 

dx(t) 

-^-± = Ax{t)+by(t) (11.7) 

at 

[x(t) = [x (t)xi(t) ■ ■ -x n _i(t)] T e R", A e R" xn , and b e R" with y(t) e R for 
all t] required us to compute e At [recall Example 10.1, which involved an example 
of (1 1.7) from electric circuit analysis; see Eq. (10.9)]. Thus, we see that computing 
the matrix exponential is an important problem in analysis. In this section we shall 
gain more familiarity with the matrix exponential because of its significance. 

Moler and Van Loan [7] caution that computing the matrix exponential is a 
numerically difficult problem. Stable, reliable, accurate, and computationally effi- 
cient algorithms are not so easy to come by. Their paper [7], as its title states, 
considers 19 methods, and none of them are entirely satisfactory. Indeed, this 
paper [7] appeared in 1978, and to this day the problem of successfully computing 
e At for any A e C nxn has not been fully resolved. We shall say something about 
why this is a difficult problem later. 

Before considering this matter, we shall consider a general analytic (i.e., hand 
calculation) method for obtaining e At for any A € R" x ", including when A is 
defective. In principle, this would involve working with Jordan decompositions 
and generalized eigenvectors, but we will avoid this by adopting the approach 
suggested in Leonard [8]. 

The matrix exponential e Al may be defined in the expected manner as 

°° 1 
O(f) = e At = V— A k t k , (11.8) 

*— ' kl 

so, for example, the kth derivative of the matrix exponential is 

0>«(f) = A k e At = e At A k (11.9) 

for k e Z + (<J>(°)(f) = <t>(r)). To see how this works, consider the following special 
case k = 1: 

<D (1) (f) = — I V — A k t k \ = — \l + —At + —A 2 t 2 H h — A k t k + ■ ■ 

w dt\f^k\ j dt 1 1! 2! k\ 



TLFeBOOK 



THE MATRIX EXPONENTIAL 489 



1 2 , 
= — A+ —A 2 t + - 
1! 2! 


. .-1 A *f* 1 h 


= a\i+ —At + -- 
1 1! 


1 A *-i,*-i 
(t-1)! 


= Ae At = e A, A. 





It is possible to formally verify that the series in (11.8) converges to a matrix 
function of t e R by working with the Jordan decomposition of A. However, 
we will avoid this level of detail. But we will consider the situation where A 
is diagonalizable later on. 

There is some additional background material needed to more fully appreciate 
[8], and we will now consider this. The main result is the Cayley-Hamilton theorem 
(Theorem 11.8, below). 

Definition 11.4: Minors and Cofactors Let A e C" x ". The minor mij is the 
determinant of the (n — 1) x (n — 1) submatrix of A derived from it by deleting 
row i and column j . The cofactor en associated with m,y is en — (—l) l+ 'mij for 
all i, j e Z„. 

A formula for the inverse of A (assuming this exists) is given by Theorem 11.7. 

Theorem 11.7: If A e C" x " is nonsingular, then 

A_1 = XT77T ad K A )' ( 1L1 °) 

det(A) 

where adj(A) (adjoint matrix of A) is the transpose of the matrix of cofactors of 
A. Thus, if C = [cjj] € C nxn is the matrix of cofactors of A, then adj(A) = C T . 

Proof See Noble and Daniel [9]. 

Of course, the method suggested by Theorem 11.7 is useful only for the hand 
calculation of low-order (small n) problems. Practical matrix inversion must use 
ideas from Chapter 4. But Theorem 11.7 is a very useful result for theoretical 
purposes, such as obtaining the following theorem. 

Theorem 11.8: Cayley-Hamilton Theorem Any matrix A € C" xn satisfies 
its own characteristic equation. 

Proof The characteristic polynomial for A is p(X) — det(XI — A), and can be 
written as 

p(X) — X" + a x X n ~ x H h a n -\X + a„ 

for suitable constants a^ e C. The theorem claims that 

A"+aiA" _1 H hfl„-iA + a„7 = 0, (H.H) 



TLFeBOOK 



490 NUMERICAL METHODS FOR EIGENPROBLEMS 

where I is the order n identity matrix. To show (11.11), we consider adj(/x/ — A) 
whose elements are polynomials in /x of a degree that is not greater than n — 1, 
where ji is not an eigenvalue of A. Hence 

adj(/x/ - A) = Mo/x"" 1 + M^"" 2 H h M n _ 2 /x + M„_i 

for suitable constant matrices M& € C" x ". Via Theorem 11.7 

(/x/ - A) adj(/x/ - A) = det(/x/ - A)/, 

or in expanded form, this becomes 

(jil - A)(M /x"- 1 + Mj/x"" 2 + • • • + M„_2M + M n _0 
= (At"+ai/x"" 1 H ho„)/. 

If we now equate like powers of /x on both sides of this equation, we obtain 

M = /, 
Mi — AMo = a\I, 
M 2 — AM\ — a 2 I, 

: (11.12) 

M„_! -AM„_ 2 = a,_i/, 
-AM„_] = a„7. 

Premultiplying 3 the jth equation in (11.12) by A n ~ J (j — 0, 1, . . . , «), and then 
adding all the equations that result from this, yields 

A" M + A" _1 (Mi - AMo) + A"~ 2 (M 2 - AM\) H h A(M„_i - AM n - 2 ) 

- AM n _i = A"+a 1 A"" 1 +a 2 A n ~ 2 H ha^iA"" 1 +a„7. 

But the left-hand side of this is seen to be zero because of cancellation of all the 
terms, and (11.11) immediately results. 

As an exercise the reader should verify that the matrices in Examples 1 1.1-1 1.3, 
all satisfy their own characteristic equations. 

We will now consider the approach in Leonard [8], who, however, assumes that 
the reader is familiar with the theory of solution of nth-order homogeneous linear 
ODEs in constant coefficients 

x (n \t) + c n - lX {n ~ l \t) + ■■■ + dx m (t) + CQx(t) = 0, (11.13) 

This means that we must multiply on the left. 



TLFeBOOK 



THE MATRIX EXPONENTIAL 491 

where the initial conditions are known. In particular, the reader must know that if 
X is a root of the characteristic equation 

X n + c n -iX n - 1 + ■ ■ ■ + ciX + co = 0, (11.14) 

then if X has multiplicity m, its contribution to the solution of the initial- value 
problem (IVP) (11.13) is of the general form 

(a + a x t H h a m _it m ~ l )e Xt . (11.15) 

These matters are considered by Derrick and Grossman [10] and Reid [11]. We 
shall be combining these facts with the results of Theorem 11.10 (below). 

Leonard [8] presents two theorems that relate the solution of (11.13) to the 
computation of <!>(?) = e At . 

Theorem 11.9: Leonard I Let A e R nxn be a constant matrix with charac- 
teristic polynomial 

p(X) = det(XI - A) = X" + c n -iX n ~ l -\ h c x X + c . 

0(?) = e At is the unique solution to the nth-order matrix differential equation 

$ (n) (f) + c-i^" -1 ^*) + • • • + Cl <D (1) (f) + c <t>(0 = (11.16) 

with initial conditions 

d>(0) = /, <J> (1) (0) = A, . . . , d> ( "- 2) (0) = A n ~ 2 , O ( " _1) (0) = A"' 1 , (11.17) 

Proof We will demonstrate uniqueness of the solution first of all. 

Suppose that <t>i(t) and 02(f) are two solutions to (11.16) for the initial condi- 
tions stated in (11.17). Let <J>(f) = <l>i(£) — <J>2(0 f° r present purposes, in which 
case <J>(f) satisfies (11.16) with the initial conditions 

d>(0) = <D (1) (0) = • • • = <D ( "" 2) (0) = $ ( " -1) (0) = 0. 

Consequently, each entry of the matrix <£(f) satisfies a scalar IVP of the form 

x {n \t) + c n -ix (n - l) (t) + ■■■ + cix (1) (0 + c Q x(t) = 0, 
x(0) = x (1) (0) = • • • = x ( "" 2) (0) = x (n - l) (0) = 0, 

where the solution is x(t) — for all t, so that <$>(t) — for all t e R + . Thus, 
<t>i(t) = <t>2(£X an d so the solution must be unique (if it exists). 



TLFeBOOK 



492 NUMERICAL METHODS FOR EIGENPROBLEMS 

Now we confirm that the solution is <t>(?) = e At (i.e., we confirm existence in 
a constructive manner). Let A be a constant matrix of order n with characteristic 
polynomial p(X) as in the theorem statement. If now <£(f) = e At , then we recall that 

$(*)(,) = A k e At ,k = 1,2, ...,n (11.18) 

[see (11.9)] so that 

<D (n) (0 + c-i^"- 1 ^*) + • • • + ci<t> (1) (f) + co*(0 
= [A" + Cn-iA"' 1 + ■ ■ ■ + a A + c I]e At 
= p(A)e A ' = 

via Theorem 11.8 (Cayley-Hamilton). From (11.18), we obtain 

4>(°)(0) = /, O (1) (0) = A, . , . , O ( "- 2) (0) = A n ~ 2 , $ ( " _1) (0) = A"' 1 , 

and so <t>(r) = e At is the unique solution to the IVP in the theorem statement. 

Theorem 11.10: Leonard II Let A e R" x " be a constant matrix with char- 
acteristic polynomial 

p{X) = X n + C,,-!!"" 1 + • • • + cik + co, 

then 

e A> = x (0/ + xi (0 A + x 2 0) A 2 H h jc„_i (t) A"" 1 , 

where x^(f), k e Z n are the solutions to the nth-order scalar ODEs 

x {n \t) + c n -ix (n - l \t) + ■■■ + cix m (t) + c Q x(t) = 0, 
satisfying the initial conditions 

x { k j) (0) = Sj- k 
forj,keZ n {xf\t) = x k {t)). 

Proof Let constant matrix A have characteristic polynomial p(X) as in the 
theorem statement. Define 

O(f) = x (t)I + xi(t)A + x 2 (t)A 2 + ■■■+ x n -i(t)A n -\ 

where Xk(t), k e Z n are unique solutions to the «th-order scalar ODEs 

x {n \t) + c n -ix {n - l \t) + ■■■ + c { x m (t) + c x(t) = 0, 



TLFeBOOK 



THE MATRIX EXPONENTIAL 493 

satisfying the initial conditions stated in the theorem. Thus, for all t e R + 
0> (n) (0 + c„-i$ ( " _1) (f) + • • • + C1 <D (1) (0 + c $(f) 

n-\ 

= J2 H n) w + c»-i4" _1) (o + • • • + ci4 1} ( ? ) + co**(o] a* 

)t=0 

= • / + • A H h • A"" 1 = 0. 

As well we see that 

$(0) = xo(0)I+xi(0)A + --- + x n -i(.0)A n - 1 =/, 

G>(1)(0) = 4 1) ( ) / + X l 1) ( () ) A +---+ JC B-l( ) A " _1 = A ' 

0(1-1) (0) = ^"-^(Q)/ + x ( "~ l) (0)A + • • • + ^"/'(O)^" 1 = A"" 1 . 

Therefore 

d>(f ) = x (?)/ + ^1 (0 A H h x„-i (t) A"" 1 

satisfies the IVP 

d> (n) (0 + c „_i* ( " _1) (0 + • • • + Cl O (1) (0 + c *(0 = 

possessing the initial conditions 

$W(0) = A* 

(A; e Z„). The solution is unique, and so we must conclude that e At — YllZo x k(t)A k 
for all t e R + , which is the central claim of the theorem. 

An example of how to apply the result of Theorem 11.10 is as follows. 

Example 11.4 Suppose that 



A = 



a y 

p 



e R 2x2 , 



which clearly has the eigenvalues X — a, p. Begin by assuming distinct eigenvalues 
for A, specifically, that a ^ p. 

The general solution to the second-order homogeneous ODE 

x (2) (t) + cix (1) (t) + c x(t) = 

with characteristic roots a, p (eigenvalues of A) is [recall (11.15)] 

x(t) — aoe°" + aie^' . 

We have x m (t) = a Q oie at + aifie pt . 



TLFeBOOK 



494 NUMERICAL METHODS FOR EIGENPROBLEMS 

For the initial conditions x(0) = l,x^(0) = 0, we have the linear system 
of equations 

ao + a\ — 1, 

aao + fia\ — 0, 



which solve to yield 



flo 



P-u 



, a\ 



P-a 



Thus, the solution in this case is 



x Q (t) 



p-a 



[pe at - a/']. 



Now, if instead the initial conditions are x(0) — 0, x^(0) — 1, we have the linear 
system of equations 

ao + a\ — 0, 

aao + fia\ — 1, 



which solve to yield 



1 1 

«o = , ai = - — 

P-a P- 



Thus, the solution in this case is 



x\ (t)= [- 

P — a 



o at -L. J»l 



Via Leonard II we must have 



e A 


= x Q (t)I + 


x\(t)A 








= 


1 


" Pe at - aeP> 

Pe°" - aeP' _ 






P -a 






- 


1 


~ -ae at + ae?' -ye at + ye?' ~ 
-pe°" + pe pt 




P-a 




= 


1 


' (P - a)e at -ye at + ye?' ' 
(P- a)e pt 






P-a 






= 


- g at 




J 


—a y ' 
e f>t 







(11-19) 



Now assume that a — p. 
The general solution to 



A2), 



,(D 



x w (t) + c x x w (t) + c x(t) = 



TLFeBOOK 



THE MATRIX EXPONENTIAL 495 

with characteristic roots a, a (eigenvalues of A) is [again, recall (11.15)] 

x(t) — («o + a\t)e at . 

We have x^{t) — (a\ + aoa + a\at)e at . 

For the initial conditions x(0) — l,;c^(0) = we have the linear system 
of equations 

a = 1, 

a\ + aoa — 0, 

which solves to yield a\ — —a, so that 

x (t) = (1 -at)e°". 

Now if instead the initial conditions are x(0) = 0, x^(0) = 1, we have the linear 
system of equations 

«o = 0, 

a\ + aoa — 1, 
which solves to yield a\ = 1 so that 

xi (t) = te°". 
If we again apply Leonard II, then we have 

e At =xo(0/ + xi(f)A 



e at yte°" 
e at 



(11.20) 



A good exercise for the reader is to verify that x(t) — e At x(0) solves dx(t)/dt = 
Ax(t) [of course, here x(t) — [xo(t)x\(t)] T is a state vector] in both of the cases 
considered in Example 11.4. Do this by direct substitution. 

Example 11.4 is considered in Moler and Van Loan [7] as it illustrates problems 
in computing e At when the eigenvalues of A are nearly multiple. If we consider 
(11.19) when /3 — a is small, and yet is not negligible, the "divided difference" 

(11.21) 



p-a 



when computed directly, may result in a large relative error. In (11.19) the ratio 
(11.21) is multiplied by y, so the final answer may be very inaccurate, indeed. 
Matrix A in Example 11.4 is of low order (i.e., n — 2) and is triangular. This type 
of problem is very difficult to detect and correct when A is larger and not triangular. 



TLFeBOOK 



496 



NUMERICAL METHODS FOR EIGENPROBLEMS 



20 



15 

































































slm 












s 







< 10 



0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 

f 

Figure 11.1 An illustration of the hump phenomenon in computing e . 

Another difficulty noted in Moler and Van Loan [7] is sometimes called the hump 
phenomenon. It is illustrated in Fig. 11.1 for Eq. (11.19) using the parameters 



■1.01, 



-1.00, y=50. 



(11.22) 



Figure 11.1 is a plot of matrix 2-norm [spectral norm; recall Equation (4.37) with 
p — 2 in Chapter 4] ||e A '|l2 versus t. (It is a version of Fig. 1 in Ref. 7.) The 
problem with this arises from the fact that one way or another some algorithms for 
the computation of e At make use of the identity 



e At = {e Atlm )m _ 



(11.23) 



When s/m is under the hump while s lies beyond it (e.g., in Fig. 11.1 s — 4 
with m — 8), we can have 

\\e As \\2«\\e As/m \\^. (11.24) 

Unfortunately, rounding errors in the mth power of e As l m are usually small only 
relative to ||e Ax / m ||2, rather than \\e \\i. Thus, rounding errors may be a problem 
in using (11.23) to compute e Al . 

The Taylor series expansion in (11.8) is not a good method for computing e At . 
The reader should recall the example of catastrophic convergence in the compu- 
tation of e x (x e R) from Chapter 3 (Appendix 3.C). It is not difficult to imagine 
that the problem of catastrophic convergence in (11.8) is likely to be much worse, 
and much harder to contain. Indeed this is the case as shown by an example in 
Moler and Van Loan [7]. 

It was suggested earlier in this section that the series in (11.8) can be shown to 
converge by considering the diagonalization of A (assuming that A is nondefective). 
Suppose that A e C" x ", and that A possesses eigenvalues that are all distinct, and 
so we may apply (11.6). Since 



A k = [PAP -1 ]* 

= [PAP" 1 ][PAP" 1 ]---[PAP" 1 ][PAP _1 ] = PA k P~\ 
k factors 



(11.25) 



TLFeBOOK 



THE MATRIX EXPONENTIAL 



497 



we have 



-At 



y-A k t k = y-PA k p- 1 t k = p 



k=0 



k=0 



E-^ 



.k=Q 



k t k 



(oo 1 oo 1 OO . \ 

£=0 ' £=0 ' k=0 ' / 



= Pdmg(e Ao ',e 



Xi)t X\t 



')P~ 



If we define 



.At 



diag(e A 



J-\t 



"), 



(11.26) 



then clearly we can say that 



„Al 



= Y-A k t k = Pe M p- 



/t=0 



A:! 



(11.27) 



We know from the theory of Maclaurin series that e x — JZ'kLo H 1 ' conver g es f° r 
all x e R. Thus, e At converges for all t e R and is well defined, and hence the 
series in (11.8) converges, and so e At is well defined. Of course, all of this suggests 
that e may be numerically computed using (11.27). From Chapter 3 we infer that 
accurate, reliable means to compute e x (x is a scalar) do exist. Also, reliable 
methods exist to find the elements of A, and this will be considered later. But P 
may be close to singular, that is, the condition number k(P) (recall Chapter 4) may 
be large, and so accurate determination of P~ l , which is required by (11.27), may 
be difficult. Additionally, the approach (11.27) lacks generality since it won't work 
unless A is nondefective (i.e., can be diagonalized). Matrix factorization methods 
to compute e At (including those in the defective case) are considered in greater 
detail in Ref. 7, and this matter will not be mentioned further here. 

In Chapter 4 a condition number k(A) was defined that informed us about the 
sensitivity of the solution to Ax — b due to perturbations in A and b. It is possible 
to develop a similar notion for the problem of computing e At . From Golub and 
Van Loan [4], the matrix exponential condition number is 



v(A, t) — max 

I|£||2<1 



f e A ^-^Ee As 
Jo 



ds 



\\A\\ 2 
\\e At \\ 2 



(11.28) 



(The theory behind this originally appeared in Van Loan [12].) In this expression 
E e R" x " is a perturbation matrix. The condition number (11.28) measures the 



sensitivity of mapping A 
E such that 



„At 



for a given t e R. For a given t, there is a matrix 



AA+E)t 



„At 



112 



,At\ 



v(A,t) 



\E\\2 

\\A\\ 2 ' 



(11.29) 



TLFeBOOK 



498 NUMERICAL METHODS FOR EIGENPROBLEMS 

We see from this that if v(A, t) is large, then a small change in A (modeled by the 
perturbation matrix E) can cause a large change in e At . In general, it is not easy 
to specify A leading to large values for v(A, t). However, it is known that 

v(A,t)>t\\A\\ 2 (11.30) 

for t e R + with equality iff A is normal. (Any A e C" x " is normal iff A H A — 
AA H .) Thus, it appears that normal matrices are generally the least troublesome 
with respect to computing e At . From the definition of a normal matrix we see that 
real-valued, symmetric matrices are an important special case. 

Of the less dubious means to compute e At , Golub and Van Loan's algorithm 
11.3.1 [4] is suggested. It is based on Pade approximation, which is the use of 
rational functions to approximate other functions. However, we will only refer the 
reader to Ref. 4 (or Ref. 7) for the relevant details. A version of Algorithm 11.3.1 
[4] is implemented in the MATLAB expm function, and MATLAB provides other 
algorithm implementations for computing e At . 

11.4 THE POWER METHODS 

In this section we consider a simple approach to determine the eigenvalues and 
eigenvectors of A € R" x ". The approach is iterative. The main result is as follows. 

Theorem 11.11: Let A e R" x " be such that 

(a) A has n linearly independent eigenvectors x^\ corresponding to the eigen- 
Vahs{(X k ,xW)\k€Z n }. 

(b) The eigenvalues satisfy A,,_i € R with 

|A„_i| > \X n - 2 \ > \K-3\ >---|A,i| > | A. | 
(!„_! is the dominant eigenvalue). 
If yo € R" is a starting vector such that 

71-1 

;=o 
with a n -\ ^ 0, then for yk+i — Ayt with k e Z + 



lim ^Z£ = CX («-D (H.31) 

. k ■ ' 



for some c ^ 0, and (recalling that (x, y) — x T y — y T x for any x, y e R") 

(yn, A k yo) 
lim UU ' ,_, =A.„_i. (11.32) 

k^oo (y , A K 'yo) 



TLFeBOOK 



THE POWER METHODS 499 

Proof We observe that v* = A k yo, and since A k x < -^ — k-x^, we have 



n-2 



A k y Q = Y J ^.i^ U) +an-^ k n . l x in - V) , 



implying that 



A k y 



— Cln—lX 



(1-1) 



"•n-1 



n-2 

J2 a J 

.7=0 



^•B-l 



'' » *^. 



(11.33) 



Since |A./| < |A„_i| for _/ = 0, 1, . . . , n — 2, we have Unh- 
and hence 

A k yo 



\K-i) 



lim 



®n—\X 



(B-l) 



B-l 



Now A^o = jy i jZlajX k j x u \ so that 



B-l 



;=o 



and if 77 y = (yo, x^), then 



(yo, A^o> a n - X X k n _ x r] n -x + E" = o«/^1.7 



(JO, A*-iyo> a n -iX k n -_\r>n-i + JTjZUj^nj 



Vi-l 



k 1 



2 n -irin-l + 22j=o (l J r IJ [t— 



Wj-1 



fc-1 



Again, since lim,t_ 



(&)' 



(for j = 0, 1, . . . , n — 2), we have 



(yo, A*;yo> 

, llm 7 Ak-l \ = X n-i- 

k^oo (y , A K l y$) 



We note that 



1 > 



X n -2 


> 


^M-3 


A.B-1 




^n-1 



Ao 



Wi-1 



so the rate of convergence of A k yo/k k i _ l to a n _ix^" ^ is, according to (11.33), 
dominated by the term containing k n -ilk n -\. This is sometimes expressed 



TLFeBOOK 



500 NUMERICAL METHODS FOR EIGENPROBLEMS 

by writing 



A k y 



k n-\ 



-IX 



(n-1) 




= ^ . (11.34) 



The choice of norm in (11.34) is arbitrary. Of course, a small value for |A. n _2/A.„_i| 
implies faster convergence. 

The theory of Theorem 11.11 assumes that A is nondefective. If the algorithm 
suggested by this theorem is applied to a defective A, it will attempt to converge. 
In effect, for any defective matrix A there is a nondefective matrix close to it, and 
so the limiting values in (11.31) and (11.32) will be for a "nearby" nondefective 
matrix. However, convergence can be very slow, particularly if the dependent 
eigenvectors of A correspond to \ n -\, and A.„_2. In this situation the results may 
not be meaningful. 

If yo is chosen such that a„_i =0, rounding errors in computing y k +\ — Ay^ 
will usually give a y k with a component in the direction of the eigenvector x ( - n ~ l \ 
Thus, convergence ultimately is the result. To ensure that this happens, it is best 
to select yo with noninteger components. 

From (11.31), if A: is big enough, then 

y k = A k y Q * ck k n _ lX in - v \ 

and so y^ is an approximation to x'" -1 ^ (to within some scale factor). However, 
if |A.„_i| > 1, we see that \\yt\\ — > oo with increasing k, while if |l„_i| < 1, we 
see that \\yk\\ -> with increasing k. Either way, serious numerical problems will 
certainly result (overflow in the former case and rounding errors or underflow in the 
latter case). This difficulty may be eliminated by the proper scaling of the iterates, 
and leads to what is sometimes called the scaled power algorithm: 

Input N; { upper limit on the number of iterations } 
Input yo; { starting vector } 
y :=yo/l|yrjll2; { normalize yo to unit norm } 
k := 0; 

while k < N do begin 
z k+ t :=Ay k ; 

Yk+1 :=z /c+l/ll z /<+lll2; 
^+D:=y[ +1 ^ +1 ; 
k :=/f+1; 
end; 

In this algorithm yk+\ is the (k + l)th estimate of x < - n+l \ while A^ 1 ) is the 
corresponding estimate of eigenvalue \ n -\. From the pseudocode above we may 
easily see that 

yk = — r^— (11-35) 

l|A*y ||2 



TLFeBOOK 



THE POWER METHODS 501 

for all k > 1. With y Q = J2'jZo a j xU) («/ e R such that ||y || 2 = 1) 



n-2 



n-2 



A k y = Y j a j A k x u) + a n -\ A k x {n ~ l) = a^A^x 



(»-i) 



7=0 



7=0 



**<'> 



= a„-ik 



An-l) 



n-2 

■E 

7=0 



fl n-l \A. B — 1 



,0') 



•/(« 
,(*)! 



and as in Theorem 11.11 we see lim^oo ||m ( 'lb = 0, and (11.35) becomes 

(11.36) 



yk 



a n -ik k n _ 1 [x« , -» + u<V] 



\\an-i>£_i[x<- n -» + uW]\\ 2 



l^k 



x (n-\) + u (k) 
|| X («-D + M W|| 2 ' 



where fi^ is the sign of a n -\\ k _ x (i.e., ji^ e {+1, —1}). Clearly, as k — > oo, vector 
;y& in (11.36) becomes a better and better approximation to eigenvector x^" -1 '. To 
confirm that A^ +1 ^ estimates X n -i, consider that from (11.36) we have (for k 
sufficiently large) 



X (k+1) = yI +l Ay k+1 



( X (n-l)y Ax (n-1) 



|X(»-D||2 



X n -1 



[recall that Ax^" 1 ' = A.„_ix ( " _1) ]. 

If |A. n _i| = |A,„_2| > |A.;| for j = 0, 1, . . . , n — 3, we have two dominant eigen- 
values. In this situation, as noted in Quarteroni et al. [13], convergence may or may 
not occur. If, for example, k n -\ = X n -2, then vector sequence (y%) converges to 
a vector in the subspace of R" spanned by x^" -1 ', and x^ n ~ 2 \ In this case, since 
A e R nx '\ we must have A.„_i, X n -i e R, and hence for k > 1, we must have 

n-3 

A k y = J2 a J xl J xU) +««-2^_ 2 x ( "- 2) +«„_ 1 ^_ 1 x("- 1 \ 

7=0 



implying that 



A*yo 



"n-l 



— ctfi — \jv ~\~ (-ifi — 2" 



«-3 / i . \k 



so that 



lim 



A k y 



k^oo X 



— Clfj — ^ Jt ~\~ Clyi — 2 5 



n-1 



TLFeBOOK 



502 



NUMERICAL METHODS FOR EIGENPROBLEMS 



which is a vector in a two-dimensional subspace of R". On the other hand, 



recall Example 11.3, where n — 2 so that Xq — e j6 \ X\ — e 



A* From (11.35) 



llwlb = 1, so, because A is a rotation operator, yt — A k yol\\A k yo\\2 will always 
be a point on the unit circle {(x, y)\x 2 + y 2 — 1}. Convergence does not occur 
since it is generally not the same point from one iteration to the next (e.g., consider 
— it/2 radians). 



Example 11.5 Consider an example based on application of the scaled power 
algorithm to the matrix 

"410 

A - 14 1 

1 4 



This matrix turns out to have the eigenvalues 

A = 2.58578644, k { = 4.00000000, A 2 = 5.41421356 

(as may be determined in MATLAB using the eig function). If we define y^ — 
[yic,o yic,\ yk,2\ T £ R 3 , then, from the algorithm, we obtain the iterates: 



k 


ykfi 


yk,\ 


yk,2 


*(*> 





0.57735027 


0.57735027 


0.57735027 


— 


1 


0.53916387 


0.64699664 


0.53916387 


5.39534884 


2 


0.51916999 


0.67891460 


0.51916999 


5.40988836 


3 


0.50925630 


0.69376945 


0.50925630 


5.41322584 


4 


0.50444312 


0.70076692 


0.50444312 


5.41398821 


5 


0.50212703 


0.70408585 


0.50212703 


5.41416216 


6 


0.50101700 


0.70566560 


0.50101700 


5.41420184 


7 


0.50048597 


0.70641885 


0.50048597 


5.41421089 


8 


0.50023215 


0.70677831 


0.50023215 


5.41421295 


9 


0.50011089 


0.70694993 


0.50011089 


5.41421342 


10 


0.50005296 


0.70703187 


0.50005296 


5.41421353 


11 


0.50002530 


0.70707101 


0.50002530 


5.41421356 



In only 1 1 iterations the power algorithm has obtained the dominant eigenvalue 
to an accuracy of eight decimal places. 

Continue to assume that A e R nxn is nondefective. Now define A M — A — jil 
(where / is the n x n identity matrix, as usual), and \i e R is called the shift 
(or shift parameter). 4 We will assume that /x always results in the existence of 
A~ l . Because A is not defective, there will be a nonsingular matrix P such that 
P~ l AP — A — diag(Ao, A.i, . . . , A„_i) (recall the basic facts from Section 11.2 
that justify this). Consequently 



A IA = PAP" 1 - fil ^ P _1 A M P = A-/j,I, 



(11.37) 



The reason for introducing the shift parameter /x will be made clear a bit later. 



TLFeBOOK 



THE POWER METHODS 503 

and so A" 1 has the eigenvalues 

Yk=- , keZ n . (11.38) 

Ajfc - [A 

P is a similarity transformation that diagonalizes A M , giving A — \il, so the eigen- 
values of (A — /x/) _1 must be the eigenvalues of A" 1 as these are similar matrices. 
A modification of the previous scaled power algorithm is the following shifted 
inverse power algorithm: 

Input N\ { upper limit on the number of iterations } 

Input yg; { starting vector } 

y :=yo/l|yoll2; 1 normalize yo to unit norm } 

k:=Q; 

while k < N do begin 

A n z k+1 '-=yk- 

y/c+1 :=Zfr+i/l|Z/r+ill2; 



Y 



(fr+1) 



= y[ +1 ^y/c+i; 



/c:=/c + 1; 
end; 

Assume that the eigenvalues of A satisfy 

|A„_i| > |A„- 2 |>---> IA.il > |A |, (11.39) 

and also that /z = 0, so then \kk\ = l/|Xfc|, and (11.39) yields 

iKol > iKil > \Yi\ > •■■> lK«-2l > \Yn-i\- (11-40) 

We observe that in the shifted inverse power algorithm A^Zk+i = y\ is equivalent 
to Zk+\ — A~ l yk, and so A" 1 effectively replaces A in the statement Zk+i '■— Ayj. in 
the scaled power algorithm. This implies that the shifted inverse power algorithm 
produces vector sequence (y^) that converges to the eigenvector of A~ l — A -1 
corresponding to the eigenvalue yo (= 1 Ao)- Since Ax^ — Xqx^ implies that 
A _1 jt' ' = y-x^ — Yqx^K then for a sufficiently large k, the vector y\ will approx- 
imate x {SS) . The argument to verify this follows the proof of Theorem 11.11. 
Therefore, consider the starting vector 

n-\ 

yo = ao x (0) + J2 a i xU) > a o # (H-41) 

;=i 

(such that Hyolb = 1)- We see that we must have 

1 "~ l 1 

A- k y Q = ao^x^ +Y j a j —x (j) 

X o j= i x j 

ra-1 

= aoy ^ (0) +2^ajYjX U) > ( n - 42 ) 

7=1 



TLFeBOOK 



504 NUMERICAL METHODS FOR EIGENPROBLEMS 

and thus 



a —k n ~ 1 

a k yo ( o) 



Immediately, because of (11.40), we must have 



t°j(z) xU) 



Yo 



lim 

k—>oo 



A k y 



aox (0) . 



From (11.42), we obtain 



A k yo = uqYq 











x®> 


n-l 

+E- 


(- 


\ x^ 




=„(« 





(11.43) 



In much the same way as we arrived at (11.35), for the shifted inverse power 
algorithm, we must have for all k > 1 



yk = 



A n k yo 

WAT/yoWi' 



(11.44) 



From (11.43) for \i 



yk 



0, this equation becomes 

aoYo[x (0) +v (k) l 
\\aoy£[xM+vWl\\2 



,(0) 



.,(*) 



Vk 



\\x 



(0) 



>(*)l 



b 



(11.45) 



where v^ e { + 1 , — 1 } . From the pseudocode for the shifted inverse power algorithm, 
if k is large enough, via (11.45), we obtain 



Y ik+1) = yI +1 Ay k+l 



(*< >) r Aj<°> 

I|X(°)||2 



= *0- 



(11.46) 



Thus, j/(* +1 ) is an approximation to A.o. 

In summary, for a nondefective A e R" x " such that (11.39) holds, the shifted 
inverse power algorithm will generate a sequence of increasingly better approxi- 
mations to the eigenpair (Xq, x^), when we set fx — 0. 

We note that the scaled power algorithm needs 0(n 2 ) flops (recall the definition 
of flops from Chapter 4) at every iteration. This is due mainly to the matrix-vector 
product step Zk+i — Ayt- To solve the linear system A^Zk+i — yk requires 0(« 3 ) 
flops in general. To save on computations in the implementation of the shifted 
inverse power algorithm, it is often best to LU -decompose A^ (recall Section 4.5) 
only once: A fl — LU. At each iteration LUzk+i — yk mav be solved using forward 



TLFeBOOK 



THE POWER METHODS 



505 



and backward substitution, which is efficient since this needs only 0(n 2 ) flops at 
every iteration. Even so, because of the need to compute the LU decomposition of 
A M , the shifted inverse power algorithm still needs 0(« 3 ) flops overall. It is thus 
intrinsically more computation-intensive than is the scaled power algorithm. 

However, it is the concept of a shift that makes the shifted inverse power algo- 
rithm attractive, at least in some circumstances. But before we consider the reason 
for introducing the shift parameter /x, the reader should view the following example. 

Example 11.6 Let us reconsider matrix A from Example 11.5. If we apply 
the shifted inverse power algorithm using /a = to this matrix, we obtain the 
following iterates: 



k 


ykfi 


yk,l 


yk,2 


y(k) 





0.57735027 


0.57735027 


0.57735027 





1 


0.63960215 


0.42640143 


0.63960215 


5.09090909 


2 


0.70014004 


0.14002801 


0.70014004 


4.39215686 


3 


0.69011108 


-0.21792981 


0.69011108 


3.39841689 


4 


0.62357865 


-0.47148630 


0.62357865 


2.82396484 


5 


0.56650154 


-0.59845803 


0.56650154 


2.64389042 


6 


0.53330904 


-0.65662999 


0.53330904 


2.59925317 


7 


0.51623508 


-0.68337595 


0.51623508 


2.58886945 


8 


0.50782504 


-0.69586455 


0.50782504 


2.58649025 


9 


0.50375306 


-0.70175901 


0.50375306 


2.58594700 


10 


0.50179601 


-0.70455768 


0.50179601 


2.58582306 


11 


0.50085857 


-0.70589049 


0.50085857 


2.58579479 


12 


0.50041023 


-0.70652615 


0.50041023 


2.58578834 


13 


0.50019597 


-0.70682953 


0.50019597 


2.58578687 


14 


0.50009360 


-0.70697438 


0.50009360 


2.58578654 


15 


0.50004471 


-0.70704355 


0.50004471 


2.58578646 


16 


0.50002135 


-0.70707658 


0.50002135 


2.58578644 



We see that the method converges to an estimate of Xo (smallest eigenvalue of 
A) that is accurate to eight decimal places in only 16 iterations. 

As an exercise, the reader should confirm that the vector y^ in the bottom row 
of the table above is an estimate of the eigenvector for Xo. The reader should do 
the same for Example 11.5. 



Recall again that A l is assumed to exist (so that jj, is not an eigenvalue of A). 
Observe that Ax^ — Xtx^, so that (A — (il)x^ — (Xj — ix)x^\ and therefore 



A" 1 **/) 



Yj- 



C U) 



Suppose that there is an m e Z„ such that 



\X m - n\ < \Xj - fi\ 



(11.47) 



for all j e Z„ , but that j ^ m; that is, X m is closest to /x of all the eigenvalues of 
A. This really says that X m has a multiplicity of one (i.e., is simple). Now consider 



TLFeBOOK 



506 



NUMERICAL METHODS FOR EIGENPROBLEMS 



the starting vector 



71-1 



>'o 



J2 a i x 



(./) 



Am) 



(11.48) 



;'=o 

jjtm 



with a m ^ 0, and \\y0W2 — 1- Clearly 



n-1 





V?o 


= \ 


(3 


jyfx (j) +a m yj, 






implying that 






- fl '« X + 2^ fl 7 ( 


Now via (11.47) 








XL 

Ym 


= 


Xj-11 


< 1. 


Therefore, via (11.49) 






lim 


A ~» 


k yo 





k (m) 



M) 



(11.49) 



k^oo Ym 

This implies that the vector sequence (yjt) in the shifted inverse power algorithm 
converges to x (pt \ Put simply, by the proper selection of the shift parameter /x, we 
can extract just about any eigenpair of A that we wish to (as long as X m is simple). 
Thus, in this sense, the shifted inverse power algorithm is more general than the 
scaled power algorithm. The following example illustrates another important point. 

Example 11.7 Once again we apply the shifted inverse power algorithm to 
matrix A from Example 11.5. However, now we select ji — 2. The resulting 
sequence of iterates for this case is as follows: 



k 


yk,o 


y*,i 


yk,2 


y(k) 





0.57735027 


0.57735027 


0.57735027 


— 


1 


0.70710678 


0.00000000 


0.70710678 


4.00000000 


2 


0.57735027 


-0.57735027 


0.57735027 


2.66666667 


3 


0.51449576 


-0.68599434 


0.51449576 


2.58823529 


4 


0.50251891 


-0.70352647 


0.50251891 


2.58585859 


5 


0.50043309 


-0.70649377 


0.50043309 


2.58578856 


6 


0.50007433 


-0.70700164 


0.50007433 


2.58578650 


7 


0.50001275 


-0.70708874 


0.50001275 


2.58578644 



TLFeBOOK 



THE POWER METHODS 



507 



We see that convergence to the smallest eigenvalue of A has now occurred in 
only seven iterations, which is faster than the case considered in Example 1 1.6 (for 
which we used /i — 0). 

We see that this example illustrates the fact that a properly chosen shift param- 
eter can greatly accelerate the convergence of iterative eigenproblem solvers. This 
notion of shifting to improve convergence rates is also important in practical imple- 
mentations of QR iteration methods for solving eigenproblems (next section). 

So far our methods extract only one eigenvalue from A at a time. One may 
apply a method called deflation to extract all the eigenvalues of A under certain 
conditions. Begin by noting the following elementary result. 

Lemma 11.1: Suppose that B e R ( "" l)x( " _1) , and that B~ l exists, and that 
r e R" _1 , then 

1 r T I" 1 _ [ 1 -r T B~ x 
OB OB' 1 



(11.50) 



Proof Exercise. 

The deflation procedure is based on the following theorem. 

Theorem 11.12: Deflation Suppose that A n e R" x ", that eigenvalue A.,- € R 
for all i e Z„, and that all the eigenvalues of A n are distinct. The dominant eigenpair 
of A n is (A.„_i, x^" -1 '), and we assume that ||x^ _1 '||2 = 1. Suppose that Q n e 
R" x " is an orthogonal matrix such that Q„x (n_1) = [10 • • • 0] T — eo; then 



QnA n Q T n 



A-n-l a n -\ 

A„_i 



(11.51) 



Proof Q n exists because it can be a Householder transformation matrix (recall 
Section 4.6). Any eigenvector x^ of A n can always be normalized so that 

||*<*>|| 2 =1. 

Following (11.5), we have 



„[*C- 1 V' , - 2) ---;c (1 V 0) ] 










=T„ 












K-i 





• 










^«-2 • 


• 





.[.("-'V- 2 '...^ 1 '/ '] 










= T„ 








• ^! 













• 


Ao _ 



= D„ 



TLFeBOOK 



508 NUMERICAL METHODS FOR EIGENPROBLEMS 

that is, A n T n — T n D n . Thus, {Q T n — Q~ l via orthogonality) 



Now 



Q n A n Q T n = Q„T n D n T n - l Ql = {Q n T n )D n (Q n T n T x . 



QtJn = [eo Q n X 



(n-2) 



and via Lemma 11.1, we have 



QnX^QnX^] 



B, 



(11.52) 



1 *J_, 



n-l 



1 Cl 

B„-i 



Thus, (11.52) becomes 



G«A„G^ = 



1 b 1 

n- 

o B„- 



fi"- 1 ! 



X„-i o 
D n -i 



1 -fcl ,5 _1 



B 



n—\ n—\ 

1 



n-l 



^n-1 





*„-i( D »- 

B„ 



l — Ki-\h-\)B 



n-l 



■A-iV, 



which has the form given in (11.51). 

From Theorem 11.6, Q n A n Q^ and A„ are similar matrices, and so have the 
same eigenvalues. Via (11.51), A n -\ has the same eigenvalues as A n , except 
for k n -\. Clearly, the scaled power method could be used to find the eigenpair 
(X n -\, x*" -1 )). The Householder procedure from Section 4.6 gives Q n . From The- 
orem 11.12 we obtain A n -\, and the deflation procedure may be repeated to find 
all the remaining eigenvalues of A — A n . 

It is important to note that the deflation procedure may be improved with respect 
to computational efficiency by employing instead the Rayleigh quotient iteration 
method. This replaces the power methods we have considered so far. This approach 
is suggested and considered in detail in Golub and Van Loan [4] and Epperson [14]; 
we omit the details here. 



11.5 QR ITERATIONS 

The power methods of Section 11.4 and variations thereof such as Rayleigh quotient 
iterations are deficient in that they are not computationally efficient methods for 
computing all possible eigenpairs. The power methods are really at their best when 
we seek only a few eigenpairs (usually corresponding to either the smallest or 
the largest eigenvalues). In Section 11.4 the power methods were applied only to 



TLFeBOOK 



QR ITERATIONS 



509 



computing real-valued eigenvalues, but it is noteworthy that power methods can 
be adapted to finding complex-conjugate eigenvalue pairs [19]. 

The QR iterations algorithms are, according to Watkins [15], due originally to 
Francis [16] and Kublanovskaya [17]. The methodology involved in QR iterations 
is based, in turn, on earlier work of H. Rutishauser performed in the 1950s. The 
detailed theory and rationale for the QR iterations are not by any means straight- 
forward, and even the geometric arguments in Ref. 15 (based, in turn, on the work 
of Parlett and Poole [18]) are not easy to follow. However, for matrices A e C" x " 
that are dense (i.e., nonsparse; recall Section 4.7), that are not too large, and that 
are nondefective, the QR iterations are the best approach presently known for find- 
ing all possible eigenpairs of A. Indeed, the MATLAB eig function implements a 
modern version of the QR iteration methodology. 5 

Because of the highly involved nature of the QR iteration theory, we will only 
present a few of the main ideas here. Other than the references cited so far, the 
reader is referred to the literature [4,6,13,19] for more thorough discussions. Of 
course, these are not the only references available on this subject. 

Eigenvalue computations such as the QR iterations reduce large problems into 
smaller problems. Golub and Van Loan [4] present two lemmas that are involved 
in this reduction approach. Recall from Section 4.7 that s(A) denotes the set of all 
the eigenvalues of matrix A (and is also called the spectrum of A). 

Lemma 11.2: If A € C" x " is of the form 



Aoo 




Aoi 
An 



where A 00 eC pxp , 
s(A n ). 



A i e C px i, An e C qxq (q + p = «), then s(A) = s(A 00 ) U 



Proof Consider 



Ax = 



Aoo Aoi 
An 



X2 



= X 



X] 

X2 



(x\ e C p and xi e C q ). If x-i ^ 0, then A\\X2 — Xx2, and so we conclude that 
X e s(An). On the other hand, if X2 — 0, then Aoo*i = Xx\, so we must have 
X € s(Aoo). Thus, s(A) e s(Aoo) U s(A\\). Sets s(A) and ,s(Aoo) U s(An) have 
the same cardinality (i.e., the same number of elements), and so s(A) — s(Aqq) U 
s(A n ). 

5 If A e C" x ", then [V, D] = eig(A) such that 

A = VDV~ l , 

where D 6 C" x " is the diagonal matrix of eigenvalues of A and V e C nx " is the matrix whose columns 
are the corresponding eigenvectors. The eigenvectors in V are "normalized" so that each eigenvector 
has a 2-norm of unity. 



TLFeBOOK 



510 



NUMERICAL METHODS FOR EIGENPROBLEMS 



Essentially, Lemma 11.2 states that if A is block upper triangular, the eigen- 
values lie within the diagonal blocks. 

Lemma 11.3: If A € C" xn , B e C pxp , X e C nxp (with p <n) satisfy 

AX = XB, rank(X) = p, (11.53) 

then there is a unitary Q e C nxn (so Q~ l — Q H ) such that 



Q"AQ=T = 



Too 7b i 
T n 



(11.54) 



where T m € C pxp , T m e C px(r, - p \ T n e c (n - p)x(n - p \ and s(T q) = 
s(A)r\s(B). 

Proof The QR decomposition idea from Section 4.6 generalizes to any X e 
Qnxp w j m p < n anc i rank(X) = /?; that is, complex- valued Householder matrices 
are available. Thus, there is a unitary matrix Q eC* M such that 



X=Q 



R 




where R e C pxp . Substituting this into (11.53) yields 



R 




where 



Too Tbi 


' R ' 




Tio 7n 


_ 


- 


Q H AQ = 


Too 





(11.55) 



From (11.55) T\qR — 0, implying that Tio = [yielding (11.54)], and also TqqR — 
RB, implying that B — R~ 1 TqqR (R _1 exists because X is full-rank). Too and 
B are similar matrices so s(B) — s(Tqo). From Lemma 11.2 we have s(A) — 
s(Too) U s (Tn)- Thus, s(A) — s(B) U s(T\i). From basic properties regarding sets 
(distributive laws) 

s(T 00 ) n s(A) = s(T 00 ) n [s(fi) U s(T n )] 

= [s(T O0 )ns(B)]\J[s(T O0 )ris(T 11 )] 

= s(T OQ )n0, 

implying that s(Tqq) — s(Tqo) D s(A) — s(A) D s(B). This statement [s(Toq) = 
s(A) n s(B)] really says that the eigenvalues of B are a subset of those of A. 



TLFeBOOK 



QR ITERATIONS 



511 



Recall that a subspace of vector space C" is a subset of C" that is also a vector 
space. Suppose that we have the vectors xq, . . . , x m -\ e C"; then we may define 
the spanning set as 

m— 1 

span(^o, . . . , x m -i) — 2, ajXj\aj e C 

.7=0 



(11.56) 



In particular, if S — span(jt), where x is an eigenvector of A, then 

y e S =$■ Ay e S, 

and so S is invariant for A, or invariant to the action of A. It is a subspace 
(eigenspace) of C" that is invariant to A. Lemmas 11.2 and 11.3 can be used to 
establish the following important decomposition theorem (Theorem 7.4.1 in Ref. 
4). We emphasize that it is for real-valued A only. 

Theorem 11.13: Real Schur Decomposition If A e R" x ", then there is an 
orthogonal matrix Q e R" x " such that 



Q T AQ 



RoO ^01 

R n 











Ro,m-\ 
R\,m-l 



R 



m— \,m— 1 



n, 



(11.57) 



where each Rjj is either 1 x 1 or 2 x 2. In the latter case /?,-_,- will have a complex- 
conjugate pair of eigenvalues. 

Proof The matrix A e R" x ", so det(17 — A) has real-valued coefficients, and 
so complex eigenvalues of A always occur in conjugate pairs (recall Section 11.2). 
Let k be the number of complex-conjugate eigenvalue pairs in s(A). We will 
employ mathematical induction on k. 

The theorem certainly holds for k — via Lemmas 11.2 and 11.3 since real- 
valued matrices are only a special case. Now we assume that k > 1 (i.e., A possesses 
at least one conjugate pair of eigenvalues). Suppose that an eigenvalue is X — 
a + jfJ e s(A) with /5 # 0. There must be vectors x, y e R" (with y ^ 0) such that 

A(x + jy) = (a + jP)(x + jy), 



or equivalently 



A[xy] = [xy] 



-p 1 a 



(11.58) 



Since fJ ^ 0, vectors x and y span a two-dimensional subspace of R" that is 
invariant to the action of A because of (11.58). From Lemma 11.3 there is an 



TLFeBOOK 



512 NUMERICAL METHODS FOR EIGENPROBLEMS 

orthogonal matrix U\ e R" x " such that 



UfAUi 



Too 7b i 
Zii 



where T m € R 2x2 , 7bi e R 2x ^ n ~ 2 \ T u e R("- 2 ^(»- 2 \ and s(T 00 ) = {X, 1*}. By 
induction there is another orthogonal matrix t/2 such that t/JlnL^ has the neces- 
sary structure. Equation (11.57) then follows by letting 



Q = Ui 



h 

u 2 



where I2 is the 2x2 identity matrix. Of course, this process may be repeated as 
often as needed. 

A method that reliably gives us the blocks Rij for all i in (1 1.57) therefore gives 
us all the eigenvalues of A since Rjj is only 1 x 1, or 2 x 2, making its eigenvalues 
easy to find in any case. The elements in the first subdiagonal of Q T AQ of (11.57) 
are not necessarily zero-valued (again because /?,- j might be 2 x 2), so we say that 
Q 1 AQ is upper quasi-triangular. 

Definition 11.5: Hessenberg Form Matrix A € C" x " is in Hessenberg form 
if at j — for all i, j such that i — j > 1, 

Technically, A in this definition is upper Hessenberg. Matrix A in Example 4.4 
is Hessenberg. All upper triangular matrices are Hessenberg. The quasi-triangular 
matrix Q T AQ in Theorem 11.13 is Hessenberg. 

A pseudocode for the basic QR iterations algorithm is 

Input N\ { Upper limit on the number of iterations } 

Input A e R nx "; { Matrix we want to eigendecompose } 

Hq := qIaQq; { Reduce A to Hessenberg form } 

k:=V, 

while k < N do begin 

Hj<_i := Q/<R^; { QR-decomposition step } 

H k :=R k Q k ; 

/f:=/c + 1; 

end; 



In this algorithm we emphasize that A is assumed to be real-valued. Generalization 
to the complex case is possible but omitted. The statement Hq — QqAQ q generally 
involves applying orthogonal transformation <2o € R nxn to reduce A to Hessenberg 
matrix Hq, although in principle this is not necessary. However, there are major 
advantages (discussed below) to reducing A to Hessenberg form as a first step. 
The basis for this initial reduction step is the following theorem, which proves that 
such a step is always possible. 



TLFeBOOK 



QR ITERATIONS 



513 



Theorem 11.14: Hessenberg Reduction If A € R" x ", there is an orthogonal 
matrix Q e R nx " such that Q T AQ — H is Hessenberg. 

Proof From Section 4.6 in general there is an orthogonal matrix P such that 
Px — ||x||2eo (e.g., P is a Householder matrix), where eo = [1 • • • 0] T e R" 
if x e R". 

Partition A according to 



A<°> = A = 



(0) 

'on 



«o 



tf 



where a^' € R and a , b e R"" 1 , A™ e r("-Dx(«-d_ Let P] be ormog onal such 
that P\ao — Hflolb^o ( e o e R" _1 ). Define 



Gi = 



1 o 
Pi 



1 y 

and clearly Q i — Q x (i.e., Q\ is also orthogonal). Thus 



id) 



d a<°>q[ 



(0) 

'in, 



b T P T 



Il«0ll2«0 Pl^W 



The first column of A^ satisfies the Hessenberg condition since it is 

[aL, 1 1 «o 1 1 2 0---0 ] T . The process may be repeated again by partitioning A^ 

n-2 zeros 
according to 

4(D uT 



id) 



1 oo 



'1 



[0 oi] A™ 



where A$ € R 2x2 , and [0 ai],b\ e R("" 2 ) x2 , A^ e R(»-2)x(n-2)_ Let p 2 be 
orthogonal such that P2«i = llailben ( e £ R" -2 )- Define 

G2 = P 2 



where 7 2 is the 2x2 identity matrix. Thus 



A (2) = G2A (i) e r = G2eiA (0) G[e r 



'1 ^2 



1 00 



°1 r 2 



[0 Hflilbeo] P 2 A ( 1 1 1 ) P 2 r 



TLFeBOOK 



514 NUMERICAL METHODS FOR EIGENPROBLEMS 

and the first two columns of A® satisfy the Hessenberg condition. Of course, we 
may continue in this fashion, finally yielding 

A ( "- 2) = Q n -2 • • • QiQiAQjQl ■ ■ ■ Q T n _ 2 , 

which is Hessenberg. We may define Q T = Q n -2- • • Q2Q1 an d H — A^" -2 ', 
which is the claim made in the theorem statement. 

Thus, Theorem 11.14 contains a prescription for finding Ho = QqAqQq as well 
as a simple proof of existence of the decomposition. Hessenberg reduction is done 
to facilitate reducing the amount of computation per iteration. Clearly, A and Hq 
are similar matrices and so possess the same eigenvalues. 

From the pseudocode for k — 1, 2, . . . , N, we obtain 

ffk-i = QkRt, 
H k = R k Q k , 



which yields 



H n = Q t n -QIq!h Q 1 Q 2 -Q n , (11.59) 



so therefore 

H N = Q T N -Q T 2 Q T l QlAQ Q l Q 2 ---Q N . (11.60) 

Matrices Hn and A are similar, and so have the same eigenvalues for any N . It is 
important to note that if Q k is constructed properly then H k is Hessenberg for all 
k. As explained in Golub and Van Loan [4, Section 7.4.2], the use of orthogonal 
matrices Q k based on the 2x2 rotation operator (matrix A from Example 11.3) 
is recommended. These orthogonal matrices are called Givens matrices, or Givens 
rotations. The result is an algorithm that needs only 0(n 2 ) flops per iteration 
instead of 0(n 3 ) flops. Overall computational complexity is still 0(n 3 ) flops, due 
to the initial Hessenberg reduction step. It is to be noted that the rounding error 
performance of the suggested algorithm is quite good [19]. 

We have already noted the desirability of the real Schur decomposition of A 
into 11 according to Q T AQ = 1Z in (1 1.57). In fact, with proper attention to details 
(many of which cannot be considered here), the QR iterations method is an excellent 
means to find Q and 1Z as in any valid matrix norm 

lim H N =11 (11.61) 

and 

Y\Qi = Q (11-62) 

1=0 



TLFeBOOK 



QR ITERATIONS 



515 



(of course, Y\i=o Qi = Q0Q1 • • • Qn\ the ordering of factors in the product is 
important since Qi is a matrix for all j). The formal proof of this is rather difficult, 
and so it is omitted. 
Suppose that we have 









• 


• 


Q n 


1 





• 


• 


—a n -\ 





1 


• 


• 


—a„-2 








• 


• 


—(12 








• 


• 1 


—a\ 



€R" 



(11.63) 



It can be shown that 



p(X) = det(AJ - A) = X" + aiX n ~ l + a 2 X 



\ n—2 



a n -\X + a n . (11.64) 



Matrix A is called a companion matrix. We see that it is easy to obtain (1 1.64) from 
(11.63), or vice versa. We also see that A is Hessenberg. Because of (11.61), we 
may conceivably input A of (1 1.63) into the basic QR iterations algorithm (omitting 
the initial Hessenberg reduction step), and so determine the roots of p(X) = as 
these are the eigenvalues of A. Since p(X) is essentially arbitrary, except that it 
should not yield a defective A, we have an algorithm to solve the polynomial zero- 
finding problem that was mentioned in Chapter 7. Unfortunately, it has been noted 
[4,19] that this is not necessarily a stable method for finding polynomial zeros. 

Example 11.8 Suppose that 

p(X) = (X 2 -2X + 2)(X 2 - 4lX + 1) 

= X 4 - (2 + V2)X 3 + (2V2 + 3)X 2 - 2(1 + V2)X + 2, 



which has zeros for 



'-'^T^TT 



After 50 iterations of the basic QR iterations algorithm, we obtain 



#50 = 



0.5000 
0.2071 
0.0000 
0.0000 



-6.0355 
1.5000 
0.0000 
0.0000 



0.9239 

-0.3827 

0.7071 

0.7071 



5.3848 
-2.2304 
-0.7071 

0.7071 



^0,0 




-^0,1 
K1.1 



where R, ;j e R , The reader may wish to confirm that Rqq has the eigenvalues 
1 ± j, and that /?ij has the eigenvalues -4=(1 ± j). 



TLFeBOOK 



516 NUMERICAL METHODS FOR EIGENPROBLEMS 

On the other hand, for 

p(X) = (X + l)(X 2 -2X + 2)(X 2 - yflX + 1), 

which is a slight modification of the previous example, the basic QR iterations 
algorithm will fail to converge. 

The following point is also important. In (11.60), define Q — QqQi ■ ■ ■ Qn, 
so that H N = Q T AQ. Now suppose that A = A T . Clearly, Hj, = Q T A T Q = 
Q T AQ — Hn. This implies that Hn will be tridiagonal (defined in Section 6.5) 
for a real and symmetric A. Because of (11.61), we must now have 



lim H N = diag(fl ,o,.Ku, 

N— >oo 



■ , Rm-\,m-\) — D, 



where each /?, , is 1 x 1. Thus, D is the diagonal matrix of eigenvalues of A. Also, 
we must have YYhLo Qi as me corresponding matrix of eigenvectors of A. 

Example 11.9 Suppose that A is the matrix from Example 11.5: 



A = 



4 1 
1 4 1 
1 4 



We see that A is in Hessenberg form already. After 36 iterations of the basic QR 
iteration algorithm, we obtain 



#36 



5.4142 0.0000 0.0000 
0.0000 4.0000 0.0000 
0.0000 0.0000 2.5858 



so the matrix is diagonal and reveals all eigenvalues of A. Additionally, we have 



36 



UQi = 



0.5000 
0.7071 
0.5000 



-0.7071 
0.0000 
0.7071 



0.5000 

-0.7071 

0.5000 



which is a good approximation to the eigenvectors of A. 

We noted in Section 11.4 that shifting can be used to accelerate convergence in 
power methods. Similarly, shifting can be employed in QR iterations to achieve 
the same result. Indeed, all modern implementations of QR iterations incorporate 
some form of shifting for this reason. The previous basic QR iteration algorithm 
may be modified to incorporate the shift parameter ji e R. The overall structure of 
the result is described by the following pseudocode: 



TLFeBOOK 



QR ITERATIONS 



517 



Input N; { Upper limit on the number of iterations } 

Input A g R" xn ; { Matrix we want to eigendecompose ] 

Ho := QqAQq; { Reduce A to Hessenberg form } 

k:=V, 

while k < A/do begin 

Determine the shift parameter /x e R; 

Hk--\ - fJ-l '■= QfcR/<; { QR-decomposition step } 

H k := R k Q k + nl; 

k:=k+\; 

end; 



The reader may readily confirm that we still have 



On 



QiQqAQoQi 



Qn 



just as we had in (11.60) for the basic QR iteration algorithm. Thus, we again find 
that Hk is similar to A for all k. Perhaps the simplest means to generate /i is the 
single-shift QR iterations algorithm: 

Input N\ { Upper limit on the number of iterations } 

Input A s R nxn ; { Matrix we want to eigendecompose } 

Hq := QqAQq; { Reduce >4 to Hessenberg form } 

k:=V, 

while k < N do begin 

Hk := H k _-\ (n - 1,n - 1); { \i k is the lower right corner element of H k _-\ } 

H/(_i - fi k l := Q k R k \ { QR-decomposition step } 

H k ■= RkQk + V; 

k:=k+-\; 

end; 

We note that /x is not fixed in general from one iteration to the next. Basically, \i 
varies from iteration to iteration in order to account for new information about s(A) 
as the subdiagonal entries of Hk converge to zero. We will avoid the technicalities 
involved in a full justification of this approach except to mention that it is flawed, 
and that more sophisticated shifting methods are needed for an acceptable algorithm 
(e.g., the double shift [4,19]). However, the following example shows that shifting 
in this way does speed convergence. 

Example 11.10 If we apply the single-shift QR iterations algorithm to A in 
Example 11.9, we obtain the following matrix in only one iteration: 



Hi = 



This matrix certainly does not have the structure of Hx, in Example 11.9, but the 
eigenvalues of the submatrix 



4.0000 


-1.4142 


0.0000 


1.4142 


4.0000 


0.0000 


0.0000 


0.0000 


4.0000 



4.0000 
-1.4142 



-1.4142 
4.0000 



are 5.4142, 2.5858. 



TLFeBOOK 



518 NUMERICAL METHODS FOR EIGENPROBLEMS 

Finally, we mention that our pseudocodes assume a user-specified number of 
iterations N. This is not convenient, and is inefficient in practice. Criteria to auto- 
matically terminate the QR iterations without user intervention are available, but a 
discussion of this matter is beyond our scope. 



REFERENCES 

1. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.), 
Random House, New York, 1988. 

2. P. Halmos, Finite Dimensional Vector Spaces, Van Nostrand, New York, 1958. 

3. R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge Univ. Press, Cambridge, 
UK, 1985. 

4. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., Johns Hopkins Univ. 
Press, Baltimore, MD, 1989. 

5. F. W. Fairman, "On Using Singular Value Decomposition to Obtain Irreducible Jordan 
Realizations," in Linear Circuits, Systems and Signal Processing: Theory and Appli- 
cation, C. I. Byrnes, C. F. Martin, and R. E. Saeks, eds., North-Holland, Amsterdam, 
1988, pp. 35-40. 

6. G. Stewart, Introduction to Matrix Computations, Academic Press, New York, 1973. 

7. C. Moler and C. Van Loan, "Nineteen Dubious Ways to Compute the Exponential of a 
Matrix," SI AM Rev. 20, 801-836 (Oct. 1978). 

8. I. E. Leonard, "The Matrix Exponential," SIAM Rev. 38, 507-512 (Sept. 1996). 

9. B. Noble and J. W. Daniel, Applied Linear Algebra, Prentice-Hall, Englewood Cliffs, 
NJ, 1977. 

10. W. R. Derrick and S. I. Grossman, Elementary Differential Equations with Applications, 
2nd ed., Addison-Wesley, Reading, MA, 1981. 

11. W. T. Reid, Ordinary Differential Equations, Wiley, New York, 1971. 

12. C. Van Loan, "The Sensitivity of the Matrix Exponential," SIAM J. Numer. Anal. 14, 
971-981 (Dec. 1977). 

13. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math- 
ematics series, Vol. 37). Springer- Verlag, New York, 2000. 

14. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New 
York, 2002. 

15. D. S. Watkins, "Understanding the QR Algorithm," SIAM Rev. 24, 427-440 (Oct. 
1982). 

16. J. G. F. Francis, "The QR Transformation: A Unitary Analogue to the LR Transforma- 
tions, Parts I and II," Comput. J. 4, 265-272, 332-345 (1961). 

17. V. N. Kublanovskaya, "On Some Algorithms for the Solution of the Complete Eigen- 
value Problem," USSR Comput. Math. Phys. 3, 637-657, (1961). 

18. B. N. Parlett and W. G. Poole, Jr., "A Geometric Theory for the QR, LU, and Power 
Iterations," SIAM J. Numer. Anal. 10, 389-412 (1973). 

19. J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, UK, 
1965. 



TLFeBOOK 



PROBLEMS 



519 



PROBLEMS 

11.1. Aided with at most a pocket calculator, find all the eigenvalues and eigen- 
vectors of the following matrices: 



(a) A = 



(b) B = 



(c) C 



4 
1 


1 

4 




" 





-2 " 


1 





1 





1 


2 


" 


1 


l - 

4 








1 

4 


1 





1 



(d) D 



\-j 

~j 2 

11.2. A conic section in R 2 is described in general by 



o!Xq + 2/3xoxi + Y x \ + & x o + e*i + p — 0. 



(ll.P.l) 



(a) Show that (ll.P.l) can be rewritten as a quadratic form: 

x T Ax + g T x + p = 0, (11.P.2) 

where x = [xq x\] t , and A — A T . 

(b) For a conic section in standard form A is diagonal. Suppose that A is 
diagonal. State the conditions on the diagonal elements that result in 
(11.P.2) describing an ellipse, a parabola, and a hyperbola. 

(c) Suppose that (11. P. 2) is not in standard form (i.e., A is not a diagonal 
matrix). Explain how similarity transformations might be used to place 
(11. P. 2) in standard form. 

11.3. Consider the companion matrix 



C = 









• 


• 


c n 


1 





• 


• 


— Cn-1 





1 


• 


• 


—C n -2 








• 


• 


-ci 








• 


• 1 


-c\ 



eC" 



TLFeBOOK 



520 NUMERICAL METHODS FOR EIGENPROBLEMS 

Prove that 

p n (X) = det(A7 - C) = X" + ciX"~ l + c 2 X n ~ 2 + ■■■ + c n -\X + c„, 

11.4. Suppose that the eigenvalues of A € C" x " are Xq, X\, . . . , X n -\. Find the 
eigenvalues of A + al , where I is the order n identity matrix and a e C 
is a constant. 

11.5. Suppose that A e R nxn is orthogonal (i.e., A -1 = A T ); then, if X is an 
eigenvalue of A, show that we must have \X\ — 1. 

11.6. Find all the eigenvalues of 

co&9 -sin0 

cos</> — sin</> 

sin0 cos« 

sin</> cos(/> 

(Hint: The problem is simplified by using permutation matrices.) 

11.7. Consider the following definition: A, B e C nxn are simultaneously diago- 
nalizable if there is a similarity matrix S e C" x " such that S~ l AS, and 
5 1-1 BS are both diagonal matrices. Show that if A, B e C" x " are simulta- 
neously diagonalizable, then they commute (i.e., Afi = BA). 

11.8. Prove the following theorem. Let A, B e C nxn be diagonalizable. There- 
fore, A and B commute iff they are simultaneously diagonalizable. 

11.9. Matrix A € C" x " is a square root of B e C nxn if A 2 = B. Show that every 
diagonalizable matrix in C" x " has a square root. 

11.10. Prove the following (Bauer-Fike) theorem (which says something about 
how perturbations of matrices affect their eigenvalues). If y is an eigenvalue 
of A + E € R nxn , and T~ l AT = D = diag(A , A-i, . . . , A.„_i), then 

min |A.-y|<K2(r)||E|| 2 . 

ie.«(A) 

Recall that s(A) denotes the set of eigenvalues of A (i.e., the spectrum of 
A). [Hint: If y e s(A), the result is certainly true, so we need consider 
only the situation where y <£ s(A). Confirm that if T~ 1 (A + E — yI)T 
is singular, then so is I + (D — yf)~ l (T~ l ET). Note that if for some 
B e R nx " the matrix I + B is singular, then (I + B)x — for some leR" 
that is nonzero, so ||x||2 = ||Bx||2, and so ||B||2 > 1- Consider upper and 
lower bounds on the norm ||(D — y I) -1 (T _1 ET)\\2-] 

11.11. The Daubechies 4-tap scaling function 4>(t) satisfies the two-scale difference 
equation 

<t>(t) = />o<M2f) + pi</>(2f - 1) + P2<P(2t - 2) + p 3 0(2r - 3), (11.P.3) 



TLFeBOOK 



PROBLEMS 521 

where supp</>(£) = [0,3] C R [i.e., 4>(t) is nonzero only on the interval 
[0, 3]], and where 

p =\(l + V3), pi = i(3 + V3), 

p 2 =5(3-V3), P3 = 5(l-V3). 

Note that the solution to (11. P. 3) is continuous (an important fact). 

(a) Find the matrix M such that 

M(/) = (/>, (11.P.4) 

where = [^(l)0(2)] r e R 2 . 

(b) Find the eigenvalues of M. Find the solution 4> to (11.P.4). 

(c) Take (p from (b) and multiply it by constant a such that a J2k=i W = 
1 (i.e., replace </> by the normalized form ouj)). 

(d) Using the normalized vector from (c) (i.e., a<j>), find 4>(k/2) for all 
keZ. 

[Comment: Computation of the Daubechies 4-tap scaling function is the 
first major step in computing the Daubechies 4-tap wavelet. The process 
suggested in (d) may be continued to compute <f>(k/2 J ) for any k e Z, and 
for any positive integer J. The algorithm suggested by this is often called 
the interpolatory graphical display algorithm (IGDA).] 

11.12. Prove the following theorem. Suppose A e R" x " and A — A T . Then A > 
iff A — P T P for some nonsingular matrix P e R" x ". 

11.13. Let A e R" x " be symmetric with eigenvalues 

^0 < ^1 < ' ' ' < ^n-2 < ^n-1- 

Show that for all x e R" 

Xqx x < x Ax < X n -ix x. 

[Hint: Use the fact that there is an orthogonal matrix P such that P T AP — 
A (diagonal matrix of eigenvalues of A). Partition P in terms of its row 
vectors.] 

11.14. Section 11.3 presented a method of computing e At "by hand." Use this 
method to 

(a) Derive (10.109) (in Example 10.9). 

(b) Derive (10.P.9) in Problem 10.23. 



TLFeBOOK 



522 



NUMERICAL METHODS FOR EIGENPROBLEMS 



(c) Find a closed-form expression for e , where 



A = 



k 1 
A 1 
A 



11.15. This exercise confirms that eigenvalues and singular values are definitely 
not the same thing. Consider the matrix 



A = 



2 

1 1 



Use the MATLAB eig and svd functions to find the eigenvalues and singular 
values of A. 

11.16. Show that e {A+B)t = e At e Bt for all t e R if AB = BA. Does e {A+B)t = 
e e Bt always hold for all t e R when AB ^ BA ? Justify your answer. 

11.17. This problem is an introduction to Floquet theory. Consider a linear system 
with state vector x(t) e R" for all t e R such that 



dx(t) 
dt 



A(t)x(t) 



(11.P.5) 



for some A(t) e R" x " (all t e R), and such that A(t + T) = A(t) for some 
T > [so that A(t) is periodic with period T]. Let <t>(f) be the fundamental 
matrix of the system such that 



d®(t) 



= A(f)0(f), cj>(0) = /. 



(a) Let *(f) = 0(t 



dt 

T), and show that 
dV(t) 



dt 



= A(0*(0- 



(11.P.6) 



(11.P.7) 



(b) Show that <D(f + T) = $(f)C, where C — $>(T). [Hint: Equations 
(11.P.6) and (11. P. 7) differ only in their initial conditions [i.e., ^(0) = 
what ?].] 

(c) Assume that C _1 exists for some C e R nx '\ and that there exists some 
R e R" x " such that C = e TR . Define P(t) = ®(t)e~ tR , and show that 
P(t + T) = P(t). [Thus, <D(f) = P(t)e tR , which is the general form of 
the solution to (11.P.6).] 

(Comment: Further details of the theory of solution of (11. P. 5) based on 
working with (11.P.6) may be found in E. A. Coddington and N. Levinson, 



TLFeBOOK 



PROBLEMS 523 

Theory of Ordinary Differential Equations, McGraw-Hill, New York, 1955. 
The main thing for the student to notice is that the theory involves matrix 
exponentials.) 



11.18. Consider the matrix 

A = 



4 2 
14 10 
14 1 
2 4 



(a) Create a MATLAB routine that implements the scaled power algorithm, 
and use your routine to find the largest eigenvalue of A. 

(b) Create a MATLAB routine that implements the shifted inverse power 
algorithm, and use your routine to find the smallest eigenvalue of A. 

11.19. For A in the previous problem, find K2(A) using the MATLAB routines that 
you created to solve the problem. 

11.20. If A € R nxn , and A — A T , then, for some x e R" such that x # 0, we 
define the Rayleigh quotient of A and x to be the ratio x T Ax/x T x (= 
(x, Ax) J ' {x, x)). The Rayleigh quotient iterative algorithm is described as 



/c:=0; 

while k < N do begin 

lx k :=z T k Az k lzlz k ; 

(A - n k l)y k := z k ; 

z/c+1 :=y/r/lly/cll2; 

k:=k+-\; 
end; 

The user inputs zo, which is the initial guess about the eigenvector. Note 
that the shift Hk is changed (i.e., updated) at every iteration. This has the 
effect of accelerating convergence (i.e., of reducing the number of iterations 
needed to achieve an accurate solution). However, A w = A — [x^l needs 
to be factored anew with every iteration as a result. Prove the following 
theorem. Let A e R nx " be symmetric and (k,x) be an eigenpair for A. If 
y «s x, ix — y T Ay/y T y with ||x|| 2 = | |y | b = 1, then 

\k-n\< ||A-A./|| 2 ||*-y||l. 

[Comment: The norm \\A — XI\\2 is an A-dependent constant, while \\x — 
y\\* = Hellj is the square of the size of the error between eigenvector x 
and the estimate y of it. So the size of the error between X and the estimate 
\i (i.e., \X — fi\) are proportional to IMI2 at the worst. This explains the 
fast convergence of the method (i.e., only a relatively small N is usually 
needed). Note that the proof uses (4.31).] 



TLFeBOOK 



524 



NUMERICAL METHODS FOR EIGENPROBLEMS 



11.21. Write a MATLAB routine that implements the basic QR iteration algorithm. 
You may use the MATLAB function qr to perform QR factorizations. Test 
your program out on the following matrices: 



(a) 



4 2 
14 10 
14 1 
2 4 



(b) 



-\ 

1 ° ° 1 + 75 

1 -§->/2 

1 1 + V2 



Use other built-in MATLAB functions (e.g., roots or eig) to verify your 
answers. Iterate enough to obtain entries for Hn that are accurate to four 
decimal places. 

11.22. Repeat the previous problem using your own MATLAB implementation of 
the single-shift QR iteration algorithm. Compare the number of iterations 
needed to obtain four decimal places of accuracy with the result from the 
previous problem. 

11.23. Suppose that X,YeR nxn , and we define the matrices 



A = X + jY,B 



X 
Y 



-Y 
X 



Show that if A € s(A) is real-valued, then k e s(B). Find a relationship 
between the corresponding eigenvectors. 



TLFeBOOK 



12 



Numerical Solution of Partial 
Differential Equations 



12.1 INTRODUCTION 

The subject of partial differential equations (PDEs) with respect to the matter of 
their numerical solution is impossibly large to properly cover within a single chapter 
(or, for that matter, even within a single textbook). Furthermore, the development of 
numerical methods for PDEs is a highly active area of research, and so it continues 
to be a challenge to decide what is truly "fundamental" material to cover at an 
introductory level. In this chapter we shall place emphasis on wave propagation 
problems modeled by hyperbolic PDEs (defined in Section 12.2). We will consider 
especially the finite-difference time-domain (FDTD) method [8], as this appears to 
be gaining importance in such application areas as modeling of the scattering of 
electromagnetic waves from particles and objects and modeling of optoelectronic 
systems. We will only illustrate the method with respect to planar electromagnetic 
wave propagation problems at normal incidence. However, prior to this we shall 
give an overview of PDEs, including how they are classified into elliptic, parabolic, 
and hyperbolic types. 



12.2 A BRIEF OVERVIEW OF PARTIAL DIFFERENTIAL EQUATIONS 

In this section we define some notation and terminology that is used throughout 
the chapter. We explain how second-order PDEs are classified. We also summarize 
some problems that will not be covered within this book, simply citing references 
where the interested reader can find out more. 

We will consider only two-dimensional functions u(x, t) e R (or u(x, y) e R), 
where the independent variable x is interpreted as a space variable, and independent 
variable t is interpreted as time. The order of a PDE is the order of the highest 
derivative. For our purposes, we will never consider PDEs of an order greater than 
2. Common shorthand notation for partial derivatives includes 



du du d u d u 

, Uf ^ , Uxt — — , tlxx ~~ T • 

dx dt dtdx dx z 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



525 



TLFeBOOK 



526 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

If the PDE has solution w(x, y), where x and y are spatial variables, then often 
we are interested only in approximating the solution on a bounded region (i.e., a 
bounded subset of R 2 ). However, we will consider mainly a PDE with solution 
u(x,t), where, as already noted, t is time, and so we consider u(x,t) only for 
x e [a, b] = [xo, Xf] and t e [to, tf]. Commonly, to = 0, xo = 0, and x/ — L, with 
tf — T . We wish to approximate the PDE solution u(x, t) at grid points (mesh 
points) much as we did in the problem of numerically solving ODEs as considered 
in Chapter 10. Thus, we wish to approximate u(xk,t n ), where 

xi — xo + hk, t n = to + xn, (12.1) 

such that 

1 1 

h=— (xf-xo), T = —(t f -t ), (12.2) 

M N 

and so k = 0, 1, . . . , M, and n — 0, 1, . . . , N. This implies that we assume sam- 
pling on a uniform two-dimensional grid defined on the xt plane [(x, t) plane]. 
Commonly, the numerical approximation to u(xk, t n ) is denoted by Uk ,«• The index 
k is then a space index, and n is a time index. 

There is a classification scheme for second-order linear PDEs. According to 
Kreyszig [1], Myint-U and Debnath [2], and Courant and Hilbert [9], a PDE of 
the form 

Au xx + 2Bu xy + Cuyy — F (x , y, u, u x , u y ) (12.3) 

is elliptic if AC — B 2 > 0, parabolic if AC — B 2 — 0, and is hyperbolic if AC — 
B 2 < 0. 1 It is possible for A, B, and C to be functions of x and y, in which case 
(12.3) may be of a type (elliptic, parabolic, hyperbolic) that varies with x and y. For 
example, (12.3) might be hyperbolic in one region of R 2 but parabolic in another. 
Of course, the terminology as to type remains the same when space variable y is 
replaced by time variable t. 

An example of an elliptic PDE is the Poisson equation from electrostatics [10] 

V xx + Vyy = — p(x,y), (12.4) 

€ 

where the solution V(x, y) is the electrical potential (e.g., in units of volts) at the 
spatial location (x, y) in R 2 , constant e is the permittivity of the medium (e.g., 
in units of farads per meter), and p(x, y) is the charge density (e.g., in units of 
coulombs per square meter) at the spatial point (x, y). We have assumed that 
the permittivity is a constant, but it can vary spatially as well. Certainly, (12.4) 

This classification scheme is related to the classification of conic sections on the Cartesian plane. The 
general equation for such a conic on R is 

Ax 1 + Bxy + Cy 2 + Dx + Ey + F = 0. 

The conic is hyperbolic if B 2 - AAC > 0, parabolic if B 2 - 4 AC = 0, and is elliptic if B 2 - 4 AC < 0. 



TLFeBOOK 



A BRIEF OVERVIEW OF PARTIAL DIFFERENTIAL EQUATIONS 527 

has the form of (12.3), where for u(x, y) — V(x, y) we have B — 0, A — C — 
1, and F(x, y, u, u x , u y ) — —p(x, y)/e. Therefore, AC — B 2 — 1 > 0, confirming 
that (12.4) is elliptic. 

To develop an approximate method of solving (12.4), one may employ finite 
differences. For example, from Taylor series theory 

d 2 V(x k , Jn) _ V(x k +u Jn) ~ 2V(x k , y n ) + V(x k -i, y„) h 2 9 4 Vfe, y n ) 

dx 2 h 2 12 dx 4 

(12.5a) 

for some %\ e [xk-i, Xk+i], and 

d 2 V(x k ,y„) _ Vjxk, y«+i) - 2V(x k , y n ) + V(xk, y«-i) _ r 2 d 4 V(x k , rj„) 

dy 2 ~ x 2 12 3y 4 

(12.5b) 

for some r\ n e [y n -\, y n +\\, where x k — xq + hk, y n — yo + xn [recall (12.1), and 

(12.2)]. The finite-difference approximation to (12.4) is thus 



Vk+\,n - 2Vk,n + Vk-hn Vk,n+\ ~ 2V k ,n + Vk,n-\ _ PJXk, yn) 

h 2 x 2 e 



(12.6) 



Here k = 0, 1, . . . , M, and n = 0, 1, . . . , N. Depending on p(x, y) and boundary 
conditions, it is possible to rewrite (12.6) as a linear system of equations in the 
unknown (approximate) potentials Vj. „. In practice, N and M may be large, and so 
the linear system of equations will be of high order consisting of O (NM) unknowns 
to solve for. Because of the structure of (12.6), the linear system is a sparse one, 
too. It has therefore been pointed out [3,7,11] that iterative solution methods are 
preferred, such as the Jacobi or Gauss-Seidel methods (recall Section 4.7). This 
avoids the problems inherent in storing and manipulating large dense matrices. 
Epperson [11] notes that in recent years conjugate gradient methods have begun to 
displace Gauss-Seidel/Jacobi approaches to solving large and sparse linear systems 
such as are generated from (12.6). In part this is due to the difficulties inherent 
in obtaining the optimal value for the relaxation parameter a> (recall the definition 
from Section 4.7). 

An example of a parabolic PDE is sometimes called the heat equation, or diffu- 
sion equation [2-4, 11] since it models one-dimensional diffusion processes such 
as the flow of heat through a metal bar. The general form of the basic parabolic 
PDE is 

u, — a 2 u xx (12.7) 

for x e [0, L], t e R + . Here u(x, t) could be the temperature of some material at 
(x,t). It could also be the concentration of some chemical substance that is diffusing 
out from some source. (Of course, other physical interpretations are possible.) 
Typical boundary conditions are 

u(0,t) = 0, u(L,t) = 0for t > 0, (12.8a) 



TLFeBOOK 



528 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

and the initial condition is 

u(x,0) = f(x). (12.8b) 

The initial condition might be an initial temperature distribution, or chemical con- 
centration. If uix,t) is interpreted as temperature, then the boundary conditions 
(12.8a) state that the ends of the one-dimensional medium are held at a constant 
temperature of (e.g., degrees Celsius). 

Equation (12.7) has the form of (12.3) with B — C — 0, and hence AC - B 2 = 
0, which is the criterion for a parabolic PDE. Finite-difference schemes analogous 
to the case for elliptic PDEs may be developed. Classically, perhaps the most 
popular choice is the Crank-Nicolson method summarized by Burden and Faires 
[3], but given a more detailed treatment by Epperson [11]. 

Another popular numerical solution technique for PDEs is the finite -element 
(FEL) method. It applies to a broad class of PDEs, and there are many commercially 
available software packages that implement this approach for various applications 
such as structural vibration analysis, or electromagnetics. However, we will not 
consider the FEL method as it deserves its own textbook. The interested reader can 
see Strang and Fix [5] or Brenner and Scott [6] for details. A brief introduction 
appears in Burden and Faires [3]. 

As stated earlier, the emphasis in this book will be on wave propagation prob- 
lems as modeled by hyperbolic PDEs. We now turn our attention to this class 
of PDEs. 



12.3 APPLICATIONS OF HYPERBOLIC PDEs 

In this section we summarize two problems that illustrate how hyperbolic PDEs 
arise in practice. In later sections we will see that although both involve the mod- 
eling of waves propagating in physical systems, the numerical methods for their 
solution are different in the two cases, and yet they have in common the application 
of finite-difference schemes. 

12.3.1 The Vibrating String 

Consider an elastic string with its ends fixed at the points x — and x — L (so 
that the string is of length L unstretched). If the string is plucked at position 
x = xp ixp e (0, L)) at time t — such as shown in Fig. 12.1, then it will vibrate 
for t > 0. The PDE describing w(x, t), which is the displacement of the string at 
position x and time t, is given by 

u tt = c 2 u xx . (12.9) 

The system of Fig. 12.1 is also characterized by the boundary conditions 

u(0,t) = 0, u(L,t) = for all t e R + , (12.10) 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 529 



u(x,t) 








1 


i 


P 






u(x P ,0) 













Xp 


L 





Figure 12.1 An elastic string plucked at time t = at point P, which is located at x = xp. 

which specify that the string's ends are fixed, and we have the initial conditions 

du(x, t) 



u(x,0) = f(x), 



dt 



g(x), 



(12.11) 



r=0 



which describes the initial displacement, and velocity of the string, respectively. 
As explained, for example, in Kreyszig [1] or in Elmore and Heald [12], the PDE 
(12.9) is derived from elementary Newtonian mechanics based on the following 
assumptions: 

1. The mass of the string per unit of length is a constant. 

2. The string is perfectly elastic, offers no resistance to bending, and there is 
no friction. 

3. The tension in stretching the string before fixing its ends is large enough to 
neglect the action of gravity. 

4. The motion of the string is purely a vibration in the vertical plane (i.e., the 
y direction), and the deflection and slope are small in absolute value. 

We will omit the details of the derivation of (12.9), as this would carry us too far 
off course. However, note that constant c 2 in (12.9) is 



c 2 = 



T 

P ' 



(12.12) 



where T is the tension in the string (e.g., units of newtons) and p is the density of 
the string (e.g., units of kilograms per meter). A dimensional analysis of (12.12) 
quickly reveals that c has the units of speed (e.g, meters per second). It specifies 
the speed at which waves propagate on the string. 



TLFeBOOK 



530 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

It is easy to confirm that (12.9) is a hyperbolic PDE since, on comparing (12.9) 
with (12.3), we have A = c 2 , B = 0, and C = -1. Thus, AC - B 2 = -c 2 < 0, 
which meets the definition of a hyperbolic PDE. 

At this point we summarize a standard method for obtaining series-based analyt- 
ical solutions to PDEs. This is the method of separation of variables (also called the 
product method). We shall also find that Fourier series expansions (recall Chapters 1 
and 3) have an important role to play in the solution method. The solutions we 
obtain yield test cases that we can use to gauge the accuracy of numerical methods 
that we consider later on. 

Assume that the solution 2 to (12.9) can be rewritten in the form 

u(x,t) = X(x)T(t). (12.13) 

Clearly 

u xx = X xx T, u tt =XT tl , (12.14) 

which may be substituted into (12.9), yielding 

XT tt = c 2 TX xx , (12.15) 

or equivalently 

lit Ayr 

-£- = -^L, (12.16) 

c 2 T X 

The expression on the left-hand side is a function of t only, while that on the 
right-hand side is a function only of x. Thus, both sides must equal some constant, 
say, k: 

±tt Arv 

-£- = ^- = k. (12.17) 

c 2 T X 

From (12.17) we obtain two second-order linear ODEs in constant coefficients 

X xx -kX = (12.18) 

and 

T tt -c 2 icT = 0. (12.19) 

We will now ascertain the general form of the solutions to (12.18) and (12.19), 
based on the conditions (12.10) and (12.11). From (12.10) substituted into (12.13), 
we obtain 

u(0, t) = X(0)T(t) = 0, u(L, t) = X(L)T(t) = 0. 

Theories about the existence and uniqueness of solutions to PDEs are often highly involved, and so 
we completely ignore this matter here. The reader is advised to consult books dedicated to PDEs and 
their solution for such information. 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 531 

If T(t) — (all i), then u(x, t) — for all x and t. This is the trivial solution, and 
we reject it. Thus, we must have 

X(0) = X(L) = 0. (12.20) 

For k = 0, Eq. (12.18) is X xx — 0, which has the general solution X(x) — ax + b, 
but from (12.20) we conclude that a — b — 0, and so X(x) — for all x. This is the 
trivial solution and so is rejected. If k — /x 2 > 0, we have ODE X xx — y?X — 0, 
which has a characteristic equation possessing roots at ±/z. Consequently, X(x) — 
ae ,LX +be~ flx . If we apply (12.20) to this, we conclude that a = b = 0, once 
again giving the trivial solution X(x) — for all x. Now finally suppose that 
k — —fi 2 < 0, in which case (12.18) becomes 

X XX +/3 2 X = 0, (12.21) 

which has the characteristic equation .v 2 + 1 — 0. Thus, the general solution to 
(12.21) is of the form 

X(x) = a confix) + bsin(fix). (12.22) 

Applying (12.20) yields 

X(Q) = a = 0, X(L) = & sinOSL) = 0. (12.23) 

Clearly, to avoid encountering the trivial solution, we must assume that b ^ 0. 
Thus, we must have 

sinOSL) = 0, 

implying that we have 

nit 
P = — , n e Z. (12.24) 

However, we avoid fi — (for n — 0) to prevent X(x) — for all x; and we 
consider only n e {1, 2, 3, . . .} = N because sin(— x) — — sinx, and the minus sign 
can be absorbed into the constant b. Thus, in general 

X(x) = X„(x) = 6„sin(— x) (12.25) 

for « € N, and where x € [0, L]. So now we have found that 



in which case (12.19) takes on the form 



T„+A 2 r = 0, where X n = —. (12.26) 



TLFeBOOK 



532 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

This has a general solution of the form 

T n (t) = A n cos(X„t) + B n sin(A.„f) (12.27) 

again for n e N. Consequently, u n {x, t) — X n (x)T n (t) is a solution to (12.9) for 
all n e N, and 

u n (x, t) = [A n cos(X n t) + B„ sm(X n t)] sin I — x 1 . (12.28) 

The functions (12.28) are eigenfunctions with eigenvalues X n — rnic/L for the PDE 
in (12.9). The set {X n \n e N} is the spectrum. Each u n (x,t) represents harmonic 
motion of the string with frequency X„/(2jt) cycles per unit of time, and is also 
called the nth normal mode for the string. The first mode for n = 1 is called the 
fundamental mode, and the others (for n > 1) are called overtones, or harmonics. 
It is clear that u n (x, t) in (12.28) satisfies PDE (12.9), and the boundary conditions 
(12.10). However, u n (x,t) by itself will not satisfy (12.9), (12.10) and (12.11) 
all simultaneously. In general, the complete solution is [using superposition as the 
PDE (12.9) is linear] 

u(x, t) = ^ u„(x, t) = ^[A„ cos(A„f) + B„ sm(X„t)] sin (^-*) , (12.29) 

n=\ n=\ 

where the initial conditions (12.11) are employed to find the series coefficients A n 
and B n for all n. We will now consider how this is done in general. 
From (12.11) we obtain 

m(x,0) = ^A^sin^— x\ = f(x) (12.30) 

M = l 



and 








di 


(x, 
dt 


"i 


=0 = 



2_][—A n X n sm(X n t) + B n X n cos(X n t )] sin I — x ) 

.« = ! 



= *(*) 

r=0 



oo 

E/H7T \ 
B n X n sm{—x)=g(x). (12.31) 

n=\ 

The orthogonality properties of sinusoids can be used to determine A n and B n 
for all n. Note that (12.30) and (12.31) are particular instances of Fourier series 
expansions. In particular, observe that for k, n e N 

i, Sm (lT X ) Sm (^j^ = { L/2, n = k ■ (12J2) 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 533 

Plainly, set {sm(njr x / L)\n e N} is an orthogonal set. Thus, from (12.30) 
2 f L /kit \ ^-^ /rnt \\ 2 [ L /lor 



i ^(r) ^ A » sm (r)r I= i/o /(x)sin 



L 



x \ dx, 



and via (12.32) this reduces to 



2 f L (kit \ 

Ak = — I f{x) sin I — x I dx, 



(12.33) 



and similarly from (12.31), and (12.32) 



If 1 

^■kL Jo 



B k - — T I g(x) sin ( —x ) dx. 



(12.34) 



Example 12.1 Suppose that g(x) — for all x. Thus, the initial velocity of the 

string is zero. Let the initial position (deflection) of the plucked string be triangular 

such that 

2H L 
x, < x < — 

/U) { 2 L h l ^ . (12.35) 

(L — x), — < x < L 

L 2 



In Fig. 12.1 this corresponds to xp — L/2 and u(xp, 0) = H . Since g(x) — for 
all x via (12.34), we must have B^ — for all k. From (12.33) we have 



AH 



f L/2 ( k * \ ( 

J x sin I — x \ dx + I 
Jo \ L / J L 



L ( klT , 

(L — x) sin — x dx 

L/2 \L ' 



Since 



/ 



sin(flx) dx — cos(ax) + C 

a 



and 



f ! ! 

/ x sin(ax) dx — x cos(ax) H r- sin(ax) + C 

J a a 1 



(C is a constant of integration), on simplification we have 



SH (kit 



(12.36) 



TLFeBOOK 



534 



NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 



0.1 -s 




Time (/) 







Position (x) 



Figure 12.2 Fourier series solution to the vibrating string problem. A mesh plot of u(x, t) 
as given by Eq. (12.37) for the parameters L 
the first 100 terms of the series expansion. 



as given by Eq. (12.37) for the parameters L = 1, c/L = A, and H = -At. The plot employed 



Thus, substituting (12.36) into (12.29) yields the general solution for our example, 
namely 



u(x, t) — 



%H ^ 1 (kn\ (kite \ (kit 



but sin(&7r/2) = for even k, and so this expression reduces to 

8H ^ (-1)"" 1 ({2n-\)jtc\ /(2«-1)tt 
u(x, t) — — z- > ^ cos I 1 I sin I x 

n z / -~ i (2« — \y 



(12.37) 



Figure 12.2 shows a typical plot of the function u(x, t) as given by (12.37) (for 
the parameters stated in the figure caption). 

The reader is encouraged to think about Fig. 12.2, and to ask if the picture is 
a reasonable one on the basis of his/her intuitive understanding of how a plucked 
string (say, that on a stringed musical instrument) behaves. Of course, this question 
should be considered with respect to the modeling assumptions that lead to (12.9), 
and that were listed earlier. 

12.3.2 Plane Electromagnetic Waves 

An electromagnetic wave (e.g., radio wave or light) in three-dimensional space R 3 
within some material is described by the vector magnetic field intensity H(x, y, z, t) 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 535 

[e.g., in units of amperes per meter, (A/m)] and the vector electric field intensity 
E(x, y, z, t) [e.g., in units of volts per meter (V/m)] such that 

H(x, y, z, t) — H x (x, y, z, t)x + H y (x, y, z, t)y + H z (x, y, z, t)z, 
E(x, y, z, t) — E x (x, y, z, t)x + E y (x, y, z, t)y + E z {x, y, z, t)z, 

where x, y, and z are the unit vectors in the x, y, and z directions of R 3 , respec- 
tively. The dynamic equations that H and E both satisfy are Maxwell's equations: 

— 95 

V x E = (Faraday's law), (12.38) 

dt 

— 3D 

V x H (Ampere's law). (12.39) 

dt 

Here the material in which the wave propagates contains no charges or cur- 
rent sources. The magnetic flux density B(x, y, z,t), and the electric flux density 
D(x, y, z,t) are assumed to satisfy 

D = €£, ~B = n~H. (12.40) 

These relations assume that the material is linear, isotropic [i.e., the same in all 
directions, and homogeneous i.e., the parameters € and /x do not vary with (x, y, z)]. 
Constant e is the material's permittivity [units of farads per meter (F/m)], and 
constant /x is the material's permeability [units of henries per meter (H/m)]. The 
permittivity and permeability of free space (i.e., a vacuum) are often denoted by 
eo and /xo, respectively, and 

6 = e r 6o, /"- = /"-i-Mo, (12.41) 

where e r is the relative permittivity and \x r is the relative permeability of the 
material. Note that 



eo 



8.854185 x 10" 12 F/m, p Q = 400tt x 10" 9 H/m. (12.42) 



If the material is air, then e r « 1 and \i r « 1 to very good approximation, and 
so air is not practically distinguished (usually) from free space. For commonly 
occurring dielectric materials (i.e., insulators), we have p r & 1, also to excellent 
approximation. On the other hand, for magnetic materials (e.g., iron, cobalt, nickel, 
various alloys and mixtures), fi r will be very different from unity, and in fact 
the relationship B — jiH must often be replaced by sometimes quite complicated 
nonlinear relationships, often involving the phenomenon of hysteresis. But we will 
completely avoid this situation here. 



TLFeBOOK 



536 



NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 



In general, for the vector field A — A x x + A y y + A z z, the curl is the determinant 



V x A 



X 


y 


z 


a 


3 


3 


dx 


By 


Bz 


A x 


Ay 


A z 



(12.43) 



so, expanding this expression with A — E and A — H gives (respectively) 



V x E = x 
and 



dE z 
~~By~ 



V x H 



— „ /3H : 



By 



BEy 

~~dz 



8Hy 

~Jz 



+ y 



dE r dE 7 



dz 



BHx 

dz 



dx 



dH z 

dx 



9Ey 

dx 



dH y 

dx 



BE, 

By 



BH X 

By 



(12.44) 



(12.45) 



We will consider only transverse electromagnetic (TEM) waves (i.e., plane 
waves). If such a wave propagates in the x direction, then we may assume that 
E x — E z — 0, and so H x — H y — 0. Note that the electric and magnetic field com- 
ponents E y and H z are orthogonal to each other. They lie within the (y, z) plane, 
which itself is orthogonal to the direction of travel of the plane wave. From (12.44) 
and (12.45), Maxwell's equations (12.38) and (12.39) reduce to 



BEy 



BH 7 



dH 7 



BE V 



— —I 1 

Bx Bt 



— —€ 

Bx Bt 



(12.46) 



where we have used (12.40). Combining the two equations in (12.46), we obtain 
either 

B 2 H 7 B 2 H 7 



— ixe- 



Bx 2 ^ Bt 2 ' 
which is the wave equation for the magnetic field, or 



(12.47) 



B l E v 



B l E v 



lie- 



dx 2 ' dt 2 ' 
which is the wave equation for the electric field. If we define 

1 



//xe 



(12.48) 



(12.49) 



then the general solution to (12.48) (for example) can be expressed in the form 

E y (x, t) = E Jr (x - vt) + Ey,(X + vt), (12.50) 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 



537 



where the first term is a wave propagating in the +x direction (i.e., to the right) 
with speed v and the second term is a wave propagating in the — x direction (i.e., 
to the left) with speed v? Equation (12.50) is the classical D'Alembert solution to 
the scalar wave equation (12.48). Clearly, similar reasoning applies to (12.47). Of 
course, using (12.49) in (12.48), we can write 



d z E 



dt 2 



y _ v 2 9 E y 



dx 2 



(12.51) 



which has the same form as (12.9). In short, the mathematics describing the vibra- 
tions of mechanical systems is much the same as that describing electromagnetic 
systems, only the physical interpretations differ. Of course, (12.51) clearly implies 
that (12.47) and (12.48) are hyperbolic PDEs. 



Example 12.2 It is easy to confirm that (12.37) can be rewritten in the form 
of (12.50). Via the identity 



cos A sinB = ±[sin(A + B) - sin(A - B)] 



(12.52) 



we see that 



cos 



(2n — \)nc 



(2m — \)tx 



1 
-- sin 

2 



(2m — 1)tt 
L 
(2m — 1)jt 
L 



(x + ct) 
(x — ct) 



Thus, (12.37) may immediately be rewritten as 



AH ~ (-If 1 . 

U(X, t) — =- > pr sin 

n 2 *-*• (2m - l) 2 

n=\ 



(2m - 1)71 

L 



(x — ct) 



=u\(x—ct) 



AH ^ (-1)" -1 
It 2 2-> (2m - l) 2 

n=\ 



sin 



(2m - Y)it 



(x + ct) 



=u i (x+ct) 

We note that when e r — /x r = 1, we have v — c, where 

1 



V^ofo 



(12.53) 



Readers are invited to draw a simple sketch and convince themselves that this interpretation is cor- 
rect. This interpretation is vital in understanding the propagation of electromagnetic waves in layered 
materials, such as thin optical films. 



TLFeBOOK 



538 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

This is the speed of light in a vacuum. Since for real materials p, r > \ and e,- > 1, 
we have u < c, so an electromagnetic wave cannot travel at a speed exceeding that 
of light in a vacuum. 

Now we will assume sinusoidal solutions to the wave equations such as would 
originate from sinusoidal sources. 4 Specifically, let us assume that 

E y (x,t) — E sm(cot - fix), (12.54) 

where p — co^/JIe — 2jt/X, and X is the wavelength (e.g., in units of meters). The 
frequency of the source is a>, a fixed constant, and so the wavelength will vary 
depending on the medium. If the free-space wavelength is denoted A.rj, then 

2tt &) 

— = -, (12.55) 

where c is from (12.53). If the free-space wave then propagates into a denser 

material, then 

7.71 a> 

— = - (12.56) 

k v 

for v given by (12.49). From (12.55) and (12.56), we obtain 

k = -A. . (12.57) 

c 

Since v < c, we always have k < ko; that is, the wavelength will shorten. This 
observation is useful in checking numerical methods that model the propagation 
of sinusoidal waves across interfaces between different materials (e.g., layered 
structures such as thin optical films). 
From (12.54) we have 

dEy 

= —/3EQCOs(cLit — fix) 

dx 

so that from the first equation in (12.46) we have 



-J 



PEq fJE 

cos(cot — fix) dt — sin(ft)? — fix). (12.58) 



The characteristic impedance of the medium in which the sinusoidal electromag- 
netic wave travels is defined to be 

H z p V e V ^r 

For example, the Colpitts oscillator from Chapter 10 could operate as a sinusoidal signal generator to 
drive an antenna, thus producing a sinusoidal electromagnetic wave (radio wave) in space. 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 539 

where Zq — V/^o7^0 ls the characteristic impedance of free space. The units of 
Z are in ohms (£2). We see that Z is analogous to the concept of impedance that 
arises in electric circuit analysis. 

The analogy between our present problem and phasor analysis in basic electric 
circuit theory can be exploited. For suitable E(x) e R 

E y = E y (x, t) = E(x)e ja " (12.60) 

so that 

d 2 E y d 2 E(x) ,„, 9 2 £ v , 

1^ = ^^ ' V = "^ W ' (12 ' 61) 



Substituting (12.61) into (12.48) yields 

e ja " = -/xew z £'(x)e 



d2E <- X K icot _ _„ £ ,.,2 ,,,,.„>,/ 



JX 2 

which reduces to the second-order linear ODE 

d 2 E(x) , 

V^ + M«« 2 £(x) = 0. (12.62) 

For convenience, we define the propagation constant 

y = jco^Jie = jf3, (12.63) 

so — y 2 — peco 2 , and (12.62) is now 

d 2 E 7 

— - y 2 E = (12.64) 

(E — E(x)). This ODE has a general solution of the form 

E(x) = E e- yx + E ie YX , (12.65) 

where Eq and E\ are constants. Recalling (12.60), it follows that 

E y (x, t) = £oe ; ^ _yx + Eie jmt+yx 

= Eoe-JV x - M) + Eie i{fix+mt) . (12.66) 

Of course, the first term is a wave propagating to the right, and the second term is 
a wave propagating to the left. 

So far we have assumed wave propagation in lossless materials since this is 
the easiest case to consider at the outset. We shall now consider the effects of 
lossy materials on propagation. This will be important in that it is a more realistic 



TLFeBOOK 



540 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

assumption in practice, and it is important in designing perfectly matched layers 
(PMLs) in the finite-difference time-domain (FDTD) method, as will be seen later. 
We may define electrical conductivity a [units of amperes per volt-meter 
[A/(V-m)] or mhos/meter], and magnetic conductivity a* [units of volts per 
ampere-meter [V/(A-m)] or ohms/meter]. In this case (12.38) and (12.39) take 
on the more general forms. 

O TJ 

Vx£=-u a*~H, (12.67) 

1 dt 

Vxl = ( h o~E. (12.68) 

dt 

Note that a* is not the complex conjugate of a . In fact, or, a* e R with a > 
for a lossy material (i.e., an imperfect insulator, or a conductor), and a* > 0. As 
before we will assume E possesses only a y component, and H possesses only a z 
component. Since we again have propagation only in the x direction, via (12.44), 
(12.45) in (12.67), and (12.68), we have 

dE y dH z 

— - = -M — --a*H z , (12.69) 

dx dt z 

dH z dEy 

-—S = € —± + crE y . (12.70) 

dx dt 

If E y — E y (x, t) possesses a term that propagates only to the right, then (using 
phasors again) 

E = E y (x, t)y = E e j(Ut e~ yx y (12.71) 

for some suitable propagation constant y. For suitable characteristic impedance Z, 
we must have 

H = H z (x, t)z = -E e jat e- yx z. (12.72) 

We may use (12.69) and (12.70) to determine y and Z. Substituting (12.71) and 
(12.72) into (12.69) and (12.70) and solving for y and Z yields 

z i = J_^ _ y 2 = (_ / - a)e + .)(jf <w/Lt +cr *). (12.73) 

ywe + er 

How to handle the complex square roots needed to obtain Z and y will be dealt 
with below. Observe that the equations in (12.73) reduce to the previous cases 
(12.59), and (12.63) when a — a* — 0. It is noteworthy that when we have the 
condition 

(12.74) 



a a 



then we have 



Z 2 = i— = ^1 %- = -. (12.75) 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 



541 



Condition (12.74) is what makes the creation of a PML possible, as will be con- 
sidered later in this section, and will be demonstrated in Section 12.5. 

Now we must investigate what happens to waves when they encounter a sudden 
change in the material properties, specifically, an interface between layers. This 
situation is depicted in Fig. 12.3. Assume that medium 1 has physical parameters 
€\, (JLi, o\, and erj*, while medium 2 has physical parameters 62, M2> ff 2, an d o^- 
The corresponding characteristic impedance and propagation constant for medium 
1 is thus [via (12.73)] 



)(o\i\ +erf 



J \ 



jcoei +a t 
while for medium 2 we have 

2 _ j(Ofl 2 + 02 



Y\ ={j(0€i+ai){jcotn +cr*), 



y 2 = U^ 2 + (T2)U^^2 +^2*)' 



(12.76) 



ja>e 2 + 02 
In Fig. 12.3 for some constants E and H we have for the incident field 



(12.77) 



E t - E t y = Ee 



l^t-Yix 



>'. 



Hi = H t z = H 



jojt-yix-. 



zO 

(Out of the page) 



0"< 

Incident 
wave 



H ' 

Reflected 
wave 



Medium 1 




Interface (boundary) between 
two different media 



///////. 



Medium 2 




Figure 12.3 A plane wave normally incident on an interface (boundary) between two 
different media. The magnetic field components are directed orthogonal to the page. The 
interface is at x = and is the yz— plane. 



TLFeBOOK 



542 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

so that 

dH t dE t 

— jcoHi, = —y\Ei, 

dt dx 



and via (12.69) 
implying that 



-y\Ei = -jcofxiHi - a* Hi 

E t _ jo)ji\+ a I 



Zi. (12.78) 



Hi ki 

Similarly, for the transmitted field, we must have 

E t jwu2 + erf 

— = — — ^ = Z 2 . (12.79) 

H t y 2 

But, for the reflected field, again for suitable constants E', and H' we have 

E r = E r y = E' <!***€**$, H r = H r z = H'e' mt e nx z 

so that 

3// r 3£ r 

-— = jwH r , -— = Y\E r , 

at ax 



and via (12.69) 
implying 



]/i£ r = -ja>ixiH r - a*H r 
E r jco/xi + CTj" 



//,- Kl 



= -Zi. (12.80) 



The electric and magnetic field components are tangential to the interface, and so 
must be continuous across it. This implies that at x — 0, and for all t (in Fig. 12.3) 

Hi + H r = H t , E t + E r = E t . (12.81) 

If we substitute (12.78)-(12.80) into (12.81), then after a bit of algebra, we have 

E, 2Z 2 



(12.82) 



and 



It is easy to confirm that 



E t Z 2 + Zy 

E r Z2 — -Zi 
E~~ Z2 + Z1 

x = l + p. (12.84) 



p. (12.83) 



TLFeBOOK 



APPLICATIONS OF HYPERBOLIC PDEs 543 

We call t the transmission coefficient from medium 1 into medium 2, and p is the 
reflection coefficient from medium 1 into medium 2. The coefficients x and p are 
often called Fresnel coefficients, especially in the field of optics. 

If a* — then from (12.73) the propagation constant for a sinusoidal wave in 
a conductor is obtained from 



2 2 

y — —fJ,€a> + ja>fia, 



(12.85) 



where fiew 2 > and cojia > 0. We may express y 2 in polar form: y 2 — r\e^ x . 
But y — re' 6 — [rie j6l ] l/2 . In general, if z — re-' 6 , then 



z l/2 = r l/2 e J9/2^ 



z l/2 = r l/2 e ;(0/2+7r)^ 



(12.86) 



From (12.85) 



n — \y 2 \ — m\i\l a 2 + e 2 a> 2 , 0\ — h tan l (-co) . 

2 \<j I 



(12.87) 



Consequently 



Y 



±W^G 2 + € 2 w 2 ] l l 2 e^ + 2^- i ^l°)\ 



(12.88) 



A special case is the perfect insulator (perfect dielectric) for which a — 0. Since 
tan _1 (oo) = 7r/2, Eq. (12.88) reduces to y = zkj^/JZea) — ±j/S. More generally 
(a not necessarily zero) we have y — ±(a + jfi), where 



a — [ajfiVcr 2 + € 2 co 2 ] ' cos 
P = [co/j,y/a 2 + e 2 co 2 ] l/2 sin 



Tt 1 



tan 



h - tan 

4 2 






(12.89a) 
(12.89b) 



Example 12.3 Assume that /x = /xo, e = €o, and that cr = 0.0001 mhos/meter. 
For co = 2nf [/ is the sinusoid's frequency in Hertz (H z )] from (12.89), we obtain 
the following table: 



/(Hz) 



1 x 10 b 


.015236 


0.025911 


1 x 10 7 


.018762 


0.210423 


1 x 10 8 


.018836 


2.095929 


1 x 10 9 


.018837 


20.958455 



Keeping the parameters the same, except that now a — 0.01 mhos/meter, we 
have the following table: 



TLFeBOOK 



544 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 



/ (Hz) 


a 


P 


1 x 10 6 


.198140 


0.199245 


1 x 10 7 


.611091 


0.646032 


1 x 10 8 


1.523602 


2.591125 


1 x 10 9 


1.876150 


21.042254 



In general, from (12.89a) we have a > 0. If in Fig. 12.3 medium 1 is free space 
while medium 2 is a conductor with < a < oo, then E t has the form [using 
(12.82)] for x > 

E t = rEe jmt e- y2X y = rEe ja " ' e~ aj - x e~ ihx y . (12.90) 

Of course, we also have E, — Ee^ wt e~^ lX y for x < since y\ — jfii (i-e-, oti = 
in free space), and E r — pEe Jcol e'^ lX y forx < [using (12.83)]. Since a 2 > 0, the 
factor e~ a2X will go to zero as x -> oo. The amplitude of the wave must decay as 
it progresses from free space (medium 1) into the conductive medium (medium 2). 
The rate of decay certainly depends on the size of a 2 . 

Now suppose that medium 1 is again free space, but that medium 2 has both 
<r 2 > and a^ > such that condition (12.74) holds with /x 2 = Mo> an d e 2 — e o> 
specifically 



-±- = — (12.91) 



which implies [via (12.75)] that Z 2 = Vmo/^o- Since medium 1 is free space Z\ — 

VMo/eo, loo. The reflection coefficient from medium 1 into medium 2 is [via 

(12.83)] 

Z 2 — Z\ Zq — Zq 

p — = = 0. 

Z 2 + Z\ 2Zq 

When wave Ej in medium 1 encounters the interface (at x = in Fig. 12.3), there 
will be no reflected component, that is, we will have E r — 0. From (12.73) we 
obtain 

yl = (or 2 or 2 * - w 2 /x eo) + jw(a 2 IMo + or 2 *eo), (12.92) 

and we select the medium 2 parameters so that for y 2 = a 2 + jfii we obtain a 2 > 0, 
and a 2 is large enough so that the wave is rapidly attenuated in that e~ aiX is small 
for relatively small x. In this case we may define medium 2 to be a perfectly 
matched layer (PML). It is perfectly matched in the sense that its characteristic 
impedance is the same as that of medium 1, thus eliminating reflections at the 
interface. Because it is lossy, it absorbs radiation incident on it. The layer dissipates 
energy without reflection. It thus simulates the walls of an anechoic chamber. In 
other words, an anechoic chamber has walls that approximately realize condition 
(12.74). The necessity to simulate the walls of an anechoic chamber will become 
clearer in Section 12.5 when we look at the FDTD method. 



TLFeBOOK 



THE FINITE-DIFFERENCE (FD) METHOD 545 

Finally, we remark on the similarities between the vibrating string problem 
and the problem considered here. The analytical solution method employed in 
Section 12.3.1 was separation of variables, and we have employed the same ap- 
proach here since all of our electromagnetic field solutions are of the form u(x,t) — 
X(x)T(t). The main difference is that in the vibrating string problem we have 
boundary conditions defined by the ends of the string being tied down somewhere, 
while in the electromagnetic wave propagation problem as we have considered it 
here there are no boundaries, or rather, the boundaries are at x — ±00. 



12.4 THE FINITE-DIFFERENCE (FD) METHOD 

We now consider a classical approach to the numerical solution of (12.9) that we 
call the finite -difference (FD) method. Note that the method to follow is by no means 
the only approach. Indeed, the FDTD method to be considered in Section 12.5 is 
an alternative, and there are still others. 
Following (12.5) 

d 2 u(x k , t n ) _ u(x k , t n+ i) - 2u(x k , t n ) + u(x k , t n -\) x 2 3 4 u(x k , rj n ) ..„, 
dt 2 - T 2 12 dt 4 ( } 

for some r\ n e [t n -\, t n +\\, and 

d 2 u(x k , t n ) _ u(x k+ i, t n ) - 2u(x k , t n ) + u(x k -i, t n ) h 2 d 4 u(t; k , t n ) ,,„„., 
dx 2 h 2 12 dx 4 

for some % k € [x k -\, x k+ \], where 

x k — kh, t n — nx (12.95) 

forfc = 0, l,...,M,and« € Z+. On substitution of (12.93) and (12.94) into (12.9), 
we have 

u(x k , t n+ i) - 2u(x k , t n ) + u(x k , t n _x) _ 2 u{x k+ i, t n ) - 2u(x k , t„) + u(x k -i, t n ) 
x 2 ° h 2 

1 



12 



2 d^u(x k ,rj n ) 2 ,2 du ^k,t n ) 

t ; c h -, 

3f 4 dx 4 



(12.96) 



=e kj , 



where e kt „ is the local truncation error. Since u k ,« ^ u(x k , t n ) from (12.96), we 
obtain the difference equation 

Uk,n + \ ~ 2uk,n + Uk,n-l ~ e 2 u k+ \, n + 1e 2 u k , n - e 2 u k -] :n = 0, (12.97) 

where 

e = T T c (12.98) 



TLFeBOOK 



546 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

is sometimes called the Courant parameter. It has a crucial role to play in deter- 
mining the stability of the FD method (and of the FDTD method, too). If we solve 
(12.97) for Uk, n +i we obtain 

Uk,n+l = 2(1 - e 2 )ll k , n + e 2 (uk+l,n + Uk-l,n) ~ Uk,n-1, (12.99) 

where k =1,2, , . . , M — 1, andn = 1, 2, 3, Equation (12.99) is the main recur- 
sion in the FD algorithm. However, we need to account for the initial and boundary 
conditions in order to initialize this recursion. Before we consider this matter note 
that, in the language of Section 10.2, the FD algorithm has a truncation error of 
0(x 2 + h 2 ) per step [via e k ,n m (12.96)]. 

Immediately on applying (12.10), since L — Mh, we have 

m ,« = UM,n — for n e Z + . (12.100) 

Since from (12.11) u(x, 0) = f(x), we also have 

«*,<> = /(**) (12.101) 

for k = 0, 1, . . . , M. From (12.101) we have ut,o, but we also need ut,i [consider 
(12.99) for n — 1]. To obtain a suitable expression first observe that from (12.9) 

d 2 u(x,t) T?) 2 u{x,t) d 2 u(x,0) ,3 2 m(x,0) , n -. 

V— = c 2 ~^^ -f- = c 2 K -^-L = c 2 f i2 \x). (12.102) 

dt 2 dx 2 dt 2 dx 2 

Now, on applying the Taylor series expansion, we see that for some \i e [0, t\\ — 
[0, t] (and /i may depend on x) 

du(x, 0) 1 2 3 u(x,0) 1 3 3 m(x, /x) 
u(x,t 1 ) = u(x,0) + T^^ + -T —^- + J x — gp— . 

and on applying (12.11) and (12.102), this becomes 

1 i o n\ 1 1 3 u(x, Li) 

u(x, fi) = u(x, 0) + xg{x) + -r 2 c 2 f i2) (x) + -r 3 -^— . (12.103) 

2 6 dt s 

In particular, for x — x k , this yields 

1 T O CTl 1 1 9 life, Uit) 

«(**, fi) = u(x k , 0) + rgfe) + -t 2 c 2 f (2) (x k ) + -t 3 ' * . (12.104) 

2 6 3f J 

If f(x) e C 4 [0, L], then, for some & e [jc^-i, X£+i], we have 

/(2)fo) = /fa +1 )-2/(* t ) + /(**-l) _ g /( 4) fe) . (12 . 105) 



TLFeBOOK 



THE FINITE-DIFFERENCE (FD) METHOD 547 

Since u(x k ,0) = f{x k ) [recall (12.11) again], and if we substitute (12.105) into 
(12.104), we obtain 



,2 ,.2 



1 c z t" 
u(x k , h) = f(x k ) + xg{x k ) + -—j-[f(x k+ i) - 2f(x k ) + f(xk-i)] 



0(r 3 + r 2 h 2 ). 



(12.106) 



Via (12.98) this yields the required approximation 

«*,i = f(xk) + tg(x k ) + je 2 [f(x k+ i) - 2f(x k ) + f(x k -\)] 



or 



Uk,\ — (1 - e 2 )f(x k ) + je 2 [f(x k+i ) + f(xk-i)] + rg(x k ). 



(12.107) 



Taking account of the boundary conditions (12.100), Eq. (12.99) can be expressed 
in matrix form as 



u\, n +\ 

U2,n + l 

UM-2,n+l 

u M-\,n+\ 



2(1 -e z ) 

„2 






M2,n 

u M-2,n 
UM-l,n 



e 1 

2(1 - e 2 ) e 2 






"2,n-l 



u M-2,n-l 
UM-l,n-l 






2(1 - e 2 ) 



2(1 -e 2 ) 



(12.108) 



This matrix recursion is run for n — 1, 2, 3, . . ., and the initial conditions are pro- 
vided by (12.101) and (12.107). 

We recall from Chapter 10 that numerical methods for the solution of ODE IVPs 
can be unstable. The same problem can arise in the numerical solution of PDEs. 
In particular, as the FD method is effectively an explicit method, it can certainly 
become unstable if h and r are inappropriately selected. 

As noted by others [2, 13], we will have 

lim u k n — u(hk, xri) 

ft,T->0 ' 

provided that < e < 1. This is the famous Courant-Friedrichs-Lewy (CFL) con- 
dition for the stability of the FD method, and is originally due to Courant et al. 
[23]. The special case where e — 1 is interesting and easy to analyze. In this case 
(12.99) reduces to 

Uk,n+l = Uk+l,n ~ "A:,«-l + «/t-l,n- (12.109) 



TLFeBOOK 



548 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

We recall that u(x,t) has the form 

u(x, t) — v(x — ct) + w{x + ct) 

[see (12.50)]. Hence u(xk, t n ) — v(x k — ct n ) + w(xk + ct n ). Observe that, since 
h — ex (as e — 1), we have 

u(xt+i, t n ) - u(x k , t n -\) + u(xk-\,t n ) — v(xk + h - ct„) + w{xk +h + ct n ) 

— v(xk — ct n + ct) — w(xk + ct n — ct) 
+ v(xk — h — ct n ) + w(xk — h + ct n ) 
= v(x.k — h — ct n ) + w(xk + h + ct„) 

— v(Xk — Ct n ~ CT ) + w ( x k + ct n + cr ) 

= v(x k - ct n+ \) + w(x k + ct„ + i) 
= u(x k ,t n+ i), 

or 

u(xk, t n +i) = u(x k +\, t n ) - u(Xk, t n -\) + u(x k -i, t n ). (12.110) 

Equation (12.110) has a form that is identical to that of (12.109). In other words, 
the algorithm (12.109) gives the exact solution to (12.9), but only at x — hk and 
t — xn with h — ex (which is a rather restrictive situation). 

A more general approach to error analysis that confirms the CFL condition is 
sometimes called von Neumann stability analysis [2]. We outline the approach as 
follows. We begin by defining the global truncation error 

€k,n = u(Xk,t n ) ~U k ,n- (12.111) 

Via (12.96) 

u(x k ,t n+ i) - 2u(x k , t n ) + u(x k , t n -i) - e 2 [u(xk + \,t n ) -2u(x k ,t n ) + u(x k -\,t n )] 
= r 2 e k ,n- (12-112) 

If we subtract (12.97) from (12.112) and simplify the result using (12.111), we 
obtain 

£k,n+i = 2(1 - e 2 )e k ,n + e 2 [e k +\, n + tk-i,n] ~ e k ,n-i + r 2 e k , n (12.113) 

for k = 0, 1, . . . , M (L — Mh), and n = 1,2,3, .... Equation (12.113) is a two- 
dimensional difference equation for the global error sequence (€k, n )- The term 
x 2 e k ,n is & forcing term, and if u(x,t) is smooth enough, the forcing term will 
be bounded for all k and n. Basically, we can show that the CFL condition < 
e < 1 prevents liirin^oo \€k,n\ — °° f° r a U k — 0, 1, . . . , M. Analogously to our 



TLFeBOOK 



THE FINITE-DIFFERENCE (FD) METHOD 549 

stability analysis approach for ODE IVPs from Chapter 10, we may consider the 
homogeneous problem 

€k,n+\ = 2(1 - e 2 )e kjl + e 2 [e k+hn + e k -i, n ] - €k,n-u (12.114) 

which is just (12.113) with the forcing term made identically zero for all k and 
n. In Section 12.3.1 we learned that separation of variables was a useful means 
to solve (12.9). We therefore believe that a discrete version of this approach is 
helpful at solving (12.114). To this end we postulate a typical solution of (12.114) 
of the form 

e k n — exp[jakh + pnr] (12.115) 

for suitable constants a e R and fi e C. We note that (12.115) has similarities to 
(12.28) and is really a term in a discrete form of Fourier series expansion. We also 
see that 

\€k,n\ = \exp(finx)\ = \s n \. 

Thus, if \s\ < 1, we will not have unbounded growth of the error sequence (e k , n ) 
as n increases. If we now substitute (12.115) into (12.114), we obtain (after sim- 
plification) the characteristic equation 

s 2 - [2(1 -e 2 ) + 2cos(ah)e 2 ]s + 1 = 0. (12.116) 



=2b 



Using the identity 2sin 2 x = 1 — cos(2x), we obtain 



l-2e 2 sin 2 ( — I . (12.117) 



It is easy to confirm that \b\ < 1 for all e such that < e < 1 because < 
sin 2 (ah/2) < 1 for all all ah e R. We note that s 2 - lbs + 1 = for s — S\,S2, 
where 



s\ 



b + y/b 2 - 1, s 2 = b-y/b 2 - 1. (12.118) 



If \b\ > 1, then |^| > 1 for some k e {1, 2}, which can happen if we permit e > 1. 
Naturally we reject this choice as it yields unbounded growth in the size of e k ,n 
as n — > oo. If \b\ < 1, then clearly \s k \ = 1 for all k. (To see this, consider the 
product ^i^2 = Si*i = ki I 2 -) This prevents unbounded growth of e kn . Thus, we 
have validated the CFL condition for the selection of Courant parameter e (i.e., we 
must always choose e to satisfy < e < 1). 

Example 12.4 Figure 12.4 illustrates the application of the recursion (12.108) 
to the vibrating string problem of Example 12.1. The simulation parameters are 
stated in the figure caption. The reader should compare the approximate solution 



TLFeBOOK 



550 



NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 



0.05 




-0.05 



Time (m) 







02 Position (kh) 



Figure 12.4 FD method approximate solution to the vibrating string problem. A mesh 



plot of «£ „ as given by Eq. (12.108) for the parameters L = \,c/L 



Additionally, h 
simulation. 



and H 



io- 



0.05 and r =0.1, which meets the CFL criterion for stability of the 



of Fig. 12.4 to the exact solution of Fig. 12.2. The apparent loss of accuracy as the 
number of time steps increases (i.e., with increasing nx) is due to the phenomenon 
of numerical dispersion [22], a topic considered in the next section in the context 
of the FDTD method. Of course, simulation accuracy improves as h, x — > for 
fixed T = Nx, and L = Mh. 



12.5 THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD 

The FDTD method is often attributed to Yee [14]. It is a finite-difference scheme 
just as the FD method of Section 12.4 is a finite-difference scheme. However, it is 
of such a nature as to be particularly useful in solving hyperbolic PDEs where the 
boundary conditions are at infinity (i.e., wave propagation problems of the kind 
considered in Section 12.3.2). 

The FDTD method considers approximations to H z (x, t) and E y (x, t) given by 
applying the central difference and forward difference approximations to the first 
derivatives in the PDEs (12.69) and (12.70). We will use the following notation 
for the sampling of continuous functions such as f(x, t): 



fk, n « f(kAx,nAt), f 



k+ 



f{{k+\)Ax,{n 



\)At) 



(12.119) 



(so Ax replaces h, and At replaces r here, where h and r were the grid spacings 
used in previous sections). For convenience, let E — E y and H — H z (i.e., we drop 



TLFeBOOK 



THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD 



551 



the subscripts on the field components). We approximate the derivatives in (12.69) 
and (12.70) specifically according to 



dH ^ 

~dT " 


1 


(12.120a) 


dE ^ 


1 

s —[Ek,n + l — Ejc^n], 


(12.120b) 


dH ^ 

dx 


1 


(12.120c) 


dE ^ 
dx 


1 
s —[Ek+\,n — Ek,n\- 


(12.120d) 



Define e k — e(kAx), ji k — p((k + \)Ax), a k — a(kAx), and a k * — a*((k + i) 
Ax), which assumes the general situation where the material parameters vary with 
x e [0, L] (computational region). Substituting these discretized material parame- 
ters, and (12.120) into (12.69) and (12.70), we obtain the following algorithm: 



H, 



1 - -*- At 



H, 



k+i,,n- 



1 At 

: — [Ek+l,n — Ek, n ] 

Hk Ax 



Ek,n+1 






At 



&k,n 



1 At 

€k Ax 



[H t 



H, 



(12.121a) 
(12.121b) 



This is sometimes called the leapfrog algorithm. The dependencies between the esti- 
mated field components in (12.121) are illustrated in Fig. 12.5. If we assume that 



H(-\Ax, t) = H((M + \)Ax, f) = 



(12.122) 



Time (f) 




x=0 



M-1 



ooo 



Computational region 




x=L 



Figure 12.5 An illustration of the dependencies between the approximate field components 
given by (12.121a,b); the lines with arrows denote the "flow" of these dependencies [O 
electric field component (E y ); □ magnetic field component (H z )]. 



TLFeBOOK 



552 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

for all t e R + , then a more detailed pseudocode description of the FDTD algo- 
rithm is 

H i i := for k = 0, 1 M - 1 ; 

E/( j0 := for /c = 0, 1 , . . . , M; 
for n := to N — 1 do begin 

E m>n := Eo sin(<wnAf); {0 < m < M] 

for /( := to M - 1 do begin 



W 1 1 

end; 



1-^Af 



H /f+ . 2 -,n-±~w^ [E ' f + 1 .''' 



for /( := to M do begin 

& n+ 1 := Pi- ^Afl &,„-!-££ [H i i -H 

end; 
end; 



2 > " "t 2 ^ — 2 ' ^ ~^~ 2 



The statement E mn :— Eosm(amAt) simulates an antenna that broadcasts a 
sinusoidal electromagnetic wave from the location x — mAx e (0, L). Of course, 
the antenna must be located in free space. 

The FDTD algorithm is an explicit difference scheme, and so it may have 

stability problems. However, it can be shown that the algorithm is stable provided 

we have 

Af / Ax \ 

e = c < 1 or Af < , (12.123) 

Ax \ c J 

Thus, the CFL condition of Section 12.4 applies to the FDTD algorithm as well. 

A justification of this claim appears in Taflove [8]. A MATLAB implementation 

of the FDTD algorithm may be found in Appendix 12.A (see routine FDTD.m). In 

this implementation we have introduced the parameters s x and s t (0 < s x , s t < 1) 

such that 

Ax 

Ax = ^A , At = s t • (12.124) 

c 

Clearly, cAt/Ax — s t , and so the CFL condition is met. Also, spatial sampling is 
determined by Ax = s x Xq, which is some fraction of a free-space wavelength Xq 
[recall (12.55)]. Note that the algorithm simulates the field for all t e [0, T], where 
T — NAt. If the wave is propagating only through free space, then the wave will 
travel a distance 

D = cT = Ns,s x X Q , (12.125) 

that is, the distance traveled is Ns,s x free-space wavelengths. Since L = Ms x Xq 
(i.e., the computational region spans Ms x free-space wavelengths), this allows us 
to make a reasonable choice for A^. 

A problem with the FDTD algorithm is that even if the computational region is 
only free-space, a wave launched from location x = mAx € (0, L) will eventually 
strike the boundaries at x = and/or x — L, and so will be reflected back toward 
the source. These reflections will cause very large errors in the estimates of H, and 



TLFeBOOK 



THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD 553 

E. But we know from Section 12.3.2 that we may design absorbing layers called 
perfectly matched layers (PMLs) that suppress these reflections. 

Suppose that the PML has physical parameters /i,e,a, and a*, then, from 
(12.73), the PML will have a propagation constant given by 



y — (era* — a> /xe) + ja>(cr /x + cr*e). 



(12.126) 



If we enforce the condition (12.74), namely 



a ' 



a 



(12.127) 



then (12.126) becomes 



y 2 = -((T 2 - OJ 2 € 2 ) + 2JCOGIM. 



(12.128) 



If we enforce a — co e < 0, then y — a + j/3, where 






4 



-Tan 



2^2 



co i e 



2&)CT6 



(12.129) 



Equation (12.129) is obtained by the same arguments that yielded (12.89). As 
a > 0, then a wave on entering the PML will be attenuated by a factor e~ ax , 
where x is the depth of penetration of the wave into the PML. A particularly 
simple choice for a is to let a 2 = co 2 e 2 , in which case 



Since co — l^-c and c — ' 



(12.130) as 



/M0<?0 



a^co^JU. (12.130) 

with /i = /Xrfio and e — e r €o, we can rewrite 



2n 

a = siller — - 
>-0 



(12.131) 



If we are matching the PML to free space, then ji r — e r — 1, and so a = 
which case 

„—OLX 



In 



in 



e A o 



(12.132) 

If the PML is of thickness x = 2A , then, from (12.132) we have e~ ax = e~ An % 
3.5 x 10~ 6 . A PML that is two free-space wavelengths thick will therefore absorb 
very nearly all of the radiation incident on it at the wavelength Ao. Since we have 
chosen a 2 — co 2 e 2 , it is easy to confirm that 



In e r t 2it 

a = , a = — l^rZo 

Xq Zq Xq 



(12.133) 



TLFeBOOK 



554 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 




4 5 6 7 

Distance (xA ) 



10 



Figure 12.6 Typical output from FDTD.m (see Appendix 12.A) for the system described in 
Example 12.5. The antenna broadcasts a sinusoid of wavelength Xq = 500 nm (nanometers) 
from the location x = 3A.Q. The transmitted field strength is Eq = 1 V/m. 



[via (12.127)], where we recall that Zq = VTW^o- Routine FDTD.m in Appen- 
dix 12. A implements PMLs according to this approach. 



Example 12.5 Figure 12.6 illustrates a typical output from FDTD.m (Appen- 
dix 12.A). The system shown in the figure occupies a computational region of 
length 10Ao (i.e., x e [0, lOArj])- Somewhat arbitrarily we have Xq — 500 nm (nano- 
meters). The antenna (which, given the wavelength, could be a laser) is located 
at index m — 150 (i.e., is at x — mAx — ms x Xo — 3Xq, since s x — .02). The free- 
space region is for x e (2A,n, 6Ao). The lossless dielectric occupies x e [6Ao, 8Xo], 
and has a relative permittivity of e r — 4. The entire computational region is non- 
magnetic, and so we have /x = /j,q everywhere. Clearly, PML 1 is matched to free 
space, while PML 2 is matched to the dielectric. 

Since e r — 4, according to (12.82), the transmission coefficient from free space 
into the dielectric is 



T = 



2/^r 

V e rf0 


2 


2 


/ M-0 , /MO 

V f,-eo V e o 


l + V^ 7 


3 



TLFeBOOK 



THE FINITE-DIFFERENCE TIME-DOMAIN (FDTD) METHOD 555 



Since Eq = 1 V/m, the amplitude of the electric field within the dielectric must be 

2 



xEq = I V/m. From Fig. 12.6 the reader can see that the electric field within the 



dielectric does indeed have an amplitude of about | V/m to a good approximation. 
From (12.57) the wavelength within the dielectric material is 

1 1 

X — ——Xq — -X.Q. 

Again from Fig. 12.6 we see that the wavelength of the transmitted field is indeed 
close to jXq within the dielectric. 

We observe that the PMLs in Example 12.5 do not perfectly suppress reflections 
at their boundaries. For example, the wave crest closest to the interface between 
PML 2 and the dielectric, and that lies within the dielectric, is somewhat higher 
than it should be. It is the discretization of a continuous space that has lead to these 
residual reflections. 

The theory of PMLs presented here does not easily extend from electromagnetic 
wave propagation problems in one spatial dimension into propagation problems in 
two or three spatial dimensions. It appears that the first truly successful extension of 
PML theory to higher spatial dimensions is due to Berenger [15,16]. Wu and Fang 
[17] claim to have improved the theory still further by improving the suppression 
of the residual reflections noted above. 

The problem of numerical dispersion was mentioned in Example 12.4 in the 
application of the FD method to the simulation of a vibrating string. We conclude 
this chapter with an account of the problem based mainly on the work of Trefethen 
[22]. We will assume lossless propagation, so a — a* = in (12.121). We will also 
assume that the computational region is free space, so p — po, and e — €q every- 
where. If we now substitute E(x, t ) — Eq sin(ct>t — fix) and H(x,t) — Hq sin(&;f — 
fix) into either of (12.121a) or (12.121b), apply the appropriate trigonometric iden- 
tities, and then cancel out common factors, we obtain the identity 

/w A A //SAx\ 
sin =esin . (12.134) 



We may use (12.134) and (12.123) to obtain 

ft) c 2 , 

v„ — — — sin 

p efiAx 



esm 



J Ax 



(12.135) 



which is the phase speed of the wave of wavelength Xo (recall /3 = 2jt/Xo) in the 
FDTD method. For the continuous wave E(x, t) [or, for that matter, H(x, t)], recall 
from Section 12.3.2 that <o/fi — c, so without spatial or temporal discretization 
effects, a sinusoid will propagate through free space at the speed c regardless of its 
wavelength (or, equivalently, its frequency). However, (12.135) suggests that the 
speed of an FDTD-simulated sinusoidal wave will vary with the wavelength. As 



TLFeBOOK 



556 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

explained in Ref. 22 (or see Ref. 12), the group speed 



dw d 
"dp 7 ~ lp 



cos 



iPVn) 



(i£) 



1 — e 2 sin 2 



(*£) 



(12.136) 



is more relevant to assessing how propagation speed varies with the wavelength. 
Again, in free space the continuous wave propagates at the group speed % = 

jg(fic) — c. A plot of Vg/c versus An/Ax [with v g /c given by (12.136)] appears 
in Fig. 2 of Represa et al. [19]. It shows that short-wavelength sinusoids travel at 
slower speeds than do long-wavelength sinusoids when simulated using the FDTD 
method. 

We have seen that nonsinusoidal waveforms (e.g., the triangle wave of Exam- 
ple 12.1) are a superposition of sinusoids of varying frequency. Thus, if we use 
the FDTD method, the FD method, or indeed any numerical method to simulate 
wave propagation, we will see the effects of numerical dispersion. In other words, 
the various frequency components in the wave will travel at different speeds, and 
so the original shape of the wave will become lost as the simulation progresses 
in time (i.e., as «Af increases). Figure 12.7 illustrates this for the case of two 





? 










(1) 








II) 




fc 


1 






(1) 




Q. 




<0 





() 




> 












m 


-1 




-2 



(a) 



o 

uf 

(b) 























/ 


\ 


/ 


I 




























n = 500 steps 

i 









3 4 5 

Distance (xx ) 



■I 
1 



1 

? 








I I 






1 


w 




Errors due to 
numerical dispersion ^^^ 


"-J 


V 




\p- 








•V 








n = 1 500 steps 

i i 









3 4 5 

Distance (xX Q ) 



Figure 12.7 Numerical dispersion in the FDTD method as illustrated by the propagation 
of two Gaussian pulses. The medium is free space. 



TLFeBOOK 



MATLAB CODE FOR EXAMPLE 12.5 557 

Gaussian pulses traveling in opposite directions. The two pulses originally appeared 
at x = 4A.o, and the medium is free space. For N = 1500 time steps, there is a very 
noticeable error due to the "breakup" of the pulses as their constituent frequency 
components separate out as a result of the numerical dispersion. 

In closing, note that more examples of numerical dispersion may be found in 
Luebbers et al. [18]. Shin and Nevels [21] explain how to work with Gaussian 
test pulses to reduce numerical dispersion. We mention that Represa et al. [19] 
use absorbing boundary conditions based on the theory in Mur [20], which is a 
different method from the PML approach we have used in this book. 



APPENDIX 12.A MATLAB CODE FOR EXAMPLE 12.5 



% permittivity .m 

% This routine specifies the permittivity profile of the 
% computational region [0,L], and is needed by FDTD.m 

% 

function epsilon = permittivity (k,sx,lambdaO,M) 

epsilonO = 8.854185*1e-12; % free-space permittivity 
er1 =4; % relative permittivity of the dielectric 

Dx = sx*lambdaO; % this is Delta x 

x = k*Dx; % position at which we determine epsilon 

L = M*Dx; % location of right end of computational 

% region 

if ((x >= 0) & (x < (L-4*lambda0))) 

epsilon = epsilonO; 
else 

epsilon = er1*epsilon0; 

end; 

q, 
'o 

% permeability .m 

% 

% This routine specifies the permeability profile of the 

% computational region [0,L], and is needed by FDTD.m 



function mu = permeability(k,sx,lambdaO,M) 

muO = 400*pi*1e-9; % free-space permeability 
Dx = sx*lambdaO; % this is Delta x 

x = k*Dx; % position at which we determine mu 

L = M*Dx; % location of right end of computational 

% region 

mu = muO; 

q, 

% econductivity .m 

q, 

% This routine specifies the electrical conductivity profile 



TLFeBOOK 



558 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

% of the computational region [0,L], and is needed by FDTD.m 

function sigma = econductivity(k,sx,lambdaO,M) 



epsilonO = 8.854185*1e-12; 

muO = 400*pi*1e-9; 

er1 = 4; 

epsilonl = er1*epsilon0; 

ZO = sqrt(muO/epsilonO) ; 

Dx = sx*lambdaO; 

x = k*Dx; 

L = M*Dx; 



free-space permittivity 

free-space permeability 

dielectric relative permittivity 

dielectric permittivity 

free-space impedance 

this is Delta x 

position at which we determine sigma 

location of right end of computational 

region 



starl = (2*pi/lambda0)*(1/Z0; 

star2 = er1 *star1 ; 

if ((x > 2*lambda0) & (x < L 

sigma = 0; 
elseif (x <= 2*lambda0) 

sigma = starl ; 
elseif (x >= (L - 2*lambda0); 

sigma = star2; 

end; 



; % conductivity of PML 1 (at x = end) 
% conductivity of PML 2 (at x = L end) 
2*lambda0)) 



mconductivity .m 

This routine specifies the magnetic conductivity profile of the 
computational region [0,L], and is needed by FDTD.m 



function sigmastar = mconductivity(k,sx,lambdaO,M) 



epsilonO = 8.854185*1e-12; 

muO = 400*pi*1e-9; 

ZO = sqrt(muO/epsilonO) ; 

Dx = sx*lambdaO; 

x = (k+.5)*Dx; 

L = M*Dx; 



free-space permittivity 

free-space permeability 

free-space impedance 

this is Delta x 

position at which we determine sigmastar 

location of right end of computational 

region 



star = (2*pi/lambda0)* 


'ZO; 








if ((x > 2*lambda0) & 


(x < 


L 


■2' 


*lambda0 


sigmastar 


= 0; 










else 












sigmastar 


= star; 










end; 












% 










FDTD. 



This routine produces the plot in Fig. 12.6 which is associated with 

Example 12.5. Thus, it illustrates the FDTD method. 

The routine returns the total electric field component Ey to the 
caller. 



TLFeBOOK 



MATLAB CODE FOR EXAMPLE 12.5 



559 



function Ey = FDTD 

muO = 400*pi*1e-9; 
epsilonO = 8.854185*1e-12; 
c = 1 /sqrt (muO*epsilonO) ; 

lambdaO = 500; 

lambdaO = lambda0*1e-9; 

betaO = (2*pi) /lambdaO; 

sx = .02; 

st = .10; 

Dx = sx*lambda0; 
Dt = (st/c)*Dx; 



EO = 1 ; 

m = 150; 

omega = betaO*c; 

M = 500; 
N = 4000; 

E = zeros(1 ,M+1 ) ; 
H = zeros(1 ,M+2) ; 



% free-space permeability 
% free-space permittivity 
% speed of light in free-space 

% free-space wavelength of the source 

% in nanometers 

% free-space wavelength of the source 

% in meters 

% free-space beta (wavenumber) 

% fraction of a free-space wavelength 

% used to determine Delta x 

% scale factor used to determine time-step 

% size Delta t 

% Delta x 

% application of the CFL condition 

% to determine time-step Delta t 

% amplitude of the electric field (V/m) 

% generated by the source 

% source (antenna) location index 

% source frequency (radians/second) 

% number of spatial grid points is (M+1 ) 

% the number of time steps in the simulation 

% initial electric field 
% initial magnetic field 



% Specify the material properties in the computational region 
% (which is x in [0,M*Dx]) 

for k = 0:M 

epsilon(k+1) = permittivity(k, sx, lambdaO, M) ; 

ce(k+1) = 1 - Dt*econductivity(k, sx, lambdaO, M) /epsilon(k+1 ) ; 

h(k+1) = 1/epsilon(k+1); 

end; 
h = h*Dt/Dx; 
for k = 0: (M-1) 

mu(k+1) = permeability(k, sx, lambdaO, M) ; 

ch(k+1) = 1 - Dt*mconductivity(k, sx, lambdaO, M) /mu(k+1 ) ; 

e(k+1) = 1/mu(k+1); 

end; 
e = e*Dt/Dx; 

% Run the simulation for N time steps 

for n = 1 :N 

E(m+1) = E0*sin(omega*(n-1 )*Dt) ; % Antenna is at index m 

H(2:M+1) = ch.*H(2:M+1) - e.*(E(2:M+1) - E(1:M)); 

E(1:M+1) = ce.*E(1:M+1) - h.*(H(2:M+2) - H(1:M+1)); 

E(m+1) = EO*sin(omega*n*Dt) ; 

Ey(n,:) = E; % Save the total electric field at time step n 

end; 



TLFeBOOK 



560 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

REFERENCES 

1. E. Kreyszig, Advanced Engineering Mathematics, 4th ed., Wiley, New York, 1979. 

2. T. Myint-U and L. Debnath, Partial Differential Equations for Scientists and Engineers, 
3rd ed., North-Holland, New York, 1987. 

3. R. L. Burden and J. D. Faires, Numerical Analysis, 4th ed., PWS-KENT Publ., Boston, 
MA, 1989. 

4. A. Quarteroni, R. Sacco and F. Saleri, Numerical Mathematics (Texts in Applied Math- 
ematics series, Vol. 37), Springer- Verlag, New York, 2000. 

5. G. Strang and G. Fix, An Analysis of the Finite Element Method, Prentice-Hall, Engle- 
wood Cliffs, NJ, 1973. 

6. S. Brenner and R. Scott, The Mathematical Theory of Finite Element Methods, Springer- 
Verlag, New York, 1994. 

7. C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, SIAM, Philadel- 
phia, PA, 1995. 

8. A. Taflove, Computational Electrodynamics: The Finite-Difference Time-Domain 
Method, Artech House, Norwood, MA, 1995. 

9. R. Courant and D. Hilbert, Methods of Mathematical Physics, Vol. II. Partial Differen- 
tial Equations, Wiley, New York, 1962. 

10. J. D. Kraus and K. R. Carver, Electromagnetics, 2nd ed., McGraw-Hill, New York, 
1973. 

11. J. F. Epperson, An Introduction to Numerical Methods and Analysis, Wiley, New York, 
2002. 

12. W. C. Elmore and M. A. Heald, Physics of Waves, Dover Publ., New York, 1969. 

13. E. Isaacson and H. B. Keller, Analysis of Numerical Methods, Wiley, New York, 
1966. 

14. K. S. Yee, "Numerical Solution of Initial Boundary Value Problems Involving 
Maxwell's Equations in Isotropic Media," IEEE Trans. Antennas Propag. AP-14, 302- 
307 (May 1966). 

15. J. -P. Berenger, "A Perfectly Matched Layer for the Absorption of Electromagnetic 
Waves," J. Comput. Phys. 114, 185-200 (1994). 

16. J. -P. Berenger, "Perfectly Matched Layer for the FDTD Solution of Wave-Structure 
Interaction Problems," IEEE Trans. Antennas Propag. 44, 110-117 (Jan. 1996). 

17. Z. Wu and J. Fang, "High-Performance PML Algorithms," IEEE Microwave Guided 
Wave Lett. 6, 335-337 (Sept. 1996). 

18. R. J. Luebbers, K. S. Kunz and K. A. Chamberlin, "An Interactive Demonstration of 
Electromagnetic Wave Propagation Using Time-Domain Finite Differences," IEEE 
Trans. Educ. 33, 60-68 (Feb. 1990). 

19. J. Represa, C. Pereira, M. Panizo and F. Tadeo, "A Simple Demonstration of Numerical 
Dispersion under FDTD," IEEE Trans. Educ. 40, 98-102 (Feb. 1997). 

20. G. Mur, "Absorbing Boundary Conditions for the Finite-Difference Approximation of 
the Time-Domain Electromagnetic-Field Equations," IEEE Trans. Electromagn. Compat. 
EMC-23, 377-382 (Nov. 1981). 

21. C.-S. Shin and R. Nevels, "Optimizing the Gaussian Excitation Function in the Finite 
Difference Time Domain Method," IEEE Trans. Educ. 45, 15-18 (Feb. 2002). 



TLFeBOOK 



PROBLEMS 561 

22. L. N. Trefethen, "Group Velocity in Finite Difference Schemes," SIAMRev. 24, 113-136 
(April 1982). 

23. R. Courant, K. Friedrichs and H. Lewy, "tjber die Partiellen Differenzengleichungen 
der Mathematischen Physik," Math. Ann. 100, 32-74 (1928). 

PROBLEMS 

12.1. Classify the following PDEs into elliptic, parabolic, and hyperbolic types 
(or a combination of types). 



(a) 


?>u xx 


+ 5U X y + Uyy 


— x + y 




(b) 


u xx 


~ "XV l" ^Hyy — 


- u x + u 




(c) 


yuxx 


+ Uyy — 






(d) 


y 2 u xx - 2xyu xy + 


2 

A, Lt -y v 





(e) 


4x 2 u 


XX ~T Uyy U 






;.2. Derive (12.5) (both equations). 





12.3. In (12.6) let h — t, and let N — M — 4, and for convenience let f^ n — 
p(xk, yn)/z- Define the vectors 

v = [v u vi,2 Vi,3 v 2A v 2 , 2 v 2 ,3 v 3 ,i v X2 y 3 , 3 ], 

/ = [/l,l /l,2 .A, 3 /2,1 h,2 /2,3 /3,1 /3,2 /3,3]- 

Find the matrix A such that AV — h 2 f. Assume that 
V ,„ = Vi.o = V kA = V 4 ,„ = 
for all £, and n. 

12.4. In (12.6) let h = r = \, and N — M — 4, and assume that p(xk, y n ) — 
for all k, and n. Let xo = yo — 0. Suppose that 

V(0, y) = V(x, 0) = 0, V(x, l) = x, V(l, y) = y. 

Find the linear system of equations for Vk, n with 1 < k, n < 3, and put it 
in matrix form. 

12.5. Recall the previous problem. 

(a) Write a MATLAB routine to implement the Gauss-Seidel method (recall 
Section 4.7). Use your routine to solve the linear system of equations 
in the previous problem. 

(b) Find the exact solution to the PDE 

v xx + y yy = o 

for the boundary conditions stated in the previous problem. 



TLFeBOOK 



562 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

(c) Use the solution in (b) to find V(k/4, n/4) at 1 < k, n < 3, and compare 
to the results obtained from part (a). They should be the same. Explain 
why. [Hint: Consider the error terms in (12.5).] 

12.6. The previous two problems suggest that the linear systems that arise in the 
numerical solution of elliptic PDEs are sparse, and so it is worth considering 
their solution using the iterative methods from Section 4.7. Recall that the 
iterative methods of Section 4.7 have the general form 

X (*+D = Bx (k) + f 
[from (4.155)]. Show that 



c<* +1 >-j<*>|| c 

c (*) -*<*-!) || { 



\B\ 



[Comment: Recalling (4.36c), it can be shown that p(A) < \\A\\ p [which is 
really another way of expressing (4.158)]. For example, this result can be 
used to estimate the spectral radius of Bj [Eq. (4.171)].] 

12.7. Consider Eq. (12.9), with the initial and boundary conditions in (12.10) and 
(12.11), respectively. 

(a) Consider the change of variables 

§ = x + ct , r\ — x — ct 
and 0(£, n) replaces u{x,t) according to 

0(1, ij) = u Q(§ + n ), 1(| - 1,)^ . (12.P.1) 

Verify the derivative operator equivalences 

9_3 913_9 9 
dx 9^ dr/ c dt 9f dn 

[Hint' ^L du_dx_ _i_ dudt_ 90 du_ dx_ , 'du_ dt_ -i 

\_nini. a ^ — 3x g ^ -i- 3l 9 |, 9r/ — 3x dr) -I- 3t Sn -i 

(b) Show that (12.9) is replaceable with 

9 2 c 



(12.P.2) 



0. (12.P.3) 



d^dn 
(c) Show that the solution to (12.P.3) is of the form 

and hence 

u(x, t) — P(x + ct) + Q(x - ct), 

where P and Q are arbitrary twice continuously differentiable functions. 



TLFeBOOK 



PROBLEMS 



563 



(d) Show that 



P(x) + Q(x) = /(*), 



>(i), 



)(i)/ 



1 



P w (x)-Q w (x) = -g(x). 
c 



(e) Use the facts from (d) to show that 



1 



"X+Ct 



u(x,t) = -[f(x + ct) + f(x-ct)] + — / g(s)ds. 

2 2c J x _ ct 



12.8. In (12.11) suppose that 



fix) 



H ( L 

— I x d 

d \ 2 



H ( L 

— x \- d 

d \ 2 



0, 



L L 

— < x < \-d 

2 ~ ~ 2 

L L 

d < x < — 

2 ~ ~ 2 

elsewhere 



where < d < L/2. Assume that g(x) — for all x. 

(a) Sketch f(x). 

(b) Write a MATLAB routine to implement the FD algorithm (12.108) for 
computing «^„. Write the routine in such a way that it is easy to change 
the parameters c, H, h, r, N, M, and d. The routine must produce a mesh 
plot similar to Fig. 12.4 and a plot similar to Fig. 12.7 on the same page 
(i.e., make use of the subplot). The latter plot is to be of uu,n versus k. Try 
out your routine using the parameters c = i, h = 0.05, r — 0.025, H = 
0.1, d = L/10, M = 200, and N = 1100 (recalling that L = Mh). Do 
you find numerical dispersion effects? 



12.9. Repeat Example 12.1 for f(x) and g(x) in the previous problem. 

12.10. Example 12.1 is about the "plucked string." Repeat Example 12.1 assuming 
that f{x) — and 



,'(*) 



2V 



2V 



x), 



L 
0<x < - 

~ ~ 2 

L 

— < x < L 

2 ~ ~ 



This describes a "struck string." 



12.11. The MATLAB routines in Appendix 12.A implement the FDTD method, 
and generate information for the plot in Fig. 12.6. However, the reflected 
field component for 21o < x < 6Xq is not computed or displayed. 

Modify the code(s) in Appendix 12. A to compute the reflected field com- 
ponent in the free- space region and to plot it. Verify that at the interface 



TLFeBOOK 



564 NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

between the free-space region and the dielectric that |p| = A (magnitude of 
the reflection coefficient). Of course, you will need to read the amplitude 
of the reflected component from your plot to do this. 

12.12. Derive Eq. (12.134). 

12.13. Modify the MATLAB code(s) in Appendix 12. A to generate a plot similar 

. 21 



to Fig. 12.7. [Hint: E^q — Eq exp 



( k—m \ 
iw ) 



for k = 0, 1, ..., M is the 



initial electric field. Set the initial magnetic field to zero for all k.] 

12.14. Plot Vg/c versus Xo/Ax for v s given by (12.136). Choose e = 0.1, 0.5, and 
0.8, and plot all curves on the same graph. Use these curves to explain why 
the errors due to numerical dispersion (see Fig. 12.7) are worse on the side 
of the pulse opposite to its direction of travel. 



TLFeBOOK 



13 



An Introduction to MATLAB 



13.1 INTRODUCTION 

MATLAB is short for "matrix laboratory," and is an extremely powerful software 
tool 1 for the development and testing of algorithms over a wide range of fields 
including, but not limited to, control systems, signal processing, optimization, image 
processing, wavelet methods, probability and statistics, and symbolic computing. 
These various applications are generally divided up into toolboxes that typically 
must be licensed separately from the core package. 

Many books have already been written that cover MATLAB in varying degrees 
of detail. Some, such as Nakamura [3], Quarteroni et al. [4], and Recktenwald 
[5], emphasize MATLAB with respect to numerical analysis and methods, but 
are otherwise fairly general. Other books emphasize MATLAB with respect to 
particular areas such as matrix analysis and methods (e.g., see Golub and Van 
Loan [1] or Hill [2]). Some books implicitly assume that the reader already knows 
MATLAB [1,4]. Others assume little or no previous knowledge on the part of the 
reader [2,3,5]. 

This chapter is certainly not a comprehensive treatment of the MATLAB tool, 
and is nothing more than a quick introduction to it. Thus, the reader will have to 
obtain other books on the subject, or consult the appropriate manuals for further 
information. MATLAB 's online help facility is quite useful, too. 



13.2 STARTUP 

Once properly installed, MATLAB is often invoked (e.g., on a UNIX workstation 
with a cmdtool window open) by typing matlab, and hitting return. A window 
under which MATLAB runs will appear. The MATLAB prompt also appears: 



MATLAB commands may then be entered and executed interactively. 

If you wish to work with commands in M-files (discussed further below), then 
having two cmdtool windows open to the same working directory is usually desir- 
able. One window would be used to run MATLAB, and the other would be used 

MATLAB is written in C, but is effectively a language on its own. 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



565 



TLFeBOOK 



566 AN INTRODUCTION TO MATLAB 

to edit the M-files as needed (to either develop the algorithm in the file or to 
debug it). 



13.3 SOME BASIC OPERATORS, OPERATIONS, AND FUNCTIONS 

The MATLAB command 

>> diary filename 

will create the file filename, and every command you type in and run, and every 
result of this, will be stored in filename. This is useful for making a permanent 
record of a MATLAB session which can help in documentation and sometimes in 
debugging. When writing MATLAB M-files, always make sure to document your 
programs. The examples in Section 13.6 illustrate this. 

MATLAB tends to work in the more or less intuitive way where matrix and/or 
vector operations are concerned. Of course, it is in the nature of this software tool 
to assume the user is already familiar with matrix analysis and methods before 
attempting to use it. 

When a MATLAB command creates a vector as the output from some operation, 
it may be in the form of a column or a row vector, depending on the command. A 
typical MATLAB row vector is 

»x =[111]; 
>> 

The semicolon at the end of a line prevents the printout of the result of the command 
at that line (this is useful in preventing display clutter). If you wish to turn it into 
a column vector, then type: 

>>x = x. ' 
x = 

1 
1 
1 

Making this conversion is sometimes necessary as the inputs to some routines need 
the vectors in either row or column format, and some routines do not care. Routines 
that do not care whether the input vector is row or column make the conversion to 
a consistent form internally. For example, the following command sequence (which 
can be stored in a file called an M-file) converts vector x into a row vector if it is 
not one already: 

» [N,M] = size(x) ; 

» if N -= 1 

» x = x. ' ; 

» end; 

>> 



TLFeBOOK 



SOME BASIC OPERATORS, OPERATIONS, AND FUNCTIONS 567 

In this routine N is the number of rows and M is the number of columns in x. (The 
size command also accepts matrices.) The related command length(x) will return 
the length of vector x. This can be very useful in for loops (below). 
The addition of vectors works in the obvious way: 

» x = [ 1 1 1 ] ; 
» y = [ 2 -1 2 ]; 

» x + y 

ans = 



In this routine, the answer might be saved in vector z by typing >> z — x + y\. 
Clearly, to add vectors without error means that they must be of the same size. 
MATLAB will generate an error message if matrix and vector objects are not 
dimensioned properly when operations are performed on them. The mismatching 
of array sizes is a very common error in MATLAB programming. 
Matrices can be entered as follows: 

» A = [ 1 1 ; 1 2 ] 

A = 

1 1 

1 2 



Again, addition or subtraction would occur in the expected manner. We can invert 
a matrix as follows: 

» inv(A) 

ans = 

2 -1 

-1 1 



Operation det(A) will give the determinant of A, and [L, U] = lu(A) will give the 
LU factorization of A (if it exists). Of course, there are many other routines for 
common matrix operations, and decompositions (QR decomposition, singular value 
decomposition, eigendecompositions, etc.). Compute y — Ax + b: 

» x = [ 1 -1 ].'; 

» b = [ 2 3 ] . ' ; 



>> 



y = A*x + b 



TLFeBOOK 



568 AN INTRODUCTION TO MATLAB 



The colon operator can extract parts of matrices and vectors. For example, to 
place the elements in rows j to k of column n of matrix B into vector x, use 
> > x — B(j : k, n);. To extract the element from row k and column n of matrix A, 
use >> x — A(k, «);. To raise something (including matrices) to a specific power, 
use >> C — A A p, for which p is the desired power. (This computes C — A p '.) 
{Note: MATLAB indexes vectors and matrices beginning with 1.) 
Unless the user overrides the defaults, variables i and j denote the square root 
of-1: 

» sqrt(-1 ) 

ans = 

+ 1 .OOOOi 



Here, i andy are built-in constants. So, to enter a complex number, say, z — 3 — 2j, 
type 

» z = 3 - 2*i; 
Observe 

» x = [ 1 1+i ] ; 
» x' 

ans = 

1 .0000 

1 .0000 - 1 .0000i 



So the transposition operator without the period gives the complex-conjugate 
transpose (Hermitian transpose). Note that besides i and j, another useful built-in 
constant is pi (= it). 

Floating-point numbers are entered as, for instance, \.5e— 3 (which is 1.5 x 
10 -3 ). MATLAB agrees with IEEE floating-point conventions, and so 0/0 will 
result in NaN ("not a number") to more clearly indicate an undefined operation. 
An operation like 1/0 will result in Inf as an output. 



TLFeBOOK 



SOME BASIC OPERATORS, OPERATIONS, AND FUNCTIONS 569 

We may summarize a few important operators, functions, and other terms: 



Relational Operators 


< 


less than 


<= 


less than or equal to 


> 


greater than 


> = 


greater than or equal to 


== 


equal to 


~= 


not equal to 




Trigonometric Functions 



sin sine 

cos cosine 

tan tangent 

asin arcsine 

acos arccosine 

atan arctangent 

atan2 four quadrant arctangent 

sinh hyperbolic sine 

cosh hyperbolic cosine 

tanh hyperbolic tangent 

asinh hyperbolic arcsine 

acosh hyperbolic arccosine 

atanh hyperbolic arctangent 



Elementary Mathematical Functions 



abs absolute value 

angle phase angle (argument of a complex number) 

sqrt square root 



TLFeBOOK 



570 



AN INTRODUCTION TO MATLAB 



Elementary Mathematical Functions 


real 


real part of a complex number 


imag 


imaginary part of a complex number 


conj 


complex conjugate 


rem 


remainder or modulus 


exp 


exponential to base e 


log 


natural logarithm 


log 10 


base- 10 logarithm 


round 


round to nearest integer 


fix 


round toward zero 


floor 


round toward — oo 


ceil 


round toward +oo 



In setting up the time axis for plotting things (discussed below), a useful com- 
mand is illustrated by 

» y = [0: .2:1] 
y = 

0.2000 0.4000 0.6000 0.8000 1.0000 



Thus, [x :y :z ] creates a row vector whose first element is x and whose last element 
is z (depending on step size y), where the elements in between are of the form x 
+ ky (where k is a positive integer). 

It can also be useful to create vectors of zeros: 

>> zeros(size( [1:4])) 







Or, alternatively, a simpler way is 

>> zeros(1 ,4) 
ans = 





TLFeBOOK 



WORKING WITH POLYNOMIALS 571 

Using "zeros(n,m)" will result in an n x m matrix of zeros. Similarly, a vector 
(or matrix) containing only ones would be obtained using the MATLAB function 
called "ones." 



13.4 WORKING WITH POLYNOMIALS 

We have seen on many occasions that polynomials are vital to numerical analysis 
and methods. MATLAB has nice tools for dealing with these objects. 

Suppose that we have polynomials P\(s) — s + 2 and P2(s) = 3.v + 4 and wish 
to multiply them. In this case type the following command sequence: 

» P1 = [ 1 2 ] ; 

» P2 = [ 3 4 ] ; 
» conv(P1 ,P2) 

ans = 

3 10 8 



This is the correct answer since Pi(s)P2(s) — 3s 2 + 10s + 8. From this we see 
polynomials are represented as vectors of the polynomial coefficients, where the 
highest-degree coefficient is the first element in the vector. This rule is followed 
pretty consistently. (Note: "conv" is the MATLAB convolution function, so if you 
don't already know this, convolution is mathematically essentially the same as 
polynomial multiplication.) 

Suppose P(s) = s 2 + s — 2, and we want the roots. In this case you may type 

» P = [ 1 1 -2 ] ; 
» roots(P) 



-2 

1 



MATLAB (version 5 and later) has "mroots," which is a root finder that does a 
better job of computing multiple roots. 

The MATLAB function "polyval" is used to evaluate polynomials. For example, 
suppose P(s) — s 2 + 3s + 5, and we wanted to compute P(— 3). The command 
sequence is 

» P = [ 1 3 5 ] ; 
» polyval(P, -3) 



TLFeBOOK 



572 AN INTRODUCTION TO MATLAB 

ans = 

5 
>> 

13.5 LOOPS 

We may illustrate the simplest loop construct in MATLAB as follows: 

» t = [ : . 1 : 1 ] ; 

>> for k = 1 :length(t) 

» x(k) = 5*sin( (pi/3) * t(k) ) + 2; 

» end; 

>> 

This command sequence computes x(t) — 5 sin(jt) + 2 for t — O.Ik, where k — 
0, 1, . . . , 10. The result is saved in the (row) vector x. However, an alternative 
approach is to vectorize the calculation according to 

» t = [0: .1:1]; 

» x = 5*sin( pi*t/3 ) + 2*ones(1 , length(t) ) ; 



This yields the same result. Vectorizing calculations leads to faster code (in terms 
of runtime). 

A potentially useful method to add (append) elements to a vector is 

» x = []; 

» for k = 1:2:6 

>> x = [ x k ] ; 

» end; 

» x 



where x — [ ] defines x to be initially empty, while the for loop appends 1,3, and 
5 to the vector one element at a time. 

The format of numerical outputs can be controlled using MATLAB fprintf 
(which has many similarities to the ANSI C fprintf function). For example 

» for k = 0:9 

fprintf ( '%12.8f \n' ,sqrt(k)) ; 
end; 
0.00000000 



TLFeBOOK 



PLOTTING AND M-FILES 573 



1 .00000000 
1 .41421356 
1.73205081 
2.00000000 
2.23606798 
2.44948974 
2.64575131 
2.82842712 
3.00000000 



The use of a file identifier can force the result to be printed to a specific file instead 
of to the terminal (which is the result in this example). MATLAB also has save 
and load commands that can save variables and arrays to memory, and read them 
back, respectively. 

Certainly, for loops may be nested in the expected manner. Of course, MATLAB 
also supports a "while" statement. For information on conditional statements (i.e., 
"if" statements), use ">> help if." 



13.6 PLOTTING AND M-FILES 

Let's illustrate plotting and the use of M-files with an example. Note that M-files 
are also called script files (use script as the keyword when using help for more 
information on this feature). 

As an exercise the reader may wish to create a file called "stepH.m" (open and 
edit it in the manner you are accustomed to). In this file place the following lines: 



% stepH.m 

(X. 

% This routine computes the unit-step response of the 
% LTI system with system function H(s) given by 



% H(s) = 

% s'2 + 3s + K 

(X. 

% for user input parameter K. The result is plotted. 



function stepH(K) 

b = [ K ] ; 
a = [ 1 3 K ]; 

elf % Clear any existing plots from the screen 

step(b,a); % Compute the step response and plot it 
grid % plot the grid 



TLFeBOOK 



574 



AN INTRODUCTION TO MATLAB 



This M-file becomes a MATLAB command, and for K — 0.1 may be executed 
using 

» stepH( .1) ; 
>> 

This will result in another window opening where the plot of the step response will 
appear. To save this file for printing, use 

>> print -dps filename.ps 

which will save the plot as a postscript file called "filename.ps." Other printing for- 
mats are available. As usual, the details are available from online help. Figure 13.1 
is the plot produced by stepH.m for the specified value of K. 

Another example of an M-file that computes the frequency response (both mag- 
nitude and phase) of the linear time-invariant (LTI) system with Laplace transfer 
function is 



H(s) 



1 



Q. 

E 
< 



1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 



I I I \- I 



30 60 90 120 

Time (seconds) 



150 



180 



Figure 13.1 Step response: typical output from stepH.m. 



TLFeBOOK 



PLOTTING AND M-FILES 575 

If you are not familiar with Laplace transforms, recall phasor analysis from basic 
electric circuit theory. You may, for instance, interpret H(ja>) as the ratio of two 
phasor voltages (frequency a>), such as 

„,. . V2<Jo>) 
Vi(jco) 

The numerator phasor V^O&O is the output of the system, and the denominator 
phasor Vi(ja>) is the input to the system (a sinusoidal voltage source). 



% freqresp.m 

% 

% This routine plots the frequency response of the Laplace 

% transfer function 



% H(s) = 

% s"2 + .5s + 1 

% 

% The magnitude response (in dB) and the phase response (in 
% degrees) are returned in the vectors mag, and pha, 
% respectively. The places on the frequency axis where the 
% response is computed are returned in vector f . 



function [mag, pha, f] = freqresp 

b = [ 1]; 

a = [ 1 .5 1 ] ; 

w = logspace( -2, 1 ,50) ; % Compute the frequency response for i0"(-2) to 10"1 

% radians per second at 50 points in this range 
h = f reqs(b,a,w) ; % h is the frequency response 

mag = abs(h); % magnitude response 
pha = angle(h); % phase response 

f = w/(2*pi); % setup frequency axis in Hz 

pha = pha*180/pi; % phase now in degrees 

mag = 20*log10(mag) ; % magnitude response now in dB 

elf 

subplot(211 ) , semilogx(f ,mag, ' - ' ) , grid 
xlabel(' Frequency (Hz) ') 
ylabel(' Amplitude (dB) ') 
title (' Magnitude Response ') 

subplot(212) , semilogx(f ,pha, ' - ' ) , grid 
xlabel(' Frequency (Hz) ') 
ylabel(' Phase Angle (Degrees) ') 
title(' Phase Response ') 

Executing the command 

» [mag,phase,f ] = freqresp; 
» 



TLFeBOOK 



576 AN INTRODUCTION TO MATLAB 



10 



— n 



CD 

■n 


-10 






CI) 




u 


-20 






n 




F 


-30 


< 





-40 
10 



(a) 



1 1 1 Mlll| 1 1 1 MIM| 1 1 1 MIM| 1 1 1 Mill 




10- 1 

Frequency (Hz) 



-50 



-2 -100 

C 
CO 

8 -150 

CO 



-200 






I — 




I — i 


n 


- 


rr 


1 " 






I 


___ 




1 1 


! 










n~ 


! 












1 1 










































































































i 


i 












i i 


i 










iii 












i i 



10" 



(b) 



10- 



10- 1 

Frequency (Hz) 



10° 



10 1 



Figure 13.2 The output from freqresp.m: (a) magnitude response; (b) phase response. 



will result in the plot of Fig. 13.2, and will also give the vectors mag, phase, 
and/. Vector mag contains the magnitude response (in decibels), and vector phase 
contains the phase response (in degrees) at the sample values in the vector/, which 
defines the frequency axis for the plots (in hertz). In other words, "freqresp.m" is a 
Bode plotting routine. Note that the MATLAB command "bode" does Bode plots 
as well. 

Additional labeling may be applied to plots using the MATLAB command text 
(or via a mouse using "gtext"; see online help). As well, the legend statement is 
useful in producing labels for a plot with different curves on the same graph. For 
example 

function ShowLegend 

ul = 2.5*pi; 
11 = -pi/4; 
N = 200; 

dt = (ul - 11)/N; 
for k = 0:N-1 

t(k+1) = 11 + dt*k; 

x(k+1) = exp(-t(k+1)); 

y(k+1) = sin(t(k+1)); 

end; 



TLFeBOOK 



REFERENCES 577 




Figure 13.3 Illustration of the MATLAB legend statement. 



subplot(21 1 ) , plot(t,x, ' - ' ,t,y, ' -- ' ) , grid 

xlabelC t ') 

legend)' e"{-t} ', ' sin(t) ') 

When this code is run, it gives Fig. 13.3. Note that the label syntax is similar to 
that in LaTeX [6]. 

Other sample MATLAB codes have appeared as appendixes in earlier chapters, 
and the reader may wish to view these as additional examples. 



REFERENCES 

1. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed. Johns Hopkins Univ. 
Press, Baltimore, MD, 1989. 

2. D. R. Hill, Experiments in Computational Matrix Algebra (C. B. Moler, consulting ed.), 
Random House, New York, 1988. 

3. S. Nakamura, Numerical Analysis and Graphic Visualization with MATLAB, 2nd ed. 
Prentice-Hall, Upper Saddle River, NJ, 2002. 

4. A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics (Texts in Applied Math- 
ematics series, Vol. 37). Springer- Verlag, New York, 2000. 

5. G. Recktenwald, Numerical Methods with MATLAB: Implementation and Application, 
Prentice-Hall, Upper Saddle River, NJ, 2000. 

6. M. Goossens, F. Mittelbach, and A. Samarin, The LaTeX Companion, Addison- Wesley, 
Reading, MA, 1994. 



TLFeBOOK 



TLFeBOOK 



INDEX 



Absolute convergence, 95 
Additive splitting, 179 
Adjoint matrix, 489 
Alternating set, 239 
Ampere's law, 535 
Anechoic chamber, 544 
Appolonius' identity, 35 
Argmin, 39 
ASCII, 339 

Asymptotic expansion, 97 
Asymptotic series, 97-103 
Asymptotic time complexity, 



155 



Two's complement, 57 

Binomial theorem, 85 

Bipolar junction transistor (BJT) 
Base current, 420 
Base-emitter voltage, 420 
Collector current, 420 
Collector-emitter voltage, 420 
Forward current gain, 420 
On resistance, 420 

Bisection method, 292-296 

Block upper triangular matrix, 510 

Boundary conditions, 527 

Boundary value problem (BVP), 445 



Backtracking line search, 346,353 
Backward difference 

approximation, 402 
Backward Euler method 

see Euler' s method of solution 
(implicit form) 
Backward substitution, 156 
Banach fixed-point theorem, 297,422 

see also Contraction theorem 
Banach space, 68 

Basic QR iterations algorithm, 512 
Bauer-Fike theorem, 520 
Bernoulli's differential equation, 426 
Bernoulli's inequality, 123 
Bernoulli numbers, 398 
Bessel's inequality, 217 
"Big O" notation, 155,296 
Binary number codes 

One's complement, 57 

Sign-magnitude, 56 



C, 38,339,565 
C++, 38 
Cancellation, 44 

Cantor's intersection theorem, 294 
Cartesian product, 2 
Catastrophic cancellation, 62,117 
Catastrophic convergence, 90 
Cauchy sequence, 64 
Cauchy-Schwarz inequality, 139 
Cayley-Hamilton theorem, 489 
Central difference approximation, 402 
Chaotic sequence, 323 
Characteristic equation, 458,549 
Characteristic impedance, 538 
Characteristic impedance of free 

space, 539 
Characteristic polynomial, 144,481 
Chebyshev approximation, 239 
Chebyshev norm, 239 



An Introduction to Numerical Analysis for Electrical and Computer Engineers, by C.J. Zarowski 
ISBN 0-471-46737-5 © 2004 John Wiley & Sons, Inc. 



579 



TLFeBOOK 



580 



INDEX 



Chebyshev polynomials of the first kind, 

218-225 
Chebyshev polynomials 

of the second kind, 243-245 
Cholesky decomposition, 167,355 
Christoffel-Darboux formula, 211,389 
Christoffel numbers, 389 
Closed subset, 299 
Cofactor, 489 

Colpitts oscillator, 419,465,538 
Compactly supported, 278 
Companion matrix, 515,519 
Complementary error function, 98 
Complementary sine integral, 100 
Complete solution, 532 
Complete space, 64 
Complete spline, 275 
Completing the square, 131 
Complex numbers 

Cartesian form, 28 

Imaginary part, 31 

Magnitude, 29 

Polar form, 28 

Real part, 31 
Computational complexity, 154 
Computational efficiency, 89 
Computational region, 551 
Condition number of a matrix, 

135-147 
Conic section, 519,526 
Conjugate gradient methods, 527 
Conjugate symmetry, 35 
Continuous mapping, 11 
Contraction (contractive) 

mapping, 184,297 
Contraction theorem, 183,298 
Contra-identity matrix, 199 
Convergence in the mean, 217 
Convergence of a sequence, 63 
Convex function, 352 
Convolution, 204 
Convolution integral, 104,444 
Coordinate rotation digital computing 
(CORDIC) method, 70,107-116 
Countably infinite set, 22 



Courant-Friedrichs-Lewy (CFL) 

condition, 547 
Courant parameter, 546 
Crank-Nicolson method, 528 
Cubic B-spline, 272 
Current-controlled current source 

(CCCS), 420 
Curve fitting, 252 
Cyphertext sequence, 325 



D'Alembert solution, 537 
Daubechies 4-tap scaling function, 520 
Daubechies wavelets, 18 
Deadbeat synchronization, 326 
Defective matrix, 483 
Deflation theorem, 507 
Degree of freedom, 440 
Descent algorithm, 352 
Detection threshold, 333 
Diagonal dominance, 182,281 
Diagonalizable matrix, 445 
Diffusion equation 

see Heat equation 
Dimension, 22 
Dirichlet kernel, 74,103 
Discrete basis, 70,108 
Discrete convolution, 34 
Discrete Fourier series, 25 
Discrete Fourier transform 

(DFT), 27,37,112 
Divergent sequence, 64 
Divided differences, 257-259 
Divided-difference table, 264 
Dominant eigenvalue, 498 
Double sequence, 69 
Duffing equation, 

415,447-448,453-454 



Eavesdropper, 331 
Eigenf unction, 532 
Eigenpair, 480 
Eigenproblem, 480 
Eigenspace, 483 



TLFeBOOK 



INDEX 



581 



Eigenvalue 

Matrix, 143,480 
Partial differential equation 
(PDE), 532 

Eigenvector, 480 

Electric flux density, 535 

Electrical conductivity, 540 

Elementary logic 
Logical contrary, 32 
Logically equivalent, 32 
Necessary condition, 31 
Sufficient condition, 31 

Elliptic integrals, 410-411 

Elliptic partial differential 
equation, 526 

Equality constraint, 357 

Error control coding, 327 

Encryption key, 325 

Energy 

Of a signal (analog), 13 
Of a sequence, 13 

Error 

Absolute, 45,60,140 
Relative, 45,60,140,195 

Euler-Maclaurin formula, 397 

Error function, 94 

Euclidean metric, 6 

Euclidean space, 12,16 

Euler's identity, 19,30 

Exchange matrix 

see Contra-identity matrix 



Faraday's law, 535 

Fast Fourier transform (FFT), 27 

Feasible direction, 357 

Feasible incremental step, 357 

Feasible set, 357 

Fejer's theorem, 122 

Finite difference (FD) method, 545-550 

Finite-difference time-domain (FDTD) 

method, 525,540,550-557 
Finite-element method (FEL), 528 
Finite impulse response (FIR) filtering, 

34,367 



Finite-precision arithmetic effects, 1 
Fixed-point dynamic range, 39 
Fixed-point number 

representations, 38-41 
Fixed-point overflow, 40 
Fixed-point method, 296-305,312-318 
Fixed-point rounding, 41 
Floating-point chopping, 47 
Floating-point dynamic range, 42 
Floating-point exponent, 42 
Floating-point mantissa, 42 
Floating-point normalization, 45 
Floating-point number 

representations, 42-47 
Floating-point overflow, 43 
Floating-point rounding, 43 
Floating-point underflow, 43 
Flop, 154 

Floquet theory, 522 
Forcing term, 548 
FORTRAN, 38 

Forward difference approximation, 401 
Forward elimination, 156 
Forward substitution 

see Forward elimination 
Fourier series, 18,73 
Free space, 535 
Fresnel coefficients, 543 
Frobenius norm, 140 
Full rank matrix, 161 
Function space, 4 
Fundamental mode, 532 
Fundamental theorem 

of algebra, 254,484 



Gamma function, 90 
Gaussian function 

see Gaussian pulse 
Gaussian probability 

density function, 413 
Gaussian pulse, 92,556 
Gauss-Seidel method, 181,527 
Gauss-Seidel successive overrelaxation 

(SOR) method, 181 



TLFeBOOK 



582 



INDEX 



Gauss transformation matrix, 148 

Gauss vector, 149 

Generalized eigenvectors, 485 

Generalized Fourier series 
expansion, 218 

Generalized mean-value theorem, 79 

Generating functions 

Hermite polynomials, 227 
Legendre polynomials, 233 

Geometric series, 34 

Gibbs phenomenon, 77 

Givens matrix, 514 

Global truncation error, 548 

Golden ratio, 347 

Golden ratio search method, 346 

Gradient operator, 342 

Greatest lower bound 
see Infimum 

Group speed, 556 

Haar condition, 239 

Haar scaling function, 17,36 

Haar wavelet, 17,36 

Hankel matrix, 412 

Harmonics, 162,532 

Heat equation, 527 

Henon map, 324 

Hermite' s interpolation formula, 

267-269,387 
Hermite polynomials, 225-229 
Hessenberg matrix, 205,512 
Hessenberg reduction, 513 
Hessian matrix, 344 
Hilbert matrix, 133 
Hilbert space, 68 
Holder inequality, 139 
Homogeneous material, 535 
Householder transformation, 169-170 
Hump phenomenon, 496 
Hyperbolic partial differential 

equation, 526 
Hysteresis, 535 

Ideal diode, 106 
Idempotent matrix, 170 



IEEE floating-point standard, 42 
If and only if (iff), 32 
Ill-conditioned matrix, 132 
Increment function, 437 
Incremental condition estimation 

(ICE), 367 
Inequality 

Cauchy-Schwarz, 15 
Holder, 10 
Minkowski, 10 
Schwarz, 24,36 
Infimum, 5 

Initial conditions, 528-529 
Initial value problem (IVP) 

Adams-Bashforth (AB) methods 

of solution, 459-461 
Adams-Moulton (AM) methods 

of solution, 461-462 
Chaotic instability, 437 
Definition, 415-416 
Euler's method of solution (explicit 

form), 423,443 
Euler method of solution (implicit 

form), 427,443 
Global error, 465 

Heun's method of solution, 433,453 
Midpoint method of solution, 

457-458 
Model problem, 426,443 
Multistep predictor-corrector 

methods of solution, 456-457 
Parasitic term, 459 
Runge-Kutta methods of solution, 

437-441 
Runge-Kutta-Fehlberg (RKF) 

methods of solution, 464-466 
Single-step methods of solution, 455 
Stability, 421,423,427,433,436, 
445-447,458-459,462-463 
Stability regions, 463 
Stiff systems, 432,467-469 
Trapezoidal method of solution, 436 
Truncation error per step, 432-434 
Inner product, 14 
Inner product space, 14 



TLFeBOOK 



INDEX 



583 



Interface between media, 541 
Integral mean-value theorem, 96 
Integration between zeros, 385 
Integration by parts, 90 
Intermediate value theorem, 292,383 
Interpolation 

Hermite, 266-269,381 

Lagrange, 252-257 

Newton, 257-266 

Polynomial, 251 

Spline, 269-284 
Interpolatory graphical display 

algorithm (IGDA), 521 
Invariance (2-norm), 168 
Invariant subspace, 511 
Inverse discrete Fourier transformi 

(IDFT), 27 
Inverse power algorithm, 503 
Isotropic material, 535 



Jacobi algorithm, 203 
Jacobi method, 181,527 
Jacobi overrelaxation 

(JOR) method, 181 
Jacobian matrix, 321 
Jordan blocks, 484 
Jordan canonical form, 483 



Kernel function, 215 

Key size, 324,329 

Kirchoffs laws, 418 

Known plaintext attack, 331,339 

Kronecker delta sequence, 22 



L'Hopital's rule, 81 

Lagrange multipliers, 142,166,357-358 

Lagrange polynomials, 254 

Lagrangian function, 359 

Laplace transform, 416 

Law of cosines, 169 

Leading principle submatrix, 151 

Leapfrog algorithm, 551 



Least upper bound 

see Supremum 
Least-squares, 127-128,132,161 
Least-squares approximation using 

orthogonal polynomials, 235 
Lebesgue integrable functions, 67 
Legendre polynomials, 229-235 
Levinson-Durbin algorithm, 200 
Limit of a sequence, 63 
Line searches, 345-353 
Linearly independent, 21 
Lipschitz condition, 421 
Lipschitz constant, 421 
Lissajous figure, 478 
Local minima, 343 
Local minimizer, 365 
Local truncation error, 545 
Logistic equation, 436 
Logistic map, 323,437 
Lossless material, 539 
Lower triangular matrix, 148 
LU decomposition, 148 



Machine epsilon, 53 

Maclaurin series, 85 

Magnetic conductivity, 540 

Magnetic flux density, 535 

Manchester pulse, 17 

Mathematical induction, 116 

MATLAB, 34,54,90,102,117-118,126, 
129,134-135,137,146,185, 
186-191, 197-198,201,205-206, 
244-246,248,287-289,292,336, 
338-339,351,362-366,409-411, 
413,442,465-467,469-472, 
475-479,498,509,522-524,552, 
557-559,561,563-564,565-577 

Matrix exponential, 444,488-498 

Matrix exponential condition 
number, 497 

Matrix norm, 140 

Matrix norm equivalence, 141 

Matrix operator, 176 

Matrix spectrum, 509 



TLFeBOOK 



584 



INDEX 



Maxwell's equations, 535,540 

Mean value theorem, 79,303 

Message sequence, 325 

Method of false position, 333 

Method of separation of variables, 530 

Metric, 6 

Metric induced by the norm, 1 1 

Metric space, 6 

Micromechanical resonator, 416 

Minor, 489 

Moments, 140,411 



Rectangular rule, 372 

Recursive trapezoidal rule, 398 

Richardson's extrapolation, 396 

Right-point rule, 372 

Romberg integration formula, 397 

Romberg table, 397,399 

Simpson's rule, 378-385 

Simpson's rule truncation error, 
380,383 

Trapezoidal rule, 371-378 

Trapezoidal rule truncation error for- 
mula, 375-376 



Natural spline, 275 

Negative definite matrix, 365 

Nesting property, 199 

Newton's method, 353-356 

Newton's method with backtracking 
line search, 354-356 

Newton-Raphson method breakdown 
phenomena, 311-312 

Newton-Raphson method, 305-312, 
318-323 

Newton-Raphson method rate of con- 
vergence, 309-311 

Nondefective matrix, 483 

Nonexpansive mapping, 297 

Nonlinear observer, 326 

Non-return- to-zero (NRZ) pulse, 17 

Norm, 10 

Normal equations, 164 

Normal matrix, 147,498 

Normal mode, 532 

Normed space, 10 

Numerical dispersion, 550,555-556 

Numerical integration 

Chebyshev-Gauss quadrature, 389 
Composite Simpson's rule, 380 
Composite trapezoidal rule, 376 
Corrected trapezoidal rule, 394 
Gaussian quadrature, 389 
Hermite quadrature, 388 
Left-point rule, 372 
Legendre-Gauss quadrature, 391 
Midpoint rule, 373,409 



Objective function, 341 
Orbit 

see Trajectory 
Order of a partial differential equation, 

525 
Ordinary differential equation 

(ODE), 415 
Orthogonality, 17 
Orthogonal matrix, 161 
Orthogonal polynomial, 208 
Orthonormal set, 23 
Overdetermined least-squares 

approximation, 161 
Overrelaxation method, 182 
Overshoot, 76-77 
Overtones 

see Harmonics 



Parabolic partial differential 

equation, 526 
Parallelogram equality, 15 
Parametric amplifier, 473 
Parseval's equality, 216 
Partial differential equation (PDE), 525 
PASCAL, 38 
Perfectly matched layer (PML), 

540,544,553 
Permeability, 535 
Permittivity, 535 
Permutation matrix, 205 



TLFeBOOK 



INDEX 



585 



Persymmetry property, 199 

Phase portrait, 454 

Phase speed, 555 

Phasor analysis, 539 

Picard's theorem, 422 

Pivot, 151 

Plane electromagnetic waves, 534-535 

Plane waves, 536 

Pointwise convergence, 71 

Poisson equation, 526 

Polynomial, 4 

Positive definite matrix, 130 

Positive semidefinite matrix, 130 

Power method, 498-500 

Power series, 94 

Probability of false alarm, 333 

Product method 

see Method of separation of variables 
Projection operator, 170,197 
Propagation constant, 539 
Pseudoin verse, 192 
Pumping frequency, 473 



QR decomposition, 161 
QR factorization 

see QR decomposition 
QR iterations, 508-518 
Quadratic form, 129,164 
Quantization, 39 
Quantization error, 40 
Quartz crystal, 416 



Radius of convergence, 95 

Rate of convergence, 105,296 

Ratio test, 233 

Rayleigh quotient iteration, 508,523 

Real Schur decomposition, 511 

Reflection coefficient, 543 

Reflection coefficients of a Toeplitz 

matrix, 201 
Regula falsi 

see Method of false position 
Relative permeability, 535 



Relative permittivity, 535 
Relative perturbations, 158 
Residual reflections, 555 
Residual vector, 144-145,168,180 
Riemann integrable functions, 68-69 
Ringing artifacts, 77 
Rodrigues formula, 213,230 
Rolle's theorem, 265 
Rosenbrock's function, 342 
Rotation operator, 112-113,120,484 
Rounding errors, 38 
Roundoff errors 

see Rounding errors 
Runge's phenomenon, 257 



Saddle point, 343 

Scaled power algorithm, 580 

Schelin's theorem, 109 

Second-order linear partial differential 

equation, 526 
Sequence, 3 

Sequence of partial sums, 74,122 
Sherman-Morrison-Woodbury 

formula, 203 
Shift parameter, 502,516 
Similarity transformation, 487 
Simultaneous diagonalizability, 520 
Sine integral, 100 
Single-shift QR iterations 

algorithm, 517 
Singular values, 147,165 
Singular value decomposition 

(SVD), 164 
Singular vectors, 165 
Stability, 157 
State variables, 416 
State vector, 325 
Stationary point, 343 
Stiffness quotient, 479 
Stirling's formula, 91 
Strange attractor, 466 
Strongly nonsingular matrix 

see Strongly regular matrix 
Strongly regular matrix, 201 



TLFeBOOK 



586 



INDEX 



Subharmonics, 162 
Submultiplicative property of matrix 

norms, 141 
Summation by parts, 249 
Spectral norm, 147 
Spectral radius, 177 
Spectrum, 532 
Speed of light, 538 
Supremum, 5 
Symmetric part of a matrix, 203 



Unique global minimizer, 344 

Unitary space, 12,16 

Unit circle, 248,502 

Unit roundoff, 53 

Unit sphere, 141 

Unit step function, 421 

Unit vector, 17,170 

UNIX, 565 

Upper quasi-triangular matrix, 512 

Upper triangular matrix, 148 



Taylor series, 78-96 

Taylor polynomial, 85 

Taylor's theorem, 440 

Three-term recurrence formula (relation) 

Definition, 208 

Chebyshev polynomials of the first 
kind, 222 

Chebyshev polynomials of the second 
kind, 243 

Gram polynomials, 249 

Hermite polynomials, 229 

Legendre polynomials, 234 
Toeplitz matrix, 199 
Trace of a matrix, 206 
Trajectory, 454 
Transmission coefficient, 543 
Transverse electromagnetic (TEM) 

waves, 586 
Tridiagonal matrix, 199,275,516 
Truncation error, 85 
Truncation error per step, 546 
Two-dimensional Taylor series, 

318-319 
Tychonov regularization, 368 



Uncertainty principle, 36 
Underrelaxation method, 182 
Uniform approximation, 222,238 
Uniform convergence, 71 



Vandermonde matrix, 254,285 

Varactor diode, 473 

Vector electric field intensity, 534 

Vector magnetic field intensity, 535 

Vector, 9 

Vector dot product, 16,48 

Vector Hermitian transpose, 16 

Vector norm, 138 

Vector norm equivalence, 139 

Vector space, 9 

Vector subspace, 239 

Vector transpose, 16 

Vibrating string, 528-534 

Von Neumann stability analysis, 548 

Wavelength, 538 
Wavelet series, 18 
Wave equation 

Electric field, 536 

Magnetic field, 536 
Weierstrass' approximation 

theorem, 216 
Weierstrass function, 384 
Weighted inner product, 207 
Weighting function, 207 
Well-conditioned matrix, 137 



Zeroth order modified Bessel function 
of the first kind, 400 



TLFeBOOK