This is ONLY the text version of the file which you are looking for!

You most probably arrived to this file by performing a search.

This file was generated only in order to help search engines; its contents will display poorly, as the file was obtained by filtering out all symbols.

The file that you really want to look at is the (gzipped) postscript, or the pdf, version of the one that you are currently looking at.
These files, as well others, with publication information, can be found in Eduardo Sontag's Online Papers site.
If the file that you are looking at is named "foo.html", then the file that you want is (only one of the formats may be available) foo.ps.gz and/or foo.pdf.
For convenience, here are direct links to those files (the pdf file is in some cases unavailable):

The text version follows.

                            Learning  Complexity  Dimensions


                      for  a  Continuous-Time  Control  System



                    Pirkko Kuusela, Daniel Ocone, and Eduardo D. Sontag

                       Rutgers, The State University of New Jersey, USA
                                                       Abstract

             This  paper  takes  a  computational  learning  theory  approach  to  a  problem  of
        linear  systems  identification.   It  is  assumed  that  input  signals  have  only  a  finite
        number k of frequency components, and systems to be identified have dimension no
        greater than n.  The main result establishes that the sample complexity needed for
        identification scales polynomially with n and logarithmically with k.
             Keywords:  linear systems identification, learning theory, VC dimension
1      Introduction


The problem of systems identification may be seen as an instance of the general question
of "learning" an unknown function.  Thus, one may ask if techniques from Computational
Learning Theory ("CLT" in what follows; see for instance [20  ] for a general introduction,
as  well  as  the  papers  in  the  Special  Issue  [4 ]  of  Systems  and  Control  Letters)  can  be
used to obtain insight into this central question in control theory.
     Indeed, Ljung asked precisely this question in a paper presented at a Special Session
in Learning Theory organized at the 1996 Conference on Decision and Control, see [14  ],
and, independently, the papers [7 , 8 ] had already provided results along these lines for
discrete-time  linear  systems  on  finite-window  data.   See  also  [11  ],  [9 ],  and  references
there, as well as [12  ] for several results for nonlinear systems in discrete time.
     For continuous-time systems, the situation is complicated by the fact that, even for
finite-length inputs, learnability is impossible when formulated in the CLT framework;
this  was  proved  in  [23  ],  and,  alternatively,  can  be  seen  by  applying  the  discrete-time
results (through sampling) from [7 , 8 ].
     Thus,  in  this  paper,  we  suppose  that  all  inputs  to  be  used,  in  the  learning  as  well
as  validation  stages,  belong  to  the  linear  span  of  a  fixed  number  k  of  sinusoidal  basic
functions.   This  band-limiting  assumption  allows  us  to  obtain  a  precise  result:   the
sample complexity needed for identification scales polynomially on an upper bound on
the systems being identified, and logarithmically with k.  This provides a tight analogy
to  the  discrete  results  previously  obtained,  in  which  k  appeared  as  the  length  of  the
discrete-time window employed.
     The reader is referred to [17  ] for results that apply to a class of nonlinear continuous-
time  systems,  but  which  is  formulated  in  terms  of  learning  derivatives  evaluated  at  a
particular instant (as opposed to time data).
     This paper is organized in a top-down fashion.  Definitions and main results are given
in Section 2 and in Section 3 we state main upper and lower bounds for the complexity
                                                             1


dimensions.   After  that  we  concentrate  on  proving  the  results;  central  techniques  are
discussed in Section 4 and proofs are in Sections 5 and 6.  An example of a class with
VC-dimension k is given in Section 7.



2      Definitions  and  Statements  of  Main  Results


In this section, we formulate the problem of system identification as a learning problem
and illustrate the corresponding learning setting.  We define the systems to be studied
and,  as  main  results,  we  state  bounds  for  the  number  of  samples  needed  in  order  to
identify systems in this setting.
     We  begin  with  a  general  introduction  to  classification  problems.   Assume  that  a
set  X ,  to  be  called  the  input  space,  is  given  together  with  a  collection  C  of  mappings
X ! {0, 1}.1  Let W  be the set of all sequences


                                        w = (u1 , OE(u1 )), . . ., (us , OE(us ))


over all s   1, (u1 , . . ., us ) 2 X s and let OE 2 C.  An identifier is a map _ : W  ! C.  The
value of _  on a sequence w  above is denoted as _w  instead of _(w).  The "error" of _
is the probability that _ will misclassify a future sample.  More formally, the error of _
with respect to a probability measure P  on X , a OE 2 C, and a sequence (u1 , . . ., us ) 2 X s,
is


                          Err  (P, OE, u1 , . . ., us ) := P {u 2 X ;  _w (u) 6= OE(u)}.


     The  class  C  is  said  to  be  learnable  if  there  is  some  identifier  _  with  the  following
property:  For each accuracy  parameter ffl > 0 and confidence  parameter ffi  > 0 there is
some s so that, for every probability P  and every OE 2 C,


                        P s{(u1 , . . ., us ) 2 X s ;  Err (P, OE, u1 , . . ., us ) > ffl} < ffi,


where  P s is  the  s-fold  product  of  P .  In  the  learnable  case,  the  function  s(ffl, ffi)  which
provides the smallest s achieving this bound for any positive ffl and ffi is called the sample
complexity.   It  can  be  proved  that  learnability  is  equivalent  to  the  finiteness  of  the
Vapnik-Chervonenkis dimension VC   (C) of the class C, which is a combinatorial quantity
describing  the  richness  of  the  class  C.  Moreover,  for  learning  algorithms  that  classify
the observed samples correctly, the sample complexity is bounded by
                                                ae                  `      '             `    ' oe
                                                   8VC   (C)           8e       4           2
                           s(ffl, ffi)   max       ___________log  2   ____   , __ log 2    __    .
                                                        ffl             ffl     ffl         ffi


In addition, there is a similar lower bound for the sample complexity.
     Classification  may  be  viewed  as  the  problem  of  identifying  systems  with  binary
outputs.   More  generally,  we  introduce  a  problem  of  identification  for  systems  hav-
ing bounded outputs ([0, 1]-valued, for technical reasons) via an L1 -error, following [5 ]
(for similar statements with L2 -error see [2 ]).  Denote by F  a class of mappings from X
to [0, 1].
__________________________________________________1
     The set X is assumed to be either countable or an Euclidean space, and the maps in C are assumed
to be measurable. In addition, a mild regularity assumption called "permissibility" is needed so that all
sets appearing below are measurable, for further discussion on the topic see an appendix in [15 ]. In our
context the measurability assumptions are satisfied.



                                                             2


     By  definition,  an  identifier  is  a  mapping  from  [s2N   (X x [0, 1])    s to  [0, 1]X  .  Such  a
map takes as data a sequence of labeled samples and produces an hypothesis.  If h is a
[0, 1]-valued function defined on X  and P  is a probability measure over X x [0, 1],  we
define the error of h with respect to P  as
                                                Z

                                   ErP   (h) :=             |h(u) - y|    dP (u, y).
                                                  Xx[0,1]


For ffl > 0 and ffi > 0 we say that an identifier _  (ffl, ffi)-learns in the agnostic sense with
respect to F  from s examples if, for all distributions P  on X x [0, 1], and all f  in F :


                                  P s{w ;  ErP  (_w )    inf  ErP  (f ) + ffl} < ffi.
                                                        f 2F


Similarly, for ffl > 0 the function class F  is said to be ffl-agnostically learnable if there is
a  function  s0  : (0, 1) ! N  such  that,  for  all  0 < ffi  < 1,  there  is  an  identifier  _  which
(ffl, ffi)-learns  in  the  above  sense  with  s0  samples.   In  addition,  if  the  identifier  always
chooses a hypothesis from F , we say that F  is properly ffl-agnostically learnable.
     For  learning  [0, 1]-valued  functions,  a  sample  complexity  result  may  be  stated  in
terms of a fat-shattering dimension, which is a generalization of the VC-dimension.  For
ffl > 0 and ffi > 0 there is an identifier _ that properly (ffl, ffi)-learns in the agnostic sense
with respect to F  from ([5 ])

                                         `              `                  '          '
                                     _4__   _6d__     7      336e        7          8
                                                  ln ___   _________ ln  ___  + ln  __
                                     ff2    ln  2    ff    ff3 ln 2      ff         ffi
                                       `      `                      '   '
                                           1          2  1         1
                                  =  O    ____  d log   ___ + log  __
                                          ff2           ff         ffi


samples,  where 0 < ff < ffl=4 is chosen so that d = fat ffl=4-ff(F ) is finite.  The quantity
fat fl(F ) is called the fat-shattering dimension of the class F and it measures the richness
of the class F  with scale fl.
     The  sample  complexity  results  show  us  that  the  difficulty  of  system  identification
in  the  learning  theoretic  setting  can  be  analyzed  by  studying  various  complexity  di-
mensions,  and  it  is  the  main  focus  of  this  paper.   However,  formal  definitions  of  the
complexity dimensions are delayed until Section 3.



2.1      Linear  Systems


In the context of learning we discuss continuous-time linear control systems:


                                `x = Ax + Bu,       x(0) = x0 ,       y = Cx,                                           (1)


where A, B, and C  are n x n, n x m, and p x n real matrices, and the time interval is
[0,1].  We study sign-observations (see [13  ] for related work in control theory):


                                 sign   y(1) = (sign   y1 (1), . . ., sign   yp (1)) T ,


where  sign  z  =  0,  if  z     0,  sign  z  =  1,  if  z  >  0  and  T  stands  for  the  transpose.  For
scalar observations this is a classification problem;  each output is classified either 0 or
1  and  the  VC-dimension  can  be  used  to  study  the  complexity  of  the  problem.  When
p > 1, a generalization of the VC-dimension or a loss function is needed.



                                                             3


     In general, unlike the VC-dimension associated to discrete-time linear systems [7 , 12 ],
the  VC-dimension  of  the  classification  problem  for  continuous-time  control  systems  is
unbounded [23  ], even when n = 1, and the identification problem is not learnable in the
sense discussed earlier.  Therefore we restrict the class of admissible controls in order to
achieve a bound for the VC-dimension.  We consider controls u = (u1 , . . ., um  ) such that


                                                        u = G!,


where G is a m x k matrix that parameterizes the control.  The set of basis input func-
tions   = {!1 , . . ., !k } is fixed.  The bounds for the VC-dimension or other complexity
dimensions will depend on the properties of the set  .
     For scalar inputs (i.e.,   m = 1) the VC-dimension associated to the mapping from
inputs G to scalar sign-observations is bounded by k, which in fact can be very large in
applications.  This  bound  is  tight;  we  give  an  example  of  a  function  class     for  which
the associated VC-dimension is indeed k (see Section 7).  However, by considering band-
limited controls a better bound can be achieved.  In this work we consider the following
set of basis input functions

                      n
                    =   !1 , . . ., !k  ;   !1 , . . ., !k  linearly independent and


                                            !j = t`j effjt sin (fij t) or  !j = t`j effjt cos (fij t)                   (2)
                                                                                          o
                                            with    `j 2 N, ffj , fij 2 R, j = 1, . . ., k   ,


and let


                                             `max   = max   {`1 , . . ., `k }.                                          (3)


The  results  in  this  paper  hold  with  minor  modifications  if  basis  input  functions
!j , j  =  1, . . ., k  are,  for  example,  linear  combinations  of  the  functions  of  the  above
form.  However, the proofs and formulae are messier.


Definition  1.        (Sign system concept class, Cm,p  ) Order the set of basis input func-
tions   and denote ! = (!1 , . . ., !k )T .  Let


                                   X   = {G! : [0, 1] ! Rm   ;  G 2 Rmk  },


and  for  each  linear  system    = (A, B, C, x0 )  of  dimension  n  define  the  mapping      :
X    !  Rp  by     (G!)  =  y(1),  where  y(1)  is  the  solution  of     with  control  u  =  G!.
Similarly we define the mapping for sign-observations,


                                S   : X   ! {0, 1}p    G! 7! sign  (   (G!)).


The   class   of   above   mappings   is   the   sign   system   concept   class,    Cm,p      =
{S   ;    linear system of dimension             n}.


     When studying the learning complexity associated to the classification problem we
consider the time interval [0, 1] for simplicity, as the length of the interval plays no role
in  the  result.   However,  for  learning  [0, 1]-valued  functions  utilizing  pseudo-dimension
without a loss function, we consider the time interval [0, o ] with o  > 1.
                                                             4


     For the system


                                                    x` =  Ax + Bu,                                                      (4)

                                                    y =   Cx,


we can write CeAt B = [fl1 , . . ., flm  ], where each fli is a linear combination of n functions
,1 , . . ., ,n .  Each ,i is of the form t`eat sin (bt) or t`eat cos  (bt) with ` 2 {0, . . ., n - 1} and
a + ib  an  eigenvalue  of  A.  Assume  that  A  has  a  fixed  Jordan  block  structure  and  let
ak + ibk  be an eigenvalue of A.P We take ff11 , . . ., ffnm  , a1 , b1 , . . ., ar , br  to be the system
parameters,  where  fli(t)  =     nj=1 ffij,j (t)  for  i  =  1, . . ., m  and  a1 , b1 , . . ., ar , br  are  n
eigenvalue parameters.  For example the eigenvalue parameters for a real 4 x 4 matrix A
with eigenvalues a1   b1 i, a2  and a3  would be a1 , b1 , a2  and a3 .  Similarly, the eigenvalue
parameters with purely complex eigenvalues a1   b1 i and a2   b2 i would be a1 , b1 , a2  and
b2 , whereas real eigenvalue parameters would be listed as a1 , a2 , a3  and a4 .
     Let  U     Rmk    be  a  bounded  set.   Define  a  mapping  F  :  ~ x U  !  R  such  that
F (~, G) = y(o ),  where  o    1  and  y(o )  is  a  solution  of  (4)  with  system  parameters  ~,
and initial condition x(0) = 0.
     In the following definition we take the final time to be o  > 1 in order to show the
effect of the time interval in the learning complexity.


Definition  2.        Assume  that  the  system  x`  =  Ax  +  Bu,  y   =  Cx,  x(0)  =  0  can
be  parameterized  by  ~  2  Rn(m+1)     as  above  and  k ~k 1    =  max   1 i n(m+1)      ~i  <  1.
Let  F (~, u)  =  y(o )  be  the  solution  of  (4)   with  systemR parameters  ~  and  control
u  =  (u1 , . . ., um  )  2  U  =  {u  =  (u1 , . . ., um  )  ;    o0ui(t)dt     M, i  =  1, . . ., m}.  Denote
Bk1 (c) := {x 2 Rk  ;  kxk  1  < c} and define


                                FB  = {F (~, .) : U  ! R ;  ~ 2 Bn(m+1)1   (1)}.



2.2      Main  Results


We  formulate  two  theorems  about  bounding  sample  complexities  as  main  results.   In
Section  3  we  summarize  upper  and  lower  bounds  for  complexity  dimensions  studied
in  this  paper  together  with  definitions.   The  following  results,  formulated  as  sample
complexity statements, are immediate corollaries of learning complexity bounds proved
in this paper, and hence no separate proofs are given.


Theorem  1  (Sample  complexity  for  concept  learning).                                 For sign systems concept
class Cm,1   with scalar observations, i.e., p = 1, the sample complexity s(ffl, ffi) for identi-
fiers that agree with the observed sample can be bounded as

                                             ae                       `     '             `    ' oe
                                                8VC   (Cm,1  )          8e       4           2
                        s(ffl, ffi)   max       _______________log  2   ____   , __ log 2    __     ,
                                                       ffl               ffl     ffl         ffi


where
                                        h                                                i
  VC   (Cm,1  )   2(2mn2 + 4n + 1) log 2  8e  8mn2 k(n + `max  ) + 1   2nk + 2(1 + 2k)n


and `max   is given by  (3) .


                                                             5


     In terms of n (the dimension of the state space) and k  (the band-width) the upper
bound for the VC-dimension is of the form O(n3 log 2 (nk)).  The next section states also
a  corresponding  VC-dimension  lower  bound,  in  terms  of  the  band-width,  of  the  form
O(log (k)), and together with a lower bound for the sample complexity, this provides an
estimate for the number of samples needed in learning.  In particular, in a typical setting
of  fairly  small  system  dimension  n  and  large  band-width  k,  the  log k  bound  is  a  clear
improvement over the linear bound given by elementary analysis.


Theorem  2  (Sample  complexity  for  proper  agnostic  learning).                                     Let ~ > 0, then
the class FB , given in Definition 2, is properly agnostically learnable from

                                   `     `                                        '     '
                                      1                             2 1         1
                                 O    ___   fat (1=4-~)ffl(FB ) log   __ + log  __
                                      ffl2                            ffl       ffi


samples, where
                            8                   j                k
                            <    (m + 1)n log 2   n2mo_neo_kM____  ,
fat (1=4-~)ffl(FB )   min                        i  (1=4-~)ffl                                  j
                            :    2(m + 4)n log 2   8e  nmk4(n + `max  ) + 1   2nk + 2(2k + 1)n    ,

together with `max   given by  (2)  and  (3) , and M  a constant satisfying

                                              Z  o

                                                   |ui(o - t)|dt   kM
                                                0

for all i = 1, . . ., m.  In above, bxc stands for the integer part of x.


     In addition to thePbound above the control systems can be parameterized linearly as
follows:  Let u(s) =     kj=1 gj !j (s).  Then

             Z  o                       Z  o             X    k

   y(o ) = C      eA(o -s)  Bu(s)ds = C      eA(o -s)  B           gj !j (s)ds
               0                          0                 Z j=1
                                             X    k            o                             X    k
                                           =      j=1  gj C      eA(o -s)  B!j (s)ds       =           gj ~j (o ).   (5)
                                                          ____0_________-z_____________"          j=1

                                                                       ~j(o )

                                                P   k                  P  k
Then by considering the class F  := {g 2 Rk  7!     j=1  gj ~j (o ) ;     j=1  ~j (o )2  = 1} and
further restricting g so that k gk 2    R, [18  , 16  ] have shown that


                                     fat fl(F )   min   {9R2 =fl2 , k + 1} + 1.
3      Complexity  dimensions;  main  upper  and  lower  bounds


We begin this section by defining various learning complexity dimensions and after that
we summarize the main upper and lower bounds proved in this paper.


Definition  3  (Vapnik-Chervonenkis  dimension).                               The richness of the collection C
can be measured by its Vapnik-Chervonenkis (VC) dimension introduced in [19  ].  A set
S = {x1 , . . ., xn }   X is said to be shattered by C if, for every subset B   S, there exists



                                                             6


a set A 2 C  such that S \ A = B.  The Vapnik-Chervonenkis  dimension  of  C, denoted
VC   (C), equals the largest integer n such that there exists a set of cardinality n that is
shattered by C.  For example, in Rk  the VC-dimension of closed half-spaces through the
origin is k [22  ].  Thus, if VC   (C) = d, C is not rich enough to distinguish all subsets of any
d + 1 element set, but there is some d element set where subsets can be distinguished.
Proving exact values of the VC-dimension is hard and typically one looks for upper and
lower bounds for the VC-dimension, as is also done in this paper.

     For  our  purposes,  it  is  more  convenient  to  work  with  shattering  in  terms  of  di-
chotomies, i.e., boolean-valued maps.  We identify subsets of D with Boolean functions
OE : D  ! {0, 1}.  Similarly,  each set C  2 C  gives rise to a Boolean function on X ,  and
intersections C \ D are restrictions of functions to D.  In this language, a subset D   X
is shattered by F  := {OE ;  OE : X  ! {0, 1}}, if every dichotomy on D  is a restriction to
D of some OE 2 F .

Definition  4  (Pseudo-dimension  with  respect  to  a  loss  function).                                   For  a  given
class  of  functions  F  :  X  !  Y  and  a  loss  function  L  :  Y  x Y  !  [0, r],  we  introduce
for each f  2 F  the function


                   Af,L  : X x Y  x R ! {0, 1};   (x, y, ae) 7! sign  (L(f (x), y) - ae),                               (6)


and let AF,L   denote all such Af,L  with f  2 F .  The pseudo-dimension of F  with respect
to the loss function L, PD   [F , L], is defined as:


                                             PD   [F , L] := VC   (AF,L  ).


     The  VC-dimension  characterizes  learnability  of  {0, 1}-valued  functions.  For  learn-
ing real-valued functions we look for a generalization of the VC-dimension with similar
properties.   One  such  generalization  is  the  pseudo-dimension.   Unfortunately,  pseudo-
dimension does not share the property the VC-dimension has; there are learnable func-
tion classes with infinite pseudo-dimension, see [20  , p.206] and [3 ].
     Next  we  define  the  fat-shattering  dimension  that  corresponds  to  shattering  with
fixed  "margin"  fl.   Both  the  pseudo-dimension  and  the  fat-shattering  dimension  can
be  used  to  bound  certain  covering  numbers  and  in  this  sense  they  act  like  the  VC-
dimension.   Moreover,  the  fat-shattering  dimension  gives  upper  and  lower  bounds  for
covering numbers of function classes and the finiteness of the fat-shattering dimension
can characterize learnability (see [1 ] and [2 ]).

Definition  5  (Fat-shattering  dimension).                        Let F  be a set of real-valued functions.
We say that a set of points X  is fl-shattered by F  if there are real numbers rx  indexed
by x 2 X  such that for all binary vectors bx  indexed by X , there is a function fb  2 F
satisfying


                                       fb(x)     rx  + fl,     if bx  = 1, and

                                       fb(x)     rx  - fl,     otherwise     .


The fat-shattering dimension fat fl(F ) is a function from positive real numbers to integers
which maps a value fl to the size of the largest fat-shattered set, if it is finite, or infinity
otherwise.
     The shattering dimension when the margin fl equals 0 is called the pseudo-dimension
and it is denoted by PD   (F ).  Clearly, for all fl > 0, fat fl(F )   PD   (F ).



                                                             7


3.1      Bounds


We  begin  by  stating  bounds  in  the  easiest  learning  setting_classifying  the  final  state
observations as either 0 or 1.

Theorem  3.         (VC-dimension  upper  bound,  p = 1   ) The VC-dimension of the sign
system concept class Cm,1   with scalar observations can be bounded as
                                         h                                                i
    VC  (Cm,1  )   2(2mn2 + 4n + 1) log 2  8e  8mn2 k(n + `max  ) + 1   2nk + 2(1 + 2k)n    ,


where `max   is given by  (3) .

     In terms of n (the dimension of the state space) and k  (the band-width) the upper
bound is of the form O(n3 log 2 (nk)).
     All  VC-dimension  upper  bounds  are  based  on  the  fact  that  input  basis  functions
satisfy  a  certain  rationality  condition.  Remark  17  indicates  how  the  bound  is  formed
when the input functions satisfy the more abstract rationality condition.  In that case
the degrees of the polynomials and the number of polynomial evaluations are different.
However, in terms of n and k, the bound is of the same form.  VC-dimension or pseudo-
dimension bounds stated in this paper can be modified for the rationality condition in
the same way.
     The lower bound for the VC-dimension is in terms of n and k.  It holds for linearly
independent continuous basis input functions and compared to upper bounds,  no par-
ticular form of the functions is needed.  The bound is obtained by imposing a specific
structure on control systems, and a lower bound for a restricted class of control systems
provides a lower bound for more general classes.

Theorem  4.         (VC-dimension  lower  bound,m = 1     ,  p = 1   )
                                                        ae   ~       ~             oe
                                                                        k         0
                                  VC   (C1,1 )   max      m0   log 2   ____    , m
                                                                       m0

where m0 = min   {n, k}.

     In terms of k the upper and the lower bound match up to a constant.  For n and k the
lower bound is typically of the form O(n log 2(k=n)).  Note that if the system dimension
n is small compared to the band-width k the VC-dimension upper and lower bounds in
Theorems 3 and 4 become tighter, both being of the form c log 2 k (with different values
of the constant c).
     Extending the upper bounds to the case of vector-valued observations can be done
in various ways based on the result obtained for scalar observations.  For example,  we
may consider the p-dimensional output as bits representing a number in {0, . . ., 2p - 1}
and introduce a loss function for each f  2 Cm,p   as L0-1,f (z, a) = L0-1 (f (z), a) = 1, when
f (z) 6= a, and 0 otherwise.  We define the VC-dimension of the p-dimensional observation
as the VC-dimension of the above class of loss functions.  Modifying the argument used
with scalar observations leads to a bound of the following form:

Theorem  5.         (VC-dimension  upper  bound)


   VC   (Cm,p  )   2(2pmn2 + 4n + p)
                                      h                                                          i
                              x log 2  8e  8mn2 k(n + `max  ) + 1   2p - 1 + 2p(2k + 1)n  + 2nk   ,


where `max   is given by  (3) .



                                                             8


     Next  we  state  the  main  result  concerning  learnability  of  the  actual  input-output
mapping, i.e., learning without taking the sign of the final state observation.


Definition  6.        (Control  system  concept  class,   Gp,L )  Let  Fe   =  {     :  X    !
Rp  ;    linear system of dimension             n}  and  define  the  control  system  concept  class  as
Gp,L  = AFe,L , where AFe,L  is given by (6)  .


     Methods for calculating upper bounds for the VC-dimension readily give a tool for
obtaining  upper  bounds  for  the  pseudo-dimension  with  respect  to  loss  functions  that
preserve the rationality structure of the output.  A typical example is illustrated by the
loss function


                                   L(z1 , z2 ) = (z1 - z2 )2 =(1 + (z1 - z2 )2 )                                        (7)


and the following result:


Theorem  6.         (Pseudo-dimension  upper  bound,  p = 1   )

                       2                         h         2                                    n  i
   PD  (G1,L )   2  2mn  + 4n + 1          log  2  16e  8mn  k(n + `max  ) + 1   2nk + 2(2k + 1)     ,


where the loss function L is given by  (7)  and `max   is given by  (3) .


     This  differs  from  the  corresponding  VC-dimension  bound  only  by  the  maximum
degree  of  the  polynomials,  which  is  doubled.  Extending  this  pseudo-dimension  bound
for  p-dimensional  observations  can  be  done  naturally  by  modifying  the  loss  function.
Lower bounds for the VC-dimension are lower bounds for the pseudo-dimension as such.
     The next results summarize upper bounds for the fat-shattering dimension.  We begin
by illustrating how fat-shattering dimension can be bounded for Lipschitz functions in
certain cases:


Theorem  7  (Fat-shattering  bound).                       Let F (~, u) : Rk x U  ! R be such that F (., u)
is Lipschitz with constant L, i.e., |F (~1 , u) - F (~2 , u)|   Lk~1 - ~2 k for all u 2 U .  For
any subset B   Rk , consider the following class of functions:


                                       FB  = {F (~, .) : U  ! R ;  ~ 2 B}.


Then
                                                                 ~
                                                                   CL
                                   fat fl(FBk1 (C)  )    k log 2   _____   ,
                                                                     fl
                                                                 `     ~        '
                                                                         CL
                                   fat fl(FB~k1(C)  )    k log 2   1 +   _____       ,
                                                                           fl
                                                                 `        '
                                                                       L
                                       fat fl(FB~2,1)    k log 2   C + ___      ,
                                                                       fl

where


                             Bk1 (C)    = {x 2 Rk  ;  kxk1   =  max    |xi| < C},
                                                               1 i k

                             B~k1 (C)   = {x 2 Rk  ;  kxk1   =  max    |xi|   C},
                                                               1 i k
                                                              r  _____________
                                                                 X    k
                              B~k2(C)   = {x 2 Rk  ;  kxk2  =              x2    C}.
                                                                      i=1   i



                                                             9


Theorem  8  (Fat-shattering  bound  for  a  control  system).                                Assume  that  the  sys-
tem  `x= Ax + Bu,  y = Cx,  x(0) = 0  can  be  parameterized  by  ~ 2 Rn(m+1)     as  in  Def-
inition  2  and  k~k  1   < 1.  Let  F (~, u) = y(o )  be  the  solutionR with  system  parameters  ~
and control u = (u1 , . . ., um  ) 2 U  = {u = (u1 , . . ., um  ) ;   0o ui(t)dt   M, i = 1, . . ., m}.
Define


                                   FB  = {F (~, .) : U  ! R ;  ~ 2 Bk1 (1)}.


Then
                                                             ~
                                                               n2 mo neo M
                                 fatfl(FB )   n(m + 1) log 2   _________________  .
                                                                      fl



4      Techniques  for  Proving  VC-Dimension  Results


Our  main  results  are  based  on  a  fact  that  the  basis  input  functions  satisfy  a  certain
rationality  condition.  In  this  section  we  first  formulate  this  rationality  condition  and
then we summarize existing results that are used in proving upper and lower bounds for
the complexity dimensions.
     We recall briefly the control system setting.  We study systems

                                 x` = Ax + Bu,    x(0) = x0 ,    y = Cx,

                                 u  = G!,    and      !j 2   for  j = 1, . . ., k,

with basis input functions
                      n
                    =   !1 , . . ., !k  ;   !1 , . . ., !k  linearly independent and


                                            !j = t`j effjt sin (fij t) or  !j = t`j effjt cos (fij t)
                                                                                          o
                                            with    `j 2 N, ffj , fij 2 R, j = 1, . . ., k   ,


such that


                                             `max   = max   {`1 , . . ., `k }.


Definition  7.        (Rationality  condition  (RAT)) Let n be a positive integer.  We say
that a bounded function ! : [0, 1] ! R satisfies the rationality condition relative to the
class of n-dimensional systems if there exists h polynomial functions f1 , . . ., fh  : R4  ! R
and  2fln  rational  functions  rij`",  i  2  {1, 2},  j  2  {1, . . ., fl}  and  "`2  {1, . . ., n}  with  no

poles on subsets Sij`" of R4 , such that the following properties hold:


    1.  For each i 2 {1, 2}, `" 2 {1, . . ., n},  R4  is a disjoint union of Si1"`, . . ., Sifl`".


    2.  Each Sij`" can be defined in terms of a Boolean expression involving [f1  = 0]  , . . .,

        [fh  = 0]  , where we say that for functions f1 , . . ., fh  : R4  ! R, [fi = 0] has value 1,
        if fi(x1 , x2 , x3 , x4 ) = 0, and 0 otherwise.


    3.  Letting ri"`: R4  ! R, i 2 {1, 2}, `" 2 {1, . . ., n} be defined as
                                                      8
                                                      >>>:
                                                         rifl`"(v),     if v 2 Sifl`",



                                                            10


        then for each a, b 2 R, and for all `" 2 {1, . . ., n}

                              Z  1

                                   t"`-1 eat cos  (bt)!(t)dt    = r1"`(a, b, ea cos  b, ea sin  b),
                              Z 01

                                   t"`-1 eat sin (bt)!(t)dt     = r2"`(a, b, ea cos  b, ea sin  b).
                                0


     We denote by dmax   the maximum degree of any polynomial (i.e., f1 , . . ., fh , numer-
ators and denominators of rij`"'s) appearing in the rationality condition .

Remark 9.       First,  entries of eAt  are functions of the form ts eat cos  (bt) and ts eat sin (bt).
Solving (1)  involves convolutions of eAt  and the basis input functions !j , and we require
those to be rational functions.

Example 10.        Let !(t) = sin (ct), with nonzero c.  Then

           Z  1                         Z  1                            Z  1
                                      1       at                     1        at
                eat sin (bt)!(t)dt =  __     e   cos  ((b - c)t)dt - __      e   cos  ((b + c)t)dt
             0                        2   0                          2    0

after integration this can be split into cases with no poles yielding

                                          8     p(a,b,ea cos b,ea sin b)
                Z  1                      >< ____________________________(a2+(b+c)2)(a2+(b-c)2),iff1  6= 0, f2  6= 0,

                     eat sin (bt)!(t)dt =    b-sin_b_cosb__2b,                 if f1  = 0, f2  6= 0,
                  0                       >: sin_b_cosb-b__
                                                   2b      ,                   if f1  6= 0, f2  = 0,

where

                                  f1 (a, b, ea sin  b, ea cos  b)  = a2 + (b + c)2 ,

                                  f2 (a, b, ea sin  b, ea cos  b)  = a2 + (b - c)2 ,


and p(a, b, ea cos  b, ea sin  b) stands for the polynomial


    - 4abc + 4abcea  cos  b cos  c - 2a2 cea  cos  c sin b + 2b2 cea  cos  c sin b

            - 2c3 ea  cos  c sin b - 2a2 bea  cos  b sin c - 2b3 ea  cos  b sin c + 2bc2 ea  cos  b sin c

                                            + 2a3 ea  sin  b sin c + 2ab2 ea  sin b sin c + 2ac2 ea  sin b sin c.


Lemma  11.          Each !j  2   given by  (2)  satisfies the rationality condition.  Further, the
maximum  degree  of  polynomials  in  (RAT)  is  at  most  4(n + `max  ),  where  `max   is  given
by  (3) .



Review  of  VC-Dimension  Techniques


In  the  context  of  control  theory  it  is  sometimes  easier  to  work  with  the  dual  VC-
dimension.   Assume  that  a  function  F  :    x X  !  {0, 1}  is  given.   This  induces  two
function classes


                                     F := {F (~, .) : X ! {0, 1} ;  ~ 2  }


and


                                    F * := {F (., x) :   ! {0, 1} ;  x 2 X }.



                                                            11


The complexity dimension VC   (F *) is called the dual VC-dimension of F and it is related
to VC   (F ) as follows [20  ]:


                                            VC   (F )   blog 2 VC   (F *)c,                                             (8)


where bxc is the integer part of x.
     A sharper estimate can be obtained if   can be written as a product  1 x . . .x  n .
The following construction and result are due to DasGupta and Sontag [7 ].  We study
in  particular  those  dichotomies  that  are  defined  on  "rectangular"  subsets  of   .   Let
L = L1 x . . .x Ln  be a subset of   such that for each i,  Li    i is nonempty.  Given
any  index  1     ~     n  a  ~-axis  dichotomy  on  L  is  any  function  ffi  :  L  !  {0, 1}  which
depends  only  on  the  ~th  coordinate,  i.e.,  there  is  some  function  OE  :  L~  !  {0, 1}  so
that ffi(~1 , . . ., ~n ) = OE(~~ ) for all (~1 , . . ., ~n ) 2 L.  We say that a mapping is an axis
dichotomy if it is a ~-axis dichotomy for some ~.  A rectangular set L is said to be axis
shattered by F * if every axis dichotomy is a restriction to L of some function of the form
F (., x) :   ! {0, 1}, for some x 2 X .


Theorem  12.          (Axis  shattering  bound  [7 ])  If  L = L1  x . . .x Ln       can  be  axis
shattered and each Li has cardinality ri > 0, then VC   (F )   blog 2 (r1 )c + . . .+ blog 2 (rn )c.


     Upper bounds for VC-dimensions of concept classes that are obtained by evaluating
polynomial  equalities  and  inequalities  can  be  obtained  in  terms  of  the  number  and
degrees of the polynomials:


Theorem  13.          (Goldberg-Jerrum bound [10  ]) Given a function F  :   x X ! {0, 1}
and  the  associated  concept  class  F  :=  {F (~, .)  :  X  !  {0, 1}  ;  ~  2   },  suppose  that
   =  R`  and  X  =  Rk .   Let  F  be  defined  in  terms  of  a  Boolean  formula  involving  at
most  s  polynomial  equalities  and  inequalities  in  ` + k  variables,  each  polynomial  being
of degree at most d in ~ for all x 2 Rk .  Then, VC   (F )   2` log 2(8eds).


     The Goldberg-Jerrum bound is based on a result showing that the number of sign
assignments {-1, 0, 1} to polynomials can not grow too fast:


Theorem  14.          ([10  ])  Suppose  that  f1 , . . ., fm   are  polynomials  of  degree  at  most  d  in
n   m variables.  Then the number of distinct vectors

                                                                                    m
                                  sign  f1 (x), . . ., sign   fm  (x)   2 {-1, 0, 1}   ,


that can be generated by varying x over Rn , is at most ( (8edm)    =n)  n .
5      Proofs  of  the  VC-dimension  bounds


5.1      An Upper Bound for the VC-Dimension with Scalar Observations


We begin this section by proving Lemma 11 stating that the input basis functions satisfy
the rationality condition (RAT) and bounding the degrees of polynomials appearing in
(RAT). As a proposition we formalize how control systems can be parameterized.  After
that,  as  a  lemma,  we  develop  an  upper  bound  for  the  VC-dimension  induced  by  the
control system (1)   with its initial state fixed to be zero.  Theorem 3 with an arbitrary
initial condition is then a simple modification of the argument.



                                                            12


Proof  of  Lemma  11.             If !(t) = t`efftsin (fit) or !(t) = t`efftcos  (fit) with `   `max   then
in place of

                           Z  1                                Z   1

                                t"`eat sin (bt)!(t)dt    or         t"`eat cos  (bt)!(t)dt                              (9)
                             0                                   0

by combining exponents and using sum formulae for sin  and cos   (see Example 10) it is
enough to study terms of the form

                                Z  1                             Z   1

                                     t"ke"atsin ("bt)dt    or         t"ke"atcos  ("bt)dt,
                                  0                                0


where k" 2 {0, . . ., n + `max   - 1}.  In fact, each expression in (9)   is one of the following
type

                         `  Z  1                               Z  1                            '
                       1_         "`"at                              "`"at
                                 t e   cos  ((b - fi)t)   dt -      t e   cos  ((b + fi)t)   dt
                       2     0                                   0
                         `  Z  1                               Z  1                            '
                       1_         "`"at                              "`"at
                                 t e   cos  ((b + fi)t)   dt +      t e   cos  ((b - fi)t)   dt
                       2     0                                   0
                        `  Z  1                               Z  1                            '
                      1_         "`"at                              "`"at
                                t e   sin  ((b - fi)t)   dt -      t e   sin  ((b + fi)t)   dt       ,
                      2      0                                  0
                        `  Z  1                               Z  1                            '
                      1_         "`"at                              "`"at
                                t e   sin  ((b + fi)t)   dt +      t e   sin  ((b - fi)t)   dt       ,
                      2      0                                  0


where "a = a + ff.  Because for "k > 0

   Z   1                           "a
                                  e
         t"ke"atsin ("bt)dt =  __________("a sin "b- "bcos  "b)
     0                         "a2 + "b2
                                                            "k     Z  1  "
                                                      - __________      tk -1 e"at("a sin ("bt) - "bcos ("bt))dt   (10)
                                                        "a2 + "b2    0

and
                                 Z  1                     "a
                                                         e  ("a sin "b- "bcos  "b) + "b
                                      e"atsin ("bt)dt =  ________________________________                             (11)
                                   0                                "a2 + "b2
                           R 1  "
and a similar formulae for   0 tk e"atcos  ("bt)dt hold, we see by induction that the numerator
   R  1  "
of   0  tk e"atsin ("bt)dt is a polynomial of "a, "b, e"acos  "b and e"asin  "b.  By using sum formulae
for sin and cos, the previous expression is in turn a polynomial of a, b, ea cos  b and ea sin  b
because  "a= a + ff  and "b = b   fi  for  some  fixed  ff  and  fi.  By  similar  arguments,  the
denominator is a polynomial of a and b.  Note that, for example, eff equals a constant
times ea , so this process does not change the degreesRof the polynomials.
     Further,  observe  that  the  denominator  of   01 t"`eat sin (bt)!(t)dt  consists  of  at  most

two products of variables a and b of the form ((a + ff)2 + (b   fi)2 )"`+`+1  , and similarly
with the cos  (bt) term.  Let us index the basis input functions !1 , . . ., !k  so that !~  has
parameters ff~  and fi~ .  Hence the functions fi in (RAT), defining the subsets without
poles, can be taken as


                   {(a + ff~ )2 + (b - fi~ )2 , (a + ff~ )2 + (b + fi~ )2 ;  ~ = 1, . . ., k}.



                                                            13


Furthermore, the sets Sij`" are as simple as


[k~=1  {{(x1 , x2 , x3 , x4 ) ;  x1  = -ff~ , x2  = -fi~ } [ {(x1 , x2 , x3 , x4 ) ;  x1  = -ff~ , x2  = fi~ }}



and


   R4 \ [k~=1  {{(x1 , x2 , x3 , x4 ) ;  x1  = -ff~ , x2  = -fi~ } [

                                                                 {(x1 , x2 , x3 , x4 ) ;  x1  = -ff~ , x2  = fi~ }}     .

     We turn to estimating the maximum degree of polynomials appearing in (RAT). We
already saw that functions fi are polynomials of degree 2.  Equations (10)   and (11)   show
that the degree of numerator is not higher than the one of denominator.  We claim that
                      Z   1        ae       "
                                                         P (2("k + 1))
                            t"ke"at    sin (b"t)   dt =  __________________"  for  k" = 0, 1, . . .
                        0              cos  (b t)        ("a2 + "b2)k +1

where P (2("k + 1)) stands for some polynomial in "a, "b, e"asin ("b) and e"acos  ("b) of degree
2("k + 1).  Clearly, the claim is true for "k = 0 by (11)   and the inductive argument follows
from (10)  .  Assuming the claim true for "k - 1, we get

   Z   1

         t"ke"atsin ("bt)dt
     0                                 Z                                       Z

                P (2)           "k"a      1  "k-1  "at                  "k"b      1  "k-1  "at
            =  __________ - __________      t     e   sin ("bt)dt + __________      t     e   cos  ("bt)dt
               "a2 + "b2    "a2 + "b2    0                          "a2 + "b2    0

                            P (2)          P (1)        P (2"k)            P (1)        P (2"k)
                        =  __________ - _____________ _______________"+ _____________ _______________"
                           "a2 + "b2    ("a2 + "b2)   ("a2 + "b2)k      ("a2 + "b2)   ("a2 + "b2)k

                                                         P (2)("a2 + "b2)"k - 2P (2"k + 1)            P (2("k + 1))
                                                      =  _________________________________________"=  __________________"
                                                                     ("a2 + "b2)k +1                  ("a2 + "b2)k +1


and similarly for the cos  ("bt) term, concluding the proof of the claim.  As a corollary of
the claim
                Z  1
                                                  P (2("k + 1))                 P (2("k + 1))
                     t"`eat sin (bt)!(t)dt =  __________________________"+ __________________________",
                 0                            ("a2 + (b + fi)2 )k +1       ("a2 + (b - fi)2 )k +1

where "k = ` + "`and "a = a + ff.
     Hence the maximum degree of denominators of expressions in (9)  is 2("k + 1) + 2("k +
1) = 4("k + 1) with "k 2 {0, . . ., `max   + n - 1}.  Thus the maximum degree of polynomials                __
appearing in the (RAT) is 4(n + `max  ).                                                                   |__|


     The next proposition indicates how control systems are parameterized and later the
concept  or  function  classes  associated  to  control  systems  are  obtained  by  varying  the
parameter vector.

Proposition  15.           Denote  the  basis  input  functions  by  !  =  (!1 , . . ., !k )T   and  assume
that each !i, i = 1, . . ., k satisfies the rationality condition (RAT) and let   = R2pn2m   x
R4n  x Rp .  Then there exists a mapping H :   x Rmk   ! Rp  (depending on !) such that
for each   = (A, B, C, x0 ) there exists a ~ 2   satisfying

                                       (G!) = H (~, G) for all   G 2 Rmk  .



                                                            14


Proof.    Given a system   = (A, B, C, x0 ),

                                                        Z  1

                                (u) = y(1) = CeA x0 + C      eA(1-t)  Bu(t)dt.
                                                          0


By  an  argument  based  on  the  real  Jordan  form  of  eAt ,  the  entries  of  eA(1-t)    are
linear  combinations  of  functions  of  the  form  t"`eat cos  (bt)  and  t"`eat sin (bt),   where
"`2 {0, . . ., n - 1}     and  a + ib  is  an  eigenvalue  of  A.  Hence  we  define  the  following  2n

functions


                                             ,1 (a, b, t)  = eat cos  (bt)

                                             ,2 (a, b, t)  = teat cos  (bt)

                                                         ..
                                                         .

                                             ,n (a, b, t)  = tn-1  eat cos  (bt)

                                          ,n+1  (a, b, t)  = eat sin (bt)

                                                         ..
                                                         .

                                           ,2n (a, b, t)   = tn-1  eat sin (bt).


     By the rationality condition (RAT), for all ` = 1, . . ., 2n

                             Z  1                        ^           a          a
                                                         P`j (a, b, e  cos  b, e  sin  b)
                                  ,`(a, b, t)!j (t)dt =  ___________________________________
                              0                          Q^`j (a, b, ea cos  b, ea sin  b)


for all a, b 2 R and where P^`j  and Q^`j  are piecewise polynomial expressions.
     Let H (A, X, h, G) = (H1 , . . ., Hp )T , where for 1   ~   p


                    X    m    X   n     X   2n           X   k        P^`j(xr1 , xr2 , xr3 , xr4 )
  H~ (A, X, h, G) =      i=1      r=1       `=1  ffir`~      j=1  gij ______________________________^+ h~
                                                                      Q`j  (xr1 , xr2 , xr3 , xr4 )


and


                        A   = (ffir`~ ) i=1,...,m                          X   = (xrj ) r=1,...,n
                                        r=1,...,n                                       j=1,...,4
                                        `=1,...,2n
                                        ~=1,...,p

                         h  = (h1 , . . ., hp )T                            G  = (gij) i=1,...,m  .
                                                                                        j=1,...,k

     Next, we relate     and H  and we write

                                                      2                     3
                                                        fl11    . . .  fl1m

                                        CeA(1-t)  B = 64  ...            ...75 .

                                                        flp1    . . . flpm


We list the eigenvalues of A as ar + ibr  for r = 1, . . ., n and let ,r` (t) = ,`(ar , br , t) for
r = 1, . . ., n and ` = 1, . . ., 2n.  Then there exists some ( ffir`~ ) such that

                                                 X    n    X    2n
                                      fl~i (t) =      r=1       `=1  ffir`~ ,r~ (t).                                  (12)
                                                            15


     Let ~ = (A, X, h), where A satisfies (12)  , X = (xrj ), where xr1  = ar , xr2  = br , xr3  =
ear  cos br  and xr4  = ear  sin br , and h = CeA x0 .  We claim that


                               H (~, G) = y(1) =    (G!) for all   G 2 Rmk  .


Note that the ~-th component of    (G!) is given by

                Z   1 X   m

                               fl~i (t)ui(t)dt + h~
                  0       i=1
                   X    m    X   n     X   2n           Z  1         X    k
                 =      i=1      r=1       `=1  ffir`~       ,r` (t)           gij!j (t)dt + h~
                                                         0                j=1
                   X    m    X   n     X   2n           X   k        Z  1
                 =      i=1      r=1       `=1  ffir`~      j=1  gij      ,`(ar , br , t)!j (t)dt + h~
                                                                       0
                   X    m    X   n     X   2n           X   k        P^`j(xr1 , xr2 , xr3 , xr4 )
                 =      i=1      r=1       `=1  ffir`~      j=1  gij ______________________________^+ h~
                                                                     Q`j  (xr1 , xr2 , xr3 , xr4 )

                 = H~ (A, X, h, G).

                                                                                                                           __
                                                                                                                          |__|


     Next we take p = 1 and study the VC-dimension of the sign system concept class,
Cm,1  , where each control parameterized by G gives rise to sign   y(1).


Lemma  16.          Sign system concept class Cm,1   with initial condition x(0) = 0 satisfies

                            2                   h        2                                    n  i
      VC   (Cm,1  )   2  2mn  + 4n         log 2  8e  8mn  k(n + `max  ) + 1   2nk + 2(1 + 2k)     ,


where `max   is given by  (3) .


Proof.    By  Proposition  15  y(1)  =  H (~, G),  where  ~  2  R2mn2    x R4n   are  considered  as
parameters.  In fact, y(1) =  P__Q, where P and Q denote piecewise polynomial functions.

As  in  the  statement  of  Goldberg-Jerrum  bounds  we  have  a  function  F  :    x Rmk   !
{0, 1}  defined  by  F (~, G) = sign   H (~, G).  The  concept  class  associated  to  the  system
identification problem is F  := {F (~, .) : Rmk   ! {0, 1} ;  ~ 2  }, where   = R2mn2+4n    .
Before applying the Goldberg-Jerrum bound we need to determine the possible degrees
of P and Q with respect to the parameters.
     The rationality condition implies that
                                             n                           o
                                      max      deg  (P^`j ), deg  (Q^`j )     dmax  .
                                     i ` n
                                     1 j k


Then

                                                Pei`    X    k        P^
                                                _____ =           gij _`j__
                                                Qei`         j=1      Q^`j



so  deg  (Qei` )     kdmax    and  deg  (Pei`)     kdmax  .   Note  here  that  we  are  calculating  the
degree  with  respect  to  the  system  parameters,  and  the  inputs  gij  do  not  contribute.
By  continuing  in  a  similar  fashion  andP combining  r-summation  to  the  `-summation
                                                 2
in  Proposition  15,  we  write  Pi=Qi  =      2n`=1ffi`Pei`=Qei`   to  conclude  that  deg  (Qi)



                                                            16

                                                                                          P  m
2n2 deg  (Qei` )  =  2n2 kdmax    and  deg  (Pi)     2n2 kdmax   + 1.   Finally,  P=Q  =     i=1  Pi=Qi
with deg  (Q)   m2n2 kdmax   and deg  (P)   m2n2 kdmax   + 1.
     Recall that with p = 1 and initial condition x(0) = 0, using the notation of Propo-
sition 15

                      X    m    X   n     X   2n         X    k        Z  1
               y(1) =      i=1      r=1       `=1  ffir`      j=1  gij      ,`(xr1 , xr2 , t)!j (t)dt.
                                                                         0

The proof of Lemma 11 indicates that the denominator of
                                             Z   1

                                                   ,`(xr1 , xr2 , t)!j (t)dt
                                               0

equals
                  i                                    j  z`j i                                   j  z`j
                    ( xr1  + ffj ) 2 + (xr2  + fij ) 2          (xr1  + ffj ) 2 + (xr2  - fij ) 2        ,


where ffj , fij  are fixed parameters of the basis input function !j  and z`j  2 N.
     By carrying out the summations we get y(1) = P=Q, where Q consists of powers of
polynomials fij1 , fij2  with


                                fij1 (A, X, G)      = (xi1 + ffj )2 + (xi2 + fij )2 ,

                                fij2 (A, X, G)      = (xi1 + ffj )2 + (xi2 - fij )2 ,


and i = 1, . . ., n, j = 1, . . ., k.
     Our final step before applying the Goldberg-Jerrum bound is finding out the number
of  polynomial  inequalities  s  needed  in  the  Boolean  formula  evaluating  the  sign  of  the
final  state  output.  This  is  done  by  studying  the  number  of  different  P=Q  expressions
without poles.
     An  upper  bound  for  different  P=Q  expressions  without  poles  can  be  obtained  by
applying Theorem 14 to 2nk polynomials fij1 , fij2 , i = 1, . . ., n and j = 1, . . ., k viewing
those as polynomials of 2n variables and each polynomial having degree 2.  This gives
the upper bound (16ek)2n .
     However, a more specific bound can be obtained in this problem.  Note that varying
xi1  and  xi2  we  can  make  at  most  one  of  the  2k  polynomials  fij1 ,  fij2 ,  j  = 1, . . ., k  to
be  zero.  For  example,  fl  zeros  among  fij1 ,  fij2 ,  i  =  1, . . ., n  and  j  =  1, . . ., k  can  be
obtained in (2k)fl nfl  ways and the number of possible sign assignments is obtained by
summing over fl  yielding

                                         X    n       fl` n '             n

                                              fl=0(2k)    fl    = (1 + 2k)  .


Thus  the  number  of  P=Q  expressions  without  poles  is  (1 + 2k)n ,  which  gives  rise  to
2(1 + 2k)n  polynomials.
     Note  that  in  order  to  write  sign  y(1)  as  a  Boolean  formula  evaluating  polynomial
inequalities  and  equalities  it  also  has  to  include  the  2nk  polynomials  fij1 ,  fij2 ,  i  =
1, . . ., n, j = 1, . . ., k.  Values of these polynomials determine which P=Q expression is
the valid one to determine sign  y(1).  The Boolean formula for sign  y(1) can be given as a
truth table involving polynomial inequalities of 2nk fij1 , fij2  expressions and 2(1 + 2k)n
different P and Q expressions.
     Using  Lemma  11  for  bound  on  dmax  ,  we  apply  the  Goldberg-Jerrum  bound  with __
s = 2nk + 2(2k + 1)n , d = m2n2 k4(n + `max  ) + 1 and ` = 2mn2 + 4n.                       |__|



                                                            17


     A simple example of a piecewise polynomial function P=Q together with the decision
table for the final output is provided in Appendix A.

Remark 17.        The  VC-dimension  bound  is  modified  for  the  more  abstract  rationality
conditions  as  follows.   Evaluating  the  sign  of  the  output  involves  the  evaluation  of
2(8edmax  2n2 kh=4n)4n  + 2n2 kh  polynomials;  2n2 kh  evaluations  are  needed  to  find  an
appropriate  piece  and  by  Theorem  14  the  maximum  number  of  possible  expressions
of  the  type  P=Q  is  bounded  by  (8edmax  2n2 kh=4n)4n .  Applying  the  Goldberg-Jerrum
bound with s = 2(8edmax  2n2 kh=4n)4n  + 2n2 kh, d = m2n2 kdmax   + 1 and ` = 2mn2 + 4n
gives the result.


Proof  of  Theorem  3,  the  VC-dimensionR upper  bound,  p = 1   .                                 By using the pre-
vious  notation  y   =  CeA x0  + C   10eA(1-t)  Bu(t)dt.    Let  ex  =  CeA x0 .    Then  y   =

ex + P=Q = (exQ + P)     =Q = Pe =Q.  This has 2mn2 + 4n + 1 parameters and deg  (Pe )                                     __
m2n2 kdmax   + 1      .                                                                                                   |__|


Remark 18.        Taking 2n2  functions ,r` (t), r = 1, . . ., n, ` = 1, . . ., 2n in Proposition 15 is
clear  overcounting.  Further  calculations  indicate  that  O(n ln n)  functions  ,r` (t)  would
be enough.  For more details see Appendix B.



5.2      Lower  Bounds  for  the  VC-Dimension


The lower bounds for the VC-dimension are developed for a single-input single-output
system with initial state zero.  The control is

                                              X    k
                                          u =      j=1  gj !j : [0, 1] ! R.



We derive lower bounds by fixing the structure of A, B, and C, and using the dual VC-
dimension and axis shattering following the ideas of DasGupta and Sontag [7 ].  Lemmata
20,  21  and  22  given  in  this  section  together  prove  Theorem  4.   These  lower  bounds
are  very  general;  we  just  assume  that  the  input  functions  are  continuous  and  linearly
independent, thus no particular structure on input function is required as in the upper
bounds.
     To make the next proof cleaner we formulate a part of it as a separate proposition.
(The proposition is a standard fact but we include a short proof for the completeness of
exposition.)


Proposition  19.           Let  !j  : [0, 1] ! R,  j  = 1, . . ., k  be  continuous  and  linearly  indepen-
dent.  Then the functions

                                             Z  1

                                    hj (~) =      e~t !j (t)dt,    j = 1, . . ., k
                                               0

are linearly independent.


Proof.    We begin by showing that for any f  2 C[0, 1] = {f : [0, 1] ! R ;  f  continuous     }

                                      Z  1

                                           e~t f (t)dt = 0       for all ~ 2 R
                                        0

                                                            18


implies f (t)   0 on [0, 1].  The above integral is anRanalytic function of ~.  Moreover, all
derivatives in ~ are zero at ~ = 0, which gives that   01 tk f (t)dt = 0 for all k = 1, 2, . . ..
As f  is continuous and polynomials are dense in C[0, 1], we have that f (t)   0.
     Linear  independence  of  h1 , . . ., hk  follows:  If  ff1 h1 (~) + . . .+ ffk hk (~)  =  0  for  all
~ 2 R, then

                            Z  1

                                 e~t (ff1 !1 (t) + . . .+ ffk !k (t))dt = 0,    8~ 2 R
                             0

and  by  the  above  argument  ff1 !1 (t) + . . .+ ffk !k (t) = 0  for  all  t.  As  !1 , . . ., !k  were_
assumed to be linearly independent, we have that (ff1 , . . ., ffk ) = (0, . . ., 0).               |__|


Lemma  20  (Lower  bound  1).                   Sign system concept class C1,1  with scalar inputs and
scalar outputs satisfies

                                                           ~       ~
                                                                      k
                                          VC  (C1,1 )   m0   log 2   ____    ,
                                                                     m0


where m0 = min   {n, k}.


Proof.    Let  !j (t),  j  =  1, . . ., k  be  continuous  and  linearly  independent.  Let  A  have  n
distinct real eigenvalues -~1 , . . ., -~n , and take B  and C  so that

                                                           X    m0
                                             CeA(1-t)  B =      i=1  e~it ,


where m0 = min   {n, k}.  Then the final output of the system is

                  Z  1             X    k                  X    m0   X   k       Z  1

           y(1) =      CeA(1-t)  B           gj !j (t)dt =                    gj      e~it !j (t)dt.
                    0                   j=1                     i=1      j=1       0

                      R  1
     Define  hj (~) =   0  e~t !j (t)dt.  By  Proposition  19  the  hj 's  are  linearly  independent
and we can find ~1 , . . ., ~k  such that the matrix

                                            2                                 3
                                               h1 (~1 )     . . .  hk (~1 )
                                            6      .                   .      7
                                            4      ..                  ..     5

                                               h1 (~k )     . . .  hk (~k )


has rank k.
     The control system with sign observations gives the mapping F  : Rm0  x Rk  ! {0, 1}
by

                                                                   ~ X                            ~
                                                                          m0   X   k
                      (~1 , . . ., ~m0 , g1 , . . ., gk ) 7! sign         i=1      j=1  gj hj (~i)    .



     We show that the mapping from parameters ~1 , . . ., ~m0   to {0, 1} can be axis shat-
tered.  Let  L  =  {~1 , . . ., ~k }  be  so  that  [hj (~i)]i,j has  rank  k.  Denote  by  L1 , . . ., Lm0
                                                                S  m0
disjoint subsets of L such that |Li| = bk=m0c and let M  = L \ {   i=1  Li}.  Next we want
to interpolate in the points of L.



                                                            19


     Fix  s,  1   s   m0 and  let  OE : Ls  ! {0, 1}  be  any  dichotomy.  Next  find  g1 , . . ., gk
such that

                              X    k

                                   j=1  gj hj (~s )  = OE(~s ),       8 ~s 2 Ls ,
                                X   k                                                                                 (13)

                                    j=1   gj hj (~)  = 0,       8 ~ 2 (L [ M ) \ Ls .


Let g*1, . . .g*k satisfy (13)  .  (A unique solution exists because [hj (~i)]   has rank k.)  Then

                                                              ~ X                            ~
                                                                     m0   X   k     *
                F [~1 , . . ., ~m0 , g*1, . . ., g*k] = sign         i=1      j=1  gj hj (~i)     = OE(~),



when ~ 2 Ls  and for all (~1 , . . ., ~m0 ) 2 L1 x . . .x Lm0 .
     Let  Fe  =  {F (~1 , . . ., ~m0 , .)  :  Rk   !  {0, 1}  ;   (~1 , . . ., ~m0 )  2  Rm0 }.  By  the  Axis
shattering bound given in Theorem 12

                                                           ~       ~
                                                                      k
                                           VC   (Fe )   m0   log 2   ____
                                                                     m0


and  thus  VC  (C1,1 )     VC  (Fe ),  where  C1,1  is  the  control  system  concept  class  with  p  =               __
m = 1.                                                                                                                |__|


Lemma  21  (Lower  bound  2).                   If k   n then


                                                    VC   (C1,1 )   k.


Proof.    We make a small modification of the above argument.  Assume that k   n and
letPA have n real eigenvalues ~1 , . . ., ~n .  Next we take B  and C  so that CeA(1-t)  B  =
   n     ~it
   i=1  e    fii, where (fi1 , . . ., fin , ~1 , . . ., ~n ) are considered as system parameters.
     We study the mapping

                                                                  ~ X                               ~
                                                                         n    X   k
      (fi1 , . . ., fin , ~1 , . . .,,~ng1 , . . ., gk ) 7! sign         i=1      j=1  gj hj (~i)fii

                                              "                                     #         ~                  ~
                                                 X    k       X    n                            X    k
                                      = sign          j=1  gj      i=1  hj (~i)fii    = sign              gj flj  .
                                                              _________-z________"                   j=1
                                                                       flj


Given    (fl1 , . . ., flk ),     by    linearP   independence    of    h1 , . . ., hk ,     we    can    find
~1 , . . ., ~n , fi1 , . . ., fin  such that     ni=1 hj (~i)fii = flj ,  j  = 1, . . ., k.  But (fl1 , . . ., flk ) can
be viewed as a normal vector for a hyperplane through the origin inhRk  and the concepti
                                                                   P   k
class  associated  to  the  mapping  (g1 , . . ., gk )  7!  sign       j=1  gj flj   as  (fl1 , . . ., flk )  varies
                                                                                                           __
has VC-dimension k.  Hence VC   (C1,1 )   k.                                                              |__|


Lemma  22  (Lower  bound  3).                   If n   k  then


                                                    VC   (C1,1 )   n.


                                                            20


Proof.    Our construction for the control system is as in the previous proof but now we
assume that n   k and we study
                                                                    ~ X                               ~
                                                                           n    X   k
        (fi1 , . . ., fin , ~1 , . . .,,~ng1 , . . ., gk ) 7! sign         i=1      j=1  gj hj (~i)fii

                                                "                                    #
                                                   X    n   X    k                              h X    n         i
                                        = sign          i=1      j=1  gj hj (~i)  fii   = sign             g"ifii
                                                            _________-z________"                       i=1
                                                                      "gi


and  again  by  linear  independence  and  the  above  hyperplane  argument  (now  via  first
transforming (g1 , . . ., gk )) we can conclude that the above mapping has VC-dimension                              __
n.  Thus VC   (C1,1 )   n.                                                                                          |__|



5.3      VC-Dimension  Upper  Bounds  for  p-Dimensional  Outputs


We begin by proving Theorem 5.


Proof  of  the  VC-dimension  upper  bound.                          We develop an upper bound based on
the  bound  for  a  scalar  sign-observation.   We  have  seen  that  under  the  rationality  as-
sumption (RAT) the scalar output is a piecewise rational expression P=Q.  In general,
the control system maps G to (sign  (P1 =Q1 ), . . ., sign (Pp =Qp ))T , which is understood as
a binary representation of a number in {0, 1, . . ., 2p - 1}.  Let f  : Rmk   ! {0, . . ., 2p - 1}
be the mapping given by the control system, and denote the class of all such mappings
by F .  For each f  2 F  introduce a loss function L0-1,f (z, a) = L0-1 (f (z), a) = 1,  when
f (z) 6= a, and 0 otherwise.  Define the class L0-1,F  = {L0-1,f  ;  f  2 F }.
     In order to calculate the value of the output, after determining an appropriate piece,
one needs to know the truth values of the expressions P1  > 0, Q1  > 0, . . ., Pp  > 0 and
Qp  >  0,  where  P's  and  Q's  are  polynomials  on  inputs  and  parameters  of  the  control
system.  To evaluate the value of the loss function L0-1,f (z, a), one needs the truth values
of y = 0, y = 1, . . ., y = 2p - 2.
     In  the  general  case  one  needs  2nk + 2p(2k + 1)n  + 2p  - 1  truth  values.   As  this
procedure  evaluates  only  polynomials,  we  can  use  the  Goldberg-Jerrum  bound  again.
The maximum degree of the polynomials is m2n2 k4(n + `max  ) + 1, and the total number
of parameters is 2pn2 m + 4n + p, where the last term comes from the initial condition.                                    __
                                                                                                                          |__|


Remark 23.        An upper bound can be derived by using the uniform  -dimension defined
by  Ben-David  et.  al.  [6 ].   As  before  let  f  :  Rmk    !  {0, . . ., 2p  - 1}  be  the  mapping
given  by  the  control  system.   Let     =  {_1 , . . ., _p }  be  a  collection  of  distinguishers
_j  : {0, . . ., 2p - 1} ! {0, 1} given by _j (sign  (y1 ), . . ., sign (yp )) = sign  (yj ).

     Let _jf : Rmk   ! {0, 1} be given by _jf(z) = _j (f (z)) and _jF  = {_jf ;  f  2 F }.  An

upper bound for VC   (_jF) is given by Theorem 3.  Ben-David et. al. define the uniform
 -dimension of F , in which only one distinguisher is selected to shatter, as


                                   - dim  U  (F ) = max   {VC   (_jF) ;  _j  2  }.


That is, the upper bound for the uniform  -dimension is just the upper bound obtained
earlier for scalar observations.  However, the uniform  -dimension can not be compared
to  the  VC-dimension  as  such.   For  example,  the  formula  for  sample  complexity  with



                                                            21


 -dimension is different:  to calculate the sample complexity the uniform  -dimension
is multiplied by p log 2(2p - 1), see [6 , Theorem 27].
6      A  Fat-Shattering  Bound


We  begin  this  section  by  proving  Theorems  7  and  8.  As  a  corollary  of  Theorem  8  we
prove the fat-shattering bound appearing in Theorem 2 bounding the sample complexity
for proper agnostic learning.


Proof  of  Theorem  7.             For  the  first  part  of  the  proof  we  use  a  generic  set  B  for  the
parameters.  Assume that we can fl-shatter a set of inputs {u1 , . . ., ud } and there exists
{r1 , . . ., rd } such that, for each assignment b 2 {0, 1}d , there exists a ~ 2 B  such that


                                      F (~, ui)     ri + fl,     if bi = 1, and

                                      F (~, ui)     ri - fl,     otherwise.


We write ~ ~ ~ if and only if the parameters ~ and ~ give the same assignment for all
{u1 , . . ., ud }.  Further, let   = {~1 , . . ., ~2d } be a collection of parameters that shatter
{u1 , . . ., ud } and let ~i, ~j 2  .  Now ~i 6~ ~j  implies that there exists u*  2 {u1 , . . ., ud }
and r*  2 {r1 , . . ., rd } such that F (~i, u* )   fl + r*  and F (~j , u* )   fl - r* , or vice versa.
Hence


                                2fl   |F (~i, u* ) - F (~j , u* )|   Lk~i - ~j k


and so k~i - ~j k   2fl=L.  That is, the set   of cardinality 2d  is a 2fl=L-separated set
in  B.  Now  the  fat-shattering  bounds  follow  by  calculating  2fl=L-packing  numbers  for
different sets B.
     If B = Bk1 (C) the maximum possible cardinality for an ffl-separated set is b2C=fflck ,
and thus

                                                  ~           k   ~        k
                                                      2C            CL
                                            2d      ________    =   _____
                                                    2fl=L             fl


and solving for d yields d   k log 2 bCL=flc.
     Similarly, if B = B~k1 (C) the maximum possible cardinality for an ffl-separated set is
(1 + b2C=fflc)    k and by a similar argument we arrive at the bound d   k log 2 (1 + bLC=flc)      .
     For B = B~k2(C), let P (ffl) be a collection of ffl-separated sets in B~k2 (C) and let |P (ffl)|
denote  its  cardinality.   As  all  open  balls  with  radius  ffl=2  with  centers  at  ffl-separated
points have to be disjoint and their union has to be inside a ball of radius C + ffl=2, we
get that |P (ffl)|ff(k)(ffl=2)k    ff(k)(C + ffl=2)k , where ff(k) = ssk=2 = (k=2 + 1) is the volume
of a unit ball in Rk .  Hence |P (ffl)|   (C + 2=ffl)k  and


                                                 2d    (C + L=fl)     k ,

                                                                                                                    __
i.e., d   k log 2 (C + L=fl).                                                                                      |__|


     Next we prove Theorem 8 by applying the Lipschitz bound to a control system.



                                                            22


Proof  of  Theorem  8.             Our aim is to compute the Lipschitz constant associated to the
control system in Definition 2 and then we apply Theorem 7.
     Denote  the  system  parameters  (ff11 , . . ., ffnm  , a1 , b1 , . . ., ar , br )  by  ~  and  assume
k~k1   < 1.  Let
                                           Z  o  X   m    X    n

                        F (~, u) = y(o ) =                          ffi`,`(t)ui(o - t)dt.
                                             0       i=1       `=1

Functions ,1 (t), . . ., ,n (t) are of the form ,(t) = tceat sin (bt) or ,(t) = tceat cos  (bt) where
a + ib  is  an  eigenvalue  of  A  and  c  2  {0, . . ., n - 1}.  Thus  taking  a  partial  derivative
with respect to a or b will increase the power of t by one and change the trigonometric
functions.  Therefore,
                fifi          fi    fi        fi     fiX        X                Z  o                 fi
                fifi@F_(~,_u)_fifi= fifi@y(o_)_fifi= fifi  m        n    d(i, `)      ,`(t)ui(o - t)dtfifi
                    @ff~ae    fi    fi @ff~ae fi     fi    i=1      `=1            0                  fi
                                                    Z  o

                                                 nm      |,`(t)ui(o - t)|     dt   nmo n-1  eo M,
                                                      0

because  sup t2[0,o ] |,`(t)|    eo o n  and  d(i, `) = @ffij=@ff~ae = 1  if  (i, `) = (~, ae)  and  zero
otherwise.  Similarly we calculate
                                  fifi          fi    fi        fi
                                  fifi@F_(~,_u)_fifi= fifi@y(o_)_fifi  nmo neo M  and
                                       @a~      fi    fi  @a~   fi
                                  fifi          fi    fi        fi
                                  fifi@F_(~,_u)_fifi= fifi@y(o_)_fifi  nmo neo M
                                       @b~      fi    fi  @b~   fi
                fi       fi                                fi       fi
as sup  t2[0,o ]fifi@,`(t)_@~fifi  eo o n and sup  t2[0,o ]fifi@,`(t)_@b~fifi  eo o n.

     Now the Lipschitz constant can be taken to be L = n2 meo o nM  as


                        |F (~, u) - F (~* , u)|      = |rF . (~ - ~* )|        Lk~ - ~* k1  .


The  number  of  system  parameters  is  at  most  nm + n  =  (m + 1)n  and  we  get  the
level fat-shattering bound by applying Theorem 7 with space dimension n(m + 1) and                                  __
L = n2 meo o nM .                                                                                                  |__|


     As a corollary, we combine the above result together with a pseudo-dimension bound
to prove the fat-shattering bound given in Theorem 2.

Corollary  24  (Fat-shattering  bound  in  Theorem  2).                              Assume   that   the   system
`x =  Ax + Bu,  y  =  Cx,  x(0)  =  0  can  be  parameterized  by  ~  2  Rn(m+1)     as  in  Defi-
nition  2  with  k~k  1   <  1  and  assume  in  addition  that  the  control  is  given  by  u  =  G!
where the input basis functions !j  are in   given by  (2) .  We denote the corresponding
control system class by


                                       FB  = {F (~, .) : U  ! R ;  ~ 2 B}.


Then
                       8                   j                k
                       <    (m + 1)n log 2   n2mo_neo_kM____  ,
   fat fl(FB )   min                        i       fl                                     j
                       :    2(m + 4)n log 2   8e  nmk4(n + `max  ) + 1   2nk + 2(2k + 1)n    ,


                                                            23


where `max   is given by  (2)  and  (3)  and M  is a constant satisfying

                                              Z  o

                                                   |ui(o - t)|dt   kM
                                                0

for all i = 1, . . ., m.


Proof.    The first part of the bound follows from Theorem 8 with kM  in place of M .
     The  remaining  part  of  the  bound  comes  from  the  pseudo-dimension  bound.  First
we  derive  the  associated  VC-dimension  bound.   As  we  assumed  that  A  has  a  fixed
Jordan  block  structure,  every  entry  of  eA(1-t)    is  a  linear  combination  of  n  functions
,1 (t), . . ., ,n (t).  (That is, we don't need to consider all possible functions over different
Jordan  block  structures.)  This  implies  that  in  the  Goldberg-Jerrum  argument  of  Sec-
tion 5.1 we can take ` = mn + 4n, d = nmk4(n + `max  ) + 1 and s = 2nk + 2(2k + 1)n .
Moreover, in that section the VC-dimension bounds were derived for time interval [0, 1].
However,  the  upper  bound  depends  on  the  number  of  system  parameters  and  the  de-
grees  of  polynomials  to  be  evaluated.   Changing  the  time  interval  to  be  [0, o ]  means
just that we replace the eigenvalue parameters (referring to the proof of Proposition 15)
a, b, ea cos  b, ea sin  b by ao, bo, eao  cos  bo, eao  sin bo .
     The  above  bound  is  also  a  bound  for  the  pseudo-dimension.     Observe  that
for  G   =   {g   :   X   !   R},   the  pseudo-dimension  can  be  defined  as  PD   (G)   =
VC   {Ind  (x, y)  =  sign (g(x) - y)      ;  g  2  G}.   Hence  we  study  the  VC-dimension  associ-
ated to sign  (y(o ) - z)    = sign  (P=Q - z) = sign  (P^ =Q), where P^  = P - zQ has the same
degree as P with respect to the parameters.  Here z  is a new input, but the bound uti-
lizing Goldberg-Jerrum technique does not depend on the dimension of the inputs, and
hence the above VC-dimension bound is also a bound for the pseudo-dimension.  (Note
that here in the scale sensitive setting we do not apply the pseudo-dimension results of          __
Section 5.3 using loss functions, as those rescaled the outputs.)                                |__|


Remark 25.        In  the  special  case  of  scalar  controls  (i.e.,  m  =  1)  that  are  given  by  a
linear combination of k fixed input functions, the control system class discussed can be
viewedPas a family of linear mappings in Rk  as calculated in (5)  .  Let F  = {x 2 Rk  7!

   i wixi + `     ;  kwk2  = 1    }.  If we further restrict x such that kxk2    R and |`|    R then
[18  , 16  ] have shown that


                                     fat fl(F )   min   {9R2 =fl2 , k + 1} + 1.


     In  this  special  case  our  fl-shattering  result  is  of  the  form  c1 log 2 ( bc2 k=flc)  ,  where
c1  and c2  are constants.  This gives improvement over the hyperplane bound when the
margin fl  is small.
7      A  Class  of  Systems  with  VC-Dimension  k

                                                                 P  k
For  the  control  system  (1)   with  scalar  control  u(t)  =     i=1  gi!i(t)  and  unrestricted
!1 , . . ., !k ,  the  standard  half-space  argument  gives  an  upper  bound  k.  This  bound  is
tight.  We will give an example of a single-input, single-output one parameter family of
control systemsPin dimension two that has VC-dimension k, when the controls are of the
form u(t) =     ki=1gi!i(t) and !i(1 - t) = 1[2-i ,2-i +2ff], where ff = -2(k + 1).

                                                            24


     Consider a control system


                                                   x`1 = x2

                                                   x`2 = -~2 x1 + u                                                   (14)

                                                    y  = -x1 .


For  timeR interval  [0,1]  and  initial  condition  (x1 , x2 )  =  (0, 0),  the  output  is  given  by
y(1) =   10sin (~t)u(1 - t)dt.


Lemma  26.          Controls {!1 , . . ., !k } such that !i(1-t) = 1[2-i ,2-i +2ff], where ff = -2(k +
1), are shattered by the control system  (14)   with sign-observations.
                                                                          P  k
Proof.    Let T  = {2-i ,  i = 1, . . ., k} and J    T .  Define ~J  = ss    i=1  ai2i, where ai = 1
if 2-i  62 J  and ai = 0 otherwise.  Now if t = 2-` , then
                                               _                                                !
                          X    k                  X    `-1                 X    k
                ~J t = ss      i=1 ai2i-`  = ss        i=1  ai2i-`   +a` +              ai2i-`     ,
                                                  ________-z_______"       _____i=`+1_-z________"
                                                        1=2 c2                      2c1


where c1  2 N and 0   c2  < 2.
     Hence sin (~J t) = sin (ss(1=2 c2 + a`)). Note that if a` = 0 then 1=2 c2 + a` 2 [0, 1 - 2-` ]
and if a` = 1 then 1=2 c2 + a` 2 [1, 2 - 2-` ].  Thus sin (ss(1=2 c2 + a`))   0 if a` = 0 and
sin (ss(1=2 c2 + a`))   0 if a` = 1.  Therefore,


                                              sin (~J t)   0  ,  a` = 0


and


                                              sin (~J t)   0  ,  t 2 J.


Further,

                                     Z   2-` +2ff

                                                  sin (~J t)dt   0  ,  a` = 0,
                                       2-`

where ff is taken so that
                                              X    k     j  ff    -(k+1)

                                                   j=1  2  2     2         .                                          (15)


This assures that when `   k and t 2 [2-` , 2-`  + 2ff]

                                                 X    k
                        ~J t 2 [0, ss(1 - 2-`  +      j=1  2j 2ff)]   [0, ss) ,  if a` = 0


or similarly


                                             ~J t 2 [ss, 2ss) ,  if a` = 1.

                                        P   k
In (15)   we can take ff = -2(k + 1) as     j=1  2j = 2k+1   - 2.
     In this way the integrand in

                                                Z   2-` +2ff

                                                             sin (~J t)dt
                                                  2-`



                                                            25


is either positive or negative.
     For S   {1, . . ., k}, let J = {2-i , i 2 S}.  For each !i,

                    Z  1                         Z  2-i +2ff

                         sin (~J t)!i(1 - t)dt =              sin (~J t)dt > 0 ,  i 2 S,
                      0                            2-i

i.e., the set of controls {!1 , . . ., !k } is shattered by the mapping

                                                 h Z   1                      i
                                     !i 7! sign          sin (~J t)!i(1 - t)dt  .
                                                     0
                                                                                                                           __
                                                                                                                          |__|



A       An  Example  of  the  Goldberg-Jerrum  Bound


We begin this appendix with an informal discussion on the Goldberg-Jerrum technique
used to prove the VC-dimension upper bounds in this paper.
     We want to write y(1) = P=Q, where P and Q are polynomials.  Unfortunately, the
value  of  sign  y(1)  can  not  be  obtained  by  just  evaluating  P  and  Q  since  Q  may  have
zeros.  Therefore we need to write
                                        8
                                        >>>:
                                           Pfl =Qfl  ,     if f1  = 0, . . ., f~  6= 0


so that after evaluating ~ polynomials f1 , . . ., f~  we can pick a definition Pi=Qi without
poles in a region defined by the ~ polynomials.  When y(1) is defined in this way sign  y(1)
can be easily expressed by a Boolean formula evaluating 2fl + ~ polynomial inequalities
and equalities.
     For simplicity we assume that p = 1 and the initial condition x(0) = 0.  Then using
the notation of Proposition 15 we write

                       X    m    X   n     X   2n         X    k        Z  1
                y(1) =      i=1      r=1       `=1  ffir`      j=1  gij      ,`(xr1 , xr2 , t)!j (t)dt
                                                                          0

and by the proof of Lemma 11
Z   1
                                                                          P`j
      ,`(xr1 , xr2 , t)!j (t)dt =  ____________________________________________________________________________________22z`j22z`j,
  0                                ( (xr1  + ffj )  + (xr2  + fij )  )       ((xr1  + ffj )  + (xr2  - fij )  )



where  P`j   is  some  polynomial,  z`j   2   N  and  !j (t)   =   t`j effjt sin (fij t)  or  !j (t)   =
t`j effjt cos (fij t).  Hence the denominator of

                                      X    k        Z  1

                                           j=1  gij   0  ,`(xr1 , xr2 , t)!j (t)dt

is

                  2                2      z`1                2                2      z`1
     (xr1  + ff1 )  + (xr2  + fi1 )             (xr1  + ff1 )  + (xr2  - fi1 )           x
                                               2                2       z`k                2                2      z`k
                          . . .x  (xr1  + ffk )  + (xr2  + fik )              (xr1  + ffk )  + (xr2  - fik )           .



                                                            26


By  carrying  out  all  summations  y(1)  =  P=Q.   The  denominator  Q  consists  of  the
product
            _
   Y    n                    2                2      *                2                2      *

        r=1     (xr1  + ff1 )  + (xr2  + fi1 )           (xr1  + ff1 )  + (xr2  - fi1 )         x
                                                                                                                    !
                                                 2                2       *               2                2       *
                            . . .x  (xr1  + ffk )  + (xr2  + fik )           (xr1  + ffk )  + (xr2  - fik )            ,



where *'s stand for some unspecified powers.  Hence the zeros of Q are determined by
2nk polynomials

                                       fij1  = (xi1 + ffj )2 + (xi2 + fij )2 ,

                                       fij2  = (xi1 + ffj )2 + (xi2 - fij )2 ,

and i = 1, . . ., n, j = 1, . . ., k.  The number of different sign assignments determining fl
is calculated as in the proof of Lemma 16.



Example


The  purpose  of  the  following  example  is  to  illustrate  the  function  y  =  P=Q  used  in
the  Goldberg-Jerrum  technique  together  with  the  sequence  of  polynomial  evaluations
involved and a table for the final output depending on the outcomes of the polynomial
evaluations.
     Take m = 1, n = 2, k = 2, and assume that A has complex eigenvalues a   ib.  Take
basis input functions to be !1 (t) = et and !2 (t) = e2t .  Then

                                       X    2        X   2        Z  1
                                y(1) =      l=1  ffl     j=1   gj      ,l(t)!j (t) dt,
                                                                    0

where ,1 (t) = eat sin (bt), ,2 (t) = eat cos  (bt), ff1 , ff2 , a, b, ea sin  b and ea cos  b are system
parameters and g1 , g2  are input parameters.
     By using formulas
                            Z  1                       "a
                                                      e  ("a sin "b- "bcos  "b) + "b
                                 e"atsin ("bt) dt  =  ________________________________,    and
                            Z 0                                  "a2 + "b2
                               1                      e"a("a cos "b + "bsin "b) - "a
                                 e"atcos  ("bt) dt =  ________________________________,
                             0                                   "a2 + "b2

we calculate the integrals appearing in the rationality condition, and we call them r11 ,
r12 , r21 , and r22 :
             Z  1                     Z  1                        (                    2    2
                                                                     r11 ,   if (a + 1)  + b   6= 0,
                  ,1 (t)!1 (t) dt   =      e(a+1)t  sin (bt) dt =                      2    2
               0                        0                            0,      if (a + 1)  + b   = 0,
             Z  1                     (                    2    2
                                         r12 ,   if (a + 2)  + b   6= 0,
                  ,1 (t)!2 (t) dt   =                      2    2
               0                         0,      if (a + 2)  + b   = 0,
             Z  1                     (                    2    2
                                         r21 ,   if (a + 1)  + b   6= 0,
                  ,2 (t)!1 (t) dt   =                      2    2
               0                         1,      if (a + 1)  + b   = 0,
             Z  1                     (                    2    2
                                         r22 ,   if (a + 2)  + b   6= 0,
                  ,2 (t)!2 (t) dt   =                      2    2
               0                         1,      if (a + 2)  + b   = 0.



                                                            27


     The computation of

                                            `  X                         Z                  '
                                                   2        X   2           1
                        sign   y(1) = sign         l=1  ffl     j=1   gj      ,l(t)!j (t) dt
                                                                           0

is divided into three cases:


     o  Case (a + 1)2 + b2  6= 0, (a + 2)2 + b2  6= 0:

                                                                                                   `      '
                                                                                                      P1
                sign  y(1) = sign  (ff1 g1 r11  + ff1 g2 r12  + ff2 g1 r21  + ff2 g2 r22 ) = sign     ____   .
                                                                                                      Q1



     o  Case (a + 1)2 + b2  = 0, (a + 2)2 + b2  6= 0:

                                                                                         `      '
                                                                                            P2
                         sign  y(1) = sign  (ff1 g2 r12  + ff2 g1 + ff2 g2 r22 ) = sign     ____   .
                                                                                            Q2



     o  Case (a + 1)2 + b2  6= 0, (a + 2)2 + b2  = 0:

                                                                                          `      '
                                                                                             P3
                         sign  y(1) = sign  (ff1 g1 r11  + ff2 g2 r21  + ff2 g2 ) = sign     ____   .
                                                                                             Q3


                                                      P
Thus we have three different expressions of the form  ___.
                                                      Q
     Next  we  form  the  Boolean  formula,  F  =  sign  y(1),  evaluating  polynomials  f1  =
(a + 1)2 + b2  = 0, f2  = (a + 2)2 + b2  = 0, Pi > 0, Qi > 0, for i 2 {1, 2, 3}.  In the following
table 1 means true and 0 means false for the above polynomial evaluation (** = 1 or 0,
i.e., extend the table).


      __f1__=_0______f2__=_0_____P1__>_0______Q1__>_0_______P2__>_0______Q2__>_0_______P3__>_0______Q3__>_0__|___F____

            0           0            1            1            **           **            **           **     |   1
            0           0            1            0            **           **            **           **     |   0
            ..           .            .            .            .            .             .            .         .
            .            ..           ..           ..           ..           ..            ..           ..     || ..
            1           0           **            **            1            1            **           **     |   1
            1           0           **            **            1            0            **           **     |   0
            ..           .            .            .            .            .             .            .         .
            .            ..           ..           ..           ..           ..            ..           ..     || ..
            0           1           **            **           **           **            0             0      |  1


In    this    case    (see    the    statement    of    Goldberg-Jerrum    bounds)           =
{ff1 , ff2 , a, b, ea cos  b, ea sin  b}, X = {g1 , g2 }, s = 8, d = 12, and l = 6.
B       Functions  Needed  to  Express  an  Exponential  of  a  Real

        Matrix


Taking 2n2  functions ,r` (t), r = 1, . . ., n, ` = 1, . . ., 2n to express eAt  in Proposition 15
is clear overcounting.  In this appendix we estimate how many functions are needed to
express eAt  for any n x n real matrix A.

                                                            28


     Let us list the complex eigenvalues of A as a1   ib1 , . . ., a~    ib~ , where ~   bn=2c,
in the decreasing order by their geometric multiplicities.  We introduce functions

                                                                                     n_c-1  a t
      a1   ib1  :     Mfa1 ib1    := {ea1t cos  (b1 t), tea1t cos  (b1 t), . . ., tb 2     e 1  cos  (b1 t),
                                                                                       n_c-1  a t
                                          ea1t sin (b1 t), tea1t sin (b1 t), . . ., tb 2     e 1  sin (b1 t)},
                                                                  n_c-1  a t
      a2   ib2  :     Mfa2 ib2    := {ea2t cos  (b2 t), . . ., tb 4     e 2  cos  (b2 t),
                                                                     n_c-1  a t
                                          ea2t sin (b2 t), . . ., tb 4     e 2  sin (b2 t)},

                     ..
                     .

      a~    ib~  :    Mfa~  ib~    := {ea~ t cos (b~ t),

                                          ea~ t sin (b~ t)}.


Linear combinations of the above functions can express all possible matrices eAt  when
A has purely complex eigenvalues.
     As  real  eigenvalues  can  have  higher  geometric  multiplicities,  we  add  the  following
functions

                                                             n_c a t           n-1   a t
                            Ma1 ib1     := Mfa1 ib1    [ {tb 2  e 1  , . . ., t     e 1  },
                                                             n_c a t           b n_c-1  a t
                            Ma2 ib2     := Mfa2 ib2    [ {tb 4  e 2  , . . ., t  2     e 2  },

                                      ..
                                      .

                            Ma~  ib~    := Mfa~  ib~   [ {tea~ t}.


     To  express  eAt   when  A  has  real  eigenvalues  a`,  `  =  ~ + 1, . . ., n  with  geometric
multiplicity one, we add at most bn=2c + 1 functions Ma`  = {ea`t } for ` = ~ + 1, . . ., n.
     The total number of functions in


                              Ma1 ib1   [ . . .[ Ma~  ib~  [ Ma~+1   [ . . .[ Man


is

                             X    bn=2c   n      X    bn=2c  i   n       n   j       n
                           2      i=1   b ___c +               b __c - b ___c    + b __c + 1
                                          2i          i=1         i      2i          2
                              X    bn=2c   n      X    bn=2c   n       n
                            =      i=1   b ___c +            b __c + b __c + 1
                                           2i          i=1      i      2
                               3n  X    bn=2c  1      n
                               ____            __ + b __c + 1
                                2       i=1     i     2
                               3n  i      n      j        n
                               ____  ln b __c + 1     + b __c + 1 = O(n ln n),
                                2         2               2

                                 P   l    1
using the fact that ln (k + 1)       i=1  __i  ln (k) + 1.



Acknowledgments


This  research  has  been  financed  in  part  by  grants  from  the  following  organizations:
the  Academy  of  Finland,  Finnish  Academy  of  Sciences  and  Letters,  Finnish  Cultural
Foundation and US Air Force (grant F47620-97-1-0159).

                                                            29


References


  [1]  N.  Alon,  S.  Ben-David,  N  Cesa-Bianchi,  and  D.  Haussler,  Scale-sensitive  dimen-
       sions, uniform convergence and learnability, Journal of the ACM 44 (1997), no. 4,
       615-631.


  [2]  M. Anthony and B. Bartlett, Learning in neural networks:  theoretical foundations,
       Cambridge University Press, 1999.


  [3]  P.  Bartlett,  P.  Long,  and  R.  Williamson,  Fat-shattering  and  learnability  of  real-
       valued functions, Proceedings of the Seventh Annual Conference on Computational
       Learning Theory (New York, N.Y.), ACM, 1994, p. 299.


  [4]  P. Bartlett and M. Vidyasagar (eds.), Learning theory special issue of Systems and
       Control Letters, vol. 34, 1998.


  [5]  Peter L. Bartlett and Philip M. Long, More theorems about scale-sensitive dimen-
       sions  and  learning,  In  Proceedings  of  the  Eight  Annual  Conference  on  Computa-
       tional Learning Theory, Santa Cruz, California, 5-8 July 1995, ACM Press, 1995,
       also  appeared  as  Prediction,  Learning,  Uniform  Convergence  and  Scale-Sensitive
       Dimensions, Journal of Computer and System Sciences,v 56, n2, Feb 1998, pp. 174-
       190.


  [6]  S. Ben-David, N. Cesa-Bianchi, D. Haussler, and P. M. Long, Characterizations of
       learnability  for  classes  of  {0, . . ., n}-valued  functions,  J.  Comput.  System  Sci.  50
       (1995), no. 1, 74-86.


  [7]  B. DasGupta and E. D. Sontag, Sample complexity for learning recurrent perceptron
       mappings, IEEE Trans. Inform. Theory 42 (1996), no. 5, 1479-1487.


  [8]  B. DasGupta and E. D. Sontag, Sample complexity for learning recurrent perceptron
       mappings, Advances in Neural Information Processing Systems 8, (D.S. Touretzky,
       M.C.  Moser,  and  M.E.  Hasselmo,  eds.),  MIT  Press,  Cambridge,  MA,  1996,  pp.
       204-210.


  [9]  C.-N. Fiechter, PAC adaptive control of linear systems, Tenth Annual Conference
       on Computational Learning Theory, ACM, 1997.


[10]   P. Goldberg and M. Jerrum, Bounding the Vapnik-Chervonenkis dimension of con-
       cept classes parametrized by real numbers, Machine Learning 18 (1995), 131-148.


[11]   T.  Johansen  and  E.  Weyer,  On  convergence  proofs  in  system  identification  -  a
       general principle using ideas from learning theory, Systems and Contol Letters 34
       (1998), 85-92.


[12]   P.  Koiran  and  E.  D.  Sontag,  Vapnik-Chervonenkis  dimension  of  recurrent  neural
       networks, Discrete Applied Mathematics 86 (1998), no. 1, 63-80.


[13]   R. Koplon and E. D. Sontag, Linear systems with sign-observations, SIAM J. Con-
       trol Optim. 31 (1993), 1245-1266.


                                                            30


[14]   L. Ljung, PAC-learning and symptotic system identification theory, Proceedings of
       35th  IEEE  Conf.  on  Decision  and  Control,  Kobe,  Japan,  IEEE,  1996,  pp.  2303-
       2307.


[15]   D.  Pollard,   Convergence  of  stochastic  processes,   Springer  series  in  statistics,
       Springer Verlag, 1984.


[16]   J.     Shawe-Taylor,      P.     Bartlett,      R     Williamson,      and     M.     Anthony,
       Structural     risk     minimization     over     data-dependent     hierarchies,      Neu-
       roCOLT      Technical      Report,       NC-TR-96-053      (1996),       available      at
       ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech_reports                          or
       www.neurocolt.com


[17]   E.D. Sontag, A learning result for continuous-time recurrent neural networks, Sys-
       tems and Control Letters 34 (1998), 151-158.


[18]   V. Vapnik, The nature of statistical learning theory, Springer-Verlag, 1995.


[19]   V. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies
       of events to their probabilities, Theory Probab. Appl. 16 (1971), 264-280.


[20]   M. Vidyasagar, A theory of learning and generalization; with applications to neural
       networks and control systems, Springer-Verlag, 1997.


[21]   __________,  Introduction  to  some  statistical  aspects  of  PAC  learning  theory,  Systems
       and Control Letters 34 (1998), no. 3, 115-124.


[22]   R.  Wenocur  and  R.  Dudley,  Some  special  Vapnik-Chervonenkis  classes,  Discrete
       Mathematics 33 (1981), 313-318.


[23]   A. Zador and B. Pearlmutter, VC dimension of an integrate-and-fire neuron model,
       Neural Computation 8 (1996), no. 3, 611-624.
                                                            31