man regexp


NAME
     regexp, compile, step, advance - simple  regular  expression
     compile and match routines

SYNOPSIS
     #define INIT declarations
     #define GETC(void) getc code
     #define PEEKC(void) peekc code
     #define UNGETC(void) ungetc code
     #define RETURN(ptr) return code
     #define ERROR(val) error code

     #include <regexp.h>

     char *compile(char *instring, char *expbuf, char *endbuf,
          int eof);

     int step(char *string, char *expbuf);

     int advance(char *string, char *expbuf);

     extern char *loc1, *loc2, *locs;

DESCRIPTION
     Regular Expressions (REs)  provide  a  mechanism  to  select
     specific  strings from a set of character strings.  The Sim-
     ple Regular Expressions  described  below  differ  from  the
     Internationalized   Regular  Expressions  described  on  the
     regex(5) manual page in the following ways:

          o   only Basic Regular Expressions are supported

          o   the Internationalization features-character  class,
             equivalence class, and multi-character collation-are
             not supported.

     The functions step(), advance(), and compile()  are  general
     purpose  regular  expression matching routines to be used in
     programs that perform regular  expression  matching.   These
     functions are defined by the <regexp.h> header.

     The functions step() and advance() do pattern matching given
     a  character  string  and  a  compiled regular expression as
     input.

     The function compile() takes as input a  regular  expression
     as defined below and produces a compiled expression that can
     be used with step() or advance().


  Basic Regular Expressions
     A regular expression specifies a set of  character  strings.
     A member of this set of strings is said to be matched by the
     regular expression.  Some characters  have  special  meaning
     when  used  in  a regular expression; other characters stand
     for themselves.

     The following one-character REs match a single character:

     1.1  An ordinary character (not one of  those  discussed  in
          1.2 below) is a one-character RE that matches itself.

     1.2  A backslash (\) followed by any special character is  a
          one-character  RE  that  matches  the special character
          itself.  The special characters are:

          a.   ., *, [, and  \  (period,  asterisk,  left  square
               bracket,  and  backslash, respectively), which are
               always special, except  when  they  appear  within
               square brackets ([]; see 1.4 below).

          b.   ^ (caret or circumflex), which is special  at  the
               beginning of an entire RE (see 4.1 and 4.3 below),
               or when it immediately follows the left of a  pair
               of square brackets ([]) (see 1.4 below).

          c.   $ (dollar sign), which is special at the end of an
               entire RE (see 4.2 below).

          d.   The character used to bound (that is, delimit)  an
               entire RE, which is special for that RE (for exam-
               ple, see how slash (/) is used in the  g  command,
               below.)

     1.3  A period (.) is a one-character  RE  that  matches  any
          character except new-line.

     1.4  A non-empty string of  characters  enclosed  in  square
          brackets  ([])  is  a one-character RE that matches any
          one character in that string.  If, however,  the  first
          character  of  the string is a circumflex (^), the one-
          character RE matches any character except new-line  and
          the remaining characters in the string.  The ^ has this
          special meaning only if it occurs first in the  string.
          The  minus  (-) may be used to indicate a range of con-
          secutive characters; for example, [0-9]  is  equivalent
          to  [0123456789].   The - loses this special meaning if
          it occurs first (after an initial ^, if any) or last in
          the string.  The right square bracket (]) does not ter-
          minate such a string when it  is  the  first  character
          within  it  (after  an initial ^, if any); for example,
          []a-f] matches either a right square bracket (]) or one
          of  the  ASCII letters a through f inclusive.  The four
          characters listed in 1.2.a above stand  for  themselves
          within such a string of characters.

     The following rules may be used to construct REs  from  one-
     character REs:

     2.1  A one-character RE is a RE that  matches  whatever  the
          one-character RE matches.

     2.2  A one-character RE followed by an asterisk (*) is a  RE
          that matches 0 or more occurrences of the one-character
          RE.  If there  is  any  choice,  the  longest  leftmost
          string that permits a match is chosen.

     2.3  A  one-character  RE  followed  by  \{m\},  \{m,\},  or
          \{m,n\}  is a RE that matches a range of occurrences of
          the one-character RE.  The values of m and  n  must  be
          non-negative  integers  less  than  256;  \{m\} matches
          exactly  m  occurrences;  \{m,\}  matches  at  least  m
          occurrences;  \{m,n\} matches any number of occurrences
          between m and n inclusive.  Whenever a  choice  exists,
          the RE matches as many occurrences as possible.

     2.4  The concatenation of REs is a RE that matches the  con-
          catenation  of the strings matched by each component of
          the RE.

     2.5  A RE enclosed between the character sequences \( and \)
          is a RE that matches whatever the unadorned RE matches.

     2.6  The expression \n matches the same string of characters
          as was matched by an expression enclosed between \( and
          \) earlier in the same RE.  Here  n  is  a  digit;  the
          sub-expression  specified is that beginning with the n-
          th occurrence of \( counting from the left.  For  exam-
          ple,  the expression ^\(.*\)\1$ matches a line consist-
          ing of two repeated appearances of the same string.

     A RE may be constrained to match words.

     3.1  \< constrains a RE to match the beginning of  a  string
          or  to  follow  a character that is not a digit, under-
          score, or letter.  The first character matching the  RE
          must be a digit, underscore, or letter.

     3.2  \> constrains a RE to match the end of a string  or  to
          precede a character that is not a digit, underscore, or
          letter.

     An entire RE may be constrained to  match  only  an  initial
     segment or final segment of a line (or both).

     4.1  A circumflex (^) at the beginning of an entire RE  con-
          strains that RE to match an initial segment of a line.

     4.2  A dollar sign ($) at the end of an entire RE constrains
          that RE to match a final segment of a line.

     4.3  The construction ^entire RE$ constrains the  entire  RE
          to match the entire line.

     The null RE (for example, //) is equivalent to the  last  RE
     encountered.

  Addressing with REs
     Addresses are constructed as follows:

      1.  The character "." addresses the current line.

      2.  The character  "$"  addresses  the  last  line  of  the
          buffer.

      3.  A decimal number n  addresses  the  n-th  line  of  the
          buffer.

      4.  'x addresses the line marked with the mark name charac-
          ter  x, which must be an ASCII lower-case letter (a-z).
          Lines are marked with the k command described below.

      5.  A RE enclosed by slashes (/) addresses the  first  line
          found  by searching forward from the line following the
          current line toward the end of the buffer and  stopping
          at  the first line containing a string matching the RE.
          If necessary, the search wraps around to the  beginning
          of  the  buffer  and  continues up to and including the
          current line, so that the entire buffer is searched.

      6.  A RE enclosed in question marks (?) addresses the first
          line  found by searching backward from the line preced-
          ing the current line toward the beginning of the buffer
          and  stopping  at  the  first  line containing a string
          matching the RE.  If necessary, the search wraps around
          to  the  end  of  the  buffer  and  continues up to and
          including the current line.

      7.  An address followed by a plus sign (+) or a minus  sign
          (-) followed by a decimal number specifies that address
          plus  (respectively  minus)  the  indicated  number  of
          lines.  A shorthand for .+5 is .5.

      8.  If an address begins with + or -, the addition or  sub-
          traction is taken with respect to the current line; for
          example, -5 is understood to mean .-5.

      9.  If an address ends with + or -, then 1 is added  to  or
          subtracted from the address, respectively.  As a conse-
          quence of this rule and of Rule 8,  immediately  above,
          the  address - refers to the line preceding the current
          line.  (To maintain compatibility with earlier versions
          of the editor, the character ^ in addresses is entirely
          equivalent to -.)  Moreover, trailing + and  -  charac-
          ters  have  a  cumulative  effect,  so -- refers to the
          current line less 2.

     10.  For convenience, a comma (,)  stands  for  the  address
          pair  1,$,  while  a  semicolon (;) stands for the pair
          .,$.

  Characters With Special Meaning
     Characters that have special meaning except when they appear
     within square brackets ([]) or are preceded by \ are:  ., *,
     [, \.  Other special characters,  such  as  $  have  special
     meaning in more restricted contexts.

     The character ^ at the beginning of an expression permits  a
     successful  match  only immediately after a newline, and the
     character $ at the end of an expression requires a  trailing
     newline.

     Two characters have special meaning only  when  used  within
     square  brackets.   The  character - denotes a range, [c-c],
     unless it is just after the open bracket or before the clos-
     ing  bracket,  [ -c] or [c-] in which case it has no special
     meaning.  When used within brackets, the character ^ has the
     meaning  complement  of  if  it immediately follows the open
     bracket (example: [^c]); elsewhere between  brackets  (exam-
     ple: [c^]) it stands for the ordinary character ^.

     The special meaning of the \ operator can be escaped only by
     preceding it with another \, for example \\.

  Macros
     Programs must have the following five macros declared before
     the #include <regexp.h> statement.  These macros are used by
     the compile() routine.  The macros GETC, PEEKC,  and  UNGETC
     operate  on  the  regular  expression given as input to com-
     pile().

     GETC           This macro returns  the  value  of  the  next
                    character  (byte)  in  the regular expression
                    pattern.  Successive  calls  to  GETC  should
                    return  successive  characters of the regular
                    expression.

     PEEKC          This macro returns the next character  (byte)
                    in  the regular expression.  Immediately suc-
                    cessive calls to PEEKC should return the same
                    character,  which  should  also  be  the next
                    character returned by GETC.

     UNGETC         This  macro  causes  the  argument  c  to  be
                    returned  by the next call to GETC and PEEKC.
                    No more than one  character  of  pushback  is
                    ever  needed and this character is guaranteed
                    to be the last character read by  GETC.   The
                    return value of the macro UNGETC(c) is always
                    ignored.

     RETURN(ptr)    This macro is used on normal exit of the com-
                    pile()  routine.   The  value of the argument
                    ptr is a pointer to the character  after  the
                    last   character   of  the  compiled  regular
                    expression.  This is useful to programs which
                    have memory allocation to manage.

     ERROR(val)     This macro is the abnormal  return  from  the
                    compile()  routine.   The  argument val is an
                    error number (see ERRORS below for meanings).
                    This call should never return.

  compile()
     The syntax of the compile() routine is as follows:

          ccoommppiillee((instring, expbuf, endbuf, eof)

     The first parameter, instring, is never used  explicitly  by
     the  compile()  routine but is useful for programs that pass
     down different pointers to input characters.   It  is  some-
     times  used  in  the INIT declaration (see below).  Programs
     which call functions to input characters or have  characters
     in  an external array can pass down a value of (char *)0 for
     this parameter.

     The next parameter, expbuf,  is  a  character  pointer.   It
     points  to  the  place where the compiled regular expression
     will be placed.

     The parameter endbuf is one more than  the  highest  address
     where the compiled regular expression may be placed.  If the
     compiled expression cannot fit in (endbuf-expbuf)  bytes,  a
     call to ERROR(50) is made.

     The parameter eof is the character which marks  the  end  of
     the regular expression.  This character is usually a /.

     Each program that includes the <regexp.h> header  file  must
     have a #define statement for INIT.  It is used for dependent
     declarations and initializations.  Most often it is used  to
     set  a  register  variable  to point to the beginning of the
     regular expression so that this  register  variable  can  be
     used  in the declarations for GETC, PEEKC, and UNGETC.  Oth-
     erwise it can be used to  declare  external  variables  that
     might  be  used  by  GETC,  PEEKC and UNGETC.  (See EXAMPLES
     below.)

  step(), advance()
     The first parameter to the step() and advance() functions is
     a  pointer  to  a  string  of characters to be checked for a
     match.  This string should be null terminated.

     The  second  parameter,  expbuf,  is  the  compiled  regular
     expression which was obtained by a call to the function com-
     pile().

     The function step() returns non-zero if  some  substring  of
     string  matches  the  regular  expression in expbuf and 0 if
     there is no match.  If there is a match, two external  char-
     acter  pointers  are  set  as  a  side effect to the call to
     step().  The variable loc1 points  to  the  first  character
     that  matched  the  regular  expression;  the  variable loc2
     points to  the  character  after  the  last  character  that
     matches the regular expression.  Thus if the regular expres-
     sion matches the entire input string, loc1 will point to the
     first character of string and loc2 will point to the null at
     the end of string.

     The function advance() returns non-zero if the initial  sub-
     string  of  string matches the regular expression in expbuf.
     If there is a match, an external character pointer, loc2, is
     set  as a side effect.  The variable loc2 points to the next
     character in string after the last character that matched.

     When advance() encounters a * or \{ \} sequence in the regu-
     lar expression, it will advance its pointer to the string to
     be matched as far as  possible  and  will  recursively  call
     itself trying to match the rest of the string to the rest of
     the regular expression.  As  long  as  there  is  no  match,
     advance()  will  back  up  along the string until it finds a
     match or reaches the point  in  the  string  that  initially
     matched  the  * or \{ \}.  It is sometimes desirable to stop
     this backing up before the initial point in  the  string  is
     reached.  If the external character pointer locs is equal to
     the point in the string at sometime during  the  backing  up
     process,  advance() will break out of the loop that backs up
     and will return zero.

     The external variables circf, sed, and nbra are reserved.

EXAMPLES
     The following is an example of how  the  regular  expression
     macros and calls might be defined by an application program:

     ##ddeeffiinnee IINNIITT         rreeggiisstteerr cchhaarr **sspp == iinnssttrriinngg;;
     ##ddeeffiinnee GGEETTCC       ((**sspp++++))
     ##ddeeffiinnee PPEEEEKKCC      ((**sspp))
     ##ddeeffiinnee UUNNGGEETTCC((cc))    ((----sspp))
     ##ddeeffiinnee RREETTUURRNN((**cc))    rreettuurrnn;;
     ##ddeeffiinnee EERRRROORR((cc))     rreeggeerrrr
     ##iinncclluuddee <<rreeggeexxpp..hh>>
      .. .. ..
           ((vvooiidd)) ccoommppiillee((**aarrggvv,, eexxppbbuuff,, &&eexxppbbuuff[[EESSIIZZEE]],,''\\00''));;
      .. .. ..
           iiff ((sstteepp((lliinneebbuuff,, eexxppbbuuff))))
                             ssuucccceeeedd;;

DIAGNOSTICS
     The function compile() uses the macro RETURN on success  and
     the  macro  ERROR  on  failure  (see  above).  The functions
     step() and advance() return non-zero on a  successful  match
     and zero if there is no match.  Errors are:

     11   range endpoint too large.

     16   bad number.

     25   \ digit out of range.

     36   illegal or missing delimiter.

     41   no remembered search string.

     42   \( \) imbalance.

     43   too many \(.

     44   more than 2 numbers given in \{ \}.

     45   } expected after \.

     46   first number exceeds second in \{ \}.

     49   [ ] imbalance.

     50   regular expression overflow.

SEE ALSO
     regex(5)
Man(1) output converted with man2html