1335 lines
		
	
	
		
			62 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			1335 lines
		
	
	
		
			62 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| PCRE2TEST(1)                General Commands Manual               PCRE2TEST(1)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        pcre2test - a program for testing Perl-compatible regular expressions.
 | |
| 
 | |
| SYNOPSIS
 | |
| 
 | |
|        pcre2test [options] [input file [output file]]
 | |
| 
 | |
|        pcre2test is a test program for the PCRE2 regular expression libraries,
 | |
|        but it can also be used for  experimenting  with  regular  expressions.
 | |
|        This  document  describes the features of the test program; for details
 | |
|        of the regular expressions themselves, see the pcre2pattern  documenta-
 | |
|        tion.  For  details  of  the  PCRE2  library  function  calls and their
 | |
|        options, see the pcre2api documentation.
 | |
| 
 | |
|        The input for pcre2test is a sequence of  regular  expression  patterns
 | |
|        and  subject  strings  to  be matched. There are also command lines for
 | |
|        setting defaults and controlling some special actions. The output shows
 | |
|        the  result  of  each  match attempt. Modifiers on external or internal
 | |
|        command lines, the patterns, and the subject lines specify PCRE2  func-
 | |
|        tion  options, control how the subject is processed, and what output is
 | |
|        produced.
 | |
| 
 | |
|        As the original fairly simple PCRE library evolved,  it  acquired  many
 | |
|        different  features,  and  as  a  result, the original pcretest program
 | |
|        ended up with a lot of options in a messy, arcane syntax,  for  testing
 | |
|        all the features. The move to the new PCRE2 API provided an opportunity
 | |
|        to re-implement the test program as pcre2test, with a cleaner  modifier
 | |
|        syntax.  Nevertheless,  there are still many obscure modifiers, some of
 | |
|        which are specifically designed for use in conjunction  with  the  test
 | |
|        script  and  data  files that are distributed as part of PCRE2. All the
 | |
|        modifiers are documented here, some  without  much  justification,  but
 | |
|        many  of  them  are  unlikely  to  be  of  use  except when testing the
 | |
|        libraries.
 | |
| 
 | |
| 
 | |
| PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
 | |
| 
 | |
|        Different versions of the PCRE2 library can be built to support charac-
 | |
|        ter  strings  that  are encoded in 8-bit, 16-bit, or 32-bit code units.
 | |
|        One, two, or  all  three  of  these  libraries  may  be  simultaneously
 | |
|        installed. The pcre2test program can be used to test all the libraries.
 | |
|        However, its own input and output are  always  in  8-bit  format.  When
 | |
|        testing  the  16-bit  or 32-bit libraries, patterns and subject strings
 | |
|        are converted to 16- or  32-bit  format  before  being  passed  to  the
 | |
|        library  functions.  Results are converted back to 8-bit code units for
 | |
|        output.
 | |
| 
 | |
|        In the rest of this document, the names of library functions and struc-
 | |
|        tures  are  given  in  generic  form,  for example, pcre_compile(). The
 | |
|        actual names used in the libraries have a suffix _8, _16,  or  _32,  as
 | |
|        appropriate.
 | |
| 
 | |
| 
 | |
| INPUT ENCODING
 | |
| 
 | |
|        Input  to  pcre2test is processed line by line, either by calling the C
 | |
|        library's fgets() function, or via the libreadline library (see below).
 | |
|        The  input  is  processed using using C's string functions, so must not
 | |
|        contain binary zeroes, even though in Unix-like  environments,  fgets()
 | |
|        treats any bytes other than newline as data characters. In some Windows
 | |
|        environments character 26 (hex 1A) causes an immediate end of file, and
 | |
|        no further data is read.
 | |
| 
 | |
|        For  maximum portability, therefore, it is safest to avoid non-printing
 | |
|        characters in pcre2test input files. There is a facility for specifying
 | |
|        a pattern's characters as hexadecimal pairs, thus making it possible to
 | |
|        include binary zeroes in a pattern for testing purposes. Subject  lines
 | |
|        are processed for backslash escapes, which makes it possible to include
 | |
|        any data value.
 | |
| 
 | |
| 
 | |
| COMMAND LINE OPTIONS
 | |
| 
 | |
|        -8        If the 8-bit library has been built, this option causes it to
 | |
|                  be  used  (this is the default). If the 8-bit library has not
 | |
|                  been built, this option causes an error.
 | |
| 
 | |
|        -16       If the 16-bit library has been built, this option  causes  it
 | |
|                  to  be  used. If only the 16-bit library has been built, this
 | |
|                  is the default. If the 16-bit library  has  not  been  built,
 | |
|                  this option causes an error.
 | |
| 
 | |
|        -32       If  the  32-bit library has been built, this option causes it
 | |
|                  to be used. If only the 32-bit library has been  built,  this
 | |
|                  is  the  default.  If  the 32-bit library has not been built,
 | |
|                  this option causes an error.
 | |
| 
 | |
|        -b        Behave as if each pattern has the /fullbincode modifier;  the
 | |
|                  full internal binary form of the pattern is output after com-
 | |
|                  pilation.
 | |
| 
 | |
|        -C        Output the version number  of  the  PCRE2  library,  and  all
 | |
|                  available  information  about  the optional features that are
 | |
|                  included, and then  exit  with  zero  exit  code.  All  other
 | |
|                  options are ignored.
 | |
| 
 | |
|        -C option Output  information  about a specific build-time option, then
 | |
|                  exit. This functionality is intended for use in scripts  such
 | |
|                  as  RunTest.  The  following options output the value and set
 | |
|                  the exit code as indicated:
 | |
| 
 | |
|                    ebcdic-nl  the code for LF (= NL) in an EBCDIC environment:
 | |
|                                 0x15 or 0x25
 | |
|                                 0 if used in an ASCII environment
 | |
|                                 exit code is always 0
 | |
|                    linksize   the configured internal link size (2, 3, or 4)
 | |
|                                 exit code is set to the link size
 | |
|                    newline    the default newline setting:
 | |
|                                 CR, LF, CRLF, ANYCRLF, or ANY
 | |
|                                 exit code is always 0
 | |
|                    bsr        the default setting for what \R matches:
 | |
|                                 ANYCRLF or ANY
 | |
|                                 exit code is always 0
 | |
| 
 | |
|                  The following options output 1 for true or 0 for  false,  and
 | |
|                  set the exit code to the same value:
 | |
| 
 | |
|                    ebcdic     compiled for an EBCDIC environment
 | |
|                    jit        just-in-time support is available
 | |
|                    pcre2-16   the 16-bit library was built
 | |
|                    pcre2-32   the 32-bit library was built
 | |
|                    pcre2-8    the 8-bit library was built
 | |
|                    unicode    Unicode support is available
 | |
| 
 | |
|                  If  an  unknown  option is given, an error message is output;
 | |
|                  the exit code is 0.
 | |
| 
 | |
|        -d        Behave as if each pattern has the debug modifier; the  inter-
 | |
|                  nal form and information about the compiled pattern is output
 | |
|                  after compilation; -d is equivalent to -b -i.
 | |
| 
 | |
|        -dfa      Behave as if each subject line has the dfa modifier; matching
 | |
|                  is  done  using the pcre2_dfa_match() function instead of the
 | |
|                  default pcre2_match().
 | |
| 
 | |
|        -help     Output a brief summary these options and then exit.
 | |
| 
 | |
|        -i        Behave as if each pattern has the /info modifier; information
 | |
|                  about the compiled pattern is given after compilation.
 | |
| 
 | |
|        -jit      Behave  as  if  each pattern line has the jit modifier; after
 | |
|                  successful compilation, each pattern is passed to  the  just-
 | |
|                  in-time compiler, if available.
 | |
| 
 | |
|        -pattern modifier-list
 | |
|                  Behave as if each pattern line contains the given modifiers.
 | |
| 
 | |
|        -q        Do not output the version number of pcre2test at the start of
 | |
|                  execution.
 | |
| 
 | |
|        -S size   On Unix-like systems, set the size of the run-time  stack  to
 | |
|                  size megabytes.
 | |
| 
 | |
|        -subject modifier-list
 | |
|                  Behave as if each subject line contains the given modifiers.
 | |
| 
 | |
|        -t        Run  each compile and match many times with a timer, and out-
 | |
|                  put the resulting times per compile or  match.  When  JIT  is
 | |
|                  used,  separate  times  are given for the initial compile and
 | |
|                  the JIT compile. You can control  the  number  of  iterations
 | |
|                  that  are used for timing by following -t with a number (as a
 | |
|                  separate item on the command line). For  example,  "-t  1000"
 | |
|                  iterates 1000 times. The default is to iterate 500,000 times.
 | |
| 
 | |
|        -tm       This is like -t except that it times only the matching phase,
 | |
|                  not the compile phase.
 | |
| 
 | |
|        -T -TM    These behave like -t and -tm, but in addition, at the end  of
 | |
|                  a  run, the total times for all compiles and matches are out-
 | |
|                  put.
 | |
| 
 | |
|        -version  Output the PCRE2 version number and then exit.
 | |
| 
 | |
| 
 | |
| DESCRIPTION
 | |
| 
 | |
|        If pcre2test is given two filename arguments, it reads from  the  first
 | |
|        and writes to the second. If the first name is "-", input is taken from
 | |
|        the standard input. If pcre2test is given only one argument,  it  reads
 | |
|        from that file and writes to stdout. Otherwise, it reads from stdin and
 | |
|        writes to stdout.
 | |
| 
 | |
|        When pcre2test is built, a configuration option  can  specify  that  it
 | |
|        should  be linked with the libreadline or libedit library. When this is
 | |
|        done, if the input is from a terminal, it is read using the  readline()
 | |
|        function. This provides line-editing and history facilities. The output
 | |
|        from the -help option states whether or not readline() will be used.
 | |
| 
 | |
|        The program handles any number of tests, each of which  consists  of  a
 | |
|        set  of input lines. Each set starts with a regular expression pattern,
 | |
|        followed by any number of subject lines to be matched against that pat-
 | |
|        tern. In between sets of test data, command lines that begin with # may
 | |
|        appear. This file format, with some restrictions, can also be processed
 | |
|        by  the perltest.sh script that is distributed with PCRE2 as a means of
 | |
|        checking that the behaviour of PCRE2 and Perl is the same.
 | |
| 
 | |
|        When the input is a terminal, pcre2test prompts for each line of input,
 | |
|        using  "re>"  to prompt for regular expression patterns, and "data>" to
 | |
|        prompt for subject lines. Command lines starting with # can be  entered
 | |
|        only in response to the "re>" prompt.
 | |
| 
 | |
|        Each  subject line is matched separately and independently. If you want
 | |
|        to do multi-line matches, you have to use the \n escape sequence (or \r
 | |
|        or  \r\n,  etc.,  depending on the newline setting) in a single line of
 | |
|        input to encode the newline sequences. There is no limit on the  length
 | |
|        of  subject  lines; the input buffer is automatically extended if it is
 | |
|        too small. There is a replication feature that  makes  it  possible  to
 | |
|        generate long subject lines without having to supply them explicitly.
 | |
| 
 | |
|        An  empty  line  or  the end of the file signals the end of the subject
 | |
|        lines for a test, at which point a  new  pattern  or  command  line  is
 | |
|        expected if there is still input to be read.
 | |
| 
 | |
| 
 | |
| COMMAND LINES
 | |
| 
 | |
|        In  between sets of test data, a line that begins with # is interpreted
 | |
|        as a command line. If the first character is followed by white space or
 | |
|        an  exclamation  mark,  the  line is treated as a comment, and ignored.
 | |
|        Otherwise, the following commands are recognized:
 | |
| 
 | |
|          #forbid_utf
 | |
| 
 | |
|        Subsequent  patterns  automatically  have   the   PCRE2_NEVER_UTF   and
 | |
|        PCRE2_NEVER_UCP  options  set, which locks out the use of the PCRE2_UTF
 | |
|        and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start  of
 | |
|        patterns.  This  command  also  forces an error if a subsequent pattern
 | |
|        contains any occurrences of \P, \p, or \X, which  are  still  supported
 | |
|        when  PCRE2_UTF  is not set, but which require Unicode property support
 | |
|        to be included in the library.
 | |
| 
 | |
|        This is a trigger guard that is used in test files to ensure  that  UTF
 | |
|        or  Unicode property tests are not accidentally added to files that are
 | |
|        used when Unicode support is  not  included  in  the  library.  Setting
 | |
|        PCRE2_NEVER_UTF  and  PCRE2_NEVER_UCP as a default can also be obtained
 | |
|        by the use of #pattern; the difference is that  #forbid_utf  cannot  be
 | |
|        unset,  and the automatic options are not displayed in pattern informa-
 | |
|        tion, to avoid cluttering up test output.
 | |
| 
 | |
|          #load <filename>
 | |
| 
 | |
|        This command is used to load a set of precompiled patterns from a file,
 | |
|        as  described  in  the  section entitled "Saving and restoring compiled
 | |
|        patterns" below.
 | |
| 
 | |
|          #pattern <modifier-list>
 | |
| 
 | |
|        This command sets a default modifier list that applies  to  all  subse-
 | |
|        quent patterns. Modifiers on a pattern can change these settings.
 | |
| 
 | |
|          #perltest
 | |
| 
 | |
|        The  appearance of this line causes all subsequent modifier settings to
 | |
|        be checked for compatibility with the perltest.sh script, which is used
 | |
|        to  confirm that Perl gives the same results as PCRE2. Also, apart from
 | |
|        comment lines, none of the other command lines are  permitted,  because
 | |
|        they  and  many  of the modifiers are specific to pcre2test, and should
 | |
|        not be used in test files that are also processed by  perltest.sh.  The
 | |
|        #perltest  command  helps detect tests that are accidentally put in the
 | |
|        wrong file.
 | |
| 
 | |
|          #pop [<modifiers>]
 | |
| 
 | |
|        This command is used to manipulate the stack of compiled  patterns,  as
 | |
|        described  in  the section entitled "Saving and restoring compiled pat-
 | |
|        terns" below.
 | |
| 
 | |
|          #save <filename>
 | |
| 
 | |
|        This command is used to save a set of compiled patterns to a  file,  as
 | |
|        described  in  the section entitled "Saving and restoring compiled pat-
 | |
|        terns" below.
 | |
| 
 | |
|          #subject <modifier-list>
 | |
| 
 | |
|        This command sets a default modifier list that applies  to  all  subse-
 | |
|        quent  subject lines. Modifiers on a subject line can change these set-
 | |
|        tings.
 | |
| 
 | |
| 
 | |
| MODIFIER SYNTAX
 | |
| 
 | |
|        Modifier lists are used with both pattern and subject lines. Items in a
 | |
|        list  are  separated by commas and optional white space. Some modifiers
 | |
|        may be given for both patterns and subject lines,  whereas  others  are
 | |
|        valid  for  one  or  the other only. Each modifier has a long name, for
 | |
|        example "anchored", and some of them must be followed by an equals sign
 | |
|        and a value, for example, "offset=12".  Modifiers that do not take val-
 | |
|        ues may be preceded by a minus sign to turn off a previous setting.
 | |
| 
 | |
|        A few of the more common modifiers can also be specified as single let-
 | |
|        ters,  for  example "i" for "caseless". In documentation, following the
 | |
|        Perl convention, these are written with a slash ("the /i modifier") for
 | |
|        clarity.  Abbreviated  modifiers  must all be concatenated in the first
 | |
|        item of a modifier list. If the first item is not recognized as a  long
 | |
|        modifier  name, it is interpreted as a sequence of these abbreviations.
 | |
|        For example:
 | |
| 
 | |
|          /abc/ig,newline=cr,jit=3
 | |
| 
 | |
|        This is a pattern line whose modifier list starts with  two  one-letter
 | |
|        modifiers  (/i  and  /g).  The lower-case abbreviated modifiers are the
 | |
|        same as used in Perl.
 | |
| 
 | |
| 
 | |
| PATTERN SYNTAX
 | |
| 
 | |
|        A pattern line must start with one of the following characters  (common
 | |
|        symbols, excluding pattern meta-characters):
 | |
| 
 | |
|          / ! " ' ` - = _ : ; , % & @ ~
 | |
| 
 | |
|        This  is  interpreted  as the pattern's delimiter. A regular expression
 | |
|        may be continued over several input lines, in which  case  the  newline
 | |
|        characters are included within it. It is possible to include the delim-
 | |
|        iter within the pattern by escaping it with a backslash, for example
 | |
| 
 | |
|          /abc\/def/
 | |
| 
 | |
|        If you do this, the escape and the delimiter form part of the  pattern,
 | |
|        but since the delimiters are all non-alphanumeric, this does not affect
 | |
|        its interpretation. If the terminating delimiter  is  immediately  fol-
 | |
|        lowed by a backslash, for example,
 | |
| 
 | |
|          /abc/\
 | |
| 
 | |
|        then  a  backslash  is added to the end of the pattern. This is done to
 | |
|        provide a way of testing the error condition that arises if  a  pattern
 | |
|        finishes with a backslash, because
 | |
| 
 | |
|          /abc\/
 | |
| 
 | |
|        is  interpreted as the first line of a pattern that starts with "abc/",
 | |
|        causing pcre2test to read the next line as a continuation of the  regu-
 | |
|        lar expression.
 | |
| 
 | |
|        A pattern can be followed by a modifier list (details below).
 | |
| 
 | |
| 
 | |
| SUBJECT LINE SYNTAX
 | |
| 
 | |
|        Before    each   subject   line   is   passed   to   pcre2_match()   or
 | |
|        pcre2_dfa_match(), leading and trailing white space is removed, and the
 | |
|        line is scanned for backslash escapes. The following provide a means of
 | |
|        encoding non-printing characters in a visible way:
 | |
| 
 | |
|          \a         alarm (BEL, \x07)
 | |
|          \b         backspace (\x08)
 | |
|          \e         escape (\x27)
 | |
|          \f         form feed (\x0c)
 | |
|          \n         newline (\x0a)
 | |
|          \r         carriage return (\x0d)
 | |
|          \t         tab (\x09)
 | |
|          \v         vertical tab (\x0b)
 | |
|          \nnn       octal character (up to 3 octal digits); always
 | |
|                       a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
 | |
|          \o{dd...}  octal character (any number of octal digits}
 | |
|          \xhh       hexadecimal byte (up to 2 hex digits)
 | |
|          \x{hh...}  hexadecimal character (any number of hex digits)
 | |
| 
 | |
|        The use of \x{hh...} is not dependent on the use of the utf modifier on
 | |
|        the  pattern. It is recognized always. There may be any number of hexa-
 | |
|        decimal digits inside the braces; invalid  values  provoke  error  mes-
 | |
|        sages.
 | |
| 
 | |
|        Note  that  \xhh  specifies one byte rather than one character in UTF-8
 | |
|        mode; this makes it possible to construct invalid UTF-8  sequences  for
 | |
|        testing  purposes.  On the other hand, \x{hh} is interpreted as a UTF-8
 | |
|        character in UTF-8 mode, generating more than one byte if the value  is
 | |
|        greater  than  127.   When testing the 8-bit library not in UTF-8 mode,
 | |
|        \x{hh} generates one byte for values less than 256, and causes an error
 | |
|        for greater values.
 | |
| 
 | |
|        In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
 | |
|        possible to construct invalid UTF-16 sequences for testing purposes.
 | |
| 
 | |
|        In UTF-32 mode, all 4- to 8-digit \x{...}  values  are  accepted.  This
 | |
|        makes  it  possible  to  construct invalid UTF-32 sequences for testing
 | |
|        purposes.
 | |
| 
 | |
|        There is a special backslash sequence that specifies replication of one
 | |
|        or more characters:
 | |
| 
 | |
|          \[<characters>]{<count>}
 | |
| 
 | |
|        This  makes  it possible to test long strings without having to provide
 | |
|        them as part of the file. For example:
 | |
| 
 | |
|          \[abc]{4}
 | |
| 
 | |
|        is converted to "abcabcabcabc". This feature does not support  nesting.
 | |
|        To include a closing square bracket in the characters, code it as \x5D.
 | |
| 
 | |
|        A  backslash  followed  by  an equals sign marks the end of the subject
 | |
|        string and the start of a modifier list. For example:
 | |
| 
 | |
|          abc\=notbol,notempty
 | |
| 
 | |
|        A backslash followed  by  any  other  non-alphanumeric  character  just
 | |
|        escapes that character. A backslash followed by anything else causes an
 | |
|        error. However, if the very last character in the line is  a  backslash
 | |
|        (and  there  is  no  modifier list), it is ignored. This gives a way of
 | |
|        passing an empty line as data, since a real empty line  terminates  the
 | |
|        data input.
 | |
| 
 | |
| 
 | |
| PATTERN MODIFIERS
 | |
| 
 | |
|        There are three types of modifier that can appear in pattern lines, two
 | |
|        of which may also be used in a #pattern command. A  pattern's  modifier
 | |
|        list can add to or override default modifiers that were set by a previ-
 | |
|        ous #pattern command.
 | |
| 
 | |
|    Setting compilation options
 | |
| 
 | |
|        The following modifiers set options for pcre2_compile(). The most  com-
 | |
|        mon  ones  have single-letter abbreviations. See pcreapi for a descrip-
 | |
|        tion of their effects.
 | |
| 
 | |
|              allow_empty_class         set PCRE2_ALLOW_EMPTY_CLASS
 | |
|              alt_bsux                  set PCRE2_ALT_BSUX
 | |
|              alt_circumflex            set PCRE2_ALT_CIRCUMFLEX
 | |
|              anchored                  set PCRE2_ANCHORED
 | |
|              auto_callout              set PCRE2_AUTO_CALLOUT
 | |
|          /i  caseless                  set PCRE2_CASELESS
 | |
|              dollar_endonly            set PCRE2_DOLLAR_ENDONLY
 | |
|          /s  dotall                    set PCRE2_DOTALL
 | |
|              dupnames                  set PCRE2_DUPNAMES
 | |
|          /x  extended                  set PCRE2_EXTENDED
 | |
|              firstline                 set PCRE2_FIRSTLINE
 | |
|              match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
 | |
|          /m  multiline                 set PCRE2_MULTILINE
 | |
|              never_backslash_c         set PCRE2_NEVER_BACKSLASH_C
 | |
|              never_ucp                 set PCRE2_NEVER_UCP
 | |
|              never_utf                 set PCRE2_NEVER_UTF
 | |
|              no_auto_capture           set PCRE2_NO_AUTO_CAPTURE
 | |
|              no_auto_possess           set PCRE2_NO_AUTO_POSSESS
 | |
|              no_dotstar_anchor         set PCRE2_NO_DOTSTAR_ANCHOR
 | |
|              no_start_optimize         set PCRE2_NO_START_OPTIMIZE
 | |
|              no_utf_check              set PCRE2_NO_UTF_CHECK
 | |
|              ucp                       set PCRE2_UCP
 | |
|              ungreedy                  set PCRE2_UNGREEDY
 | |
|              utf                       set PCRE2_UTF
 | |
| 
 | |
|        As well as turning on the PCRE2_UTF option, the utf modifier causes all
 | |
|        non-printing  characters  in  output  strings  to  be printed using the
 | |
|        \x{hh...} notation. Otherwise, those less than 0x100 are output in  hex
 | |
|        without the curly brackets.
 | |
| 
 | |
|    Setting compilation controls
 | |
| 
 | |
|        The  following  modifiers  affect  the  compilation  process or request
 | |
|        information about the pattern:
 | |
| 
 | |
|              bsr=[anycrlf|unicode]     specify \R handling
 | |
|          /B  bincode                   show binary code without lengths
 | |
|              callout_info              show callout information
 | |
|              debug                     same as info,fullbincode
 | |
|              fullbincode               show binary code with lengths
 | |
|          /I  info                      show info about compiled pattern
 | |
|              hex                       pattern is coded in hexadecimal
 | |
|              jit[=<number>]            use JIT
 | |
|              jitfast                   use JIT fast path
 | |
|              jitverify                 verify JIT use
 | |
|              locale=<name>             use this locale
 | |
|              memory                    show memory used
 | |
|              newline=<type>            set newline type
 | |
|              parens_nest_limit=<n>     set maximum parentheses depth
 | |
|              posix                     use the POSIX API
 | |
|              push                      push compiled pattern onto the stack
 | |
|              stackguard=<number>       test the stackguard feature
 | |
|              tables=[0|1|2]            select internal tables
 | |
| 
 | |
|        The effects of these modifiers are described in the following sections.
 | |
| 
 | |
|    Newline and \R handling
 | |
| 
 | |
|        The bsr modifier specifies what \R in a pattern should match. If it  is
 | |
|        set  to  "anycrlf",  \R  matches  CR, LF, or CRLF only. If it is set to
 | |
|        "unicode", \R matches any Unicode  newline  sequence.  The  default  is
 | |
|        specified when PCRE2 is built, with the default default being Unicode.
 | |
| 
 | |
|        The  newline  modifier specifies which characters are to be interpreted
 | |
|        as newlines, both in the pattern and in subject lines. The type must be
 | |
|        one of CR, LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
 | |
| 
 | |
|    Information about a pattern
 | |
| 
 | |
|        The  debug modifier is a shorthand for info,fullbincode, requesting all
 | |
|        available information.
 | |
| 
 | |
|        The bincode modifier causes a representation of the compiled code to be
 | |
|        output  after compilation. This information does not contain length and
 | |
|        offset values, which ensures that the same output is generated for dif-
 | |
|        ferent  internal  link  sizes  and different code unit widths. By using
 | |
|        bincode, the same regression tests can be used  in  different  environ-
 | |
|        ments.
 | |
| 
 | |
|        The  fullbincode  modifier, by contrast, does include length and offset
 | |
|        values. This is used in a few special tests that run only for  specific
 | |
|        code unit widths and link sizes, and is also useful for one-off tests.
 | |
| 
 | |
|        The  info  modifier  requests  information  about  the compiled pattern
 | |
|        (whether it is anchored, has a fixed first character, and so  on).  The
 | |
|        information  is  obtained  from the pcre2_pattern_info() function. Here
 | |
|        are some typical examples:
 | |
| 
 | |
|            re> /(?i)(^a|^b)/m,info
 | |
|          Capturing subpattern count = 1
 | |
|          Compile options: multiline
 | |
|          Overall options: caseless multiline
 | |
|          First code unit at start or follows newline
 | |
|          Subject length lower bound = 1
 | |
| 
 | |
|            re> /(?i)abc/info
 | |
|          Capturing subpattern count = 0
 | |
|          Compile options: <none>
 | |
|          Overall options: caseless
 | |
|          First code unit = 'a' (caseless)
 | |
|          Last code unit = 'c' (caseless)
 | |
|          Subject length lower bound = 3
 | |
| 
 | |
|        "Compile options" are those specified by modifiers;  "overall  options"
 | |
|        have  added options that are taken or deduced from the pattern. If both
 | |
|        sets of options are the same, just a single "options" line  is  output;
 | |
|        if  there  are  no  options,  the line is omitted. "First code unit" is
 | |
|        where any match must start; if there is more than one they  are  listed
 | |
|        as  "starting  code  units".  "Last code unit" is the last literal code
 | |
|        unit that must be present in any match. This  is  not  necessarily  the
 | |
|        last  character.  These lines are omitted if no starting or ending code
 | |
|        units are recorded.
 | |
| 
 | |
|        The callout_info modifier requests information about all  the  callouts
 | |
|        in the pattern. A list of them is output at the end of any other infor-
 | |
|        mation that is requested. For each callout, either its number or string
 | |
|        is given, followed by the item that follows it in the pattern.
 | |
| 
 | |
|    Specifying a pattern in hex
 | |
| 
 | |
|        The hex modifier specifies that the characters of the pattern are to be
 | |
|        interpreted as pairs of hexadecimal digits. White  space  is  permitted
 | |
|        between pairs. For example:
 | |
| 
 | |
|          /ab 32 59/hex
 | |
| 
 | |
|        This  feature  is  provided  as a way of creating patterns that contain
 | |
|        binary zero and other non-printing characters.  By  default,  pcre2test
 | |
|        passes  patterns  as zero-terminated strings to pcre2_compile(), giving
 | |
|        the length as PCRE2_ZERO_TERMINATED. However, for patterns specified in
 | |
|        hexadecimal, the actual length of the pattern is passed.
 | |
| 
 | |
|    JIT compilation
 | |
| 
 | |
|        The  /jit  modifier  may optionally be followed by an equals sign and a
 | |
|        number in the range 0 to 7:
 | |
| 
 | |
|          0  disable JIT
 | |
|          1  use JIT for normal match only
 | |
|          2  use JIT for soft partial match only
 | |
|          3  use JIT for normal match and soft partial match
 | |
|          4  use JIT for hard partial match only
 | |
|          6  use JIT for soft and hard partial match
 | |
|          7  all three modes
 | |
| 
 | |
|        If no number is given, 7 is assumed. If JIT compilation is  successful,
 | |
|        the  compiled JIT code will automatically be used when pcre2_match() is
 | |
|        run for the appropriate type of match, except  when  incompatible  run-
 | |
|        time options are specified. For more details, see the pcre2jit documen-
 | |
|        tation. See also the jitstack modifier below for a way of  setting  the
 | |
|        size of the JIT stack.
 | |
| 
 | |
|        If  the  jitfast  modifier is specified, matching is done using the JIT
 | |
|        "fast path" interface, pcre2_jit_match(), which skips some of the  san-
 | |
|        ity  checks that are done by pcre2_match(), and of course does not work
 | |
|        when JIT is not supported. If jitfast is specified without  jit,  jit=7
 | |
|        is assumed.
 | |
| 
 | |
|        If  the jitverify modifier is specified, information about the compiled
 | |
|        pattern shows whether JIT compilation was or  was  not  successful.  If
 | |
|        jitverify  is  specified without jit, jit=7 is assumed. If JIT compila-
 | |
|        tion is successful when jitverify is set, the text "(JIT)" is added  to
 | |
|        the first output line after a match or non match when JIT-compiled code
 | |
|        was actually used in the match.
 | |
| 
 | |
|    Setting a locale
 | |
| 
 | |
|        The /locale modifier must specify the name of a locale, for example:
 | |
| 
 | |
|          /pattern/locale=fr_FR
 | |
| 
 | |
|        The given locale is set, pcre2_maketables() is called to build a set of
 | |
|        character  tables for the locale, and this is then passed to pcre2_com-
 | |
|        pile() when compiling the regular expression. The same tables are  used
 | |
|        when matching the following subject lines. The /locale modifier applies
 | |
|        only to the pattern on which it appears, but can be given in a #pattern
 | |
|        command  if a default is needed. Setting a locale and alternate charac-
 | |
|        ter tables are mutually exclusive.
 | |
| 
 | |
|    Showing pattern memory
 | |
| 
 | |
|        The /memory modifier causes the size in bytes of  the  memory  used  to
 | |
|        hold  the compiled pattern to be output. This does not include the size
 | |
|        of the pcre2_code block; it is just the actual compiled  data.  If  the
 | |
|        pattern is subsequently passed to the JIT compiler, the size of the JIT
 | |
|        compiled code is also output. Here is an example:
 | |
| 
 | |
|            re> /a(b)c/jit,memory
 | |
|          Memory allocation (code space): 21
 | |
|          Memory allocation (JIT code): 1910
 | |
| 
 | |
| 
 | |
|    Limiting nested parentheses
 | |
| 
 | |
|        The parens_nest_limit modifier sets a limit  on  the  depth  of  nested
 | |
|        parentheses  in  a  pattern.  Breaching  the limit causes a compilation
 | |
|        error.  The default for the library is set when  PCRE2  is  built,  but
 | |
|        pcre2test  sets  its  own default of 220, which is required for running
 | |
|        the standard test suite.
 | |
| 
 | |
|    Using the POSIX wrapper API
 | |
| 
 | |
|        The /posix modifier causes pcre2test to call PCRE2 via the POSIX  wrap-
 | |
|        per  API  rather  than  its  native  API.  This supports only the 8-bit
 | |
|        library.  When the POSIX API is being used, the following pattern modi-
 | |
|        fiers set options for the regcomp() function:
 | |
| 
 | |
|          caseless           REG_ICASE
 | |
|          multiline          REG_NEWLINE
 | |
|          no_auto_capture    REG_NOSUB
 | |
|          dotall             REG_DOTALL     )
 | |
|          ungreedy           REG_UNGREEDY   ) These options are not part of
 | |
|          ucp                REG_UCP        )   the POSIX standard
 | |
|          utf                REG_UTF8       )
 | |
| 
 | |
|        The  aftertext  and  allaftertext  subject  modifiers work as described
 | |
|        below. All other modifiers cause an error.
 | |
| 
 | |
|    Testing the stack guard feature
 | |
| 
 | |
|        The /stackguard modifier is used to  test  the  use  of  pcre2_set_com-
 | |
|        pile_recursion_guard(),  a  function  that  is provided to enable stack
 | |
|        availability to be checked during compilation (see the  pcre2api  docu-
 | |
|        mentation  for  details).  If  the  number specified by the modifier is
 | |
|        greater than zero, pcre2_set_compile_recursion_guard() is called to set
 | |
|        up  callback  from pcre2_compile() to a local function. The argument it
 | |
|        receives is the current nesting parenthesis depth; if this  is  greater
 | |
|        than the value given by the modifier, non-zero is returned, causing the
 | |
|        compilation to be aborted.
 | |
| 
 | |
|    Using alternative character tables
 | |
| 
 | |
|        The value specified for the /tables modifier must be one of the  digits
 | |
|        0, 1, or 2. It causes a specific set of built-in character tables to be
 | |
|        passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
 | |
|        haviour with different character tables. The digit specifies the tables
 | |
|        as follows:
 | |
| 
 | |
|          0   do not pass any special character tables
 | |
|          1   the default ASCII tables, as distributed in
 | |
|                pcre2_chartables.c.dist
 | |
|          2   a set of tables defining ISO 8859 characters
 | |
| 
 | |
|        In table 2, some characters whose codes are greater than 128 are  iden-
 | |
|        tified  as  letters,  digits,  spaces, etc. Setting alternate character
 | |
|        tables and a locale are mutually exclusive.
 | |
| 
 | |
|    Setting certain match controls
 | |
| 
 | |
|        The following modifiers are really subject modifiers, and are described
 | |
|        below.   However, they may be included in a pattern's modifier list, in
 | |
|        which case they are applied to every subject  line  that  is  processed
 | |
|        with that pattern. They do not affect the compilation process.
 | |
| 
 | |
|              aftertext           show text after match
 | |
|              allaftertext        show text after captures
 | |
|              allcaptures         show all captures
 | |
|              allusedtext         show all consulted text
 | |
|          /g  global              global matching
 | |
|              mark                show mark values
 | |
|              replace=<string>    specify a replacement string
 | |
|              startchar           show starting character when relevant
 | |
| 
 | |
|        These  modifiers may not appear in a #pattern command. If you want them
 | |
|        as defaults, set them in a #subject command.
 | |
| 
 | |
|    Saving a compiled pattern
 | |
| 
 | |
|        When a pattern with the push modifier is successfully compiled,  it  is
 | |
|        pushed  onto  a  stack  of compiled patterns, and pcre2test expects the
 | |
|        next line to contain a new pattern (or a command) instead of a  subject
 | |
|        line. This facility is used when saving compiled patterns to a file, as
 | |
|        described in the section entitled "Saving and restoring  compiled  pat-
 | |
|        terns" below.  The push modifier is incompatible with compilation modi-
 | |
|        fiers such as global that act at match time. Any that are specified are
 | |
|        ignored,  with  a  warning message, except for replace, which causes an
 | |
|        error. Note that, jitverify, which is allowed, does not  carry  through
 | |
|        to any subsequent matching that uses this pattern.
 | |
| 
 | |
| 
 | |
| SUBJECT MODIFIERS
 | |
| 
 | |
|        The modifiers that can appear in subject lines and the #subject command
 | |
|        are of two types.
 | |
| 
 | |
|    Setting match options
 | |
| 
 | |
|        The   following   modifiers   set   options   for   pcre2_match()    or
 | |
|        pcre2_dfa_match(). See pcreapi for a description of their effects.
 | |
| 
 | |
|              anchored                  set PCRE2_ANCHORED
 | |
|              dfa_restart               set PCRE2_DFA_RESTART
 | |
|              dfa_shortest              set PCRE2_DFA_SHORTEST
 | |
|              no_utf_check              set PCRE2_NO_UTF_CHECK
 | |
|              notbol                    set PCRE2_NOTBOL
 | |
|              notempty                  set PCRE2_NOTEMPTY
 | |
|              notempty_atstart          set PCRE2_NOTEMPTY_ATSTART
 | |
|              noteol                    set PCRE2_NOTEOL
 | |
|              partial_hard (or ph)      set PCRE2_PARTIAL_HARD
 | |
|              partial_soft (or ps)      set PCRE2_PARTIAL_SOFT
 | |
| 
 | |
|        The  partial matching modifiers are provided with abbreviations because
 | |
|        they appear frequently in tests.
 | |
| 
 | |
|        If the /posix modifier was present on the pattern,  causing  the  POSIX
 | |
|        wrapper API to be used, the only option-setting modifiers that have any
 | |
|        effect  are  notbol,  notempty,   and   noteol,   causing   REG_NOTBOL,
 | |
|        REG_NOTEMPTY,  and REG_NOTEOL, respectively, to be passed to regexec().
 | |
|        Any other modifiers cause an error.
 | |
| 
 | |
|    Setting match controls
 | |
| 
 | |
|        The following modifiers affect the matching process  or  request  addi-
 | |
|        tional  information.  Some  of  them may also be specified on a pattern
 | |
|        line (see above), in which case they apply to every subject  line  that
 | |
|        is matched against that pattern.
 | |
| 
 | |
|              aftertext                 show text after match
 | |
|              allaftertext              show text after captures
 | |
|              allcaptures               show all captures
 | |
|              allusedtext               show all consulted text (non-JIT only)
 | |
|              altglobal                 alternative global matching
 | |
|              callout_capture           show captures at callout time
 | |
|              callout_data=<n>          set a value to pass via callouts
 | |
|              callout_fail=<n>[:<m>]    control callout failure
 | |
|              callout_none              do not supply a callout function
 | |
|              copy=<number or name>     copy captured substring
 | |
|              dfa                       use pcre2_dfa_match()
 | |
|              find_limits               find match and recursion limits
 | |
|              get=<number or name>      extract captured substring
 | |
|              getall                    extract all captured substrings
 | |
|          /g  global                    global matching
 | |
|              jitstack=<n>              set size of JIT stack
 | |
|              mark                      show mark values
 | |
|              match_limit=>n>           set a match limit
 | |
|              memory                    show memory usage
 | |
|              offset=<n>                set starting offset
 | |
|              ovector=<n>               set size of output vector
 | |
|              recursion_limit=<n>       set a recursion limit
 | |
|              replace=<string>          specify a replacement string
 | |
|              startchar                 show startchar when relevant
 | |
|              zero_terminate            pass the subject as zero-terminated
 | |
| 
 | |
|        The effects of these modifiers are described in the following sections.
 | |
| 
 | |
|    Showing more text
 | |
| 
 | |
|        The  aftertext modifier requests that as well as outputting the part of
 | |
|        the subject string that matched the entire pattern, pcre2test should in
 | |
|        addition output the remainder of the subject string. This is useful for
 | |
|        tests where the subject contains multiple copies of the same substring.
 | |
|        The  allaftertext  modifier  requests the same action for captured sub-
 | |
|        strings as well as the main matched substring. In each case the remain-
 | |
|        der is output on the following line with a plus character following the
 | |
|        capture number.
 | |
| 
 | |
|        The allusedtext modifier requests that all the text that was  consulted
 | |
|        during  a  successful pattern match by the interpreter should be shown.
 | |
|        This feature is not supported for JIT matching, and if  requested  with
 | |
|        JIT  it  is  ignored  (with  a  warning message). Setting this modifier
 | |
|        affects the output if there is a lookbehind at the start of a match, or
 | |
|        a  lookahead  at  the  end, or if \K is used in the pattern. Characters
 | |
|        that precede or follow the start and end of the actual match are  indi-
 | |
|        cated  in  the output by '<' or '>' characters underneath them. Here is
 | |
|        an example:
 | |
| 
 | |
|            re> /(?<=pqr)abc(?=xyz)/
 | |
|          data> 123pqrabcxyz456\=allusedtext
 | |
|           0: pqrabcxyz
 | |
|              <<<   >>>
 | |
| 
 | |
|        This shows that the matched string is "abc",  with  the  preceding  and
 | |
|        following  strings  "pqr"  and  "xyz"  having been consulted during the
 | |
|        match (when processing the assertions).
 | |
| 
 | |
|        The startchar modifier requests that the  starting  character  for  the
 | |
|        match  be  indicated,  if  it  is different to the start of the matched
 | |
|        string. The only time when this occurs is when \K has been processed as
 | |
|        part of the match. In this situation, the output for the matched string
 | |
|        is displayed from the starting character  instead  of  from  the  match
 | |
|        point,  with  circumflex  characters  under the earlier characters. For
 | |
|        example:
 | |
| 
 | |
|            re> /abc\Kxyz/
 | |
|          data> abcxyz\=startchar
 | |
|           0: abcxyz
 | |
|              ^^^
 | |
| 
 | |
|        Unlike allusedtext, the startchar modifier can be used with JIT.   How-
 | |
|        ever, these two modifiers are mutually exclusive.
 | |
| 
 | |
|    Showing the value of all capture groups
 | |
| 
 | |
|        The allcaptures modifier requests that the values of all potential cap-
 | |
|        tured parentheses be output after a match. By default, only those up to
 | |
|        the highest one actually used in the match are output (corresponding to
 | |
|        the return code from pcre2_match()). Groups that did not take  part  in
 | |
|        the match are output as "<unset>".
 | |
| 
 | |
|    Testing callouts
 | |
| 
 | |
|        A  callout function is supplied when pcre2test calls the library match-
 | |
|        ing functions, unless callout_none is specified. If callout_capture  is
 | |
|        set, the current captured groups are output when a callout occurs.
 | |
| 
 | |
|        The  callout_fail modifier can be given one or two numbers. If there is
 | |
|        only one number, 1 is returned instead of 0 when a callout of that num-
 | |
|        ber  is  reached.  If two numbers are given, 1 is returned when callout
 | |
|        <n> is reached for the <m>th time. Note that callouts with string argu-
 | |
|        ments  are  always  given  the  number zero. See "Callouts" below for a
 | |
|        description of the output when a callout it taken.
 | |
| 
 | |
|        The callout_data modifier can be given an unsigned or a  negative  num-
 | |
|        ber.   This  is  set  as the "user data" that is passed to the matching
 | |
|        function, and passed back when the callout  function  is  invoked.  Any
 | |
|        value  other  than  zero  is  used as a return from pcre2test's callout
 | |
|        function.
 | |
| 
 | |
|    Finding all matches in a string
 | |
| 
 | |
|        Searching for all possible matches within a subject can be requested by
 | |
|        the  global or /altglobal modifier. After finding a match, the matching
 | |
|        function is called again to search the remainder of  the  subject.  The
 | |
|        difference  between  global  and  altglobal is that the former uses the
 | |
|        start_offset argument to pcre2_match() or  pcre2_dfa_match()  to  start
 | |
|        searching  at  a new point within the entire string (which is what Perl
 | |
|        does), whereas the latter passes over a shortened subject. This makes a
 | |
|        difference to the matching process if the pattern begins with a lookbe-
 | |
|        hind assertion (including \b or \B).
 | |
| 
 | |
|        If an empty string  is  matched,  the  next  match  is  done  with  the
 | |
|        PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
 | |
|        for another, non-empty, match at the same point in the subject. If this
 | |
|        match  fails,  the  start  offset  is advanced, and the normal match is
 | |
|        retried. This imitates the way Perl handles such cases when  using  the
 | |
|        /g  modifier  or  the  split()  function. Normally, the start offset is
 | |
|        advanced by one character, but if  the  newline  convention  recognizes
 | |
|        CRLF  as  a newline, and the current character is CR followed by LF, an
 | |
|        advance of two characters occurs.
 | |
| 
 | |
|    Testing substring extraction functions
 | |
| 
 | |
|        The copy  and  get  modifiers  can  be  used  to  test  the  pcre2_sub-
 | |
|        string_copy_xxx() and pcre2_substring_get_xxx() functions.  They can be
 | |
|        given more than once, and each can specify a group name or number,  for
 | |
|        example:
 | |
| 
 | |
|           abcd\=copy=1,copy=3,get=G1
 | |
| 
 | |
|        If  the  #subject command is used to set default copy and/or get lists,
 | |
|        these can be unset by specifying a negative number to cancel  all  num-
 | |
|        bered groups and an empty name to cancel all named groups.
 | |
| 
 | |
|        The  getall  modifier  tests pcre2_substring_list_get(), which extracts
 | |
|        all captured substrings.
 | |
| 
 | |
|        If the subject line is successfully matched, the  substrings  extracted
 | |
|        by  the  convenience  functions  are  output  with C, G, or L after the
 | |
|        string number instead of a colon. This is in  addition  to  the  normal
 | |
|        full  list.  The string length (that is, the return from the extraction
 | |
|        function) is given in parentheses after each substring, followed by the
 | |
|        name when the extraction was by name.
 | |
| 
 | |
|    Testing the substitution function
 | |
| 
 | |
|        If  the  replace  modifier  is  set, the pcre2_substitute() function is
 | |
|        called instead  of  one  of  the  matching  functions.  Unlike  subject
 | |
|        strings,  pcre2test  does  not  process  replacement strings for escape
 | |
|        sequences. In UTF mode, a replacement string is checked to see if it is
 | |
|        a valid UTF-8 string.  If so, it is correctly converted to a UTF string
 | |
|        of the appropriate code unit width. If it is not a valid UTF-8  string,
 | |
|        the individual code units are copied directly. This provides a means of
 | |
|        passing an invalid UTF-8 string for testing purposes.
 | |
| 
 | |
|        If the global modifier is set,  PCRE2_SUBSTITUTE_GLOBAL  is  passed  to
 | |
|        pcre2_substitute().  After  a  successful  substitution,  the  modified
 | |
|        string is output, preceded by the number of replacements. This  may  be
 | |
|        zero  if there were no matches. Here is a simple example of a substitu-
 | |
|        tion test:
 | |
| 
 | |
|          /abc/replace=xxx
 | |
|              =abc=abc=
 | |
|           1: =xxx=abc=
 | |
|              =abc=abc=\=global
 | |
|           2: =xxx=xxx=
 | |
| 
 | |
|        Subject and replacement strings should be  kept  relatively  short  for
 | |
|        substitution  tests, as fixed-size buffers are used. To make it easy to
 | |
|        test for buffer overflow, if the replacement string starts with a  num-
 | |
|        ber  in square brackets, that number is passed to pcre2_substitute() as
 | |
|        the size of the output buffer, with the replacement string starting  at
 | |
|        the next character. Here is an example that tests the edge case:
 | |
| 
 | |
|          /abc/
 | |
|              123abc123\=replace=[10]XYZ
 | |
|           1: 123XYZ123
 | |
|              123abc123\=replace=[9]XYZ
 | |
|          Failed: error -47: no more memory
 | |
| 
 | |
|        A replacement string is ignored with POSIX and DFA matching. Specifying
 | |
|        partial matching provokes an error return  ("bad  option  value")  from
 | |
|        pcre2_substitute().
 | |
| 
 | |
|    Setting the JIT stack size
 | |
| 
 | |
|        The  jitstack modifier provides a way of setting the maximum stack size
 | |
|        that is used by the just-in-time optimization code. It  is  ignored  if
 | |
|        JIT optimization is not being used. The value is a number of kilobytes.
 | |
|        Providing a stack that is larger than the default 32K is necessary only
 | |
|        for very complicated patterns.
 | |
| 
 | |
|    Setting match and recursion limits
 | |
| 
 | |
|        The  match_limit and recursion_limit modifiers set the appropriate lim-
 | |
|        its in the match context. These values are ignored when the find_limits
 | |
|        modifier is specified.
 | |
| 
 | |
|    Finding minimum limits
 | |
| 
 | |
|        If  the  find_limits modifier is present, pcre2test calls pcre2_match()
 | |
|        several times, setting  different  values  in  the  match  context  via
 | |
|        pcre2_set_match_limit()  and pcre2_set_recursion_limit() until it finds
 | |
|        the minimum values for each parameter that allow pcre2_match() to  com-
 | |
|        plete without error.
 | |
| 
 | |
|        If JIT is being used, only the match limit is relevant. If DFA matching
 | |
|        is being used, neither limit is relevant, and this modifier is  ignored
 | |
|        (with a warning message).
 | |
| 
 | |
|        The  match_limit number is a measure of the amount of backtracking that
 | |
|        takes place, and learning the minimum value  can  be  instructive.  For
 | |
|        most  simple  matches, the number is quite small, but for patterns with
 | |
|        very large numbers of matching possibilities, it can become large  very
 | |
|        quickly    with    increasing    length    of   subject   string.   The
 | |
|        match_limit_recursion number is a measure of how  much  stack  (or,  if
 | |
|        PCRE2  is  compiled with NO_RECURSE, how much heap) memory is needed to
 | |
|        complete the match attempt.
 | |
| 
 | |
|    Showing MARK names
 | |
| 
 | |
| 
 | |
|        The mark modifier causes the names from backtracking control verbs that
 | |
|        are  returned from calls to pcre2_match() to be displayed. If a mark is
 | |
|        returned for a match, non-match, or partial match, pcre2test shows  it.
 | |
|        For  a  match, it is on a line by itself, tagged with "MK:". Otherwise,
 | |
|        it is added to the non-match message.
 | |
| 
 | |
|    Showing memory usage
 | |
| 
 | |
|        The memory modifier causes pcre2test to log all memory  allocation  and
 | |
|        freeing calls that occur during a match operation.
 | |
| 
 | |
|    Setting a starting offset
 | |
| 
 | |
|        The  offset  modifier  sets  an  offset  in the subject string at which
 | |
|        matching starts. Its value is a number of code units, not characters.
 | |
| 
 | |
|    Setting the size of the output vector
 | |
| 
 | |
|        The ovector modifier applies only to  the  subject  line  in  which  it
 | |
|        appears,  though  of  course  it can also be used to set a default in a
 | |
|        #subject command. It specifies the number of pairs of offsets that  are
 | |
|        available for storing matching information. The default is 15.
 | |
| 
 | |
|        A  value of zero is useful when testing the POSIX API because it causes
 | |
|        regexec() to be called with a NULL capture vector. When not testing the
 | |
|        POSIX  API,  a  value  of  zero  is used to cause pcre2_match_data_cre-
 | |
|        ate_from_pattern() to be called, in order to create a  match  block  of
 | |
|        exactly the right size for the pattern. (It is not possible to create a
 | |
|        match block with a zero-length ovector; there is always  at  least  one
 | |
|        pair of offsets.)
 | |
| 
 | |
|    Passing the subject as zero-terminated
 | |
| 
 | |
|        By default, the subject string is passed to a native API matching func-
 | |
|        tion with its correct length. In order to test the facility for passing
 | |
|        a  zero-terminated  string, the zero_terminate modifier is provided. It
 | |
|        causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
 | |
|        via  the  POSIX  interface, this modifier has no effect, as there is no
 | |
|        facility for passing a length.)
 | |
| 
 | |
|        When testing pcre2_substitute(), this modifier also has the  effect  of
 | |
|        passing the replacement string as zero-terminated.
 | |
| 
 | |
| 
 | |
| THE ALTERNATIVE MATCHING FUNCTION
 | |
| 
 | |
|        By  default,  pcre2test  uses  the  standard  PCRE2  matching function,
 | |
|        pcre2_match() to match each subject line. PCRE2 also supports an alter-
 | |
|        native  matching  function, pcre2_dfa_match(), which operates in a dif-
 | |
|        ferent way, and has some restrictions. The differences between the  two
 | |
|        functions are described in the pcre2matching documentation.
 | |
| 
 | |
|        If  the dfa modifier is set, the alternative matching function is used.
 | |
|        This function finds all possible matches at a given point in  the  sub-
 | |
|        ject.  If,  however, the dfa_shortest modifier is set, processing stops
 | |
|        after the first match is found. This is always  the  shortest  possible
 | |
|        match.
 | |
| 
 | |
| 
 | |
| DEFAULT OUTPUT FROM pcre2test
 | |
| 
 | |
|        This  section  describes  the output when the normal matching function,
 | |
|        pcre2_match(), is being used.
 | |
| 
 | |
|        When a match succeeds, pcre2test outputs  the  list  of  captured  sub-
 | |
|        strings,  starting  with number 0 for the string that matched the whole
 | |
|        pattern.   Otherwise,  it  outputs  "No  match"  when  the  return   is
 | |
|        PCRE2_ERROR_NOMATCH,  or  "Partial  match:"  followed  by the partially
 | |
|        matching substring when the return is PCRE2_ERROR_PARTIAL.  (Note  that
 | |
|        this  is  the  entire  substring  that was inspected during the partial
 | |
|        match; it may include characters before the actual  match  start  if  a
 | |
|        lookbehind assertion, \K, \b, or \B was involved.)
 | |
| 
 | |
|        For any other return, pcre2test outputs the PCRE2 negative error number
 | |
|        and a short descriptive phrase. If the error is  a  failed  UTF  string
 | |
|        check,  the  code  unit offset of the start of the failing character is
 | |
|        also output. Here is an example of an interactive pcre2test run.
 | |
| 
 | |
|          $ pcre2test
 | |
|          PCRE2 version 9.00 2014-05-10
 | |
| 
 | |
|            re> /^abc(\d+)/
 | |
|          data> abc123
 | |
|           0: abc123
 | |
|           1: 123
 | |
|          data> xyz
 | |
|          No match
 | |
| 
 | |
|        Unset capturing substrings that are not followed by one that is set are
 | |
|        not shown by pcre2test unless the allcaptures modifier is specified. In
 | |
|        the following example, there are two capturing substrings, but when the
 | |
|        first  data  line is matched, the second, unset substring is not shown.
 | |
|        An "internal" unset substring is shown as "<unset>", as for the  second
 | |
|        data line.
 | |
| 
 | |
|            re> /(a)|(b)/
 | |
|          data> a
 | |
|           0: a
 | |
|           1: a
 | |
|          data> b
 | |
|           0: b
 | |
|           1: <unset>
 | |
|           2: b
 | |
| 
 | |
|        If  the strings contain any non-printing characters, they are output as
 | |
|        \xhh escapes if the value is less than 256 and UTF  mode  is  not  set.
 | |
|        Otherwise they are output as \x{hh...} escapes. See below for the defi-
 | |
|        nition of non-printing characters. If the /aftertext modifier  is  set,
 | |
|        the  output  for substring 0 is followed by the the rest of the subject
 | |
|        string, identified by "0+" like this:
 | |
| 
 | |
|            re> /cat/aftertext
 | |
|          data> cataract
 | |
|           0: cat
 | |
|           0+ aract
 | |
| 
 | |
|        If global matching is requested, the  results  of  successive  matching
 | |
|        attempts are output in sequence, like this:
 | |
| 
 | |
|            re> /\Bi(\w\w)/g
 | |
|          data> Mississippi
 | |
|           0: iss
 | |
|           1: ss
 | |
|           0: iss
 | |
|           1: ss
 | |
|           0: ipp
 | |
|           1: pp
 | |
| 
 | |
|        "No  match" is output only if the first match attempt fails. Here is an
 | |
|        example of a failure message (the offset 4 that  is  specified  by  the
 | |
|        offset modifier is past the end of the subject string):
 | |
| 
 | |
|            re> /xyz/
 | |
|          data> xyz\=offset=4
 | |
|          Error -24 (bad offset value)
 | |
| 
 | |
|        Note that whereas patterns can be continued over several lines (a plain
 | |
|        ">" prompt is used for continuations), subject lines may  not.  However
 | |
|        newlines can be included in a subject by means of the \n escape (or \r,
 | |
|        \r\n, etc., depending on the newline sequence setting).
 | |
| 
 | |
| 
 | |
| OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
 | |
| 
 | |
|        When the alternative matching function, pcre2_dfa_match(), is used, the
 | |
|        output  consists  of  a list of all the matches that start at the first
 | |
|        point in the subject where there is at least one match. For example:
 | |
| 
 | |
|            re> /(tang|tangerine|tan)/
 | |
|          data> yellow tangerine\=dfa
 | |
|           0: tangerine
 | |
|           1: tang
 | |
|           2: tan
 | |
| 
 | |
|        Using the normal matching function on this data finds only "tang".  The
 | |
|        longest  matching  string  is  always  given first (and numbered zero).
 | |
|        After a PCRE2_ERROR_PARTIAL return, the  output  is  "Partial  match:",
 | |
|        followed  by  the  partially  matching substring. Note that this is the
 | |
|        entire substring that was inspected during the partial  match;  it  may
 | |
|        include characters before the actual match start if a lookbehind asser-
 | |
|        tion, \b, or \B was involved. (\K is not supported for DFA matching.)
 | |
| 
 | |
|        If global matching is requested, the search for further matches resumes
 | |
|        at the end of the longest match. For example:
 | |
| 
 | |
|            re> /(tang|tangerine|tan)/g
 | |
|          data> yellow tangerine and tangy sultana\=dfa
 | |
|           0: tangerine
 | |
|           1: tang
 | |
|           2: tan
 | |
|           0: tang
 | |
|           1: tan
 | |
|           0: tan
 | |
| 
 | |
|        The  alternative  matching function does not support substring capture,
 | |
|        so the modifiers that are concerned with captured  substrings  are  not
 | |
|        relevant.
 | |
| 
 | |
| 
 | |
| RESTARTING AFTER A PARTIAL MATCH
 | |
| 
 | |
|        When  the  alternative matching function has given the PCRE2_ERROR_PAR-
 | |
|        TIAL return, indicating that the subject partially matched the pattern,
 | |
|        you  can restart the match with additional subject data by means of the
 | |
|        dfa_restart modifier. For example:
 | |
| 
 | |
|            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | |
|          data> 23ja\=P,dfa
 | |
|          Partial match: 23ja
 | |
|          data> n05\=dfa,dfa_restart
 | |
|           0: n05
 | |
| 
 | |
|        For further information about partial matching,  see  the  pcre2partial
 | |
|        documentation.
 | |
| 
 | |
| 
 | |
| CALLOUTS
 | |
| 
 | |
|        If the pattern contains any callout requests, pcre2test's callout func-
 | |
|        tion is called during matching unless callout_none is specified.   This
 | |
|        works with both matching functions.
 | |
| 
 | |
|        The  callout  function in pcre2test returns zero (carry on matching) by
 | |
|        default, but you can use a callout_fail modifier in a subject line  (as
 | |
|        described above) to change this and other parameters of the callout.
 | |
| 
 | |
|        Inserting callouts can be helpful when using pcre2test to check compli-
 | |
|        cated regular expressions. For further information about callouts,  see
 | |
|        the pcre2callout documentation.
 | |
| 
 | |
|        The  output for callouts with numerical arguments and those with string
 | |
|        arguments is slightly different.
 | |
| 
 | |
|    Callouts with numerical arguments
 | |
| 
 | |
|        By default, the callout function displays the callout number, the start
 | |
|        and  current positions in the subject text at the callout time, and the
 | |
|        next pattern item to be tested. For example:
 | |
| 
 | |
|          --->pqrabcdef
 | |
|            0    ^  ^     \d
 | |
| 
 | |
|        This output indicates that  callout  number  0  occurred  for  a  match
 | |
|        attempt  starting  at  the fourth character of the subject string, when
 | |
|        the pointer was at the seventh character, and  when  the  next  pattern
 | |
|        item  was  \d.  Just  one circumflex is output if the start and current
 | |
|        positions are the same.
 | |
| 
 | |
|        Callouts numbered 255 are assumed to be automatic callouts, inserted as
 | |
|        a  result  of the /auto_callout pattern modifier. In this case, instead
 | |
|        of showing the callout number, the offset in the pattern, preceded by a
 | |
|        plus, is output. For example:
 | |
| 
 | |
|            re> /\d?[A-E]\*/auto_callout
 | |
|          data> E*
 | |
|          --->E*
 | |
|           +0 ^      \d?
 | |
|           +3 ^      [A-E]
 | |
|           +8 ^^     \*
 | |
|          +10 ^ ^
 | |
|           0: E*
 | |
| 
 | |
|        If a pattern contains (*MARK) items, an additional line is output when-
 | |
|        ever a change of latest mark is passed to  the  callout  function.  For
 | |
|        example:
 | |
| 
 | |
|            re> /a(*MARK:X)bc/auto_callout
 | |
|          data> abc
 | |
|          --->abc
 | |
|           +0 ^       a
 | |
|           +1 ^^      (*MARK:X)
 | |
|          +10 ^^      b
 | |
|          Latest Mark: X
 | |
|          +11 ^ ^     c
 | |
|          +12 ^  ^
 | |
|           0: abc
 | |
| 
 | |
|        The  mark  changes between matching "a" and "b", but stays the same for
 | |
|        the rest of the match, so nothing more is output. If, as  a  result  of
 | |
|        backtracking,  the  mark  reverts to being unset, the text "<unset>" is
 | |
|        output.
 | |
| 
 | |
|    Callouts with string arguments
 | |
| 
 | |
|        The output for a callout with a string argument is similar, except that
 | |
|        instead  of outputting a callout number before the position indicators,
 | |
|        the callout string and its offset in  the  pattern  string  are  output
 | |
|        before  the reflection of the subject string, and the subject string is
 | |
|        reflected for each callout. For example:
 | |
| 
 | |
|            re> /^ab(?C'first')cd(?C"second")ef/
 | |
|          data> abcdefg
 | |
|          Callout (7): 'first'
 | |
|          --->abcdefg
 | |
|              ^ ^         c
 | |
|          Callout (20): "second"
 | |
|          --->abcdefg
 | |
|              ^   ^       e
 | |
|           0: abcdef
 | |
| 
 | |
| 
 | |
| NON-PRINTING CHARACTERS
 | |
| 
 | |
|        When pcre2test is outputting text in the compiled version of a pattern,
 | |
|        bytes  other  than 32-126 are always treated as non-printing characters
 | |
|        and are therefore shown as hex escapes.
 | |
| 
 | |
|        When pcre2test is outputting text that is a matched part of  a  subject
 | |
|        string,  it behaves in the same way, unless a different locale has been
 | |
|        set for the pattern (using the /locale modifier).  In  this  case,  the
 | |
|        isprint()  function  is  used  to distinguish printing and non-printing
 | |
|        characters.
 | |
| 
 | |
| 
 | |
| SAVING AND RESTORING COMPILED PATTERNS
 | |
| 
 | |
|        It is possible to save compiled patterns  on  disc  or  elsewhere,  and
 | |
|        reload them later, subject to a number of restrictions. JIT data cannot
 | |
|        be saved. The host on which the patterns are reloaded must  be  running
 | |
|        the same version of PCRE2, with the same code unit width, and must also
 | |
|        have the same endianness, pointer width  and  PCRE2_SIZE  type.  Before
 | |
|        compiled  patterns  can be saved they must be serialized, that is, con-
 | |
|        verted to a stream of bytes. A single byte stream may contain any  num-
 | |
|        ber  of  compiled  patterns,  but  they must all use the same character
 | |
|        tables. A single copy of the tables is included in the byte stream (its
 | |
|        size is 1088 bytes).
 | |
| 
 | |
|        The  functions  whose  names  begin  with pcre2_serialize_ are used for
 | |
|        serializing and de-serializing. They are described in the  pcre2serial-
 | |
|        ize  documentation.  In  this  section  we  describe  the  features  of
 | |
|        pcre2test that can be used to test these functions.
 | |
| 
 | |
|        When a pattern with push  modifier  is  successfully  compiled,  it  is
 | |
|        pushed  onto  a  stack  of compiled patterns, and pcre2test expects the
 | |
|        next line to contain a new pattern (or command) instead  of  a  subject
 | |
|        line. By this means, a number of patterns can be compiled and retained.
 | |
|        The push modifier is incompatible with  posix,  and  control  modifiers
 | |
|        that act at match time are ignored (with a message). The jitverify mod-
 | |
|        ifier applies only at compile time. The command
 | |
| 
 | |
|          #save <filename>
 | |
| 
 | |
|        causes all the stacked patterns to be serialized and the result written
 | |
|        to  the named file. Afterwards, all the stacked patterns are freed. The
 | |
|        command
 | |
| 
 | |
|          #load <filename>
 | |
| 
 | |
|        reads the data in the file, and then arranges for it to  be  de-serial-
 | |
|        ized,  with the resulting compiled patterns added to the pattern stack.
 | |
|        The pattern on the top of the stack can be retrieved by the  #pop  com-
 | |
|        mand,  which  must  be  followed  by  lines  of subjects that are to be
 | |
|        matched with the pattern, terminated as usual by an empty line  or  end
 | |
|        of  file.  This  command  may be followed by a modifier list containing
 | |
|        only control modifiers that act after a pattern has been  compiled.  In
 | |
|        particular,  hex,  posix, and push are not allowed, nor are any option-
 | |
|        setting modifiers.  The JIT modifiers are, however permitted.  Here  is
 | |
|        an example that saves and reloads two patterns.
 | |
| 
 | |
|          /abc/push
 | |
|          /xyz/push
 | |
|          #save tempfile
 | |
|          #load tempfile
 | |
|          #pop info
 | |
|          xyz
 | |
| 
 | |
|          #pop jit,bincode
 | |
|          abc
 | |
| 
 | |
|        If  jitverify  is  used with #pop, it does not automatically imply jit,
 | |
|        which is different behaviour from when it is used on a pattern.
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcre2(3),  pcre2api(3),  pcre2callout(3),  pcre2jit,  pcre2matching(3),
 | |
|        pcre2partial(d), pcre2pattern(3), pcre2serialize(3).
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 20 May 2015
 | |
|        Copyright (c) 1997-2015 University of Cambridge.
 | 
