1335 lines
		
	
	
		
			62 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			1335 lines
		
	
	
		
			62 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
PCRE2TEST(1)                General Commands Manual               PCRE2TEST(1)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       pcre2test - a program for testing Perl-compatible regular expressions.
 | 
						|
 | 
						|
SYNOPSIS
 | 
						|
 | 
						|
       pcre2test [options] [input file [output file]]
 | 
						|
 | 
						|
       pcre2test is a test program for the PCRE2 regular expression libraries,
 | 
						|
       but it can also be used for  experimenting  with  regular  expressions.
 | 
						|
       This  document  describes the features of the test program; for details
 | 
						|
       of the regular expressions themselves, see the pcre2pattern  documenta-
 | 
						|
       tion.  For  details  of  the  PCRE2  library  function  calls and their
 | 
						|
       options, see the pcre2api documentation.
 | 
						|
 | 
						|
       The input for pcre2test is a sequence of  regular  expression  patterns
 | 
						|
       and  subject  strings  to  be matched. There are also command lines for
 | 
						|
       setting defaults and controlling some special actions. The output shows
 | 
						|
       the  result  of  each  match attempt. Modifiers on external or internal
 | 
						|
       command lines, the patterns, and the subject lines specify PCRE2  func-
 | 
						|
       tion  options, control how the subject is processed, and what output is
 | 
						|
       produced.
 | 
						|
 | 
						|
       As the original fairly simple PCRE library evolved,  it  acquired  many
 | 
						|
       different  features,  and  as  a  result, the original pcretest program
 | 
						|
       ended up with a lot of options in a messy, arcane syntax,  for  testing
 | 
						|
       all the features. The move to the new PCRE2 API provided an opportunity
 | 
						|
       to re-implement the test program as pcre2test, with a cleaner  modifier
 | 
						|
       syntax.  Nevertheless,  there are still many obscure modifiers, some of
 | 
						|
       which are specifically designed for use in conjunction  with  the  test
 | 
						|
       script  and  data  files that are distributed as part of PCRE2. All the
 | 
						|
       modifiers are documented here, some  without  much  justification,  but
 | 
						|
       many  of  them  are  unlikely  to  be  of  use  except when testing the
 | 
						|
       libraries.
 | 
						|
 | 
						|
 | 
						|
PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
 | 
						|
 | 
						|
       Different versions of the PCRE2 library can be built to support charac-
 | 
						|
       ter  strings  that  are encoded in 8-bit, 16-bit, or 32-bit code units.
 | 
						|
       One, two, or  all  three  of  these  libraries  may  be  simultaneously
 | 
						|
       installed. The pcre2test program can be used to test all the libraries.
 | 
						|
       However, its own input and output are  always  in  8-bit  format.  When
 | 
						|
       testing  the  16-bit  or 32-bit libraries, patterns and subject strings
 | 
						|
       are converted to 16- or  32-bit  format  before  being  passed  to  the
 | 
						|
       library  functions.  Results are converted back to 8-bit code units for
 | 
						|
       output.
 | 
						|
 | 
						|
       In the rest of this document, the names of library functions and struc-
 | 
						|
       tures  are  given  in  generic  form,  for example, pcre_compile(). The
 | 
						|
       actual names used in the libraries have a suffix _8, _16,  or  _32,  as
 | 
						|
       appropriate.
 | 
						|
 | 
						|
 | 
						|
INPUT ENCODING
 | 
						|
 | 
						|
       Input  to  pcre2test is processed line by line, either by calling the C
 | 
						|
       library's fgets() function, or via the libreadline library (see below).
 | 
						|
       The  input  is  processed using using C's string functions, so must not
 | 
						|
       contain binary zeroes, even though in Unix-like  environments,  fgets()
 | 
						|
       treats any bytes other than newline as data characters. In some Windows
 | 
						|
       environments character 26 (hex 1A) causes an immediate end of file, and
 | 
						|
       no further data is read.
 | 
						|
 | 
						|
       For  maximum portability, therefore, it is safest to avoid non-printing
 | 
						|
       characters in pcre2test input files. There is a facility for specifying
 | 
						|
       a pattern's characters as hexadecimal pairs, thus making it possible to
 | 
						|
       include binary zeroes in a pattern for testing purposes. Subject  lines
 | 
						|
       are processed for backslash escapes, which makes it possible to include
 | 
						|
       any data value.
 | 
						|
 | 
						|
 | 
						|
COMMAND LINE OPTIONS
 | 
						|
 | 
						|
       -8        If the 8-bit library has been built, this option causes it to
 | 
						|
                 be  used  (this is the default). If the 8-bit library has not
 | 
						|
                 been built, this option causes an error.
 | 
						|
 | 
						|
       -16       If the 16-bit library has been built, this option  causes  it
 | 
						|
                 to  be  used. If only the 16-bit library has been built, this
 | 
						|
                 is the default. If the 16-bit library  has  not  been  built,
 | 
						|
                 this option causes an error.
 | 
						|
 | 
						|
       -32       If  the  32-bit library has been built, this option causes it
 | 
						|
                 to be used. If only the 32-bit library has been  built,  this
 | 
						|
                 is  the  default.  If  the 32-bit library has not been built,
 | 
						|
                 this option causes an error.
 | 
						|
 | 
						|
       -b        Behave as if each pattern has the /fullbincode modifier;  the
 | 
						|
                 full internal binary form of the pattern is output after com-
 | 
						|
                 pilation.
 | 
						|
 | 
						|
       -C        Output the version number  of  the  PCRE2  library,  and  all
 | 
						|
                 available  information  about  the optional features that are
 | 
						|
                 included, and then  exit  with  zero  exit  code.  All  other
 | 
						|
                 options are ignored.
 | 
						|
 | 
						|
       -C option Output  information  about a specific build-time option, then
 | 
						|
                 exit. This functionality is intended for use in scripts  such
 | 
						|
                 as  RunTest.  The  following options output the value and set
 | 
						|
                 the exit code as indicated:
 | 
						|
 | 
						|
                   ebcdic-nl  the code for LF (= NL) in an EBCDIC environment:
 | 
						|
                                0x15 or 0x25
 | 
						|
                                0 if used in an ASCII environment
 | 
						|
                                exit code is always 0
 | 
						|
                   linksize   the configured internal link size (2, 3, or 4)
 | 
						|
                                exit code is set to the link size
 | 
						|
                   newline    the default newline setting:
 | 
						|
                                CR, LF, CRLF, ANYCRLF, or ANY
 | 
						|
                                exit code is always 0
 | 
						|
                   bsr        the default setting for what \R matches:
 | 
						|
                                ANYCRLF or ANY
 | 
						|
                                exit code is always 0
 | 
						|
 | 
						|
                 The following options output 1 for true or 0 for  false,  and
 | 
						|
                 set the exit code to the same value:
 | 
						|
 | 
						|
                   ebcdic     compiled for an EBCDIC environment
 | 
						|
                   jit        just-in-time support is available
 | 
						|
                   pcre2-16   the 16-bit library was built
 | 
						|
                   pcre2-32   the 32-bit library was built
 | 
						|
                   pcre2-8    the 8-bit library was built
 | 
						|
                   unicode    Unicode support is available
 | 
						|
 | 
						|
                 If  an  unknown  option is given, an error message is output;
 | 
						|
                 the exit code is 0.
 | 
						|
 | 
						|
       -d        Behave as if each pattern has the debug modifier; the  inter-
 | 
						|
                 nal form and information about the compiled pattern is output
 | 
						|
                 after compilation; -d is equivalent to -b -i.
 | 
						|
 | 
						|
       -dfa      Behave as if each subject line has the dfa modifier; matching
 | 
						|
                 is  done  using the pcre2_dfa_match() function instead of the
 | 
						|
                 default pcre2_match().
 | 
						|
 | 
						|
       -help     Output a brief summary these options and then exit.
 | 
						|
 | 
						|
       -i        Behave as if each pattern has the /info modifier; information
 | 
						|
                 about the compiled pattern is given after compilation.
 | 
						|
 | 
						|
       -jit      Behave  as  if  each pattern line has the jit modifier; after
 | 
						|
                 successful compilation, each pattern is passed to  the  just-
 | 
						|
                 in-time compiler, if available.
 | 
						|
 | 
						|
       -pattern modifier-list
 | 
						|
                 Behave as if each pattern line contains the given modifiers.
 | 
						|
 | 
						|
       -q        Do not output the version number of pcre2test at the start of
 | 
						|
                 execution.
 | 
						|
 | 
						|
       -S size   On Unix-like systems, set the size of the run-time  stack  to
 | 
						|
                 size megabytes.
 | 
						|
 | 
						|
       -subject modifier-list
 | 
						|
                 Behave as if each subject line contains the given modifiers.
 | 
						|
 | 
						|
       -t        Run  each compile and match many times with a timer, and out-
 | 
						|
                 put the resulting times per compile or  match.  When  JIT  is
 | 
						|
                 used,  separate  times  are given for the initial compile and
 | 
						|
                 the JIT compile. You can control  the  number  of  iterations
 | 
						|
                 that  are used for timing by following -t with a number (as a
 | 
						|
                 separate item on the command line). For  example,  "-t  1000"
 | 
						|
                 iterates 1000 times. The default is to iterate 500,000 times.
 | 
						|
 | 
						|
       -tm       This is like -t except that it times only the matching phase,
 | 
						|
                 not the compile phase.
 | 
						|
 | 
						|
       -T -TM    These behave like -t and -tm, but in addition, at the end  of
 | 
						|
                 a  run, the total times for all compiles and matches are out-
 | 
						|
                 put.
 | 
						|
 | 
						|
       -version  Output the PCRE2 version number and then exit.
 | 
						|
 | 
						|
 | 
						|
DESCRIPTION
 | 
						|
 | 
						|
       If pcre2test is given two filename arguments, it reads from  the  first
 | 
						|
       and writes to the second. If the first name is "-", input is taken from
 | 
						|
       the standard input. If pcre2test is given only one argument,  it  reads
 | 
						|
       from that file and writes to stdout. Otherwise, it reads from stdin and
 | 
						|
       writes to stdout.
 | 
						|
 | 
						|
       When pcre2test is built, a configuration option  can  specify  that  it
 | 
						|
       should  be linked with the libreadline or libedit library. When this is
 | 
						|
       done, if the input is from a terminal, it is read using the  readline()
 | 
						|
       function. This provides line-editing and history facilities. The output
 | 
						|
       from the -help option states whether or not readline() will be used.
 | 
						|
 | 
						|
       The program handles any number of tests, each of which  consists  of  a
 | 
						|
       set  of input lines. Each set starts with a regular expression pattern,
 | 
						|
       followed by any number of subject lines to be matched against that pat-
 | 
						|
       tern. In between sets of test data, command lines that begin with # may
 | 
						|
       appear. This file format, with some restrictions, can also be processed
 | 
						|
       by  the perltest.sh script that is distributed with PCRE2 as a means of
 | 
						|
       checking that the behaviour of PCRE2 and Perl is the same.
 | 
						|
 | 
						|
       When the input is a terminal, pcre2test prompts for each line of input,
 | 
						|
       using  "re>"  to prompt for regular expression patterns, and "data>" to
 | 
						|
       prompt for subject lines. Command lines starting with # can be  entered
 | 
						|
       only in response to the "re>" prompt.
 | 
						|
 | 
						|
       Each  subject line is matched separately and independently. If you want
 | 
						|
       to do multi-line matches, you have to use the \n escape sequence (or \r
 | 
						|
       or  \r\n,  etc.,  depending on the newline setting) in a single line of
 | 
						|
       input to encode the newline sequences. There is no limit on the  length
 | 
						|
       of  subject  lines; the input buffer is automatically extended if it is
 | 
						|
       too small. There is a replication feature that  makes  it  possible  to
 | 
						|
       generate long subject lines without having to supply them explicitly.
 | 
						|
 | 
						|
       An  empty  line  or  the end of the file signals the end of the subject
 | 
						|
       lines for a test, at which point a  new  pattern  or  command  line  is
 | 
						|
       expected if there is still input to be read.
 | 
						|
 | 
						|
 | 
						|
COMMAND LINES
 | 
						|
 | 
						|
       In  between sets of test data, a line that begins with # is interpreted
 | 
						|
       as a command line. If the first character is followed by white space or
 | 
						|
       an  exclamation  mark,  the  line is treated as a comment, and ignored.
 | 
						|
       Otherwise, the following commands are recognized:
 | 
						|
 | 
						|
         #forbid_utf
 | 
						|
 | 
						|
       Subsequent  patterns  automatically  have   the   PCRE2_NEVER_UTF   and
 | 
						|
       PCRE2_NEVER_UCP  options  set, which locks out the use of the PCRE2_UTF
 | 
						|
       and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start  of
 | 
						|
       patterns.  This  command  also  forces an error if a subsequent pattern
 | 
						|
       contains any occurrences of \P, \p, or \X, which  are  still  supported
 | 
						|
       when  PCRE2_UTF  is not set, but which require Unicode property support
 | 
						|
       to be included in the library.
 | 
						|
 | 
						|
       This is a trigger guard that is used in test files to ensure  that  UTF
 | 
						|
       or  Unicode property tests are not accidentally added to files that are
 | 
						|
       used when Unicode support is  not  included  in  the  library.  Setting
 | 
						|
       PCRE2_NEVER_UTF  and  PCRE2_NEVER_UCP as a default can also be obtained
 | 
						|
       by the use of #pattern; the difference is that  #forbid_utf  cannot  be
 | 
						|
       unset,  and the automatic options are not displayed in pattern informa-
 | 
						|
       tion, to avoid cluttering up test output.
 | 
						|
 | 
						|
         #load <filename>
 | 
						|
 | 
						|
       This command is used to load a set of precompiled patterns from a file,
 | 
						|
       as  described  in  the  section entitled "Saving and restoring compiled
 | 
						|
       patterns" below.
 | 
						|
 | 
						|
         #pattern <modifier-list>
 | 
						|
 | 
						|
       This command sets a default modifier list that applies  to  all  subse-
 | 
						|
       quent patterns. Modifiers on a pattern can change these settings.
 | 
						|
 | 
						|
         #perltest
 | 
						|
 | 
						|
       The  appearance of this line causes all subsequent modifier settings to
 | 
						|
       be checked for compatibility with the perltest.sh script, which is used
 | 
						|
       to  confirm that Perl gives the same results as PCRE2. Also, apart from
 | 
						|
       comment lines, none of the other command lines are  permitted,  because
 | 
						|
       they  and  many  of the modifiers are specific to pcre2test, and should
 | 
						|
       not be used in test files that are also processed by  perltest.sh.  The
 | 
						|
       #perltest  command  helps detect tests that are accidentally put in the
 | 
						|
       wrong file.
 | 
						|
 | 
						|
         #pop [<modifiers>]
 | 
						|
 | 
						|
       This command is used to manipulate the stack of compiled  patterns,  as
 | 
						|
       described  in  the section entitled "Saving and restoring compiled pat-
 | 
						|
       terns" below.
 | 
						|
 | 
						|
         #save <filename>
 | 
						|
 | 
						|
       This command is used to save a set of compiled patterns to a  file,  as
 | 
						|
       described  in  the section entitled "Saving and restoring compiled pat-
 | 
						|
       terns" below.
 | 
						|
 | 
						|
         #subject <modifier-list>
 | 
						|
 | 
						|
       This command sets a default modifier list that applies  to  all  subse-
 | 
						|
       quent  subject lines. Modifiers on a subject line can change these set-
 | 
						|
       tings.
 | 
						|
 | 
						|
 | 
						|
MODIFIER SYNTAX
 | 
						|
 | 
						|
       Modifier lists are used with both pattern and subject lines. Items in a
 | 
						|
       list  are  separated by commas and optional white space. Some modifiers
 | 
						|
       may be given for both patterns and subject lines,  whereas  others  are
 | 
						|
       valid  for  one  or  the other only. Each modifier has a long name, for
 | 
						|
       example "anchored", and some of them must be followed by an equals sign
 | 
						|
       and a value, for example, "offset=12".  Modifiers that do not take val-
 | 
						|
       ues may be preceded by a minus sign to turn off a previous setting.
 | 
						|
 | 
						|
       A few of the more common modifiers can also be specified as single let-
 | 
						|
       ters,  for  example "i" for "caseless". In documentation, following the
 | 
						|
       Perl convention, these are written with a slash ("the /i modifier") for
 | 
						|
       clarity.  Abbreviated  modifiers  must all be concatenated in the first
 | 
						|
       item of a modifier list. If the first item is not recognized as a  long
 | 
						|
       modifier  name, it is interpreted as a sequence of these abbreviations.
 | 
						|
       For example:
 | 
						|
 | 
						|
         /abc/ig,newline=cr,jit=3
 | 
						|
 | 
						|
       This is a pattern line whose modifier list starts with  two  one-letter
 | 
						|
       modifiers  (/i  and  /g).  The lower-case abbreviated modifiers are the
 | 
						|
       same as used in Perl.
 | 
						|
 | 
						|
 | 
						|
PATTERN SYNTAX
 | 
						|
 | 
						|
       A pattern line must start with one of the following characters  (common
 | 
						|
       symbols, excluding pattern meta-characters):
 | 
						|
 | 
						|
         / ! " ' ` - = _ : ; , % & @ ~
 | 
						|
 | 
						|
       This  is  interpreted  as the pattern's delimiter. A regular expression
 | 
						|
       may be continued over several input lines, in which  case  the  newline
 | 
						|
       characters are included within it. It is possible to include the delim-
 | 
						|
       iter within the pattern by escaping it with a backslash, for example
 | 
						|
 | 
						|
         /abc\/def/
 | 
						|
 | 
						|
       If you do this, the escape and the delimiter form part of the  pattern,
 | 
						|
       but since the delimiters are all non-alphanumeric, this does not affect
 | 
						|
       its interpretation. If the terminating delimiter  is  immediately  fol-
 | 
						|
       lowed by a backslash, for example,
 | 
						|
 | 
						|
         /abc/\
 | 
						|
 | 
						|
       then  a  backslash  is added to the end of the pattern. This is done to
 | 
						|
       provide a way of testing the error condition that arises if  a  pattern
 | 
						|
       finishes with a backslash, because
 | 
						|
 | 
						|
         /abc\/
 | 
						|
 | 
						|
       is  interpreted as the first line of a pattern that starts with "abc/",
 | 
						|
       causing pcre2test to read the next line as a continuation of the  regu-
 | 
						|
       lar expression.
 | 
						|
 | 
						|
       A pattern can be followed by a modifier list (details below).
 | 
						|
 | 
						|
 | 
						|
SUBJECT LINE SYNTAX
 | 
						|
 | 
						|
       Before    each   subject   line   is   passed   to   pcre2_match()   or
 | 
						|
       pcre2_dfa_match(), leading and trailing white space is removed, and the
 | 
						|
       line is scanned for backslash escapes. The following provide a means of
 | 
						|
       encoding non-printing characters in a visible way:
 | 
						|
 | 
						|
         \a         alarm (BEL, \x07)
 | 
						|
         \b         backspace (\x08)
 | 
						|
         \e         escape (\x27)
 | 
						|
         \f         form feed (\x0c)
 | 
						|
         \n         newline (\x0a)
 | 
						|
         \r         carriage return (\x0d)
 | 
						|
         \t         tab (\x09)
 | 
						|
         \v         vertical tab (\x0b)
 | 
						|
         \nnn       octal character (up to 3 octal digits); always
 | 
						|
                      a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
 | 
						|
         \o{dd...}  octal character (any number of octal digits}
 | 
						|
         \xhh       hexadecimal byte (up to 2 hex digits)
 | 
						|
         \x{hh...}  hexadecimal character (any number of hex digits)
 | 
						|
 | 
						|
       The use of \x{hh...} is not dependent on the use of the utf modifier on
 | 
						|
       the  pattern. It is recognized always. There may be any number of hexa-
 | 
						|
       decimal digits inside the braces; invalid  values  provoke  error  mes-
 | 
						|
       sages.
 | 
						|
 | 
						|
       Note  that  \xhh  specifies one byte rather than one character in UTF-8
 | 
						|
       mode; this makes it possible to construct invalid UTF-8  sequences  for
 | 
						|
       testing  purposes.  On the other hand, \x{hh} is interpreted as a UTF-8
 | 
						|
       character in UTF-8 mode, generating more than one byte if the value  is
 | 
						|
       greater  than  127.   When testing the 8-bit library not in UTF-8 mode,
 | 
						|
       \x{hh} generates one byte for values less than 256, and causes an error
 | 
						|
       for greater values.
 | 
						|
 | 
						|
       In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
 | 
						|
       possible to construct invalid UTF-16 sequences for testing purposes.
 | 
						|
 | 
						|
       In UTF-32 mode, all 4- to 8-digit \x{...}  values  are  accepted.  This
 | 
						|
       makes  it  possible  to  construct invalid UTF-32 sequences for testing
 | 
						|
       purposes.
 | 
						|
 | 
						|
       There is a special backslash sequence that specifies replication of one
 | 
						|
       or more characters:
 | 
						|
 | 
						|
         \[<characters>]{<count>}
 | 
						|
 | 
						|
       This  makes  it possible to test long strings without having to provide
 | 
						|
       them as part of the file. For example:
 | 
						|
 | 
						|
         \[abc]{4}
 | 
						|
 | 
						|
       is converted to "abcabcabcabc". This feature does not support  nesting.
 | 
						|
       To include a closing square bracket in the characters, code it as \x5D.
 | 
						|
 | 
						|
       A  backslash  followed  by  an equals sign marks the end of the subject
 | 
						|
       string and the start of a modifier list. For example:
 | 
						|
 | 
						|
         abc\=notbol,notempty
 | 
						|
 | 
						|
       A backslash followed  by  any  other  non-alphanumeric  character  just
 | 
						|
       escapes that character. A backslash followed by anything else causes an
 | 
						|
       error. However, if the very last character in the line is  a  backslash
 | 
						|
       (and  there  is  no  modifier list), it is ignored. This gives a way of
 | 
						|
       passing an empty line as data, since a real empty line  terminates  the
 | 
						|
       data input.
 | 
						|
 | 
						|
 | 
						|
PATTERN MODIFIERS
 | 
						|
 | 
						|
       There are three types of modifier that can appear in pattern lines, two
 | 
						|
       of which may also be used in a #pattern command. A  pattern's  modifier
 | 
						|
       list can add to or override default modifiers that were set by a previ-
 | 
						|
       ous #pattern command.
 | 
						|
 | 
						|
   Setting compilation options
 | 
						|
 | 
						|
       The following modifiers set options for pcre2_compile(). The most  com-
 | 
						|
       mon  ones  have single-letter abbreviations. See pcreapi for a descrip-
 | 
						|
       tion of their effects.
 | 
						|
 | 
						|
             allow_empty_class         set PCRE2_ALLOW_EMPTY_CLASS
 | 
						|
             alt_bsux                  set PCRE2_ALT_BSUX
 | 
						|
             alt_circumflex            set PCRE2_ALT_CIRCUMFLEX
 | 
						|
             anchored                  set PCRE2_ANCHORED
 | 
						|
             auto_callout              set PCRE2_AUTO_CALLOUT
 | 
						|
         /i  caseless                  set PCRE2_CASELESS
 | 
						|
             dollar_endonly            set PCRE2_DOLLAR_ENDONLY
 | 
						|
         /s  dotall                    set PCRE2_DOTALL
 | 
						|
             dupnames                  set PCRE2_DUPNAMES
 | 
						|
         /x  extended                  set PCRE2_EXTENDED
 | 
						|
             firstline                 set PCRE2_FIRSTLINE
 | 
						|
             match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
 | 
						|
         /m  multiline                 set PCRE2_MULTILINE
 | 
						|
             never_backslash_c         set PCRE2_NEVER_BACKSLASH_C
 | 
						|
             never_ucp                 set PCRE2_NEVER_UCP
 | 
						|
             never_utf                 set PCRE2_NEVER_UTF
 | 
						|
             no_auto_capture           set PCRE2_NO_AUTO_CAPTURE
 | 
						|
             no_auto_possess           set PCRE2_NO_AUTO_POSSESS
 | 
						|
             no_dotstar_anchor         set PCRE2_NO_DOTSTAR_ANCHOR
 | 
						|
             no_start_optimize         set PCRE2_NO_START_OPTIMIZE
 | 
						|
             no_utf_check              set PCRE2_NO_UTF_CHECK
 | 
						|
             ucp                       set PCRE2_UCP
 | 
						|
             ungreedy                  set PCRE2_UNGREEDY
 | 
						|
             utf                       set PCRE2_UTF
 | 
						|
 | 
						|
       As well as turning on the PCRE2_UTF option, the utf modifier causes all
 | 
						|
       non-printing  characters  in  output  strings  to  be printed using the
 | 
						|
       \x{hh...} notation. Otherwise, those less than 0x100 are output in  hex
 | 
						|
       without the curly brackets.
 | 
						|
 | 
						|
   Setting compilation controls
 | 
						|
 | 
						|
       The  following  modifiers  affect  the  compilation  process or request
 | 
						|
       information about the pattern:
 | 
						|
 | 
						|
             bsr=[anycrlf|unicode]     specify \R handling
 | 
						|
         /B  bincode                   show binary code without lengths
 | 
						|
             callout_info              show callout information
 | 
						|
             debug                     same as info,fullbincode
 | 
						|
             fullbincode               show binary code with lengths
 | 
						|
         /I  info                      show info about compiled pattern
 | 
						|
             hex                       pattern is coded in hexadecimal
 | 
						|
             jit[=<number>]            use JIT
 | 
						|
             jitfast                   use JIT fast path
 | 
						|
             jitverify                 verify JIT use
 | 
						|
             locale=<name>             use this locale
 | 
						|
             memory                    show memory used
 | 
						|
             newline=<type>            set newline type
 | 
						|
             parens_nest_limit=<n>     set maximum parentheses depth
 | 
						|
             posix                     use the POSIX API
 | 
						|
             push                      push compiled pattern onto the stack
 | 
						|
             stackguard=<number>       test the stackguard feature
 | 
						|
             tables=[0|1|2]            select internal tables
 | 
						|
 | 
						|
       The effects of these modifiers are described in the following sections.
 | 
						|
 | 
						|
   Newline and \R handling
 | 
						|
 | 
						|
       The bsr modifier specifies what \R in a pattern should match. If it  is
 | 
						|
       set  to  "anycrlf",  \R  matches  CR, LF, or CRLF only. If it is set to
 | 
						|
       "unicode", \R matches any Unicode  newline  sequence.  The  default  is
 | 
						|
       specified when PCRE2 is built, with the default default being Unicode.
 | 
						|
 | 
						|
       The  newline  modifier specifies which characters are to be interpreted
 | 
						|
       as newlines, both in the pattern and in subject lines. The type must be
 | 
						|
       one of CR, LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
 | 
						|
 | 
						|
   Information about a pattern
 | 
						|
 | 
						|
       The  debug modifier is a shorthand for info,fullbincode, requesting all
 | 
						|
       available information.
 | 
						|
 | 
						|
       The bincode modifier causes a representation of the compiled code to be
 | 
						|
       output  after compilation. This information does not contain length and
 | 
						|
       offset values, which ensures that the same output is generated for dif-
 | 
						|
       ferent  internal  link  sizes  and different code unit widths. By using
 | 
						|
       bincode, the same regression tests can be used  in  different  environ-
 | 
						|
       ments.
 | 
						|
 | 
						|
       The  fullbincode  modifier, by contrast, does include length and offset
 | 
						|
       values. This is used in a few special tests that run only for  specific
 | 
						|
       code unit widths and link sizes, and is also useful for one-off tests.
 | 
						|
 | 
						|
       The  info  modifier  requests  information  about  the compiled pattern
 | 
						|
       (whether it is anchored, has a fixed first character, and so  on).  The
 | 
						|
       information  is  obtained  from the pcre2_pattern_info() function. Here
 | 
						|
       are some typical examples:
 | 
						|
 | 
						|
           re> /(?i)(^a|^b)/m,info
 | 
						|
         Capturing subpattern count = 1
 | 
						|
         Compile options: multiline
 | 
						|
         Overall options: caseless multiline
 | 
						|
         First code unit at start or follows newline
 | 
						|
         Subject length lower bound = 1
 | 
						|
 | 
						|
           re> /(?i)abc/info
 | 
						|
         Capturing subpattern count = 0
 | 
						|
         Compile options: <none>
 | 
						|
         Overall options: caseless
 | 
						|
         First code unit = 'a' (caseless)
 | 
						|
         Last code unit = 'c' (caseless)
 | 
						|
         Subject length lower bound = 3
 | 
						|
 | 
						|
       "Compile options" are those specified by modifiers;  "overall  options"
 | 
						|
       have  added options that are taken or deduced from the pattern. If both
 | 
						|
       sets of options are the same, just a single "options" line  is  output;
 | 
						|
       if  there  are  no  options,  the line is omitted. "First code unit" is
 | 
						|
       where any match must start; if there is more than one they  are  listed
 | 
						|
       as  "starting  code  units".  "Last code unit" is the last literal code
 | 
						|
       unit that must be present in any match. This  is  not  necessarily  the
 | 
						|
       last  character.  These lines are omitted if no starting or ending code
 | 
						|
       units are recorded.
 | 
						|
 | 
						|
       The callout_info modifier requests information about all  the  callouts
 | 
						|
       in the pattern. A list of them is output at the end of any other infor-
 | 
						|
       mation that is requested. For each callout, either its number or string
 | 
						|
       is given, followed by the item that follows it in the pattern.
 | 
						|
 | 
						|
   Specifying a pattern in hex
 | 
						|
 | 
						|
       The hex modifier specifies that the characters of the pattern are to be
 | 
						|
       interpreted as pairs of hexadecimal digits. White  space  is  permitted
 | 
						|
       between pairs. For example:
 | 
						|
 | 
						|
         /ab 32 59/hex
 | 
						|
 | 
						|
       This  feature  is  provided  as a way of creating patterns that contain
 | 
						|
       binary zero and other non-printing characters.  By  default,  pcre2test
 | 
						|
       passes  patterns  as zero-terminated strings to pcre2_compile(), giving
 | 
						|
       the length as PCRE2_ZERO_TERMINATED. However, for patterns specified in
 | 
						|
       hexadecimal, the actual length of the pattern is passed.
 | 
						|
 | 
						|
   JIT compilation
 | 
						|
 | 
						|
       The  /jit  modifier  may optionally be followed by an equals sign and a
 | 
						|
       number in the range 0 to 7:
 | 
						|
 | 
						|
         0  disable JIT
 | 
						|
         1  use JIT for normal match only
 | 
						|
         2  use JIT for soft partial match only
 | 
						|
         3  use JIT for normal match and soft partial match
 | 
						|
         4  use JIT for hard partial match only
 | 
						|
         6  use JIT for soft and hard partial match
 | 
						|
         7  all three modes
 | 
						|
 | 
						|
       If no number is given, 7 is assumed. If JIT compilation is  successful,
 | 
						|
       the  compiled JIT code will automatically be used when pcre2_match() is
 | 
						|
       run for the appropriate type of match, except  when  incompatible  run-
 | 
						|
       time options are specified. For more details, see the pcre2jit documen-
 | 
						|
       tation. See also the jitstack modifier below for a way of  setting  the
 | 
						|
       size of the JIT stack.
 | 
						|
 | 
						|
       If  the  jitfast  modifier is specified, matching is done using the JIT
 | 
						|
       "fast path" interface, pcre2_jit_match(), which skips some of the  san-
 | 
						|
       ity  checks that are done by pcre2_match(), and of course does not work
 | 
						|
       when JIT is not supported. If jitfast is specified without  jit,  jit=7
 | 
						|
       is assumed.
 | 
						|
 | 
						|
       If  the jitverify modifier is specified, information about the compiled
 | 
						|
       pattern shows whether JIT compilation was or  was  not  successful.  If
 | 
						|
       jitverify  is  specified without jit, jit=7 is assumed. If JIT compila-
 | 
						|
       tion is successful when jitverify is set, the text "(JIT)" is added  to
 | 
						|
       the first output line after a match or non match when JIT-compiled code
 | 
						|
       was actually used in the match.
 | 
						|
 | 
						|
   Setting a locale
 | 
						|
 | 
						|
       The /locale modifier must specify the name of a locale, for example:
 | 
						|
 | 
						|
         /pattern/locale=fr_FR
 | 
						|
 | 
						|
       The given locale is set, pcre2_maketables() is called to build a set of
 | 
						|
       character  tables for the locale, and this is then passed to pcre2_com-
 | 
						|
       pile() when compiling the regular expression. The same tables are  used
 | 
						|
       when matching the following subject lines. The /locale modifier applies
 | 
						|
       only to the pattern on which it appears, but can be given in a #pattern
 | 
						|
       command  if a default is needed. Setting a locale and alternate charac-
 | 
						|
       ter tables are mutually exclusive.
 | 
						|
 | 
						|
   Showing pattern memory
 | 
						|
 | 
						|
       The /memory modifier causes the size in bytes of  the  memory  used  to
 | 
						|
       hold  the compiled pattern to be output. This does not include the size
 | 
						|
       of the pcre2_code block; it is just the actual compiled  data.  If  the
 | 
						|
       pattern is subsequently passed to the JIT compiler, the size of the JIT
 | 
						|
       compiled code is also output. Here is an example:
 | 
						|
 | 
						|
           re> /a(b)c/jit,memory
 | 
						|
         Memory allocation (code space): 21
 | 
						|
         Memory allocation (JIT code): 1910
 | 
						|
 | 
						|
 | 
						|
   Limiting nested parentheses
 | 
						|
 | 
						|
       The parens_nest_limit modifier sets a limit  on  the  depth  of  nested
 | 
						|
       parentheses  in  a  pattern.  Breaching  the limit causes a compilation
 | 
						|
       error.  The default for the library is set when  PCRE2  is  built,  but
 | 
						|
       pcre2test  sets  its  own default of 220, which is required for running
 | 
						|
       the standard test suite.
 | 
						|
 | 
						|
   Using the POSIX wrapper API
 | 
						|
 | 
						|
       The /posix modifier causes pcre2test to call PCRE2 via the POSIX  wrap-
 | 
						|
       per  API  rather  than  its  native  API.  This supports only the 8-bit
 | 
						|
       library.  When the POSIX API is being used, the following pattern modi-
 | 
						|
       fiers set options for the regcomp() function:
 | 
						|
 | 
						|
         caseless           REG_ICASE
 | 
						|
         multiline          REG_NEWLINE
 | 
						|
         no_auto_capture    REG_NOSUB
 | 
						|
         dotall             REG_DOTALL     )
 | 
						|
         ungreedy           REG_UNGREEDY   ) These options are not part of
 | 
						|
         ucp                REG_UCP        )   the POSIX standard
 | 
						|
         utf                REG_UTF8       )
 | 
						|
 | 
						|
       The  aftertext  and  allaftertext  subject  modifiers work as described
 | 
						|
       below. All other modifiers cause an error.
 | 
						|
 | 
						|
   Testing the stack guard feature
 | 
						|
 | 
						|
       The /stackguard modifier is used to  test  the  use  of  pcre2_set_com-
 | 
						|
       pile_recursion_guard(),  a  function  that  is provided to enable stack
 | 
						|
       availability to be checked during compilation (see the  pcre2api  docu-
 | 
						|
       mentation  for  details).  If  the  number specified by the modifier is
 | 
						|
       greater than zero, pcre2_set_compile_recursion_guard() is called to set
 | 
						|
       up  callback  from pcre2_compile() to a local function. The argument it
 | 
						|
       receives is the current nesting parenthesis depth; if this  is  greater
 | 
						|
       than the value given by the modifier, non-zero is returned, causing the
 | 
						|
       compilation to be aborted.
 | 
						|
 | 
						|
   Using alternative character tables
 | 
						|
 | 
						|
       The value specified for the /tables modifier must be one of the  digits
 | 
						|
       0, 1, or 2. It causes a specific set of built-in character tables to be
 | 
						|
       passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
 | 
						|
       haviour with different character tables. The digit specifies the tables
 | 
						|
       as follows:
 | 
						|
 | 
						|
         0   do not pass any special character tables
 | 
						|
         1   the default ASCII tables, as distributed in
 | 
						|
               pcre2_chartables.c.dist
 | 
						|
         2   a set of tables defining ISO 8859 characters
 | 
						|
 | 
						|
       In table 2, some characters whose codes are greater than 128 are  iden-
 | 
						|
       tified  as  letters,  digits,  spaces, etc. Setting alternate character
 | 
						|
       tables and a locale are mutually exclusive.
 | 
						|
 | 
						|
   Setting certain match controls
 | 
						|
 | 
						|
       The following modifiers are really subject modifiers, and are described
 | 
						|
       below.   However, they may be included in a pattern's modifier list, in
 | 
						|
       which case they are applied to every subject  line  that  is  processed
 | 
						|
       with that pattern. They do not affect the compilation process.
 | 
						|
 | 
						|
             aftertext           show text after match
 | 
						|
             allaftertext        show text after captures
 | 
						|
             allcaptures         show all captures
 | 
						|
             allusedtext         show all consulted text
 | 
						|
         /g  global              global matching
 | 
						|
             mark                show mark values
 | 
						|
             replace=<string>    specify a replacement string
 | 
						|
             startchar           show starting character when relevant
 | 
						|
 | 
						|
       These  modifiers may not appear in a #pattern command. If you want them
 | 
						|
       as defaults, set them in a #subject command.
 | 
						|
 | 
						|
   Saving a compiled pattern
 | 
						|
 | 
						|
       When a pattern with the push modifier is successfully compiled,  it  is
 | 
						|
       pushed  onto  a  stack  of compiled patterns, and pcre2test expects the
 | 
						|
       next line to contain a new pattern (or a command) instead of a  subject
 | 
						|
       line. This facility is used when saving compiled patterns to a file, as
 | 
						|
       described in the section entitled "Saving and restoring  compiled  pat-
 | 
						|
       terns" below.  The push modifier is incompatible with compilation modi-
 | 
						|
       fiers such as global that act at match time. Any that are specified are
 | 
						|
       ignored,  with  a  warning message, except for replace, which causes an
 | 
						|
       error. Note that, jitverify, which is allowed, does not  carry  through
 | 
						|
       to any subsequent matching that uses this pattern.
 | 
						|
 | 
						|
 | 
						|
SUBJECT MODIFIERS
 | 
						|
 | 
						|
       The modifiers that can appear in subject lines and the #subject command
 | 
						|
       are of two types.
 | 
						|
 | 
						|
   Setting match options
 | 
						|
 | 
						|
       The   following   modifiers   set   options   for   pcre2_match()    or
 | 
						|
       pcre2_dfa_match(). See pcreapi for a description of their effects.
 | 
						|
 | 
						|
             anchored                  set PCRE2_ANCHORED
 | 
						|
             dfa_restart               set PCRE2_DFA_RESTART
 | 
						|
             dfa_shortest              set PCRE2_DFA_SHORTEST
 | 
						|
             no_utf_check              set PCRE2_NO_UTF_CHECK
 | 
						|
             notbol                    set PCRE2_NOTBOL
 | 
						|
             notempty                  set PCRE2_NOTEMPTY
 | 
						|
             notempty_atstart          set PCRE2_NOTEMPTY_ATSTART
 | 
						|
             noteol                    set PCRE2_NOTEOL
 | 
						|
             partial_hard (or ph)      set PCRE2_PARTIAL_HARD
 | 
						|
             partial_soft (or ps)      set PCRE2_PARTIAL_SOFT
 | 
						|
 | 
						|
       The  partial matching modifiers are provided with abbreviations because
 | 
						|
       they appear frequently in tests.
 | 
						|
 | 
						|
       If the /posix modifier was present on the pattern,  causing  the  POSIX
 | 
						|
       wrapper API to be used, the only option-setting modifiers that have any
 | 
						|
       effect  are  notbol,  notempty,   and   noteol,   causing   REG_NOTBOL,
 | 
						|
       REG_NOTEMPTY,  and REG_NOTEOL, respectively, to be passed to regexec().
 | 
						|
       Any other modifiers cause an error.
 | 
						|
 | 
						|
   Setting match controls
 | 
						|
 | 
						|
       The following modifiers affect the matching process  or  request  addi-
 | 
						|
       tional  information.  Some  of  them may also be specified on a pattern
 | 
						|
       line (see above), in which case they apply to every subject  line  that
 | 
						|
       is matched against that pattern.
 | 
						|
 | 
						|
             aftertext                 show text after match
 | 
						|
             allaftertext              show text after captures
 | 
						|
             allcaptures               show all captures
 | 
						|
             allusedtext               show all consulted text (non-JIT only)
 | 
						|
             altglobal                 alternative global matching
 | 
						|
             callout_capture           show captures at callout time
 | 
						|
             callout_data=<n>          set a value to pass via callouts
 | 
						|
             callout_fail=<n>[:<m>]    control callout failure
 | 
						|
             callout_none              do not supply a callout function
 | 
						|
             copy=<number or name>     copy captured substring
 | 
						|
             dfa                       use pcre2_dfa_match()
 | 
						|
             find_limits               find match and recursion limits
 | 
						|
             get=<number or name>      extract captured substring
 | 
						|
             getall                    extract all captured substrings
 | 
						|
         /g  global                    global matching
 | 
						|
             jitstack=<n>              set size of JIT stack
 | 
						|
             mark                      show mark values
 | 
						|
             match_limit=>n>           set a match limit
 | 
						|
             memory                    show memory usage
 | 
						|
             offset=<n>                set starting offset
 | 
						|
             ovector=<n>               set size of output vector
 | 
						|
             recursion_limit=<n>       set a recursion limit
 | 
						|
             replace=<string>          specify a replacement string
 | 
						|
             startchar                 show startchar when relevant
 | 
						|
             zero_terminate            pass the subject as zero-terminated
 | 
						|
 | 
						|
       The effects of these modifiers are described in the following sections.
 | 
						|
 | 
						|
   Showing more text
 | 
						|
 | 
						|
       The  aftertext modifier requests that as well as outputting the part of
 | 
						|
       the subject string that matched the entire pattern, pcre2test should in
 | 
						|
       addition output the remainder of the subject string. This is useful for
 | 
						|
       tests where the subject contains multiple copies of the same substring.
 | 
						|
       The  allaftertext  modifier  requests the same action for captured sub-
 | 
						|
       strings as well as the main matched substring. In each case the remain-
 | 
						|
       der is output on the following line with a plus character following the
 | 
						|
       capture number.
 | 
						|
 | 
						|
       The allusedtext modifier requests that all the text that was  consulted
 | 
						|
       during  a  successful pattern match by the interpreter should be shown.
 | 
						|
       This feature is not supported for JIT matching, and if  requested  with
 | 
						|
       JIT  it  is  ignored  (with  a  warning message). Setting this modifier
 | 
						|
       affects the output if there is a lookbehind at the start of a match, or
 | 
						|
       a  lookahead  at  the  end, or if \K is used in the pattern. Characters
 | 
						|
       that precede or follow the start and end of the actual match are  indi-
 | 
						|
       cated  in  the output by '<' or '>' characters underneath them. Here is
 | 
						|
       an example:
 | 
						|
 | 
						|
           re> /(?<=pqr)abc(?=xyz)/
 | 
						|
         data> 123pqrabcxyz456\=allusedtext
 | 
						|
          0: pqrabcxyz
 | 
						|
             <<<   >>>
 | 
						|
 | 
						|
       This shows that the matched string is "abc",  with  the  preceding  and
 | 
						|
       following  strings  "pqr"  and  "xyz"  having been consulted during the
 | 
						|
       match (when processing the assertions).
 | 
						|
 | 
						|
       The startchar modifier requests that the  starting  character  for  the
 | 
						|
       match  be  indicated,  if  it  is different to the start of the matched
 | 
						|
       string. The only time when this occurs is when \K has been processed as
 | 
						|
       part of the match. In this situation, the output for the matched string
 | 
						|
       is displayed from the starting character  instead  of  from  the  match
 | 
						|
       point,  with  circumflex  characters  under the earlier characters. For
 | 
						|
       example:
 | 
						|
 | 
						|
           re> /abc\Kxyz/
 | 
						|
         data> abcxyz\=startchar
 | 
						|
          0: abcxyz
 | 
						|
             ^^^
 | 
						|
 | 
						|
       Unlike allusedtext, the startchar modifier can be used with JIT.   How-
 | 
						|
       ever, these two modifiers are mutually exclusive.
 | 
						|
 | 
						|
   Showing the value of all capture groups
 | 
						|
 | 
						|
       The allcaptures modifier requests that the values of all potential cap-
 | 
						|
       tured parentheses be output after a match. By default, only those up to
 | 
						|
       the highest one actually used in the match are output (corresponding to
 | 
						|
       the return code from pcre2_match()). Groups that did not take  part  in
 | 
						|
       the match are output as "<unset>".
 | 
						|
 | 
						|
   Testing callouts
 | 
						|
 | 
						|
       A  callout function is supplied when pcre2test calls the library match-
 | 
						|
       ing functions, unless callout_none is specified. If callout_capture  is
 | 
						|
       set, the current captured groups are output when a callout occurs.
 | 
						|
 | 
						|
       The  callout_fail modifier can be given one or two numbers. If there is
 | 
						|
       only one number, 1 is returned instead of 0 when a callout of that num-
 | 
						|
       ber  is  reached.  If two numbers are given, 1 is returned when callout
 | 
						|
       <n> is reached for the <m>th time. Note that callouts with string argu-
 | 
						|
       ments  are  always  given  the  number zero. See "Callouts" below for a
 | 
						|
       description of the output when a callout it taken.
 | 
						|
 | 
						|
       The callout_data modifier can be given an unsigned or a  negative  num-
 | 
						|
       ber.   This  is  set  as the "user data" that is passed to the matching
 | 
						|
       function, and passed back when the callout  function  is  invoked.  Any
 | 
						|
       value  other  than  zero  is  used as a return from pcre2test's callout
 | 
						|
       function.
 | 
						|
 | 
						|
   Finding all matches in a string
 | 
						|
 | 
						|
       Searching for all possible matches within a subject can be requested by
 | 
						|
       the  global or /altglobal modifier. After finding a match, the matching
 | 
						|
       function is called again to search the remainder of  the  subject.  The
 | 
						|
       difference  between  global  and  altglobal is that the former uses the
 | 
						|
       start_offset argument to pcre2_match() or  pcre2_dfa_match()  to  start
 | 
						|
       searching  at  a new point within the entire string (which is what Perl
 | 
						|
       does), whereas the latter passes over a shortened subject. This makes a
 | 
						|
       difference to the matching process if the pattern begins with a lookbe-
 | 
						|
       hind assertion (including \b or \B).
 | 
						|
 | 
						|
       If an empty string  is  matched,  the  next  match  is  done  with  the
 | 
						|
       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
 | 
						|
       for another, non-empty, match at the same point in the subject. If this
 | 
						|
       match  fails,  the  start  offset  is advanced, and the normal match is
 | 
						|
       retried. This imitates the way Perl handles such cases when  using  the
 | 
						|
       /g  modifier  or  the  split()  function. Normally, the start offset is
 | 
						|
       advanced by one character, but if  the  newline  convention  recognizes
 | 
						|
       CRLF  as  a newline, and the current character is CR followed by LF, an
 | 
						|
       advance of two characters occurs.
 | 
						|
 | 
						|
   Testing substring extraction functions
 | 
						|
 | 
						|
       The copy  and  get  modifiers  can  be  used  to  test  the  pcre2_sub-
 | 
						|
       string_copy_xxx() and pcre2_substring_get_xxx() functions.  They can be
 | 
						|
       given more than once, and each can specify a group name or number,  for
 | 
						|
       example:
 | 
						|
 | 
						|
          abcd\=copy=1,copy=3,get=G1
 | 
						|
 | 
						|
       If  the  #subject command is used to set default copy and/or get lists,
 | 
						|
       these can be unset by specifying a negative number to cancel  all  num-
 | 
						|
       bered groups and an empty name to cancel all named groups.
 | 
						|
 | 
						|
       The  getall  modifier  tests pcre2_substring_list_get(), which extracts
 | 
						|
       all captured substrings.
 | 
						|
 | 
						|
       If the subject line is successfully matched, the  substrings  extracted
 | 
						|
       by  the  convenience  functions  are  output  with C, G, or L after the
 | 
						|
       string number instead of a colon. This is in  addition  to  the  normal
 | 
						|
       full  list.  The string length (that is, the return from the extraction
 | 
						|
       function) is given in parentheses after each substring, followed by the
 | 
						|
       name when the extraction was by name.
 | 
						|
 | 
						|
   Testing the substitution function
 | 
						|
 | 
						|
       If  the  replace  modifier  is  set, the pcre2_substitute() function is
 | 
						|
       called instead  of  one  of  the  matching  functions.  Unlike  subject
 | 
						|
       strings,  pcre2test  does  not  process  replacement strings for escape
 | 
						|
       sequences. In UTF mode, a replacement string is checked to see if it is
 | 
						|
       a valid UTF-8 string.  If so, it is correctly converted to a UTF string
 | 
						|
       of the appropriate code unit width. If it is not a valid UTF-8  string,
 | 
						|
       the individual code units are copied directly. This provides a means of
 | 
						|
       passing an invalid UTF-8 string for testing purposes.
 | 
						|
 | 
						|
       If the global modifier is set,  PCRE2_SUBSTITUTE_GLOBAL  is  passed  to
 | 
						|
       pcre2_substitute().  After  a  successful  substitution,  the  modified
 | 
						|
       string is output, preceded by the number of replacements. This  may  be
 | 
						|
       zero  if there were no matches. Here is a simple example of a substitu-
 | 
						|
       tion test:
 | 
						|
 | 
						|
         /abc/replace=xxx
 | 
						|
             =abc=abc=
 | 
						|
          1: =xxx=abc=
 | 
						|
             =abc=abc=\=global
 | 
						|
          2: =xxx=xxx=
 | 
						|
 | 
						|
       Subject and replacement strings should be  kept  relatively  short  for
 | 
						|
       substitution  tests, as fixed-size buffers are used. To make it easy to
 | 
						|
       test for buffer overflow, if the replacement string starts with a  num-
 | 
						|
       ber  in square brackets, that number is passed to pcre2_substitute() as
 | 
						|
       the size of the output buffer, with the replacement string starting  at
 | 
						|
       the next character. Here is an example that tests the edge case:
 | 
						|
 | 
						|
         /abc/
 | 
						|
             123abc123\=replace=[10]XYZ
 | 
						|
          1: 123XYZ123
 | 
						|
             123abc123\=replace=[9]XYZ
 | 
						|
         Failed: error -47: no more memory
 | 
						|
 | 
						|
       A replacement string is ignored with POSIX and DFA matching. Specifying
 | 
						|
       partial matching provokes an error return  ("bad  option  value")  from
 | 
						|
       pcre2_substitute().
 | 
						|
 | 
						|
   Setting the JIT stack size
 | 
						|
 | 
						|
       The  jitstack modifier provides a way of setting the maximum stack size
 | 
						|
       that is used by the just-in-time optimization code. It  is  ignored  if
 | 
						|
       JIT optimization is not being used. The value is a number of kilobytes.
 | 
						|
       Providing a stack that is larger than the default 32K is necessary only
 | 
						|
       for very complicated patterns.
 | 
						|
 | 
						|
   Setting match and recursion limits
 | 
						|
 | 
						|
       The  match_limit and recursion_limit modifiers set the appropriate lim-
 | 
						|
       its in the match context. These values are ignored when the find_limits
 | 
						|
       modifier is specified.
 | 
						|
 | 
						|
   Finding minimum limits
 | 
						|
 | 
						|
       If  the  find_limits modifier is present, pcre2test calls pcre2_match()
 | 
						|
       several times, setting  different  values  in  the  match  context  via
 | 
						|
       pcre2_set_match_limit()  and pcre2_set_recursion_limit() until it finds
 | 
						|
       the minimum values for each parameter that allow pcre2_match() to  com-
 | 
						|
       plete without error.
 | 
						|
 | 
						|
       If JIT is being used, only the match limit is relevant. If DFA matching
 | 
						|
       is being used, neither limit is relevant, and this modifier is  ignored
 | 
						|
       (with a warning message).
 | 
						|
 | 
						|
       The  match_limit number is a measure of the amount of backtracking that
 | 
						|
       takes place, and learning the minimum value  can  be  instructive.  For
 | 
						|
       most  simple  matches, the number is quite small, but for patterns with
 | 
						|
       very large numbers of matching possibilities, it can become large  very
 | 
						|
       quickly    with    increasing    length    of   subject   string.   The
 | 
						|
       match_limit_recursion number is a measure of how  much  stack  (or,  if
 | 
						|
       PCRE2  is  compiled with NO_RECURSE, how much heap) memory is needed to
 | 
						|
       complete the match attempt.
 | 
						|
 | 
						|
   Showing MARK names
 | 
						|
 | 
						|
 | 
						|
       The mark modifier causes the names from backtracking control verbs that
 | 
						|
       are  returned from calls to pcre2_match() to be displayed. If a mark is
 | 
						|
       returned for a match, non-match, or partial match, pcre2test shows  it.
 | 
						|
       For  a  match, it is on a line by itself, tagged with "MK:". Otherwise,
 | 
						|
       it is added to the non-match message.
 | 
						|
 | 
						|
   Showing memory usage
 | 
						|
 | 
						|
       The memory modifier causes pcre2test to log all memory  allocation  and
 | 
						|
       freeing calls that occur during a match operation.
 | 
						|
 | 
						|
   Setting a starting offset
 | 
						|
 | 
						|
       The  offset  modifier  sets  an  offset  in the subject string at which
 | 
						|
       matching starts. Its value is a number of code units, not characters.
 | 
						|
 | 
						|
   Setting the size of the output vector
 | 
						|
 | 
						|
       The ovector modifier applies only to  the  subject  line  in  which  it
 | 
						|
       appears,  though  of  course  it can also be used to set a default in a
 | 
						|
       #subject command. It specifies the number of pairs of offsets that  are
 | 
						|
       available for storing matching information. The default is 15.
 | 
						|
 | 
						|
       A  value of zero is useful when testing the POSIX API because it causes
 | 
						|
       regexec() to be called with a NULL capture vector. When not testing the
 | 
						|
       POSIX  API,  a  value  of  zero  is used to cause pcre2_match_data_cre-
 | 
						|
       ate_from_pattern() to be called, in order to create a  match  block  of
 | 
						|
       exactly the right size for the pattern. (It is not possible to create a
 | 
						|
       match block with a zero-length ovector; there is always  at  least  one
 | 
						|
       pair of offsets.)
 | 
						|
 | 
						|
   Passing the subject as zero-terminated
 | 
						|
 | 
						|
       By default, the subject string is passed to a native API matching func-
 | 
						|
       tion with its correct length. In order to test the facility for passing
 | 
						|
       a  zero-terminated  string, the zero_terminate modifier is provided. It
 | 
						|
       causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
 | 
						|
       via  the  POSIX  interface, this modifier has no effect, as there is no
 | 
						|
       facility for passing a length.)
 | 
						|
 | 
						|
       When testing pcre2_substitute(), this modifier also has the  effect  of
 | 
						|
       passing the replacement string as zero-terminated.
 | 
						|
 | 
						|
 | 
						|
THE ALTERNATIVE MATCHING FUNCTION
 | 
						|
 | 
						|
       By  default,  pcre2test  uses  the  standard  PCRE2  matching function,
 | 
						|
       pcre2_match() to match each subject line. PCRE2 also supports an alter-
 | 
						|
       native  matching  function, pcre2_dfa_match(), which operates in a dif-
 | 
						|
       ferent way, and has some restrictions. The differences between the  two
 | 
						|
       functions are described in the pcre2matching documentation.
 | 
						|
 | 
						|
       If  the dfa modifier is set, the alternative matching function is used.
 | 
						|
       This function finds all possible matches at a given point in  the  sub-
 | 
						|
       ject.  If,  however, the dfa_shortest modifier is set, processing stops
 | 
						|
       after the first match is found. This is always  the  shortest  possible
 | 
						|
       match.
 | 
						|
 | 
						|
 | 
						|
DEFAULT OUTPUT FROM pcre2test
 | 
						|
 | 
						|
       This  section  describes  the output when the normal matching function,
 | 
						|
       pcre2_match(), is being used.
 | 
						|
 | 
						|
       When a match succeeds, pcre2test outputs  the  list  of  captured  sub-
 | 
						|
       strings,  starting  with number 0 for the string that matched the whole
 | 
						|
       pattern.   Otherwise,  it  outputs  "No  match"  when  the  return   is
 | 
						|
       PCRE2_ERROR_NOMATCH,  or  "Partial  match:"  followed  by the partially
 | 
						|
       matching substring when the return is PCRE2_ERROR_PARTIAL.  (Note  that
 | 
						|
       this  is  the  entire  substring  that was inspected during the partial
 | 
						|
       match; it may include characters before the actual  match  start  if  a
 | 
						|
       lookbehind assertion, \K, \b, or \B was involved.)
 | 
						|
 | 
						|
       For any other return, pcre2test outputs the PCRE2 negative error number
 | 
						|
       and a short descriptive phrase. If the error is  a  failed  UTF  string
 | 
						|
       check,  the  code  unit offset of the start of the failing character is
 | 
						|
       also output. Here is an example of an interactive pcre2test run.
 | 
						|
 | 
						|
         $ pcre2test
 | 
						|
         PCRE2 version 9.00 2014-05-10
 | 
						|
 | 
						|
           re> /^abc(\d+)/
 | 
						|
         data> abc123
 | 
						|
          0: abc123
 | 
						|
          1: 123
 | 
						|
         data> xyz
 | 
						|
         No match
 | 
						|
 | 
						|
       Unset capturing substrings that are not followed by one that is set are
 | 
						|
       not shown by pcre2test unless the allcaptures modifier is specified. In
 | 
						|
       the following example, there are two capturing substrings, but when the
 | 
						|
       first  data  line is matched, the second, unset substring is not shown.
 | 
						|
       An "internal" unset substring is shown as "<unset>", as for the  second
 | 
						|
       data line.
 | 
						|
 | 
						|
           re> /(a)|(b)/
 | 
						|
         data> a
 | 
						|
          0: a
 | 
						|
          1: a
 | 
						|
         data> b
 | 
						|
          0: b
 | 
						|
          1: <unset>
 | 
						|
          2: b
 | 
						|
 | 
						|
       If  the strings contain any non-printing characters, they are output as
 | 
						|
       \xhh escapes if the value is less than 256 and UTF  mode  is  not  set.
 | 
						|
       Otherwise they are output as \x{hh...} escapes. See below for the defi-
 | 
						|
       nition of non-printing characters. If the /aftertext modifier  is  set,
 | 
						|
       the  output  for substring 0 is followed by the the rest of the subject
 | 
						|
       string, identified by "0+" like this:
 | 
						|
 | 
						|
           re> /cat/aftertext
 | 
						|
         data> cataract
 | 
						|
          0: cat
 | 
						|
          0+ aract
 | 
						|
 | 
						|
       If global matching is requested, the  results  of  successive  matching
 | 
						|
       attempts are output in sequence, like this:
 | 
						|
 | 
						|
           re> /\Bi(\w\w)/g
 | 
						|
         data> Mississippi
 | 
						|
          0: iss
 | 
						|
          1: ss
 | 
						|
          0: iss
 | 
						|
          1: ss
 | 
						|
          0: ipp
 | 
						|
          1: pp
 | 
						|
 | 
						|
       "No  match" is output only if the first match attempt fails. Here is an
 | 
						|
       example of a failure message (the offset 4 that  is  specified  by  the
 | 
						|
       offset modifier is past the end of the subject string):
 | 
						|
 | 
						|
           re> /xyz/
 | 
						|
         data> xyz\=offset=4
 | 
						|
         Error -24 (bad offset value)
 | 
						|
 | 
						|
       Note that whereas patterns can be continued over several lines (a plain
 | 
						|
       ">" prompt is used for continuations), subject lines may  not.  However
 | 
						|
       newlines can be included in a subject by means of the \n escape (or \r,
 | 
						|
       \r\n, etc., depending on the newline sequence setting).
 | 
						|
 | 
						|
 | 
						|
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
 | 
						|
 | 
						|
       When the alternative matching function, pcre2_dfa_match(), is used, the
 | 
						|
       output  consists  of  a list of all the matches that start at the first
 | 
						|
       point in the subject where there is at least one match. For example:
 | 
						|
 | 
						|
           re> /(tang|tangerine|tan)/
 | 
						|
         data> yellow tangerine\=dfa
 | 
						|
          0: tangerine
 | 
						|
          1: tang
 | 
						|
          2: tan
 | 
						|
 | 
						|
       Using the normal matching function on this data finds only "tang".  The
 | 
						|
       longest  matching  string  is  always  given first (and numbered zero).
 | 
						|
       After a PCRE2_ERROR_PARTIAL return, the  output  is  "Partial  match:",
 | 
						|
       followed  by  the  partially  matching substring. Note that this is the
 | 
						|
       entire substring that was inspected during the partial  match;  it  may
 | 
						|
       include characters before the actual match start if a lookbehind asser-
 | 
						|
       tion, \b, or \B was involved. (\K is not supported for DFA matching.)
 | 
						|
 | 
						|
       If global matching is requested, the search for further matches resumes
 | 
						|
       at the end of the longest match. For example:
 | 
						|
 | 
						|
           re> /(tang|tangerine|tan)/g
 | 
						|
         data> yellow tangerine and tangy sultana\=dfa
 | 
						|
          0: tangerine
 | 
						|
          1: tang
 | 
						|
          2: tan
 | 
						|
          0: tang
 | 
						|
          1: tan
 | 
						|
          0: tan
 | 
						|
 | 
						|
       The  alternative  matching function does not support substring capture,
 | 
						|
       so the modifiers that are concerned with captured  substrings  are  not
 | 
						|
       relevant.
 | 
						|
 | 
						|
 | 
						|
RESTARTING AFTER A PARTIAL MATCH
 | 
						|
 | 
						|
       When  the  alternative matching function has given the PCRE2_ERROR_PAR-
 | 
						|
       TIAL return, indicating that the subject partially matched the pattern,
 | 
						|
       you  can restart the match with additional subject data by means of the
 | 
						|
       dfa_restart modifier. For example:
 | 
						|
 | 
						|
           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | 
						|
         data> 23ja\=P,dfa
 | 
						|
         Partial match: 23ja
 | 
						|
         data> n05\=dfa,dfa_restart
 | 
						|
          0: n05
 | 
						|
 | 
						|
       For further information about partial matching,  see  the  pcre2partial
 | 
						|
       documentation.
 | 
						|
 | 
						|
 | 
						|
CALLOUTS
 | 
						|
 | 
						|
       If the pattern contains any callout requests, pcre2test's callout func-
 | 
						|
       tion is called during matching unless callout_none is specified.   This
 | 
						|
       works with both matching functions.
 | 
						|
 | 
						|
       The  callout  function in pcre2test returns zero (carry on matching) by
 | 
						|
       default, but you can use a callout_fail modifier in a subject line  (as
 | 
						|
       described above) to change this and other parameters of the callout.
 | 
						|
 | 
						|
       Inserting callouts can be helpful when using pcre2test to check compli-
 | 
						|
       cated regular expressions. For further information about callouts,  see
 | 
						|
       the pcre2callout documentation.
 | 
						|
 | 
						|
       The  output for callouts with numerical arguments and those with string
 | 
						|
       arguments is slightly different.
 | 
						|
 | 
						|
   Callouts with numerical arguments
 | 
						|
 | 
						|
       By default, the callout function displays the callout number, the start
 | 
						|
       and  current positions in the subject text at the callout time, and the
 | 
						|
       next pattern item to be tested. For example:
 | 
						|
 | 
						|
         --->pqrabcdef
 | 
						|
           0    ^  ^     \d
 | 
						|
 | 
						|
       This output indicates that  callout  number  0  occurred  for  a  match
 | 
						|
       attempt  starting  at  the fourth character of the subject string, when
 | 
						|
       the pointer was at the seventh character, and  when  the  next  pattern
 | 
						|
       item  was  \d.  Just  one circumflex is output if the start and current
 | 
						|
       positions are the same.
 | 
						|
 | 
						|
       Callouts numbered 255 are assumed to be automatic callouts, inserted as
 | 
						|
       a  result  of the /auto_callout pattern modifier. In this case, instead
 | 
						|
       of showing the callout number, the offset in the pattern, preceded by a
 | 
						|
       plus, is output. For example:
 | 
						|
 | 
						|
           re> /\d?[A-E]\*/auto_callout
 | 
						|
         data> E*
 | 
						|
         --->E*
 | 
						|
          +0 ^      \d?
 | 
						|
          +3 ^      [A-E]
 | 
						|
          +8 ^^     \*
 | 
						|
         +10 ^ ^
 | 
						|
          0: E*
 | 
						|
 | 
						|
       If a pattern contains (*MARK) items, an additional line is output when-
 | 
						|
       ever a change of latest mark is passed to  the  callout  function.  For
 | 
						|
       example:
 | 
						|
 | 
						|
           re> /a(*MARK:X)bc/auto_callout
 | 
						|
         data> abc
 | 
						|
         --->abc
 | 
						|
          +0 ^       a
 | 
						|
          +1 ^^      (*MARK:X)
 | 
						|
         +10 ^^      b
 | 
						|
         Latest Mark: X
 | 
						|
         +11 ^ ^     c
 | 
						|
         +12 ^  ^
 | 
						|
          0: abc
 | 
						|
 | 
						|
       The  mark  changes between matching "a" and "b", but stays the same for
 | 
						|
       the rest of the match, so nothing more is output. If, as  a  result  of
 | 
						|
       backtracking,  the  mark  reverts to being unset, the text "<unset>" is
 | 
						|
       output.
 | 
						|
 | 
						|
   Callouts with string arguments
 | 
						|
 | 
						|
       The output for a callout with a string argument is similar, except that
 | 
						|
       instead  of outputting a callout number before the position indicators,
 | 
						|
       the callout string and its offset in  the  pattern  string  are  output
 | 
						|
       before  the reflection of the subject string, and the subject string is
 | 
						|
       reflected for each callout. For example:
 | 
						|
 | 
						|
           re> /^ab(?C'first')cd(?C"second")ef/
 | 
						|
         data> abcdefg
 | 
						|
         Callout (7): 'first'
 | 
						|
         --->abcdefg
 | 
						|
             ^ ^         c
 | 
						|
         Callout (20): "second"
 | 
						|
         --->abcdefg
 | 
						|
             ^   ^       e
 | 
						|
          0: abcdef
 | 
						|
 | 
						|
 | 
						|
NON-PRINTING CHARACTERS
 | 
						|
 | 
						|
       When pcre2test is outputting text in the compiled version of a pattern,
 | 
						|
       bytes  other  than 32-126 are always treated as non-printing characters
 | 
						|
       and are therefore shown as hex escapes.
 | 
						|
 | 
						|
       When pcre2test is outputting text that is a matched part of  a  subject
 | 
						|
       string,  it behaves in the same way, unless a different locale has been
 | 
						|
       set for the pattern (using the /locale modifier).  In  this  case,  the
 | 
						|
       isprint()  function  is  used  to distinguish printing and non-printing
 | 
						|
       characters.
 | 
						|
 | 
						|
 | 
						|
SAVING AND RESTORING COMPILED PATTERNS
 | 
						|
 | 
						|
       It is possible to save compiled patterns  on  disc  or  elsewhere,  and
 | 
						|
       reload them later, subject to a number of restrictions. JIT data cannot
 | 
						|
       be saved. The host on which the patterns are reloaded must  be  running
 | 
						|
       the same version of PCRE2, with the same code unit width, and must also
 | 
						|
       have the same endianness, pointer width  and  PCRE2_SIZE  type.  Before
 | 
						|
       compiled  patterns  can be saved they must be serialized, that is, con-
 | 
						|
       verted to a stream of bytes. A single byte stream may contain any  num-
 | 
						|
       ber  of  compiled  patterns,  but  they must all use the same character
 | 
						|
       tables. A single copy of the tables is included in the byte stream (its
 | 
						|
       size is 1088 bytes).
 | 
						|
 | 
						|
       The  functions  whose  names  begin  with pcre2_serialize_ are used for
 | 
						|
       serializing and de-serializing. They are described in the  pcre2serial-
 | 
						|
       ize  documentation.  In  this  section  we  describe  the  features  of
 | 
						|
       pcre2test that can be used to test these functions.
 | 
						|
 | 
						|
       When a pattern with push  modifier  is  successfully  compiled,  it  is
 | 
						|
       pushed  onto  a  stack  of compiled patterns, and pcre2test expects the
 | 
						|
       next line to contain a new pattern (or command) instead  of  a  subject
 | 
						|
       line. By this means, a number of patterns can be compiled and retained.
 | 
						|
       The push modifier is incompatible with  posix,  and  control  modifiers
 | 
						|
       that act at match time are ignored (with a message). The jitverify mod-
 | 
						|
       ifier applies only at compile time. The command
 | 
						|
 | 
						|
         #save <filename>
 | 
						|
 | 
						|
       causes all the stacked patterns to be serialized and the result written
 | 
						|
       to  the named file. Afterwards, all the stacked patterns are freed. The
 | 
						|
       command
 | 
						|
 | 
						|
         #load <filename>
 | 
						|
 | 
						|
       reads the data in the file, and then arranges for it to  be  de-serial-
 | 
						|
       ized,  with the resulting compiled patterns added to the pattern stack.
 | 
						|
       The pattern on the top of the stack can be retrieved by the  #pop  com-
 | 
						|
       mand,  which  must  be  followed  by  lines  of subjects that are to be
 | 
						|
       matched with the pattern, terminated as usual by an empty line  or  end
 | 
						|
       of  file.  This  command  may be followed by a modifier list containing
 | 
						|
       only control modifiers that act after a pattern has been  compiled.  In
 | 
						|
       particular,  hex,  posix, and push are not allowed, nor are any option-
 | 
						|
       setting modifiers.  The JIT modifiers are, however permitted.  Here  is
 | 
						|
       an example that saves and reloads two patterns.
 | 
						|
 | 
						|
         /abc/push
 | 
						|
         /xyz/push
 | 
						|
         #save tempfile
 | 
						|
         #load tempfile
 | 
						|
         #pop info
 | 
						|
         xyz
 | 
						|
 | 
						|
         #pop jit,bincode
 | 
						|
         abc
 | 
						|
 | 
						|
       If  jitverify  is  used with #pop, it does not automatically imply jit,
 | 
						|
       which is different behaviour from when it is used on a pattern.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcre2(3),  pcre2api(3),  pcre2callout(3),  pcre2jit,  pcre2matching(3),
 | 
						|
       pcre2partial(d), pcre2pattern(3), pcre2serialize(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 20 May 2015
 | 
						|
       Copyright (c) 1997-2015 University of Cambridge.
 |