Update bundled PCRE2-library to version 10.23

Some manual changes done to the library were lost with this update. They will be added in the next commit.
2017-05-29 15:31:42 +03:00
parent 7231563937
commit 36af74cb25
218 changed files with 49218 additions and 26130 deletions
--- a/pcre2/doc/html/pcre2test.html
+++ b/pcre2/doc/html/pcre2test.html
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
 <P>
 As the original fairly simple PCRE library evolved, it acquired many different
 features, and as a result, the original <b>pcretest</b> program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
 move to the new PCRE2 API provided an opportunity to re-implement the test
 program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
 are still many obscure modifiers, some of which are specifically designed for
@ -77,31 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
 all three of these libraries may be simultaneously installed. The
 <b>pcre2test</b> program can be used to test all the libraries. However, its own
 input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
-before being passed to the library functions. Results are converted back to
-8-bit code units for output.
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
+format before being passed to the library functions. Results are converted back
+to 8-bit code units for output.
 </P>
 <P>
 In the rest of this document, the names of library functions and structures
 are given in generic form, for example, <b>pcre_compile()</b>. The actual
 names used in the libraries have a suffix _8, _16, or _32, as appropriate.
-</P>
+<a name="inputencoding"></a></P>
 <br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
 <P>
 Input to <b>pcre2test</b> is processed line by line, either by calling the C
-library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
-below). The input is processed using using C's string functions, so must not
-contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
-treats any bytes other than newline as data characters. In some Windows
-environments character 26 (hex 1A) causes an immediate end of file, and no
-further data is read.
+library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
+no further data is read, so this character should be avoided unless you really
+want that action.
 </P>
 <P>
-For maximum portability, therefore, it is safest to avoid non-printing
-characters in <b>pcre2test</b> input files. There is a facility for specifying a
-pattern's characters as hexadecimal pairs, thus making it possible to include
-binary zeroes in a pattern for testing purposes. Subject lines are processed
-for backslash escapes, which makes it possible to include any data value.
+The input is processed using using C's string functions, so must not
+contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
+treats any bytes other than newline as data characters. An error is generated
+if a binary zero is encountered. Subject lines are processed for backslash
+escapes, which makes it possible to include any data value in strings that are
+passed to the library for matching. For patterns, there is a facility for
+specifying some or all of the 8-bit input characters as hexadecimal pairs,
+which makes it possible to include binary zeros.
+</P>
+<br><b>
+Input for the 16-bit and 32-bit libraries
+</b><br>
+<P>
+When testing the 16-bit or 32-bit libraries, there is a need to be able to
+generate character code points greater than 255 in the strings that are passed
+to the library. For subject lines, backslash escapes can be used. In addition,
+when the <b>utf</b> modifier (see
+<a href="#optionmodifiers">"Setting compilation options"</a>
+below) is set, the pattern and any following subject lines are interpreted as
+UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
+</P>
+<P>
+For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
+used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
+or 32-bit mode. It causes the pattern and following subject lines to be treated
+as UTF-8 according to the original definition (RFC 2279), which allows for
+character values up to 0x7fffffff. Each character is placed in one 16-bit or
+32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
+to occur).
+</P>
+<P>
+UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
+values can be handled by the 32-bit library. When testing this library in
+non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
+byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
+character's value. This is the only way of passing such code points in a
+pattern string. For subject strings, using an escape sequence is preferable.
 </P>
 <br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
 <P>
@ -123,8 +153,13 @@ the 32-bit library has been built, this is the default. If the 32-bit library
 has not been built, this option causes an error.
 </P>
 <P>
+<b>-ac</b>
+Behave as if each pattern has the <b>auto_callout</b> modifier, that is, insert
+automatic callouts into every pattern that is compiled.
+</P>
+<P>
 <b>-b</b>
-Behave as if each pattern has the <b>/fullbincode</b> modifier; the full
+Behave as if each pattern has the <b>fullbincode</b> modifier; the full
 internal binary form of the pattern is output after compilation.
 </P>
 <P>
@ -155,12 +190,13 @@ following options output the value and set the exit code as indicated:
 The following options output 1 for true or 0 for false, and set the exit code
 to the same value:
 <pre>
-  ebcdic     compiled for an EBCDIC environment
-  jit        just-in-time support is available
-  pcre2-16   the 16-bit library was built
-  pcre2-32   the 32-bit library was built
-  pcre2-8    the 8-bit library was built
-  unicode    Unicode support is available
+  backslash-C  \C is supported (not locked out)
+  ebcdic       compiled for an EBCDIC environment
+  jit          just-in-time support is available
+  pcre2-16     the 16-bit library was built
+  pcre2-32     the 32-bit library was built
+  pcre2-8      the 8-bit library was built
+  unicode      Unicode support is available
 </pre>
 If an unknown option is given, an error message is output; the exit code is 0.
 </P>
@ -177,12 +213,19 @@ using the <b>pcre2_dfa_match()</b> function instead of the default
 <b>pcre2_match()</b>.
 </P>
 <P>
+<b>-error</b> <i>number[,number,...]</i>
+Call <b>pcre2_get_error_message()</b> for each of the error numbers in the
+comma-separated list, display the resulting messages on the standard output,
+then exit with zero exit code. The numbers may be positive or negative. This is
+a convenience facility for PCRE2 maintainers.
+</P>
+<P>
 <b>-help</b>
 Output a brief summary these options and then exit.
 </P>
 <P>
 <b>-i</b>
-Behave as if each pattern has the <b>/info</b> modifier; information about the
+Behave as if each pattern has the <b>info</b> modifier; information about the
 compiled pattern is given after compilation.
 </P>
 <P>
@ -265,9 +308,9 @@ Each subject line is matched separately and independently. If you want to do
 multi-line matches, you have to use the \n escape sequence (or \r or \r\n,
 etc., depending on the newline setting) in a single line of input to encode the
 newline sequences. There is no limit on the length of subject lines; the input
-buffer is automatically extended if it is too small. There is a replication
-feature that makes it possible to generate long subject lines without having to
-supply them explicitly.
+buffer is automatically extended if it is too small. There are replication
+features that makes it possible to generate long repetitive pattern or subject
+lines without having to supply them explicitly.
 </P>
 <P>
 An empty line or the end of the file signals the end of the subject lines for a
@ -304,6 +347,36 @@ output.
 This command is used to load a set of precompiled patterns from a file, as
 described in the section entitled "Saving and restoring compiled patterns"
 <a href="#saverestore">below.</a>
+<pre>
+  #newline_default [&#60;newline-list&#62;]
+</pre>
+When PCRE2 is built, a default newline convention can be specified. This
+determines which characters and/or character pairs are recognized as indicating
+a newline in a pattern or subject string. The default can be overridden when a
+pattern is compiled. The standard test files contain tests of various newline
+conventions, but the majority of the tests expect a single linefeed to be
+recognized as a newline by default. Without special action the tests would fail
+when PCRE2 is compiled with either CR or CRLF as the default newline.
+</P>
+<P>
+The #newline_default command specifies a list of newline types that are
+acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
+ANY (in upper or lower case), for example:
+<pre>
+  #newline_default LF Any anyCRLF
+</pre>
+If the default newline is in the list, this command has no effect. Otherwise,
+except when testing the POSIX API, a <b>newline</b> modifier that specifies the
+first newline convention in the list (LF in the above example) is added to any
+pattern that does not already have a <b>newline</b> modifier. If the newline
+list is empty, the feature is turned off. This command is present in a number
+of the standard test input files.
+</P>
+<P>
+When the POSIX API is being tested there is no way to override the default
+newline convention, though it is possible to set the newline convention from
+within the pattern. A warning is given if the <b>posix</b> modifier is used when
+<b>#newline_default</b> would set a default for the non-POSIX API.
 <pre>
  #pattern &#60;modifier-list&#62;
 </pre>
@ -321,9 +394,10 @@ test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
 command helps detect tests that are accidentally put in the wrong file.
 <pre>
  #pop [&#60;modifiers&#62;]
+  #popcopy [&#60;modifiers&#62;]
 </pre>
-This command is used to manipulate the stack of compiled patterns, as described
-in the section entitled "Saving and restoring compiled patterns"
+These commands are used to manipulate the stack of compiled patterns, as
+described in the section entitled "Saving and restoring compiled patterns"
 <a href="#saverestore">below.</a>
 <pre>
  #save &#60;filename&#62;
@ -340,12 +414,13 @@ subject lines. Modifiers on a subject line can change these settings.
 <br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br>
 <P>
 Modifier lists are used with both pattern and subject lines. Items in a list
-are separated by commas and optional white space. Some modifiers may be given
-for both patterns and subject lines, whereas others are valid for one or the
-other only. Each modifier has a long name, for example "anchored", and some of
-them must be followed by an equals sign and a value, for example, "offset=12".
-Modifiers that do not take values may be preceded by a minus sign to turn off a
-previous setting.
+are separated by commas followed by optional white space. Trailing whitespace
+in a modifier list is ignored. Some modifiers may be given for both patterns
+and subject lines, whereas others are valid only for one or the other. Each
+modifier has a long name, for example "anchored", and some of them must be
+followed by an equals sign and a value, for example, "offset=12". Values cannot
+contain comma characters, but may contain spaces. Modifiers that do not take
+values may be preceded by a minus sign to turn off a previous setting.
 </P>
 <P>
 A few of the more common modifiers can also be specified as single letters, for
@ -454,6 +529,12 @@ the start of a modifier list. For example:
 <pre>
  abc\=notbol,notempty
 </pre>
+If the subject string is empty and \= is followed by whitespace, the line is
+treated as a comment line, and is not used for matching. For example:
+<pre>
+  \= This is a comment.
+  abc\= This is an invalid modifier list.
+</pre>
 A backslash followed by any other non-alphanumeric character just escapes that
 character. A backslash followed by anything else causes an error. However, if
 the very last character in the line is a backslash (and there is no modifier
@ -462,10 +543,10 @@ a real empty line terminates the data input.
 </P>
 <br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
 <P>
-There are three types of modifier that can appear in pattern lines, two of
-which may also be used in a <b>#pattern</b> command. A pattern's modifier list
-can add to or override default modifiers that were set by a previous
-<b>#pattern</b> command.
+There are several types of modifier that can appear in pattern lines. Except
+where noted below, they may also be used in <b>#pattern</b> commands. A
+pattern's modifier list can add to or override default modifiers that were set
+by a previous <b>#pattern</b> command.
 <a name="optionmodifiers"></a></P>
 <br><b>
 Setting compilation options
@ -473,12 +554,13 @@ Setting compilation options
 <P>
 The following modifiers set options for <b>pcre2_compile()</b>. The most common
 ones have single-letter abbreviations. See
-<a href="pcreapi.html"><b>pcreapi</b></a>
+<a href="pcre2api.html"><b>pcre2api</b></a>
 for a description of their effects.
 <pre>
      allow_empty_class         set PCRE2_ALLOW_EMPTY_CLASS
      alt_bsux                  set PCRE2_ALT_BSUX
      alt_circumflex            set PCRE2_ALT_CIRCUMFLEX
+      alt_verbnames             set PCRE2_ALT_VERBNAMES
      anchored                  set PCRE2_ANCHORED
      auto_callout              set PCRE2_AUTO_CALLOUT
  /i  caseless                  set PCRE2_CASELESS
@ -499,12 +581,15 @@ for a description of their effects.
      no_utf_check              set PCRE2_NO_UTF_CHECK
      ucp                       set PCRE2_UCP
      ungreedy                  set PCRE2_UNGREEDY
+      use_offset_limit          set PCRE2_USE_OFFSET_LIMIT
      utf                       set PCRE2_UTF
 </pre>
 As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
 non-printing characters in output strings to be printed using the \x{hh...}
 notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
+subject strings to be translated to UTF-16 or UTF-32, respectively, before
+being passed to library functions.
 <a name="controlmodifiers"></a></P>
 <br><b>
 Setting compilation controls
@ -519,18 +604,24 @@ about the pattern:
      debug                     same as info,fullbincode
      fullbincode               show binary code with lengths
  /I  info                      show info about compiled pattern
-      hex                       pattern is coded in hexadecimal
+      hex                       unquoted characters are hexadecimal
      jit[=&#60;number&#62;]            use JIT
      jitfast                   use JIT fast path
      jitverify                 verify JIT use
      locale=&#60;name&#62;             use this locale
+      max_pattern_length=&#60;n&#62;    set the maximum pattern length
      memory                    show memory used
      newline=&#60;type&#62;            set newline type
+      null_context              compile with a NULL context
      parens_nest_limit=&#60;n&#62;     set maximum parentheses depth
      posix                     use the POSIX API
+      posix_nosub               use the POSIX API with REG_NOSUB
      push                      push compiled pattern onto the stack
+      pushcopy                  push a copy onto the stack
      stackguard=&#60;number&#62;       test the stackguard feature
      tables=[0|1|2]            select internal tables
+      use_length                do not zero-terminate the pattern
+      utf8_input                treat input as UTF-8
 </pre>
 The effects of these modifiers are described in the following sections.
 </P>
@ -604,40 +695,145 @@ is requested. For each callout, either its number or string is given, followed
 by the item that follows it in the pattern.
 </P>
 <br><b>
-Specifying a pattern in hex
+Passing a NULL context
 </b><br>
 <P>
-The <b>hex</b> modifier specifies that the characters of the pattern are to be
-interpreted as pairs of hexadecimal digits. White space is permitted between
-pairs. For example:
+Normally, <b>pcre2test</b> passes a context block to <b>pcre2_compile()</b>. If
+the <b>null_context</b> modifier is set, however, NULL is passed. This is for
+testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
+default values).
+</P>
+<br><b>
+Specifying the pattern's length
+</b><br>
+<P>
+By default, patterns are passed to the compiling functions as zero-terminated
+strings. When using the POSIX wrapper API, there is no other option. However,
+when using PCRE2's native API, patterns can be passed by length instead of
+being zero-terminated. The <b>use_length</b> modifier causes this to happen.
+Using a length happens automatically (whether or not <b>use_length</b> is set)
+when <b>hex</b> is set, because patterns specified in hexadecimal may contain
+binary zeros.
+</P>
+<br><b>
+Specifying pattern characters in hexadecimal
+</b><br>
+<P>
+The <b>hex</b> modifier specifies that the characters of the pattern, except for
+substrings enclosed in single or double quotes, are to be interpreted as pairs
+of hexadecimal digits. This feature is provided as a way of creating patterns
+that contain binary zeros and other non-printing characters. White space is
+permitted between pairs of digits. For example, this pattern contains three
+characters:
 <pre>
  /ab 32 59/hex
 </pre>
-This feature is provided as a way of creating patterns that contain binary zero
-and other non-printing characters. By default, <b>pcre2test</b> passes patterns
-as zero-terminated strings to <b>pcre2_compile()</b>, giving the length as
-PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
-actual length of the pattern is passed.
+Parts of such a pattern are taken literally if quoted. This pattern contains
+nine characters, only two of which are specified in hexadecimal:
+<pre>
+  /ab "literal" 32/hex
+</pre>
+Either single or double quotes may be used. There is no way of including
+the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
+mutually exclusive.
+</P>
+<P>
+The POSIX API cannot be used with patterns specified in hexadecimal because
+they may contain binary zeros, which conflicts with <b>regcomp()</b>'s
+requirement for a zero-terminated string. Such patterns are always passed to
+<b>pcre2_compile()</b> as a string with a length, not as zero-terminated.
+</P>
+<br><b>
+Specifying wide characters in 16-bit and 32-bit modes
+</b><br>
+<P>
+In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
+translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
+the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
+can be used. It is mutually exclusive with <b>utf</b>. Input lines are
+interpreted as UTF-8 as a means of specifying wide characters. More details are
+given in
+<a href="#inputencoding">"Input encoding"</a>
+above.
+</P>
+<br><b>
+Generating long repetitive patterns
+</b><br>
+<P>
+Some tests use long patterns that are very repetitive. Instead of creating a
+very long input line for such a pattern, you can use a special repetition
+feature, similar to the one described for subject lines above. If the
+<b>expand</b> modifier is present on a pattern, parts of the pattern that have
+the form
+<pre>
+  \[&#60;characters&#62;]{&#60;count&#62;}
+</pre>
+are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
+example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
+cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
+by decimal digits and "}" is found later in the pattern. If not, the characters
+remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
+mutually exclusive.
+</P>
+<P>
+If part of an expanded pattern looks like an expansion, but is really part of
+the actual pattern, unwanted expansion can be avoided by giving two values in
+the quantifier. For example, \[AB]{6000,6000} is not recognized as an
+expansion item.
+</P>
+<P>
+If the <b>info</b> modifier is set on an expanded pattern, the result of the
+expansion is included in the information that is output.
 </P>
 <br><b>
 JIT compilation
 </b><br>
 <P>
-The <b>/jit</b> modifier may optionally be followed by an equals sign and a
-number in the range 0 to 7:
+Just-in-time (JIT) compiling is a heavyweight optimization that can greatly
+speed up pattern matching. See the
+<a href="pcre2jit.html"><b>pcre2jit</b></a>
+documentation for details. JIT compiling happens, optionally, after a pattern
+has been successfully compiled into an internal form. The JIT compiler converts
+this to optimized machine code. It needs to know whether the match-time options
+PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because
+different code is generated for the different cases. See the <b>partial</b>
+modifier in "Subject Modifiers"
+<a href="#subjectmodifiers">below</a>
+for details of how these options are specified for each match attempt.
+</P>
+<P>
+JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
+optionally be followed by an equals sign and a number in the range 0 to 7.
+The three bits that make up the number specify which of the three JIT operating
+modes are to be compiled:
+<pre>
+  1  compile JIT code for non-partial matching
+  2  compile JIT code for soft partial matching
+  4  compile JIT code for hard partial matching
+</pre>
+The possible values for the <b>jit</b> modifier are therefore:
 <pre>
  0  disable JIT
-  1  use JIT for normal match only
-  2  use JIT for soft partial match only
-  3  use JIT for normal match and soft partial match
-  4  use JIT for hard partial match only
-  6  use JIT for soft and hard partial match
+  1  normal matching only
+  2  soft partial matching only
+  3  normal and soft partial matching
+  4  hard partial matching only
+  6  soft and hard partial matching only
  7  all three modes
 </pre>
-If no number is given, 7 is assumed. If JIT compilation is successful, the
-compiled JIT code will automatically be used when <b>pcre2_match()</b> is run
-for the appropriate type of match, except when incompatible run-time options
-are specified. For more details, see the
+If no number is given, 7 is assumed. The phrase "partial matching" means a call
+to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
+PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
+match; the options enable the possibility of a partial match, but do not
+require it. Note also that if you request JIT compilation only for partial
+matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
+subject line, that match will not use JIT code because none was compiled for
+non-partial matching.
+</P>
+<P>
+If JIT compilation is successful, the compiled JIT code will automatically be
+used when an appropriate type of match is run, except when incompatible
+run-time options are specified. For more details, see the
 <a href="pcre2jit.html"><b>pcre2jit</b></a>
 documentation. See also the <b>jitstack</b> modifier below for a way of
 setting the size of the JIT stack.
@ -661,14 +857,14 @@ code was actually used in the match.
 Setting a locale
 </b><br>
 <P>
-The <b>/locale</b> modifier must specify the name of a locale, for example:
+The <b>locale</b> modifier must specify the name of a locale, for example:
 <pre>
  /pattern/locale=fr_FR
 </pre>
 The given locale is set, <b>pcre2_maketables()</b> is called to build a set of
 character tables for the locale, and this is then passed to
 <b>pcre2_compile()</b> when compiling the regular expression. The same tables
-are used when matching the following subject lines. The <b>/locale</b> modifier
+are used when matching the following subject lines. The <b>locale</b> modifier
 applies only to the pattern on which it appears, but can be given in a
 <b>#pattern</b> command if a default is needed. Setting a locale and alternate
 character tables are mutually exclusive.
@ -677,7 +873,7 @@ character tables are mutually exclusive.
 Showing pattern memory
 </b><br>
 <P>
-The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
+The <b>memory</b> modifier causes the size in bytes of the memory used to hold
 the compiled pattern to be output. This does not include the size of the
 <b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
 subsequently passed to the JIT compiler, the size of the JIT compiled code is
@ -700,30 +896,53 @@ sets its own default of 220, which is required for running the standard test
 suite.
 </P>
 <br><b>
+Limiting the pattern length
+</b><br>
+<P>
+The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
+length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
+causes a compilation error. The default is the largest number a PCRE2_SIZE
+variable can hold (essentially unlimited).
+</P>
+<br><b>
 Using the POSIX wrapper API
 </b><br>
 <P>
-The <b>/posix</b> modifier causes <b>pcre2test</b> to call PCRE2 via the POSIX
-wrapper API rather than its native API. This supports only the 8-bit library.
-When the POSIX API is being used, the following pattern modifiers set options
-for the <b>regcomp()</b> function:
+The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
+PCRE2 via the POSIX wrapper API rather than its native API. When
+<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
+<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
+it does not imply POSIX matching semantics; for more detail see the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+documentation. The following pattern modifiers set options for the
+<b>regcomp()</b> function:
 <pre>
  caseless           REG_ICASE
  multiline          REG_NEWLINE
-  no_auto_capture    REG_NOSUB
  dotall             REG_DOTALL     )
  ungreedy           REG_UNGREEDY   ) These options are not part of
  ucp                REG_UCP        )   the POSIX standard
  utf                REG_UTF8       )
 </pre>
+The <b>regerror_buffsize</b> modifier specifies a size for the error buffer that
+is passed to <b>regerror()</b> in the event of a compilation error. For example:
+<pre>
+  /abc/posix,regerror_buffsize=20
+</pre>
+This provides a means of testing the behaviour of <b>regerror()</b> when the
+buffer is too small for the error message. If this modifier has not been set, a
+large buffer is used.
+</P>
+<P>
 The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
-below. All other modifiers cause an error.
+below. All other modifiers are either ignored, with a warning message, or cause
+an error.
 </P>
 <br><b>
 Testing the stack guard feature
 </b><br>
 <P>
-The <b>/stackguard</b> modifier is used to test the use of
+The <b>stackguard</b> modifier is used to test the use of
 <b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to
 enable stack availability to be checked during compilation (see the
 <a href="pcre2api.html"><b>pcre2api</b></a>
@ -738,7 +957,7 @@ be aborted.
 Using alternative character tables
 </b><br>
 <P>
-The value specified for the <b>/tables</b> modifier must be one of the digits 0,
+The value specified for the <b>tables</b> modifier must be one of the digits 0,
 1, or 2. It causes a specific set of built-in character tables to be passed to
 <b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with
 different character tables. The digit specifies the tables as follows:
@ -758,17 +977,22 @@ Setting certain match controls
 <P>
 The following modifiers are really subject modifiers, and are described below.
 However, they may be included in a pattern's modifier list, in which case they
-are applied to every subject line that is processed with that pattern. They do
-not affect the compilation process.
+are applied to every subject line that is processed with that pattern. They may
+not appear in <b>#pattern</b> commands. These modifiers do not affect the
+compilation process.
 <pre>
-      aftertext           show text after match
-      allaftertext        show text after captures
-      allcaptures         show all captures
-      allusedtext         show all consulted text
-  /g  global              global matching
-      mark                show mark values
-      replace=&#60;string&#62;    specify a replacement string
-      startchar           show starting character when relevant
+      aftertext                  show text after match
+      allaftertext               show text after captures
+      allcaptures                show all captures
+      allusedtext                show all consulted text
+  /g  global                     global matching
+      mark                       show mark values
+      replace=&#60;string&#62;           specify a replacement string
+      startchar                  show starting character when relevant
+      substitute_extended        use PCRE2_SUBSTITUTE_EXTENDED
+      substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
+      substitute_unknown_unset   use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
+      substitute_unset_empty     use PCRE2_SUBSTITUTE_UNSET_EMPTY
 </pre>
 These modifiers may not appear in a <b>#pattern</b> command. If you want them as
 defaults, set them in a <b>#subject</b> command.
@ -782,13 +1006,17 @@ pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next
 line to contain a new pattern (or a command) instead of a subject line. This
 facility is used when saving compiled patterns to a file, as described in the
 section entitled "Saving and restoring compiled patterns"
-<a href="#saverestore">below.</a>
-The <b>push</b> modifier is incompatible with compilation modifiers such as
-<b>global</b> that act at match time. Any that are specified are ignored, with a
-warning message, except for <b>replace</b>, which causes an error. Note that,
-<b>jitverify</b>, which is allowed, does not carry through to any subsequent
-matching that uses this pattern.
-</P>
+<a href="#saverestore">below. If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled</a>
+pattern is stacked, leaving the original as current, ready to match the
+following input lines. This provides a way of testing the
+<b>pcre2_code_copy()</b> function.
+The <b>push</b> and <b>pushcopy </b> modifiers are incompatible with compilation
+modifiers such as <b>global</b> that act at match time. Any that are specified
+are ignored (for the stacked copy), with a warning message, except for
+<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is
+allowed, does not carry through to any subsequent matching that uses a stacked
+pattern.
+<a name="subjectmodifiers"></a></P>
 <br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br>
 <P>
 The modifiers that can appear in subject lines and the <b>#subject</b>
@ -806,6 +1034,7 @@ for a description of their effects.
      anchored                  set PCRE2_ANCHORED
      dfa_restart               set PCRE2_DFA_RESTART
      dfa_shortest              set PCRE2_DFA_SHORTEST
+      no_jit                    set PCRE2_NO_JIT
      no_utf_check              set PCRE2_NO_UTF_CHECK
      notbol                    set PCRE2_NOTBOL
      notempty                  set PCRE2_NOTEMPTY
@ -818,11 +1047,11 @@ The partial matching modifiers are provided with abbreviations because they
 appear frequently in tests.
 </P>
 <P>
-If the <b>/posix</b> modifier was present on the pattern, causing the POSIX
+If the <b>posix</b> modifier was present on the pattern, causing the POSIX
 wrapper API to be used, the only option-setting modifiers that have any effect
 are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
 REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
-Any other modifiers cause an error.
+The other modifiers are ignored, with a warning message.
 </P>
 <br><b>
 Setting match controls
@ -833,33 +1062,44 @@ information. Some of them may also be specified on a pattern line (see above),
 in which case they apply to every subject line that is matched against that
 pattern.
 <pre>
-      aftertext                 show text after match
-      allaftertext              show text after captures
-      allcaptures               show all captures
-      allusedtext               show all consulted text (non-JIT only)
-      altglobal                 alternative global matching
-      callout_capture           show captures at callout time
-      callout_data=&#60;n&#62;          set a value to pass via callouts
-      callout_fail=&#60;n&#62;[:&#60;m&#62;]    control callout failure
-      callout_none              do not supply a callout function
-      copy=&#60;number or name&#62;     copy captured substring
-      dfa                       use <b>pcre2_dfa_match()</b>
-      find_limits               find match and recursion limits
-      get=&#60;number or name&#62;      extract captured substring
-      getall                    extract all captured substrings
-  /g  global                    global matching
-      jitstack=&#60;n&#62;              set size of JIT stack
-      mark                      show mark values
-      match_limit=&#62;n&#62;           set a match limit
-      memory                    show memory usage
-      offset=&#60;n&#62;                set starting offset
-      ovector=&#60;n&#62;               set size of output vector
-      recursion_limit=&#60;n&#62;       set a recursion limit
-      replace=&#60;string&#62;          specify a replacement string
-      startchar                 show startchar when relevant
-      zero_terminate            pass the subject as zero-terminated
+      aftertext                  show text after match
+      allaftertext               show text after captures
+      allcaptures                show all captures
+      allusedtext                show all consulted text (non-JIT only)
+      altglobal                  alternative global matching
+      callout_capture            show captures at callout time
+      callout_data=&#60;n&#62;           set a value to pass via callouts
+      callout_error=&#60;n&#62;[:&#60;m&#62;]    control callout error
+      callout_fail=&#60;n&#62;[:&#60;m&#62;]     control callout failure
+      callout_none               do not supply a callout function
+      copy=&#60;number or name&#62;      copy captured substring
+      dfa                        use <b>pcre2_dfa_match()</b>
+      find_limits                find match and recursion limits
+      get=&#60;number or name&#62;       extract captured substring
+      getall                     extract all captured substrings
+  /g  global                     global matching
+      jitstack=&#60;n&#62;               set size of JIT stack
+      mark                       show mark values
+      match_limit=&#60;n&#62;            set a match limit
+      memory                     show memory usage
+      null_context               match with a NULL context
+      offset=&#60;n&#62;                 set starting offset
+      offset_limit=&#60;n&#62;           set offset limit
+      ovector=&#60;n&#62;                set size of output vector
+      recursion_limit=&#60;n&#62;        set a recursion limit
+      replace=&#60;string&#62;           specify a replacement string
+      startchar                  show startchar when relevant
+      startoffset=&#60;n&#62;            same as offset=&#60;n&#62;
+      substitute_extedded        use PCRE2_SUBSTITUTE_EXTENDED
+      substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
+      substitute_unknown_unset   use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
+      substitute_unset_empty     use PCRE2_SUBSTITUTE_UNSET_EMPTY
+      zero_terminate             pass the subject as zero-terminated
 </pre>
-The effects of these modifiers are described in the following sections.
+The effects of these modifiers are described in the following sections. When
+matching via the POSIX wrapper API, the <b>aftertext</b>, <b>allaftertext</b>,
+and <b>ovector</b> subject modifiers work as described below. All other
+modifiers are either ignored, with a warning message, or cause an error.
 </P>
 <br><b>
 Showing more text
@ -916,7 +1156,8 @@ The <b>allcaptures</b> modifier requests that the values of all potential
 captured parentheses be output after a match. By default, only those up to the
 highest one actually used in the match are output (corresponding to the return
 code from <b>pcre2_match()</b>). Groups that did not take part in the match
-are output as "&#60;unset&#62;".
+are output as "&#60;unset&#62;". This modifier is not relevant for DFA matching (which
+does no capturing); it is ignored, with a warning message, if present.
 </P>
 <br><b>
 Testing callouts
@ -924,15 +1165,22 @@ Testing callouts
 <P>
 A callout function is supplied when <b>pcre2test</b> calls the library matching
 functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is
-set, the current captured groups are output when a callout occurs.
+set, the current captured groups are output when a callout occurs. The default
+return from the callout function is zero, which allows matching to continue.
 </P>
 <P>
 The <b>callout_fail</b> modifier can be given one or two numbers. If there is
-only one number, 1 is returned instead of 0 when a callout of that number is
-reached. If two numbers are given, 1 is returned when callout &#60;n&#62; is reached
-for the &#60;m&#62;th time. Note that callouts with string arguments are always given
-the number zero. See "Callouts" below for a description of the output when a
-callout it taken.
+only one number, 1 is returned instead of 0 (causing matching to backtrack)
+when a callout of that number is reached. If two numbers (&#60;n&#62;:&#60;m&#62;) are given, 1
+is returned when callout &#60;n&#62; is reached and there have been at least &#60;m&#62;
+callouts. The <b>callout_error</b> modifier is similar, except that
+PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
+aborted. If both these modifiers are set for the same callout number,
+<b>callout_error</b> takes precedence.
+</P>
+<P>
+Note that callouts with string arguments are always given the number zero. See
+"Callouts" below for a description of the output when a callout it taken.
 </P>
 <P>
 The <b>callout_data</b> modifier can be given an unsigned or a negative number.
@ -945,7 +1193,7 @@ Finding all matches in a string
 </b><br>
 <P>
 Searching for all possible matches within a subject can be requested by the
-<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
+<b>global</b> or <b>altglobal</b> modifier. After finding a match, the matching
 function is called again to search the remainder of the subject. The difference
 between <b>global</b> and <b>altglobal</b> is that the former uses the
 <i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
@ -996,19 +1244,34 @@ Testing the substitution function
 </b><br>
 <P>
 If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
-called instead of one of the matching functions. Unlike subject strings,
-<b>pcre2test</b> does not process replacement strings for escape sequences. In
-UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
-If so, it is correctly converted to a UTF string of the appropriate code unit
-width. If it is not a valid UTF-8 string, the individual code units are copied
-directly. This provides a means of passing an invalid UTF-8 string for testing
-purposes.
+called instead of one of the matching functions. Note that replacement strings
+cannot contain commas, because a comma signifies the end of a modifier. This is
+not thought to be an issue in a test program.
 </P>
 <P>
-If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
-<b>pcre2_substitute()</b>. After a successful substitution, the modified string
-is output, preceded by the number of replacements. This may be zero if there
-were no matches. Here is a simple example of a substitution test:
+Unlike subject strings, <b>pcre2test</b> does not process replacement strings
+for escape sequences. In UTF mode, a replacement string is checked to see if it
+is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
+the appropriate code unit width. If it is not a valid UTF-8 string, the
+individual code units are copied directly. This provides a means of passing an
+invalid UTF-8 string for testing purposes.
+</P>
+<P>
+The following modifiers set options (in additional to the normal match options)
+for <b>pcre2_substitute()</b>:
+<pre>
+  global                      PCRE2_SUBSTITUTE_GLOBAL
+  substitute_extended         PCRE2_SUBSTITUTE_EXTENDED
+  substitute_overflow_length  PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
+  substitute_unknown_unset    PCRE2_SUBSTITUTE_UNKNOWN_UNSET
+  substitute_unset_empty      PCRE2_SUBSTITUTE_UNSET_EMPTY
+
+</PRE>
+</P>
+<P>
+After a successful substitution, the modified string is output, preceded by the
+number of replacements. This may be zero if there were no matches. Here is a
+simple example of a substitution test:
 <pre>
  /abc/replace=xxx
      =abc=abc=
@ -1016,12 +1279,12 @@ were no matches. Here is a simple example of a substitution test:
      =abc=abc=\=global
   2: =xxx=xxx=
 </pre>
-Subject and replacement strings should be kept relatively short for
-substitution tests, as fixed-size buffers are used. To make it easy to test for
-buffer overflow, if the replacement string starts with a number in square
-brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
-output buffer, with the replacement string starting at the next character. Here
-is an example that tests the edge case:
+Subject and replacement strings should be kept relatively short (fewer than 256
+characters) for substitution tests, as fixed-size buffers are used. To make it
+easy to test for buffer overflow, if the replacement string starts with a
+number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
+the size of the output buffer, with the replacement string starting at the next
+character. Here is an example that tests the edge case:
 <pre>
  /abc/
      123abc123\=replace=[10]XYZ
@ -1029,6 +1292,19 @@ is an example that tests the edge case:
      123abc123\=replace=[9]XYZ
  Failed: error -47: no more memory
 </pre>
+The default action of <b>pcre2_substitute()</b> is to return
+PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
+PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
+<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
+to go through the motions of matching and substituting, in order to compute the
+size of buffer that is required. When this happens, <b>pcre2test</b> shows the
+required buffer length (which includes space for the trailing zero) as part of
+the error message. For example:
+<pre>
+  /abc/substitute_overflow_length
+      123abc123\=replace=[9]XYZ
+  Failed: error -47: no more memory: 10 code units are needed
+</pre>
 A replacement string is ignored with POSIX and DFA matching. Specifying partial
 matching provokes an error return ("bad option value") from
 <b>pcre2_substitute()</b>.
@ -1100,6 +1376,16 @@ The <b>offset</b> modifier sets an offset in the subject string at which
 matching starts. Its value is a number of code units, not characters.
 </P>
 <br><b>
+Setting an offset limit
+</b><br>
+<P>
+The <b>offset_limit</b> modifier sets a limit for unanchored matches. If a match
+cannot be found starting at or before this offset in the subject, a "no match"
+return is given. The data value is a number of code units, not characters. When
+this modifier is used, the <b>use_offset_limit</b> modifier must have been set
+for the pattern; if not, an error is generated.
+</P>
+<br><b>
 Setting the size of the output vector
 </b><br>
 <P>
@ -1131,6 +1417,17 @@ this modifier has no effect, as there is no facility for passing a length.)
 When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
 passing the replacement string as zero-terminated.
 </P>
+<br><b>
+Passing a NULL context
+</b><br>
+<P>
+Normally, <b>pcre2test</b> passes a context block to <b>pcre2_match()</b>,
+<b>pcre2_dfa_match()</b> or <b>pcre2_jit_match()</b>. If the <b>null_context</b>
+modifier is set, however, NULL is passed. This is for testing that the matching
+functions behave correctly in this case (they use default values). This
+modifier cannot be used with the <b>find_limits</b> modifier or when testing the
+substitution function.
+</P>
 <br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
 <P>
 By default, <b>pcre2test</b> uses the standard PCRE2 matching function,
@ -1196,7 +1493,7 @@ unset substring is shown as "&#60;unset&#62;", as for the second data line.
 If the strings contain any non-printing characters, they are output as \xhh
 escapes if the value is less than 256 and UTF mode is not set. Otherwise they
 are output as \x{hh...} escapes. See below for the definition of non-printing
-characters. If the <b>/aftertext</b> modifier is set, the output for substring
+characters. If the <b>aftertext</b> modifier is set, the output for substring
 0 is followed by the the rest of the subject string, identified by "0+" like
 this:
 <pre>
@ -1321,7 +1618,9 @@ item to be tested. For example:
 This output indicates that callout number 0 occurred for a match attempt
 starting at the fourth character of the subject string, when the pointer was at
 the seventh character, and when the next pattern item was \d. Just
-one circumflex is output if the start and current positions are the same.
+one circumflex is output if the start and current positions are the same, or if
+the current position precedes the start position, which can happen if the
+callout is in a lookbehind assertion.
 </P>
 <P>
 Callouts numbered 255 are assumed to be automatic callouts, inserted as a
@ -1387,7 +1686,7 @@ therefore shown as hex escapes.
 <P>
 When <b>pcre2test</b> is outputting text that is a matched part of a subject
 string, it behaves in the same way, unless a different locale has been set for
-the pattern (using the <b>/locale</b> modifier). In this case, the
+the pattern (using the <b>locale</b> modifier). In this case, the
 <b>isprint()</b> function is used to distinguish printing and non-printing
 characters.
 <a name="saverestore"></a></P>
@ -1413,11 +1712,16 @@ can be used to test these functions.
 <P>
 When a pattern with <b>push</b> modifier is successfully compiled, it is pushed
 onto a stack of compiled patterns, and <b>pcre2test</b> expects the next line to
-contain a new pattern (or command) instead of a subject line. By this means, a
-number of patterns can be compiled and retained. The <b>push</b> modifier is
-incompatible with <b>posix</b>, and control modifiers that act at match time are
-ignored (with a message). The <b>jitverify</b> modifier applies only at compile
-time. The command
+contain a new pattern (or command) instead of a subject line. By contrast,
+the <b>pushcopy</b> modifier causes a copy of the compiled pattern to be
+stacked, leaving the original available for immediate matching. By using
+<b>push</b> and/or <b>pushcopy</b>, a number of patterns can be compiled and
+retained. These modifiers are incompatible with <b>posix</b>, and control
+modifiers that act at match time are ignored (with a message) for the stacked
+patterns. The <b>jitverify</b> modifier applies only at compile time.
+</P>
+<P>
+The command
 <pre>
  #save &#60;filename&#62;
 </pre>
@ -1434,7 +1738,8 @@ usual by an empty line or end of file. This command may be followed by a
 modifier list containing only
 <a href="#controlmodifiers">control modifiers</a>
 that act after a pattern has been compiled. In particular, <b>hex</b>,
-<b>posix</b>, and <b>push</b> are not allowed, nor are any
+<b>posix</b>, <b>posix_nosub</b>, <b>push</b>, and <b>pushcopy</b> are not allowed,
+nor are any
 <a href="#optionmodifiers">option-setting modifiers.</a>
 The JIT modifiers are, however permitted. Here is an example that saves and
 reloads two patterns.
@ -1452,6 +1757,11 @@ reloads two patterns.
 If <b>jitverify</b> is used with #pop, it does not automatically imply
 <b>jit</b>, which is different behaviour from when it is used on a pattern.
 </P>
+<P>
+The #popcopy command is analagous to the <b>pushcopy</b> modifier in that it
+makes current a copy of the topmost stack pattern, leaving the original still
+on the stack.
+</P>
 <br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
 <P>
 <b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
@ -1469,9 +1779,9 @@ Cambridge, England.
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 20 May 2015
+Last updated: 28 December 2016
 <br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2016 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.