Update bundled PCRE2-library to version 10.23
Some manual changes done to the library were lost with this update. They will be added in the next commit.
This commit is contained in:
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
|
||||
<P>
|
||||
As the original fairly simple PCRE library evolved, it acquired many different
|
||||
features, and as a result, the original <b>pcretest</b> program ended up with a
|
||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
||||
lot of options in a messy, arcane syntax for testing all the features. The
|
||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
|
||||
are still many obscure modifiers, some of which are specifically designed for
|
||||
@ -77,31 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
||||
all three of these libraries may be simultaneously installed. The
|
||||
<b>pcre2test</b> program can be used to test all the libraries. However, its own
|
||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
||||
before being passed to the library functions. Results are converted back to
|
||||
8-bit code units for output.
|
||||
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||
format before being passed to the library functions. Results are converted back
|
||||
to 8-bit code units for output.
|
||||
</P>
|
||||
<P>
|
||||
In the rest of this document, the names of library functions and structures
|
||||
are given in generic form, for example, <b>pcre_compile()</b>. The actual
|
||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||
</P>
|
||||
<a name="inputencoding"></a></P>
|
||||
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
|
||||
<P>
|
||||
Input to <b>pcre2test</b> is processed line by line, either by calling the C
|
||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
|
||||
below). The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
||||
further data is read.
|
||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
|
||||
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||
no further data is read, so this character should be avoided unless you really
|
||||
want that action.
|
||||
</P>
|
||||
<P>
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
characters in <b>pcre2test</b> input files. There is a facility for specifying a
|
||||
pattern's characters as hexadecimal pairs, thus making it possible to include
|
||||
binary zeroes in a pattern for testing purposes. Subject lines are processed
|
||||
for backslash escapes, which makes it possible to include any data value.
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. An error is generated
|
||||
if a binary zero is encountered. Subject lines are processed for backslash
|
||||
escapes, which makes it possible to include any data value in strings that are
|
||||
passed to the library for matching. For patterns, there is a facility for
|
||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
which makes it possible to include binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
Input for the 16-bit and 32-bit libraries
|
||||
</b><br>
|
||||
<P>
|
||||
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||
generate character code points greater than 255 in the strings that are passed
|
||||
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||
when the <b>utf</b> modifier (see
|
||||
<a href="#optionmodifiers">"Setting compilation options"</a>
|
||||
below) is set, the pattern and any following subject lines are interpreted as
|
||||
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||
</P>
|
||||
<P>
|
||||
For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
|
||||
used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
|
||||
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||
to occur).
|
||||
</P>
|
||||
<P>
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||
values can be handled by the 32-bit library. When testing this library in
|
||||
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||
character's value. This is the only way of passing such code points in a
|
||||
pattern string. For subject strings, using an escape sequence is preferable.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||
<P>
|
||||
@ -123,8 +153,13 @@ the 32-bit library has been built, this is the default. If the 32-bit library
|
||||
has not been built, this option causes an error.
|
||||
</P>
|
||||
<P>
|
||||
<b>-ac</b>
|
||||
Behave as if each pattern has the <b>auto_callout</b> modifier, that is, insert
|
||||
automatic callouts into every pattern that is compiled.
|
||||
</P>
|
||||
<P>
|
||||
<b>-b</b>
|
||||
Behave as if each pattern has the <b>/fullbincode</b> modifier; the full
|
||||
Behave as if each pattern has the <b>fullbincode</b> modifier; the full
|
||||
internal binary form of the pattern is output after compilation.
|
||||
</P>
|
||||
<P>
|
||||
@ -155,12 +190,13 @@ following options output the value and set the exit code as indicated:
|
||||
The following options output 1 for true or 0 for false, and set the exit code
|
||||
to the same value:
|
||||
<pre>
|
||||
ebcdic compiled for an EBCDIC environment
|
||||
jit just-in-time support is available
|
||||
pcre2-16 the 16-bit library was built
|
||||
pcre2-32 the 32-bit library was built
|
||||
pcre2-8 the 8-bit library was built
|
||||
unicode Unicode support is available
|
||||
backslash-C \C is supported (not locked out)
|
||||
ebcdic compiled for an EBCDIC environment
|
||||
jit just-in-time support is available
|
||||
pcre2-16 the 16-bit library was built
|
||||
pcre2-32 the 32-bit library was built
|
||||
pcre2-8 the 8-bit library was built
|
||||
unicode Unicode support is available
|
||||
</pre>
|
||||
If an unknown option is given, an error message is output; the exit code is 0.
|
||||
</P>
|
||||
@ -177,12 +213,19 @@ using the <b>pcre2_dfa_match()</b> function instead of the default
|
||||
<b>pcre2_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>-error</b> <i>number[,number,...]</i>
|
||||
Call <b>pcre2_get_error_message()</b> for each of the error numbers in the
|
||||
comma-separated list, display the resulting messages on the standard output,
|
||||
then exit with zero exit code. The numbers may be positive or negative. This is
|
||||
a convenience facility for PCRE2 maintainers.
|
||||
</P>
|
||||
<P>
|
||||
<b>-help</b>
|
||||
Output a brief summary these options and then exit.
|
||||
</P>
|
||||
<P>
|
||||
<b>-i</b>
|
||||
Behave as if each pattern has the <b>/info</b> modifier; information about the
|
||||
Behave as if each pattern has the <b>info</b> modifier; information about the
|
||||
compiled pattern is given after compilation.
|
||||
</P>
|
||||
<P>
|
||||
@ -265,9 +308,9 @@ Each subject line is matched separately and independently. If you want to do
|
||||
multi-line matches, you have to use the \n escape sequence (or \r or \r\n,
|
||||
etc., depending on the newline setting) in a single line of input to encode the
|
||||
newline sequences. There is no limit on the length of subject lines; the input
|
||||
buffer is automatically extended if it is too small. There is a replication
|
||||
feature that makes it possible to generate long subject lines without having to
|
||||
supply them explicitly.
|
||||
buffer is automatically extended if it is too small. There are replication
|
||||
features that makes it possible to generate long repetitive pattern or subject
|
||||
lines without having to supply them explicitly.
|
||||
</P>
|
||||
<P>
|
||||
An empty line or the end of the file signals the end of the subject lines for a
|
||||
@ -304,6 +347,36 @@ output.
|
||||
This command is used to load a set of precompiled patterns from a file, as
|
||||
described in the section entitled "Saving and restoring compiled patterns"
|
||||
<a href="#saverestore">below.</a>
|
||||
<pre>
|
||||
#newline_default [<newline-list>]
|
||||
</pre>
|
||||
When PCRE2 is built, a default newline convention can be specified. This
|
||||
determines which characters and/or character pairs are recognized as indicating
|
||||
a newline in a pattern or subject string. The default can be overridden when a
|
||||
pattern is compiled. The standard test files contain tests of various newline
|
||||
conventions, but the majority of the tests expect a single linefeed to be
|
||||
recognized as a newline by default. Without special action the tests would fail
|
||||
when PCRE2 is compiled with either CR or CRLF as the default newline.
|
||||
</P>
|
||||
<P>
|
||||
The #newline_default command specifies a list of newline types that are
|
||||
acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
|
||||
ANY (in upper or lower case), for example:
|
||||
<pre>
|
||||
#newline_default LF Any anyCRLF
|
||||
</pre>
|
||||
If the default newline is in the list, this command has no effect. Otherwise,
|
||||
except when testing the POSIX API, a <b>newline</b> modifier that specifies the
|
||||
first newline convention in the list (LF in the above example) is added to any
|
||||
pattern that does not already have a <b>newline</b> modifier. If the newline
|
||||
list is empty, the feature is turned off. This command is present in a number
|
||||
of the standard test input files.
|
||||
</P>
|
||||
<P>
|
||||
When the POSIX API is being tested there is no way to override the default
|
||||
newline convention, though it is possible to set the newline convention from
|
||||
within the pattern. A warning is given if the <b>posix</b> modifier is used when
|
||||
<b>#newline_default</b> would set a default for the non-POSIX API.
|
||||
<pre>
|
||||
#pattern <modifier-list>
|
||||
</pre>
|
||||
@ -321,9 +394,10 @@ test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
|
||||
command helps detect tests that are accidentally put in the wrong file.
|
||||
<pre>
|
||||
#pop [<modifiers>]
|
||||
#popcopy [<modifiers>]
|
||||
</pre>
|
||||
This command is used to manipulate the stack of compiled patterns, as described
|
||||
in the section entitled "Saving and restoring compiled patterns"
|
||||
These commands are used to manipulate the stack of compiled patterns, as
|
||||
described in the section entitled "Saving and restoring compiled patterns"
|
||||
<a href="#saverestore">below.</a>
|
||||
<pre>
|
||||
#save <filename>
|
||||
@ -340,12 +414,13 @@ subject lines. Modifiers on a subject line can change these settings.
|
||||
<br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br>
|
||||
<P>
|
||||
Modifier lists are used with both pattern and subject lines. Items in a list
|
||||
are separated by commas and optional white space. Some modifiers may be given
|
||||
for both patterns and subject lines, whereas others are valid for one or the
|
||||
other only. Each modifier has a long name, for example "anchored", and some of
|
||||
them must be followed by an equals sign and a value, for example, "offset=12".
|
||||
Modifiers that do not take values may be preceded by a minus sign to turn off a
|
||||
previous setting.
|
||||
are separated by commas followed by optional white space. Trailing whitespace
|
||||
in a modifier list is ignored. Some modifiers may be given for both patterns
|
||||
and subject lines, whereas others are valid only for one or the other. Each
|
||||
modifier has a long name, for example "anchored", and some of them must be
|
||||
followed by an equals sign and a value, for example, "offset=12". Values cannot
|
||||
contain comma characters, but may contain spaces. Modifiers that do not take
|
||||
values may be preceded by a minus sign to turn off a previous setting.
|
||||
</P>
|
||||
<P>
|
||||
A few of the more common modifiers can also be specified as single letters, for
|
||||
@ -454,6 +529,12 @@ the start of a modifier list. For example:
|
||||
<pre>
|
||||
abc\=notbol,notempty
|
||||
</pre>
|
||||
If the subject string is empty and \= is followed by whitespace, the line is
|
||||
treated as a comment line, and is not used for matching. For example:
|
||||
<pre>
|
||||
\= This is a comment.
|
||||
abc\= This is an invalid modifier list.
|
||||
</pre>
|
||||
A backslash followed by any other non-alphanumeric character just escapes that
|
||||
character. A backslash followed by anything else causes an error. However, if
|
||||
the very last character in the line is a backslash (and there is no modifier
|
||||
@ -462,10 +543,10 @@ a real empty line terminates the data input.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
|
||||
<P>
|
||||
There are three types of modifier that can appear in pattern lines, two of
|
||||
which may also be used in a <b>#pattern</b> command. A pattern's modifier list
|
||||
can add to or override default modifiers that were set by a previous
|
||||
<b>#pattern</b> command.
|
||||
There are several types of modifier that can appear in pattern lines. Except
|
||||
where noted below, they may also be used in <b>#pattern</b> commands. A
|
||||
pattern's modifier list can add to or override default modifiers that were set
|
||||
by a previous <b>#pattern</b> command.
|
||||
<a name="optionmodifiers"></a></P>
|
||||
<br><b>
|
||||
Setting compilation options
|
||||
@ -473,12 +554,13 @@ Setting compilation options
|
||||
<P>
|
||||
The following modifiers set options for <b>pcre2_compile()</b>. The most common
|
||||
ones have single-letter abbreviations. See
|
||||
<a href="pcreapi.html"><b>pcreapi</b></a>
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
for a description of their effects.
|
||||
<pre>
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
alt_verbnames set PCRE2_ALT_VERBNAMES
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
@ -499,12 +581,15 @@ for a description of their effects.
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
ucp set PCRE2_UCP
|
||||
ungreedy set PCRE2_UNGREEDY
|
||||
use_offset_limit set PCRE2_USE_OFFSET_LIMIT
|
||||
utf set PCRE2_UTF
|
||||
</pre>
|
||||
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
|
||||
non-printing characters in output strings to be printed using the \x{hh...}
|
||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||
brackets.
|
||||
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
|
||||
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||
being passed to library functions.
|
||||
<a name="controlmodifiers"></a></P>
|
||||
<br><b>
|
||||
Setting compilation controls
|
||||
@ -519,18 +604,24 @@ about the pattern:
|
||||
debug same as info,fullbincode
|
||||
fullbincode show binary code with lengths
|
||||
/I info show info about compiled pattern
|
||||
hex pattern is coded in hexadecimal
|
||||
hex unquoted characters are hexadecimal
|
||||
jit[=<number>] use JIT
|
||||
jitfast use JIT fast path
|
||||
jitverify verify JIT use
|
||||
locale=<name> use this locale
|
||||
max_pattern_length=<n> set the maximum pattern length
|
||||
memory show memory used
|
||||
newline=<type> set newline type
|
||||
null_context compile with a NULL context
|
||||
parens_nest_limit=<n> set maximum parentheses depth
|
||||
posix use the POSIX API
|
||||
posix_nosub use the POSIX API with REG_NOSUB
|
||||
push push compiled pattern onto the stack
|
||||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
tables=[0|1|2] select internal tables
|
||||
use_length do not zero-terminate the pattern
|
||||
utf8_input treat input as UTF-8
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
</P>
|
||||
@ -604,40 +695,145 @@ is requested. For each callout, either its number or string is given, followed
|
||||
by the item that follows it in the pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying a pattern in hex
|
||||
Passing a NULL context
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>hex</b> modifier specifies that the characters of the pattern are to be
|
||||
interpreted as pairs of hexadecimal digits. White space is permitted between
|
||||
pairs. For example:
|
||||
Normally, <b>pcre2test</b> passes a context block to <b>pcre2_compile()</b>. If
|
||||
the <b>null_context</b> modifier is set, however, NULL is passed. This is for
|
||||
testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
|
||||
default values).
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying the pattern's length
|
||||
</b><br>
|
||||
<P>
|
||||
By default, patterns are passed to the compiling functions as zero-terminated
|
||||
strings. When using the POSIX wrapper API, there is no other option. However,
|
||||
when using PCRE2's native API, patterns can be passed by length instead of
|
||||
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
|
||||
Using a length happens automatically (whether or not <b>use_length</b> is set)
|
||||
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
|
||||
binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying pattern characters in hexadecimal
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>hex</b> modifier specifies that the characters of the pattern, except for
|
||||
substrings enclosed in single or double quotes, are to be interpreted as pairs
|
||||
of hexadecimal digits. This feature is provided as a way of creating patterns
|
||||
that contain binary zeros and other non-printing characters. White space is
|
||||
permitted between pairs of digits. For example, this pattern contains three
|
||||
characters:
|
||||
<pre>
|
||||
/ab 32 59/hex
|
||||
</pre>
|
||||
This feature is provided as a way of creating patterns that contain binary zero
|
||||
and other non-printing characters. By default, <b>pcre2test</b> passes patterns
|
||||
as zero-terminated strings to <b>pcre2_compile()</b>, giving the length as
|
||||
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
|
||||
actual length of the pattern is passed.
|
||||
Parts of such a pattern are taken literally if quoted. This pattern contains
|
||||
nine characters, only two of which are specified in hexadecimal:
|
||||
<pre>
|
||||
/ab "literal" 32/hex
|
||||
</pre>
|
||||
Either single or double quotes may be used. There is no way of including
|
||||
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<P>
|
||||
The POSIX API cannot be used with patterns specified in hexadecimal because
|
||||
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s
|
||||
requirement for a zero-terminated string. Such patterns are always passed to
|
||||
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
</b><br>
|
||||
<P>
|
||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||
translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
|
||||
the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
|
||||
can be used. It is mutually exclusive with <b>utf</b>. Input lines are
|
||||
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||
given in
|
||||
<a href="#inputencoding">"Input encoding"</a>
|
||||
above.
|
||||
</P>
|
||||
<br><b>
|
||||
Generating long repetitive patterns
|
||||
</b><br>
|
||||
<P>
|
||||
Some tests use long patterns that are very repetitive. Instead of creating a
|
||||
very long input line for such a pattern, you can use a special repetition
|
||||
feature, similar to the one described for subject lines above. If the
|
||||
<b>expand</b> modifier is present on a pattern, parts of the pattern that have
|
||||
the form
|
||||
<pre>
|
||||
\[<characters>]{<count>}
|
||||
</pre>
|
||||
are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
|
||||
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
|
||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||
remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<P>
|
||||
If part of an expanded pattern looks like an expansion, but is really part of
|
||||
the actual pattern, unwanted expansion can be avoided by giving two values in
|
||||
the quantifier. For example, \[AB]{6000,6000} is not recognized as an
|
||||
expansion item.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>info</b> modifier is set on an expanded pattern, the result of the
|
||||
expansion is included in the information that is output.
|
||||
</P>
|
||||
<br><b>
|
||||
JIT compilation
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/jit</b> modifier may optionally be followed by an equals sign and a
|
||||
number in the range 0 to 7:
|
||||
Just-in-time (JIT) compiling is a heavyweight optimization that can greatly
|
||||
speed up pattern matching. See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for details. JIT compiling happens, optionally, after a pattern
|
||||
has been successfully compiled into an internal form. The JIT compiler converts
|
||||
this to optimized machine code. It needs to know whether the match-time options
|
||||
PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because
|
||||
different code is generated for the different cases. See the <b>partial</b>
|
||||
modifier in "Subject Modifiers"
|
||||
<a href="#subjectmodifiers">below</a>
|
||||
for details of how these options are specified for each match attempt.
|
||||
</P>
|
||||
<P>
|
||||
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
|
||||
optionally be followed by an equals sign and a number in the range 0 to 7.
|
||||
The three bits that make up the number specify which of the three JIT operating
|
||||
modes are to be compiled:
|
||||
<pre>
|
||||
1 compile JIT code for non-partial matching
|
||||
2 compile JIT code for soft partial matching
|
||||
4 compile JIT code for hard partial matching
|
||||
</pre>
|
||||
The possible values for the <b>jit</b> modifier are therefore:
|
||||
<pre>
|
||||
0 disable JIT
|
||||
1 use JIT for normal match only
|
||||
2 use JIT for soft partial match only
|
||||
3 use JIT for normal match and soft partial match
|
||||
4 use JIT for hard partial match only
|
||||
6 use JIT for soft and hard partial match
|
||||
1 normal matching only
|
||||
2 soft partial matching only
|
||||
3 normal and soft partial matching
|
||||
4 hard partial matching only
|
||||
6 soft and hard partial matching only
|
||||
7 all three modes
|
||||
</pre>
|
||||
If no number is given, 7 is assumed. If JIT compilation is successful, the
|
||||
compiled JIT code will automatically be used when <b>pcre2_match()</b> is run
|
||||
for the appropriate type of match, except when incompatible run-time options
|
||||
are specified. For more details, see the
|
||||
If no number is given, 7 is assumed. The phrase "partial matching" means a call
|
||||
to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
|
||||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
|
||||
match; the options enable the possibility of a partial match, but do not
|
||||
require it. Note also that if you request JIT compilation only for partial
|
||||
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
|
||||
subject line, that match will not use JIT code because none was compiled for
|
||||
non-partial matching.
|
||||
</P>
|
||||
<P>
|
||||
If JIT compilation is successful, the compiled JIT code will automatically be
|
||||
used when an appropriate type of match is run, except when incompatible
|
||||
run-time options are specified. For more details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation. See also the <b>jitstack</b> modifier below for a way of
|
||||
setting the size of the JIT stack.
|
||||
@ -661,14 +857,14 @@ code was actually used in the match.
|
||||
Setting a locale
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/locale</b> modifier must specify the name of a locale, for example:
|
||||
The <b>locale</b> modifier must specify the name of a locale, for example:
|
||||
<pre>
|
||||
/pattern/locale=fr_FR
|
||||
</pre>
|
||||
The given locale is set, <b>pcre2_maketables()</b> is called to build a set of
|
||||
character tables for the locale, and this is then passed to
|
||||
<b>pcre2_compile()</b> when compiling the regular expression. The same tables
|
||||
are used when matching the following subject lines. The <b>/locale</b> modifier
|
||||
are used when matching the following subject lines. The <b>locale</b> modifier
|
||||
applies only to the pattern on which it appears, but can be given in a
|
||||
<b>#pattern</b> command if a default is needed. Setting a locale and alternate
|
||||
character tables are mutually exclusive.
|
||||
@ -677,7 +873,7 @@ character tables are mutually exclusive.
|
||||
Showing pattern memory
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
|
||||
The <b>memory</b> modifier causes the size in bytes of the memory used to hold
|
||||
the compiled pattern to be output. This does not include the size of the
|
||||
<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
|
||||
subsequently passed to the JIT compiler, the size of the JIT compiled code is
|
||||
@ -700,30 +896,53 @@ sets its own default of 220, which is required for running the standard test
|
||||
suite.
|
||||
</P>
|
||||
<br><b>
|
||||
Limiting the pattern length
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
|
||||
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
|
||||
causes a compilation error. The default is the largest number a PCRE2_SIZE
|
||||
variable can hold (essentially unlimited).
|
||||
</P>
|
||||
<br><b>
|
||||
Using the POSIX wrapper API
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/posix</b> modifier causes <b>pcre2test</b> to call PCRE2 via the POSIX
|
||||
wrapper API rather than its native API. This supports only the 8-bit library.
|
||||
When the POSIX API is being used, the following pattern modifiers set options
|
||||
for the <b>regcomp()</b> function:
|
||||
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
||||
PCRE2 via the POSIX wrapper API rather than its native API. When
|
||||
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
|
||||
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
|
||||
it does not imply POSIX matching semantics; for more detail see the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
documentation. The following pattern modifiers set options for the
|
||||
<b>regcomp()</b> function:
|
||||
<pre>
|
||||
caseless REG_ICASE
|
||||
multiline REG_NEWLINE
|
||||
no_auto_capture REG_NOSUB
|
||||
dotall REG_DOTALL )
|
||||
ungreedy REG_UNGREEDY ) These options are not part of
|
||||
ucp REG_UCP ) the POSIX standard
|
||||
utf REG_UTF8 )
|
||||
</pre>
|
||||
The <b>regerror_buffsize</b> modifier specifies a size for the error buffer that
|
||||
is passed to <b>regerror()</b> in the event of a compilation error. For example:
|
||||
<pre>
|
||||
/abc/posix,regerror_buffsize=20
|
||||
</pre>
|
||||
This provides a means of testing the behaviour of <b>regerror()</b> when the
|
||||
buffer is too small for the error message. If this modifier has not been set, a
|
||||
large buffer is used.
|
||||
</P>
|
||||
<P>
|
||||
The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
|
||||
below. All other modifiers cause an error.
|
||||
below. All other modifiers are either ignored, with a warning message, or cause
|
||||
an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Testing the stack guard feature
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/stackguard</b> modifier is used to test the use of
|
||||
The <b>stackguard</b> modifier is used to test the use of
|
||||
<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to
|
||||
enable stack availability to be checked during compilation (see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
@ -738,7 +957,7 @@ be aborted.
|
||||
Using alternative character tables
|
||||
</b><br>
|
||||
<P>
|
||||
The value specified for the <b>/tables</b> modifier must be one of the digits 0,
|
||||
The value specified for the <b>tables</b> modifier must be one of the digits 0,
|
||||
1, or 2. It causes a specific set of built-in character tables to be passed to
|
||||
<b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with
|
||||
different character tables. The digit specifies the tables as follows:
|
||||
@ -758,17 +977,22 @@ Setting certain match controls
|
||||
<P>
|
||||
The following modifiers are really subject modifiers, and are described below.
|
||||
However, they may be included in a pattern's modifier list, in which case they
|
||||
are applied to every subject line that is processed with that pattern. They do
|
||||
not affect the compilation process.
|
||||
are applied to every subject line that is processed with that pattern. They may
|
||||
not appear in <b>#pattern</b> commands. These modifiers do not affect the
|
||||
compilation process.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
</pre>
|
||||
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
||||
defaults, set them in a <b>#subject</b> command.
|
||||
@ -782,13 +1006,17 @@ pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next
|
||||
line to contain a new pattern (or a command) instead of a subject line. This
|
||||
facility is used when saving compiled patterns to a file, as described in the
|
||||
section entitled "Saving and restoring compiled patterns"
|
||||
<a href="#saverestore">below.</a>
|
||||
The <b>push</b> modifier is incompatible with compilation modifiers such as
|
||||
<b>global</b> that act at match time. Any that are specified are ignored, with a
|
||||
warning message, except for <b>replace</b>, which causes an error. Note that,
|
||||
<b>jitverify</b>, which is allowed, does not carry through to any subsequent
|
||||
matching that uses this pattern.
|
||||
</P>
|
||||
<a href="#saverestore">below. If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled</a>
|
||||
pattern is stacked, leaving the original as current, ready to match the
|
||||
following input lines. This provides a way of testing the
|
||||
<b>pcre2_code_copy()</b> function.
|
||||
The <b>push</b> and <b>pushcopy </b> modifiers are incompatible with compilation
|
||||
modifiers such as <b>global</b> that act at match time. Any that are specified
|
||||
are ignored (for the stacked copy), with a warning message, except for
|
||||
<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is
|
||||
allowed, does not carry through to any subsequent matching that uses a stacked
|
||||
pattern.
|
||||
<a name="subjectmodifiers"></a></P>
|
||||
<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br>
|
||||
<P>
|
||||
The modifiers that can appear in subject lines and the <b>#subject</b>
|
||||
@ -806,6 +1034,7 @@ for a description of their effects.
|
||||
anchored set PCRE2_ANCHORED
|
||||
dfa_restart set PCRE2_DFA_RESTART
|
||||
dfa_shortest set PCRE2_DFA_SHORTEST
|
||||
no_jit set PCRE2_NO_JIT
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
notbol set PCRE2_NOTBOL
|
||||
notempty set PCRE2_NOTEMPTY
|
||||
@ -818,11 +1047,11 @@ The partial matching modifiers are provided with abbreviations because they
|
||||
appear frequently in tests.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>/posix</b> modifier was present on the pattern, causing the POSIX
|
||||
If the <b>posix</b> modifier was present on the pattern, causing the POSIX
|
||||
wrapper API to be used, the only option-setting modifiers that have any effect
|
||||
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
|
||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
|
||||
Any other modifiers cause an error.
|
||||
The other modifiers are ignored, with a warning message.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match controls
|
||||
@ -833,33 +1062,44 @@ information. Some of them may also be specified on a pattern line (see above),
|
||||
in which case they apply to every subject line that is matched against that
|
||||
pattern.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use <b>pcre2_dfa_match()</b>
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=>n> set a match limit
|
||||
memory show memory usage
|
||||
offset=<n> set starting offset
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_error=<n>[:<m>] control callout error
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use <b>pcre2_dfa_match()</b>
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
memory show memory usage
|
||||
null_context match with a NULL context
|
||||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
The effects of these modifiers are described in the following sections. When
|
||||
matching via the POSIX wrapper API, the <b>aftertext</b>, <b>allaftertext</b>,
|
||||
and <b>ovector</b> subject modifiers work as described below. All other
|
||||
modifiers are either ignored, with a warning message, or cause an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Showing more text
|
||||
@ -916,7 +1156,8 @@ The <b>allcaptures</b> modifier requests that the values of all potential
|
||||
captured parentheses be output after a match. By default, only those up to the
|
||||
highest one actually used in the match are output (corresponding to the return
|
||||
code from <b>pcre2_match()</b>). Groups that did not take part in the match
|
||||
are output as "<unset>".
|
||||
are output as "<unset>". This modifier is not relevant for DFA matching (which
|
||||
does no capturing); it is ignored, with a warning message, if present.
|
||||
</P>
|
||||
<br><b>
|
||||
Testing callouts
|
||||
@ -924,15 +1165,22 @@ Testing callouts
|
||||
<P>
|
||||
A callout function is supplied when <b>pcre2test</b> calls the library matching
|
||||
functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is
|
||||
set, the current captured groups are output when a callout occurs.
|
||||
set, the current captured groups are output when a callout occurs. The default
|
||||
return from the callout function is zero, which allows matching to continue.
|
||||
</P>
|
||||
<P>
|
||||
The <b>callout_fail</b> modifier can be given one or two numbers. If there is
|
||||
only one number, 1 is returned instead of 0 when a callout of that number is
|
||||
reached. If two numbers are given, 1 is returned when callout <n> is reached
|
||||
for the <m>th time. Note that callouts with string arguments are always given
|
||||
the number zero. See "Callouts" below for a description of the output when a
|
||||
callout it taken.
|
||||
only one number, 1 is returned instead of 0 (causing matching to backtrack)
|
||||
when a callout of that number is reached. If two numbers (<n>:<m>) are given, 1
|
||||
is returned when callout <n> is reached and there have been at least <m>
|
||||
callouts. The <b>callout_error</b> modifier is similar, except that
|
||||
PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
|
||||
aborted. If both these modifiers are set for the same callout number,
|
||||
<b>callout_error</b> takes precedence.
|
||||
</P>
|
||||
<P>
|
||||
Note that callouts with string arguments are always given the number zero. See
|
||||
"Callouts" below for a description of the output when a callout it taken.
|
||||
</P>
|
||||
<P>
|
||||
The <b>callout_data</b> modifier can be given an unsigned or a negative number.
|
||||
@ -945,7 +1193,7 @@ Finding all matches in a string
|
||||
</b><br>
|
||||
<P>
|
||||
Searching for all possible matches within a subject can be requested by the
|
||||
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
|
||||
<b>global</b> or <b>altglobal</b> modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The difference
|
||||
between <b>global</b> and <b>altglobal</b> is that the former uses the
|
||||
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
|
||||
@ -996,19 +1244,34 @@ Testing the substitution function
|
||||
</b><br>
|
||||
<P>
|
||||
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
|
||||
called instead of one of the matching functions. Unlike subject strings,
|
||||
<b>pcre2test</b> does not process replacement strings for escape sequences. In
|
||||
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
|
||||
If so, it is correctly converted to a UTF string of the appropriate code unit
|
||||
width. If it is not a valid UTF-8 string, the individual code units are copied
|
||||
directly. This provides a means of passing an invalid UTF-8 string for testing
|
||||
purposes.
|
||||
called instead of one of the matching functions. Note that replacement strings
|
||||
cannot contain commas, because a comma signifies the end of a modifier. This is
|
||||
not thought to be an issue in a test program.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
<b>pcre2_substitute()</b>. After a successful substitution, the modified string
|
||||
is output, preceded by the number of replacements. This may be zero if there
|
||||
were no matches. Here is a simple example of a substitution test:
|
||||
Unlike subject strings, <b>pcre2test</b> does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to see if it
|
||||
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
|
||||
the appropriate code unit width. If it is not a valid UTF-8 string, the
|
||||
individual code units are copied directly. This provides a means of passing an
|
||||
invalid UTF-8 string for testing purposes.
|
||||
</P>
|
||||
<P>
|
||||
The following modifiers set options (in additional to the normal match options)
|
||||
for <b>pcre2_substitute()</b>:
|
||||
<pre>
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
After a successful substitution, the modified string is output, preceded by the
|
||||
number of replacements. This may be zero if there were no matches. Here is a
|
||||
simple example of a substitution test:
|
||||
<pre>
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
@ -1016,12 +1279,12 @@ were no matches. Here is a simple example of a substitution test:
|
||||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
</pre>
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
||||
buffer overflow, if the replacement string starts with a number in square
|
||||
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
|
||||
output buffer, with the replacement string starting at the next character. Here
|
||||
is an example that tests the edge case:
|
||||
Subject and replacement strings should be kept relatively short (fewer than 256
|
||||
characters) for substitution tests, as fixed-size buffers are used. To make it
|
||||
easy to test for buffer overflow, if the replacement string starts with a
|
||||
number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
|
||||
the size of the output buffer, with the replacement string starting at the next
|
||||
character. Here is an example that tests the edge case:
|
||||
<pre>
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
@ -1029,6 +1292,19 @@ is an example that tests the edge case:
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
</pre>
|
||||
The default action of <b>pcre2_substitute()</b> is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
|
||||
<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
|
||||
to go through the motions of matching and substituting, in order to compute the
|
||||
size of buffer that is required. When this happens, <b>pcre2test</b> shows the
|
||||
required buffer length (which includes space for the trailing zero) as part of
|
||||
the error message. For example:
|
||||
<pre>
|
||||
/abc/substitute_overflow_length
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory: 10 code units are needed
|
||||
</pre>
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||
matching provokes an error return ("bad option value") from
|
||||
<b>pcre2_substitute()</b>.
|
||||
@ -1100,6 +1376,16 @@ The <b>offset</b> modifier sets an offset in the subject string at which
|
||||
matching starts. Its value is a number of code units, not characters.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting an offset limit
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>offset_limit</b> modifier sets a limit for unanchored matches. If a match
|
||||
cannot be found starting at or before this offset in the subject, a "no match"
|
||||
return is given. The data value is a number of code units, not characters. When
|
||||
this modifier is used, the <b>use_offset_limit</b> modifier must have been set
|
||||
for the pattern; if not, an error is generated.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting the size of the output vector
|
||||
</b><br>
|
||||
<P>
|
||||
@ -1131,6 +1417,17 @@ this modifier has no effect, as there is no facility for passing a length.)
|
||||
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
</P>
|
||||
<br><b>
|
||||
Passing a NULL context
|
||||
</b><br>
|
||||
<P>
|
||||
Normally, <b>pcre2test</b> passes a context block to <b>pcre2_match()</b>,
|
||||
<b>pcre2_dfa_match()</b> or <b>pcre2_jit_match()</b>. If the <b>null_context</b>
|
||||
modifier is set, however, NULL is passed. This is for testing that the matching
|
||||
functions behave correctly in this case (they use default values). This
|
||||
modifier cannot be used with the <b>find_limits</b> modifier or when testing the
|
||||
substitution function.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
|
||||
<P>
|
||||
By default, <b>pcre2test</b> uses the standard PCRE2 matching function,
|
||||
@ -1196,7 +1493,7 @@ unset substring is shown as "<unset>", as for the second data line.
|
||||
If the strings contain any non-printing characters, they are output as \xhh
|
||||
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
|
||||
are output as \x{hh...} escapes. See below for the definition of non-printing
|
||||
characters. If the <b>/aftertext</b> modifier is set, the output for substring
|
||||
characters. If the <b>aftertext</b> modifier is set, the output for substring
|
||||
0 is followed by the the rest of the subject string, identified by "0+" like
|
||||
this:
|
||||
<pre>
|
||||
@ -1321,7 +1618,9 @@ item to be tested. For example:
|
||||
This output indicates that callout number 0 occurred for a match attempt
|
||||
starting at the fourth character of the subject string, when the pointer was at
|
||||
the seventh character, and when the next pattern item was \d. Just
|
||||
one circumflex is output if the start and current positions are the same.
|
||||
one circumflex is output if the start and current positions are the same, or if
|
||||
the current position precedes the start position, which can happen if the
|
||||
callout is in a lookbehind assertion.
|
||||
</P>
|
||||
<P>
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
|
||||
@ -1387,7 +1686,7 @@ therefore shown as hex escapes.
|
||||
<P>
|
||||
When <b>pcre2test</b> is outputting text that is a matched part of a subject
|
||||
string, it behaves in the same way, unless a different locale has been set for
|
||||
the pattern (using the <b>/locale</b> modifier). In this case, the
|
||||
the pattern (using the <b>locale</b> modifier). In this case, the
|
||||
<b>isprint()</b> function is used to distinguish printing and non-printing
|
||||
characters.
|
||||
<a name="saverestore"></a></P>
|
||||
@ -1413,11 +1712,16 @@ can be used to test these functions.
|
||||
<P>
|
||||
When a pattern with <b>push</b> modifier is successfully compiled, it is pushed
|
||||
onto a stack of compiled patterns, and <b>pcre2test</b> expects the next line to
|
||||
contain a new pattern (or command) instead of a subject line. By this means, a
|
||||
number of patterns can be compiled and retained. The <b>push</b> modifier is
|
||||
incompatible with <b>posix</b>, and control modifiers that act at match time are
|
||||
ignored (with a message). The <b>jitverify</b> modifier applies only at compile
|
||||
time. The command
|
||||
contain a new pattern (or command) instead of a subject line. By contrast,
|
||||
the <b>pushcopy</b> modifier causes a copy of the compiled pattern to be
|
||||
stacked, leaving the original available for immediate matching. By using
|
||||
<b>push</b> and/or <b>pushcopy</b>, a number of patterns can be compiled and
|
||||
retained. These modifiers are incompatible with <b>posix</b>, and control
|
||||
modifiers that act at match time are ignored (with a message) for the stacked
|
||||
patterns. The <b>jitverify</b> modifier applies only at compile time.
|
||||
</P>
|
||||
<P>
|
||||
The command
|
||||
<pre>
|
||||
#save <filename>
|
||||
</pre>
|
||||
@ -1434,7 +1738,8 @@ usual by an empty line or end of file. This command may be followed by a
|
||||
modifier list containing only
|
||||
<a href="#controlmodifiers">control modifiers</a>
|
||||
that act after a pattern has been compiled. In particular, <b>hex</b>,
|
||||
<b>posix</b>, and <b>push</b> are not allowed, nor are any
|
||||
<b>posix</b>, <b>posix_nosub</b>, <b>push</b>, and <b>pushcopy</b> are not allowed,
|
||||
nor are any
|
||||
<a href="#optionmodifiers">option-setting modifiers.</a>
|
||||
The JIT modifiers are, however permitted. Here is an example that saves and
|
||||
reloads two patterns.
|
||||
@ -1452,6 +1757,11 @@ reloads two patterns.
|
||||
If <b>jitverify</b> is used with #pop, it does not automatically imply
|
||||
<b>jit</b>, which is different behaviour from when it is used on a pattern.
|
||||
</P>
|
||||
<P>
|
||||
The #popcopy command is analagous to the <b>pushcopy</b> modifier in that it
|
||||
makes current a copy of the topmost stack pattern, leaving the original still
|
||||
on the stack.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
@ -1469,9 +1779,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 May 2015
|
||||
Last updated: 28 December 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
Reference in New Issue
Block a user