Update bundled PCRE2-library to version 10.23
Some manual changes done to the library were lost with this update. They will be added in the next commit.
This commit is contained in:
@ -190,6 +190,12 @@ be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
</P>
|
||||
<P>
|
||||
The match limit is used (but in a different way) when JIT is being used, but it
|
||||
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
|
||||
However, the recursion limit is relevant for DFA matching, which does use some
|
||||
function recursion, in particular, for recursions within the pattern.
|
||||
<a name="newlines"></a></P>
|
||||
<br><b>
|
||||
Newline conventions
|
||||
@ -379,32 +385,31 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
|
||||
code unit following \c has a value less than 32 or greater than 126, a
|
||||
compile-time error occurs. This locks out non-printable ASCII characters in all
|
||||
modes.
|
||||
compile-time error occurs.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
|
||||
generate the appropriate EBCDIC code values. The \c escape is processed
|
||||
as specified for Perl in the <b>perlebcdic</b> document. The only characters
|
||||
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \@ encodes
|
||||
character code 0; the letters (in either case) encode characters 1-26 (hex 01
|
||||
to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
||||
\? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
other character provokes a compile-time error. The sequence \c@ encodes
|
||||
character code 0; after \c the letters (in either case) encode characters 1-26
|
||||
(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
||||
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
</P>
|
||||
<P>
|
||||
Thus, apart from \?, these escapes generate the same character code values as
|
||||
Thus, apart from \c?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
differ. For example, \G always generates code value 7, which is BEL in ASCII
|
||||
differ. For example, \cG always generates code value 7, which is BEL in ASCII
|
||||
but DEL in EBCDIC.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \? generate 95; otherwise it generates 255.
|
||||
values, PCRE2 makes \c? generate 95; otherwise it generates 255.
|
||||
</P>
|
||||
<P>
|
||||
After \0 up to two further octal digits are read. If there are fewer than two
|
||||
@ -526,9 +531,9 @@ by code point, as described in the previous section.
|
||||
Absolute and relative back references
|
||||
</b><br>
|
||||
<P>
|
||||
The sequence \g followed by an unsigned or a negative number, optionally
|
||||
enclosed in braces, is an absolute or relative back reference. A named back
|
||||
reference can be coded as \g{name}. Back references are discussed
|
||||
The sequence \g followed by a signed or unsigned number, optionally enclosed
|
||||
in braces, is an absolute or relative back reference. A named back reference
|
||||
can be coded as \g{name}. Back references are discussed
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
<a href="#subpattern">parenthesized subpatterns.</a>
|
||||
@ -669,8 +674,8 @@ This is an example of an "atomic group", details of which are given
|
||||
This particular group matches either the two-character sequence CR followed by
|
||||
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
||||
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
||||
line, U+0085). The two-character sequence is treated as a single unit that
|
||||
cannot be split.
|
||||
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||
treated as a single unit that cannot be split.
|
||||
</P>
|
||||
<P>
|
||||
In other modes, two additional characters whose codepoints are greater than 255
|
||||
@ -736,6 +741,8 @@ Those that are not part of an identified script are lumped together as
|
||||
"Common". The current list of scripts is:
|
||||
</P>
|
||||
<P>
|
||||
Ahom,
|
||||
Anatolian_Hieroglyphs,
|
||||
Arabic,
|
||||
Armenian,
|
||||
Avestan,
|
||||
@ -776,6 +783,7 @@ Gurmukhi,
|
||||
Han,
|
||||
Hangul,
|
||||
Hanunoo,
|
||||
Hatran,
|
||||
Hebrew,
|
||||
Hiragana,
|
||||
Imperial_Aramaic,
|
||||
@ -812,12 +820,14 @@ Miao,
|
||||
Modi,
|
||||
Mongolian,
|
||||
Mro,
|
||||
Multani,
|
||||
Myanmar,
|
||||
Nabataean,
|
||||
New_Tai_Lue,
|
||||
Nko,
|
||||
Ogham,
|
||||
Ol_Chiki,
|
||||
Old_Hungarian,
|
||||
Old_Italic,
|
||||
Old_North_Arabian,
|
||||
Old_Permic,
|
||||
@ -839,6 +849,7 @@ Saurashtra,
|
||||
Sharada,
|
||||
Shavian,
|
||||
Siddham,
|
||||
SignWriting,
|
||||
Sinhala,
|
||||
Sora_Sompeng,
|
||||
Sundanese,
|
||||
@ -1180,6 +1191,16 @@ when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
|
||||
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
||||
</P>
|
||||
<P>
|
||||
When the newline convention (see
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
below) recognizes the two-character sequence CRLF as a newline, this is
|
||||
preferred, even if the single characters CR and LF are also recognized as
|
||||
newlines. For example, if the newline convention is "any", a multiline mode
|
||||
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
|
||||
CR, even though CR on its own is a valid newline. (It also matches at the very
|
||||
start of the string, of course.)
|
||||
</P>
|
||||
<P>
|
||||
Note that the sequences \A, \Z, and \z can be used to match the start and
|
||||
end of the subject in both modes, and if all branches of a pattern start with
|
||||
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
|
||||
@ -1230,20 +1251,32 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
|
||||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
|
||||
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
unless the PCRE2_NO_UTF_CHECK option is used).
|
||||
</P>
|
||||
<P>
|
||||
An application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
|
||||
build PCRE2 with the use of \C permanently disabled.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 does not allow \C to appear in lookbehind assertions
|
||||
<a href="#lookbehind">(described below)</a>
|
||||
in a UTF mode, because this would make it impossible to calculate the length of
|
||||
the lookbehind.
|
||||
in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
|
||||
the length of the lookbehind. Neither the alternative matching function
|
||||
<b>pcre2_dfa_match()</b> nor the JIT optimizer support \C in these UTF modes.
|
||||
The former gives a match-time error; the latter fails to optimize and so the
|
||||
match is always run using the interpreter.
|
||||
</P>
|
||||
<P>
|
||||
In the 32-bit library, however, \C is always supported (when not explicitly
|
||||
locked out) because it always matches a single code unit, whether or not UTF-32
|
||||
is specified.
|
||||
</P>
|
||||
<P>
|
||||
In general, the \C escape sequence is best avoided. However, one way of using
|
||||
it that avoids the problem of malformed UTF characters is to use a lookahead to
|
||||
check the length of the next character, as in this pattern, which could be used
|
||||
with a UTF-8 string (ignore white space and line breaks):
|
||||
it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
|
||||
lookahead to check the length of the next character, as in this pattern, which
|
||||
could be used with a UTF-8 string (ignore white space and line breaks):
|
||||
<pre>
|
||||
(?| (?=[\x00-\x7f])(\C) |
|
||||
(?=[\x80-\x{7ff}])(\C)(\C) |
|
||||
@ -1298,42 +1331,6 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||
class such as [^a] always matches one of these characters.
|
||||
</P>
|
||||
<P>
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class, or
|
||||
immediately after a range. For example, [b-d-z] matches letters in the range b
|
||||
to d, a hyphen character, or z.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
|
||||
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
|
||||
the end of range, so [W-\]46] is interpreted as a class containing a range
|
||||
followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
</P>
|
||||
<P>
|
||||
An error is generated if a POSIX character class (see below) or an escape
|
||||
sequence other than one that defines a single character appears at a point
|
||||
where a range ending character is expected. For example, [z-\xff] is valid,
|
||||
but [A-\d] and [A-[:digit:]] are not.
|
||||
</P>
|
||||
<P>
|
||||
Ranges operate in the collating sequence of character values. They can also be
|
||||
used for characters specified numerically, for example [\000-\037]. Ranges
|
||||
can include any characters that are valid for the current mode.
|
||||
</P>
|
||||
<P>
|
||||
If a range that includes letters is used when caseless matching is set, it
|
||||
matches the letters in either case. For example, [W-c] is equivalent to
|
||||
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
|
||||
tables for a French locale are in use, [\xc8-\xcb] matches accented E
|
||||
characters in both cases.
|
||||
</P>
|
||||
<P>
|
||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
|
||||
\V, \w, and \W may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\dABCDEF] matches any hexadecimal
|
||||
@ -1347,6 +1344,52 @@ are not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error.
|
||||
</P>
|
||||
<P>
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class,
|
||||
or immediately after a range. For example, [b-d-z] matches letters in the range
|
||||
b to d, a hyphen character, or z.
|
||||
</P>
|
||||
<P>
|
||||
Perl treats a hyphen as a literal if it appears before or after a POSIX class
|
||||
(see below) or a character type escape such as as \d, but gives a warning in
|
||||
its warning mode, as this is most likely a user error. As PCRE2 has no facility
|
||||
for warning, an error is given in these cases.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
|
||||
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
|
||||
the end of range, so [W-\]46] is interpreted as a class containing a range
|
||||
followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
</P>
|
||||
<P>
|
||||
Ranges normally include all code points between the start and end characters,
|
||||
inclusive. They can also be used for code points specified numerically, for
|
||||
example [\000-\037]. Ranges can include any characters that are valid for the
|
||||
current mode.
|
||||
</P>
|
||||
<P>
|
||||
There is a special case in EBCDIC environments for ranges whose end points are
|
||||
both specified as literal letters in the same case. For compatibility with
|
||||
Perl, EBCDIC code points within the range that are not letters are omitted. For
|
||||
example, [h-k] matches only four characters, even though the codes for h and k
|
||||
are 0x88 and 0x92, a range of 11 code points. However, if the range is
|
||||
specified numerically, for example, [\x88-\x92] or [h-\x92], all code points
|
||||
are included.
|
||||
</P>
|
||||
<P>
|
||||
If a range that includes letters is used when caseless matching is set, it
|
||||
matches the letters in either case. For example, [W-c] is equivalent to
|
||||
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
|
||||
tables for a French locale are in use, [\xc8-\xcb] matches accented E
|
||||
characters in both cases.
|
||||
</P>
|
||||
<P>
|
||||
A circumflex can conveniently be used with the upper case character types to
|
||||
specify a more restricted set of characters than the matching lower case type.
|
||||
For example, the class [^\W_] matches any letter or digit, but not underscore,
|
||||
@ -1514,13 +1557,8 @@ respectively.
|
||||
<P>
|
||||
When one of these option changes occurs at top level (that is, not inside
|
||||
subpattern parentheses), the change applies to the remainder of the pattern
|
||||
that follows. If the change is placed right at the start of a pattern, PCRE2
|
||||
extracts it into the global options (and it will therefore show up in data
|
||||
extracted by the <b>pcre2_pattern_info()</b> function).
|
||||
</P>
|
||||
<P>
|
||||
An option change within a subpattern (see below for a description of
|
||||
subpatterns) affects only that part of the subpattern that follows it, so
|
||||
that follows. An option change within a subpattern (see below for a description
|
||||
of subpatterns) affects only that part of the subpattern that follows it, so
|
||||
<pre>
|
||||
(a(?i)b)c
|
||||
</pre>
|
||||
@ -1649,6 +1687,10 @@ first one in the pattern with the given number. The following pattern matches
|
||||
<pre>
|
||||
/(?|(abc)|(def))(?1)/
|
||||
</pre>
|
||||
A relative reference such as (?-1) is no different: it is just a convenient way
|
||||
of computing an absolute group number.
|
||||
</P>
|
||||
<P>
|
||||
If a
|
||||
<a href="#conditions">condition test</a>
|
||||
for a subpattern's having matched refers to a non-unique number, the test is
|
||||
@ -2051,9 +2093,9 @@ subpattern is possible using named parentheses (see below).
|
||||
</P>
|
||||
<P>
|
||||
Another way of avoiding the ambiguity inherent in the use of digits following a
|
||||
backslash is to use the \g escape sequence. This escape must be followed by an
|
||||
unsigned number or a negative number, optionally enclosed in braces. These
|
||||
examples are all identical:
|
||||
backslash is to use the \g escape sequence. This escape must be followed by a
|
||||
signed or unsigned number, optionally enclosed in braces. These examples are
|
||||
all identical:
|
||||
<pre>
|
||||
(ring), \1
|
||||
(ring), \g1
|
||||
@ -2061,8 +2103,7 @@ examples are all identical:
|
||||
</pre>
|
||||
An unsigned number specifies an absolute reference without the ambiguity that
|
||||
is present in the older syntax. It is also useful when literal digits follow
|
||||
the reference. A negative number is a relative reference. Consider this
|
||||
example:
|
||||
the reference. A signed number is a relative reference. Consider this example:
|
||||
<pre>
|
||||
(abc(def)ghi)\g{-1}
|
||||
</pre>
|
||||
@ -2073,6 +2114,11 @@ can be helpful in long patterns, and also in patterns that are created by
|
||||
joining together fragments that contain references within themselves.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \g{+1} is a reference to the next capturing subpattern. This kind
|
||||
of forward reference can be useful it patterns that repeat. Perl does not
|
||||
support the use of + in this way.
|
||||
</P>
|
||||
<P>
|
||||
A back reference matches whatever actually matched the capturing subpattern in
|
||||
the current subject string, rather than anything matching the subpattern
|
||||
itself (see
|
||||
@ -2172,6 +2218,14 @@ capturing is carried out only for positive assertions. (Perl sometimes, but not
|
||||
always, does do capturing in negative assertions.)
|
||||
</P>
|
||||
<P>
|
||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
||||
succeeds, but failure to match later in the pattern causes backtracking over
|
||||
this assertion, the captures within the assertion are reset only if no higher
|
||||
numbered captures are already set. This is, unfortunately, a fundamental
|
||||
limitation of the current implementation; it may get removed in a future
|
||||
reworking.
|
||||
</P>
|
||||
<P>
|
||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||
it makes no sense to assert the same thing several times, the side effect of
|
||||
capturing parentheses may occasionally be useful. However, an assertion that
|
||||
@ -2268,18 +2322,31 @@ match. If there are insufficient characters before the current position, the
|
||||
assertion fails.
|
||||
</P>
|
||||
<P>
|
||||
In a UTF mode, PCRE2 does not allow the \C escape (which matches a single code
|
||||
unit even in a UTF mode) to appear in lookbehind assertions, because it makes
|
||||
it impossible to calculate the length of the lookbehind. The \X and \R
|
||||
escapes, which can match different numbers of code units, are also not
|
||||
permitted.
|
||||
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
|
||||
single code unit even in a UTF mode) to appear in lookbehind assertions,
|
||||
because it makes it impossible to calculate the length of the lookbehind. The
|
||||
\X and \R escapes, which can match different numbers of code units, are never
|
||||
permitted in lookbehinds.
|
||||
</P>
|
||||
<P>
|
||||
<a href="#subpatternsassubroutines">"Subroutine"</a>
|
||||
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
|
||||
as the subpattern matches a fixed-length string.
|
||||
<a href="#recursion">Recursion,</a>
|
||||
however, is not supported.
|
||||
as the subpattern matches a fixed-length string. However,
|
||||
<a href="#recursion">recursion,</a>
|
||||
that is, a "subroutine" call into a group that is already active,
|
||||
is not supported.
|
||||
</P>
|
||||
<P>
|
||||
Perl does not support back references in lookbehinds. PCRE2 does support them,
|
||||
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
|
||||
must not be set, there must be no use of (?| in the pattern (it creates
|
||||
duplicate subpattern numbers), and if the back reference is by name, the name
|
||||
must be unique. Of course, the referenced subpattern must itself be of fixed
|
||||
length. The following pattern matches words containing at least two characters
|
||||
that begin and end with the same character:
|
||||
<pre>
|
||||
\b(\w)\w++(?<=\1)
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
Possessive quantifiers can be used in conjunction with lookbehind assertions to
|
||||
@ -2417,7 +2484,9 @@ Checking for a used subpattern by name
|
||||
<P>
|
||||
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
|
||||
subpattern by name. For compatibility with earlier versions of PCRE1, which had
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized.
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized. Note,
|
||||
however, that undelimited names consisting of the letter R followed by digits
|
||||
are ambiguous (see the following section).
|
||||
</P>
|
||||
<P>
|
||||
Rewriting the above example to use a named subpattern gives this:
|
||||
@ -2432,30 +2501,52 @@ matched.
|
||||
Checking for pattern recursion
|
||||
</b><br>
|
||||
<P>
|
||||
If the condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if a recursive call to the whole pattern or any
|
||||
subpattern has been made. If digits or a name preceded by ampersand follow the
|
||||
letter R, for example:
|
||||
"Recursion" in this sense refers to any subroutine-like call from one part of
|
||||
the pattern to another, whether or not it is actually recursive. See the
|
||||
sections entitled
|
||||
<a href="#recursion">"Recursive patterns"</a>
|
||||
and
|
||||
<a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
|
||||
below for details of recursion and subpattern calls.
|
||||
</P>
|
||||
<P>
|
||||
If a condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if matching is currently in a recursion or subroutine
|
||||
call to the whole pattern or any subpattern. If digits follow the letter R, and
|
||||
there is no subpattern with that name, the condition is true if the most recent
|
||||
call is into a subpattern with the given number, which must exist somewhere in
|
||||
the overall pattern. This is a contrived example that is equivalent to a+b:
|
||||
<pre>
|
||||
(?(R3)...) or (?(R&name)...)
|
||||
((?(R1)a+|(?1)b))
|
||||
</pre>
|
||||
the condition is true if the most recent recursion is into a subpattern whose
|
||||
number or name is given. This condition does not check the entire recursion
|
||||
stack. If the name used in a condition of this kind is a duplicate, the test is
|
||||
applied to all subpatterns of the same name, and is true if any one of them is
|
||||
the most recent recursion.
|
||||
However, in both cases, if there is a subpattern with a matching name, the
|
||||
condition tests for its being set, as described in the section above, instead
|
||||
of testing for recursion. For example, creating a group with the name R1 by
|
||||
adding (?<R1>) to the above pattern completely changes its meaning.
|
||||
</P>
|
||||
<P>
|
||||
If a name preceded by ampersand follows the letter R, for example:
|
||||
<pre>
|
||||
(?(R&name)...)
|
||||
</pre>
|
||||
the condition is true if the most recent recursion is into a subpattern of that
|
||||
name (which must exist within the pattern).
|
||||
</P>
|
||||
<P>
|
||||
This condition does not check the entire recursion stack. It tests only the
|
||||
current level. If the name used in a condition of this kind is a duplicate, the
|
||||
test is applied to all subpatterns of the same name, and is true if any one of
|
||||
them is the most recent recursion.
|
||||
</P>
|
||||
<P>
|
||||
At "top level", all these recursion test conditions are false.
|
||||
<a href="#recursion">The syntax for recursive patterns</a>
|
||||
is described below.
|
||||
<a name="subdefine"></a></P>
|
||||
<br><b>
|
||||
Defining subpatterns for use by reference only
|
||||
</b><br>
|
||||
<P>
|
||||
If the condition is the string (DEFINE), and there is no subpattern with the
|
||||
name DEFINE, the condition is always false. In this case, there may be only one
|
||||
If the condition is the string (DEFINE), the condition is always false, even if
|
||||
there is a group with the name DEFINE. In this case, there may be only one
|
||||
alternative in the subpattern. It is always skipped if control reaches this
|
||||
point in the pattern; the idea of DEFINE is that it can be used to define
|
||||
subroutines that can be referenced from elsewhere. (The use of
|
||||
@ -2489,7 +2580,8 @@ For example:
|
||||
(?(VERSION>=10.4)yes|no)
|
||||
</pre>
|
||||
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
|
||||
"no" otherwise.
|
||||
"no" otherwise. The fractional part of the version number may not contain more
|
||||
than two digits.
|
||||
</P>
|
||||
<br><b>
|
||||
Assertion conditions
|
||||
@ -2602,6 +2694,21 @@ parentheses preceding the recursion. In other words, a negative number counts
|
||||
capturing parentheses leftwards from the point at which it is encountered.
|
||||
</P>
|
||||
<P>
|
||||
Be aware however, that if
|
||||
<a href="#dupsubpatternnumber">duplicate subpattern numbers</a>
|
||||
are in use, relative references refer to the earliest subpattern with the
|
||||
appropriate number. Consider, for example:
|
||||
<pre>
|
||||
(?|(a)|(b)) (c) (?-2)
|
||||
</pre>
|
||||
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
|
||||
is number 2. When the reference (?-2) is encountered, the second most recently
|
||||
opened parentheses has the number 1, but it is the first such group (the (a)
|
||||
group) to which the recursion refers. This would be the same if an absolute
|
||||
reference (?1) was used. In other words, relative references are just a
|
||||
shorthand for computing a group number.
|
||||
</P>
|
||||
<P>
|
||||
It is also possible to refer to subsequently opened parentheses, by writing
|
||||
references such as (?+2). However, these cannot be recursive because the
|
||||
reference is not inside the parentheses that are referenced. They are always
|
||||
@ -2899,14 +3006,36 @@ remarks apply to the PCRE2 features described in this section.
|
||||
</P>
|
||||
<P>
|
||||
The new verbs make use of what was previously invalid syntax: an opening
|
||||
parenthesis followed by an asterisk. They are generally of the form
|
||||
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
|
||||
differently depending on whether or not a name is present. A name is any
|
||||
sequence of characters that does not include a closing parenthesis. The maximum
|
||||
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
|
||||
libraries. If the name is empty, that is, if the closing parenthesis
|
||||
immediately follows the colon, the effect is as if the colon were not there.
|
||||
Any number of these verbs may occur in a pattern.
|
||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
||||
depending on whether or not a name is present.
|
||||
</P>
|
||||
<P>
|
||||
By default, for compatibility with Perl, a name is any sequence of characters
|
||||
that does not include a closing parenthesis. The name is not processed in
|
||||
any way, and it is not possible to include a closing parenthesis in the name.
|
||||
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
|
||||
is no longer Perl-compatible.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
|
||||
and only an unescaped closing parenthesis terminates the name. However, the
|
||||
only backslash items that are permitted are \Q, \E, and sequences such as
|
||||
\x{100} that define character code points. Character type escapes such as \d
|
||||
are faulted.
|
||||
</P>
|
||||
<P>
|
||||
A closing parenthesis can be included in a name either as \) or between \Q
|
||||
and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||
parenthesis immediately follows the colon, the effect is as if the colon were
|
||||
not there. Any number of these verbs may occur in a pattern.
|
||||
</P>
|
||||
<P>
|
||||
Since these verbs are specifically related to backtracking, most of them can be
|
||||
@ -3323,9 +3452,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 13 June 2015
|
||||
Last updated: 27 December 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
Reference in New Issue
Block a user