Update bundled PCRE2-library to version 10.23

Some manual changes done to the library were lost with this update.
They will be added in the next commit.
This commit is contained in:
Esa Korhonen
2017-05-29 15:31:42 +03:00
parent 7231563937
commit 36af74cb25
218 changed files with 49218 additions and 26130 deletions

View File

@ -97,6 +97,7 @@ can skip ahead to the CMake section.
pcre2_context.c
pcre2_dfa_match.c
pcre2_error.c
pcre2_find_bracket.c
pcre2_jit_compile.c
pcre2_maketables.c
pcre2_match.c
@ -173,7 +174,11 @@ can skip ahead to the CMake section.
(11) If you want to use the pcre2grep command, compile and link
src/pcre2grep.c; it uses only the basic 8-bit PCRE2 library (it does not
need the pcre2posix library).
need the pcre2posix library). If you have built the PCRE2 library with JIT
support by defining SUPPORT_JIT in src/config.h, you can also define
SUPPORT_PCRE2GREP_JIT, which causes pcre2grep to make use of JIT (unless
it is run with --no-jit). If you define SUPPORT_PCRE2GREP_JIT without
defining SUPPORT_JIT, pcre2grep does not try to make use of JIT.
STACK SIZE IN WINDOWS ENVIRONMENTS
@ -388,4 +393,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
recommended download site.
=============================
Last Updated: 15 June 2015
Last Updated: 13 October 2016

View File

@ -44,7 +44,7 @@ wrappers.
The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix
man page). These can be found in a library called libpcre2posix. Note that this
man page). These can be found in a library called libpcre2-posix. Note that this
just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities.
@ -58,8 +58,8 @@ renamed or pointed at by a link.
If you are using the POSIX interface to PCRE2 and there is already a POSIX
regex library installed on your system, as well as worrying about the regex.h
header file (as mentioned above), you must also take care when linking programs
to ensure that they link with PCRE2's libpcre2posix library. Otherwise they may
pick up the POSIX functions of the same name from the other library.
to ensure that they link with PCRE2's libpcre2-posix library. Otherwise they
may pick up the POSIX functions of the same name from the other library.
One way of avoiding this confusion is to compile PCRE2 with the addition of
-Dregcomp=PCRE2regcomp (and similarly for the other POSIX functions) to the
@ -168,15 +168,12 @@ library. They are also documented in the pcre2build man page.
built. If you want only the 16-bit or 32-bit library, use --disable-pcre2-8
to disable building the 8-bit library.
. If you want to include support for just-in-time compiling, which can give
large performance improvements on certain platforms, add --enable-jit to the
"configure" command. This support is available only for certain hardware
. If you want to include support for just-in-time (JIT) compiling, which can
give large performance improvements on certain platforms, add --enable-jit to
the "configure" command. This support is available only for certain hardware
architectures. If you try to enable it on an unsupported architecture, there
will be a compile time error.
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
you add --disable-pcre2grep-jit to the "configure" command.
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
library, or UTF-32 Unicode character strings in the 32-bit library, you can
@ -207,19 +204,19 @@ library. They are also documented in the pcre2build man page.
--enable-newline-is-crlf, --enable-newline-is-anycrlf, or
--enable-newline-is-any to the "configure" command, respectively.
If you specify --enable-newline-is-cr or --enable-newline-is-crlf, some of
the standard tests will fail, because the lines in the test files end with
LF. Even if the files are edited to change the line endings, there are likely
to be some failures. With --enable-newline-is-anycrlf or
--enable-newline-is-any, many tests should succeed, but there may be some
failures.
. By default, the sequence \R in a pattern matches any Unicode line ending
sequence. This is independent of the option specifying what PCRE2 considers
to be the end of a line (see above). However, the caller of PCRE2 can
restrict \R to match only CR, LF, or CRLF. You can make this the default by
adding --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R").
. In a pattern, the escape sequence \C matches a single code unit, even in a
UTF mode. This can be dangerous because it breaks up multi-code-unit
characters. You can build PCRE2 with the use of \C permanently locked out by
adding --enable-never-backslash-C (note the upper case C) to the "configure"
command. When \C is allowed by the library, individual applications can lock
it out by calling pcre2_compile() with the PCRE2_NEVER_BACKSLASH_C option.
. PCRE2 has a counter that limits the depth of nesting of parentheses in a
pattern. This limits the amount of system stack that a pattern uses when it
is compiled. The default is 250, but you can change it by setting, for
@ -249,13 +246,13 @@ library. They are also documented in the pcre2build man page.
sizes in the pcre2stack man page.
. In the 8-bit library, the default maximum compiled pattern size is around
64K. You can increase this by adding --with-link-size=3 to the "configure"
command. PCRE2 then uses three bytes instead of two for offsets to different
parts of the compiled pattern. In the 16-bit library, --with-link-size=3 is
the same as --with-link-size=4, which (in both libraries) uses four-byte
offsets. Increasing the internal link size reduces performance in the 8-bit
and 16-bit libraries. In the 32-bit library, the link size setting is
ignored, as 4-byte offsets are always used.
64K bytes. You can increase this by adding --with-link-size=3 to the
"configure" command. PCRE2 then uses three bytes instead of two for offsets
to different parts of the compiled pattern. In the 16-bit library,
--with-link-size=3 is the same as --with-link-size=4, which (in both
libraries) uses four-byte offsets. Increasing the internal link size reduces
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
link size setting is ignored, as 4-byte offsets are always used.
. You can build PCRE2 so that its internal match() function that is called from
pcre2_match() does not call itself recursively. Instead, it uses memory
@ -317,6 +314,14 @@ library. They are also documented in the pcre2build man page.
running "make" to build PCRE2. There is more information about coverage
reporting in the "pcre2build" documentation.
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
you add --disable-pcre2grep-jit to the "configure" command.
. On non-Windows sytems there is support for calling external scripts during
matching in the pcre2grep command via PCRE2's callout facility with string
arguments. This support can be disabled by adding --disable-pcre2grep-callout
to the "configure" command.
. The pcre2grep program currently supports only 8-bit data files, and so
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
@ -327,12 +332,23 @@ library. They are also documented in the pcre2build man page.
Of course, the relevant libraries must be installed on your system.
. The default size (in bytes) of the internal buffer used by pcre2grep can be
set by, for example:
. The default starting size (in bytes) of the internal buffer used by pcre2grep
can be set by, for example:
--with-pcre2grep-bufsize=51200
The value must be a plain integer. The default is 20480.
The value must be a plain integer. The default is 20480. The amount of memory
used by pcre2grep is actually three times this number, to allow for "before"
and "after" lines. If very long lines are encountered, the buffer is
automatically enlarged, up to a fixed maximum size.
. The default maximum size of pcre2grep's internal buffer can be set by, for
example:
--with-pcre2grep-max-bufsize=2097152
The default is either 1048576 or the value of --with-pcre2grep-bufsize,
whichever is the larger.
. It is possible to compile pcre2test so that it links with the libreadline
or libedit libraries, by specifying, respectively,
@ -357,6 +373,22 @@ library. They are also documented in the pcre2build man page.
tgetflag, or tgoto, this is the problem, and linking with the ncurses library
should fix it.
. There is a special option called --enable-fuzz-support for use by people who
want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit
library. If set, it causes an extra library called libpcre2-fuzzsupport.a to
be built, but not installed. This contains a single function called
LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the
length of the string. When called, this function tries to compile the string
as a pattern, and if that succeeds, to match it. This is done both with no
options and with some random options bits that are generated from the string.
Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to
be created. This is normally run under valgrind or used when PCRE2 is
compiled with address sanitizing enabled. It calls the fuzzing function and
outputs information about it is doing. The input strings are specified by
arguments: if an argument starts with "=" the rest of it is a literal input
string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string.
The "configure" script builds the following files for the basic C library:
. Makefile the makefile that builds the library
@ -531,7 +563,7 @@ script creates the .txt and HTML forms of the documentation from the man pages.
Testing PCRE2
------------
-------------
To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
There is another script called RunGrepTest that tests the pcre2grep command.
@ -724,6 +756,7 @@ The distribution should contain the files listed below.
src/pcre2_context.c )
src/pcre2_dfa_match.c )
src/pcre2_error.c )
src/pcre2_find_bracket.c )
src/pcre2_jit_compile.c )
src/pcre2_jit_match.c ) sources for the functions in the library,
src/pcre2_jit_misc.c ) and some internal functions that they use
@ -744,6 +777,7 @@ The distribution should contain the files listed below.
src/pcre2_xclass.c )
src/pcre2_printint.c debugging function that is used by pcre2test,
src/pcre2_fuzzsupport.c function for (optional) fuzzing support
src/config.h.in template for config.h, when built by "configure"
src/pcre2.h.in template for pcre2.h when built by "configure"
@ -801,7 +835,7 @@ The distribution should contain the files listed below.
libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config
libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config
libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config
libpcre2posix.pc.in template for libpcre2posix.pc for pkg-config
libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config
ltmain.sh file used to build a libtool script
missing ) common stub for a few missing GNU programs while
) installing, generated by automake
@ -832,4 +866,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 24 April 2015
Last updated: 01 November 2016

View File

@ -91,6 +91,12 @@ in the library.
<tr><td><a href="pcre2_callout_enumerate.html">pcre2_callout_enumerate</a></td>
<td>&nbsp;&nbsp;Enumerate callouts in a compiled pattern</td></tr>
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern</td></tr>
<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern and its character tables</td></tr>
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
<td>&nbsp;&nbsp;Free a compiled pattern</td></tr>
@ -210,9 +216,15 @@ in the library.
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
<td>&nbsp;&nbsp;Set the match limit</td></tr>
<tr><td><a href="pcre2_set_max_pattern_length.html">pcre2_set_max_pattern_length</a></td>
<td>&nbsp;&nbsp;Set the maximum length of pattern</td></tr>
<tr><td><a href="pcre2_set_newline.html">pcre2_set_newline</a></td>
<td>&nbsp;&nbsp;Set the newline convention</td></tr>
<tr><td><a href="pcre2_set_offset_limit.html">pcre2_set_offset_limit</a></td>
<td>&nbsp;&nbsp;Set the offset limit</td></tr>
<tr><td><a href="pcre2_set_parens_nest_limit.html">pcre2_set_parens_nest_limit</a></td>
<td>&nbsp;&nbsp;Set the parentheses nesting limit</td></tr>

View File

@ -126,8 +126,10 @@ running redundant checks.
<P>
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
problems, because it may leave the current matching point in the middle of a
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
lock out the use of \C, causing a compile-time error if it is encountered.
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an
application to lock out the use of \C, causing a compile-time error if it is
encountered. It is also possible to build PCRE2 with the use of \C permanently
disabled.
</P>
<P>
Another way that performance can be hit is by running a pattern that has a very
@ -187,7 +189,7 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
Last updated: 13 April 2015
Last updated: 16 October 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>

View File

@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_code_copy specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_code_copy man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching. The
pointer to the character tables is copied, not the tables themselves (see
<b>pcre2_code_copy_with_tables()</b>). The yield of the function is NULL if
<i>code</i> is NULL or if sufficient memory cannot be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@ -0,0 +1,44 @@
<html>
<head>
<title>pcre2_code_copy_with_tables specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_code_copy_with_tables man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching.
Unlike <b>pcre2_code_copy()</b>, a separate copy of the character tables is also
made, with the new code pointing to it. This memory will be automatically freed
when <b>pcre2_code_free()</b> is called. The yield of the function is NULL if
<i>code</i> is NULL or if sufficient memory cannot be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@ -19,7 +19,7 @@ SYNOPSIS
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_code_free(pcre2_code *<i>code</i>);</b>
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
</P>
<br><b>
DESCRIPTION

View File

@ -45,8 +45,8 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
<i>wscount</i> Number of elements in the vector
</pre>
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
up a callout function. The <i>length</i> and <i>startoffset</i> values are code
units, not characters. The options are:
up a callout function or specify the recursion limit. The <i>length</i> and
<i>startoffset</i> values are code units, not characters. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line

View File

@ -35,7 +35,10 @@ errors are negative numbers. The arguments are:
<i>bufflen</i> the length of the buffer (code units)
</pre>
The function returns the length of the message, excluding the trailing zero, or
a negative error code if the buffer is too small.
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
this case, the returned message is truncated (but still with a trailing zero).
If <i>errorcode</i> does not contain a recognized error code number, the
negative value PCRE2_ERROR_BADDATA is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the

View File

@ -19,7 +19,7 @@ SYNOPSIS
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>

View File

@ -19,8 +19,8 @@ SYNOPSIS
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_match_data_create_from_pattern(const pcre2_code *<i>code</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION

View File

@ -42,19 +42,20 @@ request are as follows:
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
0 nothing set
1 first code unit is set
2 start of string or after newline
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
exist in the pattern
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
0 nothing set
1 code unit is set
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
empty string, 0 otherwise
PCRE2_INFO_MATCHLIMIT Match limit if set,
@ -62,8 +63,8 @@ request are as follows:
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
lookbehind assertion
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
PCRE2_INFO_NAMECOUNT Number of named subpatterns
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
PCRE2_INFO_NAMETABLE Pointer to name table
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
PCRE2_NEWLINE_CR

View File

@ -20,7 +20,7 @@ SYNOPSIS
</P>
<P>
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, const uint32_t *<i>bytes</i>,</b>
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>

View File

@ -19,8 +19,8 @@ SYNOPSIS
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int32_t pcre2_serialize_encode(pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, uint32_t **<i>serialized_bytes</i>,</b>
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>

View File

@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_set_max_pattern_length specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_max_pattern_length man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets, in a compile context, the maximum text length (in code
units) of the pattern that can be compiled. The result is always zero. If a
longer pattern is passed to <b>pcre2_compile()</b> there is an immediate error
return. The default is effectively unlimited, being the largest value a
PCRE2_SIZE variable can hold.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_set_offset_limit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_offset_limit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the offset limit field in a match context. The result is
always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@ -59,20 +59,25 @@ units, not characters, as is the contents of the variable pointed at by
<i>outlengthptr</i>, which is updated to the actual length of the new string.
The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject string is not the beginning of a line
PCRE2_NOTEOL Subject string is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject or replacement for
UTF validity (only relevant if PCRE2_UTF
was set at compile time)
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the
subject is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject or replacement
for UTF validity (only relevant if
PCRE2_UTF was set at compile time)
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
</pre>
The function returns the number of substitutions, which may be zero if there
were no matches. The result can be greater than one only when
PCRE2_SUBSTITUTE_GLOBAL is set.
PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the

File diff suppressed because it is too large Load Diff

View File

@ -18,23 +18,26 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
<li><a name="TOC11" href="#SEC11">LIMITING PCRE2 RESOURCE USAGE</a>
<li><a name="TOC12" href="#SEC12">CREATING CHARACTER TABLES AT BUILD TIME</a>
<li><a name="TOC13" href="#SEC13">USING EBCDIC CODE</a>
<li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
<li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a>
<li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
<li><a name="TOC17" href="#SEC17">INCLUDING DEBUGGING CODE</a>
<li><a name="TOC18" href="#SEC18">DEBUGGING WITH VALGRIND SUPPORT</a>
<li><a name="TOC19" href="#SEC19">CODE COVERAGE REPORTING</a>
<li><a name="TOC20" href="#SEC20">SEE ALSO</a>
<li><a name="TOC21" href="#SEC21">AUTHOR</a>
<li><a name="TOC22" href="#SEC22">REVISION</a>
<li><a name="TOC6" href="#SEC6">DISABLING THE USE OF \C</a>
<li><a name="TOC7" href="#SEC7">JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC8" href="#SEC8">NEWLINE RECOGNITION</a>
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
<li><a name="TOC10" href="#SEC10">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC11" href="#SEC11">AVOIDING EXCESSIVE STACK USAGE</a>
<li><a name="TOC12" href="#SEC12">LIMITING PCRE2 RESOURCE USAGE</a>
<li><a name="TOC13" href="#SEC13">CREATING CHARACTER TABLES AT BUILD TIME</a>
<li><a name="TOC14" href="#SEC14">USING EBCDIC CODE</a>
<li><a name="TOC15" href="#SEC15">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
<li><a name="TOC16" href="#SEC16">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
<li><a name="TOC17" href="#SEC17">PCRE2GREP BUFFER SIZE</a>
<li><a name="TOC18" href="#SEC18">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
<li><a name="TOC19" href="#SEC19">INCLUDING DEBUGGING CODE</a>
<li><a name="TOC20" href="#SEC20">DEBUGGING WITH VALGRIND SUPPORT</a>
<li><a name="TOC21" href="#SEC21">CODE COVERAGE REPORTING</a>
<li><a name="TOC22" href="#SEC22">SUPPORT FOR FUZZERS</a>
<li><a name="TOC23" href="#SEC23">SEE ALSO</a>
<li><a name="TOC24" href="#SEC24">AUTHOR</a>
<li><a name="TOC25" href="#SEC25">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
<P>
@ -148,13 +151,19 @@ properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP).
</P>
<br><a name="SEC6" href="#TOC1">DISABLING THE USE OF \C</a><br>
<P>
The \C escape sequence, which matches a single code unit, even in a UTF mode,
can cause unpredictable behaviour because it may leave the current matching
point in the middle of a multi-code-unit character. It can be locked out by
setting the PCRE2_NEVER_BACKSLASH_C option.
point in the middle of a multi-code-unit character. The application can lock it
out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
<b>pcre2_compile()</b>. There is also a build-time option
<pre>
--enable-never-backslash-C
</pre>
(note the upper case C) which locks out the use of \C entirely.
</P>
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiler support is included in the build by specifying
<pre>
@ -171,7 +180,7 @@ pcre2grep automatically makes use of it, unless you add
</pre>
to the "configure" command.
</P>
<br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
<br><a name="SEC8" href="#TOC1">NEWLINE RECOGNITION</a><br>
<P>
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
of a line. This is the normal newline character on Unix-like systems. You can
@ -208,7 +217,7 @@ Whatever default line ending convention is selected when PCRE2 is built can be
overridden by applications that use the library. At build time it is
conventional to use the standard for your operating system.
</P>
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
<br><a name="SEC9" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
By default, the sequence \R in a pattern matches any Unicode newline sequence,
independently of what has been selected as the line ending sequence. If you
@ -220,7 +229,7 @@ the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden by applications that use the
called.
</P>
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
<br><a name="SEC10" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
<P>
Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation
@ -239,7 +248,7 @@ longer offsets slows down the operation of PCRE2 because it has to load
additional data when handling them. For the 32-bit library the value is always
4 and cannot be overridden; the value of --with-link-size is ignored.
</P>
<br><a name="SEC10" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
<br><a name="SEC11" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
<P>
When matching with the <b>pcre2_match()</b> function, PCRE2 implements
backtracking by making recursive calls to an internal function called
@ -261,7 +270,7 @@ custom memory management functions can be called instead. PCRE2 runs noticeably
more slowly when built in this way. This option affects only the
<b>pcre2_match()</b> function; it is not relevant for <b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC11" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
<br><a name="SEC12" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
<P>
Internally, PCRE2 has a function called <b>match()</b>, which it calls
repeatedly (sometimes recursively) when matching a pattern with the
@ -290,7 +299,7 @@ constraints. However, you can set a lower limit by adding, for example,
</pre>
to the <b>configure</b> command. This value can also be overridden at run time.
</P>
<br><a name="SEC12" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
<br><a name="SEC13" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
<P>
PCRE2 uses fixed tables for processing characters whose code points are less
than 256. By default, PCRE2 is built with a set of tables that are distributed
@ -307,7 +316,7 @@ compiling, because <b>dftables</b> is run on the local host. If you need to
create alternative tables when cross compiling, you will have to do so "by
hand".)
</P>
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
<br><a name="SEC14" href="#TOC1">USING EBCDIC CODE</a><br>
<P>
PCRE2 assumes by default that it will run in an environment where the character
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
@ -342,7 +351,16 @@ The options that select newline behaviour, such as --enable-newline-is-cr,
and equivalent run-time options, refer to these character values in an EBCDIC
environment.
</P>
<br><a name="SEC14" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
<br><a name="SEC15" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
<P>
By default, on non-Windows systems, <b>pcre2grep</b> supports the use of
callouts with string arguments within the patterns it is matching, in order to
run external scripts. For details, see the
<a href="pcre2grep.html"><b>pcre2grep</b></a>
documentation. This support can be disabled by adding
--disable-pcre2grep-callout to the <b>configure</b> command.
</P>
<br><a name="SEC16" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
<P>
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
@ -355,22 +373,25 @@ to the <b>configure</b> command. These options naturally require that the
relevant libraries are installed on your system. Configuration will fail if
they are not.
</P>
<br><a name="SEC15" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
<br><a name="SEC17" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
<P>
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when it
finds a match. The size of the buffer is controlled by a parameter whose
default value is 20K. The buffer itself is three times this size, but because
of the way it is used for holding "before" lines, the longest line that is
guaranteed to be processable is the parameter size. You can change the default
parameter value by adding, for example,
finds a match. The starting size of the buffer is controlled by a parameter
whose default value is 20K. The buffer itself is three times this size, but
because of the way it is used for holding "before" lines, the longest line that
is guaranteed to be processable is the parameter size. If a longer line is
encountered, <b>pcre2grep</b> automatically expands the buffer, up to a
specified maximum size, whose default is 1M or the starting size, whichever is
the larger. You can change the default parameter values by adding, for example,
<pre>
--with-pcre2grep-bufsize=50K
--with-pcre2grep-bufsize=51200
--with-pcre2grep-max-bufsize=2097152
</pre>
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
value by using --buffer-size on the command line..
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override
these values by using --buffer-size and --max-buffer-size on the command line.
</P>
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<br><a name="SEC18" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P>
If you add one of
<pre>
@ -404,7 +425,7 @@ automatically included, you may need to add something like
</pre>
immediately before the <b>configure</b> command.
</P>
<br><a name="SEC17" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
<br><a name="SEC19" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
<P>
If you add
<pre>
@ -413,7 +434,7 @@ If you add
to the <b>configure</b> command, additional debugging code is included in the
build. This feature is intended for use by the PCRE2 maintainers.
</P>
<br><a name="SEC18" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<br><a name="SEC20" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
If you add
<pre>
@ -423,7 +444,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect invalid
memory accesses, and is mostly useful for debugging PCRE2 itself.
</P>
<br><a name="SEC19" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<br><a name="SEC21" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
code coverage report for its test suite. To enable this, you must install
@ -480,11 +501,32 @@ This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
documentation.
</P>
<br><a name="SEC20" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC22" href="#TOC1">SUPPORT FOR FUZZERS</a><br>
<P>
There is a special option for use by people who want to run fuzzing tests on
PCRE2:
<pre>
--enable-fuzz-support
</pre>
At present this applies only to the 8-bit library. If set, it causes an extra
library called libpcre2-fuzzsupport.a to be built, but not installed. This
contains a single function called LLVMFuzzerTestOneInput() whose arguments are
a pointer to a string and the length of the string. When called, this function
tries to compile the string as a pattern, and if that succeeds, to match it.
This is done both with no options and with some random options bits that are
generated from the string. Setting --enable-fuzz-support also causes a binary
called <b>pcre2fuzzcheck</b> to be created. This is normally run under valgrind
or used when PCRE2 is compiled with address sanitizing enabled. It calls the
fuzzing function and outputs information about it is doing. The input strings
are specified by arguments: if an argument starts with "=" the rest of it is a
literal input string. Otherwise, it is assumed to be a file name, and the
contents of the file are the test string.
</P>
<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
</P>
<br><a name="SEC21" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC24" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -493,11 +535,11 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<br><a name="SEC25" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 April 2015
Last updated: 01 November 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -57,11 +57,20 @@ two callout points:
</pre>
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
automatically inserts callouts, all with number 255, before each item in the
pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
pattern except for immediately before or after a callout item in the pattern.
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
<pre>
A(?C3)B
</pre>
it is processed as if it were
<pre>
(?C255)A(?C3)B(?C255)
</pre>
Here is a more complicated example:
<pre>
A(\d{2}|--)
</pre>
it is processed as if it were
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
<br>
<br>
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
@ -107,10 +116,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
No match
</pre>
This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur.
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
case, the output changes to this:
(because it is being treated as a++) and therefore the callouts that would be
taken for the backtracks do not occur. You can disable the auto-possessify
feature by passing PCRE2_NO_AUTO_POSSESS to <b>pcre2_compile()</b>, or starting
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
<pre>
---&#62;aaaa
+0 ^ a+
@ -235,8 +244,8 @@ Fields for numerical callouts
<P>
For a numerical callout, <i>callout_string</i> is NULL, and <i>callout_number</i>
contains the number of the callout, in the range 0-255. This is the number
that follows (?C for manual callouts; it is 255 for automatically generated
callouts.
that follows (?C for callouts that part of the pattern; it is 255 for
automatically generated callouts.
</P>
<br><b>
Fields for string callouts
@ -310,10 +319,15 @@ the next item to be matched.
</P>
<P>
The <i>next_item_length</i> field contains the length of the next item to be
matched in the pattern string. When the callout immediately precedes an
alternation bar, a closing parenthesis, or the end of the pattern, the length
is zero. When the callout precedes an opening parenthesis, the length is that
of the entire subpattern.
processed in the pattern string. When the callout is at the end of the pattern,
the length is zero. When the callout precedes an opening parenthesis, the
length includes meta characters that follow the parenthesis. For example, in a
callout before an assertion such as (?=ab) the length is 3. For an an
alternation bar or a closing parenthesis, the length is one, unless a closing
parenthesis is followed by a quantifier, in which case its length is included.
(This changed in release 10.23. In earlier releases, before an opening
parenthesis the length was that of the entire subpattern, and before an
alternation bar or a closing parenthesis the length was zero.)
</P>
<P>
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
@ -399,9 +413,9 @@ Cambridge, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 March 2015
Last updated: 29 September 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -107,7 +107,7 @@ processed as anchored at the point where they are tested.
one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are examples where it differs.
same as PCRE2, but there are cases where it differs.
</P>
<P>
11. Most backtracking verbs in assertions have their normal actions. They are
@ -123,7 +123,7 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE2
works internally just with numbers, using an external table to translate
between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b)B),
between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B),
where the two capturing parentheses have the same number but different names,
is not supported, and causes an error at compile time. If it were allowed, it
would not be possible to distinguish which parentheses matched, because both
@ -131,10 +131,11 @@ names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
</P>
<P>
14. Perl recognizes comments in some places that PCRE2 does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
Perl allows white space between ( and ? (though current Perls warn that this is
deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
14. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a subpattern. If the /x modifier
is set, Perl allowed white space between ( and ? though the latest Perls give
an error (for a while it was just deprecated). There may still be some cases
where Perl behaves differently.
</P>
<P>
15. Perl, when in warning mode, gives warnings for character classes such as
@ -161,42 +162,47 @@ each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
<br>
<br>
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
(b) From PCRE2 10.23, back references to groups of fixed length are supported
in lookbehinds, provided that there is no possibility of referencing a
non-unique number or name. Perl does not support backreferences in lookbehinds.
<br>
<br>
(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
<br>
<br>
(c) A backslash followed by a letter with no special meaning is faulted. (Perl
(d) A backslash followed by a letter with no special meaning is faulted. (Perl
can be made to issue a warning.)
<br>
<br>
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
<br>
<br>
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
<br>
<br>
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
<br>
<br>
(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF
(h) The \R escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE2_BSR_ANYCRLF option.
<br>
<br>
(h) The callout facility is PCRE2-specific.
(i) The callout facility is PCRE2-specific.
<br>
<br>
(i) The partial matching facility is PCRE2-specific.
(j) The partial matching facility is PCRE2-specific.
<br>
<br>
(j) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
(k) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
different way and is not Perl-compatible.
<br>
<br>
(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
(l) PCRE2 recognizes some special sequences such as (*CR) at the start of
a pattern that set overall options that cannot be changed within the pattern.
</P>
<br><b>
@ -214,9 +220,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 15 March 2015
Last updated: 18 October 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -20,28 +20,31 @@ please consult the man page, in case the conversion went wrong.
*************************************************/
/* This is a demonstration program to illustrate a straightforward way of
calling the PCRE2 regular expression library from a C program. See the
using the PCRE2 regular expression library from a C program. See the
pcre2sample documentation for a short discussion ("man pcre2sample" if you have
the PCRE2 man pages installed). PCRE2 is a revised API for the library, and is
incompatible with the original PCRE API.
There are actually three libraries, each supporting a different code unit
width. This demonstration program uses the 8-bit library.
width. This demonstration program uses the 8-bit library. The default is to
process each code unit as a separate character, but if the pattern begins with
"(*UTF)", both it and the subject are treated as UTF-8 strings, where
characters may occupy multiple code units.
In Unix-like environments, if PCRE2 is installed in your standard system
libraries, you should be able to compile this program using this command:
gcc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
cc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
If PCRE2 is not installed in a standard place, it is likely to be installed
with support for the pkg-config mechanism. If you have pkg-config, you can
compile this program using this command:
gcc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
cc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
If you do not have pkg-config, you may have to use this:
If you do not have pkg-config, you may have to use something like this:
gcc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \
cc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \
-R/usr/local/lib -lpcre2-8 -o pcre2demo
Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
@ -56,9 +59,14 @@ the following line. */
/* #define PCRE2_STATIC */
/* This macro must be defined before including pcre2.h. For a program that uses
only one code unit width, it makes it possible to use generic function names
such as pcre2_compile(). */
/* The PCRE2_CODE_UNIT_WIDTH macro must be defined before including pcre2.h.
For a program that uses only one code unit width, setting it to 8, 16, or 32
makes it possible to use generic function names such as pcre2_compile(). Note
that just changing 8 to 16 (for example) is not sufficient to convert this
program to process 16-bit characters. Even in a fully 16-bit environment, where
string-handling functions such as strcmp() and printf() work with 16-bit
characters, the code for handling the table of named substrings will still need
to be modified. */
#define PCRE2_CODE_UNIT_WIDTH 8
@ -79,19 +87,19 @@ int main(int argc, char **argv)
{
pcre2_code *re;
PCRE2_SPTR pattern; /* PCRE2_SPTR is a pointer to unsigned code units of */
PCRE2_SPTR subject; /* the appropriate width (8, 16, or 32 bits). */
PCRE2_SPTR subject; /* the appropriate width (in this case, 8 bits). */
PCRE2_SPTR name_table;
int crlf_is_newline;
int errornumber;
int find_all;
int i;
int namecount;
int name_entry_size;
int rc;
int utf8;
uint32_t option_bits;
uint32_t namecount;
uint32_t name_entry_size;
uint32_t newline;
PCRE2_SIZE erroroffset;
@ -106,15 +114,19 @@ pcre2_match_data *match_data;
* First, sort out the command line. There is only one possible option at *
* the moment, "-g" to request repeated matching to find all occurrences, *
* like Perl's /g option. We set the variable find_all to a non-zero value *
* if the -g option is present. Apart from that, there must be exactly two *
* arguments. *
* if the -g option is present. *
**************************************************************************/
find_all = 0;
for (i = 1; i &lt; argc; i++)
{
if (strcmp(argv[i], "-g") == 0) find_all = 1;
else break;
else if (argv[i][0] == '-')
{
printf("Unrecognised option %s\n", argv[i]);
return 1;
}
else break;
}
/* After the options, we require exactly two arguments, which are the pattern,
@ -122,7 +134,7 @@ and the subject string. */
if (argc - i != 2)
{
printf("Two arguments required: a regex and a subject string\n");
printf("Exactly two arguments required: a regex and a subject string\n");
return 1;
}
@ -201,7 +213,7 @@ if (rc &lt; 0)
stored. */
ovector = pcre2_get_ovector_pointer(match_data);
printf("\nMatch succeeded at offset %d\n", (int)ovector[0]);
printf("Match succeeded at offset %d\n", (int)ovector[0]);
/*************************************************************************
@ -242,7 +254,7 @@ we have to extract the count of named parentheses from the pattern. */
PCRE2_INFO_NAMECOUNT, /* get the number of named substrings */
&amp;namecount); /* where to put the answer */
if (namecount &lt;= 0) printf("No named substrings\n"); else
if (namecount == 0) printf("No named substrings\n"); else
{
PCRE2_SPTR tabptr;
printf("Named substrings\n");
@ -330,8 +342,8 @@ crlf_is_newline = newline == PCRE2_NEWLINE_ANY ||
for (;;)
{
uint32_t options = 0; /* Normally no options */
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
uint32_t options = 0; /* Normally no options */
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
/* If the previous match was for an empty string, we are finished if we are
at the end of the subject. Otherwise, arrange to run another match at the
@ -371,7 +383,7 @@ for (;;)
{
if (options == 0) break; /* All matches found */
ovector[1] = start_offset + 1; /* Advance one code unit */
if (crlf_is_newline &amp;&amp; /* If CRLF is newline &amp; */
if (crlf_is_newline &amp;&amp; /* If CRLF is a newline &amp; */
start_offset &lt; subject_length - 1 &amp;&amp; /* we are at CRLF, */
subject[start_offset] == '\r' &amp;&amp;
subject[start_offset + 1] == '\n')
@ -417,7 +429,7 @@ for (;;)
printf("%2d: %.*s\n", i, (int)substring_length, (char *)substring_start);
}
if (namecount &lt;= 0) printf("No named substrings\n"); else
if (namecount == 0) printf("No named substrings\n"); else
{
PCRE2_SPTR tabptr = name_table;
printf("Named substrings\n");

View File

@ -22,11 +22,12 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC7" href="#SEC7">NEWLINES</a>
<li><a name="TOC8" href="#SEC8">OPTIONS COMPATIBILITY</a>
<li><a name="TOC9" href="#SEC9">OPTIONS WITH DATA</a>
<li><a name="TOC10" href="#SEC10">MATCHING ERRORS</a>
<li><a name="TOC11" href="#SEC11">DIAGNOSTICS</a>
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
<li><a name="TOC14" href="#SEC14">REVISION</a>
<li><a name="TOC10" href="#SEC10">CALLING EXTERNAL SCRIPTS</a>
<li><a name="TOC11" href="#SEC11">MATCHING ERRORS</a>
<li><a name="TOC12" href="#SEC12">DIAGNOSTICS</a>
<li><a name="TOC13" href="#SEC13">SEE ALSO</a>
<li><a name="TOC14" href="#SEC14">AUTHOR</a>
<li><a name="TOC15" href="#SEC15">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
@ -79,11 +80,19 @@ span line boundaries. What defines a line boundary is controlled by the
</P>
<P>
The amount of memory used for buffering files that are being scanned is
controlled by a parameter that can be set by the <b>--buffer-size</b> option.
The default value for this parameter is specified when <b>pcre2grep</b> is
built, with the default default being 20K. A block of memory three times this
size is used (to allow for buffering "before" and "after" lines). An error
occurs if a line overflows the buffer.
controlled by parameters that can be set by the <b>--buffer-size</b> and
<b>--max-buffer-size</b> options. The first of these sets the size of buffer
that is obtained at the start of processing. If an input file contains very
long lines, a larger buffer may be needed; this is handled by automatically
extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
default values for these parameters are specified when <b>pcre2grep</b> is
built, with the default defaults being 20K and 1M respectively. An error occurs
if a line is too long and the buffer can no longer be expanded.
</P>
<P>
The block of memory that is actually used is three times the "buffer size", to
allow for buffering "before" and "after" lines. If the buffer size is too
small, fewer than requested "before" and "after" lines may be output.
</P>
<P>
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
@ -154,12 +163,13 @@ processing of patterns and file names that start with hyphens.
</P>
<P>
<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
Output <i>number</i> lines of context after each matching line. If file names
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
guarantees to have up to 8K of following text available for context output.
Output up to <i>number</i> lines of context after each matching line. Fewer
lines are output if the next match or the end of the file is reached, or if the
processing buffer size has been set too small. If file names and/or line
numbers are being output, a hyphen separator is used instead of a colon for the
context lines. A line containing "--" is output between each group of lines,
unless they are in fact contiguous in the input file. The value of <i>number</i>
is expected to be relatively small. When <b>-c</b> is used, <b>-A</b> is ignored.
</P>
<P>
<b>-a</b>, <b>--text</b>
@ -168,12 +178,14 @@ Treat binary files as text. This is equivalent to
</P>
<P>
<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
Output <i>number</i> lines of context before each matching line. If file names
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
guarantees to have up to 8K of preceding text available for context output.
Output up to <i>number</i> lines of context before each matching line. Fewer
lines are output if the previous match or the start of the file is within
<i>number</i> lines, or if the processing buffer size has been set too small. If
file names and/or line numbers are being output, a hyphen separator is used
instead of a colon for the context lines. A line containing "--" is output
between each group of lines, unless they are in fact contiguous in the input
file. The value of <i>number</i> is expected to be relatively small. When
<b>-c</b> is used, <b>-B</b> is ignored.
</P>
<P>
<b>--binary-files=</b><i>word</i>
@ -190,8 +202,9 @@ return code.
</P>
<P>
<b>--buffer-size=</b><i>number</i>
Set the parameter that controls how much memory is used for buffering files
that are being scanned.
Set the parameter that controls how much memory is obtained at the start of
processing for buffering files that are being scanned. See also
<b>--max-buffer-size</b> below.
</P>
<P>
<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
@ -201,14 +214,16 @@ This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
<P>
<b>-c</b>, <b>--count</b>
Do not output lines from the files that are being scanned; instead output the
number of matches (or non-matches if <b>-v</b> is used) that would otherwise
have caused lines to be shown. By default, this count is the same as the number
of suppressed lines, but if the <b>-M</b> (multiline) option is used (without
<b>-v</b>), there may be more suppressed lines than the number of matches.
number of lines that would have been shown, either because they matched, or, if
<b>-v</b> is set, because they failed to match. By default, this count is
exactly the same as the number of lines that would have been output, but if the
<b>-M</b> (multiline) option is used (without <b>-v</b>), there may be more
suppressed lines than the count (that is, the number of matches).
<br>
<br>
If no lines are selected, the number zero is output. If several files are are
being scanned, a count is output for each of them. However, if the
being scanned, a count is output for each of them and the <b>-t</b> option can
be used to cause a total to be output at the end. However, if the
<b>--files-with-matches</b> option is also used, only those files whose counts
are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
<b>-B</b>, and <b>-C</b> options are ignored.
@ -230,12 +245,23 @@ because <b>pcre2grep</b> has to search for all possible matches in a line, not
just one, in order to colour them all.
<br>
<br>
The colour that is used can be specified by setting the environment variable
PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a
string of two numbers, separated by a semicolon. They are copied directly into
the control string for setting colour on a terminal, so it is your
responsibility to ensure that they make sense. If neither of the environment
variables is set, the default is "1;31", which gives red.
The colour that is used can be specified by setting one of the environment
variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
PCREGREP_COLOR, which are checked in that order. If none of these are set,
<b>pcre2grep</b> looks for GREP_COLORS or GREP_COLOR (in that order). The value
of the variable should be a string of two numbers, separated by a semicolon,
except in the case of GREP_COLORS, which must start with "ms=" or "mt="
followed by two semicolon-separated colours, terminated by the end of the
string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
ignored, and GREP_COLOR is checked.
<br>
<br>
If the string obtained from one of the above variables contains any characters
other than semicolon or digits, the setting is ignored and the default colour
is used. The string is copied directly into the control string for setting
colour on a terminal, so it is your responsibility to ensure that the values
make sense. If no relevant environment variable is set, the default is "1;31",
which gives red.
</P>
<P>
<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
@ -320,18 +346,18 @@ files; it does not apply to patterns specified by any of the <b>--include</b> or
</P>
<P>
<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
Read patterns from the file, one per line, and match them against
each line of input. What constitutes a newline when reading the file is the
operating system's default. The <b>--newline</b> option has no effect on this
option. Trailing white space is removed from each line, and blank lines are
ignored. An empty file contains no patterns and therefore matches nothing. See
also the comments about multiple patterns versus a single pattern with
alternatives in the description of <b>-e</b> above.
Read patterns from the file, one per line, and match them against each line of
input. What constitutes a newline when reading the file is the operating
system's default. The <b>--newline</b> option has no effect on this option.
Trailing white space is removed from each line, and blank lines are ignored. An
empty file contains no patterns and therefore matches nothing. See also the
comments about multiple patterns versus a single pattern with alternatives in
the description of <b>-e</b> above.
<br>
<br>
If this option is given more than once, all the specified files are
read. A data line is output if any of the patterns match it. A file name can
be given as "-" to refer to the standard input. When <b>-f</b> is used, patterns
If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given
as "-" to refer to the standard input. When <b>-f</b> is used, patterns
specified on the command line using <b>-e</b> may also be present; they are
tested before the file's patterns. However, no other pattern is taken from the
command line; all arguments are treated as the names of paths to be searched.
@ -501,19 +527,27 @@ There are no short forms for these options. The default settings are specified
when the PCRE2 library is compiled, with the default default being 10 million.
</P>
<P>
\fB--max-buffer-size=<i>number</i>
This limits the expansion of the processing buffer, whose initial size can be
set by <b>--buffer-size</b>. The maximum buffer size is silently forced to be no
smaller than the starting buffer size.
</P>
<P>
<b>-M</b>, <b>--multiline</b>
Allow patterns to match more than one line. When this option is given, patterns
may usefully contain literal newline characters and internal occurrences of ^
and $ characters. The output for a successful match may consist of more than
one line. The first is the line in which the match started, and the last is the
line in which the match ended. If the matched string ends with a newline
sequence the output ends at the end of that line.
Allow patterns to match more than one line. When this option is set, the PCRE2
library is called in "multiline" mode. This allows a matched string to extend
past the end of a line and continue on one or more subsequent lines. Patterns
used with <b>-M</b> may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match may
consist of more than one line. The first line is the line in which the match
started, and the last line is the line in which the match ended. If the matched
string ends with a newline sequence, the output ends at the end of that line.
If <b>-v</b> is set, none of the lines in a multi-line match are output. Once a
match has been handled, scanning restarts at the beginning of the line after
the one in which the match ended.
<br>
<br>
When this option is set, the PCRE2 library is called in "multiline" mode.
However, <b>pcre2grep</b> still processes the input line by line. The difference
is that a matched string may extend past the end of a line and continue on
one or more subsequent lines. The newline sequence must be matched as part of
The newline sequence that separates multiple lines must be matched as part of
the pattern. For example, to find the phrase "regular expression" in a file
where "regular" might be at the end of a line and "expression" at the start of
the next line, you could use this command:
@ -526,11 +560,8 @@ well as possibly handling a two-character newline sequence.
<br>
<br>
There is a limit to the number of lines that can be matched, imposed by the way
that <b>pcre2grep</b> buffers the input file as it scans it. However,
<b>pcre2grep</b> ensures that at least 8K characters or the rest of the file
(whichever is the shorter) are available for forward matching, and similarly
the previous 8K characters (or all the previous characters, if fewer than 8K)
are guaranteed to be available for lookbehind assertions. The <b>-M</b> option
that <b>pcre2grep</b> buffers the input file as it scans it. With a sufficiently
large processing buffer, this should not be a problem, but the <b>-M</b> option
does not work when input is read line by line (see \fP--line-buffered\fP.)
</P>
<P>
@ -578,12 +609,13 @@ It should never be needed in normal use.
Show only the part of the line that matched a pattern instead of the whole
line. In this mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and
<b>-C</b> options are ignored. If there is more than one match in a line, each
of them is shown separately. If <b>-o</b> is combined with <b>-v</b> (invert the
sense of the match to find non-matching lines), no output is generated, but the
return code is set appropriately. If the matched portion of the line is empty,
nothing is output unless the file name or line number are being printed, in
which case they are shown on an otherwise empty line. This option is mutually
exclusive with <b>--file-offsets</b> and <b>--line-offsets</b>.
of them is shown separately, on a separate line of output. If <b>-o</b> is
combined with <b>-v</b> (invert the sense of the match to find non-matching
lines), no output is generated, but the return code is set appropriately. If
the matched portion of the line is empty, nothing is output unless the file
name or line number are being printed, in which case they are shown on an
otherwise empty line. This option is mutually exclusive with
<b>--file-offsets</b> and <b>--line-offsets</b>.
</P>
<P>
<b>-o</b><i>number</i>, <b>--only-matching</b>=<i>number</i>
@ -597,10 +629,11 @@ capturing parentheses do not exist in the pattern, or were not set in the
match, nothing is output unless the file name or line number are being output.
<br>
<br>
If this option is given multiple times, multiple substrings are output, in the
order the options are given. For example, -o3 -o1 -o3 causes the substrings
matched by capturing parentheses 3 and 1 and then 3 again to be output. By
default, there is no separator (but see the next option).
If this option is given multiple times, multiple substrings are output for each
match, in the order the options are given, and all on one line. For example,
-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
then 3 again to be output. By default, there is no separator (but see the next
option).
</P>
<P>
<b>--om-separator</b>=<i>text</i>
@ -631,6 +664,18 @@ quietly skipped. However, the return code is still 2, even if matches were
found in other files.
</P>
<P>
<b>-t</b>, <b>--total-count</b>
This option is useful when scanning more than one file. If used on its own,
<b>-t</b> suppresses all output except for a grand total number of matching
lines (or non-matching lines if <b>-v</b> is used) in all the files. If <b>-t</b>
is used with <b>-c</b>, a grand total is output except when the previous output
is just one line. In other words, it is not output when just one file's count
is listed. If file names are being output, the grand total is preceded by
"TOTAL:". Otherwise, it appears as just another number. The <b>-t</b> option is
ignored when used with <b>-L</b> (list files without matches), because the grand
total would always be zero.
</P>
<P>
<b>-u</b>, <b>--utf-8</b>
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
@ -658,11 +703,12 @@ specified by any of the <b>--include</b> or <b>--exclude</b> options.
<P>
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
Force the patterns to be anchored (each must start matching at the beginning of
a line) and in addition, require them to match entire lines. This is equivalent
to having ^ and $ characters at the start and end of each alternative top-level
branch in every pattern. This option applies only to the patterns that are
matched against the contents of files; it does not apply to patterns specified
by any of the <b>--include</b> or <b>--exclude</b> options.
a line) and in addition, require them to match entire lines. In multiline mode
the match may be more than one line. This is equivalent to having \A and \Z
characters at the start and end of each alternative top-level branch in every
pattern. This option applies only to the patterns that are matched against the
contents of files; it does not apply to patterns specified by any of the
<b>--include</b> or <b>--exclude</b> options.
</P>
<br><a name="SEC6" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
<P>
@ -735,7 +781,57 @@ The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
options does have data, it must be given in the first form, using an equals
character. Otherwise <b>pcre2grep</b> will assume that it has no data.
</P>
<br><a name="SEC10" href="#TOC1">MATCHING ERRORS</a><br>
<br><a name="SEC10" href="#TOC1">CALLING EXTERNAL SCRIPTS</a><br>
<P>
<b>pcre2grep</b> has, by default, support for calling external programs or
scripts during matching by making use of PCRE2's callout facility. However,
this support can be disabled when <b>pcre2grep</b> is built. You can find out
whether your binary has support for callouts by running it with the <b>--help</b>
option. If the support is not enabled, all callouts in patterns are ignored by
<b>pcre2grep</b>.
</P>
<P>
A callout in a PCRE2 pattern is of the form (?C&#60;arg&#62;) where the argument is
either a number or a quoted string (see the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details). Numbered callouts are ignored by <b>pcre2grep</b>.
String arguments are parsed as a list of substrings separated by pipe (vertical
bar) characters. The first substring must be an executable name, with the
following substrings specifying arguments:
<pre>
executable_name|arg1|arg2|...
</pre>
Any substring (including the executable name) may contain escape sequences
started by a dollar character: $&#60;digits&#62; or ${&#60;digits&#62;} is replaced by the
captured substring of the given decimal number, which must be greater than
zero. If the number is greater than the number of capturing substrings, or if
the capture is unset, the replacement is empty.
</P>
<P>
Any other character is substituted by itself. In particular, $$ is replaced by
a single dollar and $| is replaced by a pipe character. Here is an example:
<pre>
echo -e "abcde\n12345" | pcre2grep \
'(?x)(.)(..(.))
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
Output:
Arg1: [a] [bcd] [d] Arg2: |a| ()
abcde
Arg1: [1] [234] [4] Arg2: |1| ()
12345
</pre>
The parameters for the <b>execv()</b> system call that is used to run the
program or script are zero-terminated strings. This means that binary zero
characters in the callout argument will cause premature termination of their
substrings, and therefore should not be present. Any syntax errors in the
string (for example, a dollar not followed by another character) cause the
callout to be ignored. If running the program fails for any reason (including
the non-existence of the executable), a local matching failure occurs and the
matcher backtracks in the normal way.
</P>
<br><a name="SEC11" href="#TOC1">MATCHING ERRORS</a><br>
<P>
It is possible to supply a regular expression that takes a very long time to
fail to match certain lines. Such patterns normally involve nested indefinite
@ -751,7 +847,7 @@ overall resource limit; there is a second option called <b>--recursion-limit</b>
that sets a limit on the amount of memory (usually stack) that is used (see the
discussion of these options above).
</P>
<br><a name="SEC11" href="#TOC1">DIAGNOSTICS</a><br>
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
<P>
Exit status is 0 if any matches were found, 1 if no matches were found, and 2
for syntax errors, overlong lines, non-existent or inaccessible files (even if
@ -759,11 +855,11 @@ matches were found in other files) or too many matching errors. Using the
<b>-s</b> option to suppress error messages about inaccessible files does not
affect the return code.
</P>
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC13" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3).
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3), <b>pcre2callout</b>(3).
</P>
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC14" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -772,11 +868,11 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
Last updated: 03 January 2015
Last updated: 31 December 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -86,6 +86,13 @@ results. The returned value from <b>pcre2_jit_compile()</b> is zero on success,
or a negative error code.
</P>
<P>
There is a limit to the size of pattern that JIT supports, imposed by the size
of machine stack that it uses. The exact rules are not documented because they
may change at any time, in particular, when new optimizations are introduced.
If a pattern is too big, a call to \fBpcre2_jit_compile()\fB returns
PCRE2_ERROR_NOMEMORY.
</P>
<P>
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
@ -145,6 +152,10 @@ PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
PCRE2_ANCHORED option is not supported at match time.
</P>
<P>
If the PCRE2_NO_JIT option is passed to <b>pcre2_match()</b> it disables the
use of JIT, forcing matching by the interpreter code.
</P>
<P>
The only unsupported pattern items are \C (match a single data unit) when
running in a UTF mode, and a callout immediately before an assertion condition
in a conditional group.
@ -224,8 +235,14 @@ whether a match operation was executed by JIT or by the interpreter.
</P>
<P>
You may safely use the same JIT stack for more than one pattern (either by
assigning directly or by callback), as long as the patterns are all matched
sequentially in the same thread. In a multithread application, if you do not
assigning directly or by callback), as long as the patterns are matched
sequentially in the same thread. Currently, the only way to set up
non-sequential matches in one thread is to use callouts: if a callout function
starts another match, that match must use a different JIT stack to the one used
for currently suspended match(es).
</P>
<P>
In a multithread application, if you do not
specify a JIT stack, or if you assign or pass back NULL from a callback, that
is thread-safe, because each thread has its own machine stack. However, if you
assign or pass back a non-NULL JIT stack, this must be a different stack for
@ -390,7 +407,7 @@ The fast path function is called <b>pcre2_jit_match()</b>, and it takes exactly
the same arguments as <b>pcre2_match()</b>. The return values are also the same,
plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is
requested that was not compiled. Unsupported option bits (for example,
PCRE2_ANCHORED) are ignored.
PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option.
</P>
<P>
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
@ -419,9 +436,9 @@ Cambridge, England.
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 November 2014
Last updated: 05 June 2016
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -32,6 +32,11 @@ However, the speed of execution is slower. In the 32-bit library, the internal
linkage size is always 4.
</P>
<P>
The maximum length of a source pattern string is essentially unlimited; it is
the largest number a PCRE2_SIZE variable can hold. However, the program that
calls <b>pcre2_compile()</b> can specify a smaller limit.
</P>
<P>
The maximum length (in code units) of a subject string is one less than the
largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an unsigned
integer type, usually defined as size_t. Its maximum value (that is
@ -50,17 +55,16 @@ documentation.
All values in repeating quantifiers must be less than 65536.
</P>
<P>
The maximum length of a lookbehind assertion is 65535 characters.
</P>
<P>
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
order to limit the amount of system stack used at compile time. The limit can
be specified when PCRE2 is built; the default is 250.
</P>
<P>
There is a limit to the number of forward references to subsequent subpatterns
of around 200,000. Repeated forward references with fixed upper limits, for
example, (?2){0,100} when subpattern number 2 is to the right, are included in
the count. There is no limit to the number of backward references.
order to limit the amount of system stack used at compile time. The default
limit can be specified when PCRE2 is built; the default default is 250. An
application can change this limit by calling pcre2_set_parens_nest_limit() to
set the limit in a compile context.
</P>
<P>
The maximum length of name for a named subpattern is 32 code units, and the
@ -68,7 +72,12 @@ maximum number of named subpatterns is 10000.
</P>
<P>
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
32-bit libraries.
</P>
<P>
The maximum length of a string argument to a callout is the largest number a
32-bit unsigned integer can hold.
</P>
<br><b>
AUTHOR
@ -85,9 +94,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 25 November 2014
Last updated: 26 October 2016
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -190,6 +190,12 @@ be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
</P>
<P>
The match limit is used (but in a different way) when JIT is being used, but it
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
However, the recursion limit is relevant for DFA matching, which does use some
function recursion, in particular, for recursions within the pattern.
<a name="newlines"></a></P>
<br><b>
Newline conventions
@ -379,32 +385,31 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
code unit following \c has a value less than 32 or greater than 126, a
compile-time error occurs. This locks out non-printable ASCII characters in all
modes.
compile-time error occurs.
</P>
<P>
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
generate the appropriate EBCDIC code values. The \c escape is processed
as specified for Perl in the <b>perlebcdic</b> document. The only characters
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
other character provokes a compile-time error. The sequence \@ encodes
character code 0; the letters (in either case) encode characters 1-26 (hex 01
to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
\? becomes either 255 (hex FF) or 95 (hex 5F).
other character provokes a compile-time error. The sequence \c@ encodes
character code 0; after \c the letters (in either case) encode characters 1-26
(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
</P>
<P>
Thus, apart from \?, these escapes generate the same character code values as
Thus, apart from \c?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly
differ. For example, \G always generates code value 7, which is BEL in ASCII
differ. For example, \cG always generates code value 7, which is BEL in ASCII
but DEL in EBCDIC.
</P>
<P>
The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
values, PCRE2 makes \? generate 95; otherwise it generates 255.
values, PCRE2 makes \c? generate 95; otherwise it generates 255.
</P>
<P>
After \0 up to two further octal digits are read. If there are fewer than two
@ -526,9 +531,9 @@ by code point, as described in the previous section.
Absolute and relative back references
</b><br>
<P>
The sequence \g followed by an unsigned or a negative number, optionally
enclosed in braces, is an absolute or relative back reference. A named back
reference can be coded as \g{name}. Back references are discussed
The sequence \g followed by a signed or unsigned number, optionally enclosed
in braces, is an absolute or relative back reference. A named back reference
can be coded as \g{name}. Back references are discussed
<a href="#backreferences">later,</a>
following the discussion of
<a href="#subpattern">parenthesized subpatterns.</a>
@ -669,8 +674,8 @@ This is an example of an "atomic group", details of which are given
This particular group matches either the two-character sequence CR followed by
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). The two-character sequence is treated as a single unit that
cannot be split.
line, U+0085). Because this is an atomic group, the two-character sequence is
treated as a single unit that cannot be split.
</P>
<P>
In other modes, two additional characters whose codepoints are greater than 255
@ -736,6 +741,8 @@ Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
</P>
<P>
Ahom,
Anatolian_Hieroglyphs,
Arabic,
Armenian,
Avestan,
@ -776,6 +783,7 @@ Gurmukhi,
Han,
Hangul,
Hanunoo,
Hatran,
Hebrew,
Hiragana,
Imperial_Aramaic,
@ -812,12 +820,14 @@ Miao,
Modi,
Mongolian,
Mro,
Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
Nko,
Ogham,
Ol_Chiki,
Old_Hungarian,
Old_Italic,
Old_North_Arabian,
Old_Permic,
@ -839,6 +849,7 @@ Saurashtra,
Sharada,
Shavian,
Siddham,
SignWriting,
Sinhala,
Sora_Sompeng,
Sundanese,
@ -1180,6 +1191,16 @@ when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
</P>
<P>
When the newline convention (see
<a href="#newlines">"Newline conventions"</a>
below) recognizes the two-character sequence CRLF as a newline, this is
preferred, even if the single characters CR and LF are also recognized as
newlines. For example, if the newline convention is "any", a multiline mode
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
CR, even though CR on its own is a valid newline. (It also matches at the very
start of the string, of course.)
</P>
<P>
Note that the sequences \A, \Z, and \z can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
@ -1230,20 +1251,32 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option.
unless the PCRE2_NO_UTF_CHECK option is used).
</P>
<P>
An application can lock out the use of \C by setting the
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
build PCRE2 with the use of \C permanently disabled.
</P>
<P>
PCRE2 does not allow \C to appear in lookbehind assertions
<a href="#lookbehind">(described below)</a>
in a UTF mode, because this would make it impossible to calculate the length of
the lookbehind.
in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
the length of the lookbehind. Neither the alternative matching function
<b>pcre2_dfa_match()</b> nor the JIT optimizer support \C in these UTF modes.
The former gives a match-time error; the latter fails to optimize and so the
match is always run using the interpreter.
</P>
<P>
In the 32-bit library, however, \C is always supported (when not explicitly
locked out) because it always matches a single code unit, whether or not UTF-32
is specified.
</P>
<P>
In general, the \C escape sequence is best avoided. However, one way of using
it that avoids the problem of malformed UTF characters is to use a lookahead to
check the length of the next character, as in this pattern, which could be used
with a UTF-8 string (ignore white space and line breaks):
it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
lookahead to check the length of the next character, as in this pattern, which
could be used with a UTF-8 string (ignore white space and line breaks):
<pre>
(?| (?=[\x00-\x7f])(\C) |
(?=[\x80-\x{7ff}])(\C)(\C) |
@ -1298,42 +1331,6 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters.
</P>
<P>
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
indicating a range, typically as the first or last character in the class, or
immediately after a range. For example, [b-d-z] matches letters in the range b
to d, a hyphen character, or z.
</P>
<P>
It is not possible to have the literal character "]" as the end character of a
range. A pattern such as [W-]46] is interpreted as a class of two characters
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
the end of range, so [W-\]46] is interpreted as a class containing a range
followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
</P>
<P>
An error is generated if a POSIX character class (see below) or an escape
sequence other than one that defines a single character appears at a point
where a range ending character is expected. For example, [z-\xff] is valid,
but [A-\d] and [A-[:digit:]] are not.
</P>
<P>
Ranges operate in the collating sequence of character values. They can also be
used for characters specified numerically, for example [\000-\037]. Ranges
can include any characters that are valid for the current mode.
</P>
<P>
If a range that includes letters is used when caseless matching is set, it
matches the letters in either case. For example, [W-c] is equivalent to
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
tables for a French locale are in use, [\xc8-\xcb] matches accented E
characters in both cases.
</P>
<P>
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
\V, \w, and \W may appear in a character class, and add the characters that
they match to the class. For example, [\dABCDEF] matches any hexadecimal
@ -1347,6 +1344,52 @@ are not special inside a character class. Like any other unrecognized escape
sequences, they cause an error.
</P>
<P>
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
indicating a range, typically as the first or last character in the class,
or immediately after a range. For example, [b-d-z] matches letters in the range
b to d, a hyphen character, or z.
</P>
<P>
Perl treats a hyphen as a literal if it appears before or after a POSIX class
(see below) or a character type escape such as as \d, but gives a warning in
its warning mode, as this is most likely a user error. As PCRE2 has no facility
for warning, an error is given in these cases.
</P>
<P>
It is not possible to have the literal character "]" as the end character of a
range. A pattern such as [W-]46] is interpreted as a class of two characters
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
the end of range, so [W-\]46] is interpreted as a class containing a range
followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
</P>
<P>
Ranges normally include all code points between the start and end characters,
inclusive. They can also be used for code points specified numerically, for
example [\000-\037]. Ranges can include any characters that are valid for the
current mode.
</P>
<P>
There is a special case in EBCDIC environments for ranges whose end points are
both specified as literal letters in the same case. For compatibility with
Perl, EBCDIC code points within the range that are not letters are omitted. For
example, [h-k] matches only four characters, even though the codes for h and k
are 0x88 and 0x92, a range of 11 code points. However, if the range is
specified numerically, for example, [\x88-\x92] or [h-\x92], all code points
are included.
</P>
<P>
If a range that includes letters is used when caseless matching is set, it
matches the letters in either case. For example, [W-c] is equivalent to
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
tables for a French locale are in use, [\xc8-\xcb] matches accented E
characters in both cases.
</P>
<P>
A circumflex can conveniently be used with the upper case character types to
specify a more restricted set of characters than the matching lower case type.
For example, the class [^\W_] matches any letter or digit, but not underscore,
@ -1514,13 +1557,8 @@ respectively.
<P>
When one of these option changes occurs at top level (that is, not inside
subpattern parentheses), the change applies to the remainder of the pattern
that follows. If the change is placed right at the start of a pattern, PCRE2
extracts it into the global options (and it will therefore show up in data
extracted by the <b>pcre2_pattern_info()</b> function).
</P>
<P>
An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the subpattern that follows it, so
that follows. An option change within a subpattern (see below for a description
of subpatterns) affects only that part of the subpattern that follows it, so
<pre>
(a(?i)b)c
</pre>
@ -1649,6 +1687,10 @@ first one in the pattern with the given number. The following pattern matches
<pre>
/(?|(abc)|(def))(?1)/
</pre>
A relative reference such as (?-1) is no different: it is just a convenient way
of computing an absolute group number.
</P>
<P>
If a
<a href="#conditions">condition test</a>
for a subpattern's having matched refers to a non-unique number, the test is
@ -2051,9 +2093,9 @@ subpattern is possible using named parentheses (see below).
</P>
<P>
Another way of avoiding the ambiguity inherent in the use of digits following a
backslash is to use the \g escape sequence. This escape must be followed by an
unsigned number or a negative number, optionally enclosed in braces. These
examples are all identical:
backslash is to use the \g escape sequence. This escape must be followed by a
signed or unsigned number, optionally enclosed in braces. These examples are
all identical:
<pre>
(ring), \1
(ring), \g1
@ -2061,8 +2103,7 @@ examples are all identical:
</pre>
An unsigned number specifies an absolute reference without the ambiguity that
is present in the older syntax. It is also useful when literal digits follow
the reference. A negative number is a relative reference. Consider this
example:
the reference. A signed number is a relative reference. Consider this example:
<pre>
(abc(def)ghi)\g{-1}
</pre>
@ -2073,6 +2114,11 @@ can be helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
</P>
<P>
The sequence \g{+1} is a reference to the next capturing subpattern. This kind
of forward reference can be useful it patterns that repeat. Perl does not
support the use of + in this way.
</P>
<P>
A back reference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself (see
@ -2172,6 +2218,14 @@ capturing is carried out only for positive assertions. (Perl sometimes, but not
always, does do capturing in negative assertions.)
</P>
<P>
WARNING: If a positive assertion containing one or more capturing subpatterns
succeeds, but failure to match later in the pattern causes backtracking over
this assertion, the captures within the assertion are reset only if no higher
numbered captures are already set. This is, unfortunately, a fundamental
limitation of the current implementation; it may get removed in a future
reworking.
</P>
<P>
For compatibility with Perl, most assertion subpatterns may be repeated; though
it makes no sense to assert the same thing several times, the side effect of
capturing parentheses may occasionally be useful. However, an assertion that
@ -2268,18 +2322,31 @@ match. If there are insufficient characters before the current position, the
assertion fails.
</P>
<P>
In a UTF mode, PCRE2 does not allow the \C escape (which matches a single code
unit even in a UTF mode) to appear in lookbehind assertions, because it makes
it impossible to calculate the length of the lookbehind. The \X and \R
escapes, which can match different numbers of code units, are also not
permitted.
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
single code unit even in a UTF mode) to appear in lookbehind assertions,
because it makes it impossible to calculate the length of the lookbehind. The
\X and \R escapes, which can match different numbers of code units, are never
permitted in lookbehinds.
</P>
<P>
<a href="#subpatternsassubroutines">"Subroutine"</a>
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
as the subpattern matches a fixed-length string.
<a href="#recursion">Recursion,</a>
however, is not supported.
as the subpattern matches a fixed-length string. However,
<a href="#recursion">recursion,</a>
that is, a "subroutine" call into a group that is already active,
is not supported.
</P>
<P>
Perl does not support back references in lookbehinds. PCRE2 does support them,
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
must not be set, there must be no use of (?| in the pattern (it creates
duplicate subpattern numbers), and if the back reference is by name, the name
must be unique. Of course, the referenced subpattern must itself be of fixed
length. The following pattern matches words containing at least two characters
that begin and end with the same character:
<pre>
\b(\w)\w++(?&#60;=\1)
</PRE>
</P>
<P>
Possessive quantifiers can be used in conjunction with lookbehind assertions to
@ -2417,7 +2484,9 @@ Checking for a used subpattern by name
<P>
Perl uses the syntax (?(&#60;name&#62;)...) or (?('name')...) to test for a used
subpattern by name. For compatibility with earlier versions of PCRE1, which had
this facility before Perl, the syntax (?(name)...) is also recognized.
this facility before Perl, the syntax (?(name)...) is also recognized. Note,
however, that undelimited names consisting of the letter R followed by digits
are ambiguous (see the following section).
</P>
<P>
Rewriting the above example to use a named subpattern gives this:
@ -2432,30 +2501,52 @@ matched.
Checking for pattern recursion
</b><br>
<P>
If the condition is the string (R), and there is no subpattern with the name R,
the condition is true if a recursive call to the whole pattern or any
subpattern has been made. If digits or a name preceded by ampersand follow the
letter R, for example:
"Recursion" in this sense refers to any subroutine-like call from one part of
the pattern to another, whether or not it is actually recursive. See the
sections entitled
<a href="#recursion">"Recursive patterns"</a>
and
<a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
below for details of recursion and subpattern calls.
</P>
<P>
If a condition is the string (R), and there is no subpattern with the name R,
the condition is true if matching is currently in a recursion or subroutine
call to the whole pattern or any subpattern. If digits follow the letter R, and
there is no subpattern with that name, the condition is true if the most recent
call is into a subpattern with the given number, which must exist somewhere in
the overall pattern. This is a contrived example that is equivalent to a+b:
<pre>
(?(R3)...) or (?(R&name)...)
((?(R1)a+|(?1)b))
</pre>
the condition is true if the most recent recursion is into a subpattern whose
number or name is given. This condition does not check the entire recursion
stack. If the name used in a condition of this kind is a duplicate, the test is
applied to all subpatterns of the same name, and is true if any one of them is
the most recent recursion.
However, in both cases, if there is a subpattern with a matching name, the
condition tests for its being set, as described in the section above, instead
of testing for recursion. For example, creating a group with the name R1 by
adding (?&#60;R1&#62;) to the above pattern completely changes its meaning.
</P>
<P>
If a name preceded by ampersand follows the letter R, for example:
<pre>
(?(R&name)...)
</pre>
the condition is true if the most recent recursion is into a subpattern of that
name (which must exist within the pattern).
</P>
<P>
This condition does not check the entire recursion stack. It tests only the
current level. If the name used in a condition of this kind is a duplicate, the
test is applied to all subpatterns of the same name, and is true if any one of
them is the most recent recursion.
</P>
<P>
At "top level", all these recursion test conditions are false.
<a href="#recursion">The syntax for recursive patterns</a>
is described below.
<a name="subdefine"></a></P>
<br><b>
Defining subpatterns for use by reference only
</b><br>
<P>
If the condition is the string (DEFINE), and there is no subpattern with the
name DEFINE, the condition is always false. In this case, there may be only one
If the condition is the string (DEFINE), the condition is always false, even if
there is a group with the name DEFINE. In this case, there may be only one
alternative in the subpattern. It is always skipped if control reaches this
point in the pattern; the idea of DEFINE is that it can be used to define
subroutines that can be referenced from elsewhere. (The use of
@ -2489,7 +2580,8 @@ For example:
(?(VERSION&#62;=10.4)yes|no)
</pre>
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
"no" otherwise.
"no" otherwise. The fractional part of the version number may not contain more
than two digits.
</P>
<br><b>
Assertion conditions
@ -2602,6 +2694,21 @@ parentheses preceding the recursion. In other words, a negative number counts
capturing parentheses leftwards from the point at which it is encountered.
</P>
<P>
Be aware however, that if
<a href="#dupsubpatternnumber">duplicate subpattern numbers</a>
are in use, relative references refer to the earliest subpattern with the
appropriate number. Consider, for example:
<pre>
(?|(a)|(b)) (c) (?-2)
</pre>
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
is number 2. When the reference (?-2) is encountered, the second most recently
opened parentheses has the number 1, but it is the first such group (the (a)
group) to which the recursion refers. This would be the same if an absolute
reference (?1) was used. In other words, relative references are just a
shorthand for computing a group number.
</P>
<P>
It is also possible to refer to subsequently opened parentheses, by writing
references such as (?+2). However, these cannot be recursive because the
reference is not inside the parentheses that are referenced. They are always
@ -2899,14 +3006,36 @@ remarks apply to the PCRE2 features described in this section.
</P>
<P>
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
differently depending on whether or not a name is present. A name is any
sequence of characters that does not include a closing parenthesis. The maximum
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
libraries. If the name is empty, that is, if the closing parenthesis
immediately follows the colon, the effect is as if the colon were not there.
Any number of these verbs may occur in a pattern.
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
(*VERB:NAME). Some verbs take either form, possibly behaving differently
depending on whether or not a name is present.
</P>
<P>
By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in
any way, and it is not possible to include a closing parenthesis in the name.
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
is no longer Perl-compatible.
</P>
<P>
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
and only an unescaped closing parenthesis terminates the name. However, the
only backslash items that are permitted are \Q, \E, and sequences such as
\x{100} that define character code points. Character type escapes such as \d
are faulted.
</P>
<P>
A closing parenthesis can be included in a name either as \) or between \Q
and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
also set, unescaped whitespace in verb names is skipped, and #-comments are
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
</P>
<P>
The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
parenthesis immediately follows the colon, the effect is as if the colon were
not there. Any number of these verbs may occur in a pattern.
</P>
<P>
Since these verbs are specifically related to backtracking, most of them can be
@ -3323,9 +3452,9 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 13 June 2015
Last updated: 27 December 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -12,17 +12,21 @@ This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
PCRE2 PERFORMANCE
</b><br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 PERFORMANCE</a><br>
<P>
Two aspects of performance are discussed below: memory usage and processing
time. The way you express your pattern as a regular expression can affect both
of them.
</P>
<br><b>
COMPILED PATTERN MEMORY USAGE
</b><br>
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
<P>
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory. However, there is one case
@ -75,9 +79,7 @@ pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE2 cannot otherwise handle.
</P>
<br><b>
STACK USAGE AT RUN TIME
</b><br>
<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
<P>
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
cause it to use large amounts of the process stack. In some environments the
@ -86,9 +88,7 @@ SIGSEGV. Rewriting your pattern can often help. The
<a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation discusses this issue in detail.
</P>
<br><b>
PROCESSING TIME
</b><br>
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
<P>
Certain items in regular expression patterns are processed more efficiently
than others. It is more efficient to use a character class like [aeiou] than a
@ -177,9 +177,7 @@ appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier.
</P>
<br><b>
AUTHOR
</b><br>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -188,9 +186,7 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 January 2015
<br>

View File

@ -48,7 +48,7 @@ This set of functions provides a POSIX-style API for the PCRE2 regular
expression 8-bit library. See the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation for a description of PCRE2's native API, which contains much
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
additional functionality. There are no POSIX-style wrappers for PCRE2's 16-bit
and 32-bit libraries.
</P>
<P>
@ -67,9 +67,9 @@ POSIX interface often use it, this makes it easier to slot in PCRE2 as a
replacement library. Other POSIX options are not even defined.
</P>
<P>
There are also some other options that are not defined by POSIX. These have
been added at the request of users who want to make use of certain
PCRE2-specific features via the POSIX calling interface.
There are also some options that are not defined by POSIX. These have been
added at the request of users who want to make use of certain PCRE2-specific
features via the POSIX calling interface.
</P>
<P>
When PCRE2 is called via these functions, it is only the API that is POSIX-like
@ -119,11 +119,11 @@ defined POSIX behaviour for REG_NEWLINE (see the following section).
<pre>
REG_NOSUB
</pre>
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
for compilation to the native function. In addition, when a pattern that is
compiled with this flag is passed to <b>regexec()</b> for matching, the
<i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
are returned.
When a pattern that is compiled with this flag is passed to <b>regexec()</b> for
matching, the <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no
captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
<pre>
REG_UCP
</pre>
@ -170,7 +170,7 @@ use the contents of the <i>preg</i> structure. If, for example, you pass it to
This area is not simple, because POSIX and Perl take different views of things.
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
never intended to be a POSIX engine. The following table lists the different
possibilities for matching newline characters in PCRE2:
possibilities for matching newline characters in Perl and PCRE2:
<pre>
Default Change with
@ -180,7 +180,7 @@ possibilities for matching newline characters in PCRE2:
$ matches \n in middle no PCRE2_MULTILINE
^ matches \n in middle no PCRE2_MULTILINE
</pre>
This is the equivalent table for POSIX:
This is the equivalent table for a POSIX-compatible pattern matcher:
<pre>
Default Change with
@ -190,14 +190,18 @@ This is the equivalent table for POSIX:
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
</pre>
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
newline from matching [^a].
This behaviour is not what happens when PCRE2 is called via its POSIX
API. By default, PCRE2's behaviour is the same as Perl's, except that there is
no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there
is no way to stop newline from matching [^a].
</P>
<P>
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
the REG_NEWLINE action.
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
PCRE2_DOLLAR_ENDONLY when calling <b>pcre2_compile()</b> directly, but there is
no way to make PCRE2 behave exactly as for the REG_NEWLINE action. When using
the POSIX API, passing REG_NEWLINE to PCRE2's <b>regcomp()</b> function
causes PCRE2_MULTILINE to be passed to <b>pcre2_compile()</b>, and REG_DOTALL
passes PCRE2_DOTALL. There is no way to pass PCRE2_DOLLAR_ENDONLY.
</P>
<br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br>
<P>
@ -231,19 +235,21 @@ to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
how it is matched.
how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
mutually exclusive; the error REG_INVARG is returned.
</P>
<P>
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
<b>regexec()</b> are ignored.
<b>regexec()</b> are ignored (except possibly as input for REG_STARTEND).
</P>
<P>
If the value of <i>nmatch</i> is zero, or if the value <i>pmatch</i> is NULL,
no data about any matched strings is returned.
The value of <i>nmatch</i> may be zero, and the value <i>pmatch</i> may be NULL
(unless REG_STARTEND is set); in both these cases no data about any matched
strings is returned.
</P>
<P>
Otherwise,the portion of the string that was matched, and also any captured
Otherwise, the portion of the string that was matched, and also any captured
substrings, are returned via the <i>pmatch</i> argument, which points to an
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
members <i>rm_so</i> and <i>rm_eo</i>. These contain the byte offset to the first
@ -262,9 +268,11 @@ header file, of which REG_NOMATCH is the "expected" failure code.
The <b>regerror()</b> function maps a non-zero errorcode from either
<b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not
NULL, the error should have arisen from the use of that structure. A message
terminated by a binary zero is placed in <i>errbuf</i>. The length of the
message, including the zero, is limited to <i>errbuf_size</i>. The yield of the
function is the size of buffer needed to hold the whole message.
terminated by a binary zero is placed in <i>errbuf</i>. If the buffer is too
short, only the first <i>errbuf_size</i> - 1 characters of the error message are
used. The yield of the function is the size of buffer needed to hold the whole
message, including the terminating zero. This value is greater than
<i>errbuf_size</i> if the message was truncated.
</P>
<br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br>
<P>
@ -283,9 +291,9 @@ Cambridge, England.
</P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 October 2014
Last updated: 31 January 2016
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -24,12 +24,11 @@ documentation. If you do not have a copy of the PCRE2 distribution, you can
save this listing to re-create the contents of <i>pcre2demo.c</i>.
</P>
<P>
The demonstration program, which uses the PCRE2 8-bit library, compiles the
regular expression that is its first argument, and matches it against the
subject string in its second argument. No PCRE2 options are set, and default
character tables are used. If matching succeeds, the program outputs the
portion of the subject that matched, together with the contents of any captured
substrings.
The demonstration program compiles the regular expression that is its
first argument, and matches it against the subject string in its second
argument. No PCRE2 options are set, and default character tables are used. If
matching succeeds, the program outputs the portion of the subject that matched,
together with the contents of any captured substrings.
</P>
<P>
If the -g option is given on the command line, the program then goes on to
@ -38,34 +37,39 @@ string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
</P>
<P>
The code in <b>pcre2demo.c</b> is an 8-bit program that uses the PCRE2 8-bit
library. It handles strings and characters that are stored in 8-bit code units.
By default, one character corresponds to one code unit, but if the pattern
starts with "(*UTF)", both it and the subject are treated as UTF-8 strings,
where characters may occupy multiple code units.
</P>
<P>
If PCRE2 is installed in the standard include and library directories for your
operating system, you should be able to compile the demonstration program using
this command:
a command like this:
<pre>
gcc -o pcre2demo pcre2demo.c -lpcre2-8
cc -o pcre2demo pcre2demo.c -lpcre2-8
</pre>
If PCRE2 is installed elsewhere, you may need to add additional options to the
command line. For example, on a Unix-like system that has PCRE2 installed in
<i>/usr/local</i>, you can compile the demonstration program using a command
like this:
<pre>
gcc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
</PRE>
</P>
<P>
Once you have compiled and linked the demonstration program, you can run simple
tests like this:
cc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
</pre>
Once you have built the demonstration program, you can run simple tests like
this:
<pre>
./pcre2demo 'cat|dog' 'the cat sat on the mat'
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
</pre>
Note that there is a much more comprehensive test program, called
<a href="pcre2test.html"><b>pcre2test</b>,</a>
which supports many more facilities for testing regular expressions using the
PCRE2 libraries. The
which supports many more facilities for testing regular expressions using all
three PCRE2 libraries (8-bit, 16-bit, and 32-bit, though not all three need be
installed). The
<a href="pcre2demo.html"><b>pcre2demo</b></a>
program is provided as a simple coding example.
program is provided as a relatively simple coding example.
</P>
<P>
If you try to run
@ -73,7 +77,7 @@ If you try to run
when PCRE2 is not installed in the standard library directory, you may get an
error like this on some operating systems (e.g. Solaris):
<pre>
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file or directory
</pre>
This is caused by the way shared library support works on those systems. You
need to add
@ -97,9 +101,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 20 October 2014
Last updated: 02 February 2016
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -14,10 +14,11 @@ please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a>
<li><a name="TOC2" href="#SEC2">SAVING COMPILED PATTERNS</a>
<li><a name="TOC3" href="#SEC3">RE-USING PRECOMPILED PATTERNS</a>
<li><a name="TOC4" href="#SEC4">AUTHOR</a>
<li><a name="TOC5" href="#SEC5">REVISION</a>
<li><a name="TOC2" href="#SEC2">SECURITY CONCERNS</a>
<li><a name="TOC3" href="#SEC3">SAVING COMPILED PATTERNS</a>
<li><a name="TOC4" href="#SEC4">RE-USING PRECOMPILED PATTERNS</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a><br>
<P>
@ -41,14 +42,22 @@ If you are running an application that uses a large number of regular
expression patterns, it may be useful to store them in a precompiled form
instead of having to compile them every time the application is run. However,
if you are using the just-in-time optimization feature, it is not possible to
save and reload the JIT data, because it is position-dependent. In addition,
the host on which the patterns are reloaded must be running the same version of
PCRE2, with the same code unit width, and must also have the same endianness,
pointer width and PCRE2_SIZE type. For example, patterns compiled on a 32-bit
system using PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor
can they be reloaded using the 8-bit library.
save and reload the JIT data, because it is position-dependent. The host on
which the patterns are reloaded must be running the same version of PCRE2, with
the same code unit width, and must also have the same endianness, pointer width
and PCRE2_SIZE type. For example, patterns compiled on a 32-bit system using
PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor can they be
reloaded using the 8-bit library.
</P>
<br><a name="SEC2" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
<br><a name="SEC2" href="#TOC1">SECURITY CONCERNS</a><br>
<P>
The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not
complete validation of what is being re-loaded.
</P>
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
<P>
Before compiled patterns can be saved they must be serialized, that is,
converted to a stream of bytes. A single byte stream may contain any number of
@ -110,7 +119,7 @@ still be used for matching. Their memory must eventually be freed in the usual
way by calling <b>pcre2_code_free()</b>. When you have finished with the byte
stream, it too must be freed by calling <b>pcre2_serialize_free()</b>.
</P>
<br><a name="SEC3" href="#TOC1">RE-USING PRECOMPILED PATTERNS</a><br>
<br><a name="SEC4" href="#TOC1">RE-USING PRECOMPILED PATTERNS</a><br>
<P>
In order to re-use a set of saved patterns you must first make the serialized
byte stream available in main memory (for example, by reading from a file). The
@ -142,21 +151,27 @@ is filled with those that fit, and the remainder are ignored. The yield of the
function is the number of decoded patterns, or one of the following negative
error codes:
<pre>
PCRE2_ERROR_BADDATA second argument is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE2 version
PCRE2_ERROR_MEMORY memory allocation failed
PCRE2_ERROR_NULL first or third argument is NULL
PCRE2_ERROR_BADDATA second argument is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
PCRE2_ERROR_MEMORY memory allocation failed
PCRE2_ERROR_NULL first or third argument is NULL
</pre>
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
on a system with different endianness.
</P>
<P>
Decoded patterns can be used for matching in the usual way, and must be freed
by calling <b>pcre2_code_free()</b> as normal. A single copy of the character
tables is used by all the decoded patterns. A reference count is used to
by calling <b>pcre2_code_free()</b>. However, be aware that there is a potential
race issue if you are using multiple patterns that were decoded from a single
byte stream in a multithreaded application. A single copy of the character
tables is used by all the decoded patterns and a reference count is used to
arrange for its memory to be automatically freed when the last pattern is
freed.
freed, but there is no locking on this reference count. Therefore, if you want
to call <b>pcre2_code_free()</b> for these patterns in different threads, you
must arrange your own locking, and ensure that <b>pcre2_code_free()</b> cannot
be called by two threads at the same time.
</P>
<P>
If a pattern was processed by <b>pcre2_jit_compile()</b> before being
@ -164,7 +179,7 @@ serialized, the JIT data is discarded and so is no longer available after a
save/restore cycle. You can, however, process a restored pattern with
<b>pcre2_jit_compile()</b> if you wish.
</P>
<br><a name="SEC4" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -173,11 +188,11 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 January 2015
Last updated: 24 May 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -57,12 +57,13 @@ assertion and "once-only" subpatterns, which are handled like subroutine calls.
Normally, these are never very deep, and the limit on the complexity of
<b>pcre2_dfa_match()</b> is controlled by the amount of workspace it is given.
However, it is possible to write patterns with runaway infinite recursions;
such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack. At
present, there is no protection against this.
such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack unless a
limit is applied (see below).
</P>
<P>
The comments that follow do NOT apply to <b>pcre2_dfa_match()</b>; they are
relevant only for <b>pcre2_match()</b> without the JIT optimization.
The comments in the next three sections do not apply to
<b>pcre2_dfa_match()</b>; they are relevant only for <b>pcre2_match()</b> without
the JIT optimization.
</P>
<br><b>
Reducing <b>pcre2_match()</b>'s stack usage
@ -115,7 +116,7 @@ entitled
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. Since the block sizes are always the same, it may be possible to
implement customized a memory handler that is more efficient than the standard
implement a customized memory handler that is more efficient than the standard
function. The memory blocks obtained for this purpose are retained and re-used
if possible while <b>pcre2_match()</b> is running. They are all freed just
before it exits.
@ -151,6 +152,15 @@ pattern to match. This is done by calling <b>pcre2_match()</b> repeatedly with
different limits.
</P>
<br><b>
Limiting <b>pcre2_dfa_match()</b>'s stack usage
</b><br>
<P>
The recursion limit, as described above for <b>pcre2_match()</b>, also applies
to <b>pcre2_dfa_match()</b>, whose use of recursive function calls for
recursions in the pattern can lead to runaway stack usage. The non-recursive
match limit is not relevant for DFA matching, and is ignored.
</P>
<br><b>
Changing stack size in Unix-like systems
</b><br>
<P>
@ -198,9 +208,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 21 November 2014
Last updated: 23 December 2016
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -111,9 +111,10 @@ it matches a literal "u".
\W a "non-word" character
\X a Unicode extended grapheme cluster
</pre>
The application can lock out the use of \C by setting the
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
current matching point in the middle of a UTF-8 or UTF-16 character.
\C is dangerous because it may leave the current matching point in the middle
of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
with the use of \C permanently disabled.
</P>
<P>
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
@ -187,6 +188,8 @@ at release 5.18.
</P>
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
<P>
Ahom,
Anatolian_Hieroglyphs,
Arabic,
Armenian,
Avestan,
@ -227,6 +230,7 @@ Gurmukhi,
Han,
Hangul,
Hanunoo,
Hatran,
Hebrew,
Hiragana,
Imperial_Aramaic,
@ -263,12 +267,14 @@ Miao,
Modi,
Mongolian,
Mro,
Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
Nko,
Ogham,
Ol_Chiki,
Old_Hungarian,
Old_Italic,
Old_North_Arabian,
Old_Permic,
@ -290,6 +296,7 @@ Saurashtra,
Sharada,
Shavian,
Siddham,
SignWriting,
Sinhala,
Sora_Sompeng,
Sundanese,
@ -444,9 +451,10 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_match(), not increase them. The application
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
PCRE2_NEVER_UCP options, respectively, at compile time.
limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>, not
increase them. The application can lock out the use of (*UTF) and (*UCP) by
setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at
compile time.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
@ -485,6 +493,9 @@ Each top-level branch of a look behind must be of a fixed length.
\n reference by number (can be ambiguous)
\gn reference by number
\g{n} reference by number
\g+n relative reference by number (PCRE2 extension)
\g-n relative reference by number
\g{+n} relative reference by number (PCRE2 extension)
\g{-n} relative reference by number
\k&#60;name&#62; reference by name (Perl)
\k'name' reference by name (Perl)
@ -523,14 +534,17 @@ Each top-level branch of a look behind must be of a fixed length.
(?(-n) relative reference condition
(?(&#60;name&#62;) named reference condition (Perl)
(?('name') named reference condition (Perl)
(?(name) named reference condition (PCRE2)
(?(name) named reference condition (PCRE2, deprecated)
(?(R) overall recursion condition
(?(Rn) specific group recursion condition
(?(R&name) specific recursion condition
(?(Rn) specific numbered group recursion condition
(?(R&name) specific named group recursion condition
(?(DEFINE) define subpattern for reference
(?(VERSION[&#62;]=n.m) test PCRE2 version
(?(assert) assertion condition
</PRE>
</pre>
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
</P>
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
@ -582,9 +596,9 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 13 June 2015
Last updated: 23 December 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -61,7 +61,7 @@ subject is processed, and what output is produced.
<P>
As the original fairly simple PCRE library evolved, it acquired many different
features, and as a result, the original <b>pcretest</b> program ended up with a
lot of options in a messy, arcane syntax, for testing all the features. The
lot of options in a messy, arcane syntax for testing all the features. The
move to the new PCRE2 API provided an opportunity to re-implement the test
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
are still many obscure modifiers, some of which are specifically designed for
@ -77,31 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
all three of these libraries may be simultaneously installed. The
<b>pcre2test</b> program can be used to test all the libraries. However, its own
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
libraries, patterns and subject strings are converted to 16- or 32-bit format
before being passed to the library functions. Results are converted back to
8-bit code units for output.
libraries, patterns and subject strings are converted to 16-bit or 32-bit
format before being passed to the library functions. Results are converted back
to 8-bit code units for output.
</P>
<P>
In the rest of this document, the names of library functions and structures
are given in generic form, for example, <b>pcre_compile()</b>. The actual
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
</P>
<a name="inputencoding"></a></P>
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
<P>
Input to <b>pcre2test</b> is processed line by line, either by calling the C
library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
below). The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
treats any bytes other than newline as data characters. In some Windows
environments character 26 (hex 1A) causes an immediate end of file, and no
further data is read.
library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
Windows environments character 26 (hex 1A) causes an immediate end of file, and
no further data is read, so this character should be avoided unless you really
want that action.
</P>
<P>
For maximum portability, therefore, it is safest to avoid non-printing
characters in <b>pcre2test</b> input files. There is a facility for specifying a
pattern's characters as hexadecimal pairs, thus making it possible to include
binary zeroes in a pattern for testing purposes. Subject lines are processed
for backslash escapes, which makes it possible to include any data value.
The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
treats any bytes other than newline as data characters. An error is generated
if a binary zero is encountered. Subject lines are processed for backslash
escapes, which makes it possible to include any data value in strings that are
passed to the library for matching. For patterns, there is a facility for
specifying some or all of the 8-bit input characters as hexadecimal pairs,
which makes it possible to include binary zeros.
</P>
<br><b>
Input for the 16-bit and 32-bit libraries
</b><br>
<P>
When testing the 16-bit or 32-bit libraries, there is a need to be able to
generate character code points greater than 255 in the strings that are passed
to the library. For subject lines, backslash escapes can be used. In addition,
when the <b>utf</b> modifier (see
<a href="#optionmodifiers">"Setting compilation options"</a>
below) is set, the pattern and any following subject lines are interpreted as
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
</P>
<P>
For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
or 32-bit mode. It causes the pattern and following subject lines to be treated
as UTF-8 according to the original definition (RFC 2279), which allows for
character values up to 0x7fffffff. Each character is placed in one 16-bit or
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
to occur).
</P>
<P>
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
values can be handled by the 32-bit library. When testing this library in
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
character's value. This is the only way of passing such code points in a
pattern string. For subject strings, using an escape sequence is preferable.
</P>
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
<P>
@ -123,8 +153,13 @@ the 32-bit library has been built, this is the default. If the 32-bit library
has not been built, this option causes an error.
</P>
<P>
<b>-ac</b>
Behave as if each pattern has the <b>auto_callout</b> modifier, that is, insert
automatic callouts into every pattern that is compiled.
</P>
<P>
<b>-b</b>
Behave as if each pattern has the <b>/fullbincode</b> modifier; the full
Behave as if each pattern has the <b>fullbincode</b> modifier; the full
internal binary form of the pattern is output after compilation.
</P>
<P>
@ -155,12 +190,13 @@ following options output the value and set the exit code as indicated:
The following options output 1 for true or 0 for false, and set the exit code
to the same value:
<pre>
ebcdic compiled for an EBCDIC environment
jit just-in-time support is available
pcre2-16 the 16-bit library was built
pcre2-32 the 32-bit library was built
pcre2-8 the 8-bit library was built
unicode Unicode support is available
backslash-C \C is supported (not locked out)
ebcdic compiled for an EBCDIC environment
jit just-in-time support is available
pcre2-16 the 16-bit library was built
pcre2-32 the 32-bit library was built
pcre2-8 the 8-bit library was built
unicode Unicode support is available
</pre>
If an unknown option is given, an error message is output; the exit code is 0.
</P>
@ -177,12 +213,19 @@ using the <b>pcre2_dfa_match()</b> function instead of the default
<b>pcre2_match()</b>.
</P>
<P>
<b>-error</b> <i>number[,number,...]</i>
Call <b>pcre2_get_error_message()</b> for each of the error numbers in the
comma-separated list, display the resulting messages on the standard output,
then exit with zero exit code. The numbers may be positive or negative. This is
a convenience facility for PCRE2 maintainers.
</P>
<P>
<b>-help</b>
Output a brief summary these options and then exit.
</P>
<P>
<b>-i</b>
Behave as if each pattern has the <b>/info</b> modifier; information about the
Behave as if each pattern has the <b>info</b> modifier; information about the
compiled pattern is given after compilation.
</P>
<P>
@ -265,9 +308,9 @@ Each subject line is matched separately and independently. If you want to do
multi-line matches, you have to use the \n escape sequence (or \r or \r\n,
etc., depending on the newline setting) in a single line of input to encode the
newline sequences. There is no limit on the length of subject lines; the input
buffer is automatically extended if it is too small. There is a replication
feature that makes it possible to generate long subject lines without having to
supply them explicitly.
buffer is automatically extended if it is too small. There are replication
features that makes it possible to generate long repetitive pattern or subject
lines without having to supply them explicitly.
</P>
<P>
An empty line or the end of the file signals the end of the subject lines for a
@ -304,6 +347,36 @@ output.
This command is used to load a set of precompiled patterns from a file, as
described in the section entitled "Saving and restoring compiled patterns"
<a href="#saverestore">below.</a>
<pre>
#newline_default [&#60;newline-list&#62;]
</pre>
When PCRE2 is built, a default newline convention can be specified. This
determines which characters and/or character pairs are recognized as indicating
a newline in a pattern or subject string. The default can be overridden when a
pattern is compiled. The standard test files contain tests of various newline
conventions, but the majority of the tests expect a single linefeed to be
recognized as a newline by default. Without special action the tests would fail
when PCRE2 is compiled with either CR or CRLF as the default newline.
</P>
<P>
The #newline_default command specifies a list of newline types that are
acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
ANY (in upper or lower case), for example:
<pre>
#newline_default LF Any anyCRLF
</pre>
If the default newline is in the list, this command has no effect. Otherwise,
except when testing the POSIX API, a <b>newline</b> modifier that specifies the
first newline convention in the list (LF in the above example) is added to any
pattern that does not already have a <b>newline</b> modifier. If the newline
list is empty, the feature is turned off. This command is present in a number
of the standard test input files.
</P>
<P>
When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from
within the pattern. A warning is given if the <b>posix</b> modifier is used when
<b>#newline_default</b> would set a default for the non-POSIX API.
<pre>
#pattern &#60;modifier-list&#62;
</pre>
@ -321,9 +394,10 @@ test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
command helps detect tests that are accidentally put in the wrong file.
<pre>
#pop [&#60;modifiers&#62;]
#popcopy [&#60;modifiers&#62;]
</pre>
This command is used to manipulate the stack of compiled patterns, as described
in the section entitled "Saving and restoring compiled patterns"
These commands are used to manipulate the stack of compiled patterns, as
described in the section entitled "Saving and restoring compiled patterns"
<a href="#saverestore">below.</a>
<pre>
#save &#60;filename&#62;
@ -340,12 +414,13 @@ subject lines. Modifiers on a subject line can change these settings.
<br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br>
<P>
Modifier lists are used with both pattern and subject lines. Items in a list
are separated by commas and optional white space. Some modifiers may be given
for both patterns and subject lines, whereas others are valid for one or the
other only. Each modifier has a long name, for example "anchored", and some of
them must be followed by an equals sign and a value, for example, "offset=12".
Modifiers that do not take values may be preceded by a minus sign to turn off a
previous setting.
are separated by commas followed by optional white space. Trailing whitespace
in a modifier list is ignored. Some modifiers may be given for both patterns
and subject lines, whereas others are valid only for one or the other. Each
modifier has a long name, for example "anchored", and some of them must be
followed by an equals sign and a value, for example, "offset=12". Values cannot
contain comma characters, but may contain spaces. Modifiers that do not take
values may be preceded by a minus sign to turn off a previous setting.
</P>
<P>
A few of the more common modifiers can also be specified as single letters, for
@ -454,6 +529,12 @@ the start of a modifier list. For example:
<pre>
abc\=notbol,notempty
</pre>
If the subject string is empty and \= is followed by whitespace, the line is
treated as a comment line, and is not used for matching. For example:
<pre>
\= This is a comment.
abc\= This is an invalid modifier list.
</pre>
A backslash followed by any other non-alphanumeric character just escapes that
character. A backslash followed by anything else causes an error. However, if
the very last character in the line is a backslash (and there is no modifier
@ -462,10 +543,10 @@ a real empty line terminates the data input.
</P>
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
<P>
There are three types of modifier that can appear in pattern lines, two of
which may also be used in a <b>#pattern</b> command. A pattern's modifier list
can add to or override default modifiers that were set by a previous
<b>#pattern</b> command.
There are several types of modifier that can appear in pattern lines. Except
where noted below, they may also be used in <b>#pattern</b> commands. A
pattern's modifier list can add to or override default modifiers that were set
by a previous <b>#pattern</b> command.
<a name="optionmodifiers"></a></P>
<br><b>
Setting compilation options
@ -473,12 +554,13 @@ Setting compilation options
<P>
The following modifiers set options for <b>pcre2_compile()</b>. The most common
ones have single-letter abbreviations. See
<a href="pcreapi.html"><b>pcreapi</b></a>
<a href="pcre2api.html"><b>pcre2api</b></a>
for a description of their effects.
<pre>
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
/i caseless set PCRE2_CASELESS
@ -499,12 +581,15 @@ for a description of their effects.
no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP
ungreedy set PCRE2_UNGREEDY
use_offset_limit set PCRE2_USE_OFFSET_LIMIT
utf set PCRE2_UTF
</pre>
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
non-printing characters in output strings to be printed using the \x{hh...}
notation. Otherwise, those less than 0x100 are output in hex without the curly
brackets.
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
subject strings to be translated to UTF-16 or UTF-32, respectively, before
being passed to library functions.
<a name="controlmodifiers"></a></P>
<br><b>
Setting compilation controls
@ -519,18 +604,24 @@ about the pattern:
debug same as info,fullbincode
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex pattern is coded in hexadecimal
hex unquoted characters are hexadecimal
jit[=&#60;number&#62;] use JIT
jitfast use JIT fast path
jitverify verify JIT use
locale=&#60;name&#62; use this locale
max_pattern_length=&#60;n&#62; set the maximum pattern length
memory show memory used
newline=&#60;type&#62; set newline type
null_context compile with a NULL context
parens_nest_limit=&#60;n&#62; set maximum parentheses depth
posix use the POSIX API
posix_nosub use the POSIX API with REG_NOSUB
push push compiled pattern onto the stack
pushcopy push a copy onto the stack
stackguard=&#60;number&#62; test the stackguard feature
tables=[0|1|2] select internal tables
use_length do not zero-terminate the pattern
utf8_input treat input as UTF-8
</pre>
The effects of these modifiers are described in the following sections.
</P>
@ -604,40 +695,145 @@ is requested. For each callout, either its number or string is given, followed
by the item that follows it in the pattern.
</P>
<br><b>
Specifying a pattern in hex
Passing a NULL context
</b><br>
<P>
The <b>hex</b> modifier specifies that the characters of the pattern are to be
interpreted as pairs of hexadecimal digits. White space is permitted between
pairs. For example:
Normally, <b>pcre2test</b> passes a context block to <b>pcre2_compile()</b>. If
the <b>null_context</b> modifier is set, however, NULL is passed. This is for
testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
default values).
</P>
<br><b>
Specifying the pattern's length
</b><br>
<P>
By default, patterns are passed to the compiling functions as zero-terminated
strings. When using the POSIX wrapper API, there is no other option. However,
when using PCRE2's native API, patterns can be passed by length instead of
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
Using a length happens automatically (whether or not <b>use_length</b> is set)
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
binary zeros.
</P>
<br><b>
Specifying pattern characters in hexadecimal
</b><br>
<P>
The <b>hex</b> modifier specifies that the characters of the pattern, except for
substrings enclosed in single or double quotes, are to be interpreted as pairs
of hexadecimal digits. This feature is provided as a way of creating patterns
that contain binary zeros and other non-printing characters. White space is
permitted between pairs of digits. For example, this pattern contains three
characters:
<pre>
/ab 32 59/hex
</pre>
This feature is provided as a way of creating patterns that contain binary zero
and other non-printing characters. By default, <b>pcre2test</b> passes patterns
as zero-terminated strings to <b>pcre2_compile()</b>, giving the length as
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
actual length of the pattern is passed.
Parts of such a pattern are taken literally if quoted. This pattern contains
nine characters, only two of which are specified in hexadecimal:
<pre>
/ab "literal" 32/hex
</pre>
Either single or double quotes may be used. There is no way of including
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
mutually exclusive.
</P>
<P>
The POSIX API cannot be used with patterns specified in hexadecimal because
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s
requirement for a zero-terminated string. Such patterns are always passed to
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated.
</P>
<br><b>
Specifying wide characters in 16-bit and 32-bit modes
</b><br>
<P>
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
can be used. It is mutually exclusive with <b>utf</b>. Input lines are
interpreted as UTF-8 as a means of specifying wide characters. More details are
given in
<a href="#inputencoding">"Input encoding"</a>
above.
</P>
<br><b>
Generating long repetitive patterns
</b><br>
<P>
Some tests use long patterns that are very repetitive. Instead of creating a
very long input line for such a pattern, you can use a special repetition
feature, similar to the one described for subject lines above. If the
<b>expand</b> modifier is present on a pattern, parts of the pattern that have
the form
<pre>
\[&#60;characters&#62;]{&#60;count&#62;}
</pre>
are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
by decimal digits and "}" is found later in the pattern. If not, the characters
remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
mutually exclusive.
</P>
<P>
If part of an expanded pattern looks like an expansion, but is really part of
the actual pattern, unwanted expansion can be avoided by giving two values in
the quantifier. For example, \[AB]{6000,6000} is not recognized as an
expansion item.
</P>
<P>
If the <b>info</b> modifier is set on an expanded pattern, the result of the
expansion is included in the information that is output.
</P>
<br><b>
JIT compilation
</b><br>
<P>
The <b>/jit</b> modifier may optionally be followed by an equals sign and a
number in the range 0 to 7:
Just-in-time (JIT) compiling is a heavyweight optimization that can greatly
speed up pattern matching. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for details. JIT compiling happens, optionally, after a pattern
has been successfully compiled into an internal form. The JIT compiler converts
this to optimized machine code. It needs to know whether the match-time options
PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because
different code is generated for the different cases. See the <b>partial</b>
modifier in "Subject Modifiers"
<a href="#subjectmodifiers">below</a>
for details of how these options are specified for each match attempt.
</P>
<P>
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to 7.
The three bits that make up the number specify which of the three JIT operating
modes are to be compiled:
<pre>
1 compile JIT code for non-partial matching
2 compile JIT code for soft partial matching
4 compile JIT code for hard partial matching
</pre>
The possible values for the <b>jit</b> modifier are therefore:
<pre>
0 disable JIT
1 use JIT for normal match only
2 use JIT for soft partial match only
3 use JIT for normal match and soft partial match
4 use JIT for hard partial match only
6 use JIT for soft and hard partial match
1 normal matching only
2 soft partial matching only
3 normal and soft partial matching
4 hard partial matching only
6 soft and hard partial matching only
7 all three modes
</pre>
If no number is given, 7 is assumed. If JIT compilation is successful, the
compiled JIT code will automatically be used when <b>pcre2_match()</b> is run
for the appropriate type of match, except when incompatible run-time options
are specified. For more details, see the
If no number is given, 7 is assumed. The phrase "partial matching" means a call
to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
match; the options enable the possibility of a partial match, but do not
require it. Note also that if you request JIT compilation only for partial
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
subject line, that match will not use JIT code because none was compiled for
non-partial matching.
</P>
<P>
If JIT compilation is successful, the compiled JIT code will automatically be
used when an appropriate type of match is run, except when incompatible
run-time options are specified. For more details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation. See also the <b>jitstack</b> modifier below for a way of
setting the size of the JIT stack.
@ -661,14 +857,14 @@ code was actually used in the match.
Setting a locale
</b><br>
<P>
The <b>/locale</b> modifier must specify the name of a locale, for example:
The <b>locale</b> modifier must specify the name of a locale, for example:
<pre>
/pattern/locale=fr_FR
</pre>
The given locale is set, <b>pcre2_maketables()</b> is called to build a set of
character tables for the locale, and this is then passed to
<b>pcre2_compile()</b> when compiling the regular expression. The same tables
are used when matching the following subject lines. The <b>/locale</b> modifier
are used when matching the following subject lines. The <b>locale</b> modifier
applies only to the pattern on which it appears, but can be given in a
<b>#pattern</b> command if a default is needed. Setting a locale and alternate
character tables are mutually exclusive.
@ -677,7 +873,7 @@ character tables are mutually exclusive.
Showing pattern memory
</b><br>
<P>
The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
The <b>memory</b> modifier causes the size in bytes of the memory used to hold
the compiled pattern to be output. This does not include the size of the
<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is
@ -700,30 +896,53 @@ sets its own default of 220, which is required for running the standard test
suite.
</P>
<br><b>
Limiting the pattern length
</b><br>
<P>
The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
causes a compilation error. The default is the largest number a PCRE2_SIZE
variable can hold (essentially unlimited).
</P>
<br><b>
Using the POSIX wrapper API
</b><br>
<P>
The <b>/posix</b> modifier causes <b>pcre2test</b> to call PCRE2 via the POSIX
wrapper API rather than its native API. This supports only the 8-bit library.
When the POSIX API is being used, the following pattern modifiers set options
for the <b>regcomp()</b> function:
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
PCRE2 via the POSIX wrapper API rather than its native API. When
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
it does not imply POSIX matching semantics; for more detail see the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
documentation. The following pattern modifiers set options for the
<b>regcomp()</b> function:
<pre>
caseless REG_ICASE
multiline REG_NEWLINE
no_auto_capture REG_NOSUB
dotall REG_DOTALL )
ungreedy REG_UNGREEDY ) These options are not part of
ucp REG_UCP ) the POSIX standard
utf REG_UTF8 )
</pre>
The <b>regerror_buffsize</b> modifier specifies a size for the error buffer that
is passed to <b>regerror()</b> in the event of a compilation error. For example:
<pre>
/abc/posix,regerror_buffsize=20
</pre>
This provides a means of testing the behaviour of <b>regerror()</b> when the
buffer is too small for the error message. If this modifier has not been set, a
large buffer is used.
</P>
<P>
The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
below. All other modifiers cause an error.
below. All other modifiers are either ignored, with a warning message, or cause
an error.
</P>
<br><b>
Testing the stack guard feature
</b><br>
<P>
The <b>/stackguard</b> modifier is used to test the use of
The <b>stackguard</b> modifier is used to test the use of
<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to
enable stack availability to be checked during compilation (see the
<a href="pcre2api.html"><b>pcre2api</b></a>
@ -738,7 +957,7 @@ be aborted.
Using alternative character tables
</b><br>
<P>
The value specified for the <b>/tables</b> modifier must be one of the digits 0,
The value specified for the <b>tables</b> modifier must be one of the digits 0,
1, or 2. It causes a specific set of built-in character tables to be passed to
<b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows:
@ -758,17 +977,22 @@ Setting certain match controls
<P>
The following modifiers are really subject modifiers, and are described below.
However, they may be included in a pattern's modifier list, in which case they
are applied to every subject line that is processed with that pattern. They do
not affect the compilation process.
are applied to every subject line that is processed with that pattern. They may
not appear in <b>#pattern</b> commands. These modifiers do not affect the
compilation process.
<pre>
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
/g global global matching
mark show mark values
replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
/g global global matching
mark show mark values
replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
</pre>
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
defaults, set them in a <b>#subject</b> command.
@ -782,13 +1006,17 @@ pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next
line to contain a new pattern (or a command) instead of a subject line. This
facility is used when saving compiled patterns to a file, as described in the
section entitled "Saving and restoring compiled patterns"
<a href="#saverestore">below.</a>
The <b>push</b> modifier is incompatible with compilation modifiers such as
<b>global</b> that act at match time. Any that are specified are ignored, with a
warning message, except for <b>replace</b>, which causes an error. Note that,
<b>jitverify</b>, which is allowed, does not carry through to any subsequent
matching that uses this pattern.
</P>
<a href="#saverestore">below. If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled</a>
pattern is stacked, leaving the original as current, ready to match the
following input lines. This provides a way of testing the
<b>pcre2_code_copy()</b> function.
The <b>push</b> and <b>pushcopy </b> modifiers are incompatible with compilation
modifiers such as <b>global</b> that act at match time. Any that are specified
are ignored (for the stacked copy), with a warning message, except for
<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is
allowed, does not carry through to any subsequent matching that uses a stacked
pattern.
<a name="subjectmodifiers"></a></P>
<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br>
<P>
The modifiers that can appear in subject lines and the <b>#subject</b>
@ -806,6 +1034,7 @@ for a description of their effects.
anchored set PCRE2_ANCHORED
dfa_restart set PCRE2_DFA_RESTART
dfa_shortest set PCRE2_DFA_SHORTEST
no_jit set PCRE2_NO_JIT
no_utf_check set PCRE2_NO_UTF_CHECK
notbol set PCRE2_NOTBOL
notempty set PCRE2_NOTEMPTY
@ -818,11 +1047,11 @@ The partial matching modifiers are provided with abbreviations because they
appear frequently in tests.
</P>
<P>
If the <b>/posix</b> modifier was present on the pattern, causing the POSIX
If the <b>posix</b> modifier was present on the pattern, causing the POSIX
wrapper API to be used, the only option-setting modifiers that have any effect
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
Any other modifiers cause an error.
The other modifiers are ignored, with a warning message.
</P>
<br><b>
Setting match controls
@ -833,33 +1062,44 @@ information. Some of them may also be specified on a pattern line (see above),
in which case they apply to every subject line that is matched against that
pattern.
<pre>
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text (non-JIT only)
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=&#60;n&#62; set a value to pass via callouts
callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure
callout_none do not supply a callout function
copy=&#60;number or name&#62; copy captured substring
dfa use <b>pcre2_dfa_match()</b>
find_limits find match and recursion limits
get=&#60;number or name&#62; extract captured substring
getall extract all captured substrings
/g global global matching
jitstack=&#60;n&#62; set size of JIT stack
mark show mark values
match_limit=&#62;n&#62; set a match limit
memory show memory usage
offset=&#60;n&#62; set starting offset
ovector=&#60;n&#62; set size of output vector
recursion_limit=&#60;n&#62; set a recursion limit
replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text (non-JIT only)
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=&#60;n&#62; set a value to pass via callouts
callout_error=&#60;n&#62;[:&#60;m&#62;] control callout error
callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure
callout_none do not supply a callout function
copy=&#60;number or name&#62; copy captured substring
dfa use <b>pcre2_dfa_match()</b>
find_limits find match and recursion limits
get=&#60;number or name&#62; extract captured substring
getall extract all captured substrings
/g global global matching
jitstack=&#60;n&#62; set size of JIT stack
mark show mark values
match_limit=&#60;n&#62; set a match limit
memory show memory usage
null_context match with a NULL context
offset=&#60;n&#62; set starting offset
offset_limit=&#60;n&#62; set offset limit
ovector=&#60;n&#62; set size of output vector
recursion_limit=&#60;n&#62; set a recursion limit
replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant
startoffset=&#60;n&#62; same as offset=&#60;n&#62;
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
zero_terminate pass the subject as zero-terminated
</pre>
The effects of these modifiers are described in the following sections.
The effects of these modifiers are described in the following sections. When
matching via the POSIX wrapper API, the <b>aftertext</b>, <b>allaftertext</b>,
and <b>ovector</b> subject modifiers work as described below. All other
modifiers are either ignored, with a warning message, or cause an error.
</P>
<br><b>
Showing more text
@ -916,7 +1156,8 @@ The <b>allcaptures</b> modifier requests that the values of all potential
captured parentheses be output after a match. By default, only those up to the
highest one actually used in the match are output (corresponding to the return
code from <b>pcre2_match()</b>). Groups that did not take part in the match
are output as "&#60;unset&#62;".
are output as "&#60;unset&#62;". This modifier is not relevant for DFA matching (which
does no capturing); it is ignored, with a warning message, if present.
</P>
<br><b>
Testing callouts
@ -924,15 +1165,22 @@ Testing callouts
<P>
A callout function is supplied when <b>pcre2test</b> calls the library matching
functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is
set, the current captured groups are output when a callout occurs.
set, the current captured groups are output when a callout occurs. The default
return from the callout function is zero, which allows matching to continue.
</P>
<P>
The <b>callout_fail</b> modifier can be given one or two numbers. If there is
only one number, 1 is returned instead of 0 when a callout of that number is
reached. If two numbers are given, 1 is returned when callout &#60;n&#62; is reached
for the &#60;m&#62;th time. Note that callouts with string arguments are always given
the number zero. See "Callouts" below for a description of the output when a
callout it taken.
only one number, 1 is returned instead of 0 (causing matching to backtrack)
when a callout of that number is reached. If two numbers (&#60;n&#62;:&#60;m&#62;) are given, 1
is returned when callout &#60;n&#62; is reached and there have been at least &#60;m&#62;
callouts. The <b>callout_error</b> modifier is similar, except that
PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
aborted. If both these modifiers are set for the same callout number,
<b>callout_error</b> takes precedence.
</P>
<P>
Note that callouts with string arguments are always given the number zero. See
"Callouts" below for a description of the output when a callout it taken.
</P>
<P>
The <b>callout_data</b> modifier can be given an unsigned or a negative number.
@ -945,7 +1193,7 @@ Finding all matches in a string
</b><br>
<P>
Searching for all possible matches within a subject can be requested by the
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
<b>global</b> or <b>altglobal</b> modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between <b>global</b> and <b>altglobal</b> is that the former uses the
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
@ -996,19 +1244,34 @@ Testing the substitution function
</b><br>
<P>
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
called instead of one of the matching functions. Unlike subject strings,
<b>pcre2test</b> does not process replacement strings for escape sequences. In
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
If so, it is correctly converted to a UTF string of the appropriate code unit
width. If it is not a valid UTF-8 string, the individual code units are copied
directly. This provides a means of passing an invalid UTF-8 string for testing
purposes.
called instead of one of the matching functions. Note that replacement strings
cannot contain commas, because a comma signifies the end of a modifier. This is
not thought to be an issue in a test program.
</P>
<P>
If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
<b>pcre2_substitute()</b>. After a successful substitution, the modified string
is output, preceded by the number of replacements. This may be zero if there
were no matches. Here is a simple example of a substitution test:
Unlike subject strings, <b>pcre2test</b> does not process replacement strings
for escape sequences. In UTF mode, a replacement string is checked to see if it
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
the appropriate code unit width. If it is not a valid UTF-8 string, the
individual code units are copied directly. This provides a means of passing an
invalid UTF-8 string for testing purposes.
</P>
<P>
The following modifiers set options (in additional to the normal match options)
for <b>pcre2_substitute()</b>:
<pre>
global PCRE2_SUBSTITUTE_GLOBAL
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
</PRE>
</P>
<P>
After a successful substitution, the modified string is output, preceded by the
number of replacements. This may be zero if there were no matches. Here is a
simple example of a substitution test:
<pre>
/abc/replace=xxx
=abc=abc=
@ -1016,12 +1279,12 @@ were no matches. Here is a simple example of a substitution test:
=abc=abc=\=global
2: =xxx=xxx=
</pre>
Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to test for
buffer overflow, if the replacement string starts with a number in square
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
output buffer, with the replacement string starting at the next character. Here
is an example that tests the edge case:
Subject and replacement strings should be kept relatively short (fewer than 256
characters) for substitution tests, as fixed-size buffers are used. To make it
easy to test for buffer overflow, if the replacement string starts with a
number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
the size of the output buffer, with the replacement string starting at the next
character. Here is an example that tests the edge case:
<pre>
/abc/
123abc123\=replace=[10]XYZ
@ -1029,6 +1292,19 @@ is an example that tests the edge case:
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory
</pre>
The default action of <b>pcre2_substitute()</b> is to return
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
to go through the motions of matching and substituting, in order to compute the
size of buffer that is required. When this happens, <b>pcre2test</b> shows the
required buffer length (which includes space for the trailing zero) as part of
the error message. For example:
<pre>
/abc/substitute_overflow_length
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory: 10 code units are needed
</pre>
A replacement string is ignored with POSIX and DFA matching. Specifying partial
matching provokes an error return ("bad option value") from
<b>pcre2_substitute()</b>.
@ -1100,6 +1376,16 @@ The <b>offset</b> modifier sets an offset in the subject string at which
matching starts. Its value is a number of code units, not characters.
</P>
<br><b>
Setting an offset limit
</b><br>
<P>
The <b>offset_limit</b> modifier sets a limit for unanchored matches. If a match
cannot be found starting at or before this offset in the subject, a "no match"
return is given. The data value is a number of code units, not characters. When
this modifier is used, the <b>use_offset_limit</b> modifier must have been set
for the pattern; if not, an error is generated.
</P>
<br><b>
Setting the size of the output vector
</b><br>
<P>
@ -1131,6 +1417,17 @@ this modifier has no effect, as there is no facility for passing a length.)
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
passing the replacement string as zero-terminated.
</P>
<br><b>
Passing a NULL context
</b><br>
<P>
Normally, <b>pcre2test</b> passes a context block to <b>pcre2_match()</b>,
<b>pcre2_dfa_match()</b> or <b>pcre2_jit_match()</b>. If the <b>null_context</b>
modifier is set, however, NULL is passed. This is for testing that the matching
functions behave correctly in this case (they use default values). This
modifier cannot be used with the <b>find_limits</b> modifier or when testing the
substitution function.
</P>
<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
<P>
By default, <b>pcre2test</b> uses the standard PCRE2 matching function,
@ -1196,7 +1493,7 @@ unset substring is shown as "&#60;unset&#62;", as for the second data line.
If the strings contain any non-printing characters, they are output as \xhh
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
are output as \x{hh...} escapes. See below for the definition of non-printing
characters. If the <b>/aftertext</b> modifier is set, the output for substring
characters. If the <b>aftertext</b> modifier is set, the output for substring
0 is followed by the the rest of the subject string, identified by "0+" like
this:
<pre>
@ -1321,7 +1618,9 @@ item to be tested. For example:
This output indicates that callout number 0 occurred for a match attempt
starting at the fourth character of the subject string, when the pointer was at
the seventh character, and when the next pattern item was \d. Just
one circumflex is output if the start and current positions are the same.
one circumflex is output if the start and current positions are the same, or if
the current position precedes the start position, which can happen if the
callout is in a lookbehind assertion.
</P>
<P>
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
@ -1387,7 +1686,7 @@ therefore shown as hex escapes.
<P>
When <b>pcre2test</b> is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been set for
the pattern (using the <b>/locale</b> modifier). In this case, the
the pattern (using the <b>locale</b> modifier). In this case, the
<b>isprint()</b> function is used to distinguish printing and non-printing
characters.
<a name="saverestore"></a></P>
@ -1413,11 +1712,16 @@ can be used to test these functions.
<P>
When a pattern with <b>push</b> modifier is successfully compiled, it is pushed
onto a stack of compiled patterns, and <b>pcre2test</b> expects the next line to
contain a new pattern (or command) instead of a subject line. By this means, a
number of patterns can be compiled and retained. The <b>push</b> modifier is
incompatible with <b>posix</b>, and control modifiers that act at match time are
ignored (with a message). The <b>jitverify</b> modifier applies only at compile
time. The command
contain a new pattern (or command) instead of a subject line. By contrast,
the <b>pushcopy</b> modifier causes a copy of the compiled pattern to be
stacked, leaving the original available for immediate matching. By using
<b>push</b> and/or <b>pushcopy</b>, a number of patterns can be compiled and
retained. These modifiers are incompatible with <b>posix</b>, and control
modifiers that act at match time are ignored (with a message) for the stacked
patterns. The <b>jitverify</b> modifier applies only at compile time.
</P>
<P>
The command
<pre>
#save &#60;filename&#62;
</pre>
@ -1434,7 +1738,8 @@ usual by an empty line or end of file. This command may be followed by a
modifier list containing only
<a href="#controlmodifiers">control modifiers</a>
that act after a pattern has been compiled. In particular, <b>hex</b>,
<b>posix</b>, and <b>push</b> are not allowed, nor are any
<b>posix</b>, <b>posix_nosub</b>, <b>push</b>, and <b>pushcopy</b> are not allowed,
nor are any
<a href="#optionmodifiers">option-setting modifiers.</a>
The JIT modifiers are, however permitted. Here is an example that saves and
reloads two patterns.
@ -1452,6 +1757,11 @@ reloads two patterns.
If <b>jitverify</b> is used with #pop, it does not automatically imply
<b>jit</b>, which is different behaviour from when it is used on a pattern.
</P>
<P>
The #popcopy command is analagous to the <b>pushcopy</b> modifier in that it
makes current a copy of the topmost stack pattern, leaving the original still
on the stack.
</P>
<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
@ -1469,9 +1779,9 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 May 2015
Last updated: 28 December 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -67,15 +67,20 @@ In UTF modes, the dot metacharacter matches one UTF character instead of a
single code unit.
</P>
<P>
The escape sequence \C can be used to match a single code unit, in a UTF mode,
The escape sequence \C can be used to match a single code unit in a UTF mode,
but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \C in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation). The use of \C is not supported in the alternative matching
function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains
\C, it will not succeed, and so the matching will be carried out by the normal
interpretive function.
documentation).
</P>
<P>
The use of \C is not supported by the alternative matching function
<b>pcre2_dfa_match()</b> when in UTF-8 or UTF-16 mode, that is, when a character
may consist of more than one code unit. The use of \C in these modes provokes
a match-time error. Also, the JIT optimization does not support \C in these
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
the matching will be carried out by the normal interpretive function.
</P>
<P>
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
@ -126,11 +131,22 @@ as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
strings to be in host byte order.
</P>
<P>
The entire string is checked before any other processing takes place. In
addition to checking the format of the string, there is a check to ensure that
all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area.
The so-called "non-character" code points are not excluded because Unicode
corrigendum #9 makes it clear that they should not be.
A UTF string is checked before any other processing takes place. In the case of
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> calls with a non-zero starting
offset, the check is applied only to that part of the subject that could be
inspected during matching, and there is a check that the starting offset points
to the first code unit of a character or to the end of the subject. If there
are no lookbehind assertions in the pattern, the check starts at the starting
offset. Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \b and \B are
one-character lookbehinds.
</P>
<P>
In addition to checking the format of the string, there is a check to ensure
that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate
area. The so-called "non-character" code points are not excluded because
Unicode corrigendum #9 makes it clear that they should not be.
</P>
<P>
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
@ -232,9 +248,9 @@ Errors in UTF-16 strings
<P>
The following negative error codes are given for invalid UTF-16 strings:
<pre>
PCRE_UTF16_ERR1 Missing low surrogate at end of string
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
PCRE_UTF16_ERR3 Isolated low surrogate
PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
<a name="utf32strings"></a></PRE>
</P>
@ -244,8 +260,8 @@ Errors in UTF-32 strings
<P>
The following negative error codes are given for invalid UTF-32 strings:
<pre>
PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
PCRE_UTF32_ERR2 Code point is greater than 0x10ffff
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
</PRE>
</P>
@ -264,9 +280,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 23 November 2014
Last updated: 03 July 2016
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -91,6 +91,12 @@ in the library.
<tr><td><a href="pcre2_callout_enumerate.html">pcre2_callout_enumerate</a></td>
<td>&nbsp;&nbsp;Enumerate callouts in a compiled pattern</td></tr>
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern</td></tr>
<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern and its character tables</td></tr>
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
<td>&nbsp;&nbsp;Free a compiled pattern</td></tr>
@ -210,9 +216,15 @@ in the library.
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
<td>&nbsp;&nbsp;Set the match limit</td></tr>
<tr><td><a href="pcre2_set_max_pattern_length.html">pcre2_set_max_pattern_length</a></td>
<td>&nbsp;&nbsp;Set the maximum length of pattern</td></tr>
<tr><td><a href="pcre2_set_newline.html">pcre2_set_newline</a></td>
<td>&nbsp;&nbsp;Set the newline convention</td></tr>
<tr><td><a href="pcre2_set_offset_limit.html">pcre2_set_offset_limit</a></td>
<td>&nbsp;&nbsp;Set the offset limit</td></tr>
<tr><td><a href="pcre2_set_parens_nest_limit.html">pcre2_set_parens_nest_limit</a></td>
<td>&nbsp;&nbsp;Set the parentheses nesting limit</td></tr>

View File

@ -1,4 +1,4 @@
.TH PCRE2 3 "13 April 2015" "PCRE2 10.20"
.TH PCRE2 3 "16 October 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH INTRODUCTION
@ -118,8 +118,10 @@ running redundant checks.
.P
The use of the \eC escape sequence in a UTF-8 or UTF-16 pattern can lead to
problems, because it may leave the current matching point in the middle of a
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
lock out the use of \eC, causing a compile-time error if it is encountered.
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an
application to lock out the use of \eC, causing a compile-time error if it is
encountered. It is also possible to build PCRE2 with the use of \eC permanently
disabled.
.P
Another way that performance can be hit is by running a pattern that has a very
large search tree against a string that will never match. Nested unlimited
@ -187,6 +189,6 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
Last updated: 13 April 2015
Last updated: 16 October 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,31 @@
.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
.rs
.sp
.B #include <pcre2.h>
.PP
.nf
.B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP);
.fi
.
.SH DESCRIPTION
.rs
.sp
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
\fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching. The
pointer to the character tables is copied, not the tables themselves (see
\fBpcre2_code_copy_with_tables()\fP). The yield of the function is NULL if
\fIcode\fP is NULL or if sufficient memory cannot be obtained.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
\fBpcre2api\fP
.\"
page and a description of the POSIX API in the
.\" HREF
\fBpcre2posix\fP
.\"
page.

View File

@ -0,0 +1,32 @@
.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
.rs
.sp
.B #include <pcre2.h>
.PP
.nf
.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP);
.fi
.
.SH DESCRIPTION
.rs
.sp
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
\fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching.
Unlike \fBpcre2_code_copy()\fP, a separate copy of the character tables is also
made, with the new code pointing to it. This memory will be automatically freed
when \fBpcre2_code_free()\fP is called. The yield of the function is NULL if
\fIcode\fP is NULL or if sufficient memory cannot be obtained.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
\fBpcre2api\fP
.\"
page and a description of the POSIX API in the
.\" HREF
\fBpcre2posix\fP
.\"
page.

View File

@ -1,4 +1,4 @@
.TH PCRE2_CODE_FREE 3 "21 October 2014" "PCRE2 10.00"
.TH PCRE2_CODE_FREE 3 "29 July 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -7,7 +7,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.B #include <pcre2.h>
.PP
.nf
.B pcre2_code_free(pcre2_code *\fIcode\fP);
.B void pcre2_code_free(pcre2_code *\fIcode\fP);
.fi
.
.SH DESCRIPTION

View File

@ -1,4 +1,4 @@
.TH PCRE2_DFA_MATCH 3 "12 May 2013" "PCRE2 10.00"
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -33,8 +33,8 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
\fIwscount\fP Number of elements in the vector
.sp
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
up a callout function. The \fIlength\fP and \fIstartoffset\fP values are code
units, not characters. The options are:
up a callout function or specify the recursion limit. The \fIlength\fP and
\fIstartoffset\fP values are code units, not characters. The options are:
.sp
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line

View File

@ -1,4 +1,4 @@
.TH PCRE2_GET_ERROR_MESSAGE 3 "21 October 2014" "PCRE2 10.00"
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -23,7 +23,10 @@ errors are negative numbers. The arguments are:
\fIbufflen\fP the length of the buffer (code units)
.sp
The function returns the length of the message, excluding the trailing zero, or
a negative error code if the buffer is too small.
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
this case, the returned message is truncated (but still with a trailing zero).
If \fIerrorcode\fP does not contain a recognized error code number, the
negative value PCRE2_ERROR_BADDATA is returned.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF

View File

@ -1,4 +1,4 @@
.TH PCRE2_MATCH_DATA_CREATE 3 "22 October 2014" "PCRE2 10.00"
.TH PCRE2_MATCH_DATA_CREATE 3 "29 July 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -7,7 +7,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.B #include <pcre2.h>
.PP
.nf
.B pcre2_match_data_create(uint32_t \fIovecsize\fP,
.B pcre2_match_data *pcre2_match_data_create(uint32_t \fIovecsize\fP,
.B " pcre2_general_context *\fIgcontext\fP);"
.fi
.

View File

@ -1,4 +1,4 @@
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "24 October 2014" "PCRE2 10.00"
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "29 July 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -7,8 +7,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.B #include <pcre2.h>
.PP
.nf
.B pcre2_match_data_create_from_pattern(const pcre2_code *\fIcode\fP,
.B " pcre2_general_context *\fIgcontext\fP);"
.B pcre2_match_data *pcre2_match_data_create_from_pattern(
.B " const pcre2_code *\fIcode\fP, pcre2_general_context *\fIgcontext\fP);"
.fi
.
.SH DESCRIPTION

View File

@ -1,4 +1,4 @@
.TH PCRE2_PATTERN_INFO 3 "01 December 2014" "PCRE2 10.00"
.TH PCRE2_PATTERN_INFO 3 "21 November 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -30,19 +30,20 @@ request are as follows:
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
0 nothing set
1 first code unit is set
2 start of string or after newline
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \eC
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
exist in the pattern
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
0 nothing set
1 code unit is set
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
empty string, 0 otherwise
PCRE2_INFO_MATCHLIMIT Match limit if set,
@ -50,8 +51,8 @@ request are as follows:
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
lookbehind assertion
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
PCRE2_INFO_NAMECOUNT Number of named subpatterns
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
PCRE2_INFO_NAMETABLE Pointer to name table
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
PCRE2_NEWLINE_CR

View File

@ -1,4 +1,4 @@
.TH PCRE2_SERIALIZE_DECODE 3 "19 January 2015" "PCRE2 10.10"
.TH PCRE2_SERIALIZE_DECODE 3 "02 September 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -8,7 +8,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.PP
.nf
.B int32_t pcre2_serialize_decode(pcre2_code **\fIcodes\fP,
.B " int32_t \fInumber_of_codes\fP, const uint32_t *\fIbytes\fP,"
.B " int32_t \fInumber_of_codes\fP, const uint8_t *\fIbytes\fP,"
.B " pcre2_general_context *\fIgcontext\fP);"
.fi
.

View File

@ -1,4 +1,4 @@
.TH PCRE2_SERIALIZE_ENCODE 3 "19 January 2015" "PCRE2 10.10"
.TH PCRE2_SERIALIZE_ENCODE 3 "02 September 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -7,8 +7,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.B #include <pcre2.h>
.PP
.nf
.B int32_t pcre2_serialize_encode(pcre2_code **\fIcodes\fP,
.B " int32_t \fInumber_of_codes\fP, uint32_t **\fIserialized_bytes\fP,"
.B int32_t pcre2_serialize_encode(const pcre2_code **\fIcodes\fP,
.B " int32_t \fInumber_of_codes\fP, uint8_t **\fIserialized_bytes\fP,"
.B " PCRE2_SIZE *\fIserialized_size\fP, pcre2_general_context *\fIgcontext\fP);"
.fi
.

View File

@ -0,0 +1,31 @@
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "05 October 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
.rs
.sp
.B #include <pcre2.h>
.PP
.nf
.B int pcre2_set_max_pattern_length(pcre2_compile_context *\fIccontext\fP,
.B " PCRE2_SIZE \fIvalue\fP);"
.fi
.
.SH DESCRIPTION
.rs
.sp
This function sets, in a compile context, the maximum text length (in code
units) of the pattern that can be compiled. The result is always zero. If a
longer pattern is passed to \fBpcre2_compile()\fP there is an immediate error
return. The default is effectively unlimited, being the largest value a
PCRE2_SIZE variable can hold.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
\fBpcre2api\fP
.\"
page and a description of the POSIX API in the
.\" HREF
\fBpcre2posix\fP
.\"
page.

View File

@ -0,0 +1,28 @@
.TH PCRE2_SET_OFFSET_LIMIT 3 "22 September 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
.rs
.sp
.B #include <pcre2.h>
.PP
.nf
.B int pcre2_set_offset_limit(pcre2_match_context *\fImcontext\fP,
.B " PCRE2_SIZE \fIvalue\fP);"
.fi
.
.SH DESCRIPTION
.rs
.sp
This function sets the offset limit field in a match context. The result is
always zero.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
\fBpcre2api\fP
.\"
page and a description of the POSIX API in the
.\" HREF
\fBpcre2posix\fP
.\"
page.

View File

@ -1,4 +1,4 @@
.TH PCRE2_SUBSTITUTE 3 "11 November 2014" "PCRE2 10.00"
.TH PCRE2_SUBSTITUTE 3 "12 December 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -47,20 +47,25 @@ units, not characters, as is the contents of the variable pointed at by
\fIoutlengthptr\fP, which is updated to the actual length of the new string.
The options are:
.sp
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject string is not the beginning of a line
PCRE2_NOTEOL Subject string is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject or replacement for
UTF validity (only relevant if PCRE2_UTF
was set at compile time)
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the
subject is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject or replacement
for UTF validity (only relevant if
PCRE2_UTF was set at compile time)
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
.sp
The function returns the number of substitutions, which may be zero if there
were no matches. The result can be greater than one only when
PCRE2_SUBSTITUTE_GLOBAL is set.
PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
is returned.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2BUILD 3 "23 April 2015" "PCRE2 10.20"
.TH PCRE2BUILD 3 "01 November 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@ -132,11 +132,20 @@ Pattern escapes such as \ed and \ew do not by default make use of Unicode
properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP).
.P
.
.
.SH "DISABLING THE USE OF \eC"
.rs
.sp
The \eC escape sequence, which matches a single code unit, even in a UTF mode,
can cause unpredictable behaviour because it may leave the current matching
point in the middle of a multi-code-unit character. It can be locked out by
setting the PCRE2_NEVER_BACKSLASH_C option.
point in the middle of a multi-code-unit character. The application can lock it
out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
\fBpcre2_compile()\fP. There is also a build-time option
.sp
--enable-never-backslash-C
.sp
(note the upper case C) which locks out the use of \eC entirely.
.
.
.SH "JUST-IN-TIME COMPILER SUPPORT"
@ -343,6 +352,19 @@ and equivalent run-time options, refer to these character values in an EBCDIC
environment.
.
.
.SH "PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS"
.rs
.sp
By default, on non-Windows systems, \fBpcre2grep\fP supports the use of
callouts with string arguments within the patterns it is matching, in order to
run external scripts. For details, see the
.\" HREF
\fBpcre2grep\fP
.\"
documentation. This support can be disabled by adding
--disable-pcre2grep-callout to the \fBconfigure\fP command.
.
.
.SH "PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT"
.rs
.sp
@ -363,16 +385,19 @@ they are not.
.sp
\fBpcre2grep\fP uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when it
finds a match. The size of the buffer is controlled by a parameter whose
default value is 20K. The buffer itself is three times this size, but because
of the way it is used for holding "before" lines, the longest line that is
guaranteed to be processable is the parameter size. You can change the default
parameter value by adding, for example,
finds a match. The starting size of the buffer is controlled by a parameter
whose default value is 20K. The buffer itself is three times this size, but
because of the way it is used for holding "before" lines, the longest line that
is guaranteed to be processable is the parameter size. If a longer line is
encountered, \fBpcre2grep\fP automatically expands the buffer, up to a
specified maximum size, whose default is 1M or the starting size, whichever is
the larger. You can change the default parameter values by adding, for example,
.sp
--with-pcre2grep-bufsize=50K
--with-pcre2grep-bufsize=51200
--with-pcre2grep-max-bufsize=2097152
.sp
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
value by using --buffer-size on the command line..
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override
these values by using --buffer-size and --max-buffer-size on the command line.
.
.
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
@ -490,6 +515,28 @@ information about code coverage, see the \fBgcov\fP and \fBlcov\fP
documentation.
.
.
.SH "SUPPORT FOR FUZZERS"
.rs
.sp
There is a special option for use by people who want to run fuzzing tests on
PCRE2:
.sp
--enable-fuzz-support
.sp
At present this applies only to the 8-bit library. If set, it causes an extra
library called libpcre2-fuzzsupport.a to be built, but not installed. This
contains a single function called LLVMFuzzerTestOneInput() whose arguments are
a pointer to a string and the length of the string. When called, this function
tries to compile the string as a pattern, and if that succeeds, to match it.
This is done both with no options and with some random options bits that are
generated from the string. Setting --enable-fuzz-support also causes a binary
called \fBpcre2fuzzcheck\fP to be created. This is normally run under valgrind
or used when PCRE2 is compiled with address sanitizing enabled. It calls the
fuzzing function and outputs information about it is doing. The input strings
are specified by arguments: if an argument starts with "=" the rest of it is a
literal input string. Otherwise, it is assumed to be a file name, and the
contents of the file are the test string.
.
.SH "SEE ALSO"
.rs
.sp
@ -510,6 +557,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 24 April 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 01 November 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "23 March 2015" "PCRE2 10.20"
.TH PCRE2CALLOUT 3 "29 September 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -40,11 +40,20 @@ two callout points:
.sp
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
automatically inserts callouts, all with number 255, before each item in the
pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
pattern except for immediately before or after a callout item in the pattern.
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
.sp
A(?C3)B
.sp
it is processed as if it were
.sp
(?C255)A(?C3)B(?C255)
.sp
Here is a more complicated example:
.sp
A(\ed{2}|--)
.sp
it is processed as if it were
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
.sp
(?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
.sp
@ -91,10 +100,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
No match
.sp
This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur.
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
case, the output changes to this:
(because it is being treated as a++) and therefore the callouts that would be
taken for the backtracks do not occur. You can disable the auto-possessify
feature by passing PCRE2_NO_AUTO_POSSESS to \fBpcre2_compile()\fP, or starting
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
.sp
--->aaaa
+0 ^ a+
@ -220,8 +229,8 @@ but the intention is never to remove any of the existing fields.
.sp
For a numerical callout, \fIcallout_string\fP is NULL, and \fIcallout_number\fP
contains the number of the callout, in the range 0-255. This is the number
that follows (?C for manual callouts; it is 255 for automatically generated
callouts.
that follows (?C for callouts that part of the pattern; it is 255 for
automatically generated callouts.
.
.
.SS "Fields for string callouts"
@ -286,10 +295,15 @@ The \fIpattern_position\fP field contains the offset in the pattern string to
the next item to be matched.
.P
The \fInext_item_length\fP field contains the length of the next item to be
matched in the pattern string. When the callout immediately precedes an
alternation bar, a closing parenthesis, or the end of the pattern, the length
is zero. When the callout precedes an opening parenthesis, the length is that
of the entire subpattern.
processed in the pattern string. When the callout is at the end of the pattern,
the length is zero. When the callout precedes an opening parenthesis, the
length includes meta characters that follow the parenthesis. For example, in a
callout before an assertion such as (?=ab) the length is 3. For an an
alternation bar or a closing parenthesis, the length is one, unless a closing
parenthesis is followed by a quantifier, in which case its length is included.
(This changed in release 10.23. In earlier releases, before an opening
parenthesis the length was that of the entire subpattern, and before an
alternation bar or a closing parenthesis the length was zero.)
.P
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
help in distinguishing between different automatic callouts, which all have the
@ -382,6 +396,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 March 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 29 September 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "15 March 2015" "PCRE2 10.20"
.TH PCRE2COMPAT 3 "18 October 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -96,7 +96,7 @@ processed as anchored at the point where they are tested.
one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are examples where it differs.
same as PCRE2, but there are cases where it differs.
.P
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
@ -109,17 +109,18 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE2
works internally just with numbers, using an external table to translate
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B),
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b>B),
where the two capturing parentheses have the same number but different names,
is not supported, and causes an error at compile time. If it were allowed, it
would not be possible to distinguish which parentheses matched, because both
names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
.P
14. Perl recognizes comments in some places that PCRE2 does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
Perl allows white space between ( and ? (though current Perls warn that this is
deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
14. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a subpattern. If the /x modifier
is set, Perl allowed white space between ( and ? though the latest Perls give
an error (for a while it was just deprecated). There may still be some cases
where Perl behaves differently.
.P
15. Perl, when in warning mode, gives warnings for character classes such as
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
@ -141,33 +142,37 @@ list is with respect to Perl 5.10:
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
.sp
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
(b) From PCRE2 10.23, back references to groups of fixed length are supported
in lookbehinds, provided that there is no possibility of referencing a
non-unique number or name. Perl does not support backreferences in lookbehinds.
.sp
(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
.sp
(c) A backslash followed by a letter with no special meaning is faulted. (Perl
(d) A backslash followed by a letter with no special meaning is faulted. (Perl
can be made to issue a warning.)
.sp
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
.sp
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
.sp
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
.sp
(g) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
(h) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE2_BSR_ANYCRLF option.
.sp
(h) The callout facility is PCRE2-specific.
(i) The callout facility is PCRE2-specific.
.sp
(i) The partial matching facility is PCRE2-specific.
(j) The partial matching facility is PCRE2-specific.
.sp
(j) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
(k) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
different way and is not Perl-compatible.
.sp
(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
(l) PCRE2 recognizes some special sequences such as (*CR) at the start of
a pattern that set overall options that cannot be changed within the pattern.
.
.
@ -185,6 +190,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 15 March 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 18 October 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -20,28 +20,31 @@
*************************************************/
/* This is a demonstration program to illustrate a straightforward way of
calling the PCRE2 regular expression library from a C program. See the
using the PCRE2 regular expression library from a C program. See the
pcre2sample documentation for a short discussion ("man pcre2sample" if you have
the PCRE2 man pages installed). PCRE2 is a revised API for the library, and is
incompatible with the original PCRE API.
There are actually three libraries, each supporting a different code unit
width. This demonstration program uses the 8-bit library.
width. This demonstration program uses the 8-bit library. The default is to
process each code unit as a separate character, but if the pattern begins with
"(*UTF)", both it and the subject are treated as UTF-8 strings, where
characters may occupy multiple code units.
In Unix-like environments, if PCRE2 is installed in your standard system
libraries, you should be able to compile this program using this command:
gcc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
cc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
If PCRE2 is not installed in a standard place, it is likely to be installed
with support for the pkg-config mechanism. If you have pkg-config, you can
compile this program using this command:
gcc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
cc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
If you do not have pkg-config, you may have to use this:
If you do not have pkg-config, you may have to use something like this:
gcc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \e
cc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \e
-R/usr/local/lib -lpcre2-8 -o pcre2demo
Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
@ -56,9 +59,14 @@ the following line. */
/* #define PCRE2_STATIC */
/* This macro must be defined before including pcre2.h. For a program that uses
only one code unit width, it makes it possible to use generic function names
such as pcre2_compile(). */
/* The PCRE2_CODE_UNIT_WIDTH macro must be defined before including pcre2.h.
For a program that uses only one code unit width, setting it to 8, 16, or 32
makes it possible to use generic function names such as pcre2_compile(). Note
that just changing 8 to 16 (for example) is not sufficient to convert this
program to process 16-bit characters. Even in a fully 16-bit environment, where
string-handling functions such as strcmp() and printf() work with 16-bit
characters, the code for handling the table of named substrings will still need
to be modified. */
#define PCRE2_CODE_UNIT_WIDTH 8
@ -79,19 +87,19 @@ int main(int argc, char **argv)
{
pcre2_code *re;
PCRE2_SPTR pattern; /* PCRE2_SPTR is a pointer to unsigned code units of */
PCRE2_SPTR subject; /* the appropriate width (8, 16, or 32 bits). */
PCRE2_SPTR subject; /* the appropriate width (in this case, 8 bits). */
PCRE2_SPTR name_table;
int crlf_is_newline;
int errornumber;
int find_all;
int i;
int namecount;
int name_entry_size;
int rc;
int utf8;
uint32_t option_bits;
uint32_t namecount;
uint32_t name_entry_size;
uint32_t newline;
PCRE2_SIZE erroroffset;
@ -106,15 +114,19 @@ pcre2_match_data *match_data;
* First, sort out the command line. There is only one possible option at *
* the moment, "-g" to request repeated matching to find all occurrences, *
* like Perl's /g option. We set the variable find_all to a non-zero value *
* if the -g option is present. Apart from that, there must be exactly two *
* arguments. *
* if the -g option is present. *
**************************************************************************/
find_all = 0;
for (i = 1; i < argc; i++)
{
if (strcmp(argv[i], "-g") == 0) find_all = 1;
else break;
else if (argv[i][0] == '-')
{
printf("Unrecognised option %s\en", argv[i]);
return 1;
}
else break;
}
/* After the options, we require exactly two arguments, which are the pattern,
@ -122,7 +134,7 @@ and the subject string. */
if (argc - i != 2)
{
printf("Two arguments required: a regex and a subject string\en");
printf("Exactly two arguments required: a regex and a subject string\en");
return 1;
}
@ -201,7 +213,7 @@ if (rc < 0)
stored. */
ovector = pcre2_get_ovector_pointer(match_data);
printf("\enMatch succeeded at offset %d\en", (int)ovector[0]);
printf("Match succeeded at offset %d\en", (int)ovector[0]);
/*************************************************************************
@ -242,7 +254,7 @@ we have to extract the count of named parentheses from the pattern. */
PCRE2_INFO_NAMECOUNT, /* get the number of named substrings */
&namecount); /* where to put the answer */
if (namecount <= 0) printf("No named substrings\en"); else
if (namecount == 0) printf("No named substrings\en"); else
{
PCRE2_SPTR tabptr;
printf("Named substrings\en");
@ -330,8 +342,8 @@ crlf_is_newline = newline == PCRE2_NEWLINE_ANY ||
for (;;)
{
uint32_t options = 0; /* Normally no options */
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
uint32_t options = 0; /* Normally no options */
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
/* If the previous match was for an empty string, we are finished if we are
at the end of the subject. Otherwise, arrange to run another match at the
@ -371,7 +383,7 @@ for (;;)
{
if (options == 0) break; /* All matches found */
ovector[1] = start_offset + 1; /* Advance one code unit */
if (crlf_is_newline && /* If CRLF is newline & */
if (crlf_is_newline && /* If CRLF is a newline & */
start_offset < subject_length - 1 && /* we are at CRLF, */
subject[start_offset] == '\er' &&
subject[start_offset + 1] == '\en')
@ -417,7 +429,7 @@ for (;;)
printf("%2d: %.*s\en", i, (int)substring_length, (char *)substring_start);
}
if (namecount <= 0) printf("No named substrings\en"); else
if (namecount == 0) printf("No named substrings\en"); else
{
PCRE2_SPTR tabptr = name_table;
printf("Named substrings\en");

View File

@ -1,4 +1,4 @@
.TH PCRE2GREP 1 "03 January 2015" "PCRE2 10.00"
.TH PCRE2GREP 1 "31 December 2016" "PCRE2 10.23"
.SH NAME
pcre2grep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS
@ -52,11 +52,18 @@ span line boundaries. What defines a line boundary is controlled by the
\fB-N\fP (\fB--newline\fP) option.
.P
The amount of memory used for buffering files that are being scanned is
controlled by a parameter that can be set by the \fB--buffer-size\fP option.
The default value for this parameter is specified when \fBpcre2grep\fP is
built, with the default default being 20K. A block of memory three times this
size is used (to allow for buffering "before" and "after" lines). An error
occurs if a line overflows the buffer.
controlled by parameters that can be set by the \fB--buffer-size\fP and
\fB--max-buffer-size\fP options. The first of these sets the size of buffer
that is obtained at the start of processing. If an input file contains very
long lines, a larger buffer may be needed; this is handled by automatically
extending the buffer, up to the limit specified by \fB--max-buffer-size\fP. The
default values for these parameters are specified when \fBpcre2grep\fP is
built, with the default defaults being 20K and 1M respectively. An error occurs
if a line is too long and the buffer can no longer be expanded.
.P
The block of memory that is actually used is three times the "buffer size", to
allow for buffering "before" and "after" lines. If the buffer size is too
small, fewer than requested "before" and "after" lines may be output.
.P
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
BUFSIZ is defined in \fB<stdio.h>\fP. When there is more than one pattern
@ -126,24 +133,27 @@ command line starts with a hyphen but is not an option. This allows for the
processing of patterns and file names that start with hyphens.
.TP
\fB-A\fP \fInumber\fP, \fB--after-context=\fP\fInumber\fP
Output \fInumber\fP lines of context after each matching line. If file names
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of \fInumber\fP is expected to be relatively small. However, \fBpcre2grep\fP
guarantees to have up to 8K of following text available for context output.
Output up to \fInumber\fP lines of context after each matching line. Fewer
lines are output if the next match or the end of the file is reached, or if the
processing buffer size has been set too small. If file names and/or line
numbers are being output, a hyphen separator is used instead of a colon for the
context lines. A line containing "--" is output between each group of lines,
unless they are in fact contiguous in the input file. The value of \fInumber\fP
is expected to be relatively small. When \fB-c\fP is used, \fB-A\fP is ignored.
.TP
\fB-a\fP, \fB--text\fP
Treat binary files as text. This is equivalent to
\fB--binary-files\fP=\fItext\fP.
.TP
\fB-B\fP \fInumber\fP, \fB--before-context=\fP\fInumber\fP
Output \fInumber\fP lines of context before each matching line. If file names
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of \fInumber\fP is expected to be relatively small. However, \fBpcre2grep\fP
guarantees to have up to 8K of preceding text available for context output.
Output up to \fInumber\fP lines of context before each matching line. Fewer
lines are output if the previous match or the start of the file is within
\fInumber\fP lines, or if the processing buffer size has been set too small. If
file names and/or line numbers are being output, a hyphen separator is used
instead of a colon for the context lines. A line containing "--" is output
between each group of lines, unless they are in fact contiguous in the input
file. The value of \fInumber\fP is expected to be relatively small. When
\fB-c\fP is used, \fB-B\fP is ignored.
.TP
\fB--binary-files=\fP\fIword\fP
Specify how binary files are to be processed. If the word is "binary" (the
@ -158,8 +168,9 @@ be of interest and are skipped without causing any output or affecting the
return code.
.TP
\fB--buffer-size=\fP\fInumber\fP
Set the parameter that controls how much memory is used for buffering files
that are being scanned.
Set the parameter that controls how much memory is obtained at the start of
processing for buffering files that are being scanned. See also
\fB--max-buffer-size\fP below.
.TP
\fB-C\fP \fInumber\fP, \fB--context=\fP\fInumber\fP
Output \fInumber\fP lines of context both before and after each matching line.
@ -167,13 +178,15 @@ This is equivalent to setting both \fB-A\fP and \fB-B\fP to the same value.
.TP
\fB-c\fP, \fB--count\fP
Do not output lines from the files that are being scanned; instead output the
number of matches (or non-matches if \fB-v\fP is used) that would otherwise
have caused lines to be shown. By default, this count is the same as the number
of suppressed lines, but if the \fB-M\fP (multiline) option is used (without
\fB-v\fP), there may be more suppressed lines than the number of matches.
number of lines that would have been shown, either because they matched, or, if
\fB-v\fP is set, because they failed to match. By default, this count is
exactly the same as the number of lines that would have been output, but if the
\fB-M\fP (multiline) option is used (without \fB-v\fP), there may be more
suppressed lines than the count (that is, the number of matches).
.sp
If no lines are selected, the number zero is output. If several files are are
being scanned, a count is output for each of them. However, if the
being scanned, a count is output for each of them and the \fB-t\fP option can
be used to cause a total to be output at the end. However, if the
\fB--files-with-matches\fP option is also used, only those files whose counts
are greater than zero are listed. When \fB-c\fP is used, the \fB-A\fP,
\fB-B\fP, and \fB-C\fP options are ignored.
@ -192,12 +205,22 @@ connected to a terminal. More resources are used when colouring is enabled,
because \fBpcre2grep\fP has to search for all possible matches in a line, not
just one, in order to colour them all.
.sp
The colour that is used can be specified by setting the environment variable
PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a
string of two numbers, separated by a semicolon. They are copied directly into
the control string for setting colour on a terminal, so it is your
responsibility to ensure that they make sense. If neither of the environment
variables is set, the default is "1;31", which gives red.
The colour that is used can be specified by setting one of the environment
variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
PCREGREP_COLOR, which are checked in that order. If none of these are set,
\fBpcre2grep\fP looks for GREP_COLORS or GREP_COLOR (in that order). The value
of the variable should be a string of two numbers, separated by a semicolon,
except in the case of GREP_COLORS, which must start with "ms=" or "mt="
followed by two semicolon-separated colours, terminated by the end of the
string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
ignored, and GREP_COLOR is checked.
.sp
If the string obtained from one of the above variables contains any characters
other than semicolon or digits, the setting is ignored and the default colour
is used. The string is copied directly into the control string for setting
colour on a terminal, so it is your responsibility to ensure that the values
make sense. If no relevant environment variable is set, the default is "1;31",
which gives red.
.TP
\fB-D\fP \fIaction\fP, \fB--devices=\fP\fIaction\fP
If an input path is not a regular file or a directory, "action" specifies how
@ -273,17 +296,17 @@ files; it does not apply to patterns specified by any of the \fB--include\fP or
\fB--exclude\fP options.
.TP
\fB-f\fP \fIfilename\fP, \fB--file=\fP\fIfilename\fP
Read patterns from the file, one per line, and match them against
each line of input. What constitutes a newline when reading the file is the
operating system's default. The \fB--newline\fP option has no effect on this
option. Trailing white space is removed from each line, and blank lines are
ignored. An empty file contains no patterns and therefore matches nothing. See
also the comments about multiple patterns versus a single pattern with
alternatives in the description of \fB-e\fP above.
Read patterns from the file, one per line, and match them against each line of
input. What constitutes a newline when reading the file is the operating
system's default. The \fB--newline\fP option has no effect on this option.
Trailing white space is removed from each line, and blank lines are ignored. An
empty file contains no patterns and therefore matches nothing. See also the
comments about multiple patterns versus a single pattern with alternatives in
the description of \fB-e\fP above.
.sp
If this option is given more than once, all the specified files are
read. A data line is output if any of the patterns match it. A file name can
be given as "-" to refer to the standard input. When \fB-f\fP is used, patterns
If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given
as "-" to refer to the standard input. When \fB-f\fP is used, patterns
specified on the command line using \fB-e\fP may also be present; they are
tested before the file's patterns. However, no other pattern is taken from the
command line; all arguments are treated as the names of paths to be searched.
@ -432,18 +455,25 @@ of use only if it is set smaller than \fB--match-limit\fP.
There are no short forms for these options. The default settings are specified
when the PCRE2 library is compiled, with the default default being 10 million.
.TP
\fB--max-buffer-size=\fInumber\fP
This limits the expansion of the processing buffer, whose initial size can be
set by \fB--buffer-size\fP. The maximum buffer size is silently forced to be no
smaller than the starting buffer size.
.TP
\fB-M\fP, \fB--multiline\fP
Allow patterns to match more than one line. When this option is given, patterns
may usefully contain literal newline characters and internal occurrences of ^
and $ characters. The output for a successful match may consist of more than
one line. The first is the line in which the match started, and the last is the
line in which the match ended. If the matched string ends with a newline
sequence the output ends at the end of that line.
Allow patterns to match more than one line. When this option is set, the PCRE2
library is called in "multiline" mode. This allows a matched string to extend
past the end of a line and continue on one or more subsequent lines. Patterns
used with \fB-M\fP may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match may
consist of more than one line. The first line is the line in which the match
started, and the last line is the line in which the match ended. If the matched
string ends with a newline sequence, the output ends at the end of that line.
If \fB-v\fP is set, none of the lines in a multi-line match are output. Once a
match has been handled, scanning restarts at the beginning of the line after
the one in which the match ended.
.sp
When this option is set, the PCRE2 library is called in "multiline" mode.
However, \fBpcre2grep\fP still processes the input line by line. The difference
is that a matched string may extend past the end of a line and continue on
one or more subsequent lines. The newline sequence must be matched as part of
The newline sequence that separates multiple lines must be matched as part of
the pattern. For example, to find the phrase "regular expression" in a file
where "regular" might be at the end of a line and "expression" at the start of
the next line, you could use this command:
@ -455,11 +485,8 @@ and is followed by + so as to match trailing white space on the first line as
well as possibly handling a two-character newline sequence.
.sp
There is a limit to the number of lines that can be matched, imposed by the way
that \fBpcre2grep\fP buffers the input file as it scans it. However,
\fBpcre2grep\fP ensures that at least 8K characters or the rest of the file
(whichever is the shorter) are available for forward matching, and similarly
the previous 8K characters (or all the previous characters, if fewer than 8K)
are guaranteed to be available for lookbehind assertions. The \fB-M\fP option
that \fBpcre2grep\fP buffers the input file as it scans it. With a sufficiently
large processing buffer, this should not be a problem, but the \fB-M\fP option
does not work when input is read line by line (see \fP--line-buffered\fP.)
.TP
\fB-N\fP \fInewline-type\fP, \fB--newline\fP=\fInewline-type\fP
@ -502,12 +529,13 @@ It should never be needed in normal use.
Show only the part of the line that matched a pattern instead of the whole
line. In this mode, no context is shown. That is, the \fB-A\fP, \fB-B\fP, and
\fB-C\fP options are ignored. If there is more than one match in a line, each
of them is shown separately. If \fB-o\fP is combined with \fB-v\fP (invert the
sense of the match to find non-matching lines), no output is generated, but the
return code is set appropriately. If the matched portion of the line is empty,
nothing is output unless the file name or line number are being printed, in
which case they are shown on an otherwise empty line. This option is mutually
exclusive with \fB--file-offsets\fP and \fB--line-offsets\fP.
of them is shown separately, on a separate line of output. If \fB-o\fP is
combined with \fB-v\fP (invert the sense of the match to find non-matching
lines), no output is generated, but the return code is set appropriately. If
the matched portion of the line is empty, nothing is output unless the file
name or line number are being printed, in which case they are shown on an
otherwise empty line. This option is mutually exclusive with
\fB--file-offsets\fP and \fB--line-offsets\fP.
.TP
\fB-o\fP\fInumber\fP, \fB--only-matching\fP=\fInumber\fP
Show only the part of the line that matched the capturing parentheses of the
@ -519,10 +547,11 @@ for the non-argument case above also apply to this case. If the specified
capturing parentheses do not exist in the pattern, or were not set in the
match, nothing is output unless the file name or line number are being output.
.sp
If this option is given multiple times, multiple substrings are output, in the
order the options are given. For example, -o3 -o1 -o3 causes the substrings
matched by capturing parentheses 3 and 1 and then 3 again to be output. By
default, there is no separator (but see the next option).
If this option is given multiple times, multiple substrings are output for each
match, in the order the options are given, and all on one line. For example,
-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
then 3 again to be output. By default, there is no separator (but see the next
option).
.TP
\fB--om-separator\fP=\fItext\fP
Specify a separating string for multiple occurrences of \fB-o\fP. The default
@ -547,6 +576,17 @@ Suppress error messages about non-existent or unreadable files. Such files are
quietly skipped. However, the return code is still 2, even if matches were
found in other files.
.TP
\fB-t\fP, \fB--total-count\fP
This option is useful when scanning more than one file. If used on its own,
\fB-t\fP suppresses all output except for a grand total number of matching
lines (or non-matching lines if \fB-v\fP is used) in all the files. If \fB-t\fP
is used with \fB-c\fP, a grand total is output except when the previous output
is just one line. In other words, it is not output when just one file's count
is listed. If file names are being output, the grand total is preceded by
"TOTAL:". Otherwise, it appears as just another number. The \fB-t\fP option is
ignored when used with \fB-L\fP (list files without matches), because the grand
total would always be zero.
.TP
\fB-u\fP, \fB--utf-8\fP
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any \fB--exclude\fP and
@ -570,11 +610,12 @@ specified by any of the \fB--include\fP or \fB--exclude\fP options.
.TP
\fB-x\fP, \fB--line-regex\fP, \fB--line-regexp\fP
Force the patterns to be anchored (each must start matching at the beginning of
a line) and in addition, require them to match entire lines. This is equivalent
to having ^ and $ characters at the start and end of each alternative top-level
branch in every pattern. This option applies only to the patterns that are
matched against the contents of files; it does not apply to patterns specified
by any of the \fB--include\fP or \fB--exclude\fP options.
a line) and in addition, require them to match entire lines. In multiline mode
the match may be more than one line. This is equivalent to having \eA and \eZ
characters at the start and end of each alternative top-level branch in every
pattern. This option applies only to the patterns that are matched against the
contents of files; it does not apply to patterns specified by any of the
\fB--include\fP or \fB--exclude\fP options.
.
.
.SH "ENVIRONMENT VARIABLES"
@ -653,6 +694,58 @@ options does have data, it must be given in the first form, using an equals
character. Otherwise \fBpcre2grep\fP will assume that it has no data.
.
.
.SH "CALLING EXTERNAL SCRIPTS"
.rs
.sp
\fBpcre2grep\fP has, by default, support for calling external programs or
scripts during matching by making use of PCRE2's callout facility. However,
this support can be disabled when \fBpcre2grep\fP is built. You can find out
whether your binary has support for callouts by running it with the \fB--help\fP
option. If the support is not enabled, all callouts in patterns are ignored by
\fBpcre2grep\fP.
.P
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argument is
either a number or a quoted string (see the
.\" HREF
\fBpcre2callout\fP
.\"
documentation for details). Numbered callouts are ignored by \fBpcre2grep\fP.
String arguments are parsed as a list of substrings separated by pipe (vertical
bar) characters. The first substring must be an executable name, with the
following substrings specifying arguments:
.sp
executable_name|arg1|arg2|...
.sp
Any substring (including the executable name) may contain escape sequences
started by a dollar character: $<digits> or ${<digits>} is replaced by the
captured substring of the given decimal number, which must be greater than
zero. If the number is greater than the number of capturing substrings, or if
the capture is unset, the replacement is empty.
.P
Any other character is substituted by itself. In particular, $$ is replaced by
a single dollar and $| is replaced by a pipe character. Here is an example:
.sp
echo -e "abcde\en12345" | pcre2grep \e
'(?x)(.)(..(.))
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
.sp
Output:
.sp
Arg1: [a] [bcd] [d] Arg2: |a| ()
abcde
Arg1: [1] [234] [4] Arg2: |1| ()
12345
.sp
The parameters for the \fBexecv()\fP system call that is used to run the
program or script are zero-terminated strings. This means that binary zero
characters in the callout argument will cause premature termination of their
substrings, and therefore should not be present. Any syntax errors in the
string (for example, a dollar not followed by another character) cause the
callout to be ignored. If running the program fails for any reason (including
the non-existence of the executable), a local matching failure occurs and the
matcher backtracks in the normal way.
.
.
.SH "MATCHING ERRORS"
.rs
.sp
@ -683,7 +776,7 @@ affect the return code.
.SH "SEE ALSO"
.rs
.sp
\fBpcre2pattern\fP(3), \fBpcre2syntax\fP(3).
\fBpcre2pattern\fP(3), \fBpcre2syntax\fP(3), \fBpcre2callout\fP(3).
.
.
.SH AUTHOR
@ -700,6 +793,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 January 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 31 December 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -51,103 +51,115 @@ DESCRIPTION
boundary is controlled by the -N (--newline) option.
The amount of memory used for buffering files that are being scanned is
controlled by a parameter that can be set by the --buffer-size option.
The default value for this parameter is specified when pcre2grep is
built, with the default default being 20K. A block of memory three
times this size is used (to allow for buffering "before" and "after"
lines). An error occurs if a line overflows the buffer.
controlled by parameters that can be set by the --buffer-size and
--max-buffer-size options. The first of these sets the size of buffer
that is obtained at the start of processing. If an input file contains
very long lines, a larger buffer may be needed; this is handled by
automatically extending the buffer, up to the limit specified by --max-
buffer-size. The default values for these parameters are specified when
pcre2grep is built, with the default defaults being 20K and 1M respec-
tively. An error occurs if a line is too long and the buffer can no
longer be expanded.
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
greater. BUFSIZ is defined in <stdio.h>. When there is more than one
The block of memory that is actually used is three times the "buffer
size", to allow for buffering "before" and "after" lines. If the buffer
size is too small, fewer than requested "before" and "after" lines may
be output.
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
greater. BUFSIZ is defined in <stdio.h>. When there is more than one
pattern (specified by the use of -e and/or -f), each pattern is applied
to each line in the order in which they are defined, except that all
to each line in the order in which they are defined, except that all
the -e patterns are tried before the -f patterns.
By default, as soon as one pattern matches a line, no further patterns
By default, as soon as one pattern matches a line, no further patterns
are considered. However, if --colour (or --color) is used to colour the
matching substrings, or if --only-matching, --file-offsets, or --line-
offsets is used to output only the part of the line that matched
matching substrings, or if --only-matching, --file-offsets, or --line-
offsets is used to output only the part of the line that matched
(either shown literally, or as an offset), scanning resumes immediately
following the match, so that further matches on the same line can be
found. If there are multiple patterns, they are all tried on the
remainder of the line, but patterns that follow the one that matched
following the match, so that further matches on the same line can be
found. If there are multiple patterns, they are all tried on the
remainder of the line, but patterns that follow the one that matched
are not tried on the earlier part of the line.
This behaviour means that the order in which multiple patterns are
specified can affect the output when one of the above options is used.
This is no longer the same behaviour as GNU grep, which now manages to
display earlier matches for later patterns (as long as there is no
This behaviour means that the order in which multiple patterns are
specified can affect the output when one of the above options is used.
This is no longer the same behaviour as GNU grep, which now manages to
display earlier matches for later patterns (as long as there is no
overlap).
Patterns that can match an empty string are accepted, but empty string
Patterns that can match an empty string are accepted, but empty string
matches are never recognized. An example is the pattern
"(super)?(man)?", in which all components are optional. This pattern
finds all occurrences of both "super" and "man"; the output differs
from matching with "super|man" when only the matching substrings are
"(super)?(man)?", in which all components are optional. This pattern
finds all occurrences of both "super" and "man"; the output differs
from matching with "super|man" when only the matching substrings are
being shown.
If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses
If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses
the value to set a locale when calling the PCRE2 library. The --locale
option can be used to override this.
SUPPORT FOR COMPRESSED FILES
It is possible to compile pcre2grep so that it uses libz or libbz2 to
read files whose names end in .gz or .bz2, respectively. You can find
It is possible to compile pcre2grep so that it uses libz or libbz2 to
read files whose names end in .gz or .bz2, respectively. You can find
out whether your binary has support for one or both of these file types
by running it with the --help option. If the appropriate support is not
present, files are treated as plain text. The standard input is always
present, files are treated as plain text. The standard input is always
so treated.
BINARY FILES
By default, a file that contains a binary zero byte within the first
1024 bytes is identified as a binary file, and is processed specially.
(GNU grep also identifies binary files in this manner.) See the
--binary-files option for a means of changing the way binary files are
By default, a file that contains a binary zero byte within the first
1024 bytes is identified as a binary file, and is processed specially.
(GNU grep also identifies binary files in this manner.) See the
--binary-files option for a means of changing the way binary files are
handled.
OPTIONS
The order in which some of the options appear can affect the output.
For example, both the -h and -l options affect the printing of file
names. Whichever comes later in the command line will be the one that
takes effect. Similarly, except where noted below, if an option is
given twice, the later setting is used. Numerical values for options
may be followed by K or M, to signify multiplication by 1024 or
The order in which some of the options appear can affect the output.
For example, both the -h and -l options affect the printing of file
names. Whichever comes later in the command line will be the one that
takes effect. Similarly, except where noted below, if an option is
given twice, the later setting is used. Numerical values for options
may be followed by K or M, to signify multiplication by 1024 or
1024*1024 respectively.
-- This terminates the list of options. It is useful if the next
item on the command line starts with a hyphen but is not an
option. This allows for the processing of patterns and file
item on the command line starts with a hyphen but is not an
option. This allows for the processing of patterns and file
names that start with hyphens.
-A number, --after-context=number
Output number lines of context after each matching line. If
file names and/or line numbers are being output, a hyphen
separator is used instead of a colon for the context lines. A
line containing "--" is output between each group of lines,
unless they are in fact contiguous in the input file. The
value of number is expected to be relatively small. However,
pcre2grep guarantees to have up to 8K of following text
available for context output.
Output up to number lines of context after each matching
line. Fewer lines are output if the next match or the end of
the file is reached, or if the processing buffer size has
been set too small. If file names and/or line numbers are
being output, a hyphen separator is used instead of a colon
for the context lines. A line containing "--" is output
between each group of lines, unless they are in fact contigu-
ous in the input file. The value of number is expected to be
relatively small. When -c is used, -A is ignored.
-a, --text
Treat binary files as text. This is equivalent to --binary-
files=text.
-B number, --before-context=number
Output number lines of context before each matching line. If
file names and/or line numbers are being output, a hyphen
separator is used instead of a colon for the context lines. A
line containing "--" is output between each group of lines,
unless they are in fact contiguous in the input file. The
value of number is expected to be relatively small. However,
pcre2grep guarantees to have up to 8K of preceding text
available for context output.
Output up to number lines of context before each matching
line. Fewer lines are output if the previous match or the
start of the file is within number lines, or if the process-
ing buffer size has been set too small. If file names and/or
line numbers are being output, a hyphen separator is used
instead of a colon for the context lines. A line containing
"--" is output between each group of lines, unless they are
in fact contiguous in the input file. The value of number is
expected to be relatively small. When -c is used, -B is
ignored.
--binary-files=word
Specify how binary files are to be processed. If the word is
@ -164,54 +176,68 @@ OPTIONS
any output or affecting the return code.
--buffer-size=number
Set the parameter that controls how much memory is used for
buffering files that are being scanned.
Set the parameter that controls how much memory is obtained
at the start of processing for buffering files that are being
scanned. See also --max-buffer-size below.
-C number, --context=number
Output number lines of context both before and after each
matching line. This is equivalent to setting both -A and -B
Output number lines of context both before and after each
matching line. This is equivalent to setting both -A and -B
to the same value.
-c, --count
Do not output lines from the files that are being scanned;
instead output the number of matches (or non-matches if -v is
used) that would otherwise have caused lines to be shown. By
default, this count is the same as the number of suppressed
lines, but if the -M (multiline) option is used (without -v),
there may be more suppressed lines than the number of
matches.
Do not output lines from the files that are being scanned;
instead output the number of lines that would have been
shown, either because they matched, or, if -v is set, because
they failed to match. By default, this count is exactly the
same as the number of lines that would have been output, but
if the -M (multiline) option is used (without -v), there may
be more suppressed lines than the count (that is, the number
of matches).
If no lines are selected, the number zero is output. If sev-
eral files are are being scanned, a count is output for each
of them. However, if the --files-with-matches option is also
used, only those files whose counts are greater than zero are
listed. When -c is used, the -A, -B, and -C options are
ignored.
of them and the -t option can be used to cause a total to be
output at the end. However, if the --files-with-matches
option is also used, only those files whose counts are
greater than zero are listed. When -c is used, the -A, -B,
and -C options are ignored.
--colour, --color
If this option is given without any data, it is equivalent to
"--colour=auto". If data is required, it must be given in
"--colour=auto". If data is required, it must be given in
the same shell item, separated by an equals sign.
--colour=value, --color=value
This option specifies under what circumstances the parts of a
line that matched a pattern should be coloured in the output.
By default, the output is not coloured. The value (which is
optional, see above) may be "never", "always", or "auto". In
the latter case, colouring happens only if the standard out-
put is connected to a terminal. More resources are used when
By default, the output is not coloured. The value (which is
optional, see above) may be "never", "always", or "auto". In
the latter case, colouring happens only if the standard out-
put is connected to a terminal. More resources are used when
colouring is enabled, because pcre2grep has to search for all
possible matches in a line, not just one, in order to colour
possible matches in a line, not just one, in order to colour
them all.
The colour that is used can be specified by setting the envi-
ronment variable PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The
value of this variable should be a string of two numbers,
separated by a semicolon. They are copied directly into the
control string for setting colour on a terminal, so it is
your responsibility to ensure that they make sense. If nei-
ther of the environment variables is set, the default is
"1;31", which gives red.
The colour that is used can be specified by setting one of
the environment variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR,
PCREGREP_COLOUR, or PCREGREP_COLOR, which are checked in that
order. If none of these are set, pcre2grep looks for
GREP_COLORS or GREP_COLOR (in that order). The value of the
variable should be a string of two numbers, separated by a
semicolon, except in the case of GREP_COLORS, which must
start with "ms=" or "mt=" followed by two semicolon-separated
colours, terminated by the end of the string or by a colon.
If GREP_COLORS does not start with "ms=" or "mt=" it is
ignored, and GREP_COLOR is checked.
If the string obtained from one of the above variables con-
tains any characters other than semicolon or digits, the set-
ting is ignored and the default colour is used. The string is
copied directly into the control string for setting colour on
a terminal, so it is your responsibility to ensure that the
values make sense. If no relevant environment variable is
set, the default is "1;31", which gives red.
-D action, --devices=action
If an input path is not a regular file or a directory,
@ -299,12 +325,12 @@ OPTIONS
Read patterns from the file, one per line, and match them
against each line of input. What constitutes a newline when
reading the file is the operating system's default. The
--newline option has no effect on this option. Trailing white
space is removed from each line, and blank lines are ignored.
An empty file contains no patterns and therefore matches
nothing. See also the comments about multiple patterns versus
a single pattern with alternatives in the description of -e
above.
--newline option has no effect on this option. Trailing
white space is removed from each line, and blank lines are
ignored. An empty file contains no patterns and therefore
matches nothing. See also the comments about multiple pat-
terns versus a single pattern with alternatives in the
description of -e above.
If this option is given more than once, all the specified
files are read. A data line is output if any of the patterns
@ -482,96 +508,101 @@ OPTIONS
tings are specified when the PCRE2 library is compiled, with
the default default being 10 million.
-M, --multiline
Allow patterns to match more than one line. When this option
is given, patterns may usefully contain literal newline char-
acters and internal occurrences of ^ and $ characters. The
output for a successful match may consist of more than one
line. The first is the line in which the match started, and
the last is the line in which the match ended. If the matched
string ends with a newline sequence the output ends at the
end of that line.
--max-buffer-size=number
This limits the expansion of the processing buffer, whose
initial size can be set by --buffer-size. The maximum buffer
size is silently forced to be no smaller than the starting
buffer size.
When this option is set, the PCRE2 library is called in "mul-
tiline" mode. However, pcre2grep still processes the input
line by line. The difference is that a matched string may
extend past the end of a line and continue on one or more
subsequent lines. The newline sequence must be matched as
part of the pattern. For example, to find the phrase "regular
expression" in a file where "regular" might be at the end of
a line and "expression" at the start of the next line, you
could use this command:
-M, --multiline
Allow patterns to match more than one line. When this option
is set, the PCRE2 library is called in "multiline" mode. This
allows a matched string to extend past the end of a line and
continue on one or more subsequent lines. Patterns used with
-M may usefully contain literal newline characters and inter-
nal occurrences of ^ and $ characters. The output for a suc-
cessful match may consist of more than one line. The first
line is the line in which the match started, and the last
line is the line in which the match ended. If the matched
string ends with a newline sequence, the output ends at the
end of that line. If -v is set, none of the lines in a
multi-line match are output. Once a match has been handled,
scanning restarts at the beginning of the line after the one
in which the match ended.
The newline sequence that separates multiple lines must be
matched as part of the pattern. For example, to find the
phrase "regular expression" in a file where "regular" might
be at the end of a line and "expression" at the start of the
next line, you could use this command:
pcre2grep -M 'regular\s+expression' <file>
The \s escape sequence matches any white space character,
including newlines, and is followed by + so as to match
trailing white space on the first line as well as possibly
The \s escape sequence matches any white space character,
including newlines, and is followed by + so as to match
trailing white space on the first line as well as possibly
handling a two-character newline sequence.
There is a limit to the number of lines that can be matched,
imposed by the way that pcre2grep buffers the input file as
it scans it. However, pcre2grep ensures that at least 8K
characters or the rest of the file (whichever is the shorter)
are available for forward matching, and similarly the previ-
ous 8K characters (or all the previous characters, if fewer
than 8K) are guaranteed to be available for lookbehind asser-
tions. The -M option does not work when input is read line by
line (see --line-buffered.)
There is a limit to the number of lines that can be matched,
imposed by the way that pcre2grep buffers the input file as
it scans it. With a sufficiently large processing buffer,
this should not be a problem, but the -M option does not work
when input is read line by line (see --line-buffered.)
-N newline-type, --newline=newline-type
The PCRE2 library supports five different conventions for
indicating the ends of lines. They are the single-character
sequences CR (carriage return) and LF (linefeed), the two-
character sequence CRLF, an "anycrlf" convention, which rec-
ognizes any of the preceding three types, and an "any" con-
The PCRE2 library supports five different conventions for
indicating the ends of lines. They are the single-character
sequences CR (carriage return) and LF (linefeed), the two-
character sequence CRLF, an "anycrlf" convention, which rec-
ognizes any of the preceding three types, and an "any" con-
vention, in which any Unicode line ending sequence is assumed
to end a line. The Unicode sequences are the three just men-
tioned, plus VT (vertical tab, U+000B), FF (form feed,
U+000C), NEL (next line, U+0085), LS (line separator,
to end a line. The Unicode sequences are the three just men-
tioned, plus VT (vertical tab, U+000B), FF (form feed,
U+000C), NEL (next line, U+0085), LS (line separator,
U+2028), and PS (paragraph separator, U+2029).
When the PCRE2 library is built, a default line-ending
sequence is specified. This is normally the standard
When the PCRE2 library is built, a default line-ending
sequence is specified. This is normally the standard
sequence for the operating system. Unless otherwise specified
by this option, pcre2grep uses the library's default. The
by this option, pcre2grep uses the library's default. The
possible values for this option are CR, LF, CRLF, ANYCRLF, or
ANY. This makes it possible to use pcre2grep to scan files
ANY. This makes it possible to use pcre2grep to scan files
that have come from other environments without having to mod-
ify their line endings. If the data that is being scanned
does not agree with the convention set by this option,
pcre2grep may behave in strange ways. Note that this option
does not apply to files specified by the -f, --exclude-from,
or --include-from options, which are expected to use the
ify their line endings. If the data that is being scanned
does not agree with the convention set by this option,
pcre2grep may behave in strange ways. Note that this option
does not apply to files specified by the -f, --exclude-from,
or --include-from options, which are expected to use the
operating system's standard newline sequence.
-n, --line-number
Precede each output line by its line number in the file, fol-
lowed by a colon for matching lines or a hyphen for context
lowed by a colon for matching lines or a hyphen for context
lines. If the file name is also being output, it precedes the
line number. When the -M option causes a pattern to match
more than one line, only the first is preceded by its line
line number. When the -M option causes a pattern to match
more than one line, only the first is preceded by its line
number. This option is forced if --line-offsets is used.
--no-jit If the PCRE2 library is built with support for just-in-time
--no-jit If the PCRE2 library is built with support for just-in-time
compiling (which speeds up matching), pcre2grep automatically
makes use of this, unless it was explicitly disabled at build
time. This option can be used to disable the use of JIT at
run time. It is provided for testing and working round prob-
time. This option can be used to disable the use of JIT at
run time. It is provided for testing and working round prob-
lems. It should never be needed in normal use.
-o, --only-matching
Show only the part of the line that matched a pattern instead
of the whole line. In this mode, no context is shown. That
is, the -A, -B, and -C options are ignored. If there is more
than one match in a line, each of them is shown separately.
If -o is combined with -v (invert the sense of the match to
find non-matching lines), no output is generated, but the
return code is set appropriately. If the matched portion of
the line is empty, nothing is output unless the file name or
line number are being printed, in which case they are shown
on an otherwise empty line. This option is mutually exclusive
with --file-offsets and --line-offsets.
of the whole line. In this mode, no context is shown. That
is, the -A, -B, and -C options are ignored. If there is more
than one match in a line, each of them is shown separately,
on a separate line of output. If -o is combined with -v
(invert the sense of the match to find non-matching lines),
no output is generated, but the return code is set appropri-
ately. If the matched portion of the line is empty, nothing
is output unless the file name or line number are being
printed, in which case they are shown on an otherwise empty
line. This option is mutually exclusive with --file-offsets
and --line-offsets.
-onumber, --only-matching=number
Show only the part of the line that matched the capturing
@ -587,65 +618,80 @@ OPTIONS
put.
If this option is given multiple times, multiple substrings
are output, in the order the options are given. For example,
-o3 -o1 -o3 causes the substrings matched by capturing paren-
theses 3 and 1 and then 3 again to be output. By default,
there is no separator (but see the next option).
are output for each match, in the order the options are
given, and all on one line. For example, -o3 -o1 -o3 causes
the substrings matched by capturing parentheses 3 and 1 and
then 3 again to be output. By default, there is no separator
(but see the next option).
--om-separator=text
Specify a separating string for multiple occurrences of -o.
The default is an empty string. Separating strings are never
Specify a separating string for multiple occurrences of -o.
The default is an empty string. Separating strings are never
coloured.
-q, --quiet
Work quietly, that is, display nothing except error messages.
The exit status indicates whether or not any matches were
The exit status indicates whether or not any matches were
found.
-r, --recursive
If any given path is a directory, recursively scan the files
it contains, taking note of any --include and --exclude set-
tings. By default, a directory is read as a normal file; in
some operating systems this gives an immediate end-of-file.
This option is a shorthand for setting the -d option to
If any given path is a directory, recursively scan the files
it contains, taking note of any --include and --exclude set-
tings. By default, a directory is read as a normal file; in
some operating systems this gives an immediate end-of-file.
This option is a shorthand for setting the -d option to
"recurse".
--recursion-limit=number
See --match-limit above.
-s, --no-messages
Suppress error messages about non-existent or unreadable
files. Such files are quietly skipped. However, the return
Suppress error messages about non-existent or unreadable
files. Such files are quietly skipped. However, the return
code is still 2, even if matches were found in other files.
-t, --total-count
This option is useful when scanning more than one file. If
used on its own, -t suppresses all output except for a grand
total number of matching lines (or non-matching lines if -v
is used) in all the files. If -t is used with -c, a grand
total is output except when the previous output is just one
line. In other words, it is not output when just one file's
count is listed. If file names are being output, the grand
total is preceded by "TOTAL:". Otherwise, it appears as just
another number. The -t option is ignored when used with -L
(list files without matches), because the grand total would
always be zero.
-u, --utf-8
Operate in UTF-8 mode. This option is available only if PCRE2
has been compiled with UTF-8 support. All patterns (including
those for any --exclude and --include options) and all sub-
ject lines that are scanned must be valid strings of UTF-8
those for any --exclude and --include options) and all sub-
ject lines that are scanned must be valid strings of UTF-8
characters.
-V, --version
Write the version numbers of pcre2grep and the PCRE2 library
to the standard output and then exit. Anything else on the
Write the version numbers of pcre2grep and the PCRE2 library
to the standard output and then exit. Anything else on the
command line is ignored.
-v, --invert-match
Invert the sense of the match, so that lines which do not
Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp
Force the patterns to match only whole words. This is equiva-
lent to having \b at the start and end of the pattern. This
option applies only to the patterns that are matched against
the contents of files; it does not apply to patterns speci-
lent to having \b at the start and end of the pattern. This
option applies only to the patterns that are matched against
the contents of files; it does not apply to patterns speci-
fied by any of the --include or --exclude options.
-x, --line-regex, --line-regexp
Force the patterns to be anchored (each must start matching
at the beginning of a line) and in addition, require them to
match entire lines. This is equivalent to having ^ and $
characters at the start and end of each alternative top-level
Force the patterns to be anchored (each must start matching
at the beginning of a line) and in addition, require them to
match entire lines. In multiline mode the match may be more
than one line. This is equivalent to having \A and \Z charac-
ters at the start and end of each alternative top-level
branch in every pattern. This option applies only to the pat-
terns that are matched against the contents of files; it does
not apply to patterns specified by any of the --include or
@ -725,35 +771,86 @@ OPTIONS WITH DATA
equals character. Otherwise pcre2grep will assume that it has no data.
CALLING EXTERNAL SCRIPTS
pcre2grep has, by default, support for calling external programs or
scripts during matching by making use of PCRE2's callout facility. How-
ever, this support can be disabled when pcre2grep is built. You can
find out whether your binary has support for callouts by running it
with the --help option. If the support is not enabled, all callouts in
patterns are ignored by pcre2grep.
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
ment is either a number or a quoted string (see the pcre2callout docu-
mentation for details). Numbered callouts are ignored by pcre2grep.
String arguments are parsed as a list of substrings separated by pipe
(vertical bar) characters. The first substring must be an executable
name, with the following substrings specifying arguments:
executable_name|arg1|arg2|...
Any substring (including the executable name) may contain escape
sequences started by a dollar character: $<digits> or ${<digits>} is
replaced by the captured substring of the given decimal number, which
must be greater than zero. If the number is greater than the number of
capturing substrings, or if the capture is unset, the replacement is
empty.
Any other character is substituted by itself. In particular, $$ is
replaced by a single dollar and $| is replaced by a pipe character.
Here is an example:
echo -e "abcde\n12345" | pcre2grep \
'(?x)(.)(..(.))
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
Output:
Arg1: [a] [bcd] [d] Arg2: |a| ()
abcde
Arg1: [1] [234] [4] Arg2: |1| ()
12345
The parameters for the execv() system call that is used to run the pro-
gram or script are zero-terminated strings. This means that binary zero
characters in the callout argument will cause premature termination of
their substrings, and therefore should not be present. Any syntax
errors in the string (for example, a dollar not followed by another
character) cause the callout to be ignored. If running the program
fails for any reason (including the non-existence of the executable), a
local matching failure occurs and the matcher backtracks in the normal
way.
MATCHING ERRORS
It is possible to supply a regular expression that takes a very long
time to fail to match certain lines. Such patterns normally involve
nested indefinite repeats, for example: (a+)*\d when matched against a
line of a's with no final digit. The PCRE2 matching function has a
resource limit that causes it to abort in these circumstances. If this
happens, pcre2grep outputs an error message and the line that caused
the problem to the standard error stream. If there are more than 20
It is possible to supply a regular expression that takes a very long
time to fail to match certain lines. Such patterns normally involve
nested indefinite repeats, for example: (a+)*\d when matched against a
line of a's with no final digit. The PCRE2 matching function has a
resource limit that causes it to abort in these circumstances. If this
happens, pcre2grep outputs an error message and the line that caused
the problem to the standard error stream. If there are more than 20
such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall
resource limit; there is a second option called --recursion-limit that
sets a limit on the amount of memory (usually stack) that is used (see
The --match-limit option of pcre2grep can be used to set the overall
resource limit; there is a second option called --recursion-limit that
sets a limit on the amount of memory (usually stack) that is used (see
the discussion of these options above).
DIAGNOSTICS
Exit status is 0 if any matches were found, 1 if no matches were found,
and 2 for syntax errors, overlong lines, non-existent or inaccessible
files (even if matches were found in other files) or too many matching
and 2 for syntax errors, overlong lines, non-existent or inaccessible
files (even if matches were found in other files) or too many matching
errors. Using the -s option to suppress error messages about inaccessi-
ble files does not affect the return code.
SEE ALSO
pcre2pattern(3), pcre2syntax(3).
pcre2pattern(3), pcre2syntax(3), pcre2callout(3).
AUTHOR
@ -765,5 +862,5 @@ AUTHOR
REVISION
Last updated: 03 January 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 31 December 2016
Copyright (c) 1997-2016 University of Cambridge.

View File

@ -1,4 +1,4 @@
.TH PCRE2JIT 3 "27 November 2014" "PCRE2 10.00"
.TH PCRE2JIT 3 "05 June 2016" "PCRE2 10.22"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -61,6 +61,12 @@ much faster than the normal interpretive code, but yields exactly the same
results. The returned value from \fBpcre2_jit_compile()\fP is zero on success,
or a negative error code.
.P
There is a limit to the size of pattern that JIT supports, imposed by the size
of machine stack that it uses. The exact rules are not documented because they
may change at any time, in particular, when new optimizations are introduced.
If a pattern is too big, a call to \fBpcre2_jit_compile()\fB returns
PCRE2_ERROR_NOMEMORY.
.P
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
PCRE2_PARTIAL_SOFT options of \fBpcre2_match()\fP, you should set one or both
@ -122,6 +128,9 @@ PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
PCRE2_ANCHORED option is not supported at match time.
.P
If the PCRE2_NO_JIT option is passed to \fBpcre2_match()\fP it disables the
use of JIT, forcing matching by the interpreter code.
.P
The only unsupported pattern items are \eC (match a single data unit) when
running in a UTF mode, and a callout immediately before an assertion condition
in a conditional group.
@ -207,8 +216,13 @@ for JIT matching. A callback function can therefore be used to determine
whether a match operation was executed by JIT or by the interpreter.
.P
You may safely use the same JIT stack for more than one pattern (either by
assigning directly or by callback), as long as the patterns are all matched
sequentially in the same thread. In a multithread application, if you do not
assigning directly or by callback), as long as the patterns are matched
sequentially in the same thread. Currently, the only way to set up
non-sequential matches in one thread is to use callouts: if a callout function
starts another match, that match must use a different JIT stack to the one used
for currently suspended match(es).
.P
In a multithread application, if you do not
specify a JIT stack, or if you assign or pass back NULL from a callback, that
is thread-safe, because each thread has its own machine stack. However, if you
assign or pass back a non-NULL JIT stack, this must be a different stack for
@ -366,7 +380,7 @@ The fast path function is called \fBpcre2_jit_match()\fP, and it takes exactly
the same arguments as \fBpcre2_match()\fP. The return values are also the same,
plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is
requested that was not compiled. Unsupported option bits (for example,
PCRE2_ANCHORED) are ignored.
PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option.
.P
When you call \fBpcre2_match()\fP, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
@ -399,6 +413,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 27 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 05 June 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2LIMITS 3 "25 November 2014" "PCRE2 10.00"
.TH PCRE2LIMITS 3 "26 October 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SIZE AND OTHER LIMITATIONS"
@ -20,6 +20,10 @@ documentation for details. In these cases the limit is substantially larger.
However, the speed of execution is slower. In the 32-bit library, the internal
linkage size is always 4.
.P
The maximum length of a source pattern string is essentially unlimited; it is
the largest number a PCRE2_SIZE variable can hold. However, the program that
calls \fBpcre2_compile()\fP can specify a smaller limit.
.P
The maximum length (in code units) of a subject string is one less than the
largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an unsigned
integer type, usually defined as size_t. Its maximum value (that is
@ -37,22 +41,25 @@ documentation.
.P
All values in repeating quantifiers must be less than 65536.
.P
The maximum length of a lookbehind assertion is 65535 characters.
.P
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
order to limit the amount of system stack used at compile time. The limit can
be specified when PCRE2 is built; the default is 250.
.P
There is a limit to the number of forward references to subsequent subpatterns
of around 200,000. Repeated forward references with fixed upper limits, for
example, (?2){0,100} when subpattern number 2 is to the right, are included in
the count. There is no limit to the number of backward references.
order to limit the amount of system stack used at compile time. The default
limit can be specified when PCRE2 is built; the default default is 250. An
application can change this limit by calling pcre2_set_parens_nest_limit() to
set the limit in a compile context.
.P
The maximum length of name for a named subpattern is 32 code units, and the
maximum number of named subpatterns is 10000.
.P
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
32-bit libraries.
.P
The maximum length of a string argument to a callout is the largest number a
32-bit unsigned integer can hold.
.
.
.SH AUTHOR
@ -69,6 +76,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 25 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 26 October 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
.TH PCRE2PATTERN 3 "27 December 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -158,6 +158,11 @@ be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
.P
The match limit is used (but in a different way) when JIT is being used, but it
is not relevant, and is ignored, when matching with \fBpcre2_dfa_match()\fP.
However, the recursion limit is relevant for DFA matching, which does use some
function recursion, in particular, for recursions within the pattern.
.
.
.\" HTML <a name="newlines"></a>
@ -359,29 +364,28 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
code unit following \ec has a value less than 32 or greater than 126, a
compile-time error occurs. This locks out non-printable ASCII characters in all
modes.
compile-time error occurs.
.P
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
generate the appropriate EBCDIC code values. The \ec escape is processed
as specified for Perl in the \fBperlebcdic\fP document. The only characters
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
other character provokes a compile-time error. The sequence \e@ encodes
character code 0; the letters (in either case) encode characters 1-26 (hex 01
to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
\e? becomes either 255 (hex FF) or 95 (hex 5F).
other character provokes a compile-time error. The sequence \ec@ encodes
character code 0; after \ec the letters (in either case) encode characters 1-26
(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex
1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
.P
Thus, apart from \e?, these escapes generate the same character code values as
Thus, apart from \ec?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly
differ. For example, \eG always generates code value 7, which is BEL in ASCII
differ. For example, \ecG always generates code value 7, which is BEL in ASCII
but DEL in EBCDIC.
.P
The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
values, PCRE2 makes \e? generate 95; otherwise it generates 255.
values, PCRE2 makes \ec? generate 95; otherwise it generates 255.
.P
After \e0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \e0\ex\e015
@ -508,9 +512,9 @@ by code point, as described in the previous section.
.SS "Absolute and relative back references"
.rs
.sp
The sequence \eg followed by an unsigned or a negative number, optionally
enclosed in braces, is an absolute or relative back reference. A named back
reference can be coded as \eg{name}. Back references are discussed
The sequence \eg followed by a signed or unsigned number, optionally enclosed
in braces, is an absolute or relative back reference. A named back reference
can be coded as \eg{name}. Back references are discussed
.\" HTML <a href="#backreferences">
.\" </a>
later,
@ -671,8 +675,8 @@ below.
This particular group matches either the two-character sequence CR followed by
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). The two-character sequence is treated as a single unit that
cannot be split.
line, U+0085). Because this is an atomic group, the two-character sequence is
treated as a single unit that cannot be split.
.P
In other modes, two additional characters whose codepoints are greater than 255
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
@ -738,6 +742,8 @@ example:
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
.P
Ahom,
Anatolian_Hieroglyphs,
Arabic,
Armenian,
Avestan,
@ -778,6 +784,7 @@ Gurmukhi,
Han,
Hangul,
Hanunoo,
Hatran,
Hebrew,
Hiragana,
Imperial_Aramaic,
@ -814,12 +821,14 @@ Miao,
Modi,
Mongolian,
Mro,
Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
Nko,
Ogham,
Ol_Chiki,
Old_Hungarian,
Old_Italic,
Old_North_Arabian,
Old_Permic,
@ -841,6 +850,7 @@ Saurashtra,
Sharada,
Shavian,
Siddham,
SignWriting,
Sinhala,
Sora_Sompeng,
Sundanese,
@ -1177,6 +1187,18 @@ patterns that are anchored in single line mode because all branches start with
when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
.P
When the newline convention (see
.\" HTML <a href="#newlines">
.\" </a>
"Newline conventions"
.\"
below) recognizes the two-character sequence CRLF as a newline, this is
preferred, even if the single characters CR and LF are also recognized as
newlines. For example, if the newline convention is "any", a multiline mode
circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
CR, even though CR on its own is a valid newline. (It also matches at the very
start of the string, of course.)
.P
Note that the sequences \eA, \eZ, and \ez can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
\eA it is always anchored, whether or not PCRE2_MULTILINE is set.
@ -1227,21 +1249,31 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
use of \eC by setting the PCRE2_NEVER_BACKSLASH_C option.
unless the PCRE2_NO_UTF_CHECK option is used).
.P
An application can lock out the use of \eC by setting the
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
build PCRE2 with the use of \eC permanently disabled.
.P
PCRE2 does not allow \eC to appear in lookbehind assertions
.\" HTML <a href="#lookbehind">
.\" </a>
(described below)
.\"
in a UTF mode, because this would make it impossible to calculate the length of
the lookbehind.
in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
the length of the lookbehind. Neither the alternative matching function
\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes.
The former gives a match-time error; the latter fails to optimize and so the
match is always run using the interpreter.
.P
In the 32-bit library, however, \eC is always supported (when not explicitly
locked out) because it always matches a single code unit, whether or not UTF-32
is specified.
.P
In general, the \eC escape sequence is best avoided. However, one way of using
it that avoids the problem of malformed UTF characters is to use a lookahead to
check the length of the next character, as in this pattern, which could be used
with a UTF-8 string (ignore white space and line breaks):
it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
lookahead to check the length of the next character, as in this pattern, which
could be used with a UTF-8 string (ignore white space and line breaks):
.sp
(?| (?=[\ex00-\ex7f])(\eC) |
(?=[\ex80-\ex{7ff}])(\eC)(\eC) |
@ -1297,37 +1329,6 @@ when matching character classes, whatever line-ending sequence is in use, and
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters.
.P
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
indicating a range, typically as the first or last character in the class, or
immediately after a range. For example, [b-d-z] matches letters in the range b
to d, a hyphen character, or z.
.P
It is not possible to have the literal character "]" as the end character of a
range. A pattern such as [W-]46] is interpreted as a class of two characters
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
the end of range, so [W-\e]46] is interpreted as a class containing a range
followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
.P
An error is generated if a POSIX character class (see below) or an escape
sequence other than one that defines a single character appears at a point
where a range ending character is expected. For example, [z-\exff] is valid,
but [A-\ed] and [A-[:digit:]] are not.
.P
Ranges operate in the collating sequence of character values. They can also be
used for characters specified numerically, for example [\e000-\e037]. Ranges
can include any characters that are valid for the current mode.
.P
If a range that includes letters is used when caseless matching is set, it
matches the letters in either case. For example, [W-c] is equivalent to
[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
tables for a French locale are in use, [\exc8-\excb] matches accented E
characters in both cases.
.P
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
\eV, \ew, and \eW may appear in a character class, and add the characters that
they match to the class. For example, [\edABCDEF] matches any hexadecimal
@ -1343,6 +1344,46 @@ class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
are not special inside a character class. Like any other unrecognized escape
sequences, they cause an error.
.P
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
indicating a range, typically as the first or last character in the class,
or immediately after a range. For example, [b-d-z] matches letters in the range
b to d, a hyphen character, or z.
.P
Perl treats a hyphen as a literal if it appears before or after a POSIX class
(see below) or a character type escape such as as \ed, but gives a warning in
its warning mode, as this is most likely a user error. As PCRE2 has no facility
for warning, an error is given in these cases.
.P
It is not possible to have the literal character "]" as the end character of a
range. A pattern such as [W-]46] is interpreted as a class of two characters
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
the end of range, so [W-\e]46] is interpreted as a class containing a range
followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
.P
Ranges normally include all code points between the start and end characters,
inclusive. They can also be used for code points specified numerically, for
example [\e000-\e037]. Ranges can include any characters that are valid for the
current mode.
.P
There is a special case in EBCDIC environments for ranges whose end points are
both specified as literal letters in the same case. For compatibility with
Perl, EBCDIC code points within the range that are not letters are omitted. For
example, [h-k] matches only four characters, even though the codes for h and k
are 0x88 and 0x92, a range of 11 code points. However, if the range is
specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
are included.
.P
If a range that includes letters is used when caseless matching is set, it
matches the letters in either case. For example, [W-c] is equivalent to
[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
tables for a French locale are in use, [\exc8-\excb] matches accented E
characters in both cases.
.P
A circumflex can conveniently be used with the upper case character types to
specify a more restricted set of characters than the matching lower case type.
For example, the class [^\eW_] matches any letter or digit, but not underscore,
@ -1514,12 +1555,8 @@ respectively.
.P
When one of these option changes occurs at top level (that is, not inside
subpattern parentheses), the change applies to the remainder of the pattern
that follows. If the change is placed right at the start of a pattern, PCRE2
extracts it into the global options (and it will therefore show up in data
extracted by the \fBpcre2_pattern_info()\fP function).
.P
An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the subpattern that follows it, so
that follows. An option change within a subpattern (see below for a description
of subpatterns) affects only that part of the subpattern that follows it, so
.sp
(a(?i)b)c
.sp
@ -1650,6 +1687,9 @@ first one in the pattern with the given number. The following pattern matches
.sp
/(?|(abc)|(def))(?1)/
.sp
A relative reference such as (?-1) is no different: it is just a convenient way
of computing an absolute group number.
.P
If a
.\" HTML <a href="#conditions">
.\" </a>
@ -2056,9 +2096,9 @@ no such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
.P
Another way of avoiding the ambiguity inherent in the use of digits following a
backslash is to use the \eg escape sequence. This escape must be followed by an
unsigned number or a negative number, optionally enclosed in braces. These
examples are all identical:
backslash is to use the \eg escape sequence. This escape must be followed by a
signed or unsigned number, optionally enclosed in braces. These examples are
all identical:
.sp
(ring), \e1
(ring), \eg1
@ -2066,8 +2106,7 @@ examples are all identical:
.sp
An unsigned number specifies an absolute reference without the ambiguity that
is present in the older syntax. It is also useful when literal digits follow
the reference. A negative number is a relative reference. Consider this
example:
the reference. A signed number is a relative reference. Consider this example:
.sp
(abc(def)ghi)\eg{-1}
.sp
@ -2077,6 +2116,10 @@ Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
can be helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
.P
The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
of forward reference can be useful it patterns that repeat. Perl does not
support the use of + in this way.
.P
A back reference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself (see
@ -2184,6 +2227,13 @@ numbering the capturing subpatterns in the whole pattern. However, substring
capturing is carried out only for positive assertions. (Perl sometimes, but not
always, does do capturing in negative assertions.)
.P
WARNING: If a positive assertion containing one or more capturing subpatterns
succeeds, but failure to match later in the pattern causes backtracking over
this assertion, the captures within the assertion are reset only if no higher
numbered captures are already set. This is, unfortunately, a fundamental
limitation of the current implementation; it may get removed in a future
reworking.
.P
For compatibility with Perl, most assertion subpatterns may be repeated; though
it makes no sense to assert the same thing several times, the side effect of
capturing parentheses may occasionally be useful. However, an assertion that
@ -2281,23 +2331,34 @@ temporarily move the current position back by the fixed length and then try to
match. If there are insufficient characters before the current position, the
assertion fails.
.P
In a UTF mode, PCRE2 does not allow the \eC escape (which matches a single code
unit even in a UTF mode) to appear in lookbehind assertions, because it makes
it impossible to calculate the length of the lookbehind. The \eX and \eR
escapes, which can match different numbers of code units, are also not
permitted.
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
single code unit even in a UTF mode) to appear in lookbehind assertions,
because it makes it impossible to calculate the length of the lookbehind. The
\eX and \eR escapes, which can match different numbers of code units, are never
permitted in lookbehinds.
.P
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
"Subroutine"
.\"
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
as the subpattern matches a fixed-length string.
as the subpattern matches a fixed-length string. However,
.\" HTML <a href="#recursion">
.\" </a>
Recursion,
recursion,
.\"
however, is not supported.
that is, a "subroutine" call into a group that is already active,
is not supported.
.P
Perl does not support back references in lookbehinds. PCRE2 does support them,
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
must not be set, there must be no use of (?| in the pattern (it creates
duplicate subpattern numbers), and if the back reference is by name, the name
must be unique. Of course, the referenced subpattern must itself be of fixed
length. The following pattern matches words containing at least two characters
that begin and end with the same character:
.sp
\eb(\ew)\ew++(?<=\e1)
.P
Possessive quantifiers can be used in conjunction with lookbehind assertions to
specify efficient matching of fixed-length strings at the end of subject
@ -2436,7 +2497,9 @@ This makes the fragment independent of the parentheses in the larger pattern.
.sp
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
subpattern by name. For compatibility with earlier versions of PCRE1, which had
this facility before Perl, the syntax (?(name)...) is also recognized.
this facility before Perl, the syntax (?(name)...) is also recognized. Note,
however, that undelimited names consisting of the letter R followed by digits
are ambiguous (see the following section).
.P
Rewriting the above example to use a named subpattern gives this:
.sp
@ -2450,33 +2513,55 @@ matched.
.SS "Checking for pattern recursion"
.rs
.sp
If the condition is the string (R), and there is no subpattern with the name R,
the condition is true if a recursive call to the whole pattern or any
subpattern has been made. If digits or a name preceded by ampersand follow the
letter R, for example:
.sp
(?(R3)...) or (?(R&name)...)
.sp
the condition is true if the most recent recursion is into a subpattern whose
number or name is given. This condition does not check the entire recursion
stack. If the name used in a condition of this kind is a duplicate, the test is
applied to all subpatterns of the same name, and is true if any one of them is
the most recent recursion.
.P
At "top level", all these recursion test conditions are false.
"Recursion" in this sense refers to any subroutine-like call from one part of
the pattern to another, whether or not it is actually recursive. See the
sections entitled
.\" HTML <a href="#recursion">
.\" </a>
The syntax for recursive patterns
"Recursive patterns"
.\"
is described below.
and
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
"Subpatterns as subroutines"
.\"
below for details of recursion and subpattern calls.
.P
If a condition is the string (R), and there is no subpattern with the name R,
the condition is true if matching is currently in a recursion or subroutine
call to the whole pattern or any subpattern. If digits follow the letter R, and
there is no subpattern with that name, the condition is true if the most recent
call is into a subpattern with the given number, which must exist somewhere in
the overall pattern. This is a contrived example that is equivalent to a+b:
.sp
((?(R1)a+|(?1)b))
.sp
However, in both cases, if there is a subpattern with a matching name, the
condition tests for its being set, as described in the section above, instead
of testing for recursion. For example, creating a group with the name R1 by
adding (?<R1>) to the above pattern completely changes its meaning.
.P
If a name preceded by ampersand follows the letter R, for example:
.sp
(?(R&name)...)
.sp
the condition is true if the most recent recursion is into a subpattern of that
name (which must exist within the pattern).
.P
This condition does not check the entire recursion stack. It tests only the
current level. If the name used in a condition of this kind is a duplicate, the
test is applied to all subpatterns of the same name, and is true if any one of
them is the most recent recursion.
.P
At "top level", all these recursion test conditions are false.
.
.
.\" HTML <a name="subdefine"></a>
.SS "Defining subpatterns for use by reference only"
.rs
.sp
If the condition is the string (DEFINE), and there is no subpattern with the
name DEFINE, the condition is always false. In this case, there may be only one
If the condition is the string (DEFINE), the condition is always false, even if
there is a group with the name DEFINE. In this case, there may be only one
alternative in the subpattern. It is always skipped if control reaches this
point in the pattern; the idea of DEFINE is that it can be used to define
subroutines that can be referenced from elsewhere. (The use of
@ -2513,7 +2598,8 @@ For example:
(?(VERSION>=10.4)yes|no)
.sp
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
"no" otherwise.
"no" otherwise. The fractional part of the version number may not contain more
than two digits.
.
.
.SS "Assertion conditions"
@ -2630,6 +2716,23 @@ pattern above you can write (?-2) to refer to the second most recently opened
parentheses preceding the recursion. In other words, a negative number counts
capturing parentheses leftwards from the point at which it is encountered.
.P
Be aware however, that if
.\" HTML <a href="#dupsubpatternnumber">
.\" </a>
duplicate subpattern numbers
.\"
are in use, relative references refer to the earliest subpattern with the
appropriate number. Consider, for example:
.sp
(?|(a)|(b)) (c) (?-2)
.sp
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
is number 2. When the reference (?-2) is encountered, the second most recently
opened parentheses has the number 1, but it is the first such group (the (a)
group) to which the recursion refers. This would be the same if an absolute
reference (?1) was used. In other words, relative references are just a
shorthand for computing a group number.
.P
It is also possible to refer to subsequently opened parentheses, by writing
references such as (?+2). However, these cannot be recursive because the
reference is not inside the parentheses that are referenced. They are always
@ -2929,14 +3032,32 @@ in production code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE2 features described in this section.
.P
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
differently depending on whether or not a name is present. A name is any
sequence of characters that does not include a closing parenthesis. The maximum
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
libraries. If the name is empty, that is, if the closing parenthesis
immediately follows the colon, the effect is as if the colon were not there.
Any number of these verbs may occur in a pattern.
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
(*VERB:NAME). Some verbs take either form, possibly behaving differently
depending on whether or not a name is present.
.P
By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in
any way, and it is not possible to include a closing parenthesis in the name.
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
is no longer Perl-compatible.
.P
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
and only an unescaped closing parenthesis terminates the name. However, the
only backslash items that are permitted are \eQ, \eE, and sequences such as
\ex{100} that define character code points. Character type escapes such as \ed
are faulted.
.P
A closing parenthesis can be included in a name either as \e) or between \eQ
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
also set, unescaped whitespace in verb names is skipped, and #-comments are
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
.P
The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
parenthesis immediately follows the colon, the effect is as if the colon were
not there. Any number of these verbs may occur in a pattern.
.P
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
@ -3361,6 +3482,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 27 December 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2POSIX 3 "20 October 2014" "PCRE2 10.00"
.TH PCRE2POSIX 3 "31 January 2016" "PCRE2 10.22"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SYNOPSIS"
@ -28,7 +28,7 @@ expression 8-bit library. See the
\fBpcre2api\fP
.\"
documentation for a description of PCRE2's native API, which contains much
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
additional functionality. There are no POSIX-style wrappers for PCRE2's 16-bit
and 32-bit libraries.
.P
The functions described here are just wrapper functions that ultimately call
@ -44,9 +44,9 @@ value zero. This has no effect, but since programs that are written to the
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
replacement library. Other POSIX options are not even defined.
.P
There are also some other options that are not defined by POSIX. These have
been added at the request of users who want to make use of certain
PCRE2-specific features via the POSIX calling interface.
There are also some options that are not defined by POSIX. These have been
added at the request of users who want to make use of certain PCRE2-specific
features via the POSIX calling interface.
.P
When PCRE2 is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
@ -95,11 +95,11 @@ defined POSIX behaviour for REG_NEWLINE (see the following section).
.sp
REG_NOSUB
.sp
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
for compilation to the native function. In addition, when a pattern that is
compiled with this flag is passed to \fBregexec()\fP for matching, the
\fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings
are returned.
When a pattern that is compiled with this flag is passed to \fBregexec()\fP for
matching, the \fInmatch\fP and \fIpmatch\fP arguments are ignored, and no
captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
.sp
REG_UCP
.sp
@ -145,7 +145,7 @@ use the contents of the \fIpreg\fP structure. If, for example, you pass it to
This area is not simple, because POSIX and Perl take different views of things.
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
never intended to be a POSIX engine. The following table lists the different
possibilities for matching newline characters in PCRE2:
possibilities for matching newline characters in Perl and PCRE2:
.sp
Default Change with
.sp
@ -155,7 +155,7 @@ possibilities for matching newline characters in PCRE2:
$ matches \en in middle no PCRE2_MULTILINE
^ matches \en in middle no PCRE2_MULTILINE
.sp
This is the equivalent table for POSIX:
This is the equivalent table for a POSIX-compatible pattern matcher:
.sp
Default Change with
.sp
@ -165,13 +165,17 @@ This is the equivalent table for POSIX:
$ matches \en in middle no REG_NEWLINE
^ matches \en in middle no REG_NEWLINE
.sp
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
newline from matching [^a].
This behaviour is not what happens when PCRE2 is called via its POSIX
API. By default, PCRE2's behaviour is the same as Perl's, except that there is
no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there
is no way to stop newline from matching [^a].
.P
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
the REG_NEWLINE action.
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
PCRE2_DOLLAR_ENDONLY when calling \fBpcre2_compile()\fP directly, but there is
no way to make PCRE2 behave exactly as for the REG_NEWLINE action. When using
the POSIX API, passing REG_NEWLINE to PCRE2's \fBregcomp()\fP function
causes PCRE2_MULTILINE to be passed to \fBpcre2_compile()\fP, and REG_DOTALL
passes PCRE2_DOTALL. There is no way to pass PCRE2_DOLLAR_ENDONLY.
.
.
.SH "MATCHING A PATTERN"
@ -207,16 +211,18 @@ to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
how it is matched.
how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL are
mutually exclusive; the error REG_INVARG is returned.
.P
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
\fBregexec()\fP are ignored.
\fBregexec()\fP are ignored (except possibly as input for REG_STARTEND).
.P
If the value of \fInmatch\fP is zero, or if the value \fIpmatch\fP is NULL,
no data about any matched strings is returned.
The value of \fInmatch\fP may be zero, and the value \fIpmatch\fP may be NULL
(unless REG_STARTEND is set); in both these cases no data about any matched
strings is returned.
.P
Otherwise,the portion of the string that was matched, and also any captured
Otherwise, the portion of the string that was matched, and also any captured
substrings, are returned via the \fIpmatch\fP argument, which points to an
array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the
members \fIrm_so\fP and \fIrm_eo\fP. These contain the byte offset to the first
@ -236,9 +242,11 @@ header file, of which REG_NOMATCH is the "expected" failure code.
The \fBregerror()\fP function maps a non-zero errorcode from either
\fBregcomp()\fP or \fBregexec()\fP to a printable message. If \fIpreg\fP is not
NULL, the error should have arisen from the use of that structure. A message
terminated by a binary zero is placed in \fIerrbuf\fP. The length of the
message, including the zero, is limited to \fIerrbuf_size\fP. The yield of the
function is the size of buffer needed to hold the whole message.
terminated by a binary zero is placed in \fIerrbuf\fP. If the buffer is too
short, only the first \fIerrbuf_size\fP - 1 characters of the error message are
used. The yield of the function is the size of buffer needed to hold the whole
message, including the terminating zero. This value is greater than
\fIerrbuf_size\fP if the message was truncated.
.
.
.SH MEMORY USAGE
@ -263,6 +271,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 31 January 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SAMPLE 3 "20 October 2014" "PCRE2 10.00"
.TH PCRE2SAMPLE 3 "02 February 2016" "PCRE2 10.22"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 SAMPLE PROGRAM"
@ -13,23 +13,28 @@ distribution. A listing of this program is given in the
documentation. If you do not have a copy of the PCRE2 distribution, you can
save this listing to re-create the contents of \fIpcre2demo.c\fP.
.P
The demonstration program, which uses the PCRE2 8-bit library, compiles the
regular expression that is its first argument, and matches it against the
subject string in its second argument. No PCRE2 options are set, and default
character tables are used. If matching succeeds, the program outputs the
portion of the subject that matched, together with the contents of any captured
substrings.
The demonstration program compiles the regular expression that is its
first argument, and matches it against the subject string in its second
argument. No PCRE2 options are set, and default character tables are used. If
matching succeeds, the program outputs the portion of the subject that matched,
together with the contents of any captured substrings.
.P
If the -g option is given on the command line, the program then goes on to
check for further matches of the same regular expression in the same subject
string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
.P
The code in \fBpcre2demo.c\fP is an 8-bit program that uses the PCRE2 8-bit
library. It handles strings and characters that are stored in 8-bit code units.
By default, one character corresponds to one code unit, but if the pattern
starts with "(*UTF)", both it and the subject are treated as UTF-8 strings,
where characters may occupy multiple code units.
.P
If PCRE2 is installed in the standard include and library directories for your
operating system, you should be able to compile the demonstration program using
this command:
a command like this:
.sp
gcc -o pcre2demo pcre2demo.c -lpcre2-8
cc -o pcre2demo pcre2demo.c -lpcre2-8
.sp
If PCRE2 is installed elsewhere, you may need to add additional options to the
command line. For example, on a Unix-like system that has PCRE2 installed in
@ -37,12 +42,11 @@ command line. For example, on a Unix-like system that has PCRE2 installed in
like this:
.sp
.\" JOINSH
gcc -o pcre2demo -I/usr/local/include pcre2demo.c \e
-L/usr/local/lib -lpcre2-8
cc -o pcre2demo -I/usr/local/include pcre2demo.c \e
-L/usr/local/lib -lpcre2-8
.sp
.P
Once you have compiled and linked the demonstration program, you can run simple
tests like this:
Once you have built the demonstration program, you can run simple tests like
this:
.sp
./pcre2demo 'cat|dog' 'the cat sat on the mat'
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
@ -51,12 +55,13 @@ Note that there is a much more comprehensive test program, called
.\" HREF
\fBpcre2test\fP,
.\"
which supports many more facilities for testing regular expressions using the
PCRE2 libraries. The
which supports many more facilities for testing regular expressions using all
three PCRE2 libraries (8-bit, 16-bit, and 32-bit, though not all three need be
installed). The
.\" HREF
\fBpcre2demo\fP
.\"
program is provided as a simple coding example.
program is provided as a relatively simple coding example.
.P
If you try to run
.\" HREF
@ -65,7 +70,7 @@ If you try to run
when PCRE2 is not installed in the standard library directory, you may get an
error like this on some operating systems (e.g. Solaris):
.sp
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file or directory
.sp
This is caused by the way shared library support works on those systems. You
need to add
@ -89,6 +94,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 February 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SERIALIZE 3 "20 January 2015" "PCRE2 10.10"
.TH PCRE2SERIALIZE 3 "24 May 2016" "PCRE2 10.22"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS"
@ -22,12 +22,22 @@ If you are running an application that uses a large number of regular
expression patterns, it may be useful to store them in a precompiled form
instead of having to compile them every time the application is run. However,
if you are using the just-in-time optimization feature, it is not possible to
save and reload the JIT data, because it is position-dependent. In addition,
the host on which the patterns are reloaded must be running the same version of
PCRE2, with the same code unit width, and must also have the same endianness,
pointer width and PCRE2_SIZE type. For example, patterns compiled on a 32-bit
system using PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor
can they be reloaded using the 8-bit library.
save and reload the JIT data, because it is position-dependent. The host on
which the patterns are reloaded must be running the same version of PCRE2, with
the same code unit width, and must also have the same endianness, pointer width
and PCRE2_SIZE type. For example, patterns compiled on a 32-bit system using
PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor can they be
reloaded using the 8-bit library.
.
.
.SH "SECURITY CONCERNS"
.rs
.sp
The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to
\fBpcre2_serialize_decode()\fP is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not
complete validation of what is being re-loaded.
.
.
.SH "SAVING COMPILED PATTERNS"
@ -129,20 +139,26 @@ is filled with those that fit, and the remainder are ignored. The yield of the
function is the number of decoded patterns, or one of the following negative
error codes:
.sp
PCRE2_ERROR_BADDATA second argument is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE2 version
PCRE2_ERROR_MEMORY memory allocation failed
PCRE2_ERROR_NULL first or third argument is NULL
PCRE2_ERROR_BADDATA second argument is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
PCRE2_ERROR_MEMORY memory allocation failed
PCRE2_ERROR_NULL first or third argument is NULL
.sp
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
on a system with different endianness.
.P
Decoded patterns can be used for matching in the usual way, and must be freed
by calling \fBpcre2_code_free()\fP as normal. A single copy of the character
tables is used by all the decoded patterns. A reference count is used to
by calling \fBpcre2_code_free()\fP. However, be aware that there is a potential
race issue if you are using multiple patterns that were decoded from a single
byte stream in a multithreaded application. A single copy of the character
tables is used by all the decoded patterns and a reference count is used to
arrange for its memory to be automatically freed when the last pattern is
freed.
freed, but there is no locking on this reference count. Therefore, if you want
to call \fBpcre2_code_free()\fP for these patterns in different threads, you
must arrange your own locking, and ensure that \fBpcre2_code_free()\fP cannot
be called by two threads at the same time.
.P
If a pattern was processed by \fBpcre2_jit_compile()\fP before being
serialized, the JIT data is discarded and so is no longer available after a
@ -165,6 +181,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 January 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 24 May 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2STACK 3 "21 November 2014" "PCRE2 10.00"
.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 DISCUSSION OF STACK USAGE"
@ -43,11 +43,12 @@ assertion and "once-only" subpatterns, which are handled like subroutine calls.
Normally, these are never very deep, and the limit on the complexity of
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
However, it is possible to write patterns with runaway infinite recursions;
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack. At
present, there is no protection against this.
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
limit is applied (see below).
.P
The comments that follow do NOT apply to \fBpcre2_dfa_match()\fP; they are
relevant only for \fBpcre2_match()\fP without the JIT optimization.
The comments in the next three sections do not apply to
\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
the JIT optimization.
.
.
.SS "Reducing \fBpcre2_match()\fP's stack usage"
@ -106,7 +107,7 @@ in the
\fBpcre2api\fP
.\"
documentation. Since the block sizes are always the same, it may be possible to
implement customized a memory handler that is more efficient than the standard
implement a customized memory handler that is more efficient than the standard
function. The memory blocks obtained for this purpose are retained and re-used
if possible while \fBpcre2_match()\fP is running. They are all freed just
before it exits.
@ -147,6 +148,15 @@ pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
different limits.
.
.
.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
.rs
.sp
The recursion limit, as described above for \fBpcre2_match()\fP, also applies
to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
recursions in the pattern can lead to runaway stack usage. The non-recursive
match limit is not relevant for DFA matching, and is ignored.
.
.
.SS "Changing stack size in Unix-like systems"
.rs
.sp
@ -197,6 +207,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 21 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 23 December 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
.TH PCRE2SYNTAX 3 "23 December 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -81,9 +81,10 @@ it matches a literal "u".
\eW a "non-word" character
\eX a Unicode extended grapheme cluster
.sp
The application can lock out the use of \eC by setting the
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
current matching point in the middle of a UTF-8 or UTF-16 character.
\eC is dangerous because it may leave the current matching point in the middle
of a UTF-8 or UTF-16 character. The application can lock out the use of \eC by
setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
with the use of \eC permanently disabled.
.P
By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
@ -159,6 +160,8 @@ at release 5.18.
.SH "SCRIPT NAMES FOR \ep AND \eP"
.rs
.sp
Ahom,
Anatolian_Hieroglyphs,
Arabic,
Armenian,
Avestan,
@ -199,6 +202,7 @@ Gurmukhi,
Han,
Hangul,
Hanunoo,
Hatran,
Hebrew,
Hiragana,
Imperial_Aramaic,
@ -235,12 +239,14 @@ Miao,
Modi,
Mongolian,
Mro,
Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
Nko,
Ogham,
Ol_Chiki,
Old_Hungarian,
Old_Italic,
Old_North_Arabian,
Old_Permic,
@ -262,6 +268,7 @@ Saurashtra,
Sharada,
Shavian,
Siddham,
SignWriting,
Sinhala,
Sora_Sompeng,
Sundanese,
@ -421,9 +428,10 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_match(), not increase them. The application
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
PCRE2_NEVER_UCP options, respectively, at compile time.
limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP, not
increase them. The application can lock out the use of (*UTF) and (*UCP) by
setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at
compile time.
.
.
.SH "NEWLINE CONVENTION"
@ -466,6 +474,9 @@ Each top-level branch of a look behind must be of a fixed length.
\en reference by number (can be ambiguous)
\egn reference by number
\eg{n} reference by number
\eg+n relative reference by number (PCRE2 extension)
\eg-n relative reference by number
\eg{+n} relative reference by number (PCRE2 extension)
\eg{-n} relative reference by number
\ek<name> reference by name (Perl)
\ek'name' reference by name (Perl)
@ -504,13 +515,17 @@ Each top-level branch of a look behind must be of a fixed length.
(?(-n) relative reference condition
(?(<name>) named reference condition (Perl)
(?('name') named reference condition (Perl)
(?(name) named reference condition (PCRE2)
(?(name) named reference condition (PCRE2, deprecated)
(?(R) overall recursion condition
(?(Rn) specific group recursion condition
(?(R&name) specific recursion condition
(?(Rn) specific numbered group recursion condition
(?(R&name) specific named group recursion condition
(?(DEFINE) define subpattern for reference
(?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition
.sp
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
.
.
.SH "BACKTRACKING CONTROL"
@ -570,6 +585,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 23 December 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "20 May 2015" "PCRE 10.20"
.TH PCRE2TEST 1 "28 December 2016" "PCRE 10.23"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
.P
As the original fairly simple PCRE library evolved, it acquired many different
features, and as a result, the original \fBpcretest\fP program ended up with a
lot of options in a messy, arcane syntax, for testing all the features. The
lot of options in a messy, arcane syntax for testing all the features. The
move to the new PCRE2 API provided an opportunity to re-implement the test
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
are still many obscure modifiers, some of which are specifically designed for
@ -47,31 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
all three of these libraries may be simultaneously installed. The
\fBpcre2test\fP program can be used to test all the libraries. However, its own
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
libraries, patterns and subject strings are converted to 16- or 32-bit format
before being passed to the library functions. Results are converted back to
8-bit code units for output.
libraries, patterns and subject strings are converted to 16-bit or 32-bit
format before being passed to the library functions. Results are converted back
to 8-bit code units for output.
.P
In the rest of this document, the names of library functions and structures
are given in generic form, for example, \fBpcre_compile()\fP. The actual
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
.
.
.\" HTML <a name="inputencoding"></a>
.SH "INPUT ENCODING"
.rs
.sp
Input to \fBpcre2test\fP is processed line by line, either by calling the C
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
below). The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
treats any bytes other than newline as data characters. In some Windows
environments character 26 (hex 1A) causes an immediate end of file, and no
further data is read.
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
Windows environments character 26 (hex 1A) causes an immediate end of file, and
no further data is read, so this character should be avoided unless you really
want that action.
.P
For maximum portability, therefore, it is safest to avoid non-printing
characters in \fBpcre2test\fP input files. There is a facility for specifying a
pattern's characters as hexadecimal pairs, thus making it possible to include
binary zeroes in a pattern for testing purposes. Subject lines are processed
for backslash escapes, which makes it possible to include any data value.
The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
treats any bytes other than newline as data characters. An error is generated
if a binary zero is encountered. Subject lines are processed for backslash
escapes, which makes it possible to include any data value in strings that are
passed to the library for matching. For patterns, there is a facility for
specifying some or all of the 8-bit input characters as hexadecimal pairs,
which makes it possible to include binary zeros.
.
.
.SS "Input for the 16-bit and 32-bit libraries"
.rs
.sp
When testing the 16-bit or 32-bit libraries, there is a need to be able to
generate character code points greater than 255 in the strings that are passed
to the library. For subject lines, backslash escapes can be used. In addition,
when the \fButf\fP modifier (see
.\" HTML <a href="#optionmodifiers">
.\" </a>
"Setting compilation options"
.\"
below) is set, the pattern and any following subject lines are interpreted as
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
.P
For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
or 32-bit mode. It causes the pattern and following subject lines to be treated
as UTF-8 according to the original definition (RFC 2279), which allows for
character values up to 0x7fffffff. Each character is placed in one 16-bit or
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
to occur).
.P
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
values can be handled by the 32-bit library. When testing this library in
non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
character's value. This is the only way of passing such code points in a
pattern string. For subject strings, using an escape sequence is preferable.
.
.
.SH "COMMAND LINE OPTIONS"
@ -92,8 +124,12 @@ If the 32-bit library has been built, this option causes it to be used. If only
the 32-bit library has been built, this is the default. If the 32-bit library
has not been built, this option causes an error.
.TP 10
\fB-ac\fP
Behave as if each pattern has the \fBauto_callout\fP modifier, that is, insert
automatic callouts into every pattern that is compiled.
.TP 10
\fB-b\fP
Behave as if each pattern has the \fB/fullbincode\fP modifier; the full
Behave as if each pattern has the \fBfullbincode\fP modifier; the full
internal binary form of the pattern is output after compilation.
.TP 10
\fB-C\fP
@ -122,12 +158,13 @@ following options output the value and set the exit code as indicated:
The following options output 1 for true or 0 for false, and set the exit code
to the same value:
.sp
ebcdic compiled for an EBCDIC environment
jit just-in-time support is available
pcre2-16 the 16-bit library was built
pcre2-32 the 32-bit library was built
pcre2-8 the 8-bit library was built
unicode Unicode support is available
backslash-C \eC is supported (not locked out)
ebcdic compiled for an EBCDIC environment
jit just-in-time support is available
pcre2-16 the 16-bit library was built
pcre2-32 the 32-bit library was built
pcre2-8 the 8-bit library was built
unicode Unicode support is available
.sp
If an unknown option is given, an error message is output; the exit code is 0.
.TP 10
@ -141,11 +178,17 @@ Behave as if each subject line has the \fBdfa\fP modifier; matching is done
using the \fBpcre2_dfa_match()\fP function instead of the default
\fBpcre2_match()\fP.
.TP 10
\fB-error\fP \fInumber[,number,...]\fP
Call \fBpcre2_get_error_message()\fP for each of the error numbers in the
comma-separated list, display the resulting messages on the standard output,
then exit with zero exit code. The numbers may be positive or negative. This is
a convenience facility for PCRE2 maintainers.
.TP 10
\fB-help\fP
Output a brief summary these options and then exit.
.TP 10
\fB-i\fP
Behave as if each pattern has the \fB/info\fP modifier; information about the
Behave as if each pattern has the \fBinfo\fP modifier; information about the
compiled pattern is given after compilation.
.TP 10
\fB-jit\fP
@ -217,9 +260,9 @@ Each subject line is matched separately and independently. If you want to do
multi-line matches, you have to use the \en escape sequence (or \er or \er\en,
etc., depending on the newline setting) in a single line of input to encode the
newline sequences. There is no limit on the length of subject lines; the input
buffer is automatically extended if it is too small. There is a replication
feature that makes it possible to generate long subject lines without having to
supply them explicitly.
buffer is automatically extended if it is too small. There are replication
features that makes it possible to generate long repetitive pattern or subject
lines without having to supply them explicitly.
.P
An empty line or the end of the file signals the end of the subject lines for a
test, at which point a new pattern or command line is expected if there is
@ -259,6 +302,34 @@ described in the section entitled "Saving and restoring compiled patterns"
.\" </a>
below.
.\"
.sp
#newline_default [<newline-list>]
.sp
When PCRE2 is built, a default newline convention can be specified. This
determines which characters and/or character pairs are recognized as indicating
a newline in a pattern or subject string. The default can be overridden when a
pattern is compiled. The standard test files contain tests of various newline
conventions, but the majority of the tests expect a single linefeed to be
recognized as a newline by default. Without special action the tests would fail
when PCRE2 is compiled with either CR or CRLF as the default newline.
.P
The #newline_default command specifies a list of newline types that are
acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
ANY (in upper or lower case), for example:
.sp
#newline_default LF Any anyCRLF
.sp
If the default newline is in the list, this command has no effect. Otherwise,
except when testing the POSIX API, a \fBnewline\fP modifier that specifies the
first newline convention in the list (LF in the above example) is added to any
pattern that does not already have a \fBnewline\fP modifier. If the newline
list is empty, the feature is turned off. This command is present in a number
of the standard test input files.
.P
When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from
within the pattern. A warning is given if the \fBposix\fP modifier is used when
\fB#newline_default\fP would set a default for the non-POSIX API.
.sp
#pattern <modifier-list>
.sp
@ -276,9 +347,10 @@ test files that are also processed by \fBperltest.sh\fP. The \fB#perltest\fP
command helps detect tests that are accidentally put in the wrong file.
.sp
#pop [<modifiers>]
#popcopy [<modifiers>]
.sp
This command is used to manipulate the stack of compiled patterns, as described
in the section entitled "Saving and restoring compiled patterns"
These commands are used to manipulate the stack of compiled patterns, as
described in the section entitled "Saving and restoring compiled patterns"
.\" HTML <a href="#saverestore">
.\" </a>
below.
@ -303,12 +375,13 @@ subject lines. Modifiers on a subject line can change these settings.
.rs
.sp
Modifier lists are used with both pattern and subject lines. Items in a list
are separated by commas and optional white space. Some modifiers may be given
for both patterns and subject lines, whereas others are valid for one or the
other only. Each modifier has a long name, for example "anchored", and some of
them must be followed by an equals sign and a value, for example, "offset=12".
Modifiers that do not take values may be preceded by a minus sign to turn off a
previous setting.
are separated by commas followed by optional white space. Trailing whitespace
in a modifier list is ignored. Some modifiers may be given for both patterns
and subject lines, whereas others are valid only for one or the other. Each
modifier has a long name, for example "anchored", and some of them must be
followed by an equals sign and a value, for example, "offset=12". Values cannot
contain comma characters, but may contain spaces. Modifiers that do not take
values may be preceded by a minus sign to turn off a previous setting.
.P
A few of the more common modifiers can also be specified as single letters, for
example "i" for "caseless". In documentation, following the Perl convention,
@ -414,6 +487,12 @@ the start of a modifier list. For example:
.sp
abc\e=notbol,notempty
.sp
If the subject string is empty and \e= is followed by whitespace, the line is
treated as a comment line, and is not used for matching. For example:
.sp
\e= This is a comment.
abc\e= This is an invalid modifier list.
.sp
A backslash followed by any other non-alphanumeric character just escapes that
character. A backslash followed by anything else causes an error. However, if
the very last character in the line is a backslash (and there is no modifier
@ -424,10 +503,10 @@ a real empty line terminates the data input.
.SH "PATTERN MODIFIERS"
.rs
.sp
There are three types of modifier that can appear in pattern lines, two of
which may also be used in a \fB#pattern\fP command. A pattern's modifier list
can add to or override default modifiers that were set by a previous
\fB#pattern\fP command.
There are several types of modifier that can appear in pattern lines. Except
where noted below, they may also be used in \fB#pattern\fP commands. A
pattern's modifier list can add to or override default modifiers that were set
by a previous \fB#pattern\fP command.
.
.
.\" HTML <a name="optionmodifiers"></a>
@ -437,13 +516,14 @@ can add to or override default modifiers that were set by a previous
The following modifiers set options for \fBpcre2_compile()\fP. The most common
ones have single-letter abbreviations. See
.\" HREF
\fBpcreapi\fP
\fBpcre2api\fP
.\"
for a description of their effects.
.sp
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
/i caseless set PCRE2_CASELESS
@ -464,12 +544,15 @@ for a description of their effects.
no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP
ungreedy set PCRE2_UNGREEDY
use_offset_limit set PCRE2_USE_OFFSET_LIMIT
utf set PCRE2_UTF
.sp
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
non-printing characters in output strings to be printed using the \ex{hh...}
notation. Otherwise, those less than 0x100 are output in hex without the curly
brackets.
brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
subject strings to be translated to UTF-16 or UTF-32, respectively, before
being passed to library functions.
.
.
.\" HTML <a name="controlmodifiers"></a>
@ -485,18 +568,24 @@ about the pattern:
debug same as info,fullbincode
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex pattern is coded in hexadecimal
hex unquoted characters are hexadecimal
jit[=<number>] use JIT
jitfast use JIT fast path
jitverify verify JIT use
locale=<name> use this locale
max_pattern_length=<n> set the maximum pattern length
memory show memory used
newline=<type> set newline type
null_context compile with a NULL context
parens_nest_limit=<n> set maximum parentheses depth
posix use the POSIX API
posix_nosub use the POSIX API with REG_NOSUB
push push compiled pattern onto the stack
pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature
tables=[0|1|2] select internal tables
use_length do not zero-terminate the pattern
utf8_input treat input as UTF-8
.sp
The effects of these modifiers are described in the following sections.
.
@ -565,40 +654,148 @@ is requested. For each callout, either its number or string is given, followed
by the item that follows it in the pattern.
.
.
.SS "Specifying a pattern in hex"
.SS "Passing a NULL context"
.rs
.sp
The \fBhex\fP modifier specifies that the characters of the pattern are to be
interpreted as pairs of hexadecimal digits. White space is permitted between
pairs. For example:
Normally, \fBpcre2test\fP passes a context block to \fBpcre2_compile()\fP. If
the \fBnull_context\fP modifier is set, however, NULL is passed. This is for
testing that \fBpcre2_compile()\fP behaves correctly in this case (it uses
default values).
.
.
.SS "Specifying the pattern's length"
.rs
.sp
By default, patterns are passed to the compiling functions as zero-terminated
strings. When using the POSIX wrapper API, there is no other option. However,
when using PCRE2's native API, patterns can be passed by length instead of
being zero-terminated. The \fBuse_length\fP modifier causes this to happen.
Using a length happens automatically (whether or not \fBuse_length\fP is set)
when \fBhex\fP is set, because patterns specified in hexadecimal may contain
binary zeros.
.
.
.SS "Specifying pattern characters in hexadecimal"
.rs
.sp
The \fBhex\fP modifier specifies that the characters of the pattern, except for
substrings enclosed in single or double quotes, are to be interpreted as pairs
of hexadecimal digits. This feature is provided as a way of creating patterns
that contain binary zeros and other non-printing characters. White space is
permitted between pairs of digits. For example, this pattern contains three
characters:
.sp
/ab 32 59/hex
.sp
This feature is provided as a way of creating patterns that contain binary zero
and other non-printing characters. By default, \fBpcre2test\fP passes patterns
as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
actual length of the pattern is passed.
Parts of such a pattern are taken literally if quoted. This pattern contains
nine characters, only two of which are specified in hexadecimal:
.sp
/ab "literal" 32/hex
.sp
Either single or double quotes may be used. There is no way of including
the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
mutually exclusive.
.P
The POSIX API cannot be used with patterns specified in hexadecimal because
they may contain binary zeros, which conflicts with \fBregcomp()\fP's
requirement for a zero-terminated string. Such patterns are always passed to
\fBpcre2_compile()\fP as a string with a length, not as zero-terminated.
.
.
.SS "Specifying wide characters in 16-bit and 32-bit modes"
.rs
.sp
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
can be used. It is mutually exclusive with \fButf\fP. Input lines are
interpreted as UTF-8 as a means of specifying wide characters. More details are
given in
.\" HTML <a href="#inputencoding">
.\" </a>
"Input encoding"
.\"
above.
.
.
.SS "Generating long repetitive patterns"
.rs
.sp
Some tests use long patterns that are very repetitive. Instead of creating a
very long input line for such a pattern, you can use a special repetition
feature, similar to the one described for subject lines above. If the
\fBexpand\fP modifier is present on a pattern, parts of the pattern that have
the form
.sp
\e[<characters>]{<count>}
.sp
are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
by decimal digits and "}" is found later in the pattern. If not, the characters
remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
mutually exclusive.
.P
If part of an expanded pattern looks like an expansion, but is really part of
the actual pattern, unwanted expansion can be avoided by giving two values in
the quantifier. For example, \e[AB]{6000,6000} is not recognized as an
expansion item.
.P
If the \fBinfo\fP modifier is set on an expanded pattern, the result of the
expansion is included in the information that is output.
.
.
.SS "JIT compilation"
.rs
.sp
The \fB/jit\fP modifier may optionally be followed by an equals sign and a
number in the range 0 to 7:
Just-in-time (JIT) compiling is a heavyweight optimization that can greatly
speed up pattern matching. See the
.\" HREF
\fBpcre2jit\fP
.\"
documentation for details. JIT compiling happens, optionally, after a pattern
has been successfully compiled into an internal form. The JIT compiler converts
this to optimized machine code. It needs to know whether the match-time options
PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because
different code is generated for the different cases. See the \fBpartial\fP
modifier in "Subject Modifiers"
.\" HTML <a href="#subjectmodifiers">
.\" </a>
below
.\"
for details of how these options are specified for each match attempt.
.P
JIT compilation is requested by the \fB/jit\fP pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to 7.
The three bits that make up the number specify which of the three JIT operating
modes are to be compiled:
.sp
1 compile JIT code for non-partial matching
2 compile JIT code for soft partial matching
4 compile JIT code for hard partial matching
.sp
The possible values for the \fBjit\fP modifier are therefore:
.sp
0 disable JIT
1 use JIT for normal match only
2 use JIT for soft partial match only
3 use JIT for normal match and soft partial match
4 use JIT for hard partial match only
6 use JIT for soft and hard partial match
1 normal matching only
2 soft partial matching only
3 normal and soft partial matching
4 hard partial matching only
6 soft and hard partial matching only
7 all three modes
.sp
If no number is given, 7 is assumed. If JIT compilation is successful, the
compiled JIT code will automatically be used when \fBpcre2_match()\fP is run
for the appropriate type of match, except when incompatible run-time options
are specified. For more details, see the
If no number is given, 7 is assumed. The phrase "partial matching" means a call
to \fBpcre2_match()\fP with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
match; the options enable the possibility of a partial match, but do not
require it. Note also that if you request JIT compilation only for partial
matching (for example, /jit=2) but do not set the \fBpartial\fP modifier on a
subject line, that match will not use JIT code because none was compiled for
non-partial matching.
.P
If JIT compilation is successful, the compiled JIT code will automatically be
used when an appropriate type of match is run, except when incompatible
run-time options are specified. For more details, see the
.\" HREF
\fBpcre2jit\fP
.\"
@ -622,14 +819,14 @@ code was actually used in the match.
.SS "Setting a locale"
.rs
.sp
The \fB/locale\fP modifier must specify the name of a locale, for example:
The \fBlocale\fP modifier must specify the name of a locale, for example:
.sp
/pattern/locale=fr_FR
.sp
The given locale is set, \fBpcre2_maketables()\fP is called to build a set of
character tables for the locale, and this is then passed to
\fBpcre2_compile()\fP when compiling the regular expression. The same tables
are used when matching the following subject lines. The \fB/locale\fP modifier
are used when matching the following subject lines. The \fBlocale\fP modifier
applies only to the pattern on which it appears, but can be given in a
\fB#pattern\fP command if a default is needed. Setting a locale and alternate
character tables are mutually exclusive.
@ -638,7 +835,7 @@ character tables are mutually exclusive.
.SS "Showing pattern memory"
.rs
.sp
The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
The \fBmemory\fP modifier causes the size in bytes of the memory used to hold
the compiled pattern to be output. This does not include the size of the
\fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is
@ -660,30 +857,54 @@ sets its own default of 220, which is required for running the standard test
suite.
.
.
.SS "Limiting the pattern length"
.rs
.sp
The \fBmax_pattern_length\fP modifier sets a limit, in code units, to the
length of pattern that \fBpcre2_compile()\fP will accept. Breaching the limit
causes a compilation error. The default is the largest number a PCRE2_SIZE
variable can hold (essentially unlimited).
.
.
.SS "Using the POSIX wrapper API"
.rs
.sp
The \fB/posix\fP modifier causes \fBpcre2test\fP to call PCRE2 via the POSIX
wrapper API rather than its native API. This supports only the 8-bit library.
When the POSIX API is being used, the following pattern modifiers set options
for the \fBregcomp()\fP function:
The \fB/posix\fP and \fBposix_nosub\fP modifiers cause \fBpcre2test\fP to call
PCRE2 via the POSIX wrapper API rather than its native API. When
\fBposix_nosub\fP is used, the POSIX option REG_NOSUB is passed to
\fBregcomp()\fP. The POSIX wrapper supports only the 8-bit library. Note that
it does not imply POSIX matching semantics; for more detail see the
.\" HREF
\fBpcre2posix\fP
.\"
documentation. The following pattern modifiers set options for the
\fBregcomp()\fP function:
.sp
caseless REG_ICASE
multiline REG_NEWLINE
no_auto_capture REG_NOSUB
dotall REG_DOTALL )
ungreedy REG_UNGREEDY ) These options are not part of
ucp REG_UCP ) the POSIX standard
utf REG_UTF8 )
.sp
The \fBregerror_buffsize\fP modifier specifies a size for the error buffer that
is passed to \fBregerror()\fP in the event of a compilation error. For example:
.sp
/abc/posix,regerror_buffsize=20
.sp
This provides a means of testing the behaviour of \fBregerror()\fP when the
buffer is too small for the error message. If this modifier has not been set, a
large buffer is used.
.P
The \fBaftertext\fP and \fBallaftertext\fP subject modifiers work as described
below. All other modifiers cause an error.
below. All other modifiers are either ignored, with a warning message, or cause
an error.
.
.
.SS "Testing the stack guard feature"
.rs
.sp
The \fB/stackguard\fP modifier is used to test the use of
The \fBstackguard\fP modifier is used to test the use of
\fBpcre2_set_compile_recursion_guard()\fP, a function that is provided to
enable stack availability to be checked during compilation (see the
.\" HREF
@ -700,7 +921,7 @@ be aborted.
.SS "Using alternative character tables"
.rs
.sp
The value specified for the \fB/tables\fP modifier must be one of the digits 0,
The value specified for the \fBtables\fP modifier must be one of the digits 0,
1, or 2. It causes a specific set of built-in character tables to be passed to
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows:
@ -720,17 +941,22 @@ are mutually exclusive.
.sp
The following modifiers are really subject modifiers, and are described below.
However, they may be included in a pattern's modifier list, in which case they
are applied to every subject line that is processed with that pattern. They do
not affect the compilation process.
are applied to every subject line that is processed with that pattern. They may
not appear in \fB#pattern\fP commands. These modifiers do not affect the
compilation process.
.sp
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
/g global global matching
mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
/g global global matching
mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
.sp
These modifiers may not appear in a \fB#pattern\fP command. If you want them as
defaults, set them in a \fB#subject\fP command.
@ -746,15 +972,20 @@ facility is used when saving compiled patterns to a file, as described in the
section entitled "Saving and restoring compiled patterns"
.\" HTML <a href="#saverestore">
.\" </a>
below.
below. If \fBpushcopy\fP is used instead of \fBpush\fP, a copy of the compiled
pattern is stacked, leaving the original as current, ready to match the
following input lines. This provides a way of testing the
\fBpcre2_code_copy()\fP function.
.\"
The \fBpush\fP modifier is incompatible with compilation modifiers such as
\fBglobal\fP that act at match time. Any that are specified are ignored, with a
warning message, except for \fBreplace\fP, which causes an error. Note that,
\fBjitverify\fP, which is allowed, does not carry through to any subsequent
matching that uses this pattern.
The \fBpush\fP and \fBpushcopy \fP modifiers are incompatible with compilation
modifiers such as \fBglobal\fP that act at match time. Any that are specified
are ignored (for the stacked copy), with a warning message, except for
\fBreplace\fP, which causes an error. Note that \fBjitverify\fP, which is
allowed, does not carry through to any subsequent matching that uses a stacked
pattern.
.
.
.\" HTML <a name="subjectmodifiers"></a>
.SH "SUBJECT MODIFIERS"
.rs
.sp
@ -775,6 +1006,7 @@ for a description of their effects.
anchored set PCRE2_ANCHORED
dfa_restart set PCRE2_DFA_RESTART
dfa_shortest set PCRE2_DFA_SHORTEST
no_jit set PCRE2_NO_JIT
no_utf_check set PCRE2_NO_UTF_CHECK
notbol set PCRE2_NOTBOL
notempty set PCRE2_NOTEMPTY
@ -786,11 +1018,11 @@ for a description of their effects.
The partial matching modifiers are provided with abbreviations because they
appear frequently in tests.
.P
If the \fB/posix\fP modifier was present on the pattern, causing the POSIX
If the \fBposix\fP modifier was present on the pattern, causing the POSIX
wrapper API to be used, the only option-setting modifiers that have any effect
are \fBnotbol\fP, \fBnotempty\fP, and \fBnoteol\fP, causing REG_NOTBOL,
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to \fBregexec()\fP.
Any other modifiers cause an error.
The other modifiers are ignored, with a warning message.
.
.
.SS "Setting match controls"
@ -801,33 +1033,44 @@ information. Some of them may also be specified on a pattern line (see above),
in which case they apply to every subject line that is matched against that
pattern.
.sp
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text (non-JIT only)
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=<n> set a value to pass via callouts
callout_fail=<n>[:<m>] control callout failure
callout_none do not supply a callout function
copy=<number or name> copy captured substring
dfa use \fBpcre2_dfa_match()\fP
find_limits find match and recursion limits
get=<number or name> extract captured substring
getall extract all captured substrings
/g global global matching
jitstack=<n> set size of JIT stack
mark show mark values
match_limit=>n> set a match limit
memory show memory usage
offset=<n> set starting offset
ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit
replace=<string> specify a replacement string
startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text (non-JIT only)
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=<n> set a value to pass via callouts
callout_error=<n>[:<m>] control callout error
callout_fail=<n>[:<m>] control callout failure
callout_none do not supply a callout function
copy=<number or name> copy captured substring
dfa use \fBpcre2_dfa_match()\fP
find_limits find match and recursion limits
get=<number or name> extract captured substring
getall extract all captured substrings
/g global global matching
jitstack=<n> set size of JIT stack
mark show mark values
match_limit=<n> set a match limit
memory show memory usage
null_context match with a NULL context
offset=<n> set starting offset
offset_limit=<n> set offset limit
ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit
replace=<string> specify a replacement string
startchar show startchar when relevant
startoffset=<n> same as offset=<n>
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
zero_terminate pass the subject as zero-terminated
.sp
The effects of these modifiers are described in the following sections.
The effects of these modifiers are described in the following sections. When
matching via the POSIX wrapper API, the \fBaftertext\fP, \fBallaftertext\fP,
and \fBovector\fP subject modifiers work as described below. All other
modifiers are either ignored, with a warning message, or cause an error.
.
.
.SS "Showing more text"
@ -882,7 +1125,8 @@ The \fBallcaptures\fP modifier requests that the values of all potential
captured parentheses be output after a match. By default, only those up to the
highest one actually used in the match are output (corresponding to the return
code from \fBpcre2_match()\fP). Groups that did not take part in the match
are output as "<unset>".
are output as "<unset>". This modifier is not relevant for DFA matching (which
does no capturing); it is ignored, with a warning message, if present.
.
.
.SS "Testing callouts"
@ -890,14 +1134,20 @@ are output as "<unset>".
.sp
A callout function is supplied when \fBpcre2test\fP calls the library matching
functions, unless \fBcallout_none\fP is specified. If \fBcallout_capture\fP is
set, the current captured groups are output when a callout occurs.
set, the current captured groups are output when a callout occurs. The default
return from the callout function is zero, which allows matching to continue.
.P
The \fBcallout_fail\fP modifier can be given one or two numbers. If there is
only one number, 1 is returned instead of 0 when a callout of that number is
reached. If two numbers are given, 1 is returned when callout <n> is reached
for the <m>th time. Note that callouts with string arguments are always given
the number zero. See "Callouts" below for a description of the output when a
callout it taken.
only one number, 1 is returned instead of 0 (causing matching to backtrack)
when a callout of that number is reached. If two numbers (<n>:<m>) are given, 1
is returned when callout <n> is reached and there have been at least <m>
callouts. The \fBcallout_error\fP modifier is similar, except that
PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
aborted. If both these modifiers are set for the same callout number,
\fBcallout_error\fP takes precedence.
.P
Note that callouts with string arguments are always given the number zero. See
"Callouts" below for a description of the output when a callout it taken.
.P
The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
This is set as the "user data" that is passed to the matching function, and
@ -909,7 +1159,7 @@ used as a return from \fBpcre2test\fP's callout function.
.rs
.sp
Searching for all possible matches within a subject can be requested by the
\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
\fBglobal\fP or \fBaltglobal\fP modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
@ -957,18 +1207,30 @@ by name.
.rs
.sp
If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is
called instead of one of the matching functions. Unlike subject strings,
\fBpcre2test\fP does not process replacement strings for escape sequences. In
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
If so, it is correctly converted to a UTF string of the appropriate code unit
width. If it is not a valid UTF-8 string, the individual code units are copied
directly. This provides a means of passing an invalid UTF-8 string for testing
purposes.
called instead of one of the matching functions. Note that replacement strings
cannot contain commas, because a comma signifies the end of a modifier. This is
not thought to be an issue in a test program.
.P
If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
\fBpcre2_substitute()\fP. After a successful substitution, the modified string
is output, preceded by the number of replacements. This may be zero if there
were no matches. Here is a simple example of a substitution test:
Unlike subject strings, \fBpcre2test\fP does not process replacement strings
for escape sequences. In UTF mode, a replacement string is checked to see if it
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
the appropriate code unit width. If it is not a valid UTF-8 string, the
individual code units are copied directly. This provides a means of passing an
invalid UTF-8 string for testing purposes.
.P
The following modifiers set options (in additional to the normal match options)
for \fBpcre2_substitute()\fP:
.sp
global PCRE2_SUBSTITUTE_GLOBAL
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
.sp
.P
After a successful substitution, the modified string is output, preceded by the
number of replacements. This may be zero if there were no matches. Here is a
simple example of a substitution test:
.sp
/abc/replace=xxx
=abc=abc=
@ -976,12 +1238,12 @@ were no matches. Here is a simple example of a substitution test:
=abc=abc=\e=global
2: =xxx=xxx=
.sp
Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to test for
buffer overflow, if the replacement string starts with a number in square
brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
output buffer, with the replacement string starting at the next character. Here
is an example that tests the edge case:
Subject and replacement strings should be kept relatively short (fewer than 256
characters) for substitution tests, as fixed-size buffers are used. To make it
easy to test for buffer overflow, if the replacement string starts with a
number in square brackets, that number is passed to \fBpcre2_substitute()\fP as
the size of the output buffer, with the replacement string starting at the next
character. Here is an example that tests the edge case:
.sp
/abc/
123abc123\e=replace=[10]XYZ
@ -989,6 +1251,19 @@ is an example that tests the edge case:
123abc123\e=replace=[9]XYZ
Failed: error -47: no more memory
.sp
The default action of \fBpcre2_substitute()\fP is to return
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
\fBsubstitute_overflow_length\fP modifier), \fBpcre2_substitute()\fP continues
to go through the motions of matching and substituting, in order to compute the
size of buffer that is required. When this happens, \fBpcre2test\fP shows the
required buffer length (which includes space for the trailing zero) as part of
the error message. For example:
.sp
/abc/substitute_overflow_length
123abc123\e=replace=[9]XYZ
Failed: error -47: no more memory: 10 code units are needed
.sp
A replacement string is ignored with POSIX and DFA matching. Specifying partial
matching provokes an error return ("bad option value") from
\fBpcre2_substitute()\fP.
@ -1059,6 +1334,16 @@ The \fBoffset\fP modifier sets an offset in the subject string at which
matching starts. Its value is a number of code units, not characters.
.
.
.SS "Setting an offset limit"
.rs
.sp
The \fBoffset_limit\fP modifier sets a limit for unanchored matches. If a match
cannot be found starting at or before this offset in the subject, a "no match"
return is given. The data value is a number of code units, not characters. When
this modifier is used, the \fBuse_offset_limit\fP modifier must have been set
for the pattern; if not, an error is generated.
.
.
.SS "Setting the size of the output vector"
.rs
.sp
@ -1089,6 +1374,17 @@ When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
passing the replacement string as zero-terminated.
.
.
.SS "Passing a NULL context"
.rs
.sp
Normally, \fBpcre2test\fP passes a context block to \fBpcre2_match()\fP,
\fBpcre2_dfa_match()\fP or \fBpcre2_jit_match()\fP. If the \fBnull_context\fP
modifier is set, however, NULL is passed. This is for testing that the matching
functions behave correctly in this case (they use default values). This
modifier cannot be used with the \fBfind_limits\fP modifier or when testing the
substitution function.
.
.
.SH "THE ALTERNATIVE MATCHING FUNCTION"
.rs
.sp
@ -1156,7 +1452,7 @@ unset substring is shown as "<unset>", as for the second data line.
If the strings contain any non-printing characters, they are output as \exhh
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
are output as \ex{hh...} escapes. See below for the definition of non-printing
characters. If the \fB/aftertext\fP modifier is set, the output for substring
characters. If the \fBaftertext\fP modifier is set, the output for substring
0 is followed by the the rest of the subject string, identified by "0+" like
this:
.sp
@ -1286,7 +1582,9 @@ item to be tested. For example:
This output indicates that callout number 0 occurred for a match attempt
starting at the fourth character of the subject string, when the pointer was at
the seventh character, and when the next pattern item was \ed. Just
one circumflex is output if the start and current positions are the same.
one circumflex is output if the start and current positions are the same, or if
the current position precedes the start position, which can happen if the
callout is in a lookbehind assertion.
.P
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
result of the \fB/auto_callout\fP pattern modifier. In this case, instead of
@ -1352,7 +1650,7 @@ therefore shown as hex escapes.
.P
When \fBpcre2test\fP is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been set for
the pattern (using the \fB/locale\fP modifier). In this case, the
the pattern (using the \fBlocale\fP modifier). In this case, the
\fBisprint()\fP function is used to distinguish printing and non-printing
characters.
.
@ -1382,11 +1680,15 @@ can be used to test these functions.
.P
When a pattern with \fBpush\fP modifier is successfully compiled, it is pushed
onto a stack of compiled patterns, and \fBpcre2test\fP expects the next line to
contain a new pattern (or command) instead of a subject line. By this means, a
number of patterns can be compiled and retained. The \fBpush\fP modifier is
incompatible with \fBposix\fP, and control modifiers that act at match time are
ignored (with a message). The \fBjitverify\fP modifier applies only at compile
time. The command
contain a new pattern (or command) instead of a subject line. By contrast,
the \fBpushcopy\fP modifier causes a copy of the compiled pattern to be
stacked, leaving the original available for immediate matching. By using
\fBpush\fP and/or \fBpushcopy\fP, a number of patterns can be compiled and
retained. These modifiers are incompatible with \fBposix\fP, and control
modifiers that act at match time are ignored (with a message) for the stacked
patterns. The \fBjitverify\fP modifier applies only at compile time.
.P
The command
.sp
#save <filename>
.sp
@ -1406,7 +1708,8 @@ modifier list containing only
control modifiers
.\"
that act after a pattern has been compiled. In particular, \fBhex\fP,
\fBposix\fP, and \fBpush\fP are not allowed, nor are any
\fBposix\fP, \fBposix_nosub\fP, \fBpush\fP, and \fBpushcopy\fP are not allowed,
nor are any
.\" HTML <a href="#optionmodifiers">
.\" </a>
option-setting modifiers.
@ -1426,6 +1729,10 @@ reloads two patterns.
.sp
If \fBjitverify\fP is used with #pop, it does not automatically imply
\fBjit\fP, which is different behaviour from when it is used on a pattern.
.P
The #popcopy command is analagous to the \fBpushcopy\fP modifier in that it
makes current a copy of the topmost stack pattern, leaving the original still
on the stack.
.
.
.
@ -1451,6 +1758,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 May 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 28 December 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
.TH PCRE2UNICODE 3 "03 July 2016" "PCRE2 10.22"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@ -57,17 +57,21 @@ individual code units.
In UTF modes, the dot metacharacter matches one UTF character instead of a
single code unit.
.P
The escape sequence \eC can be used to match a single code unit, in a UTF mode,
The escape sequence \eC can be used to match a single code unit in a UTF mode,
but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \eC in the
.\" HREF
\fBpcre2pattern\fP
.\"
documentation). The use of \eC is not supported in the alternative matching
function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains
\eC, it will not succeed, and so the matching will be carried out by the normal
interpretive function.
documentation).
.P
The use of \eC is not supported by the alternative matching function
\fBpcre2_dfa_match()\fP when in UTF-8 or UTF-16 mode, that is, when a character
may consist of more than one code unit. The use of \eC in these modes provokes
a match-time error. Also, the JIT optimization does not support \eC in these
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
the matching will be carried out by the normal interpretive function.
.P
The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
characters of any code value, but, by default, the characters that PCRE2
@ -117,11 +121,21 @@ UTF-16 and UTF-32 strings can indicate their endianness by special code knows
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
strings to be in host byte order.
.P
The entire string is checked before any other processing takes place. In
addition to checking the format of the string, there is a check to ensure that
all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area.
The so-called "non-character" code points are not excluded because Unicode
corrigendum #9 makes it clear that they should not be.
A UTF string is checked before any other processing takes place. In the case of
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP calls with a non-zero starting
offset, the check is applied only to that part of the subject that could be
inspected during matching, and there is a check that the starting offset points
to the first code unit of a character or to the end of the subject. If there
are no lookbehind assertions in the pattern, the check starts at the starting
offset. Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \eb and \eB are
one-character lookbehinds.
.P
In addition to checking the format of the string, there is a check to ensure
that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate
area. The so-called "non-character" code points are not excluded because
Unicode corrigendum #9 makes it clear that they should not be.
.P
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
where they are used in pairs to encode code points with values greater than
@ -221,9 +235,9 @@ never occur in a valid UTF-8 string.
.sp
The following negative error codes are given for invalid UTF-16 strings:
.sp
PCRE_UTF16_ERR1 Missing low surrogate at end of string
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
PCRE_UTF16_ERR3 Isolated low surrogate
PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
.sp
.
.
@ -233,8 +247,8 @@ The following negative error codes are given for invalid UTF-16 strings:
.sp
The following negative error codes are given for invalid UTF-32 strings:
.sp
PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
PCRE_UTF32_ERR2 Code point is greater than 0x10ffff
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
.sp
.
.
@ -252,6 +266,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 03 July 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi