Update bundled PCRE2-library to version 10.23
Some manual changes done to the library were lost with this update. They will be added in the next commit.
This commit is contained in:
@ -97,6 +97,7 @@ can skip ahead to the CMake section.
|
||||
pcre2_context.c
|
||||
pcre2_dfa_match.c
|
||||
pcre2_error.c
|
||||
pcre2_find_bracket.c
|
||||
pcre2_jit_compile.c
|
||||
pcre2_maketables.c
|
||||
pcre2_match.c
|
||||
@ -173,7 +174,11 @@ can skip ahead to the CMake section.
|
||||
|
||||
(11) If you want to use the pcre2grep command, compile and link
|
||||
src/pcre2grep.c; it uses only the basic 8-bit PCRE2 library (it does not
|
||||
need the pcre2posix library).
|
||||
need the pcre2posix library). If you have built the PCRE2 library with JIT
|
||||
support by defining SUPPORT_JIT in src/config.h, you can also define
|
||||
SUPPORT_PCRE2GREP_JIT, which causes pcre2grep to make use of JIT (unless
|
||||
it is run with --no-jit). If you define SUPPORT_PCRE2GREP_JIT without
|
||||
defining SUPPORT_JIT, pcre2grep does not try to make use of JIT.
|
||||
|
||||
|
||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||
@ -388,4 +393,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
|
||||
recommended download site.
|
||||
|
||||
=============================
|
||||
Last Updated: 15 June 2015
|
||||
Last Updated: 13 October 2016
|
||||
|
@ -44,7 +44,7 @@ wrappers.
|
||||
|
||||
The distribution does contain a set of C wrapper functions for the 8-bit
|
||||
library that are based on the POSIX regular expression API (see the pcre2posix
|
||||
man page). These can be found in a library called libpcre2posix. Note that this
|
||||
man page). These can be found in a library called libpcre2-posix. Note that this
|
||||
just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
||||
and does not give full access to all of PCRE2's facilities.
|
||||
@ -58,8 +58,8 @@ renamed or pointed at by a link.
|
||||
If you are using the POSIX interface to PCRE2 and there is already a POSIX
|
||||
regex library installed on your system, as well as worrying about the regex.h
|
||||
header file (as mentioned above), you must also take care when linking programs
|
||||
to ensure that they link with PCRE2's libpcre2posix library. Otherwise they may
|
||||
pick up the POSIX functions of the same name from the other library.
|
||||
to ensure that they link with PCRE2's libpcre2-posix library. Otherwise they
|
||||
may pick up the POSIX functions of the same name from the other library.
|
||||
|
||||
One way of avoiding this confusion is to compile PCRE2 with the addition of
|
||||
-Dregcomp=PCRE2regcomp (and similarly for the other POSIX functions) to the
|
||||
@ -168,15 +168,12 @@ library. They are also documented in the pcre2build man page.
|
||||
built. If you want only the 16-bit or 32-bit library, use --disable-pcre2-8
|
||||
to disable building the 8-bit library.
|
||||
|
||||
. If you want to include support for just-in-time compiling, which can give
|
||||
large performance improvements on certain platforms, add --enable-jit to the
|
||||
"configure" command. This support is available only for certain hardware
|
||||
. If you want to include support for just-in-time (JIT) compiling, which can
|
||||
give large performance improvements on certain platforms, add --enable-jit to
|
||||
the "configure" command. This support is available only for certain hardware
|
||||
architectures. If you try to enable it on an unsupported architecture, there
|
||||
will be a compile time error.
|
||||
|
||||
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
|
||||
you add --disable-pcre2grep-jit to the "configure" command.
|
||||
|
||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
||||
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
||||
@ -207,19 +204,19 @@ library. They are also documented in the pcre2build man page.
|
||||
--enable-newline-is-crlf, --enable-newline-is-anycrlf, or
|
||||
--enable-newline-is-any to the "configure" command, respectively.
|
||||
|
||||
If you specify --enable-newline-is-cr or --enable-newline-is-crlf, some of
|
||||
the standard tests will fail, because the lines in the test files end with
|
||||
LF. Even if the files are edited to change the line endings, there are likely
|
||||
to be some failures. With --enable-newline-is-anycrlf or
|
||||
--enable-newline-is-any, many tests should succeed, but there may be some
|
||||
failures.
|
||||
|
||||
. By default, the sequence \R in a pattern matches any Unicode line ending
|
||||
sequence. This is independent of the option specifying what PCRE2 considers
|
||||
to be the end of a line (see above). However, the caller of PCRE2 can
|
||||
restrict \R to match only CR, LF, or CRLF. You can make this the default by
|
||||
adding --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R").
|
||||
|
||||
. In a pattern, the escape sequence \C matches a single code unit, even in a
|
||||
UTF mode. This can be dangerous because it breaks up multi-code-unit
|
||||
characters. You can build PCRE2 with the use of \C permanently locked out by
|
||||
adding --enable-never-backslash-C (note the upper case C) to the "configure"
|
||||
command. When \C is allowed by the library, individual applications can lock
|
||||
it out by calling pcre2_compile() with the PCRE2_NEVER_BACKSLASH_C option.
|
||||
|
||||
. PCRE2 has a counter that limits the depth of nesting of parentheses in a
|
||||
pattern. This limits the amount of system stack that a pattern uses when it
|
||||
is compiled. The default is 250, but you can change it by setting, for
|
||||
@ -249,13 +246,13 @@ library. They are also documented in the pcre2build man page.
|
||||
sizes in the pcre2stack man page.
|
||||
|
||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||
64K. You can increase this by adding --with-link-size=3 to the "configure"
|
||||
command. PCRE2 then uses three bytes instead of two for offsets to different
|
||||
parts of the compiled pattern. In the 16-bit library, --with-link-size=3 is
|
||||
the same as --with-link-size=4, which (in both libraries) uses four-byte
|
||||
offsets. Increasing the internal link size reduces performance in the 8-bit
|
||||
and 16-bit libraries. In the 32-bit library, the link size setting is
|
||||
ignored, as 4-byte offsets are always used.
|
||||
64K bytes. You can increase this by adding --with-link-size=3 to the
|
||||
"configure" command. PCRE2 then uses three bytes instead of two for offsets
|
||||
to different parts of the compiled pattern. In the 16-bit library,
|
||||
--with-link-size=3 is the same as --with-link-size=4, which (in both
|
||||
libraries) uses four-byte offsets. Increasing the internal link size reduces
|
||||
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
||||
link size setting is ignored, as 4-byte offsets are always used.
|
||||
|
||||
. You can build PCRE2 so that its internal match() function that is called from
|
||||
pcre2_match() does not call itself recursively. Instead, it uses memory
|
||||
@ -317,6 +314,14 @@ library. They are also documented in the pcre2build man page.
|
||||
running "make" to build PCRE2. There is more information about coverage
|
||||
reporting in the "pcre2build" documentation.
|
||||
|
||||
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
|
||||
you add --disable-pcre2grep-jit to the "configure" command.
|
||||
|
||||
. On non-Windows sytems there is support for calling external scripts during
|
||||
matching in the pcre2grep command via PCRE2's callout facility with string
|
||||
arguments. This support can be disabled by adding --disable-pcre2grep-callout
|
||||
to the "configure" command.
|
||||
|
||||
. The pcre2grep program currently supports only 8-bit data files, and so
|
||||
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
|
||||
libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
|
||||
@ -327,12 +332,23 @@ library. They are also documented in the pcre2build man page.
|
||||
|
||||
Of course, the relevant libraries must be installed on your system.
|
||||
|
||||
. The default size (in bytes) of the internal buffer used by pcre2grep can be
|
||||
set by, for example:
|
||||
. The default starting size (in bytes) of the internal buffer used by pcre2grep
|
||||
can be set by, for example:
|
||||
|
||||
--with-pcre2grep-bufsize=51200
|
||||
|
||||
The value must be a plain integer. The default is 20480.
|
||||
The value must be a plain integer. The default is 20480. The amount of memory
|
||||
used by pcre2grep is actually three times this number, to allow for "before"
|
||||
and "after" lines. If very long lines are encountered, the buffer is
|
||||
automatically enlarged, up to a fixed maximum size.
|
||||
|
||||
. The default maximum size of pcre2grep's internal buffer can be set by, for
|
||||
example:
|
||||
|
||||
--with-pcre2grep-max-bufsize=2097152
|
||||
|
||||
The default is either 1048576 or the value of --with-pcre2grep-bufsize,
|
||||
whichever is the larger.
|
||||
|
||||
. It is possible to compile pcre2test so that it links with the libreadline
|
||||
or libedit libraries, by specifying, respectively,
|
||||
@ -357,6 +373,22 @@ library. They are also documented in the pcre2build man page.
|
||||
tgetflag, or tgoto, this is the problem, and linking with the ncurses library
|
||||
should fix it.
|
||||
|
||||
. There is a special option called --enable-fuzz-support for use by people who
|
||||
want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit
|
||||
library. If set, it causes an extra library called libpcre2-fuzzsupport.a to
|
||||
be built, but not installed. This contains a single function called
|
||||
LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the
|
||||
length of the string. When called, this function tries to compile the string
|
||||
as a pattern, and if that succeeds, to match it. This is done both with no
|
||||
options and with some random options bits that are generated from the string.
|
||||
Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to
|
||||
be created. This is normally run under valgrind or used when PCRE2 is
|
||||
compiled with address sanitizing enabled. It calls the fuzzing function and
|
||||
outputs information about it is doing. The input strings are specified by
|
||||
arguments: if an argument starts with "=" the rest of it is a literal input
|
||||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||
file are the test string.
|
||||
|
||||
The "configure" script builds the following files for the basic C library:
|
||||
|
||||
. Makefile the makefile that builds the library
|
||||
@ -531,7 +563,7 @@ script creates the .txt and HTML forms of the documentation from the man pages.
|
||||
|
||||
|
||||
Testing PCRE2
|
||||
------------
|
||||
-------------
|
||||
|
||||
To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
|
||||
There is another script called RunGrepTest that tests the pcre2grep command.
|
||||
@ -724,6 +756,7 @@ The distribution should contain the files listed below.
|
||||
src/pcre2_context.c )
|
||||
src/pcre2_dfa_match.c )
|
||||
src/pcre2_error.c )
|
||||
src/pcre2_find_bracket.c )
|
||||
src/pcre2_jit_compile.c )
|
||||
src/pcre2_jit_match.c ) sources for the functions in the library,
|
||||
src/pcre2_jit_misc.c ) and some internal functions that they use
|
||||
@ -744,6 +777,7 @@ The distribution should contain the files listed below.
|
||||
src/pcre2_xclass.c )
|
||||
|
||||
src/pcre2_printint.c debugging function that is used by pcre2test,
|
||||
src/pcre2_fuzzsupport.c function for (optional) fuzzing support
|
||||
|
||||
src/config.h.in template for config.h, when built by "configure"
|
||||
src/pcre2.h.in template for pcre2.h when built by "configure"
|
||||
@ -801,7 +835,7 @@ The distribution should contain the files listed below.
|
||||
libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config
|
||||
libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config
|
||||
libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config
|
||||
libpcre2posix.pc.in template for libpcre2posix.pc for pkg-config
|
||||
libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config
|
||||
ltmain.sh file used to build a libtool script
|
||||
missing ) common stub for a few missing GNU programs while
|
||||
) installing, generated by automake
|
||||
@ -832,4 +866,4 @@ The distribution should contain the files listed below.
|
||||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 24 April 2015
|
||||
Last updated: 01 November 2016
|
||||
|
@ -91,6 +91,12 @@ in the library.
|
||||
<tr><td><a href="pcre2_callout_enumerate.html">pcre2_callout_enumerate</a></td>
|
||||
<td> Enumerate callouts in a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
|
||||
<td> Copy a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
|
||||
<td> Copy a compiled pattern and its character tables</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
|
||||
<td> Free a compiled pattern</td></tr>
|
||||
|
||||
@ -210,9 +216,15 @@ in the library.
|
||||
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
|
||||
<td> Set the match limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_max_pattern_length.html">pcre2_set_max_pattern_length</a></td>
|
||||
<td> Set the maximum length of pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_newline.html">pcre2_set_newline</a></td>
|
||||
<td> Set the newline convention</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_offset_limit.html">pcre2_set_offset_limit</a></td>
|
||||
<td> Set the offset limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_parens_nest_limit.html">pcre2_set_parens_nest_limit</a></td>
|
||||
<td> Set the parentheses nesting limit</td></tr>
|
||||
|
||||
|
@ -126,8 +126,10 @@ running redundant checks.
|
||||
<P>
|
||||
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
|
||||
problems, because it may leave the current matching point in the middle of a
|
||||
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
|
||||
lock out the use of \C, causing a compile-time error if it is encountered.
|
||||
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an
|
||||
application to lock out the use of \C, causing a compile-time error if it is
|
||||
encountered. It is also possible to build PCRE2 with the use of \C permanently
|
||||
disabled.
|
||||
</P>
|
||||
<P>
|
||||
Another way that performance can be hit is by running a pattern that has a very
|
||||
@ -187,7 +189,7 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 13 April 2015
|
||||
Last updated: 16 October 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
43
pcre2/doc/html/pcre2_code_copy.html
Normal file
43
pcre2/doc/html/pcre2_code_copy.html
Normal file
@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_code_copy specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_code_copy man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes a copy of the memory used for a compiled pattern, excluding
|
||||
any memory used by the JIT compiler. Without a subsequent call to
|
||||
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching. The
|
||||
pointer to the character tables is copied, not the tables themselves (see
|
||||
<b>pcre2_code_copy_with_tables()</b>). The yield of the function is NULL if
|
||||
<i>code</i> is NULL or if sufficient memory cannot be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
44
pcre2/doc/html/pcre2_code_copy_with_tables.html
Normal file
44
pcre2/doc/html/pcre2_code_copy_with_tables.html
Normal file
@ -0,0 +1,44 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_code_copy_with_tables specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_code_copy_with_tables man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes a copy of the memory used for a compiled pattern, excluding
|
||||
any memory used by the JIT compiler. Without a subsequent call to
|
||||
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching.
|
||||
Unlike <b>pcre2_code_copy()</b>, a separate copy of the character tables is also
|
||||
made, with the new code pointing to it. This memory will be automatically freed
|
||||
when <b>pcre2_code_free()</b> is called. The yield of the function is NULL if
|
||||
<i>code</i> is NULL or if sufficient memory cannot be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
@ -19,7 +19,7 @@ SYNOPSIS
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_code_free(pcre2_code *<i>code</i>);</b>
|
||||
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
|
@ -45,8 +45,8 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
|
||||
<i>wscount</i> Number of elements in the vector
|
||||
</pre>
|
||||
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
||||
up a callout function. The <i>length</i> and <i>startoffset</i> values are code
|
||||
units, not characters. The options are:
|
||||
up a callout function or specify the recursion limit. The <i>length</i> and
|
||||
<i>startoffset</i> values are code units, not characters. The options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
|
@ -35,7 +35,10 @@ errors are negative numbers. The arguments are:
|
||||
<i>bufflen</i> the length of the buffer (code units)
|
||||
</pre>
|
||||
The function returns the length of the message, excluding the trailing zero, or
|
||||
a negative error code if the buffer is too small.
|
||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
||||
this case, the returned message is truncated (but still with a trailing zero).
|
||||
If <i>errorcode</i> does not contain a recognized error code number, the
|
||||
negative value PCRE2_ERROR_BADDATA is returned.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
|
@ -19,7 +19,7 @@ SYNOPSIS
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
|
||||
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
|
@ -19,8 +19,8 @@ SYNOPSIS
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_match_data_create_from_pattern(const pcre2_code *<i>code</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
|
||||
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
|
@ -42,19 +42,20 @@ request are as follows:
|
||||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
|
||||
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
|
||||
0 nothing set
|
||||
1 first code unit is set
|
||||
2 start of string or after newline
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
|
||||
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
|
||||
exist in the pattern
|
||||
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
|
||||
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
|
||||
0 nothing set
|
||||
1 code unit is set
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
|
||||
empty string, 0 otherwise
|
||||
PCRE2_INFO_MATCHLIMIT Match limit if set,
|
||||
@ -62,8 +63,8 @@ request are as follows:
|
||||
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
|
||||
lookbehind assertion
|
||||
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMECOUNT Number of named subpatterns
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMETABLE Pointer to name table
|
||||
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
|
||||
PCRE2_NEWLINE_CR
|
||||
|
@ -20,7 +20,7 @@ SYNOPSIS
|
||||
</P>
|
||||
<P>
|
||||
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, const uint32_t *<i>bytes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
|
@ -19,8 +19,8 @@ SYNOPSIS
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int32_t pcre2_serialize_encode(pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, uint32_t **<i>serialized_bytes</i>,</b>
|
||||
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
|
43
pcre2/doc/html/pcre2_set_max_pattern_length.html
Normal file
43
pcre2/doc/html/pcre2_set_max_pattern_length.html
Normal file
@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_max_pattern_length specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_max_pattern_length man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets, in a compile context, the maximum text length (in code
|
||||
units) of the pattern that can be compiled. The result is always zero. If a
|
||||
longer pattern is passed to <b>pcre2_compile()</b> there is an immediate error
|
||||
return. The default is effectively unlimited, being the largest value a
|
||||
PCRE2_SIZE variable can hold.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
40
pcre2/doc/html/pcre2_set_offset_limit.html
Normal file
40
pcre2/doc/html/pcre2_set_offset_limit.html
Normal file
@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_offset_limit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_offset_limit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the offset limit field in a match context. The result is
|
||||
always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
@ -59,20 +59,25 @@ units, not characters, as is the contents of the variable pointed at by
|
||||
<i>outlengthptr</i>, which is updated to the actual length of the new string.
|
||||
The options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject string is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject string is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
||||
is not a valid match
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject or replacement for
|
||||
UTF validity (only relevant if PCRE2_UTF
|
||||
was set at compile time)
|
||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the
|
||||
subject is not a valid match
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject or replacement
|
||||
for UTF validity (only relevant if
|
||||
PCRE2_UTF was set at compile time)
|
||||
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
|
||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
|
||||
</pre>
|
||||
The function returns the number of substitutions, which may be zero if there
|
||||
were no matches. The result can be greater than one only when
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
|
||||
is returned.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -18,23 +18,26 @@ please consult the man page, in case the conversion went wrong.
|
||||
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
|
||||
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
|
||||
<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
|
||||
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
|
||||
<li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
|
||||
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
|
||||
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
|
||||
<li><a name="TOC11" href="#SEC11">LIMITING PCRE2 RESOURCE USAGE</a>
|
||||
<li><a name="TOC12" href="#SEC12">CREATING CHARACTER TABLES AT BUILD TIME</a>
|
||||
<li><a name="TOC13" href="#SEC13">USING EBCDIC CODE</a>
|
||||
<li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
|
||||
<li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a>
|
||||
<li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
|
||||
<li><a name="TOC17" href="#SEC17">INCLUDING DEBUGGING CODE</a>
|
||||
<li><a name="TOC18" href="#SEC18">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||
<li><a name="TOC19" href="#SEC19">CODE COVERAGE REPORTING</a>
|
||||
<li><a name="TOC20" href="#SEC20">SEE ALSO</a>
|
||||
<li><a name="TOC21" href="#SEC21">AUTHOR</a>
|
||||
<li><a name="TOC22" href="#SEC22">REVISION</a>
|
||||
<li><a name="TOC6" href="#SEC6">DISABLING THE USE OF \C</a>
|
||||
<li><a name="TOC7" href="#SEC7">JUST-IN-TIME COMPILER SUPPORT</a>
|
||||
<li><a name="TOC8" href="#SEC8">NEWLINE RECOGNITION</a>
|
||||
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC10" href="#SEC10">HANDLING VERY LARGE PATTERNS</a>
|
||||
<li><a name="TOC11" href="#SEC11">AVOIDING EXCESSIVE STACK USAGE</a>
|
||||
<li><a name="TOC12" href="#SEC12">LIMITING PCRE2 RESOURCE USAGE</a>
|
||||
<li><a name="TOC13" href="#SEC13">CREATING CHARACTER TABLES AT BUILD TIME</a>
|
||||
<li><a name="TOC14" href="#SEC14">USING EBCDIC CODE</a>
|
||||
<li><a name="TOC15" href="#SEC15">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
|
||||
<li><a name="TOC16" href="#SEC16">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
|
||||
<li><a name="TOC17" href="#SEC17">PCRE2GREP BUFFER SIZE</a>
|
||||
<li><a name="TOC18" href="#SEC18">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
|
||||
<li><a name="TOC19" href="#SEC19">INCLUDING DEBUGGING CODE</a>
|
||||
<li><a name="TOC20" href="#SEC20">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||
<li><a name="TOC21" href="#SEC21">CODE COVERAGE REPORTING</a>
|
||||
<li><a name="TOC22" href="#SEC22">SUPPORT FOR FUZZERS</a>
|
||||
<li><a name="TOC23" href="#SEC23">SEE ALSO</a>
|
||||
<li><a name="TOC24" href="#SEC24">AUTHOR</a>
|
||||
<li><a name="TOC25" href="#SEC25">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
|
||||
<P>
|
||||
@ -148,13 +151,19 @@ properties. The application can request that they do by setting the PCRE2_UCP
|
||||
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||
request this by starting with (*UCP).
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">DISABLING THE USE OF \C</a><br>
|
||||
<P>
|
||||
The \C escape sequence, which matches a single code unit, even in a UTF mode,
|
||||
can cause unpredictable behaviour because it may leave the current matching
|
||||
point in the middle of a multi-code-unit character. It can be locked out by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
point in the middle of a multi-code-unit character. The application can lock it
|
||||
out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
|
||||
<b>pcre2_compile()</b>. There is also a build-time option
|
||||
<pre>
|
||||
--enable-never-backslash-C
|
||||
</pre>
|
||||
(note the upper case C) which locks out the use of \C entirely.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<P>
|
||||
Just-in-time compiler support is included in the build by specifying
|
||||
<pre>
|
||||
@ -171,7 +180,7 @@ pcre2grep automatically makes use of it, unless you add
|
||||
</pre>
|
||||
to the "configure" command.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
|
||||
<br><a name="SEC8" href="#TOC1">NEWLINE RECOGNITION</a><br>
|
||||
<P>
|
||||
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
||||
of a line. This is the normal newline character on Unix-like systems. You can
|
||||
@ -208,7 +217,7 @@ Whatever default line ending convention is selected when PCRE2 is built can be
|
||||
overridden by applications that use the library. At build time it is
|
||||
conventional to use the standard for your operating system.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<br><a name="SEC9" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<P>
|
||||
By default, the sequence \R in a pattern matches any Unicode newline sequence,
|
||||
independently of what has been selected as the line ending sequence. If you
|
||||
@ -220,7 +229,7 @@ the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
|
||||
selected when PCRE2 is built can be overridden by applications that use the
|
||||
called.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
||||
<br><a name="SEC10" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
||||
<P>
|
||||
Within a compiled pattern, offset values are used to point from one part to
|
||||
another (for example, from an opening parenthesis to an alternation
|
||||
@ -239,7 +248,7 @@ longer offsets slows down the operation of PCRE2 because it has to load
|
||||
additional data when handling them. For the 32-bit library the value is always
|
||||
4 and cannot be overridden; the value of --with-link-size is ignored.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
|
||||
<br><a name="SEC11" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
|
||||
<P>
|
||||
When matching with the <b>pcre2_match()</b> function, PCRE2 implements
|
||||
backtracking by making recursive calls to an internal function called
|
||||
@ -261,7 +270,7 @@ custom memory management functions can be called instead. PCRE2 runs noticeably
|
||||
more slowly when built in this way. This option affects only the
|
||||
<b>pcre2_match()</b> function; it is not relevant for <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
|
||||
<br><a name="SEC12" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
|
||||
<P>
|
||||
Internally, PCRE2 has a function called <b>match()</b>, which it calls
|
||||
repeatedly (sometimes recursively) when matching a pattern with the
|
||||
@ -290,7 +299,7 @@ constraints. However, you can set a lower limit by adding, for example,
|
||||
</pre>
|
||||
to the <b>configure</b> command. This value can also be overridden at run time.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
|
||||
<br><a name="SEC13" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
|
||||
<P>
|
||||
PCRE2 uses fixed tables for processing characters whose code points are less
|
||||
than 256. By default, PCRE2 is built with a set of tables that are distributed
|
||||
@ -307,7 +316,7 @@ compiling, because <b>dftables</b> is run on the local host. If you need to
|
||||
create alternative tables when cross compiling, you will have to do so "by
|
||||
hand".)
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
|
||||
<br><a name="SEC14" href="#TOC1">USING EBCDIC CODE</a><br>
|
||||
<P>
|
||||
PCRE2 assumes by default that it will run in an environment where the character
|
||||
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
||||
@ -342,7 +351,16 @@ The options that select newline behaviour, such as --enable-newline-is-cr,
|
||||
and equivalent run-time options, refer to these character values in an EBCDIC
|
||||
environment.
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
|
||||
<br><a name="SEC15" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
|
||||
<P>
|
||||
By default, on non-Windows systems, <b>pcre2grep</b> supports the use of
|
||||
callouts with string arguments within the patterns it is matching, in order to
|
||||
run external scripts. For details, see the
|
||||
<a href="pcre2grep.html"><b>pcre2grep</b></a>
|
||||
documentation. This support can be disabled by adding
|
||||
--disable-pcre2grep-callout to the <b>configure</b> command.
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
|
||||
<P>
|
||||
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
|
||||
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
|
||||
@ -355,22 +373,25 @@ to the <b>configure</b> command. These options naturally require that the
|
||||
relevant libraries are installed on your system. Configuration will fail if
|
||||
they are not.
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
|
||||
<br><a name="SEC17" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
|
||||
<P>
|
||||
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
|
||||
scanning, in order to be able to output "before" and "after" lines when it
|
||||
finds a match. The size of the buffer is controlled by a parameter whose
|
||||
default value is 20K. The buffer itself is three times this size, but because
|
||||
of the way it is used for holding "before" lines, the longest line that is
|
||||
guaranteed to be processable is the parameter size. You can change the default
|
||||
parameter value by adding, for example,
|
||||
finds a match. The starting size of the buffer is controlled by a parameter
|
||||
whose default value is 20K. The buffer itself is three times this size, but
|
||||
because of the way it is used for holding "before" lines, the longest line that
|
||||
is guaranteed to be processable is the parameter size. If a longer line is
|
||||
encountered, <b>pcre2grep</b> automatically expands the buffer, up to a
|
||||
specified maximum size, whose default is 1M or the starting size, whichever is
|
||||
the larger. You can change the default parameter values by adding, for example,
|
||||
<pre>
|
||||
--with-pcre2grep-bufsize=50K
|
||||
--with-pcre2grep-bufsize=51200
|
||||
--with-pcre2grep-max-bufsize=2097152
|
||||
</pre>
|
||||
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
|
||||
value by using --buffer-size on the command line..
|
||||
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override
|
||||
these values by using --buffer-size and --max-buffer-size on the command line.
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
||||
<br><a name="SEC18" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
||||
<P>
|
||||
If you add one of
|
||||
<pre>
|
||||
@ -404,7 +425,7 @@ automatically included, you may need to add something like
|
||||
</pre>
|
||||
immediately before the <b>configure</b> command.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
|
||||
<br><a name="SEC19" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
@ -413,7 +434,7 @@ If you add
|
||||
to the <b>configure</b> command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<br><a name="SEC20" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
@ -423,7 +444,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
|
||||
certain memory regions as unaddressable. This allows it to detect invalid
|
||||
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<br><a name="SEC21" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<P>
|
||||
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
|
||||
code coverage report for its test suite. To enable this, you must install
|
||||
@ -480,11 +501,32 @@ This cleans all coverage data including the generated coverage report. For more
|
||||
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">SUPPORT FOR FUZZERS</a><br>
|
||||
<P>
|
||||
There is a special option for use by people who want to run fuzzing tests on
|
||||
PCRE2:
|
||||
<pre>
|
||||
--enable-fuzz-support
|
||||
</pre>
|
||||
At present this applies only to the 8-bit library. If set, it causes an extra
|
||||
library called libpcre2-fuzzsupport.a to be built, but not installed. This
|
||||
contains a single function called LLVMFuzzerTestOneInput() whose arguments are
|
||||
a pointer to a string and the length of the string. When called, this function
|
||||
tries to compile the string as a pattern, and if that succeeds, to match it.
|
||||
This is done both with no options and with some random options bits that are
|
||||
generated from the string. Setting --enable-fuzz-support also causes a binary
|
||||
called <b>pcre2fuzzcheck</b> to be created. This is normally run under valgrind
|
||||
or used when PCRE2 is compiled with address sanitizing enabled. It calls the
|
||||
fuzzing function and outputs information about it is doing. The input strings
|
||||
are specified by arguments: if an argument starts with "=" the rest of it is a
|
||||
literal input string. Otherwise, it is assumed to be a file name, and the
|
||||
contents of the file are the test string.
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC24" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
@ -493,11 +535,11 @@ University Computing Service
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC25" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 24 April 2015
|
||||
Last updated: 01 November 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -57,11 +57,20 @@ two callout points:
|
||||
</pre>
|
||||
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
||||
automatically inserts callouts, all with number 255, before each item in the
|
||||
pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||
pattern except for immediately before or after a callout item in the pattern.
|
||||
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||
<pre>
|
||||
A(?C3)B
|
||||
</pre>
|
||||
it is processed as if it were
|
||||
<pre>
|
||||
(?C255)A(?C3)B(?C255)
|
||||
</pre>
|
||||
Here is a more complicated example:
|
||||
<pre>
|
||||
A(\d{2}|--)
|
||||
</pre>
|
||||
it is processed as if it were
|
||||
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
||||
<br>
|
||||
<br>
|
||||
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
||||
@ -107,10 +116,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
||||
No match
|
||||
</pre>
|
||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||
and therefore the callouts that would be taken for the backtracks do not occur.
|
||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||
<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
|
||||
case, the output changes to this:
|
||||
(because it is being treated as a++) and therefore the callouts that would be
|
||||
taken for the backtracks do not occur. You can disable the auto-possessify
|
||||
feature by passing PCRE2_NO_AUTO_POSSESS to <b>pcre2_compile()</b>, or starting
|
||||
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
|
||||
<pre>
|
||||
--->aaaa
|
||||
+0 ^ a+
|
||||
@ -235,8 +244,8 @@ Fields for numerical callouts
|
||||
<P>
|
||||
For a numerical callout, <i>callout_string</i> is NULL, and <i>callout_number</i>
|
||||
contains the number of the callout, in the range 0-255. This is the number
|
||||
that follows (?C for manual callouts; it is 255 for automatically generated
|
||||
callouts.
|
||||
that follows (?C for callouts that part of the pattern; it is 255 for
|
||||
automatically generated callouts.
|
||||
</P>
|
||||
<br><b>
|
||||
Fields for string callouts
|
||||
@ -310,10 +319,15 @@ the next item to be matched.
|
||||
</P>
|
||||
<P>
|
||||
The <i>next_item_length</i> field contains the length of the next item to be
|
||||
matched in the pattern string. When the callout immediately precedes an
|
||||
alternation bar, a closing parenthesis, or the end of the pattern, the length
|
||||
is zero. When the callout precedes an opening parenthesis, the length is that
|
||||
of the entire subpattern.
|
||||
processed in the pattern string. When the callout is at the end of the pattern,
|
||||
the length is zero. When the callout precedes an opening parenthesis, the
|
||||
length includes meta characters that follow the parenthesis. For example, in a
|
||||
callout before an assertion such as (?=ab) the length is 3. For an an
|
||||
alternation bar or a closing parenthesis, the length is one, unless a closing
|
||||
parenthesis is followed by a quantifier, in which case its length is included.
|
||||
(This changed in release 10.23. In earlier releases, before an opening
|
||||
parenthesis the length was that of the entire subpattern, and before an
|
||||
alternation bar or a closing parenthesis the length was zero.)
|
||||
</P>
|
||||
<P>
|
||||
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
|
||||
@ -399,9 +413,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 March 2015
|
||||
Last updated: 29 September 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -107,7 +107,7 @@ processed as anchored at the point where they are tested.
|
||||
one that is backtracked onto acts. For example, in the pattern
|
||||
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
|
||||
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
|
||||
same as PCRE2, but there are examples where it differs.
|
||||
same as PCRE2, but there are cases where it differs.
|
||||
</P>
|
||||
<P>
|
||||
11. Most backtracking verbs in assertions have their normal actions. They are
|
||||
@ -123,7 +123,7 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
|
||||
13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern
|
||||
names is not as general as Perl's. This is a consequence of the fact the PCRE2
|
||||
works internally just with numbers, using an external table to translate
|
||||
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B),
|
||||
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b>B),
|
||||
where the two capturing parentheses have the same number but different names,
|
||||
is not supported, and causes an error at compile time. If it were allowed, it
|
||||
would not be possible to distinguish which parentheses matched, because both
|
||||
@ -131,10 +131,11 @@ names map to capturing subpattern number 1. To avoid this confusing situation,
|
||||
an error is given at compile time.
|
||||
</P>
|
||||
<P>
|
||||
14. Perl recognizes comments in some places that PCRE2 does not, for example,
|
||||
between the ( and ? at the start of a subpattern. If the /x modifier is set,
|
||||
Perl allows white space between ( and ? (though current Perls warn that this is
|
||||
deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
|
||||
14. Perl used to recognize comments in some places that PCRE2 does not, for
|
||||
example, between the ( and ? at the start of a subpattern. If the /x modifier
|
||||
is set, Perl allowed white space between ( and ? though the latest Perls give
|
||||
an error (for a while it was just deprecated). There may still be some cases
|
||||
where Perl behaves differently.
|
||||
</P>
|
||||
<P>
|
||||
15. Perl, when in warning mode, gives warnings for character classes such as
|
||||
@ -161,42 +162,47 @@ each alternative branch of a lookbehind assertion can match a different length
|
||||
of string. Perl requires them all to have the same length.
|
||||
<br>
|
||||
<br>
|
||||
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
|
||||
(b) From PCRE2 10.23, back references to groups of fixed length are supported
|
||||
in lookbehinds, provided that there is no possibility of referencing a
|
||||
non-unique number or name. Perl does not support backreferences in lookbehinds.
|
||||
<br>
|
||||
<br>
|
||||
(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
|
||||
meta-character matches only at the very end of the string.
|
||||
<br>
|
||||
<br>
|
||||
(c) A backslash followed by a letter with no special meaning is faulted. (Perl
|
||||
(d) A backslash followed by a letter with no special meaning is faulted. (Perl
|
||||
can be made to issue a warning.)
|
||||
<br>
|
||||
<br>
|
||||
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
|
||||
(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
|
||||
inverted, that is, by default they are not greedy, but if followed by a
|
||||
question mark they are.
|
||||
<br>
|
||||
<br>
|
||||
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
|
||||
(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
|
||||
only at the first matching position in the subject string.
|
||||
<br>
|
||||
<br>
|
||||
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
|
||||
(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
|
||||
PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
|
||||
<br>
|
||||
<br>
|
||||
(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF
|
||||
(h) The \R escape sequence can be restricted to match only CR, LF, or CRLF
|
||||
by the PCRE2_BSR_ANYCRLF option.
|
||||
<br>
|
||||
<br>
|
||||
(h) The callout facility is PCRE2-specific.
|
||||
(i) The callout facility is PCRE2-specific.
|
||||
<br>
|
||||
<br>
|
||||
(i) The partial matching facility is PCRE2-specific.
|
||||
(j) The partial matching facility is PCRE2-specific.
|
||||
<br>
|
||||
<br>
|
||||
(j) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
|
||||
(k) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
|
||||
different way and is not Perl-compatible.
|
||||
<br>
|
||||
<br>
|
||||
(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
|
||||
(l) PCRE2 recognizes some special sequences such as (*CR) at the start of
|
||||
a pattern that set overall options that cannot be changed within the pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
@ -214,9 +220,9 @@ Cambridge, England.
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 15 March 2015
|
||||
Last updated: 18 October 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -20,28 +20,31 @@ please consult the man page, in case the conversion went wrong.
|
||||
*************************************************/
|
||||
|
||||
/* This is a demonstration program to illustrate a straightforward way of
|
||||
calling the PCRE2 regular expression library from a C program. See the
|
||||
using the PCRE2 regular expression library from a C program. See the
|
||||
pcre2sample documentation for a short discussion ("man pcre2sample" if you have
|
||||
the PCRE2 man pages installed). PCRE2 is a revised API for the library, and is
|
||||
incompatible with the original PCRE API.
|
||||
|
||||
There are actually three libraries, each supporting a different code unit
|
||||
width. This demonstration program uses the 8-bit library.
|
||||
width. This demonstration program uses the 8-bit library. The default is to
|
||||
process each code unit as a separate character, but if the pattern begins with
|
||||
"(*UTF)", both it and the subject are treated as UTF-8 strings, where
|
||||
characters may occupy multiple code units.
|
||||
|
||||
In Unix-like environments, if PCRE2 is installed in your standard system
|
||||
libraries, you should be able to compile this program using this command:
|
||||
|
||||
gcc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
|
||||
cc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
|
||||
|
||||
If PCRE2 is not installed in a standard place, it is likely to be installed
|
||||
with support for the pkg-config mechanism. If you have pkg-config, you can
|
||||
compile this program using this command:
|
||||
|
||||
gcc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
|
||||
cc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
|
||||
|
||||
If you do not have pkg-config, you may have to use this:
|
||||
If you do not have pkg-config, you may have to use something like this:
|
||||
|
||||
gcc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \
|
||||
cc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \
|
||||
-R/usr/local/lib -lpcre2-8 -o pcre2demo
|
||||
|
||||
Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
|
||||
@ -56,9 +59,14 @@ the following line. */
|
||||
|
||||
/* #define PCRE2_STATIC */
|
||||
|
||||
/* This macro must be defined before including pcre2.h. For a program that uses
|
||||
only one code unit width, it makes it possible to use generic function names
|
||||
such as pcre2_compile(). */
|
||||
/* The PCRE2_CODE_UNIT_WIDTH macro must be defined before including pcre2.h.
|
||||
For a program that uses only one code unit width, setting it to 8, 16, or 32
|
||||
makes it possible to use generic function names such as pcre2_compile(). Note
|
||||
that just changing 8 to 16 (for example) is not sufficient to convert this
|
||||
program to process 16-bit characters. Even in a fully 16-bit environment, where
|
||||
string-handling functions such as strcmp() and printf() work with 16-bit
|
||||
characters, the code for handling the table of named substrings will still need
|
||||
to be modified. */
|
||||
|
||||
#define PCRE2_CODE_UNIT_WIDTH 8
|
||||
|
||||
@ -79,19 +87,19 @@ int main(int argc, char **argv)
|
||||
{
|
||||
pcre2_code *re;
|
||||
PCRE2_SPTR pattern; /* PCRE2_SPTR is a pointer to unsigned code units of */
|
||||
PCRE2_SPTR subject; /* the appropriate width (8, 16, or 32 bits). */
|
||||
PCRE2_SPTR subject; /* the appropriate width (in this case, 8 bits). */
|
||||
PCRE2_SPTR name_table;
|
||||
|
||||
int crlf_is_newline;
|
||||
int errornumber;
|
||||
int find_all;
|
||||
int i;
|
||||
int namecount;
|
||||
int name_entry_size;
|
||||
int rc;
|
||||
int utf8;
|
||||
|
||||
uint32_t option_bits;
|
||||
uint32_t namecount;
|
||||
uint32_t name_entry_size;
|
||||
uint32_t newline;
|
||||
|
||||
PCRE2_SIZE erroroffset;
|
||||
@ -106,15 +114,19 @@ pcre2_match_data *match_data;
|
||||
* First, sort out the command line. There is only one possible option at *
|
||||
* the moment, "-g" to request repeated matching to find all occurrences, *
|
||||
* like Perl's /g option. We set the variable find_all to a non-zero value *
|
||||
* if the -g option is present. Apart from that, there must be exactly two *
|
||||
* arguments. *
|
||||
* if the -g option is present. *
|
||||
**************************************************************************/
|
||||
|
||||
find_all = 0;
|
||||
for (i = 1; i < argc; i++)
|
||||
{
|
||||
if (strcmp(argv[i], "-g") == 0) find_all = 1;
|
||||
else break;
|
||||
else if (argv[i][0] == '-')
|
||||
{
|
||||
printf("Unrecognised option %s\n", argv[i]);
|
||||
return 1;
|
||||
}
|
||||
else break;
|
||||
}
|
||||
|
||||
/* After the options, we require exactly two arguments, which are the pattern,
|
||||
@ -122,7 +134,7 @@ and the subject string. */
|
||||
|
||||
if (argc - i != 2)
|
||||
{
|
||||
printf("Two arguments required: a regex and a subject string\n");
|
||||
printf("Exactly two arguments required: a regex and a subject string\n");
|
||||
return 1;
|
||||
}
|
||||
|
||||
@ -201,7 +213,7 @@ if (rc < 0)
|
||||
stored. */
|
||||
|
||||
ovector = pcre2_get_ovector_pointer(match_data);
|
||||
printf("\nMatch succeeded at offset %d\n", (int)ovector[0]);
|
||||
printf("Match succeeded at offset %d\n", (int)ovector[0]);
|
||||
|
||||
|
||||
/*************************************************************************
|
||||
@ -242,7 +254,7 @@ we have to extract the count of named parentheses from the pattern. */
|
||||
PCRE2_INFO_NAMECOUNT, /* get the number of named substrings */
|
||||
&namecount); /* where to put the answer */
|
||||
|
||||
if (namecount <= 0) printf("No named substrings\n"); else
|
||||
if (namecount == 0) printf("No named substrings\n"); else
|
||||
{
|
||||
PCRE2_SPTR tabptr;
|
||||
printf("Named substrings\n");
|
||||
@ -330,8 +342,8 @@ crlf_is_newline = newline == PCRE2_NEWLINE_ANY ||
|
||||
|
||||
for (;;)
|
||||
{
|
||||
uint32_t options = 0; /* Normally no options */
|
||||
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
|
||||
uint32_t options = 0; /* Normally no options */
|
||||
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
|
||||
|
||||
/* If the previous match was for an empty string, we are finished if we are
|
||||
at the end of the subject. Otherwise, arrange to run another match at the
|
||||
@ -371,7 +383,7 @@ for (;;)
|
||||
{
|
||||
if (options == 0) break; /* All matches found */
|
||||
ovector[1] = start_offset + 1; /* Advance one code unit */
|
||||
if (crlf_is_newline && /* If CRLF is newline & */
|
||||
if (crlf_is_newline && /* If CRLF is a newline & */
|
||||
start_offset < subject_length - 1 && /* we are at CRLF, */
|
||||
subject[start_offset] == '\r' &&
|
||||
subject[start_offset + 1] == '\n')
|
||||
@ -417,7 +429,7 @@ for (;;)
|
||||
printf("%2d: %.*s\n", i, (int)substring_length, (char *)substring_start);
|
||||
}
|
||||
|
||||
if (namecount <= 0) printf("No named substrings\n"); else
|
||||
if (namecount == 0) printf("No named substrings\n"); else
|
||||
{
|
||||
PCRE2_SPTR tabptr = name_table;
|
||||
printf("Named substrings\n");
|
||||
|
@ -22,11 +22,12 @@ please consult the man page, in case the conversion went wrong.
|
||||
<li><a name="TOC7" href="#SEC7">NEWLINES</a>
|
||||
<li><a name="TOC8" href="#SEC8">OPTIONS COMPATIBILITY</a>
|
||||
<li><a name="TOC9" href="#SEC9">OPTIONS WITH DATA</a>
|
||||
<li><a name="TOC10" href="#SEC10">MATCHING ERRORS</a>
|
||||
<li><a name="TOC11" href="#SEC11">DIAGNOSTICS</a>
|
||||
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
|
||||
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
|
||||
<li><a name="TOC14" href="#SEC14">REVISION</a>
|
||||
<li><a name="TOC10" href="#SEC10">CALLING EXTERNAL SCRIPTS</a>
|
||||
<li><a name="TOC11" href="#SEC11">MATCHING ERRORS</a>
|
||||
<li><a name="TOC12" href="#SEC12">DIAGNOSTICS</a>
|
||||
<li><a name="TOC13" href="#SEC13">SEE ALSO</a>
|
||||
<li><a name="TOC14" href="#SEC14">AUTHOR</a>
|
||||
<li><a name="TOC15" href="#SEC15">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
|
||||
<P>
|
||||
@ -79,11 +80,19 @@ span line boundaries. What defines a line boundary is controlled by the
|
||||
</P>
|
||||
<P>
|
||||
The amount of memory used for buffering files that are being scanned is
|
||||
controlled by a parameter that can be set by the <b>--buffer-size</b> option.
|
||||
The default value for this parameter is specified when <b>pcre2grep</b> is
|
||||
built, with the default default being 20K. A block of memory three times this
|
||||
size is used (to allow for buffering "before" and "after" lines). An error
|
||||
occurs if a line overflows the buffer.
|
||||
controlled by parameters that can be set by the <b>--buffer-size</b> and
|
||||
<b>--max-buffer-size</b> options. The first of these sets the size of buffer
|
||||
that is obtained at the start of processing. If an input file contains very
|
||||
long lines, a larger buffer may be needed; this is handled by automatically
|
||||
extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
|
||||
default values for these parameters are specified when <b>pcre2grep</b> is
|
||||
built, with the default defaults being 20K and 1M respectively. An error occurs
|
||||
if a line is too long and the buffer can no longer be expanded.
|
||||
</P>
|
||||
<P>
|
||||
The block of memory that is actually used is three times the "buffer size", to
|
||||
allow for buffering "before" and "after" lines. If the buffer size is too
|
||||
small, fewer than requested "before" and "after" lines may be output.
|
||||
</P>
|
||||
<P>
|
||||
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
|
||||
@ -154,12 +163,13 @@ processing of patterns and file names that start with hyphens.
|
||||
</P>
|
||||
<P>
|
||||
<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
|
||||
Output <i>number</i> lines of context after each matching line. If file names
|
||||
and/or line numbers are being output, a hyphen separator is used instead of a
|
||||
colon for the context lines. A line containing "--" is output between each
|
||||
group of lines, unless they are in fact contiguous in the input file. The value
|
||||
of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
|
||||
guarantees to have up to 8K of following text available for context output.
|
||||
Output up to <i>number</i> lines of context after each matching line. Fewer
|
||||
lines are output if the next match or the end of the file is reached, or if the
|
||||
processing buffer size has been set too small. If file names and/or line
|
||||
numbers are being output, a hyphen separator is used instead of a colon for the
|
||||
context lines. A line containing "--" is output between each group of lines,
|
||||
unless they are in fact contiguous in the input file. The value of <i>number</i>
|
||||
is expected to be relatively small. When <b>-c</b> is used, <b>-A</b> is ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>-a</b>, <b>--text</b>
|
||||
@ -168,12 +178,14 @@ Treat binary files as text. This is equivalent to
|
||||
</P>
|
||||
<P>
|
||||
<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
|
||||
Output <i>number</i> lines of context before each matching line. If file names
|
||||
and/or line numbers are being output, a hyphen separator is used instead of a
|
||||
colon for the context lines. A line containing "--" is output between each
|
||||
group of lines, unless they are in fact contiguous in the input file. The value
|
||||
of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
|
||||
guarantees to have up to 8K of preceding text available for context output.
|
||||
Output up to <i>number</i> lines of context before each matching line. Fewer
|
||||
lines are output if the previous match or the start of the file is within
|
||||
<i>number</i> lines, or if the processing buffer size has been set too small. If
|
||||
file names and/or line numbers are being output, a hyphen separator is used
|
||||
instead of a colon for the context lines. A line containing "--" is output
|
||||
between each group of lines, unless they are in fact contiguous in the input
|
||||
file. The value of <i>number</i> is expected to be relatively small. When
|
||||
<b>-c</b> is used, <b>-B</b> is ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>--binary-files=</b><i>word</i>
|
||||
@ -190,8 +202,9 @@ return code.
|
||||
</P>
|
||||
<P>
|
||||
<b>--buffer-size=</b><i>number</i>
|
||||
Set the parameter that controls how much memory is used for buffering files
|
||||
that are being scanned.
|
||||
Set the parameter that controls how much memory is obtained at the start of
|
||||
processing for buffering files that are being scanned. See also
|
||||
<b>--max-buffer-size</b> below.
|
||||
</P>
|
||||
<P>
|
||||
<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
|
||||
@ -201,14 +214,16 @@ This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
|
||||
<P>
|
||||
<b>-c</b>, <b>--count</b>
|
||||
Do not output lines from the files that are being scanned; instead output the
|
||||
number of matches (or non-matches if <b>-v</b> is used) that would otherwise
|
||||
have caused lines to be shown. By default, this count is the same as the number
|
||||
of suppressed lines, but if the <b>-M</b> (multiline) option is used (without
|
||||
<b>-v</b>), there may be more suppressed lines than the number of matches.
|
||||
number of lines that would have been shown, either because they matched, or, if
|
||||
<b>-v</b> is set, because they failed to match. By default, this count is
|
||||
exactly the same as the number of lines that would have been output, but if the
|
||||
<b>-M</b> (multiline) option is used (without <b>-v</b>), there may be more
|
||||
suppressed lines than the count (that is, the number of matches).
|
||||
<br>
|
||||
<br>
|
||||
If no lines are selected, the number zero is output. If several files are are
|
||||
being scanned, a count is output for each of them. However, if the
|
||||
being scanned, a count is output for each of them and the <b>-t</b> option can
|
||||
be used to cause a total to be output at the end. However, if the
|
||||
<b>--files-with-matches</b> option is also used, only those files whose counts
|
||||
are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
|
||||
<b>-B</b>, and <b>-C</b> options are ignored.
|
||||
@ -230,12 +245,23 @@ because <b>pcre2grep</b> has to search for all possible matches in a line, not
|
||||
just one, in order to colour them all.
|
||||
<br>
|
||||
<br>
|
||||
The colour that is used can be specified by setting the environment variable
|
||||
PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a
|
||||
string of two numbers, separated by a semicolon. They are copied directly into
|
||||
the control string for setting colour on a terminal, so it is your
|
||||
responsibility to ensure that they make sense. If neither of the environment
|
||||
variables is set, the default is "1;31", which gives red.
|
||||
The colour that is used can be specified by setting one of the environment
|
||||
variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
|
||||
PCREGREP_COLOR, which are checked in that order. If none of these are set,
|
||||
<b>pcre2grep</b> looks for GREP_COLORS or GREP_COLOR (in that order). The value
|
||||
of the variable should be a string of two numbers, separated by a semicolon,
|
||||
except in the case of GREP_COLORS, which must start with "ms=" or "mt="
|
||||
followed by two semicolon-separated colours, terminated by the end of the
|
||||
string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
|
||||
ignored, and GREP_COLOR is checked.
|
||||
<br>
|
||||
<br>
|
||||
If the string obtained from one of the above variables contains any characters
|
||||
other than semicolon or digits, the setting is ignored and the default colour
|
||||
is used. The string is copied directly into the control string for setting
|
||||
colour on a terminal, so it is your responsibility to ensure that the values
|
||||
make sense. If no relevant environment variable is set, the default is "1;31",
|
||||
which gives red.
|
||||
</P>
|
||||
<P>
|
||||
<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
|
||||
@ -320,18 +346,18 @@ files; it does not apply to patterns specified by any of the <b>--include</b> or
|
||||
</P>
|
||||
<P>
|
||||
<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
|
||||
Read patterns from the file, one per line, and match them against
|
||||
each line of input. What constitutes a newline when reading the file is the
|
||||
operating system's default. The <b>--newline</b> option has no effect on this
|
||||
option. Trailing white space is removed from each line, and blank lines are
|
||||
ignored. An empty file contains no patterns and therefore matches nothing. See
|
||||
also the comments about multiple patterns versus a single pattern with
|
||||
alternatives in the description of <b>-e</b> above.
|
||||
Read patterns from the file, one per line, and match them against each line of
|
||||
input. What constitutes a newline when reading the file is the operating
|
||||
system's default. The <b>--newline</b> option has no effect on this option.
|
||||
Trailing white space is removed from each line, and blank lines are ignored. An
|
||||
empty file contains no patterns and therefore matches nothing. See also the
|
||||
comments about multiple patterns versus a single pattern with alternatives in
|
||||
the description of <b>-e</b> above.
|
||||
<br>
|
||||
<br>
|
||||
If this option is given more than once, all the specified files are
|
||||
read. A data line is output if any of the patterns match it. A file name can
|
||||
be given as "-" to refer to the standard input. When <b>-f</b> is used, patterns
|
||||
If this option is given more than once, all the specified files are read. A
|
||||
data line is output if any of the patterns match it. A file name can be given
|
||||
as "-" to refer to the standard input. When <b>-f</b> is used, patterns
|
||||
specified on the command line using <b>-e</b> may also be present; they are
|
||||
tested before the file's patterns. However, no other pattern is taken from the
|
||||
command line; all arguments are treated as the names of paths to be searched.
|
||||
@ -501,19 +527,27 @@ There are no short forms for these options. The default settings are specified
|
||||
when the PCRE2 library is compiled, with the default default being 10 million.
|
||||
</P>
|
||||
<P>
|
||||
\fB--max-buffer-size=<i>number</i>
|
||||
This limits the expansion of the processing buffer, whose initial size can be
|
||||
set by <b>--buffer-size</b>. The maximum buffer size is silently forced to be no
|
||||
smaller than the starting buffer size.
|
||||
</P>
|
||||
<P>
|
||||
<b>-M</b>, <b>--multiline</b>
|
||||
Allow patterns to match more than one line. When this option is given, patterns
|
||||
may usefully contain literal newline characters and internal occurrences of ^
|
||||
and $ characters. The output for a successful match may consist of more than
|
||||
one line. The first is the line in which the match started, and the last is the
|
||||
line in which the match ended. If the matched string ends with a newline
|
||||
sequence the output ends at the end of that line.
|
||||
Allow patterns to match more than one line. When this option is set, the PCRE2
|
||||
library is called in "multiline" mode. This allows a matched string to extend
|
||||
past the end of a line and continue on one or more subsequent lines. Patterns
|
||||
used with <b>-M</b> may usefully contain literal newline characters and internal
|
||||
occurrences of ^ and $ characters. The output for a successful match may
|
||||
consist of more than one line. The first line is the line in which the match
|
||||
started, and the last line is the line in which the match ended. If the matched
|
||||
string ends with a newline sequence, the output ends at the end of that line.
|
||||
If <b>-v</b> is set, none of the lines in a multi-line match are output. Once a
|
||||
match has been handled, scanning restarts at the beginning of the line after
|
||||
the one in which the match ended.
|
||||
<br>
|
||||
<br>
|
||||
When this option is set, the PCRE2 library is called in "multiline" mode.
|
||||
However, <b>pcre2grep</b> still processes the input line by line. The difference
|
||||
is that a matched string may extend past the end of a line and continue on
|
||||
one or more subsequent lines. The newline sequence must be matched as part of
|
||||
The newline sequence that separates multiple lines must be matched as part of
|
||||
the pattern. For example, to find the phrase "regular expression" in a file
|
||||
where "regular" might be at the end of a line and "expression" at the start of
|
||||
the next line, you could use this command:
|
||||
@ -526,11 +560,8 @@ well as possibly handling a two-character newline sequence.
|
||||
<br>
|
||||
<br>
|
||||
There is a limit to the number of lines that can be matched, imposed by the way
|
||||
that <b>pcre2grep</b> buffers the input file as it scans it. However,
|
||||
<b>pcre2grep</b> ensures that at least 8K characters or the rest of the file
|
||||
(whichever is the shorter) are available for forward matching, and similarly
|
||||
the previous 8K characters (or all the previous characters, if fewer than 8K)
|
||||
are guaranteed to be available for lookbehind assertions. The <b>-M</b> option
|
||||
that <b>pcre2grep</b> buffers the input file as it scans it. With a sufficiently
|
||||
large processing buffer, this should not be a problem, but the <b>-M</b> option
|
||||
does not work when input is read line by line (see \fP--line-buffered\fP.)
|
||||
</P>
|
||||
<P>
|
||||
@ -578,12 +609,13 @@ It should never be needed in normal use.
|
||||
Show only the part of the line that matched a pattern instead of the whole
|
||||
line. In this mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and
|
||||
<b>-C</b> options are ignored. If there is more than one match in a line, each
|
||||
of them is shown separately. If <b>-o</b> is combined with <b>-v</b> (invert the
|
||||
sense of the match to find non-matching lines), no output is generated, but the
|
||||
return code is set appropriately. If the matched portion of the line is empty,
|
||||
nothing is output unless the file name or line number are being printed, in
|
||||
which case they are shown on an otherwise empty line. This option is mutually
|
||||
exclusive with <b>--file-offsets</b> and <b>--line-offsets</b>.
|
||||
of them is shown separately, on a separate line of output. If <b>-o</b> is
|
||||
combined with <b>-v</b> (invert the sense of the match to find non-matching
|
||||
lines), no output is generated, but the return code is set appropriately. If
|
||||
the matched portion of the line is empty, nothing is output unless the file
|
||||
name or line number are being printed, in which case they are shown on an
|
||||
otherwise empty line. This option is mutually exclusive with
|
||||
<b>--file-offsets</b> and <b>--line-offsets</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>-o</b><i>number</i>, <b>--only-matching</b>=<i>number</i>
|
||||
@ -597,10 +629,11 @@ capturing parentheses do not exist in the pattern, or were not set in the
|
||||
match, nothing is output unless the file name or line number are being output.
|
||||
<br>
|
||||
<br>
|
||||
If this option is given multiple times, multiple substrings are output, in the
|
||||
order the options are given. For example, -o3 -o1 -o3 causes the substrings
|
||||
matched by capturing parentheses 3 and 1 and then 3 again to be output. By
|
||||
default, there is no separator (but see the next option).
|
||||
If this option is given multiple times, multiple substrings are output for each
|
||||
match, in the order the options are given, and all on one line. For example,
|
||||
-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
|
||||
then 3 again to be output. By default, there is no separator (but see the next
|
||||
option).
|
||||
</P>
|
||||
<P>
|
||||
<b>--om-separator</b>=<i>text</i>
|
||||
@ -631,6 +664,18 @@ quietly skipped. However, the return code is still 2, even if matches were
|
||||
found in other files.
|
||||
</P>
|
||||
<P>
|
||||
<b>-t</b>, <b>--total-count</b>
|
||||
This option is useful when scanning more than one file. If used on its own,
|
||||
<b>-t</b> suppresses all output except for a grand total number of matching
|
||||
lines (or non-matching lines if <b>-v</b> is used) in all the files. If <b>-t</b>
|
||||
is used with <b>-c</b>, a grand total is output except when the previous output
|
||||
is just one line. In other words, it is not output when just one file's count
|
||||
is listed. If file names are being output, the grand total is preceded by
|
||||
"TOTAL:". Otherwise, it appears as just another number. The <b>-t</b> option is
|
||||
ignored when used with <b>-L</b> (list files without matches), because the grand
|
||||
total would always be zero.
|
||||
</P>
|
||||
<P>
|
||||
<b>-u</b>, <b>--utf-8</b>
|
||||
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
|
||||
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
|
||||
@ -658,11 +703,12 @@ specified by any of the <b>--include</b> or <b>--exclude</b> options.
|
||||
<P>
|
||||
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
|
||||
Force the patterns to be anchored (each must start matching at the beginning of
|
||||
a line) and in addition, require them to match entire lines. This is equivalent
|
||||
to having ^ and $ characters at the start and end of each alternative top-level
|
||||
branch in every pattern. This option applies only to the patterns that are
|
||||
matched against the contents of files; it does not apply to patterns specified
|
||||
by any of the <b>--include</b> or <b>--exclude</b> options.
|
||||
a line) and in addition, require them to match entire lines. In multiline mode
|
||||
the match may be more than one line. This is equivalent to having \A and \Z
|
||||
characters at the start and end of each alternative top-level branch in every
|
||||
pattern. This option applies only to the patterns that are matched against the
|
||||
contents of files; it does not apply to patterns specified by any of the
|
||||
<b>--include</b> or <b>--exclude</b> options.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
|
||||
<P>
|
||||
@ -735,7 +781,57 @@ The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
|
||||
options does have data, it must be given in the first form, using an equals
|
||||
character. Otherwise <b>pcre2grep</b> will assume that it has no data.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">MATCHING ERRORS</a><br>
|
||||
<br><a name="SEC10" href="#TOC1">CALLING EXTERNAL SCRIPTS</a><br>
|
||||
<P>
|
||||
<b>pcre2grep</b> has, by default, support for calling external programs or
|
||||
scripts during matching by making use of PCRE2's callout facility. However,
|
||||
this support can be disabled when <b>pcre2grep</b> is built. You can find out
|
||||
whether your binary has support for callouts by running it with the <b>--help</b>
|
||||
option. If the support is not enabled, all callouts in patterns are ignored by
|
||||
<b>pcre2grep</b>.
|
||||
</P>
|
||||
<P>
|
||||
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argument is
|
||||
either a number or a quoted string (see the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation for details). Numbered callouts are ignored by <b>pcre2grep</b>.
|
||||
String arguments are parsed as a list of substrings separated by pipe (vertical
|
||||
bar) characters. The first substring must be an executable name, with the
|
||||
following substrings specifying arguments:
|
||||
<pre>
|
||||
executable_name|arg1|arg2|...
|
||||
</pre>
|
||||
Any substring (including the executable name) may contain escape sequences
|
||||
started by a dollar character: $<digits> or ${<digits>} is replaced by the
|
||||
captured substring of the given decimal number, which must be greater than
|
||||
zero. If the number is greater than the number of capturing substrings, or if
|
||||
the capture is unset, the replacement is empty.
|
||||
</P>
|
||||
<P>
|
||||
Any other character is substituted by itself. In particular, $$ is replaced by
|
||||
a single dollar and $| is replaced by a pipe character. Here is an example:
|
||||
<pre>
|
||||
echo -e "abcde\n12345" | pcre2grep \
|
||||
'(?x)(.)(..(.))
|
||||
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
|
||||
|
||||
Output:
|
||||
|
||||
Arg1: [a] [bcd] [d] Arg2: |a| ()
|
||||
abcde
|
||||
Arg1: [1] [234] [4] Arg2: |1| ()
|
||||
12345
|
||||
</pre>
|
||||
The parameters for the <b>execv()</b> system call that is used to run the
|
||||
program or script are zero-terminated strings. This means that binary zero
|
||||
characters in the callout argument will cause premature termination of their
|
||||
substrings, and therefore should not be present. Any syntax errors in the
|
||||
string (for example, a dollar not followed by another character) cause the
|
||||
callout to be ignored. If running the program fails for any reason (including
|
||||
the non-existence of the executable), a local matching failure occurs and the
|
||||
matcher backtracks in the normal way.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">MATCHING ERRORS</a><br>
|
||||
<P>
|
||||
It is possible to supply a regular expression that takes a very long time to
|
||||
fail to match certain lines. Such patterns normally involve nested indefinite
|
||||
@ -751,7 +847,7 @@ overall resource limit; there is a second option called <b>--recursion-limit</b>
|
||||
that sets a limit on the amount of memory (usually stack) that is used (see the
|
||||
discussion of these options above).
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">DIAGNOSTICS</a><br>
|
||||
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
|
||||
<P>
|
||||
Exit status is 0 if any matches were found, 1 if no matches were found, and 2
|
||||
for syntax errors, overlong lines, non-existent or inaccessible files (even if
|
||||
@ -759,11 +855,11 @@ matches were found in other files) or too many matching errors. Using the
|
||||
<b>-s</b> option to suppress error messages about inaccessible files does not
|
||||
affect the return code.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC13" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3).
|
||||
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3), <b>pcre2callout</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC14" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
@ -772,11 +868,11 @@ University Computing Service
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 03 January 2015
|
||||
Last updated: 31 December 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -86,6 +86,13 @@ results. The returned value from <b>pcre2_jit_compile()</b> is zero on success,
|
||||
or a negative error code.
|
||||
</P>
|
||||
<P>
|
||||
There is a limit to the size of pattern that JIT supports, imposed by the size
|
||||
of machine stack that it uses. The exact rules are not documented because they
|
||||
may change at any time, in particular, when new optimizations are introduced.
|
||||
If a pattern is too big, a call to \fBpcre2_jit_compile()\fB returns
|
||||
PCRE2_ERROR_NOMEMORY.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
|
||||
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
|
||||
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
|
||||
@ -145,6 +152,10 @@ PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
|
||||
PCRE2_ANCHORED option is not supported at match time.
|
||||
</P>
|
||||
<P>
|
||||
If the PCRE2_NO_JIT option is passed to <b>pcre2_match()</b> it disables the
|
||||
use of JIT, forcing matching by the interpreter code.
|
||||
</P>
|
||||
<P>
|
||||
The only unsupported pattern items are \C (match a single data unit) when
|
||||
running in a UTF mode, and a callout immediately before an assertion condition
|
||||
in a conditional group.
|
||||
@ -224,8 +235,14 @@ whether a match operation was executed by JIT or by the interpreter.
|
||||
</P>
|
||||
<P>
|
||||
You may safely use the same JIT stack for more than one pattern (either by
|
||||
assigning directly or by callback), as long as the patterns are all matched
|
||||
sequentially in the same thread. In a multithread application, if you do not
|
||||
assigning directly or by callback), as long as the patterns are matched
|
||||
sequentially in the same thread. Currently, the only way to set up
|
||||
non-sequential matches in one thread is to use callouts: if a callout function
|
||||
starts another match, that match must use a different JIT stack to the one used
|
||||
for currently suspended match(es).
|
||||
</P>
|
||||
<P>
|
||||
In a multithread application, if you do not
|
||||
specify a JIT stack, or if you assign or pass back NULL from a callback, that
|
||||
is thread-safe, because each thread has its own machine stack. However, if you
|
||||
assign or pass back a non-NULL JIT stack, this must be a different stack for
|
||||
@ -390,7 +407,7 @@ The fast path function is called <b>pcre2_jit_match()</b>, and it takes exactly
|
||||
the same arguments as <b>pcre2_match()</b>. The return values are also the same,
|
||||
plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is
|
||||
requested that was not compiled. Unsupported option bits (for example,
|
||||
PCRE2_ANCHORED) are ignored.
|
||||
PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option.
|
||||
</P>
|
||||
<P>
|
||||
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
|
||||
@ -419,9 +436,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 November 2014
|
||||
Last updated: 05 June 2016
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -32,6 +32,11 @@ However, the speed of execution is slower. In the 32-bit library, the internal
|
||||
linkage size is always 4.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a source pattern string is essentially unlimited; it is
|
||||
the largest number a PCRE2_SIZE variable can hold. However, the program that
|
||||
calls <b>pcre2_compile()</b> can specify a smaller limit.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length (in code units) of a subject string is one less than the
|
||||
largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an unsigned
|
||||
integer type, usually defined as size_t. Its maximum value (that is
|
||||
@ -50,17 +55,16 @@ documentation.
|
||||
All values in repeating quantifiers must be less than 65536.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a lookbehind assertion is 65535 characters.
|
||||
</P>
|
||||
<P>
|
||||
There is no limit to the number of parenthesized subpatterns, but there can be
|
||||
no more than 65535 capturing subpatterns. There is, however, a limit to the
|
||||
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
|
||||
order to limit the amount of system stack used at compile time. The limit can
|
||||
be specified when PCRE2 is built; the default is 250.
|
||||
</P>
|
||||
<P>
|
||||
There is a limit to the number of forward references to subsequent subpatterns
|
||||
of around 200,000. Repeated forward references with fixed upper limits, for
|
||||
example, (?2){0,100} when subpattern number 2 is to the right, are included in
|
||||
the count. There is no limit to the number of backward references.
|
||||
order to limit the amount of system stack used at compile time. The default
|
||||
limit can be specified when PCRE2 is built; the default default is 250. An
|
||||
application can change this limit by calling pcre2_set_parens_nest_limit() to
|
||||
set the limit in a compile context.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of name for a named subpattern is 32 code units, and the
|
||||
@ -68,7 +72,12 @@ maximum number of named subpatterns is 10000.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
|
||||
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
|
||||
is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
|
||||
32-bit libraries.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a string argument to a callout is the largest number a
|
||||
32-bit unsigned integer can hold.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
@ -85,9 +94,9 @@ Cambridge, England.
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 25 November 2014
|
||||
Last updated: 26 October 2016
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -190,6 +190,12 @@ be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
</P>
|
||||
<P>
|
||||
The match limit is used (but in a different way) when JIT is being used, but it
|
||||
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
|
||||
However, the recursion limit is relevant for DFA matching, which does use some
|
||||
function recursion, in particular, for recursions within the pattern.
|
||||
<a name="newlines"></a></P>
|
||||
<br><b>
|
||||
Newline conventions
|
||||
@ -379,32 +385,31 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
|
||||
code unit following \c has a value less than 32 or greater than 126, a
|
||||
compile-time error occurs. This locks out non-printable ASCII characters in all
|
||||
modes.
|
||||
compile-time error occurs.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
|
||||
generate the appropriate EBCDIC code values. The \c escape is processed
|
||||
as specified for Perl in the <b>perlebcdic</b> document. The only characters
|
||||
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \@ encodes
|
||||
character code 0; the letters (in either case) encode characters 1-26 (hex 01
|
||||
to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
||||
\? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
other character provokes a compile-time error. The sequence \c@ encodes
|
||||
character code 0; after \c the letters (in either case) encode characters 1-26
|
||||
(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
||||
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
</P>
|
||||
<P>
|
||||
Thus, apart from \?, these escapes generate the same character code values as
|
||||
Thus, apart from \c?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
differ. For example, \G always generates code value 7, which is BEL in ASCII
|
||||
differ. For example, \cG always generates code value 7, which is BEL in ASCII
|
||||
but DEL in EBCDIC.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \? generate 95; otherwise it generates 255.
|
||||
values, PCRE2 makes \c? generate 95; otherwise it generates 255.
|
||||
</P>
|
||||
<P>
|
||||
After \0 up to two further octal digits are read. If there are fewer than two
|
||||
@ -526,9 +531,9 @@ by code point, as described in the previous section.
|
||||
Absolute and relative back references
|
||||
</b><br>
|
||||
<P>
|
||||
The sequence \g followed by an unsigned or a negative number, optionally
|
||||
enclosed in braces, is an absolute or relative back reference. A named back
|
||||
reference can be coded as \g{name}. Back references are discussed
|
||||
The sequence \g followed by a signed or unsigned number, optionally enclosed
|
||||
in braces, is an absolute or relative back reference. A named back reference
|
||||
can be coded as \g{name}. Back references are discussed
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
<a href="#subpattern">parenthesized subpatterns.</a>
|
||||
@ -669,8 +674,8 @@ This is an example of an "atomic group", details of which are given
|
||||
This particular group matches either the two-character sequence CR followed by
|
||||
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
||||
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
||||
line, U+0085). The two-character sequence is treated as a single unit that
|
||||
cannot be split.
|
||||
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||
treated as a single unit that cannot be split.
|
||||
</P>
|
||||
<P>
|
||||
In other modes, two additional characters whose codepoints are greater than 255
|
||||
@ -736,6 +741,8 @@ Those that are not part of an identified script are lumped together as
|
||||
"Common". The current list of scripts is:
|
||||
</P>
|
||||
<P>
|
||||
Ahom,
|
||||
Anatolian_Hieroglyphs,
|
||||
Arabic,
|
||||
Armenian,
|
||||
Avestan,
|
||||
@ -776,6 +783,7 @@ Gurmukhi,
|
||||
Han,
|
||||
Hangul,
|
||||
Hanunoo,
|
||||
Hatran,
|
||||
Hebrew,
|
||||
Hiragana,
|
||||
Imperial_Aramaic,
|
||||
@ -812,12 +820,14 @@ Miao,
|
||||
Modi,
|
||||
Mongolian,
|
||||
Mro,
|
||||
Multani,
|
||||
Myanmar,
|
||||
Nabataean,
|
||||
New_Tai_Lue,
|
||||
Nko,
|
||||
Ogham,
|
||||
Ol_Chiki,
|
||||
Old_Hungarian,
|
||||
Old_Italic,
|
||||
Old_North_Arabian,
|
||||
Old_Permic,
|
||||
@ -839,6 +849,7 @@ Saurashtra,
|
||||
Sharada,
|
||||
Shavian,
|
||||
Siddham,
|
||||
SignWriting,
|
||||
Sinhala,
|
||||
Sora_Sompeng,
|
||||
Sundanese,
|
||||
@ -1180,6 +1191,16 @@ when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
|
||||
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
||||
</P>
|
||||
<P>
|
||||
When the newline convention (see
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
below) recognizes the two-character sequence CRLF as a newline, this is
|
||||
preferred, even if the single characters CR and LF are also recognized as
|
||||
newlines. For example, if the newline convention is "any", a multiline mode
|
||||
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
|
||||
CR, even though CR on its own is a valid newline. (It also matches at the very
|
||||
start of the string, of course.)
|
||||
</P>
|
||||
<P>
|
||||
Note that the sequences \A, \Z, and \z can be used to match the start and
|
||||
end of the subject in both modes, and if all branches of a pattern start with
|
||||
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
|
||||
@ -1230,20 +1251,32 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
|
||||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
|
||||
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
unless the PCRE2_NO_UTF_CHECK option is used).
|
||||
</P>
|
||||
<P>
|
||||
An application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
|
||||
build PCRE2 with the use of \C permanently disabled.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 does not allow \C to appear in lookbehind assertions
|
||||
<a href="#lookbehind">(described below)</a>
|
||||
in a UTF mode, because this would make it impossible to calculate the length of
|
||||
the lookbehind.
|
||||
in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
|
||||
the length of the lookbehind. Neither the alternative matching function
|
||||
<b>pcre2_dfa_match()</b> nor the JIT optimizer support \C in these UTF modes.
|
||||
The former gives a match-time error; the latter fails to optimize and so the
|
||||
match is always run using the interpreter.
|
||||
</P>
|
||||
<P>
|
||||
In the 32-bit library, however, \C is always supported (when not explicitly
|
||||
locked out) because it always matches a single code unit, whether or not UTF-32
|
||||
is specified.
|
||||
</P>
|
||||
<P>
|
||||
In general, the \C escape sequence is best avoided. However, one way of using
|
||||
it that avoids the problem of malformed UTF characters is to use a lookahead to
|
||||
check the length of the next character, as in this pattern, which could be used
|
||||
with a UTF-8 string (ignore white space and line breaks):
|
||||
it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
|
||||
lookahead to check the length of the next character, as in this pattern, which
|
||||
could be used with a UTF-8 string (ignore white space and line breaks):
|
||||
<pre>
|
||||
(?| (?=[\x00-\x7f])(\C) |
|
||||
(?=[\x80-\x{7ff}])(\C)(\C) |
|
||||
@ -1298,42 +1331,6 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||
class such as [^a] always matches one of these characters.
|
||||
</P>
|
||||
<P>
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class, or
|
||||
immediately after a range. For example, [b-d-z] matches letters in the range b
|
||||
to d, a hyphen character, or z.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
|
||||
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
|
||||
the end of range, so [W-\]46] is interpreted as a class containing a range
|
||||
followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
</P>
|
||||
<P>
|
||||
An error is generated if a POSIX character class (see below) or an escape
|
||||
sequence other than one that defines a single character appears at a point
|
||||
where a range ending character is expected. For example, [z-\xff] is valid,
|
||||
but [A-\d] and [A-[:digit:]] are not.
|
||||
</P>
|
||||
<P>
|
||||
Ranges operate in the collating sequence of character values. They can also be
|
||||
used for characters specified numerically, for example [\000-\037]. Ranges
|
||||
can include any characters that are valid for the current mode.
|
||||
</P>
|
||||
<P>
|
||||
If a range that includes letters is used when caseless matching is set, it
|
||||
matches the letters in either case. For example, [W-c] is equivalent to
|
||||
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
|
||||
tables for a French locale are in use, [\xc8-\xcb] matches accented E
|
||||
characters in both cases.
|
||||
</P>
|
||||
<P>
|
||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
|
||||
\V, \w, and \W may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\dABCDEF] matches any hexadecimal
|
||||
@ -1347,6 +1344,52 @@ are not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error.
|
||||
</P>
|
||||
<P>
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class,
|
||||
or immediately after a range. For example, [b-d-z] matches letters in the range
|
||||
b to d, a hyphen character, or z.
|
||||
</P>
|
||||
<P>
|
||||
Perl treats a hyphen as a literal if it appears before or after a POSIX class
|
||||
(see below) or a character type escape such as as \d, but gives a warning in
|
||||
its warning mode, as this is most likely a user error. As PCRE2 has no facility
|
||||
for warning, an error is given in these cases.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
|
||||
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
|
||||
the end of range, so [W-\]46] is interpreted as a class containing a range
|
||||
followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
</P>
|
||||
<P>
|
||||
Ranges normally include all code points between the start and end characters,
|
||||
inclusive. They can also be used for code points specified numerically, for
|
||||
example [\000-\037]. Ranges can include any characters that are valid for the
|
||||
current mode.
|
||||
</P>
|
||||
<P>
|
||||
There is a special case in EBCDIC environments for ranges whose end points are
|
||||
both specified as literal letters in the same case. For compatibility with
|
||||
Perl, EBCDIC code points within the range that are not letters are omitted. For
|
||||
example, [h-k] matches only four characters, even though the codes for h and k
|
||||
are 0x88 and 0x92, a range of 11 code points. However, if the range is
|
||||
specified numerically, for example, [\x88-\x92] or [h-\x92], all code points
|
||||
are included.
|
||||
</P>
|
||||
<P>
|
||||
If a range that includes letters is used when caseless matching is set, it
|
||||
matches the letters in either case. For example, [W-c] is equivalent to
|
||||
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
|
||||
tables for a French locale are in use, [\xc8-\xcb] matches accented E
|
||||
characters in both cases.
|
||||
</P>
|
||||
<P>
|
||||
A circumflex can conveniently be used with the upper case character types to
|
||||
specify a more restricted set of characters than the matching lower case type.
|
||||
For example, the class [^\W_] matches any letter or digit, but not underscore,
|
||||
@ -1514,13 +1557,8 @@ respectively.
|
||||
<P>
|
||||
When one of these option changes occurs at top level (that is, not inside
|
||||
subpattern parentheses), the change applies to the remainder of the pattern
|
||||
that follows. If the change is placed right at the start of a pattern, PCRE2
|
||||
extracts it into the global options (and it will therefore show up in data
|
||||
extracted by the <b>pcre2_pattern_info()</b> function).
|
||||
</P>
|
||||
<P>
|
||||
An option change within a subpattern (see below for a description of
|
||||
subpatterns) affects only that part of the subpattern that follows it, so
|
||||
that follows. An option change within a subpattern (see below for a description
|
||||
of subpatterns) affects only that part of the subpattern that follows it, so
|
||||
<pre>
|
||||
(a(?i)b)c
|
||||
</pre>
|
||||
@ -1649,6 +1687,10 @@ first one in the pattern with the given number. The following pattern matches
|
||||
<pre>
|
||||
/(?|(abc)|(def))(?1)/
|
||||
</pre>
|
||||
A relative reference such as (?-1) is no different: it is just a convenient way
|
||||
of computing an absolute group number.
|
||||
</P>
|
||||
<P>
|
||||
If a
|
||||
<a href="#conditions">condition test</a>
|
||||
for a subpattern's having matched refers to a non-unique number, the test is
|
||||
@ -2051,9 +2093,9 @@ subpattern is possible using named parentheses (see below).
|
||||
</P>
|
||||
<P>
|
||||
Another way of avoiding the ambiguity inherent in the use of digits following a
|
||||
backslash is to use the \g escape sequence. This escape must be followed by an
|
||||
unsigned number or a negative number, optionally enclosed in braces. These
|
||||
examples are all identical:
|
||||
backslash is to use the \g escape sequence. This escape must be followed by a
|
||||
signed or unsigned number, optionally enclosed in braces. These examples are
|
||||
all identical:
|
||||
<pre>
|
||||
(ring), \1
|
||||
(ring), \g1
|
||||
@ -2061,8 +2103,7 @@ examples are all identical:
|
||||
</pre>
|
||||
An unsigned number specifies an absolute reference without the ambiguity that
|
||||
is present in the older syntax. It is also useful when literal digits follow
|
||||
the reference. A negative number is a relative reference. Consider this
|
||||
example:
|
||||
the reference. A signed number is a relative reference. Consider this example:
|
||||
<pre>
|
||||
(abc(def)ghi)\g{-1}
|
||||
</pre>
|
||||
@ -2073,6 +2114,11 @@ can be helpful in long patterns, and also in patterns that are created by
|
||||
joining together fragments that contain references within themselves.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \g{+1} is a reference to the next capturing subpattern. This kind
|
||||
of forward reference can be useful it patterns that repeat. Perl does not
|
||||
support the use of + in this way.
|
||||
</P>
|
||||
<P>
|
||||
A back reference matches whatever actually matched the capturing subpattern in
|
||||
the current subject string, rather than anything matching the subpattern
|
||||
itself (see
|
||||
@ -2172,6 +2218,14 @@ capturing is carried out only for positive assertions. (Perl sometimes, but not
|
||||
always, does do capturing in negative assertions.)
|
||||
</P>
|
||||
<P>
|
||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
||||
succeeds, but failure to match later in the pattern causes backtracking over
|
||||
this assertion, the captures within the assertion are reset only if no higher
|
||||
numbered captures are already set. This is, unfortunately, a fundamental
|
||||
limitation of the current implementation; it may get removed in a future
|
||||
reworking.
|
||||
</P>
|
||||
<P>
|
||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||
it makes no sense to assert the same thing several times, the side effect of
|
||||
capturing parentheses may occasionally be useful. However, an assertion that
|
||||
@ -2268,18 +2322,31 @@ match. If there are insufficient characters before the current position, the
|
||||
assertion fails.
|
||||
</P>
|
||||
<P>
|
||||
In a UTF mode, PCRE2 does not allow the \C escape (which matches a single code
|
||||
unit even in a UTF mode) to appear in lookbehind assertions, because it makes
|
||||
it impossible to calculate the length of the lookbehind. The \X and \R
|
||||
escapes, which can match different numbers of code units, are also not
|
||||
permitted.
|
||||
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
|
||||
single code unit even in a UTF mode) to appear in lookbehind assertions,
|
||||
because it makes it impossible to calculate the length of the lookbehind. The
|
||||
\X and \R escapes, which can match different numbers of code units, are never
|
||||
permitted in lookbehinds.
|
||||
</P>
|
||||
<P>
|
||||
<a href="#subpatternsassubroutines">"Subroutine"</a>
|
||||
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
|
||||
as the subpattern matches a fixed-length string.
|
||||
<a href="#recursion">Recursion,</a>
|
||||
however, is not supported.
|
||||
as the subpattern matches a fixed-length string. However,
|
||||
<a href="#recursion">recursion,</a>
|
||||
that is, a "subroutine" call into a group that is already active,
|
||||
is not supported.
|
||||
</P>
|
||||
<P>
|
||||
Perl does not support back references in lookbehinds. PCRE2 does support them,
|
||||
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
|
||||
must not be set, there must be no use of (?| in the pattern (it creates
|
||||
duplicate subpattern numbers), and if the back reference is by name, the name
|
||||
must be unique. Of course, the referenced subpattern must itself be of fixed
|
||||
length. The following pattern matches words containing at least two characters
|
||||
that begin and end with the same character:
|
||||
<pre>
|
||||
\b(\w)\w++(?<=\1)
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
Possessive quantifiers can be used in conjunction with lookbehind assertions to
|
||||
@ -2417,7 +2484,9 @@ Checking for a used subpattern by name
|
||||
<P>
|
||||
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
|
||||
subpattern by name. For compatibility with earlier versions of PCRE1, which had
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized.
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized. Note,
|
||||
however, that undelimited names consisting of the letter R followed by digits
|
||||
are ambiguous (see the following section).
|
||||
</P>
|
||||
<P>
|
||||
Rewriting the above example to use a named subpattern gives this:
|
||||
@ -2432,30 +2501,52 @@ matched.
|
||||
Checking for pattern recursion
|
||||
</b><br>
|
||||
<P>
|
||||
If the condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if a recursive call to the whole pattern or any
|
||||
subpattern has been made. If digits or a name preceded by ampersand follow the
|
||||
letter R, for example:
|
||||
"Recursion" in this sense refers to any subroutine-like call from one part of
|
||||
the pattern to another, whether or not it is actually recursive. See the
|
||||
sections entitled
|
||||
<a href="#recursion">"Recursive patterns"</a>
|
||||
and
|
||||
<a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
|
||||
below for details of recursion and subpattern calls.
|
||||
</P>
|
||||
<P>
|
||||
If a condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if matching is currently in a recursion or subroutine
|
||||
call to the whole pattern or any subpattern. If digits follow the letter R, and
|
||||
there is no subpattern with that name, the condition is true if the most recent
|
||||
call is into a subpattern with the given number, which must exist somewhere in
|
||||
the overall pattern. This is a contrived example that is equivalent to a+b:
|
||||
<pre>
|
||||
(?(R3)...) or (?(R&name)...)
|
||||
((?(R1)a+|(?1)b))
|
||||
</pre>
|
||||
the condition is true if the most recent recursion is into a subpattern whose
|
||||
number or name is given. This condition does not check the entire recursion
|
||||
stack. If the name used in a condition of this kind is a duplicate, the test is
|
||||
applied to all subpatterns of the same name, and is true if any one of them is
|
||||
the most recent recursion.
|
||||
However, in both cases, if there is a subpattern with a matching name, the
|
||||
condition tests for its being set, as described in the section above, instead
|
||||
of testing for recursion. For example, creating a group with the name R1 by
|
||||
adding (?<R1>) to the above pattern completely changes its meaning.
|
||||
</P>
|
||||
<P>
|
||||
If a name preceded by ampersand follows the letter R, for example:
|
||||
<pre>
|
||||
(?(R&name)...)
|
||||
</pre>
|
||||
the condition is true if the most recent recursion is into a subpattern of that
|
||||
name (which must exist within the pattern).
|
||||
</P>
|
||||
<P>
|
||||
This condition does not check the entire recursion stack. It tests only the
|
||||
current level. If the name used in a condition of this kind is a duplicate, the
|
||||
test is applied to all subpatterns of the same name, and is true if any one of
|
||||
them is the most recent recursion.
|
||||
</P>
|
||||
<P>
|
||||
At "top level", all these recursion test conditions are false.
|
||||
<a href="#recursion">The syntax for recursive patterns</a>
|
||||
is described below.
|
||||
<a name="subdefine"></a></P>
|
||||
<br><b>
|
||||
Defining subpatterns for use by reference only
|
||||
</b><br>
|
||||
<P>
|
||||
If the condition is the string (DEFINE), and there is no subpattern with the
|
||||
name DEFINE, the condition is always false. In this case, there may be only one
|
||||
If the condition is the string (DEFINE), the condition is always false, even if
|
||||
there is a group with the name DEFINE. In this case, there may be only one
|
||||
alternative in the subpattern. It is always skipped if control reaches this
|
||||
point in the pattern; the idea of DEFINE is that it can be used to define
|
||||
subroutines that can be referenced from elsewhere. (The use of
|
||||
@ -2489,7 +2580,8 @@ For example:
|
||||
(?(VERSION>=10.4)yes|no)
|
||||
</pre>
|
||||
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
|
||||
"no" otherwise.
|
||||
"no" otherwise. The fractional part of the version number may not contain more
|
||||
than two digits.
|
||||
</P>
|
||||
<br><b>
|
||||
Assertion conditions
|
||||
@ -2602,6 +2694,21 @@ parentheses preceding the recursion. In other words, a negative number counts
|
||||
capturing parentheses leftwards from the point at which it is encountered.
|
||||
</P>
|
||||
<P>
|
||||
Be aware however, that if
|
||||
<a href="#dupsubpatternnumber">duplicate subpattern numbers</a>
|
||||
are in use, relative references refer to the earliest subpattern with the
|
||||
appropriate number. Consider, for example:
|
||||
<pre>
|
||||
(?|(a)|(b)) (c) (?-2)
|
||||
</pre>
|
||||
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
|
||||
is number 2. When the reference (?-2) is encountered, the second most recently
|
||||
opened parentheses has the number 1, but it is the first such group (the (a)
|
||||
group) to which the recursion refers. This would be the same if an absolute
|
||||
reference (?1) was used. In other words, relative references are just a
|
||||
shorthand for computing a group number.
|
||||
</P>
|
||||
<P>
|
||||
It is also possible to refer to subsequently opened parentheses, by writing
|
||||
references such as (?+2). However, these cannot be recursive because the
|
||||
reference is not inside the parentheses that are referenced. They are always
|
||||
@ -2899,14 +3006,36 @@ remarks apply to the PCRE2 features described in this section.
|
||||
</P>
|
||||
<P>
|
||||
The new verbs make use of what was previously invalid syntax: an opening
|
||||
parenthesis followed by an asterisk. They are generally of the form
|
||||
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
|
||||
differently depending on whether or not a name is present. A name is any
|
||||
sequence of characters that does not include a closing parenthesis. The maximum
|
||||
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
|
||||
libraries. If the name is empty, that is, if the closing parenthesis
|
||||
immediately follows the colon, the effect is as if the colon were not there.
|
||||
Any number of these verbs may occur in a pattern.
|
||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
||||
depending on whether or not a name is present.
|
||||
</P>
|
||||
<P>
|
||||
By default, for compatibility with Perl, a name is any sequence of characters
|
||||
that does not include a closing parenthesis. The name is not processed in
|
||||
any way, and it is not possible to include a closing parenthesis in the name.
|
||||
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
|
||||
is no longer Perl-compatible.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
|
||||
and only an unescaped closing parenthesis terminates the name. However, the
|
||||
only backslash items that are permitted are \Q, \E, and sequences such as
|
||||
\x{100} that define character code points. Character type escapes such as \d
|
||||
are faulted.
|
||||
</P>
|
||||
<P>
|
||||
A closing parenthesis can be included in a name either as \) or between \Q
|
||||
and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||
parenthesis immediately follows the colon, the effect is as if the colon were
|
||||
not there. Any number of these verbs may occur in a pattern.
|
||||
</P>
|
||||
<P>
|
||||
Since these verbs are specifically related to backtracking, most of them can be
|
||||
@ -3323,9 +3452,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 13 June 2015
|
||||
Last updated: 27 December 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -12,17 +12,21 @@ This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
PCRE2 PERFORMANCE
|
||||
</b><br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
|
||||
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
|
||||
<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
|
||||
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
|
||||
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
||||
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 PERFORMANCE</a><br>
|
||||
<P>
|
||||
Two aspects of performance are discussed below: memory usage and processing
|
||||
time. The way you express your pattern as a regular expression can affect both
|
||||
of them.
|
||||
</P>
|
||||
<br><b>
|
||||
COMPILED PATTERN MEMORY USAGE
|
||||
</b><br>
|
||||
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
|
||||
<P>
|
||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||
so that most simple patterns do not use much memory. However, there is one case
|
||||
@ -75,9 +79,7 @@ pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
||||
speed is acceptable, this kind of rewriting will allow you to process patterns
|
||||
that PCRE2 cannot otherwise handle.
|
||||
</P>
|
||||
<br><b>
|
||||
STACK USAGE AT RUN TIME
|
||||
</b><br>
|
||||
<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
|
||||
<P>
|
||||
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
|
||||
cause it to use large amounts of the process stack. In some environments the
|
||||
@ -86,9 +88,7 @@ SIGSEGV. Rewriting your pattern can often help. The
|
||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
||||
documentation discusses this issue in detail.
|
||||
</P>
|
||||
<br><b>
|
||||
PROCESSING TIME
|
||||
</b><br>
|
||||
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
|
||||
<P>
|
||||
Certain items in regular expression patterns are processed more efficiently
|
||||
than others. It is more efficient to use a character class like [aeiou] than a
|
||||
@ -177,9 +177,7 @@ appreciable time with strings longer than about 20 characters.
|
||||
In many cases, the solution to this kind of performance issue is to use an
|
||||
atomic group or a possessive quantifier.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
</b><br>
|
||||
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
@ -188,9 +186,7 @@ University Computing Service
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><b>
|
||||
REVISION
|
||||
</b><br>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 02 January 2015
|
||||
<br>
|
||||
|
@ -48,7 +48,7 @@ This set of functions provides a POSIX-style API for the PCRE2 regular
|
||||
expression 8-bit library. See the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation for a description of PCRE2's native API, which contains much
|
||||
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
|
||||
additional functionality. There are no POSIX-style wrappers for PCRE2's 16-bit
|
||||
and 32-bit libraries.
|
||||
</P>
|
||||
<P>
|
||||
@ -67,9 +67,9 @@ POSIX interface often use it, this makes it easier to slot in PCRE2 as a
|
||||
replacement library. Other POSIX options are not even defined.
|
||||
</P>
|
||||
<P>
|
||||
There are also some other options that are not defined by POSIX. These have
|
||||
been added at the request of users who want to make use of certain
|
||||
PCRE2-specific features via the POSIX calling interface.
|
||||
There are also some options that are not defined by POSIX. These have been
|
||||
added at the request of users who want to make use of certain PCRE2-specific
|
||||
features via the POSIX calling interface.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||
@ -119,11 +119,11 @@ defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||
<pre>
|
||||
REG_NOSUB
|
||||
</pre>
|
||||
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
|
||||
for compilation to the native function. In addition, when a pattern that is
|
||||
compiled with this flag is passed to <b>regexec()</b> for matching, the
|
||||
<i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
|
||||
are returned.
|
||||
When a pattern that is compiled with this flag is passed to <b>regexec()</b> for
|
||||
matching, the <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no
|
||||
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
||||
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
||||
because it disables the use of back references.
|
||||
<pre>
|
||||
REG_UCP
|
||||
</pre>
|
||||
@ -170,7 +170,7 @@ use the contents of the <i>preg</i> structure. If, for example, you pass it to
|
||||
This area is not simple, because POSIX and Perl take different views of things.
|
||||
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
|
||||
never intended to be a POSIX engine. The following table lists the different
|
||||
possibilities for matching newline characters in PCRE2:
|
||||
possibilities for matching newline characters in Perl and PCRE2:
|
||||
<pre>
|
||||
Default Change with
|
||||
|
||||
@ -180,7 +180,7 @@ possibilities for matching newline characters in PCRE2:
|
||||
$ matches \n in middle no PCRE2_MULTILINE
|
||||
^ matches \n in middle no PCRE2_MULTILINE
|
||||
</pre>
|
||||
This is the equivalent table for POSIX:
|
||||
This is the equivalent table for a POSIX-compatible pattern matcher:
|
||||
<pre>
|
||||
Default Change with
|
||||
|
||||
@ -190,14 +190,18 @@ This is the equivalent table for POSIX:
|
||||
$ matches \n in middle no REG_NEWLINE
|
||||
^ matches \n in middle no REG_NEWLINE
|
||||
</pre>
|
||||
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
|
||||
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
|
||||
newline from matching [^a].
|
||||
This behaviour is not what happens when PCRE2 is called via its POSIX
|
||||
API. By default, PCRE2's behaviour is the same as Perl's, except that there is
|
||||
no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there
|
||||
is no way to stop newline from matching [^a].
|
||||
</P>
|
||||
<P>
|
||||
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
|
||||
the REG_NEWLINE action.
|
||||
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||
PCRE2_DOLLAR_ENDONLY when calling <b>pcre2_compile()</b> directly, but there is
|
||||
no way to make PCRE2 behave exactly as for the REG_NEWLINE action. When using
|
||||
the POSIX API, passing REG_NEWLINE to PCRE2's <b>regcomp()</b> function
|
||||
causes PCRE2_MULTILINE to be passed to <b>pcre2_compile()</b>, and REG_DOTALL
|
||||
passes PCRE2_DOTALL. There is no way to pass PCRE2_DOLLAR_ENDONLY.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br>
|
||||
<P>
|
||||
@ -231,19 +235,21 @@ to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
|
||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||
how it is matched.
|
||||
how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
|
||||
mutually exclusive; the error REG_INVARG is returned.
|
||||
</P>
|
||||
<P>
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
|
||||
<b>regexec()</b> are ignored.
|
||||
<b>regexec()</b> are ignored (except possibly as input for REG_STARTEND).
|
||||
</P>
|
||||
<P>
|
||||
If the value of <i>nmatch</i> is zero, or if the value <i>pmatch</i> is NULL,
|
||||
no data about any matched strings is returned.
|
||||
The value of <i>nmatch</i> may be zero, and the value <i>pmatch</i> may be NULL
|
||||
(unless REG_STARTEND is set); in both these cases no data about any matched
|
||||
strings is returned.
|
||||
</P>
|
||||
<P>
|
||||
Otherwise,the portion of the string that was matched, and also any captured
|
||||
Otherwise, the portion of the string that was matched, and also any captured
|
||||
substrings, are returned via the <i>pmatch</i> argument, which points to an
|
||||
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
|
||||
members <i>rm_so</i> and <i>rm_eo</i>. These contain the byte offset to the first
|
||||
@ -262,9 +268,11 @@ header file, of which REG_NOMATCH is the "expected" failure code.
|
||||
The <b>regerror()</b> function maps a non-zero errorcode from either
|
||||
<b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not
|
||||
NULL, the error should have arisen from the use of that structure. A message
|
||||
terminated by a binary zero is placed in <i>errbuf</i>. The length of the
|
||||
message, including the zero, is limited to <i>errbuf_size</i>. The yield of the
|
||||
function is the size of buffer needed to hold the whole message.
|
||||
terminated by a binary zero is placed in <i>errbuf</i>. If the buffer is too
|
||||
short, only the first <i>errbuf_size</i> - 1 characters of the error message are
|
||||
used. The yield of the function is the size of buffer needed to hold the whole
|
||||
message, including the terminating zero. This value is greater than
|
||||
<i>errbuf_size</i> if the message was truncated.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br>
|
||||
<P>
|
||||
@ -283,9 +291,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
Last updated: 31 January 2016
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -24,12 +24,11 @@ documentation. If you do not have a copy of the PCRE2 distribution, you can
|
||||
save this listing to re-create the contents of <i>pcre2demo.c</i>.
|
||||
</P>
|
||||
<P>
|
||||
The demonstration program, which uses the PCRE2 8-bit library, compiles the
|
||||
regular expression that is its first argument, and matches it against the
|
||||
subject string in its second argument. No PCRE2 options are set, and default
|
||||
character tables are used. If matching succeeds, the program outputs the
|
||||
portion of the subject that matched, together with the contents of any captured
|
||||
substrings.
|
||||
The demonstration program compiles the regular expression that is its
|
||||
first argument, and matches it against the subject string in its second
|
||||
argument. No PCRE2 options are set, and default character tables are used. If
|
||||
matching succeeds, the program outputs the portion of the subject that matched,
|
||||
together with the contents of any captured substrings.
|
||||
</P>
|
||||
<P>
|
||||
If the -g option is given on the command line, the program then goes on to
|
||||
@ -38,34 +37,39 @@ string. The logic is a little bit tricky because of the possibility of matching
|
||||
an empty string. Comments in the code explain what is going on.
|
||||
</P>
|
||||
<P>
|
||||
The code in <b>pcre2demo.c</b> is an 8-bit program that uses the PCRE2 8-bit
|
||||
library. It handles strings and characters that are stored in 8-bit code units.
|
||||
By default, one character corresponds to one code unit, but if the pattern
|
||||
starts with "(*UTF)", both it and the subject are treated as UTF-8 strings,
|
||||
where characters may occupy multiple code units.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2 is installed in the standard include and library directories for your
|
||||
operating system, you should be able to compile the demonstration program using
|
||||
this command:
|
||||
a command like this:
|
||||
<pre>
|
||||
gcc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||
cc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||
</pre>
|
||||
If PCRE2 is installed elsewhere, you may need to add additional options to the
|
||||
command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||
<i>/usr/local</i>, you can compile the demonstration program using a command
|
||||
like this:
|
||||
<pre>
|
||||
gcc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
Once you have compiled and linked the demonstration program, you can run simple
|
||||
tests like this:
|
||||
cc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
|
||||
</pre>
|
||||
Once you have built the demonstration program, you can run simple tests like
|
||||
this:
|
||||
<pre>
|
||||
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
||||
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
||||
</pre>
|
||||
Note that there is a much more comprehensive test program, called
|
||||
<a href="pcre2test.html"><b>pcre2test</b>,</a>
|
||||
which supports many more facilities for testing regular expressions using the
|
||||
PCRE2 libraries. The
|
||||
which supports many more facilities for testing regular expressions using all
|
||||
three PCRE2 libraries (8-bit, 16-bit, and 32-bit, though not all three need be
|
||||
installed). The
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
program is provided as a simple coding example.
|
||||
program is provided as a relatively simple coding example.
|
||||
</P>
|
||||
<P>
|
||||
If you try to run
|
||||
@ -73,7 +77,7 @@ If you try to run
|
||||
when PCRE2 is not installed in the standard library directory, you may get an
|
||||
error like this on some operating systems (e.g. Solaris):
|
||||
<pre>
|
||||
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
|
||||
ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file or directory
|
||||
</pre>
|
||||
This is caused by the way shared library support works on those systems. You
|
||||
need to add
|
||||
@ -97,9 +101,9 @@ Cambridge, England.
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
Last updated: 02 February 2016
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -14,10 +14,11 @@ please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a>
|
||||
<li><a name="TOC2" href="#SEC2">SAVING COMPILED PATTERNS</a>
|
||||
<li><a name="TOC3" href="#SEC3">RE-USING PRECOMPILED PATTERNS</a>
|
||||
<li><a name="TOC4" href="#SEC4">AUTHOR</a>
|
||||
<li><a name="TOC5" href="#SEC5">REVISION</a>
|
||||
<li><a name="TOC2" href="#SEC2">SECURITY CONCERNS</a>
|
||||
<li><a name="TOC3" href="#SEC3">SAVING COMPILED PATTERNS</a>
|
||||
<li><a name="TOC4" href="#SEC4">RE-USING PRECOMPILED PATTERNS</a>
|
||||
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
||||
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a><br>
|
||||
<P>
|
||||
@ -41,14 +42,22 @@ If you are running an application that uses a large number of regular
|
||||
expression patterns, it may be useful to store them in a precompiled form
|
||||
instead of having to compile them every time the application is run. However,
|
||||
if you are using the just-in-time optimization feature, it is not possible to
|
||||
save and reload the JIT data, because it is position-dependent. In addition,
|
||||
the host on which the patterns are reloaded must be running the same version of
|
||||
PCRE2, with the same code unit width, and must also have the same endianness,
|
||||
pointer width and PCRE2_SIZE type. For example, patterns compiled on a 32-bit
|
||||
system using PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor
|
||||
can they be reloaded using the 8-bit library.
|
||||
save and reload the JIT data, because it is position-dependent. The host on
|
||||
which the patterns are reloaded must be running the same version of PCRE2, with
|
||||
the same code unit width, and must also have the same endianness, pointer width
|
||||
and PCRE2_SIZE type. For example, patterns compiled on a 32-bit system using
|
||||
PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor can they be
|
||||
reloaded using the 8-bit library.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
||||
<br><a name="SEC2" href="#TOC1">SECURITY CONCERNS</a><br>
|
||||
<P>
|
||||
The facility for saving and restoring compiled patterns is intended for use
|
||||
within individual applications. As such, the data supplied to
|
||||
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
|
||||
arbitrary external sources. There is only some simple consistency checking, not
|
||||
complete validation of what is being re-loaded.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
||||
<P>
|
||||
Before compiled patterns can be saved they must be serialized, that is,
|
||||
converted to a stream of bytes. A single byte stream may contain any number of
|
||||
@ -110,7 +119,7 @@ still be used for matching. Their memory must eventually be freed in the usual
|
||||
way by calling <b>pcre2_code_free()</b>. When you have finished with the byte
|
||||
stream, it too must be freed by calling <b>pcre2_serialize_free()</b>.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">RE-USING PRECOMPILED PATTERNS</a><br>
|
||||
<br><a name="SEC4" href="#TOC1">RE-USING PRECOMPILED PATTERNS</a><br>
|
||||
<P>
|
||||
In order to re-use a set of saved patterns you must first make the serialized
|
||||
byte stream available in main memory (for example, by reading from a file). The
|
||||
@ -142,21 +151,27 @@ is filled with those that fit, and the remainder are ignored. The yield of the
|
||||
function is the number of decoded patterns, or one of the following negative
|
||||
error codes:
|
||||
<pre>
|
||||
PCRE2_ERROR_BADDATA second argument is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
|
||||
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE2 version
|
||||
PCRE2_ERROR_MEMORY memory allocation failed
|
||||
PCRE2_ERROR_NULL first or third argument is NULL
|
||||
PCRE2_ERROR_BADDATA second argument is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
|
||||
PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
|
||||
PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
|
||||
PCRE2_ERROR_MEMORY memory allocation failed
|
||||
PCRE2_ERROR_NULL first or third argument is NULL
|
||||
</pre>
|
||||
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
|
||||
on a system with different endianness.
|
||||
</P>
|
||||
<P>
|
||||
Decoded patterns can be used for matching in the usual way, and must be freed
|
||||
by calling <b>pcre2_code_free()</b> as normal. A single copy of the character
|
||||
tables is used by all the decoded patterns. A reference count is used to
|
||||
by calling <b>pcre2_code_free()</b>. However, be aware that there is a potential
|
||||
race issue if you are using multiple patterns that were decoded from a single
|
||||
byte stream in a multithreaded application. A single copy of the character
|
||||
tables is used by all the decoded patterns and a reference count is used to
|
||||
arrange for its memory to be automatically freed when the last pattern is
|
||||
freed.
|
||||
freed, but there is no locking on this reference count. Therefore, if you want
|
||||
to call <b>pcre2_code_free()</b> for these patterns in different threads, you
|
||||
must arrange your own locking, and ensure that <b>pcre2_code_free()</b> cannot
|
||||
be called by two threads at the same time.
|
||||
</P>
|
||||
<P>
|
||||
If a pattern was processed by <b>pcre2_jit_compile()</b> before being
|
||||
@ -164,7 +179,7 @@ serialized, the JIT data is discarded and so is no longer available after a
|
||||
save/restore cycle. You can, however, process a restored pattern with
|
||||
<b>pcre2_jit_compile()</b> if you wish.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
@ -173,11 +188,11 @@ University Computing Service
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 January 2015
|
||||
Last updated: 24 May 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -57,12 +57,13 @@ assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
||||
Normally, these are never very deep, and the limit on the complexity of
|
||||
<b>pcre2_dfa_match()</b> is controlled by the amount of workspace it is given.
|
||||
However, it is possible to write patterns with runaway infinite recursions;
|
||||
such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack. At
|
||||
present, there is no protection against this.
|
||||
such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack unless a
|
||||
limit is applied (see below).
|
||||
</P>
|
||||
<P>
|
||||
The comments that follow do NOT apply to <b>pcre2_dfa_match()</b>; they are
|
||||
relevant only for <b>pcre2_match()</b> without the JIT optimization.
|
||||
The comments in the next three sections do not apply to
|
||||
<b>pcre2_dfa_match()</b>; they are relevant only for <b>pcre2_match()</b> without
|
||||
the JIT optimization.
|
||||
</P>
|
||||
<br><b>
|
||||
Reducing <b>pcre2_match()</b>'s stack usage
|
||||
@ -115,7 +116,7 @@ entitled
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. Since the block sizes are always the same, it may be possible to
|
||||
implement customized a memory handler that is more efficient than the standard
|
||||
implement a customized memory handler that is more efficient than the standard
|
||||
function. The memory blocks obtained for this purpose are retained and re-used
|
||||
if possible while <b>pcre2_match()</b> is running. They are all freed just
|
||||
before it exits.
|
||||
@ -151,6 +152,15 @@ pattern to match. This is done by calling <b>pcre2_match()</b> repeatedly with
|
||||
different limits.
|
||||
</P>
|
||||
<br><b>
|
||||
Limiting <b>pcre2_dfa_match()</b>'s stack usage
|
||||
</b><br>
|
||||
<P>
|
||||
The recursion limit, as described above for <b>pcre2_match()</b>, also applies
|
||||
to <b>pcre2_dfa_match()</b>, whose use of recursive function calls for
|
||||
recursions in the pattern can lead to runaway stack usage. The non-recursive
|
||||
match limit is not relevant for DFA matching, and is ignored.
|
||||
</P>
|
||||
<br><b>
|
||||
Changing stack size in Unix-like systems
|
||||
</b><br>
|
||||
<P>
|
||||
@ -198,9 +208,9 @@ Cambridge, England.
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 21 November 2014
|
||||
Last updated: 23 December 2016
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -111,9 +111,10 @@ it matches a literal "u".
|
||||
\W a "non-word" character
|
||||
\X a Unicode extended grapheme cluster
|
||||
</pre>
|
||||
The application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
|
||||
current matching point in the middle of a UTF-8 or UTF-16 character.
|
||||
\C is dangerous because it may leave the current matching point in the middle
|
||||
of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
|
||||
with the use of \C permanently disabled.
|
||||
</P>
|
||||
<P>
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
|
||||
@ -187,6 +188,8 @@ at release 5.18.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
|
||||
<P>
|
||||
Ahom,
|
||||
Anatolian_Hieroglyphs,
|
||||
Arabic,
|
||||
Armenian,
|
||||
Avestan,
|
||||
@ -227,6 +230,7 @@ Gurmukhi,
|
||||
Han,
|
||||
Hangul,
|
||||
Hanunoo,
|
||||
Hatran,
|
||||
Hebrew,
|
||||
Hiragana,
|
||||
Imperial_Aramaic,
|
||||
@ -263,12 +267,14 @@ Miao,
|
||||
Modi,
|
||||
Mongolian,
|
||||
Mro,
|
||||
Multani,
|
||||
Myanmar,
|
||||
Nabataean,
|
||||
New_Tai_Lue,
|
||||
Nko,
|
||||
Ogham,
|
||||
Ol_Chiki,
|
||||
Old_Hungarian,
|
||||
Old_Italic,
|
||||
Old_North_Arabian,
|
||||
Old_Permic,
|
||||
@ -290,6 +296,7 @@ Saurashtra,
|
||||
Sharada,
|
||||
Shavian,
|
||||
Siddham,
|
||||
SignWriting,
|
||||
Sinhala,
|
||||
Sora_Sompeng,
|
||||
Sundanese,
|
||||
@ -444,9 +451,10 @@ appear.
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
</pre>
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_match(), not increase them. The application
|
||||
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
|
||||
PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||
limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>, not
|
||||
increase them. The application can lock out the use of (*UTF) and (*UCP) by
|
||||
setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at
|
||||
compile time.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
@ -485,6 +493,9 @@ Each top-level branch of a look behind must be of a fixed length.
|
||||
\n reference by number (can be ambiguous)
|
||||
\gn reference by number
|
||||
\g{n} reference by number
|
||||
\g+n relative reference by number (PCRE2 extension)
|
||||
\g-n relative reference by number
|
||||
\g{+n} relative reference by number (PCRE2 extension)
|
||||
\g{-n} relative reference by number
|
||||
\k<name> reference by name (Perl)
|
||||
\k'name' reference by name (Perl)
|
||||
@ -523,14 +534,17 @@ Each top-level branch of a look behind must be of a fixed length.
|
||||
(?(-n) relative reference condition
|
||||
(?(<name>) named reference condition (Perl)
|
||||
(?('name') named reference condition (Perl)
|
||||
(?(name) named reference condition (PCRE2)
|
||||
(?(name) named reference condition (PCRE2, deprecated)
|
||||
(?(R) overall recursion condition
|
||||
(?(Rn) specific group recursion condition
|
||||
(?(R&name) specific recursion condition
|
||||
(?(Rn) specific numbered group recursion condition
|
||||
(?(R&name) specific named group recursion condition
|
||||
(?(DEFINE) define subpattern for reference
|
||||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
</PRE>
|
||||
</pre>
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
@ -582,9 +596,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 13 June 2015
|
||||
Last updated: 23 December 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
|
||||
<P>
|
||||
As the original fairly simple PCRE library evolved, it acquired many different
|
||||
features, and as a result, the original <b>pcretest</b> program ended up with a
|
||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
||||
lot of options in a messy, arcane syntax for testing all the features. The
|
||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
|
||||
are still many obscure modifiers, some of which are specifically designed for
|
||||
@ -77,31 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
||||
all three of these libraries may be simultaneously installed. The
|
||||
<b>pcre2test</b> program can be used to test all the libraries. However, its own
|
||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
||||
before being passed to the library functions. Results are converted back to
|
||||
8-bit code units for output.
|
||||
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||
format before being passed to the library functions. Results are converted back
|
||||
to 8-bit code units for output.
|
||||
</P>
|
||||
<P>
|
||||
In the rest of this document, the names of library functions and structures
|
||||
are given in generic form, for example, <b>pcre_compile()</b>. The actual
|
||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||
</P>
|
||||
<a name="inputencoding"></a></P>
|
||||
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
|
||||
<P>
|
||||
Input to <b>pcre2test</b> is processed line by line, either by calling the C
|
||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
|
||||
below). The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
||||
further data is read.
|
||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
|
||||
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||
no further data is read, so this character should be avoided unless you really
|
||||
want that action.
|
||||
</P>
|
||||
<P>
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
characters in <b>pcre2test</b> input files. There is a facility for specifying a
|
||||
pattern's characters as hexadecimal pairs, thus making it possible to include
|
||||
binary zeroes in a pattern for testing purposes. Subject lines are processed
|
||||
for backslash escapes, which makes it possible to include any data value.
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. An error is generated
|
||||
if a binary zero is encountered. Subject lines are processed for backslash
|
||||
escapes, which makes it possible to include any data value in strings that are
|
||||
passed to the library for matching. For patterns, there is a facility for
|
||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
which makes it possible to include binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
Input for the 16-bit and 32-bit libraries
|
||||
</b><br>
|
||||
<P>
|
||||
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||
generate character code points greater than 255 in the strings that are passed
|
||||
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||
when the <b>utf</b> modifier (see
|
||||
<a href="#optionmodifiers">"Setting compilation options"</a>
|
||||
below) is set, the pattern and any following subject lines are interpreted as
|
||||
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||
</P>
|
||||
<P>
|
||||
For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
|
||||
used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
|
||||
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||
to occur).
|
||||
</P>
|
||||
<P>
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||
values can be handled by the 32-bit library. When testing this library in
|
||||
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||
character's value. This is the only way of passing such code points in a
|
||||
pattern string. For subject strings, using an escape sequence is preferable.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||
<P>
|
||||
@ -123,8 +153,13 @@ the 32-bit library has been built, this is the default. If the 32-bit library
|
||||
has not been built, this option causes an error.
|
||||
</P>
|
||||
<P>
|
||||
<b>-ac</b>
|
||||
Behave as if each pattern has the <b>auto_callout</b> modifier, that is, insert
|
||||
automatic callouts into every pattern that is compiled.
|
||||
</P>
|
||||
<P>
|
||||
<b>-b</b>
|
||||
Behave as if each pattern has the <b>/fullbincode</b> modifier; the full
|
||||
Behave as if each pattern has the <b>fullbincode</b> modifier; the full
|
||||
internal binary form of the pattern is output after compilation.
|
||||
</P>
|
||||
<P>
|
||||
@ -155,12 +190,13 @@ following options output the value and set the exit code as indicated:
|
||||
The following options output 1 for true or 0 for false, and set the exit code
|
||||
to the same value:
|
||||
<pre>
|
||||
ebcdic compiled for an EBCDIC environment
|
||||
jit just-in-time support is available
|
||||
pcre2-16 the 16-bit library was built
|
||||
pcre2-32 the 32-bit library was built
|
||||
pcre2-8 the 8-bit library was built
|
||||
unicode Unicode support is available
|
||||
backslash-C \C is supported (not locked out)
|
||||
ebcdic compiled for an EBCDIC environment
|
||||
jit just-in-time support is available
|
||||
pcre2-16 the 16-bit library was built
|
||||
pcre2-32 the 32-bit library was built
|
||||
pcre2-8 the 8-bit library was built
|
||||
unicode Unicode support is available
|
||||
</pre>
|
||||
If an unknown option is given, an error message is output; the exit code is 0.
|
||||
</P>
|
||||
@ -177,12 +213,19 @@ using the <b>pcre2_dfa_match()</b> function instead of the default
|
||||
<b>pcre2_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>-error</b> <i>number[,number,...]</i>
|
||||
Call <b>pcre2_get_error_message()</b> for each of the error numbers in the
|
||||
comma-separated list, display the resulting messages on the standard output,
|
||||
then exit with zero exit code. The numbers may be positive or negative. This is
|
||||
a convenience facility for PCRE2 maintainers.
|
||||
</P>
|
||||
<P>
|
||||
<b>-help</b>
|
||||
Output a brief summary these options and then exit.
|
||||
</P>
|
||||
<P>
|
||||
<b>-i</b>
|
||||
Behave as if each pattern has the <b>/info</b> modifier; information about the
|
||||
Behave as if each pattern has the <b>info</b> modifier; information about the
|
||||
compiled pattern is given after compilation.
|
||||
</P>
|
||||
<P>
|
||||
@ -265,9 +308,9 @@ Each subject line is matched separately and independently. If you want to do
|
||||
multi-line matches, you have to use the \n escape sequence (or \r or \r\n,
|
||||
etc., depending on the newline setting) in a single line of input to encode the
|
||||
newline sequences. There is no limit on the length of subject lines; the input
|
||||
buffer is automatically extended if it is too small. There is a replication
|
||||
feature that makes it possible to generate long subject lines without having to
|
||||
supply them explicitly.
|
||||
buffer is automatically extended if it is too small. There are replication
|
||||
features that makes it possible to generate long repetitive pattern or subject
|
||||
lines without having to supply them explicitly.
|
||||
</P>
|
||||
<P>
|
||||
An empty line or the end of the file signals the end of the subject lines for a
|
||||
@ -304,6 +347,36 @@ output.
|
||||
This command is used to load a set of precompiled patterns from a file, as
|
||||
described in the section entitled "Saving and restoring compiled patterns"
|
||||
<a href="#saverestore">below.</a>
|
||||
<pre>
|
||||
#newline_default [<newline-list>]
|
||||
</pre>
|
||||
When PCRE2 is built, a default newline convention can be specified. This
|
||||
determines which characters and/or character pairs are recognized as indicating
|
||||
a newline in a pattern or subject string. The default can be overridden when a
|
||||
pattern is compiled. The standard test files contain tests of various newline
|
||||
conventions, but the majority of the tests expect a single linefeed to be
|
||||
recognized as a newline by default. Without special action the tests would fail
|
||||
when PCRE2 is compiled with either CR or CRLF as the default newline.
|
||||
</P>
|
||||
<P>
|
||||
The #newline_default command specifies a list of newline types that are
|
||||
acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
|
||||
ANY (in upper or lower case), for example:
|
||||
<pre>
|
||||
#newline_default LF Any anyCRLF
|
||||
</pre>
|
||||
If the default newline is in the list, this command has no effect. Otherwise,
|
||||
except when testing the POSIX API, a <b>newline</b> modifier that specifies the
|
||||
first newline convention in the list (LF in the above example) is added to any
|
||||
pattern that does not already have a <b>newline</b> modifier. If the newline
|
||||
list is empty, the feature is turned off. This command is present in a number
|
||||
of the standard test input files.
|
||||
</P>
|
||||
<P>
|
||||
When the POSIX API is being tested there is no way to override the default
|
||||
newline convention, though it is possible to set the newline convention from
|
||||
within the pattern. A warning is given if the <b>posix</b> modifier is used when
|
||||
<b>#newline_default</b> would set a default for the non-POSIX API.
|
||||
<pre>
|
||||
#pattern <modifier-list>
|
||||
</pre>
|
||||
@ -321,9 +394,10 @@ test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
|
||||
command helps detect tests that are accidentally put in the wrong file.
|
||||
<pre>
|
||||
#pop [<modifiers>]
|
||||
#popcopy [<modifiers>]
|
||||
</pre>
|
||||
This command is used to manipulate the stack of compiled patterns, as described
|
||||
in the section entitled "Saving and restoring compiled patterns"
|
||||
These commands are used to manipulate the stack of compiled patterns, as
|
||||
described in the section entitled "Saving and restoring compiled patterns"
|
||||
<a href="#saverestore">below.</a>
|
||||
<pre>
|
||||
#save <filename>
|
||||
@ -340,12 +414,13 @@ subject lines. Modifiers on a subject line can change these settings.
|
||||
<br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br>
|
||||
<P>
|
||||
Modifier lists are used with both pattern and subject lines. Items in a list
|
||||
are separated by commas and optional white space. Some modifiers may be given
|
||||
for both patterns and subject lines, whereas others are valid for one or the
|
||||
other only. Each modifier has a long name, for example "anchored", and some of
|
||||
them must be followed by an equals sign and a value, for example, "offset=12".
|
||||
Modifiers that do not take values may be preceded by a minus sign to turn off a
|
||||
previous setting.
|
||||
are separated by commas followed by optional white space. Trailing whitespace
|
||||
in a modifier list is ignored. Some modifiers may be given for both patterns
|
||||
and subject lines, whereas others are valid only for one or the other. Each
|
||||
modifier has a long name, for example "anchored", and some of them must be
|
||||
followed by an equals sign and a value, for example, "offset=12". Values cannot
|
||||
contain comma characters, but may contain spaces. Modifiers that do not take
|
||||
values may be preceded by a minus sign to turn off a previous setting.
|
||||
</P>
|
||||
<P>
|
||||
A few of the more common modifiers can also be specified as single letters, for
|
||||
@ -454,6 +529,12 @@ the start of a modifier list. For example:
|
||||
<pre>
|
||||
abc\=notbol,notempty
|
||||
</pre>
|
||||
If the subject string is empty and \= is followed by whitespace, the line is
|
||||
treated as a comment line, and is not used for matching. For example:
|
||||
<pre>
|
||||
\= This is a comment.
|
||||
abc\= This is an invalid modifier list.
|
||||
</pre>
|
||||
A backslash followed by any other non-alphanumeric character just escapes that
|
||||
character. A backslash followed by anything else causes an error. However, if
|
||||
the very last character in the line is a backslash (and there is no modifier
|
||||
@ -462,10 +543,10 @@ a real empty line terminates the data input.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
|
||||
<P>
|
||||
There are three types of modifier that can appear in pattern lines, two of
|
||||
which may also be used in a <b>#pattern</b> command. A pattern's modifier list
|
||||
can add to or override default modifiers that were set by a previous
|
||||
<b>#pattern</b> command.
|
||||
There are several types of modifier that can appear in pattern lines. Except
|
||||
where noted below, they may also be used in <b>#pattern</b> commands. A
|
||||
pattern's modifier list can add to or override default modifiers that were set
|
||||
by a previous <b>#pattern</b> command.
|
||||
<a name="optionmodifiers"></a></P>
|
||||
<br><b>
|
||||
Setting compilation options
|
||||
@ -473,12 +554,13 @@ Setting compilation options
|
||||
<P>
|
||||
The following modifiers set options for <b>pcre2_compile()</b>. The most common
|
||||
ones have single-letter abbreviations. See
|
||||
<a href="pcreapi.html"><b>pcreapi</b></a>
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
for a description of their effects.
|
||||
<pre>
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
alt_verbnames set PCRE2_ALT_VERBNAMES
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
@ -499,12 +581,15 @@ for a description of their effects.
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
ucp set PCRE2_UCP
|
||||
ungreedy set PCRE2_UNGREEDY
|
||||
use_offset_limit set PCRE2_USE_OFFSET_LIMIT
|
||||
utf set PCRE2_UTF
|
||||
</pre>
|
||||
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
|
||||
non-printing characters in output strings to be printed using the \x{hh...}
|
||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||
brackets.
|
||||
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
|
||||
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||
being passed to library functions.
|
||||
<a name="controlmodifiers"></a></P>
|
||||
<br><b>
|
||||
Setting compilation controls
|
||||
@ -519,18 +604,24 @@ about the pattern:
|
||||
debug same as info,fullbincode
|
||||
fullbincode show binary code with lengths
|
||||
/I info show info about compiled pattern
|
||||
hex pattern is coded in hexadecimal
|
||||
hex unquoted characters are hexadecimal
|
||||
jit[=<number>] use JIT
|
||||
jitfast use JIT fast path
|
||||
jitverify verify JIT use
|
||||
locale=<name> use this locale
|
||||
max_pattern_length=<n> set the maximum pattern length
|
||||
memory show memory used
|
||||
newline=<type> set newline type
|
||||
null_context compile with a NULL context
|
||||
parens_nest_limit=<n> set maximum parentheses depth
|
||||
posix use the POSIX API
|
||||
posix_nosub use the POSIX API with REG_NOSUB
|
||||
push push compiled pattern onto the stack
|
||||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
tables=[0|1|2] select internal tables
|
||||
use_length do not zero-terminate the pattern
|
||||
utf8_input treat input as UTF-8
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
</P>
|
||||
@ -604,40 +695,145 @@ is requested. For each callout, either its number or string is given, followed
|
||||
by the item that follows it in the pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying a pattern in hex
|
||||
Passing a NULL context
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>hex</b> modifier specifies that the characters of the pattern are to be
|
||||
interpreted as pairs of hexadecimal digits. White space is permitted between
|
||||
pairs. For example:
|
||||
Normally, <b>pcre2test</b> passes a context block to <b>pcre2_compile()</b>. If
|
||||
the <b>null_context</b> modifier is set, however, NULL is passed. This is for
|
||||
testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
|
||||
default values).
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying the pattern's length
|
||||
</b><br>
|
||||
<P>
|
||||
By default, patterns are passed to the compiling functions as zero-terminated
|
||||
strings. When using the POSIX wrapper API, there is no other option. However,
|
||||
when using PCRE2's native API, patterns can be passed by length instead of
|
||||
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
|
||||
Using a length happens automatically (whether or not <b>use_length</b> is set)
|
||||
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
|
||||
binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying pattern characters in hexadecimal
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>hex</b> modifier specifies that the characters of the pattern, except for
|
||||
substrings enclosed in single or double quotes, are to be interpreted as pairs
|
||||
of hexadecimal digits. This feature is provided as a way of creating patterns
|
||||
that contain binary zeros and other non-printing characters. White space is
|
||||
permitted between pairs of digits. For example, this pattern contains three
|
||||
characters:
|
||||
<pre>
|
||||
/ab 32 59/hex
|
||||
</pre>
|
||||
This feature is provided as a way of creating patterns that contain binary zero
|
||||
and other non-printing characters. By default, <b>pcre2test</b> passes patterns
|
||||
as zero-terminated strings to <b>pcre2_compile()</b>, giving the length as
|
||||
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
|
||||
actual length of the pattern is passed.
|
||||
Parts of such a pattern are taken literally if quoted. This pattern contains
|
||||
nine characters, only two of which are specified in hexadecimal:
|
||||
<pre>
|
||||
/ab "literal" 32/hex
|
||||
</pre>
|
||||
Either single or double quotes may be used. There is no way of including
|
||||
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<P>
|
||||
The POSIX API cannot be used with patterns specified in hexadecimal because
|
||||
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s
|
||||
requirement for a zero-terminated string. Such patterns are always passed to
|
||||
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
</b><br>
|
||||
<P>
|
||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||
translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
|
||||
the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
|
||||
can be used. It is mutually exclusive with <b>utf</b>. Input lines are
|
||||
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||
given in
|
||||
<a href="#inputencoding">"Input encoding"</a>
|
||||
above.
|
||||
</P>
|
||||
<br><b>
|
||||
Generating long repetitive patterns
|
||||
</b><br>
|
||||
<P>
|
||||
Some tests use long patterns that are very repetitive. Instead of creating a
|
||||
very long input line for such a pattern, you can use a special repetition
|
||||
feature, similar to the one described for subject lines above. If the
|
||||
<b>expand</b> modifier is present on a pattern, parts of the pattern that have
|
||||
the form
|
||||
<pre>
|
||||
\[<characters>]{<count>}
|
||||
</pre>
|
||||
are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
|
||||
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
|
||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||
remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<P>
|
||||
If part of an expanded pattern looks like an expansion, but is really part of
|
||||
the actual pattern, unwanted expansion can be avoided by giving two values in
|
||||
the quantifier. For example, \[AB]{6000,6000} is not recognized as an
|
||||
expansion item.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>info</b> modifier is set on an expanded pattern, the result of the
|
||||
expansion is included in the information that is output.
|
||||
</P>
|
||||
<br><b>
|
||||
JIT compilation
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/jit</b> modifier may optionally be followed by an equals sign and a
|
||||
number in the range 0 to 7:
|
||||
Just-in-time (JIT) compiling is a heavyweight optimization that can greatly
|
||||
speed up pattern matching. See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for details. JIT compiling happens, optionally, after a pattern
|
||||
has been successfully compiled into an internal form. The JIT compiler converts
|
||||
this to optimized machine code. It needs to know whether the match-time options
|
||||
PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because
|
||||
different code is generated for the different cases. See the <b>partial</b>
|
||||
modifier in "Subject Modifiers"
|
||||
<a href="#subjectmodifiers">below</a>
|
||||
for details of how these options are specified for each match attempt.
|
||||
</P>
|
||||
<P>
|
||||
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
|
||||
optionally be followed by an equals sign and a number in the range 0 to 7.
|
||||
The three bits that make up the number specify which of the three JIT operating
|
||||
modes are to be compiled:
|
||||
<pre>
|
||||
1 compile JIT code for non-partial matching
|
||||
2 compile JIT code for soft partial matching
|
||||
4 compile JIT code for hard partial matching
|
||||
</pre>
|
||||
The possible values for the <b>jit</b> modifier are therefore:
|
||||
<pre>
|
||||
0 disable JIT
|
||||
1 use JIT for normal match only
|
||||
2 use JIT for soft partial match only
|
||||
3 use JIT for normal match and soft partial match
|
||||
4 use JIT for hard partial match only
|
||||
6 use JIT for soft and hard partial match
|
||||
1 normal matching only
|
||||
2 soft partial matching only
|
||||
3 normal and soft partial matching
|
||||
4 hard partial matching only
|
||||
6 soft and hard partial matching only
|
||||
7 all three modes
|
||||
</pre>
|
||||
If no number is given, 7 is assumed. If JIT compilation is successful, the
|
||||
compiled JIT code will automatically be used when <b>pcre2_match()</b> is run
|
||||
for the appropriate type of match, except when incompatible run-time options
|
||||
are specified. For more details, see the
|
||||
If no number is given, 7 is assumed. The phrase "partial matching" means a call
|
||||
to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
|
||||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
|
||||
match; the options enable the possibility of a partial match, but do not
|
||||
require it. Note also that if you request JIT compilation only for partial
|
||||
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
|
||||
subject line, that match will not use JIT code because none was compiled for
|
||||
non-partial matching.
|
||||
</P>
|
||||
<P>
|
||||
If JIT compilation is successful, the compiled JIT code will automatically be
|
||||
used when an appropriate type of match is run, except when incompatible
|
||||
run-time options are specified. For more details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation. See also the <b>jitstack</b> modifier below for a way of
|
||||
setting the size of the JIT stack.
|
||||
@ -661,14 +857,14 @@ code was actually used in the match.
|
||||
Setting a locale
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/locale</b> modifier must specify the name of a locale, for example:
|
||||
The <b>locale</b> modifier must specify the name of a locale, for example:
|
||||
<pre>
|
||||
/pattern/locale=fr_FR
|
||||
</pre>
|
||||
The given locale is set, <b>pcre2_maketables()</b> is called to build a set of
|
||||
character tables for the locale, and this is then passed to
|
||||
<b>pcre2_compile()</b> when compiling the regular expression. The same tables
|
||||
are used when matching the following subject lines. The <b>/locale</b> modifier
|
||||
are used when matching the following subject lines. The <b>locale</b> modifier
|
||||
applies only to the pattern on which it appears, but can be given in a
|
||||
<b>#pattern</b> command if a default is needed. Setting a locale and alternate
|
||||
character tables are mutually exclusive.
|
||||
@ -677,7 +873,7 @@ character tables are mutually exclusive.
|
||||
Showing pattern memory
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
|
||||
The <b>memory</b> modifier causes the size in bytes of the memory used to hold
|
||||
the compiled pattern to be output. This does not include the size of the
|
||||
<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
|
||||
subsequently passed to the JIT compiler, the size of the JIT compiled code is
|
||||
@ -700,30 +896,53 @@ sets its own default of 220, which is required for running the standard test
|
||||
suite.
|
||||
</P>
|
||||
<br><b>
|
||||
Limiting the pattern length
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
|
||||
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
|
||||
causes a compilation error. The default is the largest number a PCRE2_SIZE
|
||||
variable can hold (essentially unlimited).
|
||||
</P>
|
||||
<br><b>
|
||||
Using the POSIX wrapper API
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/posix</b> modifier causes <b>pcre2test</b> to call PCRE2 via the POSIX
|
||||
wrapper API rather than its native API. This supports only the 8-bit library.
|
||||
When the POSIX API is being used, the following pattern modifiers set options
|
||||
for the <b>regcomp()</b> function:
|
||||
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
||||
PCRE2 via the POSIX wrapper API rather than its native API. When
|
||||
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
|
||||
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
|
||||
it does not imply POSIX matching semantics; for more detail see the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
documentation. The following pattern modifiers set options for the
|
||||
<b>regcomp()</b> function:
|
||||
<pre>
|
||||
caseless REG_ICASE
|
||||
multiline REG_NEWLINE
|
||||
no_auto_capture REG_NOSUB
|
||||
dotall REG_DOTALL )
|
||||
ungreedy REG_UNGREEDY ) These options are not part of
|
||||
ucp REG_UCP ) the POSIX standard
|
||||
utf REG_UTF8 )
|
||||
</pre>
|
||||
The <b>regerror_buffsize</b> modifier specifies a size for the error buffer that
|
||||
is passed to <b>regerror()</b> in the event of a compilation error. For example:
|
||||
<pre>
|
||||
/abc/posix,regerror_buffsize=20
|
||||
</pre>
|
||||
This provides a means of testing the behaviour of <b>regerror()</b> when the
|
||||
buffer is too small for the error message. If this modifier has not been set, a
|
||||
large buffer is used.
|
||||
</P>
|
||||
<P>
|
||||
The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
|
||||
below. All other modifiers cause an error.
|
||||
below. All other modifiers are either ignored, with a warning message, or cause
|
||||
an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Testing the stack guard feature
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/stackguard</b> modifier is used to test the use of
|
||||
The <b>stackguard</b> modifier is used to test the use of
|
||||
<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to
|
||||
enable stack availability to be checked during compilation (see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
@ -738,7 +957,7 @@ be aborted.
|
||||
Using alternative character tables
|
||||
</b><br>
|
||||
<P>
|
||||
The value specified for the <b>/tables</b> modifier must be one of the digits 0,
|
||||
The value specified for the <b>tables</b> modifier must be one of the digits 0,
|
||||
1, or 2. It causes a specific set of built-in character tables to be passed to
|
||||
<b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with
|
||||
different character tables. The digit specifies the tables as follows:
|
||||
@ -758,17 +977,22 @@ Setting certain match controls
|
||||
<P>
|
||||
The following modifiers are really subject modifiers, and are described below.
|
||||
However, they may be included in a pattern's modifier list, in which case they
|
||||
are applied to every subject line that is processed with that pattern. They do
|
||||
not affect the compilation process.
|
||||
are applied to every subject line that is processed with that pattern. They may
|
||||
not appear in <b>#pattern</b> commands. These modifiers do not affect the
|
||||
compilation process.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
</pre>
|
||||
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
||||
defaults, set them in a <b>#subject</b> command.
|
||||
@ -782,13 +1006,17 @@ pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next
|
||||
line to contain a new pattern (or a command) instead of a subject line. This
|
||||
facility is used when saving compiled patterns to a file, as described in the
|
||||
section entitled "Saving and restoring compiled patterns"
|
||||
<a href="#saverestore">below.</a>
|
||||
The <b>push</b> modifier is incompatible with compilation modifiers such as
|
||||
<b>global</b> that act at match time. Any that are specified are ignored, with a
|
||||
warning message, except for <b>replace</b>, which causes an error. Note that,
|
||||
<b>jitverify</b>, which is allowed, does not carry through to any subsequent
|
||||
matching that uses this pattern.
|
||||
</P>
|
||||
<a href="#saverestore">below. If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled</a>
|
||||
pattern is stacked, leaving the original as current, ready to match the
|
||||
following input lines. This provides a way of testing the
|
||||
<b>pcre2_code_copy()</b> function.
|
||||
The <b>push</b> and <b>pushcopy </b> modifiers are incompatible with compilation
|
||||
modifiers such as <b>global</b> that act at match time. Any that are specified
|
||||
are ignored (for the stacked copy), with a warning message, except for
|
||||
<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is
|
||||
allowed, does not carry through to any subsequent matching that uses a stacked
|
||||
pattern.
|
||||
<a name="subjectmodifiers"></a></P>
|
||||
<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br>
|
||||
<P>
|
||||
The modifiers that can appear in subject lines and the <b>#subject</b>
|
||||
@ -806,6 +1034,7 @@ for a description of their effects.
|
||||
anchored set PCRE2_ANCHORED
|
||||
dfa_restart set PCRE2_DFA_RESTART
|
||||
dfa_shortest set PCRE2_DFA_SHORTEST
|
||||
no_jit set PCRE2_NO_JIT
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
notbol set PCRE2_NOTBOL
|
||||
notempty set PCRE2_NOTEMPTY
|
||||
@ -818,11 +1047,11 @@ The partial matching modifiers are provided with abbreviations because they
|
||||
appear frequently in tests.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>/posix</b> modifier was present on the pattern, causing the POSIX
|
||||
If the <b>posix</b> modifier was present on the pattern, causing the POSIX
|
||||
wrapper API to be used, the only option-setting modifiers that have any effect
|
||||
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
|
||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
|
||||
Any other modifiers cause an error.
|
||||
The other modifiers are ignored, with a warning message.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match controls
|
||||
@ -833,33 +1062,44 @@ information. Some of them may also be specified on a pattern line (see above),
|
||||
in which case they apply to every subject line that is matched against that
|
||||
pattern.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use <b>pcre2_dfa_match()</b>
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=>n> set a match limit
|
||||
memory show memory usage
|
||||
offset=<n> set starting offset
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_error=<n>[:<m>] control callout error
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use <b>pcre2_dfa_match()</b>
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
memory show memory usage
|
||||
null_context match with a NULL context
|
||||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
The effects of these modifiers are described in the following sections. When
|
||||
matching via the POSIX wrapper API, the <b>aftertext</b>, <b>allaftertext</b>,
|
||||
and <b>ovector</b> subject modifiers work as described below. All other
|
||||
modifiers are either ignored, with a warning message, or cause an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Showing more text
|
||||
@ -916,7 +1156,8 @@ The <b>allcaptures</b> modifier requests that the values of all potential
|
||||
captured parentheses be output after a match. By default, only those up to the
|
||||
highest one actually used in the match are output (corresponding to the return
|
||||
code from <b>pcre2_match()</b>). Groups that did not take part in the match
|
||||
are output as "<unset>".
|
||||
are output as "<unset>". This modifier is not relevant for DFA matching (which
|
||||
does no capturing); it is ignored, with a warning message, if present.
|
||||
</P>
|
||||
<br><b>
|
||||
Testing callouts
|
||||
@ -924,15 +1165,22 @@ Testing callouts
|
||||
<P>
|
||||
A callout function is supplied when <b>pcre2test</b> calls the library matching
|
||||
functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is
|
||||
set, the current captured groups are output when a callout occurs.
|
||||
set, the current captured groups are output when a callout occurs. The default
|
||||
return from the callout function is zero, which allows matching to continue.
|
||||
</P>
|
||||
<P>
|
||||
The <b>callout_fail</b> modifier can be given one or two numbers. If there is
|
||||
only one number, 1 is returned instead of 0 when a callout of that number is
|
||||
reached. If two numbers are given, 1 is returned when callout <n> is reached
|
||||
for the <m>th time. Note that callouts with string arguments are always given
|
||||
the number zero. See "Callouts" below for a description of the output when a
|
||||
callout it taken.
|
||||
only one number, 1 is returned instead of 0 (causing matching to backtrack)
|
||||
when a callout of that number is reached. If two numbers (<n>:<m>) are given, 1
|
||||
is returned when callout <n> is reached and there have been at least <m>
|
||||
callouts. The <b>callout_error</b> modifier is similar, except that
|
||||
PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
|
||||
aborted. If both these modifiers are set for the same callout number,
|
||||
<b>callout_error</b> takes precedence.
|
||||
</P>
|
||||
<P>
|
||||
Note that callouts with string arguments are always given the number zero. See
|
||||
"Callouts" below for a description of the output when a callout it taken.
|
||||
</P>
|
||||
<P>
|
||||
The <b>callout_data</b> modifier can be given an unsigned or a negative number.
|
||||
@ -945,7 +1193,7 @@ Finding all matches in a string
|
||||
</b><br>
|
||||
<P>
|
||||
Searching for all possible matches within a subject can be requested by the
|
||||
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
|
||||
<b>global</b> or <b>altglobal</b> modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The difference
|
||||
between <b>global</b> and <b>altglobal</b> is that the former uses the
|
||||
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
|
||||
@ -996,19 +1244,34 @@ Testing the substitution function
|
||||
</b><br>
|
||||
<P>
|
||||
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
|
||||
called instead of one of the matching functions. Unlike subject strings,
|
||||
<b>pcre2test</b> does not process replacement strings for escape sequences. In
|
||||
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
|
||||
If so, it is correctly converted to a UTF string of the appropriate code unit
|
||||
width. If it is not a valid UTF-8 string, the individual code units are copied
|
||||
directly. This provides a means of passing an invalid UTF-8 string for testing
|
||||
purposes.
|
||||
called instead of one of the matching functions. Note that replacement strings
|
||||
cannot contain commas, because a comma signifies the end of a modifier. This is
|
||||
not thought to be an issue in a test program.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
<b>pcre2_substitute()</b>. After a successful substitution, the modified string
|
||||
is output, preceded by the number of replacements. This may be zero if there
|
||||
were no matches. Here is a simple example of a substitution test:
|
||||
Unlike subject strings, <b>pcre2test</b> does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to see if it
|
||||
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
|
||||
the appropriate code unit width. If it is not a valid UTF-8 string, the
|
||||
individual code units are copied directly. This provides a means of passing an
|
||||
invalid UTF-8 string for testing purposes.
|
||||
</P>
|
||||
<P>
|
||||
The following modifiers set options (in additional to the normal match options)
|
||||
for <b>pcre2_substitute()</b>:
|
||||
<pre>
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
After a successful substitution, the modified string is output, preceded by the
|
||||
number of replacements. This may be zero if there were no matches. Here is a
|
||||
simple example of a substitution test:
|
||||
<pre>
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
@ -1016,12 +1279,12 @@ were no matches. Here is a simple example of a substitution test:
|
||||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
</pre>
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
||||
buffer overflow, if the replacement string starts with a number in square
|
||||
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
|
||||
output buffer, with the replacement string starting at the next character. Here
|
||||
is an example that tests the edge case:
|
||||
Subject and replacement strings should be kept relatively short (fewer than 256
|
||||
characters) for substitution tests, as fixed-size buffers are used. To make it
|
||||
easy to test for buffer overflow, if the replacement string starts with a
|
||||
number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
|
||||
the size of the output buffer, with the replacement string starting at the next
|
||||
character. Here is an example that tests the edge case:
|
||||
<pre>
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
@ -1029,6 +1292,19 @@ is an example that tests the edge case:
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
</pre>
|
||||
The default action of <b>pcre2_substitute()</b> is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
|
||||
<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
|
||||
to go through the motions of matching and substituting, in order to compute the
|
||||
size of buffer that is required. When this happens, <b>pcre2test</b> shows the
|
||||
required buffer length (which includes space for the trailing zero) as part of
|
||||
the error message. For example:
|
||||
<pre>
|
||||
/abc/substitute_overflow_length
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory: 10 code units are needed
|
||||
</pre>
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||
matching provokes an error return ("bad option value") from
|
||||
<b>pcre2_substitute()</b>.
|
||||
@ -1100,6 +1376,16 @@ The <b>offset</b> modifier sets an offset in the subject string at which
|
||||
matching starts. Its value is a number of code units, not characters.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting an offset limit
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>offset_limit</b> modifier sets a limit for unanchored matches. If a match
|
||||
cannot be found starting at or before this offset in the subject, a "no match"
|
||||
return is given. The data value is a number of code units, not characters. When
|
||||
this modifier is used, the <b>use_offset_limit</b> modifier must have been set
|
||||
for the pattern; if not, an error is generated.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting the size of the output vector
|
||||
</b><br>
|
||||
<P>
|
||||
@ -1131,6 +1417,17 @@ this modifier has no effect, as there is no facility for passing a length.)
|
||||
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
</P>
|
||||
<br><b>
|
||||
Passing a NULL context
|
||||
</b><br>
|
||||
<P>
|
||||
Normally, <b>pcre2test</b> passes a context block to <b>pcre2_match()</b>,
|
||||
<b>pcre2_dfa_match()</b> or <b>pcre2_jit_match()</b>. If the <b>null_context</b>
|
||||
modifier is set, however, NULL is passed. This is for testing that the matching
|
||||
functions behave correctly in this case (they use default values). This
|
||||
modifier cannot be used with the <b>find_limits</b> modifier or when testing the
|
||||
substitution function.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
|
||||
<P>
|
||||
By default, <b>pcre2test</b> uses the standard PCRE2 matching function,
|
||||
@ -1196,7 +1493,7 @@ unset substring is shown as "<unset>", as for the second data line.
|
||||
If the strings contain any non-printing characters, they are output as \xhh
|
||||
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
|
||||
are output as \x{hh...} escapes. See below for the definition of non-printing
|
||||
characters. If the <b>/aftertext</b> modifier is set, the output for substring
|
||||
characters. If the <b>aftertext</b> modifier is set, the output for substring
|
||||
0 is followed by the the rest of the subject string, identified by "0+" like
|
||||
this:
|
||||
<pre>
|
||||
@ -1321,7 +1618,9 @@ item to be tested. For example:
|
||||
This output indicates that callout number 0 occurred for a match attempt
|
||||
starting at the fourth character of the subject string, when the pointer was at
|
||||
the seventh character, and when the next pattern item was \d. Just
|
||||
one circumflex is output if the start and current positions are the same.
|
||||
one circumflex is output if the start and current positions are the same, or if
|
||||
the current position precedes the start position, which can happen if the
|
||||
callout is in a lookbehind assertion.
|
||||
</P>
|
||||
<P>
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
|
||||
@ -1387,7 +1686,7 @@ therefore shown as hex escapes.
|
||||
<P>
|
||||
When <b>pcre2test</b> is outputting text that is a matched part of a subject
|
||||
string, it behaves in the same way, unless a different locale has been set for
|
||||
the pattern (using the <b>/locale</b> modifier). In this case, the
|
||||
the pattern (using the <b>locale</b> modifier). In this case, the
|
||||
<b>isprint()</b> function is used to distinguish printing and non-printing
|
||||
characters.
|
||||
<a name="saverestore"></a></P>
|
||||
@ -1413,11 +1712,16 @@ can be used to test these functions.
|
||||
<P>
|
||||
When a pattern with <b>push</b> modifier is successfully compiled, it is pushed
|
||||
onto a stack of compiled patterns, and <b>pcre2test</b> expects the next line to
|
||||
contain a new pattern (or command) instead of a subject line. By this means, a
|
||||
number of patterns can be compiled and retained. The <b>push</b> modifier is
|
||||
incompatible with <b>posix</b>, and control modifiers that act at match time are
|
||||
ignored (with a message). The <b>jitverify</b> modifier applies only at compile
|
||||
time. The command
|
||||
contain a new pattern (or command) instead of a subject line. By contrast,
|
||||
the <b>pushcopy</b> modifier causes a copy of the compiled pattern to be
|
||||
stacked, leaving the original available for immediate matching. By using
|
||||
<b>push</b> and/or <b>pushcopy</b>, a number of patterns can be compiled and
|
||||
retained. These modifiers are incompatible with <b>posix</b>, and control
|
||||
modifiers that act at match time are ignored (with a message) for the stacked
|
||||
patterns. The <b>jitverify</b> modifier applies only at compile time.
|
||||
</P>
|
||||
<P>
|
||||
The command
|
||||
<pre>
|
||||
#save <filename>
|
||||
</pre>
|
||||
@ -1434,7 +1738,8 @@ usual by an empty line or end of file. This command may be followed by a
|
||||
modifier list containing only
|
||||
<a href="#controlmodifiers">control modifiers</a>
|
||||
that act after a pattern has been compiled. In particular, <b>hex</b>,
|
||||
<b>posix</b>, and <b>push</b> are not allowed, nor are any
|
||||
<b>posix</b>, <b>posix_nosub</b>, <b>push</b>, and <b>pushcopy</b> are not allowed,
|
||||
nor are any
|
||||
<a href="#optionmodifiers">option-setting modifiers.</a>
|
||||
The JIT modifiers are, however permitted. Here is an example that saves and
|
||||
reloads two patterns.
|
||||
@ -1452,6 +1757,11 @@ reloads two patterns.
|
||||
If <b>jitverify</b> is used with #pop, it does not automatically imply
|
||||
<b>jit</b>, which is different behaviour from when it is used on a pattern.
|
||||
</P>
|
||||
<P>
|
||||
The #popcopy command is analagous to the <b>pushcopy</b> modifier in that it
|
||||
makes current a copy of the topmost stack pattern, leaving the original still
|
||||
on the stack.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
@ -1469,9 +1779,9 @@ Cambridge, England.
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 May 2015
|
||||
Last updated: 28 December 2016
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -67,15 +67,20 @@ In UTF modes, the dot metacharacter matches one UTF character instead of a
|
||||
single code unit.
|
||||
</P>
|
||||
<P>
|
||||
The escape sequence \C can be used to match a single code unit, in a UTF mode,
|
||||
The escape sequence \C can be used to match a single code unit in a UTF mode,
|
||||
but its use can lead to some strange effects because it breaks up multi-unit
|
||||
characters (see the description of \C in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation). The use of \C is not supported in the alternative matching
|
||||
function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
|
||||
optimization. If JIT optimization is requested for a UTF pattern that contains
|
||||
\C, it will not succeed, and so the matching will be carried out by the normal
|
||||
interpretive function.
|
||||
documentation).
|
||||
</P>
|
||||
<P>
|
||||
The use of \C is not supported by the alternative matching function
|
||||
<b>pcre2_dfa_match()</b> when in UTF-8 or UTF-16 mode, that is, when a character
|
||||
may consist of more than one code unit. The use of \C in these modes provokes
|
||||
a match-time error. Also, the JIT optimization does not support \C in these
|
||||
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
|
||||
contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
|
||||
the matching will be carried out by the normal interpretive function.
|
||||
</P>
|
||||
<P>
|
||||
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
|
||||
@ -126,11 +131,22 @@ as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
||||
strings to be in host byte order.
|
||||
</P>
|
||||
<P>
|
||||
The entire string is checked before any other processing takes place. In
|
||||
addition to checking the format of the string, there is a check to ensure that
|
||||
all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area.
|
||||
The so-called "non-character" code points are not excluded because Unicode
|
||||
corrigendum #9 makes it clear that they should not be.
|
||||
A UTF string is checked before any other processing takes place. In the case of
|
||||
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> calls with a non-zero starting
|
||||
offset, the check is applied only to that part of the subject that could be
|
||||
inspected during matching, and there is a check that the starting offset points
|
||||
to the first code unit of a character or to the end of the subject. If there
|
||||
are no lookbehind assertions in the pattern, the check starts at the starting
|
||||
offset. Otherwise, it starts at the length of the longest lookbehind before the
|
||||
starting offset, or at the start of the subject if there are not that many
|
||||
characters before the starting offset. Note that the sequences \b and \B are
|
||||
one-character lookbehinds.
|
||||
</P>
|
||||
<P>
|
||||
In addition to checking the format of the string, there is a check to ensure
|
||||
that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate
|
||||
area. The so-called "non-character" code points are not excluded because
|
||||
Unicode corrigendum #9 makes it clear that they should not be.
|
||||
</P>
|
||||
<P>
|
||||
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
|
||||
@ -232,9 +248,9 @@ Errors in UTF-16 strings
|
||||
<P>
|
||||
The following negative error codes are given for invalid UTF-16 strings:
|
||||
<pre>
|
||||
PCRE_UTF16_ERR1 Missing low surrogate at end of string
|
||||
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
|
||||
PCRE_UTF16_ERR3 Isolated low surrogate
|
||||
PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
|
||||
PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
|
||||
PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
|
||||
|
||||
<a name="utf32strings"></a></PRE>
|
||||
</P>
|
||||
@ -244,8 +260,8 @@ Errors in UTF-32 strings
|
||||
<P>
|
||||
The following negative error codes are given for invalid UTF-32 strings:
|
||||
<pre>
|
||||
PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
|
||||
PCRE_UTF32_ERR2 Code point is greater than 0x10ffff
|
||||
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
|
||||
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
@ -264,9 +280,9 @@ Cambridge, England.
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 23 November 2014
|
||||
Last updated: 03 July 2016
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
@ -91,6 +91,12 @@ in the library.
|
||||
<tr><td><a href="pcre2_callout_enumerate.html">pcre2_callout_enumerate</a></td>
|
||||
<td> Enumerate callouts in a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
|
||||
<td> Copy a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
|
||||
<td> Copy a compiled pattern and its character tables</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
|
||||
<td> Free a compiled pattern</td></tr>
|
||||
|
||||
@ -210,9 +216,15 @@ in the library.
|
||||
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
|
||||
<td> Set the match limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_max_pattern_length.html">pcre2_set_max_pattern_length</a></td>
|
||||
<td> Set the maximum length of pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_newline.html">pcre2_set_newline</a></td>
|
||||
<td> Set the newline convention</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_offset_limit.html">pcre2_set_offset_limit</a></td>
|
||||
<td> Set the offset limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_parens_nest_limit.html">pcre2_set_parens_nest_limit</a></td>
|
||||
<td> Set the parentheses nesting limit</td></tr>
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2 3 "13 April 2015" "PCRE2 10.20"
|
||||
.TH PCRE2 3 "16 October 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH INTRODUCTION
|
||||
@ -118,8 +118,10 @@ running redundant checks.
|
||||
.P
|
||||
The use of the \eC escape sequence in a UTF-8 or UTF-16 pattern can lead to
|
||||
problems, because it may leave the current matching point in the middle of a
|
||||
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
|
||||
lock out the use of \eC, causing a compile-time error if it is encountered.
|
||||
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an
|
||||
application to lock out the use of \eC, causing a compile-time error if it is
|
||||
encountered. It is also possible to build PCRE2 with the use of \eC permanently
|
||||
disabled.
|
||||
.P
|
||||
Another way that performance can be hit is by running a pattern that has a very
|
||||
large search tree against a string that will never match. Nested unlimited
|
||||
@ -187,6 +189,6 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 13 April 2015
|
||||
Last updated: 16 October 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
6509
pcre2/doc/pcre2.txt
6509
pcre2/doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
31
pcre2/doc/pcre2_code_copy.3
Normal file
31
pcre2/doc/pcre2_code_copy.3
Normal file
@ -0,0 +1,31 @@
|
||||
.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
.rs
|
||||
.sp
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP);
|
||||
.fi
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
.rs
|
||||
.sp
|
||||
This function makes a copy of the memory used for a compiled pattern, excluding
|
||||
any memory used by the JIT compiler. Without a subsequent call to
|
||||
\fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching. The
|
||||
pointer to the character tables is copied, not the tables themselves (see
|
||||
\fBpcre2_code_copy_with_tables()\fP). The yield of the function is NULL if
|
||||
\fIcode\fP is NULL or if sufficient memory cannot be obtained.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
page and a description of the POSIX API in the
|
||||
.\" HREF
|
||||
\fBpcre2posix\fP
|
||||
.\"
|
||||
page.
|
32
pcre2/doc/pcre2_code_copy_with_tables.3
Normal file
32
pcre2/doc/pcre2_code_copy_with_tables.3
Normal file
@ -0,0 +1,32 @@
|
||||
.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
.rs
|
||||
.sp
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP);
|
||||
.fi
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
.rs
|
||||
.sp
|
||||
This function makes a copy of the memory used for a compiled pattern, excluding
|
||||
any memory used by the JIT compiler. Without a subsequent call to
|
||||
\fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching.
|
||||
Unlike \fBpcre2_code_copy()\fP, a separate copy of the character tables is also
|
||||
made, with the new code pointing to it. This memory will be automatically freed
|
||||
when \fBpcre2_code_free()\fP is called. The yield of the function is NULL if
|
||||
\fIcode\fP is NULL or if sufficient memory cannot be obtained.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
page and a description of the POSIX API in the
|
||||
.\" HREF
|
||||
\fBpcre2posix\fP
|
||||
.\"
|
||||
page.
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_CODE_FREE 3 "21 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_CODE_FREE 3 "29 July 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -7,7 +7,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B pcre2_code_free(pcre2_code *\fIcode\fP);
|
||||
.B void pcre2_code_free(pcre2_code *\fIcode\fP);
|
||||
.fi
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_DFA_MATCH 3 "12 May 2013" "PCRE2 10.00"
|
||||
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -33,8 +33,8 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
|
||||
\fIwscount\fP Number of elements in the vector
|
||||
.sp
|
||||
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
||||
up a callout function. The \fIlength\fP and \fIstartoffset\fP values are code
|
||||
units, not characters. The options are:
|
||||
up a callout function or specify the recursion limit. The \fIlength\fP and
|
||||
\fIstartoffset\fP values are code units, not characters. The options are:
|
||||
.sp
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_GET_ERROR_MESSAGE 3 "21 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -23,7 +23,10 @@ errors are negative numbers. The arguments are:
|
||||
\fIbufflen\fP the length of the buffer (code units)
|
||||
.sp
|
||||
The function returns the length of the message, excluding the trailing zero, or
|
||||
a negative error code if the buffer is too small.
|
||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
||||
this case, the returned message is truncated (but still with a trailing zero).
|
||||
If \fIerrorcode\fP does not contain a recognized error code number, the
|
||||
negative value PCRE2_ERROR_BADDATA is returned.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_MATCH_DATA_CREATE 3 "22 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_MATCH_DATA_CREATE 3 "29 July 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -7,7 +7,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B pcre2_match_data_create(uint32_t \fIovecsize\fP,
|
||||
.B pcre2_match_data *pcre2_match_data_create(uint32_t \fIovecsize\fP,
|
||||
.B " pcre2_general_context *\fIgcontext\fP);"
|
||||
.fi
|
||||
.
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "24 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "29 July 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -7,8 +7,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B pcre2_match_data_create_from_pattern(const pcre2_code *\fIcode\fP,
|
||||
.B " pcre2_general_context *\fIgcontext\fP);"
|
||||
.B pcre2_match_data *pcre2_match_data_create_from_pattern(
|
||||
.B " const pcre2_code *\fIcode\fP, pcre2_general_context *\fIgcontext\fP);"
|
||||
.fi
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_PATTERN_INFO 3 "01 December 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_PATTERN_INFO 3 "21 November 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -30,19 +30,20 @@ request are as follows:
|
||||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
|
||||
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
|
||||
0 nothing set
|
||||
1 first code unit is set
|
||||
2 start of string or after newline
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \eC
|
||||
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
|
||||
exist in the pattern
|
||||
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
|
||||
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
|
||||
0 nothing set
|
||||
1 code unit is set
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
|
||||
empty string, 0 otherwise
|
||||
PCRE2_INFO_MATCHLIMIT Match limit if set,
|
||||
@ -50,8 +51,8 @@ request are as follows:
|
||||
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
|
||||
lookbehind assertion
|
||||
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMECOUNT Number of named subpatterns
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMETABLE Pointer to name table
|
||||
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
|
||||
PCRE2_NEWLINE_CR
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_SERIALIZE_DECODE 3 "19 January 2015" "PCRE2 10.10"
|
||||
.TH PCRE2_SERIALIZE_DECODE 3 "02 September 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -8,7 +8,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.PP
|
||||
.nf
|
||||
.B int32_t pcre2_serialize_decode(pcre2_code **\fIcodes\fP,
|
||||
.B " int32_t \fInumber_of_codes\fP, const uint32_t *\fIbytes\fP,"
|
||||
.B " int32_t \fInumber_of_codes\fP, const uint8_t *\fIbytes\fP,"
|
||||
.B " pcre2_general_context *\fIgcontext\fP);"
|
||||
.fi
|
||||
.
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_SERIALIZE_ENCODE 3 "19 January 2015" "PCRE2 10.10"
|
||||
.TH PCRE2_SERIALIZE_ENCODE 3 "02 September 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -7,8 +7,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B int32_t pcre2_serialize_encode(pcre2_code **\fIcodes\fP,
|
||||
.B " int32_t \fInumber_of_codes\fP, uint32_t **\fIserialized_bytes\fP,"
|
||||
.B int32_t pcre2_serialize_encode(const pcre2_code **\fIcodes\fP,
|
||||
.B " int32_t \fInumber_of_codes\fP, uint8_t **\fIserialized_bytes\fP,"
|
||||
.B " PCRE2_SIZE *\fIserialized_size\fP, pcre2_general_context *\fIgcontext\fP);"
|
||||
.fi
|
||||
.
|
||||
|
31
pcre2/doc/pcre2_set_max_pattern_length.3
Normal file
31
pcre2/doc/pcre2_set_max_pattern_length.3
Normal file
@ -0,0 +1,31 @@
|
||||
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "05 October 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
.rs
|
||||
.sp
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B int pcre2_set_max_pattern_length(pcre2_compile_context *\fIccontext\fP,
|
||||
.B " PCRE2_SIZE \fIvalue\fP);"
|
||||
.fi
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
.rs
|
||||
.sp
|
||||
This function sets, in a compile context, the maximum text length (in code
|
||||
units) of the pattern that can be compiled. The result is always zero. If a
|
||||
longer pattern is passed to \fBpcre2_compile()\fP there is an immediate error
|
||||
return. The default is effectively unlimited, being the largest value a
|
||||
PCRE2_SIZE variable can hold.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
page and a description of the POSIX API in the
|
||||
.\" HREF
|
||||
\fBpcre2posix\fP
|
||||
.\"
|
||||
page.
|
28
pcre2/doc/pcre2_set_offset_limit.3
Normal file
28
pcre2/doc/pcre2_set_offset_limit.3
Normal file
@ -0,0 +1,28 @@
|
||||
.TH PCRE2_SET_OFFSET_LIMIT 3 "22 September 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
.rs
|
||||
.sp
|
||||
.B #include <pcre2.h>
|
||||
.PP
|
||||
.nf
|
||||
.B int pcre2_set_offset_limit(pcre2_match_context *\fImcontext\fP,
|
||||
.B " PCRE2_SIZE \fIvalue\fP);"
|
||||
.fi
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
.rs
|
||||
.sp
|
||||
This function sets the offset limit field in a match context. The result is
|
||||
always zero.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
page and a description of the POSIX API in the
|
||||
.\" HREF
|
||||
\fBpcre2posix\fP
|
||||
.\"
|
||||
page.
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2_SUBSTITUTE 3 "11 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_SUBSTITUTE 3 "12 December 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -47,20 +47,25 @@ units, not characters, as is the contents of the variable pointed at by
|
||||
\fIoutlengthptr\fP, which is updated to the actual length of the new string.
|
||||
The options are:
|
||||
.sp
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject string is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject string is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
||||
is not a valid match
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject or replacement for
|
||||
UTF validity (only relevant if PCRE2_UTF
|
||||
was set at compile time)
|
||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the
|
||||
subject is not a valid match
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject or replacement
|
||||
for UTF validity (only relevant if
|
||||
PCRE2_UTF was set at compile time)
|
||||
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
|
||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
|
||||
.sp
|
||||
The function returns the number of substitutions, which may be zero if there
|
||||
were no matches. The result can be greater than one only when
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
|
||||
is returned.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -1,4 +1,4 @@
|
||||
.TH PCRE2BUILD 3 "23 April 2015" "PCRE2 10.20"
|
||||
.TH PCRE2BUILD 3 "01 November 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.
|
||||
@ -132,11 +132,20 @@ Pattern escapes such as \ed and \ew do not by default make use of Unicode
|
||||
properties. The application can request that they do by setting the PCRE2_UCP
|
||||
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||
request this by starting with (*UCP).
|
||||
.P
|
||||
.
|
||||
.
|
||||
.SH "DISABLING THE USE OF \eC"
|
||||
.rs
|
||||
.sp
|
||||
The \eC escape sequence, which matches a single code unit, even in a UTF mode,
|
||||
can cause unpredictable behaviour because it may leave the current matching
|
||||
point in the middle of a multi-code-unit character. It can be locked out by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
point in the middle of a multi-code-unit character. The application can lock it
|
||||
out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
|
||||
\fBpcre2_compile()\fP. There is also a build-time option
|
||||
.sp
|
||||
--enable-never-backslash-C
|
||||
.sp
|
||||
(note the upper case C) which locks out the use of \eC entirely.
|
||||
.
|
||||
.
|
||||
.SH "JUST-IN-TIME COMPILER SUPPORT"
|
||||
@ -343,6 +352,19 @@ and equivalent run-time options, refer to these character values in an EBCDIC
|
||||
environment.
|
||||
.
|
||||
.
|
||||
.SH "PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS"
|
||||
.rs
|
||||
.sp
|
||||
By default, on non-Windows systems, \fBpcre2grep\fP supports the use of
|
||||
callouts with string arguments within the patterns it is matching, in order to
|
||||
run external scripts. For details, see the
|
||||
.\" HREF
|
||||
\fBpcre2grep\fP
|
||||
.\"
|
||||
documentation. This support can be disabled by adding
|
||||
--disable-pcre2grep-callout to the \fBconfigure\fP command.
|
||||
.
|
||||
.
|
||||
.SH "PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT"
|
||||
.rs
|
||||
.sp
|
||||
@ -363,16 +385,19 @@ they are not.
|
||||
.sp
|
||||
\fBpcre2grep\fP uses an internal buffer to hold a "window" on the file it is
|
||||
scanning, in order to be able to output "before" and "after" lines when it
|
||||
finds a match. The size of the buffer is controlled by a parameter whose
|
||||
default value is 20K. The buffer itself is three times this size, but because
|
||||
of the way it is used for holding "before" lines, the longest line that is
|
||||
guaranteed to be processable is the parameter size. You can change the default
|
||||
parameter value by adding, for example,
|
||||
finds a match. The starting size of the buffer is controlled by a parameter
|
||||
whose default value is 20K. The buffer itself is three times this size, but
|
||||
because of the way it is used for holding "before" lines, the longest line that
|
||||
is guaranteed to be processable is the parameter size. If a longer line is
|
||||
encountered, \fBpcre2grep\fP automatically expands the buffer, up to a
|
||||
specified maximum size, whose default is 1M or the starting size, whichever is
|
||||
the larger. You can change the default parameter values by adding, for example,
|
||||
.sp
|
||||
--with-pcre2grep-bufsize=50K
|
||||
--with-pcre2grep-bufsize=51200
|
||||
--with-pcre2grep-max-bufsize=2097152
|
||||
.sp
|
||||
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
|
||||
value by using --buffer-size on the command line..
|
||||
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override
|
||||
these values by using --buffer-size and --max-buffer-size on the command line.
|
||||
.
|
||||
.
|
||||
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
|
||||
@ -490,6 +515,28 @@ information about code coverage, see the \fBgcov\fP and \fBlcov\fP
|
||||
documentation.
|
||||
.
|
||||
.
|
||||
.SH "SUPPORT FOR FUZZERS"
|
||||
.rs
|
||||
.sp
|
||||
There is a special option for use by people who want to run fuzzing tests on
|
||||
PCRE2:
|
||||
.sp
|
||||
--enable-fuzz-support
|
||||
.sp
|
||||
At present this applies only to the 8-bit library. If set, it causes an extra
|
||||
library called libpcre2-fuzzsupport.a to be built, but not installed. This
|
||||
contains a single function called LLVMFuzzerTestOneInput() whose arguments are
|
||||
a pointer to a string and the length of the string. When called, this function
|
||||
tries to compile the string as a pattern, and if that succeeds, to match it.
|
||||
This is done both with no options and with some random options bits that are
|
||||
generated from the string. Setting --enable-fuzz-support also causes a binary
|
||||
called \fBpcre2fuzzcheck\fP to be created. This is normally run under valgrind
|
||||
or used when PCRE2 is compiled with address sanitizing enabled. It calls the
|
||||
fuzzing function and outputs information about it is doing. The input strings
|
||||
are specified by arguments: if an argument starts with "=" the rest of it is a
|
||||
literal input string. Otherwise, it is assumed to be a file name, and the
|
||||
contents of the file are the test string.
|
||||
.
|
||||
.SH "SEE ALSO"
|
||||
.rs
|
||||
.sp
|
||||
@ -510,6 +557,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 24 April 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 01 November 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2CALLOUT 3 "23 March 2015" "PCRE2 10.20"
|
||||
.TH PCRE2CALLOUT 3 "29 September 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
@ -40,11 +40,20 @@ two callout points:
|
||||
.sp
|
||||
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
||||
automatically inserts callouts, all with number 255, before each item in the
|
||||
pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||
pattern except for immediately before or after a callout item in the pattern.
|
||||
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||
.sp
|
||||
A(?C3)B
|
||||
.sp
|
||||
it is processed as if it were
|
||||
.sp
|
||||
(?C255)A(?C3)B(?C255)
|
||||
.sp
|
||||
Here is a more complicated example:
|
||||
.sp
|
||||
A(\ed{2}|--)
|
||||
.sp
|
||||
it is processed as if it were
|
||||
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
||||
.sp
|
||||
(?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
||||
.sp
|
||||
@ -91,10 +100,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
||||
No match
|
||||
.sp
|
||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||
and therefore the callouts that would be taken for the backtracks do not occur.
|
||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||
\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
|
||||
case, the output changes to this:
|
||||
(because it is being treated as a++) and therefore the callouts that would be
|
||||
taken for the backtracks do not occur. You can disable the auto-possessify
|
||||
feature by passing PCRE2_NO_AUTO_POSSESS to \fBpcre2_compile()\fP, or starting
|
||||
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
|
||||
.sp
|
||||
--->aaaa
|
||||
+0 ^ a+
|
||||
@ -220,8 +229,8 @@ but the intention is never to remove any of the existing fields.
|
||||
.sp
|
||||
For a numerical callout, \fIcallout_string\fP is NULL, and \fIcallout_number\fP
|
||||
contains the number of the callout, in the range 0-255. This is the number
|
||||
that follows (?C for manual callouts; it is 255 for automatically generated
|
||||
callouts.
|
||||
that follows (?C for callouts that part of the pattern; it is 255 for
|
||||
automatically generated callouts.
|
||||
.
|
||||
.
|
||||
.SS "Fields for string callouts"
|
||||
@ -286,10 +295,15 @@ The \fIpattern_position\fP field contains the offset in the pattern string to
|
||||
the next item to be matched.
|
||||
.P
|
||||
The \fInext_item_length\fP field contains the length of the next item to be
|
||||
matched in the pattern string. When the callout immediately precedes an
|
||||
alternation bar, a closing parenthesis, or the end of the pattern, the length
|
||||
is zero. When the callout precedes an opening parenthesis, the length is that
|
||||
of the entire subpattern.
|
||||
processed in the pattern string. When the callout is at the end of the pattern,
|
||||
the length is zero. When the callout precedes an opening parenthesis, the
|
||||
length includes meta characters that follow the parenthesis. For example, in a
|
||||
callout before an assertion such as (?=ab) the length is 3. For an an
|
||||
alternation bar or a closing parenthesis, the length is one, unless a closing
|
||||
parenthesis is followed by a quantifier, in which case its length is included.
|
||||
(This changed in release 10.23. In earlier releases, before an opening
|
||||
parenthesis the length was that of the entire subpattern, and before an
|
||||
alternation bar or a closing parenthesis the length was zero.)
|
||||
.P
|
||||
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
|
||||
help in distinguishing between different automatic callouts, which all have the
|
||||
@ -382,6 +396,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 March 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 29 September 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2COMPAT 3 "15 March 2015" "PCRE2 10.20"
|
||||
.TH PCRE2COMPAT 3 "18 October 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
||||
@ -96,7 +96,7 @@ processed as anchored at the point where they are tested.
|
||||
one that is backtracked onto acts. For example, in the pattern
|
||||
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
|
||||
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
|
||||
same as PCRE2, but there are examples where it differs.
|
||||
same as PCRE2, but there are cases where it differs.
|
||||
.P
|
||||
11. Most backtracking verbs in assertions have their normal actions. They are
|
||||
not confined to the assertion.
|
||||
@ -109,17 +109,18 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
|
||||
13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern
|
||||
names is not as general as Perl's. This is a consequence of the fact the PCRE2
|
||||
works internally just with numbers, using an external table to translate
|
||||
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B),
|
||||
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b>B),
|
||||
where the two capturing parentheses have the same number but different names,
|
||||
is not supported, and causes an error at compile time. If it were allowed, it
|
||||
would not be possible to distinguish which parentheses matched, because both
|
||||
names map to capturing subpattern number 1. To avoid this confusing situation,
|
||||
an error is given at compile time.
|
||||
.P
|
||||
14. Perl recognizes comments in some places that PCRE2 does not, for example,
|
||||
between the ( and ? at the start of a subpattern. If the /x modifier is set,
|
||||
Perl allows white space between ( and ? (though current Perls warn that this is
|
||||
deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
|
||||
14. Perl used to recognize comments in some places that PCRE2 does not, for
|
||||
example, between the ( and ? at the start of a subpattern. If the /x modifier
|
||||
is set, Perl allowed white space between ( and ? though the latest Perls give
|
||||
an error (for a while it was just deprecated). There may still be some cases
|
||||
where Perl behaves differently.
|
||||
.P
|
||||
15. Perl, when in warning mode, gives warnings for character classes such as
|
||||
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
|
||||
@ -141,33 +142,37 @@ list is with respect to Perl 5.10:
|
||||
each alternative branch of a lookbehind assertion can match a different length
|
||||
of string. Perl requires them all to have the same length.
|
||||
.sp
|
||||
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
|
||||
(b) From PCRE2 10.23, back references to groups of fixed length are supported
|
||||
in lookbehinds, provided that there is no possibility of referencing a
|
||||
non-unique number or name. Perl does not support backreferences in lookbehinds.
|
||||
.sp
|
||||
(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
|
||||
meta-character matches only at the very end of the string.
|
||||
.sp
|
||||
(c) A backslash followed by a letter with no special meaning is faulted. (Perl
|
||||
(d) A backslash followed by a letter with no special meaning is faulted. (Perl
|
||||
can be made to issue a warning.)
|
||||
.sp
|
||||
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
|
||||
(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
|
||||
inverted, that is, by default they are not greedy, but if followed by a
|
||||
question mark they are.
|
||||
.sp
|
||||
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
|
||||
(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
|
||||
only at the first matching position in the subject string.
|
||||
.sp
|
||||
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
|
||||
(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
|
||||
PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
|
||||
.sp
|
||||
(g) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
|
||||
(h) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
|
||||
by the PCRE2_BSR_ANYCRLF option.
|
||||
.sp
|
||||
(h) The callout facility is PCRE2-specific.
|
||||
(i) The callout facility is PCRE2-specific.
|
||||
.sp
|
||||
(i) The partial matching facility is PCRE2-specific.
|
||||
(j) The partial matching facility is PCRE2-specific.
|
||||
.sp
|
||||
(j) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
|
||||
(k) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
|
||||
different way and is not Perl-compatible.
|
||||
.sp
|
||||
(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
|
||||
(l) PCRE2 recognizes some special sequences such as (*CR) at the start of
|
||||
a pattern that set overall options that cannot be changed within the pattern.
|
||||
.
|
||||
.
|
||||
@ -185,6 +190,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 15 March 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 18 October 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -20,28 +20,31 @@
|
||||
*************************************************/
|
||||
|
||||
/* This is a demonstration program to illustrate a straightforward way of
|
||||
calling the PCRE2 regular expression library from a C program. See the
|
||||
using the PCRE2 regular expression library from a C program. See the
|
||||
pcre2sample documentation for a short discussion ("man pcre2sample" if you have
|
||||
the PCRE2 man pages installed). PCRE2 is a revised API for the library, and is
|
||||
incompatible with the original PCRE API.
|
||||
|
||||
There are actually three libraries, each supporting a different code unit
|
||||
width. This demonstration program uses the 8-bit library.
|
||||
width. This demonstration program uses the 8-bit library. The default is to
|
||||
process each code unit as a separate character, but if the pattern begins with
|
||||
"(*UTF)", both it and the subject are treated as UTF-8 strings, where
|
||||
characters may occupy multiple code units.
|
||||
|
||||
In Unix-like environments, if PCRE2 is installed in your standard system
|
||||
libraries, you should be able to compile this program using this command:
|
||||
|
||||
gcc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
|
||||
cc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
|
||||
|
||||
If PCRE2 is not installed in a standard place, it is likely to be installed
|
||||
with support for the pkg-config mechanism. If you have pkg-config, you can
|
||||
compile this program using this command:
|
||||
|
||||
gcc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
|
||||
cc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
|
||||
|
||||
If you do not have pkg-config, you may have to use this:
|
||||
If you do not have pkg-config, you may have to use something like this:
|
||||
|
||||
gcc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \e
|
||||
cc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \e
|
||||
-R/usr/local/lib -lpcre2-8 -o pcre2demo
|
||||
|
||||
Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
|
||||
@ -56,9 +59,14 @@ the following line. */
|
||||
|
||||
/* #define PCRE2_STATIC */
|
||||
|
||||
/* This macro must be defined before including pcre2.h. For a program that uses
|
||||
only one code unit width, it makes it possible to use generic function names
|
||||
such as pcre2_compile(). */
|
||||
/* The PCRE2_CODE_UNIT_WIDTH macro must be defined before including pcre2.h.
|
||||
For a program that uses only one code unit width, setting it to 8, 16, or 32
|
||||
makes it possible to use generic function names such as pcre2_compile(). Note
|
||||
that just changing 8 to 16 (for example) is not sufficient to convert this
|
||||
program to process 16-bit characters. Even in a fully 16-bit environment, where
|
||||
string-handling functions such as strcmp() and printf() work with 16-bit
|
||||
characters, the code for handling the table of named substrings will still need
|
||||
to be modified. */
|
||||
|
||||
#define PCRE2_CODE_UNIT_WIDTH 8
|
||||
|
||||
@ -79,19 +87,19 @@ int main(int argc, char **argv)
|
||||
{
|
||||
pcre2_code *re;
|
||||
PCRE2_SPTR pattern; /* PCRE2_SPTR is a pointer to unsigned code units of */
|
||||
PCRE2_SPTR subject; /* the appropriate width (8, 16, or 32 bits). */
|
||||
PCRE2_SPTR subject; /* the appropriate width (in this case, 8 bits). */
|
||||
PCRE2_SPTR name_table;
|
||||
|
||||
int crlf_is_newline;
|
||||
int errornumber;
|
||||
int find_all;
|
||||
int i;
|
||||
int namecount;
|
||||
int name_entry_size;
|
||||
int rc;
|
||||
int utf8;
|
||||
|
||||
uint32_t option_bits;
|
||||
uint32_t namecount;
|
||||
uint32_t name_entry_size;
|
||||
uint32_t newline;
|
||||
|
||||
PCRE2_SIZE erroroffset;
|
||||
@ -106,15 +114,19 @@ pcre2_match_data *match_data;
|
||||
* First, sort out the command line. There is only one possible option at *
|
||||
* the moment, "-g" to request repeated matching to find all occurrences, *
|
||||
* like Perl's /g option. We set the variable find_all to a non-zero value *
|
||||
* if the -g option is present. Apart from that, there must be exactly two *
|
||||
* arguments. *
|
||||
* if the -g option is present. *
|
||||
**************************************************************************/
|
||||
|
||||
find_all = 0;
|
||||
for (i = 1; i < argc; i++)
|
||||
{
|
||||
if (strcmp(argv[i], "-g") == 0) find_all = 1;
|
||||
else break;
|
||||
else if (argv[i][0] == '-')
|
||||
{
|
||||
printf("Unrecognised option %s\en", argv[i]);
|
||||
return 1;
|
||||
}
|
||||
else break;
|
||||
}
|
||||
|
||||
/* After the options, we require exactly two arguments, which are the pattern,
|
||||
@ -122,7 +134,7 @@ and the subject string. */
|
||||
|
||||
if (argc - i != 2)
|
||||
{
|
||||
printf("Two arguments required: a regex and a subject string\en");
|
||||
printf("Exactly two arguments required: a regex and a subject string\en");
|
||||
return 1;
|
||||
}
|
||||
|
||||
@ -201,7 +213,7 @@ if (rc < 0)
|
||||
stored. */
|
||||
|
||||
ovector = pcre2_get_ovector_pointer(match_data);
|
||||
printf("\enMatch succeeded at offset %d\en", (int)ovector[0]);
|
||||
printf("Match succeeded at offset %d\en", (int)ovector[0]);
|
||||
|
||||
|
||||
/*************************************************************************
|
||||
@ -242,7 +254,7 @@ we have to extract the count of named parentheses from the pattern. */
|
||||
PCRE2_INFO_NAMECOUNT, /* get the number of named substrings */
|
||||
&namecount); /* where to put the answer */
|
||||
|
||||
if (namecount <= 0) printf("No named substrings\en"); else
|
||||
if (namecount == 0) printf("No named substrings\en"); else
|
||||
{
|
||||
PCRE2_SPTR tabptr;
|
||||
printf("Named substrings\en");
|
||||
@ -330,8 +342,8 @@ crlf_is_newline = newline == PCRE2_NEWLINE_ANY ||
|
||||
|
||||
for (;;)
|
||||
{
|
||||
uint32_t options = 0; /* Normally no options */
|
||||
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
|
||||
uint32_t options = 0; /* Normally no options */
|
||||
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
|
||||
|
||||
/* If the previous match was for an empty string, we are finished if we are
|
||||
at the end of the subject. Otherwise, arrange to run another match at the
|
||||
@ -371,7 +383,7 @@ for (;;)
|
||||
{
|
||||
if (options == 0) break; /* All matches found */
|
||||
ovector[1] = start_offset + 1; /* Advance one code unit */
|
||||
if (crlf_is_newline && /* If CRLF is newline & */
|
||||
if (crlf_is_newline && /* If CRLF is a newline & */
|
||||
start_offset < subject_length - 1 && /* we are at CRLF, */
|
||||
subject[start_offset] == '\er' &&
|
||||
subject[start_offset + 1] == '\en')
|
||||
@ -417,7 +429,7 @@ for (;;)
|
||||
printf("%2d: %.*s\en", i, (int)substring_length, (char *)substring_start);
|
||||
}
|
||||
|
||||
if (namecount <= 0) printf("No named substrings\en"); else
|
||||
if (namecount == 0) printf("No named substrings\en"); else
|
||||
{
|
||||
PCRE2_SPTR tabptr = name_table;
|
||||
printf("Named substrings\en");
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2GREP 1 "03 January 2015" "PCRE2 10.00"
|
||||
.TH PCRE2GREP 1 "31 December 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
pcre2grep - a grep with Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
@ -52,11 +52,18 @@ span line boundaries. What defines a line boundary is controlled by the
|
||||
\fB-N\fP (\fB--newline\fP) option.
|
||||
.P
|
||||
The amount of memory used for buffering files that are being scanned is
|
||||
controlled by a parameter that can be set by the \fB--buffer-size\fP option.
|
||||
The default value for this parameter is specified when \fBpcre2grep\fP is
|
||||
built, with the default default being 20K. A block of memory three times this
|
||||
size is used (to allow for buffering "before" and "after" lines). An error
|
||||
occurs if a line overflows the buffer.
|
||||
controlled by parameters that can be set by the \fB--buffer-size\fP and
|
||||
\fB--max-buffer-size\fP options. The first of these sets the size of buffer
|
||||
that is obtained at the start of processing. If an input file contains very
|
||||
long lines, a larger buffer may be needed; this is handled by automatically
|
||||
extending the buffer, up to the limit specified by \fB--max-buffer-size\fP. The
|
||||
default values for these parameters are specified when \fBpcre2grep\fP is
|
||||
built, with the default defaults being 20K and 1M respectively. An error occurs
|
||||
if a line is too long and the buffer can no longer be expanded.
|
||||
.P
|
||||
The block of memory that is actually used is three times the "buffer size", to
|
||||
allow for buffering "before" and "after" lines. If the buffer size is too
|
||||
small, fewer than requested "before" and "after" lines may be output.
|
||||
.P
|
||||
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
|
||||
BUFSIZ is defined in \fB<stdio.h>\fP. When there is more than one pattern
|
||||
@ -126,24 +133,27 @@ command line starts with a hyphen but is not an option. This allows for the
|
||||
processing of patterns and file names that start with hyphens.
|
||||
.TP
|
||||
\fB-A\fP \fInumber\fP, \fB--after-context=\fP\fInumber\fP
|
||||
Output \fInumber\fP lines of context after each matching line. If file names
|
||||
and/or line numbers are being output, a hyphen separator is used instead of a
|
||||
colon for the context lines. A line containing "--" is output between each
|
||||
group of lines, unless they are in fact contiguous in the input file. The value
|
||||
of \fInumber\fP is expected to be relatively small. However, \fBpcre2grep\fP
|
||||
guarantees to have up to 8K of following text available for context output.
|
||||
Output up to \fInumber\fP lines of context after each matching line. Fewer
|
||||
lines are output if the next match or the end of the file is reached, or if the
|
||||
processing buffer size has been set too small. If file names and/or line
|
||||
numbers are being output, a hyphen separator is used instead of a colon for the
|
||||
context lines. A line containing "--" is output between each group of lines,
|
||||
unless they are in fact contiguous in the input file. The value of \fInumber\fP
|
||||
is expected to be relatively small. When \fB-c\fP is used, \fB-A\fP is ignored.
|
||||
.TP
|
||||
\fB-a\fP, \fB--text\fP
|
||||
Treat binary files as text. This is equivalent to
|
||||
\fB--binary-files\fP=\fItext\fP.
|
||||
.TP
|
||||
\fB-B\fP \fInumber\fP, \fB--before-context=\fP\fInumber\fP
|
||||
Output \fInumber\fP lines of context before each matching line. If file names
|
||||
and/or line numbers are being output, a hyphen separator is used instead of a
|
||||
colon for the context lines. A line containing "--" is output between each
|
||||
group of lines, unless they are in fact contiguous in the input file. The value
|
||||
of \fInumber\fP is expected to be relatively small. However, \fBpcre2grep\fP
|
||||
guarantees to have up to 8K of preceding text available for context output.
|
||||
Output up to \fInumber\fP lines of context before each matching line. Fewer
|
||||
lines are output if the previous match or the start of the file is within
|
||||
\fInumber\fP lines, or if the processing buffer size has been set too small. If
|
||||
file names and/or line numbers are being output, a hyphen separator is used
|
||||
instead of a colon for the context lines. A line containing "--" is output
|
||||
between each group of lines, unless they are in fact contiguous in the input
|
||||
file. The value of \fInumber\fP is expected to be relatively small. When
|
||||
\fB-c\fP is used, \fB-B\fP is ignored.
|
||||
.TP
|
||||
\fB--binary-files=\fP\fIword\fP
|
||||
Specify how binary files are to be processed. If the word is "binary" (the
|
||||
@ -158,8 +168,9 @@ be of interest and are skipped without causing any output or affecting the
|
||||
return code.
|
||||
.TP
|
||||
\fB--buffer-size=\fP\fInumber\fP
|
||||
Set the parameter that controls how much memory is used for buffering files
|
||||
that are being scanned.
|
||||
Set the parameter that controls how much memory is obtained at the start of
|
||||
processing for buffering files that are being scanned. See also
|
||||
\fB--max-buffer-size\fP below.
|
||||
.TP
|
||||
\fB-C\fP \fInumber\fP, \fB--context=\fP\fInumber\fP
|
||||
Output \fInumber\fP lines of context both before and after each matching line.
|
||||
@ -167,13 +178,15 @@ This is equivalent to setting both \fB-A\fP and \fB-B\fP to the same value.
|
||||
.TP
|
||||
\fB-c\fP, \fB--count\fP
|
||||
Do not output lines from the files that are being scanned; instead output the
|
||||
number of matches (or non-matches if \fB-v\fP is used) that would otherwise
|
||||
have caused lines to be shown. By default, this count is the same as the number
|
||||
of suppressed lines, but if the \fB-M\fP (multiline) option is used (without
|
||||
\fB-v\fP), there may be more suppressed lines than the number of matches.
|
||||
number of lines that would have been shown, either because they matched, or, if
|
||||
\fB-v\fP is set, because they failed to match. By default, this count is
|
||||
exactly the same as the number of lines that would have been output, but if the
|
||||
\fB-M\fP (multiline) option is used (without \fB-v\fP), there may be more
|
||||
suppressed lines than the count (that is, the number of matches).
|
||||
.sp
|
||||
If no lines are selected, the number zero is output. If several files are are
|
||||
being scanned, a count is output for each of them. However, if the
|
||||
being scanned, a count is output for each of them and the \fB-t\fP option can
|
||||
be used to cause a total to be output at the end. However, if the
|
||||
\fB--files-with-matches\fP option is also used, only those files whose counts
|
||||
are greater than zero are listed. When \fB-c\fP is used, the \fB-A\fP,
|
||||
\fB-B\fP, and \fB-C\fP options are ignored.
|
||||
@ -192,12 +205,22 @@ connected to a terminal. More resources are used when colouring is enabled,
|
||||
because \fBpcre2grep\fP has to search for all possible matches in a line, not
|
||||
just one, in order to colour them all.
|
||||
.sp
|
||||
The colour that is used can be specified by setting the environment variable
|
||||
PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a
|
||||
string of two numbers, separated by a semicolon. They are copied directly into
|
||||
the control string for setting colour on a terminal, so it is your
|
||||
responsibility to ensure that they make sense. If neither of the environment
|
||||
variables is set, the default is "1;31", which gives red.
|
||||
The colour that is used can be specified by setting one of the environment
|
||||
variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
|
||||
PCREGREP_COLOR, which are checked in that order. If none of these are set,
|
||||
\fBpcre2grep\fP looks for GREP_COLORS or GREP_COLOR (in that order). The value
|
||||
of the variable should be a string of two numbers, separated by a semicolon,
|
||||
except in the case of GREP_COLORS, which must start with "ms=" or "mt="
|
||||
followed by two semicolon-separated colours, terminated by the end of the
|
||||
string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
|
||||
ignored, and GREP_COLOR is checked.
|
||||
.sp
|
||||
If the string obtained from one of the above variables contains any characters
|
||||
other than semicolon or digits, the setting is ignored and the default colour
|
||||
is used. The string is copied directly into the control string for setting
|
||||
colour on a terminal, so it is your responsibility to ensure that the values
|
||||
make sense. If no relevant environment variable is set, the default is "1;31",
|
||||
which gives red.
|
||||
.TP
|
||||
\fB-D\fP \fIaction\fP, \fB--devices=\fP\fIaction\fP
|
||||
If an input path is not a regular file or a directory, "action" specifies how
|
||||
@ -273,17 +296,17 @@ files; it does not apply to patterns specified by any of the \fB--include\fP or
|
||||
\fB--exclude\fP options.
|
||||
.TP
|
||||
\fB-f\fP \fIfilename\fP, \fB--file=\fP\fIfilename\fP
|
||||
Read patterns from the file, one per line, and match them against
|
||||
each line of input. What constitutes a newline when reading the file is the
|
||||
operating system's default. The \fB--newline\fP option has no effect on this
|
||||
option. Trailing white space is removed from each line, and blank lines are
|
||||
ignored. An empty file contains no patterns and therefore matches nothing. See
|
||||
also the comments about multiple patterns versus a single pattern with
|
||||
alternatives in the description of \fB-e\fP above.
|
||||
Read patterns from the file, one per line, and match them against each line of
|
||||
input. What constitutes a newline when reading the file is the operating
|
||||
system's default. The \fB--newline\fP option has no effect on this option.
|
||||
Trailing white space is removed from each line, and blank lines are ignored. An
|
||||
empty file contains no patterns and therefore matches nothing. See also the
|
||||
comments about multiple patterns versus a single pattern with alternatives in
|
||||
the description of \fB-e\fP above.
|
||||
.sp
|
||||
If this option is given more than once, all the specified files are
|
||||
read. A data line is output if any of the patterns match it. A file name can
|
||||
be given as "-" to refer to the standard input. When \fB-f\fP is used, patterns
|
||||
If this option is given more than once, all the specified files are read. A
|
||||
data line is output if any of the patterns match it. A file name can be given
|
||||
as "-" to refer to the standard input. When \fB-f\fP is used, patterns
|
||||
specified on the command line using \fB-e\fP may also be present; they are
|
||||
tested before the file's patterns. However, no other pattern is taken from the
|
||||
command line; all arguments are treated as the names of paths to be searched.
|
||||
@ -432,18 +455,25 @@ of use only if it is set smaller than \fB--match-limit\fP.
|
||||
There are no short forms for these options. The default settings are specified
|
||||
when the PCRE2 library is compiled, with the default default being 10 million.
|
||||
.TP
|
||||
\fB--max-buffer-size=\fInumber\fP
|
||||
This limits the expansion of the processing buffer, whose initial size can be
|
||||
set by \fB--buffer-size\fP. The maximum buffer size is silently forced to be no
|
||||
smaller than the starting buffer size.
|
||||
.TP
|
||||
\fB-M\fP, \fB--multiline\fP
|
||||
Allow patterns to match more than one line. When this option is given, patterns
|
||||
may usefully contain literal newline characters and internal occurrences of ^
|
||||
and $ characters. The output for a successful match may consist of more than
|
||||
one line. The first is the line in which the match started, and the last is the
|
||||
line in which the match ended. If the matched string ends with a newline
|
||||
sequence the output ends at the end of that line.
|
||||
Allow patterns to match more than one line. When this option is set, the PCRE2
|
||||
library is called in "multiline" mode. This allows a matched string to extend
|
||||
past the end of a line and continue on one or more subsequent lines. Patterns
|
||||
used with \fB-M\fP may usefully contain literal newline characters and internal
|
||||
occurrences of ^ and $ characters. The output for a successful match may
|
||||
consist of more than one line. The first line is the line in which the match
|
||||
started, and the last line is the line in which the match ended. If the matched
|
||||
string ends with a newline sequence, the output ends at the end of that line.
|
||||
If \fB-v\fP is set, none of the lines in a multi-line match are output. Once a
|
||||
match has been handled, scanning restarts at the beginning of the line after
|
||||
the one in which the match ended.
|
||||
.sp
|
||||
When this option is set, the PCRE2 library is called in "multiline" mode.
|
||||
However, \fBpcre2grep\fP still processes the input line by line. The difference
|
||||
is that a matched string may extend past the end of a line and continue on
|
||||
one or more subsequent lines. The newline sequence must be matched as part of
|
||||
The newline sequence that separates multiple lines must be matched as part of
|
||||
the pattern. For example, to find the phrase "regular expression" in a file
|
||||
where "regular" might be at the end of a line and "expression" at the start of
|
||||
the next line, you could use this command:
|
||||
@ -455,11 +485,8 @@ and is followed by + so as to match trailing white space on the first line as
|
||||
well as possibly handling a two-character newline sequence.
|
||||
.sp
|
||||
There is a limit to the number of lines that can be matched, imposed by the way
|
||||
that \fBpcre2grep\fP buffers the input file as it scans it. However,
|
||||
\fBpcre2grep\fP ensures that at least 8K characters or the rest of the file
|
||||
(whichever is the shorter) are available for forward matching, and similarly
|
||||
the previous 8K characters (or all the previous characters, if fewer than 8K)
|
||||
are guaranteed to be available for lookbehind assertions. The \fB-M\fP option
|
||||
that \fBpcre2grep\fP buffers the input file as it scans it. With a sufficiently
|
||||
large processing buffer, this should not be a problem, but the \fB-M\fP option
|
||||
does not work when input is read line by line (see \fP--line-buffered\fP.)
|
||||
.TP
|
||||
\fB-N\fP \fInewline-type\fP, \fB--newline\fP=\fInewline-type\fP
|
||||
@ -502,12 +529,13 @@ It should never be needed in normal use.
|
||||
Show only the part of the line that matched a pattern instead of the whole
|
||||
line. In this mode, no context is shown. That is, the \fB-A\fP, \fB-B\fP, and
|
||||
\fB-C\fP options are ignored. If there is more than one match in a line, each
|
||||
of them is shown separately. If \fB-o\fP is combined with \fB-v\fP (invert the
|
||||
sense of the match to find non-matching lines), no output is generated, but the
|
||||
return code is set appropriately. If the matched portion of the line is empty,
|
||||
nothing is output unless the file name or line number are being printed, in
|
||||
which case they are shown on an otherwise empty line. This option is mutually
|
||||
exclusive with \fB--file-offsets\fP and \fB--line-offsets\fP.
|
||||
of them is shown separately, on a separate line of output. If \fB-o\fP is
|
||||
combined with \fB-v\fP (invert the sense of the match to find non-matching
|
||||
lines), no output is generated, but the return code is set appropriately. If
|
||||
the matched portion of the line is empty, nothing is output unless the file
|
||||
name or line number are being printed, in which case they are shown on an
|
||||
otherwise empty line. This option is mutually exclusive with
|
||||
\fB--file-offsets\fP and \fB--line-offsets\fP.
|
||||
.TP
|
||||
\fB-o\fP\fInumber\fP, \fB--only-matching\fP=\fInumber\fP
|
||||
Show only the part of the line that matched the capturing parentheses of the
|
||||
@ -519,10 +547,11 @@ for the non-argument case above also apply to this case. If the specified
|
||||
capturing parentheses do not exist in the pattern, or were not set in the
|
||||
match, nothing is output unless the file name or line number are being output.
|
||||
.sp
|
||||
If this option is given multiple times, multiple substrings are output, in the
|
||||
order the options are given. For example, -o3 -o1 -o3 causes the substrings
|
||||
matched by capturing parentheses 3 and 1 and then 3 again to be output. By
|
||||
default, there is no separator (but see the next option).
|
||||
If this option is given multiple times, multiple substrings are output for each
|
||||
match, in the order the options are given, and all on one line. For example,
|
||||
-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
|
||||
then 3 again to be output. By default, there is no separator (but see the next
|
||||
option).
|
||||
.TP
|
||||
\fB--om-separator\fP=\fItext\fP
|
||||
Specify a separating string for multiple occurrences of \fB-o\fP. The default
|
||||
@ -547,6 +576,17 @@ Suppress error messages about non-existent or unreadable files. Such files are
|
||||
quietly skipped. However, the return code is still 2, even if matches were
|
||||
found in other files.
|
||||
.TP
|
||||
\fB-t\fP, \fB--total-count\fP
|
||||
This option is useful when scanning more than one file. If used on its own,
|
||||
\fB-t\fP suppresses all output except for a grand total number of matching
|
||||
lines (or non-matching lines if \fB-v\fP is used) in all the files. If \fB-t\fP
|
||||
is used with \fB-c\fP, a grand total is output except when the previous output
|
||||
is just one line. In other words, it is not output when just one file's count
|
||||
is listed. If file names are being output, the grand total is preceded by
|
||||
"TOTAL:". Otherwise, it appears as just another number. The \fB-t\fP option is
|
||||
ignored when used with \fB-L\fP (list files without matches), because the grand
|
||||
total would always be zero.
|
||||
.TP
|
||||
\fB-u\fP, \fB--utf-8\fP
|
||||
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
|
||||
with UTF-8 support. All patterns (including those for any \fB--exclude\fP and
|
||||
@ -570,11 +610,12 @@ specified by any of the \fB--include\fP or \fB--exclude\fP options.
|
||||
.TP
|
||||
\fB-x\fP, \fB--line-regex\fP, \fB--line-regexp\fP
|
||||
Force the patterns to be anchored (each must start matching at the beginning of
|
||||
a line) and in addition, require them to match entire lines. This is equivalent
|
||||
to having ^ and $ characters at the start and end of each alternative top-level
|
||||
branch in every pattern. This option applies only to the patterns that are
|
||||
matched against the contents of files; it does not apply to patterns specified
|
||||
by any of the \fB--include\fP or \fB--exclude\fP options.
|
||||
a line) and in addition, require them to match entire lines. In multiline mode
|
||||
the match may be more than one line. This is equivalent to having \eA and \eZ
|
||||
characters at the start and end of each alternative top-level branch in every
|
||||
pattern. This option applies only to the patterns that are matched against the
|
||||
contents of files; it does not apply to patterns specified by any of the
|
||||
\fB--include\fP or \fB--exclude\fP options.
|
||||
.
|
||||
.
|
||||
.SH "ENVIRONMENT VARIABLES"
|
||||
@ -653,6 +694,58 @@ options does have data, it must be given in the first form, using an equals
|
||||
character. Otherwise \fBpcre2grep\fP will assume that it has no data.
|
||||
.
|
||||
.
|
||||
.SH "CALLING EXTERNAL SCRIPTS"
|
||||
.rs
|
||||
.sp
|
||||
\fBpcre2grep\fP has, by default, support for calling external programs or
|
||||
scripts during matching by making use of PCRE2's callout facility. However,
|
||||
this support can be disabled when \fBpcre2grep\fP is built. You can find out
|
||||
whether your binary has support for callouts by running it with the \fB--help\fP
|
||||
option. If the support is not enabled, all callouts in patterns are ignored by
|
||||
\fBpcre2grep\fP.
|
||||
.P
|
||||
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argument is
|
||||
either a number or a quoted string (see the
|
||||
.\" HREF
|
||||
\fBpcre2callout\fP
|
||||
.\"
|
||||
documentation for details). Numbered callouts are ignored by \fBpcre2grep\fP.
|
||||
String arguments are parsed as a list of substrings separated by pipe (vertical
|
||||
bar) characters. The first substring must be an executable name, with the
|
||||
following substrings specifying arguments:
|
||||
.sp
|
||||
executable_name|arg1|arg2|...
|
||||
.sp
|
||||
Any substring (including the executable name) may contain escape sequences
|
||||
started by a dollar character: $<digits> or ${<digits>} is replaced by the
|
||||
captured substring of the given decimal number, which must be greater than
|
||||
zero. If the number is greater than the number of capturing substrings, or if
|
||||
the capture is unset, the replacement is empty.
|
||||
.P
|
||||
Any other character is substituted by itself. In particular, $$ is replaced by
|
||||
a single dollar and $| is replaced by a pipe character. Here is an example:
|
||||
.sp
|
||||
echo -e "abcde\en12345" | pcre2grep \e
|
||||
'(?x)(.)(..(.))
|
||||
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
|
||||
.sp
|
||||
Output:
|
||||
.sp
|
||||
Arg1: [a] [bcd] [d] Arg2: |a| ()
|
||||
abcde
|
||||
Arg1: [1] [234] [4] Arg2: |1| ()
|
||||
12345
|
||||
.sp
|
||||
The parameters for the \fBexecv()\fP system call that is used to run the
|
||||
program or script are zero-terminated strings. This means that binary zero
|
||||
characters in the callout argument will cause premature termination of their
|
||||
substrings, and therefore should not be present. Any syntax errors in the
|
||||
string (for example, a dollar not followed by another character) cause the
|
||||
callout to be ignored. If running the program fails for any reason (including
|
||||
the non-existence of the executable), a local matching failure occurs and the
|
||||
matcher backtracks in the normal way.
|
||||
.
|
||||
.
|
||||
.SH "MATCHING ERRORS"
|
||||
.rs
|
||||
.sp
|
||||
@ -683,7 +776,7 @@ affect the return code.
|
||||
.SH "SEE ALSO"
|
||||
.rs
|
||||
.sp
|
||||
\fBpcre2pattern\fP(3), \fBpcre2syntax\fP(3).
|
||||
\fBpcre2pattern\fP(3), \fBpcre2syntax\fP(3), \fBpcre2callout\fP(3).
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
@ -700,6 +793,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 03 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 31 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -51,103 +51,115 @@ DESCRIPTION
|
||||
boundary is controlled by the -N (--newline) option.
|
||||
|
||||
The amount of memory used for buffering files that are being scanned is
|
||||
controlled by a parameter that can be set by the --buffer-size option.
|
||||
The default value for this parameter is specified when pcre2grep is
|
||||
built, with the default default being 20K. A block of memory three
|
||||
times this size is used (to allow for buffering "before" and "after"
|
||||
lines). An error occurs if a line overflows the buffer.
|
||||
controlled by parameters that can be set by the --buffer-size and
|
||||
--max-buffer-size options. The first of these sets the size of buffer
|
||||
that is obtained at the start of processing. If an input file contains
|
||||
very long lines, a larger buffer may be needed; this is handled by
|
||||
automatically extending the buffer, up to the limit specified by --max-
|
||||
buffer-size. The default values for these parameters are specified when
|
||||
pcre2grep is built, with the default defaults being 20K and 1M respec-
|
||||
tively. An error occurs if a line is too long and the buffer can no
|
||||
longer be expanded.
|
||||
|
||||
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
|
||||
greater. BUFSIZ is defined in <stdio.h>. When there is more than one
|
||||
The block of memory that is actually used is three times the "buffer
|
||||
size", to allow for buffering "before" and "after" lines. If the buffer
|
||||
size is too small, fewer than requested "before" and "after" lines may
|
||||
be output.
|
||||
|
||||
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
|
||||
greater. BUFSIZ is defined in <stdio.h>. When there is more than one
|
||||
pattern (specified by the use of -e and/or -f), each pattern is applied
|
||||
to each line in the order in which they are defined, except that all
|
||||
to each line in the order in which they are defined, except that all
|
||||
the -e patterns are tried before the -f patterns.
|
||||
|
||||
By default, as soon as one pattern matches a line, no further patterns
|
||||
By default, as soon as one pattern matches a line, no further patterns
|
||||
are considered. However, if --colour (or --color) is used to colour the
|
||||
matching substrings, or if --only-matching, --file-offsets, or --line-
|
||||
offsets is used to output only the part of the line that matched
|
||||
matching substrings, or if --only-matching, --file-offsets, or --line-
|
||||
offsets is used to output only the part of the line that matched
|
||||
(either shown literally, or as an offset), scanning resumes immediately
|
||||
following the match, so that further matches on the same line can be
|
||||
found. If there are multiple patterns, they are all tried on the
|
||||
remainder of the line, but patterns that follow the one that matched
|
||||
following the match, so that further matches on the same line can be
|
||||
found. If there are multiple patterns, they are all tried on the
|
||||
remainder of the line, but patterns that follow the one that matched
|
||||
are not tried on the earlier part of the line.
|
||||
|
||||
This behaviour means that the order in which multiple patterns are
|
||||
specified can affect the output when one of the above options is used.
|
||||
This is no longer the same behaviour as GNU grep, which now manages to
|
||||
display earlier matches for later patterns (as long as there is no
|
||||
This behaviour means that the order in which multiple patterns are
|
||||
specified can affect the output when one of the above options is used.
|
||||
This is no longer the same behaviour as GNU grep, which now manages to
|
||||
display earlier matches for later patterns (as long as there is no
|
||||
overlap).
|
||||
|
||||
Patterns that can match an empty string are accepted, but empty string
|
||||
Patterns that can match an empty string are accepted, but empty string
|
||||
matches are never recognized. An example is the pattern
|
||||
"(super)?(man)?", in which all components are optional. This pattern
|
||||
finds all occurrences of both "super" and "man"; the output differs
|
||||
from matching with "super|man" when only the matching substrings are
|
||||
"(super)?(man)?", in which all components are optional. This pattern
|
||||
finds all occurrences of both "super" and "man"; the output differs
|
||||
from matching with "super|man" when only the matching substrings are
|
||||
being shown.
|
||||
|
||||
If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses
|
||||
If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses
|
||||
the value to set a locale when calling the PCRE2 library. The --locale
|
||||
option can be used to override this.
|
||||
|
||||
|
||||
SUPPORT FOR COMPRESSED FILES
|
||||
|
||||
It is possible to compile pcre2grep so that it uses libz or libbz2 to
|
||||
read files whose names end in .gz or .bz2, respectively. You can find
|
||||
It is possible to compile pcre2grep so that it uses libz or libbz2 to
|
||||
read files whose names end in .gz or .bz2, respectively. You can find
|
||||
out whether your binary has support for one or both of these file types
|
||||
by running it with the --help option. If the appropriate support is not
|
||||
present, files are treated as plain text. The standard input is always
|
||||
present, files are treated as plain text. The standard input is always
|
||||
so treated.
|
||||
|
||||
|
||||
BINARY FILES
|
||||
|
||||
By default, a file that contains a binary zero byte within the first
|
||||
1024 bytes is identified as a binary file, and is processed specially.
|
||||
(GNU grep also identifies binary files in this manner.) See the
|
||||
--binary-files option for a means of changing the way binary files are
|
||||
By default, a file that contains a binary zero byte within the first
|
||||
1024 bytes is identified as a binary file, and is processed specially.
|
||||
(GNU grep also identifies binary files in this manner.) See the
|
||||
--binary-files option for a means of changing the way binary files are
|
||||
handled.
|
||||
|
||||
|
||||
OPTIONS
|
||||
|
||||
The order in which some of the options appear can affect the output.
|
||||
For example, both the -h and -l options affect the printing of file
|
||||
names. Whichever comes later in the command line will be the one that
|
||||
takes effect. Similarly, except where noted below, if an option is
|
||||
given twice, the later setting is used. Numerical values for options
|
||||
may be followed by K or M, to signify multiplication by 1024 or
|
||||
The order in which some of the options appear can affect the output.
|
||||
For example, both the -h and -l options affect the printing of file
|
||||
names. Whichever comes later in the command line will be the one that
|
||||
takes effect. Similarly, except where noted below, if an option is
|
||||
given twice, the later setting is used. Numerical values for options
|
||||
may be followed by K or M, to signify multiplication by 1024 or
|
||||
1024*1024 respectively.
|
||||
|
||||
-- This terminates the list of options. It is useful if the next
|
||||
item on the command line starts with a hyphen but is not an
|
||||
option. This allows for the processing of patterns and file
|
||||
item on the command line starts with a hyphen but is not an
|
||||
option. This allows for the processing of patterns and file
|
||||
names that start with hyphens.
|
||||
|
||||
-A number, --after-context=number
|
||||
Output number lines of context after each matching line. If
|
||||
file names and/or line numbers are being output, a hyphen
|
||||
separator is used instead of a colon for the context lines. A
|
||||
line containing "--" is output between each group of lines,
|
||||
unless they are in fact contiguous in the input file. The
|
||||
value of number is expected to be relatively small. However,
|
||||
pcre2grep guarantees to have up to 8K of following text
|
||||
available for context output.
|
||||
Output up to number lines of context after each matching
|
||||
line. Fewer lines are output if the next match or the end of
|
||||
the file is reached, or if the processing buffer size has
|
||||
been set too small. If file names and/or line numbers are
|
||||
being output, a hyphen separator is used instead of a colon
|
||||
for the context lines. A line containing "--" is output
|
||||
between each group of lines, unless they are in fact contigu-
|
||||
ous in the input file. The value of number is expected to be
|
||||
relatively small. When -c is used, -A is ignored.
|
||||
|
||||
-a, --text
|
||||
Treat binary files as text. This is equivalent to --binary-
|
||||
files=text.
|
||||
|
||||
-B number, --before-context=number
|
||||
Output number lines of context before each matching line. If
|
||||
file names and/or line numbers are being output, a hyphen
|
||||
separator is used instead of a colon for the context lines. A
|
||||
line containing "--" is output between each group of lines,
|
||||
unless they are in fact contiguous in the input file. The
|
||||
value of number is expected to be relatively small. However,
|
||||
pcre2grep guarantees to have up to 8K of preceding text
|
||||
available for context output.
|
||||
Output up to number lines of context before each matching
|
||||
line. Fewer lines are output if the previous match or the
|
||||
start of the file is within number lines, or if the process-
|
||||
ing buffer size has been set too small. If file names and/or
|
||||
line numbers are being output, a hyphen separator is used
|
||||
instead of a colon for the context lines. A line containing
|
||||
"--" is output between each group of lines, unless they are
|
||||
in fact contiguous in the input file. The value of number is
|
||||
expected to be relatively small. When -c is used, -B is
|
||||
ignored.
|
||||
|
||||
--binary-files=word
|
||||
Specify how binary files are to be processed. If the word is
|
||||
@ -164,54 +176,68 @@ OPTIONS
|
||||
any output or affecting the return code.
|
||||
|
||||
--buffer-size=number
|
||||
Set the parameter that controls how much memory is used for
|
||||
buffering files that are being scanned.
|
||||
Set the parameter that controls how much memory is obtained
|
||||
at the start of processing for buffering files that are being
|
||||
scanned. See also --max-buffer-size below.
|
||||
|
||||
-C number, --context=number
|
||||
Output number lines of context both before and after each
|
||||
matching line. This is equivalent to setting both -A and -B
|
||||
Output number lines of context both before and after each
|
||||
matching line. This is equivalent to setting both -A and -B
|
||||
to the same value.
|
||||
|
||||
-c, --count
|
||||
Do not output lines from the files that are being scanned;
|
||||
instead output the number of matches (or non-matches if -v is
|
||||
used) that would otherwise have caused lines to be shown. By
|
||||
default, this count is the same as the number of suppressed
|
||||
lines, but if the -M (multiline) option is used (without -v),
|
||||
there may be more suppressed lines than the number of
|
||||
matches.
|
||||
Do not output lines from the files that are being scanned;
|
||||
instead output the number of lines that would have been
|
||||
shown, either because they matched, or, if -v is set, because
|
||||
they failed to match. By default, this count is exactly the
|
||||
same as the number of lines that would have been output, but
|
||||
if the -M (multiline) option is used (without -v), there may
|
||||
be more suppressed lines than the count (that is, the number
|
||||
of matches).
|
||||
|
||||
If no lines are selected, the number zero is output. If sev-
|
||||
eral files are are being scanned, a count is output for each
|
||||
of them. However, if the --files-with-matches option is also
|
||||
used, only those files whose counts are greater than zero are
|
||||
listed. When -c is used, the -A, -B, and -C options are
|
||||
ignored.
|
||||
of them and the -t option can be used to cause a total to be
|
||||
output at the end. However, if the --files-with-matches
|
||||
option is also used, only those files whose counts are
|
||||
greater than zero are listed. When -c is used, the -A, -B,
|
||||
and -C options are ignored.
|
||||
|
||||
--colour, --color
|
||||
If this option is given without any data, it is equivalent to
|
||||
"--colour=auto". If data is required, it must be given in
|
||||
"--colour=auto". If data is required, it must be given in
|
||||
the same shell item, separated by an equals sign.
|
||||
|
||||
--colour=value, --color=value
|
||||
This option specifies under what circumstances the parts of a
|
||||
line that matched a pattern should be coloured in the output.
|
||||
By default, the output is not coloured. The value (which is
|
||||
optional, see above) may be "never", "always", or "auto". In
|
||||
the latter case, colouring happens only if the standard out-
|
||||
put is connected to a terminal. More resources are used when
|
||||
By default, the output is not coloured. The value (which is
|
||||
optional, see above) may be "never", "always", or "auto". In
|
||||
the latter case, colouring happens only if the standard out-
|
||||
put is connected to a terminal. More resources are used when
|
||||
colouring is enabled, because pcre2grep has to search for all
|
||||
possible matches in a line, not just one, in order to colour
|
||||
possible matches in a line, not just one, in order to colour
|
||||
them all.
|
||||
|
||||
The colour that is used can be specified by setting the envi-
|
||||
ronment variable PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The
|
||||
value of this variable should be a string of two numbers,
|
||||
separated by a semicolon. They are copied directly into the
|
||||
control string for setting colour on a terminal, so it is
|
||||
your responsibility to ensure that they make sense. If nei-
|
||||
ther of the environment variables is set, the default is
|
||||
"1;31", which gives red.
|
||||
The colour that is used can be specified by setting one of
|
||||
the environment variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR,
|
||||
PCREGREP_COLOUR, or PCREGREP_COLOR, which are checked in that
|
||||
order. If none of these are set, pcre2grep looks for
|
||||
GREP_COLORS or GREP_COLOR (in that order). The value of the
|
||||
variable should be a string of two numbers, separated by a
|
||||
semicolon, except in the case of GREP_COLORS, which must
|
||||
start with "ms=" or "mt=" followed by two semicolon-separated
|
||||
colours, terminated by the end of the string or by a colon.
|
||||
If GREP_COLORS does not start with "ms=" or "mt=" it is
|
||||
ignored, and GREP_COLOR is checked.
|
||||
|
||||
If the string obtained from one of the above variables con-
|
||||
tains any characters other than semicolon or digits, the set-
|
||||
ting is ignored and the default colour is used. The string is
|
||||
copied directly into the control string for setting colour on
|
||||
a terminal, so it is your responsibility to ensure that the
|
||||
values make sense. If no relevant environment variable is
|
||||
set, the default is "1;31", which gives red.
|
||||
|
||||
-D action, --devices=action
|
||||
If an input path is not a regular file or a directory,
|
||||
@ -299,12 +325,12 @@ OPTIONS
|
||||
Read patterns from the file, one per line, and match them
|
||||
against each line of input. What constitutes a newline when
|
||||
reading the file is the operating system's default. The
|
||||
--newline option has no effect on this option. Trailing white
|
||||
space is removed from each line, and blank lines are ignored.
|
||||
An empty file contains no patterns and therefore matches
|
||||
nothing. See also the comments about multiple patterns versus
|
||||
a single pattern with alternatives in the description of -e
|
||||
above.
|
||||
--newline option has no effect on this option. Trailing
|
||||
white space is removed from each line, and blank lines are
|
||||
ignored. An empty file contains no patterns and therefore
|
||||
matches nothing. See also the comments about multiple pat-
|
||||
terns versus a single pattern with alternatives in the
|
||||
description of -e above.
|
||||
|
||||
If this option is given more than once, all the specified
|
||||
files are read. A data line is output if any of the patterns
|
||||
@ -482,96 +508,101 @@ OPTIONS
|
||||
tings are specified when the PCRE2 library is compiled, with
|
||||
the default default being 10 million.
|
||||
|
||||
-M, --multiline
|
||||
Allow patterns to match more than one line. When this option
|
||||
is given, patterns may usefully contain literal newline char-
|
||||
acters and internal occurrences of ^ and $ characters. The
|
||||
output for a successful match may consist of more than one
|
||||
line. The first is the line in which the match started, and
|
||||
the last is the line in which the match ended. If the matched
|
||||
string ends with a newline sequence the output ends at the
|
||||
end of that line.
|
||||
--max-buffer-size=number
|
||||
This limits the expansion of the processing buffer, whose
|
||||
initial size can be set by --buffer-size. The maximum buffer
|
||||
size is silently forced to be no smaller than the starting
|
||||
buffer size.
|
||||
|
||||
When this option is set, the PCRE2 library is called in "mul-
|
||||
tiline" mode. However, pcre2grep still processes the input
|
||||
line by line. The difference is that a matched string may
|
||||
extend past the end of a line and continue on one or more
|
||||
subsequent lines. The newline sequence must be matched as
|
||||
part of the pattern. For example, to find the phrase "regular
|
||||
expression" in a file where "regular" might be at the end of
|
||||
a line and "expression" at the start of the next line, you
|
||||
could use this command:
|
||||
-M, --multiline
|
||||
Allow patterns to match more than one line. When this option
|
||||
is set, the PCRE2 library is called in "multiline" mode. This
|
||||
allows a matched string to extend past the end of a line and
|
||||
continue on one or more subsequent lines. Patterns used with
|
||||
-M may usefully contain literal newline characters and inter-
|
||||
nal occurrences of ^ and $ characters. The output for a suc-
|
||||
cessful match may consist of more than one line. The first
|
||||
line is the line in which the match started, and the last
|
||||
line is the line in which the match ended. If the matched
|
||||
string ends with a newline sequence, the output ends at the
|
||||
end of that line. If -v is set, none of the lines in a
|
||||
multi-line match are output. Once a match has been handled,
|
||||
scanning restarts at the beginning of the line after the one
|
||||
in which the match ended.
|
||||
|
||||
The newline sequence that separates multiple lines must be
|
||||
matched as part of the pattern. For example, to find the
|
||||
phrase "regular expression" in a file where "regular" might
|
||||
be at the end of a line and "expression" at the start of the
|
||||
next line, you could use this command:
|
||||
|
||||
pcre2grep -M 'regular\s+expression' <file>
|
||||
|
||||
The \s escape sequence matches any white space character,
|
||||
including newlines, and is followed by + so as to match
|
||||
trailing white space on the first line as well as possibly
|
||||
The \s escape sequence matches any white space character,
|
||||
including newlines, and is followed by + so as to match
|
||||
trailing white space on the first line as well as possibly
|
||||
handling a two-character newline sequence.
|
||||
|
||||
There is a limit to the number of lines that can be matched,
|
||||
imposed by the way that pcre2grep buffers the input file as
|
||||
it scans it. However, pcre2grep ensures that at least 8K
|
||||
characters or the rest of the file (whichever is the shorter)
|
||||
are available for forward matching, and similarly the previ-
|
||||
ous 8K characters (or all the previous characters, if fewer
|
||||
than 8K) are guaranteed to be available for lookbehind asser-
|
||||
tions. The -M option does not work when input is read line by
|
||||
line (see --line-buffered.)
|
||||
There is a limit to the number of lines that can be matched,
|
||||
imposed by the way that pcre2grep buffers the input file as
|
||||
it scans it. With a sufficiently large processing buffer,
|
||||
this should not be a problem, but the -M option does not work
|
||||
when input is read line by line (see --line-buffered.)
|
||||
|
||||
-N newline-type, --newline=newline-type
|
||||
The PCRE2 library supports five different conventions for
|
||||
indicating the ends of lines. They are the single-character
|
||||
sequences CR (carriage return) and LF (linefeed), the two-
|
||||
character sequence CRLF, an "anycrlf" convention, which rec-
|
||||
ognizes any of the preceding three types, and an "any" con-
|
||||
The PCRE2 library supports five different conventions for
|
||||
indicating the ends of lines. They are the single-character
|
||||
sequences CR (carriage return) and LF (linefeed), the two-
|
||||
character sequence CRLF, an "anycrlf" convention, which rec-
|
||||
ognizes any of the preceding three types, and an "any" con-
|
||||
vention, in which any Unicode line ending sequence is assumed
|
||||
to end a line. The Unicode sequences are the three just men-
|
||||
tioned, plus VT (vertical tab, U+000B), FF (form feed,
|
||||
U+000C), NEL (next line, U+0085), LS (line separator,
|
||||
to end a line. The Unicode sequences are the three just men-
|
||||
tioned, plus VT (vertical tab, U+000B), FF (form feed,
|
||||
U+000C), NEL (next line, U+0085), LS (line separator,
|
||||
U+2028), and PS (paragraph separator, U+2029).
|
||||
|
||||
When the PCRE2 library is built, a default line-ending
|
||||
sequence is specified. This is normally the standard
|
||||
When the PCRE2 library is built, a default line-ending
|
||||
sequence is specified. This is normally the standard
|
||||
sequence for the operating system. Unless otherwise specified
|
||||
by this option, pcre2grep uses the library's default. The
|
||||
by this option, pcre2grep uses the library's default. The
|
||||
possible values for this option are CR, LF, CRLF, ANYCRLF, or
|
||||
ANY. This makes it possible to use pcre2grep to scan files
|
||||
ANY. This makes it possible to use pcre2grep to scan files
|
||||
that have come from other environments without having to mod-
|
||||
ify their line endings. If the data that is being scanned
|
||||
does not agree with the convention set by this option,
|
||||
pcre2grep may behave in strange ways. Note that this option
|
||||
does not apply to files specified by the -f, --exclude-from,
|
||||
or --include-from options, which are expected to use the
|
||||
ify their line endings. If the data that is being scanned
|
||||
does not agree with the convention set by this option,
|
||||
pcre2grep may behave in strange ways. Note that this option
|
||||
does not apply to files specified by the -f, --exclude-from,
|
||||
or --include-from options, which are expected to use the
|
||||
operating system's standard newline sequence.
|
||||
|
||||
-n, --line-number
|
||||
Precede each output line by its line number in the file, fol-
|
||||
lowed by a colon for matching lines or a hyphen for context
|
||||
lowed by a colon for matching lines or a hyphen for context
|
||||
lines. If the file name is also being output, it precedes the
|
||||
line number. When the -M option causes a pattern to match
|
||||
more than one line, only the first is preceded by its line
|
||||
line number. When the -M option causes a pattern to match
|
||||
more than one line, only the first is preceded by its line
|
||||
number. This option is forced if --line-offsets is used.
|
||||
|
||||
--no-jit If the PCRE2 library is built with support for just-in-time
|
||||
--no-jit If the PCRE2 library is built with support for just-in-time
|
||||
compiling (which speeds up matching), pcre2grep automatically
|
||||
makes use of this, unless it was explicitly disabled at build
|
||||
time. This option can be used to disable the use of JIT at
|
||||
run time. It is provided for testing and working round prob-
|
||||
time. This option can be used to disable the use of JIT at
|
||||
run time. It is provided for testing and working round prob-
|
||||
lems. It should never be needed in normal use.
|
||||
|
||||
-o, --only-matching
|
||||
Show only the part of the line that matched a pattern instead
|
||||
of the whole line. In this mode, no context is shown. That
|
||||
is, the -A, -B, and -C options are ignored. If there is more
|
||||
than one match in a line, each of them is shown separately.
|
||||
If -o is combined with -v (invert the sense of the match to
|
||||
find non-matching lines), no output is generated, but the
|
||||
return code is set appropriately. If the matched portion of
|
||||
the line is empty, nothing is output unless the file name or
|
||||
line number are being printed, in which case they are shown
|
||||
on an otherwise empty line. This option is mutually exclusive
|
||||
with --file-offsets and --line-offsets.
|
||||
of the whole line. In this mode, no context is shown. That
|
||||
is, the -A, -B, and -C options are ignored. If there is more
|
||||
than one match in a line, each of them is shown separately,
|
||||
on a separate line of output. If -o is combined with -v
|
||||
(invert the sense of the match to find non-matching lines),
|
||||
no output is generated, but the return code is set appropri-
|
||||
ately. If the matched portion of the line is empty, nothing
|
||||
is output unless the file name or line number are being
|
||||
printed, in which case they are shown on an otherwise empty
|
||||
line. This option is mutually exclusive with --file-offsets
|
||||
and --line-offsets.
|
||||
|
||||
-onumber, --only-matching=number
|
||||
Show only the part of the line that matched the capturing
|
||||
@ -587,65 +618,80 @@ OPTIONS
|
||||
put.
|
||||
|
||||
If this option is given multiple times, multiple substrings
|
||||
are output, in the order the options are given. For example,
|
||||
-o3 -o1 -o3 causes the substrings matched by capturing paren-
|
||||
theses 3 and 1 and then 3 again to be output. By default,
|
||||
there is no separator (but see the next option).
|
||||
are output for each match, in the order the options are
|
||||
given, and all on one line. For example, -o3 -o1 -o3 causes
|
||||
the substrings matched by capturing parentheses 3 and 1 and
|
||||
then 3 again to be output. By default, there is no separator
|
||||
(but see the next option).
|
||||
|
||||
--om-separator=text
|
||||
Specify a separating string for multiple occurrences of -o.
|
||||
The default is an empty string. Separating strings are never
|
||||
Specify a separating string for multiple occurrences of -o.
|
||||
The default is an empty string. Separating strings are never
|
||||
coloured.
|
||||
|
||||
-q, --quiet
|
||||
Work quietly, that is, display nothing except error messages.
|
||||
The exit status indicates whether or not any matches were
|
||||
The exit status indicates whether or not any matches were
|
||||
found.
|
||||
|
||||
-r, --recursive
|
||||
If any given path is a directory, recursively scan the files
|
||||
it contains, taking note of any --include and --exclude set-
|
||||
tings. By default, a directory is read as a normal file; in
|
||||
some operating systems this gives an immediate end-of-file.
|
||||
This option is a shorthand for setting the -d option to
|
||||
If any given path is a directory, recursively scan the files
|
||||
it contains, taking note of any --include and --exclude set-
|
||||
tings. By default, a directory is read as a normal file; in
|
||||
some operating systems this gives an immediate end-of-file.
|
||||
This option is a shorthand for setting the -d option to
|
||||
"recurse".
|
||||
|
||||
--recursion-limit=number
|
||||
See --match-limit above.
|
||||
|
||||
-s, --no-messages
|
||||
Suppress error messages about non-existent or unreadable
|
||||
files. Such files are quietly skipped. However, the return
|
||||
Suppress error messages about non-existent or unreadable
|
||||
files. Such files are quietly skipped. However, the return
|
||||
code is still 2, even if matches were found in other files.
|
||||
|
||||
-t, --total-count
|
||||
This option is useful when scanning more than one file. If
|
||||
used on its own, -t suppresses all output except for a grand
|
||||
total number of matching lines (or non-matching lines if -v
|
||||
is used) in all the files. If -t is used with -c, a grand
|
||||
total is output except when the previous output is just one
|
||||
line. In other words, it is not output when just one file's
|
||||
count is listed. If file names are being output, the grand
|
||||
total is preceded by "TOTAL:". Otherwise, it appears as just
|
||||
another number. The -t option is ignored when used with -L
|
||||
(list files without matches), because the grand total would
|
||||
always be zero.
|
||||
|
||||
-u, --utf-8
|
||||
Operate in UTF-8 mode. This option is available only if PCRE2
|
||||
has been compiled with UTF-8 support. All patterns (including
|
||||
those for any --exclude and --include options) and all sub-
|
||||
ject lines that are scanned must be valid strings of UTF-8
|
||||
those for any --exclude and --include options) and all sub-
|
||||
ject lines that are scanned must be valid strings of UTF-8
|
||||
characters.
|
||||
|
||||
-V, --version
|
||||
Write the version numbers of pcre2grep and the PCRE2 library
|
||||
to the standard output and then exit. Anything else on the
|
||||
Write the version numbers of pcre2grep and the PCRE2 library
|
||||
to the standard output and then exit. Anything else on the
|
||||
command line is ignored.
|
||||
|
||||
-v, --invert-match
|
||||
Invert the sense of the match, so that lines which do not
|
||||
Invert the sense of the match, so that lines which do not
|
||||
match any of the patterns are the ones that are found.
|
||||
|
||||
-w, --word-regex, --word-regexp
|
||||
Force the patterns to match only whole words. This is equiva-
|
||||
lent to having \b at the start and end of the pattern. This
|
||||
option applies only to the patterns that are matched against
|
||||
the contents of files; it does not apply to patterns speci-
|
||||
lent to having \b at the start and end of the pattern. This
|
||||
option applies only to the patterns that are matched against
|
||||
the contents of files; it does not apply to patterns speci-
|
||||
fied by any of the --include or --exclude options.
|
||||
|
||||
-x, --line-regex, --line-regexp
|
||||
Force the patterns to be anchored (each must start matching
|
||||
at the beginning of a line) and in addition, require them to
|
||||
match entire lines. This is equivalent to having ^ and $
|
||||
characters at the start and end of each alternative top-level
|
||||
Force the patterns to be anchored (each must start matching
|
||||
at the beginning of a line) and in addition, require them to
|
||||
match entire lines. In multiline mode the match may be more
|
||||
than one line. This is equivalent to having \A and \Z charac-
|
||||
ters at the start and end of each alternative top-level
|
||||
branch in every pattern. This option applies only to the pat-
|
||||
terns that are matched against the contents of files; it does
|
||||
not apply to patterns specified by any of the --include or
|
||||
@ -725,35 +771,86 @@ OPTIONS WITH DATA
|
||||
equals character. Otherwise pcre2grep will assume that it has no data.
|
||||
|
||||
|
||||
CALLING EXTERNAL SCRIPTS
|
||||
|
||||
pcre2grep has, by default, support for calling external programs or
|
||||
scripts during matching by making use of PCRE2's callout facility. How-
|
||||
ever, this support can be disabled when pcre2grep is built. You can
|
||||
find out whether your binary has support for callouts by running it
|
||||
with the --help option. If the support is not enabled, all callouts in
|
||||
patterns are ignored by pcre2grep.
|
||||
|
||||
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
|
||||
ment is either a number or a quoted string (see the pcre2callout docu-
|
||||
mentation for details). Numbered callouts are ignored by pcre2grep.
|
||||
String arguments are parsed as a list of substrings separated by pipe
|
||||
(vertical bar) characters. The first substring must be an executable
|
||||
name, with the following substrings specifying arguments:
|
||||
|
||||
executable_name|arg1|arg2|...
|
||||
|
||||
Any substring (including the executable name) may contain escape
|
||||
sequences started by a dollar character: $<digits> or ${<digits>} is
|
||||
replaced by the captured substring of the given decimal number, which
|
||||
must be greater than zero. If the number is greater than the number of
|
||||
capturing substrings, or if the capture is unset, the replacement is
|
||||
empty.
|
||||
|
||||
Any other character is substituted by itself. In particular, $$ is
|
||||
replaced by a single dollar and $| is replaced by a pipe character.
|
||||
Here is an example:
|
||||
|
||||
echo -e "abcde\n12345" | pcre2grep \
|
||||
'(?x)(.)(..(.))
|
||||
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
|
||||
|
||||
Output:
|
||||
|
||||
Arg1: [a] [bcd] [d] Arg2: |a| ()
|
||||
abcde
|
||||
Arg1: [1] [234] [4] Arg2: |1| ()
|
||||
12345
|
||||
|
||||
The parameters for the execv() system call that is used to run the pro-
|
||||
gram or script are zero-terminated strings. This means that binary zero
|
||||
characters in the callout argument will cause premature termination of
|
||||
their substrings, and therefore should not be present. Any syntax
|
||||
errors in the string (for example, a dollar not followed by another
|
||||
character) cause the callout to be ignored. If running the program
|
||||
fails for any reason (including the non-existence of the executable), a
|
||||
local matching failure occurs and the matcher backtracks in the normal
|
||||
way.
|
||||
|
||||
|
||||
MATCHING ERRORS
|
||||
|
||||
It is possible to supply a regular expression that takes a very long
|
||||
time to fail to match certain lines. Such patterns normally involve
|
||||
nested indefinite repeats, for example: (a+)*\d when matched against a
|
||||
line of a's with no final digit. The PCRE2 matching function has a
|
||||
resource limit that causes it to abort in these circumstances. If this
|
||||
happens, pcre2grep outputs an error message and the line that caused
|
||||
the problem to the standard error stream. If there are more than 20
|
||||
It is possible to supply a regular expression that takes a very long
|
||||
time to fail to match certain lines. Such patterns normally involve
|
||||
nested indefinite repeats, for example: (a+)*\d when matched against a
|
||||
line of a's with no final digit. The PCRE2 matching function has a
|
||||
resource limit that causes it to abort in these circumstances. If this
|
||||
happens, pcre2grep outputs an error message and the line that caused
|
||||
the problem to the standard error stream. If there are more than 20
|
||||
such errors, pcre2grep gives up.
|
||||
|
||||
The --match-limit option of pcre2grep can be used to set the overall
|
||||
resource limit; there is a second option called --recursion-limit that
|
||||
sets a limit on the amount of memory (usually stack) that is used (see
|
||||
The --match-limit option of pcre2grep can be used to set the overall
|
||||
resource limit; there is a second option called --recursion-limit that
|
||||
sets a limit on the amount of memory (usually stack) that is used (see
|
||||
the discussion of these options above).
|
||||
|
||||
|
||||
DIAGNOSTICS
|
||||
|
||||
Exit status is 0 if any matches were found, 1 if no matches were found,
|
||||
and 2 for syntax errors, overlong lines, non-existent or inaccessible
|
||||
files (even if matches were found in other files) or too many matching
|
||||
and 2 for syntax errors, overlong lines, non-existent or inaccessible
|
||||
files (even if matches were found in other files) or too many matching
|
||||
errors. Using the -s option to suppress error messages about inaccessi-
|
||||
ble files does not affect the return code.
|
||||
|
||||
|
||||
SEE ALSO
|
||||
|
||||
pcre2pattern(3), pcre2syntax(3).
|
||||
pcre2pattern(3), pcre2syntax(3), pcre2callout(3).
|
||||
|
||||
|
||||
AUTHOR
|
||||
@ -765,5 +862,5 @@ AUTHOR
|
||||
|
||||
REVISION
|
||||
|
||||
Last updated: 03 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 31 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2JIT 3 "27 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2JIT 3 "05 June 2016" "PCRE2 10.22"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
||||
@ -61,6 +61,12 @@ much faster than the normal interpretive code, but yields exactly the same
|
||||
results. The returned value from \fBpcre2_jit_compile()\fP is zero on success,
|
||||
or a negative error code.
|
||||
.P
|
||||
There is a limit to the size of pattern that JIT supports, imposed by the size
|
||||
of machine stack that it uses. The exact rules are not documented because they
|
||||
may change at any time, in particular, when new optimizations are introduced.
|
||||
If a pattern is too big, a call to \fBpcre2_jit_compile()\fB returns
|
||||
PCRE2_ERROR_NOMEMORY.
|
||||
.P
|
||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
|
||||
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
|
||||
PCRE2_PARTIAL_SOFT options of \fBpcre2_match()\fP, you should set one or both
|
||||
@ -122,6 +128,9 @@ PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
|
||||
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
|
||||
PCRE2_ANCHORED option is not supported at match time.
|
||||
.P
|
||||
If the PCRE2_NO_JIT option is passed to \fBpcre2_match()\fP it disables the
|
||||
use of JIT, forcing matching by the interpreter code.
|
||||
.P
|
||||
The only unsupported pattern items are \eC (match a single data unit) when
|
||||
running in a UTF mode, and a callout immediately before an assertion condition
|
||||
in a conditional group.
|
||||
@ -207,8 +216,13 @@ for JIT matching. A callback function can therefore be used to determine
|
||||
whether a match operation was executed by JIT or by the interpreter.
|
||||
.P
|
||||
You may safely use the same JIT stack for more than one pattern (either by
|
||||
assigning directly or by callback), as long as the patterns are all matched
|
||||
sequentially in the same thread. In a multithread application, if you do not
|
||||
assigning directly or by callback), as long as the patterns are matched
|
||||
sequentially in the same thread. Currently, the only way to set up
|
||||
non-sequential matches in one thread is to use callouts: if a callout function
|
||||
starts another match, that match must use a different JIT stack to the one used
|
||||
for currently suspended match(es).
|
||||
.P
|
||||
In a multithread application, if you do not
|
||||
specify a JIT stack, or if you assign or pass back NULL from a callback, that
|
||||
is thread-safe, because each thread has its own machine stack. However, if you
|
||||
assign or pass back a non-NULL JIT stack, this must be a different stack for
|
||||
@ -366,7 +380,7 @@ The fast path function is called \fBpcre2_jit_match()\fP, and it takes exactly
|
||||
the same arguments as \fBpcre2_match()\fP. The return values are also the same,
|
||||
plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is
|
||||
requested that was not compiled. Unsupported option bits (for example,
|
||||
PCRE2_ANCHORED) are ignored.
|
||||
PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option.
|
||||
.P
|
||||
When you call \fBpcre2_match()\fP, as well as testing for invalid options, a
|
||||
number of other sanity checks are performed on the arguments. For example, if
|
||||
@ -399,6 +413,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 27 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 05 June 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2LIMITS 3 "25 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2LIMITS 3 "26 October 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "SIZE AND OTHER LIMITATIONS"
|
||||
@ -20,6 +20,10 @@ documentation for details. In these cases the limit is substantially larger.
|
||||
However, the speed of execution is slower. In the 32-bit library, the internal
|
||||
linkage size is always 4.
|
||||
.P
|
||||
The maximum length of a source pattern string is essentially unlimited; it is
|
||||
the largest number a PCRE2_SIZE variable can hold. However, the program that
|
||||
calls \fBpcre2_compile()\fP can specify a smaller limit.
|
||||
.P
|
||||
The maximum length (in code units) of a subject string is one less than the
|
||||
largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an unsigned
|
||||
integer type, usually defined as size_t. Its maximum value (that is
|
||||
@ -37,22 +41,25 @@ documentation.
|
||||
.P
|
||||
All values in repeating quantifiers must be less than 65536.
|
||||
.P
|
||||
The maximum length of a lookbehind assertion is 65535 characters.
|
||||
.P
|
||||
There is no limit to the number of parenthesized subpatterns, but there can be
|
||||
no more than 65535 capturing subpatterns. There is, however, a limit to the
|
||||
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
|
||||
order to limit the amount of system stack used at compile time. The limit can
|
||||
be specified when PCRE2 is built; the default is 250.
|
||||
.P
|
||||
There is a limit to the number of forward references to subsequent subpatterns
|
||||
of around 200,000. Repeated forward references with fixed upper limits, for
|
||||
example, (?2){0,100} when subpattern number 2 is to the right, are included in
|
||||
the count. There is no limit to the number of backward references.
|
||||
order to limit the amount of system stack used at compile time. The default
|
||||
limit can be specified when PCRE2 is built; the default default is 250. An
|
||||
application can change this limit by calling pcre2_set_parens_nest_limit() to
|
||||
set the limit in a compile context.
|
||||
.P
|
||||
The maximum length of name for a named subpattern is 32 code units, and the
|
||||
maximum number of named subpatterns is 10000.
|
||||
.P
|
||||
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
|
||||
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
|
||||
is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
|
||||
32-bit libraries.
|
||||
.P
|
||||
The maximum length of a string argument to a callout is the largest number a
|
||||
32-bit unsigned integer can hold.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
@ -69,6 +76,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 25 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 26 October 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
|
||||
.TH PCRE2PATTERN 3 "27 December 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
@ -158,6 +158,11 @@ be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
.P
|
||||
The match limit is used (but in a different way) when JIT is being used, but it
|
||||
is not relevant, and is ignored, when matching with \fBpcre2_dfa_match()\fP.
|
||||
However, the recursion limit is relevant for DFA matching, which does use some
|
||||
function recursion, in particular, for recursions within the pattern.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="newlines"></a>
|
||||
@ -359,29 +364,28 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
|
||||
code unit following \ec has a value less than 32 or greater than 126, a
|
||||
compile-time error occurs. This locks out non-printable ASCII characters in all
|
||||
modes.
|
||||
compile-time error occurs.
|
||||
.P
|
||||
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
|
||||
generate the appropriate EBCDIC code values. The \ec escape is processed
|
||||
as specified for Perl in the \fBperlebcdic\fP document. The only characters
|
||||
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \e@ encodes
|
||||
character code 0; the letters (in either case) encode characters 1-26 (hex 01
|
||||
to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
||||
\e? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
other character provokes a compile-time error. The sequence \ec@ encodes
|
||||
character code 0; after \ec the letters (in either case) encode characters 1-26
|
||||
(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
||||
1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
.P
|
||||
Thus, apart from \e?, these escapes generate the same character code values as
|
||||
Thus, apart from \ec?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
differ. For example, \eG always generates code value 7, which is BEL in ASCII
|
||||
differ. For example, \ecG always generates code value 7, which is BEL in ASCII
|
||||
but DEL in EBCDIC.
|
||||
.P
|
||||
The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \e? generate 95; otherwise it generates 255.
|
||||
values, PCRE2 makes \ec? generate 95; otherwise it generates 255.
|
||||
.P
|
||||
After \e0 up to two further octal digits are read. If there are fewer than two
|
||||
digits, just those that are present are used. Thus the sequence \e0\ex\e015
|
||||
@ -508,9 +512,9 @@ by code point, as described in the previous section.
|
||||
.SS "Absolute and relative back references"
|
||||
.rs
|
||||
.sp
|
||||
The sequence \eg followed by an unsigned or a negative number, optionally
|
||||
enclosed in braces, is an absolute or relative back reference. A named back
|
||||
reference can be coded as \eg{name}. Back references are discussed
|
||||
The sequence \eg followed by a signed or unsigned number, optionally enclosed
|
||||
in braces, is an absolute or relative back reference. A named back reference
|
||||
can be coded as \eg{name}. Back references are discussed
|
||||
.\" HTML <a href="#backreferences">
|
||||
.\" </a>
|
||||
later,
|
||||
@ -671,8 +675,8 @@ below.
|
||||
This particular group matches either the two-character sequence CR followed by
|
||||
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
||||
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
||||
line, U+0085). The two-character sequence is treated as a single unit that
|
||||
cannot be split.
|
||||
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||
treated as a single unit that cannot be split.
|
||||
.P
|
||||
In other modes, two additional characters whose codepoints are greater than 255
|
||||
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
|
||||
@ -738,6 +742,8 @@ example:
|
||||
Those that are not part of an identified script are lumped together as
|
||||
"Common". The current list of scripts is:
|
||||
.P
|
||||
Ahom,
|
||||
Anatolian_Hieroglyphs,
|
||||
Arabic,
|
||||
Armenian,
|
||||
Avestan,
|
||||
@ -778,6 +784,7 @@ Gurmukhi,
|
||||
Han,
|
||||
Hangul,
|
||||
Hanunoo,
|
||||
Hatran,
|
||||
Hebrew,
|
||||
Hiragana,
|
||||
Imperial_Aramaic,
|
||||
@ -814,12 +821,14 @@ Miao,
|
||||
Modi,
|
||||
Mongolian,
|
||||
Mro,
|
||||
Multani,
|
||||
Myanmar,
|
||||
Nabataean,
|
||||
New_Tai_Lue,
|
||||
Nko,
|
||||
Ogham,
|
||||
Ol_Chiki,
|
||||
Old_Hungarian,
|
||||
Old_Italic,
|
||||
Old_North_Arabian,
|
||||
Old_Permic,
|
||||
@ -841,6 +850,7 @@ Saurashtra,
|
||||
Sharada,
|
||||
Shavian,
|
||||
Siddham,
|
||||
SignWriting,
|
||||
Sinhala,
|
||||
Sora_Sompeng,
|
||||
Sundanese,
|
||||
@ -1177,6 +1187,18 @@ patterns that are anchored in single line mode because all branches start with
|
||||
when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
|
||||
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
||||
.P
|
||||
When the newline convention (see
|
||||
.\" HTML <a href="#newlines">
|
||||
.\" </a>
|
||||
"Newline conventions"
|
||||
.\"
|
||||
below) recognizes the two-character sequence CRLF as a newline, this is
|
||||
preferred, even if the single characters CR and LF are also recognized as
|
||||
newlines. For example, if the newline convention is "any", a multiline mode
|
||||
circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
|
||||
CR, even though CR on its own is a valid newline. (It also matches at the very
|
||||
start of the string, of course.)
|
||||
.P
|
||||
Note that the sequences \eA, \eZ, and \ez can be used to match the start and
|
||||
end of the subject in both modes, and if all branches of a pattern start with
|
||||
\eA it is always anchored, whether or not PCRE2_MULTILINE is set.
|
||||
@ -1227,21 +1249,31 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
|
||||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
|
||||
use of \eC by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
unless the PCRE2_NO_UTF_CHECK option is used).
|
||||
.P
|
||||
An application can lock out the use of \eC by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
|
||||
build PCRE2 with the use of \eC permanently disabled.
|
||||
.P
|
||||
PCRE2 does not allow \eC to appear in lookbehind assertions
|
||||
.\" HTML <a href="#lookbehind">
|
||||
.\" </a>
|
||||
(described below)
|
||||
.\"
|
||||
in a UTF mode, because this would make it impossible to calculate the length of
|
||||
the lookbehind.
|
||||
in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
|
||||
the length of the lookbehind. Neither the alternative matching function
|
||||
\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes.
|
||||
The former gives a match-time error; the latter fails to optimize and so the
|
||||
match is always run using the interpreter.
|
||||
.P
|
||||
In the 32-bit library, however, \eC is always supported (when not explicitly
|
||||
locked out) because it always matches a single code unit, whether or not UTF-32
|
||||
is specified.
|
||||
.P
|
||||
In general, the \eC escape sequence is best avoided. However, one way of using
|
||||
it that avoids the problem of malformed UTF characters is to use a lookahead to
|
||||
check the length of the next character, as in this pattern, which could be used
|
||||
with a UTF-8 string (ignore white space and line breaks):
|
||||
it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
|
||||
lookahead to check the length of the next character, as in this pattern, which
|
||||
could be used with a UTF-8 string (ignore white space and line breaks):
|
||||
.sp
|
||||
(?| (?=[\ex00-\ex7f])(\eC) |
|
||||
(?=[\ex80-\ex{7ff}])(\eC)(\eC) |
|
||||
@ -1297,37 +1329,6 @@ when matching character classes, whatever line-ending sequence is in use, and
|
||||
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||
class such as [^a] always matches one of these characters.
|
||||
.P
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class, or
|
||||
immediately after a range. For example, [b-d-z] matches letters in the range b
|
||||
to d, a hyphen character, or z.
|
||||
.P
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
|
||||
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
|
||||
the end of range, so [W-\e]46] is interpreted as a class containing a range
|
||||
followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
.P
|
||||
An error is generated if a POSIX character class (see below) or an escape
|
||||
sequence other than one that defines a single character appears at a point
|
||||
where a range ending character is expected. For example, [z-\exff] is valid,
|
||||
but [A-\ed] and [A-[:digit:]] are not.
|
||||
.P
|
||||
Ranges operate in the collating sequence of character values. They can also be
|
||||
used for characters specified numerically, for example [\e000-\e037]. Ranges
|
||||
can include any characters that are valid for the current mode.
|
||||
.P
|
||||
If a range that includes letters is used when caseless matching is set, it
|
||||
matches the letters in either case. For example, [W-c] is equivalent to
|
||||
[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
|
||||
tables for a French locale are in use, [\exc8-\excb] matches accented E
|
||||
characters in both cases.
|
||||
.P
|
||||
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
|
||||
\eV, \ew, and \eW may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\edABCDEF] matches any hexadecimal
|
||||
@ -1343,6 +1344,46 @@ class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
|
||||
are not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error.
|
||||
.P
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class,
|
||||
or immediately after a range. For example, [b-d-z] matches letters in the range
|
||||
b to d, a hyphen character, or z.
|
||||
.P
|
||||
Perl treats a hyphen as a literal if it appears before or after a POSIX class
|
||||
(see below) or a character type escape such as as \ed, but gives a warning in
|
||||
its warning mode, as this is most likely a user error. As PCRE2 has no facility
|
||||
for warning, an error is given in these cases.
|
||||
.P
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
|
||||
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
|
||||
the end of range, so [W-\e]46] is interpreted as a class containing a range
|
||||
followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
.P
|
||||
Ranges normally include all code points between the start and end characters,
|
||||
inclusive. They can also be used for code points specified numerically, for
|
||||
example [\e000-\e037]. Ranges can include any characters that are valid for the
|
||||
current mode.
|
||||
.P
|
||||
There is a special case in EBCDIC environments for ranges whose end points are
|
||||
both specified as literal letters in the same case. For compatibility with
|
||||
Perl, EBCDIC code points within the range that are not letters are omitted. For
|
||||
example, [h-k] matches only four characters, even though the codes for h and k
|
||||
are 0x88 and 0x92, a range of 11 code points. However, if the range is
|
||||
specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
|
||||
are included.
|
||||
.P
|
||||
If a range that includes letters is used when caseless matching is set, it
|
||||
matches the letters in either case. For example, [W-c] is equivalent to
|
||||
[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
|
||||
tables for a French locale are in use, [\exc8-\excb] matches accented E
|
||||
characters in both cases.
|
||||
.P
|
||||
A circumflex can conveniently be used with the upper case character types to
|
||||
specify a more restricted set of characters than the matching lower case type.
|
||||
For example, the class [^\eW_] matches any letter or digit, but not underscore,
|
||||
@ -1514,12 +1555,8 @@ respectively.
|
||||
.P
|
||||
When one of these option changes occurs at top level (that is, not inside
|
||||
subpattern parentheses), the change applies to the remainder of the pattern
|
||||
that follows. If the change is placed right at the start of a pattern, PCRE2
|
||||
extracts it into the global options (and it will therefore show up in data
|
||||
extracted by the \fBpcre2_pattern_info()\fP function).
|
||||
.P
|
||||
An option change within a subpattern (see below for a description of
|
||||
subpatterns) affects only that part of the subpattern that follows it, so
|
||||
that follows. An option change within a subpattern (see below for a description
|
||||
of subpatterns) affects only that part of the subpattern that follows it, so
|
||||
.sp
|
||||
(a(?i)b)c
|
||||
.sp
|
||||
@ -1650,6 +1687,9 @@ first one in the pattern with the given number. The following pattern matches
|
||||
.sp
|
||||
/(?|(abc)|(def))(?1)/
|
||||
.sp
|
||||
A relative reference such as (?-1) is no different: it is just a convenient way
|
||||
of computing an absolute group number.
|
||||
.P
|
||||
If a
|
||||
.\" HTML <a href="#conditions">
|
||||
.\" </a>
|
||||
@ -2056,9 +2096,9 @@ no such problem when named parentheses are used. A back reference to any
|
||||
subpattern is possible using named parentheses (see below).
|
||||
.P
|
||||
Another way of avoiding the ambiguity inherent in the use of digits following a
|
||||
backslash is to use the \eg escape sequence. This escape must be followed by an
|
||||
unsigned number or a negative number, optionally enclosed in braces. These
|
||||
examples are all identical:
|
||||
backslash is to use the \eg escape sequence. This escape must be followed by a
|
||||
signed or unsigned number, optionally enclosed in braces. These examples are
|
||||
all identical:
|
||||
.sp
|
||||
(ring), \e1
|
||||
(ring), \eg1
|
||||
@ -2066,8 +2106,7 @@ examples are all identical:
|
||||
.sp
|
||||
An unsigned number specifies an absolute reference without the ambiguity that
|
||||
is present in the older syntax. It is also useful when literal digits follow
|
||||
the reference. A negative number is a relative reference. Consider this
|
||||
example:
|
||||
the reference. A signed number is a relative reference. Consider this example:
|
||||
.sp
|
||||
(abc(def)ghi)\eg{-1}
|
||||
.sp
|
||||
@ -2077,6 +2116,10 @@ Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
|
||||
can be helpful in long patterns, and also in patterns that are created by
|
||||
joining together fragments that contain references within themselves.
|
||||
.P
|
||||
The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
|
||||
of forward reference can be useful it patterns that repeat. Perl does not
|
||||
support the use of + in this way.
|
||||
.P
|
||||
A back reference matches whatever actually matched the capturing subpattern in
|
||||
the current subject string, rather than anything matching the subpattern
|
||||
itself (see
|
||||
@ -2184,6 +2227,13 @@ numbering the capturing subpatterns in the whole pattern. However, substring
|
||||
capturing is carried out only for positive assertions. (Perl sometimes, but not
|
||||
always, does do capturing in negative assertions.)
|
||||
.P
|
||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
||||
succeeds, but failure to match later in the pattern causes backtracking over
|
||||
this assertion, the captures within the assertion are reset only if no higher
|
||||
numbered captures are already set. This is, unfortunately, a fundamental
|
||||
limitation of the current implementation; it may get removed in a future
|
||||
reworking.
|
||||
.P
|
||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||
it makes no sense to assert the same thing several times, the side effect of
|
||||
capturing parentheses may occasionally be useful. However, an assertion that
|
||||
@ -2281,23 +2331,34 @@ temporarily move the current position back by the fixed length and then try to
|
||||
match. If there are insufficient characters before the current position, the
|
||||
assertion fails.
|
||||
.P
|
||||
In a UTF mode, PCRE2 does not allow the \eC escape (which matches a single code
|
||||
unit even in a UTF mode) to appear in lookbehind assertions, because it makes
|
||||
it impossible to calculate the length of the lookbehind. The \eX and \eR
|
||||
escapes, which can match different numbers of code units, are also not
|
||||
permitted.
|
||||
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
|
||||
single code unit even in a UTF mode) to appear in lookbehind assertions,
|
||||
because it makes it impossible to calculate the length of the lookbehind. The
|
||||
\eX and \eR escapes, which can match different numbers of code units, are never
|
||||
permitted in lookbehinds.
|
||||
.P
|
||||
.\" HTML <a href="#subpatternsassubroutines">
|
||||
.\" </a>
|
||||
"Subroutine"
|
||||
.\"
|
||||
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
|
||||
as the subpattern matches a fixed-length string.
|
||||
as the subpattern matches a fixed-length string. However,
|
||||
.\" HTML <a href="#recursion">
|
||||
.\" </a>
|
||||
Recursion,
|
||||
recursion,
|
||||
.\"
|
||||
however, is not supported.
|
||||
that is, a "subroutine" call into a group that is already active,
|
||||
is not supported.
|
||||
.P
|
||||
Perl does not support back references in lookbehinds. PCRE2 does support them,
|
||||
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
|
||||
must not be set, there must be no use of (?| in the pattern (it creates
|
||||
duplicate subpattern numbers), and if the back reference is by name, the name
|
||||
must be unique. Of course, the referenced subpattern must itself be of fixed
|
||||
length. The following pattern matches words containing at least two characters
|
||||
that begin and end with the same character:
|
||||
.sp
|
||||
\eb(\ew)\ew++(?<=\e1)
|
||||
.P
|
||||
Possessive quantifiers can be used in conjunction with lookbehind assertions to
|
||||
specify efficient matching of fixed-length strings at the end of subject
|
||||
@ -2436,7 +2497,9 @@ This makes the fragment independent of the parentheses in the larger pattern.
|
||||
.sp
|
||||
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
|
||||
subpattern by name. For compatibility with earlier versions of PCRE1, which had
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized.
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized. Note,
|
||||
however, that undelimited names consisting of the letter R followed by digits
|
||||
are ambiguous (see the following section).
|
||||
.P
|
||||
Rewriting the above example to use a named subpattern gives this:
|
||||
.sp
|
||||
@ -2450,33 +2513,55 @@ matched.
|
||||
.SS "Checking for pattern recursion"
|
||||
.rs
|
||||
.sp
|
||||
If the condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if a recursive call to the whole pattern or any
|
||||
subpattern has been made. If digits or a name preceded by ampersand follow the
|
||||
letter R, for example:
|
||||
.sp
|
||||
(?(R3)...) or (?(R&name)...)
|
||||
.sp
|
||||
the condition is true if the most recent recursion is into a subpattern whose
|
||||
number or name is given. This condition does not check the entire recursion
|
||||
stack. If the name used in a condition of this kind is a duplicate, the test is
|
||||
applied to all subpatterns of the same name, and is true if any one of them is
|
||||
the most recent recursion.
|
||||
.P
|
||||
At "top level", all these recursion test conditions are false.
|
||||
"Recursion" in this sense refers to any subroutine-like call from one part of
|
||||
the pattern to another, whether or not it is actually recursive. See the
|
||||
sections entitled
|
||||
.\" HTML <a href="#recursion">
|
||||
.\" </a>
|
||||
The syntax for recursive patterns
|
||||
"Recursive patterns"
|
||||
.\"
|
||||
is described below.
|
||||
and
|
||||
.\" HTML <a href="#subpatternsassubroutines">
|
||||
.\" </a>
|
||||
"Subpatterns as subroutines"
|
||||
.\"
|
||||
below for details of recursion and subpattern calls.
|
||||
.P
|
||||
If a condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if matching is currently in a recursion or subroutine
|
||||
call to the whole pattern or any subpattern. If digits follow the letter R, and
|
||||
there is no subpattern with that name, the condition is true if the most recent
|
||||
call is into a subpattern with the given number, which must exist somewhere in
|
||||
the overall pattern. This is a contrived example that is equivalent to a+b:
|
||||
.sp
|
||||
((?(R1)a+|(?1)b))
|
||||
.sp
|
||||
However, in both cases, if there is a subpattern with a matching name, the
|
||||
condition tests for its being set, as described in the section above, instead
|
||||
of testing for recursion. For example, creating a group with the name R1 by
|
||||
adding (?<R1>) to the above pattern completely changes its meaning.
|
||||
.P
|
||||
If a name preceded by ampersand follows the letter R, for example:
|
||||
.sp
|
||||
(?(R&name)...)
|
||||
.sp
|
||||
the condition is true if the most recent recursion is into a subpattern of that
|
||||
name (which must exist within the pattern).
|
||||
.P
|
||||
This condition does not check the entire recursion stack. It tests only the
|
||||
current level. If the name used in a condition of this kind is a duplicate, the
|
||||
test is applied to all subpatterns of the same name, and is true if any one of
|
||||
them is the most recent recursion.
|
||||
.P
|
||||
At "top level", all these recursion test conditions are false.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="subdefine"></a>
|
||||
.SS "Defining subpatterns for use by reference only"
|
||||
.rs
|
||||
.sp
|
||||
If the condition is the string (DEFINE), and there is no subpattern with the
|
||||
name DEFINE, the condition is always false. In this case, there may be only one
|
||||
If the condition is the string (DEFINE), the condition is always false, even if
|
||||
there is a group with the name DEFINE. In this case, there may be only one
|
||||
alternative in the subpattern. It is always skipped if control reaches this
|
||||
point in the pattern; the idea of DEFINE is that it can be used to define
|
||||
subroutines that can be referenced from elsewhere. (The use of
|
||||
@ -2513,7 +2598,8 @@ For example:
|
||||
(?(VERSION>=10.4)yes|no)
|
||||
.sp
|
||||
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
|
||||
"no" otherwise.
|
||||
"no" otherwise. The fractional part of the version number may not contain more
|
||||
than two digits.
|
||||
.
|
||||
.
|
||||
.SS "Assertion conditions"
|
||||
@ -2630,6 +2716,23 @@ pattern above you can write (?-2) to refer to the second most recently opened
|
||||
parentheses preceding the recursion. In other words, a negative number counts
|
||||
capturing parentheses leftwards from the point at which it is encountered.
|
||||
.P
|
||||
Be aware however, that if
|
||||
.\" HTML <a href="#dupsubpatternnumber">
|
||||
.\" </a>
|
||||
duplicate subpattern numbers
|
||||
.\"
|
||||
are in use, relative references refer to the earliest subpattern with the
|
||||
appropriate number. Consider, for example:
|
||||
.sp
|
||||
(?|(a)|(b)) (c) (?-2)
|
||||
.sp
|
||||
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
|
||||
is number 2. When the reference (?-2) is encountered, the second most recently
|
||||
opened parentheses has the number 1, but it is the first such group (the (a)
|
||||
group) to which the recursion refers. This would be the same if an absolute
|
||||
reference (?1) was used. In other words, relative references are just a
|
||||
shorthand for computing a group number.
|
||||
.P
|
||||
It is also possible to refer to subsequently opened parentheses, by writing
|
||||
references such as (?+2). However, these cannot be recursive because the
|
||||
reference is not inside the parentheses that are referenced. They are always
|
||||
@ -2929,14 +3032,32 @@ in production code should be noted to avoid problems during upgrades." The same
|
||||
remarks apply to the PCRE2 features described in this section.
|
||||
.P
|
||||
The new verbs make use of what was previously invalid syntax: an opening
|
||||
parenthesis followed by an asterisk. They are generally of the form
|
||||
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
|
||||
differently depending on whether or not a name is present. A name is any
|
||||
sequence of characters that does not include a closing parenthesis. The maximum
|
||||
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
|
||||
libraries. If the name is empty, that is, if the closing parenthesis
|
||||
immediately follows the colon, the effect is as if the colon were not there.
|
||||
Any number of these verbs may occur in a pattern.
|
||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
||||
depending on whether or not a name is present.
|
||||
.P
|
||||
By default, for compatibility with Perl, a name is any sequence of characters
|
||||
that does not include a closing parenthesis. The name is not processed in
|
||||
any way, and it is not possible to include a closing parenthesis in the name.
|
||||
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
|
||||
is no longer Perl-compatible.
|
||||
.P
|
||||
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
|
||||
and only an unescaped closing parenthesis terminates the name. However, the
|
||||
only backslash items that are permitted are \eQ, \eE, and sequences such as
|
||||
\ex{100} that define character code points. Character type escapes such as \ed
|
||||
are faulted.
|
||||
.P
|
||||
A closing parenthesis can be included in a name either as \e) or between \eQ
|
||||
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
||||
.P
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||
parenthesis immediately follows the colon, the effect is as if the colon were
|
||||
not there. Any number of these verbs may occur in a pattern.
|
||||
.P
|
||||
Since these verbs are specifically related to backtracking, most of them can be
|
||||
used only when the pattern is to be matched using the traditional matching
|
||||
@ -3361,6 +3482,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 13 June 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 27 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2POSIX 3 "20 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2POSIX 3 "31 January 2016" "PCRE2 10.22"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "SYNOPSIS"
|
||||
@ -28,7 +28,7 @@ expression 8-bit library. See the
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation for a description of PCRE2's native API, which contains much
|
||||
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
|
||||
additional functionality. There are no POSIX-style wrappers for PCRE2's 16-bit
|
||||
and 32-bit libraries.
|
||||
.P
|
||||
The functions described here are just wrapper functions that ultimately call
|
||||
@ -44,9 +44,9 @@ value zero. This has no effect, but since programs that are written to the
|
||||
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
|
||||
replacement library. Other POSIX options are not even defined.
|
||||
.P
|
||||
There are also some other options that are not defined by POSIX. These have
|
||||
been added at the request of users who want to make use of certain
|
||||
PCRE2-specific features via the POSIX calling interface.
|
||||
There are also some options that are not defined by POSIX. These have been
|
||||
added at the request of users who want to make use of certain PCRE2-specific
|
||||
features via the POSIX calling interface.
|
||||
.P
|
||||
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||
in style. The syntax and semantics of the regular expressions themselves are
|
||||
@ -95,11 +95,11 @@ defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||
.sp
|
||||
REG_NOSUB
|
||||
.sp
|
||||
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
|
||||
for compilation to the native function. In addition, when a pattern that is
|
||||
compiled with this flag is passed to \fBregexec()\fP for matching, the
|
||||
\fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings
|
||||
are returned.
|
||||
When a pattern that is compiled with this flag is passed to \fBregexec()\fP for
|
||||
matching, the \fInmatch\fP and \fIpmatch\fP arguments are ignored, and no
|
||||
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
||||
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
||||
because it disables the use of back references.
|
||||
.sp
|
||||
REG_UCP
|
||||
.sp
|
||||
@ -145,7 +145,7 @@ use the contents of the \fIpreg\fP structure. If, for example, you pass it to
|
||||
This area is not simple, because POSIX and Perl take different views of things.
|
||||
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
|
||||
never intended to be a POSIX engine. The following table lists the different
|
||||
possibilities for matching newline characters in PCRE2:
|
||||
possibilities for matching newline characters in Perl and PCRE2:
|
||||
.sp
|
||||
Default Change with
|
||||
.sp
|
||||
@ -155,7 +155,7 @@ possibilities for matching newline characters in PCRE2:
|
||||
$ matches \en in middle no PCRE2_MULTILINE
|
||||
^ matches \en in middle no PCRE2_MULTILINE
|
||||
.sp
|
||||
This is the equivalent table for POSIX:
|
||||
This is the equivalent table for a POSIX-compatible pattern matcher:
|
||||
.sp
|
||||
Default Change with
|
||||
.sp
|
||||
@ -165,13 +165,17 @@ This is the equivalent table for POSIX:
|
||||
$ matches \en in middle no REG_NEWLINE
|
||||
^ matches \en in middle no REG_NEWLINE
|
||||
.sp
|
||||
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
|
||||
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
|
||||
newline from matching [^a].
|
||||
This behaviour is not what happens when PCRE2 is called via its POSIX
|
||||
API. By default, PCRE2's behaviour is the same as Perl's, except that there is
|
||||
no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there
|
||||
is no way to stop newline from matching [^a].
|
||||
.P
|
||||
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
|
||||
the REG_NEWLINE action.
|
||||
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||
PCRE2_DOLLAR_ENDONLY when calling \fBpcre2_compile()\fP directly, but there is
|
||||
no way to make PCRE2 behave exactly as for the REG_NEWLINE action. When using
|
||||
the POSIX API, passing REG_NEWLINE to PCRE2's \fBregcomp()\fP function
|
||||
causes PCRE2_MULTILINE to be passed to \fBpcre2_compile()\fP, and REG_DOTALL
|
||||
passes PCRE2_DOTALL. There is no way to pass PCRE2_DOLLAR_ENDONLY.
|
||||
.
|
||||
.
|
||||
.SH "MATCHING A PATTERN"
|
||||
@ -207,16 +211,18 @@ to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP
|
||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does
|
||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||
how it is matched.
|
||||
how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL are
|
||||
mutually exclusive; the error REG_INVARG is returned.
|
||||
.P
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
|
||||
\fBregexec()\fP are ignored.
|
||||
\fBregexec()\fP are ignored (except possibly as input for REG_STARTEND).
|
||||
.P
|
||||
If the value of \fInmatch\fP is zero, or if the value \fIpmatch\fP is NULL,
|
||||
no data about any matched strings is returned.
|
||||
The value of \fInmatch\fP may be zero, and the value \fIpmatch\fP may be NULL
|
||||
(unless REG_STARTEND is set); in both these cases no data about any matched
|
||||
strings is returned.
|
||||
.P
|
||||
Otherwise,the portion of the string that was matched, and also any captured
|
||||
Otherwise, the portion of the string that was matched, and also any captured
|
||||
substrings, are returned via the \fIpmatch\fP argument, which points to an
|
||||
array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the
|
||||
members \fIrm_so\fP and \fIrm_eo\fP. These contain the byte offset to the first
|
||||
@ -236,9 +242,11 @@ header file, of which REG_NOMATCH is the "expected" failure code.
|
||||
The \fBregerror()\fP function maps a non-zero errorcode from either
|
||||
\fBregcomp()\fP or \fBregexec()\fP to a printable message. If \fIpreg\fP is not
|
||||
NULL, the error should have arisen from the use of that structure. A message
|
||||
terminated by a binary zero is placed in \fIerrbuf\fP. The length of the
|
||||
message, including the zero, is limited to \fIerrbuf_size\fP. The yield of the
|
||||
function is the size of buffer needed to hold the whole message.
|
||||
terminated by a binary zero is placed in \fIerrbuf\fP. If the buffer is too
|
||||
short, only the first \fIerrbuf_size\fP - 1 characters of the error message are
|
||||
used. The yield of the function is the size of buffer needed to hold the whole
|
||||
message, including the terminating zero. This value is greater than
|
||||
\fIerrbuf_size\fP if the message was truncated.
|
||||
.
|
||||
.
|
||||
.SH MEMORY USAGE
|
||||
@ -263,6 +271,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 31 January 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2SAMPLE 3 "20 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2SAMPLE 3 "02 February 2016" "PCRE2 10.22"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 SAMPLE PROGRAM"
|
||||
@ -13,23 +13,28 @@ distribution. A listing of this program is given in the
|
||||
documentation. If you do not have a copy of the PCRE2 distribution, you can
|
||||
save this listing to re-create the contents of \fIpcre2demo.c\fP.
|
||||
.P
|
||||
The demonstration program, which uses the PCRE2 8-bit library, compiles the
|
||||
regular expression that is its first argument, and matches it against the
|
||||
subject string in its second argument. No PCRE2 options are set, and default
|
||||
character tables are used. If matching succeeds, the program outputs the
|
||||
portion of the subject that matched, together with the contents of any captured
|
||||
substrings.
|
||||
The demonstration program compiles the regular expression that is its
|
||||
first argument, and matches it against the subject string in its second
|
||||
argument. No PCRE2 options are set, and default character tables are used. If
|
||||
matching succeeds, the program outputs the portion of the subject that matched,
|
||||
together with the contents of any captured substrings.
|
||||
.P
|
||||
If the -g option is given on the command line, the program then goes on to
|
||||
check for further matches of the same regular expression in the same subject
|
||||
string. The logic is a little bit tricky because of the possibility of matching
|
||||
an empty string. Comments in the code explain what is going on.
|
||||
.P
|
||||
The code in \fBpcre2demo.c\fP is an 8-bit program that uses the PCRE2 8-bit
|
||||
library. It handles strings and characters that are stored in 8-bit code units.
|
||||
By default, one character corresponds to one code unit, but if the pattern
|
||||
starts with "(*UTF)", both it and the subject are treated as UTF-8 strings,
|
||||
where characters may occupy multiple code units.
|
||||
.P
|
||||
If PCRE2 is installed in the standard include and library directories for your
|
||||
operating system, you should be able to compile the demonstration program using
|
||||
this command:
|
||||
a command like this:
|
||||
.sp
|
||||
gcc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||
cc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||
.sp
|
||||
If PCRE2 is installed elsewhere, you may need to add additional options to the
|
||||
command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||
@ -37,12 +42,11 @@ command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||
like this:
|
||||
.sp
|
||||
.\" JOINSH
|
||||
gcc -o pcre2demo -I/usr/local/include pcre2demo.c \e
|
||||
-L/usr/local/lib -lpcre2-8
|
||||
cc -o pcre2demo -I/usr/local/include pcre2demo.c \e
|
||||
-L/usr/local/lib -lpcre2-8
|
||||
.sp
|
||||
.P
|
||||
Once you have compiled and linked the demonstration program, you can run simple
|
||||
tests like this:
|
||||
Once you have built the demonstration program, you can run simple tests like
|
||||
this:
|
||||
.sp
|
||||
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
||||
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
||||
@ -51,12 +55,13 @@ Note that there is a much more comprehensive test program, called
|
||||
.\" HREF
|
||||
\fBpcre2test\fP,
|
||||
.\"
|
||||
which supports many more facilities for testing regular expressions using the
|
||||
PCRE2 libraries. The
|
||||
which supports many more facilities for testing regular expressions using all
|
||||
three PCRE2 libraries (8-bit, 16-bit, and 32-bit, though not all three need be
|
||||
installed). The
|
||||
.\" HREF
|
||||
\fBpcre2demo\fP
|
||||
.\"
|
||||
program is provided as a simple coding example.
|
||||
program is provided as a relatively simple coding example.
|
||||
.P
|
||||
If you try to run
|
||||
.\" HREF
|
||||
@ -65,7 +70,7 @@ If you try to run
|
||||
when PCRE2 is not installed in the standard library directory, you may get an
|
||||
error like this on some operating systems (e.g. Solaris):
|
||||
.sp
|
||||
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
|
||||
ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file or directory
|
||||
.sp
|
||||
This is caused by the way shared library support works on those systems. You
|
||||
need to add
|
||||
@ -89,6 +94,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 February 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2SERIALIZE 3 "20 January 2015" "PCRE2 10.10"
|
||||
.TH PCRE2SERIALIZE 3 "24 May 2016" "PCRE2 10.22"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS"
|
||||
@ -22,12 +22,22 @@ If you are running an application that uses a large number of regular
|
||||
expression patterns, it may be useful to store them in a precompiled form
|
||||
instead of having to compile them every time the application is run. However,
|
||||
if you are using the just-in-time optimization feature, it is not possible to
|
||||
save and reload the JIT data, because it is position-dependent. In addition,
|
||||
the host on which the patterns are reloaded must be running the same version of
|
||||
PCRE2, with the same code unit width, and must also have the same endianness,
|
||||
pointer width and PCRE2_SIZE type. For example, patterns compiled on a 32-bit
|
||||
system using PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor
|
||||
can they be reloaded using the 8-bit library.
|
||||
save and reload the JIT data, because it is position-dependent. The host on
|
||||
which the patterns are reloaded must be running the same version of PCRE2, with
|
||||
the same code unit width, and must also have the same endianness, pointer width
|
||||
and PCRE2_SIZE type. For example, patterns compiled on a 32-bit system using
|
||||
PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor can they be
|
||||
reloaded using the 8-bit library.
|
||||
.
|
||||
.
|
||||
.SH "SECURITY CONCERNS"
|
||||
.rs
|
||||
.sp
|
||||
The facility for saving and restoring compiled patterns is intended for use
|
||||
within individual applications. As such, the data supplied to
|
||||
\fBpcre2_serialize_decode()\fP is expected to be trusted data, not data from
|
||||
arbitrary external sources. There is only some simple consistency checking, not
|
||||
complete validation of what is being re-loaded.
|
||||
.
|
||||
.
|
||||
.SH "SAVING COMPILED PATTERNS"
|
||||
@ -129,20 +139,26 @@ is filled with those that fit, and the remainder are ignored. The yield of the
|
||||
function is the number of decoded patterns, or one of the following negative
|
||||
error codes:
|
||||
.sp
|
||||
PCRE2_ERROR_BADDATA second argument is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
|
||||
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE2 version
|
||||
PCRE2_ERROR_MEMORY memory allocation failed
|
||||
PCRE2_ERROR_NULL first or third argument is NULL
|
||||
PCRE2_ERROR_BADDATA second argument is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
|
||||
PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
|
||||
PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
|
||||
PCRE2_ERROR_MEMORY memory allocation failed
|
||||
PCRE2_ERROR_NULL first or third argument is NULL
|
||||
.sp
|
||||
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
|
||||
on a system with different endianness.
|
||||
.P
|
||||
Decoded patterns can be used for matching in the usual way, and must be freed
|
||||
by calling \fBpcre2_code_free()\fP as normal. A single copy of the character
|
||||
tables is used by all the decoded patterns. A reference count is used to
|
||||
by calling \fBpcre2_code_free()\fP. However, be aware that there is a potential
|
||||
race issue if you are using multiple patterns that were decoded from a single
|
||||
byte stream in a multithreaded application. A single copy of the character
|
||||
tables is used by all the decoded patterns and a reference count is used to
|
||||
arrange for its memory to be automatically freed when the last pattern is
|
||||
freed.
|
||||
freed, but there is no locking on this reference count. Therefore, if you want
|
||||
to call \fBpcre2_code_free()\fP for these patterns in different threads, you
|
||||
must arrange your own locking, and ensure that \fBpcre2_code_free()\fP cannot
|
||||
be called by two threads at the same time.
|
||||
.P
|
||||
If a pattern was processed by \fBpcre2_jit_compile()\fP before being
|
||||
serialized, the JIT data is discarded and so is no longer available after a
|
||||
@ -165,6 +181,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 24 May 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2STACK 3 "21 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 DISCUSSION OF STACK USAGE"
|
||||
@ -43,11 +43,12 @@ assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
||||
Normally, these are never very deep, and the limit on the complexity of
|
||||
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
|
||||
However, it is possible to write patterns with runaway infinite recursions;
|
||||
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack. At
|
||||
present, there is no protection against this.
|
||||
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
|
||||
limit is applied (see below).
|
||||
.P
|
||||
The comments that follow do NOT apply to \fBpcre2_dfa_match()\fP; they are
|
||||
relevant only for \fBpcre2_match()\fP without the JIT optimization.
|
||||
The comments in the next three sections do not apply to
|
||||
\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
|
||||
the JIT optimization.
|
||||
.
|
||||
.
|
||||
.SS "Reducing \fBpcre2_match()\fP's stack usage"
|
||||
@ -106,7 +107,7 @@ in the
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation. Since the block sizes are always the same, it may be possible to
|
||||
implement customized a memory handler that is more efficient than the standard
|
||||
implement a customized memory handler that is more efficient than the standard
|
||||
function. The memory blocks obtained for this purpose are retained and re-used
|
||||
if possible while \fBpcre2_match()\fP is running. They are all freed just
|
||||
before it exits.
|
||||
@ -147,6 +148,15 @@ pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
|
||||
different limits.
|
||||
.
|
||||
.
|
||||
.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
|
||||
.rs
|
||||
.sp
|
||||
The recursion limit, as described above for \fBpcre2_match()\fP, also applies
|
||||
to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
|
||||
recursions in the pattern can lead to runaway stack usage. The non-recursive
|
||||
match limit is not relevant for DFA matching, and is ignored.
|
||||
.
|
||||
.
|
||||
.SS "Changing stack size in Unix-like systems"
|
||||
.rs
|
||||
.sp
|
||||
@ -197,6 +207,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 21 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 23 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
|
||||
.TH PCRE2SYNTAX 3 "23 December 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
@ -81,9 +81,10 @@ it matches a literal "u".
|
||||
\eW a "non-word" character
|
||||
\eX a Unicode extended grapheme cluster
|
||||
.sp
|
||||
The application can lock out the use of \eC by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
|
||||
current matching point in the middle of a UTF-8 or UTF-16 character.
|
||||
\eC is dangerous because it may leave the current matching point in the middle
|
||||
of a UTF-8 or UTF-16 character. The application can lock out the use of \eC by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
|
||||
with the use of \eC permanently disabled.
|
||||
.P
|
||||
By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
|
||||
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||
@ -159,6 +160,8 @@ at release 5.18.
|
||||
.SH "SCRIPT NAMES FOR \ep AND \eP"
|
||||
.rs
|
||||
.sp
|
||||
Ahom,
|
||||
Anatolian_Hieroglyphs,
|
||||
Arabic,
|
||||
Armenian,
|
||||
Avestan,
|
||||
@ -199,6 +202,7 @@ Gurmukhi,
|
||||
Han,
|
||||
Hangul,
|
||||
Hanunoo,
|
||||
Hatran,
|
||||
Hebrew,
|
||||
Hiragana,
|
||||
Imperial_Aramaic,
|
||||
@ -235,12 +239,14 @@ Miao,
|
||||
Modi,
|
||||
Mongolian,
|
||||
Mro,
|
||||
Multani,
|
||||
Myanmar,
|
||||
Nabataean,
|
||||
New_Tai_Lue,
|
||||
Nko,
|
||||
Ogham,
|
||||
Ol_Chiki,
|
||||
Old_Hungarian,
|
||||
Old_Italic,
|
||||
Old_North_Arabian,
|
||||
Old_Permic,
|
||||
@ -262,6 +268,7 @@ Saurashtra,
|
||||
Sharada,
|
||||
Shavian,
|
||||
Siddham,
|
||||
SignWriting,
|
||||
Sinhala,
|
||||
Sora_Sompeng,
|
||||
Sundanese,
|
||||
@ -421,9 +428,10 @@ appear.
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||
.sp
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_match(), not increase them. The application
|
||||
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
|
||||
PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||
limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP, not
|
||||
increase them. The application can lock out the use of (*UTF) and (*UCP) by
|
||||
setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at
|
||||
compile time.
|
||||
.
|
||||
.
|
||||
.SH "NEWLINE CONVENTION"
|
||||
@ -466,6 +474,9 @@ Each top-level branch of a look behind must be of a fixed length.
|
||||
\en reference by number (can be ambiguous)
|
||||
\egn reference by number
|
||||
\eg{n} reference by number
|
||||
\eg+n relative reference by number (PCRE2 extension)
|
||||
\eg-n relative reference by number
|
||||
\eg{+n} relative reference by number (PCRE2 extension)
|
||||
\eg{-n} relative reference by number
|
||||
\ek<name> reference by name (Perl)
|
||||
\ek'name' reference by name (Perl)
|
||||
@ -504,13 +515,17 @@ Each top-level branch of a look behind must be of a fixed length.
|
||||
(?(-n) relative reference condition
|
||||
(?(<name>) named reference condition (Perl)
|
||||
(?('name') named reference condition (Perl)
|
||||
(?(name) named reference condition (PCRE2)
|
||||
(?(name) named reference condition (PCRE2, deprecated)
|
||||
(?(R) overall recursion condition
|
||||
(?(Rn) specific group recursion condition
|
||||
(?(R&name) specific recursion condition
|
||||
(?(Rn) specific numbered group recursion condition
|
||||
(?(R&name) specific named group recursion condition
|
||||
(?(DEFINE) define subpattern for reference
|
||||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
.sp
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
.
|
||||
.
|
||||
.SH "BACKTRACKING CONTROL"
|
||||
@ -570,6 +585,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 13 June 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 23 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
@ -1,4 +1,4 @@
|
||||
.TH PCRE2TEST 1 "20 May 2015" "PCRE 10.20"
|
||||
.TH PCRE2TEST 1 "28 December 2016" "PCRE 10.23"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
|
||||
.P
|
||||
As the original fairly simple PCRE library evolved, it acquired many different
|
||||
features, and as a result, the original \fBpcretest\fP program ended up with a
|
||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
||||
lot of options in a messy, arcane syntax for testing all the features. The
|
||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
|
||||
are still many obscure modifiers, some of which are specifically designed for
|
||||
@ -47,31 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
||||
all three of these libraries may be simultaneously installed. The
|
||||
\fBpcre2test\fP program can be used to test all the libraries. However, its own
|
||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
||||
before being passed to the library functions. Results are converted back to
|
||||
8-bit code units for output.
|
||||
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||
format before being passed to the library functions. Results are converted back
|
||||
to 8-bit code units for output.
|
||||
.P
|
||||
In the rest of this document, the names of library functions and structures
|
||||
are given in generic form, for example, \fBpcre_compile()\fP. The actual
|
||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="inputencoding"></a>
|
||||
.SH "INPUT ENCODING"
|
||||
.rs
|
||||
.sp
|
||||
Input to \fBpcre2test\fP is processed line by line, either by calling the C
|
||||
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
|
||||
below). The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
||||
further data is read.
|
||||
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
|
||||
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||
no further data is read, so this character should be avoided unless you really
|
||||
want that action.
|
||||
.P
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
characters in \fBpcre2test\fP input files. There is a facility for specifying a
|
||||
pattern's characters as hexadecimal pairs, thus making it possible to include
|
||||
binary zeroes in a pattern for testing purposes. Subject lines are processed
|
||||
for backslash escapes, which makes it possible to include any data value.
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
||||
treats any bytes other than newline as data characters. An error is generated
|
||||
if a binary zero is encountered. Subject lines are processed for backslash
|
||||
escapes, which makes it possible to include any data value in strings that are
|
||||
passed to the library for matching. For patterns, there is a facility for
|
||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
which makes it possible to include binary zeros.
|
||||
.
|
||||
.
|
||||
.SS "Input for the 16-bit and 32-bit libraries"
|
||||
.rs
|
||||
.sp
|
||||
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||
generate character code points greater than 255 in the strings that are passed
|
||||
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||
when the \fButf\fP modifier (see
|
||||
.\" HTML <a href="#optionmodifiers">
|
||||
.\" </a>
|
||||
"Setting compilation options"
|
||||
.\"
|
||||
below) is set, the pattern and any following subject lines are interpreted as
|
||||
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||
.P
|
||||
For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
|
||||
used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
|
||||
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||
to occur).
|
||||
.P
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||
values can be handled by the 32-bit library. When testing this library in
|
||||
non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
|
||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||
character's value. This is the only way of passing such code points in a
|
||||
pattern string. For subject strings, using an escape sequence is preferable.
|
||||
.
|
||||
.
|
||||
.SH "COMMAND LINE OPTIONS"
|
||||
@ -92,8 +124,12 @@ If the 32-bit library has been built, this option causes it to be used. If only
|
||||
the 32-bit library has been built, this is the default. If the 32-bit library
|
||||
has not been built, this option causes an error.
|
||||
.TP 10
|
||||
\fB-ac\fP
|
||||
Behave as if each pattern has the \fBauto_callout\fP modifier, that is, insert
|
||||
automatic callouts into every pattern that is compiled.
|
||||
.TP 10
|
||||
\fB-b\fP
|
||||
Behave as if each pattern has the \fB/fullbincode\fP modifier; the full
|
||||
Behave as if each pattern has the \fBfullbincode\fP modifier; the full
|
||||
internal binary form of the pattern is output after compilation.
|
||||
.TP 10
|
||||
\fB-C\fP
|
||||
@ -122,12 +158,13 @@ following options output the value and set the exit code as indicated:
|
||||
The following options output 1 for true or 0 for false, and set the exit code
|
||||
to the same value:
|
||||
.sp
|
||||
ebcdic compiled for an EBCDIC environment
|
||||
jit just-in-time support is available
|
||||
pcre2-16 the 16-bit library was built
|
||||
pcre2-32 the 32-bit library was built
|
||||
pcre2-8 the 8-bit library was built
|
||||
unicode Unicode support is available
|
||||
backslash-C \eC is supported (not locked out)
|
||||
ebcdic compiled for an EBCDIC environment
|
||||
jit just-in-time support is available
|
||||
pcre2-16 the 16-bit library was built
|
||||
pcre2-32 the 32-bit library was built
|
||||
pcre2-8 the 8-bit library was built
|
||||
unicode Unicode support is available
|
||||
.sp
|
||||
If an unknown option is given, an error message is output; the exit code is 0.
|
||||
.TP 10
|
||||
@ -141,11 +178,17 @@ Behave as if each subject line has the \fBdfa\fP modifier; matching is done
|
||||
using the \fBpcre2_dfa_match()\fP function instead of the default
|
||||
\fBpcre2_match()\fP.
|
||||
.TP 10
|
||||
\fB-error\fP \fInumber[,number,...]\fP
|
||||
Call \fBpcre2_get_error_message()\fP for each of the error numbers in the
|
||||
comma-separated list, display the resulting messages on the standard output,
|
||||
then exit with zero exit code. The numbers may be positive or negative. This is
|
||||
a convenience facility for PCRE2 maintainers.
|
||||
.TP 10
|
||||
\fB-help\fP
|
||||
Output a brief summary these options and then exit.
|
||||
.TP 10
|
||||
\fB-i\fP
|
||||
Behave as if each pattern has the \fB/info\fP modifier; information about the
|
||||
Behave as if each pattern has the \fBinfo\fP modifier; information about the
|
||||
compiled pattern is given after compilation.
|
||||
.TP 10
|
||||
\fB-jit\fP
|
||||
@ -217,9 +260,9 @@ Each subject line is matched separately and independently. If you want to do
|
||||
multi-line matches, you have to use the \en escape sequence (or \er or \er\en,
|
||||
etc., depending on the newline setting) in a single line of input to encode the
|
||||
newline sequences. There is no limit on the length of subject lines; the input
|
||||
buffer is automatically extended if it is too small. There is a replication
|
||||
feature that makes it possible to generate long subject lines without having to
|
||||
supply them explicitly.
|
||||
buffer is automatically extended if it is too small. There are replication
|
||||
features that makes it possible to generate long repetitive pattern or subject
|
||||
lines without having to supply them explicitly.
|
||||
.P
|
||||
An empty line or the end of the file signals the end of the subject lines for a
|
||||
test, at which point a new pattern or command line is expected if there is
|
||||
@ -259,6 +302,34 @@ described in the section entitled "Saving and restoring compiled patterns"
|
||||
.\" </a>
|
||||
below.
|
||||
.\"
|
||||
.sp
|
||||
#newline_default [<newline-list>]
|
||||
.sp
|
||||
When PCRE2 is built, a default newline convention can be specified. This
|
||||
determines which characters and/or character pairs are recognized as indicating
|
||||
a newline in a pattern or subject string. The default can be overridden when a
|
||||
pattern is compiled. The standard test files contain tests of various newline
|
||||
conventions, but the majority of the tests expect a single linefeed to be
|
||||
recognized as a newline by default. Without special action the tests would fail
|
||||
when PCRE2 is compiled with either CR or CRLF as the default newline.
|
||||
.P
|
||||
The #newline_default command specifies a list of newline types that are
|
||||
acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
|
||||
ANY (in upper or lower case), for example:
|
||||
.sp
|
||||
#newline_default LF Any anyCRLF
|
||||
.sp
|
||||
If the default newline is in the list, this command has no effect. Otherwise,
|
||||
except when testing the POSIX API, a \fBnewline\fP modifier that specifies the
|
||||
first newline convention in the list (LF in the above example) is added to any
|
||||
pattern that does not already have a \fBnewline\fP modifier. If the newline
|
||||
list is empty, the feature is turned off. This command is present in a number
|
||||
of the standard test input files.
|
||||
.P
|
||||
When the POSIX API is being tested there is no way to override the default
|
||||
newline convention, though it is possible to set the newline convention from
|
||||
within the pattern. A warning is given if the \fBposix\fP modifier is used when
|
||||
\fB#newline_default\fP would set a default for the non-POSIX API.
|
||||
.sp
|
||||
#pattern <modifier-list>
|
||||
.sp
|
||||
@ -276,9 +347,10 @@ test files that are also processed by \fBperltest.sh\fP. The \fB#perltest\fP
|
||||
command helps detect tests that are accidentally put in the wrong file.
|
||||
.sp
|
||||
#pop [<modifiers>]
|
||||
#popcopy [<modifiers>]
|
||||
.sp
|
||||
This command is used to manipulate the stack of compiled patterns, as described
|
||||
in the section entitled "Saving and restoring compiled patterns"
|
||||
These commands are used to manipulate the stack of compiled patterns, as
|
||||
described in the section entitled "Saving and restoring compiled patterns"
|
||||
.\" HTML <a href="#saverestore">
|
||||
.\" </a>
|
||||
below.
|
||||
@ -303,12 +375,13 @@ subject lines. Modifiers on a subject line can change these settings.
|
||||
.rs
|
||||
.sp
|
||||
Modifier lists are used with both pattern and subject lines. Items in a list
|
||||
are separated by commas and optional white space. Some modifiers may be given
|
||||
for both patterns and subject lines, whereas others are valid for one or the
|
||||
other only. Each modifier has a long name, for example "anchored", and some of
|
||||
them must be followed by an equals sign and a value, for example, "offset=12".
|
||||
Modifiers that do not take values may be preceded by a minus sign to turn off a
|
||||
previous setting.
|
||||
are separated by commas followed by optional white space. Trailing whitespace
|
||||
in a modifier list is ignored. Some modifiers may be given for both patterns
|
||||
and subject lines, whereas others are valid only for one or the other. Each
|
||||
modifier has a long name, for example "anchored", and some of them must be
|
||||
followed by an equals sign and a value, for example, "offset=12". Values cannot
|
||||
contain comma characters, but may contain spaces. Modifiers that do not take
|
||||
values may be preceded by a minus sign to turn off a previous setting.
|
||||
.P
|
||||
A few of the more common modifiers can also be specified as single letters, for
|
||||
example "i" for "caseless". In documentation, following the Perl convention,
|
||||
@ -414,6 +487,12 @@ the start of a modifier list. For example:
|
||||
.sp
|
||||
abc\e=notbol,notempty
|
||||
.sp
|
||||
If the subject string is empty and \e= is followed by whitespace, the line is
|
||||
treated as a comment line, and is not used for matching. For example:
|
||||
.sp
|
||||
\e= This is a comment.
|
||||
abc\e= This is an invalid modifier list.
|
||||
.sp
|
||||
A backslash followed by any other non-alphanumeric character just escapes that
|
||||
character. A backslash followed by anything else causes an error. However, if
|
||||
the very last character in the line is a backslash (and there is no modifier
|
||||
@ -424,10 +503,10 @@ a real empty line terminates the data input.
|
||||
.SH "PATTERN MODIFIERS"
|
||||
.rs
|
||||
.sp
|
||||
There are three types of modifier that can appear in pattern lines, two of
|
||||
which may also be used in a \fB#pattern\fP command. A pattern's modifier list
|
||||
can add to or override default modifiers that were set by a previous
|
||||
\fB#pattern\fP command.
|
||||
There are several types of modifier that can appear in pattern lines. Except
|
||||
where noted below, they may also be used in \fB#pattern\fP commands. A
|
||||
pattern's modifier list can add to or override default modifiers that were set
|
||||
by a previous \fB#pattern\fP command.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="optionmodifiers"></a>
|
||||
@ -437,13 +516,14 @@ can add to or override default modifiers that were set by a previous
|
||||
The following modifiers set options for \fBpcre2_compile()\fP. The most common
|
||||
ones have single-letter abbreviations. See
|
||||
.\" HREF
|
||||
\fBpcreapi\fP
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
for a description of their effects.
|
||||
.sp
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
alt_verbnames set PCRE2_ALT_VERBNAMES
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
@ -464,12 +544,15 @@ for a description of their effects.
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
ucp set PCRE2_UCP
|
||||
ungreedy set PCRE2_UNGREEDY
|
||||
use_offset_limit set PCRE2_USE_OFFSET_LIMIT
|
||||
utf set PCRE2_UTF
|
||||
.sp
|
||||
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
|
||||
non-printing characters in output strings to be printed using the \ex{hh...}
|
||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||
brackets.
|
||||
brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
|
||||
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||
being passed to library functions.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="controlmodifiers"></a>
|
||||
@ -485,18 +568,24 @@ about the pattern:
|
||||
debug same as info,fullbincode
|
||||
fullbincode show binary code with lengths
|
||||
/I info show info about compiled pattern
|
||||
hex pattern is coded in hexadecimal
|
||||
hex unquoted characters are hexadecimal
|
||||
jit[=<number>] use JIT
|
||||
jitfast use JIT fast path
|
||||
jitverify verify JIT use
|
||||
locale=<name> use this locale
|
||||
max_pattern_length=<n> set the maximum pattern length
|
||||
memory show memory used
|
||||
newline=<type> set newline type
|
||||
null_context compile with a NULL context
|
||||
parens_nest_limit=<n> set maximum parentheses depth
|
||||
posix use the POSIX API
|
||||
posix_nosub use the POSIX API with REG_NOSUB
|
||||
push push compiled pattern onto the stack
|
||||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
tables=[0|1|2] select internal tables
|
||||
use_length do not zero-terminate the pattern
|
||||
utf8_input treat input as UTF-8
|
||||
.sp
|
||||
The effects of these modifiers are described in the following sections.
|
||||
.
|
||||
@ -565,40 +654,148 @@ is requested. For each callout, either its number or string is given, followed
|
||||
by the item that follows it in the pattern.
|
||||
.
|
||||
.
|
||||
.SS "Specifying a pattern in hex"
|
||||
.SS "Passing a NULL context"
|
||||
.rs
|
||||
.sp
|
||||
The \fBhex\fP modifier specifies that the characters of the pattern are to be
|
||||
interpreted as pairs of hexadecimal digits. White space is permitted between
|
||||
pairs. For example:
|
||||
Normally, \fBpcre2test\fP passes a context block to \fBpcre2_compile()\fP. If
|
||||
the \fBnull_context\fP modifier is set, however, NULL is passed. This is for
|
||||
testing that \fBpcre2_compile()\fP behaves correctly in this case (it uses
|
||||
default values).
|
||||
.
|
||||
.
|
||||
.SS "Specifying the pattern's length"
|
||||
.rs
|
||||
.sp
|
||||
By default, patterns are passed to the compiling functions as zero-terminated
|
||||
strings. When using the POSIX wrapper API, there is no other option. However,
|
||||
when using PCRE2's native API, patterns can be passed by length instead of
|
||||
being zero-terminated. The \fBuse_length\fP modifier causes this to happen.
|
||||
Using a length happens automatically (whether or not \fBuse_length\fP is set)
|
||||
when \fBhex\fP is set, because patterns specified in hexadecimal may contain
|
||||
binary zeros.
|
||||
.
|
||||
.
|
||||
.SS "Specifying pattern characters in hexadecimal"
|
||||
.rs
|
||||
.sp
|
||||
The \fBhex\fP modifier specifies that the characters of the pattern, except for
|
||||
substrings enclosed in single or double quotes, are to be interpreted as pairs
|
||||
of hexadecimal digits. This feature is provided as a way of creating patterns
|
||||
that contain binary zeros and other non-printing characters. White space is
|
||||
permitted between pairs of digits. For example, this pattern contains three
|
||||
characters:
|
||||
.sp
|
||||
/ab 32 59/hex
|
||||
.sp
|
||||
This feature is provided as a way of creating patterns that contain binary zero
|
||||
and other non-printing characters. By default, \fBpcre2test\fP passes patterns
|
||||
as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
|
||||
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
|
||||
actual length of the pattern is passed.
|
||||
Parts of such a pattern are taken literally if quoted. This pattern contains
|
||||
nine characters, only two of which are specified in hexadecimal:
|
||||
.sp
|
||||
/ab "literal" 32/hex
|
||||
.sp
|
||||
Either single or double quotes may be used. There is no way of including
|
||||
the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
|
||||
mutually exclusive.
|
||||
.P
|
||||
The POSIX API cannot be used with patterns specified in hexadecimal because
|
||||
they may contain binary zeros, which conflicts with \fBregcomp()\fP's
|
||||
requirement for a zero-terminated string. Such patterns are always passed to
|
||||
\fBpcre2_compile()\fP as a string with a length, not as zero-terminated.
|
||||
.
|
||||
.
|
||||
.SS "Specifying wide characters in 16-bit and 32-bit modes"
|
||||
.rs
|
||||
.sp
|
||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||
translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
|
||||
the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
|
||||
can be used. It is mutually exclusive with \fButf\fP. Input lines are
|
||||
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||
given in
|
||||
.\" HTML <a href="#inputencoding">
|
||||
.\" </a>
|
||||
"Input encoding"
|
||||
.\"
|
||||
above.
|
||||
.
|
||||
.
|
||||
.SS "Generating long repetitive patterns"
|
||||
.rs
|
||||
.sp
|
||||
Some tests use long patterns that are very repetitive. Instead of creating a
|
||||
very long input line for such a pattern, you can use a special repetition
|
||||
feature, similar to the one described for subject lines above. If the
|
||||
\fBexpand\fP modifier is present on a pattern, parts of the pattern that have
|
||||
the form
|
||||
.sp
|
||||
\e[<characters>]{<count>}
|
||||
.sp
|
||||
are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
|
||||
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
|
||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||
remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
|
||||
mutually exclusive.
|
||||
.P
|
||||
If part of an expanded pattern looks like an expansion, but is really part of
|
||||
the actual pattern, unwanted expansion can be avoided by giving two values in
|
||||
the quantifier. For example, \e[AB]{6000,6000} is not recognized as an
|
||||
expansion item.
|
||||
.P
|
||||
If the \fBinfo\fP modifier is set on an expanded pattern, the result of the
|
||||
expansion is included in the information that is output.
|
||||
.
|
||||
.
|
||||
.SS "JIT compilation"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/jit\fP modifier may optionally be followed by an equals sign and a
|
||||
number in the range 0 to 7:
|
||||
Just-in-time (JIT) compiling is a heavyweight optimization that can greatly
|
||||
speed up pattern matching. See the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
documentation for details. JIT compiling happens, optionally, after a pattern
|
||||
has been successfully compiled into an internal form. The JIT compiler converts
|
||||
this to optimized machine code. It needs to know whether the match-time options
|
||||
PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because
|
||||
different code is generated for the different cases. See the \fBpartial\fP
|
||||
modifier in "Subject Modifiers"
|
||||
.\" HTML <a href="#subjectmodifiers">
|
||||
.\" </a>
|
||||
below
|
||||
.\"
|
||||
for details of how these options are specified for each match attempt.
|
||||
.P
|
||||
JIT compilation is requested by the \fB/jit\fP pattern modifier, which may
|
||||
optionally be followed by an equals sign and a number in the range 0 to 7.
|
||||
The three bits that make up the number specify which of the three JIT operating
|
||||
modes are to be compiled:
|
||||
.sp
|
||||
1 compile JIT code for non-partial matching
|
||||
2 compile JIT code for soft partial matching
|
||||
4 compile JIT code for hard partial matching
|
||||
.sp
|
||||
The possible values for the \fBjit\fP modifier are therefore:
|
||||
.sp
|
||||
0 disable JIT
|
||||
1 use JIT for normal match only
|
||||
2 use JIT for soft partial match only
|
||||
3 use JIT for normal match and soft partial match
|
||||
4 use JIT for hard partial match only
|
||||
6 use JIT for soft and hard partial match
|
||||
1 normal matching only
|
||||
2 soft partial matching only
|
||||
3 normal and soft partial matching
|
||||
4 hard partial matching only
|
||||
6 soft and hard partial matching only
|
||||
7 all three modes
|
||||
.sp
|
||||
If no number is given, 7 is assumed. If JIT compilation is successful, the
|
||||
compiled JIT code will automatically be used when \fBpcre2_match()\fP is run
|
||||
for the appropriate type of match, except when incompatible run-time options
|
||||
are specified. For more details, see the
|
||||
If no number is given, 7 is assumed. The phrase "partial matching" means a call
|
||||
to \fBpcre2_match()\fP with either the PCRE2_PARTIAL_SOFT or the
|
||||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
|
||||
match; the options enable the possibility of a partial match, but do not
|
||||
require it. Note also that if you request JIT compilation only for partial
|
||||
matching (for example, /jit=2) but do not set the \fBpartial\fP modifier on a
|
||||
subject line, that match will not use JIT code because none was compiled for
|
||||
non-partial matching.
|
||||
.P
|
||||
If JIT compilation is successful, the compiled JIT code will automatically be
|
||||
used when an appropriate type of match is run, except when incompatible
|
||||
run-time options are specified. For more details, see the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
@ -622,14 +819,14 @@ code was actually used in the match.
|
||||
.SS "Setting a locale"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/locale\fP modifier must specify the name of a locale, for example:
|
||||
The \fBlocale\fP modifier must specify the name of a locale, for example:
|
||||
.sp
|
||||
/pattern/locale=fr_FR
|
||||
.sp
|
||||
The given locale is set, \fBpcre2_maketables()\fP is called to build a set of
|
||||
character tables for the locale, and this is then passed to
|
||||
\fBpcre2_compile()\fP when compiling the regular expression. The same tables
|
||||
are used when matching the following subject lines. The \fB/locale\fP modifier
|
||||
are used when matching the following subject lines. The \fBlocale\fP modifier
|
||||
applies only to the pattern on which it appears, but can be given in a
|
||||
\fB#pattern\fP command if a default is needed. Setting a locale and alternate
|
||||
character tables are mutually exclusive.
|
||||
@ -638,7 +835,7 @@ character tables are mutually exclusive.
|
||||
.SS "Showing pattern memory"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
|
||||
The \fBmemory\fP modifier causes the size in bytes of the memory used to hold
|
||||
the compiled pattern to be output. This does not include the size of the
|
||||
\fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
|
||||
subsequently passed to the JIT compiler, the size of the JIT compiled code is
|
||||
@ -660,30 +857,54 @@ sets its own default of 220, which is required for running the standard test
|
||||
suite.
|
||||
.
|
||||
.
|
||||
.SS "Limiting the pattern length"
|
||||
.rs
|
||||
.sp
|
||||
The \fBmax_pattern_length\fP modifier sets a limit, in code units, to the
|
||||
length of pattern that \fBpcre2_compile()\fP will accept. Breaching the limit
|
||||
causes a compilation error. The default is the largest number a PCRE2_SIZE
|
||||
variable can hold (essentially unlimited).
|
||||
.
|
||||
.
|
||||
.SS "Using the POSIX wrapper API"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/posix\fP modifier causes \fBpcre2test\fP to call PCRE2 via the POSIX
|
||||
wrapper API rather than its native API. This supports only the 8-bit library.
|
||||
When the POSIX API is being used, the following pattern modifiers set options
|
||||
for the \fBregcomp()\fP function:
|
||||
The \fB/posix\fP and \fBposix_nosub\fP modifiers cause \fBpcre2test\fP to call
|
||||
PCRE2 via the POSIX wrapper API rather than its native API. When
|
||||
\fBposix_nosub\fP is used, the POSIX option REG_NOSUB is passed to
|
||||
\fBregcomp()\fP. The POSIX wrapper supports only the 8-bit library. Note that
|
||||
it does not imply POSIX matching semantics; for more detail see the
|
||||
.\" HREF
|
||||
\fBpcre2posix\fP
|
||||
.\"
|
||||
documentation. The following pattern modifiers set options for the
|
||||
\fBregcomp()\fP function:
|
||||
.sp
|
||||
caseless REG_ICASE
|
||||
multiline REG_NEWLINE
|
||||
no_auto_capture REG_NOSUB
|
||||
dotall REG_DOTALL )
|
||||
ungreedy REG_UNGREEDY ) These options are not part of
|
||||
ucp REG_UCP ) the POSIX standard
|
||||
utf REG_UTF8 )
|
||||
.sp
|
||||
The \fBregerror_buffsize\fP modifier specifies a size for the error buffer that
|
||||
is passed to \fBregerror()\fP in the event of a compilation error. For example:
|
||||
.sp
|
||||
/abc/posix,regerror_buffsize=20
|
||||
.sp
|
||||
This provides a means of testing the behaviour of \fBregerror()\fP when the
|
||||
buffer is too small for the error message. If this modifier has not been set, a
|
||||
large buffer is used.
|
||||
.P
|
||||
The \fBaftertext\fP and \fBallaftertext\fP subject modifiers work as described
|
||||
below. All other modifiers cause an error.
|
||||
below. All other modifiers are either ignored, with a warning message, or cause
|
||||
an error.
|
||||
.
|
||||
.
|
||||
.SS "Testing the stack guard feature"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/stackguard\fP modifier is used to test the use of
|
||||
The \fBstackguard\fP modifier is used to test the use of
|
||||
\fBpcre2_set_compile_recursion_guard()\fP, a function that is provided to
|
||||
enable stack availability to be checked during compilation (see the
|
||||
.\" HREF
|
||||
@ -700,7 +921,7 @@ be aborted.
|
||||
.SS "Using alternative character tables"
|
||||
.rs
|
||||
.sp
|
||||
The value specified for the \fB/tables\fP modifier must be one of the digits 0,
|
||||
The value specified for the \fBtables\fP modifier must be one of the digits 0,
|
||||
1, or 2. It causes a specific set of built-in character tables to be passed to
|
||||
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
|
||||
different character tables. The digit specifies the tables as follows:
|
||||
@ -720,17 +941,22 @@ are mutually exclusive.
|
||||
.sp
|
||||
The following modifiers are really subject modifiers, and are described below.
|
||||
However, they may be included in a pattern's modifier list, in which case they
|
||||
are applied to every subject line that is processed with that pattern. They do
|
||||
not affect the compilation process.
|
||||
are applied to every subject line that is processed with that pattern. They may
|
||||
not appear in \fB#pattern\fP commands. These modifiers do not affect the
|
||||
compilation process.
|
||||
.sp
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
.sp
|
||||
These modifiers may not appear in a \fB#pattern\fP command. If you want them as
|
||||
defaults, set them in a \fB#subject\fP command.
|
||||
@ -746,15 +972,20 @@ facility is used when saving compiled patterns to a file, as described in the
|
||||
section entitled "Saving and restoring compiled patterns"
|
||||
.\" HTML <a href="#saverestore">
|
||||
.\" </a>
|
||||
below.
|
||||
below. If \fBpushcopy\fP is used instead of \fBpush\fP, a copy of the compiled
|
||||
pattern is stacked, leaving the original as current, ready to match the
|
||||
following input lines. This provides a way of testing the
|
||||
\fBpcre2_code_copy()\fP function.
|
||||
.\"
|
||||
The \fBpush\fP modifier is incompatible with compilation modifiers such as
|
||||
\fBglobal\fP that act at match time. Any that are specified are ignored, with a
|
||||
warning message, except for \fBreplace\fP, which causes an error. Note that,
|
||||
\fBjitverify\fP, which is allowed, does not carry through to any subsequent
|
||||
matching that uses this pattern.
|
||||
The \fBpush\fP and \fBpushcopy \fP modifiers are incompatible with compilation
|
||||
modifiers such as \fBglobal\fP that act at match time. Any that are specified
|
||||
are ignored (for the stacked copy), with a warning message, except for
|
||||
\fBreplace\fP, which causes an error. Note that \fBjitverify\fP, which is
|
||||
allowed, does not carry through to any subsequent matching that uses a stacked
|
||||
pattern.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="subjectmodifiers"></a>
|
||||
.SH "SUBJECT MODIFIERS"
|
||||
.rs
|
||||
.sp
|
||||
@ -775,6 +1006,7 @@ for a description of their effects.
|
||||
anchored set PCRE2_ANCHORED
|
||||
dfa_restart set PCRE2_DFA_RESTART
|
||||
dfa_shortest set PCRE2_DFA_SHORTEST
|
||||
no_jit set PCRE2_NO_JIT
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
notbol set PCRE2_NOTBOL
|
||||
notempty set PCRE2_NOTEMPTY
|
||||
@ -786,11 +1018,11 @@ for a description of their effects.
|
||||
The partial matching modifiers are provided with abbreviations because they
|
||||
appear frequently in tests.
|
||||
.P
|
||||
If the \fB/posix\fP modifier was present on the pattern, causing the POSIX
|
||||
If the \fBposix\fP modifier was present on the pattern, causing the POSIX
|
||||
wrapper API to be used, the only option-setting modifiers that have any effect
|
||||
are \fBnotbol\fP, \fBnotempty\fP, and \fBnoteol\fP, causing REG_NOTBOL,
|
||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to \fBregexec()\fP.
|
||||
Any other modifiers cause an error.
|
||||
The other modifiers are ignored, with a warning message.
|
||||
.
|
||||
.
|
||||
.SS "Setting match controls"
|
||||
@ -801,33 +1033,44 @@ information. Some of them may also be specified on a pattern line (see above),
|
||||
in which case they apply to every subject line that is matched against that
|
||||
pattern.
|
||||
.sp
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use \fBpcre2_dfa_match()\fP
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=>n> set a match limit
|
||||
memory show memory usage
|
||||
offset=<n> set starting offset
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_error=<n>[:<m>] control callout error
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use \fBpcre2_dfa_match()\fP
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
memory show memory usage
|
||||
null_context match with a NULL context
|
||||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
.sp
|
||||
The effects of these modifiers are described in the following sections.
|
||||
The effects of these modifiers are described in the following sections. When
|
||||
matching via the POSIX wrapper API, the \fBaftertext\fP, \fBallaftertext\fP,
|
||||
and \fBovector\fP subject modifiers work as described below. All other
|
||||
modifiers are either ignored, with a warning message, or cause an error.
|
||||
.
|
||||
.
|
||||
.SS "Showing more text"
|
||||
@ -882,7 +1125,8 @@ The \fBallcaptures\fP modifier requests that the values of all potential
|
||||
captured parentheses be output after a match. By default, only those up to the
|
||||
highest one actually used in the match are output (corresponding to the return
|
||||
code from \fBpcre2_match()\fP). Groups that did not take part in the match
|
||||
are output as "<unset>".
|
||||
are output as "<unset>". This modifier is not relevant for DFA matching (which
|
||||
does no capturing); it is ignored, with a warning message, if present.
|
||||
.
|
||||
.
|
||||
.SS "Testing callouts"
|
||||
@ -890,14 +1134,20 @@ are output as "<unset>".
|
||||
.sp
|
||||
A callout function is supplied when \fBpcre2test\fP calls the library matching
|
||||
functions, unless \fBcallout_none\fP is specified. If \fBcallout_capture\fP is
|
||||
set, the current captured groups are output when a callout occurs.
|
||||
set, the current captured groups are output when a callout occurs. The default
|
||||
return from the callout function is zero, which allows matching to continue.
|
||||
.P
|
||||
The \fBcallout_fail\fP modifier can be given one or two numbers. If there is
|
||||
only one number, 1 is returned instead of 0 when a callout of that number is
|
||||
reached. If two numbers are given, 1 is returned when callout <n> is reached
|
||||
for the <m>th time. Note that callouts with string arguments are always given
|
||||
the number zero. See "Callouts" below for a description of the output when a
|
||||
callout it taken.
|
||||
only one number, 1 is returned instead of 0 (causing matching to backtrack)
|
||||
when a callout of that number is reached. If two numbers (<n>:<m>) are given, 1
|
||||
is returned when callout <n> is reached and there have been at least <m>
|
||||
callouts. The \fBcallout_error\fP modifier is similar, except that
|
||||
PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
|
||||
aborted. If both these modifiers are set for the same callout number,
|
||||
\fBcallout_error\fP takes precedence.
|
||||
.P
|
||||
Note that callouts with string arguments are always given the number zero. See
|
||||
"Callouts" below for a description of the output when a callout it taken.
|
||||
.P
|
||||
The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
|
||||
This is set as the "user data" that is passed to the matching function, and
|
||||
@ -909,7 +1159,7 @@ used as a return from \fBpcre2test\fP's callout function.
|
||||
.rs
|
||||
.sp
|
||||
Searching for all possible matches within a subject can be requested by the
|
||||
\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
|
||||
\fBglobal\fP or \fBaltglobal\fP modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The difference
|
||||
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
|
||||
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
|
||||
@ -957,18 +1207,30 @@ by name.
|
||||
.rs
|
||||
.sp
|
||||
If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is
|
||||
called instead of one of the matching functions. Unlike subject strings,
|
||||
\fBpcre2test\fP does not process replacement strings for escape sequences. In
|
||||
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
|
||||
If so, it is correctly converted to a UTF string of the appropriate code unit
|
||||
width. If it is not a valid UTF-8 string, the individual code units are copied
|
||||
directly. This provides a means of passing an invalid UTF-8 string for testing
|
||||
purposes.
|
||||
called instead of one of the matching functions. Note that replacement strings
|
||||
cannot contain commas, because a comma signifies the end of a modifier. This is
|
||||
not thought to be an issue in a test program.
|
||||
.P
|
||||
If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
\fBpcre2_substitute()\fP. After a successful substitution, the modified string
|
||||
is output, preceded by the number of replacements. This may be zero if there
|
||||
were no matches. Here is a simple example of a substitution test:
|
||||
Unlike subject strings, \fBpcre2test\fP does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to see if it
|
||||
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
|
||||
the appropriate code unit width. If it is not a valid UTF-8 string, the
|
||||
individual code units are copied directly. This provides a means of passing an
|
||||
invalid UTF-8 string for testing purposes.
|
||||
.P
|
||||
The following modifiers set options (in additional to the normal match options)
|
||||
for \fBpcre2_substitute()\fP:
|
||||
.sp
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
.sp
|
||||
.P
|
||||
After a successful substitution, the modified string is output, preceded by the
|
||||
number of replacements. This may be zero if there were no matches. Here is a
|
||||
simple example of a substitution test:
|
||||
.sp
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
@ -976,12 +1238,12 @@ were no matches. Here is a simple example of a substitution test:
|
||||
=abc=abc=\e=global
|
||||
2: =xxx=xxx=
|
||||
.sp
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
||||
buffer overflow, if the replacement string starts with a number in square
|
||||
brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
|
||||
output buffer, with the replacement string starting at the next character. Here
|
||||
is an example that tests the edge case:
|
||||
Subject and replacement strings should be kept relatively short (fewer than 256
|
||||
characters) for substitution tests, as fixed-size buffers are used. To make it
|
||||
easy to test for buffer overflow, if the replacement string starts with a
|
||||
number in square brackets, that number is passed to \fBpcre2_substitute()\fP as
|
||||
the size of the output buffer, with the replacement string starting at the next
|
||||
character. Here is an example that tests the edge case:
|
||||
.sp
|
||||
/abc/
|
||||
123abc123\e=replace=[10]XYZ
|
||||
@ -989,6 +1251,19 @@ is an example that tests the edge case:
|
||||
123abc123\e=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
.sp
|
||||
The default action of \fBpcre2_substitute()\fP is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
|
||||
\fBsubstitute_overflow_length\fP modifier), \fBpcre2_substitute()\fP continues
|
||||
to go through the motions of matching and substituting, in order to compute the
|
||||
size of buffer that is required. When this happens, \fBpcre2test\fP shows the
|
||||
required buffer length (which includes space for the trailing zero) as part of
|
||||
the error message. For example:
|
||||
.sp
|
||||
/abc/substitute_overflow_length
|
||||
123abc123\e=replace=[9]XYZ
|
||||
Failed: error -47: no more memory: 10 code units are needed
|
||||
.sp
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||
matching provokes an error return ("bad option value") from
|
||||
\fBpcre2_substitute()\fP.
|
||||
@ -1059,6 +1334,16 @@ The \fBoffset\fP modifier sets an offset in the subject string at which
|
||||
matching starts. Its value is a number of code units, not characters.
|
||||
.
|
||||
.
|
||||
.SS "Setting an offset limit"
|
||||
.rs
|
||||
.sp
|
||||
The \fBoffset_limit\fP modifier sets a limit for unanchored matches. If a match
|
||||
cannot be found starting at or before this offset in the subject, a "no match"
|
||||
return is given. The data value is a number of code units, not characters. When
|
||||
this modifier is used, the \fBuse_offset_limit\fP modifier must have been set
|
||||
for the pattern; if not, an error is generated.
|
||||
.
|
||||
.
|
||||
.SS "Setting the size of the output vector"
|
||||
.rs
|
||||
.sp
|
||||
@ -1089,6 +1374,17 @@ When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
.
|
||||
.
|
||||
.SS "Passing a NULL context"
|
||||
.rs
|
||||
.sp
|
||||
Normally, \fBpcre2test\fP passes a context block to \fBpcre2_match()\fP,
|
||||
\fBpcre2_dfa_match()\fP or \fBpcre2_jit_match()\fP. If the \fBnull_context\fP
|
||||
modifier is set, however, NULL is passed. This is for testing that the matching
|
||||
functions behave correctly in this case (they use default values). This
|
||||
modifier cannot be used with the \fBfind_limits\fP modifier or when testing the
|
||||
substitution function.
|
||||
.
|
||||
.
|
||||
.SH "THE ALTERNATIVE MATCHING FUNCTION"
|
||||
.rs
|
||||
.sp
|
||||
@ -1156,7 +1452,7 @@ unset substring is shown as "<unset>", as for the second data line.
|
||||
If the strings contain any non-printing characters, they are output as \exhh
|
||||
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
|
||||
are output as \ex{hh...} escapes. See below for the definition of non-printing
|
||||
characters. If the \fB/aftertext\fP modifier is set, the output for substring
|
||||
characters. If the \fBaftertext\fP modifier is set, the output for substring
|
||||
0 is followed by the the rest of the subject string, identified by "0+" like
|
||||
this:
|
||||
.sp
|
||||
@ -1286,7 +1582,9 @@ item to be tested. For example:
|
||||
This output indicates that callout number 0 occurred for a match attempt
|
||||
starting at the fourth character of the subject string, when the pointer was at
|
||||
the seventh character, and when the next pattern item was \ed. Just
|
||||
one circumflex is output if the start and current positions are the same.
|
||||
one circumflex is output if the start and current positions are the same, or if
|
||||
the current position precedes the start position, which can happen if the
|
||||
callout is in a lookbehind assertion.
|
||||
.P
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
|
||||
result of the \fB/auto_callout\fP pattern modifier. In this case, instead of
|
||||
@ -1352,7 +1650,7 @@ therefore shown as hex escapes.
|
||||
.P
|
||||
When \fBpcre2test\fP is outputting text that is a matched part of a subject
|
||||
string, it behaves in the same way, unless a different locale has been set for
|
||||
the pattern (using the \fB/locale\fP modifier). In this case, the
|
||||
the pattern (using the \fBlocale\fP modifier). In this case, the
|
||||
\fBisprint()\fP function is used to distinguish printing and non-printing
|
||||
characters.
|
||||
.
|
||||
@ -1382,11 +1680,15 @@ can be used to test these functions.
|
||||
.P
|
||||
When a pattern with \fBpush\fP modifier is successfully compiled, it is pushed
|
||||
onto a stack of compiled patterns, and \fBpcre2test\fP expects the next line to
|
||||
contain a new pattern (or command) instead of a subject line. By this means, a
|
||||
number of patterns can be compiled and retained. The \fBpush\fP modifier is
|
||||
incompatible with \fBposix\fP, and control modifiers that act at match time are
|
||||
ignored (with a message). The \fBjitverify\fP modifier applies only at compile
|
||||
time. The command
|
||||
contain a new pattern (or command) instead of a subject line. By contrast,
|
||||
the \fBpushcopy\fP modifier causes a copy of the compiled pattern to be
|
||||
stacked, leaving the original available for immediate matching. By using
|
||||
\fBpush\fP and/or \fBpushcopy\fP, a number of patterns can be compiled and
|
||||
retained. These modifiers are incompatible with \fBposix\fP, and control
|
||||
modifiers that act at match time are ignored (with a message) for the stacked
|
||||
patterns. The \fBjitverify\fP modifier applies only at compile time.
|
||||
.P
|
||||
The command
|
||||
.sp
|
||||
#save <filename>
|
||||
.sp
|
||||
@ -1406,7 +1708,8 @@ modifier list containing only
|
||||
control modifiers
|
||||
.\"
|
||||
that act after a pattern has been compiled. In particular, \fBhex\fP,
|
||||
\fBposix\fP, and \fBpush\fP are not allowed, nor are any
|
||||
\fBposix\fP, \fBposix_nosub\fP, \fBpush\fP, and \fBpushcopy\fP are not allowed,
|
||||
nor are any
|
||||
.\" HTML <a href="#optionmodifiers">
|
||||
.\" </a>
|
||||
option-setting modifiers.
|
||||
@ -1426,6 +1729,10 @@ reloads two patterns.
|
||||
.sp
|
||||
If \fBjitverify\fP is used with #pop, it does not automatically imply
|
||||
\fBjit\fP, which is different behaviour from when it is used on a pattern.
|
||||
.P
|
||||
The #popcopy command is analagous to the \fBpushcopy\fP modifier in that it
|
||||
makes current a copy of the topmost stack pattern, leaving the original still
|
||||
on the stack.
|
||||
.
|
||||
.
|
||||
.
|
||||
@ -1451,6 +1758,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 May 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 28 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -1,4 +1,4 @@
|
||||
.TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2UNICODE 3 "03 July 2016" "PCRE2 10.22"
|
||||
.SH NAME
|
||||
PCRE - Perl-compatible regular expressions (revised API)
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
@ -57,17 +57,21 @@ individual code units.
|
||||
In UTF modes, the dot metacharacter matches one UTF character instead of a
|
||||
single code unit.
|
||||
.P
|
||||
The escape sequence \eC can be used to match a single code unit, in a UTF mode,
|
||||
The escape sequence \eC can be used to match a single code unit in a UTF mode,
|
||||
but its use can lead to some strange effects because it breaks up multi-unit
|
||||
characters (see the description of \eC in the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
documentation). The use of \eC is not supported in the alternative matching
|
||||
function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
|
||||
optimization. If JIT optimization is requested for a UTF pattern that contains
|
||||
\eC, it will not succeed, and so the matching will be carried out by the normal
|
||||
interpretive function.
|
||||
documentation).
|
||||
.P
|
||||
The use of \eC is not supported by the alternative matching function
|
||||
\fBpcre2_dfa_match()\fP when in UTF-8 or UTF-16 mode, that is, when a character
|
||||
may consist of more than one code unit. The use of \eC in these modes provokes
|
||||
a match-time error. Also, the JIT optimization does not support \eC in these
|
||||
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
|
||||
contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
|
||||
the matching will be carried out by the normal interpretive function.
|
||||
.P
|
||||
The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
|
||||
characters of any code value, but, by default, the characters that PCRE2
|
||||
@ -117,11 +121,21 @@ UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
||||
strings to be in host byte order.
|
||||
.P
|
||||
The entire string is checked before any other processing takes place. In
|
||||
addition to checking the format of the string, there is a check to ensure that
|
||||
all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area.
|
||||
The so-called "non-character" code points are not excluded because Unicode
|
||||
corrigendum #9 makes it clear that they should not be.
|
||||
A UTF string is checked before any other processing takes place. In the case of
|
||||
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP calls with a non-zero starting
|
||||
offset, the check is applied only to that part of the subject that could be
|
||||
inspected during matching, and there is a check that the starting offset points
|
||||
to the first code unit of a character or to the end of the subject. If there
|
||||
are no lookbehind assertions in the pattern, the check starts at the starting
|
||||
offset. Otherwise, it starts at the length of the longest lookbehind before the
|
||||
starting offset, or at the start of the subject if there are not that many
|
||||
characters before the starting offset. Note that the sequences \eb and \eB are
|
||||
one-character lookbehinds.
|
||||
.P
|
||||
In addition to checking the format of the string, there is a check to ensure
|
||||
that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate
|
||||
area. The so-called "non-character" code points are not excluded because
|
||||
Unicode corrigendum #9 makes it clear that they should not be.
|
||||
.P
|
||||
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
|
||||
where they are used in pairs to encode code points with values greater than
|
||||
@ -221,9 +235,9 @@ never occur in a valid UTF-8 string.
|
||||
.sp
|
||||
The following negative error codes are given for invalid UTF-16 strings:
|
||||
.sp
|
||||
PCRE_UTF16_ERR1 Missing low surrogate at end of string
|
||||
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
|
||||
PCRE_UTF16_ERR3 Isolated low surrogate
|
||||
PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
|
||||
PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
|
||||
PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
|
||||
.sp
|
||||
.
|
||||
.
|
||||
@ -233,8 +247,8 @@ The following negative error codes are given for invalid UTF-16 strings:
|
||||
.sp
|
||||
The following negative error codes are given for invalid UTF-32 strings:
|
||||
.sp
|
||||
PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
|
||||
PCRE_UTF32_ERR2 Code point is greater than 0x10ffff
|
||||
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
|
||||
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
|
||||
.sp
|
||||
.
|
||||
.
|
||||
@ -252,6 +266,6 @@ Cambridge, England.
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 03 July 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
Reference in New Issue
Block a user