Update bundled PCRE2-library to version 10.23

Some manual changes done to the library were lost with this update.
They will be added in the next commit.
This commit is contained in:
Esa Korhonen
2017-05-29 15:31:42 +03:00
parent 7231563937
commit 36af74cb25
218 changed files with 49218 additions and 26130 deletions

View File

@ -22,11 +22,12 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC7" href="#SEC7">NEWLINES</a>
<li><a name="TOC8" href="#SEC8">OPTIONS COMPATIBILITY</a>
<li><a name="TOC9" href="#SEC9">OPTIONS WITH DATA</a>
<li><a name="TOC10" href="#SEC10">MATCHING ERRORS</a>
<li><a name="TOC11" href="#SEC11">DIAGNOSTICS</a>
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
<li><a name="TOC14" href="#SEC14">REVISION</a>
<li><a name="TOC10" href="#SEC10">CALLING EXTERNAL SCRIPTS</a>
<li><a name="TOC11" href="#SEC11">MATCHING ERRORS</a>
<li><a name="TOC12" href="#SEC12">DIAGNOSTICS</a>
<li><a name="TOC13" href="#SEC13">SEE ALSO</a>
<li><a name="TOC14" href="#SEC14">AUTHOR</a>
<li><a name="TOC15" href="#SEC15">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
@ -79,11 +80,19 @@ span line boundaries. What defines a line boundary is controlled by the
</P>
<P>
The amount of memory used for buffering files that are being scanned is
controlled by a parameter that can be set by the <b>--buffer-size</b> option.
The default value for this parameter is specified when <b>pcre2grep</b> is
built, with the default default being 20K. A block of memory three times this
size is used (to allow for buffering "before" and "after" lines). An error
occurs if a line overflows the buffer.
controlled by parameters that can be set by the <b>--buffer-size</b> and
<b>--max-buffer-size</b> options. The first of these sets the size of buffer
that is obtained at the start of processing. If an input file contains very
long lines, a larger buffer may be needed; this is handled by automatically
extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
default values for these parameters are specified when <b>pcre2grep</b> is
built, with the default defaults being 20K and 1M respectively. An error occurs
if a line is too long and the buffer can no longer be expanded.
</P>
<P>
The block of memory that is actually used is three times the "buffer size", to
allow for buffering "before" and "after" lines. If the buffer size is too
small, fewer than requested "before" and "after" lines may be output.
</P>
<P>
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
@ -154,12 +163,13 @@ processing of patterns and file names that start with hyphens.
</P>
<P>
<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
Output <i>number</i> lines of context after each matching line. If file names
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
guarantees to have up to 8K of following text available for context output.
Output up to <i>number</i> lines of context after each matching line. Fewer
lines are output if the next match or the end of the file is reached, or if the
processing buffer size has been set too small. If file names and/or line
numbers are being output, a hyphen separator is used instead of a colon for the
context lines. A line containing "--" is output between each group of lines,
unless they are in fact contiguous in the input file. The value of <i>number</i>
is expected to be relatively small. When <b>-c</b> is used, <b>-A</b> is ignored.
</P>
<P>
<b>-a</b>, <b>--text</b>
@ -168,12 +178,14 @@ Treat binary files as text. This is equivalent to
</P>
<P>
<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
Output <i>number</i> lines of context before each matching line. If file names
and/or line numbers are being output, a hyphen separator is used instead of a
colon for the context lines. A line containing "--" is output between each
group of lines, unless they are in fact contiguous in the input file. The value
of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
guarantees to have up to 8K of preceding text available for context output.
Output up to <i>number</i> lines of context before each matching line. Fewer
lines are output if the previous match or the start of the file is within
<i>number</i> lines, or if the processing buffer size has been set too small. If
file names and/or line numbers are being output, a hyphen separator is used
instead of a colon for the context lines. A line containing "--" is output
between each group of lines, unless they are in fact contiguous in the input
file. The value of <i>number</i> is expected to be relatively small. When
<b>-c</b> is used, <b>-B</b> is ignored.
</P>
<P>
<b>--binary-files=</b><i>word</i>
@ -190,8 +202,9 @@ return code.
</P>
<P>
<b>--buffer-size=</b><i>number</i>
Set the parameter that controls how much memory is used for buffering files
that are being scanned.
Set the parameter that controls how much memory is obtained at the start of
processing for buffering files that are being scanned. See also
<b>--max-buffer-size</b> below.
</P>
<P>
<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
@ -201,14 +214,16 @@ This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
<P>
<b>-c</b>, <b>--count</b>
Do not output lines from the files that are being scanned; instead output the
number of matches (or non-matches if <b>-v</b> is used) that would otherwise
have caused lines to be shown. By default, this count is the same as the number
of suppressed lines, but if the <b>-M</b> (multiline) option is used (without
<b>-v</b>), there may be more suppressed lines than the number of matches.
number of lines that would have been shown, either because they matched, or, if
<b>-v</b> is set, because they failed to match. By default, this count is
exactly the same as the number of lines that would have been output, but if the
<b>-M</b> (multiline) option is used (without <b>-v</b>), there may be more
suppressed lines than the count (that is, the number of matches).
<br>
<br>
If no lines are selected, the number zero is output. If several files are are
being scanned, a count is output for each of them. However, if the
being scanned, a count is output for each of them and the <b>-t</b> option can
be used to cause a total to be output at the end. However, if the
<b>--files-with-matches</b> option is also used, only those files whose counts
are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
<b>-B</b>, and <b>-C</b> options are ignored.
@ -230,12 +245,23 @@ because <b>pcre2grep</b> has to search for all possible matches in a line, not
just one, in order to colour them all.
<br>
<br>
The colour that is used can be specified by setting the environment variable
PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a
string of two numbers, separated by a semicolon. They are copied directly into
the control string for setting colour on a terminal, so it is your
responsibility to ensure that they make sense. If neither of the environment
variables is set, the default is "1;31", which gives red.
The colour that is used can be specified by setting one of the environment
variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
PCREGREP_COLOR, which are checked in that order. If none of these are set,
<b>pcre2grep</b> looks for GREP_COLORS or GREP_COLOR (in that order). The value
of the variable should be a string of two numbers, separated by a semicolon,
except in the case of GREP_COLORS, which must start with "ms=" or "mt="
followed by two semicolon-separated colours, terminated by the end of the
string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
ignored, and GREP_COLOR is checked.
<br>
<br>
If the string obtained from one of the above variables contains any characters
other than semicolon or digits, the setting is ignored and the default colour
is used. The string is copied directly into the control string for setting
colour on a terminal, so it is your responsibility to ensure that the values
make sense. If no relevant environment variable is set, the default is "1;31",
which gives red.
</P>
<P>
<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
@ -320,18 +346,18 @@ files; it does not apply to patterns specified by any of the <b>--include</b> or
</P>
<P>
<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
Read patterns from the file, one per line, and match them against
each line of input. What constitutes a newline when reading the file is the
operating system's default. The <b>--newline</b> option has no effect on this
option. Trailing white space is removed from each line, and blank lines are
ignored. An empty file contains no patterns and therefore matches nothing. See
also the comments about multiple patterns versus a single pattern with
alternatives in the description of <b>-e</b> above.
Read patterns from the file, one per line, and match them against each line of
input. What constitutes a newline when reading the file is the operating
system's default. The <b>--newline</b> option has no effect on this option.
Trailing white space is removed from each line, and blank lines are ignored. An
empty file contains no patterns and therefore matches nothing. See also the
comments about multiple patterns versus a single pattern with alternatives in
the description of <b>-e</b> above.
<br>
<br>
If this option is given more than once, all the specified files are
read. A data line is output if any of the patterns match it. A file name can
be given as "-" to refer to the standard input. When <b>-f</b> is used, patterns
If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given
as "-" to refer to the standard input. When <b>-f</b> is used, patterns
specified on the command line using <b>-e</b> may also be present; they are
tested before the file's patterns. However, no other pattern is taken from the
command line; all arguments are treated as the names of paths to be searched.
@ -501,19 +527,27 @@ There are no short forms for these options. The default settings are specified
when the PCRE2 library is compiled, with the default default being 10 million.
</P>
<P>
\fB--max-buffer-size=<i>number</i>
This limits the expansion of the processing buffer, whose initial size can be
set by <b>--buffer-size</b>. The maximum buffer size is silently forced to be no
smaller than the starting buffer size.
</P>
<P>
<b>-M</b>, <b>--multiline</b>
Allow patterns to match more than one line. When this option is given, patterns
may usefully contain literal newline characters and internal occurrences of ^
and $ characters. The output for a successful match may consist of more than
one line. The first is the line in which the match started, and the last is the
line in which the match ended. If the matched string ends with a newline
sequence the output ends at the end of that line.
Allow patterns to match more than one line. When this option is set, the PCRE2
library is called in "multiline" mode. This allows a matched string to extend
past the end of a line and continue on one or more subsequent lines. Patterns
used with <b>-M</b> may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match may
consist of more than one line. The first line is the line in which the match
started, and the last line is the line in which the match ended. If the matched
string ends with a newline sequence, the output ends at the end of that line.
If <b>-v</b> is set, none of the lines in a multi-line match are output. Once a
match has been handled, scanning restarts at the beginning of the line after
the one in which the match ended.
<br>
<br>
When this option is set, the PCRE2 library is called in "multiline" mode.
However, <b>pcre2grep</b> still processes the input line by line. The difference
is that a matched string may extend past the end of a line and continue on
one or more subsequent lines. The newline sequence must be matched as part of
The newline sequence that separates multiple lines must be matched as part of
the pattern. For example, to find the phrase "regular expression" in a file
where "regular" might be at the end of a line and "expression" at the start of
the next line, you could use this command:
@ -526,11 +560,8 @@ well as possibly handling a two-character newline sequence.
<br>
<br>
There is a limit to the number of lines that can be matched, imposed by the way
that <b>pcre2grep</b> buffers the input file as it scans it. However,
<b>pcre2grep</b> ensures that at least 8K characters or the rest of the file
(whichever is the shorter) are available for forward matching, and similarly
the previous 8K characters (or all the previous characters, if fewer than 8K)
are guaranteed to be available for lookbehind assertions. The <b>-M</b> option
that <b>pcre2grep</b> buffers the input file as it scans it. With a sufficiently
large processing buffer, this should not be a problem, but the <b>-M</b> option
does not work when input is read line by line (see \fP--line-buffered\fP.)
</P>
<P>
@ -578,12 +609,13 @@ It should never be needed in normal use.
Show only the part of the line that matched a pattern instead of the whole
line. In this mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and
<b>-C</b> options are ignored. If there is more than one match in a line, each
of them is shown separately. If <b>-o</b> is combined with <b>-v</b> (invert the
sense of the match to find non-matching lines), no output is generated, but the
return code is set appropriately. If the matched portion of the line is empty,
nothing is output unless the file name or line number are being printed, in
which case they are shown on an otherwise empty line. This option is mutually
exclusive with <b>--file-offsets</b> and <b>--line-offsets</b>.
of them is shown separately, on a separate line of output. If <b>-o</b> is
combined with <b>-v</b> (invert the sense of the match to find non-matching
lines), no output is generated, but the return code is set appropriately. If
the matched portion of the line is empty, nothing is output unless the file
name or line number are being printed, in which case they are shown on an
otherwise empty line. This option is mutually exclusive with
<b>--file-offsets</b> and <b>--line-offsets</b>.
</P>
<P>
<b>-o</b><i>number</i>, <b>--only-matching</b>=<i>number</i>
@ -597,10 +629,11 @@ capturing parentheses do not exist in the pattern, or were not set in the
match, nothing is output unless the file name or line number are being output.
<br>
<br>
If this option is given multiple times, multiple substrings are output, in the
order the options are given. For example, -o3 -o1 -o3 causes the substrings
matched by capturing parentheses 3 and 1 and then 3 again to be output. By
default, there is no separator (but see the next option).
If this option is given multiple times, multiple substrings are output for each
match, in the order the options are given, and all on one line. For example,
-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
then 3 again to be output. By default, there is no separator (but see the next
option).
</P>
<P>
<b>--om-separator</b>=<i>text</i>
@ -631,6 +664,18 @@ quietly skipped. However, the return code is still 2, even if matches were
found in other files.
</P>
<P>
<b>-t</b>, <b>--total-count</b>
This option is useful when scanning more than one file. If used on its own,
<b>-t</b> suppresses all output except for a grand total number of matching
lines (or non-matching lines if <b>-v</b> is used) in all the files. If <b>-t</b>
is used with <b>-c</b>, a grand total is output except when the previous output
is just one line. In other words, it is not output when just one file's count
is listed. If file names are being output, the grand total is preceded by
"TOTAL:". Otherwise, it appears as just another number. The <b>-t</b> option is
ignored when used with <b>-L</b> (list files without matches), because the grand
total would always be zero.
</P>
<P>
<b>-u</b>, <b>--utf-8</b>
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
@ -658,11 +703,12 @@ specified by any of the <b>--include</b> or <b>--exclude</b> options.
<P>
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
Force the patterns to be anchored (each must start matching at the beginning of
a line) and in addition, require them to match entire lines. This is equivalent
to having ^ and $ characters at the start and end of each alternative top-level
branch in every pattern. This option applies only to the patterns that are
matched against the contents of files; it does not apply to patterns specified
by any of the <b>--include</b> or <b>--exclude</b> options.
a line) and in addition, require them to match entire lines. In multiline mode
the match may be more than one line. This is equivalent to having \A and \Z
characters at the start and end of each alternative top-level branch in every
pattern. This option applies only to the patterns that are matched against the
contents of files; it does not apply to patterns specified by any of the
<b>--include</b> or <b>--exclude</b> options.
</P>
<br><a name="SEC6" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
<P>
@ -735,7 +781,57 @@ The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
options does have data, it must be given in the first form, using an equals
character. Otherwise <b>pcre2grep</b> will assume that it has no data.
</P>
<br><a name="SEC10" href="#TOC1">MATCHING ERRORS</a><br>
<br><a name="SEC10" href="#TOC1">CALLING EXTERNAL SCRIPTS</a><br>
<P>
<b>pcre2grep</b> has, by default, support for calling external programs or
scripts during matching by making use of PCRE2's callout facility. However,
this support can be disabled when <b>pcre2grep</b> is built. You can find out
whether your binary has support for callouts by running it with the <b>--help</b>
option. If the support is not enabled, all callouts in patterns are ignored by
<b>pcre2grep</b>.
</P>
<P>
A callout in a PCRE2 pattern is of the form (?C&#60;arg&#62;) where the argument is
either a number or a quoted string (see the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details). Numbered callouts are ignored by <b>pcre2grep</b>.
String arguments are parsed as a list of substrings separated by pipe (vertical
bar) characters. The first substring must be an executable name, with the
following substrings specifying arguments:
<pre>
executable_name|arg1|arg2|...
</pre>
Any substring (including the executable name) may contain escape sequences
started by a dollar character: $&#60;digits&#62; or ${&#60;digits&#62;} is replaced by the
captured substring of the given decimal number, which must be greater than
zero. If the number is greater than the number of capturing substrings, or if
the capture is unset, the replacement is empty.
</P>
<P>
Any other character is substituted by itself. In particular, $$ is replaced by
a single dollar and $| is replaced by a pipe character. Here is an example:
<pre>
echo -e "abcde\n12345" | pcre2grep \
'(?x)(.)(..(.))
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
Output:
Arg1: [a] [bcd] [d] Arg2: |a| ()
abcde
Arg1: [1] [234] [4] Arg2: |1| ()
12345
</pre>
The parameters for the <b>execv()</b> system call that is used to run the
program or script are zero-terminated strings. This means that binary zero
characters in the callout argument will cause premature termination of their
substrings, and therefore should not be present. Any syntax errors in the
string (for example, a dollar not followed by another character) cause the
callout to be ignored. If running the program fails for any reason (including
the non-existence of the executable), a local matching failure occurs and the
matcher backtracks in the normal way.
</P>
<br><a name="SEC11" href="#TOC1">MATCHING ERRORS</a><br>
<P>
It is possible to supply a regular expression that takes a very long time to
fail to match certain lines. Such patterns normally involve nested indefinite
@ -751,7 +847,7 @@ overall resource limit; there is a second option called <b>--recursion-limit</b>
that sets a limit on the amount of memory (usually stack) that is used (see the
discussion of these options above).
</P>
<br><a name="SEC11" href="#TOC1">DIAGNOSTICS</a><br>
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
<P>
Exit status is 0 if any matches were found, 1 if no matches were found, and 2
for syntax errors, overlong lines, non-existent or inaccessible files (even if
@ -759,11 +855,11 @@ matches were found in other files) or too many matching errors. Using the
<b>-s</b> option to suppress error messages about inaccessible files does not
affect the return code.
</P>
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC13" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3).
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3), <b>pcre2callout</b>(3).
</P>
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC14" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -772,11 +868,11 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
Last updated: 03 January 2015
Last updated: 31 December 2016
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.